I don't have an STM8, or a compiler for it, but wanted to stick my spoon too in the soup.
The naïve shift-and-add-if-carry implementation takes 49 bytes and 49 cycles (1+3N bytes and cycles for N bits, not including the far return), when the argument is in the X register:
; A = popcount(X)
; 49 bytes, 49 cycles
popcount:
CLR A ; 1 byte, 1 cycle
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
RETF ; (1 byte, 5 cycles)
With a 16-entry lookup table, that shrinks to 35 bytes and 25 cycles (not including the far return, or the lookup table),
; A = popcount(X)
; uses 16-byte popcount table _POPCNT
; 35 bytes, 25 cycles
popcount:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG XL, A ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
RETF ; (1 byte, 5 cycles)
Interestingly, if the X register does not need to be saved, and you have N bytes on stack, you can do a popcount in about 12 cycles per byte, using the above scheme.
With that same 16-byte lookup table, an 8-bit popcount takes 19 bytes and 15 cycles, not including the far return:
; A = popcountb(A)
; uses 16-byte popcount table _POPCNT
; 19 bytes, 15 cycles
popcountb:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
RETF ; (1 byte, 5 cycles)
Using a 256-byte table, a 16-bit popcount with argument in X register takes 12 bytes and 9 cycles, and a 8-bit popcount with argument in A register takes 7 bytes and 7 cycles:
; A = popcount(X)
; uses 256-byte popcount table _POPCNT
; 12 bytes, 9 cycles
popcount:
LD A, XH ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
CLR A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
LD XH, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
RETF ; (1 byte, 5 cycles)
; A = popcountb(A)
; uses 256-byte popcount table _POPCNT
; 7 bytes, 7 cycles
popcountb:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
LD XL, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
RETF ; (1 byte, 5 cycles)
Note that these are all written to keep non-parameter registers intact, and the far return is not included in the byte or cycle counts. The cycle and byte counts also assume the address of the _POPCNT table is within the first 64k.
Actually, none of the registers need to saved, as SDCC uses a caller-saves calling convention. ^-^
Righty-o; like I said, I don't know/have the compiler, and don't know the binary ABI.
I installed sdcc 3.5. Compiling (-mstm8 -S) a simple test program, using the default 'medium' memory model (16-bit address space only, HwAoRrDk is using 24-bit address space or large memory model AFAICS), with first parameter byte at (3, sp). Based on some experimentation, with the 256-byte table, I'd suggest
; uint8_t popcount8(uint8_t u)
; uses 256-byte table _popcnt
; 8 bytes, 8 cycles
_popcount8:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
ld xl, a ; 1 byte, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 bytes, 4 cycles
; uint8_t popcount16(uint16_t u)
; uses 256-byte table _popcnt
; 15 bytes, 12 cycles
_popcount16:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
ld xl, a ; 1 byte, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)
; uses 256-byte table _popcnt
; 22 bytes, 16 cycles
_popcount24:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
ld xl, a ; 1 byte, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3)
; uses 256-byte table _popcnt
; 29 bytes, 20 cycles
_popcount32:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
ld xl, a ; 1 byte, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x06, sp) ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
which can be extended to any number of parameters, running in 4+4n cycles, in 1+7n bytes of code, for n parameter bytes, using the medium memory model (16-bit address space).
A variant that uses a 128-byte lookup table, as noted by RenThraysk:
; uint8_t popcount8(uint8_t u)
; uses 128-byte table _popcnt
; 10 bytes, 10 cycles
_popcount8:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle
ld xl, a ; 1 byte, 1 cycle (does not affect carry)
clr a ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount16(uint16_t u)
; uses 128-byte table _popcnt
; 18 bytes, 15 cycles
_popcount16:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle
ld xl, a ; 1 byte, 1 cycle (does not affect carry)
clr a ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)
; uses 128-byte table _popcnt
; 26 bytes, 20 cycles
_popcount16:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle
ld xl, a ; 1 byte, 1 cycle (does not affect carry)
clr a ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3)
; uses 128-byte table _popcnt
; 34 bytes, 25 cycles
_popcount32:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle
ld xl, a ; 1 byte, 1 cycle (does not affect carry)
clr a ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x06, sp) ; 2 bytes, 1 cycle
srl a ; 1 byte, 1 cycle (does not affect carry)
exg a, xl ; 1 byte, 1 cycle (does not affect carry)
adc a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
These run in 5+5n cycles and use 2+8n bytes, so slightly slower and slightly larger but halving the needed table.
Using just 16-byte lookup table, the calculation takes 3+11n cycles and 19n bytes, for n bytes:
_popcnt:
db 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
; uint8_t popcount8(uint8_t u)
; uses 16-byte table _popcnt
; 19 bytes, 14 cycles
_popcount8:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount16(uint16_t u)
; uses 16-byte table _popcnt
; 38 bytes, 25 cycles
_popcount16:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)
; uses 16-byte table _popcnt
; 57 bytes, 36 cycles
_popcount24:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3, uint8_t u4)
; uses 16-byte table _popcnt
; 76 bytes, 47 cycles
_popcount32:
clrw x ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
ld a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x03, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x04, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x05, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x06, sp) ; 2 bytes, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
ld a, (0x06, sp) ; 2 bytes, 1 cycle
swap a ; 1 byte, 1 cycle
and a, #0x0F ; 2 bytes, 1 cycle
exg a, xl ; 1 byte, 1 cycle
add a, (_popcnt, x) ; 3 bytes, 1 cycle
ret ; 1 byte, 4 cycles
I would wager that one of these variants – unless you find a faster one, which is quite possible, as I'm definitely not familiar with this architecture! – should fit ones needs.
It is interesting to note that the exact same implementation on STM8 work for different C prototypes:
uint8_t popcount16(uint8_t, uint8_t);
uint8_t popcount16(uint16_t);
uint8_t popcount24(uint8_t, uint8_t, uint8_t);
uint8_t popcount24(uint16_t, uint8_t);
uint8_t popcount24(uint8_t, uint16_t);
uint8_t popcount32(uint8_t, uint8_t, uint8_t, uint8_t);
uint8_t popcount32(uint8_t, uint8_t, uint16_t);
uint8_t popcount32(uint8_t, uint16_t, uint8_t);
uint8_t popcount32(uint16_t, uint8_t, uint8_t);
uint8_t popcount32(uint16_t, uint16_t);
uint8_t popcount32(uint32_t);