I don't have an STM8, or a compiler for it, but wanted to stick my spoon too in the soup.
The naïve shift-and-add-if-carry implementation takes 49 bytes and 49 cycles (1+3N bytes and cycles for N bits, not including the far return), when the argument is in the X register:
; A = popcount(X)
; 49 bytes, 49 cycles
popcount:
CLR A ; 1 byte, 1 cycle
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
SRLW X ; 1 byte, 2 cycles
ADC A, #$0 ; 2 bytes, 1 cycles
RETF ; (1 byte, 5 cycles)
With a 16-entry lookup table, that shrinks to 35 bytes and 25 cycles (not including the far return, or the lookup table),
; A = popcount(X)
; uses 16-byte popcount table _POPCNT
; 35 bytes, 25 cycles
popcount:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG XL, A ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
RETF ; (1 byte, 5 cycles)
Interestingly, if the X register does not need to be saved, and you have N bytes on stack, you can do a popcount in about 12 cycles per byte, using the above scheme.
With that same 16-byte lookup table, an 8-bit popcount takes 19 bytes and 15 cycles, not including the far return:
; A = popcountb(A)
; uses 16-byte popcount table _POPCNT
; 19 bytes, 15 cycles
popcountb:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
SWAP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
LD XL, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
POP A ; 1 byte, 1 cycle
AND A, #$0F ; 2 bytes, 1 cycle
EXG A, XL ; 1 byte, 1 cycle
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
RETF ; (1 byte, 5 cycles)
Using a 256-byte table, a 16-bit popcount with argument in X register takes 12 bytes and 9 cycles, and a 8-bit popcount with argument in A register takes 7 bytes and 7 cycles:
; A = popcount(X)
; uses 256-byte popcount table _POPCNT
; 12 bytes, 9 cycles
popcount:
LD A, XH ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
CLR A ; 1 byte, 1 cycle
PUSH A ; 1 byte, 1 cycle
LD XH, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle
RETF ; (1 byte, 5 cycles)
; A = popcountb(A)
; uses 256-byte popcount table _POPCNT
; 7 bytes, 7 cycles
popcountb:
PUSHW X ; 1 byte, 2 cycles
CLRW X ; 1 byte, 1 cycle
LD XL, A ; 1 byte, 1 cycle
LD A, (_POPCNT, X) ; 3 bytes, 1 cycle
POPW X ; 1 byte, 2 cycles
RETF ; (1 byte, 5 cycles)
Note that these are all written to keep non-parameter registers intact, and the far return is not included in the byte or cycle counts. The cycle and byte counts also assume the address of the _POPCNT table is within the first 64k.