### Author Topic: Optimising a popcount implementation  (Read 539 times)

0 Members and 1 Guest are viewing this topic.

#### HwAoRrDk

• Frequent Contributor
• Posts: 739
• Country:
##### Optimising a popcount implementation
« on: May 15, 2020, 04:16:23 pm »
I am implementing a popcount, aka "count the number of 1 bits", function in assembly for an 8-bit micro (STM8). The common C implementation (essentially, this) wasn't really cutting it in terms of speed.

For a pathological input value with the most-significant bit set (such as 0x8000) my assembly implementation will always take the maximum time possible regardless of the number of other 1 bits, as it always has to shift through all 16 bits. While my code is generally faster than the C implementation, it actually loses out on such values with only a few most-significant bits set, e.g. 0xC000 or 0xE000.

But, I had an idea: would it be a worthwhile optimisation to examine the most-significant bit and then use that to decide which direction to do the bit shifting? My thinking is that if an input value has the MSb set, it's likely to be a large value, with most bits set in the upper byte, so left-shifting (vs. right-shifting) will usually be quicker to reach a zero value. At any rate, I think it will at least 'flatten' the average execution time curve, given a distribution of random input values.

#### SiliconWizard

• Super Contributor
• Posts: 5439
• Country:
##### Re: Optimising a popcount implementation
« Reply #1 on: May 15, 2020, 05:53:15 pm »
But, I had an idea: would it be a worthwhile optimisation to examine the most-significant bit and then use that to decide which direction to do the bit shifting? My thinking is that if an input value has the MSb set, it's likely to be a large value, with most bits set in the upper byte, so left-shifting (vs. right-shifting) will usually be quicker to reach a zero value. At any rate, I think it will at least 'flatten' the average execution time curve, given a distribution of random input values.

If the numbers you deal with are biased this way, sure, but otherwise, I for one don't see a reason that would help on a random distribution.

If you want something fast and are ready to waste a little memory, I would just use a look-up table to divide your numbers in chunks.
For instance, you could use a table of 256 items and shift your numbers by 8 bits. Provided that memory access is reasonably fast, it would certainly be much faster than counting individual bits.
You can adjust the table's depth according to the trade-off you're willing to make.

#### ogden

• Super Contributor
• Posts: 3217
• Country:
##### Re: Optimising a popcount implementation
« Reply #2 on: May 15, 2020, 06:36:24 pm »
The common C implementation (essentially, this) wasn't really cutting it in terms of speed.
You were looking at solution, in webpage you mention:

Code: [Select]
v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporaryv = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // tempc = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

#### HwAoRrDk

• Frequent Contributor
• Posts: 739
• Country:
##### Re: Optimising a popcount implementation
« Reply #3 on: May 15, 2020, 07:13:28 pm »
If you want something fast and are ready to waste a little memory, I would just use a look-up table to divide your numbers in chunks.
For instance, you could use a table of 256 items and shift your numbers by 8 bits. Provided that memory access is reasonably fast, it would certainly be much faster than counting individual bits.
You can adjust the table's depth according to the trade-off you're willing to make.

I had thought about a 256-byte look-up table, but decided that because there is still the need to do some shifting of the input value (STM8 only has shift-once instructions, so multiple shifts take several cycles), it probably won't be faster. I don't know, maybe I should try it.

You were looking at solution, in webpage you mention:

Code: [Select]
v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporaryv = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // tempc = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

Yeah... nah. Modified to work with a 16-bit value (as opposed to 32-bits, as presented), it compiles to this machine code:

Code: [Select]
sub sp, #2ldw x, (0x06, sp)srlw xld a, xland a, #0x55ld (0x02, sp), ald a, xhand a, #0x55ld (0x01, sp), aldw x, (0x06, sp)subw x, (0x01, sp)ldw (0x06, sp), xldw x, (0x06, sp)ld a, xland a, #0x33ld (0x02, sp), ald a, xhand a, #0x33ld (0x01, sp), aldw x, (0x06, sp)srlw xsrlw xld a, xland a, #0x33rlwa xand a, #0x33ld xh, aaddw x, (0x01, sp)ldw (0x06, sp), xldw x, (0x06, sp)ld a, #0x10div x, aaddw x, (0x06, sp)ld a, xland a, #0x0frlwa xand a, #0x0fld xh, apushw xpush #0x01push #0x01callf __mulintaddw sp, #4ld a, xhaddw sp, #2retf
That's an enormous amount of code! And, because the target architecture doesn't have anything other than a 8-bit hardware multiply, it has to call __mulint intrinsic to boot.

Compare and contrast to my initial implementation:

Code: [Select]
 ldw x, (4, sp) clr a0001$: srlw x adc a, #0 tnzw x jrne 0001$ retf
Remember, this is for an 8-bit microcontroller. And one that has a limited number of registers at that (A, X, Y). Pretty much any sophisticated C code that is traditionally regarded as an optimisation won't be.

#### T3sl4co1l

• Super Contributor
• Posts: 15415
• Country:
• Expert, Analog Electronics, PCB Layout, EMC
##### Re: Optimising a popcount implementation
« Reply #4 on: May 15, 2020, 07:13:38 pm »
Exactly how much faster do you expect a int32 solution, that requires multiplication, to complete on an 8-bit platform..?

The shift-and-mask process should be adaptable though.  Not sure it'll be much faster than looping, honestly.  Loops don't hurt on non-pipelined architectures.  I'd write out the logic on paper to make sure it works (and run some simulations or whatever too), see how many bytes and cycle counts can be had.

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!

#### RenThraysk

• Contributor
• Posts: 26
• Country:
##### Re: Optimising a popcount implementation
« Reply #5 on: May 15, 2020, 07:20:55 pm »
Unroll the loop, so only test if non zero every 2 bits... or 4...

Eg

Code: [Select]
 ldw x, (4, sp) clr a0001$: srlw x adc a, #0 srlw x adc a, #0 tnzw x jrne 0001$ retf
« Last Edit: May 15, 2020, 07:24:23 pm by RenThraysk »

#### SiliconWizard

• Super Contributor
• Posts: 5439
• Country:
##### Re: Optimising a popcount implementation
« Reply #6 on: May 15, 2020, 07:21:20 pm »
If you want something fast and are ready to waste a little memory, I would just use a look-up table to divide your numbers in chunks.
For instance, you could use a table of 256 items and shift your numbers by 8 bits. Provided that memory access is reasonably fast, it would certainly be much faster than counting individual bits.
You can adjust the table's depth according to the trade-off you're willing to make.

I had thought about a 256-byte look-up table, but decided that because there is still the need to do some shifting of the input value (STM8 only has shift-once instructions, so multiple shifts take several cycles), it probably won't be faster. I don't know, maybe I should try it.

Whether it will be faster on average indeed all depends on the target itself, but you should give it a try IMHO. True that if your target doesn't have a barrel shifter, shifting by more than 1 bit is not efficient, but you could then achieve splitting your word in bytes with other means. Don't know about the STM8, but it may have instructions to swap bytes, for instance, which you could use instead of shifting.

#### ogden

• Super Contributor
• Posts: 3217
• Country:
##### Re: Optimising a popcount implementation
« Reply #7 on: May 15, 2020, 07:30:20 pm »
I had thought about a 256-byte look-up table, but decided that because there is still the need to do some shifting of the input value (STM8 only has shift-once instructions, so multiple shifts take several cycles), it probably won't be faster. I don't know, maybe I should try it.
To process 16-bit word as two 8-bit bytes you don't need shift instruction. You need type casting - address 16bit word as array of two bytes. As STM8 have byte nibble swap instruction, you have option to make 16-byte lookup table, count bits of 4-bit nibbles.

#### HwAoRrDk

• Frequent Contributor
• Posts: 739
• Country:
##### Re: Optimising a popcount implementation
« Reply #8 on: May 15, 2020, 08:11:04 pm »
Don't know about the STM8, but it may have instructions to swap bytes, for instance, which you could use instead of shifting.

To process 16-bit word as two 8-bit bytes you don't need shift instruction. You need type casting - address 16bit word as array of two bytes. As STM8 have byte nibble swap instruction, you have option to make 16-byte lookup table, count bits of 4-bit nibbles.

Ah, you're right, thanks for the suggestion. There is an instruction for exchanging bytes between registers. Given that, I managed to come up with this:

Code: [Select]
clrw xld a, (4+0, sp)exg a, xladd a, (_table, x)exg a, xlld a, (4+1, sp)exg a, xladd a, (_table, x)retf
Only 13 cycles, and a constant 13 cycles at that. Shame that the table takes 256 bytes of flash, but I suppose I could make compilation conditional between the LUT-based and non-LUT implementations, so the choice can be made if space is tight.

I shall have to see about perhaps doing as suggested with the 16-entry table and dealing with nibbles. Although, I have a feeling the code will be more voluminous, so will be slower.

#### ogden

• Super Contributor
• Posts: 3217
• Country:
##### Re: Optimising a popcount implementation
« Reply #9 on: May 15, 2020, 08:38:05 pm »
I shall have to see about perhaps doing as suggested with the 16-entry table and dealing with nibbles. Although, I have a feeling the code will be more voluminous, so will be slower.
With 65536 byte table you can do even faster It's about tradeoffs only you can decide about.

#### SiliconWizard

• Super Contributor
• Posts: 5439
• Country:
##### Re: Optimising a popcount implementation
« Reply #10 on: May 15, 2020, 10:42:24 pm »
Shame that the table takes 256 bytes of flash,

Since each entry only needs 4 bits (number of bits for a byte is from 0 to 8 ), you could already pack 2 entries per byte and make this 128 bytes instead of 256. There would be a small overhead to address it (and probably use nibble swap too), so not as efficient, but something to consider.

#### brucehoult

• Super Contributor
• Posts: 1601
• Country:
• Formerly SiFive, Samsung R&D
##### Re: Optimising a popcount implementation
« Reply #11 on: May 16, 2020, 08:15:41 am »
If you want something fast and are ready to waste a little memory, I would just use a look-up table to divide your numbers in chunks.
For instance, you could use a table of 256 items and shift your numbers by 8 bits. Provided that memory access is reasonably fast, it would certainly be much faster than counting individual bits.
You can adjust the table's depth according to the trade-off you're willing to make.

I had thought about a 256-byte look-up table, but decided that because there is still the need to do some shifting of the input value (STM8 only has shift-once instructions, so multiple shifts take several cycles), it probably won't be faster. I don't know, maybe I should try it.

You were looking at solution, in webpage you mention:

Code: [Select]
v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporaryv = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // tempc = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

If you don't have a fast multiply but you do have fast shift then you can instead finish:

Code: [Select]
v += v >> 16;v += v >> 8;c = (v&0xF) + (v>>4);
Without fast shift nothing is going to beat table lookup -- and loading the bytes from RAM not shifting in a register.

I don't know the STM8 at all, but on a 6502 I'd do (assuming the input buffer is at a known location (preferably ZP), and the table is in a known location (preferably page-aligned but doesn't have to be), and the input is not longer than 32 bytes (256 bits)):

Code: [Select]
  ldx #num_bytes  lda #0  clcloop:  ldy INPUT,x  adc TABLE,y  dex  bne loop
Or if you're just doing something like 4 bytes:

Code: [Select]
  clc  ldy $INPUT lda TABLE,y ldy$INPUT+1  adc TABLE,y  ldy $INPUT+2 adc TABLE,y ldy$INPUT+3  adc TABLE,y[/quote]You'd have to be pretty squeezed on most things to not have space for a 256 byte table, especially if this is actually speed critical for you.

#### brucehoult

• Super Contributor
• Posts: 1601
• Country:
• Formerly SiFive, Samsung R&D
##### Re: Optimising a popcount implementation
« Reply #12 on: May 16, 2020, 08:39:59 am »
Oh. It seems the ST7 is a pretty close cousin of the 6502 .. and the STM8 extends X and Y to 16 bits, which makes it more like a cousin of the 6800 but without a B accumulator. Completely binary incompatible I assume.

#### Siwastaja

• Super Contributor
• Posts: 2771
• Country:
##### Re: Optimising a popcount implementation
« Reply #13 on: May 16, 2020, 09:51:41 am »
If the instructions are fetched one per cycle and you have excess instruction memory, as is often the case with microcontrollers, just unroll the loop completely. Make an assembler macro (or C static inline __attribute__(("always_inline"))) out of it.

Decrementing the loop counter and jumping is a large part of such simple algorithm.

#### Nominal Animal

• Super Contributor
• Posts: 1774
• Country:
##### Re: Optimising a popcount implementation
« Reply #14 on: May 16, 2020, 04:49:25 pm »
I don't have an STM8, or a compiler for it, but wanted to stick my spoon too in the soup.

The naïve shift-and-add-if-carry implementation takes 49 bytes and 49 cycles (1+3N bytes and cycles for N bits, not including the far return), when the argument is in the X register:
Code: [Select]
                            ; A = popcount(X)                            ; 49 bytes, 49 cyclespopcount:    CLR   A                 ;  1 byte,  1 cycle    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    SRLW  X                 ;  1 byte,  2 cycles    ADC   A, #$0 ; 2 bytes, 1 cycles SRLW X ; 1 byte, 2 cycles ADC A, #$0            ;  2 bytes, 1 cycles    RETF                    ; (1 byte,  5 cycles)With a 16-entry lookup table, that shrinks to 35 bytes and 25 cycles (not including the far return, or the lookup table),
Code: [Select]
                            ; A = popcount(X)                            ; uses 16-byte popcount table _POPCNT                            ; 35 bytes, 25 cyclespopcount:    PUSHW X                 ;  1 byte,  2 cycles    CLRW  X                 ;  1 byte,  1 cycle    POP   A                 ;  1 byte,  1 cycle    PUSH  A                 ;  1 byte,  1 cycle    AND   A, #$0F ; 2 bytes, 1 cycle LD A, (_POPCNT, X) ; 3 bytes, 1 cycle LD XL, A ; 1 byte, 1 cycle POP A ; 1 byte, 1 cycle SWAP A ; 1 byte, 1 cycle AND A, #$0F           ;  2 bytes, 1 cycle    EXG   A, XL             ;  1 byte,  1 cycle    ADD   A, (_POPCNT, X)   ;  3 bytes, 1 cycle    LD    XL, A             ;  1 byte,  1 cycle    POP   A                 ;  1 byte,  1 cycle    PUSH  A                 ;  1 byte,  1 cycle    AND   A, #$0F ; 2 bytes, 1 cycle EXG A, XL ; 1 byte, 1 cycle ADD A, (_POPCNT, X) ; 3 bytes, 1 cycle LD XL, A ; 1 byte, 1 cycle POP A ; 1 byte, 1 cycle SWAP A ; 1 byte, 1 cycle AND A, #$0F           ;  2 bytes, 1 cycle    EXG   XL, A             ;  1 byte,  1 cycle    ADD   A, (_POPCNT, X)   ;  3 bytes, 1 cycle    RETF                    ; (1 byte,  5 cycles)Interestingly, if the X register does not need to be saved, and you have N bytes on stack, you can do a popcount in about 12 cycles per byte, using the above scheme.

With that same 16-byte lookup table, an 8-bit popcount takes 19 bytes and 15 cycles, not including the far return:
Code: [Select]
                            ; A = popcountb(A)                            ; uses 16-byte popcount table _POPCNT                            ; 19 bytes, 15 cyclespopcountb:    PUSHW X                 ;  1 byte,  2 cycles    CLRW  X                 ;  1 byte,  1 cycle    PUSH  A                 ;  1 byte,  1 cycle    SWAP  A                 ;  1 byte,  1 cycle    AND   A, #$0F ; 2 bytes, 1 cycle LD XL, A ; 1 byte, 1 cycle LD A, (_POPCNT, X) ; 3 bytes, 1 cycle EXG A, XL ; 1 byte, 1 cycle POP A ; 1 byte, 1 cycle AND A, #$0F           ;  2 bytes, 1 cycle    EXG   A, XL             ;  1 byte,  1 cycle    ADD   A, (_POPCNT, X)   ;  3 bytes, 1 cycle    POPW  X                 ;  1 byte,  2 cycles    RETF                    ; (1 byte,  5 cycles)
Using a 256-byte table, a 16-bit popcount with argument in X register takes 12 bytes and 9 cycles, and a 8-bit popcount with argument in A register takes 7 bytes and 7 cycles:
Code: [Select]
                            ; A = popcount(X)                            ; uses 256-byte popcount table _POPCNT                            ; 12 bytes, 9 cyclespopcount:    LD    A, XH             ;  1 byte,  1 cycle    PUSH  A                 ;  1 byte,  1 cycle    CLR   A                 ;  1 byte,  1 cycle    PUSH  A                 ;  1 byte,  1 cycle    LD    XH, A             ;  1 byte,  1 cycle    LD    A, (_POPCNT, X)   ;  3 bytes, 1 cycle    POPW  X                 ;  1 byte,  2 cycles    ADD   A, (_POPCNT, X)   ;  3 bytes, 1 cycle    RETF                    ; (1 byte,  5 cycles)                            ; A = popcountb(A)                            ; uses 256-byte popcount table _POPCNT                            ;  7 bytes, 7 cyclespopcountb:    PUSHW X                 ;  1 byte,  2 cycles    CLRW  X                 ;  1 byte,  1 cycle    LD    XL, A             ;  1 byte,  1 cycle    LD    A, (_POPCNT, X)   ;  3 bytes, 1 cycle    POPW  X                 ;  1 byte,  2 cycles    RETF                    ; (1 byte,  5 cycles)
Note that these are all written to keep non-parameter registers intact, and the far return is not included in the byte or cycle counts.  The cycle and byte counts also assume the address of the _POPCNT table is within the first 64k.

The following users thanked this post: HwAoRrDk

#### HwAoRrDk

• Frequent Contributor
• Posts: 739
• Country:
##### Re: Optimising a popcount implementation
« Reply #15 on: May 16, 2020, 07:29:25 pm »
I don't have an STM8, or a compiler for it, but wanted to stick my spoon too in the soup.

Cool, thanks!

With a 16-entry lookup table, that shrinks to 35 bytes and 25 cycles (not including the far return, or the lookup table),

That's fairly similar to what I came up with myself:

Code: [Select]
 clrw x ld a, (4+0, sp) and a, #0x0F exg a, xl add a, (0001$, x) exg a, xl ld a, (4+0, sp) swap a and a, #0x0F exg a, xl add a, (0001$, x) exg a, xl ld a, (4+1, sp) and a, #0x0F exg a, xl add a, (0001$, x) exg a, xl ld a, (4+1, sp) swap a and a, #0x0F exg a, xl add a, (0001$, x) retf0001$: .byte 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 27 cycles, including return, and no stack usage. BTW, I attempted to write an implementation that uses a 128-byte table (with counts packed as 4-bit nibbles), but part-way through it became clear the code would be too convoluted to be quicker than the 16-byte table version, so I abandoned that method. I think what I will do is have three alternate implementations: 'large' 256-byte LUT, 'small' 16-byte LUT, and shift-loop, and which one is used is controlled by defines. Then I have options for any speed-vs-size trade-off I need to make. Interestingly, if the X register does not need to be saved, and you have N bytes on stack, you can do a popcount in about 12 cycles per byte, using the above scheme. ... Note that these are all written to keep non-parameter registers intact, and the far return is not included in the byte or cycle counts. Actually, none of the registers need to saved, as SDCC uses a caller-saves calling convention. Also, I should note that your code as-is wouldn't actually work, because it's not accessing the function arguments in the correct manner. The return address is placed at the top of the stack (or should that be bottom, as it grows downwards), with function arguments following, so you can't simply pop function arguments like that. That's why my code loads function args by offset from the stack pointer. #### RenThraysk • Contributor • Posts: 26 • Country: ##### Re: Optimising a popcount implementation « Reply #16 on: May 16, 2020, 10:05:00 pm » Could use the carry flag to halve the table size. Code: [Select] srl xadc a, (0001$, x)
Whilst works for 8 bit, not sure works as well with 16.

#### Nominal Animal

• Super Contributor
• Posts: 1774
• Country:
##### Re: Optimising a popcount implementation
« Reply #17 on: May 17, 2020, 12:48:46 am »
Actually, none of the registers need to saved, as SDCC uses a caller-saves calling convention.
Righty-o; like I said, I don't know/have the compiler, and don't know the binary ABI.

I installed sdcc 3.5.  Compiling (-mstm8 -S) a simple test program, using the default 'medium' memory model (16-bit address space only, HwAoRrDk is using 24-bit address space or large memory model AFAICS), with first parameter byte at (3, sp).  Based on some experimentation, with the 256-byte table, I'd suggest
Code: [Select]
                            ; uint8_t popcount8(uint8_t u)                            ; uses 256-byte table _popcnt                            ; 8 bytes, 8 cycles_popcount8:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    ld   xl, a              ; 1 byte,  1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 bytes, 4 cycles                            ; uint8_t popcount16(uint16_t u)                            ; uses 256-byte table _popcnt                            ; 15 bytes, 12 cycles_popcount16:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    ld   xl, a              ; 1 byte,  1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)                            ; uses 256-byte table _popcnt                            ; 22 bytes, 16 cycles_popcount24:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    ld   xl, a              ; 1 byte,  1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle     exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3)                            ; uses 256-byte table _popcnt                            ; 29 bytes, 20 cycles_popcount32:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    ld   xl, a              ; 1 byte,  1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle     exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x06, sp)      ; 2 bytes, 1 cycle     exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles
which can be extended to any number of parameters, running in 4+4n cycles, in 1+7n bytes of code, for n parameter bytes, using the medium memory model (16-bit address space).

A variant that uses a 128-byte lookup table, as noted by RenThraysk:
Code: [Select]
                            ; uint8_t popcount8(uint8_t u)                            ; uses 128-byte table _popcnt                            ; 10 bytes, 10 cycles_popcount8:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle    ld   xl, a              ; 1 byte,  1 cycle  (does not affect carry)    clr  a                  ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount16(uint16_t u)                            ; uses 128-byte table _popcnt                            ; 18 bytes, 15 cycles_popcount16:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle    ld   xl, a              ; 1 byte,  1 cycle  (does not affect carry)    clr  a                  ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)                            ; uses 128-byte table _popcnt                            ; 26 bytes, 20 cycles_popcount16:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle    ld   xl, a              ; 1 byte,  1 cycle  (does not affect carry)    clr  a                  ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3)                            ; uses 128-byte table _popcnt                            ; 34 bytes, 25 cycles_popcount32:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle    ld   xl, a              ; 1 byte,  1 cycle  (does not affect carry)    clr  a                  ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x06, sp)      ; 2 bytes, 1 cycle    srl  a                  ; 1 byte,  1 cycle  (does not affect carry)    exg  a, xl              ; 1 byte,  1 cycle  (does not affect carry)    adc  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles
These run in 5+5n cycles and use 2+8n bytes, so slightly slower and slightly larger but halving the needed table.

Using just 16-byte lookup table, the calculation takes 3+11n cycles and 19n bytes, for n bytes:
Code: [Select]
_popcnt:    db 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4                            ; uint8_t popcount8(uint8_t u)                            ; uses 16-byte table _popcnt                            ; 19 bytes, 14 cycles_popcount8:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount16(uint16_t u)                            ; uses 16-byte table _popcnt                            ; 38 bytes, 25 cycles_popcount16:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                            ; uint8_t popcount24(uint8_t u1, uint8_t u2, uint8_t u3)                            ; uses 16-byte table _popcnt                            ; 57 bytes, 36 cycles_popcount24:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles                                ; uint8_t popcount32(uint8_t u1, uint8_t u2, uint8_t u3, uint8_t u4)                            ; uses 16-byte table _popcnt                            ; 76 bytes, 47 cycles_popcount32:    clrw x                  ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    ld   a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x03, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x04, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x05, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x06, sp)      ; 2 bytes, 1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    ld   a, (0x06, sp)      ; 2 bytes, 1 cycle    swap a                  ; 1 byte,  1 cycle    and  a, #0x0F           ; 2 bytes, 1 cycle    exg  a, xl              ; 1 byte,  1 cycle    add  a, (_popcnt, x)    ; 3 bytes, 1 cycle    ret                     ; 1 byte,  4 cycles
I would wager that one of these variants – unless you find a faster one, which is quite possible, as I'm definitely not familiar with this architecture! – should fit ones needs.

It is interesting to note that the exact same implementation on STM8 work for different C prototypes:
Code: [Select]
uint8_t  popcount16(uint8_t, uint8_t);uint8_t  popcount16(uint16_t);uint8_t  popcount24(uint8_t, uint8_t, uint8_t);uint8_t  popcount24(uint16_t, uint8_t);uint8_t  popcount24(uint8_t, uint16_t);uint8_t  popcount32(uint8_t, uint8_t, uint8_t, uint8_t);uint8_t  popcount32(uint8_t, uint8_t, uint16_t);uint8_t  popcount32(uint8_t, uint16_t, uint8_t);uint8_t  popcount32(uint16_t, uint8_t, uint8_t);uint8_t  popcount32(uint16_t, uint16_t);uint8_t  popcount32(uint32_t);

#### HwAoRrDk

• Frequent Contributor
• Posts: 739
• Country:
##### Re: Optimising a popcount implementation
« Reply #18 on: May 17, 2020, 02:56:24 pm »
I installed sdcc 3.5.

Where did you get such an old version of SDCC? Version 3.5 was released in 2015! (4.0.0 is current.) Are you using Linux and that is what is available in the distro's default package repository? If so, that sucks.

A variant that uses a 128-byte lookup table, as noted by RenThraysk:
...
These run in 5+5n cycles and use 2+8n bytes, so slightly slower and slightly larger but halving the needed table.

I also followed RenThraysk's suggestion (an ingenious idea!), and came up with essentially the same.

By the way, if you hadn't used ld xl, a on the first byte and instead exg a, xl, then the clr a is unnecessary, and it would be one cycle less. By using exg initially rather than ld, you get the A reg initialised to zero for free (because XL was already zeroed by the clrw x).

It is interesting to note that the exact same implementation on STM8 work for different C prototypes:

Yes, I suppose that is one benefit of having function arguments passed exclusively on the stack. A uint32_t value is passed no differently from four uint8_t args.

#### Nominal Animal

• Super Contributor
• Posts: 1774
• Country:
##### Re: Optimising a popcount implementation
« Reply #19 on: May 17, 2020, 04:33:19 pm »
Where did you get such an old version of SDCC? Version 3.5 was released in 2015! (4.0.0 is current.) Are you using Linux and that is what is available in the distro's default package repository? If so, that sucks.
Yes, distro packaging, because I don't have the hardware, and I was only interested in testing it.  (I would have used a PPA if there was one with 4.0.0, but didn't find any.)
For actual work, I'd have installed 4.0.0, of course.  But this way, I can uninstall (purge) it trivially after testing.

This is a laptop with a centrally managed Linux distro used by faculty of science students among others.  I volunteer for bug triage, so I try to keep this as close to "vanilla" as I can, even though I do have full root access.  And, as it has only a 120G SSD mostly filled with other software, it doesn't really have enough storage for virtual machines (which would be my preferred option for testing).

By the way, if you hadn't used ld xl, a on the first byte and instead exg a, xl, then the clr a is unnecessary, and it would be one cycle less. By using exg initially rather than ld, you get the A reg initialised to zero for free (because XL was already zeroed by the clrw x).
True!  Just proves how unfamiliar I am with this architecture.

Other than AVR (which has "lots" of registers, comparatively speaking), it's been decades since I last wrote 8-bit assembly (6502; STM8 reminded me a lot of 6502).  I always felt constricted with lack of registers on 6502, and even on 8086 (80x86, really), and have forgotten most of the patterns of how to efficiently use just a small number of registers.  In comparison, AVR, ARM, and AMD64 all have oodles of registers.

One downside of passing parameters always on stack is that the compiler cannot really inline inline assembly functions.  On AMD64 (x86-64), you can write static inline functions with an inline assembly body efficiently, as parameters are passed in registers.  GCC extended inline assembly syntax even allows you to use register classes instead of specific registers, and tell the C compiler what inputs the assembly expects, what outputs it provides, and what it clobbers, with the compiler doing the adaptation to surrounding code as it sees best.  (Historically, one of GCC's weak points was that it generated many register-to-register moves, even when not really needed.  Nowadays GCC is much better at this.)  This is an efficient tool: I have used it to implement odd-ish non-trivial SIMD (SSE/AVX) functions, that evaluate complicated functions for 2/4/8/16 parameters in parallel.

I guess SDCC could be taught the concept of an inline function, where a limited number of parameters are passed in and out in registers, with the intent of the compiler being able to inline the thing easily – it would need to include the clobber/no-clobber register list somehow, to avoid unnecessary stack ops too.  If ever called as an actual function, the compiler would generate a separate copy of the code, adding the necessary move-params-from-stack-to-registers, according to calling convention on the target architecture.

It is difficult to say whether it would be useful on STM8, as the cost of a proper function call is just six-seven cycles total; this inlining would only make sense for short, but extremely often used functions.  I don't even know if you tend to have those on STM8.

#### T3sl4co1l

• Super Contributor
• Posts: 15415
• Country:
• Expert, Analog Electronics, PCB Layout, EMC
##### Re: Optimising a popcount implementation
« Reply #20 on: May 17, 2020, 07:10:25 pm »
(Historically, one of GCC's weak points was that it generated many register-to-register moves, even when not really needed.  Nowadays GCC is much better at this.)

An aside: I've found GCC not only generates these (and also things like unused sign-extends and implicit (whole byte) bitmasks) poorly for AVR, but it's actually gotten considerably worse from GCC 5, to 8 and 9! :O  I guess the back end (GIMPLE-->AVR?) has fallen sorely behind the other mainstream outputters.

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!

The following users thanked this post: Nominal Animal

#### Nominal Animal

• Super Contributor
• Posts: 1774
• Country:
##### Re: Optimising a popcount implementation
« Reply #21 on: May 17, 2020, 07:32:18 pm »
Crap.  I haven't noticed, since this lappy is based on Ubuntu 18.04 LTS, which means avr-gcc is version 5.4.0.

If I was trapped in Groundhog day, learning the different approaches to programming language compilers and then trying to do this stuff better (having essentially infinite time and resources), would be one of my goals.

(And now a kernel update is giving me grief: the psmouse driver handling the (EliteBook 820 G2) touchpad occasionally forgets what a button click is. For a band-aid, removing and reloading the driver works – I've got that as a shortcut now –, but I have a bad feeling that this isn't anything as simple as that, and instead is some kind of a userspace conflict between event devices... Tentacles, you know. Yuk.)
« Last Edit: May 17, 2020, 07:36:21 pm by Nominal Animal »

Smf