Products > Programming

Should memcmp be faster than a loop? (32F417)

(1/8) > >>

I am doing some RAM testing and found that

--- Code: ---if ( memcmp(buf0,buf1,PKTSIZE)!=0 ) errcnt++;
--- End code ---

is no faster than

--- Code: ---uint32_t errcnt=0;
for (int i=0; i<PKTSIZE; i++)
    if (buf0[i]!=buf1[i]) errcnt++;
--- End code ---

which surprised me, since the lib functions are supposed to be super-optimised, using word size compares etc.

The two buffers were malloc'd so are definitely 4-aligned and probably 8- or 16-aligned.

I would expect a loop with a word compare and doing PKTSIZE/4 compares but had some trouble making it work (pointer notation) :)

I am using the -fno-tree-loop-distribute-patterns compiler option, to prevent loops being replaced with lib functions.

Look at the generated assembler instructions, that will tell why the speed is or isn't as expected.

Compilers are very smart these days, so likely both of them got optimized down into very similar or even identical machine code.

I had a nice example of where i used a function to calculate a checksum of an area of flash for checking it. However the compiler figured out that it can precalculate this. So instead of actually calling the function it instead just calculated what the checksum result would be and return that fixed number. It is impressive how far compilers can go with optimizations.

The stdlib memcmp doesn't seem to be doing anything remotely clever

--- Code: ---          memcmp:
0804643c:   push    {r4, lr}
0804643e:   subs    r1, #1
08046440:   add     r2, r0
08046442:   cmp     r0, r2
08046444:   bne.n   0x804644a <memcmp+14>
08046446:   movs    r0, #0
08046448:   b.n     0x8046456 <memcmp+26>
0804644a:   ldrb    r3, [r0, #0]
0804644c:   ldrb.w  r4, [r1, #1]!
08046450:   cmp     r3, r4
08046452:   beq.n   0x8046458 <memcmp+28>
08046454:   subs    r0, r3, r4
08046456:   pop     {r4, pc}
08046458:   adds    r0, #1
0804645a:   b.n     0x8046442 <memcmp+6>
--- End code ---

I see no attempt to identify e.g. opportunities for word compares.

BTW the block is 2048 bytes.
buf0 is at 0x2000cd88
buf1 is at 0x2000d590
Both buffers 8-aligned which is what I would expect (re another thread on the standard dictating malloced block alignment).

Tried this but obviously I haven't got the casts correct :)

--- Code: ---
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
bool ret = true;
for (int i=0; i<(blocksize/4);i+=4)
if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
return (ret);

--- End code ---

Memcmp is a still an iterator under the hood.

I would be interested to see how the compiler optimizes your "for (int i=0; i<(blocksize/4);i+=4)" line. You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2);i+=4)" but would the compiler be doing this cheat?


[0] Message Index

[#] Next page

There was an error while thanking
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod