-
I am doing some RAM testing and found that
if ( memcmp(buf0,buf1,PKTSIZE)!=0 ) errcnt++;
is no faster than
uint32_t errcnt=0;
for (int i=0; i<PKTSIZE; i++)
{
if (buf0[i]!=buf1[i]) errcnt++;
}
which surprised me, since the lib functions are supposed to be super-optimised, using word size compares etc.
The two buffers were malloc'd so are definitely 4-aligned and probably 8- or 16-aligned.
I would expect a loop with a word compare and doing PKTSIZE/4 compares but had some trouble making it work (pointer notation) :)
I am using the -fno-tree-loop-distribute-patterns compiler option, to prevent loops being replaced with lib functions.
-
Look at the generated assembler instructions, that will tell why the speed is or isn't as expected.
-
Compilers are very smart these days, so likely both of them got optimized down into very similar or even identical machine code.
I had a nice example of where i used a function to calculate a checksum of an area of flash for checking it. However the compiler figured out that it can precalculate this. So instead of actually calling the function it instead just calculated what the checksum result would be and return that fixed number. It is impressive how far compilers can go with optimizations.
-
The stdlib memcmp doesn't seem to be doing anything remotely clever
memcmp:
0804643c: push {r4, lr}
0804643e: subs r1, #1
08046440: add r2, r0
08046442: cmp r0, r2
08046444: bne.n 0x804644a <memcmp+14>
08046446: movs r0, #0
08046448: b.n 0x8046456 <memcmp+26>
0804644a: ldrb r3, [r0, #0]
0804644c: ldrb.w r4, [r1, #1]!
08046450: cmp r3, r4
08046452: beq.n 0x8046458 <memcmp+28>
08046454: subs r0, r3, r4
08046456: pop {r4, pc}
08046458: adds r0, #1
0804645a: b.n 0x8046442 <memcmp+6>
I see no attempt to identify e.g. opportunities for word compares.
BTW the block is 2048 bytes.
buf0 is at 0x2000cd88
buf1 is at 0x2000d590
Both buffers 8-aligned which is what I would expect (re another thread on the standard dictating malloced block alignment).
Tried this but obviously I haven't got the casts correct :)
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.
bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
for (int i=0; i<(blocksize/4);i+=4)
{
if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
{
ret=false;
break;
}
}
return (ret);
}
-
Memcmp is a still an iterator under the hood.
I would be interested to see how the compiler optimizes your "for (int i=0; i<(blocksize/4);i+=4)" line. You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2);i+=4)" but would the compiler be doing this cheat?
-
If i was to do the 32bit comparison optimization i would likely define a new uint32 pointer variable, cast the buffer into that then just step that across the data exactly like before, only simply dividing the buffer length by 4. Or actually just >>2 since bitshift is faster (i think the compiler optimizer does that fast divide trick)
The compiler might be avoiding doing this optimization on its own because it might not be confident that the data will arrive 4 byte aligned.
-
for (int i=0; i<(blocksize/4);i+=4)
{
if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
{
ret=false;
break;
}
}
This code is only comparing the first word of buf1 and buf2.
Also, the count is wrong: if you count up to blocksize/4, you should not increment by 4...
-
You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2)
I would expect a /4 to be done as a shift; even the 1980s Z80 compilers did that. It would also be precomputed. But hey it doesn't:
ft_memcmp:
08043b54: movs r3, #0
08043b56: cmp.w r3, r2, lsr #2
08043b5a: bcs.n 0x8043b78 <ft_memcmp+36>
356 {
08043b5c: push {r4, r5}
360 if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
08043b5e: ldr r4, [r0, #0]
08043b60: ldr r5, [r1, #0]
08043b62: cmp r4, r5
08043b64: bne.n 0x8043b74 <ft_memcmp+32>
358 for (int i=0; i<(blocksize/4);i+=4)
08043b66: adds r3, #4
08043b68: cmp.w r3, r2, lsr #2
08043b6c: bcc.n 0x8043b5e <ft_memcmp+10>
357 bool ret = true;
08043b6e: movs r0, #1
367 }
08043b70: pop {r4, r5}
08043b72: bx lr
362 ret=false;
08043b74: movs r0, #0
08043b76: b.n 0x8043b70 <ft_memcmp+28>
357 bool ret = true;
08043b78: movs r0, #1
367 }
08043b7a: bx lr
891 {
define a new uint32 pointer variable, cast the buffer into that then just step that across the data exactly like before
I am not clever enough :)
This code is only comparing the first word of buf1 and buf2.
Also, the count is wrong: if you count up to blocksize/4, you should not increment by 4...
Told you I am not clever enough :) I do only simple C. This doesn't compile
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.
bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;
for (int i=0; i<count;i++)
{
if ( ((uint32_t*)&buf1[i]) != ((uint32_t*)&buf2[i]) )
{
ret=false;
break;
}
}
return (ret);
}
-
The stdlib memcmp doesn't seem to be doing anything remotely clever
picolibc (https://github.com/picolibc/picolibc/blob/main/newlib/libc/string/memcmp.c), largely inspired and sharing most of the code with newlib nano, uses an optimized loop for memcmp if both input pointers are word aligned and the right defines are used at compile time.
-
I would expect a /4 to be done as a shift [...]. But hey it doesn't:
08043b56: cmp.w r3, r2, lsr #2
08043b68: cmp.w r3, r2, lsr #2
Yes, it definitely does!
-
From picolib, I lifted some code and got this which may be right. I will corrupt some data and make sure it fails, but stepping through the loop it is doing the right thing, I think. a1 and a2 increment by 4, count starts at 512 and decrements by 1.
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.
bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;
uint32_t *a1;
uint32_t *a2;
a1 = (uint32_t*) buf1;
a2 = (uint32_t*) buf2;
while (count>0)
{
if (*a1 != *a2)
{
ret=false;
break;
}
a1++;
a2++;
count--;
}
return (ret);
}
And it is fast.
Yes, it definitely does!
I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
-
You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2);i+=4)"
[rant]
I have to be blunt: this is the kind of micro-optimizations I abhor in my code and always flag when I do code reviews.
[/rant]
Two main reasons:
- We have compilers that outsmart any human in this kind of things.
- I have a guiding principle of "write what you mean", so if I want to say /4, I write /4, not >>2
(in any case, I would write /sizeof (the pointer target type)
-
This is the asm for the above code. Nothing clever; the main gain is 512 loops instead of 2048.
ft_memcmp:
08043b54: lsrs r2, r2, #2
364 while (count>0)
08043b56: cbz r2, 0x8043b78 <ft_memcmp+36>
356 {
08043b58: push {r4}
366 if (*a1 != *a2)
08043b5a: ldr r3, [r1, #0]
08043b5c: ldr r4, [r0, #0]
08043b5e: cmp r4, r3
08043b60: bne.n 0x8043b74 <ft_memcmp+32>
371 a1++;
08043b62: adds r0, #4
372 a2++;
08043b64: adds r1, #4
373 count--;
08043b66: subs r2, #1
364 while (count>0)
08043b68: cmp r2, #0
08043b6a: bne.n 0x8043b5a <ft_memcmp+6>
357 bool ret = true;
08043b6c: movs r0, #1
377 }
08043b6e: ldr.w r4, [sp], #4
08043b72: bx lr
368 ret=false;
08043b74: movs r0, #0
08043b76: b.n 0x8043b6e <ft_memcmp+26>
357 bool ret = true;
08043b78: movs r0, #1
377 }
08043b7a: bx lr
901 {
-
From picolib, I lifted some code and got this which may be right. [/code]
And it is fast.
Looks good.
I also like early returns, if no cleanup is is needed, as is the case here:
f(...)
{
for(...)
{
...
if(failed test) return false;
...
}
return true;
}
It does not make a big difference (possibly none at all with the right optimization options), and some might dislike it.
Still it makes the code slightly more readable, in my view.
What -O or other optimization options are you using?
clang 15 does a by 4 loop unrolling with -O2 and higher.
-
I am using -Og. I have these notes in my design doc:
Compiler optimisation
Much time has been spent on this. It is a can of worms, because e.g. optimisation level -O3 replaces this loop
for (uint32_t i=0; i<length; i++)
{
buf[offset+i]=data;
)
with memcpy() which while “functional” will crash the system if you have this loop in the boot block and memcpy() is located at some address outside (which it will be, being a part of the standard library!) Selected boot block functions therefore have the attribute to force -O0 (zero optimisation) in case somebody tries -O3.
The basic optimisation of -O0 (zero) works fine, is easily fast enough for the job, and gives you informative single step debugging, but it produces about 30% more code. The next one, -Og, is the best one to use and doesn’t seem to do any risky stuff like above.
Arguably, one should develop with optimisation ON (say -Og) and then you will pick up problems as you go along. Then switch to -O0 for single stepping if needed to track something down. Whereas if you write the whole thing with -O0 and only change to something else later, you have an impossible amount of stuff to regression-test.
The problems tend to be really subtle, especially if timing issues arise. For example the 20ns min CS high time for the 45DB serial FLASH can be violated (by two function calls in succession) with a 168MHz CPU, and the function hang_around_us() is useful just in case. A lot of ST HAL code works only because it is so bloated. Such time-critical code needs to be in a function which has the
__attribute__((optimize("O0")))
attribute above it. Or be written in assembler.
These figures show the relative benefits, at a particular stage of the project
-O0 produces 491k
-Og produces 342k
-O1 produces 338k
-Os produces 305k
Others, not listed above, have not been tested.
A compiler command line switch -fno-tree-loop-distribute-patterns has been added to globally prevent memcpy etc substitutions.
===
I have tested the code with -O3 and it all seems to run, but I am not interested in spending days re-testing every feature. -Og does sometimes result in "optimised out" for some variable when I am stepping through but it doesn't usually matter, and I can always make it "volatile" just for the test.
I try to avoid early returns, because they can bite. I use them if early on e.g.
if (buf1==NULL) return;
to avoid some invalid access trap.
-
Yes, it definitely does!
I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
Yes, the cycle cost for an operand shift is zero in cortex-m3 and higher, so doing it once before the loop would have actually increased the cycle count!
the <op2> field can be replaced with one of the following options:
...
An immediate shifted register, for example Rm, LSL #4.
A register shifted register, for example Rm, LSL Rs.
...
Compare CMP Rn, <op2> 1
-
Properly written libraries should take advantage of hardware features when they are notified of the platform.
For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.
This is similar to using a floating point coprocessor, when available, instead of pure software FP back when FP hardware wasn't always on the die.
-
FYI, attached is the source code of memcmp() taken from Newlib sources:
//
// gcc-arm-none-eabi-9-2020-q2-update/src/newlib/newlib/libc/string/memcmp.c
//
#include <string.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#define UNALIGNED(X, Y) \
(((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))
/* How many bytes are copied each iteration of the word copy loop. */
#define LBLOCKSIZE (sizeof (long))
/* Threshhold for punting to the byte copier. */
#define TOO_SMALL(LEN) ((LEN) < LBLOCKSIZE)
int
memcmp (const void *m1,
const void *m2,
size_t n)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
unsigned char *s1 = (unsigned char *) m1;
unsigned char *s2 = (unsigned char *) m2;
while (n--)
{
if (*s1 != *s2)
{
return *s1 - *s2;
}
s1++;
s2++;
}
return 0;
#else
unsigned char *s1 = (unsigned char *) m1;
unsigned char *s2 = (unsigned char *) m2;
unsigned long *a1;
unsigned long *a2;
/* If the size is too small, or either pointer is unaligned,
then we punt to the byte compare loop. Hopefully this will
not turn up in inner loops. */
if (!TOO_SMALL(n) && !UNALIGNED(s1,s2))
{
/* Otherwise, load and compare the blocks of memory one
word at a time. */
a1 = (unsigned long*) s1;
a2 = (unsigned long*) s2;
while (n >= LBLOCKSIZE)
{
if (*a1 != *a2)
break;
a1++;
a2++;
n -= LBLOCKSIZE;
}
/* check m mod LBLOCKSIZE remaining characters */
s1 = (unsigned char*)a1;
s2 = (unsigned char*)a2;
}
while (n--)
{
if (*s1 != *s2)
return *s1 - *s2;
s1++;
s2++;
}
return 0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}
Newlib usually comes with arm-none-gcc compilers in the form of libc.a library. Depending on how it was compiled, you'll get this or that result.
-
Interesting. Mine was obviously compiled for smaller size.
What controls how it gets compiled? I don't think the sources are supplied (with Cube) so there is no way I can control this using say the -O optimisation setting.
-
A bit like memcpy(), this is an endless source of tricks, benchmarks and blog posts.
The most efficient way of achieving it will depend on so many factors that it's impossible to answer what is going to be faster in a given context. It's not always the functions from the standard library. But one thing to consider is that compilers are now clever enough to generate inline, optimized code for memcpy() and memcmp() in a large number of cases, so you'll be hard-pressed to do better. Usually not worth your time, especially if the memory blocks are relatively small.
-
FYI, attached is the source code of memcmp() taken from Newlib sources:
Interesting that this "memcmp" source punts to what it calls a "byte copier". Hmmm. :palm:
Properly written block functions shouldn't always punt misaligned pointers to byte-sized routines. The efficient way to do misalignment is to handle the edges with byte copies, but still do the big "middle" section with the largest native size. There's obviously a lowest useful block size below which the realignment overhead isn't worth it, but you can do that math one time and hardcode the threshold for a given architecture. Above a certain size, there's always a "middle" block that is aligned for both source and destination. No reason to grind through that a byte at a time instead of using the hardware's largest and most efficient data size.
This can be written as "fallthrough" code where the front-end misalignment (if any) is handled at the byte level, the middle section is done fast, and the back-end misalignment (if any) is handled at the byte level. This yields small AND fast code. Kinda get a warm feeling when it's done.
-
That's exactly what I thought the lib functions did. There is no penalty, other than the code size, but this function is so tiny anyway.
I wrote the comparison loop using the 32 bit cast anyway (and made sure the buffers are aligned).
-
What controls how it gets compiled? I don't think the sources are supplied (with Cube) so there is no way I can control this using say the -O optimisation setting.
The library comes bundled with the compiler from the ARM Developer (https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads) website. It is in binary form, and unless you ask nicely, the exact compilation options are not known.
You can still get the library source code from the same site, and if you fancy you can extract some select files to add them to your project (or make a submodule in your git/svn/...) and compile them with whatever options that best suit your needs. More work, but you will be in full control.
-
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.
bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;
uint32_t *a1;
uint32_t *a2;
a1 = (uint32_t*) buf1;
a2 = (uint32_t*) buf2;
while (count>0)
{
if (*a1 != *a2)
{
ret=false;
break;
}
a1++;
a2++;
count--;
}
return (ret);
}
i like it tidy, because its fun. your code is not wrong, i'm just doing mental/memory refresh so i wont forget C...
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.
bool ft_memcmp(const void *buf1, const void *buf2, uint32_t blocksize) {
uint32_t count = blocksize / 4;
uint32_t *a1 = (uint32_t*) buf1;
uint32_t *a2 = (uint32_t*) buf2;
while ((count--) > 0) if (*(a1++) != *(a2++)) return false;
return true;
}
there should be some mistakes here and there, the point is... tidy ;D
-
Yes, it definitely does!
I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
Yes, the cycle cost for an operand shift is zero in cortex-m3 and higher, so doing it once before the loop would have actually increased the cycle count!
Doing extra work is never free.
It might be hidden in, for example, a lower max MHz on the part than would otherwise be possible. See, for example, the 1-cycle multiply option on Cortex M0+. You'll never see that on a high-clocked part.
Most CPUs are designed so that register read -> add/sub/cmp -> writeback/forward is on the critical path. Arbitrary size shifts have a similar gate delay depth to add (perhaps just slightly more if the adder has good carry look-ahead not just ripple), so can usually be accommodated too. Doing an arbitrary-sized shift AND THEN an add/sub/cmp basically doubles the critical path in the ALU pipe stage, potentially halving the clock speed attainable.
Small shifts of {0,1,2,3} bits can probably be accommodated before an adder and end up about the same total delay as an arbitrary shift. But the ARM ISA allows an arbitrary shift.
-
For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.
REP MOVSB and friends have been in 8086 from Day #1.
For almost all x86 models in the last 45 years a software loop has been faster than REP MOVSB.
Here's an old discussion involving Intel engineers:
https://community.intel.com/t5/Intel-Fortran-Compiler/Time-to-revisit-REP-MOVS/td-p/796377
I believe some of the most recent µarch revisions might have given REP MOVSB the performance love attention that people assume it always had. But it's VERY recent.
-
That was a VERY informative thread reference, thank you!
It has been a very long time since I wrote x86 Assembly code so modern hardware has made my earlier comments out of date.
The best new data for me was learning that the FP Assembly has its own block-move instructions. That's incredibly helpful. *Those* should be able to optimize per my discussion above, and actually do even better than I expected thanks to the lack of MovSx's historical baggage and their inherently wider operands.
Thanks again, that was well worth updating my knowledge!
-
Properly written libraries should take advantage of hardware features when they are notified of the platform.
true thats why building a simple library even like in this case memcmp can be a daunting task with lots of "preprocessor directive" checking #IFDEFINED MACHINE(this and that)
For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.
This is similar to using a floating point coprocessor, when available, instead of pure software FP back when FP hardware wasn't always on the die.
i believe modern COMPILER is partly (or wholly?) taking this job. so using old compiler will dismiss us from taking advantage of new opcodes and machine architecture. thats why my stand when coding is forget about machine dependent code (or trying to be clever at fondling one for each machine architecture) because machine dependent code will obsolete pretty soon. when i read software algorithm and analysis in any SW eng or Computing Science book, i will almost never see machine dependency is discussed or tried to be tackled, we should concentrate on operation complexity of a certain algorithm.. (language and hardware agnostics) as for fastest (or code size) optimization on certain machine, let the compiler do its best, or else.. just code in assembly . ymmv.
-
Which platform?
I did some test on Windows, and memcmp is many times faster than a simple loop (read the first post with pictures).
https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358 (https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358)
-
Which platform?
I did some test on Windows, and memcmp is many times faster than a simple loop (read the first post with pictures).
https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358 (https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358)
You've got a recent CPU with a very competitive REP MOVSB, at least up to 128 byte size, and from 16k and up. And not awful between those.
I don't think you'd get anywhere near as good results from it even as recently as Skylake.
-
The link goes to the memcpy results, don't know why - the memcmp results are at the top of the page.
The memcmp does not use the rep-movsb instruction.
The CPU is 5 years old. The optimized rep-movsb instruction is standard since about 2010 (Haswell?).
Newer CPUs should be even better, especially with the small size array optimization.
-
Last I checked, the GNU C library didn't even use MMX or SSE2 instructions for things like memcpy(). I don't know what the reason is. MMX has been around since 1997. Most of the GNU graphical tool kits are so slow that a 1GHz CPU is needed to open a context menu smoothly.
There was also the whole fiasco where the Intel C compiler would disable all optimizations by not using the optimized code branch at run time if it did not detect an Intel Pentium 3 or 4 CPU. This was back around 2005 and AMD took legal action.
-
Glibc has had vector optimization for memcpy, memcmp, and friends for ages.
-
glibc-2.8 from 2009 didn't use any vector optimizations, even though all AMD64 CPUs support SSE2, and those came out 6 years ago, so the excuse of maintaining compatibility with all CPUs of that class doesn't hold. For example not all i686 CPUs support MMX, so the non MMX equipped Pentium Pro could be the reason for not using MMX. It probably got to the point where the threat of outside patches to glibc motivated them to add support.
-
It's been a long time since I wrote Assembly for the x86 platform, but isn't it possible to read a register and determine the CPU? Knowing that you could choose to use the most advanced approach supported by that hardware.
-
Get pro grade compiler, not the old and free one https://stackoverflow.com/questions/152016/detecting-cpu-architecture-compile-time