Author Topic: Should memcmp be faster than a loop? (32F417)  (Read 3403 times)

0 Members and 1 Guest are viewing this topic.

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Should memcmp be faster than a loop? (32F417)
« on: October 27, 2022, 09:36:18 am »
I am doing some RAM testing and found that

Code: [Select]
if ( memcmp(buf0,buf1,PKTSIZE)!=0 ) errcnt++;
is no faster than

Code: [Select]
uint32_t errcnt=0;
for (int i=0; i<PKTSIZE; i++)
{
    if (buf0[i]!=buf1[i]) errcnt++;
}

which surprised me, since the lib functions are supposed to be super-optimised, using word size compares etc.

The two buffers were malloc'd so are definitely 4-aligned and probably 8- or 16-aligned.

I would expect a loop with a word compare and doing PKTSIZE/4 compares but had some trouble making it work (pointer notation) :)

I am using the -fno-tree-loop-distribute-patterns compiler option, to prevent loops being replaced with lib functions.
« Last Edit: October 27, 2022, 09:47:04 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline RoGeorge

  • Super Contributor
  • ***
  • Posts: 6136
  • Country: ro
Re: Should memcmp be faster than a loop? (32F417)
« Reply #1 on: October 27, 2022, 10:48:51 am »
Look at the generated assembler instructions, that will tell why the speed is or isn't as expected.

Offline Berni

  • Super Contributor
  • ***
  • Posts: 4911
  • Country: si
Re: Should memcmp be faster than a loop? (32F417)
« Reply #2 on: October 27, 2022, 10:57:21 am »
Compilers are very smart these days, so likely both of them got optimized down into very similar or even identical machine code.

I had a nice example of where i used a function to calculate a checksum of an area of flash for checking it. However the compiler figured out that it can precalculate this. So instead of actually calling the function it instead just calculated what the checksum result would be and return that fixed number. It is impressive how far compilers can go with optimizations.
 
The following users thanked this post: I wanted a rude username

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #3 on: October 27, 2022, 11:06:12 am »
The stdlib memcmp doesn't seem to be doing anything remotely clever

Code: [Select]
          memcmp:
0804643c:   push    {r4, lr}
0804643e:   subs    r1, #1
08046440:   add     r2, r0
08046442:   cmp     r0, r2
08046444:   bne.n   0x804644a <memcmp+14>
08046446:   movs    r0, #0
08046448:   b.n     0x8046456 <memcmp+26>
0804644a:   ldrb    r3, [r0, #0]
0804644c:   ldrb.w  r4, [r1, #1]!
08046450:   cmp     r3, r4
08046452:   beq.n   0x8046458 <memcmp+28>
08046454:   subs    r0, r3, r4
08046456:   pop     {r4, pc}
08046458:   adds    r0, #1
0804645a:   b.n     0x8046442 <memcmp+6>

I see no attempt to identify e.g. opportunities for word compares.

BTW the block is 2048 bytes.
buf0 is at 0x2000cd88
buf1 is at 0x2000d590
Both buffers 8-aligned which is what I would expect (re another thread on the standard dictating malloced block alignment).

Tried this but obviously I haven't got the casts correct :)

Code: [Select]

// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
for (int i=0; i<(blocksize/4);i+=4)
{
if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
{
ret=false;
break;
}
}
return (ret);
}
« Last Edit: October 27, 2022, 11:32:03 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline AndyBeez

  • Frequent Contributor
  • **
  • Posts: 853
  • Country: nu
Re: Should memcmp be faster than a loop? (32F417)
« Reply #4 on: October 27, 2022, 12:06:45 pm »
Memcmp is a still an iterator under the hood.

I would be interested to see how the compiler optimizes your "for (int i=0; i<(blocksize/4);i+=4)" line. You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2);i+=4)" but would the compiler be doing this cheat?
 

Offline Berni

  • Super Contributor
  • ***
  • Posts: 4911
  • Country: si
Re: Should memcmp be faster than a loop? (32F417)
« Reply #5 on: October 27, 2022, 12:17:00 pm »
If i was to do the 32bit comparison optimization i would likely define a new uint32 pointer variable, cast the buffer into that then just step that across the data exactly like before, only simply dividing the buffer length by 4. Or actually just >>2 since bitshift is faster (i think the compiler optimizer does that fast divide trick)

The compiler might be avoiding doing this optimization on its own because it might not be confident that the data will arrive 4 byte aligned.
 

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #6 on: October 27, 2022, 01:13:32 pm »
Code: [Select]
for (int i=0; i<(blocksize/4);i+=4)
{
if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
{
ret=false;
break;
}
}
This code is only comparing the first word of buf1 and buf2.
Also, the count is wrong: if you count up to blocksize/4, you should not increment by 4...
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #7 on: October 27, 2022, 01:17:11 pm »
Quote
You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2)

I would expect a /4 to be done as a shift; even the 1980s Z80 compilers did that. It would also be precomputed. But hey it doesn't:

Code: [Select]
          ft_memcmp:
08043b54:   movs    r3, #0
08043b56:   cmp.w   r3, r2, lsr #2
08043b5a:   bcs.n   0x8043b78 <ft_memcmp+36>
 356      {
08043b5c:   push    {r4, r5}
 360      if ( (*(uint32_t*)buf1) != (*(uint32_t*)buf2) )
08043b5e:   ldr     r4, [r0, #0]
08043b60:   ldr     r5, [r1, #0]
08043b62:   cmp     r4, r5
08043b64:   bne.n   0x8043b74 <ft_memcmp+32>
 358      for (int i=0; i<(blocksize/4);i+=4)
08043b66:   adds    r3, #4
08043b68:   cmp.w   r3, r2, lsr #2
08043b6c:   bcc.n   0x8043b5e <ft_memcmp+10>
 357      bool ret = true;
08043b6e:   movs    r0, #1
 367      }
08043b70:   pop     {r4, r5}
08043b72:   bx      lr
 362      ret=false;
08043b74:   movs    r0, #0
08043b76:   b.n     0x8043b70 <ft_memcmp+28>
 357      bool ret = true;
08043b78:   movs    r0, #1
 367      }
08043b7a:   bx      lr
 891      {

Quote
define a new uint32 pointer variable, cast the buffer into that then just step that across the data exactly like before

I am not clever enough :)

Quote
This code is only comparing the first word of buf1 and buf2.
Also, the count is wrong: if you count up to blocksize/4, you should not increment by 4...

Told you I am not clever enough :) I do only simple C. This doesn't compile

Code: [Select]
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;

for (int i=0; i<count;i++)
{
if ( ((uint32_t*)&buf1[i]) != ((uint32_t*)&buf2[i]) )
{
ret=false;
break;
}
}
return (ret);
}
« Last Edit: October 27, 2022, 01:20:14 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #8 on: October 27, 2022, 01:19:20 pm »
The stdlib memcmp doesn't seem to be doing anything remotely clever
picolibc, largely inspired and sharing most of the code with newlib nano, uses an optimized loop for memcmp if both input pointers are word aligned and the right defines are used at compile time.
Nandemo wa shiranai wa yo, shitteru koto dake.
 
The following users thanked this post: peter-h

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #9 on: October 27, 2022, 01:26:09 pm »
I would expect a /4 to be done as a shift [...]. But hey it doesn't:

Code: [Select]
08043b56:   cmp.w   r3, r2, lsr #2
08043b68:   cmp.w   r3, r2, lsr #2

Yes, it definitely does!
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #10 on: October 27, 2022, 01:33:32 pm »
From picolib, I lifted some code and got this which may be right. I will corrupt some data and make sure it fails, but stepping through the loop it is doing the right thing, I think. a1 and a2 increment by 4, count starts at 512 and decrements by 1.

Code: [Select]
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;
    uint32_t *a1;
    uint32_t *a2;
a1 = (uint32_t*) buf1;
    a2 = (uint32_t*) buf2;

    while (count>0)
    {
        if (*a1 != *a2)
        {
ret=false;
break;
        }
        a1++;
        a2++;
        count--;
    }

    return (ret);
}

And it is fast.

Quote
Yes, it definitely does!

I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
« Last Edit: October 27, 2022, 01:36:20 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 
The following users thanked this post: newbrain

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #11 on: October 27, 2022, 01:33:43 pm »
You could write that in a theoretically faster way: "for (int i=0; i<(blocksize>>2);i+=4)"
[rant]
I have to be blunt: this is the kind of micro-optimizations I abhor in my code and always flag when I do code reviews.
[/rant]
Two main reasons:
  • We have compilers that outsmart any human in this kind of things.
  • I have a guiding principle of "write what you mean", so if I want to say /4, I write /4, not >>2
(in any case, I would write /sizeof (the pointer target type)
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #12 on: October 27, 2022, 01:44:09 pm »
This is the asm for the above code. Nothing clever; the main gain is 512 loops instead of 2048.

Code: [Select]
          ft_memcmp:
08043b54:   lsrs    r2, r2, #2
 364          while (count>0)
08043b56:   cbz     r2, 0x8043b78 <ft_memcmp+36>
 356      {
08043b58:   push    {r4}
 366              if (*a1 != *a2)
08043b5a:   ldr     r3, [r1, #0]
08043b5c:   ldr     r4, [r0, #0]
08043b5e:   cmp     r4, r3
08043b60:   bne.n   0x8043b74 <ft_memcmp+32>
 371              a1++;
08043b62:   adds    r0, #4
 372              a2++;
08043b64:   adds    r1, #4
 373              count--;
08043b66:   subs    r2, #1
 364          while (count>0)
08043b68:   cmp     r2, #0
08043b6a:   bne.n   0x8043b5a <ft_memcmp+6>
 357      bool ret = true;
08043b6c:   movs    r0, #1
 377      }
08043b6e:   ldr.w   r4, [sp], #4
08043b72:   bx      lr
 368      ret=false;
08043b74:   movs    r0, #0
08043b76:   b.n     0x8043b6e <ft_memcmp+26>
 357      bool ret = true;
08043b78:   movs    r0, #1
 377      }
08043b7a:   bx      lr
 901      {
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #13 on: October 27, 2022, 01:50:01 pm »
From picolib, I lifted some code and got this which may be right. [/code]

And it is fast.
Looks good.
I also like early returns, if no cleanup is is needed, as is the case here:
Code: [Select]
f(...)
{
    for(...)
    {
        ...
        if(failed test) return false;
        ...
    }
    return true;
}
It does not make a big difference (possibly none at all with the right optimization options), and some might dislike it.
Still it makes the code slightly more readable, in my view.

What -O or other optimization options are you using?
clang 15 does a by 4 loop unrolling with -O2 and higher.
« Last Edit: October 27, 2022, 01:53:33 pm by newbrain »
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #14 on: October 27, 2022, 02:04:06 pm »
I am using -Og. I have these notes in my design doc:

Compiler optimisation

Much time has been spent on this. It is a can of worms, because e.g. optimisation level -O3 replaces this loop
for (uint32_t i=0; i<length; i++)
{
 buf[offset+i]=data;
)
with memcpy() which while “functional” will crash the system if you have this loop in the boot block and memcpy() is located at some address outside (which it will be, being a part of the standard library!) Selected boot block functions therefore have the attribute to force -O0 (zero optimisation) in case somebody tries -O3.

The basic optimisation of -O0 (zero) works fine, is easily fast enough for the job, and gives you informative single step debugging, but it produces about 30% more code. The next one, -Og, is the best one to use and doesn’t seem to do any risky stuff like above.

Arguably, one should develop with optimisation ON (say -Og) and then you will pick up problems as you go along. Then switch to -O0 for single stepping if needed to track something down. Whereas if you write the whole thing with -O0 and only change to something else later, you have an impossible amount of stuff to regression-test.

The problems tend to be really subtle, especially if timing issues arise. For example the 20ns min CS high time for the 45DB serial FLASH can be violated (by two function calls in succession) with a 168MHz CPU, and the function hang_around_us() is useful just in case. A lot of ST HAL code works only because it is so bloated. Such time-critical code needs to be in a function which has the
__attribute__((optimize("O0")))
attribute above it. Or be written in assembler.

These figures show the relative benefits, at a particular stage of the project

-O0 produces 491k
-Og produces 342k
-O1 produces 338k
-Os produces 305k

Others, not listed above, have not been tested.

A compiler command line switch -fno-tree-loop-distribute-patterns has been added to globally prevent memcpy etc substitutions.


===

I have tested the code with -O3 and it all seems to run, but I am not interested in spending days re-testing every feature. -Og does sometimes result in "optimised out" for some variable when I am stepping through but it doesn't usually matter, and I can always make it "volatile" just for the test.

I try to avoid early returns, because they can bite. I use them if early on e.g.
if (buf1==NULL) return;
to avoid some invalid access trap.
« Last Edit: October 27, 2022, 02:07:40 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online newbrain

  • Super Contributor
  • ***
  • Posts: 1706
  • Country: se
Re: Should memcmp be faster than a loop? (32F417)
« Reply #15 on: October 27, 2022, 03:19:23 pm »
Quote
Yes, it definitely does!

I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
Yes, the cycle cost for an operand shift is zero in cortex-m3 and higher, so doing it once before the loop would have actually increased the cycle count!
Quote from: https://developer.arm.com/documentation/100166/0001/Programmers-Model/Instruction-set-summary/Table-of-processor-instructions?lang=en
the <op2> field can be replaced with one of the following options:
    ...
    An immediate shifted register, for example Rm, LSL #4.
    A register shifted register, for example Rm, LSL Rs.
    ...
Compare   CMP Rn, <op2>   1
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Offline IDEngineer

  • Super Contributor
  • ***
  • Posts: 1923
  • Country: us
Re: Should memcmp be faster than a loop? (32F417)
« Reply #16 on: November 23, 2022, 08:34:43 am »
Properly written libraries should take advantage of hardware features when they are notified of the platform.

For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.

This is similar to using a floating point coprocessor, when available, instead of pure software FP back when FP hardware wasn't always on the die.
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 357
  • Country: be
Re: Should memcmp be faster than a loop? (32F417)
« Reply #17 on: November 26, 2022, 06:12:57 pm »
FYI, attached is the source code of memcmp() taken from Newlib sources:

Code: [Select]
//
// gcc-arm-none-eabi-9-2020-q2-update/src/newlib/newlib/libc/string/memcmp.c
//
 
#include <string.h>


/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) \
  (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the word copy loop.  */
#define LBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier.  */
#define TOO_SMALL(LEN)  ((LEN) < LBLOCKSIZE)

int
memcmp (const void *m1,
        const void *m2,
        size_t n)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  unsigned char *s1 = (unsigned char *) m1;
  unsigned char *s2 = (unsigned char *) m2;

  while (n--)
    {
      if (*s1 != *s2)
        {
          return *s1 - *s2;
        }
      s1++;
      s2++;
    }
  return 0;
#else 
  unsigned char *s1 = (unsigned char *) m1;
  unsigned char *s2 = (unsigned char *) m2;
  unsigned long *a1;
  unsigned long *a2;

  /* If the size is too small, or either pointer is unaligned,
     then we punt to the byte compare loop.  Hopefully this will
     not turn up in inner loops.  */
  if (!TOO_SMALL(n) && !UNALIGNED(s1,s2))
    {
      /* Otherwise, load and compare the blocks of memory one
         word at a time.  */
      a1 = (unsigned long*) s1;
      a2 = (unsigned long*) s2;
      while (n >= LBLOCKSIZE)
        {
          if (*a1 != *a2)
            break;
          a1++;
          a2++;
          n -= LBLOCKSIZE;
        }

      /* check m mod LBLOCKSIZE remaining characters */

      s1 = (unsigned char*)a1;
      s2 = (unsigned char*)a2;
    }
  while (n--)
    {
      if (*s1 != *s2)
        return *s1 - *s2;
      s1++;
      s2++;
    }

  return 0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}


Newlib usually comes with arm-none-gcc compilers in the form of libc.a library. Depending on how it was compiled, you'll get this or that result.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #18 on: November 26, 2022, 06:30:33 pm »
Interesting. Mine was obviously compiled for smaller size.

What controls how it gets compiled? I don't think the sources are supplied (with Cube) so there is no way I can control this using say the -O optimisation setting.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Should memcmp be faster than a loop? (32F417)
« Reply #19 on: November 26, 2022, 07:17:19 pm »
A bit like memcpy(), this is an endless source of tricks, benchmarks and blog posts.
The most efficient way of achieving it will depend on so many factors that it's impossible to answer what is going to be faster in a given context. It's not always the functions from the standard library. But one thing to consider is that compilers are now clever enough to generate inline, optimized code for memcpy() and memcmp() in a large number of cases, so you'll be hard-pressed to do better. Usually not worth your time, especially if the memory blocks are relatively small.

 

Offline IDEngineer

  • Super Contributor
  • ***
  • Posts: 1923
  • Country: us
Re: Should memcmp be faster than a loop? (32F417)
« Reply #20 on: November 26, 2022, 07:39:23 pm »
FYI, attached is the source code of memcmp() taken from Newlib sources:
Interesting that this "memcmp" source punts to what it calls a "byte copier". Hmmm.  :palm:

Properly written block functions shouldn't always punt misaligned pointers to byte-sized routines. The efficient way to do misalignment is to handle the edges with byte copies, but still do the big "middle" section with the largest native size. There's obviously a lowest useful block size below which the realignment overhead isn't worth it, but you can do that math one time and hardcode the threshold for a given architecture. Above a certain size, there's always a "middle" block that is aligned for both source and destination. No reason to grind through that a byte at a time instead of using the hardware's largest and most efficient data size.

This can be written as "fallthrough" code where the front-end misalignment (if any) is handled at the byte level, the middle section is done fast, and the back-end misalignment (if any) is handled at the byte level. This yields small AND fast code. Kinda get a warm feeling when it's done.
« Last Edit: November 26, 2022, 07:42:35 pm by IDEngineer »
 
The following users thanked this post: peter-h

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: gb
  • Doing electronics since the 1960s...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #21 on: November 26, 2022, 08:05:48 pm »
That's exactly what I thought the lib functions did. There is no penalty, other than the code size, but this function is so tiny anyway.

I wrote the comparison loop using the 32 bit cast anyway (and made sure the buffers are aligned).
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 357
  • Country: be
Re: Should memcmp be faster than a loop? (32F417)
« Reply #22 on: November 26, 2022, 08:20:32 pm »
What controls how it gets compiled? I don't think the sources are supplied (with Cube) so there is no way I can control this using say the -O optimisation setting.

The library comes bundled with the compiler from the ARM Developer website. It is in binary form, and unless you ask nicely, the exact compilation options are not known.

You can still get the library source code from the same site, and if you fancy you can extract some select files to add them to your project (or make a submodule in your git/svn/...) and compile them with whatever options that best suit your needs. More work, but you will be in full control.
 
The following users thanked this post: peter-h

Offline Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11518
  • Country: my
  • reassessing directives...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #23 on: November 26, 2022, 08:31:35 pm »
Code: [Select]
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void * buf1, const void * buf2, uint32_t blocksize)
{
bool ret = true;
uint32_t count = blocksize/4;
    uint32_t *a1;
    uint32_t *a2;
a1 = (uint32_t*) buf1;
    a2 = (uint32_t*) buf2;

    while (count>0)
    {
        if (*a1 != *a2)
        {
ret=false;
break;
        }
        a1++;
        a2++;
        count--;
    }

    return (ret);
}
i like it tidy, because its fun. your code is not wrong, i'm just doing mental/memory refresh so i wont forget C...
Code: [Select]
// Compare two blocks. Must be 4-aligned, and blocksize a multiple of 4.
// Return True if compare ok.

bool ft_memcmp(const void *buf1, const void *buf2, uint32_t blocksize) {
    uint32_t count = blocksize / 4;
    uint32_t *a1 = (uint32_t*) buf1;
    uint32_t *a2 = (uint32_t*) buf2;
    while ((count--) > 0) if (*(a1++) != *(a2++)) return false;
    return true;
}
there should be some mistakes here and there, the point is... tidy ;D
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 
The following users thanked this post: peter-h

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Should memcmp be faster than a loop? (32F417)
« Reply #24 on: November 27, 2022, 12:46:21 am »
Quote
Yes, it definitely does!

I saw, but it seems to do it each time around the loop. Although the shift by 2 exec time may be hidden.
Yes, the cycle cost for an operand shift is zero in cortex-m3 and higher, so doing it once before the loop would have actually increased the cycle count!

Doing extra work is never free.

It might be hidden in, for example, a lower max MHz on the part than would otherwise be possible. See, for example, the 1-cycle multiply option on Cortex M0+. You'll never see that on a high-clocked part.

Most CPUs are designed so that register read -> add/sub/cmp -> writeback/forward is on the critical path. Arbitrary size shifts have a similar gate delay depth to add (perhaps just slightly more if the adder has good carry look-ahead not just ripple), so can usually be accommodated too. Doing an arbitrary-sized shift AND THEN an add/sub/cmp basically doubles the critical path in the ALU pipe stage, potentially halving the clock speed attainable.

Small shifts of {0,1,2,3} bits can probably be accommodated before an adder and end up about the same total delay as an arbitrary shift. But the ARM ISA allows an arbitrary shift.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf