Author Topic: Should memcmp be faster than a loop? (32F417)  (Read 3404 times)

0 Members and 1 Guest are viewing this topic.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Should memcmp be faster than a loop? (32F417)
« Reply #25 on: November 27, 2022, 12:53:18 am »
For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.

REP MOVSB and friends have been in 8086 from Day #1.

For almost all x86 models in the last 45 years a software loop has been faster than REP MOVSB.

Here's an old discussion involving Intel engineers:

https://community.intel.com/t5/Intel-Fortran-Compiler/Time-to-revisit-REP-MOVS/td-p/796377

I believe some of the most recent µarch revisions might have given REP MOVSB the performance love attention that people assume it always had. But it's VERY recent.
 

Offline IDEngineer

  • Super Contributor
  • ***
  • Posts: 1923
  • Country: us
Re: Should memcmp be faster than a loop? (32F417)
« Reply #26 on: November 27, 2022, 04:43:27 am »
That was a VERY informative thread reference, thank you!

It has been a very long time since I wrote x86 Assembly code so modern hardware has made my earlier comments out of date.

The best new data for me was learning that the FP Assembly has its own block-move instructions. That's incredibly helpful. *Those* should be able to optimize per my discussion above, and actually do even better than I expected thanks to the lack of MovSx's historical baggage and their inherently wider operands.

Thanks again, that was well worth updating my knowledge!
 

Online Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11518
  • Country: my
  • reassessing directives...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #27 on: November 27, 2022, 10:10:08 am »
Properly written libraries should take advantage of hardware features when they are notified of the platform.
true thats why building a simple library even like in this case memcmp can be a daunting task with lots of "preprocessor directive" checking #IFDEFINED MACHINE(this and that)

For example, later generation x86 CPU's have the MOVSB, MOVSW, STOSB, STOSW, and similar opcodes which can be put into a hardware loop based on the CX register. No software based construct can be as fast, so libraries should use them when possible.

This is similar to using a floating point coprocessor, when available, instead of pure software FP back when FP hardware wasn't always on the die.
i believe modern COMPILER is partly (or wholly?) taking this job. so using old compiler will dismiss us from taking advantage of new opcodes and machine architecture. thats why my stand when coding is forget about machine dependent code (or trying to be clever at fondling one for each machine architecture) because machine dependent code will obsolete pretty soon. when i read software algorithm and analysis in any SW eng or Computing Science book, i will almost never see machine dependency is discussed or tried to be tackled, we should concentrate on operation complexity of a certain algorithm.. (language and hardware agnostics) as for fastest (or code size) optimization on certain machine, let the compiler do its best, or else.. just code in assembly . ymmv.
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 

Offline udok

  • Regular Contributor
  • *
  • Posts: 57
  • Country: at
Re: Should memcmp be faster than a loop? (32F417)
« Reply #28 on: November 30, 2022, 08:33:29 am »
Which platform?

I did some test on Windows, and memcmp is many times faster than a simple loop (read the first post with pictures).

https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358
« Last Edit: November 30, 2022, 08:54:43 am by udok »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Should memcmp be faster than a loop? (32F417)
« Reply #29 on: November 30, 2022, 09:29:21 am »
Which platform?

I did some test on Windows, and memcmp is many times faster than a simple loop (read the first post with pictures).

https://www.eevblog.com/forum/programming/which-strlen-function-is-faster/msg3393358/#msg3393358

You've got a recent CPU with a very competitive REP MOVSB, at least up to 128 byte size, and from 16k and up. And not awful between those.

I don't think you'd get anywhere near as good results from it even as recently as Skylake.
 

Offline udok

  • Regular Contributor
  • *
  • Posts: 57
  • Country: at
Re: Should memcmp be faster than a loop? (32F417)
« Reply #30 on: November 30, 2022, 09:40:16 am »
The link goes to the memcpy results, don't know why - the memcmp results are at the top of the page.
The memcmp does not use the rep-movsb instruction.

The CPU is 5 years old.  The optimized rep-movsb instruction is standard since about 2010 (Haswell?).
Newer CPUs should be even better,  especially with the small size array optimization.
 

Offline 2N2222A

  • Regular Contributor
  • *
  • Posts: 69
  • Country: fr
Re: Should memcmp be faster than a loop? (32F417)
« Reply #31 on: December 05, 2022, 02:28:01 am »
Last I checked, the GNU C library didn't even use MMX or SSE2 instructions for things like memcpy(). I don't know what the reason is. MMX has been around since 1997. Most of the GNU graphical tool kits are so slow that a 1GHz CPU is needed to open a context menu smoothly.

There was also the whole fiasco where the Intel C compiler would disable all optimizations by not using the optimized code branch at run time if it did not detect an Intel Pentium 3 or 4 CPU. This was back around 2005 and AMD took legal action.
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3669
  • Country: us
Re: Should memcmp be faster than a loop? (32F417)
« Reply #32 on: December 05, 2022, 03:49:03 am »
Glibc has had vector optimization for memcpy, memcmp, and friends for ages.
 

Offline 2N2222A

  • Regular Contributor
  • *
  • Posts: 69
  • Country: fr
Re: Should memcmp be faster than a loop? (32F417)
« Reply #33 on: December 06, 2022, 04:05:40 am »
glibc-2.8 from 2009 didn't use any vector optimizations, even though all AMD64 CPUs support SSE2, and those came out 6 years ago, so the excuse of maintaining compatibility with all CPUs of that class doesn't hold. For example not all i686 CPUs support MMX, so the non MMX equipped Pentium Pro could be the reason for not using MMX. It probably got to the point where the threat of outside patches to glibc motivated them to add support.
 

Offline IDEngineer

  • Super Contributor
  • ***
  • Posts: 1923
  • Country: us
Re: Should memcmp be faster than a loop? (32F417)
« Reply #34 on: December 06, 2022, 04:44:54 am »
It's been a long time since I wrote Assembly for the x86 platform, but isn't it possible to read a register and determine the CPU? Knowing that you could choose to use the most advanced approach supported by that hardware.
 

Online Mechatrommer

  • Super Contributor
  • ***
  • Posts: 11518
  • Country: my
  • reassessing directives...
Re: Should memcmp be faster than a loop? (32F417)
« Reply #35 on: December 06, 2022, 05:16:13 am »
Get pro grade compiler, not the old and free one https://stackoverflow.com/questions/152016/detecting-cpu-architecture-compile-time
Nature: Evolution and the Illusion of Randomness (Stephen L. Talbott): Its now indisputable that... organisms “expertise” contextualizes its genome, and its nonsense to say that these powers are under the control of the genome being contextualized - Barbara McClintock
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf