Attached are the results of the memcpy function tests.
I attached the source code and the binaries too.
To build the sources, call ./build.sh form a Msys2 or Cygwin terminal.
To run the tests, call ./run.sh form the terminal.
The test program is only 155 lines of C code, compiled on Windows Msys2.
Don't forget to set the INCLUDE, LIB and PATH variable for the
Microsoft VC compiler.
Mingw Notes:
- gcc with -O3 -march=native now produces AVX2 code for the memcpy for-loop with good results.
The AVX2 code generation has not worked with the strcpy for-loop, i don't know why.
- gcc with -O2 produces the slowes result with only 0.8 Bytes / Clock.
- The Mingw binary is ridiculous large, over 250 kBytes without stripping.
Microsoft UCRT versus older MSVCRT Notes:
- The VCRUNTIME/UCRT DLL is faster than the old MSVCRT.
The good news is: The Win10 MSVCRT basically uses the new vcruntime.dll under the hood.
- The older VS 2008 compiler has problems with the builtin functions.
With -O2 the builtin functions are enabled, and are used instead of the MSVCRT functions.
Unfortunatly the builtin functions are very slow.
The new VS 2019 does not have this problem.
Intel Notes:
- The Intel compiler is very aggressive with inlining and calls the intel memcpy functions very often
- With optimizations turned on, the code gets very large, same as with gcc.
Memcpy Test Program Size:
test_3_memcpy_movsb_intr_cl_15.exe 6656
test_4_memcpy_movsb_asm_cl_15.exe 6656
test_5a_msvcrt_dll_cl_15.exe 6656
test_6_forloop_cl_15.exe 6656
test_9_ucrt_dll_cl_19.exe 10752
test_2_AgnerFog_cl_15.exe 15360
test_1_Intel_icl_19.exe 41472
test_7_forloop_O2_gcc_10.exe 41472
test_8_forloop_O3_gcc_10.exe 42496
test_5b_msvcrt_static_cl_15.exe 68096
test_9_ucrt_static_cl_19.exe 124416
- Most programs are linked dynamically with the MSVCRT.DLL or the VCRUNTIME.DLL.
- The performance with static linking seem to be better and more reproduceable,
especially in the region of over 30 Bytes / Cycle.
- The "rep; movsb" implentation relies on uOps Code and is fast and very small.
- The memcpy_movsb_intr test uses the C intrinsic __movsb(), which is inlined.
memcpy_movsb_asm uses an external assembler function which is not inlined.
- Function call on x64 is super fast, parameters are passed in registers and
inlining is in most cases not faster.
I would want to see the actual application profiled and showing that strlen was a significant time sink (and exactly which strlens were burning the time) before even thinking of going down that rabbit hole, and then I would want to convince myself that a better high level algorithm was not a better fix then optimising strlen or fucking with compiler options.
While it is possible for strlen to be the bottleneck, and for some applications it probably is (I remember from CPPCON that Facebook spent a lot of time optimising string handling because it turns out to matter to their workload), I would bet that for 99.9% of all C programs, the cost of strlen is noise.
I have definitely heard that memcpy/strcpy/memmove are a significant fraction of general purpose CPU cycles at places like facebook/google/amazon. But that is a different situation: if you are operating giant datacenters it makes sense to optimize something that is a few % of your total workload. Even for them I guess it is not a big factor in any one application -- the reason it is so big is that it is ubiquitous. It is easier to optimize one function in the C library than 1000 applications.
I am surprised by the large difference in performance of these simple functions, and by the fact that the C for() loop is so bad in comparison.
I'm not. The C for loop is operating 1 byte at a time. It has to work regardless of the alignment of the inputs/outputs and whether the length is a multiple of any particular word size. If the compiler were to convert it to multi-byte operations it would need to add alignment checks at the beginning of the call that either choose a fast vs. slow implementation or handle the ends slowly and let the fast algorithm do the middle, and you have to do that in a way that doesn't cause a big performance hit for short buffers which are probably the most common. Even detecting conditions that could be optimizes like this are hard for the compiler and if they did it would pretty much only work for cases already handled by the standard library, so there isn't really much point.
This is bad enough for memcpy where the length is known in advance, but impossible for strlen() and strcpy(). Keep in mind that accessing -- even just reading -- past the end of an array is undefined behavior. When you write a for loop that implements strlen() byte-by-byte and the compiler doesn't know where the input char* came from, it isn't supposed to even access past the null byte.
Rare but not impossible. consider the following code:
typedef struct {
char name[7];
volatile char flag;
} mystruct;
my_strlen(mystruct_inst->name);
This is a stupid thing to do, but if my_strlen does a for loop by bytes, then any vectorization done automatically by the compiler is invalid behavior. Now if you explicitly write a loop that access in 32 bit words, then the compiler is free to turn those into vector. In real life as long as you do naturally aligned access you won't cross a page boundary, memory protection boundary or venture into memory mapped IO so it is fine for the C library to provide a strlen() implementation that has a different contract, but the compiler isn't going to make that assumption for you.