Author Topic: GCC ARM32 compiler too clever, or not clever enough? (Read 13920 times)

Nominal Animal · « **Reply #100 on:** August 22, 2022, 05:32:43 am »

The functions are nothing special, and it is enough to just implement them (memcpy, memmove, memset, memcmp, and optionally abort) yourself.

peter-h · « **Reply #101 on:** August 22, 2022, 06:11:07 am »

I have done that for some of them, where I was calling them explicitly (with a prefix on the name for clarity). The challenge is this: in the main code, if the compiler invokes e.g. memset, it can use the stdlib one because it is available. But in the boot block one doesn't have stdlib so that substitution needs to be blocked or the function is provided locally.

If say the boot block is a single file and I put a static function in there called memset will the compiler use that one, even if it is not explicitly referenced? The compiler normally warns that function x is not used and strips it out. How do you avoid that?

It would be good if the .map file cross ref table showed where e.g. memcpy is used, but it doesn't.

brucehoult · « **Reply #102 on:** August 22, 2022, 09:09:08 am »

libc functions are weak linked, so if you have ANY function with the same name yourself it will use yours in preference.

If you don't care about speed you can just use a simple loop (or REP MOVSB etc, where available), make memcpy and memmove the same function, etc. (though I believe the compiler only calls memcpy and makes sure there is no overlap)

Nominal Animal · « **Reply #103 on:** August 22, 2022, 09:50:12 am »

Quote from: peter-h on August 22, 2022, 06:11:07 am

If say the boot block is a single file and I put a static function in there called memset will the compiler use that one, even if it is not explicitly referenced?

Yes, because even though the compiler itself generates the call, it is as if the structure assignment (or whatever code is being replaced) source code was replaced with the actual function call.

Quote from: peter-h on August 22, 2022, 06:11:07 am

The compiler normally warns that function x is not used and strips it out. How do you avoid that?

When the compiler does use the function, it will not complain nor strip the implementation out, because it can see the use case.

Quote from: peter-h on August 22, 2022, 06:11:07 am

It would be good if the .map file cross ref table showed where e.g. memcpy is used, but it doesn't.

You can see the symbol reference in the object file, though, using either readelf or objdump.

Quote from: brucehoult on August 22, 2022, 09:09:08 am

(though I believe the compiler only calls memcpy and makes sure there is no overlap)

I haven't explored all the cases where the compiler generates these calls, but a suggest a slightly different approach:

Implement the memory copy loop in two different functions, one that does the loop in increasing addresses, and the other in decreasing addresses. memcpy() can always call the first one. memmove() uses the increasing address one if source address is above the destination address, and the decreasing address one if source address is below the destination address. You can use either one when the source and destination addresses match. (Since memmove must act as if the data was first copied to a temporary array, and then copied back, we really shouldn't try and optimize the source==destination case out.)

The increasing address copy can then be used as memrepeat(data+1, data, (sizeof data)-(sizeof data[0])) to fill the array data with the contents of its first element. Especially useful when the elements are structures or unions.

brucehoult · « **Reply #104 on:** August 22, 2022, 11:21:37 am »

Quote from: Nominal Animal on August 22, 2022, 09:50:12 am

Quote from: brucehoult on August 22, 2022, 09:09:08 am
(though I believe the compiler only calls memcpy and makes sure there is no overlap)
I haven't explored all the cases where the compiler generates these calls, but a suggest a slightly different approach:

Implement the memory copy loop in two different functions, one that does the loop in increasing addresses, and the other in decreasing addresses. memcpy() can always call the first one. memmove() uses the increasing address one if source address is above the destination address, and the decreasing address one if source address is below the destination address.

That's not a different approach, it's the same approach.

Except memcpy() should be whichever of increasing or decreasing addresses is the fastest, and memmove() either uses the slower one or tail-calls memcpy when it can. Often a decreasing loop is the fastest.

You *could* if code space is really really tight have memset() poke the constant into the first/last bytes of the memory range and then call memcpy() with an overlapping range. However this is a pretty slow way to implement memset(), and probably setting up the call to memcpy() is just as much code as the (twice as fast) loop to just write the same data from a register repeatedly. False savings.

DiTBho · « **Reply #105 on:** August 22, 2022, 11:36:48 am »

Quote from: peter-h on April 16, 2022, 08:38:48 pm

Basically this IDE stuff is a tool which got put together by real men, for real men

for Homo real Neanderthalensis beings for real Neanderthals, you mean.

Nominal Animal · « **Reply #106 on:** August 22, 2022, 12:47:48 pm »

Quote from: brucehoult on August 22, 2022, 11:21:37 am

Quote from: Nominal Animal on August 22, 2022, 09:50:12 am
Quote from: brucehoult on August 22, 2022, 09:09:08 am
(though I believe the compiler only calls memcpy and makes sure there is no overlap)
I haven't explored all the cases where the compiler generates these calls, but a suggest a slightly different approach:

Implement the memory copy loop in two different functions, one that does the loop in increasing addresses, and the other in decreasing addresses. memcpy() can always call the first one. memmove() uses the increasing address one if source address is above the destination address, and the decreasing address one if source address is below the destination address.
That's not a different approach, it's the same approach.

Okay. I guess I got hung up on the mention of same function and avoid overlap, since I use a pair of functions that differ only in the loop direction.

Quote from: brucehoult on August 22, 2022, 11:21:37 am

Except memcpy() should be whichever of increasing or decreasing addresses is the fastest, and memmove() either uses the slower one or tail-calls memcpy when it can. Often a decreasing loop is the fastest.

True.

Quote from: brucehoult on August 22, 2022, 11:21:37 am

You *could* if code space is really really tight have memset() poke the constant into the first/last bytes of the memory range and then call memcpy() with an overlapping range. However this is a pretty slow way to implement memset(), and probably setting up the call to memcpy() is just as much code as the (twice as fast) loop to just write the same data from a register repeatedly. False savings.

Fully agreed.

(While memrepeat() does exactly that, it is only useful when the data to be filled is composite, a structure or union, or possibly a floating-point value on architectures without floating-point registers, and larger than a machine register in size. memset() is easily implemented more efficiently than that, and is common enough to spend the dozen or two bytes for its implementation.)

Even things like checking whether the range is aligned or not, and doing the loop using native word size elements (with the fill byte duplicated across the bytes in that word) is not usually worth it at run time, generally speaking, for any of these functions. Using C11 _Alignof (or the earlier GCC/clang/ICC __alignof__ operator) one can create an inline wrapper that can select between an optimized (native word alignment and size) or a byte-per-byte version, which may be useful on some architectures; but then that wrapper is visible in the header file and linkage won't be to memset()/memcpy()/memmove()/memcmp() but to the optimized or per-byte version instead.

Quote from: DiTBho on August 22, 2022, 11:36:48 am

Quote from: peter-h on April 16, 2022, 08:38:48 pm
Basically this IDE stuff is a tool which got put together by real men, for real men
for Homo real Neanderthalensis beings for real Neanderthals, you mean.

Dude, some of my distant ancestors were Neanderthals, and they had bigger brains than we do. The ooga-booga-cavemen is a poor stereotype. Get an expert in them drunk, and they'll admit that everything points out to them having been more intelligent than us –– heck, that people who lived a few thousand years ago were not only more intelligent than us but also with bodies closer to olympic athletes than the potato sacks we are ––, which itself is sufficient incentive for most humans to paint them ugly.
If you exclude childhood deaths, even their life expectancy was similar to ours, and much longer than most humans during the agricultural era. Yes, they had hard lives in that most male skeletons found show signs of old fractures having fully healed, but that also shows that such injuries were dealt with and not fatal.

DiTBho · « **Reply #107 on:** August 22, 2022, 01:31:33 pm »

Quote from: SiliconWizard on June 20, 2022, 07:01:23 pm

Any code that is statically analyzed as unreachable during compilation will just not yield emitted code from the compiler. That usually happens even at the first level of optimization.

In avionic there is a precise activity (someone paid for) to check dead (unreachable) code.
It's boring and annoying, but it's a task to be done, so I developed a tool to help people check it on C.
Then recycled the project and merged into myC.

DiTBho · « **Reply #108 on:** August 22, 2022, 01:39:25 pm »

Quote from: brucehoult on May 02, 2022, 01:29:31 pm

z80 is just SO ANNOYING to program. It shouldn't be worse than 6502, but it pretty much is, because it's just so inconsistent.

if you look at SmartC (on 90s Byte Magazine), or SDCC, well ... it's full of similar comments

DiTBho · « **Reply #109 on:** August 22, 2022, 01:51:40 pm »

Quote from: brucehoult on May 02, 2022, 01:29:31 pm

The z80 can have really fast and compact code if you manage to keep everything in its very limited register set. But if you run out and start having to load and store things to RAM then it gets pretty awful pretty quickly.

that's the same with 68hc11, just a little mitigated by the gcc-v3.4.6 trick of using internal ram as registers, but you still have push, pop, etc.

brucehoult · « **Reply #110 on:** August 22, 2022, 02:15:53 pm »

Quote from: Nominal Animal on August 22, 2022, 12:47:48 pm

Quote from: brucehoult on August 22, 2022, 11:21:37 am
Quote from: Nominal Animal on August 22, 2022, 09:50:12 am
Quote from: brucehoult on August 22, 2022, 09:09:08 am
(though I believe the compiler only calls memcpy and makes sure there is no overlap)
I haven't explored all the cases where the compiler generates these calls, but a suggest a slightly different approach:

Implement the memory copy loop in two different functions, one that does the loop in increasing addresses, and the other in decreasing addresses. memcpy() can always call the first one. memmove() uses the increasing address one if source address is above the destination address, and the decreasing address one if source address is below the destination address.
That's not a different approach, it's the same approach.
Okay. I guess I got hung up on the mention of same function and avoid overlap, since I use a pair of functions that differ only in the loop direction.

Ideally you write memcpy() and memmove() in assembly language as a single function with two entry points just a couple of instructions apart:

Code: [Select]

memmove: # a0,a1,a2 = dst,src,len
    add a2,a2,a1 # 1 past end of src
    blt a0,a1,memcpy$1
    bge a0,a2,memcpy$1
    j slow_copy
memcpy:
    add a2,a2,a1 # 1 past end of src
memcpy$1:
    :

Well, that's if there's a significant speed difference so you want to do memcpy() any time it's actually possible. If they're the same speed then just a simple test will do:

Code: [Select]

memmove: # a0,a1,a2 = dst,src,len
    bge a0,a1,copy_downwards
memcpy:
    # copy upwards
    :

Quote

Even things like checking whether the range is aligned or not, and doing the loop using native word size elements (with the fill byte duplicated across the bytes in that word) is not usually worth it at run time, generally speaking, for any of these functions.

If most copies are either very small (less than 16 bytes, say), or else larger than L2 cache (e.g. 256k or 8M or something like this) then it may be faster or at least as fast to just do a byte by byte copy. But for things from a few hundred bytes to a few hundred KB it's probably worth taking a few instructions to figure out the best tactic. If you have plenty of code space available.

I REALLY LOVE that pretty soon on most RISC-V CPUs you won't have to care and can just unconditionally do memcpy() as:

Code: [Select]

memcpy:
     mv      a3,a0
1:
     vsetvli a4,a2,e8,m4
     vle8.v  v0,(a1)
     add     a1,a1,a4
     sub     a2,a2,a4
     vse8.v  v0,(a3)
     add     a3,a3,a4
     bnez    a2,1b
     ret

... and that's going to be optimal for any size or alignment.

On the Allwinner D1 at 1 GHz, that code takes a constant 31 ns for any size from 0 to 64 bytes copied. The standard Debian glibc memcpy() takes from 50 ns for 0 bytes copied up to 112 ns for 64 bytes copied.

Similar applies to ARMv9, of course.

peter-h · « **Reply #111 on:** August 22, 2022, 02:52:31 pm »

Quote

libc functions are weak linked

Except in Cube libc.a they are not weak

(I solved that with objcopy -weaken...)

The other funny thing is that your own versions of the functions need the -O0 attribute otherwise the compiler will again replace them with the stdlib ones, won't it?

Doing them in asm is the safe thing.

eutectique · « **Reply #112 on:** August 22, 2022, 03:25:26 pm »

I have shown you that this is not the case: https://www.eevblog.com/forum/microcontrollers/is-st-cube-ide-a-piece-of-buggy-crap/msg4363990/#msg4363990

If you provide your implementation of a library function, it will be linked in. Regardless of optimisation flags. I've done it without weakening the symbols, which would be a terrible hack, IMHO.

Perhaps, the order of objects in your linker command is wrong?

brucehoult · « **Reply #113 on:** August 22, 2022, 03:31:21 pm »

Quote from: peter-h on August 22, 2022, 02:52:31 pm

Quote
libc functions are weak linked

Except in Cube libc.a they are not weak (I solved that with objcopy -weaken...)

The other funny thing is that your own versions of the functions need the -O0 attribute otherwise the compiler will again replace them with the stdlib ones, won't it?

Yup, that's easy to get with any simple copying or memset-like loop. I think I've only seen it with -O2 or -O3 on GCC, and NOT -O1 or even -Os.

Anyway, in GCC you can disable it with "-fno-tree-loop-distribute-pattern" added to CFLAGS. Or I guess with a pragma in the code.

Code: [Select]

#pragma GCC push_options
# pragma GCC optimize ("no-tree-loop-distribute-patterns")
:
:
#pragma GCC pop_options

Also, putting asm volatile {""} somewhere inside the loop should disable the optimisation on any compiler.

Quote

Doing them in asm is the safe thing.

yup.

Nominal Animal · « **Reply #114 on:** August 22, 2022, 03:35:21 pm »

Quote from: peter-h on August 22, 2022, 02:52:31 pm

Doing them in asm is the safe thing.

Yes, I too agree. I usually use extended asm, which works in GCC and clang (and Intel CC on x86/x86-64).

I would not worry overmuch optimizing them for speed, either; they're not that critical.

It does annoy me that there aren't memcpyi()/memmovei()/memseti()/memcmpi(), memcpyl()/memmovel()/memsetl()/memcmpl(), and memcpyll()/memmovell()/memsetll()/memcmpll() variants for the cases when the compiler knows at compile time that both pointers and the length are aligned to a multiple of int, long, or long long. It would be trivial for e.g. memcpyi(), memcpyl(), and memcpyll() to be weak aliases for memcpy(), so the default cost would be a few symbol aliases in the symbol table!

(I do believe many libraries already use an ELF resolver via the ifunc function attribute, so that they can select the best variants at runtime based on CPUID on x86-64, which traditionally has fluctuated between REP MOVSB being recommended or recommended against, depending on the exact processor.)

As it is, creating optimized (aligned pointers and length a multiple of said alignment) versions can be done, but the compiler will still always use the generic memcpy()/memmove()/memset()/memcmp() ones, so it is just as well to use a completely different name or interface to the optimized/aligned "versions" of those, since only the explicit calls by us human programmers will ever use them anyway.

SiliconWizard · « **Reply #115 on:** August 22, 2022, 05:57:03 pm »

Compilers will inline code for memcpy() and memmove() in a number of cases already and do so relatively cleverly. But for cases where they need to *call* the functions, yes this is often suboptimal.
That reminds me of a different discussion where I considered memory copy to be in general suboptimal for many programming languages and down to the CPUs themselves which often have very limited (or even bad) block copy functionalities. Some people spend generous amounts of time trying to optimize memory copy for their specific cases on specific targets and benchmarks show sometimes wide differences, so that's not a secondary problem IMHO.

That said, back to the memcpy() issues mentioned earlier, I guess this will be obvious to most here, but I've seen quite a few people not knowing the difference between memcpy() and memmove() - and often not even knowing the latter - and introducing nice bugs due to this.

peter-h · « **Reply #116 on:** August 22, 2022, 06:57:44 pm »

Quote

Compilers will inline code for memcpy() and memmove() in a number of cases already and do so relatively cleverly. But for cases where they need to *call* the functions, yes this is often suboptimal.

The former is fine but the latter is a bastard if you don't catch it, and you don't have stdlib in your project.

I have made a note in my doc for the product I am working on to check

Quote

You can see the symbol reference in the object file, though, using either readelf or objdump.

SiliconWizard · « **Reply #117 on:** August 22, 2022, 07:51:39 pm »

Uh?

Maybe we should stick to programming. Just a thought.

brucehoult · « **Reply #118 on:** August 23, 2022, 01:29:50 am »

Quote from: SiliconWizard on August 22, 2022, 07:51:39 pm

Uh?

Maybe we should stick to programming. Just a thought.

I think he meant d11n.

SiliconWizard · « **Reply #119 on:** August 23, 2022, 02:09:52 am »

Quote from: brucehoult on August 23, 2022, 01:29:50 am

Quote from: SiliconWizard on August 22, 2022, 07:51:39 pm
Uh?

Maybe we should stick to programming. Just a thought.

I think he meant d11n.

That was in reply to a post that since disappeared, so maybe it would be best for the thread if we deleted our last 3 posts as well.

peter-h · « **Reply #120 on:** November 04, 2022, 10:00:42 pm »

Back on this topic

Does anyone know under which conditions ARM32 GCC (currently v10) removes (or doesn't remove) unused/unreachable code?

I mean individual functions which are not called by anything, and are not pointed to by a function table.

AFAICT they are all getting removed. This is fine.

But I am working in ST Cube IDE (which is basically a makefile generator, with an editor, and 100x more features than I know how to configure

) and in the project I have tons of ST "HAL" libs which, if compiled, would really bloat the project.

There is a different scenario where a precompiled object file (.o) or a library (.a) gets removed if none of the functions within the module is being referenced, and this is much less granular that removing unreachable code during compilation. I think that is because it is being done by the linker, which cannot strip out a function out of a .o or .a file.

Nominal Animal · « **Reply #121 on:** November 04, 2022, 10:19:13 pm »

Quote from: peter-h on November 04, 2022, 10:00:42 pm

There is a different scenario where a precompiled object file (.o) or a library (.a) gets removed if none of the functions within the module is being referenced, and this is much less granular that removing unreachable code during compilation. I think that is because it is being done by the linker, which cannot strip out a function out of a .o or .a file.

If you use function-sections when compiling, then the linker can do it; just tell the linker to (garbage collect sections) gc-section. It does exactly what you want, at function granularity.

When compiled, each function gets put into a section named .text.function_name. During linking, the linker will examine which function symbols are reachable starting from the ELF entry point (includes both function calls and taking the address of a function; it does this by examining the symbol references for each function). All functions that might get called, get mapped to the common .text section, and the rest of the functions discarded.

See for example this eLinux.org page from 2011. (Also, Teensyduino (Arduino add-on for Teensy microcontrollers from PJRC) uses this by default, so I do believe others use it extensively too.)

For GCC, the options needed are -ffunction-sections (during compiling) and -Wl,--gc-sections (during linking; often -Wl,--gc-sections,--relax is used).

DiTBho · « **Reply #122 on:** November 04, 2022, 10:25:47 pm »

ummm, dead code should be detected independently of the compiler.
It is a tool that I have developed for both C and my-C.
Stood and Understand can also catch dead code.

SiliconWizard · « **Reply #123 on:** November 04, 2022, 10:30:10 pm »

Quote from: Nominal Animal on November 04, 2022, 10:19:13 pm

Quote from: peter-h on November 04, 2022, 10:00:42 pm
There is a different scenario where a precompiled object file (.o) or a library (.a) gets removed if none of the functions within the module is being referenced, and this is much less granular that removing unreachable code during compilation. I think that is because it is being done by the linker, which cannot strip out a function out of a .o or .a file.
If you use function-sections when compiling, then the linker can do it; just tell the linker to (garbage collect sections) gc-section. It does exactly what you want, at function granularity.

When compiled, each function gets put into a section named .text.function_name. During linking, the linker will examine which function symbols are reachable starting from the ELF entry point (includes both function calls and taking the address of a function; it does this by examining the symbol references for each function). All functions that might get called, get mapped to the common .text section, and the rest of the functions discarded.

See for example this eLinux.org page from 2011. (Also, Teensyduino (Arduino add-on for Teensy microcontrollers from PJRC) uses this by default, so I do believe others use it extensively too.)

For GCC, the options needed are -ffunction-sections (during compiling) and -Wl,--gc-sections (during linking; often -Wl,--gc-sections,--relax is used).

Yep.

I find it unfortunate that we have to put functions in individual sections (which makes the object files bigger, not that it is a huge deal, but yeah) to get this behavior. I can't really find a rationale for not making it the default behavior of the linker.

One question I have (I admit I don't use these options very often actually) is that, what happens if some function is never called directly in the code, but passed as a function pointer somewhere and called indirectly? Is the linker clever enough not to prune this function in this case? I suppose that taking a pointer to it should be enough to determine that said function is not dead code, but just wondering. Too lazy right now to test it.

peter-h · « **Reply #124 on:** November 04, 2022, 11:06:11 pm »

Quote

what happens if some function is never called directly in the code, but passed as a function pointer somewhere and called indirectly? Is the linker clever enough not to prune this function in this case? I suppose that taking a pointer to it should be enough to determine that said function is not dead code, but just wondering. Too lazy right now to test it.

I had this once (e.g. RAM based code which is jumped to) and there were very few ways to make it not disappear. I did it by adding the entry point (function) address to a table of words in the .s (assembler) startup file. That works perfectly.

I am more interested in the compiler removing unused code, and I think this is default behaviour.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: GCC ARM32 compiler too clever, or not clever enough? (Read 13920 times)

Share me