Author Topic: GCC compiler divmod call generation for Cortex-M0  (Read 665 times)

0 Members and 1 Guest are viewing this topic.

Offline cv007Topic starter

  • Frequent Contributor
  • **
  • Posts: 858
GCC compiler divmod call generation for Cortex-M0
« on: January 03, 2025, 02:29:32 am »
Any explanation for the following generation of div calls for the following test?

https://godbolt.org/z/zbWzhjG9Y
line 14 - comment/uncomment to see changes

If the base_ var is global, the compiler generates a single call to uidivmod, and if the var is local (or static) it generates a call to uidivmod but also another to uidiv. For some reason it appears the optimization is opposite of what it should be, or normally is. What am I missing here?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4736
  • Country: nz
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #1 on: January 03, 2025, 03:31:19 am »
Because it sees that 10 is a constant and distributes it to the two divisions before noticing that actually there is no division instruction on that CPU.

If you change the CPU to M3 then there is a single division instruction generated for the divide by 10, and a multiply-subtract to get the remainder.

Even if you change it to ...

Code: [Select]
u32 q = u / base_;
u32 r = u - q * base_;
return q + r;

... it just puts the 2nd division back in again.

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4736
  • Country: nz
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #2 on: January 03, 2025, 03:38:23 am »
Even the latest version of GCC does the same.

Every version of Clang from 9 to head generates some variation on:

Code: [Select]
test1(unsigned int):
        push    {r4, r5, r7, lr}
        add     r7, sp, #8
        mov     r4, r0
        movs    r5, #10
        mov     r1, r5
        bl      __aeabi_uidiv
        muls    r5, r0, r5
        subs    r1, r4, r5
        adds    r0, r1, r0
        pop     {r4, r5, r7, pc}
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15736
  • Country: fr
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #3 on: January 03, 2025, 07:13:10 am »
If you declare base_ 'constexpr', you'll also get the double div thing. Fun. :popcorn:
 

Offline mikerj

  • Super Contributor
  • ***
  • Posts: 3372
  • Country: gb
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #4 on: January 03, 2025, 11:33:58 am »
Turn off optimisations and you get the two calls for any combination of global/local/constexpr, so it seems like an optimiser failure.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4736
  • Country: nz
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #5 on: January 03, 2025, 12:09:32 pm »
Turn off optimisations and you get the two calls for any combination of global/local/constexpr, so it seems like an optimiser failure.

Seems like a cost estimation failure for the C-M0. It's fine on cores with hardware divide.

If no one submits a bug report then it will never change.
 
The following users thanked this post: SiliconWizard

Offline bson

  • Supporter
  • ****
  • Posts: 2493
  • Country: us
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #6 on: January 04, 2025, 08:35:06 pm »
A lot of these optimizations happen in the target machine description, and it can get pretty messy when different target variants have different instruction sets.  I'm not at all surprised there are bugs...
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4736
  • Country: nz
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #7 on: January 04, 2025, 08:42:28 pm »
A lot of these optimizations happen in the target machine description, and it can get pretty messy when different target variants have different instruction sets.  I'm not at all surprised there are bugs...

It's not a bug, it's just less than optimal code generation.

It is weird though that a "divmod" call exists which presumably returns both results, but you can't just use both of them.
 

Offline cv007Topic starter

  • Frequent Contributor
  • **
  • Posts: 858
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #8 on: January 04, 2025, 09:16:49 pm »
With -O1 and -fexpensive-optimizations the optimization appears to enable at that point-
https://godbolt.org/z/n8eT5Meb5
but still only with the base as a global. Any value known to the compiler as compile time results in the double call.

My code that uses divmod is in a C++ class where the base is not a constant, so I do end up with a single divmod call for the m0. During various testing and code creation I will sometimes see the double call and get confused as to why that is happening.

The avr will always produce a single divmod (with optimizations on), and in that case it looks like the optimization comes via the machine descriptor file in gcc. Maybe since the M0 is kind of the loner in cortex-m series, gcc is treating the costs of the divmod vs div in the same way for all cortex-m even though the m0 has to make a library call. Reloading of constants vs loading of memory I suppose is also involved in this cost calculation.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4344
  • Country: us
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #9 on: January 04, 2025, 10:11:48 pm »
Quote
it's just less than optimal code generation.
gcc support for ARMv6m (Cortex M0, M0+) isn't great.

Another example is that there is no "optimized" floating point code (there IS for ARMv7m), so any use of floats on CM0 brings in the rather bloated and slow C version from gcc.
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 9975
  • Country: gb
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #10 on: January 04, 2025, 10:17:38 pm »
Quote
it's just less than optimal code generation.
gcc support for ARMv6m (Cortex M0, M0+) isn't great.

Another example is that there is no "optimized" floating point code (there IS for ARMv7m), so any use of floats on CM0 brings in the rather bloated and slow C version from gcc.
Very few people implementing GCC for a machine without an FPU have bothered to do much about that slow C code implementation of floating point. I think it was only intended as a "get you started" version of floating point, so I don't think they even tried to make it as good as you can get with C. I started to do work on it at one point, but got distracted.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15736
  • Country: fr
Re: GCC compiler divmod call generation for Cortex-M0
« Reply #11 on: January 04, 2025, 10:21:29 pm »
For those who want faster FP on the M0/M0+, you should definitely look at the RPi's implementation for the RP2040. I don't know what license it is, so whether you're allowed to use it on other targets. Something to look at.

If I'm not mistaken, it's based on this implementation, which for sure you can freely (GPL 2) use: https://www.quinapalus.com/qfplib-m0-full.html
 
The following users thanked this post: mikerj


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf