Author Topic: C Compiler Divide and Remainder Optimisation (Read 2560 times)

Zero999 · « **on:** October 30, 2021, 07:06:02 pm »

I've finally decided to learn a bit of C, which got me wondering about optimization. I noticed there are integer division and remainder operators, yet CPUs which have a hardware divide, produce both the quotient and remainder, with one instruction.

If I compiled this code for a processer with a divide instruction, will most modern C compilers have the sense to use a single divide instruction, rather than two?

int a, b, c, d;

c = a / b;
d = a % b;

magic · « **Reply #1 on:** October 30, 2021, 07:38:34 pm »

Sure thing.

Code: [Select]

Disassembly of section .text:

0000000000000000 <div2>:
   0:   89 f8                   mov    eax,edi
   2:   49 89 d0                mov    r8,rdx
   5:   99                      cdq    
   6:   f7 fe                   idiv   esi
   8:   41 89 00                mov    DWORD PTR [r8],eax
   b:   89 11                   mov    DWORD PTR [rcx],edx
   d:   c3                      ret

You may need to enable optimization (-O with gcc).

T3sl4co1l · « **Reply #2 on:** October 30, 2021, 08:00:02 pm »

Yup, make sure the arguments are the same and it probably will be able to merge the div's.

If in doubt, read the output listing! Or paste it into godbolt.org, etc. Pays to know the instruction set! (Which I guess if you're coming from a hardware perspective, that's a given?)

Tim

ataradov · « **Reply #3 on:** October 30, 2021, 09:02:01 pm »

And if you are worried about stuff like this, then you can use godbolt.org to explore how different compilers on different architectures with different options behave.

This example - https://godbolt.org/z/qWeT6afqv

tggzzz · « **Reply #4 on:** October 30, 2021, 10:03:16 pm »

Quote from: Zero999 on October 30, 2021, 07:06:02 pm

I've finally decided to learn a bit of C, which got me wondering about optimization. I noticed there are integer division and remainder operators, yet CPUs which have a hardware divide, produce both the quotient and remainder, with one instruction.

If I compiled this code for a processer with a divide instruction, will most modern C compilers have the sense to use a single divide instruction, rather than two?

int a, b, c, d;

c = a / b;
d = a % b;

The answer depends on the CPU architecture, the compiler, the complier flags, the compiler version, the surrounding code, and probably the phase of the moon.

That is one reason that embedded projects often nail down the compiler for all time,in the form of a virtual machine.

Curiously this topic is currently being debated on comp.arch. All of the participants have long industry experience, some are active here, and there is still disagreement

IMNSHO, that is significant.

Nominal Animal · « **Reply #5 on:** October 30, 2021, 10:23:08 pm »

Quote from: Zero999 on October 30, 2021, 07:06:02 pm

If I compiled this code for a processer with a divide instruction, will most modern C compilers have the sense to use a single divide instruction, rather than two?

int a, b, c, d;

c = a / b;
d = a % b;

The ones I use, with optimization enabled, do.

If you wanted to be sure, you could always use

int a, b, c, d;
div_t result;
result = div(a, b);
c = result.quot;
d = result.rem;

The div() family of functions has existed in the standard C library since C89.

tggzzz · « **Reply #6 on:** October 30, 2021, 11:12:48 pm »

Quote from: Nominal Animal on October 30, 2021, 10:23:08 pm

Quote from: Zero999 on October 30, 2021, 07:06:02 pm
If I compiled this code for a processer with a divide instruction, will most modern C compilers have the sense to use a single divide instruction, rather than two?

int a, b, c, d;

c = a / b;
d = a % b;
The ones I use, with optimization enabled, do.

If you wanted to be sure, you could always use

int a, b, c, d;
div_t result;
result = div(a, b);
c = result.quot;
d = result.rem;

The div() family of functions has existed in the standard C library since C89.

One issue to consider is that if high levels of optimisation are selected, and there is undefined behaviour (in the C sense), then the compiler may legitimately optimise it to a no-op, or to cause nasal daemons, or to delete your filesystem. There's a lot more UB in all of the vavious C spececifications than most people realise.

SiliconWizard · « **Reply #7 on:** October 31, 2021, 12:23:50 am »

And with that said, if you're worried about such minute details when starting with a new (to you) language, you're probably not doing yourself a favor.
I understand that someone with a hardware background will think about this kind of low-level details, but from a learning, and even programming POV, this usually doesn't help.

brucehoult · « **Reply #8 on:** October 31, 2021, 01:13:30 am »

Quote from: Zero999 on October 30, 2021, 07:06:02 pm

I've finally decided to learn a bit of C, which got me wondering about optimization. I noticed there are integer division and remainder operators, yet CPUs which have a hardware divide, produce both the quotient and remainder, with one instruction.

Some do. Many don't. RISC ISAs will typically not want to write two result registers with one instruction because that requires extra hardware that will usually go unused for millions of instructions at a time.

The RISC-V manual says that if you want both, do the DIV first, with the result of the DIV (naturally) not going to either of the source operands. Certain CPUs *might* then optimise the REM to simply copy the result from a hidden register.

Note also that if you have a fast multiplier -- and especially if you have an integer MULADD/MULSUB instruction -- then using that to calculate the remainder will be much much faster than dividing a second time.

Code: [Select]

div quo,dividend,divisor
mul rem,quo,divisor
sub rem,dividend,rem

In fact Clang/LLVM does exactly this at all optimisation levels except -O0, not only for RISC-V but also for 32 bit and 64 bit ARM and POWER. Gcc doesn't for RISC-V (it does DIV and REM in the suggested order) but does for ARM and POWER.

https://godbolt.org/z/9oPbn7f4q

TheCalligrapher · « **Reply #9 on:** October 31, 2021, 01:44:32 am »

Quote from: Zero999 on October 30, 2021, 07:06:02 pm

If I compiled this code for a processer with a divide instruction, will most modern C compilers have the sense to use a single divide instruction, rather than two?

int a, b, c, d;

c = a / b;
d = a % b;

Well, if the underlying hardware platform offers a single signed division instruction that covers both results, then of course a credible compiler will use it or at least consider using it.

Although in some cases some non-obvious optimization considerations (e.g. pipelining) might come into play and make the compiler to choose multiple separate instructions instead of one (even if one is available) because multiple instructions are actually more efficient, however counterintuitive it might seem at the first sight. You asked your question under assumption that using a single instruction is an "optimization" . It is not necessarily true.

Another issue to keep in mind is the rounding rules. Modern C and C++ require signed integer division to round towards zero (and the remainder to agree with that rounding model). Whether your underlying hardware platform offers a division instruction with such rounding behavior is another question. If it doesn't the compiler will have to issue extra instructions to post-correct the result.

SiliconWizard · « **Reply #10 on:** October 31, 2021, 02:08:42 am »

Quote from: brucehoult on October 31, 2021, 01:13:30 am

The RISC-V manual says that if you want both, do the DIV first,

If there's a succession of a DIV and REM with the same operands, then the underlying CPU may implement this in many different ways. The spec suggests this order because they had a specific instruction-fusing way of implementing it in mind, but it's just a suggestion.

Quote from: brucehoult on October 31, 2021, 01:13:30 am

with the result of the DIV (naturally) not going to either of the source operands.

Not sure I quite got what you meant here?

Quote from: brucehoult on October 31, 2021, 01:13:30 am

Certain CPUs *might* then optimise the REM to simply copy the result from a hidden register.

That's the sort of thing I did in my RISC-V core. The division internally computes both the quotient and remainder, and stores the results in internal registers. Then depending on whether the instruction is a DIV or REM, it returns the quotient or remainder. The order doesn't matter as it checks operands and if the operation (signed/unsigned) and operands are the same, it will just return the appropriate computed value. The instructions don't even have to be in immediate succession. Of course there are myriads of ways to implement this, some for which order will matter, some for which it won't.

ejeffrey · « **Reply #11 on:** October 31, 2021, 02:53:32 am »

Quote from: SiliconWizard on October 31, 2021, 02:08:42 am

Quote from: brucehoult on October 31, 2021, 01:13:30 am
with the result of the DIV (naturally) not going to either of the source operands.

Not sure I quite got what you meant here?

If you do a div where the result overwrites an input argument then a subsequent rem can't operate on the same arguments so it can't use the cached value.

magic · « **Reply #12 on:** October 31, 2021, 06:55:28 am »

Quote from: Nominal Animal on October 30, 2021, 10:23:08 pm

If you wanted to be sure, you could always use

int a, b, c, d;
div_t result;
result = div(a, b);
c = result.quot;
d = result.rem;

What a wonderful way to shoot your foot

Code: [Select]

Disassembly of section .text:

0000000000000000 <div2>:
   0:   55                      push   rbp
   1:   48 89 d5                mov    rbp,rdx
   4:   53                      push   rbx
   5:   48 89 cb                mov    rbx,rcx
   8:   48 83 ec 08             sub    rsp,0x8
   c:   e8 00 00 00 00          call   11 <div2+0x11>
  11:   89 45 00                mov    DWORD PTR [rbp+0x0],eax
  14:   48 c1 f8 20             sar    rax,0x20
  18:   89 03                   mov    DWORD PTR [rbx],eax
  1a:   48 83 c4 08             add    rsp,0x8
  1e:   5b                      pop    rbx
  1f:   5d                      pop    rbp
  20:   c3                      ret

This was compiled with -O2, thank you very much.

magic · « **Reply #13 on:** October 31, 2021, 07:03:31 am »

Quote from: SiliconWizard on October 31, 2021, 12:23:50 am

And with that said, if you're worried about such minute details when starting with a new (to you) language, you're probably not doing yourself a favor.
I understand that someone with a hardware background will think about this kind of low-level details, but from a learning, and even programming POV, this usually doesn't help.

I suppose it helps to know that trivial optimizations are handled and not feel tempted to rewrite every other function in assembly

Nominal Animal · « **Reply #14 on:** October 31, 2021, 07:16:30 am »

Quote from: magic on October 31, 2021, 06:55:28 am

Quote from: Nominal Animal on October 30, 2021, 10:23:08 pm
If you wanted to be sure, you could always use

int a, b, c, d;
div_t result;
result = div(a, b);
c = result.quot;
d = result.rem;
What a wonderful way to shoot your foot

Hey, I never in any way indicated it was efficient, I only said it was the way to be sure it does only one divide instruction.

If you look at the disassembly of the called function (it's in libc.a:div.o), you'll see (edited to Intel syntax)

0000000000000000 <div>:
0: 89 f8 mov eax,edi
2: 99 cdq
3: f7 fe idiv esi
5: 48 c1 e2 20 shl rdx,0x20
9: 89 c0 mov eax,eax
b: 48 09 d0 or rax,rdx
e: c3 ret

See?

magic · « **Reply #15 on:** October 31, 2021, 07:52:49 am »

I don't think it's in any way guaranteed.

This is the GNU implementation

Code: [Select]

#include <stdlib.h>

/* Return the `div_t' representation of NUMER over DENOM.  */
div_t
div (int numer, int denom)
{
  div_t result;

  result.quot = numer / denom;
  result.rem = numer % denom;

  return result;
}

In this case all depends on whether your glibc is compiled with optimization or not.

Honestly, such thing could potentially be useful on architectures where division is implemented fully in software. But apparently glibc just doesn't give a damn.

Nominal Animal · « **Reply #16 on:** October 31, 2021, 11:17:38 am »

Quote from: magic on October 31, 2021, 07:52:49 am

Honestly, such thing could potentially be useful on architectures where division is implemented fully in software. But apparently glibc just doesn't give a damn.

Or they trust the compiler handles this. Just like Zero999 should.

Besides, in Linux at least, if you really really care, you can just reimplement the function in your own code. The standard C library symbols are weak, and you can override them by writing your own function with the same signature.

If it matters any, glibc, newlib, musl libc, and nanolibc all use similar code. uclibc and uclibc-ng uses

Code: [Select]

div_t div(int numer, int denom)
{
    div_t result;
    result.quot = numer / denom;
    result.rem  = numer - (result.quot * denom);
    return(result);
}

but /% for ldiv() and lldiv().

Personally, I prefer the form that uses C99 and later compound literals,

Code: [Select]

div_t div(int num, int den)
{
	return (div_t){ .quot = num / den, .rem = num % den };
}

but instead of using div() explicitly, I let the compiler generate the best code it can.

The reason is that library function calls cannot really be optimized (and I don't want the compiler to examine the binaries and replace calls to library functions with code it thinks ought to be equivalent).

Indeed, GCC explicitly provides built-in standard C library functions for this reason: because the definition is dictated by the standard, the compiler can provide the implementation instead of generating a call to a library function. This lets GCC turn for example printf("Hello, world!\n"); into puts("Hello, world!"); or fputs("Hello, world!\n", stdout);.
It just happens that div()/ldiv()/lldiv() are not among those, possibly because they are basically never used in real-world code, so nobody has bothered.

Older versions of GCC (up to 6.3.1, I believe) do have a slight bug (inefficiency) on ARM Cortex-M0/M0+, in that it generates both a calls to __aeabi_idiv and to __aeabi_idivmod although the latter alone would suffice; see GCC bug #43721 for details. Later versions of GCC now detect if both the quotient and remainder are needed, and uses the __aeabi_idivmod (or equivalent for the target arch) to calculate both at the same time, if possible.

magic · « **Reply #17 on:** October 31, 2021, 11:33:45 am »

Quote from: Nominal Animal on October 31, 2021, 11:17:38 am

Or they trust the compiler handles this.

Maybe this.
I tested avr-gcc (no division in hardware) and the compiler handles such cases with a single call to libgcc function __divmodhi4 as long as -O is enabled.

SiliconWizard · « **Reply #18 on:** October 31, 2021, 05:20:21 pm »

Quote from: magic on October 31, 2021, 07:03:31 am

Quote from: SiliconWizard on October 31, 2021, 12:23:50 am
And with that said, if you're worried about such minute details when starting with a new (to you) language, you're probably not doing yourself a favor.
I understand that someone with a hardware background will think about this kind of low-level details, but from a learning, and even programming POV, this usually doesn't help.
I suppose it helps to know that trivial optimizations are handled and not feel tempted to rewrite every other function in assembly

Which, again, boils down to what I just said above.
If you are tempted to rewrite even basic operators in assembly just because, you'd better have a VERY good reason for doing it. You'd better be VERY timing-constrained. Otherwise you're just wasting your time while making your code non-portable.

But it's a bit like a case of "if you ask this question, make sure you really need to know the answer". Otherwise this is just "knowledge pollution" that you do not need when you're just starting learning something new.

And with that said, if you're using any compiled language for embedded dev, it becomes a good skill to know how to generate assembly code from your source code (for this, you can for instance pass the -S option to GCC) and get to know how to interpret it. Just don't over-abuse it. Again do it only if you really need to do it. Otherwise you'll just be wasting your time.

Nominal Animal · « **Reply #19 on:** October 31, 2021, 05:39:37 pm »

Quote from: SiliconWizard on October 31, 2021, 05:20:21 pm

If you are tempted to rewrite even basic operators in assembly just because, you'd better have a VERY good reason for doing it.

Fully agreed.

It is a very common affliction to worry about such small details, and miss the big picture, because the big picture –– algorithms and data structures and analysis of the chosen approach in general –– is 'hard', and diving in into small details is 'easy' and 'comforting'. I'd say it is natural for C programmers to feel the need to 'optimize' their code at the single operation or loop level, before they discover how minuscule difference that makes compared to the changes at the algorithmic level. Natural, yes, but not beneficial: I believe it is something one needs to grow out of. Personally, I've seen this again and again –– I'd like to say 'in most C programmers I've helped to learn', but I'm not sure of the statistics, it just feels that way; and I was definitely that way myself too, once.

magic · « **Reply #20 on:** October 31, 2021, 07:14:53 pm »

I gave OP the benefit of the doubt and assumed that he is competent in assembly and wants to know how much of the work can be automated away with C. That's not exactly the same as somebody who starts programming altogether and picks C as the first language.

Maybe I was naive. But I went similar path years ago: Visual Basic, assembly, then C. I have always seen C as a glorified macro assembler with some confusing syntax here and there and I tend to have a clear vision of what sort of assembly I want to get from the C code I write.

Nominal Animal · « **Reply #21 on:** October 31, 2021, 07:38:52 pm »

Quote from: magic on October 31, 2021, 07:14:53 pm

That's not exactly the same as somebody who starts programming altogether and picks C as the first language.

That's not what I meant.

The output from those who concentrate on small details instead of algorithms and data structures, leaves a lot of room for improvement in my experience, regardless of their professed level of expertise.

(Note that my experience is mostly on the systems programming side, and not freestanding C in embedded and very resource-constrained systems, which are quite different, and tend to have specific hot spots where exact assembly may be required, like high-frequency interrupt routines. In any case, C is a programming language, and treating it as a macro assembler will lead to increasing headaches as the compiler optimization strategies evolve.)

I don't care much exactly what the generated code from my C compilers is, as long as it is not horrible. I can do and have done assembly on several architectures, and often do examine the machine code generated by specific hot/important pieces of code, but I do not compare it to my expectations: I only analyse it for weaknesses.
Nowadays, when I do need an exact assembly implementation, I often reach for GCC/ICC/Clang extended inline assembly (in static inline function wrappers), which integrates better than external assembly functions, as the compiler can choose which registers to use in each inlined instance (if inlined) and so on. For SIMD, I like the GCC generic __attribute__((vector_size (N * sizeof (basetype))) support, since it works across SSE/AVX/NEON for basic operations.

Zero999 · « **Reply #22 on:** October 31, 2021, 09:40:34 pm »

Quote from: magic on October 31, 2021, 07:14:53 pm

I gave OP the benefit of the doubt and assumed that he is competent in assembly and wants to know how much of the work can be automated away with C. That's not exactly the same as somebody who starts programming altogether and picks C as the first language.

Maybe I was naive. But I went similar path years ago: Visual Basic, assembly, then C. I have always seen C as a glorified macro assembler with some confusing syntax here and there and I tend to have a clear vision of what sort of assembly I want to get from the C code I write.

Yes, that's why I asked the question. I'm proficient in assembler and good old QBasic. I've recently decided to give C a go and wondered whether it performs obvious optimisations, so I don't have to bother with assembly. A complete newcomer, with no knowledge of hardware and assembler would not ask such a question.

I didn't expect this thread to get so long, but I'm not surprised knowing this place.

Nominal Animal · « **Reply #23 on:** October 31, 2021, 10:44:32 pm »

Quote from: Zero999 on October 31, 2021, 09:40:34 pm

I've recently decided to give C a go and wondered whether it performs obvious optimisations, so I don't have to bother with assembly.

The optimizations the compiler can and knows how to perform, are rather funky set, compared to what seems obvious to us as human programmers.

A lot of this is due to how the C standard defines itself via an abstract machine, using sequence points and side effects. (I recommend glancing through the final revision, n1256.pdf, of the C99 standard that is available on the net, and then building on top of that with n1570.pdf (the same for C11) and n2176.pdf (the same for C17). While the standards describe the common functionality, compilers can and occasionally do diverge (often not intentionally, but due to language-lawyerism dispute of what a specific paragraph in the standard actually means), especially when compile-time options (that specify such diversion, like say "unsafe-math" optimizations) are used, so do let reality dictate your choices, instead of using the standard as the book of Law that is not to be crossed. Reality beats theory every time.)

For example, it may come as a surprise, or not, that the compiler is not allowed to reorder the members in a structure, or that a simple cast limits a numeric expression to the range and precision of that type but does not force the compiler to use a temporary variable, or that an union is a practical way to type-pun (reintrepret the same memory pattern as a different type), and so on. Many C programmers do not know that there is a freestanding environment where the standard C library is not available (only some of the header files are), and the typical environment is the hosted environment; and that many embedded environments actually support a funky mix of freestanding C and a subset of freestanding C++ (omitting standard C++ library, exceptions, and perhaps other stuff depending on the target).
Freestanding C is less implementation-defined than freestanding C++, and many embedded targets have partial standard C library support via avr-libc, newlibc, or nanolibc.

Using C as a tool to produce specific assembly is becoming harder and harder, because the compilers – especially gcc and clang – are implementing more and more optimizations at the abstract syntax tree level: detecting uses and patterns, rather than idioms. It is important to realize that the way the C standard is specified, makes some "obvious" optimizations impossible, simply because the C standard requires the abstract machine to perform certain steps in certain order, and the obvious optimization changes that.

If you intend to use C (or the mix of freestanding C and freestanding C++) in embedded and resource-constrained targets as opposed to in hosted environments (like in Linux, Windows, Macs, BSDs in general, where the standard libraries are available), then I'd recommend looking at GCC/ICC/Clang extended assembly, because that can be a powerful tool in defining macro-like functions; with a key difference being that you can let the compiler choose the registers to use if they don't matter, via machine constraints. The compiler can optimize both the code around such, as well as the registers used in the assembly itself if you want; and if inlined, do it separately at each inline site. As I said before, typically you'll only want to examine the generated code for pathologically bad patterns, and not try to finesse each C expression to produce the exact code you want: that way lies only frustration and hair loss.

If you intend to use C for systems programming, take a serious look at the POSIX.1-2017 interfaces that are provided by all POSIXy systems including Linux, Mac OS, and *BSDs. In particular, I often rant that when an example or tutorial uses fgets() (and not getline()), one ought to ignore that example or exercise; and the same applies to anything using opendir(), readdir(), and closedir() to traverse directory trees instead of using nftw(), scandir(), glob(), or the fts family of functions (originating in BSDs, but included in Linux and other POSIXy systems standard C libraries). These are the facilities provided by the standard libraries every binary is linked against that one is expected to use in the real life, and not the antiquated poor C89 ones. You'll also find localization, character set conversion (iconv), and even atfile support (meaning you can give a directory descriptor as a parameter to use as the base directory if the pathname parameter is relative path), and lots more. I also recommend looking at the Linux man-pages project (sections 2 and 3, specifically), since it is currently the best maintained man page repository for functionality included in standard C libraries, and each page contains a section Conforming to, which describes where that functionality is defined, and therefore where one can expect it to be available. (On all Linux distributions, the man pages available include this set, plus any man pages provided by installed packages; so this is NOT the "complete set of man pages" in Linux, just the core set.)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: C Compiler Divide and Remainder Optimisation (Read 2560 times)

Share me