C does some implicit casting that may be what you're getting bitten by.
If you do a multiplication with two variables the result is twice the width. Eg: 2x16bit = 32 bit, 2x32 bit = 64 bit. But if you're doing two literals, the compiler will do them for you.
https://godbolt.org/ has AVR GCC so you can try to view the assembler live.
Can't run it though.
Or you are compiling with `-mint8`?
https://gcc.gnu.org/wiki/avr-gcc
Try compiling with -Wconversion -Wsign-conversion to show where conversions are happning.
int warning = 255u * 255u;
// implicit cast is happening
But i never seen it cut down a constant into a too small data type, it was usually a matter of the calculation working with a too small output type.
i think the problem was that I have to cast every constant to make sure the type of math I expect to happen is happening.
If your doing this, and you want an unsigned subtract to happen then you must make it 5u, yes.
unsigned int var;
int ans = var - 5;
Typically what I do it I don't put constants in expressions.
uint16_t a;
uint16_t b;
uint32_t c = a * b;
Doesn't leave much room for imagination.
So you going around telling everything to be ul, may not be your intended result due to what choice of (slow) multiply it picks.
TI has this short note describes it, for your archiving: https://www.ti.com/lit/an/spra683/spra683.pdf (https://www.ti.com/lit/an/spra683/spra683.pdf)
Typically what I do it I don't put constants in expressions.
uint16_t a;
uint16_t b;
uint32_t c = a * b;
Doesn't leave much room for imagination.
You say that but the whole point of this conversation is that this code is defective and will fail on 8 and 16 bit platforms.
Note: this has nothing to do with optimization. Optimization doesn’t change semantics of the code.(1) At most it exposes errors in already invalid code, that might’ve been missed in other circumstances.
I see, so basically I'd have to put "u" on the end of every constant anyway to make sure I don't end up with half my range. presumably at least one constant needs to be "ul" to promote the calculation to 32 bit.
That depends on the intention. A multiplication of two unsigned int values will be done entirely within unsigned int. For most of the inputs it will overflow. An unsigned overflow is well defined in both C and C++, but it may not be what you want.
Typically what I do it I don't put constants in expressions.
uint16_t a;
uint16_t b;
uint32_t c = a * b;
Doesn't leave much room for imagination.
So you going around telling everything to be ul, may not be your intended result due to what choice of (slow) multiply it picks.
Except that this code shows exactly what has been explained to be wrong: both in this thread and the document you have yourself linked. uint32_t in the third line bears no significance for the multiplication. That multiplication is determined by the types of a and b, which are uint16_t in this case. Depending on the platform that may be signed multiplication and, again depending on the platform, that may overflow.
If you ever used this code and it was giving the results you expected, it’s because of the lucky circumstances. Typicall this is a combination of working with 2’s complement representation of integers and compilers optimizing undefined cases. Since a compiler is allowed to do so, it substitutes that with code acting as if the operation was valid (which is more optimal). By magic of 2’s complement the intermediate result, despite being actually invalid, has binary representation matching what it would be if the whole operation was unsigned. And since the code downstream re-interprets it as unsigned, it happens to be what you expected.
But this is an accident. This is not what this expression indicates in terms of language’s semantics.
____
(1) The exception being bugs in the compiler itself, but those are very rare compared to mistakes of whoever uses the compiler.
I might be getting the new twist in this thread wrong....
Take this code on a 32-bit platform:
#include <stdio.h>
#include <stdint.h>
uint16_t my_func(uint8_t a, uint8_t b)
{
return a*b ;
}
Called with "my_func(20,20)" I can't work out if people are suggesting it should be 400 or 144.
My understanding is that the required behavior is smaller types are first converted to the native signed or unsigned integers, the calculation is then carried out, and then the result is truncated during the assignment.
This also agrees with the generated code:
push rbp
mov rbp, rsp
mov edx, edi
mov eax, esi
mov BYTE PTR [rbp-4], dl
mov BYTE PTR [rbp-8], al
movzx edx, BYTE PTR [rbp-4] ; Promote 8-bit unsigned value to 32-bit unsigned value
movzx eax, BYTE PTR [rbp-8] ; Promote 8-bit unsigned value to 32-bit unsigned value
imul eax, edx ; 32-bit multiply to give result in eax.
pop rbp
ret
Likewise all floating point calculations are carried out as doubles, unless you override this with a compiler switch.
In arithmetic expressions, if the value of an operand can be described as an int, it is promoted to an int; if it is smaller than an int but cannot be represented by an int, it is converted to an unsigned int. All other integer types (i.e., those with range at least as large as int or unsigned int) are kept unchanged. (See e.g. C99 6.3.1.1.)
Thus, all operands in an integer arithmetic expression are always at least int or unsigned int, or larger integer types.
Arithmetic operations (like addition and subtraction) cause additional conversions (see e.g. C99 6.3.1.8). Basically, floating-point and integer operands are promoted to the type of the larger operand. (Technically, floating point type ranking is float, double, and long double; integer type ranks are listed in e.g. C99 6.3.1.1p1.)
This means that if you define your own Q15 (https://en.wikipedia.org/wiki/Q_(number_format)) fixed point format, you can write the multiplication operation (A×B×2-15) as
#include <stdint.h>
typedef int32_t q15;
static inline q15 q15_mul(const q15 a, const q15 b)
{
return ((int_fast64_t)a * b) / INT32_C(32768);
}
Using return ((int_fast64_t)a * b) >> 15; will work, but is technically "implementation defined". (I personally do use it, because it is common enough so that any implementation that fails to generate equivalent code, will silently miscompile a ton of of existing C code anyway.)
Because a and b are 32-bit quantities, we want the intermediate result to be 64-bit, before the division/shift. To do this, we need to cast one of the operands to a 64-bit signed integer type, here int_fast64_t. The compiler will promote the other operand(s) to that type also, due to the arithmetic operations.
The INT32_C(32768) is simply a way to write the 16-bit divisor without worrying about the maximum range of an int . Written this way, we leave the compiler to promote it to int_fast64_t. All C compilers currently used will optimize the division to a bit shift, on all architectures where that works (all GCC supports).
We could explicitly cast all three operands to int64_t or int_fast64_t, but the last time I checked the behaviour of various C compilers a decade ago, letting the compiler to do the type promotions, let them generate better optimized code: fewer register moves in static inline function, especially on LP64 architectures like x86-64 on Linux.
Writing your own mixed-size fixed point arithmetic operations (in assembly, e.g. GCC or Clang extended asm (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html)) on types larger than the native word size, can save a significant number of operations, because (at least the last time I checked), compilers typically use a compiler-provided functions to do the operations, without optimized mixed-size versions. (A particular example is on 8-bit architectures, converting an N-bit unsigned integer to decimal via repeated multiplications by 10, and subtracting the smallest power of ten at least 2N using N+8-bit unsigned integer type.)
The key reason why I mark macro-like accessor functions static inline as opposed to just static, is that GCC does not issue a warning for unused functions of the former type, while they do for the latter, when default/recommended/common warnings are enabled. In other words,- static inline: Accessor-like function, okay if not used at all. Only defined in this compilation unit; does not generate a linkable symbol in the symbol table. If not used, the object file won't implement the function at all.
- static: Local function. The compiler should warn, if not used at all. Only defined in this compilation unit; does not generate a linkable symbol in the symbol table.
- Neither static nor inline: Externally accessible function (part of API, generates a linkable symbol in the symbol table in the object file).
Note that these are all at least as much directed to my fellow developers as they are to the compiler; and do not make any assumptions about whether the function is actually inlined by the compiler or not.
Anyway, no need to listen to me or anyone else, when you can verify the facts for yourself. Take for example the following example.c:
#include <stdlib.h>
#include <stdio.h>
#undef FUNC_PREFIX
#if defined(USE_STATIC_INLINE)
#define FUNC_PREFIX static inline
#elif defined(USE_STATIC)
#define FUNC_PREFIX static
#elif defined(USE_INLINE)
#define FUNC_PREFIX inline
#else
#define FUNC_PREFIX
#endif
FUNC_PREFIX void describe(const int num, const char *val)
{
printf("%d: %s\n", num, val);
}
FUNC_PREFIX int unused_function(int num)
{
return num + 1;
}
int main(int argc, char *argv[])
{
for (int i = 0; i < argc; i++)
describe(i, argv[i]);
return EXIT_SUCCESS;
}
and compile the four versions (I will be using -O2 because that's my habit, but do check other opimization options as well as omitting it):
gcc -Wall -O2 example.c -o ex.none
gcc -DUSE_STATIC -Wall -O2 example.c -o ex.static
gcc -DUSE_INLINE -Wall -O2 example.c -o ex.inline
gcc -DUSE_STATIC_INLINE -Wall -O2 example.c -o ex.static_inline
On my system, the -DUSE_STATIC causes the compiler (GCC 7.5.0) to complain about unused_function() being defined but not used.
(Clang does complain for both -DUSE_STATIC and -DUSE_STATIC_INLINE, though.)
The above example is too simple to exhibit any code changes. It always triggers the compiler logic on when to inline a function. That is, all run the same code, but only ex.none contains contains the binary symbols describe and unused_function. Feel free to investigate your own functions (my own focus was in funky complicated double-precision arithmetic functions and basic 3D vector algebra operations) to see if your code tends to be affected the way I described in my earlier post.
Although GCC code generation has improved a lot since the GCC 2 (1992-2001) and 3 (2001-2006) era, GCC 4 still generated a lot of superfluous register moves, increasing register pressure, and often using stack for temporary variables. This was particularly noticeable when inlining a function (which can occur with or without declaring the function inline).
If you are interested in how GCC static inline has evolved, compare 4.0.4 (https://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Inline.html) to 7.5.0 (https://gcc.gnu.org/onlinedocs/gcc-7.5.0/gcc/Inline.html) to latest (https://gcc.gnu.org/onlinedocs/gcc/Inline.html) GCC version inlining documentation. As described in various versions, static inline has similar semantics in both C and C++, which is very useful when working on microcontrollers (that rely on a funky mix of freestanding C and C++ environments).
In short, the reasons I personally mark a function static inline has nothing to do with inlining per se, and everything to do with my intent regarding that function; especially whether it is okay for it to be completely omitted from the compiled binaries (i.e., not used/needed at all).
Asking myself: Okay, but how that relates to your statement that "let them generate better optimized code: fewer register moves in static inline function"?
About a decade ago, I had access to GCC (4.x.y), Intel Compiler Collection, Pathscale, and AMD Open64 C compilers on Linux; that's when I did those experiments on x86 and x86-64 to see the effects on the generated code.
The understanding I developed from testing the abovementioned compilers (and ignoring "no change either way" cases; thus not trying to get the best results for a specific compiler, but to avoid the worst cases regardless of compiler), was that implicit and explicit casting are done at different stages of code synthesization, and that implicit casting makes it easier for the compilers to realize a register is unused, or always filled with zeros –– for example, when casting a 32-bit or smaller value to uint64_t on a 32-bit architecture. When the code is in a smallish local scope, say a macro-like accessor function, or a pure arithmetic function, this was more noticeable. Obviously, this only matters when these expressions are heavily used in a program; I was dealing with potential models in simulations, calculated hundreds of millions of times per second.
The best way to explain this, is to compare the following code (a×b/232) compiled for 32-bit Cortex-M4 and Cortex-M0:
#include <stdint.h>
int64_t mul64q32(const int64_t a, const int64_t b) { return a*b >> 32; }
int32_t mul32q32(const int32_t a, const int32_t b) { return ((int_fast64_t)a * b) >> 32; }
Compiling these to Cortex-M4 on GCC-7.5.0 yields (essentially)
mul64q32:
mul r3, r0, r3
mla r1, r2, r1, r3
umull r2, r3, r0, r2
adds r0, r1, r3
asrs r1, r0, #31
bx lr
mul32q32:
smull r0, r1, r0, r1
mov r0, r1
bx lr
Because of the 32-bit shift, one of the four 32-bit multiplications can be omitted in mul64q32. Cortex-M4 has 32×32bit multiplication with 64-bit result (in a register pair), so a single operation suffices. If mul32q32 gets inlined, and the surrounding code can use the result directly in the r1 register, the mov can be avoided, too: it then simplifies to a single smull instruction.
Now, compile the same code for Cortex-M0, and we get
mul64q32:
push {r4, lr}
bl __aeabi_lmul
movs r0, r1
asrs r1, r1, #31
pop {r4, pc}
mul32q32:
movs r2, r1
push {r4, lr}
asrs r1, r0, #31
asrs r3, r2, #31
bl __aeabi_lmul
movs r0, r1
pop {r4, pc}
where __aeabi_lmul is a compiler-provided 64×64-bit multiplication with 64-bit result (r1:r0 × r3:r2 = r1:r0).
Because the ARM GCC implementation on Cortex-M0 does not have a 32×32-bit multiplication with 64-bit result as a compiler-provided function, it has to expand the multiplicands to 64 bits, and use a generic 64×64-bit multiplication function. (Clang-10 does the same, using __muldi3 function, but does a funky shuffle to swap the two register pairs - essentially five unnecessary register-to-register moves. Odd.)
The root problem here is not at all in inlining or anything related to that, but the premature promotion of arguments to a multi-word type, then using a generic but slower operation to do the arithmetic (because the compiler does not realize it can simply omit doing the superfluous operations).
This seems to still be an issue, so much so that if using GCC or Clang-10 to compile for Cortex-M0, it would be worth the effort to implement mul32q32 in inline assembly, since it would need only two multiplication instructions, compared to four in __aeabi_lmul/__muldi3, assuming mul32q32 was so heavily used the difference would matter in real life. (Personally, I implement both [inline assembly and naïve-but-easily-verifiably-correct versions], selectable at compile time, with runnable unit tests on the target to verify they produce identical results for all arguments.)
As I always say, reality beats theory. Here, it means that while the C (and C++) standards describe the rules that should yield portable code (for example, "correctness"), individual compilers have behaviours ("efficiency") we can examine and rely on because of practical reasons. Yes, it does mean that before one can rely on these features, the output of each new (major) version of ones compiler has to be checked.
Simply put, standards describe "correctness", whereas "efficiency" is up to individual compilers. If you want the latter, you need to examine how different compilers generate efficient code.
In my experience, the key point is actually not optimum code generation, but to try and avoid the silliest and worst cases instead. (A good example of this is how optimizing for size, -Os, can often yield as efficient code as -O2 or even better. Then, the efficiency gained is just a side effect of trying to keep code size down.)
When you have something like mul32q32 above, used millions of times a second, implementing it in assembly for specific architectures is often worth the effort; you only know after examining the code generated by your toolchain for that particular architecture. You basically sidestep the compiler altogether by switching to assembly, instead of trying to find the best C or C++ expression for the job. (On x86-64, one can use <immintrin.h> intrinsics (https://software.intel.com/sites/landingpage/IntrinsicsGuide/) for Single-Instruction-Multiple-Data operations, instead of resorting to assembly. This was my main observation on x86 and x86-64 with floating-point math, really; and not relying on the compiler to vectorize the expressions also means that one has to think of data ordering and access patterns, which makes a major difference wrt. efficiency with SIMD.)
But it turns out that writing 5000 does not mean 5000, it's down to what the compiler chooses to interpret it as. In this case possibly an 8 bit number? so gibberish. I solved all problems relating to math errors by running around putting "u" or "ul" on the end of every defined constant.
The most portable way to define the size of a literal integer constant (like 5000) is to use one of the NAME_C(literal) macros as defined in the compiler-provided <stdint.h>. (This header file is available even in freestanding environments where the standard C library (like <stdio.h> and others) are NOT available. Specifically:
INT8_C(-128...+128)
UINT8_C(0...255)
INT16_C(-32768...+32767)
UINT16_C(0...65535)
INT32_C(-2147483648...+2147483647)
UINT32_C(0...4294967295)
These evaluate to the literal itself, with the appropriate suffix (u, uL, LL, uLL, et cetera) appended, taking the guessing completely out.
This also means you cannot do e.g. UINT8_C(NAMED_CONSTANT). Use a cast, (uint8_t)(NAMED_CONSTANT), for those instead.
Again, these are provided by the compiler (and must be provided by the compiler since C99 or so, and are provided even by Microchip XC compilers), and do not depend on any standard C library being available.
Yes, yes. That has even already been suggested earlier in the thread. Now I guess it needs to sink in.
Now, as we discuss on a regular basis, is C a bit tricky for arithmetics? It is. Comes both from legacy and efficiency reasons.
Regarding literals, and otherwise potential conversion issues, there is a GCC warning that is very useful and that I always enable, especially for embedded development: '-Wconversion'. Enable it!
A quick example (using large values just because I tested it with GCC for x86_64):
int Test(int n)
{
int m = 20000000000 * n;
return m;
}
The literal here is 20.10^9 - it exceeds what can be represented as an 'int' in this context.
With 'gcc -Wall': no warning.
With 'gcc -Wall -Wconversion':
test1.c: In function 'Test':
test1.c:3:10: warning: conversion from 'long long int' to 'int' may change value [-Wconversion]
3 | int m = 20000000000 * n;
| ^~~~~~~~~~~
I really suggest the OP test this warning option with their original code, and report back if it indeed catches the issue. It should.
inline never has the effect you are looking for. It's better to use the compiler specific options if you want the literal effect of the keyword, like: __attribute__((always_inline)); or __forceinline.
Yes, you lose portability. But when you need to do this you probably are micro-optimizing anyway.
In general I don't second guess the compiler but if there's a specific need for it use the compile specific directives and use the preprocessor to deal with it:
#ifdef __GNUC__
static inline ureg load32(void __iomem *address) __attribute__((always_inline))
#endif
If you want to *force* inlining with GCC, yes, this is the only thing that will work 100% of the time, even at -O0 optimization level.
(Note: a minor remark here, but GCC yields an error when you define the attribute to a function at the end of the declaration, as you did. You need to put it at the beginning, for some reason.
error: attributes should be specified before the declarator in a function definition
I'm not 100% sure, but LLVM/Clang should support this attribute too. Just guessing, as it tends to support a good range of GCC extensions and attributes.
But otherwise, unless you selectively disable function inlining through options, GCC will inline almost every function it can starting at -O1, with or without 'inline', even with or without 'static'. In the latter case, a non-static function will usually be inlined where it's called inside the same compilation unit (source file if you will), AND callable code for this function will also be generated at the same time, which will be the version that gets called outside of its compilation unit. Depending on optimization level though, GCC will inline more or less aggressively. At -O1, I think it will automatically inline only functions that are "small enough", but at -O3, it will inline practically ALL functions, which can lead to significantly larger binaries. I don't know GCC's exact criterions though, and this may change at every new version...
So while "inline" is indeed pretty much ignored by most modern compilers *for call optimization*, the standard, as I quoted, still states that it can be an implementation-defined suggestion to the compiler that the call should be "fast" - without mentioning a particular method. The C standard add two footnotes about it:
By using, for example, an alternative to the usual function call mechanism, such as ‘‘inline
substitution’’. (...)
For example, an implementation might never perform inline substitution, or might only perform inline
substitutions to calls in the scope of an inline declaration.
As it's entirely up to the implementation, of course, it may do nothing as far as optimization is concerned. Which is indeed the case with modern GCC and many other C compilers.
But yes, it's true that a number of C compilers already implemented "inline" well before it got standardized (and its getting included in the std probably comes from this fact, actually), and AFAIR, in older compilers, the keyword was indeed the only way of having the compiler inline a function (when the function could be inlined...), as they would not inline anything on their own.