The key reason why I mark macro-like accessor functions
static inline as opposed to just
static, is that GCC does not issue a warning for unused functions of the former type, while they do for the latter, when default/recommended/common warnings are enabled. In other words,
- static inline: Accessor-like function, okay if not used at all. Only defined in this compilation unit; does not generate a linkable symbol in the symbol table. If not used, the object file won't implement the function at all.
- static: Local function. The compiler should warn, if not used at all. Only defined in this compilation unit; does not generate a linkable symbol in the symbol table.
- Neither static nor inline: Externally accessible function (part of API, generates a linkable symbol in the symbol table in the object file).
Note that these are all at least as much directed to my fellow developers as they are to the compiler; and do not make any assumptions about whether the function is actually inlined by the compiler or not.
Anyway, no need to listen to me or anyone else, when you can verify the facts for yourself. Take for example the following
example.c:
#include <stdlib.h>
#include <stdio.h>
#undef FUNC_PREFIX
#if defined(USE_STATIC_INLINE)
#define FUNC_PREFIX static inline
#elif defined(USE_STATIC)
#define FUNC_PREFIX static
#elif defined(USE_INLINE)
#define FUNC_PREFIX inline
#else
#define FUNC_PREFIX
#endif
FUNC_PREFIX void describe(const int num, const char *val)
{
printf("%d: %s\n", num, val);
}
FUNC_PREFIX int unused_function(int num)
{
return num + 1;
}
int main(int argc, char *argv[])
{
for (int i = 0; i < argc; i++)
describe(i, argv[i]);
return EXIT_SUCCESS;
}
and compile the four versions (I will be using -O2 because that's my habit, but do check other opimization options as well as omitting it):
gcc -Wall -O2 example.c -o ex.none
gcc -DUSE_STATIC -Wall -O2 example.c -o ex.static
gcc -DUSE_INLINE -Wall -O2 example.c -o ex.inline
gcc -DUSE_STATIC_INLINE -Wall -O2 example.c -o ex.static_inline
On my system, the
-DUSE_STATIC causes the compiler (GCC 7.5.0) to complain about unused_function() being defined but not used.
(Clang does complain for both
-DUSE_STATIC and
-DUSE_STATIC_INLINE, though.)
The above example is too simple to exhibit any code changes. It always triggers the compiler logic on when to inline a function. That is, all run the same code, but only
ex.none contains contains the binary symbols
describe and
unused_function. Feel free to investigate your own functions (my own focus was in funky complicated double-precision arithmetic functions and basic 3D vector algebra operations) to see if
your code tends to be affected the way I described in my earlier post.
Although GCC code generation has improved a lot since the GCC 2 (1992-2001) and 3 (2001-2006) era, GCC 4 still generated a lot of superfluous register moves, increasing register pressure, and often using stack for temporary variables. This was particularly noticeable when inlining a function (which can occur with or without declaring the function
inline).
If you are interested in how GCC
static inline has evolved, compare
4.0.4 to
7.5.0 to
latest GCC version inlining documentation. As described in various versions,
static inline has similar semantics in both C and C++, which is very useful when working on microcontrollers (that rely on a funky mix of freestanding C and C++ environments).
In short, the reasons I personally mark a function
static inline has nothing to do with inlining per se, and everything to do with my intent regarding that function; especially whether it is okay for it to be completely omitted from the compiled binaries (i.e., not used/needed at all).
Asking myself:
Okay, but how that relates to your statement that "let them generate better optimized code: fewer register moves in
static inline function"?
About a decade ago, I had access to GCC (4.x.y), Intel Compiler Collection, Pathscale, and AMD Open64 C compilers on Linux; that's when I did those experiments on x86 and x86-64 to see the effects on the generated code.
The understanding I developed from testing the abovementioned compilers (and ignoring "no change either way" cases; thus not trying to get the best results for a specific compiler, but to avoid the worst cases regardless of compiler), was that implicit and explicit casting are done at different stages of code synthesization, and that implicit casting makes it easier for the compilers to realize a register is unused, or always filled with zeros –– for example, when casting a 32-bit or smaller value to
uint64_t on a 32-bit architecture. When the code is in a smallish local scope, say a macro-like accessor function, or a pure arithmetic function, this was more noticeable. Obviously, this only matters when these expressions are heavily used in a program; I was dealing with potential models in simulations, calculated hundreds of millions of times per second.
The best way to explain this, is to compare the following code (
a×
b/2
32) compiled for 32-bit Cortex-M4 and Cortex-M0:
#include <stdint.h>
int64_t mul64q32(const int64_t a, const int64_t b) { return a*b >> 32; }
int32_t mul32q32(const int32_t a, const int32_t b) { return ((int_fast64_t)a * b) >> 32; }
Compiling these to Cortex-M4 on GCC-7.5.0 yields (essentially)
mul64q32:
mul r3, r0, r3
mla r1, r2, r1, r3
umull r2, r3, r0, r2
adds r0, r1, r3
asrs r1, r0, #31
bx lr
mul32q32:
smull r0, r1, r0, r1
mov r0, r1
bx lr
Because of the 32-bit shift, one of the four 32-bit multiplications can be omitted in mul64q32. Cortex-M4 has 32×32bit multiplication with 64-bit result (in a register pair), so a single operation suffices. If
mul32q32 gets inlined, and the surrounding code can use the result directly in the
r1 register, the
mov can be avoided, too: it then simplifies to a single
smull instruction.
Now, compile the same code for Cortex-M0, and we get
mul64q32:
push {r4, lr}
bl __aeabi_lmul
movs r0, r1
asrs r1, r1, #31
pop {r4, pc}
mul32q32:
movs r2, r1
push {r4, lr}
asrs r1, r0, #31
asrs r3, r2, #31
bl __aeabi_lmul
movs r0, r1
pop {r4, pc}
where
__aeabi_lmul is a compiler-provided 64×64-bit multiplication with 64-bit result (r1:r0 × r3:r2 = r1:r0).
Because the ARM GCC implementation on Cortex-M0 does not have a 32×32-bit multiplication with 64-bit result
as a compiler-provided function, it has to expand the multiplicands to 64 bits, and use a generic 64×64-bit multiplication function. (Clang-10 does the same, using
__muldi3 function, but does a funky shuffle to swap the two register pairs - essentially five unnecessary register-to-register moves. Odd.)
The root problem here is not at all in inlining or anything related to that, but the
premature promotion of arguments to a multi-word type, then using a generic but slower operation to do the arithmetic (because the compiler does not realize it can simply omit doing the superfluous operations).
This seems to still be an issue, so much so that if using GCC or Clang-10 to compile for Cortex-M0, it would be worth the effort to implement
mul32q32 in inline assembly, since it would need only two multiplication instructions, compared to four in __aeabi_lmul/__muldi3, assuming
mul32q32 was so heavily used the difference would matter in real life. (Personally, I implement both [inline assembly and naïve-but-easily-verifiably-correct versions], selectable at compile time, with runnable unit tests on the target to verify they produce identical results for all arguments.)
As I always say, reality beats theory. Here, it means that while the C (and C++) standards describe the rules that should yield portable code (for example, "correctness"), individual compilers have behaviours ("efficiency") we can examine and rely on
because of practical reasons. Yes, it does mean that before one can rely on these features, the output of each new (major) version of ones compiler has to be checked.
Simply put, standards describe "correctness", whereas "efficiency" is up to individual compilers. If you want the latter, you need to examine how different compilers generate efficient code.
In my experience, the key point is actually
not optimum code generation, but to try and avoid the silliest and worst cases instead. (A good example of this is how optimizing for size,
-Os, can often yield as efficient code as
-O2 or even better. Then, the efficiency gained is just a side effect of trying to keep code size down.)
When you have something like
mul32q32 above, used millions of times a second, implementing it in assembly for specific architectures is often worth the effort; you only know after examining the code generated by your toolchain for that particular architecture. You basically sidestep the compiler altogether by switching to assembly, instead of trying to find the best C or C++ expression for the job. (On x86-64, one can use
<immintrin.h> intrinsics for Single-Instruction-Multiple-Data operations, instead of resorting to assembly. This was my main observation on x86 and x86-64 with floating-point math, really; and not relying on the compiler to vectorize the expressions also means that one has to think of data ordering and access patterns, which makes a major difference wrt. efficiency with SIMD.)