Type punning via an union is actually supported by the C standard since C99, so it's not just a GCC feature.
However, it is explicitly forbidden in C++, which means that you should not try to use it in the Arduino environment, because it uses C++ (GNU g++).
In both C++ and C (since C99), functions marked
static inline are as fast as macros. They not only provide type safety (that macros don't do; macros accept anything as a parameter), they also help the compiler to generate better code when you enable optimizations.
I build all my code using GCC
-Wall -O2 flags at minimum. It enables all typical warnings (some of them not so useful;
-Wunused-variable being one), and enables optimization. (I've compiled a
lot of code using similar settings, as I used to build my own "distro" from scratch; see
linuxfromscratch.org. I've also maintained computational clusters, where the correctness of the simulations is important, and compiled and optimized some simulators for them. And writing highly parallel and distributed simulators for computational materials physics is what I'd like to do as a career. So I do claim to have a lot of experience doing this, and that's what I recommend.)
Because of the various quirks in the details, rather than storing a datagram (or file header) in a structure, I do recommend using buffers, and
static inline accessor functions. For 32-bit architectures in C, for example
#include <stdint.h>
typedef union {
float f[1];
uint32_t u32[1];
int32_t i32[1];
uint16_t u16[2];
int16_t i16[2];
uint8_t u8[4];
int8_t i8[4];
} word32;
/* Pack u32 from different byte orders */
static inline uint32_t pack_u32_1234(const uint8_t src[4]) { return ((word32){ .u8 = { src[0], src[1], src[2], src[3] } }).u32[0]; }
static inline uint32_t pack_u32_3412(const uint8_t src[4]) { return ((word32){ .u8 = { src[2], src[3], src[0], src[1] } }).u32[0]; }
static inline uint32_t pack_u32_2143(const uint8_t src[4]) { return ((word32){ .u8 = { src[1], src[0], src[3], src[2] } }).u32[0]; }
static inline uint32_t pack_u32_4321(const uint8_t src[4]) { return ((word32){ .u8 = { src[3], src[2], src[1], src[0] } }).u32[0]; }
#if __BYTE_ORDER-0 == 1234
// Little-endian architecture
#define pack_le32(...) pack_u32_1234(__VA_ARGS__)
#define pack_be32(...) pack_u32_4321(__VA_ARGS__)
#elif __BYTE_ORDER-0 == 4321
// Big-endian architecture
#define pack_le32(...) pack_u32_4321(__VA_ARGS__)
#define pack_be32(...) pack_u32_1234(__VA_ARGS__)
#elif __BYTE_ORDER-0 == 3412
// PDP-endian architecture
#define pack_le32(...) pack_u32_3412(__VA_ARGS__)
#define pack_be32(...) pack_u32_2143(__VA_ARGS__)
#elif __BYTE_ORDER-0 == 2143
// Inverse PDP-endian architecture
#define pack_le32(...) pack_u32_2143(__VA_ARGS__)
#define pack_be32(...) pack_u32_3412(__VA_ARGS__)
#else
#Error Unknown __BYTE_ORDER!
#endif
produces macro-like inline functions that typically generate excellent code for constructing a 32-bit unsigned integer from four 8-bit unsigned integers. (I'm not sure I numbered the macros right for each architecture, though.) The code tends to be better than code accessing memory via an union, although you can certainly construct edge cases either way. Note, however, that accessing unaligned memory is problematic on many architectures, but the above accessor functions are not affected by that. Because the above functions are
static inline, they do not add to the binary size unless used.
If you have a pointer
unsigned char *p to a buffer position, and wanted to obtain the 32-bit unsigned int stored in big-endian byte order at 5 bytes following
p, you'd call
pack_be32((const uint8_t *)(p + 5)). The source is portable between wildly different architectures, and the compiled binary code is quite good (depending on your optimization settings a bit, of course).