Is there a well-known trick for reading unaligned data on the ARM chips that don't support it in hardware?
No, there isn't. (If such a trick exists, it is as obscure as secret undocumented instructions, and probably involves those.)
The load-shift-add method takes 8 or 10 instructions to read a 32-byte value: four byte loads, three shifts, and three adds. Anything cleverer, like reading the two aligned 32-bit words covering the desired value, leads to longer machine code. The simple method is just too simple to beat with anything more complex.
Personally, I do prefer explicit casting and notation to ensure both us humans and future versions of GCC parse the intent correctly, i.e.
#include <stdint.h>
/* Read 32-bit unsigned integer from a possibly unaligned pointer.
get_u32_native(ptr): Native byte order.
get_u32_swapped(ptr): Swapped byte order.
get_u32_le(ptr): Little-endian byte order (least significant byte first).
get_u32_be(ptr): Big-endian byte order (most significant byte first).
*/
static inline uint32_t get_u32_le(const void *const ptr)
{
const uint8_t *const byte = ptr;
return (uint32_t)byte[0]
+ ((uint32_t)byte[1] << 8)
+ ((uint32_t)byte[2] << 16)
+ ((uint32_t)byte[3] << 24);
}
static inline uint32_t get_u32_be(const void *const ptr)
{
const uint8_t *const byte = ptr;
return (uint32_t)byte[3]
+ ((uint32_t)byte[2] << 8)
+ ((uint32_t)byte[1] << 16)
+ ((uint32_t)byte[0] << 24);
}
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
#define get_u32_native get_u32_le
#define get_u32_swapped get_u32_be
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
#define get_u32_native get_u32_be
#define get_u32_swapped get_u32_le
#else
#error Unsupported byte order.
#endif
These compile to 20 bytes on Cortex-M0, M0+, and M4, when using
-O2 or
-Os (and I always do, and recommend you do too).
The other way, say
#include <stdint.h>
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
static inline uint32_t get_u32_le(const void *const ptr)
{
const uintptr_t address = (uintptr_t)ptr;
const uint32_t *base = (const uint32_t *)(address & (~(uintptr_t)3));
const uint32_t word[2] = { base[0], base[1] };
const unsigned char shift = (address & 3) << 3;
return (word[0] >> shift) + (word[1] << (32 - shift));
}
#else
#error Not implemented
#endif
compiles to nine instructions (26 bytes) on Cortex-M4, 13 instructions (also 26 bytes) on Cortex-M0 and M0+; compared to 20 bytes for each of the earlier two functions. Furthermore, it accesses the following 32-bit integer if
ptr happens to be aligned, and that can be quite problematic in some cases.
If you look at the generated code, you'll see that there are just too many individual operations needed to be done to achieve this, to get under the 10 instructions and 20 bytes of code limit, even with trickery.
The only situation when I find myself needing to access possibly unaligned pointers, is when processing binary data (or data stream) with a specific protocol. In those cases, I've always found the overhead of using such accessors warranted.
(When dealing with massive amounts of binary data, I've crafted my own protocol which ensures data alignment. I have done this, BTW, for reading and writing molecular dynamic data from both Fortran 95 and C code (although the Fortran code requires one nonstandard feature, reading raw binary data without record length words, but all Fortran 90/95 compilers I had access to, did support that). This was to allow a distributed simulation to save oodles of data to node-local storage without slowing down the simulation; data combining and slicing was done afterwards using a helper utility, when collecting the resulting data.)