Author Topic: C struct bitfields - size versus speed (Read 9667 times)

HwAoRrDk · « **on:** April 30, 2018, 10:07:43 pm »

I never really thought about this until today, but when using bitfields there must be some kind of trade-off between memory size and execution speed, right?

Say you have an struct like the following:

Code: [Select]

typedef struct {
    uint8_t foo;
    uint16_t bar;
    uint8_t baz : 3;
    bool flag_a : 1;
    bool flag_b : 1;
    bool flag_c : 1;
    bool flag_d : 1;
    bool flag_e : 1;
} my_struct_t;

Now, by setting baz to 3 bits (assuming it only needs to represent values 0-7) and using single-bit fields for our 5 boolean flags, we've gone from a total size of 9 bytes to only 4 bytes. Nice, some memory saved. However, when you're writing your code to access these bitfields, what kind of machine code is the compiler actually generating?

I assume there must be some penalty to reads and writes to these fields, in the form of extra instructions necessary to 'resize' the data. Without knowing what actually goes on, I would assume the compiler has to generate masking and shifting instructions. For instance, in the case of assigning some other uint8_t variable to baz, I would assume you would need to mask off the relevant three bits, and depending on the alignment within the struct, shift those 3 bits into the correct position.

Am I right? Or, because the C standards don't actually define how struct bitfields are physically packed, is this all platform- and/or compiler-dependant?

nctnico · « **Reply #1 on:** April 30, 2018, 10:13:30 pm »

The compiler will create bit shifts & masks to get and store the bits from the struct so yes it will be a lot slower. I have used this once or twice in the past decades to safe memory space.

helius · « **Reply #2 on:** April 30, 2018, 10:19:22 pm »

Quote from: HwAoRrDk on April 30, 2018, 10:07:43 pm

However, when you're writing your code to access these bitfields, what kind of machine code is the compiler actually generating?

I assume there must be some penalty to reads and writes to these fields, in the form of extra instructions necessary to 'resize' the data. Without knowing what actually goes on, I would assume the compiler has to generate masking and shifting instructions. For instance, in the case of assigning some other uint8_t variable to baz, I would assume you would need to mask off the relevant three bits, and depending on the alignment within the struct, shift those 3 bits into the correct position.

If the assigned value is a constant, the compiler can use a pre-shifted literal. The only extra instructions are a mask to preserve the remaining bits. If you assign multiple fields, it can coalesce multiple writes into one, and that saves more time than the mask instructions take. Keeping the struct size small also saves time when initializing it and passing it in function arguments.

Quote

Am I right? Or, because the C standards don't actually define how struct bitfields are physically packed, is this all platform- and/or compiler-dependant?

Of course struct packing is compiler dependent; it goes far beyond just bitfields.

rstofer · « **Reply #3 on:** April 30, 2018, 10:22:36 pm »

Why not compile some code accessing the various field and look at the assembly file? Use the -S option on the gcc command line. Assuming gcc, of course.

Change all of the variables to unsigned char and see how that works out.

Definitely put some bit extract/test expressions inside an 'if' statement like if (flag_a) { ... }

Yes, it's going to be compiler dependent and maybe even platform dependent.

It will be worth the time for the education.

Unless you are working on a small micro where flash is large but ram is small, why bother?

Oh, and be aware of endianness, in some cases the bits will be layed out 'backwards' so when you write masking operations, you have a 50-50 chance of having everything wrong.
I played with this a couple of weeks ago and I had to do a #ifdef to test the endianness.

C · « **Reply #4 on:** April 30, 2018, 10:35:30 pm »

Really in to how smart the compiler is.
Some things in a language force a compiler to be smarter.

A dumb compiler might treat each bit as a byte.

A little smarter and compiler might do shifts and masks.

A really smart compiler would look at where a range of bits exist.
a switch statement then is not as a 0-7 value
but as the actual storage value range.
You program switch statement as 0-7 and compiler translates it to storage position values for same result with less instructions.

When compiler gets this smart, it can skip a lot of shifts & rotates which gives faster smaller code result.

C

HwAoRrDk · « **Reply #5 on:** April 30, 2018, 10:49:24 pm »

Quote from: nctnico on April 30, 2018, 10:13:30 pm

The compiler will create bit shifts & masks to get and store the bits from the struct so yes it will be a lot slower.

When you say "a lot slower", are you exaggerating for effect, or do you mean that?

Quote from: rstofer on April 30, 2018, 10:22:36 pm

Why not compile some code accessing the various field and look at the assembly file? Use the -S option on the gcc command line. Assuming gcc, of course.

I'm not really acquainted with assembly language. Much of it is just gibberish to me.

All I'd be able to conclude with any certainty is that the code is shorter or longer. I've previously found it very hard to look at the -S output from GCC, as the annotation of the assembly with the original source code is all over the place and lines frequently don't seem to bear any relation to the section of assembly code they're placed adjacent to.

Quote from: rstofer on April 30, 2018, 10:22:36 pm

Unless you are working on a small micro where flash is large but ram is small, why bother?

The situation that has got me thinking about all this is that I have recently been writing some code for exactly that kind of small micro with little RAM.

The code takes incoming serial data and throws it into a FIFO queue to be processed - some classification, filtering, etc. I am using bitfields for my queue structure in order to save a lot of memory - the difference between taking half the available RAM and only a quarter. But, I need this code to be super speedy, as my program's main loop will be doing a lot of other stuff, so I need the queue-processing code to take as little time as possible.

In my mind, at the moment, the saving on memory outweighs any potential execution speed penalty. But, if it later turns out I'm marginal on being able to adequately 'keep up' with the incoming data stream because my main loop is doing too much other stuff, I'm worried I may have shot myself in the foot by giving myself a performance penalty by using bitfields.

IanB · « **Reply #6 on:** April 30, 2018, 11:05:12 pm »

If you write a test like this:

Code: [Select]

if (my_struct.flag_b) { /* do something */ }
Then I don't see why the generated code should be slow at all. It's a simple test of a bit using a bit mask.

rstofer · « **Reply #7 on:** April 30, 2018, 11:30:07 pm »

Here's a little GCC example of endianness and bitfields.

Code: [Select]

#include <stdio.h>

union my_register{
#if __BIG_ENDIAN
	struct
		{
		unsigned short Right : 1;
		unsigned short Mid   : 2;
		unsigned short Left  : 5;
		};
#else
	struct
		{
		unsigned short Left  : 5;
		unsigned short Mid   : 2;
		unsigned short Right : 1;
		};
#endif
	unsigned char value;
};

int main(void)
{
	union my_register reg;

	reg.value = (unsigned char) 0b10001101; // 0b10001_10_1

	printf("%#2x %#1x %#1x\n",reg.Left, reg.Mid, reg.Right);
	return 0;
}

As long as you access the bitfields with a fully qualified name rather than masking against a literal like 0b10101101 you don't have to consider endianness or anything else. The compiler will make it work.

brucehoult · « **Reply #8 on:** May 01, 2018, 04:35:47 am »

Quote from: HwAoRrDk on April 30, 2018, 10:07:43 pm

I never really thought about this until today, but when using bitfields there must be some kind of trade-off between memory size and execution speed, right?

Say you have an struct like the following:

Code: [Select]
typedef struct { uint8_t foo; uint16_t bar; uint8_t baz : 3; bool flag_a : 1; bool flag_b : 1; bool flag_c : 1; bool flag_d : 1; bool flag_e : 1; } my_struct_t;
Now, by setting baz to 3 bits (assuming it only needs to represent values 0-7) and using single-bit fields for our 5 boolean flags, we've gone from a total size of 9 bytes to only 4 bytes. Nice, some memory saved. However, when you're writing your code to access these bitfields, what kind of machine code is the compiler actually generating?

I assume there must be some penalty to reads and writes to these fields, in the form of extra instructions necessary to 'resize' the data. Without knowing what actually goes on, I would assume the compiler has to generate masking and shifting instructions. For instance, in the case of assigning some other uint8_t variable to baz, I would assume you would need to mask off the relevant three bits, and depending on the alignment within the struct, shift those 3 bits into the correct position.

Am I right? Or, because the C standards don't actually define how struct bitfields are physically packed, is this all platform- and/or compiler-dependant?

Exactly what instructions the compiler has to generate depend on what instructions the CPU type has available :-)

At one extreme, things such as the 68020 could do that in a single instruction: BFINS Dn, <ea>, [offset:width]. The VAX might well have been able to as well, though I can't recall. It's not necessarily FAST though .. in the case of the 68020 BINFS from one register to another takes 10 clock cycles if the instruction is in the cache and it can't be pipelined with other instructions (the most common case).

Simple instructions such as add, sub, and, or, xor take 2 clock cycles. Shifts by static amounts take 4 clock cycles. So you can do several simple operations in the time taken by one BFINS.

At the other extreme, you might not have any support for bit fields at all, in which case you have to do everything with shifts or rotates and masking. If you at least have rotate (the RISC-V base instruction set doesn't) then you can do this:

ROR t1, rs1, #off
ASL t2, rs3, #32-size
LSR t1, t1, #size
OR rd, t1, t2
ROL rd, rd, #size+off

That will take five clock cycles on most RISC CPUs, or four if the first two instruction can be executed together.

In between, you can have instructions that don't actually do the bitfield insert directly, but just make it easier. For example on the Motorola M88000:

CLR t1, rs1, size<off> // clears the destination bitfield
MAK t2, rs2, size<off> // shifts the inserted value to the right place and clears all other bits
OR rd, t1, t2

That can be done in two clock cycles because the first two instructions can be executed at the same time.

If you need to read the stuff from RAM and write it back afterwards then you'd probably never notice the speed difference between any of these!

bson · « **Reply #9 on:** May 01, 2018, 04:46:36 am »

I'd be more concerned with the unaligned uint16_t; unless you explicitly flag the structure as packed you may find the compiler inserts a pad byte. Or more bizarre failures; for example Cortex-M0/M0+ will hard fault on unaligned accesses while M3/M4/M7 just handle them transparently. You may find your code works fine on an M4, then you recompile it for M0 and without an explicit packed attribute the compiler will feel free to rearrange or pad the layout of your structure to suit the target. So recompiling your code for M0, as the structure describes a protocol header layout or some other externally determined format, it mysteriously doesn't work at all. (As a side note, if you write code for M4/M7 you intend to ever run on M0 it's a good idea to compile it for M0 as the target and set the STKALIGN and UNALIGN_TRP bits in CCR sometime during startup to make sure it still works and doesn't hard fault on an unaligned access.)

janekm · « **Reply #10 on:** May 01, 2018, 05:15:57 am »

The compiler may not respect your requests for the physical layout of these fields unless you use one of the compiler-specific options to make it a "packed" struct. This can catch you out if these data structures are going to be stored or communicated to somewhere else (say if this was a network packet).

Jeroen3 · « **Reply #11 on:** May 01, 2018, 05:29:19 am »

Quote from: C on April 30, 2018, 10:35:30 pm

Really in to how smart the compiler is.
Some things in a language force a compiler to be smarter.

A dumb compiler might treat each bit as a byte.

A little smarter and compiler might do shifts and masks.

A really smart compiler would look at where a range of bits exist.

A really smart compiler would use bitbanding. But this is not available in all parts and cores, so compiler prefer to stay portable.
If you need to write&read a lot of random bits, you'd definitely want to use bitbanding.

obiwanjacobi · « **Reply #12 on:** May 01, 2018, 07:23:09 am »

Some MCU's may support dedicated bit manipulation instructions. If you do not prevent the compiler from using them the perf will be good.

Here's my take on a bit array (C++) https://github.com/obiwanjacobi/atl/blob/master/Source/Code/ArduinoTemplateLibrary/BitArray.h
[2c]

HwAoRrDk · « **Reply #13 on:** May 01, 2018, 03:30:41 pm »

Quote from: bson on May 01, 2018, 04:46:36 am

I'd be more concerned with the unaligned uint16_t; unless you explicitly flag the structure as packed you may find the compiler inserts a pad byte.

Why would that be? The first two fields, foo and bar are whole bytes (that is, byte-aligned), so what reason would a compiler have to insert a padding byte? Would it be something to do with platforms that have a large word size? Without telling the compiler to pack closely, do structure fields get padded out to native word size (e.g. 16/32/64 bits) by some compilers on some platforms? What are the reasons for such behaviour?

Quote from: janekm on May 01, 2018, 05:15:57 am

The compiler may not respect your requests for the physical layout of these fields unless you use one of the compiler-specific options to make it a "packed" struct. This can catch you out if these data structures are going to be stored or communicated to somewhere else (say if this was a network packet).

Is this worth doing just as a matter of routine, even if close packing happens by default? For instance, with AVR-GCC packing of the structure as-defined appears to be the default behaviour - unless my toolchain happens to be specifying some compiler flag by default to do this.

glarsson · « **Reply #14 on:** May 01, 2018, 04:00:13 pm »

Quote from: HwAoRrDk on May 01, 2018, 03:30:41 pm

Why would that be? The first two fields, foo and bar are whole bytes (that is, byte-aligned), so what reason would a compiler have to insert a padding byte? Would it be something to do with platforms that have a large word size? Without telling the compiler to pack closely, do structure fields get padded out to native word size (e.g. 16/32/64 bits) by some compilers on some platforms? What are the reasons for such behaviour?

Most processor architectures likes padding to the word sizes. For misaligned variables MIPS, PowerPC, X86, etc will perform two memory accesses and and/shift/or to get the wanted content. Other processors like SPARC will not do this and just fail (SIGBUS in UNIX systems) as it will not be able to complete the instruction within one clock cycle.

nctnico · « **Reply #15 on:** May 01, 2018, 04:05:03 pm »

Quote from: HwAoRrDk on April 30, 2018, 10:49:24 pm

Quote from: nctnico on April 30, 2018, 10:13:30 pm
The compiler will create bit shifts & masks to get and store the bits from the struct so yes it will be a lot slower.
When you say "a lot slower", are you exaggerating for effect, or do you mean that?

It is slower because reading a value consisting of whole bytes just takes one instruction (on a modern 32 bit CPU). Taking a value apart into bits means having extra machine code for shifting & masking the values.

FlyingHacker · « **Reply #16 on:** May 01, 2018, 04:33:22 pm »

Often the best bet is to put some test code into a big loop and time the code different ways to see what the speed differences are with your specific compiler and hardware.

You also have to evaluate things like how much is the code used? Is it in a loop executed millions of times, or is it used once or twice? Weigh that against memory usage, ease of code reading, and ease of programming...

helius · « **Reply #17 on:** May 01, 2018, 05:07:03 pm »

Quote from: brucehoult on May 01, 2018, 04:35:47 am

At one extreme, things such as the 68020 could do that in a single instruction: BFINS Dn, <ea>, [offset:width]. The VAX might well have been able to as well, though I can't recall.

The VAX has INSV src, offset, width, dest
But with little-endian addressing, so the bitfield's offset points to its LSB. With the MC68020, the offset of a bitfield points to its MSB.
In both architectures the offset can be up to 268435455 bytes away from the dest. This allows you to do bitfield ops on structures directly given the struct's address, instead of needing to calculate the address of the word containing the bitfield.

SiliconWizard · « **Reply #18 on:** May 01, 2018, 05:21:43 pm »

Bit fields are not very portable and are essentially sugar coating.

They may be faster on a given target if it has specific bit-manipulation instructions and the compiler generates code for them, but even so, the benefit would depend on the number of bits in the same word you are working on. If you have to manipulate several bits in the same word at the same time, it's usually much faster to use conventional integer bit operations such as &, |, ~, ^. Of course, a good compiler may optimize your code in that way with bit fields as well. So the answer is, you have to check the assembly output of your compiler in different situations to see for yourself.

Apart from optimization reasons, I tend to avoid bit fields because they are usually not portable, especially if you intend to have each specific bit field to have a specific position in the encapsulating data word, so compiling on a different target may get you completely different results.

Howardlong · « **Reply #19 on:** May 01, 2018, 06:37:50 pm »

Quote from: Jeroen3 on May 01, 2018, 05:29:19 am

Quote from: C on April 30, 2018, 10:35:30 pm

Really in to how smart the compiler is.
Some things in a language force a compiler to be smarter.

A dumb compiler might treat each bit as a byte.

A little smarter and compiler might do shifts and masks.

A really smart compiler would look at where a range of bits exist.
A really smart compiler would use bitbanding. But this is not available in all parts and cores, so compiler prefer to stay portable.
If you need to write&read a lot of random bits, you'd definitely want to use bitbanding.

As I understand it, bit banding such as on the Cortex M3 and M4 aliases each bit in a word to multiple address spaces, allowing single bit atomic test/set/clear operation, speeding up single bit access on cores. This is no good for manipulating multiple bits concurrently.

If you have fields of more than one bit, then some cores offer bitfield instructions such as the Cortex M3 and M4, but these are register only instructions, so no good if you’re looking for atomic memory RMW, but they can be computationally efficient for the OP’s requirement.

The PIC16/PIC24/dsPIC offer efficient direct atomic single bit manipulation across the entire RAM address space, but it’s no benefit for multi-bit bit fields. However the PIC32 manages RMW differently by providing aliases for memory mapped peripheral registers so you can set, clear and invert as many bits as you like atomically within a machine word: this is only available on peripheral registers though, so no good for your own RAM based structures.

nctnico · « **Reply #20 on:** May 01, 2018, 06:52:52 pm »

Quote from: SiliconWizard on May 01, 2018, 05:21:43 pm

Bit fields are not very portable and are essentially sugar coating.

They may be faster on a given target if it has specific bit-manipulation instructions and the compiler generates code for them, but even so, the benefit would depend on the number of bits in the same word you are working on. If you have to manipulate several bits in the same word at the same time, it's usually much faster to use conventional integer bit operations such as &, |, ~, ^. Of course, a good compiler may optimize your code in that way with bit fields as well. So the answer is, you have to check the assembly output of your compiler in different situations to see for yourself.

I disagree. Bit fields are a good way to safe space without making your code obfustigated. Let the C compiler deal with the nasty stuff!

Quote

Apart from optimization reasons, I tend to avoid bit fields because they are usually not portable, especially if you intend to have each specific bit field to have a specific position in the encapsulating data word, so compiling on a different target may get you completely different results.

Sharing structs between platforms/compilers (for example to communicate) isn't a good thing to do anyway because of alignment issues and packing strategies. I always tell people not to map structs and plain C types onto byte buffers because it will go wrong at some point.

andyturk · « **Reply #21 on:** May 02, 2018, 12:57:11 am »

Quote from: nctnico on May 01, 2018, 06:52:52 pm

Sharing structs between platforms/compilers (for example to communicate) isn't a good thing to do anyway because of alignment issues and packing strategies. I always tell people not to map structs and plain C types onto byte buffers because it will go wrong at some point.

NorthGuy · « **Reply #22 on:** May 02, 2018, 01:38:11 am »

Quote from: nctnico on May 01, 2018, 06:52:52 pm

Sharing structs between platforms/compilers (for example to communicate) isn't a good thing to do anyway because of alignment issues and packing strategies. I always tell people not to map structs and plain C types onto byte buffers because it will go wrong at some point.

Look at the file formats (ELF, ZIP etc.), or communication protocols (such as TCP/IP). They all work with records which are essentially mapped C structures. They work quite well across platforms without causing issues. Of course, when these structures were created they were aligned by hand - there's no need for padding (and if it is then the padding is explicitly added). Byte/bit order is also fixed by design. The designers were clever about this. But so can you.

Besides, if you don't waste your time packing and unpacking structures, you will be more productive - you'll write more

SiliconWizard · « **Reply #23 on:** May 02, 2018, 08:30:39 am »

Quote from: NorthGuy on May 02, 2018, 01:38:11 am

Quote from: nctnico on May 01, 2018, 06:52:52 pm
Sharing structs between platforms/compilers (for example to communicate) isn't a good thing to do anyway because of alignment issues and packing strategies. I always tell people not to map structs and plain C types onto byte buffers because it will go wrong at some point.

Look at the file formats (ELF, ZIP etc.), or communication protocols (such as TCP/IP). They all work with records which are essentially mapped C structures. They work quite well across platforms without causing issues. Of course, when these structures were created they were aligned by hand - there's no need for padding (and if it is then the padding is explicitly added). Byte/bit order is also fixed by design. The designers were clever about this. But so can you.

Besides, if you don't waste your time packing and unpacking structures, you will be more productive - you'll write more

Absolutely. Forcing the alignment of struct members is usually straightforward and reasonably portable. All C compilers I have come across for targets from 8-bitters to 64-bit CPUs had means of defining alignment in non-ambiguous ways.
On some platforms, you still have to worry about unaligned access - they can be either slower (on most targets, which is why the native default alignment usually matches the target's data bus width) or lead to an exception (such as on the 16-bit PIC micros). That can be worked around of course.

andyturk · « **Reply #24 on:** May 02, 2018, 08:45:11 am »

Quote from: NorthGuy on May 02, 2018, 01:38:11 am

Besides, if you don't waste your time packing and unpacking structures, you will be more productive - you'll write more

You might have endian conversions in a portable format, so a direct mapped structure could still have packing/unpacking.

nctnico · « **Reply #25 on:** May 02, 2018, 09:48:12 am »

Quote from: NorthGuy on May 02, 2018, 01:38:11 am

Quote from: nctnico on May 01, 2018, 06:52:52 pm
Sharing structs between platforms/compilers (for example to communicate) isn't a good thing to do anyway because of alignment issues and packing strategies. I always tell people not to map structs and plain C types onto byte buffers because it will go wrong at some point.
Look at the file formats (ELF, ZIP etc.), or communication protocols (such as TCP/IP). They all work with records which are essentially mapped C structures. They work quite well across platforms without causing issues. Of course, when these structures were created they were aligned by hand - there's no need for padding (and if it is then the padding is explicitly added). Byte/bit order is also fixed by design. The designers were clever about this. But so can you.

And still there are plenty of ways this can go wrong giving very subtile errors which you won't catch doing unit testing on a different platform. Being clever isn't always being smart.
I've seen this problem pop up a couple of times in a large software project I (and some others) inherited. This took a couple of days to hunt down so there goes your productivity.
Besides that if you pack a struct the compiler may start to shuffle bytes around anyway because it is likely you provide a byte (void) pointer to the data and the compiler can no longer know how the data is aligned in memory. Remember that many platforms (for example ARM) cannot do unaligned 16 or 32 bit read/writes and it may not lead to an exception. And then there is big endian versus little endian conversion. All in all it is better to create a program with defined behaviour using byte shifts to read/write data into a byte array. It won't be slower anyway because a lot of protocols are big endian and most processors used nowadays are little endian.

NorthGuy · « **Reply #26 on:** May 02, 2018, 02:28:46 pm »

Quote from: nctnico on May 02, 2018, 09:48:12 am

And still there are plenty of ways this can go wrong giving very subtile errors which you won't catch doing unit testing on a different platform. Being clever isn't always being smart.

This is typical to any programming. Not all the bugs are easy to catch. Therefore, you need to think - think when designing software and think when designing your tests. And if you find a bug, start blaming yourself, so that next time you make less bugs and find them fasters. If you start blaming these things on alignment, endianness, or other similar factors, you not only miss learning from the mistakes, but you will cast lots of useful tools out of your toolbox because they were "unsafe" in your past experiences.

Quote from: nctnico on May 02, 2018, 09:48:12 am

... a lot of protocols are big endian and most processors used nowadays are little endian.

If you get data with different endianness, you have to deal with it anyway. But writing piles of unnecessary code is much more prone to bugs than simply calling stub functions a la htons().

Alignment is never a problem if your structures don't have any internal misalignments - you align the whole structure and everything gets aligned automatically.

nctnico · « **Reply #27 on:** May 02, 2018, 02:42:01 pm »

Quote from: NorthGuy on May 02, 2018, 02:28:46 pm

Quote from: nctnico on May 02, 2018, 09:48:12 am
And still there are plenty of ways this can go wrong giving very subtile errors which you won't catch doing unit testing on a different platform. Being clever isn't always being smart.
This is typical to any programming. Not all the bugs are easy to catch. Therefore, you need to think - think when designing software and think when designing your tests. And if you find a bug, start blaming yourself, so that next time you make less bugs and find them fasters. If you start blaming these things on alignment, endianness, or other similar factors, you not only miss learning from the mistakes, but you will cast lots of useful tools out of your toolbox because they were "unsafe" in your past experiences.

Actually my past experiences have taught me not to do esoteric stuff in C like relying on how structs are mapped in memory. I write my code in a way even a complete idiot can maintain it succesfully when I have moved on to a more interesting project. There are enough ways left in C to shoot yourself in the foot so don't make things more complicated then they have to. BTW you seem to have missed that I inherited the project with the alignment bug which wasn't caught by unit testing on a PC.

Quote

Quote from: nctnico on May 02, 2018, 09:48:12 am
... a lot of protocols are big endian and most processors used nowadays are little endian.
If you get data with different endianness, you have to deal with it anyway. But writing piles of unnecessary code is much more prone to bugs than simply calling stub functions a la htons().

Alignment is never a problem if your structures don't have any internal misalignments - you align the whole structure and everything gets aligned automatically.

That is a very big IF. What if your input buffer gets misaligned because someone changes a pointer somewhere or inserts an extra field? Besides that you can create a simple wrapper like htons/htonl (which is platform independent as a bonus) yourself to fill a buffer. Good, platform independant protocol implementations work that way because it doesn't depend on the capabilities of the person maintaining that code and/or compiler dependent settings/pragmas/attributes.

Nominal Animal · « **Reply #28 on:** May 02, 2018, 03:21:58 pm »

For what it is worth, I've found data accessor functions to be much more robust and maintainable than struct bitfields, in the long term.

The main reason is that the order in which bitfields are packed, is completely up to the compiler. Compilers can even have different conventions on different architectures - and there is nothing barring them from changing it from one version to next. Plus, people change both targets and compilers quite often, too.

It is also easier to thoroughly test the accessor functions. For example, you can test bitstream operations by comparing to a slow, known-good version, that extracts the values one bit at a time; or by using a test vector and comparing to known results. For structs with bitfields, you basically need a comprehensive set of structs (saved in binary), then verify the struct correctly maps to the binary data by comparing the test structs to known values. Very few programmers bother.

As to speed, you're going to have a hard time measuring the difference in timing between structs with bitfields and accessor functions (unless you write horribly stupid accessor functions, that e.g. use a loop to extract individual bits from the same source word/byte). This is because memory access, even on embedded systems, is slower than the few bit shift and mask operations needed to extract/pack a field; we're talking of less than a dozen clock cycles that on superscalar architectures pipelines extremely well. If you find a case where you can measure the difference, I bet there is an even faster approach (typically by avoiding accessing the packed fields that often, and using an unpacked, fast structure for the repeated accesses instead). In marginal cases, like on a microcontroller, you can optimize the accessor functions just for that hardware.

NorthGuy · « **Reply #29 on:** May 02, 2018, 03:59:41 pm »

Quote from: nctnico on May 02, 2018, 02:42:01 pm

... you seem to have missed that I inherited the project with the alignment bug.

A bug is always a consequence of a programmer's mistake. Someone years ago hired an incompetent programmer and he made a bug. You woudn't say that the bad hiring decision is somehow related
to the alignment, would you?

Quote from: nctnico on May 02, 2018, 02:42:01 pm

That is a very big IF. What if your input buffer gets misaligned because someone changes a pointer somewhere or inserts an extra field?

This is a very good question. If you use a structure in communications then your structure may change in the future. Therefore, you design it in such a way that the communicating parties have means to figure out what version of the structure they get. At the very least you include the length field in the end, and you make sure your structure is easy to align by placing padding or reserved fields at the end. Such measures ensure that any number of programmers can use your communication protocol and it works well across the versions and platforms.

If one of the thousands of programmers who use your protocol won't take time to understand the mechanism and misaligns something or otherwise fails to follow the protocol, this is clearly a bug. You don't want to make 999 good programmers write extra parsing code just because one programmer is incompetent, right?

I think this thread has deviated from the original question which the IP has posted. The OP didn't ask about communications. He asked about whether it's a good idea to use bitfields (combining multiple individual variables into a single integer) or it's better to use separate variables.

I personally do not use bitfields, I prefer bitwise logic and masks. Such approach appears more flexible to me, but, under the hood, it's the same as a bitfields.

If variables are 1-bit long (TRUE or FALSE) then it's a good idea to unite them into a single number. Many CPUs will have some sort of instructions to access single bits. More importantly, you can access several variables together. Such as you can test if any one of flag_a, flag_d or dlag_e in a single operation. If these were different variables, it wouldn't be so easy. Similarly, you can set multiple flags by or'ing, clear them by and'ing etc. Also, you can pass the whole set of flags to a function as a single parameter.

For 2-3-bit long variables, the access to a bitfield becomes inefficient. To set the bitfield, you need to read the variable, "and" it with the mask, "xor" it with the value, then write back. Some of the CPUs can combine "and" and "xor" in s single instruction (and even do a shift), but there's still a need to read, modify, and write. Single variable requires only a write, which definitely beats the bitfield.

Of course, 8-bit bitfield is as good as a separate variable.

helius · « **Reply #30 on:** May 02, 2018, 06:00:52 pm »

Quote from: NorthGuy on May 02, 2018, 03:59:41 pm

For 2-3-bit long variables, the access to a bitfield becomes inefficient. To set the bitfield, you need to read the variable, "and" it with the mask, "xor" it with the value, then write back. Some of the CPUs can combine "and" and "xor" in s single instruction (and even do a shift), but there's still a need to read, modify, and write. Single variable requires only a write, which definitely beats the bitfield.

Of course, 8-bit bitfield is as good as a separate variable.

When you consider the whole hardware architecture, single-byte writes also require some form of read-modify-write on anything beyond very simple 8-bit micros. Writing a solitary byte that misses in the cache requires the entire cache block to be loaded, then modified, and finally written back. In many cases uncachable writes smaller than a full word are not allowed.

NorthGuy · « **Reply #31 on:** May 02, 2018, 06:22:58 pm »

Quote from: helius on May 02, 2018, 06:00:52 pm

Quote from: NorthGuy on May 02, 2018, 03:59:41 pm
Of course, 8-bit bitfield is as good as a separate variable.
When you consider the whole hardware architecture, single-byte writes also require some form of read-modify-write on anything beyond very simple 8-bit micros. Writing a solitary byte that misses in the cache requires the entire cache block to be loaded, then modified, and finally written back. In many cases uncachable writes smaller than a full word are not allowed.

I don't understand how this makes aligned 8-bit bitfield different from a standalone char variable.

Besides, there are CPUs without cache (such as PIC16), or CPUs with cache lines much larger than 32-bit (such as modern Intel).

helius · « **Reply #32 on:** May 02, 2018, 06:26:20 pm »

The point is more that a single 8-bit datum (a char member) may not have any advantage over a 7 or 9 bit field in a struct, as RMW would be required in both cases. The only difference is that the char may be written with a simple instruction, but the hardware has to take care of it in both cases.

NorthGuy · « **Reply #33 on:** May 02, 2018, 07:25:49 pm »

Quote from: helius on May 02, 2018, 06:26:20 pm

The point is more that a single 8-bit datum (a char member) may not have any advantage over a 7 or 9 bit field in a struct, as RMW would be required in both cases. The only difference is that the char may be written with a simple instruction, but the hardware has to take care of it in both cases.

Yes. Big CPUs as Intel do their own optimizations as they execute instructions, which works better than C compiler. It's very hard to predict how fast the code may run, or even measure the execution time accurately.

abyrvalg · « **Reply #34 on:** May 02, 2018, 11:09:29 pm »

Quote from: NorthGuy on May 02, 2018, 07:25:49 pm

Big CPUs as Intel do their own optimizations as they execute instructions, which works better than C compiler.

The compilers have a huge potential advantage over CPUs for optimizations: they have more time and more information about the code. It’s just the x86 world’s poor compatibility tradeoff limiting them: you never know at compile time what CPU microarch will be used at runtime, so it’s CPU’s job to do the final optimizations at runtime (repeating the same things at every run, having less time, seeing smaller pieces of code - what a pity).

westfw · « **Reply #35 on:** May 06, 2018, 05:40:12 am »

There are a lot of potential optimizations, bottlenecks, syntactic advantages and portability pitfalls associated with bitfields.
Everything CAN be done using shift and mask operations, but if that's what your compiler does ALL the time, it's not a very good compiler!

8-bit and other "small" architectures tend to have specialized instructions for dealing with single bits. MANY processors have "some" capabilities WRT *certain* single bits (ie "branch if minus", carry bits, etc.) So right away you're faced with "produce different code depending on size and position of bitfield" possibilities. Some ARM Cortex chips have bit-banding for accessing single bits, and "bit field" instructions for larger bit fields (CM3 and up.) Freescale seems to have added a "bit manipulation engine" to some of their CM0 chips, that extends the bit-banding idea to multiple bit fields and additional operations, but only for the peripheral address space.

Most 32bit chips have barrel shifters, but most 8bit chips don't (which means that they might do a 6bit shift in a loop that takes 6+ instruction times.) OTOH, they might have the option of fetching only the necessary byte of a longer-than-byte bitfield. Or have a nybble-swap that's equivalent to a 4-bit rotate.

Setting or comparing bitfields with constants may be easier than variables, because you can shift the constant instead of the variable. And especially if the constant is all zero bits or all one bits.

gf · « **Reply #36 on:** May 06, 2018, 08:26:23 am »

Quote

Writing a solitary byte that misses in the cache requires the entire cache block to be loaded, then modified, and finally written back. In many cases uncachable writes smaller than a full word are not allowed.

Btw, "cache" is a good keyword. Memory operand access times with L1 cache hit, vs. L3 cache miss differ by a factor of about two decades. So it can make a significant performance difference whether the whole working set fits into L1 or L2 cache, or whether we are facing cache misses and fetches from DRAM frequently. Reducing the size of data structures may help to achieve this goal (I'm thinking e.g. of a huge array of structures), even if it possibly takes a couple of instructions more to access the bitfield data then. For instance, if an x86 does not need to stall on a DRAM fetch for say 100ns due to a L3 cache miss, then it can execute many instructions in the same time. It all depends on the memory access patterns of the individual application and on the hardware architecture, of course. For some microcontrollers these consideration may not apply at all. Performance analysis/tuning needs to be done for each use case individually.

gf · « **Reply #37 on:** May 06, 2018, 09:04:21 am »

Quote

Big CPUs as Intel do their own optimizations as they execute instructions

Most if these CPU internal "optimizations" are, however, related to instruction pipelining, i.e. the CPU tries to avoid pipeline stalls, by doing out of order execution, branch prediction, speculative execution,...

[ Well, we would not need this, if Moore's Law would also apply to CPU clock frequency and DRAM latency. However, we already had 2 GHz clocked CPUs about ten years ago, but we don't have 500 GHz CPUs today - so the CPU makers had the need to find other tricks to still increase performance. ]

NorthGuy · « **Reply #38 on:** May 06, 2018, 01:41:48 pm »

Quote from: gf on May 06, 2018, 09:04:21 am

Quote
Big CPUs as Intel do their own optimizations as they execute instructions
Most if these CPU internal "optimizations" are, however, related to instruction pipelining, i.e. the CPU tries to avoid pipeline stalls, by doing out of order execution, branch prediction, speculative execution,...

That's exactly what you need here. It can pre-fetch the data, and it can postpone data write after the instruction, so what you have left is only the operation itself. Furthermore, Intel has a number of parallel execution units, so multiple instructions can execute in parallel. Because of all the mechanisms, the execution of a series of instructions may take only one cycle. Moreover, it is possible that inserting an extra instruction just in the right place makes the execution faster compared to the same code without the instruction.

Of course, this makes the execution timing totally unpredictable - if things go wrong, the CPU may stall waiting for memory for enormous amount of time. Not what you want in the embedded system.

Nominal Animal · « **Reply #39 on:** May 06, 2018, 06:57:19 pm »

Quote from: gf on May 06, 2018, 08:26:23 am

Btw, "cache" is a good keyword. Memory operand access times with L1 cache hit, vs. L3 cache miss differ by a factor of about two decades. So it can make a significant performance difference whether the whole working set fits into L1 or L2 cache, or whether we are facing cache misses and fetches from DRAM frequently. Reducing the size of data structures may help to achieve this goal

Yup; another is split/recombine data that may or may not be logically related, but are typically accessed at the same time. The keyword is then "cache locality".

Quote from: gf on May 06, 2018, 08:26:23 am

Performance analysis/tuning needs to be done for each use case individually

with actual data.

It is often not hard to microbenchmark specific operations, and for things like microcontrollers this may be enough. On pipelined superscalar machines (with complex caching schemes shared at some level between cores), one must benchmark the overall effect of an approach, because of the overall complexity of the situation. Something that yields fantastic figures at microbenchmarks, but overall slows down the entire algorithm when used in practice, is not unheard of: bad patterns of cache usage (or just touching a lot of cache lines, changing the caching patterns and how the CPU predicts future accesses) easily does that. Which is why I personally make a big difference between microbenchmarking (an operation or an algorithm) and actually benchmarking an approach (implementation tested with actual data). The former is indicative, the latter is a finding.

As an example, I've seen developers get very surpriseed when they find that optimizing some code for size makes it run faster than with otherwise aggressive optimizations enabled. On arches (like Intel/AMD x86 and x86-64) where the hardware does a lot of speculative execution, some types of conditional expressions are cheap, while others are expensive, and the difference may depend entirely on exactly how your compiler behaves.

(None of the compilers I've used do vectorization particularly well for C, either. ICC is probably the best in this regard, but one definitely cannot rely on it, especially because of how it treats non-Intel processors at runtime -- unless you know you'll only run the code on Intel processors, of course. This is a bit off topic for this thread, because most code that uses vectorization uses it for floating-point components; for binary operations, vectorization is only useful if you perform the same operation with optionally different operands to many consecutive 8, 16, 32, or 64-bit sized aligned units at the same time.)

westfw · « **Reply #40 on:** May 06, 2018, 11:59:36 pm »

BTW, the biggest "oops" with bitfields that I've seen recently wasn't related to bit ordering or padding, but to the "atomic" functions provided by the hardware. You might think that a line like:

Code: [Select]

SERCOM5->USART.INTENCLR.bit.ERROR = 1; Would clear the ERROR interrupt enable for SERCOM5, right? But NO! The INTENCLR register reads ones in bit positions where the interrupts are currently enabled, so when the compiled code reads the register, OR's in the ERROR bit, and writes it back out (unoptimized bitfield code, and a CM0 that doesn't really have any optimization possibilities), it ends up disableing ALL the interrupts that were enabled!
This statement works correctly, and probably generates less code as well...

Code: [Select]

SERCOM5->USART.INTENCLR.reg = SERCOM_USART_INTENCLR_ERROR;https://community.atmel.com/forum/problem-clearingsetting-bit-interrupt-flag-register

The debate over bit-field non-portability is ... amusing ... in light of ARM's CMSIS essentially standardizing on the use of "bitfields overlaid on hardware registers." (not even a nod to packing or padding.) I guess theoretically, these are definitions closely associated with a particular hardware implementation provided with a particular compiler, so it's not so important. OTOH, ARMv7m (CM3 and higher) theoretically has user-selectable endianness (but only at RESET time), and I've never seen a CMSIS file that provides big-endian definitions. (OTTH, I don't think I've seen a ARMv7 that implements the endianness selection, nor one that's big-endian.)

ataradov · « **Reply #41 on:** May 07, 2018, 12:17:38 am »

At this point in time, I would say that non-portability of bit fields is a hardware problem. There is a lot of code that nobody wants to rewrite, so making hardware that does not naturally supports those assumption is a sure way to get DOA hardware. It is like designing a new CPU to be big-endian only. Good luck with that.

Same goes for compilers. There are plenty of very good choices, so if you make a new compiler and it breaks those assumptions, you will have hard time marketing that compiler.

I use packed structs with bitfields for all my software and so far I have never ran into a problem. And I get the most optimal code for the situation, since I clearly communicate what I want to do to the compiler.

brucehoult · « **Reply #42 on:** May 07, 2018, 02:07:04 am »

Quote from: westfw on May 06, 2018, 11:59:36 pm

BTW, the biggest "oops" with bitfields that I've seen recently wasn't related to bit ordering or padding, but to the "atomic" functions provided by the hardware. You might think that a line like:
Code: [Select]
SERCOM5->USART.INTENCLR.bit.ERROR = 1; Would clear the ERROR interrupt enable for SERCOM5, right? But NO! The INTENCLR register reads ones in bit positions where the interrupts are currently enabled, so when the compiled code reads the register, OR's in the ERROR bit, and writes it back out (unoptimized bitfield code, and a CM0 that doesn't really have any optimization possibilities), it ends up disableing ALL the interrupts that were enabled!
This statement works correctly, and probably generates less code as well...
Code: [Select]
SERCOM5->USART.INTENCLR.reg = SERCOM_USART_INTENCLR_ERROR;https://community.atmel.com/forum/problem-clearingsetting-bit-interrupt-flag-register

That's really nothing to do with C compilers being stupid or bitfield code not being compiled properly. It's purely down to the programmer treating something like RAM that doesn't behave like RAM. Or the hardware designer providing a badly designed interface.

Can this register only disable interrupts, and there another register for enabling them? If that's the case then it's nothing like memory and shouldn't be treated as if it was. But the compiler can't know that.

NorthGuy · « **Reply #43 on:** May 07, 2018, 04:20:03 am »

Quote from: brucehoult on May 07, 2018, 02:07:04 am

Or the hardware designer providing a badly designed interface.

If marketing says you must build your MCUs with ARM CPU which doesn't have suitable instructions for the task, what the poor designer can do?

andyturk · « **Reply #44 on:** May 07, 2018, 04:36:56 am »

Quote from: westfw on May 06, 2018, 11:59:36 pm

The debate over bit-field non-portability is ... amusing ...

If your code is always compiled with the same compiler and always for ARM Cortex (e.g., memory mapped registers), then bitfields are fair game from a portability perspective. But if hardware targets and toolchains vary (e.g., for network formats), then staying with standard types and bit masks might be a better choice.

ARM's SVDConv.exe can generate mask and shift values, but most of the .svd files I've seen use bitfields.

westfw · « **Reply #45 on:** May 07, 2018, 05:18:40 am »

Quote

[experience with INTENCLR register bitfields]
That's really nothing to do with C compilers being stupid or bitfield code not being compiled properly.

Agreed; I didn't say it did. Bitfield endianness issues are not compiler bugs or bad code, either. They're both cases where using bitfields has unexpectedly "bitten" people, though, where more traditional manipulation would have been more obvious (perhaps.)

Quote

I use packed structs with bitfields for all my software and so far I have never ran into a problem.

Ditto. Well, except for fragmentOffset in IP, but I may have been blinded by the 16bit CPUs of the (68k v 80186) day.
Not so much a "hardware problem" as a SOLVED problem. If your compiler can't make it come out right, it's time for a different compiler.
(ps: Intel added bi-endian support to their x86 compiler some time ago (because fetch and bswap is only a tiny bit slower than just a fetch.) It's pretty cool, and I'm surprised I haven't seen it spring up elsewhere (keil, llvm, gcc.))

Quote

ARM's SVDConv.exe can generate mask and shift values, but most of the .svd files I've seen use bitfields.

Ah. That explains why some (many?) of the .h files have both... I had forgotten that they're program-generated.

Code: [Select]

typedef union {
  struct {
    uint32_t SWRST:1;          /*!< bit:      0  Software Reset */
    uint32_t ENABLE:1;         /*!< bit:      1  Enable */

 :

  } bit;                       /*!< Structure used for bit  access */
  uint32_t reg;                /*!< Type      used for register access */
} SERCOM_I2CS_CTRLA_Type;

 :

#define SERCOM_I2CS_CTRLA_SWRST_Pos 0            /**< \brief (SERCOM_I2CS_CTRLA) Software Reset */
#define SERCOM_I2CS_CTRLA_SWRST     (0x1ul << SERCOM_I2CS_CTRLA_SWRST_Pos)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: C struct bitfields - size versus speed (Read 9667 times)

Share me