Author Topic: State machines - ensuring the integrity of state variable (Read 574 times)

HwAoRrDk · « **on:** January 25, 2024, 09:30:03 am »

I have been pondering this subject recently. When one has a state machine that is driving the functionality of a system (e.g. a microcontroller) that will be expected to have very long uptime (e.g. weeks or months), how can one assure the integrity of the state? That is, if the current state is represented by a single variable in memory, what techniques can be used to assure the integrity of that variable's value?

I can imagine a scenario where if a bit-flip happens in memory, all of a sudden your current state has unexpectedly changed to another state (or even an invalid state), and things may go wrong.

Some ideas I have had thus far:

- Don't have zero be a valid state value. If your state value accidentally gets erased to zero, it won't be a valid state.
- Don't keep the state variable on the stack, but instead in global/static memory. So that if any stack corruption occurs, state variable is unlikely to get clobbered.
- Make your individual state values differ from each other by more than one bit, so that a single bit-flip makes an invalid state that is easily detected. For example, instead of state values being 0x1, 0x2, 0x3, etc. make them 0x101, 0x202, 0x303, etc.
- Keep an second copy of the state variable, but inverted. For example, if state is 0x3, inverse state is 0xFC. Both need to be set appropriately to change state. Also, regularly (in situation of steady-state) verify that the state and inverse state correlate. If they ever diverge, state is invalid.
- Some other kind of checksum on the state variable.

Any of these have merit? Anyone got any other ideas?

Kalvin · « **Reply #1 on:** January 25, 2024, 09:56:04 am »

I would go with this:

- Keep an second copy of the state variable, but inverted.

Very simple to implement, understand and check. Keep the state variable and its inverted version in a struct, so that they will always stay together. If the program accidentally corrupts the memory, it is very unlikely that the state variable and its 1s complement will stay valid after memory corruption, so your program can detect state variable corruption very reliably.

Rerouter · « **Reply #2 on:** January 25, 2024, 10:18:18 am »

Some microcontrollers also have dedicated registers that are great for this kind of stuff, though behaviour you need to sell yourself on, as some are non volitile through a brownout reset, but are through a power loss,

For similar systems sometimes its possible to make it stateless, e.g. can the state be recovered by your sensors or other inputs? removing the need to store, you would still store it, but can do this integrity check,

Kalvin · « **Reply #3 on:** January 25, 2024, 10:46:04 am »

For completeness: If you need to check an integrity of a larger data structure, you need to compute some kind of checksum over the data structure.

There are many checksum algorithms you can choose from. For example the crc32 is quite popular algorithm.

For a small microcontroller the half-byte checksum variant may be useful:
https://create.stephan-brumme.com/crc32/#half-byte
Half-byte checksum is nice compromise between speed and program memory size.

Some microcontrollers have dedicated hardware for speeding up checksum computations.

5U4GB · « **Reply #4 on:** January 25, 2024, 11:09:39 am »

There's a lot of knowledge around this problem with people who write code for applications dealing with radiation, electrical noise, or similar environments. For starters, do you want to simply detect faults and fail fast, or recover from faults? If the former, a checksum over the data and halting the CPU until the watchdog restarts you is the standard approach. For the latter, you need to store the values in TMR (triple modular-redundant) form and scrub them if a fault occurs on read. You also need to look at the CPU that you're working with, e.g. does it have ECC on internal buses and registers so you can assume they're fault-resistant or not, etc. The TMS570 series is great for this, lots of fault-recovery mechanisms built in.

This is just a tiny intro to this, covering it fully would require a book. Let me know if there's any particular aspect you're interested in.

Siwastaja · « **Reply #5 on:** January 25, 2024, 11:22:29 am »

Bit flips in SRAM or CPU registers are super rare. Note that protecting against them in software design is doomed, not to total failure, but to... meh-iness. Even if you find a way to protect that state variable, you have dozens of other variables which also affect the control flow or are equally critical, so...? Or think about peripheral registers. Are you constantly re-initializing everything? If yes, how often you are going to do this? Is it allowable to misbehave for a second? Minute? Hour?

You may then think that protecting variables which have long lifetimes is somehow more important, but is that so? Your random bitflip happens at any random time, at any random address. It can be a temporary variable in stack, or directly a CPU register, during that 100 nanoseconds when a decision for the next state is made.

All in all, software design practices which are safer against memory corruption are available, but hugely complex. Simple mitigations are not very effective. More usual way is to have two CPUs which run the same program in perfect lockstep, with the exact same inputs, and a third observer which compares the outputs. But then again, you would need to prepare the inputs, and the observer would still be a single point of failure.

The big question is, are you building something with safety implications? Pacemaker, where a failure leads to death? Or just a gadget where it is allowable that once you have million units on field, one needs a reboot every year by the user because of a bitflip?

tggzzz · « **Reply #6 on:** January 25, 2024, 12:53:06 pm »

Quote from: HwAoRrDk on January 25, 2024, 09:30:03 am

I have been pondering this subject recently. When one has a state machine that is driving the functionality of a system (e.g. a microcontroller) that will be expected to have very long uptime (e.g. weeks or months), how can one assure the integrity of the state? That is, if the current state is represented by a single variable in memory, what techniques can be used to assure the integrity of that variable's value?

I can imagine a scenario where if a bit-flip happens in memory, all of a sudden your current state has unexpectedly changed to another state (or even an invalid state), and things may go wrong.

Some ideas I have had thus far:

- Don't have zero be a valid state value. If your state value accidentally gets erased to zero, it won't be a valid state.
- Don't keep the state variable on the stack, but instead in global/static memory. So that if any stack corruption occurs, state variable is unlikely to get clobbered.
- Make your individual state values differ from each other by more than one bit, so that a single bit-flip makes an invalid state that is easily detected. For example, instead of state values being 0x1, 0x2, 0x3, etc. make them 0x101, 0x202, 0x303, etc.
- Keep an second copy of the state variable, but inverted. For example, if state is 0x3, inverse state is 0xFC. Both need to be set appropriately to change state. Also, regularly (in situation of steady-state) verify that the state and inverse state correlate. If they ever diverge, state is invalid.
- Some other kind of checksum on the state variable.

Any of these have merit? Anyone got any other ideas?

There are many techniques, each with their advantages and disadvantages. We cannot enumerate all of them.

We could make suggestions after we know:

the consequences of failure (legal, commercial, ...)
permissible and required actions when a failure is detected
the "threats" that must be overcome (unexpected inputs, power failures, suboptimal designs/implementations, ....)
how this piece of the jigsaw fits into the entire system

berke · « **Reply #7 on:** January 25, 2024, 02:33:02 pm »

Against most dumb programming errors, the easiest is to use a memory-safe language with a strong, static and rich type system and automatic or static memory management (e.g. garbage collection). Rust and OCaml are examples that can and have been embedded. Under reasonable conditions these languages guarantee that there will be no memory corruption due to software faults, the static type and rich type system give guarantees of data integrity and allow higher level program constraints to be expressed and automatically checked by the compiler.

If you can't use a language designed with correctness and safety from the ground up, but must rely on C for example, there are still things you can do.
- Prove that your code is correct using a proof assistant.
- Use an advanced static analyzer with annotations

As for bitflips in SRAM, if your thing is not going into space or near a radiation source, then they are very unlikely to occur especially in small microcontrollers. Microcontroller NOR flash is also quite robust. However "mass storage" flash is another story, especially SD/MMC cards, and there you can get corruption under some circumstances (excessive activity, temperature effects), sticking to "industrial grade" brands (Delkin etc.) helps.

Redundancy gets very complicated very fast. Triplicate the system? How do you know the voter will work? What about common manufacturing defects? What about I/O? Either you have deep pockets and get some custom silicon fabbed, or use some FPGA with scrubbing/ECC support (or non-volatile configuration) and do TMR synthesis, but as has been said it's still gonna be meh.

golden_labels · « **Reply #8 on:** January 25, 2024, 03:30:13 pm »

For a typical microcontroller application running for mere weeks or months, which does not safety critical, the effort spent on addressing this issue is usually better spent elsewhere. That is: by allocating time to this thing you spend less on parts of your system more likely to be a source of failure.

Other than checksuming and keeping duplicates, you may also wisely choose values. You already got the right concept by noting they should differ by more than one bit. This is actually a much wider and well researched topic: Hamming distance and everything around this.

But, as noted above, don’t get fixation on this. Unless you either have a solid reason to focus on this issue or want to do this for educational purposes.

This may turn into a deadly rabbit hole: just realize that the machine instruction reading the state value is also a subject to that kind of corruption. And it has more bits than your variable. Depending on the internal organization of the microcontroller and how program is stored/read/executed the per-bit error probability may be considerably lower, so it’s not exactly apples to apples comparison. But I hope you see the outline of the trap.

HwAoRrDk · « **Reply #9 on:** January 26, 2024, 12:37:39 pm »

I like how - always, without fail - whenever a topic related to robustness or integrity is broached, everyone always jumps straight in to the deep end and assumes you're doing something safety- or life-critical, and starts talking about triple-redundancy, memory-scrubbing, ECC, radiation-hardening, ISO safety standards, type- and memory-safe languages, static analysis, CRCs, etc, etc, etc.

Yes, often the XY Problem is manifest, and generalising is best rather than sticking to the specific topic at hand so that the OP is informed about wider things that they may need to consider. But sheesh, guys, give it a break sometimes...

So let me just clarify:

- No, I'm not doing anything safety- or life-critical.
- No, I'm not in need of complying with any regulatory framework.
- No, I'm extremely unlikely to get sued if my system fails.
- No, I'm not sending anything into space, or controlling a nuclear reactor, X-rays, lasers, etc.

With that said, yes, I appreciate there are a whole lot of other things that can go wrong, and that a single state variable in memory is a drop in the ocean, all things considered. It just that it occurred to me the other day that I'm writing some code where if the internal state gets corrupted, annoying things may happen, and I wondered "how can I make this better?". Not ideal, not perfection, just a little bit better. And so I think aiding the integrity of the keystone of the execution of my system is not wholly futile.

I think I'm not going to go overkill, and will just go with keeping a second inverse copy of the state variable.

Code: [Select]

typedef enum {
	STATE_FOO = 1,
	STATE_BAR = 2,
        /* etc... */
} state_enum_t;

typedef struct {
	state_enum_t state;
	state_enum_t inverse;
} state_t;

static volatile state_t state;

#define state_current() (state.state)
#define state_change(s) do { state.state = (s); state.inverse = ~(s); } while(0)
#define state_is_valid() ((state.state ^ state.inverse) == -1)

The state will only be changed using the state_change() macro, and every time through the main loop the state will be verified with state_is_valid().

ejeffrey · « **Reply #10 on:** January 26, 2024, 03:00:35 pm »

Application level checksums on data structures can work. Simple duplication is the simplest example when you aren't confined in memory. Obviously it won't catch everything, but it can be one tool to help. One thing to watch out for is that data is corrupted in transit as well as at rest. So if you verify the checksum an in memory data structure find it's OK, then separately load the value you want into a register you may have missed a good chunk of your benefit. Instead you want to load the value you care about only once as part of the checksum then keep it around to use.

Lockstep processing is not even that exotic these days, the ARM cortex R class microcontrollers do it. This will catch single upsets not just in data memory but in program memory, or the processor core itself. I'd definitely look into that if you are interested.

ejeffrey · « **Reply #11 on:** January 26, 2024, 03:05:33 pm »

Another thing to watch out for if you are just using duplication is to make sure your compiler doesn't optimize our the checks. You can either write your protected variable access functions on assembly or mark both copies of your state variable volatile. If you do the latter make sure that your accessing the variables exactly the number of times you expect.

5U4GB · « **Reply #12 on:** January 26, 2024, 03:51:44 pm »

Quote from: ejeffrey on January 26, 2024, 03:05:33 pm

Another thing to watch out for if you are just using duplication is to make sure your compiler doesn't optimize our the checks. You can either write your protected variable access functions on assembly or mark both copies of your state variable volatile. If you do the latter make sure that your accessing the variables exactly the number of times you expect.

If you're really serious about this you'll need to use something like CompCert which guarantees 1:1 correspondence between source and object code. Compilers like gcc will mangle your code into what it thinks is equivalent code but with completely different semantics that doesn't have the properties you wanted any more.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: State machines - ensuring the integrity of state variable (Read 574 times)

HwAoRrDk

State machines - ensuring the integrity of state variable

Kalvin

Re: State machines - ensuring the integrity of state variable

Rerouter

Re: State machines - ensuring the integrity of state variable

Kalvin

Re: State machines - ensuring the integrity of state variable

5U4GB

Re: State machines - ensuring the integrity of state variable

Siwastaja

Re: State machines - ensuring the integrity of state variable

tggzzz

Re: State machines - ensuring the integrity of state variable

berke

Re: State machines - ensuring the integrity of state variable

golden_labels

Re: State machines - ensuring the integrity of state variable

HwAoRrDk

Re: State machines - ensuring the integrity of state variable

ejeffrey

Re: State machines - ensuring the integrity of state variable

ejeffrey

Re: State machines - ensuring the integrity of state variable

5U4GB

Re: State machines - ensuring the integrity of state variable

Share me