Author Topic: EEPROM corruption on AVR (Read 4302 times)

Oaklander · « **on:** March 05, 2024, 07:05:06 am »

I need to make EEPROM in Attiny202 corruption proof to save a configuration variable.
To do that Microchip advises to enable brown-out detection and somewhere else I've found that it's a bad practice to use the first address in EEPROM.

My question is how corruption proof will the EEPROM be if I implement the measures above? Will it then reliably retain the data for years or even decades?

Psi · « **Reply #1 on:** March 05, 2024, 07:11:12 am »

It's pretty reliable as long as you don't write to the same address 100,000's of times and wear it out.
Data retention from datasheet is 40 years at 55°C and 100,000 writes.
People have tested AVR's writing data in loops and it took a few million writes of a location before they saw any issues.

As long as you have the brownout detector active I'd be more worried about a coding bug causing excessive writes than I would be about the flash failing.

But you could always implement some sort of software error detection/checking where you store data in 3 different locations spread over the eeprom and then you can check them all to confirm the data agrees. But you do have to consider the situation if you lose power after writing some of those areas but not all. Maybe a flag to say you were in the middle of a write. If you know that then you can finish the write because you know the order you wrote them so you know which one is correct.

Can also store a checksum with a block of data to confirm when you read it back all bits are correct.

I guess it boils down to how critical is this data, if its wrong does someone die?
If you can detect that the data has been corrupted can the device just refuse to function?
Or does it NEED to be correct at startup every time.

kripton2035 · « **Reply #2 on:** March 05, 2024, 07:17:21 am »

or you can use a small fram chip that will hold the data for centuries...

Oaklander · « **Reply #3 on:** March 05, 2024, 07:21:07 am »

Excessive writes should not be an issue as the data will be written only once. I'm also pretty sure my code doesn't do any unintended writes.

Data redundancy and automatic error detection and correction is something I have thought about. But on the other hand it also looks like a possible source for coding errors.

Checksum doesn't really help anything in my application.

Psi · « **Reply #4 on:** March 05, 2024, 07:23:55 am »

Quote from: Oaklander on March 05, 2024, 07:21:07 am

Excessive writes should not be an issue as the data will be written only once. I'm also pretty sure my code doesn't do any unintended writes.

Until a cosmic ray flips a bit and turns the next instruction into a loop that jumps back.
Better enable the watchdog time too

I contain my flash/eeprom write stuff into a function and inc a counter whenever the function gets run.
It allows me to keep an eye on this number just in case.

Also, for AVRs, in the EEPROM library there are WRITE functions and also UPDATE functions.
The Update functions are safer as they do a read first to check if the data has changed. If you try to write the same value it skips the write.

Oaklander · « **Reply #5 on:** March 05, 2024, 07:39:47 am »

So actually the safest thing would be to not inlcude any EEPROM writes in the code and program the EEPROM at the same time with the program memory.

Jeroen3 · « **Reply #6 on:** March 05, 2024, 07:46:38 am »

Corruption proof is a hard requirement.

The bare minimum should be that you can detect corruption. So add a suitable CRC or similar.
Maybe you also would need to recover corruption, use ecc, or store multiple copies, on different places of the eeprom.
Then you should ensure your software does not write or erase to eeprom when the device isn't truly ready and has time to do so.
You could also make a transaction log such that you always keep the old settings when writing new ones.

Chip brownout protection shouldn't really be an issue unless there is some hardware errata.
What you should do is ensure you do not write to eeprom when there is risk. For example, if you log power-ups, wait a few seconds until you're sure the device is on, on.
Often the code is run a short while when programming or during testing and immediately cut off.

Oaklander · « **Reply #7 on:** March 05, 2024, 08:02:05 am »

The setting will only be written once. It could be hard coded in the firmware but there are dozens of different values for the setting so I'm using the EEPROM to prevent having so many different versions of the firmware.

HwAoRrDk · « **Reply #8 on:** March 05, 2024, 11:16:27 am »

In that case the best thing to do would be as you already mentioned: write the EEPROM only at programming time, don't include any code in your firmware that can write to the EEPROM, and add a CRC to the data which gets verified at startup. If verification fails, fall back to some hard-coded values or enter a fail-safe fault state.

Kleinstein · « **Reply #9 on:** March 05, 2024, 12:12:03 pm »

The old AVRs without brown-out detection had a problem with the first location to get overwritten. Not using the first location already helped a lot and setting the brownout detector active pretty much solved the problem.

If there is space, one could use redundant memory. Saving the data 3 times and than do a majority vote is relatively easy, though it needs more memory.

wek · « **Reply #10 on:** March 05, 2024, 01:03:22 pm »

> Saving the data 3 times and than do a majority vote

And what if I have 3 different values, because power outage happened in the middle of writing the second copy (i.e. there's one good old value, one good new value and one corrupted new value)?

JW

mikerj · « **Reply #11 on:** March 05, 2024, 02:02:17 pm »

Quote from: wek on March 05, 2024, 01:03:22 pm

> Saving the data 3 times and than do a majority vote

And what if I have 3 different values, because power outage happened in the middle of writing the second copy (i.e. there's one good old value, one good new value and one corrupted new value)?

JW

Include a counter in your data structure (along with a CRC) and use the valid data with the highest count. If you expect to write often enough that the counter overflows then add some logic to detect this e.g. if you have good data with a counter value of 255 and good data with a counter have of 0, then 0 is the most recent.

iMo · « **Reply #12 on:** March 05, 2024, 02:07:10 pm »

The OP is not asking on a protection with the erroneous "writes".
So he/she writes into 3 (or more) different places in the eeprom (that writes will be always ok, she/he assumes).
Then after, say 10 years, he/she reads the eeprom again, and makes a majority voting..

wraper · « **Reply #13 on:** March 05, 2024, 02:40:54 pm »

AFAIK accidental EEPROM overwrite during power cycle was a problem only in early AVR. Nonetheless enabling brown-out detection ensures than MCU does not incorrectly execute due to out of spec power.

cv007 · « **Reply #14 on:** March 05, 2024, 07:41:10 pm »

Quote

and somewhere else I've found that it's a bad practice to use the first address in EEPROM

Does not apply for these newer avr. The default 0 value (after reset) for the nvm.addr register is an invalid address for the nvm to use in all cases. The eeprom is at data space starting at 0x1400, and there are no addresses to avoid.

Perkele · « **Reply #15 on:** March 05, 2024, 08:51:11 pm »

Data mirroring with added checksums. At 16 or 20MHz, AVR is fast enough to calculate CRC-16 in a reasonable amount of time.
In addition to BOR, enable set-up time fuse (also called power-on reset timeout on other platforms) and set it to a maximum value. It is 64ms on most of AVRs.
A PSU becoming unstable after several years in operation can cause mayhem on power-on, especially if you're doing EEPROM access at that time.
In some applications, in addition to start-up delay, I might also add a simple wait loop if I need 100ms or 200ms of delay.

Oaklander · « **Reply #16 on:** March 06, 2024, 10:03:45 am »

So if I enable brown-out detection and set-up time and remove any writes to the EEPROM from the code the EEPROM should be quite reliable.

In addition I could write the data in multiple locations and compare those when reading and select the majority. This could also be used to correct the corrupted entries but that would introduce write functionality into the code. So would that be bad after all if I actually want to remove any writes from the code.

If EEPROM corruption happens will the resulting data be random or are some changes more likely than others? When I discovered the corruption problem the data had turned into FF.
If it turns out the data usually turns to FF I should make the code ignore those entries while comparing reading the data.

Jeroen3 · « **Reply #17 on:** March 06, 2024, 01:03:04 pm »

Some protocols, such as J1939, try no to use FF (and sometimes 00) as valid values. The reason is that 00 or FF are erased of undefined states of memory or a bus and accepting those increases the risk of bugs.
They often use an offset or range where 00 or FF would be invalid.

They also use special values, eg effective range is 0-250, while FB (251) to FF (255) have special meaning, such as Sensor Error (FE) and Unavailable (FF).
You could utilize a similar strategy.

cv007 · « **Reply #18 on:** March 06, 2024, 04:38:29 pm »

Quote

and remove any writes to the EEPROM from the code

Maybe describe in more detail what kind of data you want in eeprom, and how often this data needs to change. With writing removed from code as you state, it sounds like this is a one time config data write. If so, then the conversion takes a different direction.

Also note, these avr have an additional page of eeprom called user row (an additional 32 bytes in your tiny202) which does not erase with a chip erase (normal programming) or an eeprom erase (nvm command).

Oaklander · « **Reply #19 on:** March 06, 2024, 07:46:08 pm »

Quote from: cv007 on March 06, 2024, 04:38:29 pm

Quote
and remove any writes to the EEPROM from the code
Maybe describe in more detail what kind of data you want in eeprom, and how often this data needs to change. With writing removed from code as you state, it sounds like this is a one time config data write. If so, then the conversion takes a different direction.

Also note, these avr have an additional page of eeprom called user row (an additional 32 bytes in your tiny202) which does not erase with a chip erase (normal programming) or an eeprom erase (nvm command).

I've already explained it before. The data will be one 8bit integer which will be programmed at the same time with the firmware and it must never change. It could be hard coded but there are dozens of different values for the setting so I'm using the EEPROM to prevent having so many different versions of the firmware.

The current solution is to send a command over serial bus to program the EEPROM once the firmware has been programmed. I could remove the code needed for that and use the programmer to program the EEPROM too.

I could use the user row but does it have other advantages over EEPROM than not being erased like you said?

Siwastaja · « **Reply #20 on:** March 06, 2024, 07:52:43 pm »

Quote from: Jeroen3 on March 06, 2024, 01:03:04 pm

Some protocols, such as J1939, try no to use FF (and sometimes 00) as valid values. The reason is that 00 or FF are erased of undefined states of memory or a bus and accepting those increases the risk of bugs.

Instead of trying to avoid some magical values which correlate with some types of problems, but miss all other sorts of corruption, one should just use checksums; CRC8 is pretty strong already for tens of bytes; CRC16 even better. If system availability is important, writing the whole checksummed block of data multiple times allows the code to try the next block if the CRC check of the first one fails.

And CRC8 is just a few lines of code, definitely simpler and more robust than some probably variable length escaping scheme.

AndyBeez · « **Reply #21 on:** March 06, 2024, 08:46:20 pm »

Quote from: Oaklander on March 05, 2024, 07:05:06 am

Will it then reliably retain the data for years or even decades?

Will the capacitors, solder joints and other mechanical contacts last for decades? Will the user trash the circuit after five years? In the wild, AVR EEPROM data should exist far beyond the lifespan of a product. I built an ATTINY to run a vehicle's console lighting; the EEPROM data is still in there, almost two decades after it was flashed. The vehicle was turned into soup cans ages ago.

Yes, setting Brownout is good practice; remembering that Brownout detection is meant to initiate a graceful restart, deep sleep mode or low battery operation. When a brownout condition exists, nothing should be written and values read may not be reliable.

What value are you writing, how often are you updating it and, how often in the boot cycle and runtime reading it? If your write budget goes over the 100K limit in a couple of years, you'll need a different methodology. If you have to wait for the year 3000, then wear levelling will never be your problem.

The guys suggesting CRC are correct. You can also use shadow data where you read from multiple bytes and XOR the input with the previous input. If the result is not zero, you got bad data. You do not need too many clock cycles to implement this check in assembler.

What you then do if the data is corrupted

Oaklander · « **Reply #22 on:** March 06, 2024, 09:29:14 pm »

Quote from: Siwastaja on March 06, 2024, 07:52:43 pm

writing the whole checksummed block of data multiple times allows the code to try the next block if the CRC check of the first one fails

That's what I will do. It's much simpler than the majority method discussed earlier.

cv007 · « **Reply #23 on:** March 07, 2024, 03:16:54 am »

Quote

The data will be one 8bit integer which will be programmed at the same time with the firmware and it must never change

If your programmer could do the job, then just let it. Your programmer will have verified the value at programming time so no need to add any special code to get this value in the mcu, and no need to verify its value on your own. Now all you need to do is decide where you want this value- eeprom, flash, user row.

In all cases, you just have to decide what address this byte will be at- if in flash, you can use the last byte in flash (known location that will not change), for eeprom or userrow it could be the first address of either (0x1400 or 0x1300).

There are various ways to 'insert' this value into the hex file- can do it 'manually' by inserting a hex record or the programmer may have a way to add unique data at time of programming.

example for flash, storing the byte at end of flash (where, unless you are using every available byte for code, will be free)-
#define MY_SPECIAL_VALUE (*(uint8_t*)(MAPPED_PROGMEM_END)) //note mapped address used, so can read flash directly

and any use will read
if( MY_SPECIAL_VALUE > 10 ) { /* do something */ }

and this special value can be easily added to a hex file, which then will program that value at the end of flash.

Psi · « **Reply #24 on:** March 07, 2024, 09:54:31 am »

Just thinking outside the box a bit. Might not be useful..

If you have a page of flash memory spare/unused, you could store your EEPROM write functions in that flash page at a fixed address. Then you can use those functions to write anything you want to eeprom using seral commands. Then when you have the system all setup you issue the command to do a flash erase on just that page to permanently remove all flash write functions from the firmware on the device.

You'd also want to set a flag in EEPROM before page erase that says the functions no longer exist at that address. Just so you can block trying to run functions that no longer exist, but that is easy.

Oaklander · « **Reply #25 on:** March 07, 2024, 01:07:50 pm »

I was thinking that since I have a single byte to store I could just predetermine the correspondence of every data value to the possible check digit values. Then I wouldn't need any algorithm to calculate the values but just a simple lookup table for each pair of data and check digit.

Would there be any downsides to this?

AndyBeez · « **Reply #26 on:** March 07, 2024, 02:55:08 pm »

Downside? You're assuming the flash memory lookup table is always intact.

Another quick validation is to store the byte and the inverse of the byte in Eeprom. ANDing the two bytes together will always equal zero. If it does not then one or the other byte is bad > so the device should be marked as faulty.

Oaklander · « **Reply #27 on:** March 07, 2024, 04:24:12 pm »

Quote from: AndyBeez on March 07, 2024, 02:55:08 pm

Downside? You're assuming the flash memory lookup table is always intact.

I thought about that but there's no quarantee that the checksum algorithm will stay intact either.

cv007 · « **Reply #28 on:** March 07, 2024, 06:34:16 pm »

I think you are over complicating this. Your problem is the same as the desire to put in a unique s# into an mcu, which typically becomes a programming problem not a runtime one.

It will depend on what you have for programmer, but if it programs from a hex file and not an elf file-

tiny202- 2k
0x0000 - 0x07FF flash code space (programmer)
0x8000 - 0x87FF flash mapped data space (reading)

My app, which has various i2c addresses I assign-

Code: [Select]

#include <avr/io.h>
#define MY_I2C_ADDR (*(const uint8_t*)(MAPPED_PROGMEM_END)) //read from last byte in flash
int main() {
    TWI0.SADDR = MY_I2C_ADDR;
    while (1){}
    }

relevant code- is reading 0x87FF which is the last mapped flash byte, which is what we want-

Code: [Select]

00000046 <main>:
  46:	80 91 ff 87 	lds	r24, 0x87FF	; 0x8087ff <_end+0x487f>
  4a:	80 93 1c 08 	sts	0x081C, r24	; 0x80081c <__RODATA_PM_OFFSET__+0x7f881c>
  4e:	ff cf       	rjmp	.-2      	; 0x4e <main+0x8>

resulting myapp.hex-

Code: [Select]

:1000000019C020C01FC01EC01DC01CC01BC01AC00C
:1000100019C018C017C016C015C014C013C012C034
:1000200011C010C00FC00EC00DC00CC00BC00AC064
:1000300009C008C011241FBECFEFCDBFDFE3DEBF74
:1000400002D006C0DDCF8091FF8780931C08FFCFD0
:04005000F894FFCF52
:00000001FF

hex record-
:NNAAAATTDD..DDCC
NN - n DD data bytes
AAAA - start address of record data
TT - record type (00 = data)
CC = checksum (so sum of all record values/bytes is 0, mod256)

add a record for MY_I2C_ADDR, which is at flash code space address 0x07FF
value we want is 0x44-
:0107FF0044B5

myapp_0x44.hex-

Code: [Select]

:1000000019C020C01FC01EC01DC01CC01BC01AC00C
:1000100019C018C017C016C015C014C013C012C034
:1000200011C010C00FC00EC00DC00CC00BC00AC064
:1000300009C008C011241FBECFEFCDBFDFE3DEBF74
:1000400002D006C0DDCF8091FF8780931C08FFCFD0
:04005000F894FFCF52
:0107FF0044B5 <--added record
:00000001FF

You would obviously automate the generation of the hex file with the desired additional value, and could also generate each variation you need at one time with a resulting hex file that has the value in its name such as the above example (a folder full of hex files for each variation, based on name). If you use 256 possible values, maybe not, if there are only 10 possible values then maybe. Could also be scripted and done when compiling (after compile run script automatically).

You also do not say how many mcu's you will want to program- if its a couple dozen, then all of this is probably still overkill.

You have a 2k mcu, and filling it up with crc code, communications, handling all failure scenarios, etc., may not leave a lot of code room to do the main job. Programming a value at programming time eliminates all that.

And.. if there is a limited number of mcu's you will be using (pick a number for 'limited'), and I was using an ide like MPLABX to program, I would just modify a value in the code before programming each mcu. Change a number, click programming icon, done. Repeat.

Code: [Select]

#include <avr/io.h>
const uint8_t my_i2c_addr = 0x44; //change as needed
int main() {
    TWI0.SADDR = my_i2c_addr;
    while (1){}
    }

Oaklander · « **Reply #29 on:** March 07, 2024, 08:04:07 pm »

Using the last byte of flash would be a simple solution but I don't really want to have dozens of different versions of the firmware. If I make changes to the firmware it will be much harder to update the MCUs. If the data is stored in EEPROM I could just update each MCU with the same firmware preserving EEPROM and everything would still work.

Currently I'm thinking of programming the EEPROM full of data and checksum pairs. When reading the data the program would start at the first pair and check if the data and checksum match. If they don't it would go to check the next pair and so on until a matching pair is found. It would be a small and simple solution. And unless the entire EEPROM gets corrupted there would be recoverable data.

That would of course be coupled with brown-out detection and start-up delay.

It's really hard for me to judge the robustness of the different methods.

Quote from: AndyBeez on March 07, 2024, 02:55:08 pm

Another quick validation is to store the byte and the inverse of the byte in Eeprom. ANDing the two bytes together will always equal zero. If it does not then one or the other byte is bad > so the device should be marked as faulty.

Wouldn't it be better to use XOR and compare the result to 255? That would also detect errors where both bits are zero.

cv007 · « **Reply #30 on:** March 07, 2024, 11:55:42 pm »

Quote

And unless the entire EEPROM gets corrupted there would be recoverable data.

Why do you think eeprom will get corrupted. If done at programming time, the programmer will verify the write and it will never change again. No real need to verify the value every time you read it as it will not change.

If you use user row eeprom instead of eeprom, you will eliminate the possibility of a chip erase erasing the eeprom for any later programming/updates. The user row eeprom is designed so that any chip erase or eeprom erase will leave it untouched.

Just using the ide for programming as an example, although programming tools could do this easier-

Code: [Select]

#include <avr/io.h>
__attribute__(( section(".user_signatures") )) uint8_t my_i2c_addr = 0x44;
int main() {
    while (1){}
    }

Compile/program, you now have a value in user row eeprom (named user signatures because of legacy), and it will not change unless you specifically change it later. Any updates, assuming you are not programming this byte again, will leave that value untouched. This is a one time thing for each mcu.

Now that the specific value is in a specific mcu, your app can assume the value is at the start of user row eeprom (0x1300) since there is only 1 byte in use.

Code: [Select]

#include <avr/io.h>
#define MY_I2C_ADDR (*(const uint8_t*)(USER_SIGNATURES_START)) 
int main() {
    TWI0.SADDR = MY_I2C_ADDR;
    while (1){}
    }

How many mcu's are we talking about, and how many values of the possible 256 will be used?

Nominal Animal · « **Reply #31 on:** March 08, 2024, 02:29:26 am »

If you do
__attribute__((section (".user_signatures")))
const volatile uint8_t my_i2c_addr = 0x44;
you can not only just use my_i2c_addr, but typical linker scripts also export symbols __start_user_signatures that has the same address as the first variable in that section, and __stop_user_signatures that has an address just after the last variable in that section.

If you use a dedicated section for each configuration variable, especially if they have a common prefix, say conf_ (so your linker script can collect them in a single super-section if so desired, putting them all consecutively in the final firmware), you can easily manipulate their contents easily using objcopy on the ELF object file, just before converting it to a hex file; it uses the name of the section making this easier than working with addresses. You can even create a simple utility (say, Python with a nice configuration GUI) to do this (running an external objcopy command to update the values, and another utility to convert it to hex, plus a third one like avrdude to actually upload/flash it to the target device).

cv007, you might wish to switch to
#define MY_I2C_ADDR (*(const volatile uint8_t*)(USER_SIGNATURES_START))
because const volatile promises your compiler you are not trying to modify the referred to value, but also that it might not be what the code the compiler sees might imply and should therefore not cache any accesses to it or try to infer its value, so it should do an actual read from that address whenever the expression is evaluated. While currently address dereferences tend to behave as if they were declared volatile, it is better to be explicit, in my opinion; and it tells us humans reading the code that that value right there might not be what the code itself implies.

cv007 · « **Reply #32 on:** March 08, 2024, 05:00:20 am »

Quote

cv007, you might wish to switch to
#define MY_I2C_ADDR (*(const volatile uint8_t*)(USER_SIGNATURES_START))

It doesn't do anything useful in this case. The programmed value is a one time thing and is not changing, the variable symbol is no longer around, and we are not even referring to that symbol so the compiler cannot optimize away to an inline constant value. The compiler will have to read that location at least once, and if wants to cache that value afterwards it makes no difference as its not changing.

If the value was changing, then volatile would be in order (volatile+const) to make sure the compiler reads the location regardless of what it thinks it knows about the value. The const would prevent attempts to write to the var via compiler error, and is a good idea to prevent corrupting the page buffer in this case. Can still change/write this var, but it will be through a process the compiler will not be able to see through (hence the volatile).

Quote

If you use a dedicated section for each configuration variable

No need to complicate when only 1 byte is being used. It will be at the first location in the section that was named, and that address is known. If one wanted to store multiple values, then a single struct/array would be the simple way to go with no need to touch a linker script, and the same applies-

struct { uint8_t addr; uint8_t addr2; } typedef twi_info_t;
__attribute__((section (".user_signatures"))) twi_info_t my_twi_info = { 0x44, 0 }; //0=no second address

Again, only 1 var in the section we named, and will be at a known address and the linker will not be shuffling the order of these values which it could otherwise do. Usage pretty much as before-

Code: [Select]

#include <avr/io.h>
struct { uint8_t addr; uint8_t addr2; } typedef twi_info_t;
#define MY_I2C_INFO (*(const twi_info_t*)(USER_SIGNATURES_START))
int main() {
    TWI0.SADDR = MY_I2C_INFO.addr<<1; //forgot the shift in earlier example
    if( MY_I2C_INFO.addr2 ) TWI0.SADDRMASK = (MY_I2C_INFO.addr2<<1) | 1;
    while (1){}

}

Oaklander · « **Reply #33 on:** March 08, 2024, 07:00:00 am »

Quote from: cv007 on March 07, 2024, 11:55:42 pm

Quote
And unless the entire EEPROM gets corrupted there would be recoverable data.
Why do you think eeprom will get corrupted.

Because I've seen it happen.

And I was at first just going to enable brown-out detection. All the further measures with checksums and redundancy were suggested by replies in this thread.

Psi · « **Reply #34 on:** March 09, 2024, 04:49:05 am »

Quote from: Oaklander on March 08, 2024, 07:00:00 am

Quote from: cv007 on March 07, 2024, 11:55:42 pm
Quote
And unless the entire EEPROM gets corrupted there would be recoverable data.
Why do you think eeprom will get corrupted.
Because I've seen it happen.

And I was at first just going to enable brown-out detection. All the further measures with checksums and redundancy were suggested by replies in this thread.

I've seen corruption happen even with brown-out active if the device is running on a battery that's so flat the brown-out-reset keeps reoccurring in a constant loop.

Boot-up code does a whole bunch of shit, like starting up the flash/eeprom module. Devices doing some flash/eeprom writes during bootup is pretty common.
If a brown-out-reset occurs just after the write starts but before it has finished you can have issues.

One way to prevent that is to ensure something with high current draw happens as the very first thing on boot, like an LED flash,
That way it will brown out reset before it gets to writing flash/eeprom code.
If it can get past a LED flash without brown-out-reset then it's probably got enough energy in the battery to safety finish a flash/eeprom write during init.

Siwastaja · « **Reply #35 on:** March 09, 2024, 12:11:34 pm »

Quote from: Psi on March 09, 2024, 04:49:05 am

One way to prevent that is to ensure something with high current draw happens as the very first thing on boot, like an LED flash,
That way it will brown out reset before it gets to writing flash/eeprom code.
If it can get past a LED flash without brown-out-reset then it's probably got enough energy in the battery to safety finish a flash/eeprom write during init.

While I applaud such mitigation strategies, they do not solve the underlying problem, just try to make it more unlikely by controlling probabilities. (This is analogous how a switch mode converter blowing up because of inrush current into load capacitance can be made work with a soft-start circuit which ramps up the reference voltage, but it is not a true solution; it only patches the most obvious corner case of self-destruction; real solution is to add current sensing and terminate switching cycles on overcurrent; well all modern switcher ICs do both; but it's important to realize which one of the features is fundamentally important and which is just a convenience feature.)

The only really correct way, and pretty simple one, is to accept the fact that EEPROM or flash operations happening during power glitches, which also are matter of life, will corrupt some part of memory, and add some sort of double buffering scheme which makes it completely bulletproof. For example, pick two sections that belong to different erase units. Modify/erase-write one, and include CRC. For the next write, use the other section. Now if there is power failure during write, the older section is still intact, and code reading it detects this from valid CRC. Writing can be then retried as many times as required. Data corruption simply never happens, no matter how much power glitching you apply.

Psi · « **Reply #36 on:** March 09, 2024, 12:19:09 pm »

Quote from: Siwastaja on March 09, 2024, 12:11:34 pm

Quote from: Psi on March 09, 2024, 04:49:05 am
One way to prevent that is to ensure something with high current draw happens as the very first thing on boot, like an LED flash,
That way it will brown out reset before it gets to writing flash/eeprom code.
If it can get past a LED flash without brown-out-reset then it's probably got enough energy in the battery to safety finish a flash/eeprom write during init.
While I applaud such mitigation strategies, they do not solve the underlying problem, just try to make it more unlikely by controlling probabilities.

Unless you use a radiation hardened MCU, and write overly complex fault tolerant code, you're always one cosmic ray away from your device shitting the bed.
It's always controlling probabilities

But yes, point taken.

Siwastaja · « **Reply #37 on:** March 09, 2024, 12:49:06 pm »

Yeah, of course, you are technically correct, the best kind of correct. I am working on an abstraction level which, on purpose, ignores cosmic rays.

But the probability of user just turning the thing off in the middle of update is many orders of magnitude higher than a cosmic ray flipping a bit on the surface of the Earth. But yeah, consider things like metastability: someone has decided that for a certain design, two flip flops suffice for synchronization. Expected bit error rates can be actually calculated! DRAM sees a lot of errors actually in comparison, even EEC DRAM fails every x years in one of a million computers.

Our (as in, developers of commercial / industrial grade hardware and software, not aerospace) abstracted mental model of a microprocessor and memory is, though, that this kind of random corruption does not happen; in other words, any corruption is because we made a mistake in hardware (e.g. signal integrity), software (plain old good bug), or there is silicon bug (common in microcontrollers). We put our efforts to find and minimize these kind of issues instead.

Internal buses of microprocessors like AVR or STM32, or personal computers, are not protected in any way. Digital interfacing to external components rarely is checksummed at all (sometimes it is). We just... assume these failures never happen, because it is actually very demanding to do otherwise (people underestimate the complexity of e.g. rad hard software practices). And if I sell a million units and they run for 10 years non-stop, then it's likely one unit at one customer failed once due to us ignoring random bit flip.

On the other hand, if I fail to buffer and checksum flash erase-writes, which possibly take seconds, all I need is 500 customers and 1 day, and someone already applied the power for 2 seconds then just at the right time unplugged the thing, corrupting the memory.

So yes, you are most definitely right, everything is managing probabilities, and let me add, managing the product: probability * consequence * 1/effort. Which is exactly why we don't care about rad hard software practices, the probability is extremely small, consequences are capped in most commercial devices, and effort to do significantly better is huge.

Oaklander · « **Reply #38 on:** March 09, 2024, 03:51:41 pm »

Now I have implemented the following:
Strictest brown-out detection and 64ms setup time enabled.
All of EEPROM gets programmed by pairs of data byte and inverted data byte for checking. (Might replace this with CRC-8. Not sure yet). The firmware will use the first pair where the data and check digits match.

There are no writes to the EEPROM during normal operation. I did however retain the EEPROM programming over serial interface. There are several safeguards in place. Before programming two special commands must be sent in correct order to set two variables to true. The actual programming isn't possible unless they are both true. The data gets sent with check digit and after programming the EEPROM is read to check that it is correctly programmed.

I guess that the probability of anything happening should be sufficiently low now. My main concern is that the instructions to write to the EEPROM are still in the flash so they could get executed by error. I wonder if such an error happens will the write happen to just one EEPROM address or will it loop through the entire EEPROM?

Siwastaja · « **Reply #39 on:** March 09, 2024, 05:34:11 pm »

Quote from: Oaklander on March 09, 2024, 03:51:41 pm

All of EEPROM gets programmed by pairs of data byte and inverted data byte for checking. (Might replace this with CRC-8. Not sure yet). The firmware will use the first pair where the data and check digits match.

This is exactly what I thought about suggesting but somehow never made it in any of my replies. Using some CRC-8 polynomial could be a theoretical improvement, but real-world corruption of the bit and it's inverse to match after inversion is IMHO very unlikely. I think your solution is pretty rock solid, something else fails first.

AndyBeez · « **Reply #40 on:** March 09, 2024, 05:56:42 pm »

You asked if there was any difference between comparing to #FF rather than #00. It's just a developer preference. In assembly languages, testing for zero is a thing. Sometimes this is more efficient when you're counting clock cycles. Otherwise, it's just a geek fight.

cv007 · « **Reply #41 on:** March 09, 2024, 07:18:35 pm »

Quote

My main concern is that the instructions to write to the EEPROM are still in the flash

As stated before programming eeprom/userrow/flash only at programming time eliminates any eeprom writing code.

Here is what has to happen for the new avr to program eeprom at runtime-
1. the page buffer has to have a byte written to a valid eeprom/userrow address,
2. a specific value has to be written to a specific register (ccp)
3. and within the next 4 cpu cycles a specific nvm command has to be written to a specific nvm register

without any of these coded in flash, its not happening, but having eeprom writing code inside the mcu means 1/2/3 are happening somewhere.

Additional differences the new avr has, which would apply if you were writing at runtime-
any nvm write is aborted at any reset (old avr will continue the write through a reset, assuming power level is above some point)
the page buffer is cleared at any reset, and an nvm write with a cleared buffer does nothing but produce a write error
any write to an invalid address will not write to the page buffer

Both old/new have to deal with write errors at runtime, and decisions have to be made on what to do when errors occur.

I would guess the biggest source of problems for the original avr (at the 0 address) is the continuation of the write process through resets- bod probably not set and/or startup timer not used, maybe cpu speed is high and still in a voltage range not suitable for the speed, an eeprom write routine is in flash and started, the voltage bounces around the Pot threshold where the mcu goes in/out of reset, the write process is started and continues through these resets, a reset does clear the addr/data registers which the write portion of erase/write now uses, hence the writing of 0 to address 0.

The new avr can also have problems where writing is interrupted, but the write will abort at a reset so probably less likely the write was far enough in the process to do 'damage'. Obviously it depends on where these resets takes place, and the new avr can still get into a place where the erase part took place but was reset before it did any writes, so can end up with an erased byte (or bytes since it uses a page buffer). The downside of the new avr with a page buffer, is if you are writing a bunch of bytes (up to page buffer size, typically 32/64 bytes on these avr), and a reset takes place in that process you may have a lot of corrupt bytes to deal with. The answer to that is probably just limits your writes so any problems are small/more manageable.

Oaklander · « **Reply #42 on:** March 09, 2024, 07:45:45 pm »

I'm not very concerned about single erroneous writes happening as the EEPROM is full of data. My biggest fear is that somehow the entire EEPROM gets written over. But on every iteration of the writing loop I'm checking that two different variables are true and the EEPROM is written two bytes at a time. So if the program somehow happened to jump inside the loop it shouldn't be able to execute more than once.

Data corruption on write isn't a problem as the data gets verified after the writing. If there is any problem I can just send the write commands again over serial.

AndyBeez · « **Reply #43 on:** March 09, 2024, 08:07:22 pm »

AVR Lock Bits may be relevant to you too

Quote

To protect memory contents from being accidentally overwritten, or from unauthorized reading, the Lock bits can be set to protect the memory contents. As shown in the table below, the memories can either be protected from further writing, or you may completely disable both reading and writing of memories on the chip.

https://onlinedocs.microchip.com/pr/GUID-EEA42155-A873-46B4-9500-D88617041B0B-en-US-2/index.html?GUID-DA9954DA-96B2-46E2-8835-576161B2280B

Oaklander · « **Reply #44 on:** March 09, 2024, 08:24:46 pm »

Quote from: AndyBeez on March 09, 2024, 08:07:22 pm

AVR Lock Bits may be relevant to you too

According to the data sheet even though the device is locked the CPU still has write access to the EEPROM so it doesn't really help against corruption. Locking would also make it impossible to update firmware while preserving EEPROM which I want to be able to do.

cv007 · « **Reply #45 on:** March 09, 2024, 08:37:59 pm »

Quote

AVR Lock Bits may be relevant to you too

That document is referring to an 'old' avr. The LOCKBIT in the new avr is to restrict UPDI access, but does not restrict user code from using the nvm peripheral.

Quote

My biggest fear is that somehow the entire EEPROM gets written over

The 'original' solution is still available, where this byte sits at the end of flash. No eeprom, no userrrow, no nvm writing, flash cannot be written at runtime even if you tried as your whole app will be the bootloader section which cannot write to itself. Whether you also consider flash to be as fragile as eeprom/userrow is up to you.

Your app will still be a single app, and the byte value added becomes a programming time problem- depending on what you have for a programmer, can be easily automated (you also have the ability to read the mcu first, preserving previously written value).

I would use userrow, at programming time, which is what it was designed for, Your mcu, your choice, do whatever you want.

Oaklander · « **Reply #46 on:** March 10, 2024, 07:15:48 pm »

I decided to go through the programming route after all. I wrote a simple front end for Avrdude which can be used to program the flash, fuses and EEPROM.
Now there are no EEPROM write instructions in the flash so that threat is eliminated.

Xor · « **Reply #47 on:** March 10, 2024, 07:26:29 pm »

Another approach if it’s a one time deal at time of production is to use resistors on the pcb to set the variant … it’s never going to get corrupt then … of course more headache at production time

Psi · « **Reply #48 on:** March 11, 2024, 06:55:46 am »

Quote from: Xor on March 10, 2024, 07:26:29 pm

Another approach if it’s a one time deal at time of production is to use resistors on the pcb to set the variant … it’s never going to get corrupt then … of course more headache at production time

True.

OP said it was a single 8bit integer, so he could set that using two HEX-Digit rotary dip switches.
Assuming he has enough spare GPIO.
Could reduce the inputs needed using analog input and resistors.

peter-h · « **Reply #49 on:** March 12, 2024, 04:43:57 pm »

The old 90S1200 AVR would corrupt the 1st EEPROM address.

No idea if the successor (2313) fixed that. I am still working through a huge stock of the 90S1200

kevin.gibbs · « **Reply #50 on:** April 08, 2024, 01:42:36 pm »

So your conditions are different. In the case of Oaklander, the data is initially written once and is not changed further. Therefore, it makes sense to write EEPROM when programming the chip and exclude writing in the program.

If you want to change values along the way, I would start 2 groups of 3 identical records and switch between them.

To Oaklander: I would also disable EEPROM erase. That way, the hardware configuration will remain in place when the firmware is updated.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: EEPROM corruption on AVR (Read 4302 times)

Share me