I've got a weird situation going on here.
Several of my ongoing projects involve MEMS sensors. I'm qualifying a new (to me) one from ST, the LSM6DS3, which is very popular and very deeply stocked at multiple large distributors so they move a lot of volume. In other words, it's not some obscure part that nobody's heard of.
When I powered up the first two proto PCB's with this part, the LSM kept exhibiting these failure modes. It would take anywhere from 30 minutes to 2 hours, but within 2 hours you could pretty much guarantee it would fail. By "fail" I mean the part would change operational modes with no external stimulus. For example, if the part was configured (via SPI) for an update rate of 400Hz, after a while it would switch to (say) 12.5Hz. You could see this on a scope. The even weirder aspect was that reading the configuration registers back out showed the PROPER VALUES for 400Hz (or any other value, I tried several) were still in the registers. And yes, I tried rewriting those proper values back again on the chance that this would cause the chip to re-evaluate them and go back to proper operation - but no luck, the chip stayed in its self-configured mode. The only way to restore proper operation was to invoke its "soft reset" SPI command or power cycle the device.
There was another error condition where the SPI interface itself became unresponsive and the ONLY way to regain control of the device was to cycle power. Ugh.
I wrote a bunch of failsafe routines to detect and gracefully handle everything but the SPI failure, and over a few days got it to the point where no more failures occurred (at least in the sense that if they did occur, my firmware caught them and dynamically restored proper operation). Then, separately, I discovered that the SPI problem appears to be that the internal digital low pass filters require too much processing power. This led me to suspect that the internal filters are implemented in firmware, not hardware, and when they hit some edge case they take out the internal CPU. By disabling all possible internal filtering and going back to my own filters in my own firmware, the SPI errors also disappeared. At this point two prototypes ran for a couple of weeks with no errors at all.
(I'd like to add that every MEMS device I've worked with has weirdnesses like this. They're finicky, and their poor documentation doesn't help. I don't know why they all seem to be like this, but they're simply a PITA to get running. I feel like I could write an aftermarket cheat sheet on the ones I've sleuthed through. However, they're darned useful so I just budget extra time whenever I start working with a new one, knowing I'll be basically writing my own appnote to get them going.)
Yesterday, I fired up a fresh proto PCB and left it running overnight. By early this morning, it had entered one of the failure modes! Since my firmware can handle anything but the SPI lockup, all evidence points to the LSM6DS3's internal firmware having locked up again. I didn't have the unit instrumented (I thought we were past this) so I couldn't dig any deeper beyond knowing that my recovery code couldn't handle it, and it handles everything I know about except the SPI failure. Cycling power instantly fixed it, exactly as with the first two units, which further supports the SPI failure theory. Since then it's been running flawlessly for about 8 hours so far today.
Thinking about differences between this latest board and the earlier ones, the only one is clock time. They're all running the same firmware, powered by the same supply, flashed by the same programmer, etc. Those first two boards got dozens of hours on them while I was working through the error handling, and their error rate dropped as I discovered more failure modes and wrote firmware to handle them. But maybe those parts were just "burning in" and my code had little to do with it. This latest one had never been out of its antistatic bag before I flashed it and let it run overnight, and within hours it failed such that commanding a reset over SPI would not work (the classic symptom of SPI failure on this part). I've never considered "burn-in" of integrated circuits to be a "thing" before, but this sure feels like it. And these are MEMS devices, which brings in a mechanical aspect that most IC's don't have.
Is there anything in the industry about IC burn-in? Perhaps specific to MEMS devices in particular?