Author Topic: Failure mode/probability of micro-controller sleep/standby/etc modes?  (Read 911 times)

0 Members and 1 Guest are viewing this topic.

Offline globoyTopic starter

  • Regular Contributor
  • *
  • Posts: 208
  • Country: us
Perhaps this is too general of a question (e.g. depends on the particular micro-controller) but I am curious about the failure modes and the probability of those failures of micro-controllers to not come out of a sleep state.  For example some flip-flop gets flipped and the device will no longer detect the wake-up condition.  Anyone have any experience with this?

More detailed backstory: I'm doing a Failure Analysis (FA) for a consumer product that I am helping design (electronics + firmware).  The device is battery powered and probably will sit around for quite a bit of time so minimizing quiescent current consumption is important.  Of course cost is also important.  I am using a STM32F030 value line micro and its lowest-power, most disabled standby mode to cut power (just it and an ultra-low quiescent current LDO).  Two wakeup pins are used to detect a button press or the start of battery charging.  It all works just fine - in my lab - but I want to know how possible it is that it won't wake up.  This would be a catastrophic event and to the user the device would simply appear to be broken.

It seems to me that this kind of failure mode must be pretty low since I'm sure there are a huge number of devices out there that use this strategy.  The thing that makes me wonder about this is that I have remote controls that sometimes seem to completely stop working until their AA or AAA batteries are removed and re-inserted.  That seems like it could be a hung micro. 

If this is a real risk then I could do things like add an external watchdog and wake up the micro every so often to tickle it, lest it issue a reset.  That's an additional cost.  Or I could do something like use the low-going status signal from the charger IC to reset the micro through a capacitor so users could recover a device by plugging it into a charger.  But this has some non-ideal side-effects.
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11387
  • Country: us
    • Personal site
Are you actually doing FA, as in there is already a failure? Or you are just trying to predict what may go wrong? Trying to predict failure modes like this is a waste of time. You will never guess what may go wrong when things go wrong. You may be running into some unique silicon bugs that only apply to your specific use case. There is no way to predict that.

If it is really critical, provision as many watchdogs and recovery methods as acceptable by the design and the price and don't worry about it. When the failures happen, you will have to investigate and address them as they show up.

And as you assume more and more things that may fail in theory, your whole system runs a risk of turning into a mess of watchdogs, which in turn may be the source of the issues itself.
Alex
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2330
  • Country: gb
What is the worst case scenario if your system fails in some way?
Is there a risk of injury, death or just some financial loss?
 

Offline globoyTopic starter

  • Regular Contributor
  • *
  • Posts: 208
  • Country: us
I am contributing to a predictive FA being spearheaded by the client.  The failure mode of the micro not coming out of standby mode is that the device appears dead to the customer and the cost to the client could be a warranty expense and reputation hit.  It's not life threatening or risky in that way at all (it's just a consumer gadget that would look like it died).   I understand the risk is low and the cost is probably low too.  However it's an item on the FA list so I'm just seeing if people have experience they could share.  I did some online research without finding much helpful info so I turned to the brain trust here to see what experience you all have.

I will probably do what Alex suggested and use either the IWDG or RTC wake-up facility to try to wake up the device periodically while it is asleep so it can reset the various register states (on the assumption that would eliminate some possible failure modes) and then return to standby.  This can be done infrequently enough so as not to impact the sleep power budget significantly, costs me nothing but a little time, and it's simple enough that it's probably not a big risk of introducing more bugs.
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 27245
  • Country: nl
    • NCT Developments
If you can run the device from an internal RC oscillator, this would eliminate the possibility of an external crystal failure (or the oscillator not starting due to being outside temperature range). Having the device wake-up regulary using the RTC timer is also helpfull so it can re-init the wake-up conditions.

It is also a good idea to test your device at temperature extremes in a climate chamber and subject the device to external disruptions like ESD and radiated immunity. Personally I like to know what a device can take before it starts to misbehave. The limits for consumer goods are quite low where it comes to ESD and radiated immunity. In the real world a device can be subjected to much more mayhem. The radiated immunity level for consumer devices to pass EMC testing is typically 3V/m but I like my designs to keep working at 30V/m. It helps to reduce complaints & returns from consumers.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: globoy

Online DavidAlfa

  • Super Contributor
  • ***
  • Posts: 6039
  • Country: es
it's just a consumer gadget that would look like it died
Likely this would be 99.99% a design or programming mistake, not a cosmic ray flipping a bit.
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 27245
  • Country: nl
    • NCT Developments
it's just a consumer gadget that would look like it died
Likely this would be 99.99% a design or programming mistake, not a cosmic ray flipping a bit.
Cosmic rays are not likely but you can group events like ESD, radiation and electric fields together as potential cullpritts as these can cause weird things to happen with a microcontroller.

Once I had an issue with a device in which the microcontroller would lock up when the device was placed right next to a 65kVAC system which arced over every now and then. Even though the unit was inside a fully enclosed sheet metal box with only a few small holes. Turned out the grounding of the circuit board needed to be organised a little bit different in order to fix it.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14701
  • Country: fr
As said above.

Use a (or several as suggested by ataradov if it's really critical) watchdog at the minimum, and make it wake up the MCU on a regular basis (pick the period as a good compromise between safety and power draw), even if most of the time, when it wakes up, it will just go back to sleep.

I would not recommend, OTOH, putting a MCU into the lowest power mode with even the lowest-power internal oscillator (feeding the watchdog) disabled, and no guaranteed way to wake up the MCU except for conditions that you may not have any hard guarantee about.
 

Offline jnk0le

  • Regular Contributor
  • *
  • Posts: 56
  • Country: pl
Could be a transient voltage glitches due to change in current draw when entering/exiting sleep states.

Had an ch32v003 with just 100nf cap crashing (illegal instr exc) due to those wakeups. (every 1ms so was reproducible in less than 30min)
Issue was solved by adding electrolytic cap required for relay anyway. (added later)
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8251
  • Country: fi
The failure mode of the micro not coming out of standby mode is that the device appears dead to the customer and the cost to the client could be a warranty expense and reputation hit.

This will happen anyway and 99.9% likely it will not be caused by microcontroller peripherals missing wakeup conditions, but instead a good old bug in your codebase (or just maybe, a HW design issue like lacking ESD protection, but even that is less likely than software bug). In code, there will be gazillions of complex and weird interactions - especially if you have to do over-the-air firmware updates.

Invest time in making the whole code base as robust as possible, add logging, recovery mechanisms etc., and set up a way for customer service to file bug reports so that once something happens, you can react quickly. That way you will have one disgruntled customer instead of a thousand.

This is where companies usually fail: total and utter communication gap between end customer and R&D. Invest your resources to prevent this. If you outsource your customer care to basically human bot companies, you have lost the game. They need to have at least very crude understanding of your product and access to your ticket system.
 
The following users thanked this post: harerod

Offline Perkele

  • Regular Contributor
  • *
  • Posts: 56
  • Country: ie
Here's an overkill.
If you don't trust the MCU, then an external watchdog triggered by the same push-buttons that you're using to wake-up the MCU.
If the MCU does not wake-up it gets reset. If it does wake-up it then disables the watchdog.
A watchdog with standby current of about 10uA should cost you about 50 to 70 cents per 1000 pieces.
 

Online DavidAlfa

  • Super Contributor
  • ***
  • Posts: 6039
  • Country: es
Once I had an issue with a device in which the microcontroller would lock up when the device was placed right next to a 65kVAC system which arced over every now and then.
Of course, but this is a kinda special scenario, most electronics will behave in funny ways here.
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8251
  • Country: fi
If you don't trust the MCU, then an external watchdog triggered by the same push-buttons that you're using to wake-up the MCU.

Then what if the external watchdog chip has a silicon fault so that it gets stuck, resetting the CPU so that it never wakes up? I mean, this is roughly similar probability scenario than MCU silicon having a fault causing failure of wakeup sources waking up.

As ataradov points out in his first reply, trying to prepare for rare hardware failures is futile task, and the risk that the mitigation strategy adds another point of failure is very real.
 

Offline globoyTopic starter

  • Regular Contributor
  • *
  • Posts: 208
  • Country: us
Thanks everybody.  Good input.  Sounds like this is something people really haven't seen happening.

Probably I'm not going to add extra WDT functionality, mainly for cost reasons (50-70 cents is significant to to the hardware budget).  I'll use the built in IWDT peripheral while the code is running and the have the RTC running from the low-speed RC oscillator wake the device up periodically so that it can reset all the registers and then go back to standby.  There is ESD protection and I'll probably add some slight filtering to the wakeup generating signals to reduce spurious wakeups due to EMI (in my experience the transition from sleep to running and back again is an area where strange bugs can manifest). 

We are spending a bit of effort on FW validation because this isn't an IOT device.  It's once&done programming-wise.  And the user base is definitely not the kind to even want to consider this device has a computer in it.  They're not us  :D.  Fortunately the firmware isn't that complex and I'm trying to use reasonable deeply embedded programming ideology (self-test, minimal interrupts (2), non-blocking code, static memory allocations, stack analysis, timing analysis, small modules, state-machine based decision making, some run-time state/value validation, etc).  I'm even thinking of some sort of fuzzing hardware test for the device going in and out of standby.
« Last Edit: May 26, 2024, 03:43:59 pm by globoy »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf