Author Topic: STM32F4; Watchdog / Failsafe "best practice" (Read 4335 times)

NE666 · « **on:** December 02, 2025, 08:48:27 pm »

For context; whilst I program professionally, my 'swim lane' isn't embedded systems. That's something I've taken to in my spare time and I've therefore limited experience in this domain. i.e., this is very likely to be a noob question, sorry. Since it's a hobby interest, I'm not talking about "life at risk" applications either, although I'd also be interested in any thoughts in that context too.

I'm currently playing with some bare metal C code on an F411RE Nucleo board and 'discovered' something I didn't intuit from my experience writing software for hosted environments. In the later, I'm accustomed to having (very simple) program execution terminate if main() should return. This doesn't seem to be the case however, on this platform. At least, not entirely. If I've configured on-chip peripherals to use interrupts and I've provided corresponding callback functions to the interrupt vector table, these continue to get called even though my main() function has been aborted (via an assert(0), in this experiment). So, program execution isn't reset but limps on, half living, half dead.

Is this expected behaviour, and is using the hardware watchdog the expected / best practice way in which to handle this type of lock-up?

ataradov · « **Reply #1 on:** December 02, 2025, 10:16:33 pm »

It really depends on what your assert() is doing, but generally, yes, this is expected behavior. There is no exit. Most systems on main() return will just run while (1);

If you want interrupts to be blocked, you need to disable them in the assert() macro. But also, keep in mind that release version may define assert() to do nothing. If you want to terminate for sure and in a way you want, write your own termination function that would place the device into a safe state.

And you can enable watchdogs if you want to reset, of course. But that also depends on the reason for the crash. If this is something that is not rectified by a reset, then you will just be in an infinite reset loop. And that is not always OK, since reset may be controlling I/Os that might affect other parts of the system.

pcprogrammer · « **Reply #2 on:** December 03, 2025, 07:52:12 am »

Bare metal embedded programing is completely different from writing software for environments with an OS.

Ask yourself, what is my intention with this embedded device? Most of the time the answer will be perform a certain task repeatedly to control a motor or read sensor input and write information to a display. All confined within a single device. For this, most of the time there will be no need for watchdog or fail safe mechanisms. Code will be reasonably small and well defined to be able to write it without errors. Also there is often no feedback path to the user to report something caught with an assert.

For systems where there is danger to life due to heavy equipment being controlled and there being a risk of some external event throwing the process under the bus, yes one have to think about this, but I doubt it will be handled or caught with an assert.

Designing a fool proof watchdog system is not easy. You have to carefully examine all the processes, including things like interrupts to decide what is going to reset the watchdog timer. And like ataradov wrote it still brings risk of endless reset of the system.

In my opinion, it is something that has to be worked out on a per design basis, where all aspects of the required system are examined in depth to come up with a proper strategy for that system.

peter-h · « **Reply #3 on:** December 03, 2025, 12:33:31 pm »

Watchdog use can be very complex.

Modern chips have clever features on them but I have never used that.

Surprisingly, a lot of safety critical boxes do not have a watchdog. Avionics (some at least) is one example. I suspect they do it because the startup can be very slow (even minutes) and they prefer to have have faith in the thing not crashing to start with

So obviously if you do use a watchdog you need to be sure that it recovers the system to a working state fast enough.

The simplest way to use a watchdog is to pulse it from your "main loop", which might run at say 100ms rate. Some people pulse the watchdog from a timer ISR but that will trip only if your target has been "properly crashed". In my designs I usually pulse from a timer ISR but have a function which disables this and then you pulse it manually from a more sensible place.

pcprogrammer · « **Reply #4 on:** December 03, 2025, 12:58:04 pm »

Quote from: peter-h on December 03, 2025, 12:33:31 pm

The simplest way to use a watchdog is to pulse it from your "main loop", which might run at say 100ms rate. Some people pulse the watchdog from a timer ISR but that will trip only if your target has been "properly crashed". In my designs I usually pulse from a timer ISR but have a function which disables this and then you pulse it manually from a more sensible place.

Having the main loop simply pulse the watchdog is just as bad as doing it in a timer interrupt service routine.

The main loop is the place to do it, but it needs to gather information from other processes to make sure it is ok to pulse the watchdog.

Say, for instance, that there is an interrupt routine that should handle a certain critical task, but for some reason is no longer activated. Assume something cleared the interrupt enable for it. Without checking in some way that this process is still doing what it should, the main loop would never know.

It takes careful examination of the whole system to design a proper fail safe setup. There is no one universal solution for this.

peter-h · « **Reply #5 on:** December 03, 2025, 02:06:38 pm »

Yes, there are modern watchdogs which enable a second condition to come in. This is the 32F4xx one

22.1 WWDG introduction
The window watchdog is used to detect the occurrence of a software fault, usually
generated by external interference or by unforeseen logical conditions, which causes the
application program to abandon its normal sequence. The watchdog circuit generates an
MCU reset on expiry of a programmed time period, unless the program refreshes the
contents of the downcounter before the T6 bit becomes cleared. An MCU reset is also
generated if the 7-bit downcounter value (in the control register) is refreshed before the
downcounter has reached the window register value. This implies that the counter must be
refreshed in a limited window.

So you need to think carefully what your code is doing and work it all together. It will be totally application specific.

Could get quite interesting in an RTOS environment

Siwastaja · « **Reply #6 on:** December 03, 2025, 04:06:35 pm »

There is no generic answer. You need to use your brain to design something which you think is correct, and document what it does and why you think this is the best way.

Using standard library assert() which then calls standard library abort() is not IMHO a great idea on MCU (is there even abort() available, and if yes, what it does?) - instead, do your own assert, which calls error handler which does whatever you consider best way of recovery - maybe put some IOs in safe states, log an error somehow/somewhere, and after a delay reset the MCU (NVIC_SystemReset()). You can use part of RAM which you choose not to clear in init code (the one which calls main()) to carry such log (e.g., error codes) through reset.

However only resort to this kind of "hard assert" when absolutely necessary (e.g., after you detect a condition which very likely signifies memory corruption). In softer cases, better to return error codes and propagate them to upper levels, or log errors, maybe even trying to do something useful instead of erroring out.

Watchdogs are useful for those cases where you after all get infinite loop somewhere, got memory corruption or program counter corruption, forgot to clear interrupt flag and got infinite stream of interrupts, or created some complicated deadlock condition with two concurrent things waiting for each other... Watchdogs are more useful the more you have them. Many MCUs offer multiple, so that you clear them separately, and if any of them triggers then you get a reset. So for example if you have one critical interrupt handler, clear one watchdog there, and another critical processing code somewhere else, let another watchdog guard that.

SteveThackery · « **Reply #7 on:** December 03, 2025, 10:06:57 pm »

Quote from: peter-h on December 03, 2025, 12:33:31 pm

Some people pulse the watchdog from a timer ISR but that will trip only if your target has been "properly crashed". In my designs I usually pulse from a timer ISR but have a function which disables this and then you pulse it manually from a more sensible place.

I don't get this at all. As the OP suggested, ISRs will almost always continue working even when the main process is completely wrecked and sitting in a doom loop somewhere obscure. So in those instances the watchdog never issues its reset.

It seems to me that a watchdog that gets a pulse from a timer or other ISR is virtually useless. Instead you should implement a "sanity check" process that compiles a picture of the important elements in your running program and, if OK, it pulses the watchdog.

Even if you ignore all that and just pulse the watchdog in your main loop, it's (a bit) better than using a timer interrupt or similar. In my experience, a crash usually stops main() from continuing, so the watchdog will issue a reset.

Siwastaja · « **Reply #8 on:** December 04, 2025, 12:50:10 pm »

Quote from: SteveThackery on December 03, 2025, 10:06:57 pm

I don't get this at all. As the OP suggested, ISRs will almost always continue working even when the main process is completely wrecked and sitting in a doom loop somewhere obscure. So in those instances the watchdog never issues its reset.

ISR stops functioning if the interrupt source stops producing interrupts e.g. due to program misconfiguring it, program elsewhere disabling interrupts (they are sometimes disabled for multitude of good reasons, and maybe due to a bug it is not turned back on); even a failure within the peripheral. Or, ISRs can get starved out of execution time if all time is spent serving other interrupts (of higher priority) - typical example would be infinitely looping interrupt.

Total device reset takes care of all of them, so clearing a watchdog in an ISR totally makes sense.

You may want to additionally guard the "main thread" of execution from stuff like infinite loops by another watchdog, sure.

Or, if the MCU only has one, why not write magic values in main thread, check those in ISR, and reset the watchdog there. Then it will guard both the ISR and the main thread.

Berni · « **Reply #9 on:** December 04, 2025, 01:19:53 pm »

For watchdogs it is best to put checks before actually servicing the watchdog.

If you need to make sure that multiple parts of your code are executing then you instrument those parts with counters and then when it comes time to service the watchdog (be it main thread or a ISR) you check those counters and only actually clear the watchdog if they are all in order. That way a failure on any of them can trigger the watchdog. So in this case the hardware watchdog is watching over a more complex software watchdog.

As for what is the appropriate action when your program gets borked, that is highly application dependent and needs a careful risk assessment to device. For some applications rebooting is the worst thing to do (like an aircraft engine ECU) while on others (like a remote hard to reach security camera) it is very fitting to avoid the device becoming locked up.

The more difficult to recover cases are when you hit a Hard Fault handler. That usually means that shit has already hit the fan and parts of your RAM(or even worse the stack) might already be corrupt. The computer is possibly in such a dickered state that it is not even safe to log the crash. Usually the best case of action there is to save a crash dump somewhere into RAM, reboot, then on boot check that memory area for a crash dump and log it before continuing the boot sequence. But during development you tend to put a while(1) in there to make sure the CPU stops and preserves the crashed state as untouched as possible for investigation with a debugger.

peter-h · « **Reply #10 on:** December 04, 2025, 02:20:41 pm »

OTOH, a hard fault will trip the watchdog even if it is crap

I am sure it is OK to reboot an ECU on a jet engine if you do it reasonably fast. What happens on a piston (car) engine, where you may be controlling injectors, I have no idea...

SteveThackery · « **Reply #11 on:** December 04, 2025, 02:41:56 pm »

Quote from: Siwastaja on December 04, 2025, 12:50:10 pm

ISR stops functioning if the interrupt source stops producing interrupts e.g. due to program misconfiguring it, program elsewhere disabling interrupts (they are sometimes disabled for multitude of good reasons, and maybe due to a bug it is not turned back on); even a failure within the peripheral. Or, ISRs can get starved out of execution time if all time is spent serving other interrupts (of higher priority) - typical example would be infinitely looping interrupt.

Total device reset takes care of all of them, so clearing a watchdog in an ISR totally makes sense.

No, no, that's just not right. OK, under certain - unlikely - conditions a crash can also stop an ISR running. Even more unlikely, it could stop all ISRs running. But this is hopelessly unreliable as a way of checking that the main program has crashed, for the reasons given: lots of crashes do not disable interrupts. In fact the examples you give are hopelessly contrived. For example, in your scenario - where a crash can result in no interrupts - you would need the crash to occur during one of the very brief moments when interrupts are temporarily disabled, which amounts to, what, 1% of execution time?

Do you realise how silly it is to put a watchdog pulse in its own ISR? The only thing it protects against is that particular ISR not working. Can you see how circular that is? If the only thing a watchdog protects is it's own ISR, then you don't need a watchdog at all!

You talk about putting sanity data in the main loop and the watchdog - triggered by an ISR - examines the data and issues a reset accordingly. That definitely sounds better, but ultimately it doesn't make sense. If you do that you don't need an ISR - or even a watchdog - at all. The main loop, which has just marshalled this sanity data, can check it and issue a reset itself.

There can be no real doubt that the main loop is where you should pulse the watchdog, because it is almost always the main loop that crashes. If you want to protect the ISRs, then you need a watchdog for each ISR. A watchdog with its own, separate, ISR, protects nothing but it's own ISR.

Sanity checking of variables and states is the most comprehensive way to identify program faults, and a recovery strategy can be implemented - which might or might not involve a reset. However, this is not a good way to identify actual crashes, because a crash will rarely allow the sanity checking code to run. That's why a watchdog - pulsed from within the main loop - is also necessary.

SiliconWizard · « **Reply #12 on:** December 04, 2025, 03:02:40 pm »

Indeed, refreshing the watchdog in a dedicated timer ISR is one of the worst approaches which kind of screams: "I was mandated to enable the watchdog, but I didn't have time to do any better".

AFAIR, that's exactly what happened with some Toyota firmware years ago: https://www.safetyresearch.net/toyota-unintended-acceleration-and-the-big-bowl-of-spaghetti-code/

Siwastaja · « **Reply #13 on:** December 04, 2025, 04:22:18 pm »

Quote from: SteveThackery on December 04, 2025, 02:41:56 pm

No, no, that's just not right. OK, under certain - unlikely - conditions a crash can also stop an ISR running. Even more unlikely, it could stop all ISRs running. But this is hopelessly unreliable as a way of checking that the main program has crashed, for the reasons given: lots of crashes do not disable interrupts. In fact the examples you give are hopelessly contrived. For example, in your scenario - where a crash can result in no interrupts - you would need the crash to occur during one of the very brief moments when interrupts are temporarily disabled, which amounts to, what, 1% of execution time?

Bugs do not follow the percentage of runtime. Bugs can be anywhere. Better protect as much as you can.

Quote

Do you realise how silly it is to put a watchdog pulse in its own ISR? The only thing it protects against is that particular ISR not working.

Exactly, and often the most important and most serious parts of the program are in the interrupts. At least in my projects. Which is exactly why I guard them. My ISRs for example process input data (from ADC). If that data freezes, rest of the program misbehaves. Or, they calculate setpoints for motor control. Again, freezing = disaster.

Quote

If the only thing a watchdog protects is it's own ISR, then you don't need a watchdog at all!

Wut? No one suggested using watchdog to protect watchdog ISR (if you even have any, often you configure them to just generate reset, without IRQ). Some other ISR! Like PWM, ADC, - whatever is important.

STM32 even has this Window Watchdog which can trigger on both too frequent and too infrequent clears. Then it has this Independent Watchdog that uses separate low-frequency clock input and as such can also protect from main oscillator failure. Maybe reset WWDG on some important ISR, and IWDG on the main thread? This is all case-by-case, you need to understand what your program is doing and how, to protect it the best.

Siwastaja · « **Reply #14 on:** December 04, 2025, 04:28:43 pm »

Quote from: SiliconWizard on December 04, 2025, 03:02:40 pm

Indeed, refreshing the watchdog in a dedicated timer ISR is one of the worst approaches which kind of screams: "I was mandated to enable the watchdog, but I didn't have time to do any better".

Equally stupid is "I can only have one watchdog, and I need to tick this box of having a watchdog and clearing it in main thread", and now you are not protecting another important part of the program, namely the ISR contexts.

There is absolutely nothing special or different between main thread and ISR context. Either one can be starved of execution time due to bugs or hardware failure (including random bit flips). Higher priority interrupts starve lower priority interrupts, and any interrupt can starve the main thread, and main thread can also starve the interrupts by disabling them (atomic blocks taking longer than expected, or reconfiguration for whatever reason).

The reason to have watchdog is to catch those weird bugs that remain uncaught from other ways of input validation and normal sanity checking and cause total hang-up of the important functionality. And in MCU projects this functionality is typically spread in both interrupt context and main thread.

SteveThackery · « **Reply #15 on:** December 04, 2025, 07:07:29 pm »

peter-h said "Some people pulse the watchdog from a timer ISR". I think he was referring to the situation where there is only one watchdog, and the timer ISR only pulses the watchdog and nothing else. That is what I was talking about when I said that this arrangement is as good as useless, because quite often ISRs work just fine even when the main loop is a steaming wreck. And it is real: I have seen it done. The person doing it can't have given it a moment's thought. Or they must have thought that a crash would stop the ISR from running, which is usually wrong.

However, I cannot agree that the use of watchdogs is complicated, as some people have said. It is very simple:

A watchdog detects when a particular section of code has not executed for a specified amount of time. That is all. That "section of code" could be the main loop, a called function, or an ISR. Crucially, it ONLY monitors the section of code it is pulsed by. A watchdog pulsed by the main loop will not detect a corrupted ISR, and vice versa.

I totally agree that watchdogs can and should be used to monitor ISRs if said ISR has critical code in it. The same goes for the main loop and even some function calls if you wish.

I think we are in agreement. My initial reaction was to the idea of a watchdog being pulsed inside a timer ISR which does nothing except pulse the watchdog. I'm sorry I didn't express that clearly enough.

pcprogrammer · « **Reply #16 on:** December 04, 2025, 07:27:04 pm »

Quote

However, I cannot agree that the use of watchdogs is complicated, as some people have said. It is very simple:

Yes pulsing a watchdog is very simple. You just write to some peripheral location et voila.

Setting up the watchdog is also not complex. Just writing to the control registers and what ever.

It is making proper use of it, is what is difficult. One has to analyze the complete process the device has to perform, and decide on a good strategy as where to reset the watchdog timer. When there is only one watchdog, one has to come up with other ways of monitoring if all the processes are still behaving as intended.

And there are systems where a watchdog is not what you need to make it fail safe. Redundancy can be key in situations where a system reset is out of the question.

Take for instance a drone that will crash when the controller is off line for to long. A reset and initialization of the process might already take to long to avoid a crash. Having multiple independent processors with a majority vote system can be the right solution for it.

As I wrote before, there is no single solution for making a system fail safe. Careful consideration of all the design parameters needs to be done to come up with what is the best protection for that particular case.

SiliconWizard · « **Reply #17 on:** December 04, 2025, 07:37:09 pm »

I think that was relatively obvious that was what we both refered to, and that I illustrated with a Toyota firmware case that was (among many other horrors listed in the article) doing just that.

But I'll add something to this: *refreshing* a watchdog *inside* an ISR very rarely makes any sense in most cases anyway. ISRs are usually short and simple. If code gets stuck inside an ISR, the watchdog will precisely time out, which is what you want. This is not a place you want to refresh it, almost ever (there may always be niche cases). So, precisely *not* handling the watchdog whatsoever in any ISR is often just the way to go.

Of course, if you use those window watchdogs, that's when things may get a bit hairy, but you'll rarely get something relevant trying to refresh it in the supposed right window inside ISRs, that becomes a mess quite quickly.

I think in the cases peter had in mind, and that was also the case for Toyota firmware, some kind of RTOS is involved, and indeed many people do not know how to handle watchdogs properly with RTOSs - I mean, in particular, preemptive ones - so using a dedicated task looks like the reasonable way to go. But how do you do it properly? Obviously again, just implementing a task that's guaranteed to refresh the timer often enough is almost as good as nothing - that will cover the case where the CPU is not running at all, but not much more than this.

One way to handle it that I have used is the following: you indeed have one task dedicated to handling the watchdog, but what it does is a bit more elaborate than just refresh it using a timer, on its own. You create one "watchdog event" per task, which each task must signal within a certain time frame, and the watchdog task listens to the watchdog events from all other tasks and only refreshes the hardware watchdog if all events have been signaled. That protects all tasks from getting "stuck". Some RTOSs have specific watchdog events for this, with others, you'll use more generic events - but pretty much all RTOSs implement some kind of events, so this is doable with almost all. This is the approach I recommend in general.

cv007 · « **Reply #18 on:** December 04, 2025, 09:42:13 pm »

Quote

best practice way in which to handle this type of lock-up

This was self created. With an assert you eventually/normally end up in _exit, which is probably a while(1) loop. If your library does not have it (such as libc_nano), and you did not create it or it was not provided for you in some way, your linker will inform you (undefined reference to _exit). You said 'bare-metal' so it would be interesting to know which specs/library is in use, or if you are creating these sys functions on your own.

The mcu assert is intended as a way to inform you of code problems during development, by typically sending out debug info via printf. and ending up wherever _exit takes you (there is no where to go). There is a reason assert is normally turned off when the mcu leaves development as there is no one around to view the info and/or do anything about it, and unless otherwise changed the _exit will 'lock up' the mcu as you has found out (could now be in or out of an exception depending where the assert took place, and some or all exceptions may still be in effect).

So, now you know what assert normally leads to on the mcu and you can use it with that in mind (or simply not use it, or modify the assert macro to your liking). As you indicate you have limited mcu experience, so just skip the watchdog for now as it requires some thought to have it be of any help. The default handler for unhandled exceptions is also another place where development and production may differ. If you are using a debugger, sitting in the default handlers while(1) loop allows you to look at registers/stack and see where you came from, but does little good when the mcu is on its own with no debugger attached.

You can also treat the mcu as always being in production, where you do not allow these loops. For cortex-m mcu's, my default handler saves stack/register info to a small block of ram (init only on power up) then resets the mcu. After the reset main code can look at the ram and dump the info if there was a previous exception- typically I don't have main dump that info out until I have a problem, and if it does and I'm not around no big deal as the mcu will run (debug==production). These unhandled exceptions are really not that common and typically only show up when new code is introduced (unaligned access, bad address, etc). The assert macro could also be changed to do something that doesn't block, although if it is using printf then you are relyiing on that to still be functional.

peter-h · « **Reply #19 on:** December 05, 2025, 03:01:05 pm »

I think there are two separate issues here.

One is dodgy code which crashes here and there. That should really be fixed, because a watchdog will - at best - produce a system running well enough for the customer to not notice the restarts.

The other is external events, which are hopefully rare. For example I had a customer who mounted my products next to a piezo igniter for some huge industrial boiler. It didnt matter how much I told them it must go into a metal box, shielded cabling, very careful grounding, etc. But even if you do it "right" you can never guarantee the EMC margin. A watchdog is good for that kind of thing, and frankly triggering it from an ISR is probably fine, because the spike will usually crash the whole system.

Siwastaja · « **Reply #20 on:** December 05, 2025, 04:18:04 pm »

Quote from: SteveThackery on December 04, 2025, 07:07:29 pm

peter-h said "Some people pulse the watchdog from a timer ISR".

I assumed he must have some good reason to do it, and do something else than just configure a timer interrupt just to clear the watchdog, because that would be, as you say, nearly useless (not completely, but quite ineffective use of watchdog). So I assumed he must be doing something else, like writing magic values (maybe even timestamps of processing something) to variables elsewhere, checking those in the timer ISR and then clear the the watchdog if the self-tests pass. In that case, it would be equivalent to doing said checks on a timer ISR and calling reset function if the checks fails, but using watchdog clear instead would additionally protect against even that ISR (and hence, the checks!) failing. So that would totally make sense, and would be decent use of a single watchdog.

Siwastaja · « **Reply #21 on:** December 05, 2025, 04:24:06 pm »

Quote from: SiliconWizard on December 04, 2025, 07:37:09 pm

ISRs are usually short and simple.

Weird generalization. That is agreeably one way of development, which I abandoned over 10 years ago and never looked back. My programs have been long been 99% in interrupt handlers. Some interrupt handlers are short and simple, some others are complex and long. Usually it turns out that highest priority ISRs are short and simple, and lowest priority can be more complex and long.

This is just event-driven programming, where event "callbacks" = interrupt handlers.

And interrupt handlers form chains of processing.

But quite interestingly, often those short and simple, high-priority ISRs are also those that are most important (in terms of being executed, not being missed, or delayed) - so protecting them with a watchdog seems very sensible thing to do.

And having the complexity elsewhere than those simple interrupts - I can't follow your logic. Exactly because there is complexity elsewhere, that complexity totally can starve the interrupts from being executed. So seems like clearing the watchdog in that important ISR seems the right thing to do, would you not agree?

Sure, if the main purpose of watchdog is just to protect against accidental infinite loop in the main thread, then watchdog clears should happen in that main thread ( and not within that infinite loop

). But I think plain old infinite loops are quite easy to avoid anyway.

Siwastaja · « **Reply #22 on:** December 05, 2025, 04:27:33 pm »

Quote from: peter-h on December 05, 2025, 03:01:05 pm

One is dodgy code which crashes here and there. That should really be fixed, because a watchdog will - at best - produce a system running well enough for the customer to not notice the restarts.

Decent watchdog designs can generate an interrupt just before the reset. Make that interrupt highest priority and come up with some way of logging at least something useful. That way you have better chances of fixing those bugs that are rare enough not to trigger when you try to find them in your lab, but do trigger in production.

peter-h · « **Reply #23 on:** December 05, 2025, 04:44:03 pm »

Quote

if the main purpose of watchdog is just to protect against accidental infinite loop in the main thread,

I would suggest writing software carefully is a better way to protect against accidental infinite loop in the main thread

A watchdog should "almost never" trip. If it trips often, there are deep structural issues in the code. OR the box is getting "spiked", and then a) the watchdog is working exactly as intended and b) you are probably close to blowing up some hardware because a spike which gets inside the box with enough residual power to cause an illegal opcode fetch (etc) is probably within an order of magnitude of blowing something up. My industrial piezo igniter example above was a perfect example.

SteveThackery · « **Reply #24 on:** December 05, 2025, 06:01:59 pm »

Quote from: peter-h on December 05, 2025, 03:01:05 pm

A watchdog is good for that kind of thing, and frankly triggering it from an ISR is probably fine, because the spike will usually crash the whole system.

I've tried to explain more than once that "crash the whole system" doesn't make sense. What does that even mean? A transient pulse can inject spikes into the data, address, instruction buses. A bad address or bad instruction will almost always send the CPU off into a wild stampede for a few milliseconds, and then end up in a tight loop somewhere, stopping the main loop from running. But rarely does it stop ISRs from running.

If you don't believe me, you can try it for yourself. I did my experiments on a microprocessor rather than a microcontroller, because you really need access to the data and address buses. Connect it up to a logic analyser and make it crash - you could use a piezo lighter, or you can usually just short out a couple of bus lines briefly. Then analyse what it is doing. You will almost always find it sitting in a loop of some kind. And you will often find that ISRs continue to work just fine.

I can't be alone in having experimented like this. I bet it's written about somewhere.

Ask yourself this: what does "crash the whole system" actually mean? Do you imagine the CPU will stop executing code? Why would it do that? Yes, OK, perhaps you could induce a hardware latch-up (although that never happened for me). But mostly the program counter keeps incrementing, the CPU tries to execute whatever rubbish it sees, until it ends up in a loop. And now another question: an ISR interrupts whatever the CPU is doing; so why would a CPU running in an unintended loop not respond to interrupts?

The only time I can see that happening is if the crash occurs during one of those brief moments when interrupts are disabled. That would be pretty unlikely, but possible.

So no, I'm sorry - triggering a watchdog from an ISR as a way to recover a crashed main loop is not "probably fine".


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: STM32F4; Watchdog / Failsafe "best practice" (Read 4335 times)

Share me