Electronics > Microcontrollers

Annoying long runtime bug

(1/6) > >>

I have a bug that takes hours and hours to manifest.  Leaves no trace as to if/when it's going to happen, just that the device stops doing it's thing.   

I added an independently clocked watchdog, but it is being refreshed.  So it's not locking up or hard faulting.

So yesterday after work I fired it up on the breadboard with the debugger attached and let it run and run and run.  By 11pm when I was heading to bed it was still running.

At 2am when I got up for the bathroom, I poked my head in and ....  it was locked up.  However I was NOT about to sit and debug it at 2am!  So this morning I opened up the VM to find that at some point the debugger had disconnected, damn.

Of course if you run it without the debugger attached it will lock up inside 20 minutes!  (Does anyone know a way to connect CubeMX to an already running core?)

It will be a missed interrupt that deadlocks a state machine which then deadlocks (or provides no impetus for) the others.  I hate state machines.  A bit like I hate the cold and rain, but just like the weather you can't avoid state machines, necessary evil.  Just so damn hard to get right and cover all possibly scenarios.

I suppose I can just plaster it in debug logging.  If it's refreshing it's watchdog time, I can put a big fat state dumping log statement there to show me the state machine variables etc.

A question I have is around missing interrupts.

Take the scenario where you have a bursty process which takes a few milliseconds.  While it's doing it's thing, 2 interrupts are received.  By the time the core gets to the second one, another has already "overwritten" or supersceded it.

Are there any "engineering" best practices around recovering from the potential state machine mess this can create?

My own personal feeling is to aim for "detection" of invalid states or invalid transitions and have a way to completely reset the state machine and notify other state machines accordingly.  I mean, would it not be prudent to be able to stop, clear and restart any sub component and not have the whole system lock up whlie doing so. 

I suppose this problem is already targeted by the RTOS et.al.  Maybe I should consider going to RTOS and stop managing my own wonky/janky state machines, timing and interrupt management.

(sometimes I make this posts as I often answer half my own problem just writing them).

Two things occurred to me.  First, I have no idea and no way to tell how much "load" is on the core at any point in time or generally.  There is a possibility that something is (or everything is) doing too much in interrupts such that they are getting missed.

Second thing was that, while an IWDC will help reset your core when it locks up, you want a NON indenepnant watchdog timer to tell you if you are overloading the core.  AKA, the RTOS Idle Task, which could be done with a standard watchdog.  If I set the timeout of the watchdog as a interrupt and refresh the timer in the interrupt context that will tell me very quickly if interrupts are backlogged longer than the period of the timer "reset window".

Attaching to a running target, I am not sure how to do it within CubeMX, but for the Eclipse for embedded developers, using a SEGGER J-Link plugin, there is this option "Connect to running target".

Impossible to say without source, of course, but some general ideas:

- In the main() spin loop, add a counter (increment per loop).  Reset it every heartbeat or so.  Just before resetting, save the value to a cpuCount variable.  Monitor that via debugger, or spit it out via serial, whatever.  Interrupts use up CPU cycles --> reduced spin count --> more CPU usage.  So it's an inverse measure.  Disable all the main() housekeeping and interrupts to see what the maximum value is, and, obviously the minimum value is about 0.  (Depending on platform, interrupt saturation may still let through one main() instruction per interrupt, so it might not go completely to zero, it just grinds extremely slowly.)  (If you're crashing out at the same time, this might not show up, of course.)

- Others have advocated for this before: run everything through a single interrupt, so nothing gets out of sync.  Which really means, have all the ISRs jump to the same handler and figure it out from there.  Maybe call with a fixed parameter so the callee knows what device to service and switch() from there.

Downsides are full stack overhead (at least on many devices; anything with register swaps ala Z80 could potentially do very well at this, assuming the compiler emits relevant instructions to use it) and somewhat more latency, and no way to prioritize high-rate or missable interrupts while another is executing.

I'm not a fan of this myself, at least on the smaller platforms with a handful of active interrupts, that I've been working with.  I do have issues getting the priorities right and making sure everything is consistent in any order, but with only a handful, it's not unmanageable.  Granted, it wouldn't take many more to get there.

Complexity of those interactions goes up, probably exponentially with number, so if you're using a lot... it may be a worthwhile change.

Which segues into the next point...

- Don't try to write something you can't understand.  Reduce it to manageable pieces you can reason about.  This level is different for everyone, but know your strengths and weaknesses, and plan accordingly.

If you have multiple overlapping and interacting interrupts, that sounds like a recipe for disaster.  Add a state machine below that and who knows?  Maybe that's descriptive of the present issue, I don't know.  But just look at that laundry list of hazards.  It's too complex and (literally) unreasonable.  Break it into smaller pieces, isolate the processes.

You can deal with high-rate interrupts by isolating them with minimal or zero side effects, for example buffering data then handling it either in another (strictly lower priority) interrupt, or preferably in the main() spin loop or heartbeat.  Bring it down to a manageable place where everything can be resolved, in preordained order.

In your state machine, plan for zero or more items in each buffer.  Maybe it will sometimes run too often, or interrupts get held up, so you sometimes read an empty buffer -- just take that into account.  If more than one item is seen, create a plan to resolve them: process them in order, take the latest one, whatever.

If the order of multiple simultaneous data/items/events matters, you can save a timestamp with each event and resolve them later.  Or better yet, use a common event buffer -- which can be designed and tested* for atomicity and therefore is a safe gate between asynchronous interrupts and synchronous (i.e. main() loop) logic.

*Mind, this is a fairly simple system and we're already well into the realm of "tests don't prove anything".  The best testing you could do here, would be something like fuzzing inputs and triggering interrupts with random inputs to try and force any possible timing or logic error.  For sure, you're going to have minimal to nonexistent test coverage if you're going by internal timers and stuff -- the timing is too consistent, at best plus or minus a few CPU cycles due to instruction timings, and sure you can set different timer rates or even randomize them, but those updates might still be synchronous with other CPU functions and you can't guarantee every pattern will be seen.  You can at least get independent timing from an external source -- for example, hooking up a GPIO interrupt to a signal generator and giving that a spin.  But you still have no way to enumerate every possible combination and sequence of interrupt, latency, delay, etc.

On the upside, at least on simple platforms, the clock rates are fairly low (10s MHz to low 100s), so the timing doesn't need to be very fine-grained to find that one-in-a-million hangup where two interrupts conspire to corrupt the buffer state, or something like that.  And it's usually just a timing coincidence, not pattern-dependent.  And, note the failure threshold: one error at all, is evidence of total design failure; absence of error is not evidence of success.  If it comes down to interrupts interfering within a single clock cycle, out of say ~100s Hz average interrupt rates, that's literally a one in a million chance, and you need to be able to detect and log/report that one failure when it happens.  So, plan accordingly.

- If it's overlapped interrupts, also check the stack from time to time.  This isn't usually easy to do from inside the language, but there may be some methods (library?), or if nothing else, a little ASM can do it.

* In each interrupt, log the stack pointer: take the minimum value between the variable and the current value, and store that back.  (ATOMICALLY!)
* From time to time, scan the memory space where the stack resides (usually upper RAM?).  This will either be uninitialized RAM (random?), or you can clear it or fill it with patterns (0s or 0xff's or 0xcafebabe's or...) before program start (__initn or something like that).  It will be overwritten with return pointers, assorted register contents, and local variables as functions utilize it.

Again, this is an observational method, so it doesn't prove anything (it's a testing method).  But it may give direction to a problem you already know exists.

Downside, you can't scan it after the CPU's crashed, and monitoring from main() say will only get a check so often.  Well, if you can attach debug and dump RAM, you don't need anything in the program, that'd be something.



[0] Message Index

[#] Next page

There was an error while thanking
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod