I like to error() out when fundamental sanity checks fail in ways there is no obvious/easy recovery. I was thinking about error handlers, and quite obvious is to store as much information as possible in RAM segment not cleared by init code, then just reset, and let the networking code (or whatever is relevant for the project) deliver the "oh, we rebooted with error code 123, with the following extra info: ..." message.
Call graph that led into the error would be nice, and in architectures where return addresses for function calls are stored in stack, it's just matter of scanning the stack and storing values which are valid addresses for code (with a possibility that some rare false positives appear, which are not actually return addresses but any data that falls within code segment address range).
But ARM Cortex-M only stacks return addresses when entering an interrupt handler; normal function calling uses the LR register.
So maybe this: in the beginning of each function (except some super performance critical funcs), use some helper macro to allocate 4-8 bytes out of stack, and store some magic number, plus any kind of identifier for the function, like a #defined unique constant for each .c file plus __LINE__ number - these would fit in maybe 20 bits. Maybe something else, too, any status information that can be automatically discarded after function return. Once the function exists, stack allocation disappears and here we go, easily generated call graph, plus magic numbering reduces false entries.
Any comments on this, or completely orthogonal run-time instrumentation ideas? Things that have helped you to catch those unimaginable bugs that only happen after you have a lot of units on the field, in some weird corner case, so that instead of trying to reproduce the bug, you already have enough data to possibly straight up solve it, or at least give massive hint how to try to reproduce it. (Anything else except "make it completely bug free by single-stepping in debugger on lab table").