Author Topic: Gracefully recovering from HardFault on Cortex-M4 (Read 21024 times)

poorchava · « **on:** December 02, 2014, 08:44:26 am »

I'm writing a big piece of software, which makes heavy use of ARM-DSP library (mainly fixed-point FFT and complex math). The problem is, that there is a data acquisition via DMA going in parallel to the FFT calculation and DMA interrupt has higher priority than the one in which FFT is calculated. Accasionally I'm getting a HardFault exception, and if to believe the Call Stack window in AtmelStudio, the fault occurs within FFT function. My suspicion is, that CPU tries to switch context to higher priority interrupt during some critical phase of FFT or that DMA tries to write to the same location that FFT does (although I suspect that this would rather generate Bus Fault).

Anyway - I am aware of this fault and I have no way of preventing it as it appears totally random. I cannot decrease priority of DMA interrupt and neither can I delay the FFT, but in general if FFT results are invalid, then nothing bad happens, there will be fresh ones in some tens of miliseconds and incorrect results will be discarded further in the processing algorithm. Is there a way to recover gracefully from a HardFault so that the device is not reset? There is ton of resources on the internet about debugging HardFaults, but not so much about recovering from them. I've even seen an opinion, that HardFaults are fatal by default and cannot be recovered from.

Any advice will be helpful.

EDIT: the uC is SAM4S series

Brutte · « **Reply #1 on:** December 02, 2014, 06:13:08 pm »

Quote from: poorchava on December 02, 2014, 08:44:26 am

Anyway - I am aware of this fault and I have no way of preventing it as it appears totally random.

The faults do not happen because of bad mood of the core. There must have been a good reason why that happened and the very first thing you should write about here is the reason itself. Do you know the exact reason?
I am not going to study ASes, but it should give you an access to the resources of the core (printscreen in the opening post).

Quote

I cannot decrease priority of DMA interrupt (..)

Do you want to hide the bug or to solve the actual problem?

Quote

Is there a way to recover gracefully from a HardFault so that the device is not reset?

The device is not reset by the hard fault event. If it resets then that must be something else.

Quote

I've even seen an opinion, that HardFaults are fatal by default and cannot be recovered from.

That is the case with tinies (a.k.a Cortex-M0) where the core has limited functionality and core state is lost during most of such unexpected events. Your uC has a full featured core.

Quote

Any advice will be helpful.

What about RTFRMing?

nctnico · « **Reply #2 on:** December 02, 2014, 07:41:04 pm »

You can recover from a hardfault interrupt by just exiting the interrupt. The code will continue but use a wrong value.
But I'd try to get to the bottom of why the hardfault interrupt occurs. You can dump the stackframe to examine what goes wrong and where.

Sal Ammoniac · « **Reply #3 on:** December 02, 2014, 07:52:28 pm »

Faults don't occur "randomly" -- they always have a cause.

Do you have Bus Faults and Usage Faults enabled? Hard faults only occur if the actual fault (bus or usage) is disabled (or a fault occurs when executing a fault handler). Enable Bus and Usage Faults so you can determine which one is actually occurring and then you can read the fault status registers to determine the address of the access that caused the fault (and sometimes the address of the instruction that caused the fault, but only for precise faults.)

If you haven't read the ARM Technical Reference Manual for the Cortex-M4, you really should, especially the sections on fault and exception handling.

Howardlong · « **Reply #4 on:** December 02, 2014, 07:54:17 pm »

Agreed with others here, you really do need to understand the problem and fix it properly, ignoring it is only going to end in tears.

I had a similar problem about four years ago, ADC sampling at 400kSPS 16 bit through DMA for data acquisition streamed over USB to the host. Randomly, maybe every 90s or so on average, but it wasn't regular, I'd lose a sample, which messed up the host program (it was a baseband satellite telelmetry downlink) and losing that sample broke the telemetry decode because although it had FEC, losing a sample broke the derived clock on the host, so too many bits were lost to decode while the clock got back in sync.

It took me several days to figure it out. Actually, I never did really figure it out, I just completely re-wrote the buffering code from scratch, clearly there was an error somewhere but it was like looking for a needle in a haystack: one sample in 40 million, and randomly occuring, is hard to identify!

So either fix ot or re-write it is my advice to you I'm afraid, but definitely don't ignore it.

Sal Ammoniac · « **Reply #5 on:** December 02, 2014, 08:32:07 pm »

Another thing to add... One thing that makes debugging faults much easier is instruction trace. If your debugger and probe support it, it can be invaluable for debugging these types of issues because you can see exactly where in the instruction stream the error occurred.

Unfortunately, not all debuggers support instruction trace, and probes that do typically cost more than those that don't. If you're doing this in a professional capacity, buying the tools will usually be well worth the cost.

hans · « **Reply #6 on:** December 02, 2014, 09:35:52 pm »

I've had to debug a very annoying HardFault error some time back, which the HFSR (Hard Fault Status Register) described as an "imprecise error". It resulted in every ECU firmware to crash when we expanded our memory usage beyond a certain point.
It ended up being a stack corruption because we forgot to reset the PSP/MSP (process/master stack pointers) mode in the bootloader. It resulted in the application starting up with PSP and so all init code was setup for PSP. On the first RTOS interrupt it would blindly switch back to MSP and use the bootloaders stack pointers. Because the bootloader had a smaller memory profile than the application it would start to corrupt applications data, while the application would corrupt the RTOS's stack. It was a very annoying, because it only happened when we started with our bootloader and not when we started a debug session of the application (because there was no history effect of our broken bootloader).

We did not have a debugger with trace functionality, that would be very handy indeed. The smallest clues we could get were often very vague. It only took 2 weeks of man hours to diagnose and find that bug. The fix was simple; like 1 line of code.

What nctnico said is technically correct.. the CPU will jump out of the routine and try to resume your program just like any other ISR. However, if the stack is corrupted that will not work that well (if at all). Also, how will you determine or progress your project when your other application (I assume the FFT results are going somewhere.. and that's still firmware which unevitably will contain bugs or unwanted side effects) could develop hard faults as well? Sounds like you could be making your future debugging even more difficult.

In addition, if the HardFault is caused by a pointer error (accessing an unpowered peripheral for example), I am pretty certain the CPU will retry the instruction after the fault is exited. In that case you would need to fix the pointer in the processor core register, that is (hopefully/likely) pushed onto the stack. Sounds like dangerous territory to mess around with the stack at that point. Also, how do you know which one? The compiler may change it's mind the next time you change your code..

Quote

My suspicion is, that CPU tries to switch context to higher priority interrupt during some critical phase of FFT or that DMA tries to write to the same location that FFT does (although I suspect that this would rather generate Bus Fault).

I hope you're not using the same buffer for DMA & FFT, otherwise I cannot see how both pieces would be accessing the exact same piece of memory. I would highly recommended using 2 buffers if you're not already. 1 for DMA, 1 for processing.

poorchava · « **Reply #7 on:** December 02, 2014, 10:49:07 pm »

Ok, some details: I'm using SAM4SA16C, debug probe is Atmel-ICE (just a repainted J-link, that works slower than original J-link) and the IDE is Atmel Studio (seriously, whoever in Atmel made the decision to use Visual Studio as base IDE should be hanged by the balls on some tall object...).

The hard fault occurs within the FFT function which is developed by ARM and it's a large blob of ASM voo-doo which I really do not want to reverse engineer (after looking at the source code of it - it's just some C crafted in a way that forces CPU to use particular ASM instructions). In general I doubt that the problem is within the FFT function.

As for enabling bus and usage faults - I'm getting 'dummy handler' exception (the ASF way of 'unimplemented interrupt'). After reading IPSR register it's always "3" which means hard fault. Perhaps there are some other configuration bits that will enable bus and usage faults to be recognized. I'll check that tomorrow.

In general debuggin this application as an abysmal pain in the ass, because after it is paused the whole DSP engine has to be reset. The timing of DMA events and sampling rate are very critical in order to maintain synchronization with signal from external source (sorry, I can't be more specific). Any intervention by debugger causes the sync to be lost and whole system goes tits up, so I cannot really compare 2 states of the cpu, that would come from same data stream. After pausing the system to read memory I have to let it run for a second or so to acquire signal, get in sync with it (with new sync parameters and such) and then i can pause it again.

As for tracing the fault by analyzing stack: I don't know how to do it to be honest. The CallStack tool in Atmel Studio produces some results, but those are often bullshit.

As for buffering - the DMA destination buffer is separate to the FFT source buffer. Data from DMA buffer are processed and then copied to FFT source buffer. DMA should not be messing with whatever data that FFT operates on.

As for repeatability: I get the fault once or twice per day of work, let's say once every 4-5 hours of continuous operation. I have not found a way to reliably reproduce the error, but this may as well not be possible, as signals proessed by the device are largely random crap (the signal has very low SNR), so some transient random disturbance may be causing the fault once a day and it's not likely that I will catch it and identify as fault trigger.

Analyzing the program flow is abysmal pain in the ass, as Maximal Optimization (except for 'unsafe' options) is turned on (otherwise the code doesn't work as it's too slow - I'm already using like 90% of RAM and 70% of flash with maximal optimization.

I will resume the battle tomorrow...

Howardlong · « **Reply #8 on:** December 02, 2014, 11:07:50 pm »

From what you've said, if it were me, and I ran the project, I'd probably rewrite the buffering/DMA business from scratch.

Assuming you can't, I am not sure how a transient response would cause a fault directly on a fixed point FFT, unless there are some documented constraints on the input. Is it fixed or floating point FFT by the way? If it's floating point, where are you doing the conversion?

Can you force a known good buffer, say all zeros or something, into the FFT continuously while simultaneoisly sampling with the DMA into a bitbucket, and see if it is indeed causal from the input stream, either through data itself or a memory conflict.

Although you mentioned RAM and Flash utilisation, you didn't mention how much CPU is available?

Is it something as simple as a divide by zero or a floating point NaN for example?

nctnico · « **Reply #9 on:** December 03, 2014, 01:53:54 am »

Divide by zero also crossed my mind.

andyturk · « **Reply #10 on:** December 03, 2014, 02:09:43 am »

Quote from: poorchava on December 02, 2014, 10:49:07 pm

The hard fault occurs within the FFT function which is developed by ARM and it's a large blob of ASM voo-doo which I really do not want to reverse engineer (after looking at the source code of it - it's just some C crafted in a way that forces CPU to use particular ASM instructions). In general I doubt that the problem is within the FFT function.

CMSIS-DSP?

The math code itself may be fine, but if you pass it an improperly created buffer or something with a junk pointer, it'll have consequences. One way to "debug" problems like this without a good debugging environment is to start adding assertions to the code that look for specific issues and then fail in a way you can track more easily. E.g., you could add sanity checks on the buffers you pass to the FFT library to make sure they're pointing to a reasonable part of memory. You could also make sure you have enough stack space, and maybe check for some sentinel values at the bottom of the stack afterwards to make sure it didn't overflow.

poorchava · « **Reply #11 on:** December 03, 2014, 06:27:26 am »

Quote from: andyturk on December 03, 2014, 02:09:43 am

CMSIS-DSP?

Yeah, CMSIS-DSP, more precisely 32bit, fixed point real transform. The uC doesn't have FPU (I'd probably be doing fixed point math even if it did anyway).

westfw · « **Reply #12 on:** December 03, 2014, 08:28:19 am »

Neither "divide by zero" or "bus error" counts as a "hard fault", although I guess it's easy for them to get escalated to hard faults if there is an error in the MemManage, BusFault, or UsageFault handlers.

The way I read the documentation, a "hard fault" is hardly ever the result of the primary error; it's the result of an error during attempts to handle some other error. Make sure you have correct handlers installed and enabled for the other faults/interrupts, and you'll be closer to debugging the actual problem.

Jeroen3 · « **Reply #13 on:** December 03, 2014, 12:10:57 pm »

Returning from a hardfault won't work, you'd be returning to the address that triggered the fault. Triggering the fault again. And you'd need to undo some stacking the hardware did for you.
Instead, use this: (to begin with)

Code: [Select]

	uint32_t CFSRValue = SCB->CFSR;
	uint32_t HFSRValue = SCB->HFSR;
	
	if ((HFSRValue & (1 << 30)) != 0) {
		CFSRValue >>= 16;
		if((CFSRValue & (1 << 9)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: Divide by zero\n");
		}
		if((CFSRValue & (1 << 8)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: Unaligned access\n");
		}
		if((CFSRValue & (1 << 3)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: No coprocessor UsageFault\n");
		}
		if((CFSRValue & (1 << 2)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: Invalid PC load UsageFault\n");
		}
		if((CFSRValue & (1 << 1)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: Invalid state\n");
		}
		if((CFSRValue & (1 << 0)) != 0) {
			faultPrint(FAULT_MSG_PREFIX"fault: Undefined instruction\n");
		}
	}

Refer to arm documentation why and when they happen (some can be disabled). And how to read the registers dump.
The register dump is made by hardware, on the stack, before entering the handler. You can read it with some assembler magic.

Having a free uart that can send to a pc terminal makes this a lot easier. (fully software polled of course)

Read here: https://blog.feabhas.com/2013/02/developing-a-generic-hard-fault-handler-for-arm-cortex-m3cortex-m4/

Goodluck!

Brutte · « **Reply #14 on:** December 03, 2014, 02:58:41 pm »

Quote from: Jeroen3 on December 03, 2014, 12:10:57 pm

Instead, use this: (to begin with)

Why everyone insists on printf..ing in hardfaults? I cannot see in OP's posts the uC is orbiting the Earth and without debugging access. Most likely OP bends over the hardware right now so why insisting on printf..ing registers? Connect the debugger and any decent IDE displays all the required SCB content in plain English. Including stack frames, PC, registers, xPSR, CFSR, HFSR and what not. Perhaps even with some interpretations and hints about what went wrong..

nctnico · « **Reply #15 on:** December 03, 2014, 04:57:48 pm »

Half the embedded engineers don't use debuggers. I never put a JTAG header on my boards and even if I wanted to I usually use those pins for other purposes. So yes, printing to the serial port is a good way. Another advantage of printing is that a field service engineer can connect a serial interface and create a log file. And even if you use a debugger it won't point to the exact fault because optimisation causes the debugger to get confused which line number corresponds with what address.

mikerj · « **Reply #16 on:** December 03, 2014, 06:45:28 pm »

Quote from: nctnico on December 03, 2014, 04:57:48 pm

Half the embedded engineers don't use debuggers.

I doubt that very much, certainly for the case of professional engineers where time==money. Modern 32 bit parts often have dedicated JTAG or SWD ports anyway, so the pin issue is irrelevant on these. Trying to debug using printfs and toggling spare pins in a complex program, perhaps running under an RTOS, on such a device is a last resort, anyone with any sense at all would use a debugger.

If you are using very low pin count devices with multiplexed debugger pins then the problem is often easily resolved by using a larger part in the same family for debugging. However, these small parts typically don't have particularly complex code, so going back to the stone age method of debugging is less of a problem.

janoc · « **Reply #17 on:** December 03, 2014, 08:12:03 pm »

@mikerj, @Brutte

In this case using a debugger is actually a problem, as the OP said, because pausing the execution so that you can inspect the state breaks the HW and everything has to be re-initialized. Another common use case with this type of problem is debugging USB stacks - pause the code, the hosts disconnects you because of a time out and you have to reconnect the device, reinitializing the stack from scratch ...

So printf-ing data to an UART can be valuable time-saving hack in these situations - not everyone has access to hw debugger with the tracing feature (and not every hw supports it!).

paulie · « **Reply #18 on:** December 03, 2014, 08:52:59 pm »

Quote from: nctnico on December 03, 2014, 04:57:48 pm

Half the embedded engineers don't use debuggers.

I doubt that very much, certainly for the case of professional engineers where time==money.

IMO it's more like 90% of them prefer the simplest most direct tools and those hw debuggers ain't that. First of all vast majority of work out there deals with low pin count parts that don't even have the capability. Even for those that do in many cases a quick register/ram printout turns out more useful. Lastly of the many engineers I chum with daily only a few actually admit resorting to the fancy schmancy approach and even then not that often.

I think there are many cases when HW debug is an only solution but not that common at all. Mostly it's for hotshots who like to show off their expensive and complicated gear so clients and peers can see how smart and "advanced" they are.

nctnico · « **Reply #19 on:** December 04, 2014, 12:35:58 am »

Quote from: mikerj on December 03, 2014, 06:45:28 pm

Quote from: nctnico on December 03, 2014, 04:57:48 pm
Half the embedded engineers don't use debuggers.
I doubt that very much, certainly for the case of professional engineers where time==money.

Ask the other attendees the next time you visit a microcontroller seminar. A hardware debugger has limited use unless you develop all the software on the target but there are much better alternatives for that.

Kjelt · « **Reply #20 on:** December 04, 2014, 09:05:54 am »

It can be good practice IMO to put printf's in debug mode in the code esp. when you are programming with multiple engineers on a large platform.
If any engineer causes an (old) known issue the printf will show again the cause instead of making the same mistake over and over again.
Also profiling (rtos etc.) code is good to have in debug mode esp. with dynamic memory usage.
Using a debugger is also used a lot when you are stuck on an annoying problem or want to doublecheck your running code.
So why choose one over the other, they all have their value in time.
So short in larger projects or multiple persons involved printf's or substitutes are really valuable IMO, esp. in de debug mode, can you imagine large software stacks such as Windows crashing with only one error code: "Windows crashed please reset computer"?

Jeroen3 · « **Reply #21 on:** December 04, 2014, 09:19:39 am »

Quote

Why everyone insists on printf..ing in hardfaults?

The code sample I posted was equipped with my own faultPrint, not a single library call was made.
it just pushes a const string to the uart.
Indeed I am doing some snprintf'ing to show the registers and some other numbers.
But this could perfectly be replaced by some simple routine converting bin to hex-ascii.

Quote

Half the embedded engineers don't use debuggers.

Indeed you skip the jtag/swd header on the production run (if you have DFU), when you're smart you'll make some testpoints but sometimes there is not enough space. But you'll usually have a proto, with similar functionality and hardware, and sometimes the bigger binary compatible cousin of the mcu family.

The problem described is a sporadic problem, that would need at least the more expensive debuggers with tracing functionality to be of many help. This requires more pins than your usual debugger. I've never used one of those, neither made the pins available.

I'f you're running an (RT)OS you might be able to shoot out the last 64 context switches to see what threads did something and in what mode they got switched out.

Sal Ammoniac · « **Reply #22 on:** December 04, 2014, 06:18:27 pm »

Quote from: nctnico on December 03, 2014, 04:57:48 pm

Half the embedded engineers don't use debuggers

As a professional embedded engineer, I can say this has not been my experience. Nearly every single one of my colleagues use debuggers. Sure, they (and I) also use printfs in the code occasionally, but we primarily rely on debuggers.

In an industry were time literally is money, we tend to use high-end, expensive tools that support trace and other advanced features, because these tools save time.

Maybe engineers in underfunded start-ups (or one man companies) don't use debuggers for financial reasons, but I can assure you that here in Silicon Valley the debugger is king in the embedded world.

mikerj · « **Reply #23 on:** December 04, 2014, 06:56:09 pm »

Quote from: nctnico on December 04, 2014, 12:35:58 am

Quote from: mikerj on December 03, 2014, 06:45:28 pm
Quote from: nctnico on December 03, 2014, 04:57:48 pm
Half the embedded engineers don't use debuggers.
I doubt that very much, certainly for the case of professional engineers where time==money.
Ask the other attendees the next time you visit a microcontroller seminar. A hardware debugger has limited use unless you develop all the software on the target but there are much better alternatives for that.

You typically want to DEBUG on the target, I usually develop code on alternative platforms.

poorchava · « **Reply #24 on:** December 04, 2014, 10:40:06 pm »

Well, I may be onto something, although not necessarily connected to the mysterious hard faults caused by FFT. It seems that for some reason compiler will mess up structure alignment in memory. There are some instructions in CM4 that can do unaligned (byte) access, but they are limited and slow. When I tried to use arm_copy_q7 math function to copy structures (this function works just like memcpy, but uses SIMD instructions) I was getting Usage Faults coming from unaligned access. Adding _attribute_((aligned(4))) to structure declaration fixes the issue. One could think that compiler will do stuff like that automatically...


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Gracefully recovering from HardFault on Cortex-M4 (Read 21024 times)

Share me