Author Topic: 32F4 hard fault trap - how to track this down? (Read 9281 times)

peter-h · « **on:** July 22, 2022, 07:40:23 pm »

The stack trace is just nonsense - CPU executing invalid opcodes. But the 0xFFFFFFED is a clue, and it googles to various things, but I don't see a systematic method. One guy found it by lighting up an LED and moving the point along his code until it stayed lit at the trap. In my case I have loads of code - about 300k.

thm_w · « **Reply #1 on:** July 22, 2022, 08:35:32 pm »

Look at the call stack, was any specific function occurring or is it a different function each time.

This might give some clues: https://github.com/ferenc-nemeth/arm-hard-fault-handler

ataradov · « **Reply #2 on:** July 22, 2022, 10:52:35 pm »

It is a good idea to not rely on the IDE for stack trace.

Here is an example of an instrumented HF handler:

Code: [Select]

//-----------------------------------------------------------------------------
void irq_handler_hard_fault_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while (1);
}

//-----------------------------------------------------------------------------
__attribute__((naked)) void irq_handler_hard_fault(void) // Rename to whatever you have in the vector table
{
  asm volatile (R"asm(
    mov    r0, lr
    mrs    r1, msp
    mrs    r2, psp
    b      irq_handler_hard_fault_c
    )asm"
  );
}

Once you get to the C handler, you will have all the saved context. Interesting values here are s_lr, since it would contain last valid call that was performed. You can follow the code (assembly) from there and them correlate the values in the saved registers with what the code was likely doing.

And the values of all those xxxSR registers would tell you the nature of the fault.

peter-h · « **Reply #3 on:** July 23, 2022, 06:21:11 am »

The above shows syntax errors

This is an example of my asm code which is ok

but I am reluctant to go hacking yours in case you did something for a reason

Making the obvious changes produces loads of warnings

ataradov · « **Reply #4 on:** July 23, 2022, 06:27:35 am »

What syntax errors? Does the compiler complain or just the IDE? You are too reliant on IDEs.

Your IDE may not understand this style of strings. You can change to whatever it understands as long as the code remains the same.

I see the edit with the error message. What version of the compiler do you have?

Ok, raw strings seem to be an extension for now, so you need -std=gnu99 or -std=gnu11 passed to the compiler. Or rewrite the strings without the raw stuff.

Warnings are normal, those variables are not used. This is excerpt from the code that used to print them, but you can just look at them in the debugger.

peter-h · « **Reply #5 on:** July 23, 2022, 07:03:04 am »

Those are compiler errors.

GCC v10.

Apologies; I find the ARM asm syntax impenetrable. I programmed in asm for decades but this is something else

I would be grateful for any help. I think a part of it is that one can quote each line, or a whole section of asm, but I can't get anything to compile.

This is nearer, I think:

ataradov · « **Reply #6 on:** July 23, 2022, 07:10:10 am »

You need \n at the end of all those new strings. That's why I used raw strings, so I won't have to do that nonsense myself.

ARM syntax is one of the easiest to understand. I'm not sure what is so hard about it.

peter-h · « **Reply #7 on:** July 23, 2022, 07:43:27 am »

You are right of course; I need to read my own code more carefully

I also set optimisation to -O0 for this function, otherwise some of the variables don't show up in the debugger.

Code: [Select]

// This was originally used to print out those registers
__attribute__((optimize("O0")))
void irq_handler_hard_fault_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while (1);
}

// Rename this to whatever you have in the vector table
__attribute__((naked)) void HardFault_Handler(void)
{
	//asm volatile (R"asm(
	asm volatile (
    "mov    r0, lr \n"
    "mrs    r1, msp \n"
    "mrs    r2, psp \n"
    "b      irq_handler_hard_fault_c \n"
    //)asm"
  );
}

I looked for ways to make the Cube stack trace longer but can't find anything. The stack trace length varies anyway.

newbrain · « **Reply #8 on:** July 23, 2022, 08:18:57 am »

Quote from: peter-h on July 22, 2022, 07:40:23 pm

The stack trace is just nonsense - CPU executing invalid opcodes. But the 0xFFFFFFED is a clue, and it googles to various things, but I don't see a systematic method.

https://developer.arm.com/documentation/ddi0403/d/System-Level-Architecture/System-Level-Programmers--Model/ARMv7-M-exception-model/Exception-return-behavior?lang=en

peter-h · « **Reply #9 on:** July 23, 2022, 08:29:38 am »

Yes; it is something to do with the FPU, but (if so) how?

Now that I have put in the extra debug above, the target has decided to run for longer

wek · « **Reply #10 on:** July 23, 2022, 08:42:02 am »

Just don't get fixated to a particular value. Look at the stack as thm_w said. You can also post it for us to chew on it, together with content of the processor registers.

Btw. if you want some other particular value to analyze, then 0x656d6974, that's "time" or "emit"

JW

newbrain · « **Reply #11 on:** July 23, 2022, 08:46:51 am »

Quote from: peter-h on July 23, 2022, 08:29:38 am

Yes; it is something to do with the FPU, but (if so) how?

Got to a real keyboard now, so here's some basic info:
0xFFFFFFED is a magic value that's loaded in the lr register when an exception is entered - in a regular subroutine call the current pc would be loaded there instead.
When a return is executed (e.g., a branch on the value of lr) the magic value indicates that this is not a regular subroutine return but an exception return.
In a Cortex-M with an FP extension, the stack frame that's automatically saved entering the exception might (Extended) or might not (Basic) contain FP registers.
As the table states, 0xFFFFFFED means: Thread mode, use Process stack pointer and restore an Extended stack frame.

peter-h · « **Reply #12 on:** July 23, 2022, 10:21:53 am »

It is still running

But I had an idea, relating to the extended stack frame: read somewhere that the printf family uses the heap for the %f implementation. Clearly that would be a disaster for thread safety under an RTOS. It is library dependent and I can't find GCC-specific info but if true this would definitely be a vulnerability, due to a) malloc and free being definitely not thread-safe (I can mutex them, having made sure none are used prior to mutexes having become available which is quite late in my main() and b) the heap is a stupid idea anyway when uses that way due to fragmentation.

I can modify all instances of %f to output two integers etc.

EDIT: I found malloc() in the library. No source but I set a breakpoint on the FLASH address. An ignore count of 5 gets me past known usage. Then I don't see any calls to it despite using stuff like %7.3f in printfs.

peter-h · « **Reply #13 on:** July 23, 2022, 11:33:30 am »

It finally bombed:

Code: [Select]

lr	uint32_t	0xffffffed (Hex)	
msp	uint32_t	0x2001ffe0 (Hex)	
psp	uint32_t	0x100079f0 (Hex)	
s_r0	uint32_t	0x200060e8 (Hex)	
s_r1	uint32_t	0x200060ec (Hex)	
s_r2	uint32_t	0x64 (Hex)	
s_r3	uint32_t	0x5c28f5c9 (Hex)	
s_r12	uint32_t	0xa5a5a5a5 (Hex)	
s_lr	uint32_t	0xa5a5a5a5 (Hex)	
s_pc	uint32_t	0xa5a5a5a5 (Hex)	
s_psr	uint32_t	0xa5a5a5a5 (Hex)	
r_CFSR	uint32_t	0x60000000 (Hex)	
r_HFSR	uint32_t	0x20000 (Hex)	
r_DFSR	uint32_t	0x40000000 (Hex)	
r_AFSR	uint32_t	0x0 (Hex)	
r_BFAR	uint32_t	0xe000ed34 (Hex)	
r_MMAR	uint32_t	0x9 (Hex)	
sp	uint32_t *	0x100013dc (Hex)	

r0	0xffffffed (Hex)		
r1	0x2001ffe0 (Hex)		
r2	0x100079f0 (Hex)		
r3	0x0 (Hex)		
r4	0x804beac (Hex)		
r5	0x3e8 (Hex)		
r6	0x20002478 (Hex)		
r7	0x2001ff88 (Hex)		
r8	0x801ce69 (Hex)		
r9	0xa5a5a5a5 (Hex)		
r10	0xa5a5a5a5 (Hex)		
r11	0xa5a5a5a5 (Hex)		
r12	0xa0000000 (Hex)		
sp	0x2001ff88 (Hex)		
lr	0xffffffed (Hex)		
pc	0x8041d42 (Hex)		
xpsr	0x21000003 (Hex)		
d0	0x0 (Hex)		
d1	0x0 (Hex)		
d2	0x0 (Hex)		
d3	0x0 (Hex)		
d4	0x0 (Hex)		
d5	0x0 (Hex)		
d6	0x0 (Hex)		
d7	0x0 (Hex)		
d8	0x0 (Hex)		
d9	0x0 (Hex)		
d10	0x0 (Hex)		
d11	0x0 (Hex)		
d12	0x0 (Hex)		
d13	0x0 (Hex)

SP at 0x2001ff88 is reasonable (my general stack is 2001e000-2001ffff).
MSP is the same as above. The CPU switches SP to MSP or PSP, automatically.
PSP at 0x100079f0 is in one of the RTOS stacks (RTOS uses the 64k CCM at 10000000-1000ffff) but I need to restart the target to find out which task it belongs to (I have a graphical display of the CCM block, with a mouseover display of the address and data). But before I restart the target, someone here may have a suggestion to do something else, so I won't restart it yet.

Thank you for any pointers.

wek · « **Reply #14 on:** July 23, 2022, 12:09:26 pm »

I said, look at the *stack*, not just stack pointer. You're about to enter hard stuff. You may find dissecting 300kLOC to be a viable option.

-----

> printf() using malloc() [thus not reentrant]

It may come as a nasty surprise, but by the C standard, *no* library function is reentrant. No, not even abs().

Some versions of printf() may use internal heap, but that's not reentrant either.

https://nadler.com/embedded/newlibAndFreeRTOS.html

Generally, printf() and kin have no place in mcu. If you want to use them, you'll pay all the price, including the hidden portions.

JW

peter-h · « **Reply #15 on:** July 23, 2022, 12:46:14 pm »

Here is the stack above 2000FF88

I see that FFFFFFED value on it. Addresses 100xxxxx are within the RTOS stacks and addresses 200xxxxx are obviously main RAM.

The only FLASH address I see is 8014610 and that is within the RTOS code (port.c) although the .map file shows nothing specific at that address

and it doesn't look like anything I recognise.

Code: [Select]

          prvPortStartFirstTask:
080145f0:   ldr     r0, [pc, #32]   ; (0x8014614)
080145f2:   ldr     r0, [r0, #0]
080145f4:   ldr     r0, [r0, #0]
080145f6:   msr     MSP, r0
080145fa:   mov.w   r0, #0
080145fe:   msr     CONTROL, r0
08014602:   cpsie   i
08014604:   cpsie   f
08014606:   dsb     sy
0801460a:   isb     sy
0801460e:   svc     0
08014610:   nop     
282       }
08014612:   movs    r0, r0
08014614:           ; <UNDEFINED> instruction: 0xed08e000
708       	__asm volatile
          vPortEnableVFP:
08014618:   ldr.w   r0, [pc, #12]   ; 0x8014628
0801461c:   ldr     r1, [r0, #0]

But then I don't know how to interpret that stack frame. It is automated.

I did see that Nadler site but a breakpoint on malloc doesn't reveal any heap usage at all, after the known calls (5 of them after startup). GCC printf doesn't call a malloc for sure. It may have an internal heap... In years past, I saw some usage of statics which would obviously not be thread safe.

abyrvalg · « **Reply #16 on:** July 23, 2022, 01:43:43 pm »

Check 080148A8 (a saved PC/LR of a Thumb code always has bit 0 set. The 08014610 you’ve spotted is just some pointer).

peter-h · « **Reply #17 on:** July 23, 2022, 01:54:21 pm »

Nothing in the .map file for 080148A8 but here is the content

That is within FreeRTOS.

I have been digging around to see how the arm32 stack is filled up and can't find a clear description, so I don't know what to make of 100079f0 (which I am not looking yet because I would need to restart the target and then all the values may change). This is what is in there (data, not code)

That fact that SP was in the general stack, not within one of the RTOS stacks (0x10000000+) tells me that this was an ISR which did it, because the CPU switches to the general stack for interrupts. My ISRs should all be in main RAM (0x20000000+).

harerod · « **Reply #18 on:** July 23, 2022, 02:50:09 pm »

ataradov, thank you for that snippet. I am looking forward to using this.
I added this to my existing HardFault-handler. In my production code the while(1) becomes replaced by a define, which is either an eternal loop for debugging or an immediate system restart request for production use. I also added "__attribute__((unused))".
Adjusted for CubeIde 1.3, your examples might read like:

Code: [Select]

// put this function in Vector Table
__attribute__((naked)) void HardFault_Handler_asm(void)
{
  asm(
    "mov    r0, lr\n"
    "mrs    r1, msp\n"
    "mrs    r2, psp\n"
    "b      HardFault_Handler\n"
  );
} 

// c-handler with breakpoint
void HardFault_Handler_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  __attribute__((unused)) uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  __attribute__((unused)) uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while(1);
}

wek · « **Reply #19 on:** July 23, 2022, 02:50:44 pm »

> I don't know how to interpret that stack frame.

The first two rows are - as you've already used them - R0, R1, R2, R3, R12, LR, PC, xPSR, as they were stacked by the fault handler.

If you have FPU enabled, then next 4 rows are the FPU registers (or just space for them if lazy stacking is on, which probably is) and one more word is PFSCR. There may be an aligner, too, see CCR.STKALIGN.

The rest is what was at the stack at the moment when the fault happened.

This is a post-mortem status, nowhere it is said that it's useful, and also it may be full of red herrings, except that it's all you have. The svc instruction in snippet you posted causes the SVC exception, which should stack also the 0x08014610 as PC at that point, but that would be the on the bottom of stack only if FPU would not be used, so that's confusing. I'd have a look at the SVC handler, too, just for the fun. Yes I know that's the heart of the RTOS. I don't use RTOS and have exactly zero experience debugging it or debugging within it.

"Normally", with "simple errors", the fault happens so that PC points to the last "correct" offending instruction. Your PC points to 0x00208CEA. I don't know how that area behaves, probably traps, so it must've been a jump to that address immediately before, but we of course have no trace of where it jumped from. The problem with post-mortem analysis is, that you can't walk backwards (plus runaway program sometimes destroys evidence, too).

What is strange is also content of CFSR and HFSR registers you've given above, they are completely nonsense.

JW

wek · « **Reply #20 on:** July 23, 2022, 02:57:38 pm »

Coincidence?

As I've said, I know nothing about RTOS.

JW

peter-h · « **Reply #21 on:** July 23, 2022, 03:07:21 pm »

SVC handler:

Code: [Select]

static void prvTaskExitError( void )
{
volatile uint32_t ulDummy = 0;

	/* A function that implements a task must not exit or attempt to return to
	its caller as there is nothing to return to.  If a task wants to exit it
	should instead call vTaskDelete( NULL ).

	Artificially force an assert() to be triggered if configASSERT() is
	defined, then stop here so application writers can catch the error. */
	configASSERT( uxCriticalNesting == ~0UL );
	portDISABLE_INTERRUPTS();
	while( ulDummy == 0 )
	{
		/* This file calls prvTaskExitError() after the scheduler has been
		started to remove a compiler warning about the function being defined
		but never called.  ulDummy is used purely to quieten other warnings
		about code appearing after this function is called - making ulDummy
		volatile makes the compiler think the function could return and
		therefore not output an 'unreachable code' warning for code that appears
		after it. */
	}
}
/*-----------------------------------------------------------*/

void vPortSVCHandler( void )
{
	__asm volatile (
					"	ldr	r3, pxCurrentTCBConst2		\n" /* Restore the context. */
					"	ldr r1, [r3]					\n" /* Use pxCurrentTCBConst to get the pxCurrentTCB address. */
					"	ldr r0, [r1]					\n" /* The first item in pxCurrentTCB is the task top of stack. */
					"	ldmia r0!, {r4-r11, r14}		\n" /* Pop the registers that are not automatically saved on exception entry and the critical nesting count. */
					"	msr psp, r0						\n" /* Restore the task stack pointer. */
					"	isb								\n"
					"	mov r0, #0 						\n"
					"	msr	basepri, r0					\n"
					"	bx r14							\n"
					"									\n"
					"	.align 4						\n"
					"pxCurrentTCBConst2: .word pxCurrentTCB				\n"
				);
}
/*-----------------------------------------------------------*/

static void prvPortStartFirstTask( void )
{
	/* Start the first task.  This also clears the bit that indicates the FPU is
	in use in case the FPU was used before the scheduler was started - which
	would otherwise result in the unnecessary leaving of space in the SVC stack
	for lazy saving of FPU registers. */
	__asm volatile(
					" ldr r0, =0xE000ED08 	\n" /* Use the NVIC offset register to locate the stack. */
					" ldr r0, [r0] 			\n"
					" ldr r0, [r0] 			\n"
					" msr msp, r0			\n" /* Set the msp back to the start of the stack. */
					" mov r0, #0			\n" /* Clear the bit that indicates the FPU is in use, see comment above. */
					" msr control, r0		\n"
					" cpsie i				\n" /* Globally enable interrupts. */
					" cpsie f				\n"
					" dsb					\n"
					" isb					\n"
					" svc 0					\n" /* System call to start first task. */
					" nop					\n"
				);
}

I am at the limit of my knowledge here, but if I can find the address of the code which resulted in this trap I can put in breakpoints around that.

That 79F0 address is within the TCP/IP RTOS task, which is completely unsurprising, but the code for that is in the FLASH. The stuff at 79F0 is just a data+RTOS stack area. For example if a function running under an RTOS declares a variable, that variable, being stack-based as normal, will end up in this area.

I can leave this for a bit, otherwise I can restart it and see if it ends up in the same place.

Jeroen3 · « **Reply #22 on:** July 23, 2022, 03:36:36 pm »

Do you know what type of error caused it yet?
Have you check the fault analyzer in CubeIDE yet, it will tell you exactly what everyone here is trying to access via the complicated assembly.

^{Relying on IDE's too much lol.}

peter-h · « **Reply #23 on:** July 23, 2022, 04:12:58 pm »

0x100079f0 is within one of the RTOS stacks areas, for task "TCP/IP". This is the whole RTOS stack space for that task (FreeRTOS fills its entire workspace (a sort of heap actually) with A5).

Interestingly at 79F0 is 0x00000000, which looks like it overwrote the tail end of a MEM_SYS text string which was there before (which may not be relevant).

If this is repeatable I can do a watchpoint on 79F0 and 0x00000000.

The "unused stack" areas are much bigger than shown; I cropped it of necessity.

Jeroen3 · « **Reply #24 on:** July 23, 2022, 04:38:56 pm »

Have you enabled "halt on exception" in the debug startup settings yet? Then it breaks at the exact instruction causing the fault, with the context intact.
You can then look at window with the function call tracing, I forgot the name, how you got to that point, and what pointers are used to get there.
You should be able to click back in the that trace to see more context of those functions and if any pointers go to something they shouldn't. If that trace is gibberish, you've smashed the stack.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 32F4 hard fault trap - how to track this down? (Read 9281 times)

Share me