Author Topic: 32F4 hard fault trap - how to track this down? (Read 9294 times)

peter-h · « **on:** July 22, 2022, 07:40:23 pm »

The stack trace is just nonsense - CPU executing invalid opcodes. But the 0xFFFFFFED is a clue, and it googles to various things, but I don't see a systematic method. One guy found it by lighting up an LED and moving the point along his code until it stayed lit at the trap. In my case I have loads of code - about 300k.

thm_w · « **Reply #1 on:** July 22, 2022, 08:35:32 pm »

Look at the call stack, was any specific function occurring or is it a different function each time.

This might give some clues: https://github.com/ferenc-nemeth/arm-hard-fault-handler

ataradov · « **Reply #2 on:** July 22, 2022, 10:52:35 pm »

It is a good idea to not rely on the IDE for stack trace.

Here is an example of an instrumented HF handler:

Code: [Select]

//-----------------------------------------------------------------------------
void irq_handler_hard_fault_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while (1);
}

//-----------------------------------------------------------------------------
__attribute__((naked)) void irq_handler_hard_fault(void) // Rename to whatever you have in the vector table
{
  asm volatile (R"asm(
    mov    r0, lr
    mrs    r1, msp
    mrs    r2, psp
    b      irq_handler_hard_fault_c
    )asm"
  );
}

Once you get to the C handler, you will have all the saved context. Interesting values here are s_lr, since it would contain last valid call that was performed. You can follow the code (assembly) from there and them correlate the values in the saved registers with what the code was likely doing.

And the values of all those xxxSR registers would tell you the nature of the fault.

peter-h · « **Reply #3 on:** July 23, 2022, 06:21:11 am »

The above shows syntax errors

This is an example of my asm code which is ok

but I am reluctant to go hacking yours in case you did something for a reason

Making the obvious changes produces loads of warnings

ataradov · « **Reply #4 on:** July 23, 2022, 06:27:35 am »

What syntax errors? Does the compiler complain or just the IDE? You are too reliant on IDEs.

Your IDE may not understand this style of strings. You can change to whatever it understands as long as the code remains the same.

I see the edit with the error message. What version of the compiler do you have?

Ok, raw strings seem to be an extension for now, so you need -std=gnu99 or -std=gnu11 passed to the compiler. Or rewrite the strings without the raw stuff.

Warnings are normal, those variables are not used. This is excerpt from the code that used to print them, but you can just look at them in the debugger.

peter-h · « **Reply #5 on:** July 23, 2022, 07:03:04 am »

Those are compiler errors.

GCC v10.

Apologies; I find the ARM asm syntax impenetrable. I programmed in asm for decades but this is something else

I would be grateful for any help. I think a part of it is that one can quote each line, or a whole section of asm, but I can't get anything to compile.

This is nearer, I think:

ataradov · « **Reply #6 on:** July 23, 2022, 07:10:10 am »

You need \n at the end of all those new strings. That's why I used raw strings, so I won't have to do that nonsense myself.

ARM syntax is one of the easiest to understand. I'm not sure what is so hard about it.

peter-h · « **Reply #7 on:** July 23, 2022, 07:43:27 am »

You are right of course; I need to read my own code more carefully

I also set optimisation to -O0 for this function, otherwise some of the variables don't show up in the debugger.

Code: [Select]

// This was originally used to print out those registers
__attribute__((optimize("O0")))
void irq_handler_hard_fault_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while (1);
}

// Rename this to whatever you have in the vector table
__attribute__((naked)) void HardFault_Handler(void)
{
	//asm volatile (R"asm(
	asm volatile (
    "mov    r0, lr \n"
    "mrs    r1, msp \n"
    "mrs    r2, psp \n"
    "b      irq_handler_hard_fault_c \n"
    //)asm"
  );
}

I looked for ways to make the Cube stack trace longer but can't find anything. The stack trace length varies anyway.

newbrain · « **Reply #8 on:** July 23, 2022, 08:18:57 am »

Quote from: peter-h on July 22, 2022, 07:40:23 pm

The stack trace is just nonsense - CPU executing invalid opcodes. But the 0xFFFFFFED is a clue, and it googles to various things, but I don't see a systematic method.

https://developer.arm.com/documentation/ddi0403/d/System-Level-Architecture/System-Level-Programmers--Model/ARMv7-M-exception-model/Exception-return-behavior?lang=en

peter-h · « **Reply #9 on:** July 23, 2022, 08:29:38 am »

Yes; it is something to do with the FPU, but (if so) how?

Now that I have put in the extra debug above, the target has decided to run for longer

wek · « **Reply #10 on:** July 23, 2022, 08:42:02 am »

Just don't get fixated to a particular value. Look at the stack as thm_w said. You can also post it for us to chew on it, together with content of the processor registers.

Btw. if you want some other particular value to analyze, then 0x656d6974, that's "time" or "emit"

JW

newbrain · « **Reply #11 on:** July 23, 2022, 08:46:51 am »

Quote from: peter-h on July 23, 2022, 08:29:38 am

Yes; it is something to do with the FPU, but (if so) how?

Got to a real keyboard now, so here's some basic info:
0xFFFFFFED is a magic value that's loaded in the lr register when an exception is entered - in a regular subroutine call the current pc would be loaded there instead.
When a return is executed (e.g., a branch on the value of lr) the magic value indicates that this is not a regular subroutine return but an exception return.
In a Cortex-M with an FP extension, the stack frame that's automatically saved entering the exception might (Extended) or might not (Basic) contain FP registers.
As the table states, 0xFFFFFFED means: Thread mode, use Process stack pointer and restore an Extended stack frame.

peter-h · « **Reply #12 on:** July 23, 2022, 10:21:53 am »

It is still running

But I had an idea, relating to the extended stack frame: read somewhere that the printf family uses the heap for the %f implementation. Clearly that would be a disaster for thread safety under an RTOS. It is library dependent and I can't find GCC-specific info but if true this would definitely be a vulnerability, due to a) malloc and free being definitely not thread-safe (I can mutex them, having made sure none are used prior to mutexes having become available which is quite late in my main() and b) the heap is a stupid idea anyway when uses that way due to fragmentation.

I can modify all instances of %f to output two integers etc.

EDIT: I found malloc() in the library. No source but I set a breakpoint on the FLASH address. An ignore count of 5 gets me past known usage. Then I don't see any calls to it despite using stuff like %7.3f in printfs.

peter-h · « **Reply #13 on:** July 23, 2022, 11:33:30 am »

It finally bombed:

Code: [Select]

lr	uint32_t	0xffffffed (Hex)	
msp	uint32_t	0x2001ffe0 (Hex)	
psp	uint32_t	0x100079f0 (Hex)	
s_r0	uint32_t	0x200060e8 (Hex)	
s_r1	uint32_t	0x200060ec (Hex)	
s_r2	uint32_t	0x64 (Hex)	
s_r3	uint32_t	0x5c28f5c9 (Hex)	
s_r12	uint32_t	0xa5a5a5a5 (Hex)	
s_lr	uint32_t	0xa5a5a5a5 (Hex)	
s_pc	uint32_t	0xa5a5a5a5 (Hex)	
s_psr	uint32_t	0xa5a5a5a5 (Hex)	
r_CFSR	uint32_t	0x60000000 (Hex)	
r_HFSR	uint32_t	0x20000 (Hex)	
r_DFSR	uint32_t	0x40000000 (Hex)	
r_AFSR	uint32_t	0x0 (Hex)	
r_BFAR	uint32_t	0xe000ed34 (Hex)	
r_MMAR	uint32_t	0x9 (Hex)	
sp	uint32_t *	0x100013dc (Hex)	

r0	0xffffffed (Hex)		
r1	0x2001ffe0 (Hex)		
r2	0x100079f0 (Hex)		
r3	0x0 (Hex)		
r4	0x804beac (Hex)		
r5	0x3e8 (Hex)		
r6	0x20002478 (Hex)		
r7	0x2001ff88 (Hex)		
r8	0x801ce69 (Hex)		
r9	0xa5a5a5a5 (Hex)		
r10	0xa5a5a5a5 (Hex)		
r11	0xa5a5a5a5 (Hex)		
r12	0xa0000000 (Hex)		
sp	0x2001ff88 (Hex)		
lr	0xffffffed (Hex)		
pc	0x8041d42 (Hex)		
xpsr	0x21000003 (Hex)		
d0	0x0 (Hex)		
d1	0x0 (Hex)		
d2	0x0 (Hex)		
d3	0x0 (Hex)		
d4	0x0 (Hex)		
d5	0x0 (Hex)		
d6	0x0 (Hex)		
d7	0x0 (Hex)		
d8	0x0 (Hex)		
d9	0x0 (Hex)		
d10	0x0 (Hex)		
d11	0x0 (Hex)		
d12	0x0 (Hex)		
d13	0x0 (Hex)

SP at 0x2001ff88 is reasonable (my general stack is 2001e000-2001ffff).
MSP is the same as above. The CPU switches SP to MSP or PSP, automatically.
PSP at 0x100079f0 is in one of the RTOS stacks (RTOS uses the 64k CCM at 10000000-1000ffff) but I need to restart the target to find out which task it belongs to (I have a graphical display of the CCM block, with a mouseover display of the address and data). But before I restart the target, someone here may have a suggestion to do something else, so I won't restart it yet.

Thank you for any pointers.

wek · « **Reply #14 on:** July 23, 2022, 12:09:26 pm »

I said, look at the *stack*, not just stack pointer. You're about to enter hard stuff. You may find dissecting 300kLOC to be a viable option.

-----

> printf() using malloc() [thus not reentrant]

It may come as a nasty surprise, but by the C standard, *no* library function is reentrant. No, not even abs().

Some versions of printf() may use internal heap, but that's not reentrant either.

https://nadler.com/embedded/newlibAndFreeRTOS.html

Generally, printf() and kin have no place in mcu. If you want to use them, you'll pay all the price, including the hidden portions.

JW

peter-h · « **Reply #15 on:** July 23, 2022, 12:46:14 pm »

Here is the stack above 2000FF88

I see that FFFFFFED value on it. Addresses 100xxxxx are within the RTOS stacks and addresses 200xxxxx are obviously main RAM.

The only FLASH address I see is 8014610 and that is within the RTOS code (port.c) although the .map file shows nothing specific at that address

and it doesn't look like anything I recognise.

Code: [Select]

          prvPortStartFirstTask:
080145f0:   ldr     r0, [pc, #32]   ; (0x8014614)
080145f2:   ldr     r0, [r0, #0]
080145f4:   ldr     r0, [r0, #0]
080145f6:   msr     MSP, r0
080145fa:   mov.w   r0, #0
080145fe:   msr     CONTROL, r0
08014602:   cpsie   i
08014604:   cpsie   f
08014606:   dsb     sy
0801460a:   isb     sy
0801460e:   svc     0
08014610:   nop     
282       }
08014612:   movs    r0, r0
08014614:           ; <UNDEFINED> instruction: 0xed08e000
708       	__asm volatile
          vPortEnableVFP:
08014618:   ldr.w   r0, [pc, #12]   ; 0x8014628
0801461c:   ldr     r1, [r0, #0]

But then I don't know how to interpret that stack frame. It is automated.

I did see that Nadler site but a breakpoint on malloc doesn't reveal any heap usage at all, after the known calls (5 of them after startup). GCC printf doesn't call a malloc for sure. It may have an internal heap... In years past, I saw some usage of statics which would obviously not be thread safe.

abyrvalg · « **Reply #16 on:** July 23, 2022, 01:43:43 pm »

Check 080148A8 (a saved PC/LR of a Thumb code always has bit 0 set. The 08014610 you’ve spotted is just some pointer).

peter-h · « **Reply #17 on:** July 23, 2022, 01:54:21 pm »

Nothing in the .map file for 080148A8 but here is the content

That is within FreeRTOS.

I have been digging around to see how the arm32 stack is filled up and can't find a clear description, so I don't know what to make of 100079f0 (which I am not looking yet because I would need to restart the target and then all the values may change). This is what is in there (data, not code)

That fact that SP was in the general stack, not within one of the RTOS stacks (0x10000000+) tells me that this was an ISR which did it, because the CPU switches to the general stack for interrupts. My ISRs should all be in main RAM (0x20000000+).

harerod · « **Reply #18 on:** July 23, 2022, 02:50:09 pm »

ataradov, thank you for that snippet. I am looking forward to using this.
I added this to my existing HardFault-handler. In my production code the while(1) becomes replaced by a define, which is either an eternal loop for debugging or an immediate system restart request for production use. I also added "__attribute__((unused))".
Adjusted for CubeIde 1.3, your examples might read like:

Code: [Select]

// put this function in Vector Table
__attribute__((naked)) void HardFault_Handler_asm(void)
{
  asm(
    "mov    r0, lr\n"
    "mrs    r1, msp\n"
    "mrs    r2, psp\n"
    "b      HardFault_Handler\n"
  );
} 

// c-handler with breakpoint
void HardFault_Handler_c(uint32_t lr, uint32_t msp, uint32_t psp)
{
  __attribute__((unused)) uint32_t s_r0, s_r1, s_r2, s_r3, s_r12, s_lr, s_pc, s_psr;
  __attribute__((unused)) uint32_t r_CFSR, r_HFSR, r_DFSR, r_AFSR, r_BFAR, r_MMAR;
  uint32_t *sp = (uint32_t *)((lr & 4) ? psp : msp);

  s_r0  = sp[0];
  s_r1  = sp[1];
  s_r2  = sp[2];
  s_r3  = sp[3];
  s_r12 = sp[4];
  s_lr  = sp[5];
  s_pc  = sp[6];
  s_psr = sp[7];

  r_CFSR = SCB->CFSR;  // Configurable Fault Status Register (MMSR, BFSR and UFSR)
  r_HFSR = SCB->HFSR;  // Hard Fault Status Register
  r_DFSR = SCB->DFSR;  // Debug Fault Status Register
  r_MMAR = SCB->MMFAR; // MemManage Fault Address Register
  r_BFAR = SCB->BFAR;  // Bus Fault Address Register
  r_AFSR = SCB->AFSR;  // Auxiliary Fault Status Register

  asm("nop"); // Setup breakpoint here

  while(1);
}

wek · « **Reply #19 on:** July 23, 2022, 02:50:44 pm »

> I don't know how to interpret that stack frame.

The first two rows are - as you've already used them - R0, R1, R2, R3, R12, LR, PC, xPSR, as they were stacked by the fault handler.

If you have FPU enabled, then next 4 rows are the FPU registers (or just space for them if lazy stacking is on, which probably is) and one more word is PFSCR. There may be an aligner, too, see CCR.STKALIGN.

The rest is what was at the stack at the moment when the fault happened.

This is a post-mortem status, nowhere it is said that it's useful, and also it may be full of red herrings, except that it's all you have. The svc instruction in snippet you posted causes the SVC exception, which should stack also the 0x08014610 as PC at that point, but that would be the on the bottom of stack only if FPU would not be used, so that's confusing. I'd have a look at the SVC handler, too, just for the fun. Yes I know that's the heart of the RTOS. I don't use RTOS and have exactly zero experience debugging it or debugging within it.

"Normally", with "simple errors", the fault happens so that PC points to the last "correct" offending instruction. Your PC points to 0x00208CEA. I don't know how that area behaves, probably traps, so it must've been a jump to that address immediately before, but we of course have no trace of where it jumped from. The problem with post-mortem analysis is, that you can't walk backwards (plus runaway program sometimes destroys evidence, too).

What is strange is also content of CFSR and HFSR registers you've given above, they are completely nonsense.

JW

wek · « **Reply #20 on:** July 23, 2022, 02:57:38 pm »

Coincidence?

As I've said, I know nothing about RTOS.

JW

peter-h · « **Reply #21 on:** July 23, 2022, 03:07:21 pm »

SVC handler:

Code: [Select]

static void prvTaskExitError( void )
{
volatile uint32_t ulDummy = 0;

	/* A function that implements a task must not exit or attempt to return to
	its caller as there is nothing to return to.  If a task wants to exit it
	should instead call vTaskDelete( NULL ).

	Artificially force an assert() to be triggered if configASSERT() is
	defined, then stop here so application writers can catch the error. */
	configASSERT( uxCriticalNesting == ~0UL );
	portDISABLE_INTERRUPTS();
	while( ulDummy == 0 )
	{
		/* This file calls prvTaskExitError() after the scheduler has been
		started to remove a compiler warning about the function being defined
		but never called.  ulDummy is used purely to quieten other warnings
		about code appearing after this function is called - making ulDummy
		volatile makes the compiler think the function could return and
		therefore not output an 'unreachable code' warning for code that appears
		after it. */
	}
}
/*-----------------------------------------------------------*/

void vPortSVCHandler( void )
{
	__asm volatile (
					"	ldr	r3, pxCurrentTCBConst2		\n" /* Restore the context. */
					"	ldr r1, [r3]					\n" /* Use pxCurrentTCBConst to get the pxCurrentTCB address. */
					"	ldr r0, [r1]					\n" /* The first item in pxCurrentTCB is the task top of stack. */
					"	ldmia r0!, {r4-r11, r14}		\n" /* Pop the registers that are not automatically saved on exception entry and the critical nesting count. */
					"	msr psp, r0						\n" /* Restore the task stack pointer. */
					"	isb								\n"
					"	mov r0, #0 						\n"
					"	msr	basepri, r0					\n"
					"	bx r14							\n"
					"									\n"
					"	.align 4						\n"
					"pxCurrentTCBConst2: .word pxCurrentTCB				\n"
				);
}
/*-----------------------------------------------------------*/

static void prvPortStartFirstTask( void )
{
	/* Start the first task.  This also clears the bit that indicates the FPU is
	in use in case the FPU was used before the scheduler was started - which
	would otherwise result in the unnecessary leaving of space in the SVC stack
	for lazy saving of FPU registers. */
	__asm volatile(
					" ldr r0, =0xE000ED08 	\n" /* Use the NVIC offset register to locate the stack. */
					" ldr r0, [r0] 			\n"
					" ldr r0, [r0] 			\n"
					" msr msp, r0			\n" /* Set the msp back to the start of the stack. */
					" mov r0, #0			\n" /* Clear the bit that indicates the FPU is in use, see comment above. */
					" msr control, r0		\n"
					" cpsie i				\n" /* Globally enable interrupts. */
					" cpsie f				\n"
					" dsb					\n"
					" isb					\n"
					" svc 0					\n" /* System call to start first task. */
					" nop					\n"
				);
}

I am at the limit of my knowledge here, but if I can find the address of the code which resulted in this trap I can put in breakpoints around that.

That 79F0 address is within the TCP/IP RTOS task, which is completely unsurprising, but the code for that is in the FLASH. The stuff at 79F0 is just a data+RTOS stack area. For example if a function running under an RTOS declares a variable, that variable, being stack-based as normal, will end up in this area.

I can leave this for a bit, otherwise I can restart it and see if it ends up in the same place.

Jeroen3 · « **Reply #22 on:** July 23, 2022, 03:36:36 pm »

Do you know what type of error caused it yet?
Have you check the fault analyzer in CubeIDE yet, it will tell you exactly what everyone here is trying to access via the complicated assembly.

^{Relying on IDE's too much lol.}

peter-h · « **Reply #23 on:** July 23, 2022, 04:12:58 pm »

0x100079f0 is within one of the RTOS stacks areas, for task "TCP/IP". This is the whole RTOS stack space for that task (FreeRTOS fills its entire workspace (a sort of heap actually) with A5).

Interestingly at 79F0 is 0x00000000, which looks like it overwrote the tail end of a MEM_SYS text string which was there before (which may not be relevant).

If this is repeatable I can do a watchpoint on 79F0 and 0x00000000.

The "unused stack" areas are much bigger than shown; I cropped it of necessity.

Jeroen3 · « **Reply #24 on:** July 23, 2022, 04:38:56 pm »

Have you enabled "halt on exception" in the debug startup settings yet? Then it breaks at the exact instruction causing the fault, with the context intact.
You can then look at window with the function call tracing, I forgot the name, how you got to that point, and what pointers are used to get there.
You should be able to click back in the that trace to see more context of those functions and if any pointers go to something they shouldn't. If that trace is gibberish, you've smashed the stack.

ataradov · « **Reply #25 on:** July 23, 2022, 04:46:19 pm »

It is not always possible to know the exact instruction. Some faults are imprecise. And based on the information in the post #13, there is no valid information saved, the fault is secondary to something else going on.

It is hard to suggest something useful when a lot of other people suggesting other stuff.

Jeroen3 · « **Reply #26 on:** July 23, 2022, 04:51:12 pm »

In peters last post the fault address pointers to the memory location of an assertion ready to be printed.

ataradov · « **Reply #27 on:** July 23, 2022, 04:57:37 pm »

There is assertion text in the memory, but I don't see any pointers that it is ready to be printed.

But if that is a suspicion in any way, then assertion routine should be changed to while (1) and breakpoint set there. But based on all the previous information, I don't think it is.

peter-h · « **Reply #28 on:** July 23, 2022, 05:29:12 pm »

Quote

Have you enabled "halt on exception" in the debug startup settings yet?

That is checked, but greyed-out.

It does halt on exceptions. That is how I got that data.

Quote

There is assertion text in the memory

What is "assertion text"?

Some of the STM libs use "assert..." all over the place. I don't use it.

cv007 · « **Reply #29 on:** July 23, 2022, 07:15:09 pm »

Quote

Some of the STM libs use "assert..." all over the place. I don't use it.

lwip has its own assert settings, under arch settings such as-

https://github.com/STMicroelectronics/stm32_mw_lwip/blob/master/system/arch/cc.h

and if this is the same as what you have, it would explain the assertion text in your mcu. I would assume that define could be changed to use the 'general' assert macro so you can enable/disable all asserts in one place. I didn't dig through all the defines to figure out if there is another way in lwip to disable its assertions, but its possible this is the only place it can be done.

Quote

I added this to my existing HardFault-handler. In my production code the while(1) becomes replaced by a define, which is either an eternal loop for debugging or an immediate system restart request for production use.

Another option is to save the data in a set-aside section of ram, so you can still keep going (reset) but also have a little data to look at (check out any previous exception).

I have a section of ram set aside in the linker script to store debug data (so its in a fixed location where address is always known)- in startup code an errorFunc is used to dump the stack registers to 'debugram', then the mcu is reset. The errorFunc address gets populated in all unused vectors (vectors in ram), so any unhandled exception results in a dump of some stack info plus a reset. When I get back to main, I can deal with the data if wanted. I only have a uart as debugger for this mcu (by choice), so when I get back to main I can dump out the data via uart. Not as good as stopping at the exception to take a look around, but its better than nothing and can always be left 'enabled'. I have only used this a couple times, and it did help, but any exceptions I create are usually simple errors in unaligned access (which are also the type where you already have a good idea where to look- your most recent code).

https://github.com/cv007/NUCLEO32_G031K8_B/blob/main/startup.cpp

nctnico · « **Reply #30 on:** July 23, 2022, 07:43:21 pm »

Quote from: ataradov on July 22, 2022, 10:52:35 pm

Here is an example of an instrumented HF handler:

Another option is to get the stackpointer (for example using: asm("mov %0, sp \n\t" : "=r"(reg_sp) ); ) and print the values from the stackpointer and up using formatting routines that write data to a UART directly. When the hardfault handler is entered, the registers are pushed onto the stack as well. I typically have this method print the last 12 stack entries (32 bit words) and that allows me to get a good idea where the hardfault occured (RAM and ROM addresses used).

peter-h · « **Reply #31 on:** July 23, 2022, 07:58:53 pm »

Cube can set a watchpoint on a Write to an address but there seems to be no way to specify data=0x00000000. Unless the Condition box does that but I cannot find any documentation on it.

uer166 · « **Reply #32 on:** July 23, 2022, 08:07:26 pm »

Are you by any chance using printf() with float format with FreeRTOS? I've had a similar issue, printf("%f") calls malloc() which happens to be broken on ST's provided FreeRTOS.

peter-h · « **Reply #33 on:** July 23, 2022, 08:19:49 pm »

Yes, but see above. I tested this with a breakpoint on malloc(). Only known code hits it.

I am using newlib-nano (no idea of why this config was chosen; someone else set this project up 2-3 years ago and I started working on it 1-2 years ago)

There is stuff on google about newlib and such e.g. https://nadler.com/embedded/newlibAndFreeRTOS.html but I have found zero evidence of %f using malloc(). Unless it is calling a different malloc()

There is a FreeRTOS malloc() and then there is an LWIP malloc() (called mem_malloc). But why should a GNU C printf be calling those?

I do have one more idea up my sleeve:
https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32
and the first bold link there (the need for __DMB in various places in the low level ETH DMA code). I am just not sure where to do this and the writer "Piranha" is not generally contactable. He does know absolutely everything though - if you can get him to tell you

I can find the ETH DMA "OWN" bit ops, but not sure about the rest. If that is the cause, it will never be found by debugging... This fix does need to be done.

Finally, the RTOS thread which causes this crash is a simple http server, which has a 1Hz auto refresh on one of the pages. With that page not up, the server is of course dormant, and no crashing happens. The issue could of course be elsewhere (e.g. this task is the only one calling the LWIP netconn API) but this task does have a printf %f in it. I can easily remove that and I will do that tomorrow.

uer166 · « **Reply #34 on:** July 23, 2022, 08:30:29 pm »

Quote from: peter-h on July 23, 2022, 08:19:49 pm

But why should a GNU C printf be calling those?

printf() calls malloc if the format includes a floating point variable. It doesn't normally on ints, strings, hex, etc.

A quick test might be to uncheck that "use float with printf" options and see if you get a different failure, if any.

peter-h · « **Reply #35 on:** July 23, 2022, 08:42:51 pm »

Quote

printf() calls malloc if the format includes a floating point variable

Is that a definite in GCC? I know for a fact that it is not mandatory because I have programmed in C nearly 30 years ago (Z180) and had the printf sources (Hitech Z180 compiler) and they didn't do any of that. One didn't have heaps in those days

My feeling, with due respect, is that this is another "internet myth".

Unless this printf library is somehow integrated with FreeRTOS and uses its malloc? That malloc is supposed to be mutexed. AFAIK nothing in my project, other than FR, uses that heap. That heap lives entirely within the FR memory block, and is used to allocate blocks to the various tasks. This is my graphical viewer for the 64k CCM block used for this:

Yellow=unused Green=task stack
$100 on freelancer.com to get that written

It picks up a 64k file which I generate.

What is the meaning of t "use float with printf" ? IIRC, if you uncheck these, you get smaller libs but you can't use %f.

SiliconWizard · « **Reply #36 on:** July 23, 2022, 09:20:16 pm »

This is not GCC per se, this is newlib.

You can have a look at the source code: https://sourceware.org/git/?p=newlib-cygwin.git;a=tree;f=newlib/libc/stdio;h=0f5e4dd0dc465029d0b6c0a5d03fc2cc70e8df87;hb=HEAD
(it'll take a while though as it's a cascade of function calls and conditional compilation.)

Unless FreeRTOS redefines printf() and the like, those functions will come from newlib, and *can* call malloc() in some situations. Not the only functions from the std lib that can do this either. IIRC, strtok(), for instance, will also call malloc().

While it's a common conception that newlib in 'nano' mode will call malloc() from printf() calls only if you enable float format, I can't guarantee you for sure this is the only case.
The official doc is here: https://sourceware.org/newlib/libc.html
but I haven't found much info on the above point there.

At the moment, I don't know where the newlib documentation about such details as the exact differences between nano and non-nano, what calls malloc(), and so on, is. I mean, the official doc. There are numerous blog articles, forum threads, and so on, about that, but I can't find the official doc, which is why I kinda had to "reverse-engineer" this reading the source code and looking at what functions are linked in the final object code. If anyone can point you(/us) to any such official doc, that'll probably be helpful. Otherwise, as you said, unless you dig yourself, you never really know if the info you find is true or if it's a myth.

peter-h · « **Reply #37 on:** July 23, 2022, 09:41:52 pm »

Hmmm, you guys are not wrong! I traced through this (got no sources for sprintf)

sprintf(adcString, "ADC1: %d +5V rail: %4.2fV ADC2: %d +3.3V rail: %4.2fV", adc1,adcv1,adc2,adcv2);

But the malloc didn't actually get called. It may be conditional on some config option. I will take another look tomorrow morning.

In the .map file

Code: [Select]

 .text._malloc_r
                0x0000000008042728      0x478 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-mallocr.o)
                0x0000000008042728                _malloc_r
 .text.memcmp   0x0000000008042ba0       0x20 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-memcmp.o)
                0x0000000008042ba0                memcmp
 .text.memcpy   0x0000000008042bc0       0x1c c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-memcpy-stub.o)
                0x0000000008042bc0                memcpy
 .text.memmove  0x0000000008042bdc       0x34 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-memmove.o)
                0x0000000008042bdc                memmove
 .text.memset   0x0000000008042c10       0x10 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-memset.o)
                0x0000000008042c10                memset
 .text.__malloc_lock
                0x0000000008042c20        0xc c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-mlock.o)
                0x0000000008042c20                __malloc_lock
 .text.__malloc_unlock
                0x0000000008042c2c        0xc c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-mlock.o)
                0x0000000008042c2c                __malloc_unlock
 .text.printf   0x0000000008042c38       0x24 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-printf.o)
                0x0000000008042c38                printf
 .text.putchar  0x0000000008042c5c       0x10 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-putchar.o)
                0x0000000008042c5c                putchar
 .text.rand     0x0000000008042c6c       0x38 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-rand.o)
                0x0000000008042c6c                rand
 .text._sbrk_r  0x0000000008042ca4       0x20 c:/st/stm32cubeide_1.10.1/stm32cubeide/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.10.3-2021.10.win32_1.0.0.202111181127/tools/bin/../lib/gcc/arm-none-eabi/10.3.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard\libc.a(lib_a-sbrkr.o)

and the disassembly is

Code: [Select]

          _malloc_r:
08042728:   add.w   r3, r1, #11
0804272c:   cmp     r3, #22
0804272e:   stmdb   sp!, {r0, r1, r2, r4, r5, r6, r7, r8, r9, r10, r11, lr}
08042732:   mov     r5, r0
08042734:   bls.n   0x8042744 <_malloc_r+28>
08042736:   bics.w  r7, r3, #7
0804273a:   bpl.n   0x8042746 <_malloc_r+30>
0804273c:   movs    r3, #12
0804273e:   str     r3, [r5, #0]
08042740:   movs    r4, #0
08042742:   b.n     0x8042a90 <_malloc_r+872>
08042744:   movs    r7, #16
08042746:   cmp     r1, r7
08042748:   bhi.n   0x804273c <_malloc_r+20>
0804274a:   mov     r0, r5
0804274c:   bl      0x8042c20 <__malloc_lock>
08042750:   cmp.w   r7, #504        ; 0x1f8
08042754:   ldr     r6, [pc, #704]  ; (0x8042a18 <_malloc_r+752>)
08042756:   bcs.n   0x80427c8 <_malloc_r+160>
08042758:   add.w   r2, r7, #8
0804275c:   add     r2, r6
0804275e:   sub.w   r1, r2, #8
08042762:   ldr     r4, [r2, #4]
08042764:   cmp     r4, r1
08042766:   mov.w   r3, r7, lsr #3
0804276a:   bne.n   0x8042772 <_malloc_r+74>
0804276c:   ldr     r4, [r2, #12]
0804276e:   cmp     r2, r4
08042770:   beq.n   0x8042794 <_malloc_r+108>
08042772:   ldr     r3, [r4, #4]
08042774:   ldrd    r1, r2, [r4, #8]
08042778:   bic.w   r3, r3, #3
0804277c:   str     r2, [r1, #12]
0804277e:   add     r3, r4
08042780:   str     r1, [r2, #8]
08042782:   ldr     r2, [r3, #4]
08042784:   orr.w   r2, r2, #1
08042788:   str     r2, [r3, #4]
0804278a:   mov     r0, r5
0804278c:   bl      0x8042c2c <__malloc_unlock>
08042790:   adds    r4, #8
08042792:   b.n     0x8042a90 <_malloc_r+872>
08042794:   adds    r3, #2
08042796:   ldr     r4, [r6, #16]
08042798:   ldr     r1, [pc, #640]  ; (0x8042a1c <_malloc_r+756>)
0804279a:   cmp     r4, r1
0804279c:   beq.n   0x804288e <_malloc_r+358>
0804279e:   ldr     r2, [r4, #4]
080427a0:   bic.w   r12, r2, #3
080427a4:   sub.w   r0, r12, r7
080427a8:   cmp     r0, #15
080427aa:   ble.n   0x804283e <_malloc_r+278>
080427ac:   adds    r2, r4, r7
080427ae:   orr.w   r3, r0, #1
080427b2:   orr.w   r7, r7, #1
080427b6:   str     r7, [r4, #4]
080427b8:   strd    r2, r2, [r6, #16]
080427bc:   strd    r1, r1, [r2, #8]
080427c0:   str     r3, [r2, #4]
080427c2:   str.w   r0, [r4, r12]
080427c6:   b.n     0x804278a <_malloc_r+98>
080427c8:   lsrs    r3, r7, #9
080427ca:   beq.n   0x8042822 <_malloc_r+250>
080427cc:   cmp     r3, #4
080427ce:   bhi.n   0x80427f6 <_malloc_r+206>
080427d0:   lsrs    r3, r7, #6
080427d2:   adds    r3, #56 ; 0x38
080427d4:   adds    r2, r3, #1
080427d6:   add.w   r2, r6, r2, lsl #3
080427da:   sub.w   r12, r2, #8
080427de:   ldr     r4, [r2, #4]
080427e0:   cmp     r4, r12
080427e2:   beq.n   0x80427f2 <_malloc_r+202>
080427e4:   ldr     r2, [r4, #4]
080427e6:   bic.w   r2, r2, #3
080427ea:   subs    r0, r2, r7
080427ec:   cmp     r0, #15
080427ee:   ble.n   0x804282a <_malloc_r+258>
080427f0:   subs    r3, #1
080427f2:   adds    r3, #1
080427f4:   b.n     0x8042796 <_malloc_r+110>
080427f6:   cmp     r3, #20
080427f8:   bhi.n   0x80427fe <_malloc_r+214>
080427fa:   adds    r3, #91 ; 0x5b
080427fc:   b.n     0x80427d4 <_malloc_r+172>
080427fe:   cmp     r3, #84 ; 0x54
08042800:   bhi.n   0x8042808 <_malloc_r+224>
08042802:   lsrs    r3, r7, #12
08042804:   adds    r3, #110        ; 0x6e
08042806:   b.n     0x80427d4 <_malloc_r+172>
08042808:   cmp.w   r3, #340        ; 0x154
0804280c:   bhi.n   0x8042814 <_malloc_r+236>
0804280e:   lsrs    r3, r7, #15
08042810:   adds    r3, #119        ; 0x77
08042812:   b.n     0x80427d4 <_malloc_r+172>
08042814:   movw    r2, #1364       ; 0x554
08042818:   cmp     r3, r2
0804281a:   bhi.n   0x8042826 <_malloc_r+254>
0804281c:   lsrs    r3, r7, #18
0804281e:   adds    r3, #124        ; 0x7c
08042820:   b.n     0x80427d4 <_malloc_r+172>
08042822:   movs    r3, #63 ; 0x3f
08042824:   b.n     0x80427d4 <_malloc_r+172>
08042826:   movs    r3, #126        ; 0x7e
08042828:   b.n     0x80427d4 <_malloc_r+172>
0804282a:   cmp     r0, #0
0804282c:   ldr     r1, [r4, #12]
0804282e:   blt.n   0x804283a <_malloc_r+274>
08042830:   ldr     r3, [r4, #8]
08042832:   str     r1, [r3, #12]
08042834:   str     r3, [r1, #8]
08042836:   adds    r3, r4, r2
08042838:   b.n     0x8042782 <_malloc_r+90>
0804283a:   mov     r4, r1
0804283c:   b.n     0x80427e0 <_malloc_r+184>
0804283e:   cmp     r0, #0
08042840:   strd    r1, r1, [r6, #16]
08042844:   blt.n   0x8042856 <_malloc_r+302>
08042846:   add     r12, r4
08042848:   ldr.w   r3, [r12, #4]
0804284c:   orr.w   r3, r3, #1
08042850:   str.w   r3, [r12, #4]
08042854:   b.n     0x804278a <_malloc_r+98>
08042856:   cmp.w   r12, #512       ; 0x200
0804285a:   ldr     r0, [r6, #4]
0804285c:   bcs.w   0x804298c <_malloc_r+612>
08042860:   mov.w   r2, r12, lsr #3
08042864:   mov.w   lr, r12, lsr #5
08042868:   mov.w   r12, #1
0804286c:   adds    r2, #1

This is some sort of mutex protected malloc.

God knows where this one came from.

This is the malloc_lock

Code: [Select]

          __malloc_lock:
08042c20:   ldr     r0, [pc, #4]    ; (0x8042c28 <__malloc_lock+8>)
08042c22:   b.w     0x8048358 <__retarget_lock_acquire_recursive>
08042c26:   nop     
08042c28:   stmia   r2!, {r0, r3, r7}
08042c2a:   movs    r0, #0
          __malloc_unlock:
08042c2c:   ldr     r0, [pc, #4]    ; (0x8042c34 <__malloc_unlock+8>)
08042c2e:   b.w     0x804835a <__retarget_lock_release_recursive>
08042c32:   nop     
08042c34:   stmia   r2!, {r0, r3, r7}
08042c36:   movs    r0, #0

Does anyone recognise this stuff??

uer166 · « **Reply #38 on:** July 23, 2022, 11:13:23 pm »

Quote from: peter-h on July 23, 2022, 08:42:51 pm

My feeling, with due respect, is that this is another "internet myth".

That may be your feeling, but like I said, I chased down this specific issue before on STM32G474, and we eventually gave up and just re-implemented a printFloat() on our own. Not sure if it's been posted before or not: https://community.st.com/s/question/0D50X0000BB1eL7SQJ/bug-cubemx-freertos-projects-corrupt-memory.

Of course, this issue may have nothing to do with this, but I can guarantee you that using printf(%f) and FreeRTOS together in ST's ecosystem does not work.

cv007 · « **Reply #39 on:** July 24, 2022, 12:15:01 am »

Quote

But the malloc didn't actually get called. It may be conditional on some config option.

This seems to be what you are showing-
https://sourceware.org/git?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdio/vfprintf.c;h=6a198e2c657e8cf44b720c8bec76b1121921a42d;hb=HEAD#l866

Ending up in _svfprintf_r via a starting point of sprintf, the __SMBF FILE struct flag is not set so malloc is then skipped.

This suggestion may be worthless- but it seems lwip is your latest code addition, and you have lwip assert enabled as seen by the assert strings in memory, so maybe disable the lwip assert to see what changes. Maybe an assert is taking place, and the resulting printf originating within lwip code is causing some problem. Not necessarily a great way to go about it, but a few pokes of the patient to see how they react is sometimes useful. Alternatively- make a lwip assert fail to see if the lwip assert printf actually works correctly.

peter-h · « **Reply #40 on:** July 24, 2022, 08:00:36 am »

Got out of bed and ... it has crashed again. Ran for 5hrs. So it isn't simply that printf. Same trace data as before.

I put a breakpoint on the _malloc_r and found it gets called right away, in this

Code: [Select]

ck_1 = HAL_RCC_GetPCLK1Freq();
ck_2 = HAL_RCC_GetPCLK2Freq();
printf("PCLK1=%ld  PCLK2=%ld\n",ck_1,ck_2);

which is nothing to do with %f. That btw uses a printf which has had its out redirected to the SWD console so you can do debugs to the debugger window at high speed. It does sort of work... but I will remove it now

There is a mutex passed through, fwiw.

So this printf uses malloc for just about everything! But whose heap?

Anyway, not sure what I can do about that. Obviously it is not thread safe. I probably need to mutex printf sprintf but then mutexes are not accessible until osKernelInitialize().

I did a fixed _sbrk for the general heap malloc but if this _malloc_r is getting a wrong value, that will be broken. Where is it placing the heap? The source - thanks cv007 - says

107 Supporting OS subroutines required: <<close>>, <<fstat>>, <<isatty>>,
108 <<lseek>>, <<read>>, <<sbrk>>, <<write>>.

I fixed sbrk (and I see _sbrk is at the same address) a long time ago. But that source does not reference sbrk so where does this printf place its heap? The initial call to _malloc_r has this register content

Code: [Select]

General Registers		General Purpose and FPU Register Group	
	r0	0x20000598 (Hex)		
	r1	0x400 (Hex)		
	r2	0x1 (Hex)		
	r3	0x400 (Hex)		
	r4	0x200008ec (Hex)		
	r5	0x800 (Hex)		
	r6	0x20000598 (Hex)		
	r7	0x2001fd90 (Hex)		
	r8	0x80086cc (Hex)		
	r9	0x200008ec (Hex)		
	r10	0x20000598 (Hex)		
	r11	0x0 (Hex)		
	r12	0x0 (Hex)		
	sp	0x2001fb38 (Hex)		
	lr	0x8048407 (Hex)		
	pc	0x8042728 (Hex)		
	xpsr	0x1000000 (Hex)		
	d0	0x0 (Hex)		
	d1	0x0 (Hex)		
	d2	0x0 (Hex)		
	d3	0x0 (Hex)		
	d4	0x0 (Hex)

and while I don't know which registers are used for the parameters, I would bet it is allocating 0x400 (R1) or 0x800 (R5) which is 1k or 2k!

It calls _sbrk_r

and the source for that is

which calls _sbrk. Funny how arm implements a RET... they pop the PC

The funny thing is that this heap seems to get discarded when printf returns. Or does it? Does the first call to "printf" grab a block? It would need to use some global variables which get preserved. Normally you don't need to call sbrk to do a malloc, but you do need to call it when creating a new heap, on an existing heap.

Then it gets more complicated. I am tracing calls to sbrk, to see where this "printf heap" is going, and I find calls to the normal malloc is calling _malloc_r !!!! So we come full circle

and I see no calls to _free_r (well not until TLS is running and I know about that one; it gets a 48k block which is freed when the https session ends, about once a minute).

Clearly the printf family is using the general heap for every call. And it never calls free(). That means it must be storing a pointer to its block somewhere. But it still does a malloc call on every subsequent use.

A key question is whether this heap is really mutex protected. It looks like it is. Does anyone recognise this

A google on __retarget_lock_acquire_recursive shows that this is indeed an empty function, and this
https://gist.github.com/thomask77/65591d78070ace68885d3fd05cdebe3a
describes the right code for that - using one of the FreeRTOS mutexes.

So this newlib printf has not been correctly implemented on this system.

This
https://www.freertos.org/FreeRTOS_Support_Forum_Archive/May_2017/freertos_printf_and_heap_4_f2b0ee0cj.html
suggests there is a solution, with FreeRTOS having a proper printf printf-stdarg.c. This is referenced in comments in the FR files but it says it is an incomplete version (no float output for example).

I think I need a new printf which does not use the heap except for %f.

I still don't know if this is the cause of the crashing but it is a good candidate.

I unchecked the newlib-nano checkboxes and it still calls malloc for the integer output, and it calls those empty mutexes...

But I found something

The 1st printf calls malloc but the 2nd one doesn't. Same for calling __retarget_lock_init_recursive. So maybe the heap (and its attempted protection) is used for %f and for longs only.

How can one make those empty mutex functions call proper ones, given that a) I don't have the C source and b) they are presumably not defined as weak?

emece67 · « **Reply #41 on:** July 24, 2022, 09:42:03 am »

emece67 · « **Reply #42 on:** July 24, 2022, 09:47:53 am »

peter-h · « **Reply #43 on:** July 24, 2022, 11:08:52 am »

It is more complicated.

Both the default ST Cube printf and the newlib-nano alternative does this:

1) Uses the heap for floats and longs
2) Uses malloc in the initial call to a printf, and uses it again for the initial call to a sprintf, and does not use it after that
3) Calls _sbrk only once (at the very first printf, but not at the very first sprintf)

Now why would a printf call _sbrk? It is not needed for just using the heap. For that you call malloc(). _sbrk is used only by malloc itself to check if the requested block will fit. Possibly the newlib stuff gets a block via malloc and then runs its own heap within that, but I don't really get it.

Bad coding really; no need for the heap for these little blocks of RAM. They could use the stack and it would be re-entrant, etc.

Does anyone know what is supposed to be the difference in ST Cube between newlib-nano and the default? Nano is supposed to be smaller, but there is stull like this
https://stackoverflow.com/questions/32948032/newlib-nano-long-long-support
stating that it does not support long-long, which is definitely incorrect.

I am now implementing this lot
https://gist.github.com/thomask77/3a2d54a482c294beec5d87730e163bdd
which is implementing the mutexes used by newlib (basically the printf family AFAICT).

I am getting lots of linker errors
multiple definition of `__retarget_lock_try_acquire'; ./src/heap_locking.o:
and I don't know how to fix this since the original symbols are defined in code to which there are no sources, the functions are not "weak", and the symbols are already in the symbol table. Is there some sort of override which one can put in the .c file where the new functions are, and which will get fed through to the linker? Setting up the linker options in Cube is complicated. I have asked the Q in the Programming forum.

Mutex protection on the heap should take care of the newlib heap usage issue. Well, printf itself may still not be thread-safe in which case I have a bigger problem, because that is needed.

But malloc and free themselves are still unprotected AFAICT, if called directly.

Whether the above stuff is causing the crashes, I don't know, but it needs to be fixed.

Thanks to everyone for your continued help.

peter-h · « **Reply #44 on:** July 24, 2022, 04:12:11 pm »

I was puzzled why both newlib-nano and standard produced the same behaviour. Later I found the code size doesn't change! Turns out those checkboxes in Cube are BS.

You have to also select one of these, and it was always "Standard C". If you do that, the nano option does nothing. Obvious, I suppose, to those who know.

Anyway, this means all the stuff around the net about Newlib-nano being a load of crap doesn't need to worry me. I am dealing with the standard C lib, and that is what uses malloc for floats (promoted to double, per standard C printf), longs, and double longs. That also explains why %llu etc has been working; it isn't supposed to be working on newlib-nano.

Still have to do all those mutexes...

cv007 · « **Reply #45 on:** July 24, 2022, 05:38:29 pm »

Quote

Bad coding really; no need for the heap for these little blocks of RAM. They could use the stack and it would be re-entrant, etc.

As shown in a few of your disassembly screenshots, they also use a big chunk of stack space (316 bytes) in addition to whatever else they are doing. Not sure what freertos has in place for stack checking, but probably would be nice to know how all the stack spaces are doing.

peter-h · « **Reply #46 on:** July 24, 2022, 06:00:51 pm »

I am happy with RTOS stacks; I have that graphical monitoring tool for that and have almost 64k of RTOS stacks to play with for the whole project.

Funny tracing through that printf code. First it calls __retarget_lock_acquire_recursive, then it calls _malloc_r, and that calls __retarget_lock_acquire_recursive.

Normally this would not work but I guess a "recursive" mutex can be nested like that, and still works at each depth. I would have used one mutex for the printf and another one for the heap, but maybe printf calls __retarget_lock_acquire_recursive several times without closing each one...

That function __retarget_lock_acquire_recursive is empty as described and that is what I am trying to fix with my stuff above. Need to override the existing symbols...

On past record, it would not surprise me if there was some build option which, when #defined, magically joins this all up with the FreeRTOS mutexes (like my posts above), but I can't see it. For the newlib-nano stuff there is configUSE_NEWLIB_REENTRANT but that creates other problems and it looked like a huge rabbit hole and anyway I do need uint32_t and 64_t printf and scanf support (%f one can avoid, at a push).

jc101 · « **Reply #47 on:** July 24, 2022, 09:06:20 pm »

I had the printf problem with FreeRTOS but with a PIC32. I would get a crash when using printf to format a float, my fix was to use a define to map malloc to the FreeRTOS pvPortMalloc etc. which are thread safe.
This made all, albeit limited, malloc calls use the FreeRTOS heap, solved the problem for me. I was using heap 4 for my project, I think heap 3 does this for you but you don't get the benefit of heap 4. Also had to ensure the task calling printf had enough memory on its stack or the task would fault if the task swapped when in the printf.

Code: [Select]

#define malloc(size) pvPortMalloc(size)
#define free(ptr) vPortFree(ptr)

peter-h · « **Reply #48 on:** July 24, 2022, 09:21:46 pm »

You probably had sources to your pic32 printf. Then you could put those #defines in there.

I could also fix the malloc and free (to which I also don't have sources) by wrapping them in mutexes (just one mutex actually because you cannot have malloc and free occurring concurrently) and calling them say m_malloc and m_free, and then have a file heap.h with similar #defines in it. Or I could replace the malloc code altogether; there is a ton of heap code out there.

Looking at how my printf calls the same recursive mutex as my heap does, they must have come from the same place.

Somebody who knows their way around this development environment would have had this done in less time than it takes me to write this

jc101 · « **Reply #49 on:** July 24, 2022, 09:28:08 pm »

That is probably the case. Have you searched the FreeRTOS forums, they have all sorts buried in there for all kinds of IDE’s?

Some of them are archived but still searchable via the FreeRTOS website.

Sauli · « **Reply #50 on:** July 25, 2022, 04:13:22 am »

I have been using https://github.com/MarioViara/xprintfc for years instead of printf and have been very happy with it. At least you know what it is doing. I have made some tweaks to it, like instead of giving an output function it always prints to a given buffer (as sprintf), and added function xnformat (as snprintf). Also added a mutex in xvformat for thread safety.

peter-h · « **Reply #51 on:** July 25, 2022, 06:01:00 am »

I took out the %f printf from the RTOS task (an HTTP server which has a 1Hz refresh on a page and eventually crashes the thing) and it still crashes, so I am not sure this alone is the cause.

100079f0 is the SP and that is within the RTOS stack of a task called TCP/IP. Both stack pointers are ok for the context. SP is initialised to top of RAM at startup.

This time the ffffffed does not appear - presumably because the handler makes a call to the function printing out the regs?

The point, however, is that this thing should crash simply because printf is not reentrant and the heap also isn't. So that needs fixing before one can dig around more. But the constant value of 100079f0 is curious.

abyrvalg · « **Reply #52 on:** July 25, 2022, 08:15:45 am »

What is at that line 214 of lwip/src/core/timeouts.c?

Jeroen3 · « **Reply #53 on:** July 25, 2022, 08:17:10 am »

Again, the text of an assert is written to SRAM. Why?
Put a breakpoint in your assert print function.

peter-h · « **Reply #54 on:** July 25, 2022, 08:30:25 am »

Yeah - that's the one. It's being crapped on halfway in the text string. But the text string should be stored elsewhere (initialised data, down in FLASH, not in the RTOS stack area) unless it appears in the RTOS stack area because it has been invoked, and ends up in there in the process of it being emitted.

Looking into ASSERT, it ends up here

Code: [Select]

#define LWIP_PLATFORM_ASSERT(x) do {debug_thread_printf("Assertion \"%s\" failed at line %d in %s\n", \
                                     x, __LINE__, __FILE__); } while(0)

which is the debug output in this project. It puts stuff into a buffer from where an RTOS thread picks it up and sends it to a USB VCP port, which I monitor with a PC. But nothing appears on that - probably because the thing has crashed before it got output. Those debugs are flakey; it is not like sending a string straight to a UART with polling.

Whole file:

Code: [Select]

/**
 * @file
 * Stack-internal timers implementation.
 * This file includes timer callbacks for stack-internal timers as well as
 * functions to set up or stop timers and check for expired timers.
 *
 */

/*
 * Copyright (c) 2001-2004 Swedish Institute of Computer Science.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without modification,
 * are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice,
 *    this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright notice,
 *    this list of conditions and the following disclaimer in the documentation
 *    and/or other materials provided with the distribution.
 * 3. The name of the author may not be used to endorse or promote products
 *    derived from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
 * SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
 * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
 * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
 * OF SUCH DAMAGE.
 *
 * This file is part of the lwIP TCP/IP stack.
 *
 * Author: Adam Dunkels <adam@sics.se>
 *         Simon Goldschmidt
 *
 */

#include "lwip/opt.h"

#include "lwip/timeouts.h"
#include "lwip/priv/tcp_priv.h"

#include "lwip/def.h"
#include "lwip/memp.h"
#include "lwip/priv/tcpip_priv.h"

#include "lwip/ip4_frag.h"
#include "lwip/etharp.h"
#include "lwip/dhcp.h"
#include "lwip/autoip.h"
#include "lwip/igmp.h"
#include "lwip/dns.h"
#include "lwip/nd6.h"
#include "lwip/ip6_frag.h"
#include "lwip/mld6.h"
#include "lwip/sys.h"
#include "lwip/pbuf.h"

#if LWIP_DEBUG_TIMERNAMES
#define HANDLER(x) x, #x
#else /* LWIP_DEBUG_TIMERNAMES */
#define HANDLER(x) x
#endif /* LWIP_DEBUG_TIMERNAMES */

/** This array contains all stack-internal cyclic timers. To get the number of
 * timers, use LWIP_ARRAYSIZE() */
const struct lwip_cyclic_timer lwip_cyclic_timers[] = {
#if LWIP_TCP
  /* The TCP timer is a special case: it does not have to run always and
     is triggered to start from TCP using tcp_timer_needed() */
  {TCP_TMR_INTERVAL, HANDLER(tcp_tmr)},
#endif /* LWIP_TCP */
#if LWIP_IPV4
#if IP_REASSEMBLY
  {IP_TMR_INTERVAL, HANDLER(ip_reass_tmr)},
#endif /* IP_REASSEMBLY */
#if LWIP_ARP
  {ARP_TMR_INTERVAL, HANDLER(etharp_tmr)},
#endif /* LWIP_ARP */
#if LWIP_DHCP
  {DHCP_COARSE_TIMER_MSECS, HANDLER(dhcp_coarse_tmr)},
  {DHCP_FINE_TIMER_MSECS, HANDLER(dhcp_fine_tmr)},
#endif /* LWIP_DHCP */
#if LWIP_AUTOIP
  {AUTOIP_TMR_INTERVAL, HANDLER(autoip_tmr)},
#endif /* LWIP_AUTOIP */
#if LWIP_IGMP
  {IGMP_TMR_INTERVAL, HANDLER(igmp_tmr)},
#endif /* LWIP_IGMP */
#endif /* LWIP_IPV4 */
#if LWIP_DNS
  {DNS_TMR_INTERVAL, HANDLER(dns_tmr)},
#endif /* LWIP_DNS */
#if LWIP_IPV6
  {ND6_TMR_INTERVAL, HANDLER(nd6_tmr)},
#if LWIP_IPV6_REASS
  {IP6_REASS_TMR_INTERVAL, HANDLER(ip6_reass_tmr)},
#endif /* LWIP_IPV6_REASS */
#if LWIP_IPV6_MLD
  {MLD6_TMR_INTERVAL, HANDLER(mld6_tmr)},
#endif /* LWIP_IPV6_MLD */
#endif /* LWIP_IPV6 */
};

#define MEMP_SYS_TIMEOUT 10

#if LWIP_TIMERS && !LWIP_TIMERS_CUSTOM

/** The one and only timeout list */
static struct sys_timeo *next_timeout;
static u32_t timeouts_last_time;

#if LWIP_TCP
/** global variable that shows if the tcp timer is currently scheduled or not */
static int tcpip_tcp_timer_active;

/**
 * Timer callback function that calls tcp_tmr() and reschedules itself.
 *
 * @param arg unused argument
 */
static void
tcpip_tcp_timer(void *arg)
{
  LWIP_UNUSED_ARG(arg);

  /* call TCP timer handler */
  tcp_tmr();
  /* timer still needed? */
  if (tcp_active_pcbs || tcp_tw_pcbs) {
    /* restart timer */
    sys_timeout(TCP_TMR_INTERVAL, tcpip_tcp_timer, NULL);
  } else {
    /* disable timer */
    tcpip_tcp_timer_active = 0;
  }
}

/**
 * Called from TCP_REG when registering a new PCB:
 * the reason is to have the TCP timer only running when
 * there are active (or time-wait) PCBs.
 */
void
tcp_timer_needed(void)
{
  /* timer is off but needed again? */
  if (!tcpip_tcp_timer_active && (tcp_active_pcbs || tcp_tw_pcbs)) {
    /* enable and start timer */
    tcpip_tcp_timer_active = 1;
    sys_timeout(TCP_TMR_INTERVAL, tcpip_tcp_timer, NULL);
  }
}
#endif /* LWIP_TCP */

/**
 * Timer callback function that calls mld6_tmr() and reschedules itself.
 *
 * @param arg unused argument
 */
static void
cyclic_timer(void *arg)
{
  const struct lwip_cyclic_timer* cyclic = (const struct lwip_cyclic_timer*)arg;
#if LWIP_DEBUG_TIMERNAMES
  LWIP_DEBUGF(TIMERS_DEBUG, ("tcpip: %s()\n", cyclic->handler_name));
#endif
  cyclic->handler();
  sys_timeout(cyclic->interval_ms, cyclic_timer, arg);
}

/** Initialize this module */
void sys_timeouts_init(void)
{
  size_t i;
  /* tcp_tmr() at index 0 is started on demand */
  for (i = (LWIP_TCP ? 1 : 0); i < LWIP_ARRAYSIZE(lwip_cyclic_timers); i++) {
    /* we have to cast via size_t to get rid of const warning
      (this is OK as cyclic_timer() casts back to const* */
    sys_timeout(lwip_cyclic_timers[i].interval_ms, cyclic_timer, LWIP_CONST_CAST(void*, &lwip_cyclic_timers[i]));
  }

  /* Initialise timestamp for sys_check_timeouts */
  timeouts_last_time = sys_now();
}

/**
 * Create a one-shot timer (aka timeout). Timeouts are processed in the
 * following cases:
 * - while waiting for a message using sys_timeouts_mbox_fetch()
 * - by calling sys_check_timeouts() (NO_SYS==1 only)
 *
 * @param msecs time in milliseconds after that the timer should expire
 * @param handler callback function to call when msecs have elapsed
 * @param arg argument to pass to the callback function
 */
#if LWIP_DEBUG_TIMERNAMES
void
sys_timeout_debug(u32_t msecs, sys_timeout_handler handler, void *arg, const char* handler_name)
#else /* LWIP_DEBUG_TIMERNAMES */
void
sys_timeout(u32_t msecs, sys_timeout_handler handler, void *arg)
#endif /* LWIP_DEBUG_TIMERNAMES */
{
  struct sys_timeo *timeout, *t;
  u32_t now, diff;

  timeout = (struct sys_timeo *)memp_malloc(MEMP_SYS_TIMEOUT);
  if (timeout == NULL) {
    LWIP_ASSERT("sys_timeout: timeout != NULL, pool MEMP_SYS_TIMEOUT is empty", timeout != NULL);
    return;
  }

  now = sys_now();
  if (next_timeout == NULL) {
    diff = 0;
    timeouts_last_time = now;
  } else {
    diff = now - timeouts_last_time;
  }

  timeout->next = NULL;
  timeout->h = handler;
  timeout->arg = arg;
  timeout->time = msecs + diff;
#if LWIP_DEBUG_TIMERNAMES
  timeout->handler_name = handler_name;
  LWIP_DEBUGF(TIMERS_DEBUG, ("sys_timeout: %p msecs=%"U32_F" handler=%s arg=%p\n",
    (void *)timeout, msecs, handler_name, (void *)arg));
#endif /* LWIP_DEBUG_TIMERNAMES */

  if (next_timeout == NULL) {
    next_timeout = timeout;
    return;
  }

  if (next_timeout->time > msecs) {
    next_timeout->time -= msecs;
    timeout->next = next_timeout;
    next_timeout = timeout;
  } else {
    for (t = next_timeout; t != NULL; t = t->next) {
      timeout->time -= t->time;
      if (t->next == NULL || t->next->time > timeout->time) {
        if (t->next != NULL) {
          t->next->time -= timeout->time;
        } else if (timeout->time > msecs) {
          /* If this is the case, 'timeouts_last_time' and 'now' differs too much.
             This can be due to sys_check_timeouts() not being called at the right
             times, but also when stopping in a breakpoint. Anyway, let's assume
             this is not wanted, so add the first timer's time instead of 'diff' */
          timeout->time = msecs + next_timeout->time;
        }
        timeout->next = t->next;
        t->next = timeout;
        break;
      }
    }
  }
}

/**
 * Go through timeout list (for this task only) and remove the first matching
 * entry (subsequent entries remain untouched), even though the timeout has not
 * triggered yet.
 *
 * @param handler callback function that would be called by the timeout
 * @param arg callback argument that would be passed to handler
*/
void
sys_untimeout(sys_timeout_handler handler, void *arg)
{
  struct sys_timeo *prev_t, *t;

  if (next_timeout == NULL) {
    return;
  }

  for (t = next_timeout, prev_t = NULL; t != NULL; prev_t = t, t = t->next) {
    if ((t->h == handler) && (t->arg == arg)) {
      /* We have a match */
      /* Unlink from previous in list */
      if (prev_t == NULL) {
        next_timeout = t->next;
      } else {
        prev_t->next = t->next;
      }
      /* If not the last one, add time of this one back to next */
      if (t->next != NULL) {
        t->next->time += t->time;
      }
      memp_free(MEMP_SYS_TIMEOUT, t);
      return;
    }
  }
  return;
}

/**
 * @ingroup lwip_nosys
 * Handle timeouts for NO_SYS==1 (i.e. without using
 * tcpip_thread/sys_timeouts_mbox_fetch(). Uses sys_now() to call timeout
 * handler functions when timeouts expire.
 *
 * Must be called periodically from your main loop.
 */
#if !NO_SYS && !defined __DOXYGEN__
static
#endif /* !NO_SYS */
void
sys_check_timeouts(void)
{
  if (next_timeout) {
    struct sys_timeo *tmptimeout;
    u32_t diff;
    sys_timeout_handler handler;
    void *arg;
    u8_t had_one;
    u32_t now;

    now = sys_now();
    /* this cares for wraparounds */
    diff = now - timeouts_last_time;
    do {
      PBUF_CHECK_FREE_OOSEQ();
      had_one = 0;
      tmptimeout = next_timeout;
      if (tmptimeout && (tmptimeout->time <= diff)) {
        /* timeout has expired */
        had_one = 1;
        timeouts_last_time += tmptimeout->time;
        diff -= tmptimeout->time;
        next_timeout = tmptimeout->next;
        handler = tmptimeout->h;
        arg = tmptimeout->arg;
#if LWIP_DEBUG_TIMERNAMES
        if (handler != NULL) {
          LWIP_DEBUGF(TIMERS_DEBUG, ("sct calling h=%s arg=%p\n",
            tmptimeout->handler_name, arg));
        }
#endif /* LWIP_DEBUG_TIMERNAMES */
        memp_free(MEMP_SYS_TIMEOUT, tmptimeout);
        if (handler != NULL) {
#if !NO_SYS
          /* For LWIP_TCPIP_CORE_LOCKING, lock the core before calling the
             timeout handler function. */
          LOCK_TCPIP_CORE();
#endif /* !NO_SYS */
          handler(arg);
#if !NO_SYS
          UNLOCK_TCPIP_CORE();
#endif /* !NO_SYS */
        }
        LWIP_TCPIP_THREAD_ALIVE();
      }
    /* repeat until all expired timers have been called */
    } while (had_one);
  }
}

/** Set back the timestamp of the last call to sys_check_timeouts()
 * This is necessary if sys_check_timeouts() hasn't been called for a long
 * time (e.g. while saving energy) to prevent all timer functions of that
 * period being called.
 */
void
sys_restart_timeouts(void)
{
  timeouts_last_time = sys_now();
}

/** Return the time left before the next timeout is due. If no timeouts are
 * enqueued, returns 0xffffffff
 */
#if !NO_SYS
static
#endif /* !NO_SYS */
u32_t
sys_timeouts_sleeptime(void)
{
  u32_t diff;
  if (next_timeout == NULL) {
    return 0xffffffff;
  }
  diff = sys_now() - timeouts_last_time;
  if (diff > next_timeout->time) {
    return 0;
  } else {
    return next_timeout->time - diff;
  }
}

#if !NO_SYS

/**
 * Wait (forever) for a message to arrive in an mbox.
 * While waiting, timeouts are processed.
 *
 * @param mbox the mbox to fetch the message from
 * @param msg the place to store the message
 */
void
sys_timeouts_mbox_fetch(sys_mbox_t *mbox, void **msg)
{
  u32_t sleeptime;

again:
  if (!next_timeout) {
    sys_arch_mbox_fetch(mbox, msg, 0);
    return;
  }

  sleeptime = sys_timeouts_sleeptime();
  if (sleeptime == 0 || sys_arch_mbox_fetch(mbox, msg, sleeptime) == SYS_ARCH_TIMEOUT) {
    /* If a SYS_ARCH_TIMEOUT value is returned, a timeout occurred
       before a message could be fetched. */
    sys_check_timeouts();
    /* We try again to fetch a message from the mbox. */
    goto again;
  }
}

#endif /* NO_SYS */

#else /* LWIP_TIMERS && !LWIP_TIMERS_CUSTOM */
/* Satisfy the TCP code which calls this function */
void
tcp_timer_needed(void)
{
}
#endif /* LWIP_TIMERS && !LWIP_TIMERS_CUSTOM */

Quote

Put a breakpoint in your assert print function.

Great idea - doing it now.

The funny thing is that another system, same firmware, has been running fine for several days. It does something crash though.

abyrvalg · « **Reply #55 on:** July 25, 2022, 08:42:53 am »

> But the text string should be stored elsewhere
Exactly. This text in RAM means it have been invoked (note the line number being printf’ed, it isn’t stored in flash text, the complete message is constructed in a buffer).

peter-h · « **Reply #56 on:** July 25, 2022, 08:49:05 am »

Yes; clearly that malloc failed. That is a malloc out of the LWIP private heap. This is a really good pointer to what the issue might be.

There is a setting in lwipopts.h

/* MEMP_NUM_SYS_TIMEOUT: the number of simulateously active
timeouts. */
#define MEMP_NUM_SYS_TIMEOUT 10

and this is checked during compilation to make sure it is big enough relative to the number of packet buffers etc. I will later try increasing it.

#if LWIP_TIMERS && (MEMP_NUM_SYS_TIMEOUT < (LWIP_TCP + IP_REASSEMBLY + LWIP_ARP + (2*LWIP_DHCP) + LWIP_AUTOIP + LWIP_IGMP + LWIP_DNS + PPP_SUPPORT + (LWIP_IPV6 ? (1 + LWIP_IPV6_REASS + LWIP_IPV6_MLD) : 0)))
#error "MEMP_NUM_SYS_TIMEOUT is too low to accomodate all required timeouts"

but there is no other reference to MEMP_NUM_SYS_TIMEOUT so it looks like LWIP allocates memory for these timeouts as it goes along, having first done a simple check that there will be enough.

The ASSERT is a macro and the line number gets inserted at compile time.

emece67 · « **Reply #57 on:** July 25, 2022, 09:18:35 am »

peter-h · « **Reply #58 on:** July 25, 2022, 10:28:34 am »

OK... found the immediate cause.

The timer allocation was returning a NULL which triggered the assert.

The output string from the assert (which was basically never tested before) was too long for the buffer, and that bombed the system.

As to why memp_malloc fails, that's another Q but nowhere else does LWIP check that return value; it assumes it will always work

Jeroen3 · « **Reply #59 on:** July 25, 2022, 12:17:45 pm »

So the thing that should tell you about crashes is crashing your system... the irony.

peter-h · « **Reply #60 on:** July 25, 2022, 01:15:36 pm »

and the information was there all along, if I knew what to look for. I didn't realise that was an assert text string.

It crashes the system because it used a vsprintf instead of a vnsprintf

Fixed now. But I would still like to know why that timer malloc fails.

cv007 · « **Reply #61 on:** July 25, 2022, 04:09:08 pm »

Back in post #39, probably not worthless.

Quote

This suggestion may be worthless- but it seems lwip is your latest code addition, and you have lwip assert enabled as seen by the assert strings in memory, so maybe disable the lwip assert to see what changes. Maybe an assert is taking place, and the resulting printf originating within lwip code is causing some problem. Not necessarily a great way to go about it, but a few pokes of the patient to see how they react is sometimes useful. Alternatively- make a lwip assert fail to see if the lwip assert printf actually works correctly.

bson · « **Reply #62 on:** July 25, 2022, 06:49:33 pm »

Quote from: peter-h on July 22, 2022, 07:40:23 pm

0x656d6974 looks like a string ('emit'). Something probably blew away the stack by overrunning a local buffer and then returned. Since it's an even address on a thumb-only core you get a hard fault.

ataradov · « **Reply #63 on:** July 25, 2022, 07:05:19 pm »

And the best thing to observe in cases like this is LR register in the exception. A typical flow for this is something like this:

1. bl or blx some_function .This saves the LR to be the next address after this one.
2. The code is called, does something like push {r0, r1, ...... lr}
3. The code runs, stack overflow happens
4. pop {r0, r1, ...... pc} oops we are screwed

But if the code did not modify the LR internally (and GCC tries not to do that unless really necessary), then you will at least have the correct last call that was performed. All you need to do is find the address in LR in the disassembly, look one instruction up and you will find what was called and what is likely lead to this breakdown.

And you can obviously do a sanity check. If LR does not point past a bl/blx instruction, then it was overwritten by the called code.

And as usual, RTOS make is much harder to debug, but no impossible.

Note that you need to observe the stacked LR, not the actual CPU LR. The CPU LR would contain the exception return magic code, and it useless here.

harerod · « **Reply #64 on:** July 25, 2022, 07:09:00 pm »

cv007 - my experience from a project running freeRTOS/lwip on an F407:
There are several asserts in lwip that will cause issues with printf/malloc under freeRTOS.
I decided not to use malloc in the project, so I use my own output routines. For this purpose I patched the asserts in lwip. Important hint: sprintf seems to do fine without malloc (not sure about float), so I just sprintf to a buffer, which is then passed to any one of the available output channels (UARTs, USB, ETH).
All peachy, if it weren't for CubeIDE (v1.3, in my case) overwriting those patches, whenever the MX-Tools regenerates any component's code, e.g. after you changed any tiny parameter. My workaround is: check code into git, update MX/IOC, let git check for unwanted changes (whose location one remembers after a few rounds), revert those changes (mostly done by inserting the previous file version).

peter-h · « **Reply #65 on:** July 26, 2022, 06:17:48 am »

Unfortunately I didn't know where to look for the stacked value of LR. But the key was the realisation that the text of an assert message was within the RTOS stack area, i.e. that message was being output. I should have searched the project for that message. FWIW, this one (an overflow of a 128 byte debug facility buffer) has bit me before, and I fixed it yesterday by using vsnprintf(buf, sizeof(buf)-1, ...) instead of the previous vsprintf which was created about 3 years ago. So this one won't happen again.

The code has now been running for an unusually long time, and a breakpoint on the assert (telling me that the malloc returned a NULL) has not been seen. Time will tell...

Some candidates:

The ST USB MSC and CDC code used a malloc at the start, and it was assumed these blocks will never be freed (a legitimate use of the heap in embedded). But they do get freed because there is a "DeInit" function (for a bizzare reason ST produced a DeInit version of almost every hardware init function) which gets called if there is a break in the USB connection! And USB has a habit of going to sleep (on the windoze end). So one would have been getting lots of fragmentation due to this. Stupid stupid code by ST - this is supposed to be an "embedded" system! And it would vary according to the USB controller; one at work is probably better behaved and so the board running on my desk there runs for much longer (I work at work and at home; same project). The two blocks, around 600 and 200 bytes, have been replaced with static buffers.

I have over time checked code for malloc use and removed these, but in some cases it is simply not possible e.g. MbedTLS uses a lot of it, as does FreeRTOS, but these use a private heap. TLS chucks away the whole heap when a connection is closed so that should be ok, but you never know... FR seems to allocate and never unallocate.

ETH global interrupt was not used but was left enabled. The ISR (in the ST HAL stuff) was not pointed to by a vector, so got optimised away, and a callback routine in ethernetif.c (which was called only by the ISR) got optimised away too. I found this in some bizzare breakpoint behaviour. Try setting a breakpoint on code which the compiler later removes... No evidence that interrupt was actually being generated though.

The printf family is still not mutex-protected but commenting out a %7.3f use in an RTOS task has not helped. I am working on this but it is messy
https://www.eevblog.com/forum/programming/st-cube-gcc-how-to-override-a-function-not-defined-as-weak-(no-sources)/
Last night I was tracing through some long and float printf code and while it calls __retarget_lock_acquire_recursive and __retarget_lock_release_recursive (with r0=0 which I think is the function parameter, in this case a handle, but does a recursive mutex need an individual handle) and I no longer saw the calls to malloc() which I saw before. But then I did add -u _printf_float to the linker options to remove a warning on the use of a float printf (which was emitted even though it did actually work; I don't understand that). Now I should be using the Standard C library with the newlib-nano unchecked. Maybe it does a malloc only in some more complicated cases. Also I notice that it does not do any sort of mutex initialisation; maybe that is not needed for recursive mutexes?

Thank you all

ataradov · « **Reply #66 on:** July 26, 2022, 06:21:05 am »

Quote from: peter-h on July 26, 2022, 06:17:48 am

Unfortunately I didn't know where to look for the stacked value of LR.

It is the value of s_lr that my code retrieves. Or you can manually looks it up in the relevant stack (MSP or PSP), basically do what my code dos, but manually.

peter-h · « **Reply #67 on:** July 26, 2022, 06:25:04 am »

OK; I looked to see if I have posted a screenshot but I don't think I have.

ataradov · « **Reply #68 on:** July 26, 2022, 06:27:41 am »

You need to run that many times to see if the failure point is consistent. And if not you need to observe and spot patterns. A single screenshot won't do much.

eutectique · « **Reply #69 on:** July 26, 2022, 02:30:51 pm »

Quote from: peter-h on July 25, 2022, 10:28:34 am

The output string from the assert (which was basically never tested before) was too long for the buffer, and that bombed the system.

Good catch!

I bet your assert strings contain full file paths. To get rid of all this unnecessary junk and save flash, you might want to add the following to your Makefile (or whatever equivalent your IDE offers):

Code: [Select]

CFLAGS += -fmacro-prefix-map=$(dir $<)=

And there is even more economical way of implementing asserts. Dump the values of PC and LR instead of strings: https://interrupt.memfault.com/blog/asserts-in-embedded-systems#register-values-only-assert-5

peter-h · « **Reply #70 on:** July 27, 2022, 04:31:40 pm »

The main issue was that my debugs were output with sprintf and not with snprintf.

Cutting off a debug is ok because if you really need it, you can make the buffer longer.

The other part was that I was running out of heap. I think the cause (not seen again after mods) was the fragmentation caused by lots of USB timeouts. Stupid ST MSC/CDC code.

I have basically solved the printf thread-safety issue - here
https://www.eevblog.com/forum/programming/a-question-on-mutexes-normal-v-recursive-and-printf/msg4324273/#msg4324273

The heap thread-safety issue remains, but I can solve that with mutexes since only a few bits of code use the heap and all of it in in .c form. Replacing the heap with an open source one would not help a huge amount, I think, because malloc and free will always need to be mutexed.

EDIT: I added simple mutexes to the heap

Code: [Select]

#include "FreeRTOS.h"
#include "cmsis_os2.h"
#include <newlib_locking.h>


extern osMutexId_t g_HEAP_Mutex;


// The mutex lock is *not* recursive (as is shown in many examples online) because
// that seems pointless.
// These two functions are used by both malloc and free.

void __malloc_lock(void)
{
	osMutexAcquire(g_HEAP_Mutex,osWaitForever);
}

void __malloc_unlock (void)
{
	osMutexRelease(g_HEAP_Mutex);
}

// This one is not used
void __malloc_lock_acquire(void)
{
	osMutexAcquire(g_HEAP_Mutex,osWaitForever);
}

Getting the above to override the empty stubs in the ST-supplied libc.a involved the method described in the above link, first converting all symbols to "weak" with objcopy.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: 32F4 hard fault trap - how to track this down? (Read 9294 times)

Share me