Author Topic: 32F4 hard fault trap - how to track this down?  (Read 9397 times)

0 Members and 1 Guest are viewing this topic.

Offline Sauli

  • Contributor
  • Posts: 43
  • Country: fi
Re: 32F4 hard fault trap - how to track this down?
« Reply #50 on: July 25, 2022, 04:13:22 am »
I have been using https://github.com/MarioViara/xprintfc for years instead of printf and have been very happy with it. At least you know what it is doing. I have made some tweaks to it, like instead of giving an output function it always prints to a given buffer (as sprintf), and added function xnformat (as snprintf). Also added a mutex in xvformat for thread safety.
 
The following users thanked this post: peter-h

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #51 on: July 25, 2022, 06:01:00 am »
I took out the %f printf from the RTOS task (an HTTP server which has a 1Hz refresh on a page and eventually crashes the thing) and it still crashes, so I am not sure this alone is the cause.



100079f0 is the SP and that is within the RTOS stack of a task called TCP/IP. Both stack pointers are ok for the context. SP is initialised to top of RAM at startup.





This time the ffffffed does not appear - presumably because the handler makes a call to the function printing out the regs?

The point, however, is that this thing should crash simply because printf is not reentrant and the heap also isn't. So that needs fixing before one can dig around more. But the constant value of 100079f0 is curious.
« Last Edit: July 25, 2022, 07:45:09 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline abyrvalg

  • Frequent Contributor
  • **
  • Posts: 825
  • Country: es
Re: 32F4 hard fault trap - how to track this down?
« Reply #52 on: July 25, 2022, 08:15:45 am »
What is at that line 214 of lwip/src/core/timeouts.c?
 
The following users thanked this post: peter-h

Offline Jeroen3

  • Super Contributor
  • ***
  • Posts: 4078
  • Country: nl
  • Embedded Engineer
    • jeroen3.nl
Re: 32F4 hard fault trap - how to track this down?
« Reply #53 on: July 25, 2022, 08:17:10 am »
Again, the text of an assert is written to SRAM. Why?
Put a breakpoint in your assert print function.
 
The following users thanked this post: peter-h

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #54 on: July 25, 2022, 08:30:25 am »


Yeah - that's the one. It's being crapped on halfway in the text string. But the text string should be stored elsewhere (initialised data, down in FLASH, not in the RTOS stack area) unless it appears in the RTOS stack area because it has been invoked, and ends up in there in the process of it being emitted.

Looking into ASSERT, it ends up here

Code: [Select]
#define LWIP_PLATFORM_ASSERT(x) do {debug_thread_printf("Assertion \"%s\" failed at line %d in %s\n", \
                                     x, __LINE__, __FILE__); } while(0)


which is the debug output in this project. It puts stuff into a buffer from where an RTOS thread picks it up and sends it to a USB VCP port, which I monitor with a PC. But nothing appears on that - probably because the thing has crashed before it got output. Those debugs are flakey; it is not like sending a string straight to a UART with polling.

Whole file:

Code: [Select]
/**
 * @file
 * Stack-internal timers implementation.
 * This file includes timer callbacks for stack-internal timers as well as
 * functions to set up or stop timers and check for expired timers.
 *
 */

/*
 * Copyright (c) 2001-2004 Swedish Institute of Computer Science.
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without modification,
 * are permitted provided that the following conditions are met:
 *
 * 1. Redistributions of source code must retain the above copyright notice,
 *    this list of conditions and the following disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright notice,
 *    this list of conditions and the following disclaimer in the documentation
 *    and/or other materials provided with the distribution.
 * 3. The name of the author may not be used to endorse or promote products
 *    derived from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
 * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
 * SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
 * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
 * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
 * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
 * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
 * OF SUCH DAMAGE.
 *
 * This file is part of the lwIP TCP/IP stack.
 *
 * Author: Adam Dunkels <adam@sics.se>
 *         Simon Goldschmidt
 *
 */

#include "lwip/opt.h"

#include "lwip/timeouts.h"
#include "lwip/priv/tcp_priv.h"

#include "lwip/def.h"
#include "lwip/memp.h"
#include "lwip/priv/tcpip_priv.h"

#include "lwip/ip4_frag.h"
#include "lwip/etharp.h"
#include "lwip/dhcp.h"
#include "lwip/autoip.h"
#include "lwip/igmp.h"
#include "lwip/dns.h"
#include "lwip/nd6.h"
#include "lwip/ip6_frag.h"
#include "lwip/mld6.h"
#include "lwip/sys.h"
#include "lwip/pbuf.h"

#if LWIP_DEBUG_TIMERNAMES
#define HANDLER(x) x, #x
#else /* LWIP_DEBUG_TIMERNAMES */
#define HANDLER(x) x
#endif /* LWIP_DEBUG_TIMERNAMES */

/** This array contains all stack-internal cyclic timers. To get the number of
 * timers, use LWIP_ARRAYSIZE() */
const struct lwip_cyclic_timer lwip_cyclic_timers[] = {
#if LWIP_TCP
  /* The TCP timer is a special case: it does not have to run always and
     is triggered to start from TCP using tcp_timer_needed() */
  {TCP_TMR_INTERVAL, HANDLER(tcp_tmr)},
#endif /* LWIP_TCP */
#if LWIP_IPV4
#if IP_REASSEMBLY
  {IP_TMR_INTERVAL, HANDLER(ip_reass_tmr)},
#endif /* IP_REASSEMBLY */
#if LWIP_ARP
  {ARP_TMR_INTERVAL, HANDLER(etharp_tmr)},
#endif /* LWIP_ARP */
#if LWIP_DHCP
  {DHCP_COARSE_TIMER_MSECS, HANDLER(dhcp_coarse_tmr)},
  {DHCP_FINE_TIMER_MSECS, HANDLER(dhcp_fine_tmr)},
#endif /* LWIP_DHCP */
#if LWIP_AUTOIP
  {AUTOIP_TMR_INTERVAL, HANDLER(autoip_tmr)},
#endif /* LWIP_AUTOIP */
#if LWIP_IGMP
  {IGMP_TMR_INTERVAL, HANDLER(igmp_tmr)},
#endif /* LWIP_IGMP */
#endif /* LWIP_IPV4 */
#if LWIP_DNS
  {DNS_TMR_INTERVAL, HANDLER(dns_tmr)},
#endif /* LWIP_DNS */
#if LWIP_IPV6
  {ND6_TMR_INTERVAL, HANDLER(nd6_tmr)},
#if LWIP_IPV6_REASS
  {IP6_REASS_TMR_INTERVAL, HANDLER(ip6_reass_tmr)},
#endif /* LWIP_IPV6_REASS */
#if LWIP_IPV6_MLD
  {MLD6_TMR_INTERVAL, HANDLER(mld6_tmr)},
#endif /* LWIP_IPV6_MLD */
#endif /* LWIP_IPV6 */
};

#define MEMP_SYS_TIMEOUT 10

#if LWIP_TIMERS && !LWIP_TIMERS_CUSTOM

/** The one and only timeout list */
static struct sys_timeo *next_timeout;
static u32_t timeouts_last_time;

#if LWIP_TCP
/** global variable that shows if the tcp timer is currently scheduled or not */
static int tcpip_tcp_timer_active;

/**
 * Timer callback function that calls tcp_tmr() and reschedules itself.
 *
 * @param arg unused argument
 */
static void
tcpip_tcp_timer(void *arg)
{
  LWIP_UNUSED_ARG(arg);

  /* call TCP timer handler */
  tcp_tmr();
  /* timer still needed? */
  if (tcp_active_pcbs || tcp_tw_pcbs) {
    /* restart timer */
    sys_timeout(TCP_TMR_INTERVAL, tcpip_tcp_timer, NULL);
  } else {
    /* disable timer */
    tcpip_tcp_timer_active = 0;
  }
}

/**
 * Called from TCP_REG when registering a new PCB:
 * the reason is to have the TCP timer only running when
 * there are active (or time-wait) PCBs.
 */
void
tcp_timer_needed(void)
{
  /* timer is off but needed again? */
  if (!tcpip_tcp_timer_active && (tcp_active_pcbs || tcp_tw_pcbs)) {
    /* enable and start timer */
    tcpip_tcp_timer_active = 1;
    sys_timeout(TCP_TMR_INTERVAL, tcpip_tcp_timer, NULL);
  }
}
#endif /* LWIP_TCP */

/**
 * Timer callback function that calls mld6_tmr() and reschedules itself.
 *
 * @param arg unused argument
 */
static void
cyclic_timer(void *arg)
{
  const struct lwip_cyclic_timer* cyclic = (const struct lwip_cyclic_timer*)arg;
#if LWIP_DEBUG_TIMERNAMES
  LWIP_DEBUGF(TIMERS_DEBUG, ("tcpip: %s()\n", cyclic->handler_name));
#endif
  cyclic->handler();
  sys_timeout(cyclic->interval_ms, cyclic_timer, arg);
}

/** Initialize this module */
void sys_timeouts_init(void)
{
  size_t i;
  /* tcp_tmr() at index 0 is started on demand */
  for (i = (LWIP_TCP ? 1 : 0); i < LWIP_ARRAYSIZE(lwip_cyclic_timers); i++) {
    /* we have to cast via size_t to get rid of const warning
      (this is OK as cyclic_timer() casts back to const* */
    sys_timeout(lwip_cyclic_timers[i].interval_ms, cyclic_timer, LWIP_CONST_CAST(void*, &lwip_cyclic_timers[i]));
  }

  /* Initialise timestamp for sys_check_timeouts */
  timeouts_last_time = sys_now();
}

/**
 * Create a one-shot timer (aka timeout). Timeouts are processed in the
 * following cases:
 * - while waiting for a message using sys_timeouts_mbox_fetch()
 * - by calling sys_check_timeouts() (NO_SYS==1 only)
 *
 * @param msecs time in milliseconds after that the timer should expire
 * @param handler callback function to call when msecs have elapsed
 * @param arg argument to pass to the callback function
 */
#if LWIP_DEBUG_TIMERNAMES
void
sys_timeout_debug(u32_t msecs, sys_timeout_handler handler, void *arg, const char* handler_name)
#else /* LWIP_DEBUG_TIMERNAMES */
void
sys_timeout(u32_t msecs, sys_timeout_handler handler, void *arg)
#endif /* LWIP_DEBUG_TIMERNAMES */
{
  struct sys_timeo *timeout, *t;
  u32_t now, diff;

  timeout = (struct sys_timeo *)memp_malloc(MEMP_SYS_TIMEOUT);
  if (timeout == NULL) {
    LWIP_ASSERT("sys_timeout: timeout != NULL, pool MEMP_SYS_TIMEOUT is empty", timeout != NULL);
    return;
  }

  now = sys_now();
  if (next_timeout == NULL) {
    diff = 0;
    timeouts_last_time = now;
  } else {
    diff = now - timeouts_last_time;
  }

  timeout->next = NULL;
  timeout->h = handler;
  timeout->arg = arg;
  timeout->time = msecs + diff;
#if LWIP_DEBUG_TIMERNAMES
  timeout->handler_name = handler_name;
  LWIP_DEBUGF(TIMERS_DEBUG, ("sys_timeout: %p msecs=%"U32_F" handler=%s arg=%p\n",
    (void *)timeout, msecs, handler_name, (void *)arg));
#endif /* LWIP_DEBUG_TIMERNAMES */

  if (next_timeout == NULL) {
    next_timeout = timeout;
    return;
  }

  if (next_timeout->time > msecs) {
    next_timeout->time -= msecs;
    timeout->next = next_timeout;
    next_timeout = timeout;
  } else {
    for (t = next_timeout; t != NULL; t = t->next) {
      timeout->time -= t->time;
      if (t->next == NULL || t->next->time > timeout->time) {
        if (t->next != NULL) {
          t->next->time -= timeout->time;
        } else if (timeout->time > msecs) {
          /* If this is the case, 'timeouts_last_time' and 'now' differs too much.
             This can be due to sys_check_timeouts() not being called at the right
             times, but also when stopping in a breakpoint. Anyway, let's assume
             this is not wanted, so add the first timer's time instead of 'diff' */
          timeout->time = msecs + next_timeout->time;
        }
        timeout->next = t->next;
        t->next = timeout;
        break;
      }
    }
  }
}

/**
 * Go through timeout list (for this task only) and remove the first matching
 * entry (subsequent entries remain untouched), even though the timeout has not
 * triggered yet.
 *
 * @param handler callback function that would be called by the timeout
 * @param arg callback argument that would be passed to handler
*/
void
sys_untimeout(sys_timeout_handler handler, void *arg)
{
  struct sys_timeo *prev_t, *t;

  if (next_timeout == NULL) {
    return;
  }

  for (t = next_timeout, prev_t = NULL; t != NULL; prev_t = t, t = t->next) {
    if ((t->h == handler) && (t->arg == arg)) {
      /* We have a match */
      /* Unlink from previous in list */
      if (prev_t == NULL) {
        next_timeout = t->next;
      } else {
        prev_t->next = t->next;
      }
      /* If not the last one, add time of this one back to next */
      if (t->next != NULL) {
        t->next->time += t->time;
      }
      memp_free(MEMP_SYS_TIMEOUT, t);
      return;
    }
  }
  return;
}

/**
 * @ingroup lwip_nosys
 * Handle timeouts for NO_SYS==1 (i.e. without using
 * tcpip_thread/sys_timeouts_mbox_fetch(). Uses sys_now() to call timeout
 * handler functions when timeouts expire.
 *
 * Must be called periodically from your main loop.
 */
#if !NO_SYS && !defined __DOXYGEN__
static
#endif /* !NO_SYS */
void
sys_check_timeouts(void)
{
  if (next_timeout) {
    struct sys_timeo *tmptimeout;
    u32_t diff;
    sys_timeout_handler handler;
    void *arg;
    u8_t had_one;
    u32_t now;

    now = sys_now();
    /* this cares for wraparounds */
    diff = now - timeouts_last_time;
    do {
      PBUF_CHECK_FREE_OOSEQ();
      had_one = 0;
      tmptimeout = next_timeout;
      if (tmptimeout && (tmptimeout->time <= diff)) {
        /* timeout has expired */
        had_one = 1;
        timeouts_last_time += tmptimeout->time;
        diff -= tmptimeout->time;
        next_timeout = tmptimeout->next;
        handler = tmptimeout->h;
        arg = tmptimeout->arg;
#if LWIP_DEBUG_TIMERNAMES
        if (handler != NULL) {
          LWIP_DEBUGF(TIMERS_DEBUG, ("sct calling h=%s arg=%p\n",
            tmptimeout->handler_name, arg));
        }
#endif /* LWIP_DEBUG_TIMERNAMES */
        memp_free(MEMP_SYS_TIMEOUT, tmptimeout);
        if (handler != NULL) {
#if !NO_SYS
          /* For LWIP_TCPIP_CORE_LOCKING, lock the core before calling the
             timeout handler function. */
          LOCK_TCPIP_CORE();
#endif /* !NO_SYS */
          handler(arg);
#if !NO_SYS
          UNLOCK_TCPIP_CORE();
#endif /* !NO_SYS */
        }
        LWIP_TCPIP_THREAD_ALIVE();
      }
    /* repeat until all expired timers have been called */
    } while (had_one);
  }
}

/** Set back the timestamp of the last call to sys_check_timeouts()
 * This is necessary if sys_check_timeouts() hasn't been called for a long
 * time (e.g. while saving energy) to prevent all timer functions of that
 * period being called.
 */
void
sys_restart_timeouts(void)
{
  timeouts_last_time = sys_now();
}

/** Return the time left before the next timeout is due. If no timeouts are
 * enqueued, returns 0xffffffff
 */
#if !NO_SYS
static
#endif /* !NO_SYS */
u32_t
sys_timeouts_sleeptime(void)
{
  u32_t diff;
  if (next_timeout == NULL) {
    return 0xffffffff;
  }
  diff = sys_now() - timeouts_last_time;
  if (diff > next_timeout->time) {
    return 0;
  } else {
    return next_timeout->time - diff;
  }
}

#if !NO_SYS

/**
 * Wait (forever) for a message to arrive in an mbox.
 * While waiting, timeouts are processed.
 *
 * @param mbox the mbox to fetch the message from
 * @param msg the place to store the message
 */
void
sys_timeouts_mbox_fetch(sys_mbox_t *mbox, void **msg)
{
  u32_t sleeptime;

again:
  if (!next_timeout) {
    sys_arch_mbox_fetch(mbox, msg, 0);
    return;
  }

  sleeptime = sys_timeouts_sleeptime();
  if (sleeptime == 0 || sys_arch_mbox_fetch(mbox, msg, sleeptime) == SYS_ARCH_TIMEOUT) {
    /* If a SYS_ARCH_TIMEOUT value is returned, a timeout occurred
       before a message could be fetched. */
    sys_check_timeouts();
    /* We try again to fetch a message from the mbox. */
    goto again;
  }
}

#endif /* NO_SYS */

#else /* LWIP_TIMERS && !LWIP_TIMERS_CUSTOM */
/* Satisfy the TCP code which calls this function */
void
tcp_timer_needed(void)
{
}
#endif /* LWIP_TIMERS && !LWIP_TIMERS_CUSTOM */


Quote
Put a breakpoint in your assert print function.

Great idea - doing it now.

The funny thing is that another system, same firmware, has been running fine for several days. It does something crash though.
« Last Edit: July 25, 2022, 08:46:28 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline abyrvalg

  • Frequent Contributor
  • **
  • Posts: 825
  • Country: es
Re: 32F4 hard fault trap - how to track this down?
« Reply #55 on: July 25, 2022, 08:42:53 am »
> But the text string should be stored elsewhere
Exactly. This text in RAM means it have been invoked (note the line number being printf’ed, it isn’t stored in flash text, the complete message is constructed in a buffer).
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #56 on: July 25, 2022, 08:49:05 am »
Yes; clearly that malloc failed. That is a malloc out of the LWIP private heap. This is a really good pointer to what the issue might be.

There is a setting in lwipopts.h

/* MEMP_NUM_SYS_TIMEOUT: the number of simulateously active
   timeouts. */
#define MEMP_NUM_SYS_TIMEOUT    10

and this is checked during compilation to make sure it is big enough relative to the number of packet buffers etc. I will later try increasing it.

#if LWIP_TIMERS && (MEMP_NUM_SYS_TIMEOUT < (LWIP_TCP + IP_REASSEMBLY + LWIP_ARP + (2*LWIP_DHCP) + LWIP_AUTOIP + LWIP_IGMP + LWIP_DNS + PPP_SUPPORT + (LWIP_IPV6 ? (1 + LWIP_IPV6_REASS + LWIP_IPV6_MLD) : 0)))
  #error "MEMP_NUM_SYS_TIMEOUT is too low to accomodate all required timeouts"

but there is no other reference to MEMP_NUM_SYS_TIMEOUT so it looks like LWIP allocates memory for these timeouts as it goes along, having first done a simple check that there will be enough.

The ASSERT is a macro and the line number gets inserted at compile time.
« Last Edit: July 25, 2022, 08:56:37 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline emece67

  • Frequent Contributor
  • **
  • !
  • Posts: 614
  • Country: 00
Re: 32F4 hard fault trap - how to track this down?
« Reply #57 on: July 25, 2022, 09:18:35 am »
.
« Last Edit: August 19, 2022, 05:41:31 pm by emece67 »
 
The following users thanked this post: peter-h

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #58 on: July 25, 2022, 10:28:34 am »
OK... found the immediate cause.

The timer allocation was returning a NULL which triggered the assert.

The output string from the assert (which was basically never tested before) was too long for the buffer, and that bombed the system.

As to why memp_malloc fails, that's another Q but nowhere else does LWIP check that return value; it assumes it will always work :)
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline Jeroen3

  • Super Contributor
  • ***
  • Posts: 4078
  • Country: nl
  • Embedded Engineer
    • jeroen3.nl
Re: 32F4 hard fault trap - how to track this down?
« Reply #59 on: July 25, 2022, 12:17:45 pm »
So the thing that should tell you about crashes is crashing your system... the irony.
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #60 on: July 25, 2022, 01:15:36 pm »
:) :) :)

and the information was there all along, if I knew what to look for. I didn't realise that was an assert text string.

It crashes the system because it used a vsprintf instead of a vnsprintf :) Fixed now. But I would still like to know why that timer malloc fails.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline cv007

  • Frequent Contributor
  • **
  • Posts: 828
Re: 32F4 hard fault trap - how to track this down?
« Reply #61 on: July 25, 2022, 04:09:08 pm »
Back in post #39, probably not worthless.
Quote
This suggestion may be worthless- but it seems lwip is your latest code addition, and you have lwip assert enabled as seen by the assert strings in memory, so maybe disable the lwip assert to see what changes. Maybe an assert is taking place, and the resulting printf originating within lwip code is causing some problem. Not necessarily a great way to go about it, but a few pokes of the patient to see how they react is sometimes useful. Alternatively- make a lwip assert fail to see if the lwip assert printf actually works correctly.
 

Offline bson

  • Supporter
  • ****
  • Posts: 2283
  • Country: us
Re: 32F4 hard fault trap - how to track this down?
« Reply #62 on: July 25, 2022, 06:49:33 pm »

0x656d6974 looks like a string ('emit').  Something probably blew away the stack by overrunning a local buffer and then returned.  Since it's an even address on a thumb-only core you get a hard fault.

 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11299
  • Country: us
    • Personal site
Re: 32F4 hard fault trap - how to track this down?
« Reply #63 on: July 25, 2022, 07:05:19 pm »
And the best thing to observe in cases like this is LR register in the exception. A typical flow for this is something like this:

1. bl or blx some_function .This saves the LR to be the next address after this one.
2. The code is called, does something like push {r0, r1, ...... lr}
3. The code runs, stack overflow happens
4. pop {r0, r1, ...... pc}  oops we are screwed

But if the code did not modify the LR internally (and GCC tries not to do that unless really necessary), then you will at least have the correct last call that was performed. All you need to do is find the address in LR in the disassembly, look one instruction up and you will find what was called and what is likely lead to this breakdown.

And you can obviously do a sanity check. If LR does not point past a bl/blx instruction, then it was overwritten by the called code.

And as usual, RTOS make is much harder to debug, but no impossible.

Note that you need to observe the stacked LR, not the actual CPU LR. The CPU LR would contain the exception return magic code, and it useless here.
« Last Edit: July 25, 2022, 07:08:04 pm by ataradov »
Alex
 

Offline harerod

  • Frequent Contributor
  • **
  • Posts: 451
  • Country: de
  • ee - digital & analog
    • My services:
Re: 32F4 hard fault trap - how to track this down?
« Reply #64 on: July 25, 2022, 07:09:00 pm »
cv007 - my experience from a project running freeRTOS/lwip on an F407:
There are several asserts in lwip that will cause issues with printf/malloc under freeRTOS. 
I decided not to use malloc in the project, so I use my own output routines. For this purpose I patched the asserts in lwip. Important hint: sprintf seems to do fine without malloc (not sure about float), so I just sprintf to a buffer, which is then passed to any one of the available output channels (UARTs, USB, ETH).
All peachy, if it weren't for CubeIDE (v1.3, in my case) overwriting those patches, whenever the MX-Tools regenerates any component's code, e.g. after you changed any tiny parameter. My workaround is: check code into git, update MX/IOC, let git check for unwanted changes (whose location one remembers after a few rounds), revert those changes (mostly done by inserting the previous file version).
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #65 on: July 26, 2022, 06:17:48 am »
Unfortunately I didn't know where to look for the stacked value of LR. But the key was the realisation that the text of an assert message was within the RTOS stack area, i.e. that message was being output. I should have searched the project for that message. FWIW, this one (an overflow of a 128 byte debug facility buffer) has bit me before, and I fixed it yesterday by using vsnprintf(buf, sizeof(buf)-1, ...) instead of the previous vsprintf which was created about 3 years ago. So this one won't happen again.

The code has now been running for an unusually long time, and a breakpoint on the assert (telling me that the malloc returned a NULL) has not been seen. Time will tell...

Some candidates:

The ST USB MSC and CDC code used a malloc at the start, and it was assumed these blocks will never be freed (a legitimate use of the heap in embedded). But they do get freed because there is a "DeInit" function (for a bizzare reason ST produced a DeInit version of almost every hardware init function) which gets called if there is a break in the USB connection! And USB has a habit of going to sleep (on the windoze end). So one would have been getting lots of fragmentation due to this. Stupid stupid code by ST - this is supposed to be an "embedded" system! And it would vary according to the USB controller; one at work is probably better behaved and so the board running on my desk there runs for much longer (I work at work and at home; same project). The two blocks, around 600 and 200 bytes, have been replaced with static buffers.

I have over time checked code for malloc use and removed these, but in some cases it is simply not possible e.g. MbedTLS uses a lot of it, as does FreeRTOS, but these use a private heap. TLS chucks away the whole heap when a connection is closed so that should be ok, but you never know... FR seems to allocate and never unallocate.

ETH global interrupt was not used but was left enabled. The ISR (in the ST HAL stuff) was not pointed to by a vector, so got optimised away, and a callback routine in ethernetif.c (which was called only by the ISR) got optimised away too. I found this in some bizzare breakpoint behaviour. Try setting a breakpoint on code which the compiler later removes... No evidence that interrupt was actually being generated though.

The printf family is still not mutex-protected but commenting out a %7.3f use in an RTOS task has not helped. I am working on this but it is messy
https://www.eevblog.com/forum/programming/st-cube-gcc-how-to-override-a-function-not-defined-as-weak-(no-sources)/
Last night I was tracing through some long and float printf code and while it calls __retarget_lock_acquire_recursive and __retarget_lock_release_recursive (with r0=0 which I think is the function parameter, in this case a handle, but does a recursive mutex need an individual handle) and I no longer saw the calls to malloc() which I saw before. But then I did add -u _printf_float to the linker options to remove a warning on the use of a float printf (which was emitted even though it did actually work; I don't understand that). Now I should be using the Standard C library with the newlib-nano unchecked. Maybe it does a malloc only in some more complicated cases. Also I notice that it does not do any sort of mutex initialisation; maybe that is not needed for recursive mutexes?

Thank you all :)
« Last Edit: July 26, 2022, 07:05:33 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11299
  • Country: us
    • Personal site
Re: 32F4 hard fault trap - how to track this down?
« Reply #66 on: July 26, 2022, 06:21:05 am »
Unfortunately I didn't know where to look for the stacked value of LR.
It is the value of s_lr that my code retrieves. Or you can manually looks it up in the relevant stack (MSP or PSP), basically do what my code dos, but manually.
Alex
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #67 on: July 26, 2022, 06:25:04 am »
OK; I looked to see if I have posted a screenshot but I don't think I have.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 11299
  • Country: us
    • Personal site
Re: 32F4 hard fault trap - how to track this down?
« Reply #68 on: July 26, 2022, 06:27:41 am »
You need to run that many times to see if the failure point is consistent. And if not you need to observe and spot patterns. A single screenshot won't do much.
Alex
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 401
  • Country: be
Re: 32F4 hard fault trap - how to track this down?
« Reply #69 on: July 26, 2022, 02:30:51 pm »
The output string from the assert (which was basically never tested before) was too long for the buffer, and that bombed the system.

Good catch!

I bet your assert strings contain full file paths. To get rid of all this unnecessary junk and save flash, you might want to add the following to your Makefile (or whatever equivalent your IDE offers):

Code: [Select]
CFLAGS += -fmacro-prefix-map=$(dir $<)=

And there is even more economical way of implementing asserts. Dump the values of PC and LR instead of strings: https://interrupt.memfault.com/blog/asserts-in-embedded-systems#register-values-only-assert-5
« Last Edit: July 26, 2022, 02:37:18 pm by eutectique »
 

Online peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3735
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F4 hard fault trap - how to track this down?
« Reply #70 on: July 27, 2022, 04:31:40 pm »
The main issue was that my debugs were output with sprintf and not with snprintf.

Cutting off a debug is ok because if you really need it, you can make the buffer longer.

The other part was that I was running out of heap. I think the cause (not seen again after mods) was the fragmentation caused by lots of USB timeouts. Stupid ST MSC/CDC code.

I have basically solved the printf thread-safety issue - here
https://www.eevblog.com/forum/programming/a-question-on-mutexes-normal-v-recursive-and-printf/msg4324273/#msg4324273

The heap thread-safety issue remains, but I can solve that with mutexes since only a few bits of code use the heap and all of it in in .c form. Replacing the heap with an open source one would not help a huge amount, I think, because malloc and free will always need to be mutexed.

EDIT: I added simple mutexes to the heap

Code: [Select]
#include "FreeRTOS.h"
#include "cmsis_os2.h"
#include <newlib_locking.h>


extern osMutexId_t g_HEAP_Mutex;


// The mutex lock is *not* recursive (as is shown in many examples online) because
// that seems pointless.
// These two functions are used by both malloc and free.

void __malloc_lock(void)
{
osMutexAcquire(g_HEAP_Mutex,osWaitForever);
}

void __malloc_unlock (void)
{
osMutexRelease(g_HEAP_Mutex);
}

// This one is not used
void __malloc_lock_acquire(void)
{
osMutexAcquire(g_HEAP_Mutex,osWaitForever);
}


Getting the above to override the empty stubs in the ST-supplied libc.a involved the method described in the above link, first converting all symbols to "weak" with objcopy.
« Last Edit: July 28, 2022, 06:01:01 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf