EEVblog Electronics Community Forum

Electronics => Microcontrollers => Topic started by: ajb on April 07, 2018, 01:16:48 am

Title: Catching an application hang in the act (Resolved - new errata entry pending)
Post by: ajb on April 07, 2018, 01:16:48 am
I have an extremely frustrating bug in a Cortex M7 project I'm currently working on--every once in a while the system gets hung up somewhere, and becomes unresponsive.  I have a watchdog implemented, so the system resets cleanly at least, but this is still very undesirable.  Unfortunately, this may only happen once in a few hours, and so far has not correlated with any other action or activity, which makes it rather hard to debug.  I know (or, am pretty sure) that it's not encountering a fault, because I have a fault handler implemented, and that's not triggering.  I could add some instrumentation to various parts of the code, and will probably start that next, but there's enough real-time stuff happening that extensive logging is difficult and, due to the infrequency of occurrence, any sort of brute force approach will take days to carry out.

Anyway, that's not the real point of the thread, because I have a number of other strategies in mind to try, but one thing in particular has been a problem:

I've been trying to wait for the system to hang up, with the watchdog disabled, and then debug via SWD to see what the state of things is, and hopefully find *where* it's hanging up.  The problem is that I cannot seem to start a debug session without the application being reset.  I'm using a J-Link Plus and Atollic, and I've cut down the standard debug startup script so that it should just connect to the target and halt, and this works fine normally.  I can start the application, then at any point connect and stop it in its tracks.  But whenever I DO catch the system in a hang, using the same debug script seems to reset the system to the beginning of the application.  What gives?

Any ideas what might be causing the apparent application reset?  Does that give any hint as to what the underlying fault may be?  Is there another setup or tool I should be using here that may allow me to see what's going on without disturbing the hang state?   Even catching a glimpse of the PC and other system registers would be nice!

(Actually as I write this, it occurs to me that the debug script I've been using is not aware of the bootloader, so perhaps the application is somehow getting thrown to somewhere before the application, and this confuses the debugger, since it's not aware of any code below the application start address?  Oh well, will have to wait a few hours for it to fail again to test that, so if anyone has any suggestions in the meantime...) 
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: Rerouter on April 07, 2018, 01:24:48 am
out of curiousity, can you hook up your debugger, connect and then just leave it until a hang occurs?

The other one is as your doing realtime code, I assume your using some level of interrupts, do you have a clean way of handling when multiple occur at the same time?

Another though is to dump registers after an unexpected reset, as most devices do not reinitialise there registers on reset, so you may be ale to catch something interesting there

And finally, your not suffering from memory creep are you? e.g. failing to release no longer needed memory, this can occur if you have iterative code
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: andyturk on April 07, 2018, 01:43:38 am
Try Segger's Ozone (https://www.segger.com/products/development-tools/ozone-j-link-debugger/) debugger and use the "Attach to Running Program" menu item. I'm able to initiate a debug session this way on various M4F mcus without forcing a reset.

Does your application have an assert mechanism that might be "causing" the hang? If so, a breakpoint there will shed light. Do you have trace access on the system?

The usual suspects for this kind of thing include stack overflows and memory allocation failure.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 07, 2018, 02:04:36 am
No malloc or recursion--or at least not in any of my code, I am using some (limited) 3rd party code for USB host, but I've seen it hang even when that is idle.

Leaving a debug session active sort of puts a damper on other development, but I may do that overnight, or perhaps set my laptop up and leave it run for a while. 

Re: Interrupts, yes, there are a handful, but as far as I can tell they are prioritized correctly (the M7 has a nestable interrupt controller, so the short and/or time-sensitive interrupts do preempt the longer ones).  Critical sections all seem to be in order as well.

I think my next approach will be to switch to the other watchdog timer--the one I'm currently using immediately and unconditionally resets the chip, but I should be able to set the other one up to issue an interrupt to grab the system registers first. 

I will give Ozone a shot as well, thanks for the reminder about that.  Normally I can connect to a running target without a reset using the startup script I have, so I'm not sure if it will be any different--but I've been meaning to tryout Ozone anyway so it's worth a shot.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: andyturk on April 07, 2018, 02:12:26 am
I think my next approach will be to switch to the other watchdog timer--the one I'm currently using immediately and unconditionally resets the chip, but I should be able to set the other one up to issue an interrupt to grab the system registers first.
Or, jam a "BKPT" instruction in the handler for the watchdog while your debugger is attached. That's still going to be pretty far away from whatever caused the problem though. Trace is your friend in these situations.

Are you using an RTOS? How is the watchdog pet/kicked/fed?
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: abraxa on April 07, 2018, 06:10:38 am
Are the trace pins available? You could use PC sampling if you have an FX2 board (cheap 8 channel logic analyzer): http://essentialscrap.com/tips/arm_trace/ (http://essentialscrap.com/tips/arm_trace/)
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: MosherIV on April 07, 2018, 08:20:40 am
Is there a serial output available?
If yes, output diagnostic info during the running, add extra info in the wdt isr to dump more info before the reset.

Is there an area of RAM that is not cleared on reset?
Can you set aside an area?
If yes, dump info into this area.

If single threaded, dump the stack.

In my experience, sw driven resets are generally cause by pointer corruption or stack corruption (especially array overflow ie array as local var)

Edit: I knew there was another reason as I pressed the save  :palm:
Null pointers! Always check for null pointers. Reading them will just give you rubbish (well actuall in most cases will return data from a vector table). Writing to a null pointer may cause a hw reset, since null is often in Flash and un authorised writes may cause a reset, or nothing but data not written. If mapping a struct onto a null pointer, accessing info from the struct will return garbage data. Any pointers inside the struct will be garbage, any data used as offsets  will be garbage. In short, always chech for null pointers!
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 08, 2018, 06:38:24 pm
I left a debug session running overnight and much of yesterday, and, of course, it didn't fail  |O.  I've seen the same hardware fail five times in two hours or not once in 10+ hours in basically identical situations, it's really maddening.  Left a system running overnight again last night without a debug session running, and it was hung when I came back to it today.  I tried connecting with Ozone, and looking at the console, it appears to identify the connected target, then says "Can not attach to CPU. Trying connect under reset."  I guess the hang condition is preventing the debugger from connecting normally, and the J-Link falls back to connecting under reset, which explains why I always seem to wind up at the beginning of the application when I try to debug during a hang.

I added a WDT ISR that dumps the stacked system registers in the hopes of at least catching a useful PC value, but despite being a higher priority than anything else, this isn't running when the hang occurs.  This would seem to implicate an atomic section, but I only have a handful of those and they're fairly simple and mostly only used during initialization.  Switching the atomic sections from global disable to using PRIMASK might allow the ISR to run, so I'll give that a try.  If I can't get the WDT ISR to run, then the amount of state I can capture is limited to what I can make survive the reset and bootloader initialization.  Even if I break at the very top of the bootloader, I expect that hardware initialization will reset the SP, which means losing anything useful in the stack.  Even running my debug USART at 3Mbaud, I can only dump so much during execution before I run into timing problems.

I should be able to get some persistent logging implemented today and will instrument the atomic sections and a couple of other things, but the fact that the debugger can't seem to connect during the hang has me concerned that this could be a hardware issue.  I'm not sure that it would be possible for software to cause debugging to fail like that. 

Trace lines are all in use on this system, but I could bodge them out, and I've ordered an FX2 device from Amazon in case that helps.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: Rerouter on April 09, 2018, 01:29:59 pm
As a bit of hardware weirdness, while the debugger is connected it doesnt play up may be a very strong indication to what is wrong,

Have a read of the datasheet over if it needs pullups or pulldowns on those debug pins, and are any of them floating without a debugger?

Equally does your debugger supply any power? if so could your device be suffering from a brownout?

Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 09, 2018, 03:23:29 pm
I wouldn't read too much into the fact that the problem did not occurring during the debug session, it really is just that erratic, unfortunately.  I've not seen any recommendations on debug IO handling for the STM32F7 specifically, but SWDIO and SWCLK are pulled up and down, respectively, which is the generally recommended condition.  The debugger is not powering the target.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: andyturk on April 09, 2018, 05:03:47 pm
Left a system running overnight again last night without a debug session running, and it was hung when I came back to it today.

So it's "hung", but the WDT isn't resetting the system? That sounds like a hole in the watchdog implementation.  It's probably not the issue you're chasing, but having the system detect the hang correctly might make it easier to track down the issue.

If you're on a F7, you've probably got an RTOS in there. Is there watchdog coverage for each thread?

https://betterembsw.blogspot.com/2014/05/proper-watchdog-timer-use.html
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 09, 2018, 06:31:26 pm
Sorry, I wasn't clear.  At the time I had disabled the WDT in the hopes that I could debug the system while it was hung.  When the WDT is enabled and the system does hang, the WDT ISR does NOT run, but the WDT DOES reset the system properly.  If I cause an artificial hang by entering an infinite loop, the WDT ISR runs before WDT resets the system, so I'm confident that it's configured correctly.  Whatever the actual hang condition is, though, apparently it blocks interrupts entirely--in addition to the WDT ISR, no other ISRs are running AFAICT.  Unfortunately since the WDT issues a configurable-priority IRQ it won't run if PRIMASK is set or a higher-priority exception is running.  The fact that the WDT can't be tied to the NMI is kind of a frustrating oversight.  Maybe I'll rig something up with a timer and DMA to issue an NMI request or cause a hardfault.

There's no RTOS.  The application isn't actually that complex in terms of threads of execution, but wants as much computational performance as possible, hence the 200MHz M7.  I am taking advantage of the PendSV handler for one task that runs as a software interrupt, but everything else is easily handled in a superloop (and obviously interrupts/DMA where required for coms).  The WDT reset includes sanity-checking other application conditions, so for the moment I'm confident in its coverage.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: bson on April 10, 2018, 07:03:04 am
Maybe your application doesn't hang per se, but you just have a bug that causes primask to be left set to 1 and all interrupts disabled.  That's pretty much the kiss of death.  Especially if PendSV exceptions are used for context switches; even those can't happen.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: andersm on April 10, 2018, 07:56:35 am
If you are using any sleep modes, check the errata of your chip if there are any mentions of missing wakeups, they are surprisingly common bugs. Also review the rest of the errata while you're at it.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 12, 2018, 05:43:20 am
Maybe your application doesn't hang per se, but you just have a bug that causes primask to be left set to 1 and all interrupts disabled. 
Yes, all of the evidence so far indicates that it's either this, a hardware issue. 

So I spent some time trying to rig up a timer to trigger an NMI or a fault, as if the fault handler runs, then that at least tells me the chip isn't somehow locked up at the hardware level--plus it may give me a chance to jump in and debug, or at least dump a bunch of state.  Unfortunately I haven't had any success.  Looks like the DMA can't access the system control block registers--which, honestly, is probably how it should be--and attempting to misconfigure the DMA to cause a fault only results in a DMA transfer error.  At this point the only way I can see to cause a fixed priority interrupt when the system hangs is to use DMA to drive an IO line to drive a transistor and shunt the crystal, which will trip the clock supervisor which WILL generate an NMI.  That's a pretty ugly solution, but I may well resort to it. It's really frustrating that neither of the two WDTs in this part is tied to the NMI!

I did check the errata, and didn't see anything that seemed relevant.  No sleep states here.

I did implement some basic persistent logging, where the application shoves token values into a buffer at various points (mostly function/ISR entries and exits), and the buffer gets checked and printed out at application start if it contains any data.  No smoking gun so far, I'm just seeing the expected sequence of activity.  I'll add some more log points to the application, and just keep collecting as many instances as I can to see if a pattern emerges.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 19, 2018, 05:40:03 pm
Finally got some other things out of the way and have been able to concentrate on this a bit more.  I added a bunch more breadcrumbing to the application in the hopes of catching something useful, and wound up with a system that hangs quite frequently--perhaps ten times in the past two hours.  That would seem to suggest that there is something about the application that's causing the problem--but fuck if I know what it is at the moment.

On the bright side (where "bright" consists of at least having a bit more information), I have managed to catch a hang during an active debug session both in Atollic and in Ozone.  Unfortunately, though, even then I can't actually get anywhere through the debugger.  In Ozone, attempting to halt the program fails (console just shows "J-Link: CPU could not be halted").  In Atollic, the entire debug system seems to freak out and continuously attempt to read out the registers, which of course fails, and it either has to be shut down or left to eventually crash. 

As another test, I reinstalled an older version of the application, and the frequent hangs immediately stopped, so that's certainly a compelling indication that this is some sort of software problem, which is encouraging.  However it's disconcerting that whatever is happening prevents debugging entirely.  I'm not sure why that should ever happen as a result of a software fault.  Even if the core enters a lockup state (when a fault is encountered while executing the hardfault or NMI handlers) the ARM documentation states that the system remains in lockup until halted by the debugger, which suggests that even then debugging should still work.  I'm not sure if it's worth trying JTAG instead of SWD, in case that makes any difference. 

I haven't had a chance to try the trace solution suggested earlier, that may be next on my list.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: splin on April 20, 2018, 03:05:42 pm
Have you tried using the memory protection unit to catch and prevent any loose pointer writes corrupting critical registers such as System Control, SCS, debug registers etc in the 0xE000xxxx range?

Might be a bit awkward if your application needs to rewrite some of these registers after initialisation but you should still be able to protect most with up to 8 memory protection regions.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 20, 2018, 03:13:58 pm
Looks like this is finally resolved! 

I found a thread on ST's forums yesterday with two other people experiencing similar problems, and one of them had apparently traced it to the QSPI timeout feature.  Normally, with an external QSPI flash in memory mapped mode, the MCU will hold the QSPI CS line constantly asserted.  Since flash memories usually consume more power when CS is asserted, the QSPI controller offers a feature where after a certain period of inactivity it deasserts CS, allowing the flash to go into a lower power state.  The next time the memory is accessed, it should reassert CS and immediately resume operation, seamlessly returning the requested data.  Except occasionally that doesn't seem to happen correctly.  The hypothesis is that if a memory access is attempted around the time that the timeout expires, the hardware gets into some sort of deadlock that winds up stalling the entire AHB, hence completely locking up the MCU.  This would explain why the problem is intermittent and variable, because it depends on the timing of those two events lining up in a particular way, and any change to the application that changes the timing characteristics of QSPI accesses will change the way that the lockups manifest.

The upshot is, since I don't particularly care about power consumption in my application, I can simply disable the timeout feature, which took the configuration on my desk from locking up about every five minutes to going almost twenty hours with no other changes!  So I'm pretty confident that's the source of the issue, and re-enabling the timeout, which is a single bit in an initialization value, brings it back to constant lockups.  I'll hopefully do a minimal demonstration of the issue a little later to further confirm the issue, and assuming that confirms what's currently suspected, hopefully this will get added to the errata sheet.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: andyturk on April 20, 2018, 06:31:37 pm
... The hypothesis is that if a memory access is attempted around the time that the timeout expires, the hardware gets into some sort of deadlock that winds up stalling the entire AHB, hence completely locking up the MCU.
Wow. Is that a problem with all of ST's QSPI implementations?
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...?
Post by: ajb on April 21, 2018, 06:28:33 am
... The hypothesis is that if a memory access is attempted around the time that the timeout expires, the hardware gets into some sort of deadlock that winds up stalling the entire AHB, hence completely locking up the MCU.
Wow. Is that a problem with all of ST's QSPI implementations?
At least the F767/F777 and F746.  Beyond that, I don't know how many other parts share the same implementation, and I don't have any other QSPI-equipped parts on hand to test.  I have a minimal program running on a Nucleo-F767ZI that exhibits the problem--no interrupts, no peripherals besides QSPI and GPIO, runs from the 16MHz HSI.  Scoping the CS line and a GPIO toggle, sure enough, it locks up when a QSPI access happens right about when CS is released. 

I'll include the demo program below in case anyone wants to try it on other hardware.  Be warned that it's a bit sensitive to timing, especially when attempting to sweep the delay time instead of the QUADSPI timeout value.  A big chunk of the file is defines for QUADSPI config values and GPIO macros from my library, the actual program is really quite simple.

Code: [Select]
/*
 * The following is a minimal demonstration of an apparent problem with
 * the QUADSPI peripheral on STM32F7 parts.  The QUADSPI seems to totally
 * lock up the MCU--even preventing debugging--under the following conditions:
 *  - The QUADSPI is in memory mapped mode
 *  - The QUADSPI's timeout counter is enabled
 *  - The timeout counter expires, releasing the QUADSPI_NCS line, right about
 *  the time that a memory-mapped QUADSPI access is attempted.
 *
 * This program has been run on a Nucleo-F767ZI board and successfully
 * demonstrated the problem.  It does not require that an external IC be
 * connected to the QUADSPI interface.  It should run on similar parts with
 * no modification.
 *
 * Be aware that the nature of this apparent fault makes the program rather
 * delicate with regard to timing.  It has been successful when compiled
 * without optimization.  It would probably be more reliable if the main test
 * loop were written in assembly, but even so the run-time performance
 * optimizations of the Cortex-M7 may introduce sufficient timing variations
 * that it takes several tries to achieve the timing required to cause the
 * fault.
 *
 */


/* Includes */
#include "stm32f7xx.h"



/* ****************************************************************************
 * QUADSPI CONFIG VALUES
 * ****************************************************************************/

#define QUADSPI_DCR_FSIZE_Pos 16
#define QUADSPI_CR_PRESCALER_Pos 24
#define QUADSPI_DCR_CSHT_Pos 8

/* QUADSPI->CCR.FMODE specifies the functional mode */
#define QUADSPI_CCR_FMODE_gp 26
#define QUADSPI_CCR_FMODE_MEMMAP (3<<QUADSPI_CCR_FMODE_gp)

/* QUADSPI->CCR.DMODE specifies the number of IO lines to use for
 * data transfer */
#define QUADSPI_CCR_DMODE_gp 24
#define QUADSPI_CCR_DMODE_SINGLE (1<<QUADSPI_CCR_DMODE_gp)

/* QUADSPI->CCR.DCYC specifies the number of dummy cycles to insert between
 * the address and data phases of the transaction */
#define QUADSPI_CCR_DCYC_gp 18
#define QUADSPI_CCR_DCYC_gm (0x1f<<QUADSPI_CCR_DCYC_gp)
#define QUADSPI_CCR_DCYC_m(num) ((num<<QUADSPI_CCR_DCYC_gp) & QUADSPI_CCR_DCYC_gm)


/* QUADSPI->CCR.ABMODE specifies the number of IO lines to use for
 * alternate byte transfer */
#define QUADSPI_CCR_ABMODE_gp 14
#define QUADSPI_CCR_ABMODE_NONE (0x0<<QUADSPI_CCR_ABMODE_gp)

/* QUADSPI->CCR.ADSIZE specifies the number of address bits to use */
#define QUADSPI_CCR_ADSIZE_gp 12
#define QUADSPI_CCR_ADSIZE_24BIT (0x2<<QUADSPI_CCR_ADSIZE_gp)

/* QUADSPI->CCR.ADMODE specifies the number of IO lines to use for
 * address transfer */
#define QUADSPI_CCR_ADMODE_gp 10
#define QUADSPI_CCR_ADMODE_gm (0x3<<QUADSPI_CCR_ADMODE_gp)
#define QUADSPI_CCR_ADMODE_SINGLE (0x1<<QUADSPI_CCR_ADMODE_gp)


/* QUADSPI->CCR.DMODE specifies the number of IO lines to use for
 * instruction transfer */
#define QUADSPI_CCR_IMODE_gp 8
#define QUADSPI_CCR_IMODE_gm (0x3<<QUADSPI_CCR_IMODE_gp)
#define QUADSPI_CCR_IMODE_SINGLE (0x1<<QUADSPI_CCR_IMODE_gp)


#define QSPI_CCR_MEMMAP_VAL (QUADSPI_CCR_FMODE_MEMMAP |\
QUADSPI_CCR_DMODE_SINGLE |\
QUADSPI_CCR_DCYC_m(0)|\
QUADSPI_CCR_ABMODE_NONE |\
QUADSPI_CCR_ADSIZE_24BIT |\
QUADSPI_CCR_ADMODE_SINGLE |\
QUADSPI_CCR_IMODE_SINGLE |\
EXTFLASH_CMD_READ)

#define EXTFLASH_CMD_READ 0x03

/* ****************************************************************************
 * GPIO CONFIG MACROS
 * ****************************************************************************/

#define IO_OUTSET(...) /* set (port, pin) output */ IO_OUTSET_SUB(__VA_ARGS__)
#define IO_OUTSET_SUB(port, pin) GPIO##port->BSRR = (1<<pin)

#define IO_OUTCLR_SUB(port, pin) GPIO##port->BSRR = ((1<<pin)<<16)
#define IO_OUTCLR(...) /* clear (port, pin) output */ IO_OUTCLR_SUB(__VA_ARGS__)

#define IO_OUTTGL(...) /* toggle (port, pin) output */ IO_OUTTGL_SUB(__VA_ARGS__)
#define IO_OUTTGL_SUB(port, pin) GPIO##port->ODR = (GPIO##port->ODR ^ (1<<pin));

#define IO_DIROUT(...) /* set (port, pin) as output */ IO_DIROUT_SUB(__VA_ARGS__)
#define IO_DIROUT_SUB(port, pin) IO_SET_MODE(port,pin, IO_MODER_OUTPUT_gv)

#define IO_SET_MODE(...) IO_SET_MODE_SUB(__VA_ARGS__)
#define IO_SET_MODE_SUB(gpio, pin, mode_gv) GPIO##gpio->MODER = ((GPIO##gpio->MODER & ~IO_MODER_gm(pin)) | ((mode_gv << (pin*2))&IO_MODER_gm(pin)))

#define IO_MODER_gm(pinNum)         (0x03<<(2*pinNum))
#define IO_MODER_INPUT_gv 0x00
#define IO_MODER_OUTPUT_gv 0x01
#define IO_MODER_ALT_gv 0x02
#define IO_MODER_ANALOG_gv 0x03

#define IO_AF_SEL(...)                        IO_AF_SEL_SUB(__VA_ARGS__)
#define IO_AF_SEL_SUB(gpio, pin, af_gv)   IO_AF_SEL_PIN##pin(gpio, pin, af_gv)

#define IO_AF_SEL_PIN0(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN1(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN2(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN3(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN4(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN5(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN6(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)
#define IO_AF_SEL_PIN7(...) IO_AF_SEL_SUB_LO(__VA_ARGS__)

#define IO_AF_SEL_PIN8(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN9(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN10(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN11(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN12(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN13(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN14(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)
#define IO_AF_SEL_PIN15(...) IO_AF_SEL_SUB_HI(__VA_ARGS__)

#define IO_AF_SEL_SUB_HI(gpio, pin, af_gv)    GPIO##gpio->AFR[1] = ((GPIO##gpio->AFR[1] & ~IO_AFRH_gm(pin)) | ((af_gv << ((pin-8)*4)) & IO_AFRH_gm(pin)));
#define IO_AF_SEL_SUB_LO(gpio, pin, af_gv)    GPIO##gpio->AFR[0] = ((GPIO##gpio->AFR[0] & ~IO_AFRL_gm(pin)) | ((af_gv << (pin*4)) & IO_AFRL_gm(pin)));

#define IO_AFRL_gm(pinNum)                    (0x0f<<(pinNum*4))
#define IO_AFRH_gm(pinNum)                    (0x0f<<((pinNum-8)*4))

/* ****************************************************************************
 * GPIOS IN USE
 * ****************************************************************************/

#define LED1 B,0 //GREEN
#define LED2 B,7 //BLUE
#define LED3 B,14 //RED

#define QUADSPI_CLK F,10
#define QUADSPI_NCS B,6

#define QUADSPI_CLK_AF 9
#define QUADSPI_NCS_AF 10


/* Private macro */
/* Private variables */
/* Private function prototypes */
/* Private functions */

void QSPI_memMapMode(void);
void QSPI_memMapModeDisable(void);

void QSPI_init(void){



IO_SET_MODE(QUADSPI_CLK, IO_MODER_ALT_gv);
IO_SET_MODE(QUADSPI_NCS, IO_MODER_ALT_gv);

IO_AF_SEL(QUADSPI_CLK, QUADSPI_CLK_AF);
IO_AF_SEL(QUADSPI_NCS, QUADSPI_NCS_AF);

RCC->AHB3ENR |= RCC_AHB3ENR_QSPIEN;

// clk = AHB, using bank 1
QUADSPI->CR = 0<<QUADSPI_CR_PRESCALER_Pos | 0;

/** vvvv  This is the problem  vvvv  ******************************************/

QUADSPI->CR |= QUADSPI_CR_TCEN;  // enable timeout in memory mapped mode

/** ^^^^  This is the problem  ^^^^  ******************************************/



QUADSPI->DCR =
( 23 << QUADSPI_DCR_FSIZE_Pos ) | // 23-bit addressing = 8MB
( 7 << QUADSPI_DCR_CSHT_Pos);  // 7 -> 8 clock cycles of nCS high between commands

QUADSPI->CR |= QUADSPI_CR_EN;

QUADSPI->FCR =
QUADSPI_FCR_CTEF |
QUADSPI_FCR_CTCF |
QUADSPI_FCR_CSMF |
QUADSPI_FCR_CTOF;
QUADSPI->DLR = 0;

QUADSPI->LPTR = 0;

//Configure QSPI for memory-mapped mode
QSPI_memMapMode();
}


void QSPI_memMapMode(void){
QUADSPI->CCR = QSPI_CCR_MEMMAP_VAL;
}

void QSPI_memMapModeDisable(void){
QUADSPI->CCR = 0;
QUADSPI->CR |= QUADSPI_CR_ABORT;
while(QUADSPI->CR & QUADSPI_CR_ABORT);
}


uint32_t delay_time=10;
volatile uint8_t dummyval;
volatile uint8_t * exfl = (uint8_t*)(0x90000000);
uint16_t offs=0;
int main(void)
{
RCC->AHB1ENR |= RCC_AHB1ENR_GPIOBEN | RCC_AHB1ENR_GPIOFEN;
IO_DIROUT(LED1);
IO_DIROUT(LED2);
IO_DIROUT(LED3);

/* configure SWO pin (optional -- since the MCU will
* wind up in a state where debugging is impossible,
* the only way to extract the parameters that trigger
* the fault is to capture that data beforehand).
*/
IO_SET_MODE(B,3,IO_MODER_ALT_gv);
IO_AF_SEL(B,3, 0);

QSPI_init();

while (1)
{
for(uint32_t i=0; i<500; i++){
/* many reads are made at each timing step to
* increase the chances of the fault occurring, since
* it is so sensitive to timing.
*/
IO_OUTSET(LED3);
for(uint32_t j=0; j < delay_time; j++);
IO_OUTCLR(LED3);
dummyval = *(exfl+(offs++));
}
IO_OUTSET(LED2);
if( (delay_time % 100) ==0) IO_OUTTGL(LED1);

//delay_time+=1;

QSPI_memMapModeDisable();
/* sweeping the QSPI timeout works better than sweeping
* delay_time because the timer (running at AHB frequency)
* has better granularity than a C spinlopp, therefore the
* chances of the timing lining up to cause the fault are
* much improved.
*/
QUADSPI->LPTR++;
QSPI_memMapMode();
IO_OUTCLR(LED2);
}
}
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...? (Resolved!)
Post by: splin on April 21, 2018, 12:08:22 pm
This is the sort of thing that can wake you up in a cold sweat - luckily in this case the bug could be found as the CS signal was external to the chip and thus observable. If it was an internal signal then only ST would have the ability to trace it.

If you're a *very* big customer you could probably persuade ST to take an interest but even then you'd likely need a very high level of proof that it wasn't a problem with your software. Small developers and medium size enterprises would likely have no chance of getting ST to investigate in any meaningful way and may have to abandon the development or restart with another device, with the potential to bankrupt the developer.

A good reason to avoid the bleeding edge and only use devices that gave been out for at least 2 or 3 years, especially if they are complex or you need to use them in unusual ways. Using the latest part because it has the speed ormemory size you need should be relatively low risk so long as you aren't using peripherals that are brand new in that part. Sure you can evaluate the part using the manufacturer's example code but there is always the risk of running into an obscure timing error which only manifests itself in your application too infrequently for you to have any chance of debugging it, but often enougn to make it unsaleable.

Just looking at errata sheets makes you wonder just how many other obscure bugs are lurking in the silicon. One that amused me is from the STM32F3 series:

Quote
2.2.4 Load multiple not supported by ADC interface

Description


The ADC  interface  does  not support  LDM, STM,  LDRD  and STRD  instructions for successive  multiple-data  read and  write  accesses to a contiguous address  block.

Workaround

The workaround  consists  in preventing  compilers from  generating  LDM, STM,  LDRD and STRD  instructions.  In  general, this can be achieved through organizing  the  source code such as  to avoid  consecutive  read  and write accesses  to neighboring addresses  in  lower-tohigher  order.  In  case  where  consecutive read  and  write accesses  to  neighboring  addresses cannot  be  avoided, order  the  source  code  such  as to  access higher  address first.

So, to be absolutely sure, all you need to do is either write all your ADC code in assembler, persuade your compiler writer to add a 'NO_LOAD_OR_STORE_MULTIPLE_INSTRUCTIONS' pragma or inspect the generated code for the offending instructions. And repeat every time you update the compiler or you change any of the ADC code. What fun! But at least it's documented and thus avoidable.
Title: Re: Catching an application hang in the act (Resolved - new errata entry pending
Post by: ajb on May 23, 2018, 05:20:05 pm
Minor update.  I submitted a support request to ST with the outline of the problem and the complete minimal example program.  There was no response for a month (possibly because I was honest about the production quantity...), but they've finally confirmed the behavior and said it will be added to the next errata sheet revision.   :clap:

Interestingly, their response said "it's a known limitation"--I really hope it's a newly known limitation, because if not, why the hell wasn't it already in the errata sheet?  I'm going to choose to believe that they've only recently learned about it, otherwise I'll be too pissed off about how much time I could have saved in dealing with this problem if there had been an errata entry for it.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...? (Resolved!)
Post by: andersm on May 23, 2018, 09:26:28 pm
So, to be absolutely sure, all you need to do is either write all your ADC code in assembler, persuade your compiler writer to add a 'NO_LOAD_OR_STORE_MULTIPLE_INSTRUCTIONS' pragma or inspect the generated code for the offending instructions. And repeat every time you update the compiler or you change any of the ADC code. What fun! But at least it's documented and thus avoidable.
The Cortex-M0 always restarts an LDM/STM instruction if it is interrupted, the M3, M4, and M7 under certain conditions (eg. if the instruction is inside an IT block, if an LDM instruction updates the base register), so you should be careful about using those instructions with hardware registers anyway. The core technical reference manuals all contain the following paragraph:
Quote
This means that software must not use load-multiple or store-multiple instructions to access a device or access a memory region that is read-sensitive or sensitive to repeated writes. The software must not use these instructions in any case where repeated reads or writes might cause inconsistent results or unwanted side-effects.
On a related note, a lot of PIC32MX chips have an errata that an interrupted store instruction isn't correctly aborted, so you should really always disable interrupts when writing to eg. peripheral transmit registers.
Title: Re: Catching an application hang in the act. J-Link+Atollic, or...? (Resolved!)
Post by: bson on May 31, 2018, 03:38:20 pm
So, to be absolutely sure, all you need to do is either write all your ADC code in assembler, persuade your compiler writer to add a 'NO_LOAD_OR_STORE_MULTIPLE_INSTRUCTIONS' pragma or inspect the generated code for the offending instructions. And repeat every time you update the compiler or you change any of the ADC code. What fun! But at least it's documented and thus avoidable.
Hmm.  An ADC conversion register (or FIFO head register) should be declared volatile, and a compiler shouldn't use LDM/STM to access a volatile register, at least for certain target architectures.