Author Topic: 32F4 - 3 ways to detect end of DMA transfer (Read 4754 times)

nctnico · « **Reply #25 on:** November 19, 2022, 05:19:50 pm »

The above is a nice summary but it still seems to be under the assumption that the code is actually executed as intended. This may not be the case at all time (think of radiation). Now these events will be rare but if you run enough devices long enough, it will start to pop-up as users reporting strange behaviour. That is why I have range checks at various levels in my software. At least the software doesn't continue processing with a wrong value. Since many applications do things is a loop anyway (read input, process, create output), not doing one loop and recovering the next loop does not affect device functionality in many cases.

voltsandjolts · « **Reply #26 on:** November 19, 2022, 05:34:43 pm »

Quote from: nctnico on November 19, 2022, 05:19:50 pm

think of radiation

Nah.
Then you are into the realms of random bit flips in sram or whatever

Spend your life writing error handlers that never get used?
Unless you're in space applications, don't worry about it.

nctnico · « **Reply #27 on:** November 19, 2022, 05:58:15 pm »

There are other ways as well like external noise. Don't get stuck on just one example.

Siwastaja · « **Reply #28 on:** November 19, 2022, 06:15:50 pm »

That's why signal integrity is handled on a different level - HW level. We almost always work under the assumption that bit-flips do simply not happen in MCU. Even with radiation, rad-hard parts are usually used so that this guarantee can still be held.

I mean, writing software that can recover from internal HW data corruption is super-duper-hyper-mega difficult to write, orders of magnitude harder than points 5&6 in my previous post.

This is why we have EMC testing and qualification, products are subjected to much larger levels of external noise than is ever present in presence of even remotely compliant devices, and normal operation during these artificial conditions is verified. This is why we look at eye diagrams on communication buses. For example, RAM address and data buses are not checksummed or anything, yet computers work reliably! (And ECC DRAM exists because of bitflips in DRAM itself, which is part of DRAM technology and compromise between price, energy consumption, and amount of memory available.)

But of course, if you find a mental model of thinking about radiation or noise induced bitflips helps you write more robust code in general, that's fine. What actually causes failure, is plain old bug.

nctnico · « **Reply #29 on:** November 19, 2022, 06:28:32 pm »

Google did some actual research into this field:
https://www.cnet.com/culture/google-computer-memory-flakier-than-expected/
Regardless of the actual cause of an error the fact of life is that they do happen. 1 PC may run reliably when it is turned on/off every day but start putting 1000's of PCs together and suddenly you will be able to get a measurable number of errors that can not be explained. Microcontrollers are no different. So if you add a bit of resillience into your code -doesn't need to be super fancy- you'll be adding reliability to your product.

One of the interesting things I have observed over the past decades is that when a microcontroller goes off the rails and has the program counter at a random address in the address range where the code is, quite often it will get back on track executing the code properly again. It is almost as if instruction opcodes are designed to do that. As a test try to disassemble some code and you'll see the disassembler will get back on track after a few instructions.

Siwastaja · « **Reply #30 on:** November 19, 2022, 06:41:53 pm »

Corruption and bit-flips in DRAM is completely normal, it's specified operation. This is why ECC exists. Google's result about water being wet is not interesting when we discuss oranges.

Microcontrollers are totally different, they are designed to be nearly free of such bitflips, which is why they use internal SRAM. Of course, cheap parts do not guarantee this and you have to pay premium for getting the guarantee, but in practice cheap parts are just as good.

Bitflips can and will happen, you can even calculate the rate at which metastability propagates in the logic when using 2 or 3 flipflops to synchronize inputs. But the question is, while others assume this rate as zero, you probably overestimate it by many orders of magnitude. So exactly how much effort you are going to put in making the code rad-hard? You know how difficult this is, don't you?

In any case, I'm 100% sure your "robust" code does not survive random bitflips. Writing software which eliminates (or even significantly reduces) the impact of random physical memory corruption is much more than "do not use function pointers" or other stupid nearly irrelevant bullshit. Maybe some simple measures are slightly better than nothing, maybe not. But many of those not scientifically proven half-baked measures are still very useful against plain old bugs, so go for it.

An example case in point: add canaries between global variables, or between stack and .data. Overindexing by +1 is usual mental fuck-up (even if you validated inputs). The bug is caught, good. Now consider random bitflip corrupting memory address in memory, or program counter, or whatever. The resulting address is now totally random in memory. Very unlikely to hit the canary. Likely to hit an invalid region, causing hardfault, which is good, but unless that happens, it likely hits something valid in memory and totally wreaks havoc, and there is nothing (simple) you can do. If you are truly concerned about this, having two CPUs with separate memories running in lock-step is a classic solution, but expensive as hell, and there's still the question of how to handle the error, and who handles it.

peter-h · « **Reply #31 on:** November 19, 2022, 06:54:42 pm »

So, instead of that forever loop (which one breaks out of if the transfer is finished) should the loop break out if some large integer overflows?

How large should it be?

One needs to quantify that experimentally. Or one could go for some massive margin: a uint32 takes some seconds to overflow, so that should work.

Alternatively one could use a timer - the 1kHz "systick" one usually present. That is more dangerous because in an RTOS environment any bit of your code could get delayed (possibly by tens of ms if you are writing FLASH, as a rare process). And sometimes (like in my boot block code) there are no interrupts at all.

And there is no assurance that exiting that loop will produce a working system. A watchdog is the only way. Various ways to trigger it eg. an overflow of the above uint32 sets a flag, various tasks can set this flag, and you have an RTOS task which is dedicated to checking that flag and various other flags, and sets off the watchdog.

Metastability is a different issue. There is probably clock sync going on between the CPU clock and the PCLK1/2 peripherals, but the latter is divided down from the former so unless they did something dumb and got the edges in exactly the wrong place, they will have a constant and safe margin. Also this is old science - 40 years ago you had chips like the Z85C30 SCC which were totally async to the CPU yet they worked 100%, and AFAIK they used only a double sync. That said, I would be amazed if there was a meta issue between the CPU and the peripherals, since the clock division can be done fully sync i.e. the peripheral gets its edge a sub nanosecond after the CPU gets its edge.

nctnico · « **Reply #32 on:** November 19, 2022, 08:44:33 pm »

Quote from: Siwastaja on November 19, 2022, 06:41:53 pm

Very unlikely to hit the canary. Likely to hit an invalid region, causing hardfault, which is good, but unless that happens, it likely hits something valid in memory and totally wreaks havoc, and there is nothing (simple) you can do. If you are truly concerned about this, having two CPUs with separate memories running in lock-step is a classic solution, but expensive as hell, and there's still the question of how to handle the error, and who handles it.

You are missing the point here. My point is that with some simple checks in place your device will survive a bit-flip due to whatever cause without going off the rails completely. There really is no need to obsess over exact causes, graceful recovery strategies, etc. For most generic, consumer and industrial products having some resillience against using (obviously) wrong values will be enough. Ofcourse things are different when lives depend on it but those are not the products I'm referring to.

Quote from: peter-h on November 19, 2022, 06:54:42 pm

So, instead of that forever loop (which one breaks out of if the transfer is finished) should the loop break out if some large integer overflows?

How large should it be?

One needs to quantify that experimentally. Or one could go for some massive margin: a uint32 takes some seconds to overflow, so that should work.

Alternatively one could use a timer - the 1kHz "systick" one usually present.

Using a 'large enough number' to exit the loop is one way. Using a hardware timer or counter that increments by itself is another (no interrupts needed; just record the start count and break the loop if the number of counts is different enough). Time spent in interrupts or other processor doesn't matter for as long as you check whether the loop ended because the hardware finished the job or the timer expired.

peter-h · « **Reply #33 on:** November 19, 2022, 09:55:35 pm »

Can anyone think of a way that DMA transfer might get stuck, without defective silicon?

Only by some bug like a stray pointer (in another RTOS task) poking a value into NDTR, so it never reaches zero, etc...

How about this?

uint32_t x=0xffffffff;
while (true)
{
if (transfer ended) break;
x--;
if (x==0) break;
}

Although I can't see how much better this is than triggering the watchdog if x=0.

nctnico · « **Reply #34 on:** November 19, 2022, 10:05:45 pm »

It all depends on how critical the DMA transfer is. If it is for an ethernet packet, it is not critical at all as either packet loss is allowed or the packet will be send again. Same for many other protocols that run over SPI, I2C, UART. Heck, even for sampling an ADC or audio stream it may not be a problem. But is all depends on whether the system is designed to re-synchronise itself or recover (for example by re-trying). If the system can re-synchronise / recover itself at a higher level then a failed DMA transfer isn't critical at all. Think about writing to an external flash. If a DMA failure causes a write or verify to fail, the high level can decide to retry or skip to the next sector. There really is no need to trigger a reset for such an event.

peter-h · « **Reply #35 on:** November 20, 2022, 08:22:25 am »

It was only to continue the RTOS task if the DMA mysteriously failed to finish.

EDIT: after extensive testing I have put this bit (in the code in my 1st post) back in

Code: [Select]

hang_around_us(1);
SPI3->DR;
SPI3->DR;
SPI3->SR;

Somebody somewhere went down the same rabbit hole and didn't document why, but it is needed in some cases. Exactly which ones, would take me days or weeks of my life to establish. On the face of it, the RX DMA should always remove all data from the SPI RX channel, so maybe the SR read is what actually matters.

It wastes only 1.3us so not worth worrying about. In any performance critical application one would not use DMA for moving so little data that 1.3us is relevant. In fact, various other code (e.g. the use of memcpy and such in low level USB and even ETH code) suggests that the CPU moving 32 bits at a time, 168MHz, outperforms a DMA (typically running at no more than 42MHz and probably half that) a lot of the time. I use DMA mainly to move 512 bytes but since the SPI is run in byte mode, I don't have a 16 bit or 32 bit option.

Siwastaja · « **Reply #36 on:** November 20, 2022, 01:49:07 pm »

Quote from: nctnico on November 19, 2022, 08:44:33 pm

You are missing the point here. My point is that with some simple checks in place your device will survive a bit-flip due to whatever cause without going off the rails completely.

I got your point perfectly clear; it's just that it is wrong. Completely and totally wrong. Random bit-flips are colossally difficult to deal with, "some simple checks" are very unlikely to offer any or much help, because simple checks (such as function argument range checks, peripheral error flag checks) cover some 0.1% of memory. Clearly you have never done rad-hard (where random bitflips are considered; actual reason doesn't need to be radiation); neither have I, but I know enough to leave it to others. But, the silver lining is, you are still doing the right thing because these simple checks have much better coverage against your own bugs (or HW peripheral bugs!), and those are the real causes for problems.

Communication links over unreliable physical medium is of course different. And there are those nasty cases where you are not certain whether your physical link is classified as unreliable or not. Better assume that way when unsure. But then again, bitflips in such interface can only affect the interface itself, so surface area for errors is again very small, so trivial checks like bounds checking data length field are easy to implement.

paulca · « **Reply #37 on:** November 20, 2022, 09:15:01 pm »

I know this won't be a productive addition to the conversation, but everytime I see this thread now I can't help but think...

"When you take the annoying thing back out of the microwave, the DMA transfer WILL be over."

Had another evening of STM32 mystery tour of non working DMA.

peter-h · « **Reply #38 on:** November 20, 2022, 09:52:16 pm »

The 32F4 DMAs are complicated.

FWIW this is now my current code, with a primitive timeout of some tens of seconds which is way longer than any possible transfer

Code: [Select]


	volatile uint32_t limit = 0;

	while(true)
	{

		limit--;
		if (limit==0) break;

		// Either end-transfer detection method below works but the NDTR may be less reliable
		// http://efton.sk/STM32/gotcha/g20.html
		// [url]https://www.eevblog.com/forum/microcontrollers/32f4-3-ways-to-detect-end-of-dma-transfer/[/url]
#if 0
		uint16_t temp1;
		temp1 = DMA1_Stream0->NDTR;
		if ( temp1 == 0 ) break;					// transfer count = 0
#else
		uint32_t temp2;
		temp2 = DMA1->LISR;
		if ( (temp2 & (1<<5)) !=0 ) break;			// TCIF0
#endif
		// Third method of detecting end of transfer: wait for DMA to set EN to 0.
		// All 3 methods produce the same time from last SPI clock to CS=1 (1.8us).

		//if ( (DMA1_Stream0->CR & DMA_SxCR_EN) == 0) break;

	}

I made "limit" volatile because I think the compiler would otherwise optimise it away.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: 32F4 - 3 ways to detect end of DMA transfer (Read 4754 times)

nctnico

Re: 32F4 - 3 ways to detect end of DMA transfer

voltsandjolts

Re: 32F4 - 3 ways to detect end of DMA transfer

nctnico

Re: 32F4 - 3 ways to detect end of DMA transfer

Siwastaja

Re: 32F4 - 3 ways to detect end of DMA transfer

nctnico

Re: 32F4 - 3 ways to detect end of DMA transfer

Siwastaja

Re: 32F4 - 3 ways to detect end of DMA transfer

peter-h

Re: 32F4 - 3 ways to detect end of DMA transfer

nctnico

Re: 32F4 - 3 ways to detect end of DMA transfer

peter-h

Re: 32F4 - 3 ways to detect end of DMA transfer

nctnico

Re: 32F4 - 3 ways to detect end of DMA transfer

peter-h

Re: 32F4 - 3 ways to detect end of DMA transfer

Siwastaja

Re: 32F4 - 3 ways to detect end of DMA transfer

paulca

Re: 32F4 - 3 ways to detect end of DMA transfer

peter-h

Re: 32F4 - 3 ways to detect end of DMA transfer

Share me