Author Topic: 32F417 SPI running at one third the speed it should (Read 8375 times)

peter-h · « **on:** January 14, 2022, 05:00:54 pm »

I posted about this here
https://community.st.com/s/question/0D53W00001J9VXrSAN/32f417-spi-running-at-half-the-speed-it-should?t=1641986845261
but that's a very high throughput forum which almost nobody reads. One guy there spent some time on identifying what might be the explanation but has not offered a solution. I am posting it here in the hope that someone might have come across this before.

Basically the code is using polling and is stuffing bytes into an SPI master "uart" and receiving what comes back. The SPI is running at 21MHz so the limit is just over 2 megabytes/sec. It appears that even a 32F4 running at 168MHz can't cope with this!

This is the code

Code: [Select]

 
    while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
    {
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
      {
        *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
        hspi->pTxBuffPtr++;
        hspi->TxXferCount--;
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      }
 
      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
      {
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        hspi->pRxBuffPtr++;
        hspi->RxXferCount--;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
      }

The issue seems to have two parts:

- The existing code (from an ST library) is a bit dumb in that it blocks loading a TX byte unless an RX byte has been retrieved. This results in the double-buffered TX being allowed to run out of data, and with SPI if you aren't sending stuff out you won't be getting stuff back because it is the action of sending out that generates the clock. This seems reasonable as a quick hack, but it ignores the fact that you can load two bytes into a TX; one propagates through to the shift register and the other ends up in the TX buffer. Whereas the RX, while having the same buffering technically, is less usable because once you detect there is data available you don't know how much time you have to get that byte out.

- The SPI runs with its own clock which is much slower than the 168MHz of the CPU. It is the APB clock, which in my case is 42MHz (and I can't change that for various reasons; there is an 84MHz option but it uses another SPI channel which I can't access). This means that say a read of the "tx buffer empty" bit actually takes a lot more than the ~7ns CPU instruction speed. It's a stupid design where they use an ARM core and "asynchronously" attached a pile of peripherals to it, so there are various oddball delays, some requiring multiple APB clocks to prevent metastability.

The answer should be DMA but that is quite tricky to get working.

I am sure it can be done with polling. Many years ago I did a floppy disk controller with a Z80 running at 2MHz and it had the same issue, which was solved by a cunning loop structure. In this case I suspect a similar trick might work, whereby the TX channel is kept full, but the RX channel is checked often enough.

wek · « **Reply #1 on:** January 14, 2022, 05:07:59 pm »

Quote from: peter-h on January 14, 2022, 05:00:54 pm

One guy there spent some time on identifying what might be the explanation but has not offered a solution.

Beg your pardon. You've been told several times, use DMA. That *is* the solution. That you refuse it, is your problem.

JW

peter-h · « **Reply #2 on:** January 14, 2022, 05:50:17 pm »

I am happy to try it if you can post some code to get me started. It takes hours of reading just to find out which of the 135 DMA channels is the one to use

DavidAlfa · « **Reply #3 on:** January 14, 2022, 06:29:47 pm »

At best, you only have 64 clocks/byte at 168MHZ Core & 21MHz SPI, there're quite a lot of checks, branches, add few flash cache misses (another 4-7 clocks)... looks pretty tight there!
I suggest to setup the DMA in cubeMX, because it really takes close to no effort. Then start increasing the frequency until you start losing bytes/clocks.

peter-h · « **Reply #4 on:** January 14, 2022, 06:39:48 pm »

One thing I wondered was whether stuff like hspi->TxXferCount is wasting time. It seems to be indexing into a structure and possibly doing so at runtime.

Variables like the byte counts should be in registers.

Currently I am using -Og optimisation level. I have tried various others and -O3 seems to break some code, and apparently (there was a thread on this) this is to be expected since it is an experimental thing. But I could try it, or use some directive to create register variables.

I've never used Cube MX but know somebody who has.

Presumably one would replace just that one loop with DMA, plus a mysterious bit of code before it which handles a single-byte case

Code: [Select]

   // The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
    {
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
      hspi->pTxBuffPtr++;
      hspi->TxXferCount--;
    }

Siwastaja · « **Reply #5 on:** January 14, 2022, 07:28:43 pm »

Quote from: peter-h on January 14, 2022, 05:50:17 pm

I am happy to try it if you can post some code to get me started. It takes hours of reading just to find out which of the 135 DMA channels is the one to use

Geez, just follow the manual. On the top of my head, set the DMA channel control register, which will also have the channel mapping field. The channel mapping table is in the reference manual. Set memory addres in M0AR, and SPI data register in PAR. Write number of data to NDTR. Enable DMA in SPI peripheral. Clear the DMA error flags (IFCR, if I recall correctly). Enable DMA channel (CR |= 1;).

SiliconWizard · « **Reply #6 on:** January 14, 2022, 07:33:06 pm »

If you don't want to use the HAL or can't use it because it's too inefficient, you can still look at the source code to speed up your development. I've done this a lot. It will at least help you make sense of the reference manuals and get you there faster. Source code is provided - take advantage of it. Still read the manual to figure out what is strictly necessary in your case and what isn't. But I find those vendor-libraries, provided with source code, a better source of information than they are good for direct use. =)

peter-h · « **Reply #7 on:** January 14, 2022, 09:13:34 pm »

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

SiliconWizard · « **Reply #8 on:** January 14, 2022, 09:59:10 pm »

Are you not using SPI? =)

wek · « **Reply #9 on:** January 14, 2022, 10:23:25 pm »

Quote

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

Yes, of course. Except that in 'F4, the individual DMA elements are called *streams*. And you want the stream handling Rx to be of higher priority than any other stream in that DMA.

Quote

It takes hours of reading just to find out which of the 135 DMA channels is the one to use

And you expect somebody else to spend that time for you? How do you intend to program without reading the fine manual?
Here you are (attachment). And remember, the DMA element in question is *stream*; *channel* means input of the request mux (i.e. the number which goes into DMA_SxCR.CHSEL.

32F417 SPI running at one third the speed it should

Except for that, everything is exactly as Siwastaja said above, just follow his description together with reading the registers part of DMA chapter. And maybe this could help you to fill in the control register (this is for something else so you have to go through the individual fields and correct them in yourself):

Code: [Select]

          
#define OR |  // because I hate C

#define DMA_SxCR_xBURST_INCR1                0       // only in FIFO mode (not Direct mode)
#define DMA_SxCR_xBURST_INCR4                1
#define DMA_SxCR_xBURST_INCR8                2
#define DMA_SxCR_xBURST_INCR16               3

#define DMA_SxCR_CT_MEMORY0                  0       // currently trageted memory, only in double-buffer mode
#define DMA_SxCR_CT_MEMORY1                  1

#define DMA_SxCR_PL_PRIORITY_LOW             0
#define DMA_SxCR_PL_PRIORITY_MEDIUM          1
#define DMA_SxCR_PL_PRIORITY_HIGH            2
#define DMA_SxCR_PL_PRIORITY_VERY_HIGH       3

#define DMA_SxCR_PINCOS_IS_PSIZE             0       // peripheral increment offset size, only if PINC=1, PBURST=0 and FIFO mode (not Direct mode)
#define DMA_SxCR_PINCOS_IS_4                 1

#define DMA_SxCR_xSIZE_BYTE                  0       // in direct (non-FIFO) mode, MSIZE is ignored and PSIZE used throughut
#define DMA_SxCR_xSIZE_HALFWORD              1
#define DMA_SxCR_xSIZE_WORD                  2

#define DMA_SxCR_DIR_P2M                     0       // direction (peripheral = P, memory = M)
#define DMA_SxCR_DIR_M2P                     1
#define DMA_SxCR_DIR_M2M                     2


#define DMA_SxFCR_FTH__1_4   0  // FIFO threshold = 1/4
#define DMA_SxFCR_FTH__1_2   1  // FIFO threshold = 1/2
#define DMA_SxFCR_FTH__3_4   2  // FIFO threshold = 3/4
#define DMA_SxFCR_FTH__FULL  3  // FIFO threshold = full

#define DMA_SxFCR_FS__1_4    0  // 0   <  FIFO status < 1/4
#define DMA_SxFCR_FS__2_4    1  // 1/4 <= FIFO status < 1/2
#define DMA_SxFCR_FS__3_4    2  // 1/2 <= FIFO status < 3/4
#define DMA_SxFCR_FS__4_4    3  // 3/4 <= FIFO status < full
#define DMA_SxFCR_FS__EMPTY  4
#define DMA_SxFCR_FS__FULL   5



disp_DMAStream->CR = 0
            OR (disp_DMAChannel          * DMA_SxCR_CHSEL_0   )  // channel select
            OR (DMA_SxCR_xBURST_INCR1    * DMA_SxCR_MBURST_0  )  // memory burst (only in FIFO mode)
            OR (DMA_SxCR_xBURST_INCR1    * DMA_SxCR_PBURST_0  )  // peripheral burst (only in FIFO mode)
            OR (0                        * DMA_SxCR_ACK       )  // "reserved" (says manual)
            OR (0                        * DMA_SxCR_CT        )  // current target (only in double-buffer mode)
            OR (0                        * DMA_SxCR_DBM       )  // double-buffer mode
            OR (DMA_SxCR_PL_PRIORITY_LOW * DMA_SxCR_PL_0      )  // priority level
            OR (0                        * DMA_SxCR_PINCOS    )  // peripheral increment offset size (only if peripheral address increments, FIFO mode and PBURST is 0)
            OR (DMA_SxCR_xSIZE_WORD      * DMA_SxCR_MSIZE_0   )  // memory data size; in direct mode forced to the same value as PSIZE
            OR (DMA_SxCR_xSIZE_HALFWORD      * DMA_SxCR_PSIZE_0   )  // peripheral data size
            OR (1                        * DMA_SxCR_MINC      )  // memory address increments
            OR (1                        * DMA_SxCR_PINC      )  // peripheral address increments
            OR (0                        * DMA_SxCR_CIRC      )  // circular mode (forced to 1 if double-buffer mode, forced to 0 if flow control is peripheral)
            OR (DMA_SxCR_DIR_M2M         * DMA_SxCR_DIR_0     )  // data transfer direction
            OR (0                        * DMA_SxCR_PFCTRL    )  // peripheral is the flow controller (i.e. who determines end of transfer) - only for SDIO
            OR (1                        * DMA_SxCR_TCIE      )  // transfer complete interrupt enable
            OR (0                        * DMA_SxCR_HTIE      )  // half transfer interrupt enable
            OR (0                        * DMA_SxCR_TEIE      )  // transfer error interrupt enable
            OR (0                        * DMA_SxCR_DMEIE     )  // direct mode error interrupt enable
            OR (1                        * DMA_SxCR_EN        )  // stream enable
          ;

Don't forget clearing the transfer-complete flag, as Siwastaja said.

You may want to use FIFO to assemble bytes into words to reduce traffic on the memory side, but start without it (i.e. leave DMA_SxFCR at its reset state) until you gain confidence. With FIFO off, MSIZE field is ignored.

Btw. you want to use 16-bit transfers on the SPI side.

Btw. you would want to do that in the polled implementation, too - that's the simplest possible optimization and it would probably result in noticeable improvement.

JW

SiliconWizard · « **Reply #10 on:** January 14, 2022, 11:11:54 pm »

Would the HAL_SPI_TransmitReceive_DMA() function from the HAL not be good enough in your case?

And if not, you can have a look at its source code as I suggested (it does a lot including configuring the DMA transfers, which you might want to do only once, and repeatedly restart them rather that reconfiguring the channels from scratch every time...) But the code is likely to at least guide you.

peter-h · « **Reply #11 on:** January 15, 2022, 06:12:21 am »

My code was based on the non-DMA HAL function, with unwanted stuff removed. Yes that's a very good point; they have an interrupt version and a DMA version. I am away right now so can't test anything but will look this up when I get back.

How can one use SPI in a 16-bit mode? Do you mean actually setting SPI to 16 bits so that it is shifting out, and shifting in, a 16 bit value at a time? I am doing that for an ADC which needs that (ADS1118). I can see that would work, in the case of a len value which is a multiple of two.

"And you expect somebody else to spend that time for you?"

No; I sent you a PM asking if you do consultancy, since you mentioned that somewhere, but got no reply

Development in this area can be very slow because getting it wrong trashes the FLASH chip so it has to be re-initialised, so you need two sets of w/r functions: the old and the one you are testing. One can spend days on this unless one already knows how.

DavidAlfa · « **Reply #12 on:** January 15, 2022, 08:42:05 am »

Have you considered using libopencm3? It's very much like the old ST libraries.
Setting up stuff is harder, you must take care of everything, but it's still better than doing it from scratch.

Edit: Easier said than done. Tried, errors everywhere. Again, no too many instructions else than "make"..
I will never hate these Linux things enough. Quick setup? keep dreaming.
First, you have to master the operating system, spend 10 hours on google until you find a lucky post where a guy posts some Matrix codes that does the fix.
Everything in that universe is sick, stuck in the 80s complexity. Then they critice HAL is bulky and buggy?
With HAL I can start developing in 30 seconds, and fight only against the stm32 problems.
With the free hippie linux stuff I just wanna thrown all away!

Siwastaja · « **Reply #13 on:** January 15, 2022, 12:51:32 pm »

Quote from: peter-h on January 14, 2022, 09:13:34 pm

How does DMA take care of sending out a byte and then waiting for one to come back? It must involve two channels.

Yes, you configure two streams with different memory addresses. Look at the channel mapping, there are SPIx_RX and SPIx_TX separately. And yeah, in STM32 terminology, streams are something that run in parallel, and channels are those fixed connections, like SPI5_RX is some channel number and say UART2_TX is another channel number. You configure the channel number in the control register of the DMA Stream of your choice. Note that not all streams support all channels, so you have to pick the right combo, but this isn't too hard, it's in the manual.

Again, just read the DMA section in the reference manual, it's quite simple if you ignore advanced operating modes (FIFOs, packing modes).

And indeed, using SPI in 16-bit mode halves the processing overhead (number of CPU interruptions in non-DMA solution, or number of DMA transfers in the simple non-FIFO case), but if your SPI device actually works with 8-bit granularity, then only even number of bytes is supported. This might or might not be a problem. How to use it? Well, just set it to 16-bit mode and process two bytes at a time in software. The only possible issue is the order of bytes, and having only two options, swapping the bytes is quite easy to do if it does not work properly. But then, if the slave expects, say, 7 byte transaction and misbehaves if it gets 8, you can't go this way.

Siwastaja · « **Reply #14 on:** January 15, 2022, 01:09:32 pm »

DMA channel mapping list is on pages 307-308 of RM0090. This also shows a total classic documentation error, ST calls this fixed mapping "an example", as if you could modify it. You can't, this is the only existing mapping, not an example of anything.

In any case, here is some random code of DMA configuration for SPI, I pulled out from some random old project:

Init:

Code: [Select]

	MC1_CS1();

	// DMA2 STREAM 0 ch 3 = motcon RX
	DMA2_Stream0->M0AR = (uint32_t)&motcon_rx[0];
	DMA2_Stream0->PAR = (uint32_t)&(SPI1->DR);
	DMA2_Stream0->NDTR = MOTCON_DATAGRAM_LEN;
	DMA2_Stream0->CR = 3UL<<25 /*Channel*/ | 0b10UL<<16 /*high prio*/ | 0b01UL<<13 /*16-bit mem*/ | 0b01UL<<11 /*16-bit periph*/ |
	                   1UL<<10 /*mem increment*/ | 1UL<<4 /*transfer complete interrupt*/;

	// DMA2 STREAM 3 ch 3 = motcon TX
	DMA2_Stream3->M0AR = (uint32_t)&motcon_tx[0];
	DMA2_Stream3->PAR = (uint32_t)&(SPI1->DR);
	DMA2_Stream3->NDTR = MOTCON_DATAGRAM_LEN;
	DMA2_Stream3->CR = 3UL<<25 /*Channel*/ | 0b10UL<<16 /*high prio*/ | 0b01UL<<13 /*16-bit mem*/ | 0b01UL<<11 /*16-bit periph*/ |
	                   1UL<<10 /*mem increment*/ | 0b01<<6 /*mem->periph*/;


	// SPI1 @ APB2 = 60 MHz
	SPI1->CR1 = 1UL<<11 /*16-bit frame*/ | 1UL<<9 /*Software slave management*/ | 1UL<<8 /*SSI bit must be high*/ |
		0b010UL<<3 /*div 8 = 7.5 MHz*/ | 1UL<<2 /*Master*/;

	SPI1->CR2 = 1UL<<1 /* TX DMA enable */ | 1UL<<0 /* RX DMA enable*/;

	SPI1->CR1 |= 1UL<<6; // Enable SPI

	NVIC_SetPriority(DMA2_Stream0_IRQn, 0b0000); // Priority is the most urgent; keep the ISR short
	NVIC_EnableIRQ(DMA2_Stream0_IRQn);

Start of transfer:

Code: [Select]

	MC1_CS0();
	DMA2->LIFCR = 0b111101UL<<0;  DMA2_Stream0->CR |= 1UL; // Enable RX DMA
	DMA2->LIFCR = 0b111101UL<<22; DMA2_Stream3->CR |= 1UL; // Enable TX DMA

End of transfer:

Code: [Select]

void motcon_rx_done_inthandler()
{
	DMA2->LIFCR = 0b111101UL<<0;
	MC1_CS1();
}

Maybe not your exact case but I'm sure having references won't hurt?

peter-h · « **Reply #15 on:** January 15, 2022, 02:14:38 pm »

Yes; thank you all. This is very good stuff. I will get onto it as soon as I get back.

If using DMA, am I right that the special case of count=1 doesn't need to be specially handled?

IIRC, the blocksize is always 512. The various functions take length as a parameter, and even support any 16 bit blocksize, but when I was reading the flash data sheet it was too ambiguous, so I avoided non-512 sizes. And when Windows goes in via USB, it always does who sectors (512) anyway.

IIRC, the flash supports any size read simply by clocking it but I am not using that. It isn't useful because there isn't enough RAM to make use of it.

If I used polling with 16 bit SPI mode, given the buffer is defined as uint8_t *buf, presumably I would need to convert that into a uint16_t *buf. Otherwise I would have to extract two bytes at a time, pack them into a 16 bit int, and stuff that into the SPI, which wastes a bit of time. I am still trying to get my head around this stuff

Siwastaja · « **Reply #16 on:** January 15, 2022, 02:34:47 pm »

Yes, DMA can handle the size of 1. You could improve performance of these single-byte accesses by not using DMA on them, but not much, because DMA channel configuration isn't too many operations, especially if you don't use HAL for it. If performance is important:
* Only write the DMA control fields which need change (maybe M0AR, probably NDTR)
* Instead of read-modify-write on CR, do full writes. DMA1_Stream0->CR = config; DMA1_Stream0->CR = config | 1UL; is faster than DMA1_Stream0->CR = config; DMA1_Stream0->CR |= 1UL;

If DMA config doesn't change except for NDTR, re-enabling would be just three writes: NDTR, IFCR, CR once for enabling the channel.

Time consumed is same if you read from 8-bit SPI and write a byte in memory, or read from 16-bit SPI and write a halfword in memory. Hence, 16-bit version will double the performance. It may happen that you need to swap the bytes, I don't remember if M7 has a single instruction for this but assuming you have optimizations enabled, it's not a big operation, you still save a lot of time by halving all other overhead (think about interrupt entry latency of 12 cycles alone!), even if you need to spend 1-3 CPU cycles to swap the bytes.

You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t, or you can create two different access types through union, whatever. In some solutions (like pointer casting), you need to add align attribute to the definition, because uint16_t needs to be aligned by 2, but uint8_t can be arbitrarily aligned. Or, you can just write higher level C code where you do two writes to the uint8_t[] table, and let the compiler optimize it. But I'm not sure if the compiler understands if the table is aligned or not, even if you have aligned attribute, so it's not necessarily as fast.

peter-h · « **Reply #17 on:** January 15, 2022, 05:16:36 pm »

Just been reading up the RM on SPI. There is a bit ordering config but no byte ordering config and I can't immediately find out a doc specifying which byte (in 16 bit mode) gets shifted out first.

One would assume that if 16 bit mode and MSB-first was selected then the high byte of a 16 bit value would come out first, because "MSB first" ought to mean bit 15

On the 32F4 that is the byte at the higher memory address (of a 16 bit int).

The flash serial interface shifts MSB first so it looks like 16 bit mode could be used but would need a byte swap.

However, with DMA, there won't be any need for 16-bit mode anyway.

Coming back to polling mode, and 16 bit SPI mode, what I am still not sure about is whether that length=1 hack is needed. It isn't possible anyway; the minimum length will be 2. I don't understand the need for that hack, however (I vaguely recall reading somewhere it was needed in case an interrupt came in at the wrong moment, and would bugger up the SPI) and it may be moot since as I said, I can limit DMA (or 16-bit polled mode) to the length=512 case.

Contrary to what I said before, the length=1 case is used for things like command bytes. But obviously no speed optimisation is needed there. Only the full block w/r of 512 needs optimising.

BTW, FWIW, SPI2 is used only for the serial flash. I am driving various other chips (ADCs, displays, etc) using SPI3 but for that I am using the HAL SPI functions unmodified. Those all work fine and don't need tweaking since SPI speeds are low; from ~500k to ~5mbps.

nctnico · « **Reply #18 on:** January 15, 2022, 05:20:25 pm »

Quote from: Siwastaja on January 15, 2022, 02:34:47 pm

You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t,

I strongly advice against casting pointers that way. You never know whether an uint8_t array is aligned or not (there is no guarantee at all) and unaligned accesses are not supported on most ARM platforms. It may even fail silently (been there, done that).

peter-h · « **Reply #19 on:** January 15, 2022, 05:25:05 pm »

An excellent point; I cannot tell if the buffer is aligned because it is supplied by the caller, which could be anything.

It thus looks like the only way is to use DMA, in 8-bit mode. The 16-bit mode will need special handling on the 1st/last byte.

Siwastaja · « **Reply #20 on:** January 15, 2022, 05:50:47 pm »

Quote from: nctnico on January 15, 2022, 05:20:25 pm

Quote from: Siwastaja on January 15, 2022, 02:34:47 pm
You can make the buffer uint16_t, or you can keep it uint8_t and cast the pointer to uint16_t,
I strongly advice against casting pointers that way. You never know whether an uint8_t array is aligned or not (there is no guarantee at all)

Of course you know if you align it. All decent compilers support this (see attributes), and if your MCU projects absolutely must be portable outside of the two major ARM compilers (gcc and clang) which follow the same syntax, you can create a wrapper header.

Perfect portability and adherence to only C standard utilities is impossible in MCU projects anyway; usually you need to add alignment attributes when using DMA, for the DMA.

You are of course right, if you can, avoid the pointer casting, except casting to char* or void* which is always safe. And in this case, using uint16_t [] in definition and casting to uint8_t* (or char*, or void*) wherever needed as single bytes is always safe.

And despite the fact that you have encountered a CPU where unaligned access caused weird issues instead of the expected crash, STM32F417 does not behave that way, but gives UsageFault as documented. And if you google UsageFault, the first result shows "unaligned memory access" and you don't even need to open that link to see that. So it's not usually a huge problem even if you mess it up accidently. One of the easiest bugs to track down, if we exclude your weird experience on non-Cortex-M4.

peter-h · « **Reply #21 on:** January 17, 2022, 06:28:21 pm »

There is another angle on this. Not worth a lot but maybe something.

To receive say 512 bytes, you merely need to stuff 512 bytes (or 256 words if using 16-bit SPI mode) into the TX. It doesn't matter what they are, you are interested only in generating the clock pulses, so no need to read the value out of a buffer, increment a pointer, etc. They can all be 0x00 or 0x0000

And likewise to transmit, you don't need to put the RX data anywhere, because it will be junk anyway.

That "transmit-receive" function was written by ST to be general-purpose, but in reality whether you are having to transmit the 512, or receive the 512, will depend on preceeding commands (mostly byte values) sent to the FLASH. You are unlikely to be wanting to TX and RX valid data. The only case I recall where both need to be valid data was with a TI ADS1118 which shifts out 16 bits and shifts in 16 bits at the same time. It's a weird device...

Re DMA, note that DMA can't access CCM, so the calling code need to watch that. In my case it is reasonably easy to take care of this, but if you use CCM for RTOS stacks (which is probably very tempting) then it could bite you.

betocool · « **Reply #22 on:** January 19, 2022, 09:41:58 am »

Hey, for SPI transfers, DMA is the way to go IMO...

Here's some code setting up an I2S transfer (which uses the SPI interface) for sending data on an STM32H7:

Code: [Select]

/* Set up DMA TX */
    RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;   // DMA2 clock enable;

	DMA2_Stream3->CR &= ~DMA_SxCR_EN;
	while(DMA2_Stream3->CR & DMA_SxCR_EN)
	{

	}

	aux_register |= DMA_SxCR_MSIZE_1;
	aux_register |= DMA_SxCR_PSIZE_1;
	aux_register |= DMA_SxCR_MINC;
	aux_register |= DMA_SxCR_CIRC;
	aux_register |= DMA_SxCR_PL_1;
	aux_register |= DMA_SxCR_DIR_0;
	aux_register |= DMA_SxCR_TCIE;
	aux_register |= DMA_SxCR_HTIE;

    HAL_NVIC_SetPriority(DMA2_Stream3_IRQn, 6, 0);
    HAL_NVIC_EnableIRQ(DMA2_Stream3_IRQn);

	DMA2_Stream3->CR = aux_register;
	DMA2_Stream3->FCR &= ~(DMA_SxFCR_DMDIS);

	DMA2_Stream3->PAR = (uint32_t)&(SPI3->TXDR);

	DMAMUX1_Channel11->CCR = 0x3E;

	/* Set up I2S3 TX */
	SPI3->CR1 &= ~SPI_CR1_SPE;
	while(SPI3->CR1 & ~SPI_CR1_SPE)
	{

	}

	aux_register = 0;
	aux_register |= SPI_CFG1_TXDMAEN;
	SPI3->CFG1 |= aux_register;

	//Chip Enable pin
	HAL_GPIO_WritePin(GPIOD, GPIO_PIN_11, GPIO_PIN_SET);

	DMA2_Stream3->M0AR = (uint32_t)&(controller_output_buffer[0].left);
	DMA2_Stream3->NDTR = 2 * OUTPUT_BUF_SIZE_STEREO_SAMPLES;

	/* Clear interrupt flags */
	DMA2->LIFCR |= DMA_LIFCR_CTCIF3 | DMA_LIFCR_CHTIF3;
	DMA2_Stream3->CR |= DMA_SxCR_EN;
	SPI3->CR1 |= SPI_CR1_SPE;
	SPI3->CR1 |= SPI_CR1_CSTART;

You'll see it's not at all complex once you read the datasheet. Basically set the direction, peripheral, number of transfers and an interrupt. If you want, set up a double buffer, and receive interrupts at transfer completion and half transfer (what I use).

You'll have to set up an additional SPI Rx DMA channel, main difference being the direction! You will receive whatever data the slave is ready to send back to you. If you only want to receive, as someone mentione before... send Zero's out on the Tx line. I'm pretty sure you can set up an RX only SPI channel with clocking and DMA... I must see if I find something like that, I think I used that on a STM32L4.

Cheers,

Alberto

betocool · « **Reply #23 on:** January 19, 2022, 09:45:14 am »

Here's part of another example. This one runs on an L4...

Code: [Select]

	/*
	 * Prepare SPI 3 for DMA transfers in both directions
	 * This functionality gets activated using low level
	 * register access
	 * SPI 3 TX -> DMA2 Channel 2 / DIR: 1 /
	 * SPI 3 RX -> DMA2 Channel 1 / DIR: 0
	 */

	/* Configure PC3 input as interrupt, rising edge */
	GPIO_InitStruct.Pin = GPIO_PIN_3;
	GPIO_InitStruct.Mode = GPIO_MODE_IT_RISING;
	GPIO_InitStruct.Pull = GPIO_PULLDOWN;
	HAL_GPIO_Init(GPIOC, &GPIO_InitStruct);

	/* Enable DMA2 */
	RCC->AHB1ENR |= RCC_AHB1ENR_DMA2EN;

	/* Select channels */
	DMA2_CSELR->CSELR |= (0x3 << DMA_CSELR_C1S_Pos); ///< SPI3RX Ch1
	DMA2_CSELR->CSELR |= (0x3 << DMA_CSELR_C2S_Pos); ///< SPI3TX Ch2
	/*
	 * SPI TX
	 */
	DMA2_Channel2->CCR |= 	DMA_CCR_MSIZE_0 | 				///< 16 Bits Mem Width
							DMA_CCR_PSIZE_0 | 				///< 16 Bits Peripheral Width
							DMA_CCR_MINC    |				///< Increment memory
							DMA_CCR_DIR;					///< Direction (TX)
	/*
	 * SPI RX
	 */
	DMA2_Channel1->CCR |= 	DMA_CCR_MSIZE_0 | 				///< 16 Bits Mem Width
							DMA_CCR_PSIZE_0 | 				///< 16 Bits Peripheral Width
							DMA_CCR_MINC    ;				///< Increment memory
	DMA2_Channel1->CCR |= 	DMA_CCR_TCIE;					///< Enable DMA RX ready interrupt
	/* Clear interrupt flags just in case */
	DMA2->IFCR |= DMA_IFCR_CTCIF1;
	DMA2->IFCR |= DMA_IFCR_CTCIF2;

	HAL_NVIC_SetPriority(DMA2_Channel1_IRQn, 7, 0);			///< Set the interrupt priority and enable
	HAL_NVIC_EnableIRQ(DMA2_Channel1_IRQn);


	// Disable SPI
	bool loop = true;
	while(loop)
	{
		if((SPI3->SR & SPI_SR_BSY) == 0)
		{
			if((SPI3->SR & SPI_SR_FRLVL) != 0)
			{
				(void)(uint32_t)SPI3->DR;
			}
			else
			{
				loop = false;
			}
		}
	}
	SPI3->CR1 &= ~SPI_CR1_SPE;

peter-h · « **Reply #24 on:** January 19, 2022, 05:22:10 pm »

One issue is that DMA cannot write to the CCM, so I still need a version of this which does not use DMA. So I spent a few hours today on that, since it should be simple

I implemented 16 bit mode only if blocksize=512 which is the only time-critical scenario.

Well, it works:

Code: [Select]

  if (Size==512)
  {

	  // Set 16 bit SPI2 mode

	  SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
	  SPI2->CR1 |= (0x1 << 11);			// set 16 bit mode
	  SPI2->CR1 |= (0x1 << 6);			// enable SPI2

	  // Loop, sending out TX buffer while receiving bytes into RXbuffer
	  // Generally, both counts must be the same, and if you are just receiving then TX data can be anything

	  uint16_t val16,val16b;

	  while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  {
	      /* Do a transmit if TX empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (TxXferCount > 0U) && txallowed )
	      {
	    	  val16 = (*hspi->pTxBuffPtr);
	    	  val16b = (val16>>8) | (val16<<8);  // byte swap
	    	   *(__IO uint16_t *)&SPI2->DR = val16b;
	    	  hspi->pTxBuffPtr+=2;
	    	  TxXferCount-=2;
	    	  /* Next Data is a reception (Rx). Tx not allowed */
	    	  txallowed = false;
	      }

	      /* Do a receive if RX not empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (RxXferCount > 0U))
	      {
	    	  val16 = SPI2->DR;
	    	  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);  // byte swap
	    	  hspi->pRxBuffPtr+=2;
	    	  RxXferCount-=2;
	    	  /* Next Data is a Transmission (Tx). Tx is allowed */
	    	  txallowed = true;
	      }

	  }

	  // Set back to 8 bit SPI2 mode
	  SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
	  SPI2->CR1 &= ~(0x1 << 11);		// set 8 bit mode back
	  SPI2->CR1 |= (0x1 << 6);			// enable SPI2


  }

Running a tight loop for reading the flash and timing 100000 512 byte page reads, I get this

original 8 bit SPI mode = 50 secs = 1024kbytes/sec
16 bit SPI mode = 40 secs = 1280kbytes/sec (-Og optimisation)
16 bit SPI mode = 37 secs = 1383kbytes/sec (-O3 optimisation)

The optimisation is done with __attribute__((optimize("O3"))) i.e. on that one function only.

So the help with 16 bit SPI mode is only 20%.

It seems obvious that the main issue must be the txallowed flag. Its purpose is to prevent loading the TX unless a value has been immediately previously extracted from RX. That loop just keeps going around, loading the TX if there is space, and reading the RX if there is data. The complication is that this is not a normal UART because you need TX to generate the clocks to get RX data to come in. So there is a deadlock, which is needed, but it has the effect of under-utilising the TX. I wonder if anyone has any ideas on how to fix this.

I then changed the loop to test txallowed and check the TXE bit only if txallowed is true, because TXE is a slow read. I also changed the test of either counter being nonzero to ORing them together. That knocked off 3 seconds - same as -O3, and now -O3 has no effect. Curious!

Code: [Select]

	  uint16_t val16,val16b;

	  while ( (TxXferCount|RxXferCount) != 0 )	// while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  {

	      /* Do a transmit if TX empty */
		  if (txallowed && (TxXferCount > 0U))
		  {
		      if ( __HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) )
		      {
		    	  val16 = (*hspi->pTxBuffPtr);
		    	  val16b = (val16>>8) | (val16<<8);		// byte swap
		    	  /* *(__IO uint16_t *)& */ SPI2->DR = val16b;
		    	  hspi->pTxBuffPtr+=2;
		    	  TxXferCount-=2;
		    	  /* Next Data is a reception (Rx). Tx not allowed */
		    	  txallowed = false;
		      }
		  }

	      /* Do a receive if RX not empty */
	      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (RxXferCount > 0U))
	      {
	    	  val16 = SPI2->DR;
	    	  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	// byte swap
	    	  hspi->pRxBuffPtr+=2;
	    	  RxXferCount-=2;
	    	  /* Next Data is a Transmission (Tx). Tx is allowed */
	    	  txallowed = true;
	      }

	  }

No matter what I do I can't get under 37 secs, which is 50% of the SPI speed. Well, better than 30%

Scope trace below showing the 16 clocks and then a gap.

I will try DMA next. Maybe DMA will be faster even with CCM, after copying the 512 bytes to/from a temp buffer in main RAM...

DavidAlfa · « **Reply #25 on:** January 20, 2022, 09:36:23 am »

Why do you have TxXferCount and RxXferCount? You're the master, you have to transmit to receive data, the size will be the same in any case?
Why do you need txallowed flag? You always have to transmit first, then the received data will be placed in DR register, setting the RXNE flag.
Another possible tweak: If you wait for RXNE, then you don't need to check for TXE, since you know the buffer will be empty and ready for a new transfer? Then you only would need to check it at the beginning?

Also, don't forget O2 optimization, sometimes it's faster than O3, and generates smaller code.
Just thinking, I've not tested it. But I think you can simplify it to make it faster.

Code: [Select]

uint16_t val16;

while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) );                // Wait until TX flag is ready to transfer
while ( XferCount )
{  
  SPI2->DR = ((*hspi->pTxBuffPtr >>8) | (*hspi->pTxBuffPtr<<8));  // byte swap then send
  hspi->pTxBuffPtr+=2;                                            // Increase TX buffer pointer  
  while (!__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE));               // Wait until Rx data is done
  val16 = SPI2->DR;                                               // Read DR
  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	  // byte swap
  hspi->pRxBuffPtr+=2;                                            // Increase Rx buffer pointer
  XferCount-=2;                                                   // Decrease remaining bytes
}

You could also read the received data, but process it when the transfer is going on, so you can use the waiting time for the byte swapping and pointer incrementing.
That way the waiting between transfers will be reduced. Again, just an idea, not tested.

Code: [Select]

bool first = true;
uint16_t val16;

while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) );                // Wait until TX flag is ready to transfer
while ( XferCount )
{
  SPI2->DR = ((*hspi->pTxBuffPtr >>8) | (*hspi->pTxBuffPtr<<8));  // byte swap then send  
  hspi->pTxBuffPtr+=2;                                            // Increase TX buffer pointer
  if ( !first )                                                   // Use transfer time to store previous received data
  {
    (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	  // byte swap
    hspi->pRxBuffPtr+=2;                                          // Increase Rx buffer pointer    
  }
  
  while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE) );             // Wait until Rx data is done
  val16= SPI2->DR;                                                // Read DR
  first = false;                                                  // No longer the first byte
  XferCount-=2;                                                   // Decrease remaining bytes
}
(*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	  // Store last received data which will not be done inside the loop

Also, you can swap the bytes outside the SPI loop to increase the speed further. Process the TX buffer before sending, process the RX buffer after receiving.
I think you would need to do this in any case if you use DMA.

betocool · « **Reply #26 on:** January 20, 2022, 10:00:44 am »

I missed what the CCM was...

I see that you have a pause in your trace which is almost as long as your transfer.

If you were to implement DMA, you could have 512 or 1024 half-words transferred in one go. Once the transfer finishes, process the data and start a new DMA transfer. You don't even need to change the parameters.

One further improvement is having a circular buffer with double interrupts, and you never need set up a DMA transfer again, but that's more useful for continuous streaming.

Both the above work in RX or TX or RX/TX. Have a look at the reference manual as well, sometimes they have a high level example on exactly what you want to do.

Cheers,

Alberto

peter-h · « **Reply #27 on:** January 20, 2022, 10:55:30 am »

Thank you all.

Yes it did occur to me that two counters are always dumb with SPI when both TX and RX are enabled; that came from the ST code.

I will work on this some more over the next few days, eventually doing the DMA too.

The byte swap ought to not be needed if DMA is used because I would then use the 8-bit mode.

Currently I am wasting time on a special case of this function which is tx-only, and is used for "just writing". When it has sent out the block (1-512 bytes) it merely clears the RX overrun condition (which is reading a couple of registers, per the RM). But the 16 bit version of it doesn't work; the data is somehow corrupted. When I do the buffer compare in my test code, where the data should be 00 01 02 03 I see 00 00 02 00 04 00. Like my byte swap code was duff. The 8 bit SPI mode does work, and I struggle to see what is different, other than the obvious. And I can trivially solve it by not running 16 bit mode in this TX-only case (since it is used only for writing the flash, which always gets you a ~20ms per page hit).

I can't use the existing function as-is for tx-only because SPI always receives "some" data and it has to go "somewhere", so you either dump it within the loop by doing just

Code: [Select]

 	  			if ( (SPI2->SR & 1) != 0 )
  	  			{
  	  				SPI2->DR;							// in rxdump mode, just dump any rx data
  	  			}

or by dumping it into a buffer, and you don't really want to dump it into the source buffer because the calling code may not like that

I will post on progress...

Code: [Select]

  	if (rxdump)
  	{
  		RxXferCount=0;
  	}

  	if (Size==512)
  	{

  		// Set 16 bit SPI2 mode
  		SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
  		SPI2->CR1 |= (0x1 << 11);			// set 16 bit mode
  		SPI2->CR1 |= (0x1 << 6);			// enable SPI2

		#ifdef TIMING_DEBUG
  			ADS1118_an_2p5v_external(0);
		#endif

  		// Loop, sending out TX buffer while receiving bytes into RXbuffer
  		// Both counts must be the same (with SPI), and if you are just receiving then TX data can be anything

  		volatile uint16_t val16,val16b;

  		while ( (TxXferCount|RxXferCount) != 0 )		// while ((TxXferCount > 0U) || (RxXferCount > 0U))
  		{

  			/* Do a transmit if TX empty */
  			if ( txallowed && (TxXferCount > 0) )
  			{
				val16 = (*hspi->pTxBuffPtr);			// Get TX data speculatively
				val16b = (val16>>8) | (val16<<8);		// byte swap

  				if ( (SPI2->SR & (1<<1)) != 0 )			// if ( __HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) )
  				{
  					SPI2->DR = val16b;
  					hspi->pTxBuffPtr+=2;
  					TxXferCount-=2;
  					txallowed=rxdump;					// block another TX until RX done (except in rxdump mode)
   				}
  			}

  			/* Do a receive if RX not empty */
  			if (!rxdump)
  			{
  	  			if ( ((SPI2->SR & 1) != 0) && (RxXferCount > 0) )		// if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) etc
  	  			{
  	  				val16 = SPI2->DR;
   	  				(*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);	// byte swap
  	  				hspi->pRxBuffPtr+=2;
 	  				RxXferCount-=2;
   	   				txallowed = true;
  	  			}
  			}
  			else
  			{
 	  			if ( (SPI2->SR & 1) != 0 )
  	  			{
  	  				SPI2->DR;							// in rxdump mode, just dump any rx data
  	  			}
  			}

  		}

		#ifdef TIMING_DEBUG
  			ADS1118_an_2p5v_external(1);
		#endif

  		// Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
  		__HAL_SPI_CLEAR_OVRFLAG(hspi);

  		// Set back to 8 bit SPI2 mode
  		SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
  		SPI2->CR1 &= ~(0x1 << 11);			// set 8 bit mode
  		SPI2->CR1 |= (0x1 << 6);			// enable SPI2

  	}
  	else
  	{
  		// This is the non-512 byte case - uses 8 bit SPI mode

I suspected that the rxdump mode issue is with the 16 bit mode setting

Code: [Select]

  		// Set back to 8 bit SPI2 mode
  		SPI2->CR1 &= ~(0x1 << 6);			// disable SPI2
  		SPI2->CR1 &= ~(0x1 << 11);			// set 8 bit mode
  		SPI2->CR1 |= (0x1 << 6);			// enable SPI2

and needing a delay after re-enabling SPI, but delays don't make any difference.

wek · « **Reply #28 on:** January 20, 2022, 11:24:26 am »

> Yes it did occur to me that two counters are always dumb with SPI when both TX and RX are enabled; that came from the ST code.

This piece of code has some history. It started from the same premise as you have - that you can have either one or two frames in transit, given the double-buffering of both Tx and Rx. For that, you need to keep track of transmitted or received frames separately.

This "worked" in simplistic test cases, failing only when somebody tried to use it in actual program with interrupts, where receiver has not been serviced timely and the two incoming frames overflew. From there stems txallowed, obviously a patch.

JW

peter-h · « **Reply #29 on:** January 20, 2022, 03:31:24 pm »

With everyone's help I got the 100k page time from 40 secs (was 37 but I had to add some code to implement the rx-dump mode) to 30 secs. It is about 60% of the theoretical SPI speed (21mbps).

It's gonna be DMA next...

Siwastaja · « **Reply #30 on:** January 20, 2022, 04:03:25 pm »

CCM might be a bad idea if it prevents you from using DMA.

Where does the data come from / where does it go to next? Use DMA for that piece, too.

If the data is generated or "consumed" ie. parsed by CPU; in other words, if it all has to go through CPU, try to make 32-bit accesses on the bus, and the advantage of the CCM will be small so you can just as well move it to the regular RAM.

peter-h · « **Reply #31 on:** January 20, 2022, 04:27:07 pm »

Well, yes, a good question

The 32F417 has 128k of normal RAM and 64k of CCM RAM. And somebody (i.e. me) had to decide how to spread this around. It can be debated all sorts of ways., and the scheme used now is not what was used earlier in the development.

We have an RTOS (FreeRTOS) and after power-up initialisation practically everything that runs on this box is running as an RTOS thread. I decided to allocate the 64k CCM to the RTOS (what they call application stacks). It looks like a good number to provide for that purpose. And this is "fast" RAM.

Then I need about 50k available to a single malloc for MbedTLS (whoever wrote that POS doesn't know what "embedded" stands for

). And if that came out of CCM it would leave very little. Also TLS won't be used much and when it does it won't need to be specially fast (it is slow anyway).

A time-critical process reading the SPI FLASH is USB, which runs wholly via two ISRs; one for reading 512 byte sectors and one for writing them. The reading one is important to get fast, because that interrupt is happening anytime a USB cable is connected and with a PC on the other end... and that ISR blocks everything else for the duration of a page read (200us in theory; currently around 400us). The USB buffers are in main RAM and thus DMA-accessible.

There won't be a whole lot of RTOS activity accessing the FLASH and especially not with critical timing requirements, and these won't be able to use DMA so the FLASH code falls back to a) 16 bit SPI mode and then b) 8 bit SPI mode.

Then I have power-up stuff like a boot loader for programming the FLASH but that obviously runs rarely, but can use DMA. The power-up code uses a stack in main RAM for various buffers which are at the top of main RAM.

So I think this is the best compromise.

I still haven't found a way to handle the USB sector write and doing that properly i.e. returning a BUSY to force the host machine to keep retrying, instead of just hanging in the ISR for the page write time of ~20ms and then returning OK (which actually works fine). I posted a thread on it where almost everybody told me to spend half my life writing a driver for the FLASH

Siwastaja · « **Reply #32 on:** January 20, 2022, 04:35:44 pm »

Oh, the root cause to your problems here seems to be an undersized microcontroller, the fact that you need to shuffle things around between normal RAM and CCM not only due to performance, but due to having too little RAM, is pretty revealing. Also the SPI throughput required seems a close call.

You should pick a larger part, for example from F7 series. What you get is SPI with FIFOs which enables massively better performance with simpler program logic even if you don't use DMA - you can do 32-bit writes on FIFOs and avoid race conditions between write/read because there is room for "one more" word in the TX FIFO; only thing you need to take care is to write and read the same amount.

And of course, enough RAM. And, 32-bit access between the SPI peripheral and DMA to save bus cycles.

peter-h · « **Reply #33 on:** January 20, 2022, 04:55:09 pm »

Well, an x86 at 4GHz and 24GB of RAM would be even better

The 32F417 is fine for this. It just needs some thought.

Optimisation is fun

Siwastaja · « **Reply #34 on:** January 20, 2022, 05:02:53 pm »

Yeah, sure. I like optimizing and doing things "no one else can do" (supposedly) just for fun.

But looking at all your threads, it seems you need quite a bit of external help to get things done, and resist the proper solutions that would easily fix the root issues, instead resorting on ugly hacks, and all this is taking a lot of time to achieve and results in an unreliable product.

But it's great to hear you are having fun. And all your threads become useful learning experiences for others, too.

peter-h · « **Reply #35 on:** January 20, 2022, 06:45:18 pm »

" it seems you need quite a bit of external help to get things done"

That's because I work on my own. It's my own small business. I had some people to work with in the 1980s but not really since. And I use forums to get help, while doing my best to make posts which somebody else may find useful in years to come. I run a "tech" forum and know how useful they can be. And how useless they can be, but that's another story...

"resist the proper solutions that would easily fix the root issues"

It may appear that way but there are constraints I am working with, which I am not spelling out. Sometimes for commercial confidentiality reasons, sometimes for cost (I don't want to pay £10 for a chip, when the 32F417 is £5 (500+) and that is a significant difference. Also you don't get that much more RAM. You don't get a few MB. RAM is "power" but unfortunately all microcontrollers are short of RAM. Or power; in this product I have a long term proven SMPS which can deliver about 300mA and if it was 500mA then I would have to revisit the design and do a huge amount of testing; control loop stability over temperature, etc. It's a special PSU with multiple isolated outputs and high efficiency.

Well, the 32F417 has plenty of RAM for anything I want to do, until one gets to stuff like TLS (which is frankly pointless; much "security" is just fashion, dictated by people who are clueless about real risks with embedded systems, but it is ridiculously resource-hungry) or driving big LCDs with graphics, and for these things e.g. multiple concurrent TLS sessions, you would want a few MB external RAM but then you lose most of the I/O.

" instead resorting on ugly hacks, and all this is taking a lot of time to achieve and results in an unreliable product."

What one person calls an ugly hack is an elegant solution to another.
What I am getting done in a few days would IME take a month in a firm where coders get paid monthly. Especially as most refuse to use online resources, and just waste company time going up blind alleys. Most full-time people have no interest in productivity, for obvious reasons.
I don't think I am doing anything unreliable. I write code carefully and test stuff exhaustively. Been doing this since c. 1979, 2MHz Z80, and back then you really were working right on the edge of everything - unless it was a control board for a washing machine

"But it's great to hear you are having fun."

It's been fun to get properly into C after many years. GCC is much better than say IAR C in the 1980s which generated such crappy code that we would do profiling (with an ICE; something one doesn't have these days) and I would rewrite a lot of bits in assembler. But I still don't get pointers

" And all your threads become useful learning experiences for others, too."

That is partly why I spend time posting.

Back to this SPI, this is almost the best possible with 16 bit SPI mode, versus the 8 bit mode. Ignore the ringing; that's just crappy scope grounding. DMA next... and hope to close the gap completely, in 8 bit mode.

peter-h · « **Reply #36 on:** January 20, 2022, 08:46:02 pm »

Can anyone tell me if these DMAs can work without affecting each other?

DMA1 Ch 7 Stream 7 is triggered by one of the DACs and is used to feed both DACs concurrently (32 bit mode)
DMA1 Ch 0 Stream 3 is SPI2 RX
DMA1 Ch 0 Stream 4 is SPI2 TX

I am wading my way through the various config bits (I have example code but it is heavy on #defines and takes ages to unravel, so I am trying to understand it ref the RM register bit fields) and from e.g. this

it looks like the entire stream has to share all those config bits. And things like NTDR (the transfer count) are clearly common to all channels in a stream. Luckily, it seems, I am using different streams for the three jobs.

wek · « **Reply #37 on:** January 20, 2022, 09:41:49 pm »

In 'F4, *Stream* is the transferring element. You don't share anything, it's a single element.

Channel is just number, telling which input of the requests multiplexer is selected, for given Stream.

NDTR is number of transfers for given stream.

Streams in DMA influence each other in that they share one Peripheral and one Memory port, through which they perform the transfers. There's an arbiter inside DMA which decides which Stream is going to access which port next. There's also a priority system for this arbitration. Read the DMA chapter in RM0090 and read AN4031.

JW

DavidAlfa · « **Reply #38 on:** January 20, 2022, 10:28:00 pm »

I don't understand why you keep using spi Send+Receive?
When receiving large blocks of data, you aren't sending anything, just clock.
So a much efficient way would be to first send the command (read, address), then receive the data by simply writing 0 to DR and waiting for RXNE flag.
Instead sending a useless buffer to receive data.

I've tested it, there's a delay of just one spi clock between bytes.
Using spiSendReceive is much slower.

peter-h · « **Reply #39 on:** January 20, 2022, 11:00:32 pm »

I think the answer is that I have a "transmit-receive" function which can do

- transmit and receive
- transmit only (there is an "rxdump" mode)
- receive only

I know what you mean. I could optimise the code for all 3 modes. But for example, to receive, one cannot just stuff junk into the SPI TX. One has to wait for TXE before stuffing the next one in, and then wait for RXNE before reading that one out. And it is these status reg accesses which are slow, and the data writes/reads which are also slow, because the SPI is not running at 168MHz. It is running (in my case) at 42MHz, and there is a "pretty relaxed" coupling between the CPU and the SPI, which takes multiple cycles at the SPI (the APB) clock frequency. The end result is that the whole thing is a lot slower than one would expect when using a 168MHz CPU.

The current (16 bit) SPI loop I have does in fact write a 16 bit value to TX and waits for RXNE, reads out the 16 bit value, and writes another word to TX. It doesn't wait for TXE because that isn't necessary. But one still doesn't achieve gap-less comms. Not at 21mbps. At 10mbps one certainly would.

Whether one stuffs junk into TX, or stuff some "real" value into TX, makes very little difference. Most of the time is spent on the CPU-SPI interface which wastes a load of clocks. Some of those clocks are probably to avoid metastability issues (one needs ~ 3 clocks to be really safe). It could have been done better. The SPI logic could have been running at the CPU clock, not at the APB clock.

Anyway tomorrow I should test the DMA and then we will see if it totally closes the gaps.

DavidAlfa · « **Reply #40 on:** January 20, 2022, 11:27:54 pm »

Why? TXE is almost inmediately set after writing to DR, after it's been transferred into the shift register.
After the full word has been sent, the input shift register is loaded into DR and RXNE is set. This happens much later than TXE.
So you can just skip TXE except fort the beginning?
Have a look at the RM spi waveforms.

Did I mention the "Ofast" optimization? Doesn't care about size, but for best speed.

Oh and remember to check for BSY flag. That's what really indicates the SPI transfer is done!
So you don't disable CS too early. TXE migh be set, but it means the buffer is free, not the output shift register.

In any case, you don't need to send and receive data simultaneously, it's always a command-answer.
When receiving, data is don't care. All you need is sending the command+parameters, then the required clocks to get the data.
It seems your method is to fill 512 bytes TXbuffer and send it to receive 512bytes?

Anyway, SendReceive can be much simpler. I've tested this. You'll need to adapt it for the 16 bit mode.

Code: [Select]

uint8_t txBf[6] = { 0x90 };                                           // RDID command
uint8_t rxBf[6];                                                      // Receive buffer

void spi_SendReceive(uint8_t * TxData, uint8_t * RxData, uint16_t size){
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until TX flag is ready
  while ( size-- ) {
    SPI1->DR = *TxData++;                                           // Send data
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
}

int main (void) {

  SET_CS(ENABLE);
  spi_SendReceive(txBf,rxBf,6);
  SET_CS(DISABLE);

}

Being 8-bit, I can run it roughly 1/2 the spi speed at 50MHZ SPI and 100MHz CPU.
Test is sending/receiving 6 bytes to read flash ID.

DavidAlfa · « **Reply #41 on:** January 21, 2022, 12:54:51 am »

And this is the diference when using separate send and receive transactions:

Code: [Select]

uint8_t txBf[6] = { 0x90 };                                         // RDID command
uint8_t rxBf[6];                                                    // Receive buffer

void spi_Send(uint8_t * TxData, uint16_t size){
  while ( size-- ) {
    while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );            // Wait until TX flag is ready
    SPI1->DR = *TxData++;                                           // Send data
  }
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until TX flag is ready (Following RM procedure)
  while ( __HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_BSY) );               // Wait until BUSY flag is gone
}

void spi_Receive(uint8_t * RxData, uint16_t size){
  SPI1->DR;                                                         // read DR to clear RXNE
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until HW is idle
  while ( size-- ) {                                                // Start receiving
    SPI1->DR = 0;
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
}

int main (void) {

  SET_CS(ENABLE);
  spi_Send(txBf,4);
  spi_Receive(rxBf, 2);
  SET_CS(DISABLE);

}

As you see, sending data is much faster. That's because you don't need to wait until the transfer is done to keep loading the buffer, it accepts new data shortly after, when the buffer is transferred to the shift register, so it'll work almost continuously.

The problem comes when receiving. You have to wait for the transfer to be completed. The SPI peripheral stops.
Reading the data, loading DR all that causes a delay while accessing the bus/peripheral, the new data is transferred to the shift register, and the a new clock sequence starts.
This is a limitation in full-duplex mode.

DavidAlfa · « **Reply #42 on:** January 21, 2022, 01:48:14 am »

Finally, playing with half-duplex mode, the delay between bytes is gone.
It's basically the same as full duplex, but uses either MOSI or MISO, instead both at the same time.
You can also enable bidirecional mode and use only 1 pin, joining MISO/MOSI in the memory, which is usually OK.

The performance comes from the fact that in Half duplex RX only mode, the spi clock is sent continuously, without requiring writes to DR.
However it goes non-stop, the RM specifies this:

Quote

In Master receive-only mode (RXONLY=1), the communication is always continuous and
the BSY flag is always read at 1.

So you must be fast. The cpu has a whole spi word time to read the data from DR, which is more than enough if done correctly.

I modified this function, you can clearly see the huge throughput increase.

Code: [Select]

void spi_Receive(uint8_t * RxData, uint16_t size){
  
  SPI1->DR;                                                         // Clear RXNE
  __HAL_SPI_DISABLE(&hspi1);                                        // Disable SPI
  SPI1->CR1 |= SPI_CR1_RXONLY;                                      // Set RX only mode
  __HAL_SPI_ENABLE(&hspi1);                                         // Enable SPI (Starts sending clock non-stop!)
  while ( size-- ) {
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
  __HAL_SPI_DISABLE(&hspi1);                                        // Disable SPI
  SPI1->CR1 &= ~(SPI_CR1_RXONLY);                                   // Disable RX only mode
  __HAL_SPI_ENABLE(&hspi1);                                         // Enable SPI
}

As you see, there's an additional byte (I've seen up to 2) received by the spi.
That's because when you get RXNE, it has already started receiving the next byte, by the time you stop it ,it'll be already sent.
Not an issue really, those byte are discarded, your code will only take the required number.
You'll need to test this properly to ensure data integrity.

My scope is crap, seems Nyquist artifacts, being Hantek, for sure it has crappy filtering, poor software interpolation... makes it very hard to see anything running too fast for the sampling rate.
Most scopes show a solid block. This one usually shows nothing, or a weird wave due the sampling rate catching random stuff.

Testing bigger transfers with 256 bytes (100% speed would be 40.96us):
Send: 46us (89% thoroughput)
Read: 42us (97% thoroughput)

I've been using "__attribute__((optimize("Ofast")))" for the functions.

wek · « **Reply #43 on:** January 21, 2022, 08:06:18 am »

Quote from: DavidAlfa on January 21, 2022, 01:48:14 am

The performance comes from the fact that in Half duplex RX only mode, the spi clock is sent continuously, without requiring writes to DR.
However it goes non-stop, the RM specifies this:
Quote
In Master receive-only mode (RXONLY=1), the communication is always continuous and
the BSY flag is always read at 1.
So you must be fast. The cpu has a whole spi word time to read the data from DR, which is more than enough if done correctly.

Yes, but that implies disabled interrupts. With large blocks read, that means interrupts being disabled for quite some time. In some applications it doesn't matter, in others it does.

Also, if you don't clear RXONLY early (*before* the last frame is finished), there will be one more frame shifted in. Again, for some applications (maybe for sSPI indeed) it doesn't matter, for others it does.

JW

DavidAlfa · « **Reply #44 on:** January 21, 2022, 09:11:33 am »

This was a short test showing that spi can be made much faster without DMA.
In my case I only have 16 cpu clocks between bytes, half than 168MHz Core / 42MHz SPI, and it works.

Whatever your application needs is up to you.
There's no shift, you're reading N bytes from certain address, you have to set the address when issuing the read command. If you requested 512 bytes from 0x0, you get exactly that. Next time you'll read address 0x200.
So those extra bytes being sent in the short time while you disable/exit RXONLY mode don't care.
Some memories have automatic address increase, but that's gone after setting CE high, requiring a new read command with the address.
Interrupts? Maybe. If you're searching the fastest performance, probably there's no problem disabling interrupts between page reads, that's just ~50-100us, will cause some jittering, that's something you'll need to evaluate and address.
Of course, you can just run DMA...

peter-h · « **Reply #45 on:** January 21, 2022, 09:26:18 am »

What is the right way to detect DMA completion in this case? I need the transfer to be blocking so I can set CS=1 on the device at the end.

The DMA TX channel is triggered by SPI TXE and the DMA RX channel is triggered by SPI RXNE.

I have one code example which waits on (LISR AND DMA_LISR_TCIF0) but really it should be TCIF3 if you wait until RX has finished. Or you could read NDTR (on the RX channel, not the TX channel because that has the "UART" double buffer on the end of it) and that should be zero when all the required bytes were sent and received.

You can't check the SPI because it doesn't keep a transfer count. Only the DMA knows.

It doesn't appear to work though.

Siwastaja · « **Reply #46 on:** January 21, 2022, 10:33:22 am »

DMA RX stream completion? I mean, that can only happen when the last SPI clock cycle has been issued, slave response bit read out, and then it takes a few more clock cycles for the DMA to actually write that to memory. It should be quite safe to deassert nCS at that point, unless the slave has specific requirements for keeping nCS active for longer.

TX stream completion indeed isn't good enough because that happens almost instantly after the last SPI word transmission begins.

peter-h · « **Reply #47 on:** January 21, 2022, 11:26:55 am »

DMA works

100k pages in 22 seconds, and solid data, in 8-bit mode so no byte swap needed.

I also found that either test below works to detect end of transfer; there is a tiny difference but both achieve a good clearance to cs=1 of about 600ns

Code: [Select]

	while(true)
	{

		uint16_t temp1;
		uint32_t temp2;

		temp1 = DMA1_Stream3->NDTR;
		if ( temp1 == 0 ) break;					// transfer count = 0

		temp2 = DMA1->LISR;
		if ( (temp2 & (1<<27)) !=0 ) break;			// TCIF3

	}

peter-h · « **Reply #48 on:** January 21, 2022, 04:50:42 pm »

In case someone comes across this later, this is the DMA code

Code: [Select]

bool B_HAL_SPI_TransmitReceive(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size, bool rxdump)
{

  	// Check for null pointers
  	if ((pTxData == NULL) || (pRxData == NULL) || (Size == 0U))
  	{
  		return(false);
  	}

  	hspi->pRxBuffPtr  = (uint8_t *)pRxData;
  	uint32_t RxXferCount = Size;
  	hspi->pTxBuffPtr  = (uint8_t *)pTxData;
  	uint32_t TxXferCount = Size;

	// Check whether DMA can be used. This uses 8 bit mode because it is fast enough, and otherwise a byte swap
	// would be needed before TX and after RX. DMA cannot be used if memory is CCM.

	// Check neither end is in CCM. All other addresses are accepted because e.g. one could
	// be copying data from main FLASH to the serial FLASH.

	// We also prevent DMA usage on short blocks. The reason for this was never determined definitely but it
	// crashed the CPU (tripped the watchdog) during USB-driven FLASH writes. Most likely because there is heavy
	// SPI2 activity during the programming cycle, waiting for it to end, for 20ms. There is no point in optimising
	// short r/w. The entire API for the serial FLASH (e.g. LF_read) uses 512 byte blocks.

	uint32_t txadd = (uint32_t) hspi->pTxBuffPtr;
	uint32_t rxadd = (uint32_t) hspi->pRxBuffPtr;

	// Test for CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K (linkerscript.ld)
	if ( ((txadd>=0x10000000) && (txadd<0x10010000)) || ((rxadd>=0x10000000) && (rxadd<0x10010000)) || (Size!=512) )
	{
		goto no_dma;
	}

	// ==== We can use DMA ====

	static uint8_t rxdump_target;					// must not be in CCM :)

	// DMA ch 1 clock enable already done in b_main.c
	//RCC->AHB1ENR |= (1u << 21);					// DMA1EN=1 - DMA1 clock enable
	//hang_around_us(1);							// give it a chance to wake up

	// DMA1 Ch 0 Stream 3 is SPI2 RX

	DMA1_Stream3->CR = 0;							// disable DMA so all regs can be written

	DMA1->LIFCR = (0x03d << 22);					// clear int flags & transfer complete - 111101

	DMA1_Stream3->NDTR = Size;

	if (rxdump)
	{
		DMA1_Stream3->M0AR = (uint32_t) &rxdump_target;		// memory address to dump rx data to
	}
	else
	{
		DMA1_Stream3->M0AR = rxadd;					// memory address in normal mode
	}

	DMA1_Stream3->PAR = (uint32_t) &(SPI2->DR);		// peripheral address
	DMA1_Stream3->FCR = 0;							// direct mode

	if (rxdump)
	{
		DMA1_Stream3->CR = 0 << 25						// CHSEL: ch 0
						|  0 << 23			  			// MBURST: memory burst - single transfer ??
						|  0 << 21						// PBURST: peripheral burst - single transfer ??
						|  3 << 16						// PL: highest priority
						|  0 << 15						// PINCOS: no peripheral address increment offset
						|  0 << 13						// MSIZE: memory data size: byte
						|  0 << 11						// PSIZE: peripheral data size: byte
						|  0 << 10						// MINC: memory address increment: 0
						|  0 << 9						// PINC: peripheral address increment: 0
						|  0 << 8						// CIRC: no circular mode
						|  0 << 6						// DIR: peripheral to memory
						|  0 << 5						// PFCTRL: DMA is flow controller
						|  1 << 0;						// EN: enable stream
	}
	else
	{
		DMA1_Stream3->CR = 0 << 25						// CHSEL: ch 0
						|  0 << 23			  			// MBURST: memory burst - single transfer ??
						|  0 << 21						// PBURST: peripheral burst - single transfer ??
						|  3 << 16						// PL: highest priority
						|  0 << 15						// PINCOS: no peripheral address increment offset
						|  0 << 13						// MSIZE: memory data size: byte
						|  0 << 11						// PSIZE: peripheral data size: byte
						|  1 << 10						// MINC: memory address increment: 1
						|  0 << 9						// PINC: peripheral address increment: 0
						|  0 << 8						// CIRC: no circular mode
						|  0 << 6						// DIR: peripheral to memory
						|  0 << 5						// PFCTRL: DMA is flow controller
						|  1 << 0;						// EN: enable stream
	}

	// DMA1 Ch 0 Stream 4 is SPI2 TX

	DMA1_Stream4->CR = 0;							// disable DMA so all regs can be written

	DMA1->HIFCR = (0x03d << 0);						// clear int flags & transfer complete - 111101

	DMA1_Stream4->NDTR = Size;
	DMA1_Stream4->M0AR = txadd;						// memory address
	DMA1_Stream4->PAR = (uint32_t) &(SPI2->DR);		// peripheral address
	DMA1_Stream4->FCR = 0;							// direct mode

	DMA1_Stream4->CR = 0 << 25						// CHSEL: ch 0
					|  0 << 23			  			// MBURST: memory burst - single transfer ??
					|  0 << 21						// PBURST: peripheral burst - single transfer ??
					|  0 << 16						// PL: priority low
					|  0 << 15						// PINCOS: no peripheral address increment offset
					|  0 << 13						// MSIZE: memory data size: byte
					|  0 << 11						// PSIZE: peripheral data size: byte
					|  1 << 10						// MINC: memory address increment: 1
					|  0 << 9						// PINC: peripheral address increment: 0
					|  0 << 8						// CIRC: no circular mode
					|  1 << 6						// DIR: memory to peripheral
					|  0 << 5						// PFCTRL: DMA is flow controller
					|  1 << 0;						// EN: enable stream

	#ifdef TIMING_DEBUG
		ADS1118_an_2p5v_external(0);
	#endif

	// Config SPI2 to let DMA handle the data. These need to be cleared when transfer complete!

	SPI2->CR2 = 3;									// TXDMAEN, RXDMAEN: 11 - both set in one go

	// Wait for DMA to finish. Blocking is necessary to prevent FLASH CS=1 too early.
	// There could be a timeout here but a failure is impossible short of duff silicon.

	while(true)
	{

		// Either method below worked fine

		//uint16_t temp1;
		uint32_t temp2;

		//temp1 = DMA1_Stream3->NDTR;
		//if ( temp1 == 0 ) break;					// transfer count = 0

		temp2 = DMA1->LISR;
		if ( (temp2 & (1<<27)) !=0 ) break;			// TCIF3

	}

	SPI2->CR2 = 0;									// TXDMAEN, RXDMAEN: 00 - both set in one go

	#ifdef TIMING_DEBUG
		ADS1118_an_2p5v_external(1);
	#endif

    // Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
   	__HAL_SPI_CLEAR_OVRFLAG(hspi);

   	return (true);

It implements the "tx-only" mode (rxdump=true) when it sets the DMA RX to point to a dummy byte (which has to be in main RAM; if it is in CCM then not only doesn't it work but some funny stuff was happening) and sets the memory increment to no-increment.

The reason I use DMA only for size=512 is because that is 99.9% of the SPI FLASH usage, and not having that condition buggered up the USB write access to the file system on the SPI FLASH. I didn't find out why (it would be extremely hard other than by trial and error). I suspected it may be the tight loop which is checking the FLASH programming status, which drives the SPI heavily for the ~20ms. This isn't easy to avoid because this code is used both in startup code (no interrupts, no RTOS) and by the main code (where one could use an RTOS delay to check the FLASH programming status every few ms).

Using DMA has greatly improved the general response of the USB read access from the USB host, regardless of what internal read or write ops are taking place. The whole 512 byte page gets transferred in about 200us.

Thank you all for your input

I will get rid of the hspi references, which are pointlessly complicated. This code should have just SPI1,2,3 as the 1st parameter.

wek · « **Reply #49 on:** January 21, 2022, 05:10:10 pm »

> This code should have just SPI1,2,3 as the 1st parameter.

That would make it unnecessary complicated, as you'd need to handle the different DMA streams belonging to different SPI, too.

Except for the arbitrary "abstraction", there's no need for such universal function in real life.

JW

peter-h · « **Reply #50 on:** January 21, 2022, 05:39:10 pm »

Very good point, and I agree. By the time you have designed a PCB, most of the allocation of features is fixed.

For example I could have made good use of SPI1 (42MHz clock is possible) but lost SPI1 because I needed four UARTs, with some GPIO for them. And loads of stuff like that.

Siwastaja · « **Reply #51 on:** January 21, 2022, 06:14:29 pm »

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

But if you just write what you exactly need, the amount of code ends up so small that it's also quick to modify or rewrite, when necessary.

As you probably already realized by looking at DMA channel mapping list, there is quite limited choice available. If you need more than 1 DMA stream and more than 1 peripheral, chances are you just can't do it without going and redesigning the PCB to use a different peripheral instantiation; peripheral X might be only available on Stream Y, but peripheral Z is also only available on Stream Y, so you can't get both. (This is fixed in the F7/H7 series by introducing the DMAMUX allowing basically equivalent of >100 "channels" per stream; every peripheral available on every stream.)

This, combined with the fact that the instantiations of peripherals (like SPI1, SPI2, SPI3...) might be slightly different, makes attempts of creating a universal code interface where you can just swap between different SPIs, quite futile, unless you are ready to sacrifice something. The STM32 HAL also suffers from this compromise.

nctnico · « **Reply #52 on:** January 21, 2022, 06:30:36 pm »

Quote from: Siwastaja on January 21, 2022, 06:14:29 pm

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

Dang you are evil... you sure know how to bring back bad memories... a very long time ago I worked on a project with a web-based front-end. The programmers of that front-end spend ages on their universal layout engine for tables. In every project meeting they would come up with new and wild ideas about how to make it more universal while even in the current state it wasn't up the the task. A never ending saga!

SiliconWizard · « **Reply #53 on:** January 21, 2022, 06:57:01 pm »

Quote from: Siwastaja on January 21, 2022, 06:14:29 pm

Yeah, I stopped trying to write "generic" driver code when multiple peripherals work with each other (typical example: DMA & SPI, indeed). It's a HUGE task to get right, and usually you get just a half-ass solution which does half of the job, in other words, is useless.

Yes, this is something that pops up on a regular basis. I agree with you there. Rewriting too generic driver code is usually a waste of time, and the fact you think you're going to do it better than vendors (with their libraries) while taking the same approach is also delusional.

Now we may need to define what "generic" means here.
If that means, as vendors usually do, writing drivers that cover the complete features of every peripheral you use, then yes, it's the above approach, and is wasteful at best.
Now if that means writing drivers that cover your use case *only*, but with generic enough parameters (when possible) that porting them to other targets would be much easier - not having to change software interfaces is always a big plus - then go "generic". =)

peter-h · « **Reply #54 on:** January 22, 2022, 07:31:49 am »

I managed to break the code last night, and it turned out to be replacing

Code: [Select]

__HAL_SPI_CLEAR_OVRFLAG(hspi);
with (per the RM)

Code: [Select]

volatile uint32_t tmp;
tmp=SPI2->DR;
tmp=SPI2->SR;

but I originally did just

Code: [Select]

SPI2->DR;
SPI2->SR;

which someone here told me does a read of that register(s). Well, it does not, and I need to go through a lot of unrelated code to see where I may have used that.

I also get an "unused" compiler warning on tmp, which the macro gets around by using

Code: [Select]

/** @brief  Clear the SPI OVR pending flag.
  * @param  __HANDLE__ specifies the SPI Handle.
  *         This parameter can be SPI where x: 1, 2, or 3 to select the SPI peripheral.
  * @retval None
  */
#define __HAL_SPI_CLEAR_OVRFLAG(__HANDLE__)        \
  do{                                              \
    __IO uint32_t tmpreg_ovr = 0x00U;              \
    tmpreg_ovr = (__HANDLE__)->Instance->DR;       \
    tmpreg_ovr = (__HANDLE__)->Instance->SR;       \
    UNUSED(tmpreg_ovr);                            \
  } while(0U)

The UNUSED(tmp); fixes it.

Siwastaja · « **Reply #55 on:** January 22, 2022, 08:25:31 am »

Quote from: peter-h on January 22, 2022, 07:31:49 am

Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.

I do exactly that all the time and never had a problem.

Did you check the disassembly to verify the absence of reading the registers? Or are you just assuming it didn't happen? To me, this sounds like a timing issue where any random change fixed it. Because now, with the tmp as volatile, write to tmp is also forced, so there will be longer delay between the two reads.

I.e., original did this (in pseudo-assembly):
load SPI2->DR
load SPI2->SR

And the current incarnation does:
load SPI2->DR
store tmp # causes delay!
load SPI2->SR
store tmp

BTW, I always tend to read SR first, then DR. Don't know if this is relevant at all, just saying.

peter-h · « **Reply #56 on:** January 22, 2022, 09:14:17 am »

I will check the asm. Could also be optimisation level. Or a silicon bug which requires a delay between the reads.

But what is the objective? To clear the OVR flag, yeah, sure, but if you want to clear out all RX data, the macro won't do that. It removes just 1 value, The RX can contain 2 bytes! I've been around this block so many times. To empty out a UART with a 16 byte FIFO you have to do 17 reads. So I think one should read DR twice, to be 100% sure there is nothing in there which will bite you next time because the SPI RX won't be empty.

Code: [Select]

#define SPI2 ((SPI_TypeDef *) SPI2_BASE)
and then all this

Code: [Select]

/**
  * @brief  SPI handle Structure definition
  */
typedef struct __SPI_HandleTypeDef
{
  SPI_TypeDef                *Instance;      /*!< SPI registers base address               */

  SPI_InitTypeDef            Init;           /*!< SPI communication parameters             */

  uint8_t                    *pTxBuffPtr;    /*!< Pointer to SPI Tx transfer Buffer        */

  uint16_t                   TxXferSize;     /*!< SPI Tx Transfer size                     */

  __IO uint16_t              TxXferCount;    /*!< SPI Tx Transfer Counter                  */

  uint8_t                    *pRxBuffPtr;    /*!< Pointer to SPI Rx transfer Buffer        */

  uint16_t                   RxXferSize;     /*!< SPI Rx Transfer size                     */

  __IO uint16_t              RxXferCount;    /*!< SPI Rx Transfer Counter                  */

  void (*RxISR)(struct __SPI_HandleTypeDef *hspi);   /*!< function pointer on Rx ISR       */

  void (*TxISR)(struct __SPI_HandleTypeDef *hspi);   /*!< function pointer on Tx ISR       */

  DMA_HandleTypeDef          *hdmatx;        /*!< SPI Tx DMA Handle parameters             */

  DMA_HandleTypeDef          *hdmarx;        /*!< SPI Rx DMA Handle parameters             */

  HAL_LockTypeDef            Lock;           /*!< Locking object                           */

  __IO HAL_SPI_StateTypeDef  State;          /*!< SPI communication state                  */

  __IO uint32_t              ErrorCode;      /*!< SPI Error code                           */

#if (USE_HAL_SPI_REGISTER_CALLBACKS == 1U)
  void (* TxCpltCallback)(struct __SPI_HandleTypeDef *hspi);             /*!< SPI Tx Completed callback          */
  void (* RxCpltCallback)(struct __SPI_HandleTypeDef *hspi);             /*!< SPI Rx Completed callback          */
  void (* TxRxCpltCallback)(struct __SPI_HandleTypeDef *hspi);           /*!< SPI TxRx Completed callback        */
  void (* TxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);         /*!< SPI Tx Half Completed callback     */
  void (* RxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);         /*!< SPI Rx Half Completed callback     */
  void (* TxRxHalfCpltCallback)(struct __SPI_HandleTypeDef *hspi);       /*!< SPI TxRx Half Completed callback   */
  void (* ErrorCallback)(struct __SPI_HandleTypeDef *hspi);              /*!< SPI Error callback                 */
  void (* AbortCpltCallback)(struct __SPI_HandleTypeDef *hspi);          /*!< SPI Abort callback                 */
  void (* MspInitCallback)(struct __SPI_HandleTypeDef *hspi);            /*!< SPI Msp Init callback              */
  void (* MspDeInitCallback)(struct __SPI_HandleTypeDef *hspi);          /*!< SPI Msp DeInit callback            */

#endif  /* USE_HAL_SPI_REGISTER_CALLBACKS */
} SPI_HandleTypeDef;

Siwastaja · « **Reply #57 on:** January 22, 2022, 10:11:53 am »

Oh, SPI is trivial: you need to do exactly the same number of writes and reads to the DR. This is regardless of whether the "FIFO" is just one word long like on F417, or longer like on F7 series. This all stems from how the SPI protocol works.

Reading DR when no data is available may not cause any problems, but it isn't beneficial either and messes with your mind. Especially if you have a longer FIFO.

So no, an extra dummy read to "empty" the FIFO is definitely not required, you can't make empty more empty. The correct pattern is to see if RX FIFO is not empty (polling or interrupt) in this case RXNE, and if it's not, then do one read. If the RX FIFO had more than one word in it, another RXNE event (new interrupt, or polling using while() loop would succeed again) would immediately follow, causing another readout.

But since the RX "FIFO" is just 1 word, it can't contain 2 words. If RXNE = 1, then DR read returns valid data, and then RXNE = 0; you should read this value if you check. Unless your timing is so marginal that RXNE goes back to 1 immediately after reading out the previous data, but then you are on the brink of overrun anyway.

Do you even need to read SR? I don't remember, every STM32 peripheral is different and some require dummy reads on status register for no apparent reason when just reading the data out of DR would logically do. But this is at least usually documented.

The problem is you assuming, instead of checking. Earlier you assumed that compiler does not generate the reads, wasting a lot of time on it and writing a forum post about how the compiler does not do the reads. And now it turned out you never checked and are ready to go for the next theory about the delays. Whether the reads are generated or not would take just one minute to see from the disassembly.

One question, why are you posting a screenshot of I2S section? You don't have I2S, right? What makes you think anything in the I2S section is relevant to SPI use?

peter-h · « **Reply #58 on:** January 22, 2022, 01:20:21 pm »

"One question, why are you posting a screenshot of I2S section? You don't have I2S, right? What makes you think anything in the I2S section is relevant to SPI use?"

Because the RM sez so :

Follow the link.

In reality few people care for the overrun flag (unless interrupting from it, which almost nobody does). What really matters is that the RX channel is empty - otherwise you start by reading a byte of crap out of it next time.

"So no, an extra dummy read to "empty" the FIFO is definitely not required, you can't make empty more empty"

True but what is the alternative? You need to have a loop on RXNE, and keep reading until RXNE goes false. Then (if you are a "proper software engineer"

) you need a timeout on that loop, which is yet more fun if you want it to work in both startup code (ints disabled) and later code (ints + RTOS etc). You end up spending time doing it using something like CYCCNT, when reading N bytes is going to do the job perfectly.

Siwastaja · « **Reply #59 on:** January 22, 2022, 02:13:48 pm »

Well OK I understand, you are just looking at the OVR flag clear procedure, which is probably the same in SPI in I2S modes given they refer you to this section.

However, why do you want to clear the OVR flag?

"few people care for the overrun flag"; well then, I'm one of those "few" people. It is a serious error condition which indicates unrecoverable data loss, and it's most trivial to enable OVR interrupt and attach the error reporting function there, job done, no need to ever look at the flag. I normally enable all possible error interrupt sources, let them end up in the same handler, and there dump error code and all relevant status registers to UART or whatever, maybe once a second so you can attach the cable and see.

Do not depend on planned overruns which you then clear. Obviously in some cases you don't care about the data, and you can just let it overrun, but when the data matters, you don't want overruns at all; and being the SPI master yourself, preventing the overruns 100% is trivial timing work, it's your task as a programmer to guarantee, leaving the OVR flag as error detection mechanism only.

So I think recovery sequence is irrelevant then. For the same reason, I don't use timeouts for core functionality where timeout could be only triggered for two reason: colossal fuckup by me, or actual silicon failure. Sure, people love to throw around unhandled or poorly handled timeouts everywhere, but recovering from serious failures is hard. Do it well, or don't do it at all. (Latter usually works.) But even if you don't try to recover, do check for such failures and add a handler which reports a helpful error code, instead of the software starting to act randomly.

SPI master is deterministic and you have all the control. Having FIFO of 1 word, you don't need to loop on RXNE, only single event can happen. Read it out before the next one, guarantee this by code design and you are fine.

And if you don't have time to do that, then that's what DMA is for.

peter-h · « **Reply #60 on:** January 22, 2022, 02:33:53 pm »

Well, I agree on error recovery. When I stripped down that HAL SPI code, I removed all the timeouts because they either cannot happen except with duff silicon or they cannot be usefully handled at that (deep) level.

I've never implemented checking serial comms error flags

The error must be detected by the higher level code which then retries, etc. But for a retry to work you do have to clean out the RX "UART" channel, otherwise you will be forever a byte short

DMA is good, and I have just learnt a lot

I will now have a go at using DMA for this USB code. Not even sure the 32F417 DMA can connect to that USB FIFO. I don't think it can. There is a dedicated DMA for the USB I think (like there is for ETH, where you really need it) but is it not available in FS mode (only in HS mode). But if the USB FIFO has an address then surely one could do a memory-memory transfer? There is no "busy" status.

nctnico · « **Reply #61 on:** January 22, 2022, 02:39:56 pm »

Quote from: Siwastaja on January 22, 2022, 08:25:31 am

Quote from: peter-h on January 22, 2022, 07:31:49 am
Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.

This is interesting. It is the first time I ever see this mentioned. I just checked with GCC 4.6.2; the statement does perform a read but it does not result in less instructions compared to reading a value into a dummy variable (either optimised for size or speed ). That makes sense because in the end the data that is read must end up in a register even if it isn't used. I do get a warning from the IDE that it is a statement without effect though.

peter-h · « **Reply #62 on:** January 22, 2022, 02:57:08 pm »

I struggle to unravel the ST #defines and structs etc but DR itself it certainly declared as __IO i.e. volatile, otherwise very little code would ever work.

This is what I see

Siwastaja · « **Reply #63 on:** January 22, 2022, 02:59:45 pm »

Quote from: nctnico on January 22, 2022, 02:39:56 pm

Quote from: Siwastaja on January 22, 2022, 08:25:31 am
Quote from: peter-h on January 22, 2022, 07:31:49 am
Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.
This is interesting. It is the first time I ever see this mentioned. I just checked with GCC 4.6.2; the statement does perform a read but it does not result in less instructions compared to reading a value into a dummy variable (either optimised for size or speed ). That makes sense because in the end the data that is read must end up in a register even if it isn't used. I do get a warning from the IDE that it is a statement without effect though.

That is because the IDE does not understand the C language, which is no surprise, because IDEs are second-tier projects developed with much less understanding, compared to compilers which need to work.

This is because by very definition of C language, evaluating a value of a variable qualified volatile is not a statement without effect. The memory access is the effect, and needs to be generated.

Now indeed if you just assign the result of the evaluation to a normal variable which isn't qualified volatile, compiler is free to optimize that variable (and assignment to it) away and obviously will do it. But note that peter-h made that local variable volatile as well, and now compiler must write to it.

I prefer not to add any temporary variable because that generates an unnecessary warning, which then needs to be silenced, leading to excess code. I like my code to be "to the point". If I want read access to a register, I just write exactly that, it also conveys the meaning of "dummy" read very well. If I assign it to a temporary variable, it then raises questions about why this variable is never used.

Quote from: peter-h on January 22, 2022, 02:57:08 pm

I struggle to unravel the ST #defines and structs etc but DR itself it certainly declared as __IO i.e. volatile, otherwise very little code would ever work.

Yes, tracking down the actual "volatile" keyword through all the #defines requires some work, but all IO registers eventually are volatile, as you say otherwise nothing would work at all.

peter-h · « **Reply #64 on:** January 22, 2022, 03:00:35 pm »

So I went back to my non-working code, and it looks like SPI2->DR; does actually do those reads

peter-h · « **Reply #65 on:** January 22, 2022, 03:07:03 pm »

So I went back to my original hypothesis about clearing out the full two byte / two 16-bit word RX queue, with

Code: [Select]

   	SPI2->DR;
   	SPI2->DR;
   	SPI2->SR;

and it works

So the ST code, where they just clear out the OVR flag, is BS.

The scenario where this breaks is where you just happen, by design, to not read the RX data, which with SPI happens if just sending data. This is of course common. You can't prevent "stuff" being received. I guess ST got away with this by a combination of accidents. This is the relevant bit of their code (HAL_SPI_Transmit)

Code: [Select]

    /* Transmit data in 8 Bit mode */

  	// The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
    {
    	*((__IO uint8_t *)&SPI2->DR) = (*hspi->pTxBuffPtr);
    	hspi->pTxBuffPtr++;
    	hspi->TxXferCount--;
    }

    while (hspi->TxXferCount > 0U)
    {
    	/* Wait until TXE flag is set before loading data */
    	if (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE))
    	{
    		*((__IO uint8_t *)&SPI2->DR) = (*hspi->pTxBuffPtr);
    		hspi->pTxBuffPtr++;
    		hspi->TxXferCount--;
    	}
    }

    //========= PH addition ===========
    // Best way to make sure TX is finished is by waiting for the RX byte (which is garbage) to arrive
    // Potential for hanging here...
    while (__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE))
    {
    	SPI2->DR;		// extract the byte and dump it
    }
    // ====================

    // Clear overrun flag in 2 Lines communication mode, just in case
   	__HAL_SPI_CLEAR_OVRFLAG(hspi);

    if (hspi->ErrorCode != HAL_SPI_ERROR_NONE)
    {
    	errorcode = HAL_ERROR;
    }

error:
  	return errorcode;

}

So the bit I added at the very end, to make it blocking for correct CS=1, masked the issue.

peter-h · « **Reply #66 on:** January 24, 2022, 11:44:43 am »

Well, I got another taste of the "loosely coupled" peripherals in this uC. At the end of this code (discussed earlier) I had a GPIO waggle, to get a pulse on a scope to measure timing. This evening, having finished with it all, I removed that line, and the code stopped working. Running my test code I was finding that the last byte (byte 511) of each page was wrong. I had noticed previously that this could be induced by not fully clearing RX data (in the "dump rx data" mode, where SPI is merely sending), so I spent a bit of time tracking it down.

It turned out that a small delay is needed before doing the final SPI2->DR (2 of them).

Code: [Select]

	  	while ((TxXferCount > 0U) || (RxXferCount > 0U))
	  	{
	  		/* Do a TX if TX empty */
			if ( txallowed && (TxXferCount > 0) )
			{
				if ( (SPI2->SR & (1<<1)) != 0 )
			    {
					*(__IO uint8_t *)&SPI2->DR = *pTxBuffPtr;
			    	pTxBuffPtr++;
			    	TxXferCount--;
			    	txallowed=rxdump;					// block another TX until RX done (tx always ok in rxdump)
			    }
			}

		    /* Do an RX if RX not empty */
			if ( !rxdump )
			{
				if ( ((SPI2->SR & 1) != 0) && (RxXferCount > 0) )
				{
	   				(*(uint8_t *)pRxBuffPtr) = SPI2->DR;
	   		    	pRxBuffPtr++;
	   		    	RxXferCount--;
	   		    	txallowed = true;					// allow tx again
		    	}
			}
	  	}

		#ifdef TIMING_DEBUG
	  		ADS1118_an_2p5v_external(1);
		#endif

  	}

    // Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
	// Have to read DR 2x to be sure the RX is cleared. The ST command __HAL_SPI_CLEAR_OVRFLAG(__HANDLE__) does
	// only 1 read of DR and this failed in rxdump mode, if DR was not read above (which it actually is).
   	// A small delay is needed because the very last transfer may not propagate to DR.

   	hang_around_us(1);
   	SPI2->DR;
   	SPI2->DR;
   	SPI2->SR;

   	return (true);

}

One could also solve it with a 3rd SPI2->DR but it's not clear with how much margin.

The APB buses are 42MHz but there are so many clocks between the CPU and the SPI...

Siwastaja · « **Reply #67 on:** January 24, 2022, 12:01:46 pm »

You can always consider resetting the SPI peripheral through the RCC reset registers. It doesn't take long, you need two writes to RCC (activate and deactivate reset), and then re-configure the SPI peripheral which is usually just a few register writes. This way, you should reliably end up with all the shift registers, buffers, and any other internal state equal to the reset state of the MCU. Maybe. Hopefully.

If you need to depend on cycle accurate synchronism between your code and peripherals, it's not going to work reliably. You need to remove the race conditions completely, not work around them.

peter-h · « **Reply #68 on:** January 24, 2022, 12:37:03 pm »

It is stuff which looks like what one might expect interfacing to a slow device of any kind.

The SPI runs at APB clock, 42MHz in this case, so let's say you get a RXNE status bit set, you can definitely read out that byte. But if say you had a 4 byte RX FIFO, with its propagation driven on a 42MHz clock, and you read it with a CPU running on a 168MHz clock, and you want to empty out that FIFO and dump the data, and you do

SPI2->DR;
SPI2->DR;
SPI2->DR;
SPI2->DR;

you may not get all four bytes - because they could not ripple through fast enough. I may be wrong in this; it is well possible that the SPI2->DR read operation itself is slow enough to enable the FIFO to propagate.

If I am right, then there are two approaches:

wait
SPI2->DR;
wait
SPI2->DR;
wait
SPI2->DR;
wait
SPI2->DR;

or have a loop testing RXNE, but that needs a timeout of some sort.

I don't get why ST run the peripherals from such a slow clock. They should have run them from the CPU clock. Lots of people have got their fingers burnt with this.

Anyway, I now have what looks like solid code; I can't break it

In DMA mode, 16 bit SPI mode, or 8 bit SPI mode. The 16 bit mode is used instead of DMA if the RAM is CCM.

peter-h · « **Reply #69 on:** February 03, 2022, 07:34:22 am »

Just in case someone finds this useful, below is the source for the 32F417 SPI3 DMA.

I had lots of problems getting it to work and it sounds like the DMA "streams" cannot be used for more than one device even if they are on different channels, so e.g. you cannot do this

Learn something every day! Basically there are just eight DMA controllers (on DMA1): stream 0-7. The "channel" is just connecting these controllers to the peripherals.

Code: [Select]

/*
*
* DMA-only version of the ultra-bloated HAL SPI TransmitReceive() but fixed for SPI3 and XX-only options added.
*
* For use where fast transfers are needed, on the limit of SPI3 speed so with zero gaps. This is impossible
* to do by polling at 10.5mbps or 21mbps and is probably marginal at 5.25mbps. The 16 bit SPI mode just
* manages gap-free with polling but works only with even block sizes, and has the "first byte problem" which
* DMA gets around.
*
* ** DMA ONLY SO NO CCM ACCESS SO THE TWO BUFFERS HAVE TO BE "STATIC" **
*
* This function is blocking so the caller can set CS=1 right away (check device data sheet!). A non-blocking
* version would make sense only if transmitting only (txonly=true) but the caller would have to tidy up
* the DMA, SPI3->CR2, etc.
*
* Two modes, obviously mutually exclusive, for tx-only and rx-only:
* If txonly, dumps rx data so you don't need to allocate a buffer for it
* If rxonly, transmits all-0x00 so you don't need to feed SPI with some "known garbage"
*
* The rx-only mode is superfluous in most cases but it does avoid shifting non-0x00 data to the device
* while we are reading data out of it. With some devices this can matter. The ADS1118 ADC is one such.
*
* SPI3 basic config is done elsewhere.
*
* Returns (false) if memory is in CCM, size=0, null pointers...
*
* NULL pointers are allowed if using the txonly or rxonly modes; then the unused one can be NULL.
*
* Originally streams 2+5 were used but it didn't work and produced bizzare effects like transfers hanging
* with the transfer counters having 1 and 127 in them, for a required count of 1. Changing the streams to
* 0+7 fixed it. Seemingly a "stream" cannot be shared across peripherals, and stream 5 is used for DAC1
* even though the "channel" is different.
*
*/

bool SPI3_DMA_TransmitReceive(uint8_t *pTxData, uint8_t *pRxData, uint16_t Size, bool txonly, bool rxonly)
{

  	// Check for invalid inputs

	if ( Size==0 ) return (false);
  	if ( (pTxData==NULL) && !rxonly ) return (false);
  	if ( (pRxData==NULL) && !txonly ) return (false);

	uint32_t txadd = (uint32_t) pTxData;
	uint32_t rxadd = (uint32_t) pRxData;
	static uint8_t txonly_target=0;					// must not be in CCM
	static uint8_t rxonly_source=0;					// must not be in CCM

	// Test for CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K (linkerscript.ld)
	if ( ((txadd>=0x10000000) && (txadd<0x10010000)) || ((rxadd>=0x10000000) && (rxadd<0x10010000)) )
	{
		return(false);
	}

	// DMA ch 1 clock enable already done in b_main.c
	//RCC->AHB1ENR |= (1u << 21);					// DMA1EN=1 - DMA1 clock enable
	//hang_around_us(1);							// give it a chance to wake up

	// DMA1 Ch 0 Stream 0 is SPI3 RX

	DMA1_Stream0->CR = 0;							// disable DMA so all regs can be written

	DMA1->LIFCR = (0x03d << 0);						// clear int flags & transfer complete - 111101 stream 0

	DMA1_Stream0->NDTR = Size;

	if (txonly)
	{
		DMA1_Stream0->M0AR = (uint32_t) &txonly_target;		// memory address to dump rx data to
	}
	else
	{
		DMA1_Stream0->M0AR = rxadd;					// memory address in normal mode
	}

	DMA1_Stream0->PAR = (uint32_t) &(SPI3->DR);		// peripheral address
	DMA1_Stream0->FCR = 0;							// direct mode

	if (txonly)
	{
		DMA1_Stream0->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  3 << 16					// PL: highest priority
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  0 << 10					// MINC: memory address increment: 0
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  0 << 6					// DIR: peripheral to memory
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}
	else
	{
		DMA1_Stream0->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  3 << 16					// PL: highest priority
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  1 << 10					// MINC: memory address increment: 1
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  0 << 6					// DIR: peripheral to memory
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}

	// DMA1 Ch 0 Stream 7 is SPI3 TX

	DMA1_Stream7->CR = 0;							// disable DMA so all regs can be written

	DMA1->HIFCR = (0x03d << 22);					// clear int flags & transfer complete - 111101 stream 7

	DMA1_Stream7->NDTR = Size;

	if (rxonly)
	{
		DMA1_Stream7->M0AR = (uint32_t) &rxonly_source;		// memory address to fetch dummy tx data from
	}
	else
	{
		DMA1_Stream7->M0AR = txadd;					// memory address in normal mode
	}

	DMA1_Stream7->PAR = (uint32_t) &(SPI3->DR);		// peripheral address
	DMA1_Stream7->FCR = 0;							// direct mode

	if (rxonly)
	{
		DMA1_Stream7->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  0 << 16					// PL: priority low
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  0 << 10					// MINC: memory address increment: 0
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  1 << 6					// DIR: memory to peripheral
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}
	else
	{
		DMA1_Stream7->CR = 0 << 25					// CHSEL: ch 0
						|  0 << 23			  		// MBURST: memory burst - single transfer
						|  0 << 21					// PBURST: peripheral burst - single transfer
						|  0 << 16					// PL: priority low
						|  0 << 15					// PINCOS: no peripheral address increment offset
						|  0 << 13					// MSIZE: memory data size: byte
						|  0 << 11					// PSIZE: peripheral data size: byte
						|  1 << 10					// MINC: memory address increment: 1
						|  0 << 9					// PINC: peripheral address increment: 0
						|  0 << 8					// CIRC: no circular mode
						|  1 << 6					// DIR: memory to peripheral
						|  0 << 5					// PFCTRL: DMA is flow controller
						|  1 << 0;					// EN: enable stream
	}

	// Config SPI3 to let DMA handle the data. These need to be cleared when transfer complete!
	// This starts the transfer

	SPI3->CR2 |= 3;									// TXDMAEN, RXDMAEN: 11 - both set in one go
	SPI3->CR1 |= (1<<6);							// SPE=1 enable SPI

	// Wait for DMA to finish. Blocking is necessary to prevent device CS=1 too early.
	// There could be a timeout here but a failure is impossible short of duff silicon, because
	// we are a Master and generating the SPI clock.

	while(true)
	{

		// Either method below worked fine

		//uint16_t temp1;
		uint32_t temp2;

		//temp1 = DMA1_Stream0->NDTR;
		//if ( temp1 == 0 ) break;					// transfer count = 0

		temp2 = DMA1->LISR;
		if ( (temp2 & (1<<5)) !=0 ) break;			// TCIF0

	}

	SPI3->CR2 &= ~3;								// TXDMAEN, RXDMAEN: 00 - both cleared in one go

	// Clear int pending flags. They get cleared at the top of this function anyway, but...

	DMA1->LIFCR = (0x03d << 0);					// clear int flags & transfer complete - 111101 stream 0
	DMA1->HIFCR = (0x03d << 22);				// clear int flags & transfer complete - 111101 stream 7

    // Clear any rx data and the overrun flag in case not all received data was read

	SPI3->CR1 &= ~(1<<6);							// SPE=0 disable SPI

   	SPI3->DR;
	SPI3->DR;
   	SPI3->SR;

   	return (true);

}

Siwastaja · « **Reply #70 on:** February 03, 2022, 09:52:31 am »

This is what I told you earlier.

Yes, it's limiting. Luckily, SPI3_TX is also available on Stream7, so just use it.

Sometimes you get in a corner where you just can't make it work, unless you change to a different SPI for example, so you need to look at these tables already at PCB design stage.

STM32H7 comes with a DMAMUX which allows every peripheral connected to every DMA stream; basically over 100 "channels" instead of just 7.

peter-h · « **Reply #71 on:** February 03, 2022, 10:49:58 am »

I read every page of the data sheet before doing the PCB, and on the whole got everything right. A lot of time was spent on it. I didn't read the 2000 page RM

Everything worked 1st time. Well, not the USB cable insertion detection... done on issue B.

Looking back on it, the only thing not optimal was that SPI1 was lost, to get the four serial ports, and losing SPI1 means losing 42mbps SPI option (2 and 3 can do 21mbps max) which makes the FatFS file system a bit slower, but that doesn't matter, and would have compromised the min baud rate on the UART(s) driven from the same clock as SPI1 (can't recall which).

Looking at what DMA may be needed for one day, DMA2 can do the two ADCs. That could be clever e.g. storing readings in a circular buffer and then some code can add them up and divide by how many, for an easy FIR filter. I can't think what DMA would be good for for timers; perhaps storing captured values?

Yes I am using streams 0+7 now, for SPI3. You are right - I didn't have to move stream 2 to stream 0.

Siwastaja · « **Reply #72 on:** February 03, 2022, 01:54:56 pm »

Quote from: peter-h on February 03, 2022, 10:49:58 am

I read every page of the data sheet before doing the PCB, and on the whole got everything right. A lot of time was spent on it. I didn't read the 2000 page RM

Me neither; one could think the datasheet contains all the basic circuitry and peripheral mapping to allow a PCB design - I pretty much used to ignore the reference manual at PCB design stage, until reality hits in the face, and then I learn, and then I want to remind others of these traps, too.

Two examples:

* The first design to actually need more than just a few DMA channels: two SPIs, one UART + ADC, for total of 7 streams. This combination hit the inability to do DMA channel mapping at all. A different UART or different SPI would have worked out, so it was solvable with pin changes. Would have required board respin, which I didn't want to do, so used interrupt-driven UART in the end; not a big problem, but good to remember.

* Capacitor on NRST pin. Many reference designs leave this unconnected, I also did that for many many designs. Until one day I had STM32H743 which does not boot, at all. Touching NRST with oscillosscope probe makes it boot. The datasheet does not tell you this: it shows a reset button and a capacitor across the button, circled as a combination called "External reset circuit", and capacitor described as something "protecting the device from parasitic resets". Fine, I don't have a button, I don't need debouncing, and I don't get parasitic resets, the CPU never boots at all. So this can't be the problem, right?

But lo and behold, the reference manual, where this kind of circuitry thing does not belong at all, suddenly has this circuit diagram (RM0433 fig. 43), and it has completely different explanation. It shows the capacitor alone, and goes on: "when it [NRST] is not used, connect this pin to ground via a 10 to 100 nF capacitor".

So yes, reading that 3319 pages of reference manual before PCB design would be highly recommendable. Of course if the design does not use Ethernet or CAN or camera interface or DMA2D graphics accelerator etc., you can skip most of the pages.

And I actually read all of it (excluding obviously unused peripherals) before committing to project. I think I read it for a week or two, day in day out. And I still missed that capacitor thing!

peter-h · « **Reply #73 on:** February 03, 2022, 03:57:39 pm »

Hmmm... I have NRST unconnected too. Didn't put a cap on it because the debugger drives it. But presumably 10nF should be safe?

Given it has a ~40k pullup, one would need a fairly powerful radiated spike to pull it down for long enough. This is the data sheet, showing 100nF

Just tested 10nF and it works fine with STLINK V3 ISOL.

I thought about DMA and UARTs and it isn't generally useful, due to the inevitably low baud rates (below 1mbps) in NRZ async encoding. I've been in serial comms for > 40 years and very few products use high rates. Most uses 9.6k

and 115.2k is the highest generally.

Ethernet uses its own DMA - at least if using the ST library code.

I am not sure the "camera interface" is useful today. It is quite slow.

harerod · « **Reply #74 on:** February 03, 2022, 05:12:02 pm »

I just finished another STM32 design where the NRST ends up "all over the board" (slight exaggeration). MCU, DCDC-open-drain-power-good, JTAG, secondary circuitry. I prefer to connect NRST to a 10k pull-up and a 100nF to ground. This cappa should be placed closely to the MCU.
An example where this concept doesn't work, would be some new ATtiny (e.g. ATTINY814). Here the RESET-pin doubles as programming interface. It may take 1..4.7nF or so. The point is that this is not a RESET pin in the normal sense.
The NRST on STM32F4 is a real RESET pin. Thus we can add protection. And any professional programmer (JLINK, STLINK) can handle an 100nF "Angstkondensator".

Siwastaja · « **Reply #75 on:** February 03, 2022, 05:51:31 pm »

Quote from: peter-h on February 03, 2022, 03:57:39 pm

Hmmm... I have NRST unconnected too. Didn't put a cap on it because the debugger drives it. But presumably 10nF should be safe?

Given it has a ~40k pullup, one would need a fairly powerful radiated spike to pull it down for long enough. This is the data sheet, showing 100nF

The very point of my post was, the reference manual, where this information does not belong at all, gives completely different rationale for the capacitor: it is Just Needed. They don't give the exact reason, but at least H743 with external Vcore* fails to start without the capacitor, and it has nothing to do with EMI or radiated anything, the test environment was quiet and I probed that pin to verify clean signal - rising fast, though, as expected without the cap. The CPU clearly needs the slow ramp on the pin, lengthening the internally generated power-on reset pulse.

*) I'm mentioning this because it could be important, even though the thing should boot with internal LDO regardless, but maybe there is some weird power sequencing thing going on. In any case, the capacitor is needed per the reference manual, and not needed per the datasheet. That's the point. You have to read all the documentation carefully.

It's actually quite classical to somehow f*** up the internal reset circuitry. Early AVR devices were known to have issues to the point people designing in external power-on-reset driver ICs. Later AVRs supposedly fixed it, but still required smaller external pull-up and some official programmers denied working if they did not detect the correct amount of resistance.

harerod · « **Reply #76 on:** February 03, 2022, 06:14:08 pm »

The attached STM32H743 excerpt from the datasheet might shed some light on Siwastaja's NRST mystery: "6.3.16 NRST pin characteristics"
Not only does it give the "Figure 23. Recommended NRST pin protection", it also gives the "Table 65. NRST pin characteristics", which include timings. I would just assume that a pin without cappa and excellent layout (e.g. no pin loading at all) might actually violate these requirements.
However - peter-h wrangles the venerable STM32F417. The NRST is described in the datasheet under "NRST pin characteristics". Minor differences, only.

AndyC_772 · « **Reply #77 on:** February 03, 2022, 06:22:18 pm »

Quote from: Siwastaja on February 03, 2022, 05:51:31 pm

The very point of my post was, the reference manual, where this information does not belong at all, gives completely different rationale for the capacitor: it is Just Needed. They don't give the exact reason, but at least H743 with external Vcore* fails to start without the capacitor

Thank you for sharing this observation, you've probably just saved me hours of head scratching.

I've just completed a prototype using the STM32H723, my first design with an H7 device. Like the part you're using, there's nothing about needing a capacitor in the data sheet, but it's prescribed in the full Reference Manual. Fortunately I can solder one easily across the back of the debug connector, where NRST and GND appear adjacent to each other. Glad I went for the full size 0.1" header this time...!

Fig. 43 (page 309) shows a simple block diagram of the reset circuit, including a 20us pulse generator which drives the NRST pin. This block has inputs 'pwr_bor_rst' and 'pwr_por_rst', which imply that the device should generate its own nice, clean 20us reset pulse following power-up or a brown-out.

I wonder whether perhaps this pulse generator simply doesn't work, or at least, doesn't work properly at power-up?

Siwastaja · « **Reply #78 on:** February 03, 2022, 06:32:20 pm »

Quote from: AndyC_772 on February 03, 2022, 06:22:18 pm

I wonder whether perhaps this pulse generator simply doesn't work, or at least, doesn't work properly at power-up?

My take is, it generates a pulse that is too short, or does it too early. This pulse is exposed on the pin, so adding C creates a simple RC delay with the internal pull-up.

Protection against noise and ESD are additional benefits, but clearly not the actual reason why it's needed.

I also scoped my supplies and they were increasing nicely and monotonically. Internal POR circuit should have no difficulties...

Obviously nobody will follow that "recommended circuit" of the datasheet, because it includes a pushbutton, and 99% of the products do not use reset buttons, except devboards. And in that "recommendation", the capacitor is grouped with the mechanical switch, clearly indicating they belong together, making people easily ignore the capacitor.

peter-h · « **Reply #79 on:** February 03, 2022, 06:36:43 pm »

Quote

H743 with external Vcore* fails to start without the capacitor,

It won't be due to the slower risetime enabling a start. It will be because it starts up, the HS oscillator starts up, and then it samples NRST, and if it is HIGH then it bombs. So the cap is needed for a delay. Or it may just need a delay on NRST after VCC rises.

I am not surprised they screwed this up. NRST will have a messy rise; all you need is a non monotonic VCC rise and the hardware will see a "pulse" on NRST.

Historically a reset without a "supervisor" has simply not worked. I have for many years used the Seiko S808
https://eu.mouser.com/ProductDetail/ABLIC/S-80825CNMC-B8KT2G?qs=1uru1TqDsKB%2FV9OyZye9RA%3D%3D
with Atmel 90S1200 AVR and other chips. For the Hitachi H8/3xx I have used the Maxim MAX706 but have just designed out those scammers and put in the ST version
https://eu.mouser.com/ProductDetail/STMicroelectronics/STM706M6F?qs=QRsLLuHQazBglYFCH0NTIg%3D%3D

I am just about to populate 10 boards and will fit the 10nF on all, then change the design to add this.

The pushbutton is optional, but it is dumb because it will bounce around. The 417 data sheet says it will reject <100ns pulses and will accept >300ns pulses. A switch will not do either

If I really had a button I would put in a schmitt trigger to feed NRST (and made the cap TC a lot longer; say 100ms).

bson · « **Reply #80 on:** February 03, 2022, 07:39:27 pm »

Quote from: nctnico on January 22, 2022, 02:39:56 pm

Quote from: Siwastaja on January 22, 2022, 08:25:31 am
Quote from: peter-h on January 22, 2022, 07:31:49 am
Code: [Select]
SPI2->DR; SPI2->SR;
which someone here told me does a read of that register(s). Well, it does not

It definitely does, unless you have written your own header which fails to qualify SPI2 as volatile, or use a broken compiler.
This is interesting. It is the first time I ever see this mentioned. I just checked with GCC 4.6.2; the statement does perform a read but it does not result in less instructions compared to reading a value into a dummy variable (either optimised for size or speed ). That makes sense because in the end the data that is read must end up in a register even if it isn't used. I do get a warning from the IDE that it is a statement without effect though.

This was changed reasonably recently (maybe even in gcc 10). If you have a volatile variable on the stack now the compiler guarantees it occupies space on the stack throughout its entire formal scope, and all accesses to the stack slot become barriers.

harerod · « **Reply #81 on:** February 04, 2022, 09:00:00 am »

Quote from: bson on February 03, 2022, 07:39:27 pm

...This was changed reasonably recently (maybe even in gcc 10). If you have a volatile variable on the stack now the compiler guarantees it occupies space on the stack throughout its entire formal scope, and all accesses to the stack slot become barriers.

Thanks for the heads-up.
STM32CubeIDE release v1.8.0 RN0114 states
GNU Tools for STM32, based on GNU Tools for Arm Embedded Processors 9-2020-q2-update 9.3.1 20200408 (release)
So most discussions for STM32 would be about GCC 9.x

peter-h · « **Reply #82 on:** February 04, 2022, 10:08:13 am »

Yes; current Cube, v1.8.0, comes with the above v9.x compiler. I checked it myself, compiling and checking the binary matches byte for byte. I also checked the version which is output when invoking the exe from a command line.

It's been at that version for a long time.

bson · « **Reply #83 on:** February 04, 2022, 05:02:16 pm »

Quote from: harerod on February 04, 2022, 09:00:00 am

Quote from: bson on February 03, 2022, 07:39:27 pm
...This was changed reasonably recently (maybe even in gcc 10). If you have a volatile variable on the stack now the compiler guarantees it occupies space on the stack throughout its entire formal scope, and all accesses to the stack slot become barriers.
Thanks for the heads-up.
STM32CubeIDE release v1.8.0 RN0114 states
GNU Tools for STM32, based on GNU Tools for Arm Embedded Processors 9-2020-q2-update 9.3.1 20200408 (release)
So most discussions for STM32 would be about GCC 9.x

That's not necessarily true. I don't use their IDE and I use the toolchains provided by ARM:
https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-rm/downloads

keremsur · « **Reply #84 on:** April 06, 2022, 09:10:52 am »

How can one use SPI in a 16-bit mode? Do you mean actually setting SPI to 16 bits so that it is shifting out, and shifting in, a 16 bit value at a time?

AndyC_772 · « **Reply #85 on:** April 06, 2022, 10:31:45 am »

It depends on the device. Usually there's a register you can program to set the word length.

One interesting feature on some of the STM32 devices is that the SPI transmit register supports 'packing'.

What this means, is that if your code performs a 16-bit write to it, then 16 bits get clocked out. If, however, your code performs an 8-bit write, then only 8 bits are sent.

The register is usually defined in the header as 16 bits wide, because it is. This leads to an infuriating symptom, where you write:

Code: [Select]

uint8_t d;
SPI_DR = d;

... and wonder why the SPI port sends twice as many bits as it should, and every other byte is zero. Instead you must write:

Code: [Select]

uint8_t *spi_dr_8bit = (uint8_t*) &SPI_DR;
uint8_t d;
*spi_dr_8bit = d;

...since this explicitly performs an 8 bit write, as opposed to a write of an 8 bit value to a 16 bit register.

peter-h · « **Reply #86 on:** April 06, 2022, 11:54:21 am »

Yes the 32F417 SPI can do 8 bits or 16 bits.

The byte order is not configurable for the 16 bit mode however and in some applications one has to swap the two bytes in software.

The 16 bit mode does no more than sending 2 bytes in the 8 bit mode, but at least you know the 16 bits will come out without any gaps if there are interrupts etc. It also supports some code optimisation if one is polling SPI and running at high bit rates. The 32F4 peripherals are not as fast as you might expect from a 168MHz CPU; they run at say 42MHz and a lot of stuff takes several clocks, so polled implementations run surprisingly slowly, with gaps. See this thread further back.

DMA is the best way and then the 16 bit mode is generally not needed.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee