Electronics > Microcontrollers

32F417 SPI running at one third the speed it should

(1/18) > >>

I posted about this here
but that's a very high throughput forum which almost nobody reads. One guy there spent some time on identifying what might be the explanation but has not offered a solution. I am posting it here in the hope that someone might have come across this before.

Basically the code is using polling and is stuffing bytes into an SPI master "uart" and receiving what comes back. The SPI is running at 21MHz so the limit is just over 2 megabytes/sec. It appears that even a 32F4 running at 168MHz can't cope with this!

This is the code

--- Code: ---
    while ((hspi->TxXferCount > 0U) || (hspi->RxXferCount > 0U))
      /* Check TXE flag */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE)) && (hspi->TxXferCount > 0U) && (txallowed == 1U))
        *(__IO uint8_t *)&hspi->Instance->DR = (*hspi->pTxBuffPtr);
        /* Next Data is a reception (Rx). Tx not allowed */
        txallowed = 0U;
      /* Wait until RXNE flag is reset */
      if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) && (hspi->RxXferCount > 0U))
        (*(uint8_t *)hspi->pRxBuffPtr) = hspi->Instance->DR;
        /* Next Data is a Transmission (Tx). Tx is allowed */
        txallowed = 1U;
--- End code ---

The issue seems to have two parts:

- The existing code (from an ST library) is a bit dumb in that it blocks loading a TX byte unless an RX byte has been retrieved. This results in the double-buffered TX being allowed to run out of data, and with SPI if you aren't sending stuff out you won't be getting stuff back because it is the action of sending out that generates the clock. This seems reasonable as a quick hack, but it ignores the fact that you can load two bytes into a TX; one propagates through to the shift register and the other ends up in the TX buffer. Whereas the RX, while having the same buffering technically, is less usable because once you detect there is data available you don't know how much time you have to get that byte out.

- The SPI runs with its own clock which is much slower than the 168MHz of the CPU. It is the APB clock, which in my case is 42MHz (and I can't change that for various reasons; there is an 84MHz option but it uses another SPI channel which I can't access). This means that say a read of the "tx buffer empty" bit actually takes a lot more than the ~7ns CPU instruction speed. It's a stupid design where they use an ARM core and "asynchronously" attached a pile of peripherals to it, so there are various oddball delays, some requiring multiple APB clocks to prevent metastability.

The answer should be DMA but that is quite tricky to get working.

I am sure it can be done with polling. Many years ago I did a floppy disk controller with a Z80 running at 2MHz and it had the same issue, which was solved by a cunning loop structure. In this case I suspect a similar trick might work, whereby the TX channel is kept full, but the RX channel is checked often enough.


--- Quote from: peter-h on January 14, 2022, 05:00:54 pm ---One guy there spent some time on identifying what might be the explanation but has not offered a solution.

--- End quote ---
Beg your pardon. You've been told several times, use DMA. That *is* the solution. That you refuse it, is your problem.


I am happy to try it if you can post some code to get me started. It takes hours of reading just to find out which of the 135 DMA channels is the one to use :)

At best, you only have 64 clocks/byte at 168MHZ Core & 21MHz SPI, there're quite a lot of checks, branches, add few flash cache misses (another 4-7 clocks)... looks pretty tight there!
I suggest to setup the DMA in cubeMX, because it really takes close to no effort. Then start increasing the frequency until you start losing bytes/clocks.

One thing I wondered was whether stuff like hspi->TxXferCount is wasting time. It seems to be indexing into a structure and possibly doing so at runtime.

Variables like the byte counts should be in registers.

Currently I am using -Og optimisation level. I have tried various others and -O3 seems to break some code, and apparently (there was a thread on this) this is to be expected since it is an experimental thing. But I could try it, or use some directive to create register variables.

I've never used Cube MX but know somebody who has.

Presumably one would replace just that one loop with DMA, plus a mysterious bit of code before it which handles a single-byte case

--- Code: ---   // The need for this initial byte is unknown
    if (initial_TxXferCount == 0x01U)
      *((__IO uint8_t *)&hspi->Instance->DR) = (*hspi->pTxBuffPtr);
--- End code ---


[0] Message Index

[#] Next page

There was an error while thanking
Go to full version