Author Topic: 32F417 SPI running at one third the speed it should  (Read 8242 times)

0 Members and 1 Guest are viewing this topic.

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #25 on: January 20, 2022, 09:36:23 am »
Why do you have TxXferCount and RxXferCount? You're the master, you have to transmit to receive data, the size will be the same in any case?
Why do you need txallowed flag? You always have to transmit first, then the received data will be placed in DR register, setting the RXNE flag.
Another possible tweak: If you wait for RXNE, then you don't need to check for TXE, since you know the buffer will be empty and ready for a new transfer? Then you only would need to check it at the beginning?

Also, don't forget O2 optimization, sometimes it's faster than O3, and generates smaller code.
Just thinking, I've not tested it. But I think you can simplify it to make it faster.
Code: [Select]
uint16_t val16;

while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) );                // Wait until TX flag is ready to transfer
while ( XferCount )

  SPI2->DR = ((*hspi->pTxBuffPtr >>8) | (*hspi->pTxBuffPtr<<8));  // byte swap then send
  hspi->pTxBuffPtr+=2;                                            // Increase TX buffer pointer 
  while (!__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE));               // Wait until Rx data is done
  val16 = SPI2->DR;                                               // Read DR
  (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);   // byte swap
  hspi->pRxBuffPtr+=2;                                            // Increase Rx buffer pointer
  XferCount-=2;                                                   // Decrease remaining bytes
}

You could also read the received data, but process it when the transfer is going on, so you can use the waiting time for the byte swapping and pointer incrementing.
That way the waiting between transfers will be reduced. Again, just an idea, not tested.
Code: [Select]
bool first = true;
uint16_t val16;

while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) );                // Wait until TX flag is ready to transfer
while ( XferCount )
{
  SPI2->DR = ((*hspi->pTxBuffPtr >>8) | (*hspi->pTxBuffPtr<<8));  // byte swap then send 
  hspi->pTxBuffPtr+=2;                                            // Increase TX buffer pointer
  if ( !first )                                                   // Use transfer time to store previous received data
  {
    (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);   // byte swap
    hspi->pRxBuffPtr+=2;                                          // Increase Rx buffer pointer   
  }
 
  while ( !__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE) );             // Wait until Rx data is done
  val16= SPI2->DR;                                                // Read DR
  first = false;                                                  // No longer the first byte
  XferCount-=2;                                                   // Decrease remaining bytes
}
(*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8);   // Store last received data which will not be done inside the loop

Also, you can swap the bytes outside the SPI loop to increase the speed further. Process the TX buffer before sending, process the RX buffer after receiving.
I think you would need to do this in any case if you use DMA.
« Last Edit: January 20, 2022, 09:53:26 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline betocool

  • Regular Contributor
  • *
  • Posts: 96
  • Country: au
Re: 32F417 SPI running at one third the speed it should
« Reply #26 on: January 20, 2022, 10:00:44 am »
I missed what the CCM was...

I see that you have a pause in your trace which is almost as long as your transfer.

If you were to implement DMA, you could have 512 or 1024 half-words transferred in one go. Once the transfer finishes, process the data and start a new DMA transfer. You don't even need to change the parameters.

One further improvement is having a circular buffer with double interrupts, and you never need set up a DMA transfer again, but that's more useful for continuous streaming.

Both the above work in RX or TX or RX/TX. Have a look at the reference manual as well, sometimes they have a high level example on exactly what you want to do.

Cheers,

Alberto
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #27 on: January 20, 2022, 10:55:30 am »
Thank you all.

Yes it did occur to me that two counters are always dumb with SPI when both TX and RX are enabled; that came from the ST code.

I will work on this some more over the next few days, eventually doing the DMA too.

The byte swap ought to not be needed if DMA is used because I would then use the 8-bit mode.

Currently I am wasting time on a special case of this function which is tx-only, and is used for "just writing". When it has sent out the block (1-512 bytes) it merely clears the RX overrun condition (which is reading a couple of registers, per the RM). But the 16 bit version of it doesn't work; the data is somehow corrupted. When I do the buffer compare in my test code, where the data should be 00 01 02 03 I see 00 00 02 00 04 00. Like my byte swap code was duff. The 8 bit SPI mode does work, and I struggle to see what is different, other than the obvious. And I can trivially solve it by not running 16 bit mode in this TX-only case (since it is used only for writing the flash, which always gets you a ~20ms per page hit).

I can't use the existing function as-is for tx-only because SPI always receives "some" data and it has to go "somewhere", so you either dump it within the loop by doing just

Code: [Select]
  if ( (SPI2->SR & 1) != 0 )
    {
    SPI2->DR; // in rxdump mode, just dump any rx data
    }

or by dumping it into a buffer, and you don't really want to dump it into the source buffer because the calling code may not like that :)

I will post on progress...

Code: [Select]
  if (rxdump)
  {
  RxXferCount=0;
  }

  if (Size==512)
  {

  // Set 16 bit SPI2 mode
  SPI2->CR1 &= ~(0x1 << 6); // disable SPI2
  SPI2->CR1 |= (0x1 << 11); // set 16 bit mode
  SPI2->CR1 |= (0x1 << 6); // enable SPI2

#ifdef TIMING_DEBUG
  ADS1118_an_2p5v_external(0);
#endif

  // Loop, sending out TX buffer while receiving bytes into RXbuffer
  // Both counts must be the same (with SPI), and if you are just receiving then TX data can be anything

  volatile uint16_t val16,val16b;

  while ( (TxXferCount|RxXferCount) != 0 ) // while ((TxXferCount > 0U) || (RxXferCount > 0U))
  {

  /* Do a transmit if TX empty */
  if ( txallowed && (TxXferCount > 0) )
  {
val16 = (*hspi->pTxBuffPtr); // Get TX data speculatively
val16b = (val16>>8) | (val16<<8); // byte swap

  if ( (SPI2->SR & (1<<1)) != 0 ) // if ( __HAL_SPI_GET_FLAG(hspi, SPI_FLAG_TXE) )
  {
  SPI2->DR = val16b;
  hspi->pTxBuffPtr+=2;
  TxXferCount-=2;
  txallowed=rxdump; // block another TX until RX done (except in rxdump mode)
    }
  }

  /* Do a receive if RX not empty */
  if (!rxdump)
  {
    if ( ((SPI2->SR & 1) != 0) && (RxXferCount > 0) ) // if ((__HAL_SPI_GET_FLAG(hspi, SPI_FLAG_RXNE)) etc
    {
    val16 = SPI2->DR;
      (*(uint16_t *)hspi->pRxBuffPtr) = (val16>>8) | (val16<<8); // byte swap
    hspi->pRxBuffPtr+=2;
    RxXferCount-=2;
        txallowed = true;
    }
  }
  else
  {
    if ( (SPI2->SR & 1) != 0 )
    {
    SPI2->DR; // in rxdump mode, just dump any rx data
    }
  }

  }

#ifdef TIMING_DEBUG
  ADS1118_an_2p5v_external(1);
#endif

  // Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
  __HAL_SPI_CLEAR_OVRFLAG(hspi);

  // Set back to 8 bit SPI2 mode
  SPI2->CR1 &= ~(0x1 << 6); // disable SPI2
  SPI2->CR1 &= ~(0x1 << 11); // set 8 bit mode
  SPI2->CR1 |= (0x1 << 6); // enable SPI2

  }
  else
  {
  // This is the non-512 byte case - uses 8 bit SPI mode

I suspected that the rxdump mode issue is with the 16 bit mode setting

Code: [Select]
  // Set back to 8 bit SPI2 mode
  SPI2->CR1 &= ~(0x1 << 6); // disable SPI2
  SPI2->CR1 &= ~(0x1 << 11); // set 8 bit mode
  SPI2->CR1 |= (0x1 << 6); // enable SPI2

and needing a delay after re-enabling SPI, but delays don't make any difference.
« Last Edit: January 20, 2022, 11:10:27 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 486
  • Country: sk
Re: 32F417 SPI running at one third the speed it should
« Reply #28 on: January 20, 2022, 11:24:26 am »
> Yes it did occur to me that two counters are always dumb with SPI when both TX and RX are enabled; that came from the ST code.

This piece of code has some history. It started from the same premise as you have - that you can have either one or two frames in transit, given the double-buffering of both Tx and Rx. For that, you need to keep track of transmitted or received frames separately.

This "worked" in simplistic test cases, failing only when somebody tried to use it in actual program with interrupts, where receiver has not been serviced timely and the two incoming frames overflew.  From there stems txallowed, obviously a patch.

JW

 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #29 on: January 20, 2022, 03:31:24 pm »
With everyone's help I got the 100k page time from 40 secs (was 37 but I had to add some code to implement the rx-dump mode) to 30 secs. It is about 60% of the theoretical SPI speed (21mbps).

It's gonna be DMA next...
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8110
  • Country: fi
Re: 32F417 SPI running at one third the speed it should
« Reply #30 on: January 20, 2022, 04:03:25 pm »
CCM might be a bad idea if it prevents you from using DMA.

Where does the data come from / where does it go to next? Use DMA for that piece, too.

If the data is generated or "consumed" ie. parsed by CPU; in other words, if it all has to go through CPU, try to make 32-bit accesses on the bus, and the advantage of the CCM will be small so you can just as well move it to the regular RAM.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #31 on: January 20, 2022, 04:27:07 pm »
Well, yes, a good question :)

The 32F417 has 128k of normal RAM and 64k of CCM RAM. And somebody (i.e. me) had to decide how to spread this around. It can be debated all sorts of ways., and the scheme used now is not what was used earlier in the development.

We have an RTOS (FreeRTOS) and after power-up initialisation practically everything that runs on this box is running as an RTOS thread. I decided to allocate the 64k CCM to the RTOS (what they call application stacks). It looks like a good number to provide for that purpose. And this is "fast" RAM.

Then I need about 50k available to a single malloc for MbedTLS (whoever wrote that POS doesn't know what "embedded" stands for :) ). And if that came out of CCM it would leave very little. Also TLS won't be used much and when it does it won't need to be specially fast (it is slow anyway).

A time-critical process reading the SPI FLASH is USB, which runs wholly via two ISRs; one for reading 512 byte sectors and one for writing them. The reading one is important to get fast, because that interrupt is happening anytime a USB cable is connected and with a PC on the other end... and that ISR blocks everything else for the duration of a page read (200us in theory; currently around 400us). The USB buffers are in main RAM and thus DMA-accessible.

There won't be a whole lot of RTOS activity accessing the FLASH and especially not with critical timing requirements, and these won't be able to use DMA so the FLASH code falls back to a) 16 bit SPI mode and then b) 8 bit SPI mode.

Then I have power-up stuff like a boot loader for programming the FLASH but that obviously runs rarely, but can use DMA. The power-up code uses a stack in main RAM for various buffers which are at the top of main RAM.

So I think this is the best compromise.

I still haven't found a way to handle the USB sector write and doing that properly i.e. returning a BUSY to force the host machine to keep retrying, instead of just hanging in the ISR for the page write time of ~20ms and then returning OK (which actually works fine). I posted a thread on it where almost everybody told me to spend half my life writing a driver for the FLASH :)


Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8110
  • Country: fi
Re: 32F417 SPI running at one third the speed it should
« Reply #32 on: January 20, 2022, 04:35:44 pm »
Oh, the root cause to your problems here seems to be an undersized microcontroller, the fact that you need to shuffle things around between normal RAM and CCM not only due to performance, but due to having too little RAM, is pretty revealing. Also the SPI throughput required seems a close call.

You should pick a larger part, for example from F7 series. What you get is SPI with FIFOs which enables massively better performance with simpler program logic even if you don't use DMA - you can do 32-bit writes on FIFOs and avoid race conditions between write/read because there is room for "one more" word in the TX FIFO; only thing you need to take care is to write and read the same amount.

And of course, enough RAM. And, 32-bit access between the SPI peripheral and DMA to save bus cycles.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #33 on: January 20, 2022, 04:55:09 pm »
Well, an x86 at 4GHz and 24GB of RAM would be even better :)

The 32F417 is fine for this. It just needs some thought.

Optimisation is fun :)
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8110
  • Country: fi
Re: 32F417 SPI running at one third the speed it should
« Reply #34 on: January 20, 2022, 05:02:53 pm »
Yeah, sure. I like optimizing and doing things "no one else can do" (supposedly) just for fun.

But looking at all your threads, it seems you need quite a bit of external help to get things done, and resist the proper solutions that would easily fix the root issues, instead resorting on ugly hacks, and all this is taking a lot of time to achieve and results in an unreliable product.

But it's great to hear you are having fun. And all your threads become useful learning experiences for others, too.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #35 on: January 20, 2022, 06:45:18 pm »
" it seems you need quite a bit of external help to get things done"

That's because I work on my own. It's my own small business. I had some people to work with in the 1980s but not really since. And I use forums to get help, while doing my best to make posts which somebody else may find useful in years to come. I run a "tech" forum and know how useful they can be. And how useless they can be, but that's another story...

"resist the proper solutions that would easily fix the root issues"

It may appear that way but there are constraints I am working with, which I am not spelling out. Sometimes for commercial confidentiality reasons, sometimes for cost (I don't want to pay £10 for a chip, when the 32F417 is £5 (500+) and that is a significant difference. Also you don't get that much more RAM. You don't get a few MB. RAM is "power" but unfortunately all microcontrollers are short of RAM. Or power; in this product I have a long term proven SMPS which can deliver about 300mA and if it was 500mA then I would have to revisit the design and do a huge amount of testing; control loop stability over temperature, etc. It's a special PSU with multiple isolated outputs and high efficiency.

Well, the 32F417 has plenty of RAM for anything I want to do, until one gets to stuff like TLS (which is frankly pointless; much "security" is just fashion, dictated by people who are clueless about real risks with embedded systems, but it is ridiculously resource-hungry) or driving big LCDs with graphics, and for these things e.g. multiple concurrent TLS sessions, you would want a few MB external RAM but then you lose most of the I/O.

" instead resorting on ugly hacks, and all this is taking a lot of time to achieve and results in an unreliable product."

What one person calls an ugly hack is an elegant solution to another.
What I am getting done in a few days would IME take a month in a firm where coders get paid monthly. Especially as most refuse to use online resources, and just waste company time going up blind alleys. Most full-time people have no interest in productivity, for obvious reasons.
I don't think I am doing anything unreliable. I write code carefully and test stuff exhaustively. Been doing this since c. 1979, 2MHz Z80, and back then you really were working right on the edge of everything - unless it was a control board for a washing machine :)

"But it's great to hear you are having fun."

It's been fun to get properly into C after many years. GCC is much better than say IAR C in the 1980s which generated such crappy code that we would do profiling (with an ICE; something one doesn't have these days) and I would rewrite a lot of bits in assembler. But I still don't get pointers ;)

" And all your threads become useful learning experiences for others, too."

That is partly why I spend time posting.

Back to this SPI, this is almost the best possible with 16 bit SPI mode, versus the 8 bit mode. Ignore the ringing; that's just crappy scope grounding. DMA next... and hope to close the gap completely, in 8 bit mode.

« Last Edit: January 20, 2022, 06:53:26 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #36 on: January 20, 2022, 08:46:02 pm »
Can anyone tell me if these DMAs can work without affecting each other?

DMA1 Ch 7 Stream 7 is triggered by one of the DACs and is used to feed both DACs concurrently (32 bit mode)
DMA1 Ch 0 Stream 3 is SPI2 RX
DMA1 Ch 0 Stream 4 is SPI2 TX

I am wading my way through the various config bits (I have example code but it is heavy on #defines and takes ages to unravel, so I am trying to understand it ref the RM register bit fields) and from e.g. this



it looks like the entire stream has to share all those config bits. And things like NTDR (the transfer count) are clearly common to all channels in a stream. Luckily, it seems, I am using different streams for the three jobs.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 486
  • Country: sk
Re: 32F417 SPI running at one third the speed it should
« Reply #37 on: January 20, 2022, 09:41:49 pm »
In 'F4, *Stream* is the transferring element. You don't share anything, it's a single element.

Channel is just  number, telling which input of the requests multiplexer is selected, for given Stream.

NDTR is number of transfers for given stream.

Streams in DMA influence each other in that they share one Peripheral and one Memory port, through which they perform the transfers. There's an arbiter inside DMA which decides which Stream is going to access which port next. There's also a priority system for this arbitration.  Read the DMA chapter in RM0090 and read AN4031.

JW
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #38 on: January 20, 2022, 10:28:00 pm »
I don't understand why you keep using spi Send+Receive?
When receiving large blocks of data, you aren't sending anything, just clock.
So a much efficient way would be to first send the command (read, address), then receive the data by simply writing 0 to DR and waiting for RXNE flag.
Instead sending a useless buffer to receive data.

I've tested it, there's a delay of just one spi clock between bytes.
Using spiSendReceive is much slower.
« Last Edit: January 20, 2022, 11:00:40 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #39 on: January 20, 2022, 11:00:32 pm »
I think the answer is that I have a "transmit-receive" function which can do

- transmit and receive
- transmit only (there is an "rxdump" mode)
- receive only

I know what you mean. I could optimise the code for all 3 modes. But for example, to receive, one cannot just stuff junk into the SPI TX. One has to wait for TXE before stuffing the next one in, and then wait for RXNE before reading that one out. And it is these status reg accesses which are slow, and the data writes/reads which are also slow, because the SPI is not running at 168MHz. It is running (in my case) at 42MHz, and there is a "pretty relaxed" coupling between the CPU and the SPI, which takes multiple cycles at the SPI (the APB) clock frequency. The end result is that the whole thing is a lot slower than one would expect when using a 168MHz CPU.

The current (16 bit) SPI loop I have does in fact write a 16 bit value to TX and waits for RXNE, reads out the 16 bit value, and writes another word to TX. It doesn't wait for TXE because that isn't necessary. But one still doesn't achieve gap-less comms. Not at 21mbps. At 10mbps one certainly would.

Whether one stuffs junk into TX, or stuff some "real" value into TX, makes very little difference. Most of the time is spent on the CPU-SPI interface which wastes a load of clocks. Some of those clocks are probably to avoid metastability issues (one needs ~ 3 clocks to be really safe). It could have been done better. The SPI logic could have been running at the CPU clock, not at the APB clock.

Anyway tomorrow I should test the DMA and then we will see if it totally closes the gaps.

Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #40 on: January 20, 2022, 11:27:54 pm »
Why? TXE is almost inmediately set after writing to DR, after it's been transferred into the shift register.
After the full word has been sent, the input shift register is loaded into DR and RXNE is set. This happens much later than TXE.
So you can just skip TXE except fort the beginning?
Have a look at the RM spi waveforms.

Did I mention the "Ofast" optimization? Doesn't care about size, but for best speed.

Oh and remember to check for BSY flag. That's what really indicates the SPI transfer is done!
So you don't disable CS too early. TXE migh be set, but it means the buffer is free, not the output shift register.

In any case, you don't need to send and receive data simultaneously, it's always a command-answer.
When receiving, data is don't care. All you need is sending the command+parameters, then the required clocks to get the data.
It seems your method is to fill 512 bytes TXbuffer and send it to receive 512bytes?



Anyway, SendReceive can be much simpler. I've tested this. You'll need to adapt it for the 16 bit mode.
Code: [Select]
uint8_t txBf[6] = { 0x90 };                                           // RDID command
uint8_t rxBf[6];                                                      // Receive buffer

void spi_SendReceive(uint8_t * TxData, uint8_t * RxData, uint16_t size){
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until TX flag is ready
  while ( size-- ) {
    SPI1->DR = *TxData++;                                           // Send data
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
}

int main (void) {

  SET_CS(ENABLE);
  spi_SendReceive(txBf,rxBf,6);
  SET_CS(DISABLE);

}

Being 8-bit, I can run it roughly 1/2 the spi speed at 50MHZ SPI and 100MHz CPU.
Test is sending/receiving 6 bytes to read flash ID.
« Last Edit: January 21, 2022, 07:45:19 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #41 on: January 21, 2022, 12:54:51 am »
And this is the diference when using separate send and receive transactions:
Code: [Select]
uint8_t txBf[6] = { 0x90 };                                         // RDID command
uint8_t rxBf[6];                                                    // Receive buffer

void spi_Send(uint8_t * TxData, uint16_t size){
  while ( size-- ) {
    while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );            // Wait until TX flag is ready
    SPI1->DR = *TxData++;                                           // Send data
  }
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until TX flag is ready (Following RM procedure)
  while ( __HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_BSY) );               // Wait until BUSY flag is gone
}

void spi_Receive(uint8_t * RxData, uint16_t size){
  SPI1->DR;                                                         // read DR to clear RXNE
  while ( !__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_TXE) );              // Wait until HW is idle
  while ( size-- ) {                                                // Start receiving
    SPI1->DR = 0;
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
}

int main (void) {

  SET_CS(ENABLE);
  spi_Send(txBf,4);
  spi_Receive(rxBf, 2);
  SET_CS(DISABLE);

}


As you see, sending data is much faster. That's because you don't need to wait until the transfer is done to keep loading the buffer, it accepts new data shortly after, when the buffer is transferred to the shift register, so it'll work almost continuously.

The problem comes when receiving. You have to wait for the transfer to be completed. The SPI peripheral stops.
Reading the data, loading DR all that causes a delay while accessing the bus/peripheral, the new data is transferred to the shift register, and the a new clock sequence starts.
This is a limitation in full-duplex mode.
« Last Edit: January 21, 2022, 03:07:49 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #42 on: January 21, 2022, 01:48:14 am »
Finally, playing with half-duplex mode, the delay between bytes is gone.
It's basically the same as full duplex, but uses either MOSI or MISO, instead both at the same time.
You can also enable bidirecional mode and use only 1 pin, joining MISO/MOSI in the memory, which is usually OK.

The performance comes from the fact that in Half duplex RX only mode, the spi clock is sent continuously, without requiring writes to DR.
However it goes non-stop, the RM specifies this:
Quote
In Master receive-only mode (RXONLY=1), the communication is always continuous and
the BSY flag is always read at 1.
So you must be fast. The cpu has a whole spi word time to read the data from DR, which is more than enough if done correctly.

I modified this function, you can clearly see the huge throughput increase.
Code: [Select]
void spi_Receive(uint8_t * RxData, uint16_t size){
 
  SPI1->DR;                                                         // Clear RXNE
  __HAL_SPI_DISABLE(&hspi1);                                        // Disable SPI
  SPI1->CR1 |= SPI_CR1_RXONLY;                                      // Set RX only mode
  __HAL_SPI_ENABLE(&hspi1);                                         // Enable SPI (Starts sending clock non-stop!)
  while ( size-- ) {
    while (!__HAL_SPI_GET_FLAG(&hspi1, SPI_FLAG_RXNE));             // Wait until Rx data is done
    *RxData++ = SPI1->DR;                                           // Read data
  }
  __HAL_SPI_DISABLE(&hspi1);                                        // Disable SPI
  SPI1->CR1 &= ~(SPI_CR1_RXONLY);                                   // Disable RX only mode
  __HAL_SPI_ENABLE(&hspi1);                                         // Enable SPI
}

As you see, there's an additional byte (I've seen up to 2) received by the spi.
That's because when you get RXNE, it has already started receiving the next byte, by the time you stop it ,it'll be already sent.
Not an issue really, those byte are discarded, your code will only take the required number.
You'll need to test this properly to ensure data integrity.

My scope is crap, seems Nyquist artifacts, being Hantek, for sure it has crappy filtering, poor software interpolation... makes it very hard to see anything running too fast for the sampling rate.
Most scopes show a solid block. This one usually shows nothing, or a weird wave due the sampling rate catching random stuff.

Testing bigger transfers with 256 bytes (100% speed would be 40.96us):
Send: 46us (89% thoroughput)
Read: 42us (97% thoroughput)

I've been using "__attribute__((optimize("Ofast")))" for the functions.
« Last Edit: January 21, 2022, 07:41:47 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 486
  • Country: sk
Re: 32F417 SPI running at one third the speed it should
« Reply #43 on: January 21, 2022, 08:06:18 am »
The performance comes from the fact that in Half duplex RX only mode, the spi clock is sent continuously, without requiring writes to DR.
However it goes non-stop, the RM specifies this:
Quote
In Master receive-only mode (RXONLY=1), the communication is always continuous and
the BSY flag is always read at 1.
So you must be fast. The cpu has a whole spi word time to read the data from DR, which is more than enough if done correctly.
Yes, but that implies disabled interrupts. With large blocks read, that means interrupts being disabled for quite some time. In some applications it doesn't matter, in others it does.

Also, if you don't clear RXONLY early (*before* the last frame is finished), there will be one more frame shifted in. Again, for some applications (maybe for sSPI indeed) it doesn't matter, for others it does.

JW
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5835
  • Country: es
Re: 32F417 SPI running at one third the speed it should
« Reply #44 on: January 21, 2022, 09:11:33 am »
This was a short test showing that spi can be made much faster without DMA.
In my case I only have 16 cpu clocks between bytes, half than 168MHz Core / 42MHz SPI, and it works.

Whatever your application needs is up to you.
There's no shift, you're reading N bytes from certain address, you have to set the address when issuing the read command. If you requested 512 bytes from 0x0, you get exactly that. Next time you'll read address 0x200.
So those extra bytes being sent in the short time while you disable/exit RXONLY mode don't care.
Some memories have automatic address increase, but that's gone after setting CE high, requiring a new read command with the address.
Interrupts? Maybe. If you're searching the fastest performance, probably there's no problem disabling interrupts between page reads, that's just ~50-100us, will cause some jittering, that's something you'll need to evaluate and address.
Of course, you can just run DMA...
« Last Edit: January 21, 2022, 09:21:48 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #45 on: January 21, 2022, 09:26:18 am »
What is the right way to detect DMA completion in this case? I need the transfer to be blocking so I can set CS=1 on the device at the end.

The DMA TX channel is triggered by SPI TXE and the DMA RX channel is triggered by SPI RXNE.

I have one code example which waits on (LISR AND DMA_LISR_TCIF0) but really it should be TCIF3 if you wait until RX has finished. Or you could read NDTR (on the RX channel, not the TX channel because that has the "UART" double buffer on the end of it) and that should be zero when all the required bytes were sent and received.

You can't check the SPI because it doesn't keep a transfer count. Only the DMA knows.

It doesn't appear to work though.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8110
  • Country: fi
Re: 32F417 SPI running at one third the speed it should
« Reply #46 on: January 21, 2022, 10:33:22 am »
DMA RX stream completion? I mean, that can only happen when the last SPI clock cycle has been issued, slave response bit read out, and then it takes a few more clock cycles for the DMA to actually write that to memory. It should be quite safe to deassert nCS at that point, unless the slave has specific requirements for keeping nCS active for longer.

TX stream completion indeed isn't good enough because that happens almost instantly after the last SPI word transmission begins.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #47 on: January 21, 2022, 11:26:55 am »
DMA works :)
100k pages in 22 seconds, and solid data, in 8-bit mode so no byte swap needed.

I also found that either test below works to detect end of transfer; there is a tiny difference but both achieve a good clearance to cs=1 of about 600ns

Code: [Select]
while(true)
{

uint16_t temp1;
uint32_t temp2;

temp1 = DMA1_Stream3->NDTR;
if ( temp1 == 0 ) break; // transfer count = 0

temp2 = DMA1->LISR;
if ( (temp2 & (1<<27)) !=0 ) break; // TCIF3

}

Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 3671
  • Country: gb
  • Doing electronics since the 1960s...
Re: 32F417 SPI running at one third the speed it should
« Reply #48 on: January 21, 2022, 04:50:42 pm »
In case someone comes across this later, this is the DMA code

Code: [Select]
bool B_HAL_SPI_TransmitReceive(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size, bool rxdump)
{

  // Check for null pointers
  if ((pTxData == NULL) || (pRxData == NULL) || (Size == 0U))
  {
  return(false);
  }

  hspi->pRxBuffPtr  = (uint8_t *)pRxData;
  uint32_t RxXferCount = Size;
  hspi->pTxBuffPtr  = (uint8_t *)pTxData;
  uint32_t TxXferCount = Size;

// Check whether DMA can be used. This uses 8 bit mode because it is fast enough, and otherwise a byte swap
// would be needed before TX and after RX. DMA cannot be used if memory is CCM.

// Check neither end is in CCM. All other addresses are accepted because e.g. one could
// be copying data from main FLASH to the serial FLASH.

// We also prevent DMA usage on short blocks. The reason for this was never determined definitely but it
// crashed the CPU (tripped the watchdog) during USB-driven FLASH writes. Most likely because there is heavy
// SPI2 activity during the programming cycle, waiting for it to end, for 20ms. There is no point in optimising
// short r/w. The entire API for the serial FLASH (e.g. LF_read) uses 512 byte blocks.

uint32_t txadd = (uint32_t) hspi->pTxBuffPtr;
uint32_t rxadd = (uint32_t) hspi->pRxBuffPtr;

// Test for CCMRAM (rw) : ORIGIN = 0x10000000, LENGTH = 64K (linkerscript.ld)
if ( ((txadd>=0x10000000) && (txadd<0x10010000)) || ((rxadd>=0x10000000) && (rxadd<0x10010000)) || (Size!=512) )
{
goto no_dma;
}

// ==== We can use DMA ====

static uint8_t rxdump_target; // must not be in CCM :)

// DMA ch 1 clock enable already done in b_main.c
//RCC->AHB1ENR |= (1u << 21); // DMA1EN=1 - DMA1 clock enable
//hang_around_us(1); // give it a chance to wake up

// DMA1 Ch 0 Stream 3 is SPI2 RX

DMA1_Stream3->CR = 0; // disable DMA so all regs can be written

DMA1->LIFCR = (0x03d << 22); // clear int flags & transfer complete - 111101

DMA1_Stream3->NDTR = Size;

if (rxdump)
{
DMA1_Stream3->M0AR = (uint32_t) &rxdump_target; // memory address to dump rx data to
}
else
{
DMA1_Stream3->M0AR = rxadd; // memory address in normal mode
}

DMA1_Stream3->PAR = (uint32_t) &(SPI2->DR); // peripheral address
DMA1_Stream3->FCR = 0; // direct mode

if (rxdump)
{
DMA1_Stream3->CR = 0 << 25 // CHSEL: ch 0
|  0 << 23   // MBURST: memory burst - single transfer ??
|  0 << 21 // PBURST: peripheral burst - single transfer ??
|  3 << 16 // PL: highest priority
|  0 << 15 // PINCOS: no peripheral address increment offset
|  0 << 13 // MSIZE: memory data size: byte
|  0 << 11 // PSIZE: peripheral data size: byte
|  0 << 10 // MINC: memory address increment: 0
|  0 << 9 // PINC: peripheral address increment: 0
|  0 << 8 // CIRC: no circular mode
|  0 << 6 // DIR: peripheral to memory
|  0 << 5 // PFCTRL: DMA is flow controller
|  1 << 0; // EN: enable stream
}
else
{
DMA1_Stream3->CR = 0 << 25 // CHSEL: ch 0
|  0 << 23   // MBURST: memory burst - single transfer ??
|  0 << 21 // PBURST: peripheral burst - single transfer ??
|  3 << 16 // PL: highest priority
|  0 << 15 // PINCOS: no peripheral address increment offset
|  0 << 13 // MSIZE: memory data size: byte
|  0 << 11 // PSIZE: peripheral data size: byte
|  1 << 10 // MINC: memory address increment: 1
|  0 << 9 // PINC: peripheral address increment: 0
|  0 << 8 // CIRC: no circular mode
|  0 << 6 // DIR: peripheral to memory
|  0 << 5 // PFCTRL: DMA is flow controller
|  1 << 0; // EN: enable stream
}

// DMA1 Ch 0 Stream 4 is SPI2 TX

DMA1_Stream4->CR = 0; // disable DMA so all regs can be written

DMA1->HIFCR = (0x03d << 0); // clear int flags & transfer complete - 111101

DMA1_Stream4->NDTR = Size;
DMA1_Stream4->M0AR = txadd; // memory address
DMA1_Stream4->PAR = (uint32_t) &(SPI2->DR); // peripheral address
DMA1_Stream4->FCR = 0; // direct mode

DMA1_Stream4->CR = 0 << 25 // CHSEL: ch 0
|  0 << 23   // MBURST: memory burst - single transfer ??
|  0 << 21 // PBURST: peripheral burst - single transfer ??
|  0 << 16 // PL: priority low
|  0 << 15 // PINCOS: no peripheral address increment offset
|  0 << 13 // MSIZE: memory data size: byte
|  0 << 11 // PSIZE: peripheral data size: byte
|  1 << 10 // MINC: memory address increment: 1
|  0 << 9 // PINC: peripheral address increment: 0
|  0 << 8 // CIRC: no circular mode
|  1 << 6 // DIR: memory to peripheral
|  0 << 5 // PFCTRL: DMA is flow controller
|  1 << 0; // EN: enable stream

#ifdef TIMING_DEBUG
ADS1118_an_2p5v_external(0);
#endif

// Config SPI2 to let DMA handle the data. These need to be cleared when transfer complete!

SPI2->CR2 = 3; // TXDMAEN, RXDMAEN: 11 - both set in one go

// Wait for DMA to finish. Blocking is necessary to prevent FLASH CS=1 too early.
// There could be a timeout here but a failure is impossible short of duff silicon.

while(true)
{

// Either method below worked fine

//uint16_t temp1;
uint32_t temp2;

//temp1 = DMA1_Stream3->NDTR;
//if ( temp1 == 0 ) break; // transfer count = 0

temp2 = DMA1->LISR;
if ( (temp2 & (1<<27)) !=0 ) break; // TCIF3

}

SPI2->CR2 = 0; // TXDMAEN, RXDMAEN: 00 - both set in one go

#ifdef TIMING_DEBUG
ADS1118_an_2p5v_external(1);
#endif

    // Clear any rx data and the overrun flag in case not all received data was read; mandatory in rxdump mode
    __HAL_SPI_CLEAR_OVRFLAG(hspi);

    return (true);


It implements the "tx-only" mode (rxdump=true) when it sets the DMA RX to point to a dummy byte (which has to be in main RAM; if it is in CCM then not only doesn't it work but some funny stuff was happening) and sets the memory increment to no-increment.

The reason I use DMA only for size=512 is because that is 99.9% of the SPI FLASH usage, and not having that condition buggered up the USB write access to the file system on the SPI FLASH. I didn't find out why (it would be extremely hard other than by trial and error). I suspected it may be the tight loop which is checking the FLASH programming status, which drives the SPI heavily for the ~20ms. This isn't easy to avoid because this code is used both in startup code (no interrupts, no RTOS) and by the main code (where one could use an RTOS delay to check the FLASH programming status every few ms).

Using DMA has greatly improved the general response of the USB read access from the USB host, regardless of what internal read or write ops are taking place. The whole 512 byte page gets transferred in about 200us.

Thank you all for your input :)

I will get rid of the hspi references, which are pointlessly complicated. This code should have just SPI1,2,3 as the 1st parameter.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 486
  • Country: sk
Re: 32F417 SPI running at one third the speed it should
« Reply #49 on: January 21, 2022, 05:10:10 pm »
> This code should have just SPI1,2,3 as the 1st parameter.

That would make it unnecessary complicated, as you'd need to handle the different DMA streams belonging to different SPI, too.

Except for the arbitrary "abstraction", there's no need for such universal function in real life.

JW
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf