Along with simple things like output polarity and programmable SPI bit lengths, it would be enormously valuable to have I/O peripherals with something more than a single-byte buffer. This would drop interrupt rates in direct proportion to the buffer depth, either allowing faster speeds or lower CPU burden (or some balance between those two). Yes, I know there are some MCU peripherals which have buffers but many do not and the performance impact can be huge, even if all the associated ISR does is move the latest byte in or out of the single register. You still incur the ISR overhead and the subsequent jitter in system timing.
Nowadays you are supposed to use DMA to implement buffering.
Yep DMA capability is nice, but it should be there to free up the CPU to do more useful stuff rather than being required to get full bandwidth out of a peripheral.
On the STM32F4 series you can't get SPI to run with no pause between bytes by manual polling even if you dedicate 100% of the CPU to it. It has a 1 byte large buffer and the status flags don't clearly tell you exactly when it needs servicing. Because it is bidirectional you have to both fill and empty the buffer, technically you can service the buffer with the next byte while the previous byte is being shifted out/in, but no matter how i would follow the available status flags something would go wrong causing a data over/under run at some point. The only reliable way was to send 1 byte, wait for it to be transferred fully and then give it the next one (This is also how the official HAL driver does it). This results in a small pause from when you find out it is done and feeding it the next byte. Feeding it with DMA works fine tho.
But in comparison here is the process for sending in polling mode:
1) Init SPI (Turn on clocks, set register, enable peripheral)
2) Write/read data into data register
3) Wait for space in buffer
4) Repeat step 2 until done
Here is what you do with DMA mode:
1) Init SPI (Turn on clocks, set register, enable peripheral)
2) Init DMA (Turn on clocks, figure out what DMA channel is able to service SPI, move away other DMA channels in conflict, link up the DMA channel with the correct pripheral, write down the channel so you can use it later)
3) Copy data into an area of RAM that is accessible by DMA
4) Flush CPU data cache to make sure it is written out there
7) Make sure DMA is idle for a new job
5) Configure SPI peripheral for a DMA mode transfer
6) Configure the DMA with the apropriate address and data size
7) Start the DMA transfer
8 ) Poll DMA register to check if done, repeat 8 otherwise
9) Terminate DMA transfer
10) Flush CPU data cache again
11) Retrieve received data from the RAM area.
Okay sure you can skip the cache flush if you configure the memory management unit to forbid caching that area, but that is even more setup work. Also yes you could do other stuff while the DMA is processing but if i am sending 8 bytes trough SPI at 30Mbit that takes only 2us, what useful stuff am i going to do in that time? If i decide to jump into a ISR in that time that will take 1us for the context switch alone, so we might as well just run in circles in a loop during that time.
Why do i have problems like this on a modern 32bit ARM MCU when i never had problems like that on a 16bit PIC way back. It did also have DMA but you could always make peripherals go full speed with polling, the DMA was only there when you needed to send a huge pile of data and you had something useful to do in that time. Then you have peripherals like I2C that are horendusly slow and could use DMA, yet need constant CPU intervention to actually complete a transfer, so the DMA is only usable once you already completed half the transfer with polling.