Author Topic: BiDirectional SPI: system <-> UI comms. (Read 1510 times)

paulca · « **on:** February 24, 2023, 01:01:01 pm »

MCU One: DSP stuff.
MCU Two: UI stuff.

I choose SPI for the interlink protocol.

I've had it sort of working, but it raised a lot of questions, so let me describe the approach and the problems I encountered.

DSP is master. The DSP periodically sends the results of some analysis, like peak and rms values for all channels, plus some status stuff... blah, blah.
GUI is slave. The GUI sends in reply a set of configuration values.

I created both as structs, say "GUI_Struct", "DSP_Struct". Then I unioned them into SPITxRx union. The SPI_Transmit/Receive_DMA calls send or receive a SPITxRx union.

The DSP master creates it's analysis and immediately sends it if the SPI is available. So the cadence is choosen by it. This is the first part I don't like. It couples the rate I want to generate that statistical information to the rate the GUI can send config updates. I mean, it's fine, it only adds, currently, about 100ms max latency between the GUI and the DSP. The GUI display is local to the GUI so will update faster. To address this I considered an external interrupt from the slave that it would like to send something. That can force the TxRx process on the master. It may cause it's periodic update to collide and be skipped, but so what, right?

The issue I appeared to have first off was the timing. Making the SPI SS line an EXT Interrupt, having the slave respond in the ISR and call SPI_TransmitReceive_DMA() appeared to be very unstable. The data was being corrupted frequently and while the scope code decode what was actually being sent, the SPI peripheral couldn't 50% of the time. Loosing the first bit or completely transposing the bit positions were common. Also, while that SS line is digitally controlled and shouldn't have any bounce, I kept seeing the whole thing lock up, most likely because of a collision with active DMA processes. The other part of that is that calling the DMA function inside an ISR often has nasty side effects. It was more stable with a flag and main() calling the TxRx.... still data corruption. The corruption might also just be breadboard noise, but I have been running the SPI down to 2Mbit or lower and it still corrupts. I have been through the datasheet/ref man on the topic of CPU cache and DMA interlinks. It does not appear to be that issue, I'm not even sure the cache is enabled, but I have tried invalidating the buffers before calls and no effect.

Is this the right way to do it? As I only have 1 slave, would it advantageous to use the hardware NSS line handler? Would it help?

paulca · « **Reply #1 on:** February 24, 2023, 01:07:41 pm »

The master SPI sequence, has to be the issue.

GPIO_Reset( SPI_SS );
SPI_TransmitReceieve();
GPIO_Set( SPI_SS );

This can't leave any time at all for the slave to respond to that SS line going low before the peripheral starts blasting data out. Is that normal? Part of me expects a "ACK" mechanism.

EDIT:
Is it possible to just free-run the TxRx constantly? Like full duplex I2S for example. The two MCUs can just constantly exchange the relevant buffers.

paulca · « **Reply #2 on:** February 24, 2023, 01:11:39 pm »

Both MCUs are STM32H7 series. Undecided, but probably a 743 for the UI and a 750 for the DSP.

So I have other options available. I've never tried them, but I believe it's possible to setup a kind of shared memory buffer that is synchronized by hardware over a parrallel bus. Is it FMC?

If that isn't that difficult to setup and I can define an area of memory mapped data to sync between the two... that might be ideal!

wek · « **Reply #3 on:** February 24, 2023, 01:23:36 pm »

UART.

Both directions are mutually independent.

JW

paulca · « **Reply #4 on:** February 24, 2023, 01:29:04 pm »

Quote from: wek on February 24, 2023, 01:23:36 pm

USART.

Both directions are independent.

JW

Hmm. True. I suppose the question becomes one of data rate.

Realistically the Tx/Rx 'datum' size is maybe a few 100 bytes. It's a fixed size making the UART comm handling much easier. I was planning on sending the RMS/Peak data at 10Hz. If it's say 128bytes at 10Hz is 1.2Kbps. Or 128*8 + stop/start * 10 is ~= 12 kilo baud. I'm sure on a dedicated impedance matched PCB trace I can do more than just an order of magnitude faster than that. I can do an order of magnitude faster than that over a dupoint lead!

I might have a try. Thanks for the option.

Siwastaja · « **Reply #5 on:** February 24, 2023, 03:27:48 pm »

STM32 and fixed-size SPI messages?

Use the hardware NSS feature for once! It works in this simple case (I think; don't count on it). Configure circular DMA. I don't think you need interrupts at all.

Bonus points on using 32-bit DMA transfers and making all data points at most 32-bits (and aligned(4)), so that update to any variable is inherently atomic. If you don't need atomicity between different elements, then just access the variables in memory whenever needed, can't get any easier. If you need atomic unit of the whole packet, then you need some kind of double-buffering and time the use, so basically interrupts. But those interrupts can be slow, triggered at DMA completion, so you have time until the whole DMA buffer fills with the next packet.

paulca · « **Reply #6 on:** February 24, 2023, 03:37:43 pm »

Quote from: Siwastaja on February 24, 2023, 03:27:48 pm

Bonus points on using 32-bit DMA transfers and making all data points at most 32-bits (and aligned(4)), so that update to any variable is inherently atomic. If you don't need atomicity between different elements, then just access the variables in memory whenever needed, can't get any easier. If you need atomic unit of the whole packet, then you need some kind of double-buffering and time the use, so basically interrupts. But those interrupts can be slow, triggered at DMA completion, so you have time until the whole DMA buffer fills with the next packet.

Interesting. Even just the align 4 for every field would not only give atomicity but idempotency as well. Meaning you can read and write it as many repeated times as you like and it will result in a consistent outcome.

On circular DMA. I'm already using the ping/pong active/inactive pointer trick everywhere else to avoid dealing with "threading" where I can make the contended monitor scope just a pointer write.

I was going to use that again with circular DMA and UART with an active pointer to the one ready for read/write flipped in interrupt.

julian1 · « **Reply #7 on:** February 24, 2023, 10:43:17 pm »

My first instinct would be to make UI spi master, and the DSP spi slave. But also have an out-of-band interrupt that the DSP can raise to signal the UI that data is ready from the DSP side.

The UI mcu can then flexibly initiate spi transfer in response to the interrupt (from the handler), or else set a flag to process the data in the superloop, or UI interface refresh loop, or just ignore it entirely etc.

Rudolph Riedel · « **Reply #8 on:** February 25, 2023, 03:09:20 pm »

Quote from: paulca on February 24, 2023, 01:11:39 pm

Both MCUs are STM32H7 series. Undecided, but probably a 743 for the UI and a 750 for the DSP.

What about using CAN-FD then?

Siwastaja · « **Reply #9 on:** February 25, 2023, 03:31:51 pm »

CAN is great when:
* More than two devices communicate
* When publish-subscribe pattern works, i.e., anyone can be interested about anyone's data (not just one-to-one messages)
* When the bus is physically long, requiring differential signaling

I don't see it as a good fit for paulca; it's only going to increase the effort of peripheral configuration. Getting the MCAN work is maybe 100 LoC and 2 days of work, whereas SPI or UART is 10 lines and 2-3 hours. (If SPI seems harder than that, then scale the CAN accordingly.)

You can of course ease the task by using reused code (made by others) but then it comes with its own drawbacks. SPI with fixed datagrams can be made extremely simple.

Benta · « **Reply #10 on:** February 25, 2023, 08:36:05 pm »

I won't say it't not possible to use SPI bidirectionally (QSPI EEPROMs do it), but it's quite unusual and creates hardware overhead for arbitration circuitry. Plus firmware issues, as you've noted yourself.
Just run a simple SPI ring instead. CPU MOSI to GUI, GUI MISO to CPU. If it's just one SPI slave device, you probably won't need any additional hardware at all.

But perhaps I've misunderstood the situation?

NorthGuy · « **Reply #11 on:** February 25, 2023, 10:06:51 pm »

Quote from: paulca on February 24, 2023, 01:29:04 pm

Realistically the Tx/Rx 'datum' size is maybe a few 100 bytes. It's a fixed size making the UART comm handling much easier. I was planning on sending the RMS/Peak data at 10Hz. If it's say 128bytes at 10Hz is 1.2Kbps. Or 128*8 + stop/start * 10 is ~= 12 kilo baud. I'm sure on a dedicated impedance matched PCB trace I can do more than just an order of magnitude faster than that. I can do an order of magnitude faster than that over a dupoint lead!

UART depends on the clock. If clocks on both sides are too much different from each other you'll get errors. SI is not that important, neither is wire length to certain extent. So breadboarding shouldn't be a problem. 8-10 MBaud should be ok.

SPI uses master's clock, so clock discrepancy doesn't matter. Hence you can go much faster. Still you can build two SPI connections consisting of two wires each - CLK and MOSI. On one SPI line, one device is a master. On the other SPI line - the other. Masters only transmit. Slaves only receive. Hence everything is source-synchronous, this configuration allows higher speed compared to traditional SPI. From the programming viewpoint, you get the same benefits as UART, but you need 2 extra wires.

mikerj · « **Reply #12 on:** February 25, 2023, 10:09:33 pm »

Quote from: paulca on February 24, 2023, 01:07:41 pm

The master SPI sequence, has to be the issue.

GPIO_Reset( SPI_SS );
SPI_TransmitReceieve();
GPIO_Set( SPI_SS );

This can't leave any time at all for the slave to respond to that SS line going low before the peripheral starts blasting data out. Is that normal? Part of me expects a "ACK" mechanism.

EDIT:
Is it possible to just free-run the TxRx constantly? Like full duplex I2S for example. The two MCUs can just constantly exchange the relevant buffers.

It took me quite a while to get stable SPI comms between two (identical) micros, unlike I2C there is no handshake mechanism to hold up the master whilst the slave does it's stuff so turnaround time becomes pretty critical. I eventually used a spare pin as a busy flag on the slave so the master could tell when it was safe to start sending data, really handy for e.g. FW updates when the flash erase/write in the slave took a while.

paulca · « **Reply #13 on:** February 28, 2023, 01:54:53 pm »

So many options. So little time. I ported the DSP over from the H743 to the H750 as the extra flash memory was going to waste and I can use it for bitmap pre-canned GUI elements better there. This went fairly flawlessly, except for the USART which now produces garbage, not just garbage but consistent garbage. Scope decode flags half the bits are in error due to being out of spec.

Assuming I fix that, given I really need this to go to a PCB interconnect as it's driving me nuts on breadboard... I am going to put 2xUARTs and 1xSPI breakout and go ahead with the PCB. That will take case of a lot of complications around stability and playing the game of "what's broken now?".

Recapping the "easy" options are:

Avoid signalling on the SPI by running it in a circular loop on both ends. Start it up, let it run at a sensible (slowish) speed and consume/update the "idle buffer halves". On need to implement the error callback to invalidate things and restart it.

Avoiding SPI altogether and use much the same approach with UART, except have it be fully asynchornous, fixed message size.

Not considering (FD)CAN bus. Not going to consider concurrent access to the QuadSPI flash chip/bus either!

It does beg the question though. When you aren't dealing with an IC, but another micro or even a general purpose OS on the end of the SPI....how DO you get the timing to work on Async messages with SPI?

As I see it the master says, "Owww! YOU! Here's some data." and start blasting data at the slave.

If all the slave is doing is waiting on that SS line, fine. If it's just unluckily in an interrupt because a UART debug message competed and takes a handful of micro-seconds to respond to the EXT-INT pin interrupt.... the master is already sending data. That's kinda of annoying, but what is even worse, the slave has no idea it's in the middle of a stream so it gives you offset bits and corrupt data. If you have to start enabling CRC buffers for 2Mbit/s there's a problem.

Again, maybe it's only designed for "free running" SPI loops or between devices which can dedicate their timing to serving the master, such as in ICs. Can't judge a gold fish by it's ability to climb a tree.

Or maybe I'm doing it wrong. It would seem to me that using the HAL function to start the SPI peripheral in the interrupt handler is not only flakey sometimes, but also probably takes a load of time. Especially if it actually disables the peripheral and DMA streams and needs to re-enable them. Thus no FIFO, no buffer, no DR register even. Maybe I should start the SPI in receive in anticipation and only trigger on either DMA interrupts to get data or the SS line's "rising" edge to wait on data.

PCB.Wiz · « **Reply #14 on:** March 01, 2023, 12:15:20 am »

Quote from: paulca on February 24, 2023, 01:01:01 pm

... Also, while that SS line is digitally controlled and shouldn't have any bounce, I kept seeing the whole thing lock up, most likely because of a collision with active DMA processes. The other part of that is that calling the DMA function inside an ISR often has nasty side effects. It was more stable with a flag and main() calling the TxRx.... still data corruption. The corruption might also just be breadboard noise, but I have been running the SPI down to 2Mbit or lower and it still corrupts. I have been through the datasheet/ref man on the topic of CPU cache and DMA interlinks. It does not appear to be that issue, I'm not even sure the cache is enabled, but I have tried invalidating the buffers before calls and no effect.

Ouch, sounds like a lot of variable in play,

Quote from: paulca on February 24, 2023, 01:07:41 pm

The master SPI sequence, has to be the issue.
GPIO_Reset( SPI_SS );
SPI_TransmitReceieve();
GPIO_Set( SPI_SS );

This can't leave any time at all for the slave to respond to that SS line going low before the peripheral starts blasting data out. Is that normal? Part of me expects a "ACK" mechanism.

SPI is inherently very simple, and largely expects a hardware slave, so SS to Data can be very short.

Did you look at using the SPI data line as an ACK, if the slave has variable SS response times ?
ie you issue SS then wait for MISO to change, which signals slave has filled the SPI buffer and is 'ready'. That means a mix of SW and HW pin control.
Master then issues clocks, and hopefully the armed/ready slave can keep up for the agreed data burst.

Siwastaja · « **Reply #15 on:** March 01, 2023, 07:23:57 am »

Regarding "time to respond".

SPI is best suited for asynchronous payloads, by that I mean the data flowing in opposite directions have no time relation to each other. This rules out classic command - response mindset. For example: setpoints run the other way, measurements the other.

If and when you need command - response functionality (for example: "command: read register 123", "response: value of register 123 is 42"), which SPI really isn't the best for, there are two classic ways to achieve this:

1) Quick data generation in the middle SCK clock cycles. Example: first byte as transmitted on MOSI defines a command, for example 0x85 means "read register 5". MISO bits during this first byte are meaningless and ignored. After 8 SCLK cycles, slave quickly accesses register 5 and outputs it during the rest of the SCK cycles.

2) Response is to the previous command. When master generates SCK cycles, the slave immediately responds with data. But this response is to the previous command. While slave responds, master gives bits of the new command. Once the transmission is complete (e.g., nCS goes inactive, or just right amount of cycles), slave can process the command so that the response will be ready for the next transaction.

With 1), datasheet has to specify maximum SCLK frequency so that data generation can fit in that tiny time slot available after fully receiving the command bytes. There is extra overhead for having to transmit dummy bits on MISO. A variation to give more time is to add dummy bits; for example, make the command fully defined with 6 first bits, and then you have 2 bit time slots to process the command before the reply has to start.

With 2), datasheet has to specify minimum nCS inactive period. 2 allows more time to process the command, and incurs less overhead if readout is going on all the time, in predictable pattern. On the other hand, if one does random accesses every now and then but needs the reply "immediately", then each read requires two full transactions.

But whenever possible, try if you can use the "data not related in time" pattern. I think this could work with an UI. If the UI processor itself chooses what data to show, then you surely just supply all possible data to it, so data does not change when buttons are pressed.

newbrain · « **Reply #16 on:** March 01, 2023, 12:55:23 pm »

Quote from: Siwastaja on March 01, 2023, 07:23:57 am

there are two classic ways to achieve this

I would add a third one, used almost universally in (Q)SPI flash memories:
3) the command is sent in the first byte, then a dummy byte is read, then the expected answer. The slave can immediately insert a dummy byte in the MISO FIFO (as it can be whatever) then it has a byte time to go and fetch the needed information.
More than one dummy byte is also possible.

paulca · « **Reply #17 on:** March 01, 2023, 02:32:00 pm »

I did try a mechanism previously I called "tug and hold".

The master would go round each slave, tug and release the slave select, then sample it. If the slave had grabbed the line and held it low it's ready.

The difficult part is not cooking two micros with one of the pins HIGH and the other open drain. Also the timing and GPIO control was formidable and I gave up. I briefly tried using 2 lines, Master->Slave request, Slave->Master Ready lines. Then it came down to timing. Using "SysTick" your resolution is 1ms +/- 1ms. Which really deflates your bandwidth and latency expectations. Shorter delays or a finite spin lock might work better.

Not needed here though as I only have 1 slave.

I think I'll try the circular fixed size, free running SPI on the next session. Most of the effort is going into the PCB at the minute.

Siwastaja · « **Reply #18 on:** March 01, 2023, 03:31:38 pm »

I always ask the question: instead of a possibly complex and edge-case-y "are you ready?" "I am ready" thing, can't you just make the access fast enough for the problem to go away? If you just have to write a few bytes to a FIFO, or enable a DMA stream, this is all achievable in maybe 20 CPU clock cycles. Can't you just make it a highest priority interrupt, for example?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: BiDirectional SPI: system <-> UI comms. (Read 1510 times)

Share me