Products > Programming

Optimizing UART (for example) output...

<< < (4/5) > >>


--- Quote from: Siwastaja on July 03, 2021, 09:52:55 am ---Multi-slave is simple, the master has complete control over the timing as slaves have no way to interact, including not being able to make the master wait. You can also decide to guarantee to the slaves that once master initiates the transaction it won't make the addressed slave wait midway, easily guaranteeing deterministic timing.

Multi-master is tricky but I'd avoid such systems in UART, SPI or I2C. Other communication buses are better suited, for example CAN is great when everybody just publishes Things^tm without a clear "master".

--- End quote ---

Yes. Well, actually, "multi-master" is something you can implement with I2C, and is even part of the standard. For SPI or UART, this isn't the case, and pretty much requires fully custom implementations, including hardware support.

Quite descriptive of the STM32 family is that many of their SPI peripherals (they are all different between families, btw) default to some weird "multimaster mode" which make them fail by default (i.e., the peripheral just turns off and an error flag, MODE FAULT, is set), and you have to explicitly disable that multimaster mode by setting some configuration bit(s) to '1', or, technically, use a configuration bit to "fake" an internal nCS level for the peripheral so it thinks it has the multi-master arbitration success all the time.

It's funny because I'd hazard a guess nobody ever has used their multimaster mode.

The most common SPI pattern is one master, one slave. Chained slaves with shared nCS signals appearing like a longer-frame single slave is way less frequent maybe followed with the true multi-slave, with shared SCK/MOSI/MISO signals and separate nCS signals. Any multi-master SPI mostly exists in the dreams (nightmares?) of designers instead of real-world designs.

With basic UART, I have used a single master multiple slaves pattern of master sending a recognizable ID string after which only one slave at a time configures its TX pin as an output driver, sends its message and disables the output driver. This is quite simple as long as the master is responsible of polling and slaves do not just generate spurious messages.


--- Quote ---an interrupt after sending each char is not expensive.
--- End quote ---
We disagree on that, or my question wouldn't make sense.

although perhaps that IS a path to the answer - I was assuming that putting data into a DMA buffer and starting the DMA was more "expensive" than putting data into a circular buffer.  Which may or may not be true, but it's the wrong question - I need to compare the circular buffer PLUS all of the interrupts need to empty the buffer to the cost of DMA setup AND completion interrupt.  (and put that way, it sounds like DMA is going to come out ahead "often."  Interrupt overhead is mostly fixed, and if DMA complexity is N times more complex than per-character mode, then the break-even size is going to be N.  And I'm pretty sure than N is a relatively small number (say, less than 20.)

--- Quote ---What is terrible is when you have to wait for the previous data to be sent!
--- End quote ---
Yeah; DMA buffers tend to be sized toward larger output, so if you have dual ("ping-pong") DMA buffers, you don't want to "fill" them both with single-byte outputs and then have to block on your third byte.  There are a couple of possible solutions:
* this is where it gets Nagle-like: if a DMA output is in progress, you keep filling your 2nd DMA buffer.
* Just because part of your buffer is being DMAed doesn't mean you can't put bytes in the rest of the buffer.  (this might complicated the DMA ISR - you couldn't automatically chain to the other buffer if  you have to examine the current buffer to see whether it's got additional data.)

Interrupt for every byte of data is very expensive unless the data rates are really slow.

IRQ entrance, return, peripheral status flag clearing and some other minimal code ends up, depending on architecture*, around 30 clock cycles, which, at say 1Mbaud and 20MHz CPU clock is already 19% CPU time.

*) for example, AVR has lower interrupt entrance and exit latency but the compiler has to manually save and restore registers, so it ends up the same or even longer than Cortex-M3 and up

The key here is, controlling the DMA should not take much more cycles than transmitting that single byte. If it does, look for why, is the particular DMA overcomplicated and not suitable microcontroller DMA design (for example, on STM32, this is not the case, the DMA's OK), or maybe you are using a library which bloats it.


if your are making an USB slave with dual ping-pong buffers for each endpoint,
then just use one buffer for DMA ransfer, and the other buffer to store the next backet of data.
Do not add more data to that buffer, but just wait for the first buffer to be handled before returning an ACK to the host!

Let the host accumulate the incoming bulk, not your USB slave device!
That way, you will always set up one DMA transfer per USB endpoint packet (if it has new data).


Likewise, the UART receiver can store new data bytes directly in the not-active pingpong buffer
that is used to transmit data to the host.

When returning data to the host, start sending the buffers that are filled most to maximize throughput.


[0] Message Index

[#] Next page

[*] Previous page

There was an error while thanking
Go to full version