Author Topic: Optimizing UART (for example) output... (Read 2993 times)

westfw · « **on:** July 01, 2021, 03:50:27 am »

Has anyone ever implemented a sort-of "Nagelized" UART driver? Essentially, interrupt driven FIFO IO, unless the FIFO fills up, and then start DMA instead... Because starting up DMA is pretty expensive for short outputs, but starts to make more sense the more output there is...
Maybe it'd be overly complicated.

westfw · « **Reply #1 on:** July 01, 2021, 08:08:49 am »

I meant for output. Trying to do UART input has a whole separate (and larger) set of issues.
(I'm sorry I wasn't clearer.)

Siwastaja · « **Reply #2 on:** July 01, 2021, 08:12:50 am »

Quote from: westfw on July 01, 2021, 03:50:27 am

Because starting up DMA is pretty expensive for short outputs, but starts to make more sense the

First off, are you really sure this is the case? In STM32 for example, setting up the DMA* is roughly equivalent to putting four words into the UART peripheral's FIFO back-to-back, or equivalent to sending one byte using an interrupt handler.

* Assuming you already know the previous DMA is finished and don't need to poll for completion, it's 2 to 4 memory writes depending on if you change the buffer address and size or not.

Microcontroller DMA implementations seem to be quite simple and fast to set up. Obviously, a bloated library can totally mess this up.

OTOH, if you require a timebase handler triggered by a timer for other reasons, and a suitable timer rate happens to match the UART byte rate, then managing the buffer and sending bytes in the same ISR is likely the lowest cost solution. UART TX having no failure modes like NACK, you can simply even guarantee the timing, getting rid of checking the Not Empty flag before writing to the TX data register!

Jeroen3 · « **Reply #3 on:** July 01, 2021, 09:01:03 am »

I considered writing one. Aside from lines sometimes a frame of several hundred bytes would be encoded and send over uart. This frame could utilize a DMA transfer.
But I instead just optimized the uart transmit interrupt. It's so short the total effects were neglibable. Unless you're running at max baud for some reason.
But at 115200 on a STM32 of 160 Mhz the effort wouldn't yield much benefit.

When you are using the UART as block device only, then it might be worth it.

Receiving by DMA is also complicated, since you'd have to abort the DMA on the IDLE flag or a timeout and then read whatever was received. Not all DMA's support this.

DiTBho · « **Reply #4 on:** July 01, 2021, 10:09:06 am »

The ideal scenario is a large elastic FIFO.

SiliconWizard · « **Reply #5 on:** July 01, 2021, 04:33:21 pm »

Quote from: westfw on July 01, 2021, 03:50:27 am

Has anyone ever implemented a sort-of "Nagelized" UART driver? Essentially, interrupt driven FIFO IO, unless the FIFO fills up, and then start DMA instead... Because starting up DMA is pretty expensive for short outputs, but starts to make more sense the more output there is...
Maybe it'd be overly complicated.

I haven't implemented this scheme, AFAIR anyway.

But I certainly have used UART FIFOs instead of using DMA on MCUs that had sufficiently large FIFOs. In many cases, that's fully adequate. I have even used MCUs that didn't have any DMA controller, but had comfortable FIFOs for UART, SPI and I2C. That was perfectly workable. (FIFOs had a depth of something like 32 words, which was not bad compared to the more typical 4 or 8 words you can find on many MCUs.)

For your proposed scheme, if I understand it correctly, you of course have to determine if the overhead of handling this dual-approach doesn't exceed that of a pure FIFO one with interrupts, or of a pure DMA approach.

Another point to consider: all this can be considered for UART TX. For UART RX, using DMA is often tricky.

PlainName · « **Reply #6 on:** July 01, 2021, 05:16:13 pm »

Quote

Because starting up DMA is pretty expensive for short outputs

Isn't Nagel used to deal with the overhead of frames, not the DMA or interrupt stuff? That is, with IP you have a fixed overhead for each frame sent, and if you send lots of single-character frames the overhead is massive. Like thousands of % massive. Not only does that slow your comms down, it hogs the shared media too, so you're potentially stifling some other system.

Frames used in serial comms tend to be quite a bit smaller, and frame data is a fixed size, which is generated en bloc, so there isn't anything to gain anyway. The trade-off between DMA and ISR would be quite hardware-specific - the size of the UART buffer would have a significant contribution.

Siwastaja · « **Reply #7 on:** July 01, 2021, 06:42:02 pm »

I second SiliconWizard's suggestion to utilize the FIFOs whenever this is possible. I have written some interrupt-driven state machines which are very efficient but also very robust.

For example, when the timing is deterministic and being late is a fatal error indicative of something being broken and requiring shutdown or at least complex recovery attempts, you can turn the classic idea of polling a completion flag with a timeout upside down; to using a timer instead to trig the next state, and then in the timer interrupt own_assert() for the availability (and correct number!) of data and absence of any error flags. Let's say with SPI, one state would maybe need to write 5 bytes; it would do two FIFO write accesses, one 32-bit and one 8-bit, then set a timer to trig the next interrupt. The next ISR would then look at the FIFO fill level (and error flags) from the status register, and error out if different than 5 bytes (on STM32 SPI for example, this requires checking a strange combination of flags, but is doable). Then it would do two FIFO read operations, again 32-bit and 8-bit. DMA would win only when either the data doesn't fit in the FIFOs, or if the length starts going over some 16 bytes.

Depending on a certain FIFO size can be a portability problem though. For example, on many STM32 devices, the different instances of peripherals even on the same device (SPI1 vs. SPI2 and so on) have different sized FIFOs, and migration between devices, even more so.

westfw · « **Reply #8 on:** July 02, 2021, 12:48:09 am »

Quote

Isn't Nagel used to deal with the overhead of frames, not the DMA or interrupt stuff?

Yes. I "generalized" the concept, and perhaps incorrectly. Perhaps it's closer to "delayed ack." I like to think that all of those TCP congestion minimization algorithms are vaguely related... (in recent times, one tends to think of only the network link being "congested", but down in the more deeply embedded world, you also can have CPU and Memory "congestion.")

Consider configuring a SAMD21 as a 4-port USB/Serial Adapter. 48MHz ARM CM0 and no HW FIFOs on the UARTs. If all four ports have users typing away at a keyboard, you don't really care very much about the performance, and per-character interrupts are fine. If all four ports start some sort of bulk data transfer, it'd be nice to use DMA. 50k interrupts per second is pretty painful. (or maybe it just used to be? Estimating 100 cycles per tx interrupt, I guess that's only about 10% of the CPU? Maybe I'm just optimizing prematurely.)

PlainName · « **Reply #9 on:** July 02, 2021, 10:07:40 am »

I think it would be product-specific. For big packets then DMA might well be preferred for streaming, but you couldn't really mix DMA and ISR since they might overlap. You could have the single-character output check for DMA operation and then wait if necessary, but you might as well just plonk it in a new DMA buffer and be done.

Siwastaja · « **Reply #10 on:** July 02, 2021, 02:33:30 pm »

UART and SPI in a typical microcontroller application are examples of extremely simple, completely deterministic master-slave protocols. There just are no "congestions" or whatever. There is simply no need for all that complexity. Any proper solution don't need to waste time waiting, and choosing between any proper solution is just about micro-optimizing a few clock cycles, like hey, ISR entrance takes 12 clock cycles, enabling DMA takes 10 clock cycles and so on.

In I2C, there at least is a remote possibility of having to support slave clock stretching which complicates the timing a bit.

PlainName · « **Reply #11 on:** July 02, 2021, 04:03:04 pm »

Quote

UART and SPI in a typical microcontroller application are examples of extremely simple, completely deterministic master-slave protocols

That was my initial thought, but I had neglected to consider a multi-drop setup where the master talks to many slaves over the same single wire (well, pair of wires). Or many masters, come to that. Each master-slave connection may be working in parallel with very different data going back and forth.

Siwastaja · « **Reply #12 on:** July 03, 2021, 09:52:55 am »

Multi-slave is simple, the master has complete control over the timing as slaves have no way to interact, including not being able to make the master wait. You can also decide to guarantee to the slaves that once master initiates the transaction it won't make the addressed slave wait midway, easily guaranteeing deterministic timing.

Multi-master is tricky but I'd avoid such systems in UART, SPI or I2C. Other communication buses are better suited, for example CAN is great when everybody just publishes Things^tm without a clear "master".

DavidAlfa · « **Reply #13 on:** July 03, 2021, 04:43:49 pm »

I've made that a lot of times, works very well. Even without DMA, since the uart is slow, an interrupt after sending each char is not expensive.
What is terrible is when you have to wait for the previous data to be sent!

I could write lots of messages in a single shot, they piled up in the fifo buffer, did great without any no waiting or data loss.

If your messages are constant, ex. printf""SYSTEM BOOTING\n", printf""TIMEOUT ERROR\n", you can just store the pointer in your fifo, count the length and save it for the DMA tranfer size.
Then, jump to the next pointer in fifo if any.

If they're variable, you need a ram buffer, copying the string to it (ex. UART_Fifo[32][32]) and the rest is pretty much the same thing, get the length and pass it to the dma.

I only printed static messages so I used the pointer method.
Might not be perfect, but you can check the code here, it's really simple:

https://github.com/deividAlfa/Alfa-166-Unilink-CD-emulator/blob/main/Core/Src/serial.c

SiliconWizard · « **Reply #14 on:** July 03, 2021, 04:45:51 pm »

Quote from: Siwastaja on July 03, 2021, 09:52:55 am

Multi-slave is simple, the master has complete control over the timing as slaves have no way to interact, including not being able to make the master wait. You can also decide to guarantee to the slaves that once master initiates the transaction it won't make the addressed slave wait midway, easily guaranteeing deterministic timing.

Multi-master is tricky but I'd avoid such systems in UART, SPI or I2C. Other communication buses are better suited, for example CAN is great when everybody just publishes Things^tm without a clear "master".

Yes. Well, actually, "multi-master" is something you can implement with I2C, and is even part of the standard. For SPI or UART, this isn't the case, and pretty much requires fully custom implementations, including hardware support.

Siwastaja · « **Reply #15 on:** July 04, 2021, 07:12:33 am »

Quite descriptive of the STM32 family is that many of their SPI peripherals (they are all different between families, btw) default to some weird "multimaster mode" which make them fail by default (i.e., the peripheral just turns off and an error flag, MODE FAULT, is set), and you have to explicitly disable that multimaster mode by setting some configuration bit(s) to '1', or, technically, use a configuration bit to "fake" an internal nCS level for the peripheral so it thinks it has the multi-master arbitration success all the time.

It's funny because I'd hazard a guess nobody ever has used their multimaster mode.

The most common SPI pattern is one master, one slave. Chained slaves with shared nCS signals appearing like a longer-frame single slave is way less frequent maybe followed with the true multi-slave, with shared SCK/MOSI/MISO signals and separate nCS signals. Any multi-master SPI mostly exists in the dreams (nightmares?) of designers instead of real-world designs.

With basic UART, I have used a single master multiple slaves pattern of master sending a recognizable ID string after which only one slave at a time configures its TX pin as an output driver, sends its message and disables the output driver. This is quite simple as long as the master is responsible of polling and slaves do not just generate spurious messages.

westfw · « **Reply #16 on:** July 04, 2021, 09:35:50 am »

Quote

an interrupt after sending each char is not expensive.

We disagree on that, or my question wouldn't make sense.

although perhaps that IS a path to the answer - I was assuming that putting data into a DMA buffer and starting the DMA was more "expensive" than putting data into a circular buffer. Which may or may not be true, but it's the wrong question - I need to compare the circular buffer PLUS all of the interrupts need to empty the buffer to the cost of DMA setup AND completion interrupt. (and put that way, it sounds like DMA is going to come out ahead "often." Interrupt overhead is mostly fixed, and if DMA complexity is N times more complex than per-character mode, then the break-even size is going to be N. And I'm pretty sure than N is a relatively small number (say, less than 20.)

Quote

What is terrible is when you have to wait for the previous data to be sent!

Yeah; DMA buffers tend to be sized toward larger output, so if you have dual ("ping-pong") DMA buffers, you don't want to "fill" them both with single-byte outputs and then have to block on your third byte. There are a couple of possible solutions:

this is where it gets Nagle-like: if a DMA output is in progress, you keep filling your 2nd DMA buffer.
Just because part of your buffer is being DMAed doesn't mean you can't put bytes in the rest of the buffer. (this might complicated the DMA ISR - you couldn't automatically chain to the other buffer if you have to examine the current buffer to see whether it's got additional data.)

Siwastaja · « **Reply #17 on:** July 04, 2021, 11:55:56 am »

Interrupt for every byte of data is very expensive unless the data rates are really slow.

IRQ entrance, return, peripheral status flag clearing and some other minimal code ends up, depending on architecture*, around 30 clock cycles, which, at say 1Mbaud and 20MHz CPU clock is already 19% CPU time.

*) for example, AVR has lower interrupt entrance and exit latency but the compiler has to manually save and restore registers, so it ends up the same or even longer than Cortex-M3 and up

The key here is, controlling the DMA should not take much more cycles than transmitting that single byte. If it does, look for why, is the particular DMA overcomplicated and not suitable microcontroller DMA design (for example, on STM32, this is not the case, the DMA's OK), or maybe you are using a library which bloats it.

spostma · « **Reply #18 on:** July 04, 2021, 01:22:15 pm »

if your are making an USB slave with dual ping-pong buffers for each endpoint,
then just use one buffer for DMA ransfer, and the other buffer to store the next backet of data.
Do not add more data to that buffer, but just wait for the first buffer to be handled before returning an ACK to the host!

Let the host accumulate the incoming bulk, not your USB slave device!
That way, you will always set up one DMA transfer per USB endpoint packet (if it has new data).

---

Likewise, the UART receiver can store new data bytes directly in the not-active pingpong buffer
that is used to transmit data to the host.

When returning data to the host, start sending the buffers that are filled most to maximize throughput.

SiliconWizard · « **Reply #19 on:** July 12, 2021, 10:00:06 pm »

Yeah. Just one point: unless you implement flow control, you need to make sure both sides can handle the max throughput allowed by the bit rate. That usually means having full control over both sides.

Otherwise, flow control is required.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Optimizing UART (for example) output... (Read 2993 times)

Share me