One example from my recent experience, might be useful for somebody.
I've been working on a "music synthesizer" device, which generates music by writing to DAC register at some sample rate, say 48kHz.
Before DMA:
Set timer to fire at sample frequency. In timer interrupt handler, do some simple calculation of sample value and write it to DAC output register.
After DMA:
Precalculate a buffer of samples, say 1024 values. Setup DMA using this buffer as a source (with address increment) and DAC register as a destination (without address increment, so all DMA writes a going to DAC output register), configure "half transfer completed" dma interrupt. Set timer to sample frequency, but instead of configuring timer interrupt, configure it to trigger DMA transfer of one sample, from buffer to dac register. In "half transfer complete" interrupt calculate and fill transferred half of buffer with new data.
Result: no more timer interrupts storm, all writes from buffer to DAC are happening in DMA, triggered by timer. All in hardware, without using CPU. CPU just prepares a buffer of samples occasionally without hard real-time requirements. Effectively this removes timer interrupt overhead.
I didn't measure the improvement quantitatively yet. Probably I can measure maximum achievable sample rate in both cases, or learn how to gather some statistics from FreeRTOS I'm using, suggestions about it are welcome.