Author Topic: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe (Read 1339 times)

peter-h · « **on:** June 04, 2026, 11:10:53 am »

This is for driving an LCD with 16 bit parallel (ST7789). I currently use SPI and up to 42MHz clock is possible on my project.

The standard method is to use the FSMC but that is hard wired to output on ports D and E. I currently use these for lots of GPIO stuff.

I am using the QFP100 package and the FSMC can't be used (not enough pins) but going to the QFP144 would expose enough extra pins.

The problem is: no automatic /WR strobe. And no real speed control unless one outputs in software...

Has anyone found some cunning way?

I realise one can generate a /WR if one sets up a table where there are 3 entries for each word, and then the DMA will generate the write pulse. But then you have a 3x bigger table... Or a 2x bigger table and generate the /WR externally with an edge detector. And you still waste a byte or two just to generate the /WR.

Also, one could do a display fill with one colour, with just a 3 x 4 byte source buffer, if using an incrementing source address within a circular buffer.

There should be some cunning way to generate the /WR pulses from the same timer used to set the DMA rate.

langwadt · « **Reply #1 on:** June 04, 2026, 11:34:05 am »

should be straight forward if you have a timer pin, set to timer to pwm, dma on update adjust pwm duty cycle to get desired setup/hold

peter-h · « **Reply #2 on:** June 04, 2026, 05:00:29 pm »

That is very cunning.

Can the 32F4 DMA drive a 16 bit value on ports F or G and be synced with a timer event in this way? There is bound to be jitter I would think, due to internal bus contention etc.

But I guess one would use the PWM pulse to latch the previous value, not the one which DMA loads when the timer overflows. So jitter does not matter.

langwadt · « **Reply #3 on:** June 04, 2026, 07:36:40 pm »

Quote from: peter-h on June 04, 2026, 05:00:29 pm

That is very cunning.

Can the 32F4 DMA drive a 16 bit value on ports F or G and be synced with a timer event in this way? There is bound to be jitter I would think, due to internal bus contention etc.

But I guess one would use the PWM pulse to latch the previous value, not the one which DMA loads when the timer overflows. So jitter does not matter.

you should be able to DMA from memory to the ODR register in the port. jitter only matter in the way that the data has to be at the pins t_setup before the pwm edge

there's probably few quirks, like you I think you'd have to setup the first data value manually before starting the timer, and use a timer with a repetion counter (limiting the burst to 65K) to be able to stop the timer at the right time

peter-h · « **Reply #4 on:** June 08, 2026, 09:16:52 pm »

Can one do this with an 8 bit wide port, say PE0-PE7, or PE8-PE15? I know DMA can have an 8-bit target in general.

The problem is that the other half of that 16 bit port may have some pins set to GPIO outputs.

langwadt · « **Reply #5 on:** June 08, 2026, 09:59:41 pm »

Quote from: peter-h on June 08, 2026, 09:16:52 pm

Can one do this with an 8 bit wide port, say PE0-PE7, or PE8-PE15? I know DMA can have an 8-bit target in general.

The problem is that the other half of that 16 bit port may have some pins set to GPIO outputs.

no afaik you cannot do that, only way to only affect some bits is to use the BSRR register

peter-h · « **Reply #6 on:** June 09, 2026, 08:53:29 am »

I should test this because AFAICS the DMA can write just bytes.

But maybe only to the low bits e.g. PE0-PE7.

langwadt · « **Reply #7 on:** June 09, 2026, 10:58:19 am »

Quote from: peter-h on June 09, 2026, 08:53:29 am

I should test this because AFAICS the DMA can write just bytes.

But maybe only to the low bits e.g. PE0-PE7.

the dma can write bytes, but you'll have to test what happens when you only write upper or lower byte of the data register

Yansi · « **Reply #8 on:** June 09, 2026, 11:07:00 am »

Bus access error will likely happen, I think IO port registers can be accessed only as a whole halfword, not individual bytes. But I may be wrong, don't remember and it might be different on different STM32 series. Check the datasheet/ref manual, it is well described there.

DMA can transfer byte/halfwords or words. That is easily configurable.

peter-h · « **Reply #9 on:** June 09, 2026, 11:54:24 am »

Quote

I think IO port registers can be accessed only as a whole halfword, not individual bytes

That may well be the case, in which case it probably works if the half-port has no pins configured as outputs. If all are inputs then the write will have no effect.

And obviously you will be discarding the "other" 8 bits of the 16 bits

I am working on the assumption that I may have to go to QFP144 and then I will get two more 16 bit ports. In the meantime I am using SPI (to drive an LCD) and going up to 42MHz which one can do on SPI1.

Yansi · « **Reply #10 on:** June 09, 2026, 12:30:28 pm »

If you do not need to use the whole 16 bits from a port, they have already suggested you shall use the BSRR register by the DMA, instead of ODR!

peter-h · « **Reply #11 on:** June 09, 2026, 01:35:43 pm »

OK I get you, but normally BSRR is used to software-set GPIO bits, and one (in all code I've seen) does a uint32_t write into it.

For changing GPIO bits one does a RMW i.e. reads BSRR into a uint32_t, masks the bit, drops in the desired one, and writes back all 32 bits. So I am not sure how you can write just 1/4 of this register.

SiliconWizard · « **Reply #12 on:** June 09, 2026, 02:17:15 pm »

BSRR is a set/reset register: it allows setting or clearing arbitrary bits. The register is split in half, which is how it works. The upper half sets port bits and the lower half clears them (or the other way around, TBC, don't remember).

A bit set (1) in BSRR modifies the corresponding port bit, a bit clear (0) takes no action.

So you have to format your DMA data accordingly.

peter-h · « **Reply #13 on:** June 09, 2026, 02:44:44 pm »

So your RAM table needs to be 32 bits wide.

A bit of digging confirms this. Also people have been up this BSRR road before e.g.
https://community.st.com/stm32-mcus-products-25/how-do-i-move-data-continuously-over-dma-from-an-array-in-memory-to-a-8-bit-set-of-gpio-pins-bus-i-feel-like-this-should-be-simple-but-i-am-majorly-stumped-i-am-using-an-stm32h7-nucleo-board-3124
https://www.eevblog.com/forum/microcontrollers/stm32f4-dma-mem-gtgpio-triggered-by-timer/

The wastage is not too bad if you are using DMA to do a one-colour fill of a display region. Then you set up DMA source to be incrementing only within 1 or 2 addresses (i.e. possibly not incrementing at all).

SiliconWizard · « **Reply #14 on:** June 09, 2026, 02:55:14 pm »

Yes you need 32 bits per entry and there's some overhead building the tables, but it's worth it if you need FSMC-like access and can't use FSMC pins.

peter-h · « **Reply #15 on:** June 09, 2026, 03:13:03 pm »

You could use memory-memory DMA to build the table

Actually this is a PITA if going to a QFP144 to get two extra 16 bit ports, cannot use the FSMC because it is hardwired to PORT D & E, and the extra ports you get on the QFP144 are F and H (and note H0 and H1 are used for the CPU crystal so only F is 16-bit accessible. Sure one can probably reassign D or E bits to say H and then use FSMC on D or E.

DavidAlfa · « **Reply #16 on:** June 10, 2026, 05:12:54 pm »

IIRC only Timer 1 dma can write to IO ports in the F family.
Set the timer to PWM, enable two PWM channels, CH1 and CH2.
CH1 = No output, 1% duty, DMA transfer on match between RAM and PORTx->ODR.
CH2 = Ouput (WR), 99% duty, makes a small negative pulse at the end of the timer cycle.

I recall issuing 8-bit and 16-bit DMA transfers to the ODR port... try yourself, you'll get DMA TEIFx flag if wrong.
I only got it because I was trying to use a different timer DMA to write to the IO regs.

peter-h · « **Reply #17 on:** June 10, 2026, 07:40:23 pm »

Thank you!

I've found that outputting 8 bits wide is not worth doing because there is a 120ns min cycle time for parallel loading of the ST7789, while it will take SPI at 60MHz

One would need to be parallel-loading at least 16 bits at a time to make it worth doing.

Currently I am working with getting everything working at 42MHz (max I can go, 168MHz, APB2=84MHz, SPI1) and then will decide which way to go; and it would have to be a QFP144. Various tweaks e.g. -O3 is 10% faster than -Og so the code is not wholly SPI limited.

DavidAlfa · « **Reply #18 on:** June 10, 2026, 07:56:19 pm »

For the best speed, try -Ofast.
Use spi dma when possible, filling operations are extremely fast with it it.

You might want to check out this uGUI mod I did some time ago...
https://github.com/deividAlfa/ST7789-STM32-uGUI

langwadt · « **Reply #19 on:** June 10, 2026, 08:50:54 pm »

Quote from: peter-h on June 10, 2026, 07:40:23 pm

Thank you!

I've found that outputting 8 bits wide is not worth doing because there is a 120ns min cycle time for parallel loading of the ST7789, while it will take SPI at 60MHz

this says 66ns https://www.buydisplay.com/download/ic/ST7789.pdf

peter-h · « **Reply #20 on:** June 10, 2026, 09:22:38 pm »

Quote

For the best speed, try -Ofast.
Use spi dma when possible, filling operations are extremely fast with it it.

I found -Ofast to be same as -O3. Actually same code size too.

Already use DMA for everything horizontal over 5 pixels. Maybe it could be less because 5px is 10-15 bytes... but my SPI3 is also currently blocking because it has a mutex (shared with much else) and when I go to SPI1 it will be dedicated. The Ramtex library is not so smart and I did tons of optimisations but ultimately if doing say rotated text or angled lines, it is 1 pixel at a time unless you do lots of messy stuff. It is all one way; no buffer and no readback. For scanline work (use that a lot for fills) I DMA up to 1.5k in one go.

Quote

this says 66ns

You are right! I was looking at the RGB timing.

So that is 15MB/sec versus 5.25MB/sec via SPI. That assumes I can actually get 15MHz with that PWM timer /WR hack... I'd have to test it.

peter-h · « **Reply #21 on:** June 11, 2026, 09:22:37 am »

TIM1 and DMA1 are already used in my project.

Digging around more, it looks like TIM8 -> DMA2 should work, and a byte write to ODR+1 should land in ODR[15:8] and drive PD8-PD15.

It also looks like PC6-8 can be used for the PWM output to produce the /WR.

There are some gotchas with spurious TIM behaviour. People found that with TIM8_UP, when you enable the timer, if the UGEN (update generation) bit or the initial configuration causes an immediate update event, you can get a spurious first trigger. You may have to set the UDIS bit or clear the update flag carefully, or use the ARPE/preload setup so the first genuine overflow is the first DMA trigger. One method is after configuring TIM8 but before enabling the DMA request, set the UG bit (EGR.UG) to force a known update that loads the preload registers, then clear the update flag (SR.UIF), then enable the DMA request (DIER.UDE), then enable the counter.

The biggest question mark is the ODR+1 and byte write aspect. I also don't want to write stuff to PD0-PD7 since some of those are outputs.

DMA to BSRR would need a specially loaded source buffer, not just the data bytes.

Reading e.g. this
https://community.st.com/stm32-mcus-products-25/can-dma-output-a-byte-stream-with-8080-style-write-strobe-34627
I think what I am trying to do is impossible.

It may still be worth doing in software (no DMA) and then it is obviously possible. Not a trivial consideration actually because DMA transfers are often blocking, so you can set /CS=1 at the end, so one gains in speed, but not with the CPU being able to do something else.

peter-h · « **Reply #22 on:** June 15, 2026, 11:18:02 am »

Moving to next stage of this, I did a quick and dirty test writing to an 8 bit port, in software, with a /wr strobe generated on another pin.

Ignore the crappy waveform. I don't have properly grounded probes etc. But this shows a 20ns (2 CPU cycle times at 168MHz!) wide /wr and an overall cycle time of 66ns which is 15MB/sec. A 1 cycle time (7ns) wide /wr is also possible, which surprised me. 20ns/div:

Data setup and hold times are tons. I am amazed how fast one can waggle a GPIO!

Here is the code - just a quick hack to output bytes on PD8-PD15 and /wr on PD2

Code: [Select]

                        #define WR_PIN   (1u << 2)   // /WR on PD2
			#define WR_PIN_POS  2u

			// Configure PD2 (/WR) as push-pull output, very-high speed, no pull.
			// Call once before strobing.
			    RCC->AHB1ENR |= RCC_AHB1ENR_GPIODEN;
			    (void)RCC->AHB1ENR;                          // let the clock settle

			    // MODER PD2 = 01 (output)
			    GPIOD->MODER   = (GPIOD->MODER   & ~(3u << (WR_PIN_POS * 2)))
			                                     |  (1u << (WR_PIN_POS * 2));
			    // OTYPER PD2 = 0 (push-pull)
			    GPIOD->OTYPER &= ~(1u << WR_PIN_POS);
			    // OSPEEDR PD2 = 11 (very high speed)
			    GPIOD->OSPEEDR = (GPIOD->OSPEEDR & ~(3u << (WR_PIN_POS * 2)))
			                                     |  (3u << (WR_PIN_POS * 2));
			    // PUPDR PD2 = 00 (no pull)
			    GPIOD->PUPDR  &= ~(3u << (WR_PIN_POS * 2));

			    // Data bus PD8-PD15: force output, push-pull, VERY-HIGH speed, no pull,
			    // so the data lines slew as fast as /WR (else /WR falls before data
			    // has finished changing). One register pair covers all 8 bits.
			    GPIOD->MODER   = (GPIOD->MODER   & 0x0000FFFFu) | 0x55550000u; // PD8-15 = 01 output
			    GPIOD->OTYPER &= ~(0xFF00u);                                   // PD8-15 push-pull
			    GPIOD->OSPEEDR = (GPIOD->OSPEEDR & 0x0000FFFFu) | 0xFFFF0000u; // PD8-15 = 11 very-high
			    GPIOD->PUPDR   = (GPIOD->PUPDR   & 0x0000FFFFu);               // PD8-15 = 00 no pull

			    // Idle /WR high (de-asserted)
			    GPIOD->BSRR = WR_PIN;

			// Example: 256-byte alternating buffer.

			    static uint8_t buf[256];
			    for (int i = 0; i < 256; i++)
			        buf[i] = (i & 1) ? 0xAAu : 0x55u;

			    while(true){

				    volatile uint16_t *odr  = (volatile uint16_t *)&GPIOD->ODR;  // 16-bit low half
				    volatile uint32_t *bsrr = &GPIOD->BSRR;
				    const uint32_t wr_high  = WR_PIN;

				    uint8_t       *p   = buf;
				    const uint8_t *end = buf + sizeof(buf);

				    __asm volatile (
				    "   ldrb   r3, [%[p]], #1    \n"   // preload first byte
				    "   lsl    r3, r3, #8        \n"   // position onto PD8-PD15; /WR(bit2)=0
				    "1:                          \n"
				    "   strh   r3, [%[odr]]      \n"   // drive bus + /WR LOW (falling edge)
				    "   nop                      \n"   // SETUP: data valid + /WR low, before the
				    "                            \n"   //   rising (latch) edge. Add more NOPs here
				    "                            \n"   //   for more setup. This window = TDST.
				    "   cmp    %[p], %[end]      \n"   // (in /WR-low window) done?
				    "   beq    2f                \n"   // if last byte, skip the prefetch
				    "   ldrb   r3, [%[p]], #1    \n"   // (in /WR-low window) load NEXT byte
				    "   lsl    r3, r3, #8        \n"   // (in /WR-low window) position NEXT byte
				    "   str    %[wrh], [%[bsrr]] \n"   // /WR HIGH -> latch this byte
				    "   b      1b                \n"
				    "2:                          \n"
				    "   nop                      \n"   // pad TWRL for the final byte (no prefetch)
				    "   nop                      \n"
				    "   str    %[wrh], [%[bsrr]] \n"   // /WR HIGH -> latch final byte
				    : [p] "+r" (p)
				    : [odr] "r" (odr), [bsrr] "r" (bsrr),
				      [wrh] "r" (wr_high), [end] "r" (end)
				    : "r3", "cc", "memory"
				    );
			    }

wek · « **Reply #23 on:** June 16, 2026, 09:58:29 am »

~20ns means 3 cycles.

When I did experiments back then when starting with STM32F4xx, I was unable to generate a 2-cycle long pulse by writing from processor to GPIO; only 1-cycle and 3-cycle. I believe this is due to the way how the GPIO's AHB bus is arbitrated within the bus matrix, but I am not an insider so there may be some other mechanism involved, e.g. within the processor's S port.

[As a curiosity, in those early days I experimented also with bit-banging GPIO on a NXP LPC1786 (Cortex-M3 but it should be mostly identical to M4 in this regard). At one point, the data hold delay to clock edge disappeared (i.e. instructions were write_for_clock_edge->write_for_data_change, and both changes appeared at the GPIO at the same time). Clock was written through bit-banding to the same GPIO port as data. My conclusion was, that as - contrary to STM32 - in LPC17xx GPIO is in the 0x2xxx'xxxx area, which by default is Normal (rather than Device), the processor was allowed to collapse both writes to one, and it did so. As at the same time we decided to select the more capable STM32, I've never gotten to proving this hypothesis (by setting the GPIO as Device using MPU).]

With DMA, things will get different again, the DMA will impose enough delay (IMO min 3 cycles) so this thing won't be observable. OTOH, the DMA's delays/latencies are relatively hard to calculate with absolute confidence, especially if other channels in DMA are used simultaneously.

Btw. you don't need to run benchmarks at maximum clock frequency, if you keep everything else set in the same way (e.g. FLASH latency): this is an extensively synchronous machine.

JW

peter-h · « **Reply #24 on:** June 16, 2026, 11:27:58 am »

Quote

only 1-cycle and 3-cycle

That's funny; I now realise I found the same when packing the /WR pulse width with NOPs

I am now running this somewhat weird code, to get data out on PD8-PD15, with /WR on PD1, and no other PD bits (i.e. PD0, PD3-7) affected. Due to the BSRR weirdness there is a 256x uint32_t lookup table. It is in FLASH because I don't want to waste 1k of RTOS stack space. Generated by Claude

Code: [Select]

	// ─────────────────────────────────────────────────────────────────────────────
	// Compile-time BSRR table for LCD_PAR_WR - const, so it lives in FLASH (.rodata),
	// not RAM. No runtime fill, no 256x4 bytes of RAM used.
	//
	// The 256 entries are generated by the preprocessor; each is a constant
	// expression the compiler folds at build time:
	//   entry b = (b<<8)            data 1-bits -> SET   (PD8-PD15, low half)
	//           | ((~b & 0xFF)<<24) data 0-bits -> RESET (PD8-PD15, high half)
	//           | (WR_PIN<<16)      /WR (PD1)   -> RESET (low)
	// Pin-safe (only PD8-PD15 and PD1 ever appear). This method is used to make sure
	// only PD1 is driven (the /WR)).
	// ─────────────────────────────────────────────────────────────────────────────

	#define WR_PIN  (1u << 1)        // /WR on PD1

	// One table entry for byte value b.
	#define LCD_PAR_E(b)   ( ((uint32_t)(b) << 8)                 \
	                       | ((uint32_t)((~(uint32_t)(b)) & 0xFFu) << 24) \
	                       | ((uint32_t)WR_PIN << 16) )

	// 16 consecutive entries starting at n.
	#define LCD_PAR_E16(n) \
	    LCD_PAR_E((n)+0),  LCD_PAR_E((n)+1),  LCD_PAR_E((n)+2),  LCD_PAR_E((n)+3),  \
	    LCD_PAR_E((n)+4),  LCD_PAR_E((n)+5),  LCD_PAR_E((n)+6),  LCD_PAR_E((n)+7),  \
	    LCD_PAR_E((n)+8),  LCD_PAR_E((n)+9),  LCD_PAR_E((n)+10), LCD_PAR_E((n)+11), \
	    LCD_PAR_E((n)+12), LCD_PAR_E((n)+13), LCD_PAR_E((n)+14), LCD_PAR_E((n)+15)

	// Full 256-entry table, built entirely at compile time -> flash, not RAM.
	static const uint32_t lcd_par_bsrr[256] = {
	    LCD_PAR_E16(0),   LCD_PAR_E16(16),  LCD_PAR_E16(32),  LCD_PAR_E16(48),
	    LCD_PAR_E16(64),  LCD_PAR_E16(80),  LCD_PAR_E16(96),  LCD_PAR_E16(112),
	    LCD_PAR_E16(128), LCD_PAR_E16(144), LCD_PAR_E16(160), LCD_PAR_E16(176),
	    LCD_PAR_E16(192), LCD_PAR_E16(208), LCD_PAR_E16(224), LCD_PAR_E16(240)
	};

	// Strobe ONE byte: precomputed data+/WR-low word from the (flash) table, NOP for
	// the /WR-low window (unchanged width), then /WR high to latch.
	#define LCD_PAR_WR(byte) do {                                  \
	        GPIOD->BSRR = lcd_par_bsrr[(uint8_t)(byte)];           \
	        __asm volatile ("nop");                                \
	        GPIOD->BSRR = WR_PIN;                                  \
	    } while (0)



// Pipelined buffer strobe: emit a contiguous run of DATA bytes, /WR on PD1, with
// the next byte's table load issued during the current byte's /WR-low window so
// the flash latency is hidden. _p and _end are byte pointers (start, one-past-end).
#define LCD_PAR_WR_BUF(_p, _end) do {                              \
		uint8_t       *_pp  = (_p);                                \
		const uint8_t *_ee  = (_end);                              \
		uint32_t _t1, _t2;                                         \
		__asm volatile (                                           \
		"   ldrb  %[t1], [%[p]], #1              \n"               \
		"   ldr   %[t2], [%[lut], %[t1], lsl #2] \n"               \
		"1: str   %[t2], [%[bsrr]]               \n"               \
		"   cmp   %[p], %[end]                   \n"               \
		"   beq   2f                             \n"               \
		"   ldrb  %[t1], [%[p]], #1              \n"               \
		"   ldr   %[t2], [%[lut], %[t1], lsl #2] \n"               \
		"   str   %[wrhi], [%[bsrr]]             \n"               \
		"   b     1b                             \n"               \
		"2: nop                                  \n"               \
		"   str   %[wrhi], [%[bsrr]]             \n"               \
		: [p] "+r" (_pp), [t1] "=&r" (_t1), [t2] "=&r" (_t2)       \
		: [lut] "r" (lcd_par_bsrr), [bsrr] "r" (&GPIOD->BSRR),     \
		  [wrhi] "r" ((uint32_t)WR_PIN), [end] "r" (_ee)           \
		: "cc", "memory" );                                        \
	} while (0)

I am getting a 70ns cycle to the LCD which is ~14MB/sec. The ST7789 min is 66ns. The only faster thing would be going to 16 bit parallel which means the QFP144 package and more hassle...

AIUI, no way to use DMA for 8 bit parallel unless one does one of these

- make sure the other half of the 16 bit port is all inputs, so writing to them has no effect
- prepare the data table to be 32 bit wide, containing BSRR values (in most cases this will be too wasteful if the table data is actually variable)


EEVblog® Main Site	EEVblog® on Youtube	EEVblog® on Twitter	EEVblog® on Facebook	EEVblog® on Odysee

Author Topic: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe (Read 1339 times)

Share me