EEVblog® Electronics Community Forum

Electronics => Microcontrollers => Topic started by: peter-h on June 04, 2026, 11:10:53 am

Title: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 04, 2026, 11:10:53 am
This is for driving an LCD with 16 bit parallel (ST7789). I currently use SPI and up to 42MHz clock is possible on my project.

The standard method is to use the FSMC but that is hard wired to output on ports D and E. I currently use these for lots of GPIO stuff.

I am using the QFP100 package and the FSMC can't be used (not enough pins) but going to the QFP144 would expose enough extra pins.

The problem is: no automatic /WR strobe. And no real speed control unless one outputs in software...

Has anyone found some cunning way?

I realise one can generate a /WR if one sets up a table where there are 3 entries for each word, and then the DMA will generate the write pulse. But then you have a 3x bigger table... Or a 2x bigger table and generate the /WR externally with an edge detector. And you still waste a byte or two just to generate the /WR.

Also, one could do a display fill with one colour, with just a 3 x 4 byte source buffer, if using an incrementing source address within a circular buffer.

There should be some cunning way to generate the /WR pulses from the same timer used to set the DMA rate.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 04, 2026, 11:34:05 am
should be straight forward if you have a timer pin, set to timer to pwm, dma on update adjust pwm duty cycle to get desired setup/hold

Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 04, 2026, 05:00:29 pm
That is very cunning.

Can the 32F4 DMA drive a 16 bit value on ports F or G and be synced with a timer event in this way? There is bound to be jitter I would think, due to internal bus contention etc.

But I guess one would use the PWM pulse to latch the previous value, not the one which DMA loads when the timer overflows. So jitter does not matter.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 04, 2026, 07:36:40 pm
That is very cunning.

Can the 32F4 DMA drive a 16 bit value on ports F or G and be synced with a timer event in this way? There is bound to be jitter I would think, due to internal bus contention etc.

But I guess one would use the PWM pulse to latch the previous value, not the one which DMA loads when the timer overflows. So jitter does not matter.

you should be able to DMA from memory to the ODR register in the port. jitter only matter in the way that the data has to be at the pins t_setup before the pwm edge

there's probably few quirks, like you I think you'd have to setup the first data value manually before starting the timer, and use a timer with a repetion counter (limiting the burst to 65K) to be able to stop the timer at the right time

Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 08, 2026, 09:16:52 pm
Can one do this with an 8 bit wide port, say PE0-PE7, or PE8-PE15? I know DMA can have an 8-bit target in general.

The problem is that the other half of that 16 bit port may have some pins set to GPIO outputs.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 08, 2026, 09:59:41 pm
Can one do this with an 8 bit wide port, say PE0-PE7, or PE8-PE15? I know DMA can have an 8-bit target in general.

The problem is that the other half of that 16 bit port may have some pins set to GPIO outputs.

no afaik you cannot do that, only way to only affect some bits is to use the BSRR register
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 09, 2026, 08:53:29 am
I should test this because AFAICS the DMA can write just bytes.

But maybe only to the low bits e.g. PE0-PE7.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 09, 2026, 10:58:19 am
I should test this because AFAICS the DMA can write just bytes.

But maybe only to the low bits e.g. PE0-PE7.

the dma can write bytes, but you'll have to test what happens when you only write upper or lower byte of the data register
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: Yansi on June 09, 2026, 11:07:00 am
Bus access error will likely happen, I think IO port registers can be accessed only as a whole halfword, not individual bytes. But I may be wrong, don't remember and it might be  different on different STM32 series. Check the datasheet/ref manual, it is well described there.

DMA can transfer byte/halfwords or words. That is easily configurable.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 09, 2026, 11:54:24 am
Quote
I think IO port registers can be accessed only as a whole halfword, not individual bytes

That may well be the case, in which case it probably works if the half-port has no pins configured as outputs. If all are inputs then the write will have no effect.

And obviously you will be discarding the "other" 8 bits of the 16 bits :)

I am working on the assumption that I may have to go to QFP144 and then I will get two more 16 bit ports. In the meantime I am using SPI (to drive an LCD) and going up to 42MHz which one can do on SPI1.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: Yansi on June 09, 2026, 12:30:28 pm
If you do not need to use the whole 16 bits from a port, they have already suggested you shall use the BSRR register by the DMA, instead of ODR!
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 09, 2026, 01:35:43 pm
OK I get you, but normally BSRR is used to software-set GPIO bits, and one (in all code I've seen) does a uint32_t write into it.

For changing GPIO bits one does a RMW i.e. reads BSRR into a uint32_t, masks the bit, drops in the desired one, and writes back all 32 bits. So I am not sure how you can write just 1/4 of this register.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: SiliconWizard on June 09, 2026, 02:17:15 pm
BSRR is a set/reset register: it allows setting or clearing arbitrary bits. The register is split in half, which is how it works.  The upper half sets port bits and the lower half clears them (or the other way around, TBC, don't remember).

A bit set (1) in BSRR modifies the corresponding port bit, a bit clear (0) takes no action.

So you have to format your DMA data accordingly.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 09, 2026, 02:44:44 pm
So your RAM table needs to be 32 bits wide.

A bit of digging confirms this. Also people have been up this BSRR road before e.g.
https://community.st.com/stm32-mcus-products-25/how-do-i-move-data-continuously-over-dma-from-an-array-in-memory-to-a-8-bit-set-of-gpio-pins-bus-i-feel-like-this-should-be-simple-but-i-am-majorly-stumped-i-am-using-an-stm32h7-nucleo-board-3124 (https://community.st.com/stm32-mcus-products-25/how-do-i-move-data-continuously-over-dma-from-an-array-in-memory-to-a-8-bit-set-of-gpio-pins-bus-i-feel-like-this-should-be-simple-but-i-am-majorly-stumped-i-am-using-an-stm32h7-nucleo-board-3124)
https://www.eevblog.com/forum/microcontrollers/stm32f4-dma-mem-gtgpio-triggered-by-timer/ (https://www.eevblog.com/forum/microcontrollers/stm32f4-dma-mem-gtgpio-triggered-by-timer/)

The wastage is not too bad if you are using DMA to do a one-colour fill of a display region. Then you set up DMA source to be incrementing only within 1 or 2 addresses (i.e. possibly not incrementing at all).
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: SiliconWizard on June 09, 2026, 02:55:14 pm
Yes you need 32 bits per entry and there's some overhead building the tables, but it's worth it if you need FSMC-like access and can't use FSMC pins.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 09, 2026, 03:13:03 pm
You could use memory-memory DMA to build the table ;)

Actually this is a PITA if going to a QFP144 to get two extra 16 bit ports, cannot use the FSMC because it is hardwired to PORT D & E, and the extra ports you get on the QFP144 are F and H (and note H0 and H1 are used for the CPU crystal so only F is 16-bit accessible. Sure one can probably reassign D or E bits to say H and then use FSMC on D or E.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: DavidAlfa on June 10, 2026, 05:12:54 pm
IIRC only Timer 1 dma can write to IO ports in the F family.
Set the timer to PWM, enable two PWM channels, CH1 and CH2.
CH1 =  No output, 1% duty, DMA transfer on match between RAM and PORTx->ODR.
CH2 = Ouput (WR), 99% duty, makes a small negative pulse at the end of the timer cycle.

I recall issuing 8-bit and 16-bit DMA transfers to the ODR port... try yourself, you'll get DMA TEIFx flag if wrong.
I only got it because I was trying to use a different timer DMA to write to the IO regs.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 10, 2026, 07:40:23 pm
Thank you!

I've found that outputting 8 bits wide is not worth doing because there is a 120ns min cycle time for parallel loading of the ST7789, while it will take SPI at 60MHz ;)

One would need to be parallel-loading at least 16 bits at a time to make it worth doing.

Currently I am working with getting everything working at 42MHz (max I can go, 168MHz, APB2=84MHz, SPI1) and then will decide which way to go; and it would have to be a QFP144. Various tweaks e.g. -O3 is 10% faster than -Og so the code is not wholly SPI limited.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: DavidAlfa on June 10, 2026, 07:56:19 pm
For the best speed, try -Ofast.
Use spi dma when possible, filling operations are extremely fast with it it.

You might want to check out this uGUI mod I did some time ago...
https://github.com/deividAlfa/ST7789-STM32-uGUI
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 10, 2026, 08:50:54 pm
Thank you!

I've found that outputting 8 bits wide is not worth doing because there is a 120ns min cycle time for parallel loading of the ST7789, while it will take SPI at 60MHz ;)

this says 66ns https://www.buydisplay.com/download/ic/ST7789.pdf (https://www.buydisplay.com/download/ic/ST7789.pdf)
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 10, 2026, 09:22:38 pm
Quote
For the best speed, try -Ofast.
Use spi dma when possible, filling operations are extremely fast with it it.

I found -Ofast to be same as -O3. Actually same code size too.

Already use DMA for everything horizontal over 5 pixels. Maybe it could be less because 5px is 10-15 bytes... but my SPI3 is also currently blocking because it has a mutex (shared with much else) and when I go to SPI1 it will be dedicated. The Ramtex library is not so smart and I did tons of optimisations but ultimately if doing say rotated text or angled lines, it is 1 pixel at a time unless you do lots of messy stuff. It is all one way; no buffer and no readback. For scanline work (use that a lot for fills) I DMA up to 1.5k in one go.

Quote
this says 66ns

You are right! I was looking at the RGB timing.

So that is 15MB/sec versus 5.25MB/sec via SPI. That assumes I can actually get 15MHz with that PWM timer /WR hack... I'd have to test it.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 11, 2026, 09:22:37 am
TIM1 and DMA1 are already used in my project.

Digging around more, it looks like TIM8 -> DMA2 should work, and a byte write to ODR+1 should land in ODR[15:8] and drive PD8-PD15.

It also looks like PC6-8 can be used for the PWM output to produce the /WR.

There are some gotchas with spurious TIM behaviour. People found that with TIM8_UP, when you enable the timer, if the UGEN (update generation) bit or the initial configuration causes an immediate update event, you can get a spurious first trigger. You may have to set the UDIS bit or clear the update flag carefully, or use the ARPE/preload setup so the first genuine overflow is the first DMA trigger. One method is after configuring TIM8 but before enabling the DMA request, set the UG bit (EGR.UG) to force a known update that loads the preload registers, then clear the update flag (SR.UIF), then enable the DMA request (DIER.UDE), then enable the counter.

The biggest question mark is the ODR+1 and byte write aspect. I also don't want to write stuff to PD0-PD7 since some of those are outputs.

DMA to BSRR would need a specially loaded source buffer, not just the data bytes.

Reading e.g. this
https://community.st.com/stm32-mcus-products-25/can-dma-output-a-byte-stream-with-8080-style-write-strobe-34627
I think what I am trying to do is impossible.

It may still be worth doing in software (no DMA) and then it is obviously possible. Not a trivial consideration actually because DMA transfers are often blocking, so you can set /CS=1 at the end, so one gains in speed, but not with the CPU being able to do something else.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 15, 2026, 11:18:02 am
Moving to next stage of this, I did a quick and dirty test writing to an 8 bit port, in software, with a /wr strobe generated on another pin.

Ignore the crappy waveform. I don't have properly grounded probes etc. But this shows a 20ns (2 CPU cycle times at 168MHz!) wide /wr and an overall cycle time of 66ns which is 15MB/sec. A 1 cycle time (7ns) wide /wr is also possible, which surprised me. 20ns/div:

(https://peter-ftp.co.uk/screenshots/202606152520640813.jpg)

Data setup and hold times are tons. I am amazed how fast one can waggle a GPIO!

Here is the code - just a quick hack to output bytes on PD8-PD15 and /wr on PD2

Code: [Select]
                        #define WR_PIN   (1u << 2)   // /WR on PD2
#define WR_PIN_POS  2u

// Configure PD2 (/WR) as push-pull output, very-high speed, no pull.
// Call once before strobing.
    RCC->AHB1ENR |= RCC_AHB1ENR_GPIODEN;
    (void)RCC->AHB1ENR;                          // let the clock settle

    // MODER PD2 = 01 (output)
    GPIOD->MODER   = (GPIOD->MODER   & ~(3u << (WR_PIN_POS * 2)))
                                     |  (1u << (WR_PIN_POS * 2));
    // OTYPER PD2 = 0 (push-pull)
    GPIOD->OTYPER &= ~(1u << WR_PIN_POS);
    // OSPEEDR PD2 = 11 (very high speed)
    GPIOD->OSPEEDR = (GPIOD->OSPEEDR & ~(3u << (WR_PIN_POS * 2)))
                                     |  (3u << (WR_PIN_POS * 2));
    // PUPDR PD2 = 00 (no pull)
    GPIOD->PUPDR  &= ~(3u << (WR_PIN_POS * 2));

    // Data bus PD8-PD15: force output, push-pull, VERY-HIGH speed, no pull,
    // so the data lines slew as fast as /WR (else /WR falls before data
    // has finished changing). One register pair covers all 8 bits.
    GPIOD->MODER   = (GPIOD->MODER   & 0x0000FFFFu) | 0x55550000u; // PD8-15 = 01 output
    GPIOD->OTYPER &= ~(0xFF00u);                                   // PD8-15 push-pull
    GPIOD->OSPEEDR = (GPIOD->OSPEEDR & 0x0000FFFFu) | 0xFFFF0000u; // PD8-15 = 11 very-high
    GPIOD->PUPDR   = (GPIOD->PUPDR   & 0x0000FFFFu);               // PD8-15 = 00 no pull

    // Idle /WR high (de-asserted)
    GPIOD->BSRR = WR_PIN;

// Example: 256-byte alternating buffer.

    static uint8_t buf[256];
    for (int i = 0; i < 256; i++)
        buf[i] = (i & 1) ? 0xAAu : 0x55u;

    while(true){

    volatile uint16_t *odr  = (volatile uint16_t *)&GPIOD->ODR;  // 16-bit low half
    volatile uint32_t *bsrr = &GPIOD->BSRR;
    const uint32_t wr_high  = WR_PIN;

    uint8_t       *p   = buf;
    const uint8_t *end = buf + sizeof(buf);

    __asm volatile (
    "   ldrb   r3, [%[p]], #1    \n"   // preload first byte
    "   lsl    r3, r3, #8        \n"   // position onto PD8-PD15; /WR(bit2)=0
    "1:                          \n"
    "   strh   r3, [%[odr]]      \n"   // drive bus + /WR LOW (falling edge)
    "   nop                      \n"   // SETUP: data valid + /WR low, before the
    "                            \n"   //   rising (latch) edge. Add more NOPs here
    "                            \n"   //   for more setup. This window = TDST.
    "   cmp    %[p], %[end]      \n"   // (in /WR-low window) done?
    "   beq    2f                \n"   // if last byte, skip the prefetch
    "   ldrb   r3, [%[p]], #1    \n"   // (in /WR-low window) load NEXT byte
    "   lsl    r3, r3, #8        \n"   // (in /WR-low window) position NEXT byte
    "   str    %[wrh], [%[bsrr]] \n"   // /WR HIGH -> latch this byte
    "   b      1b                \n"
    "2:                          \n"
    "   nop                      \n"   // pad TWRL for the final byte (no prefetch)
    "   nop                      \n"
    "   str    %[wrh], [%[bsrr]] \n"   // /WR HIGH -> latch final byte
    : [p] "+r" (p)
    : [odr] "r" (odr), [bsrr] "r" (bsrr),
      [wrh] "r" (wr_high), [end] "r" (end)
    : "r3", "cc", "memory"
    );
    }


Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: wek on June 16, 2026, 09:58:29 am
~20ns means 3 cycles.

When I did experiments back then when starting with STM32F4xx, I was unable to generate a 2-cycle long pulse by writing from processor to GPIO; only 1-cycle and 3-cycle. I believe this is due to the way how the GPIO's AHB bus is arbitrated within the bus matrix, but I am not an insider so there may be some other mechanism involved, e.g. within the processor's S port.

[As a curiosity, in those early days I experimented also with bit-banging GPIO on a NXP LPC1786 (Cortex-M3 but it should be mostly identical to M4 in this regard). At one point, the data hold delay to clock edge disappeared (i.e. instructions were write_for_clock_edge->write_for_data_change, and both changes appeared at the GPIO at the same time). Clock was written through bit-banding to the same GPIO port as data. My conclusion was, that as - contrary to STM32 - in LPC17xx GPIO is in the 0x2xxx'xxxx area, which by default is Normal (rather than Device), the processor was allowed to collapse both writes to one, and it did so. As at the same time we decided to select the more capable STM32, I've never gotten to proving this hypothesis (by setting the GPIO as Device using MPU).]

With DMA, things will get different again, the DMA will impose enough delay (IMO min 3 cycles) so this thing won't be observable. OTOH, the DMA's delays/latencies are relatively hard to calculate with absolute confidence, especially if other channels in DMA are used simultaneously.

Btw. you don't need to run benchmarks at maximum clock frequency, if you keep everything else set in the same way (e.g. FLASH latency): this is an extensively synchronous machine.

JW
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 16, 2026, 11:27:58 am
Quote
only 1-cycle and 3-cycle

That's funny; I now realise I found the same when packing the /WR pulse width with NOPs :)

I am now running this somewhat weird code, to get data out on PD8-PD15, with /WR on PD1, and no other PD bits (i.e. PD0, PD3-7) affected. Due to the BSRR weirdness there is a 256x uint32_t lookup table. It is in FLASH because I don't want to waste 1k of RTOS stack space. Generated by Claude ;)

Code: [Select]
// ─────────────────────────────────────────────────────────────────────────────
// Compile-time BSRR table for LCD_PAR_WR - const, so it lives in FLASH (.rodata),
// not RAM. No runtime fill, no 256x4 bytes of RAM used.
//
// The 256 entries are generated by the preprocessor; each is a constant
// expression the compiler folds at build time:
//   entry b = (b<<8)            data 1-bits -> SET   (PD8-PD15, low half)
//           | ((~b & 0xFF)<<24) data 0-bits -> RESET (PD8-PD15, high half)
//           | (WR_PIN<<16)      /WR (PD1)   -> RESET (low)
// Pin-safe (only PD8-PD15 and PD1 ever appear). This method is used to make sure
// only PD1 is driven (the /WR)).
// ─────────────────────────────────────────────────────────────────────────────

#define WR_PIN  (1u << 1)        // /WR on PD1

// One table entry for byte value b.
#define LCD_PAR_E(b)   ( ((uint32_t)(b) << 8)                 \
                       | ((uint32_t)((~(uint32_t)(b)) & 0xFFu) << 24) \
                       | ((uint32_t)WR_PIN << 16) )

// 16 consecutive entries starting at n.
#define LCD_PAR_E16(n) \
    LCD_PAR_E((n)+0),  LCD_PAR_E((n)+1),  LCD_PAR_E((n)+2),  LCD_PAR_E((n)+3),  \
    LCD_PAR_E((n)+4),  LCD_PAR_E((n)+5),  LCD_PAR_E((n)+6),  LCD_PAR_E((n)+7),  \
    LCD_PAR_E((n)+8),  LCD_PAR_E((n)+9),  LCD_PAR_E((n)+10), LCD_PAR_E((n)+11), \
    LCD_PAR_E((n)+12), LCD_PAR_E((n)+13), LCD_PAR_E((n)+14), LCD_PAR_E((n)+15)

// Full 256-entry table, built entirely at compile time -> flash, not RAM.
static const uint32_t lcd_par_bsrr[256] = {
    LCD_PAR_E16(0),   LCD_PAR_E16(16),  LCD_PAR_E16(32),  LCD_PAR_E16(48),
    LCD_PAR_E16(64),  LCD_PAR_E16(80),  LCD_PAR_E16(96),  LCD_PAR_E16(112),
    LCD_PAR_E16(128), LCD_PAR_E16(144), LCD_PAR_E16(160), LCD_PAR_E16(176),
    LCD_PAR_E16(192), LCD_PAR_E16(208), LCD_PAR_E16(224), LCD_PAR_E16(240)
};

// Strobe ONE byte: precomputed data+/WR-low word from the (flash) table, NOP for
// the /WR-low window (unchanged width), then /WR high to latch.
#define LCD_PAR_WR(byte) do {                                  \
        GPIOD->BSRR = lcd_par_bsrr[(uint8_t)(byte)];           \
        __asm volatile ("nop");                                \
        GPIOD->BSRR = WR_PIN;                                  \
    } while (0)



// Pipelined buffer strobe: emit a contiguous run of DATA bytes, /WR on PD1, with
// the next byte's table load issued during the current byte's /WR-low window so
// the flash latency is hidden. _p and _end are byte pointers (start, one-past-end).
#define LCD_PAR_WR_BUF(_p, _end) do {                              \
uint8_t       *_pp  = (_p);                                \
const uint8_t *_ee  = (_end);                              \
uint32_t _t1, _t2;                                         \
__asm volatile (                                           \
"   ldrb  %[t1], [%[p]], #1              \n"               \
"   ldr   %[t2], [%[lut], %[t1], lsl #2] \n"               \
"1: str   %[t2], [%[bsrr]]               \n"               \
"   cmp   %[p], %[end]                   \n"               \
"   beq   2f                             \n"               \
"   ldrb  %[t1], [%[p]], #1              \n"               \
"   ldr   %[t2], [%[lut], %[t1], lsl #2] \n"               \
"   str   %[wrhi], [%[bsrr]]             \n"               \
"   b     1b                             \n"               \
"2: nop                                  \n"               \
"   str   %[wrhi], [%[bsrr]]             \n"               \
: [p] "+r" (_pp), [t1] "=&r" (_t1), [t2] "=&r" (_t2)       \
: [lut] "r" (lcd_par_bsrr), [bsrr] "r" (&GPIOD->BSRR),     \
  [wrhi] "r" ((uint32_t)WR_PIN), [end] "r" (_ee)           \
: "cc", "memory" );                                        \
} while (0)


I am getting a 70ns cycle to the LCD which is ~14MB/sec. The ST7789 min is 66ns. The only faster thing would be going to 16 bit parallel which means the QFP144 package and more hassle...

AIUI, no way to use DMA for 8 bit parallel unless one does one of these

- make sure the other half of the 16 bit port is all inputs, so writing to them has no effect
- prepare the data table to be 32 bit wide, containing BSRR values (in most cases this will be too wasteful if the table data is actually variable)


Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: wek on June 16, 2026, 01:53:11 pm
AIUI, no way to use DMA for 8 bit parallel unless one does one of these
IMO, PD8-PD15 should be viable, by setting the write side of DMA to 8 byte and GPIO_ODR+1 as its address. 12-cycle (71.4ns @ 168MHz) timer-triggered writes IMO might perhaps be viable, depending on other loads.

Oh, and FSMC *is* usable in the 100-pin package, as long as you don't need the lower address pins - and you don't as the LCD controller has only one address pin. IMO that is what you should aim for.

JW
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 16, 2026, 02:40:45 pm
Interesting. My digging online found that this does not work. But once I got the bit-banged parallel mode (PD8-PD15, with PD1 as the /WR) working, I will try the 8 bit wide DMA mode.

The plan would be:

LCD parallel data via DMA2 Stream 1, Channel 7, triggered by TIM8_UP (the update event). /WR is generated as PWM on PC6 = TIM8_CH1 (output only — no DMA on that channel). TIM8 counts at 168MHz, so the /WR pulse has ~7ns resolution, and the DMA writes the data byte to GPIO once per timer period.

If it works it would be ideal. I already have DMA running on SPI1 at 42MHz (and yes — on the unused JTAG pins we discussed a while back, which does work!), but I want more bandwidth. Bit-banging gets me the speed, but at 100% CPU; the appeal of the timer+DMA scheme is the same rate with the core free.

One of the things to test will be to do with DMA latency. If this is too bad then the data may get updated during the next /WR. The timer will just sit there generating the /WR pulses and we are relying on the DMA not getting too slow. I am aiming for something not much worse than 66ns cycle time. I've had bitbanged code (above) achieving that.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: SiliconWizard on June 16, 2026, 05:41:22 pm
I haven't completely followed your last tests, but I would probably rather generate the /WR signal using a GPIO as well driven by DMA, just like data, so the sequence would be fully controlled.
So instead of say, driving 8 data bits using DMA and /WR separately, I would drive 9 bits, including /WR, using DMA, and prepare the data accordingly, so that the sequence is fully predictable.
Downside is that it'd require 16-bit access and thus wasting 7 bits out of 16 in your data buffer just for the /WR bit.

Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 16, 2026, 06:01:51 pm
Sure; that is one way. The problem is that in my project preparing such a buffer would take quite some time. It's funny... 99% of the time this 168MHz chip is incredibly fast, but bit-banging this is really on the limit.

Currently I am achieving 82ns cycle time with this Claude-generated asm code (I can't get my head around arm32 asm):

Code: [Select]
// Pipelined buffer strobe, /WR on PC8 (GPIOC), data on PD8-PD15 (GPIOD).
//
// Measured: /WR-high ~22ns is the limiter; /WR-low has spare time. So all the
// loop housekeeping that can go in the /WR-low window is kept there (where there
// is margin), leaving the /WR-high window holding only what is unavoidable: the
// data store for the next byte and the single loop branch.
//
// Per byte, the /WR-low window contains: cmp p,end (done?), the branch test, and
// the next-byte table load (ldrb + ldr). The /WR-high window contains only the
// next byte's data store and the loop branch. This is the tightest arrangement
// for a one-byte-per-iteration loop; shortening further needs unrolling (to drop
// the per-byte branch) or the hardware /WR via TIM8+DMA.
//
// Per byte:
//   str data -> GPIOD        drive bus (start of /WR-high tail of prev byte)
//   str wlo  -> GPIOC        /WR low
//   cmp p,end                done?      (in /WR-low, which has margin)
//   beq last                            (in /WR-low)
//   ldrb/ldr next word                  (in /WR-low - pipelines next byte)
//   str whi  -> GPIOC        /WR high (latch)
//   b loop
//
// _p and _end are byte pointers (start, one-past-end). Only called with end > p
// (C wrapper guards size != 0). Pin-safe: data -> PD8-PD15 only; /WR -> PC8 only.
                #define LCD_PAR_WR_BUF(_p, _end) do {                                  \
uint8_t       *_pp = (_p);                                     \
const uint8_t *_ee = (_end);                                   \
uint32_t _t1, _t2;                                             \
__asm volatile (                                              \
"   ldrb  %[t1], [%[p]], #1               \n"  /* byte0           */ \
"   ldr   %[t2], [%[lut], %[t1], lsl #2]  \n"  /* word0           */ \
"1: str   %[t2],  [%[bd]]                 \n"  /* data -> GPIOD   */ \
"   str   %[wlo], [%[bc]]                 \n"  /* /WR low         */ \
"   cmp   %[p], %[end]                    \n"  /* done? (in low)  */ \
"   beq   2f                              \n"  /* last? (in low)  */ \
"   ldrb  %[t1], [%[p]], #1               \n"  /* next byte(in low)*/ \
"   ldr   %[t2], [%[lut], %[t1], lsl #2]  \n"  /* next word(in low)*/ \
"   str   %[whi], [%[bc]]                 \n"  /* /WR high (latch)*/ \
"   b     1b                              \n"                  \
/* epilogue: last byte. Pad its /WR-low (only cmp+beq wide) to the */ \
/* regular width, then raise /WR. NOP count tuned on the scope.     */ \
"2: nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   str   %[whi], [%[bc]]                 \n"  /* /WR high (latch)*/ \
: [p] "+r" (_pp), [t1] "=&r" (_t1), [t2] "=&r" (_t2)           \
: [lut] "r" (lcd_par_bsrr),                                    \
  [bd]  "r" (&GPIOD->BSRR), [bc] "r" (&GPIOC->BSRR),           \
  [wlo] "r" (WR_C_LOW), [whi] "r" (WR_C_HIGH),                 \
  [end] "r" (_ee)                                              \
: "cc", "memory" );                                           \
} while (0)

I got 66ns cycle time (above) by having the /WR strobe on the same Port D as the 8 bit data bus. That enables a slight code saving on the BSRR manipulation. Now the /WR is on PC8 which is less efficient because you are doing BSRR on two different ports. PC8 was chosen because it is one of the few pins on which a PWM output can be output (within limits of which timers I am already using, etc).

If anyone can speed up this asm I would be super interested... Well there is the 8x etc unrolling of the loop as an option.

I will wire this up to make sure it actually works (the bloody LCD swaps over the /WR and the D/C pins between SPI and parallel modes  |O which I will address in the finished board with a LVC157 mux so the mode remains GPIO-selectable) and then move to the DMA experiment.

I actually wonder how the hell DMA can work:

You set up a timer to do PWM, coming out on PC8, and when the timer overflows, it triggers DMA to load in next byte. So you have to load 1st byte manually and start the timer. The number of pulses configured will be the buffer size. Is this configurable? I know the # of DMA transactions is configurable but that's not the same thing. There is no config for the timer to stop generating the pulses after N pulses. A DMA NTDR=0 interrupt will likely come too late to stop the pulses. How do people do it?

Digging around, the only method seems to be:  terminate the timer via a second DMA, not an interrupt. You can have the timer's update event drive two DMA streams — one writing pixel data to GPIO ODR, and the burst length set by NDTR. When the data DMA finishes, instead of an interrupt, you arrange for the last transfer to also trigger a write that stops the timer (e.g. a DMA writing to TIM8->CR1 to clear CEN, sequenced so it lands after the final strobe). DMA-driven stop has deterministic timing — it happens a fixed number of cycles after the last data transfer, not whenever the CPU gets round to the ISR. This removes the interrupt-latency uncertainty entirely. Bloody hell!

There is a counter on TIM8 which is 8 bits only, so you can do up to 256 bytes.

As an update in case anybody else tries this, this is TIM8 just doing PWM at 71ns cycle (ST7789 is 66ns min). It does 10 pulses and that's it. Perfect so far :)

Code: [Select]
// ── TIM8 PWM /WR burst test: 10 pulses on PC8 (TIM8_CH3), then self-stop ──
//
// Bench check for the future DMA scheme: configure TIM8 CH3 (PC8) as PWM, use
// one-pulse mode + the repetition counter (RCR) so the timer emits exactly 10
// pulses and then halts itself in hardware (CEN cleared by OPM). No DMA here -
// this just proves the pulse count and timing on the scope.
//
// TIM8 counter clock = 168MHz (APB2 timer clock). ARR sets the period, CCR3 the
// /WR low/high split. RCR = N-1 makes one-pulse-mode span N counter cycles, so
// OPM stops after N pulses.
//
// Scope PC8: expect exactly 10 PWM pulses then idle, with the period and duty
// set below. Re-run by re-arming (set CEN again).
//
// Adjust:
//   ARR  = period-1   (period in 168MHz ticks; 11 -> 12 ticks -> 71.4ns)
//   CCR3 = high-time portion (PWM mode 1: output high while CNT<CCR3)
//   RCR  = pulses-1   (10 pulses -> 9)

// Clock the timer and the GPIO port
__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_GPIOC_CLK_ENABLE();

// PC8 -> AF3 (TIM8_CH3), push-pull, very-high speed, no pull
GPIOC->MODER   = (GPIOC->MODER   & ~(3u << (8*2))) | (2u << (8*2));   // AF mode
GPIOC->OTYPER &= ~(1u << 8);                                          // push-pull
GPIOC->OSPEEDR = (GPIOC->OSPEEDR & ~(3u << (8*2))) | (3u << (8*2));   // very-high
GPIOC->PUPDR  &= ~(3u << (8*2));                                      // no pull
GPIOC->AFR[1]  = (GPIOC->AFR[1] & ~(0xFu << ((8-8)*4))) | (3u << ((8-8)*4)); // AF3

// Timer base: period and prescaler
TIM8->PSC = 0;          // no prescale -> 168MHz counter clock
TIM8->ARR = 11;         // 12 ticks -> 71.4ns period
TIM8->CCR3 = 6;         // high while CNT<6, low while 6..11 -> ~half/half
TIM8->RCR = 9;          // 10 pulses (N-1)

// One-pulse mode: timer stops itself after the RCR burst
TIM8->CR1 |= TIM_CR1_OPM;

// CH3 = PWM mode 1, output enable
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (6u << TIM_CCMR2_OC3M_Pos);   // PWM mode 1
TIM8->CCMR2 |= TIM_CCMR2_OC3PE;              // preload enable
TIM8->CCER  |= TIM_CCER_CC3E;                // CH3 output enable

// Advanced-timer outputs need MOE (main output enable) in BDTR
TIM8->BDTR |= TIM_BDTR_MOE;

// Force an update to load PSC/ARR/RCR/CCR3 shadow registers
TIM8->EGR = TIM_EGR_UG;

// Start: generates 10 pulses then OPM clears CEN automatically
TIM8->CR1 |= TIM_CR1_CEN;

// (To repeat the burst: TIM8->EGR = TIM_EGR_UG; TIM8->CR1 |= TIM_CR1_CEN;)

and this is the PWM waveform (ignore the ringing - bad probe earth)

(https://peter-ftp.co.uk/screenshots/202606174920665609.jpg)
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 16, 2026, 08:34:31 pm
you might be able use a different time to gate the main timer (gated mode)
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 16, 2026, 08:58:54 pm
OK a good idea!

TIM8 can be gated by TIM1, TIM2, TIM4, or TIM5.

TIM2 and TIM5 I have spare. I would like to reserve TIM5 (32 bit) for something else. But TIM2 has a mystery: may be used by the ethernet subsystem, as a timestamp comparator, for precision time protocol (PTP). Not confirmed but the 32F417 doc talks about it in the PTP context. See
https://community.st.com/s/question/0D53W00001QgDboSAF/32f417-is-tim2-used-with-ethernet-at-all
Nobody seems to know for 100% sure...

But actually using the 8-bit RCR for 1-256 bytes is still very efficient; you just have a little bit of CPU time at the end of the block if >256.

I've also looked at the DMA -> BSRR route (which will obviously work - the above asm code uses it too - but needs a 32 bit wide buffer in which every 4th byte is the actual data) and an asm loop to write every 4th byte, 256 times, is not too bad. Even better if you could have two buffers and swap them.
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 18, 2026, 08:02:56 pm
I am happy to say that it does work! 32F4 DMA can write bytewide to the upper half of a 16 bit port, without messing up pins on the lower half programmed to be outputs.

Also generating a strobe with PWM works, and so does triggering DMA to load the next byte onto the bus.

It is pretty fiddly to get it working right though.

https://vimeo.com/1202420391

Init code:

Code: [Select]
═══════════════════════════════════════════════════════════════════════════════
CC4-triggered DMA: load the next byte AFTER /WR has gone high, so it settles
during the high phase + next low phase and is rock-solid before the next latch.
═══════════════════════════════════════════════════════════════════════════════

PRINCIPLE
  CH3 (CCR3) generates /WR via PWM mode 2: /WR low CNT 0..CCR3-1, high CNT CCR3..ARR.
  Rising edge (the LATCH) is at CNT = CCR3.
  CH4 (CCR4) is a DMA-TRIGGER ONLY channel (no output pin). Set CCR4 a few ticks
  AFTER CCR3, so the CC4 event - and thus the DMA load of the next byte - happens
  while /WR is already HIGH, a few cycles past the latch edge.
  The byte loaded at CNT=CCR4 then has the rest of this period plus the next
  period's low phase to settle before the NEXT rising edge at CNT=CCR3. No race.

  CC-compare DMA requests are NOT gated by the repetition counter (unlike UPDATE),
  so CC4 fires every period -> one byte per period, exactly what we want, while RCR
  still bounds the burst length via OPM.

TIMING (ARR=11, 71.4ns period, 168MHz):
  CCR3 = 6  -> /WR low CNT 0..5 (~36ns), high CNT 6..11 (~36ns), latch at CNT=6.
  CCR4 = 8  -> DMA loads next byte at CNT=8 (~12ns after the rising edge).
  Next latch is at CNT=6 of the next period = (11-8)+6 = 9 ticks ~54ns later.
  So each byte gets ~54ns of settle, loaded well clear of any latch edge.

DMA MAPPING: TIM8_CH4 -> DMA2 Stream7, Channel 7.
  (TIM8_UP was Stream1 Ch7; CC4 is a different stream. Stream7/Ch7 = TIM8_CH4/TRIG/COM.)

*/

// TIM8 CH3 (PC8) = /WR via PWM. CH4 = internal DMA-trigger compare, set a
// few ticks AFTER the /WR rising edge so the DMA loads the next byte while
// /WR is high - it then settles ~54ns before the next latch. CC4 compare
// DMA fires every period (not gated by RCR), so one byte per pulse; RCR+OPM
// bound the burst length. PC8 stays a GPIO output here (bit-bang init needs
// it); the DAT burst flips PC8 to AF3 only for the pixel stream.

__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_DMA2_CLK_ENABLE();

// Pre-select AF3 for PC8 in the mux; leave MODER as OUTPUT (set in step 3).
GPIOC->AFR[1] = (GPIOC->AFR[1] & ~(0xFu << ((WR_PIN_POS-8)*4)))
  | (3u << ((WR_PIN_POS-8)*4));                                        // AF3

// Timer base
TIM8->PSC  = 0;
TIM8->ARR  = 12;       // ARR=12 is 13 ticks ≈ 77ns
TIM8->RCR  = 0;        // set per-burst

TIM8->CR1 |= TIM_CR1_OPM;                     // one-pulse: self-stop after burst

// CH3 = /WR output, PWM mode 2 (low-first). Latch (rising edge) at CNT=CCR3.
TIM8->CCR3 = 8;
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (7u << TIM_CCMR2_OC3M_Pos);    // PWM mode 2
TIM8->CCMR2 |= TIM_CCMR2_OC3PE;               // preload
TIM8->CCER  |= TIM_CCER_CC3E;                 // CH3 output enable (drives PC8 in AF)
TIM8->BDTR  |= TIM_BDTR_MOE;                  // main output enable

// CH4 = DMA-trigger only (no output pin). Compare at CCR4 generates the CC4
// event -> CC4 DMA request, loading the next byte while /WR is high.
TIM8->CCR4 = 8;                               // a few ticks after CCR3 (the latch)
TIM8->CCMR2 &= ~TIM_CCMR2_OC4M;               // OC4M=0 (frozen) - still sets CC4IF on match
// (no CC4E - we don't need an output pin, just the compare event/flag)

// DMA requests from CC4 (NOT from update). UDE off; CC4DE on per-burst.
TIM8->DIER &= ~TIM_DIER_UDE;
// CC4DE toggled per-burst in LCD_Transmit_buf_DAT (kept off here).

// DMA2 Stream7 Ch7 = TIM8_CH4. byte->byte, mem-increment, mem->periph, PAR=ODR+1.
DMA2_Stream7->CR = 0;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2_Stream7->PAR = (uint32_t)(&GPIOD->ODR) + 1u;     // high byte -> PD8-15
DMA2_Stream7->CR =
  (7u << DMA_SxCR_CHSEL_Pos)   // channel 7 = TIM8_CH4
| (0u << DMA_SxCR_MSIZE_Pos)   // memory size = byte
| (0u << DMA_SxCR_PSIZE_Pos)   // periph size = byte
| DMA_SxCR_MINC                // increment memory
| (1u << DMA_SxCR_DIR_Pos)     // memory -> peripheral
| (2u << DMA_SxCR_PL_Pos);     // priority high

TIM8->EGR = TIM_EGR_UG;            // load shadow regs (timer idle)
GPIOC->BSRR = WR_C_HIGH;           // PC8 = 1 (idle high), stays GPIO output

To output a buffer:

Code: [Select]

// TIM8 + DMA hardware /WR. CPU is idle during each burst.
//
// PC8 is a GPIO output (idle high) on entry. Flip it to AF3 (TIM8_CH3) for
// the pixel burst, then back to GPIO output afterwards so the next column's
// address setup can bit-bang /WR.
//
// Stream the buffer in <=256-byte bursts (RCR is 8-bit). For each burst:
// prime byte 0 onto PD8-15 (pulse 0 latches it), point DMA at bytes 1..n-1,
// then start TIM8. The CC4 compare event loads each subsequent byte while
// /WR is high so it settles ~54ns before the next latch.
//
// TWO RACE FIXES for the rare wrong/short filled-rectangle line:
//   1. __DSB() after EACH prime write (inside the loop) forces byte 0 onto
//      the bus before the timer can latch it - covers byte 0 of every
//      <=256 chunk, including the 2nd+ chunks of a >256 transfer.
//   2. After the timer self-stops (CEN clear), WAIT for the DMA transfer-
//      complete flag (TCIF7) BEFORE disabling the stream. The timer stopping
//      and the DMA draining its last byte are SEPARATE events; tearing the
//      stream down on CEN-clear alone can catch the DMA still draining ->
//      a whole line comes out short or wrong. This closes that race.
LCD_DC_GPIO_PORT->BSRR = LCD_DC_PIN;        // DAT: D/C=1
lcd_cs(0);

// PC8 -> AF3 (TIM8_CH3): the timer drives /WR for the DMA burst.
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (2u << (8*2));   // PC8 = AF

uint32_t off = 0;
while (off < size)
{
uint32_t n = size - off;
if (n > 256u) n = 256u;
const uint8_t *b = &outbuf[off];

// CC4 DMA request OFF while we set up.
TIM8->DIER &= ~TIM_DIER_CC4DE;

// Prime byte 0, then a data barrier so the store lands before the
// timer can latch it (covers byte 0 of THIS chunk).
*((volatile uint8_t *)(&GPIOD->ODR) + 1) = b[0];
__DSB();

// Arm DMA for bytes 1..n-1 (none if n==1). CC4 in period k loads byte k+1.
if (n > 1u)
{
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2->HIFCR = (DMA_HIFCR_CTCIF7 | DMA_HIFCR_CHTIF7 |
   DMA_HIFCR_CTEIF7 | DMA_HIFCR_CDMEIF7 | DMA_HIFCR_CFEIF7);
DMA2_Stream7->M0AR = (uint32_t)&b[1];
DMA2_Stream7->NDTR = n - 1u;
DMA2_Stream7->CR  |= DMA_SxCR_EN;
}

// Load RCR (n pulses) via UG with CC4DE OFF, then clear UIF/CC4IF.
TIM8->RCR = n - 1u;
TIM8->EGR = TIM_EGR_UG;
TIM8->SR  = 0;

// Clear any stale CC4 flag FIRST, THEN enable CC4 DMA, then start.
// (Enabling CC4DE while CC4IF is still set from the previous burst's
//  last period fires an immediate spurious DMA request -> the whole
//  burst shifts by one byte -> a wrong-colour run on that row.)
TIM8->SR  = 0;
if (n > 1u) TIM8->DIER |= TIM_DIER_CC4DE;
TIM8->CR1 |= TIM_CR1_CEN;

// Wait for OPM to self-stop the timer.
while (TIM8->CR1 & TIM_CR1_CEN) { }

// THEN wait for the DMA to finish draining its last byte before tearing
// the stream down - otherwise the timer-stop vs DMA-drain race can
// truncate the line. (Skip if n==1: no DMA was armed.)
if (n > 1u)
{
while (!(DMA2->HISR & DMA_HISR_TCIF7)) { }
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
}

TIM8->DIER &= ~TIM_DIER_CC4DE;

off += n;
}

// PC8 back to GPIO output, idle HIGH.
GPIOC->BSRR  = WR_C_HIGH;                                         // PC8 = 1
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (1u << (8*2));   // PC8 = output
__DSB();                       // ensure PC8 is GPIO before any following bit-bang

lcd_cs(1);

It uses the 8 bit counter in the timer, so is limited to 256 bytes per transfer. However, the CPU overhead at the end is pretty small. The only way I know of to do bigger packets nonstop is - as mentioned above here - to use another timer to gate this one. Maybe other CPUs have a 16 bit counter too...

Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: langwadt on June 18, 2026, 08:50:12 pm
Maybe other CPUs have a 16 bit counter too...

e.g. the STM32G4
Title: Re: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe
Post by: peter-h on June 19, 2026, 06:42:19 am
I wonder whether gating a timer with another timer is actually going to work over long periods. It could be quite fiddly...

That counter is a much better way - and works.

So many people went up that road (bytewide output) and either gave up or just read everywhere that it is not possible.