Author Topic: 32F4xx: any way to output with DMA to non-FSMC ports and get a /WR strobe  (Read 1260 times)

0 Members and 1 Guest are viewing this topic.

Offline wek

  • Frequent Contributor
  • **
  • Posts: 591
  • Country: sk
AIUI, no way to use DMA for 8 bit parallel unless one does one of these
IMO, PD8-PD15 should be viable, by setting the write side of DMA to 8 byte and GPIO_ODR+1 as its address. 12-cycle (71.4ns @ 168MHz) timer-triggered writes IMO might perhaps be viable, depending on other loads.

Oh, and FSMC *is* usable in the 100-pin package, as long as you don't need the lower address pins - and you don't as the LCD controller has only one address pin. IMO that is what you should aim for.

JW
« Last Edit: June 16, 2026, 02:02:59 pm by wek »
 
The following users thanked this post: peter-h

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 5817
  • Country: gb
  • Doing electronics since the 1960s...
Interesting. My digging online found that this does not work. But once I got the bit-banged parallel mode (PD8-PD15, with PD1 as the /WR) working, I will try the 8 bit wide DMA mode.

The plan would be:

LCD parallel data via DMA2 Stream 1, Channel 7, triggered by TIM8_UP (the update event). /WR is generated as PWM on PC6 = TIM8_CH1 (output only — no DMA on that channel). TIM8 counts at 168MHz, so the /WR pulse has ~7ns resolution, and the DMA writes the data byte to GPIO once per timer period.

If it works it would be ideal. I already have DMA running on SPI1 at 42MHz (and yes — on the unused JTAG pins we discussed a while back, which does work!), but I want more bandwidth. Bit-banging gets me the speed, but at 100% CPU; the appeal of the timer+DMA scheme is the same rate with the core free.

One of the things to test will be to do with DMA latency. If this is too bad then the data may get updated during the next /WR. The timer will just sit there generating the /WR pulses and we are relying on the DMA not getting too slow. I am aiming for something not much worse than 66ns cycle time. I've had bitbanged code (above) achieving that.
« Last Edit: June 16, 2026, 04:11:15 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17525
  • Country: fr
I haven't completely followed your last tests, but I would probably rather generate the /WR signal using a GPIO as well driven by DMA, just like data, so the sequence would be fully controlled.
So instead of say, driving 8 data bits using DMA and /WR separately, I would drive 9 bits, including /WR, using DMA, and prepare the data accordingly, so that the sequence is fully predictable.
Downside is that it'd require 16-bit access and thus wasting 7 bits out of 16 in your data buffer just for the /WR bit.

 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 5817
  • Country: gb
  • Doing electronics since the 1960s...
Sure; that is one way. The problem is that in my project preparing such a buffer would take quite some time. It's funny... 99% of the time this 168MHz chip is incredibly fast, but bit-banging this is really on the limit.

Currently I am achieving 82ns cycle time with this Claude-generated asm code (I can't get my head around arm32 asm):

Code: [Select]
// Pipelined buffer strobe, /WR on PC8 (GPIOC), data on PD8-PD15 (GPIOD).
//
// Measured: /WR-high ~22ns is the limiter; /WR-low has spare time. So all the
// loop housekeeping that can go in the /WR-low window is kept there (where there
// is margin), leaving the /WR-high window holding only what is unavoidable: the
// data store for the next byte and the single loop branch.
//
// Per byte, the /WR-low window contains: cmp p,end (done?), the branch test, and
// the next-byte table load (ldrb + ldr). The /WR-high window contains only the
// next byte's data store and the loop branch. This is the tightest arrangement
// for a one-byte-per-iteration loop; shortening further needs unrolling (to drop
// the per-byte branch) or the hardware /WR via TIM8+DMA.
//
// Per byte:
//   str data -> GPIOD        drive bus (start of /WR-high tail of prev byte)
//   str wlo  -> GPIOC        /WR low
//   cmp p,end                done?      (in /WR-low, which has margin)
//   beq last                            (in /WR-low)
//   ldrb/ldr next word                  (in /WR-low - pipelines next byte)
//   str whi  -> GPIOC        /WR high (latch)
//   b loop
//
// _p and _end are byte pointers (start, one-past-end). Only called with end > p
// (C wrapper guards size != 0). Pin-safe: data -> PD8-PD15 only; /WR -> PC8 only.
                #define LCD_PAR_WR_BUF(_p, _end) do {                                  \
uint8_t       *_pp = (_p);                                     \
const uint8_t *_ee = (_end);                                   \
uint32_t _t1, _t2;                                             \
__asm volatile (                                              \
"   ldrb  %[t1], [%[p]], #1               \n"  /* byte0           */ \
"   ldr   %[t2], [%[lut], %[t1], lsl #2]  \n"  /* word0           */ \
"1: str   %[t2],  [%[bd]]                 \n"  /* data -> GPIOD   */ \
"   str   %[wlo], [%[bc]]                 \n"  /* /WR low         */ \
"   cmp   %[p], %[end]                    \n"  /* done? (in low)  */ \
"   beq   2f                              \n"  /* last? (in low)  */ \
"   ldrb  %[t1], [%[p]], #1               \n"  /* next byte(in low)*/ \
"   ldr   %[t2], [%[lut], %[t1], lsl #2]  \n"  /* next word(in low)*/ \
"   str   %[whi], [%[bc]]                 \n"  /* /WR high (latch)*/ \
"   b     1b                              \n"                  \
/* epilogue: last byte. Pad its /WR-low (only cmp+beq wide) to the */ \
/* regular width, then raise /WR. NOP count tuned on the scope.     */ \
"2: nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   nop                                   \n"                  \
"   str   %[whi], [%[bc]]                 \n"  /* /WR high (latch)*/ \
: [p] "+r" (_pp), [t1] "=&r" (_t1), [t2] "=&r" (_t2)           \
: [lut] "r" (lcd_par_bsrr),                                    \
  [bd]  "r" (&GPIOD->BSRR), [bc] "r" (&GPIOC->BSRR),           \
  [wlo] "r" (WR_C_LOW), [whi] "r" (WR_C_HIGH),                 \
  [end] "r" (_ee)                                              \
: "cc", "memory" );                                           \
} while (0)

I got 66ns cycle time (above) by having the /WR strobe on the same Port D as the 8 bit data bus. That enables a slight code saving on the BSRR manipulation. Now the /WR is on PC8 which is less efficient because you are doing BSRR on two different ports. PC8 was chosen because it is one of the few pins on which a PWM output can be output (within limits of which timers I am already using, etc).

If anyone can speed up this asm I would be super interested... Well there is the 8x etc unrolling of the loop as an option.

I will wire this up to make sure it actually works (the bloody LCD swaps over the /WR and the D/C pins between SPI and parallel modes  |O which I will address in the finished board with a LVC157 mux so the mode remains GPIO-selectable) and then move to the DMA experiment.

I actually wonder how the hell DMA can work:

You set up a timer to do PWM, coming out on PC8, and when the timer overflows, it triggers DMA to load in next byte. So you have to load 1st byte manually and start the timer. The number of pulses configured will be the buffer size. Is this configurable? I know the # of DMA transactions is configurable but that's not the same thing. There is no config for the timer to stop generating the pulses after N pulses. A DMA NTDR=0 interrupt will likely come too late to stop the pulses. How do people do it?

Digging around, the only method seems to be:  terminate the timer via a second DMA, not an interrupt. You can have the timer's update event drive two DMA streams — one writing pixel data to GPIO ODR, and the burst length set by NDTR. When the data DMA finishes, instead of an interrupt, you arrange for the last transfer to also trigger a write that stops the timer (e.g. a DMA writing to TIM8->CR1 to clear CEN, sequenced so it lands after the final strobe). DMA-driven stop has deterministic timing — it happens a fixed number of cycles after the last data transfer, not whenever the CPU gets round to the ISR. This removes the interrupt-latency uncertainty entirely. Bloody hell!

There is a counter on TIM8 which is 8 bits only, so you can do up to 256 bytes.

As an update in case anybody else tries this, this is TIM8 just doing PWM at 71ns cycle (ST7789 is 66ns min). It does 10 pulses and that's it. Perfect so far :)

Code: [Select]
// ── TIM8 PWM /WR burst test: 10 pulses on PC8 (TIM8_CH3), then self-stop ──
//
// Bench check for the future DMA scheme: configure TIM8 CH3 (PC8) as PWM, use
// one-pulse mode + the repetition counter (RCR) so the timer emits exactly 10
// pulses and then halts itself in hardware (CEN cleared by OPM). No DMA here -
// this just proves the pulse count and timing on the scope.
//
// TIM8 counter clock = 168MHz (APB2 timer clock). ARR sets the period, CCR3 the
// /WR low/high split. RCR = N-1 makes one-pulse-mode span N counter cycles, so
// OPM stops after N pulses.
//
// Scope PC8: expect exactly 10 PWM pulses then idle, with the period and duty
// set below. Re-run by re-arming (set CEN again).
//
// Adjust:
//   ARR  = period-1   (period in 168MHz ticks; 11 -> 12 ticks -> 71.4ns)
//   CCR3 = high-time portion (PWM mode 1: output high while CNT<CCR3)
//   RCR  = pulses-1   (10 pulses -> 9)

// Clock the timer and the GPIO port
__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_GPIOC_CLK_ENABLE();

// PC8 -> AF3 (TIM8_CH3), push-pull, very-high speed, no pull
GPIOC->MODER   = (GPIOC->MODER   & ~(3u << (8*2))) | (2u << (8*2));   // AF mode
GPIOC->OTYPER &= ~(1u << 8);                                          // push-pull
GPIOC->OSPEEDR = (GPIOC->OSPEEDR & ~(3u << (8*2))) | (3u << (8*2));   // very-high
GPIOC->PUPDR  &= ~(3u << (8*2));                                      // no pull
GPIOC->AFR[1]  = (GPIOC->AFR[1] & ~(0xFu << ((8-8)*4))) | (3u << ((8-8)*4)); // AF3

// Timer base: period and prescaler
TIM8->PSC = 0;          // no prescale -> 168MHz counter clock
TIM8->ARR = 11;         // 12 ticks -> 71.4ns period
TIM8->CCR3 = 6;         // high while CNT<6, low while 6..11 -> ~half/half
TIM8->RCR = 9;          // 10 pulses (N-1)

// One-pulse mode: timer stops itself after the RCR burst
TIM8->CR1 |= TIM_CR1_OPM;

// CH3 = PWM mode 1, output enable
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (6u << TIM_CCMR2_OC3M_Pos);   // PWM mode 1
TIM8->CCMR2 |= TIM_CCMR2_OC3PE;              // preload enable
TIM8->CCER  |= TIM_CCER_CC3E;                // CH3 output enable

// Advanced-timer outputs need MOE (main output enable) in BDTR
TIM8->BDTR |= TIM_BDTR_MOE;

// Force an update to load PSC/ARR/RCR/CCR3 shadow registers
TIM8->EGR = TIM_EGR_UG;

// Start: generates 10 pulses then OPM clears CEN automatically
TIM8->CR1 |= TIM_CR1_CEN;

// (To repeat the burst: TIM8->EGR = TIM_EGR_UG; TIM8->CR1 |= TIM_CR1_CEN;)

and this is the PWM waveform (ignore the ringing - bad probe earth)


« Last Edit: June 17, 2026, 08:57:54 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 
The following users thanked this post: SiliconWizard

Online langwadt

  • Super Contributor
  • ***
  • Posts: 5631
  • Country: dk
you might be able use a different time to gate the main timer (gated mode)
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 5817
  • Country: gb
  • Doing electronics since the 1960s...
OK a good idea!

TIM8 can be gated by TIM1, TIM2, TIM4, or TIM5.

TIM2 and TIM5 I have spare. I would like to reserve TIM5 (32 bit) for something else. But TIM2 has a mystery: may be used by the ethernet subsystem, as a timestamp comparator, for precision time protocol (PTP). Not confirmed but the 32F417 doc talks about it in the PTP context. See
https://community.st.com/s/question/0D53W00001QgDboSAF/32f417-is-tim2-used-with-ethernet-at-all
Nobody seems to know for 100% sure...

But actually using the 8-bit RCR for 1-256 bytes is still very efficient; you just have a little bit of CPU time at the end of the block if >256.

I've also looked at the DMA -> BSRR route (which will obviously work - the above asm code uses it too - but needs a 32 bit wide buffer in which every 4th byte is the actual data) and an asm loop to write every 4th byte, 256 times, is not too bad. Even better if you could have two buffers and swap them.
« Last Edit: June 17, 2026, 09:19:13 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 5817
  • Country: gb
  • Doing electronics since the 1960s...
I am happy to say that it does work! 32F4 DMA can write bytewide to the upper half of a 16 bit port, without messing up pins on the lower half programmed to be outputs.

Also generating a strobe with PWM works, and so does triggering DMA to load the next byte onto the bus.

It is pretty fiddly to get it working right though.



Init code:

Code: [Select]
═══════════════════════════════════════════════════════════════════════════════
CC4-triggered DMA: load the next byte AFTER /WR has gone high, so it settles
during the high phase + next low phase and is rock-solid before the next latch.
═══════════════════════════════════════════════════════════════════════════════

PRINCIPLE
  CH3 (CCR3) generates /WR via PWM mode 2: /WR low CNT 0..CCR3-1, high CNT CCR3..ARR.
  Rising edge (the LATCH) is at CNT = CCR3.
  CH4 (CCR4) is a DMA-TRIGGER ONLY channel (no output pin). Set CCR4 a few ticks
  AFTER CCR3, so the CC4 event - and thus the DMA load of the next byte - happens
  while /WR is already HIGH, a few cycles past the latch edge.
  The byte loaded at CNT=CCR4 then has the rest of this period plus the next
  period's low phase to settle before the NEXT rising edge at CNT=CCR3. No race.

  CC-compare DMA requests are NOT gated by the repetition counter (unlike UPDATE),
  so CC4 fires every period -> one byte per period, exactly what we want, while RCR
  still bounds the burst length via OPM.

TIMING (ARR=11, 71.4ns period, 168MHz):
  CCR3 = 6  -> /WR low CNT 0..5 (~36ns), high CNT 6..11 (~36ns), latch at CNT=6.
  CCR4 = 8  -> DMA loads next byte at CNT=8 (~12ns after the rising edge).
  Next latch is at CNT=6 of the next period = (11-8)+6 = 9 ticks ~54ns later.
  So each byte gets ~54ns of settle, loaded well clear of any latch edge.

DMA MAPPING: TIM8_CH4 -> DMA2 Stream7, Channel 7.
  (TIM8_UP was Stream1 Ch7; CC4 is a different stream. Stream7/Ch7 = TIM8_CH4/TRIG/COM.)

*/

// TIM8 CH3 (PC8) = /WR via PWM. CH4 = internal DMA-trigger compare, set a
// few ticks AFTER the /WR rising edge so the DMA loads the next byte while
// /WR is high - it then settles ~54ns before the next latch. CC4 compare
// DMA fires every period (not gated by RCR), so one byte per pulse; RCR+OPM
// bound the burst length. PC8 stays a GPIO output here (bit-bang init needs
// it); the DAT burst flips PC8 to AF3 only for the pixel stream.

__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_DMA2_CLK_ENABLE();

// Pre-select AF3 for PC8 in the mux; leave MODER as OUTPUT (set in step 3).
GPIOC->AFR[1] = (GPIOC->AFR[1] & ~(0xFu << ((WR_PIN_POS-8)*4)))
  | (3u << ((WR_PIN_POS-8)*4));                                        // AF3

// Timer base
TIM8->PSC  = 0;
TIM8->ARR  = 12;       // ARR=12 is 13 ticks ≈ 77ns
TIM8->RCR  = 0;        // set per-burst

TIM8->CR1 |= TIM_CR1_OPM;                     // one-pulse: self-stop after burst

// CH3 = /WR output, PWM mode 2 (low-first). Latch (rising edge) at CNT=CCR3.
TIM8->CCR3 = 8;
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (7u << TIM_CCMR2_OC3M_Pos);    // PWM mode 2
TIM8->CCMR2 |= TIM_CCMR2_OC3PE;               // preload
TIM8->CCER  |= TIM_CCER_CC3E;                 // CH3 output enable (drives PC8 in AF)
TIM8->BDTR  |= TIM_BDTR_MOE;                  // main output enable

// CH4 = DMA-trigger only (no output pin). Compare at CCR4 generates the CC4
// event -> CC4 DMA request, loading the next byte while /WR is high.
TIM8->CCR4 = 8;                               // a few ticks after CCR3 (the latch)
TIM8->CCMR2 &= ~TIM_CCMR2_OC4M;               // OC4M=0 (frozen) - still sets CC4IF on match
// (no CC4E - we don't need an output pin, just the compare event/flag)

// DMA requests from CC4 (NOT from update). UDE off; CC4DE on per-burst.
TIM8->DIER &= ~TIM_DIER_UDE;
// CC4DE toggled per-burst in LCD_Transmit_buf_DAT (kept off here).

// DMA2 Stream7 Ch7 = TIM8_CH4. byte->byte, mem-increment, mem->periph, PAR=ODR+1.
DMA2_Stream7->CR = 0;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2_Stream7->PAR = (uint32_t)(&GPIOD->ODR) + 1u;     // high byte -> PD8-15
DMA2_Stream7->CR =
  (7u << DMA_SxCR_CHSEL_Pos)   // channel 7 = TIM8_CH4
| (0u << DMA_SxCR_MSIZE_Pos)   // memory size = byte
| (0u << DMA_SxCR_PSIZE_Pos)   // periph size = byte
| DMA_SxCR_MINC                // increment memory
| (1u << DMA_SxCR_DIR_Pos)     // memory -> peripheral
| (2u << DMA_SxCR_PL_Pos);     // priority high

TIM8->EGR = TIM_EGR_UG;            // load shadow regs (timer idle)
GPIOC->BSRR = WR_C_HIGH;           // PC8 = 1 (idle high), stays GPIO output

To output a buffer:

Code: [Select]

// TIM8 + DMA hardware /WR. CPU is idle during each burst.
//
// PC8 is a GPIO output (idle high) on entry. Flip it to AF3 (TIM8_CH3) for
// the pixel burst, then back to GPIO output afterwards so the next column's
// address setup can bit-bang /WR.
//
// Stream the buffer in <=256-byte bursts (RCR is 8-bit). For each burst:
// prime byte 0 onto PD8-15 (pulse 0 latches it), point DMA at bytes 1..n-1,
// then start TIM8. The CC4 compare event loads each subsequent byte while
// /WR is high so it settles ~54ns before the next latch.
//
// TWO RACE FIXES for the rare wrong/short filled-rectangle line:
//   1. __DSB() after EACH prime write (inside the loop) forces byte 0 onto
//      the bus before the timer can latch it - covers byte 0 of every
//      <=256 chunk, including the 2nd+ chunks of a >256 transfer.
//   2. After the timer self-stops (CEN clear), WAIT for the DMA transfer-
//      complete flag (TCIF7) BEFORE disabling the stream. The timer stopping
//      and the DMA draining its last byte are SEPARATE events; tearing the
//      stream down on CEN-clear alone can catch the DMA still draining ->
//      a whole line comes out short or wrong. This closes that race.
LCD_DC_GPIO_PORT->BSRR = LCD_DC_PIN;        // DAT: D/C=1
lcd_cs(0);

// PC8 -> AF3 (TIM8_CH3): the timer drives /WR for the DMA burst.
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (2u << (8*2));   // PC8 = AF

uint32_t off = 0;
while (off < size)
{
uint32_t n = size - off;
if (n > 256u) n = 256u;
const uint8_t *b = &outbuf[off];

// CC4 DMA request OFF while we set up.
TIM8->DIER &= ~TIM_DIER_CC4DE;

// Prime byte 0, then a data barrier so the store lands before the
// timer can latch it (covers byte 0 of THIS chunk).
*((volatile uint8_t *)(&GPIOD->ODR) + 1) = b[0];
__DSB();

// Arm DMA for bytes 1..n-1 (none if n==1). CC4 in period k loads byte k+1.
if (n > 1u)
{
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2->HIFCR = (DMA_HIFCR_CTCIF7 | DMA_HIFCR_CHTIF7 |
   DMA_HIFCR_CTEIF7 | DMA_HIFCR_CDMEIF7 | DMA_HIFCR_CFEIF7);
DMA2_Stream7->M0AR = (uint32_t)&b[1];
DMA2_Stream7->NDTR = n - 1u;
DMA2_Stream7->CR  |= DMA_SxCR_EN;
}

// Load RCR (n pulses) via UG with CC4DE OFF, then clear UIF/CC4IF.
TIM8->RCR = n - 1u;
TIM8->EGR = TIM_EGR_UG;
TIM8->SR  = 0;

// Clear any stale CC4 flag FIRST, THEN enable CC4 DMA, then start.
// (Enabling CC4DE while CC4IF is still set from the previous burst's
//  last period fires an immediate spurious DMA request -> the whole
//  burst shifts by one byte -> a wrong-colour run on that row.)
TIM8->SR  = 0;
if (n > 1u) TIM8->DIER |= TIM_DIER_CC4DE;
TIM8->CR1 |= TIM_CR1_CEN;

// Wait for OPM to self-stop the timer.
while (TIM8->CR1 & TIM_CR1_CEN) { }

// THEN wait for the DMA to finish draining its last byte before tearing
// the stream down - otherwise the timer-stop vs DMA-drain race can
// truncate the line. (Skip if n==1: no DMA was armed.)
if (n > 1u)
{
while (!(DMA2->HISR & DMA_HISR_TCIF7)) { }
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
}

TIM8->DIER &= ~TIM_DIER_CC4DE;

off += n;
}

// PC8 back to GPIO output, idle HIGH.
GPIOC->BSRR  = WR_C_HIGH;                                         // PC8 = 1
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (1u << (8*2));   // PC8 = output
__DSB();                       // ensure PC8 is GPIO before any following bit-bang

lcd_cs(1);

It uses the 8 bit counter in the timer, so is limited to 256 bytes per transfer. However, the CPU overhead at the end is pretty small. The only way I know of to do bigger packets nonstop is - as mentioned above here - to use another timer to gate this one. Maybe other CPUs have a 16 bit counter too...

Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online langwadt

  • Super Contributor
  • ***
  • Posts: 5631
  • Country: dk
Maybe other CPUs have a 16 bit counter too...

e.g. the STM32G4
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 5817
  • Country: gb
  • Doing electronics since the 1960s...
I wonder whether gating a timer with another timer is actually going to work over long periods. It could be quite fiddly...

That counter is a much better way - and works.

So many people went up that road (bytewide output) and either gave up or just read everywhere that it is not possible.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf