I am happy to say that it does work! 32F4 DMA can write bytewide to the upper half of a 16 bit port, without messing up pins on the lower half programmed to be outputs.
Also generating a strobe with PWM works, and so does triggering DMA to load the next byte onto the bus.
It is pretty fiddly to get it working right though.
Init code:
═══════════════════════════════════════════════════════════════════════════════
CC4-triggered DMA: load the next byte AFTER /WR has gone high, so it settles
during the high phase + next low phase and is rock-solid before the next latch.
═══════════════════════════════════════════════════════════════════════════════
PRINCIPLE
CH3 (CCR3) generates /WR via PWM mode 2: /WR low CNT 0..CCR3-1, high CNT CCR3..ARR.
Rising edge (the LATCH) is at CNT = CCR3.
CH4 (CCR4) is a DMA-TRIGGER ONLY channel (no output pin). Set CCR4 a few ticks
AFTER CCR3, so the CC4 event - and thus the DMA load of the next byte - happens
while /WR is already HIGH, a few cycles past the latch edge.
The byte loaded at CNT=CCR4 then has the rest of this period plus the next
period's low phase to settle before the NEXT rising edge at CNT=CCR3. No race.
CC-compare DMA requests are NOT gated by the repetition counter (unlike UPDATE),
so CC4 fires every period -> one byte per period, exactly what we want, while RCR
still bounds the burst length via OPM.
TIMING (ARR=11, 71.4ns period, 168MHz):
CCR3 = 6 -> /WR low CNT 0..5 (~36ns), high CNT 6..11 (~36ns), latch at CNT=6.
CCR4 = 8 -> DMA loads next byte at CNT=8 (~12ns after the rising edge).
Next latch is at CNT=6 of the next period = (11-8)+6 = 9 ticks ~54ns later.
So each byte gets ~54ns of settle, loaded well clear of any latch edge.
DMA MAPPING: TIM8_CH4 -> DMA2 Stream7, Channel 7.
(TIM8_UP was Stream1 Ch7; CC4 is a different stream. Stream7/Ch7 = TIM8_CH4/TRIG/COM.)
*/
// TIM8 CH3 (PC8) = /WR via PWM. CH4 = internal DMA-trigger compare, set a
// few ticks AFTER the /WR rising edge so the DMA loads the next byte while
// /WR is high - it then settles ~54ns before the next latch. CC4 compare
// DMA fires every period (not gated by RCR), so one byte per pulse; RCR+OPM
// bound the burst length. PC8 stays a GPIO output here (bit-bang init needs
// it); the DAT burst flips PC8 to AF3 only for the pixel stream.
__HAL_RCC_TIM8_CLK_ENABLE();
__HAL_RCC_DMA2_CLK_ENABLE();
// Pre-select AF3 for PC8 in the mux; leave MODER as OUTPUT (set in step 3).
GPIOC->AFR[1] = (GPIOC->AFR[1] & ~(0xFu << ((WR_PIN_POS-8)*4)))
| (3u << ((WR_PIN_POS-8)*4)); // AF3
// Timer base
TIM8->PSC = 0;
TIM8->ARR = 12; // ARR=12 is 13 ticks ≈ 77ns
TIM8->RCR = 0; // set per-burst
TIM8->CR1 |= TIM_CR1_OPM; // one-pulse: self-stop after burst
// CH3 = /WR output, PWM mode 2 (low-first). Latch (rising edge) at CNT=CCR3.
TIM8->CCR3 = 8;
TIM8->CCMR2 &= ~TIM_CCMR2_OC3M;
TIM8->CCMR2 |= (7u << TIM_CCMR2_OC3M_Pos); // PWM mode 2
TIM8->CCMR2 |= TIM_CCMR2_OC3PE; // preload
TIM8->CCER |= TIM_CCER_CC3E; // CH3 output enable (drives PC8 in AF)
TIM8->BDTR |= TIM_BDTR_MOE; // main output enable
// CH4 = DMA-trigger only (no output pin). Compare at CCR4 generates the CC4
// event -> CC4 DMA request, loading the next byte while /WR is high.
TIM8->CCR4 = 8; // a few ticks after CCR3 (the latch)
TIM8->CCMR2 &= ~TIM_CCMR2_OC4M; // OC4M=0 (frozen) - still sets CC4IF on match
// (no CC4E - we don't need an output pin, just the compare event/flag)
// DMA requests from CC4 (NOT from update). UDE off; CC4DE on per-burst.
TIM8->DIER &= ~TIM_DIER_UDE;
// CC4DE toggled per-burst in LCD_Transmit_buf_DAT (kept off here).
// DMA2 Stream7 Ch7 = TIM8_CH4. byte->byte, mem-increment, mem->periph, PAR=ODR+1.
DMA2_Stream7->CR = 0;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2_Stream7->PAR = (uint32_t)(&GPIOD->ODR) + 1u; // high byte -> PD8-15
DMA2_Stream7->CR =
(7u << DMA_SxCR_CHSEL_Pos) // channel 7 = TIM8_CH4
| (0u << DMA_SxCR_MSIZE_Pos) // memory size = byte
| (0u << DMA_SxCR_PSIZE_Pos) // periph size = byte
| DMA_SxCR_MINC // increment memory
| (1u << DMA_SxCR_DIR_Pos) // memory -> peripheral
| (2u << DMA_SxCR_PL_Pos); // priority high
TIM8->EGR = TIM_EGR_UG; // load shadow regs (timer idle)
GPIOC->BSRR = WR_C_HIGH; // PC8 = 1 (idle high), stays GPIO output
To output a buffer:
// TIM8 + DMA hardware /WR. CPU is idle during each burst.
//
// PC8 is a GPIO output (idle high) on entry. Flip it to AF3 (TIM8_CH3) for
// the pixel burst, then back to GPIO output afterwards so the next column's
// address setup can bit-bang /WR.
//
// Stream the buffer in <=256-byte bursts (RCR is 8-bit). For each burst:
// prime byte 0 onto PD8-15 (pulse 0 latches it), point DMA at bytes 1..n-1,
// then start TIM8. The CC4 compare event loads each subsequent byte while
// /WR is high so it settles ~54ns before the next latch.
//
// TWO RACE FIXES for the rare wrong/short filled-rectangle line:
// 1. __DSB() after EACH prime write (inside the loop) forces byte 0 onto
// the bus before the timer can latch it - covers byte 0 of every
// <=256 chunk, including the 2nd+ chunks of a >256 transfer.
// 2. After the timer self-stops (CEN clear), WAIT for the DMA transfer-
// complete flag (TCIF7) BEFORE disabling the stream. The timer stopping
// and the DMA draining its last byte are SEPARATE events; tearing the
// stream down on CEN-clear alone can catch the DMA still draining ->
// a whole line comes out short or wrong. This closes that race.
LCD_DC_GPIO_PORT->BSRR = LCD_DC_PIN; // DAT: D/C=1
lcd_cs(0);
// PC8 -> AF3 (TIM8_CH3): the timer drives /WR for the DMA burst.
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (2u << (8*2)); // PC8 = AF
uint32_t off = 0;
while (off < size)
{
uint32_t n = size - off;
if (n > 256u) n = 256u;
const uint8_t *b = &outbuf[off];
// CC4 DMA request OFF while we set up.
TIM8->DIER &= ~TIM_DIER_CC4DE;
// Prime byte 0, then a data barrier so the store lands before the
// timer can latch it (covers byte 0 of THIS chunk).
*((volatile uint8_t *)(&GPIOD->ODR) + 1) = b[0];
__DSB();
// Arm DMA for bytes 1..n-1 (none if n==1). CC4 in period k loads byte k+1.
if (n > 1u)
{
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
while (DMA2_Stream7->CR & DMA_SxCR_EN) { }
DMA2->HIFCR = (DMA_HIFCR_CTCIF7 | DMA_HIFCR_CHTIF7 |
DMA_HIFCR_CTEIF7 | DMA_HIFCR_CDMEIF7 | DMA_HIFCR_CFEIF7);
DMA2_Stream7->M0AR = (uint32_t)&b[1];
DMA2_Stream7->NDTR = n - 1u;
DMA2_Stream7->CR |= DMA_SxCR_EN;
}
// Load RCR (n pulses) via UG with CC4DE OFF, then clear UIF/CC4IF.
TIM8->RCR = n - 1u;
TIM8->EGR = TIM_EGR_UG;
TIM8->SR = 0;
// Clear any stale CC4 flag FIRST, THEN enable CC4 DMA, then start.
// (Enabling CC4DE while CC4IF is still set from the previous burst's
// last period fires an immediate spurious DMA request -> the whole
// burst shifts by one byte -> a wrong-colour run on that row.)
TIM8->SR = 0;
if (n > 1u) TIM8->DIER |= TIM_DIER_CC4DE;
TIM8->CR1 |= TIM_CR1_CEN;
// Wait for OPM to self-stop the timer.
while (TIM8->CR1 & TIM_CR1_CEN) { }
// THEN wait for the DMA to finish draining its last byte before tearing
// the stream down - otherwise the timer-stop vs DMA-drain race can
// truncate the line. (Skip if n==1: no DMA was armed.)
if (n > 1u)
{
while (!(DMA2->HISR & DMA_HISR_TCIF7)) { }
DMA2_Stream7->CR &= ~DMA_SxCR_EN;
}
TIM8->DIER &= ~TIM_DIER_CC4DE;
off += n;
}
// PC8 back to GPIO output, idle HIGH.
GPIOC->BSRR = WR_C_HIGH; // PC8 = 1
GPIOC->MODER = (GPIOC->MODER & ~(3u << (8*2))) | (1u << (8*2)); // PC8 = output
__DSB(); // ensure PC8 is GPIO before any following bit-bang
lcd_cs(1);
It uses the 8 bit counter in the timer, so is limited to 256 bytes per transfer. However, the CPU overhead at the end is pretty small. The only way I know of to do bigger packets nonstop is - as mentioned above here - to use another timer to gate this one. Maybe other CPUs have a 16 bit counter too...