~20ns means 3 cycles.
When I did experiments back then when starting with STM32F4xx, I was unable to generate a 2-cycle long pulse by writing from processor to GPIO; only 1-cycle and 3-cycle. I believe this is due to the way how the GPIO's AHB bus is arbitrated within the bus matrix, but I am not an insider so there may be some other mechanism involved, e.g. within the processor's S port.
[As a curiosity, in those early days I experimented also with bit-banging GPIO on a NXP LPC1786 (Cortex-M3 but it should be mostly identical to M4 in this regard). At one point, the data hold delay to clock edge disappeared (i.e. instructions were write_for_clock_edge->write_for_data_change, and both changes appeared at the GPIO at the same time). Clock was written through bit-banding to the same GPIO port as data. My conclusion was, that as - contrary to STM32 - in LPC17xx GPIO is in the 0x2xxx'xxxx area, which by default is Normal (rather than Device), the processor was allowed to collapse both writes to one, and it did so. As at the same time we decided to select the more capable STM32, I've never gotten to proving this hypothesis (by setting the GPIO as Device using MPU).]
With DMA, things will get different again, the DMA will impose enough delay (IMO min 3 cycles) so this thing won't be observable. OTOH, the DMA's delays/latencies are relatively hard to calculate with absolute confidence, especially if other channels in DMA are used simultaneously.
Btw. you don't need to run benchmarks at maximum clock frequency, if you keep everything else set in the same way (e.g. FLASH latency): this is an extensively synchronous machine.
JW