More testing on the GPIO performance. I just tried to read the input values as fast as I can from the code:
uint32_t a0 = GPIOD->ISTAT;
uint32_t a1 = GPIOD->ISTAT;
uint32_t a2 = GPIOD->ISTAT;
uint32_t a3 = GPIOD->ISTAT;
uint32_t a4 = GPIOD->ISTAT;
uint32_t a5 = GPIOD->ISTAT;
uint32_t a6 = GPIOD->ISTAT;
uint32_t a7 = GPIOD->ISTAT;
All those things went into registers:
800025a: f8d3 9010 ldr.w r9, [r3, #16]
800025e: f8d3 8010 ldr.w r8, [r3, #16]
8000262: f8d3 e010 ldr.w lr, [r3, #16]
8000266: f8d3 c010 ldr.w ip, [r3, #16]
800026a: 691f ldr r7, [r3, #16]
800026c: 691d ldr r5, [r3, #16]
800026e: 691c ldr r4, [r3, #16]
8000270: 6918 ldr r0, [r3, #16]
When clock is 4 MHz (system clock divided by 2), I get the following pattern: "03 04 04 05 06 06 07 07".
So GPIO itself can definitely sample every value.
EDIT: Another test - run the DMA in an untriggered M2M mode. So it just reads GPIO register as fast as it can. It already misses bytes:
17 19 1a 1b 1d 1e 1f 20 22 23 25 26 27 28 29 2b 2c 2d 2e 2f 31 32 34 35 36 37 38 3a 3b 3c
So the bottleneck is DMA itself.