I tested my stm32f446RE (64-pins nucleo) for parallel port reading, using TIM->DMA->GPIO signal /trigger path. It's able to read
glitch-free at about 11.25 MHz.
Have you defined 40-pins? I think, you would have to use 3-4 GPIO ports to assembly 40 lines bus. This 'd be the most difficult part, to locate pins that:
1. Belongs to the same port
2. Consecutive pins, A0, A1 ...etc.
3. Less overall number of ports , less congestion on the DMA bus.
DMA couldn't read each pin, it get's in 8-bits blocks. Block is like A0-A7, A8-A15, B0-B7 etc.
It possible to not carry for 1, 2, 3 but speed would be much lower than 10 MHz, and more complex pin-bit-magic needs to re-map data.