I recall there are a few MCUs with parallel bus style GPIOs, I think the PIC32s are such a case? -- the whole parallel array of GPIOs on that channel can be driven at once, along with a clock/bus strobe signal, hands-free with DMA. This certainly isn't available on everything, so YMMV.
I would think anything with memory interface (e.g. ST's FSMC), you can just map the FPGA to a suitable slice of address space and go with it. Parallel RAM interfaces are quite simple, even SDRAM or DDR isn't beyond the pale (of course, you'll need quite a bit more logic inside the FPGA to interface that).
For example, one of the (older/now obsolete?) Discovery boards had (SD?)RAM and (16-bit parallel) LCD on the same FSMC bus, mapped appropriately. LCD shows up as a couple addresses (registers) I think? Very easy to dump lots of data to the LCD that way.
If you don't have a parallel bus interface, you're limited by bit-banging, and whatever rate that can be done at; usually IO propagates not just through multiple bus/clock domains but different clock rates as well, and the CPU may be stalled (wait states) during those delays. I'm not sure what is generally available/possible in the average device but most things can toggle pins at CPU clock rate, when CLK_CPU = CLK_PERIPH (so like, Cortex M0 etc. stuff).
When core runs faster, YMMV; it may wait, it may go through anyway (some devices allow astonishingly high pin toggle rates), it might be resynchronized (some transitions go missing??), or even cached (slowed down to local bus rate?). Like uh, I think x86/64 CPUs do IO cache, where a sequence of IO operations is propagated in-order, but I'm not sure what all exactly, and anyway you're likely not using much of that on such a system (that's low level driver stuff you'd rarely even see).
Like, even among AVRs, there's a few with a bus interface; which are, I think some very old ones, back when onboard SRAM was expensive and DIP packages ruled, so you could add external to beef it up as needed; and, among more recent families, I think just the top end (e.g. ATXMEGA256A3U?) had a bus interface peripheral. These could offer couple-cycle wait state performance, so, at up to 32MHz 8-bit CPU, you'd be looking at say 4MB/s without much trouble (even in C, maybe?), but just not having a lot of memory to buffer things in, let alone CPU power to do much raw data processing on. Whereas with GPIO writes, you have to address the ports, fetch data, write, strobe, wait, and so on, and you're lucky to get maybe 1MB/s that way. (I did a reverb effects box using external parallel RAM this way, with an XMEGA64D3; I think that's about right -- on the order of 20-30 cycles per word fetched, at 32MHz. Oh---was that per byte or per word, maybe it wasn't so bad after all? But then, I was also doing some DSP operations inlined with that, so it's hard to say just in terms of raw rate.)
So yeh, for most like Cortex M0 things, running at nominal speeds, that should be doable even with GPIOs, but you may have to optimize it in ASM to get the timings cycle-exact, and it won't be DMA'd. If GPIOs can be DMA'd as if a bus, or if an external bus can be configured, or if the CPU can run somewhat faster, you'll also be set. There is no single solution here, too many things to choose from; you'll have to check out what MCUs have what features to offer, and go with that. Bit-bang GPIO is probably the most portable/universal, but even that can be subject to limitations.
Also... supposing a wide bus is acceptable, then taking a whole 16 or 32 GPIOs at once, obviously helps the bitrate. If not, then clock speeds will be that much more critical, and something like a multi-lane SPI bus may prove more attractive. Some devices have almost arbitrary numbers of lanes, I think? Like, if you want to treat 8 bits as a parallel bus driven like SPI, you can?
Tim