If you really want as fast as possible, it may be worth looking at using the Pi's SPI interface as its DMA capable and can run at speeds up to 32Mb/s (4MB/s). I'd lay odds that's a good bit faster than you can bit-bang parallel GPIO, unless you are coding 'bare metal' and running without an OS. The hardware can handle DMA transfers of up to 65535 bytes, and default buffer size is 4096, so you could potentially dump a 4096 byte block of data to your target in fractionally over 1us. Unfortunately I understand SPI slave mode is borked, and there is no possibility of flow control once a transfer is started when the Pi is a master, so your target, once its initially ready, has to be able to accept the whole data block at the rate the Pi is sending it. To get a byte strobe pulse, you'll need a fast divide by eight counter driven by the SPI clock, and held cleared by /CS high so its synchronised to the transfer. You'll also need a *fast* SIPO shift register to handle the SPI to parallel conversion.