Yes, the Ostrich is fast, a 27010 loads in about 4 seconds. Actually, I feel the bug is in the Ostrich hardware/firmware, not the PC end. It doesn't reliably trigger on the bit masks you send to the emulator.
I used a FTDI245R based board. They are inexpensive (no more than $20) and easy to use. There are 4 pins to monitor with your embeded software or hardware.
They are fast - 1Mbyte / sec and have a decent FIFO buffer 128byte RX, 256 byte TX
I send out the data in an 8 bit stream and have the emulator h/w re-assemble it to 16 bits.
My emulator is pretty simple in it's operation. The Micro waits for a control command, like download/upload data, base address and trace window size. Uploads to the emulator are performed with the target power off - otherwise there will be contention and this really slows things down.
There are several SRAM's banks in the design. Some monitor reads, some writes and the corresponding data. The address bus is common to all devices, as well as being under micro & cpld control.
When I'm tracing, I send a command to repeatedly scan through the trace region and see if there have been hits in the desired region (i.e. $2000-$20FF).
This is performed by using another SRAM, When a system bus write is performed, the HIT buffer (which is initially cleared to $00) changes to $01, and the corresponding write value is placed in a dedicated write SRAM buffer. If there was a hit, the micro sends out a block of data with the raw address (a hex value from $0-$FF) and the data as a raw hex value. The PC end determines the address by adding the base address + offset (from the HIT buffer). The s/w then marks the area RED (write) and updates the value. For a read, a block toggles to GREEN.
There are 2 SRAM 'system' banks, they are mirrored copy of the other, unless a value has been changed. The CPLD and micro control which is accessed, swapping when the RAM is disabled. (using the target _CE line and a system CS pin). The micro then updates the other bank until another change is done.
Doing it this way reduces the processing overheads and improves trace speed. Obviously the micro & cpld monitors the R/W - _CE and/or _WR/_OE lines depending on where your trace region is, and does the update and swap when the chip is not selected.
You may find 16 bit SRAMS easier to use due to their larger size. In that case, you'd just need to select the upper or lower 8 bytes of data.
Break your design up into blocks and it will be a lot easier to implement.