BRAM shouldn't be a concern because you can speed it up 2x by using dual reads.

Interesting, I could use that, do you have a link showing how to do this?

I do not have a link, but it is not hard to explain. 7-series BRAM can be accessed through 2 independent ports. If you run at frequency f and want to access BRAM, which runs at f/2, then you do:

1. De-multiplex the address A, so that you now have Aeven (for even clocks) and Aodd (for odd clocks)

2. Bring Aeven and Aodd from f clock domain to f/2.

3. After clock domain crossing, you supply Aeven to one of the dual read ports of BRAM and Aodd to the other read port

4. After due amount of clock cycles, you get the data Deven and Dodd from the corresponding BRAM ports

5. You bring Deven and Dodd from f/2 to f clock domain

6. You multiplex Deven and Dodd, into the final data vector D, where D is either Deven (for even clocks) or Dodd (for odd clocks).

The whole thing appears as RAM which works at f (which is 2 times faster than the BRAM speed f/2). This, of course, introduces considerable latency, but if you know the address upfront (as is the case with FFT), this is not a problem.

If you wish, you can run one of the BRAM read ports with 180-degree shift relative to the other BRAM read port (say one port is clocking by rising edges of f/2 and the other port is clocked by falling edges of f/2), which may help to make pipeline shorter and thus decrease latency.