A project I'm working on right now:
https://github.com/gabriel-tenma-white/fpga-fftI've already beat Xilinx's FFT IP core in performance per resource utilization. With the same specifications (pipelined streaming data input/output, 24 bit data, 16 bit twiddle) the used LUTs are about the same, number of multipliers are slightly lower, but the achieved Fmax is higher than Xilinx's on the same device and speed grade.
The next step is to apply the Bailey's algorithm again but on DRAM to get 4096*4096 = 16M FFTs. The main tricky part will be to transpose data efficiently (or to read data in transposed order) in DRAM. The way to do this is to store your data in a bit-permuted order such that a sub-block in the matrix occupies a page of DRAM (rather than a few lines in the matrix occupying a page), as well as process several columns at once (e.g. read the first 8 values in every matrix row to get the first 8 columns, then perform 8 FFTs). I ran the numbers and it should be possible to make efficient use of each opened page of the DRAM at size 16M on a Zynq-7010.
I'm also looking into adapting this to floating point arithmetic. Floating point multiply is easy, you simply multiply the mantissa and add the exponents, then possibly shift the mantissa left by one position if there's a leading 0 (then drop the leading 1 if you want IEEE format). The additions are harder, you have to use bit shifts or wide multiplexers which are expensive in an FPGA. Looking at the Xilinx FFT cores though, the floating point implementation uses not much more LUTs than the fixed point cores, so it's probably doable.