I've done something similar, just a few notes:
* Regarding the FFT accelerator, obviously it's a raw FFT or inverse FFT. In many applications, you'd need to apply a window function before executing the FFT, and there is no provision to do that in an accelerated way as far as I've gotten it. In my tests, I added a window function that I apply during the data transfer between the incoming digital audio data and the buffer required for the FFT accelerator. That adds some overhead. Could have been nice to have added that into the FFT accelerator.
* In a related vein, the needed data transfer phase does itself add some significant overhead. If you're using a real FFT only, that's actually a lot of (unnecessary) overhead. Would have been nice to add a "real" mode only (no imaginary part in the input buffer) in the FFT accelerator to avoid having to issue this costly transfer.
* Some figures: I set the PLL0 at 520MHz in my tests (I tried pushing it to 600MHz, but there seems to be some random problems at that frequency, may be due to power supply problems or others, not sure), so by default both the CPU and FFT accelerator are clocked at 520MHz. I timed different portions of the code. At 520MHz, the 512-FFT itself takes 16µs, which is not too shabby. Unfortunately, the preliminary data transfer takes almost as much!! It's probably feasible to rewrite this part in assembly to improve things a bit (I compiled with -03 option and looked at the resulting assembly, it could be further optimized but there is not much room for that...) I think it's a bit disappointing. Again, a special "real-FFT" mode would have been nice so we could directly use the input buffer instead of having to transfer it to a complex number array. We can probably work around that by storing the 16-bit input samples as 32-bit numbers from the start and directly pass that to the FFT accelerator (given the right alignment). Would have to test that.
* Regarding the fft_lcd example I started with (but modified quite a bit), I noticed that the formula used to compute the power spectrum in dBFS is kind of weird, and uses a log() function, which in standard C is NOT the base-10 log but the natural log! The correct one would be log10(). It's no big deal for testing purposes of course, but it actually gets you a scale that will make you think the noise floor is a lot lower than it actually is. I modified the computation with what I believe is correct, and you can see a lot more noise along the whole spectrum. You also start to see big sidelobes when using a single-frequency test signal, and that's when you realize applying a window function is necessary, and can see its effect. As a small additional note, the computation in the original source code first applies a sqrt() then a log(), which is wasteful: obviously you can do without the sqrt(). In any case, the most expensive function is the log() (or log10()): the part that computes the power spectrum in dBFS (which again is not exactly dBFS in the example IMO) takes a lot more time to execute than the FFT itself (approx. 200µs or so)!
* Regarding the LCD itself, with the MAIX DOCK board, I noticed that I had to lower the SPI frequency to 15MHz to make it work reliably. The original 20MHz would tend to get me a corrupted image especially when using big DMA data transfers (such as full screen update in one chunk). Probably a signal integrity problem, but I didn't find any datasheet or reference for the LCD unit itself (only the controller)...
To time things inside the code, use the read_cycle() function and the sysctl_clock_get_freq(SYSCTL_CLOCK_CPU) to convert that to time values when needed.