Author Topic: Hardware accellerated FFT on Kendryte K210  (Read 3568 times)

0 Members and 1 Guest are viewing this topic.

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Hardware accellerated FFT on Kendryte K210
« on: April 16, 2019, 02:09:05 am »
Last night I played around with the FFT accelerator on the Kendryte K210 last night, mainly using the code at https://github.com/MrJBSwe/fft_lcd as a guide. Heavens above, it seems to work!

To actually use it isn't that hard - it is just one call in the SDK:

Code: [Select]
fft_complex_uint16_dma(DMAC_CHANNEL0, DMAC_CHANNEL1, FFT_FORWARD_SHIFT, FFT_DIR_FORWARD, buffer_input, FFT_N, buffer_output);

The hard bit is in getting all the data set up correctly, especially with a sparsely documented SDK. The accelerator only works with up to 512 points.

For the input the development board has an MSM261S4030H0 I2S MEMS mic which is actually a pretty nifty part. For a couple of dollars plus postage you can get a it on a small PCB - https://www.seeedstudio.com/Sipeed-I2S-Mic-for-MAIX-Dev-Boards-p-2887.html

Here's a pic with a 15kHz test tone, and ambient noise:
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline zepan0

  • Contributor
  • Posts: 12
  • Country: cn
Re: Hardware accellerated FFT on Kendryte K210
« Reply #1 on: April 16, 2019, 02:22:22 am »
nice done! and hardware FFT support 64,128,256,512 point.
maixpy support fft too, here is the document:
https://maixpy.sipeed.com/zh/libs/Maix/fft.html
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14431
  • Country: fr
Re: Hardware accellerated FFT on Kendryte K210
« Reply #2 on: April 16, 2019, 02:09:03 pm »
I've done something similar, just a few notes:

* Regarding the FFT accelerator, obviously it's a raw FFT or inverse FFT. In many applications, you'd need to apply a window function before executing the FFT, and there is no provision to do that in an accelerated way as far as I've gotten it. In my tests, I added a window function that I apply during the data transfer between the incoming digital audio data and the buffer required for the FFT accelerator. That adds some overhead. Could have been nice to have added that into the FFT accelerator.
* In a related vein, the needed data transfer phase does itself add some significant overhead. If you're using a real FFT only, that's actually a lot of (unnecessary) overhead. Would have been nice to add a "real" mode only (no imaginary part in the input buffer) in the FFT accelerator to avoid having to issue this costly transfer.
* Some figures: I set the PLL0 at 520MHz in my tests (I tried pushing it to 600MHz, but there seems to be some random problems at that frequency, may be due to power supply problems or others, not sure), so by default both the CPU and FFT accelerator are clocked at 520MHz. I timed different portions of the code. At 520MHz, the 512-FFT itself takes 16µs, which is not too shabby. Unfortunately, the preliminary data transfer takes almost as much!! It's probably feasible to rewrite this part in assembly to improve things a bit (I compiled with -03 option and looked at the resulting assembly, it could be further optimized but there is not much room for that...) I think it's a bit disappointing. Again, a special "real-FFT" mode would have been nice so we could directly use the input buffer instead of having to transfer it to a complex number array. We can probably work around that by storing the 16-bit input samples as 32-bit numbers from the start and directly pass that to the FFT accelerator (given the right alignment). Would have to test that.
* Regarding the fft_lcd example I started with (but modified quite a bit), I noticed that the formula used to compute the power spectrum in dBFS is kind of weird, and uses a log() function, which in standard C is NOT the base-10 log but the natural log! The correct one would be log10(). It's no big deal for testing purposes of course, but it actually gets you a scale that will make you think the noise floor is a lot lower than it actually is. I modified the computation with what I believe is correct, and you can see a lot more noise along the whole spectrum. You also start to see big sidelobes when using a single-frequency test signal, and that's when you realize applying a window function is necessary, and can see its effect. As a small additional note, the computation in the original source code first applies a sqrt() then a log(), which is wasteful: obviously you can do without the sqrt(). In any case, the most expensive function is the log() (or log10()): the part that computes the power spectrum in dBFS (which again is not exactly dBFS in the example IMO) takes a lot more time to execute than the FFT itself (approx. 200µs or so)!
* Regarding the LCD itself, with the MAIX DOCK board, I noticed that I had to lower the SPI frequency to 15MHz to make it work reliably. The original 20MHz would tend to get me a corrupted image especially when using big DMA data transfers (such as full screen update in one chunk). Probably a signal integrity problem, but I didn't find any datasheet or reference for the LCD unit itself (only the controller)...

To time things inside the code, use the read_cycle() function and the sysctl_clock_get_freq(SYSCTL_CLOCK_CPU) to convert that to time values when needed.
« Last Edit: April 16, 2019, 02:32:34 pm by SiliconWizard »
 

Offline hamster_nzTopic starter

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: Hardware accellerated FFT on Kendryte K210
« Reply #3 on: April 16, 2019, 11:48:09 pm »
Yeah, looked at Windowing for a future version of my code - I was just wanting to see how the H/W worked so didn't start investigating it.

The data transfer phase is interesting if you wanted to speed things up- multiplying 16-bit samples by a window value, then packing two into a 64-bit word with zeros for the I values. Some optimization could really help speed this up - I should have a look at what the code GCC generates.

For MrJBSwe's code, the use of log() vs log10() didn't make much of a difference as the results are scaled to fit on the display, but as log(sqrt(x) = 0.5 * log(x) things could be made quicker my making things less explicit.

I've not had any issue with the display, but I am not doing multiple DMA transfers at the same time (which would be needed for streaming audio while processing FFTs and updating the display, for example)
« Last Edit: April 17, 2019, 12:39:37 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf