Author Topic: STM32U5 for precision DSP (Read 5214 times)

Jeroen3 · « **Reply #25 on:** December 29, 2021, 07:14:20 pm »

Based on my limited experiences with fixed point is that you're not really gaining any speed over floating point if you don't properly matrix your operations.
What I found is that you lose a lot of the benefit from the "single cycle integer" instructions by the fluff around fixed point. Such as wrapping, shifting and and saturating.
You'd basically have to use the arm math library or write some highly specific code, perhaps with some assembly in between, which is a drawback for testing and modifying.
Of course, without FPU floats are impossible to do quickly.

Single precision floating point is faster if you have an FPU, the slowest part in there is getting the data to the FPU. It's trivial to write C with a bit of unrolling and operands in the right order to have it executing quite quickly.

Simulating your DSP concepts can be done with python. You can use the numpy types and plot your results with matplotlib.
There is also scilab, which is free and is like matlab. But lacks all the popular examples so it more difficult to get going with.

SiliconWizard · « **Reply #26 on:** December 29, 2021, 07:42:02 pm »

Quote from: Jeroen3 on December 29, 2021, 07:14:20 pm

Based on my limited experiences with fixed point is that you're not really gaining any speed over floating point if you don't properly matrix your operations.

Yes, of course. It will usually be actually worse in terms of speed if the CPU you use has an embedded FPU, unless it also has a specific fixed-point accelerator, that happens to deal with the fixed-point format(s) you need to use. A number of ifs.

OTOH, if there is no FPU available, fixed-point is definitely better in terms of speed in most cases. I implemented a fixed-point FFT for a RISC-V core with no FPU - a 256-point FFT was at least 20 times faster than the same using emulated floating point, IIRC, while accuracy was on par.

If not for speed, the whole point is controlling precision at all steps, something extremely hard to do with floating point. (Which is why, as we said above, you otherwise use floating point with an overkill precision just to make sure it'll be ok, but it's a bit of a wet-finger approach =) ).

Siwastaja · « **Reply #27 on:** December 29, 2021, 08:16:45 pm »

Indeed, if you have to saturate and the instruction set does not have saturating instructions, that's a colossal waste of time, especially if you need to do it at multiple steps. But really the algorithm implementation should be so that saturation is needed only as the final step, once per output sample, if at all possible.

Inputs coming from real-world sources are always constrained to begin with and coeffs are constrained, all that is needed is to saturate the output (or even better, guarantee by running worst-case numbers that output never wraps around and doesn't need saturation).

This being said, even if your processing is in floating point, the output often needs to be saturated or range checked anyway, so the same work needs to be done.

Floating point sometimes have performance benefit over integer due to the fact that many cores (Cortex-M MCUs for example) come with a lot more FPU registers than general purpose integer registers; and they are all free to use, not being wasted in trivial things like loop counters.

For example, I hand-wrote decimating FIR filter in assembly, where all the accumulators were in single precision floating point registers (32 registers is massive number of single-cycle storage, without need of load/stores!), eliminating a lot of load and stores. With the loops unrolled, code running from ITCM, and coeffs being loaded from DCTM as a dual-issued (cortex-m7) parellel operation, the performance basically is that of a classical dedicated DSP, even on "just" a general purpose MCU.

But sure, at some point you run out of those 16 (DP) / 32 (SP) registers and can't avoid memory loads and stores. In that case, if you can load four 16-bit values in one go with 64-bit wide AXI bus and then utilize integer SIMD instructions to do whatever you need, fixed point performance will greatly exceed the floating point version.

So the answer really is, it depends.

Jeroen3 · « **Reply #28 on:** December 29, 2021, 08:26:14 pm »

I compared the performances of the cortex m types with various common operands I use.
Fixed point is basically the same speed over m3/m4/m7, m0+ is a bit slower. You're looking at max 10 cycles per raw operand, but when you add the fluff of libfixmath you're around 100 to 350 cycles per operand.

I agree with the precision. Using float you can have strange rounding sometimes or funny stuff when the numbers go big. But also NAN which you have to check for sometimes.
otoh, NAN and INF can also be used to your advantage.

emece67 · « **Reply #29 on:** December 29, 2021, 09:00:48 pm »

SiliconWizard · « **Reply #30 on:** December 29, 2021, 11:01:35 pm »

Scilab is the closest "clone" actually. https://www.scilab.org/
But yes, there are tons of other options, like Octave, or using C, or your favorite interpreted language - although it might be harder to properly emulate fixed point with some of the latter, for instance.

But well, if the OP is looking for a point-and-click tool, such as what he linked to, that's something else...

davegravy · « **Reply #31 on:** December 29, 2021, 11:18:06 pm »

Quote from: SiliconWizard on December 29, 2021, 11:01:35 pm

Scilab is the closest "clone" actually. https://www.scilab.org/
But yes, there are tons of other options, like Octave, or using C, or your favorite interpreted language - although it might be harder to properly emulate fixed point with some of the latter, for instance.

But well, if the OP is looking for a point-and-click tool, such as what he linked to, that's something else...

Thanks for the suggestions, I will investigate these. No need for point and click interface, Labview is just what I happen to have.

Any recommended learning resources would also be appreciated. I know fundementals of digital filters from my undergrad EE degree years ago but have limited knowledge of practical design issues such as those we've been discussing,as well as testing methodology.

SiliconWizard · « **Reply #32 on:** December 29, 2021, 11:35:22 pm »

Quote from: davegravy on December 29, 2021, 11:18:06 pm

Quote from: SiliconWizard on December 29, 2021, 11:01:35 pm
Scilab is the closest "clone" actually. https://www.scilab.org/
But yes, there are tons of other options, like Octave, or using C, or your favorite interpreted language - although it might be harder to properly emulate fixed point with some of the latter, for instance.

But well, if the OP is looking for a point-and-click tool, such as what he linked to, that's something else...

Thanks for the suggestions, I will investigate these. No need for point and click interface, Labview is just what I happen to have.

Any recommended learning resources would also be appreciated. I know fundementals of digital filters from my undergrad EE degree years ago but have limited knowledge of practical design issues such as those we've been discussing,as well as testing methodology.

OK then... Scilab is routinely used in universities over here for DSP courses, and then people keep using it at work. It's gotten pretty good, and frankly more than enough except if you need very specific Matlab tools or packages. Not the case here IMO.

You'll find a lot of free resources online. Otherwise, there are also many seminal books.
Two of them are the classic "Theory and Application of Digital Signal Processing" (Rabiner) and "Digital Signal Processing" (Oppenheim)

NorthGuy · « **Reply #33 on:** December 30, 2021, 03:48:21 pm »

Quote from: davegravy on December 29, 2021, 02:01:27 am

Out of curiosity, if I wanted to go with fixed point and to leverage the hardware acceleration, given that it seems limited to 8 and 16 bit is this even feasible? Marco mentioned the accumulator is 24bit which drives dynamic range but I don't quite understand this.

SIMD stands for "Single Instruction Multiple Destinations". Say, instead of one 32-bit operations a SIMD instruction can perform two 16-bit operations at the same time. In theory, you can use SIMD to speed things up, but simply using compiler intrinsics won't help you. You need to ensure that you don't waste instructions on loading/storing etc. The best way to do this is assembler. But the looping is the biggest overhead anyway, so you wouldn't get much even if you tried very hard.

With 24-bit data, you would simply use 32-bit data and possibly 64-bit accumulator. The C code wouldn't be much different from the code you would have with doubles:

Code: [Select]

double a,x,y;
a += x*y;

Code: [Select]

uint64_t a;
uint32_t x,y;
a += x*(uint64_t)y;

Either way, in your case, there's no need to worry about efficiency.

errorprone · « **Reply #34 on:** December 30, 2021, 06:16:13 pm »

Easiest way to utilize the built in DSP instructions is to use the CMSIS libraries. Here are some benchmarks of the CMSIS FFT vs KIssFFT with different data types. http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html?m=1

SiliconWizard · « **Reply #35 on:** December 30, 2021, 06:23:49 pm »

Quote from: errorprone on December 30, 2021, 06:16:13 pm

Easiest way to utilize the built in DSP instructions is to use the CMSIS libraries. Here are some benchmarks of the CMSIS FFT vs KIssFFT with different data types. http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html?m=1

The ARM DSP extensions yes, but I doubt CMSIS uses any proprietary accelerator such as STM32's FMAC. If that's what the OP wanted to use.

But yes definitely - as several of us said - the OP probably doesn't need any acceleration or even DSP instructions for what he wants to achieve. But having a look at CMSIS libraries is not a bad idea. I've always shied away from them myself, but maybe there was no good reason.

Siwastaja · « **Reply #36 on:** December 30, 2021, 06:40:34 pm »

CMSIS DSP library is of course fine if what you need is a use case where you have input buffer and output buffer so then you can just call one function and save the "hassle" of writing the 5 LoC FIR/IIR implementation. And if you are lucky, the CMSIS implementation might be faster, too.

In real world embedded systems though, it's quite typical samples come and go in which case using such library becomes impressively difficult for a beginner (are you supposed to create two buffers and manage between them seamlessly?), whereas implementing FIR/IIR is just a few lines of code so you can do it in "new sample" IRQ handler or whatever you need. So again, it depends.

In any case, avoid premature optimization (unless that is what you want to learn). My suggestion: write normal FIR/IIR implementation (a few lines of code) without any special tricks. Try it. Is it too slow? Only then try to optimize it.

You can ballpark-approximate the filter performance as 20 CPU clock cycles per coeff per sample. Such napkin calculation is useful before committing your time to even trying it.

Exact details like memory bus latency then matters. For example, if the core has separate core-coupled piece of RAM, put coeffs there. Another option is enabling cache because coeffs will definitely end up in the cache after the first loop iteration.

SiliconWizard · « **Reply #37 on:** December 30, 2021, 06:51:10 pm »

One of the main reasons I never used those CMSIS libraries is that I favor portable code as much as possible. If some specific instructions/features for a specific target are needed for performance reasons (and I make sure I really need the extra performance to begin with), I'll just use C intrinsics (or assembly if really anything else fails), and/or isolate the specific code as much as possible, so that porting it to another target would still not be a major pain. Oh, and this approach also tremendously helps testing your code on a different target - typically, on PC, something I do quite often. Try testing code that relies on CMSIS libraries on a PC...

Siwastaja · « **Reply #38 on:** December 30, 2021, 07:14:38 pm »

Yeah, for libraries, you would need to be sure it exists - and behaves 1:1 - for each platform you use, including unit testing etc. Then you usually need to follow some installation steps, sometimes as easy as just including a header-only library, sometimes a tad more difficult (especially if cross compilation is involved).

So consider the amount of work writing from scratch vs. installing, learning how to use, and maintaining the library. If a library saves 5 minutes of work, it's very unlikely to be useful. OTOH, if it saves 1000 hours, the choice is obvious.

People often completely ignore this aspect when giving out the "use as many libraries as possible because you don't want to reinvent the wheel" lecture.

emece67 · « **Reply #39 on:** January 01, 2022, 11:09:13 pm »


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: STM32U5 for precision DSP (Read 5214 times)

Jeroen3

Re: STM32U5 for precision DSP

SiliconWizard

Re: STM32U5 for precision DSP

Siwastaja

Re: STM32U5 for precision DSP

Jeroen3

Re: STM32U5 for precision DSP

emece67

Re: STM32U5 for precision DSP

SiliconWizard

Re: STM32U5 for precision DSP

davegravy

Re: STM32U5 for precision DSP

SiliconWizard

Re: STM32U5 for precision DSP

NorthGuy

Re: STM32U5 for precision DSP

errorprone

Re: STM32U5 for precision DSP

SiliconWizard

Re: STM32U5 for precision DSP

Siwastaja

Re: STM32U5 for precision DSP

SiliconWizard

Re: STM32U5 for precision DSP

Siwastaja

Re: STM32U5 for precision DSP

emece67

Re: STM32U5 for precision DSP

Share me