Author Topic: Starting with STM32 (NUCLEO-L412KB) (Read 10143 times)

Picuino · « **Reply #100 on:** May 20, 2024, 04:01:02 pm »

Impulse response:

Picuino · « **Reply #101 on:** May 20, 2024, 04:04:27 pm »

Step (of value = 100000) response:

Code: [Select]

Now the scale is OK.

Picuino · « **Reply #102 on:** May 20, 2024, 04:05:29 pm »

STM32 program:

Code: [Select]

        // Define variables
        q31_t pSrc[100];
        q31_t pDst[100];
        arm_biquad_casd_df1_inst_q31 S;
        q31_t pCoeffs[] = {44314902, 88629804, 44314902, 1448274717, -551792502};
               // { 0x02A43116, 0x0548622C, 0x02A43116, 0xA9AD14E3, 0x20E3AF76 };
        q31_t pState;

        uart_init();
        while (1) {
            // Initialize buffer pSrc
            for (int i = 0; i < 100; i++)
                pSrc[i] = 0;
            //pSrc[0]= 0x40000000;
            for (int i = 20; i < 60; i++)  pSrc[i] = 100000;

            // Process data
            arm_biquad_cascade_df1_init_q31(&S, 1, pCoeffs, &pState, 1);
            GPIOA->BSRR = (1 << 3); // Set PA3
            arm_biquad_cascade_df1_q31(&S, pSrc, pDst, 100);
            GPIOA->BRR = (1 << 3);  // Reset PA3

            // Output data
            for (int i = 0; i < 100; i++) {
                printf("%ld\r\n", pDst[i]);
                HAL_Delay(2);
            }
            HAL_Delay(2000);
        }

The process time is maintained at 42us for 100 samples.

gf · « **Reply #103 on:** May 20, 2024, 04:26:22 pm »

Gain is basically correct now, but there is still a problem with a full step (-2147483648 -> 2147483647). Then the overshoot will exceed the q31 range, and the response will go crazy

. If you want the filter to withstand a full step, the gain would need to be reduced. This could be done by scaling down all three b parameters with the same factor (say 0.8 or whatsoever is necessary to bring the overshoot into a -1...1 range).

EDIT:

And keep in mind: The smaller the signal, the larger the (relative) numerical error.
Note that your step amplitude of 100000 in Q31 corresponds to only 4.6566e-05.
That's less than one digital ADCl count after scaling the ADC full-scale range to Q31.

Picuino · « **Reply #104 on:** May 20, 2024, 05:05:14 pm »

No problem.
I will multiply the value of two ADCs (12bits x 12bits = 24bits) and perhaps add several samples (16) before applying the filter. This produces 28-bit values (worst case), which will not saturate the filter and, I hope, will be sufficiently large values.

The problem now is to downsample from 1.333MHz or higher, to an output frequency of 10khz at most. In slower cases, I will need an output frequency lower than 1Hz and I don't know how the filter will behave in that case. Perhaps I may need to sum several blocks of input samples before applying the output filter.

Picuino · « **Reply #105 on:** May 20, 2024, 05:37:20 pm »

As the filter has a lower cutoff frequency, the coefficients become smaller and smaller and, therefore, resolution is lost.
There comes a time when the coefficients are very small and become almost zero.

Code: [Select]

   // Cutoff frequency = 10000
   coeff = { 0x0008CE29, 0x00119C53, 0x0008CE29, 0x7BBC3A97, 0xC4208CC3 };

   // Cutoff frequency = 1000
   coeff = { 0x00001738, 0x00002E70, 0x00001738, 0x7F92C8E9, 0xC06CDA36 };

   // Cutoff frequency = 100
   coeff = { 0x0000003C, 0x00000077, 0x0000003C, 0x7FF51415, 0xC00AEAFD };

   // Cutoff frequency = 10
   coeff = { 0x00000001, 0x00000001, 0x00000001, 0x7FFEE868, 0xC0011795 };

   // Cutoff frequency = 1
   coeff = { 0x00000000, 0x00000000, 0x00000000, 0x7FFFE40A, 0xC0001BF6 };

Script:

Code: [Select]

#
# Python script to calculate Butterworth filter coefficients
# 
import scipy
import math

fc = 1000      # Cutoff frequency
fs = 1333000   # Sample rate

for fc in [10000, 1000, 100, 10, 1]:
    b, a = scipy.signal.butter(2, fc, btype='low', analog=False, output='ba', fs=fs)
    coeff = list(b) + list(-a)[1:]
    coeff = [round(c * (2 ** 30)) for c in coeff]
    coeff = [c if c >= 0 else c + 0x100000000 for c in coeff]
    print()
    print("   // Cutoff frequency = %d" % fc)
    print("   coeff = { 0x%08X, 0x%08X, 0x%08X, 0x%08X, 0x%08X };" % tuple(coeff))

Picuino · « **Reply #106 on:** May 20, 2024, 06:26:23 pm »

High Precision Q31 Biquad Cascade Filter
https://arm-software.github.io/CMSIS-DSP/latest/group__BiquadCascadeDF1__32x64.html

Code: [Select]

arm_biquad_cas_df1_32x64_q31 (const arm_biquad_cas_df1_32x64_ins_q31 *S, const q31_t *pSrc, q31_t *pDst, uint32_t blockSize)

Quote

This function implements a high precision Biquad cascade filter which operates on Q31 data values. The filter coefficients are in 1.31 format and the state variables are in 1.63 format. The double precision state variables reduce quantization noise in the filter and provide a cleaner output. These filters are particularly useful when implementing filters in which the singularities are close to the unit circle. This is common for low pass or high pass filters with very low cutoff frequencies.

EDIT:

Python script with postShift manage:

Code: [Select]

#
# Python script to calculate Butterworth filter coefficients
#
import scipy

fs = 1333000   # Sample rate
fc = 100000      # Cutoff frequency
postShift = 0  # Post shift gain

b, a = scipy.signal.butter(2, fc, btype='low', analog=False, output='ba', fs=fs)
coeff = list(b) + list(-a)[1:]
while max(coeff) > 1.0 or min(coeff) < -1.0:
    postShift += 1
    coeff = [c/2 for c in coeff]
coeff = [round(c * (2 ** (31-postShift))) for c in coeff]
coeff = [c if c >= 0 else c + 0x100000000 for c in coeff]
print("   // Sample rate = %0.2f" % fs)
print("   // Cutoff frequency = %0.2f" % fc)
print("   postShift = %d;" % postShift)         
print("   q31_t pCoeffs = { 0x%08X, 0x%08X, 0x%08X, 0x%08X, 0x%08X };" % tuple(coeff))

SiliconWizard · « **Reply #107 on:** May 20, 2024, 09:29:23 pm »

Sorry if you discussed it before - but the doesn't the STM32L4 have an FPU? Are you sure any of these fixed-point implementations will be faster than using floating point?

gf · « **Reply #108 on:** May 20, 2024, 10:15:13 pm »

Quote from: SiliconWizard on May 20, 2024, 09:29:23 pm

Sorry if you discussed it before - but the doesn't the STM32L4 have an FPU? Are you sure any of these fixed-point implementations will be faster than using floating point?

I would not be sure either that fixed point is really faster.

Floating point normalization certainly solves scaling issues.
But I have doubts that (single precision) FP solves IIR precision issues, since mantissa precision is only 24 bits.

dietert1 · « **Reply #109 on:** May 21, 2024, 05:11:52 am »

For a lock-in one uses a direct receiver, that is signal detection by mixing the input signal with the carrier of known frequency and phase. This mixing with a synthetic carrier of adjustable phase happens at the ADC sampling rate. One multiplication per ADC sample.
The output of that mixer is bandwidth limited at the carrier frequency. Any meaningful sampling rate won't be higher than the carrier frequency, except nyquist factor 2. So there should be some down sampling, e.g. using a boxcar.
As far as i understand the filter design discussed here is to reduce detection bandwidth even further, e.g. to 1 Hz with a 1 KHz carrier. In this case i see the need for precision but i don't see the need for high speed. At 10 KHz carrier frequency the rate will be 20 KHz (50 usec).
For a IIR filter i would try to use the FPU. If the dynamic range is a problem, one can get a F7 or H7 with the double FPU. Others get the precision by using two or three filter stages running at ever lower rates, where each stage reduces bandwidth and rate by at most a factor 10 or so. As far as i remember Cortex M supports filters with double precision accumulator (32 * 32 to 64 multiply and add into 64 bit accu).

Regards, Dieter

Picuino · « **Reply #110 on:** May 21, 2024, 07:58:47 am »

Yes, what I am going to do is to multiply some 32 or 64 samples of the two ADCs and add the result in an accumulator. The output will be a buffer with the data from the accumulators, which should come out of the DAC at a rate of about 100kHz.
To that 100kHz buffer is where I should apply the filter or filters and, in the case of slower outputs, reduce the number of points.

What is not clear to me is how to reduce the number of points of the DAC after a filter. Do I add several points again to generate a single point?

gf · « **Reply #111 on:** May 21, 2024, 09:00:53 am »

Quote from: Picuino on May 21, 2024, 07:58:47 am

Yes, what I am going to do is to multiply some 32 or 64 samples of the two ADCs and add the result in an accumulator. The output will be a buffer with the data from the accumulators, which should come out of the DAC at a rate of about 100kHz.

What you have in mind is basically boxcar filter (moving average filter) with 32...64 taps, applied to the stream of multiplied samples, followed by downsampling from 1333kSa/s to ~100kSa/s (factor 13x) by picking only every 13th sample and discarding the samples in between. And obviously you want a filter length which is larger than the decimation factor.

If the number of boxcar taps is an integer multiple of the decimation factor, then the cheapest way to do that is a CIC decimator (e.g. 13x5 -> 65 taps, or 16x4 -> 64 taps for fs_out=fs/16=83kSa/s). See https://www.dsprelated.com/showarticle/1337.php.

EDIT:

Attached is the frequency response plot of a 64-tap boxcar filter @1333kSa/s. Check yourself what this means for the suppression of the carrier and carrier harmonics if the carrier frequency can be arbitrary. Also keep in mind that any frequencies which pass through the filter (with more or less attenuation) are folded down to freqencies < fs/R/2 by the downsampling (where fs is the original sample rate and R is the downsampling factor). If the carrier frequency happens to be "unfavorable", the folded frequencies can even fall into the 0...10kHz region of interest.

[ With your milliohm meter, the length of the boxcar was an integer mutiple of the carrier period. Then the carrier and carrier harmonics fall into zeros of the filter's frequency response and are rejected completely. But this is no longer granted if the carrier frequency can be arbitrary. ]

For comparison I also added a plot with the frequency response of a "proper" FIR downsampling filter which avoids aliasing (with a stopband attenuation of ~65dB; more that that is possible, too, with more taps). 9x13=117 taps means that 9 taps must be calculated for each source sample when the downsampling factor is 13. With CMSIS, you could use arm_fir_decimate_q31() to do the filtering and decimation.

Picuino · « **Reply #112 on:** May 21, 2024, 10:56:14 am »

There is a very simple detail that I don't quite understand.
It seems that a Boxcar filter or a CIC filter is a filter that adds the previous N samples, so the algorithm would be something like this:

Code: [Select]

out[10] =                            in[10] + in[9] + in[8] + in[7]
out[11] =                   in[11] + in[10] + in[9] + in[8]
out[12] =          in[12] + in[11] + in[10] + in[9]
out[13] = in[13] + in[12] + in[11] + in[10]
and so on...

This means that the number of output samples is equal to the number of input samples, but filtered.

However what I need is to reduce the number of samples (for example from 1333kHz to 133kHz).

The only way I can think of to do this is to sum blocks:

Code: [Select]

out[1] =                                     in[10] + in[9] + in[8] + in[7]
out[2] = in[14] + in[13] + in[12] + in[11]

Is there any other way to make decimation?

gf · « **Reply #113 on:** May 21, 2024, 11:18:34 am »

Quote from: Picuino on May 21, 2024, 10:56:14 am

There is a very simple detail that I don't quite understand.
It seems that a Boxcar filter or a CIC filter is a filter that adds the previous N samples, so the algorithm would be something like this:
Code: [Select]
out[10] = in[10] + in[9] + in[8] + in[7] out[11] = in[11] + in[10] + in[9] + in[8] out[12] = in[12] + in[11] + in[10] + in[9] out[13] = in[13] + in[12] + in[11] + in[10] and so on...
This means that the number of output samples is equal to the number of input samples, but filtered.

And the next step after filtering is downsampling, i.e. you keep only every (say) 10th sample of out[] and discard the 9 samples in between. Of course you can optimize: You do not need to calculate those filtered samples, which are discarded in the next step.

Picuino · « **Reply #114 on:** May 21, 2024, 11:25:21 am »

https://arm-software.github.io/CMSIS_5/DSP/html/group__FIR__decimate.html#ga6a19d62083e85b3f5e34e8a8283c1ea0
https://arm-software.github.io/CMSIS_5/DSP/html/group__FIR__decimate.html#ga27c05d7892f8a327aab86fbfee9b0f29

Thank you very much for your help.
Do you know what would be the way to obtain the FIR filter coefficients?

Code: [Select]

arm_status arm_fir_decimate_init_q31 (
		arm_fir_decimate_instance_q31 *  	S,
		uint16_t  	numTaps,
		uint8_t  	M,
		const q31_t *  	pCoeffs,
		q31_t *  	pState,
		uint32_t  	blockSize 
	) 

Parameters
    [in,out]	S	points to an instance of the Q31 FIR decimator structure
    [in]	numTaps	number of coefficients in the filter
    [in]	M	decimation factor
    [in]	pCoeffs	points to the filter coefficients
    [in]	pState	points to the state buffer
    [in]	blockSize	number of input samples to process

Returns
    execution status

        ARM_MATH_SUCCESS : Operation successful
        ARM_MATH_LENGTH_ERROR : blockSize is not a multiple of M

Details
    pCoeffs points to the array of filter coefficients stored in time reversed order:

        {b[numTaps-1], b[numTaps-2], b[N-2], ..., b[1], b[0]}

    pState points to the array of state variables. pState is of length numTaps+blockSize-1 words where blockSize is the number of input samples passed to arm_fir_decimate_q31(). M is the decimation factor.

gf · « **Reply #115 on:** May 21, 2024, 11:49:44 am »

Quote from: Picuino on May 21, 2024, 11:25:21 am

https://arm-software.github.io/CMSIS_5/DSP/html/group__FIR__decimate.html#ga6a19d62083e85b3f5e34e8a8283c1ea0
https://arm-software.github.io/CMSIS_5/DSP/html/group__FIR__decimate.html#ga27c05d7892f8a327aab86fbfee9b0f29

Thank you very much for your help.
Do you know what would be the way to obtain the FIR filter coefficients?

For 10x decimation, try this one:

Code: [Select]

pkg load signal
R = 10      % decimation factor
BW = 10     % end of passband (start of transition band), khz
fs =  1333  % sample rate, kSa/s
ntaps = 80  % number of taps
h = remez(ntaps-1,[0 BW fs/2/R fs/2]/(fs/2),[1 1 0 0]);
% plot frequency response
[H,f] = freqz(h,1,10000,fs);
plot(f,20*log10(abs(H)))
grid on
ylim([-70 0])
% scale h to Q31
int32(round(h*2^32))

Code: [Select]

The question is, how fastst this function is.

[ If it is too slow, the decimation could be split into several stages, using a half-band filters for the first stages. Only for the last stage, the stopband must start below Nyquist. For the above case, the first decimation-by-2 stage would need only 6 taps or so. The function for the first stage could also be hand-optimized in order to avoid some of the overhead of the generic CMSIS function. ]

dietert1 · « **Reply #116 on:** May 21, 2024, 12:19:55 pm »

Let's say i want to use an exponential running average as low pass filter before down sampling. E.g.
Yn = 0.000 002 * Xn + 0.999 998 * Yn-1
If Xn is a 12 bit ADC value i will be adding zeros. But i can rewrite the formula as
Yn = Xn + Yn-1 - Yn-1 / 500 000
Is a 32 bit integer unit good enough to do this? I think it can work and the operation can be done at ADC rate. Maybe one should use a right shift instead of the division, using a power of 2 instead of an arbitrary number.

If one wants to input a 12 * 12 product into the filter, one can extend arithmetics to 64 bit using the same idea. Twice the number of operations per cycle but can still run at ADC rate.

Regards, Dieter

Picuino · « **Reply #117 on:** May 21, 2024, 01:14:16 pm »

Within a week I will receive the board with the other micro (STM32G431KB), which is the one I am going to use in the end.
This other model has a maximum speed of 4Msps.

In practice I will set the main clock speed to 170MHz and the ADC clock speed to 1/4, which gives me a conversion speed of 170/4/15 = 2.833Msps.

I am not going to do hardware oversampling because that only serves to increase the number of bits of resolution and I already checked that when taking data, with all the noise produced by the instrumentation amplifier, the results do not improve by increasing the number of bits of the ADC.
I prefer to take many samples per second and filter after multiplying the two signals.
This sampling speed is too large to apply a filter with so many parameters (80), which is very slow.

To start I will try to multiply the two ADC signals and add 8 results to decimate the sampling rate at 354kHz.
I will try to apply the filter at this lower frequency.

I have no idea about the speed of the other processor (STM32G431KB). In principle it has more clock speed and also has instructions to accelerate the digital filters. Until it arrives to me (around the 29th) I can't test it.

Picuino · « **Reply #118 on:** May 21, 2024, 01:26:08 pm »

I'm going to try programming on my current board (STM32L412KB) just to get a rough idea of the timing.

Picuino · « **Reply #119 on:** May 21, 2024, 01:42:53 pm »

Input buffer = 1000 samples
Output buffer = 100 samples

process time = 936us @ 80MHz with STM32L412KB

Program:

Code: [Select]

        // Define variables
        q31_t pSrc[1000];
        q31_t pDst[1000];
        q31_t pCoeffs[] = { 1943208, 464978, 305641, -49046, -638438, -1497324,
                -2650941, -4109235, -5862456, -7878098, -10098245, -12435856,
                -14770057, -16950103, -18809941, -20144286, -20744894,
                -20388173, -18853952, -15933574, -11440900, -5223349, 2827034,
                12763159, 24575302, 38187347, 53451404, 70147061, 87988766,
                106625888, 125658165, 144643853, 163116013, 180598380,
                196623003, 210748070, 222574464, 231761399, 238041130,
                241228017, 241228017, 238041130, 231761399, 222574464,
                210748070, 196623003, 180598380, 163116013, 144643853,
                125658165, 106625888, 87988766, 70147061, 53451404, 38187347,
                24575302, 12763159, 2827034, -5223349, -11440900, -15933574,
                -18853952, -20388173, -20744894, -20144286, -18809941,
                -16950103, -14770057, -12435856, -10098245, -7878098, -5862456,
                -4109235, -2650941, -1497324, -638438, -49046, 305641, 464978,
                1943208, };
        q31_t pState[1000 + 80];
        arm_fir_decimate_instance_q31 S;

        uart_init();
        while (1) {
            // Initialize buffer pSrc
            for (int i = 0; i < 1000; i++)
                pSrc[i] = 0;
            for (int i = 0; i < 500; i++)
                pSrc[i] = 10000000;

            // Process data
            arm_fir_decimate_init_q31(&S, 80, 10, pCoeffs, pState, 1000);
            GPIOA->BSRR = (1 << 3); // Set PA3
            arm_fir_decimate_q31(&S, pSrc, pDst, 1000);
            GPIOA->BRR = (1 << 3);  // Reset PA3

            // Output data
            for (int i = 0; i < 100; i++) {
                printf("%ld\r\n", pDst[i]);
                while(uart_sending());
            }
            HAL_Delay(2000);
        }

Attached: output response to a step signal

Code: [Select]

Picuino · « **Reply #120 on:** May 21, 2024, 01:49:07 pm »

The other microcontroller may be able to reduce 2.833Msps in real time.

Picuino · « **Reply #121 on:** May 21, 2024, 01:58:16 pm »

Yet another test:

Code: [Select]

pkg load signal
R = 10      % decimation factor
BW = 20     % end of passband (start of transition band), khz
fs =  2833  % sample rate, kSa/s
ntaps = 70 % number of taps
h = remez(ntaps-1,[0 BW fs/2/R fs/2]/(fs/2),[1 1 0 0]);
% plot frequency response
[H,f] = freqz(h,1,10000,fs);
plot(f,20*log10(abs(H)))
grid on
ylim([-70 0])
% scale h to Q31
int32(round(h*2^32))

Program:

Code: [Select]

        // Define variables
        q31_t pSrc[1000];
        q31_t pDst[1000];
        q31_t pCoeffs[] = { -3102388, -3113797, -4515965, -6180048, -8075456,
                -10141397, -12290802, -14416199, -16366203, -17980278,
                -19065273, -19421132, -18834989, -17096251, -14007364, -9390118,
                -3103346, 4955244, 14832503, 26518003, 39934516, 54936181,
                71309499, 88773820, 106991524, 125572769, 144091117, 162094330,
                179121976, 194721956, 208466551, 219971072, 228905848,
                235012496, 238111387, 238111387, 235012496, 228905848,
                219971072, 208466551, 194721956, 179121976, 162094330,
                144091117, 125572769, 106991524, 88773820, 71309499, 54936181,
                39934516, 26518003, 14832503, 4955244, -3103346, -9390118,
                -14007364, -17096251, -18834989, -19421132, -19065273,
                -17980278, -16366203, -14416199, -12290802, -10141397, -8075456,
                -6180048, -4515965, -3113797, -3102388, };
        q31_t pState[1000 + 70];
        arm_fir_decimate_instance_q31 S;

        uart_init();
        while (1) {
            // Initialize buffer pSrc
            for (int i = 0; i < 1000; i++)
                pSrc[i] = 0;
            for (int i = 0; i < 500; i++)
                pSrc[i] = 10000000;

            // Process data
            arm_fir_decimate_init_q31(&S, 70, 10, pCoeffs, pState, 1000);
            GPIOA->BSRR = (1 << 3); // Set PA3
            arm_fir_decimate_q31(&S, pSrc, pDst, 1000);
            GPIOA->BRR = (1 << 3);  // Reset PA3

            // Output data
            for (int i = 0; i < 100; i++) {
                printf("%ld\r\n", pDst[i]);
                while (uart_sending())
                    ;
            }
            HAL_Delay(2000);
        }

Process time: 836us (less)

Picuino · « **Reply #122 on:** May 21, 2024, 02:45:48 pm »

Code: [Select]

#
# Python script to calculate
# FIR coefficients with remez algorithm
#

import numpy as np
from scipy import signal
import matplotlib.pyplot as plt

fs = 2833000   # Sample rate, Hz
cutoff = 20000 # Desired cutoff frequency, Hz
R = 10         # Decimation factor
numtaps = 70   # Size of the FIR filter.

def plot_response(fs, w, h, title):
    plt.figure()
    plt.plot(0.5*fs*w/np.pi, 20*np.log10(np.abs(h)))
    plt.ylim(-100, 5)
    plt.xlim(0, 0.1*fs)
    plt.grid(True)
    plt.xlabel('Frequency (Hz)')
    plt.ylabel('Gain (dB)')
    plt.title(title)
    plt.show()


bands = [0, cutoff, 0.5*fs/R, 0.5*fs]
gains = [1, 0]

taps = signal.remez(numtaps, bands, gains, fs=fs)

q31_taps =[round(t*2**31) for t in taps]
q31_taps =[t if t>0 else t+0x100000000 for t in q31_taps]
q31_taps =[f"0x{t:08X}" for t in q31_taps]
print(", ".join(q31_taps))


w, h = signal.freqz(taps, [1], worN=2000)
plot_response(fs, w, h, "Low-pass Filter")

Python equivalent code for calculating FIR coefficients and frequency response.

EDIT: Q31 conversion corrected.

gf · « **Reply #123 on:** May 21, 2024, 03:15:13 pm »

Quote from: Picuino on May 21, 2024, 02:45:48 pm

Code: [Select]
q31_taps =[round(t*2**32) for t in taps]

Sorry, my mistake. I got that wrong too. When converting to Q31, the scaling is 2**31, not 2**32.
Basically, the sum of the taps should be 1, in order that the DC gain of the filter becomes 1.
[ But I noticed that remez does not always produce a sum of exactly 1, it can be slightly off. ]

gf · « **Reply #124 on:** May 21, 2024, 03:41:36 pm »

Quote from: Picuino on May 21, 2024, 01:42:53 pm

Input buffer = 1000 samples
Output buffer = 100 samples
process time = 936us @ 80MHz with STM32L412KB

Yet another test:
Process time: 836us (less)

Both are about 9.5 cycles per output sample per tap.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Starting with STM32 (NUCLEO-L412KB) (Read 10143 times)

Share me