Author Topic: Simple diy SDR radio with Tayloe mixer and STM32 (Read 84225 times)

mark03 · « **Reply #25 on:** June 09, 2015, 07:53:37 pm »

Quote from: Howardlong on June 09, 2015, 05:55:11 pm

NXP LPC4370 is the device, 200MHz M4F plus two M0 cores with 80Msps ADC. I only used the M4F core.

Wow, I knew about the older parts in that family but hadn't realized they had such a fast ADC in the '70. I wonder what the use case is, for their big customers, I mean. I assume anyone doing serious RF would be using an external ADC.

Quote from: Howardlong on June 09, 2015, 05:55:11 pm

A CIC implementation might be interesting, I considered that but I don't have any real experience in doing them other than back of envelope stuff, so I stuck to what I know, and doing the polyphase decimation and properly doing ARM assembler (ie, writing it, not just reading it in a dissassembly listing) for the first time were probably enough new things for my grey matter to deal with. I learned a lot.

Not sure if doing an FFT block filter is what I think you mean? Are you talking about an overlap-add/overlap-save fast convolution or something else?

Re: FFT filters, yes, I was thinking fast convolution, but upon reflection it doesn't make sense for your polyphase decimator. It has to be applied to each of those mini-FIRs separately, and at 8 taps long, wouldn't actually be faster. (The exact trade-off point depends greatly on architecture and optimizations, but I usually see 10-30 taps quoted as parity between direct-form FIR and FFT-convolution.)

I think CIC could be faster, but it depends whether you implement a clean-up FIR at the end or not. Also, it requires fixed point (the efficiency derives in part from the overflows happening the right way). Definitely less coef storage though!

Cool project. On the back burner I still have plans for a digitized bat receiver (ultrasound) as an excuse to learn some ARM assembly. A lot less computation than your radio though.

mark03 · « **Reply #26 on:** June 09, 2015, 08:01:18 pm »

Quote from: Kleinstein on June 09, 2015, 07:44:02 pm

A 12 Bit ADC will to certain degree limit the performance, getting weak stations. But at a higher sampling rate this should not be so bad, especially with some analog filtering at the Tayloe mixer. An adjustable gain might help to get most out of the 12 Bit ADCs. For good performance a higher resolution ADC would help, though I think 16 Bit with adjustable gain should be enough in most cases - unless there is a strong local station in the IF bandwidth.

If using the direct-sampling approach, you gain back a lot of bits in the process of decimating to the signal bandwidth---one bit for every factor of four rate change. So, 80 MHz -> 20 kHz nets you an extra six good bits.

I have a friend who's modified the OpenHPSDR Hermes shortwave receiver using an Analog Devices part with 12-bit ADC. He says most of the time he can't tell any difference between that and the 16-bit ADC in the original design. And that is a vastly more demanding application than AM radio receiver.

Yansi · « **Reply #27 on:** June 10, 2015, 02:05:23 am »

Hi again!

I have been studying FIR filters... But still doing something horribly wrong.

I have made a simple template project with 32F429 using onchip ADC, DAC and DMA with corresponding circular buffers. This stuff runs pretty fine. I have a 48kHz sampling rate, buffer length selectable (now 1kByte which equals 21.333ms (1024/48kHz) of delay from ADC to DAC). Computing in 16bit precision. I am running an audio signal to the ADC and out of the DAC after.

The first computational thing I have programmed in between was a volume control. That worked as expected. With 168MHz I have 3500 clocks (168MHz/48kHz) for computing each sample . This "buffer base" with the volume control (simple multiplication/division) took about 50 clocks per sample (if I remeber it right), which equals about 1.5% of CPU load. Up to this point everything is OK.

I have wanted to add a FIR filter, just as a little practice - I have never done this stuff before, but I'd like to learn at least some basics... So.. I thought I'll just slap there a FIR in front of the output volume regulator. But heck.. my crappy FIR takes about 25%+ CPU time at 32 points length (with -O3) . It kinda works and is filtering stuff... sometimes overflowing somewhere. But toooo slow I think.

I guess my implementation will make you laugh, but it is really my first FIR and the first implementation I could think of. It takes int16 data in as input sample and returns int16 data out as the filtered output, one sample at a time. The first loop advances the delay line one step, the secons loop calculates the product sum. The two div1 and div2 are just two scaling factors I fiddle with to get rid of teh distortions due to overflows in the filter. Cruel solution, I know. Acc is int32 accumulator for the sum. The fl (filterlength) coeficient is the length of the filter, 32 now.

I took the coef. from here, because it is more probably they're right, than if I'd try to calculate them myself. https://sestevenson.wordpress.com/implementation-of-fir-filtering-in-c-part-1/ I have only transformed the numbers into int16 for, +-1 is fullscale.

/* FIR FILTER */
fl = filterlen;
for (k=fl-1; k>0; k--) sample[k]=sample[k-1]; /* shift the delay line */
sample[0] = in; /* add new sample to delay chain */
acc=0;
for (k=0; k<fl; k++) { /* from k=0 to M-1 */
acc += (coeffs[k] * sample[k])/div1; /* accumulate the product */
}
out = acc/div2;

I understand, that the "delay line advancing loop" is not neccesary and could be solved using pointer arithmetic. I will look into that yesterday. But still I think this can by improved way more. What would you suggest to optimise and how?
Thanx,
Y

PS: I do not use ARM_DSP_Lib, I do not use FPU. Just because I want to test what can be done without it.

splin · « **Reply #28 on:** June 10, 2015, 11:34:11 am »

Take a look at this, especially pages 16 onwards which explains the instruction timings and how to improve your FIR speed:

http://www.arm.com/files/pdf/dspconceptsm4presentation.pdf

Howardlong · « **Reply #29 on:** June 10, 2015, 01:12:16 pm »

A few things.

Firstly, the CMSIS-DSP library is there for a reason, I assume you can use that with your device, but equally I fully understand why you want to do it from first principles, it's what I'd do too.

However, if you're doing things the hard way (as I did!), it's worth taking a look at the code that's generated.

The typical ways to optimise M4 code are:

o Unroll loops to minimise loop overhead
o Edit: use compiler idioms
o Use intrinsic functions
o Arrange data to take advantage of load and store multiple instructions
o Arrange algorithms to do as much in-register processing as possible and minimise interim store and load instructions
o Order instructions so they don't stall each other (eg, by avoiding use of the results of the immediately previous instruction, once you've unrolled your loops it'll be easier to accomplish too)
o Run in zero wait state memory, but be aware of possible stalls due to concurrent bus access on same-segment code and data memory
o Align consecutive 32 bit instructions on 32 bit word boundaries

In your code specifically, firstly see if you can move that divide out of the loop without overflowing. You may find that reducing your coefficient resolution will help avoid overflow. Divide on M4 is slow.

Avoid shifting the delay line by fiddling with your algorithm, there are a few ways of doing this, such as having two copies of your coefficients one after each other and sliding along them. If you must use a shifting delay line, either unroll the loop or use the (hopefully) optimised memmove from the c library. Regrettably M4 doesn't offer circular buffers, but then it's not a DSP!

To give you some idea, I achieved an 8 point FIR in floating point M4F assembler in 34 cycles including the loads and stores by the above methods.

I don't know if you are using GCC, but I wasn't too impressed with its optimisation efforts on the M4. It doesn't seem to know about the benefit of 32 bit word alignment for successive 32 bit instructions for example.

There is an example floating point FIR implementation in "The Definitive Guide to ARM Cortex M3 and Cortex M4 Processors" in C that claims 3.125 cycles/MAC.

Kalvin · « **Reply #30 on:** June 10, 2015, 01:22:42 pm »

Quote from: Howardlong on June 10, 2015, 01:12:16 pm

In your code specifically, firstly see if you can move that divide out of the loop without overflowing. You may find that reducing your coefficient resolution will help avoid overflow. Divide on M4 is slow.

- Scale the input and coefficients so that you don't need scaling or division inside the loops.
- Arrange the divisions so that they are 2**N if possible, which can then be reduced to shifts.
- Divisions can also be converted to faster multiplications when using fractional or floating point arithmetics.
- Don't move the data unless it is necessary. Use circular buffers and pointers instead and keep the sample data in place.
- Use circular buffers and modulo 2**N buffer sizes if possible, and see if the compiler supports modulo addressing so you don't have to test buffer boundaries.
- Check the assembly code the compiler generates.

Howardlong · « **Reply #31 on:** June 10, 2015, 01:25:19 pm »

FWIW here was a recent discussion on M4F optimisation https://www.eevblog.com/forum/microcontrollers/how-are-micros-programmed-in-real-world-situations/msg660776/#msg660776

Yansi · « **Reply #32 on:** June 10, 2015, 02:04:37 pm »

Thanks for suggestions guys. Some of them are familiar to me, some not. Thanks for the presentation form ARM. I will definitely read it and try to understand it. I have been loking into the code of the DSP Lib and tried to understand that. But couldn't - until now thanks to the presentation. Seems it is done exactly according to the presentation in the DSPlib.

Yeah - the divide can be dismantled from the inner loop. It is way faster then and works "fine" (= scaling needed before and after the FIR). But I don't know how to calculate the scaling factors so the range could be used fully.

Yeah I said that the data moving loop is unnecessary. But it was really the first implementation I have tried and that worked.

I checked the assembly output with my friend yesterday - it really didn't look nice :-)

Run in zero wait state memory - huh.. I can load the code in the CCM memory, possibly. Also move there all the buffers.

Or I could possibly write it entirely in embedded assembly code. I have some minor experience with CM3 assembly coding, so why not try the CM4. But I think it can be done nice also in C - I will try that first.

No GCC. Hate that stuff. I use Keil or occasionally IAR. (The IAR compiler is a little better than Keil I think, but I don't like the IAR GUI, that's why I stick with Keil uVision).

I will try to modify my code and make some optimalisations. (or possibly try the DSPlib)

How can I possibly calculate the scaling factors for the filter input and output? I have 16bit input/output data and 32bit accumulator (or do I need bigger but slower 64bit acc?). How should I scale the input and output to match the filter?

Would you suggest (in general) to use flating or fixed point arithmetic on the CM4 for DSP processing (for the SDR)? I think the float can better deal with overflows - there's enough dynamic range not to overflow ever. Only the output is then scaled as neccesary. Or are there some drawbacks too?

ermeneuta · « **Reply #33 on:** June 10, 2015, 02:27:35 pm »

Quote

I think CIC could be faster, but it depends whether you implement a clean-up FIR at the end or not. Also, it requires fixed point (the efficiency derives in part from the overflows happening the right way). Definitely less coef storage though!

Not necessarily... Following what Richard Lyons writes in his very good book, I have implemented in my ARM Radio project a polyphase CIC, all in floating point. The following is the comment at the beginning of the CIC block :

//-------------------------------------------------------------------------
// Now we decimate by 16 the input samples, using the CIC polyphase decomposition
// technique, which has the advantage of eliminating the recursive
// component, allowing the use of floating point, rather fast on a Cortex M4F
//
// A dividing by 16, order 4, CIC is used. Then a 4096-entry buffer is filled, and passed
// to the baseband interrupt routine, where it is additionally filtered with a
// sync-compensating FIR, which also adds further stop band rejection and a decimation by 4
//-------------------------------------------------------------------------

I cannot publish the source code (yet) as this project is my answer to the Keil/ARM design contest
(look here : http://www2.keil.com/mdk5/contest), but after the end date of the contest, it will be published as open source.

The STM32F429ZIT chip samples with two of its ADCs in interleaved mode, then a complex multiplication is done with an LO signal generated by a quadrature complex oscillator (again, thanks to Richard Lyons...), brought to zero IF, downsampled first with the CIC, then with a FIR that implements also the compensation for the droop of the response curve of the CIC (actually very small...).

At this point the downsampled complex signal (at a sampling rate of 27901.786 Hz (plus or minus the tolerance of the 8 MHz quartz of the STM32F4 clock...) is bandpassed with the fast convolution method (overlap-and-discard) with a selectable bandwidth, then applied to the AM, or SSB, or CW demodulator, with a selectable AGC time constant. The output of the demodulator is sent to the on-chip DAC, using DMA with two flip-flop buffers, and the output of the DAC is sent to an external, hardware, reconstruction filter. That's all.

The project is finished, all working, and I am now busy to write the documentation, the less pleasant step, but badly needed from the rules of the contest..

This is a bad photo of the on-board TFT screen, also driven by the STM32f429ZIT processor, with the ARM Radio tuned to the DCF77 time/frequency standard in Mainflingen, Germany :

I am intending to shot also an YouTube short video, but haven't found the time yet...

Alberto

Howardlong · « **Reply #34 on:** June 10, 2015, 02:57:09 pm »

Very nice Alberto!

It looks like you came to a similar conclusion as I did regarding the use of floating point rather than fixed point. In essence, the larger register set afforded to you by the FPU means you can do a lot more in-register processing than you can in fixed point, and in ARM registers are everything when trying to optimise.

The Richard Lyons book is pretty much the bible of DSP as far as I am concerned. His Streamlining book is also very handy.

To the OP, regarding figuring out the divide ratio, it's dependent on the number of taps, the coeffiicients and the input range. Come up with a worst case scenario and use that as the basis for your scaling.

mark03 · « **Reply #35 on:** June 10, 2015, 04:49:29 pm »

Nice project Alberto! Thanks for pointing me to nonrecursive (and polyphase) CIC filters. The only kind I knew about were the original (Hogenauer) recursive structure. I'm guessing it's the nonrecursive structure that permits the use of floating point in this case? Guess I need to pick up Lyon's book.

Can you fit a decoder for DCF77 / WWVB / etc. into the spare cycles?

Howardlong · « **Reply #36 on:** June 10, 2015, 04:58:02 pm »

Just one further comment, for some reason the CMSIS-DSP decimator isn't polyphase whereas the interpolator is. I guess no-one got around to really optimising at the algorithm level.

ermeneuta · « **Reply #37 on:** June 10, 2015, 05:35:04 pm »

Quote from: mark03 on June 10, 2015, 04:49:29 pm

Nice project Alberto! Thanks for pointing me to nonrecursive (and polyphase) CIC filters. The only kind I knew about were the original (Hogenauer) recursive structure. I'm guessing it's the nonrecursive structure that permits the use of floating point in this case? Guess I need to pick up Lyon's book.

Can you fit a decoder for DCF77 / WWVB / etc. into the spare cycles?

Correct. Eliminating the integrator part of the CIC, with its unavoidable overflow that is compensated only when you work in 2-complement arithmetic, is what permits the use of floating point. All is well described, with examples, in the excellent book of Richard Lyons, a must.

A DCF77 decoder could well be coded in the GUI task, that runs at idle priority. Decoding the DCF77 is by no means a processor intensive task...

Maybe, after the Friedrichshafen fair at the end of June, when I will have some spare time, I will consider that. If you perhaps have a pointer to a snippet of code that implements that decoder. that would spare me some time...

Alberto

mark03 · « **Reply #38 on:** June 10, 2015, 07:22:12 pm »

Quote from: Howardlong on June 10, 2015, 04:58:02 pm

Just one further comment, for some reason the CMSIS-DSP decimator isn't polyphase whereas the interpolator is. I guess no-one got around to really optimising at the algorithm level.

Do you know if they have different code on M4 vs M3? I have a sneaking suspicion that they do not, which I find strange considering the marketing push for the M4 core (and now to some extent the M7) is focused on this idea of a "digital signal controller". I guess there are commercial sources for optimized code, just not for hobbyists.

mark03 · « **Reply #39 on:** June 10, 2015, 07:30:32 pm »

Quote from: ermeneuta on June 10, 2015, 05:35:04 pm

A DCF77 decoder could well be coded in the GUI task, that runs at idle priority. Decoding the DCF77 is by no means a processor intensive task... Maybe, after the Friedrichshafen fair at the end of June, when I will have some spare time, I will consider that. If you perhaps have a pointer to a snippet of code that implements that decoder. that would spare me some time...

Sorry, I don't. I live in the US (finally updated my profile flag) so I would like to write a decoder for the new phase modulation on WWVB, which is only a few years old. (Most/all of our "atomic clock" consumer products are still using the old time code.) Wikipedia says this is similar to DCF77, so if I get around to it this summer I'll look you up!

Yansi · « **Reply #40 on:** June 10, 2015, 09:20:59 pm »

I have just got same spare time for toying with the FIR. I have started to optimise my code. I have thrown away the "delay line advancing loop" and replaced it with pointer instead. But I have stepped onto a trouble:

Here is the arm presentation, please see page 15
http://www.arm.com/files/pdf/dspconceptsm4presentation.pdf
This is the slide on page 15:

I dont understand the code, it seems fucked up somehow. Where is the "state" array defined and isn't there error on this line?
state[stateIndex++] = in[sample];

I understand the state array as an array of filtLen length. The FIR loop then advances back in tame through the array. Before each FIR computation, I need to advance 1 sample forward in time, that means I must increment the pointer first and then write the new sample. They are postincrementing the pointer. That leads to the first FIR product is calculated between coefficient[0] and not the last sample added to the chain!

Is there really an error?

A little example: Consider a FIR of length 7. We have a state array 7 samples long, stateIndex=4. The samples are stored like this:

stateIndex: V
| x-4 | x-3 | x-2 | x-1 | x | x-6 | x-5 |

Run through the FIR loop from k=0 to k=6. You will end up with the stateIndex in the same position. The logic tells you, that you should overwrite the oldest sample in the delay chain, the x-6 one. But if they post-increment the pointer, they are overwriting the X sample and the the first multiplication in the next FIR loop (k=0) will read x-6 (x-7 in reality) sample first, instead of the newest one! There will be an 1 sample overlap or something. Theres something wrong with their code I think.

Or did I miss something?

mark03 · « **Reply #41 on:** June 11, 2015, 12:15:32 am »

It does seem as though the postincrement should have been a preincrement. But the presentation is really about optimization strategies, so I wouldn't get too wrapped up in it. The author probably rushed to put that slide together. It sounds like you are getting the hang of this so I would continue experimenting and learning from that.

Yansi · « **Reply #42 on:** June 11, 2015, 01:17:39 am »

I am getting hung with anything from that presentation. If you look the second suggested method with loop unrolling - the code doesn't look right either. Multiplying one coefficient with four different samples? Wtf? Shouldnt the filtLen (loop count) be divided by four then?

One doesn't have to dig much deep into the next code examples, to find there other strange things that doesn't make any sense. I suspect the document was "obfuscated on purpose" to not show exactly how to do it. I am quite confused reading it. I haven't made any progress in optimizing my code.

My code now looks like this. It does 101point FIR in about 1300 clocks (12.8 per sample average). It is similar as the "standard FIR code" described in the presentation, only corrected the pointer to pre-incremented. I am stuck here. I have tried the first optimization suggested: "circular buffer alternative". But I couldn't understand it a bit. The pointer wrapping condition dissapeared in their example code. But I can't understand how it works without wrapping the pointer.

sample[++sampleindex] = in;
acc=0;
for (k=0; k<flen; k++) {
acc += coeffs[k] * sample[sampleindex--];
if (sampleindex<0) sampleindex = flen-1;
}
out = acc*mul2/div2;

How can I optimize my FIR code according to the "circular alternative" suggested in the presentation?

By the way, I have tested the ARM DSPlib, thinking it will be fast and optimized, if the function name is "arm_fir_fast_q15". But it is only a tiny fraction faster than my basic implementation above.

And provides very silent output signal. So silent one must amplify an elephant to hear something, compared to my FIR which has tendency to overflow quite easily.

Thanx 4 help,
Y

splin · « **Reply #43 on:** June 11, 2015, 03:50:07 pm »

Quote from: Yansi on June 10, 2015, 09:20:59 pm

little example: Consider a FIR of length 7. We have a state array 7 samples long, stateIndex=4. The samples are stored like this:

stateIndex: V
| x-4 | x-3 | x-2 | x-1 | x | x-6 | x-5 |

Run through the FIR loop from k=0 to k=6. You will end up with the stateIndex in the same position. The logic tells you, that you should overwrite the oldest sample in the delay chain, the x-6 one. But if they post-increment the pointer, they are overwriting the X sample and the the first multiplication in the next FIR loop (k=0) will read x-6 (x-7 in reality) sample first, instead of the newest one! There will be an 1 sample overlap or something. Theres something wrong with their code I think.

Or did I miss something?

Yes. The post-increment means that the filtering processes the samples in the order:

x-6, x, x-1, x-2, x-3, x-4, x-5

and leaves sampleindex pointing at x-6. The next sample, x+1, overwrites x-6, and the next filter takes the samples in order:

x-5, x+1, x, x-1, x-2, x-3, x-4

The post-increment probably is a mistake but that code will work fine so long as the coefficients are in the matching order 6, 0, 1, 2, 3, 4, 5

But as mark03 says, don't get hung up too much on the detail - the presentation is about techniques to get the best out of the M4, not production ready code. Its just a guide.

splin · « **Reply #44 on:** June 11, 2015, 06:01:06 pm »

Quote from: Yansi on June 11, 2015, 01:17:39 am

I am getting hung with anything from that presentation. If you look the second suggested method with loop unrolling - the code doesn't look right either. Multiplying one coefficient with four different samples? Wtf?

That is an error. It should probably post increment k when fetching the coefficients (as shown in the code on page 29), or use 4 different sum variables (as shown on pages 32 & 33).

Quote

Shouldnt the filtLen (loop count) be divided by four then?

Yes, but it's just not been included in that code snippet. If you look at the example code on page 29 for example it has this just before the loop:

filtLen = filtLen << 2;

Quote

One doesn't have to dig much deep into the next code examples, to find there other strange things that doesn't make any sense. I suspect the document was "obfuscated on purpose" to not show exactly how to do it. I am quite confused reading it. I haven't made any progress in optimizing my code.

I doubt that; I think it's just that it's pitched at people who are familiar with DSP and these types of coding techniques, but not specifically with the Cortex M4 DSP instructions/architecture.

Quote

My code now looks like this. It does 101point FIR in about 1300 clocks (12.8 per sample average). It is similar as the "standard FIR code" described in the presentation, only corrected the pointer to pre-incremented. I am stuck here. I have tried the first optimization suggested: "circular buffer alternative". But I couldn't understand it a bit. The pointer wrapping condition dissapeared in their example code. But I can't understand how it works without wrapping the pointer.

sample[++sampleindex] = in;
acc=0;
for (k=0; k<flen; k++) {
acc += coeffs[k] * sample[sampleindex--];
if (sampleindex<0) sampleindex = flen-1;
}
out = acc*mul2/div2;

How can I optimize my FIR code according to the "circular alternative" suggested in the presentation?

I don't think the 'circular buffer alternative' on pages 19 to 22 is very clear - you probably had to be at the presentation, but these are just notes. However the idea is to make the buffer of input samples longer than the filter length so that you can be sure that stateindex won't wrap-around in the filter loop so that you can eliminate the 'if (stateindex < 0)' test in the inner loop, saving 4 cycles. There is an extra cost of copying samples though.

In your case with a long filter, it would be simpler to eliminate the wrap-around test using:

sample[++sampleindex] = in;
acc=0;
k = 0;

while(sampleindex >= 0) {
acc += coeffs[k++] * sample[sampleindex--];
}

while(k < flen) {
acc += coeffs[k++] * sample[sampleindex--];
}

out = acc*mul2/div2;

You can significantly improve on the above 8 cycles/tap by implementing loop unrolling. Your data is 16 bit, so if you're coefficients can also be be limited to 16 bits or less then using the SIMD instructions will double the speed again with 2 x 16x16 multiply accumulates in a single instruction as shown on page 27. Applying all the techniques shown you should be able to achieve the stated 1.625 cycles/tap, but this assumes that there are no extra delays due to instruction fetch wait states or memory bus delays due to instruction/data contention.

Quote

By the way, I have tested the ARM DSPlib, thinking it will be fast and optimized, if the function name is "arm_fir_fast_q15". But it is only a tiny fraction faster than my basic implementation above. And provides very silent output signal. So silent one must amplify an elephant to hear something, compared to my FIR which has tendency to overflow quite easily.

arm_fir_fast_q15 uses SIMD and loop unrolling so should be much faster than yours so I'm not sure why your's is so slow. Note that the filter length must be even so you can't have a 101 tap filter. Also note the scaling limitations:

This fast version uses a 32-bit accumulator with 2.30 format. The accumulator maintains full precision of the intermediate multiplication results but provides only a single guard bit. Thus, if the accumulator result overflows it wraps around and distorts the result. In order to avoid overflows completely the input signal must be scaled down by log2(numTaps) bits. The 2.30 accumulator is then truncated to 2.15 format and saturated to yield the 1.15 result.

With a filter length of 100 or 102 taps you need to scale the input down by 6 or 7 bits to prevent overflow. However, as you say the volume is very low I suggest you need to check the input sample and coefficient scaling.

Have fun!

rockets4kids · « **Reply #45 on:** June 11, 2015, 06:57:57 pm »

I've only just skimmed this thread, but two things:

1. For a Tayloe modulator you want a square wave source, so the Si570 (or similar) is generally preferred.

2. There is already a project that does just about *exactly* what you want: mcHF http://www.m0nka.co.uk/

Yansi · « **Reply #46 on:** June 11, 2015, 08:00:55 pm »

splin: Thank you for your detailed reply. I have toyed with my FIR implementation some long hours again, then looked into the ARM presentation and I might possibly got it! Hurray! Just to revise some statements:

Quote from: splin on June 11, 2015, 06:01:06 pm

Quote from: Yansi on June 11, 2015, 01:17:39 am
Quote
One doesn't have to dig much deep into the next code examples, to find there other strange things that doesn't make any sense. I suspect the document was "obfuscated on purpose" to not show exactly how to do it. I am quite confused reading it. I haven't made any progress in optimizing my code.

I doubt that; I think it's just that it's pitched at people who are familiar with DSP and these types of coding techniques, but not specifically with the Cortex M4 DSP instructions/architecture.

The "final optimised FIR code" is indeed faulty. I have finaly simulated the code in great detail DaveCAD style, when I was heading home by train today. It really took my nerve, but I solved the problem. There are variables interchanged!

Now we can piece the presentation together:

And I can prove it with my DaveCAD simulation!

On top you have the state buffer with d(n) samples. The braces over the pairs are the 32bit data pairs for the dual-multiply-accumulate SIMD instructions. Under the buffer there are the sums with their related SIMDs. I have simulated four steps (2 inner loop runs). The small rectangles represents the four SMLAD instructions performed with their respective data inputs:

Each inner loop run consists of 2 groups of 4 SIDMs: The firs group is calculated as x0c0, x1c0, x2c0, x3c0. The second group then x2c1, x3c1,x0'c1,x1'c1. (Consider the apostrophe marked variables as newly fetched pairs of samples, c1 as new pair of coefficients). This and only this way it will work.

Also note that the coefficient array must be in reverse order, the FIR is calculated backwards, the last multiplied samples are the newest ones shifted in the state buffer from the "right side".

I haven't made my implementation of this yet, beacause I have just arrived and even haven't had dinner :-)

After dinner I will put a few words about why the DSPLib was so slow and produced so weak output - I probably figured that out too.. maybe.

Yansi · « **Reply #47 on:** June 11, 2015, 08:35:59 pm »

Oh, I forgot the conclusion: There's nothing wrong with the suggested optimalizations. But all of them are done wrong in the presentation :-)

DSPLib being slow: The DSP lib code for calculating blocksizes <4 is non-optimal, is slow. Almost the same as mine. Thus the same speed. You must calculate a block of 4 or more, better a multiple of 4 samples, to never run the slow code..

And I see there one other odd thing, but I might be overloking something (the code there is quite messy): Where the hell is the loop, that scrols the state buffer after each FIR function call? You need to shift the state buffer blocklength samples left after each processing. (so the pointer never wraps circularly, always starts from the zero index). Where did that loop go?

Conclusion2: The suggested optimizations need a second loop for shifting the statebuffer left by blocklength samples before each call of the FIR calculation. (If I got right how the unwinded linear buffer works). I don't see there any loop to do this, which is quite strange. The DSPlib FIR probably haven't worked at all, thats why I wass getting only ultraweak output. There has been no scaling on the input!

Conclusion3: The more data you process at once, the more effective this implementation gets. Because there is (should be!) one statebuffer shift per call and the implementation is optimal only for multiples of 4 samples. So how big blocks do you calculate at once in your SDR radios?

I will try to implement the code myself now. I don't trust the DSPlib now. It was made too much universal that it lacks the effectivity I am looking for.

danfo098 · « **Reply #48 on:** June 11, 2015, 09:13:53 pm »

Here are some more FIR filter code that you can try, it uses a 256 byte "autoroll" circular buffer so for it to work the number of coefficients must be equal or less than that. Haven't actually tested it on the M4 yet so don't know how fast (or slow) it will be.

Code: [Select]

#include <stdio.h>

#define NO_COEFFS    7
#define TEST_LENGTH  1024

float coeffs[NO_COEFFS] = {0.1f, 0.2f, 0.3f, 0.4f, 0.3f, 0.2f, 0.1f};

void fir_init(float gain)
{
  uint8_t n;

  for (n=0;n<NO_COEFFS;n++)
  {
	coeffs[n] *= gain;
  }
}

void fir(float in, float *out)
{
  uint8_t n;
  float acc;

  static uint8_t sampleidx = 0;
  static uint8_t sampleinc = NO_COEFFS + 1;
  static float samplebuf[256] = {0.0f};

  samplebuf[sampleidx] = in;
  acc = 0.0f;

  for(n=0;n<NO_COEFFS;n++)
  {
    acc += coeffs[n] * samplebuf[sampleidx--];
  }

  sampleidx += sampleinc;
  *out = acc;
}

void fir_fast(float in, float *out)
{
  uint8_t n;
  float acc;

  static uint8_t sampleidx = 0;
  static uint8_t sampleinc = NO_COEFFS + 1;
  static float samplebuf[256] = {0.0f};

  samplebuf[sampleidx] = in;
  acc = 0.0f;
  acc += coeffs[0] * samplebuf[sampleidx--];
  acc += coeffs[1] * samplebuf[sampleidx--];
  acc += coeffs[2] * samplebuf[sampleidx--];
  acc += coeffs[3] * samplebuf[sampleidx--];
  acc += coeffs[4] * samplebuf[sampleidx--];
  acc += coeffs[5] * samplebuf[sampleidx--];
  acc += coeffs[6] * samplebuf[sampleidx--];

  sampleidx += sampleinc;
  *out = acc;
}

int main(void)
{
  uint16_t n;
  float inbuf[TEST_LENGTH] = {0.0f};
  float outbuf[TEST_LENGTH] = {0.0f};

  inbuf[0] = 1.0f; /* test the impulse response */
  fir_init(1.0f); /* set filter gain */

  for(n=0;n<TEST_LENGTH;n++)
  {
    fir(inbuf[n], &outbuf[n]);
  }

  for(n=0;n<15;n++)
  {
    printf("Out sample no %d: %f \r\n", n, outbuf[n]);
  }

  return 0;
}

Yansi · « **Reply #49 on:** June 11, 2015, 09:24:55 pm »

Thanks for the code. Looks the same as mine (see page 2 of this thread please). It wasn't fast. The optimized version I am implementing now should be better (If I can make it work $:-\$ ).

Can you please explain me the autoroll buffer?

rockets4kids: I am sorry, I've skipped your answer! Si570 is nice, but look at its price. I can have three DDS chips for its price.
At the page you have posted ( http://www.m0nka.co.uk/ ) I got a big laugh when I saw this:

Quote

I guess is normal for companies like Atmel and ST to completely re-write their support libraries from scratch every few months and keep us developers busy. The latest one is another great achievement in abstraction from reality
Please i want my new Cortex M7 running at 1Ghz, so it can handle this write that is otherwise two Thumb instructions.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Simple diy SDR radio with Tayloe mixer and STM32 (Read 84225 times)

Share me