Author Topic: First attempt at FIR bandpass filter on FPGA, critique/timing errors.  (Read 5171 times)

0 Members and 1 Guest are viewing this topic.

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Hello all,

I'm working on a project where I want to implement some FIR bandpass filters in a Spartan 6(XC6SLX9).  I've never done this before so I created a proof of concept module to make sure I understand everything.  I finished the module and comparing its step response with one generated from Matlab it appears to be working but I have a few questions.  Below is the code.
Code: [Select]
module Bandpass_Filter(
input wire clk,
input wire reset,

input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_1_coeff.h"  // This file contains the information about the filter.

integer i;

reg shift;     // Shifts data in.
reg[9:0] filter_chunk;   // Used to time multiplex the 16 DSP slices.
wire signed[47:0] product[0:15];   // Results of multipication.
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.

assign product[0] = data_delay[0+filter_chunk]*coeff[0+filter_chunk] + carry_in;   // Multiply the first data and coefficent and add in the carry in.
genvar j;
generate
for(j=1; j<16; j=j+1) begin : product_operate // Create 15 more MAC operations.
assign product[j] = (data_delay[j+filter_chunk]*coeff[j+filter_chunk]) + product[j-1]; // Multiply the  data and coefficent and add in the previous carry out.
end
endgenerate

always @(posedge clk or negedge reset) begin
if(!reset) begin
carry_in <= 0;
shift <= 0;
data_out <= 0;
filter_chunk <= 0;

for(i=0; i<num_taps; i=i+1) begin : data_reset
data_delay[i] <= 0;
end
end else begin

carry_in <= product[15];          // Save last value to be the carry in for the next chunk on the next clock cycle.
filter_chunk <= filter_chunk + 16;  // Increment to evaluate the next 16 items of data.
shift <= 0;
if(filter_chunk == num_taps-16) begin  // When the end is reached shift data and give data output.
data_out <= product[15] >>> coeff_scale;
carry_in <= 0;
filter_chunk <= 0;
shift <= 1;
end

// Shift data when new data arrives.
if(shift) begin 
data_delay[0] <= {6'b0, data_in};
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule

filter_1_coeff.h
Code: [Select]
localparam num_taps = 256;
localparam coeff_scale = 16;
localparam signed[17:0] coeff[0:255] = '{5, 0, -6, -14, -20, -23, -23, -18, -9, 3, 16, 27, 35, 37, 32, 22, 7, -8, -22, -32, -35, -32, -24, -12, -1, 6, 9, 6, 0, -7, -11, -8, 2, 20, 42, 63, 77, 76, 59, 24, -23, -75, -121, -150, -155, -133, -84, -17, 56, 120, 164, 178, 160, 116, 57, -4, -54, -81, -83, -64, -33, -6, 5, -8, -47, -103, -159, -195, -194, -142, -40, 100, 253, 387, 468, 472, 388, 222, 0, -238, -447, -584, -621, -551, -387, -164, 71, 270, 395, 429, 375, 261, 129, 25, -16, 21, 124, 256, 361, 382, 274, 25, -343, -766, -1151, -1396, -1408, -1136, -581, 192, 1061, 1866, 2438, 2635, 2377, 1661, 579, -702, -1961, -2966, -3518, -3492, -2863, -1718, -241, 1315, 2680, 3611, 3941, 3611, 2680, 1315, -241, -1718, -2863, -3492, -3518, -2966, -1961, -702, 579, 1661, 2377, 2635, 2438, 1866, 1061, 192, -581, -1136, -1408, -1396, -1151, -766, -343, 25, 274, 382, 361, 256, 124, 21, -16, 25, 129, 261, 375, 429, 395, 270, 71, -164, -387, -551, -621, -584, -447, -238, 0, 222, 388, 472, 468, 387, 253, 100, -40, -142, -194, -195, -159, -103, -47, -8, 5, -6, -33, -64, -83, -81, -54, -4, 57, 116, 160, 178, 164, 120, 56, -17, -84, -133, -155, -150, -121, -75, -23, 24, 59, 76, 77, 63, 42, 20, 2, -8, -11, -7, 0, 6, 9, 6, -1, -12, -24, -32, -35, -32, -22, -8, 7, 22, 32, 37, 35, 27, 16, 3, -9, -18, -23, -23, -20, -14, -6, 0};

First, I've never done DSP with an FPGA so I'm open to any critique about the design.  I'm not sure exactly how these are implemented so maybe I'm doing some things in a bad way. 

Second, I'm including a header file which defines the filter coefficients but ISE gives me an error for this.  It doesn't like the single quote before the coefficient declaration(3rd line, the erorr is just 'Unexpected ' found.').  Modelsim is perfectly fine with this and in fact wont work without it.  I believe that's the way you're supposed to define 2 dimensional arrays.  What is the deal here?  If I remove the quote it will synthesize in ISE but I'm not sure it's being synthesized correctly. 

Third, if I remove the quote I mentioned in the second question and synthesize the module.  Assuming it is working correctly, it then fails timing.  I'm trying to run this module at 100Mhz and I'm getting a max path of 50ns.  This seems strange to me and makes me think I'm doing something wrong.  I'm daisy chaining the multiplications such they should naturally infer the DSP slices in an efficient manner.   I believe that's how these are interned to be configured.  So it seems weird that 20Mhz is the fastest they could run.  Am I doing something wrong here or is this normal? 

Thank you for any help or advice!
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Are you able to describe what you are doing in words (not code).

So if I read it right, you want to have a 256-tap FIR filter, in a small FPGA (Spartan 6 LX9 only has 16 DSP slices).

You seem to be implementing it by creating 16 partial products (kernel value * data value) each cycle, and then summing all 16 of them them over 16 cycles.

The design is fully inferred, but you have paid attention to the underlying widths of the multipliers.

You are having problems meeting timing.

Since you asked for a critique....

First thoughts

- You are not paying much attention to how the pipelining works in the multiplier (but isn't your limiting factor)

- You are not careful about how you 'fan in' the filter kernel and constants into the multiplier. This gives poor performance (but isn't your limiting factor)

- You are using single-cycle multiplication. This gives poor performance (but isn't your limiting factor)

- You are adding the 16 products from each multiplier each cycle, and 'ejecting' the total once every 16 cycles. This gives poor performance, as data from literally every DSP block has to come together and be added in a single cycle. This will be your limiting factor.

- You have 'control' signals that are running out to every DSP block on the FPGA. This fan-out might be a limiting factor.

- The device will have high utilization (100% of DSP slices). This gives poor performance unless things are carefully placed.  A bigger FPGA might help. (but isn't your limiting factor)

My suggested action plan

- Outline clearly what results you want/need - e.g. timing requirements, resource utilization, inputs, latency

- Have a read through the DSP block user's guide, have a think about which features could be valuable to you in this case and which are not. You won't get this knowledge experimenting with Verilog and seeing the results.

- Think about how you want it to be structured, and how this will make use of the features of the H/W

- Try getting the structure you want. Use primitives if you have to.

- Once you know what you want is possible, and how it is achievable, then think about how you can get the code to infer what you are after.

My thoughts on a design
This is untested, but an example of how you can do it differently. They are all flawed in some way.

Design idea 1:
Why not have 16 FIR filters, that output once every 256 cycles, set so their outputs are staggered by 16 cycles, otherwise they output 'zero'. Then you can just OR the outputs together rather than using adders.

You can use the block ram as circular buffer to  allow you to write incoming data into memory that isn't needed for the currently running calculation.

Pros: Simple, scalable. Things working locally for a global result.

Cons: Not sure if you will have enough resources to hold 16 copies of the kernel as well.

Design idea 2:
Look carefully how you can use the LUT-4 as a shift register and an indexed memory. With an two 16x18-bit RAM for the data and an 16x18-bit ROM for the kernel you can then have each DSP block do 16th the work. Then pipeline the adder.

Pros: Looks to be the best way, for both resources and performance.

Design idea 3:
If your FIR filter is symmetric, you can wrap the data back on itself, and then you only need half the DSP blocks (assuming that using 17-bit data rather than 16-bit is OK). e.g. total = (d[0]+d[255]) * k[0]) + (d[1]+d[254]) * k[1]....

Pros: Saves a lot of DSP. Might be possible,

Cons: not a good general solution. You lose a bit of precision on the input.
« Last Edit: April 23, 2018, 12:40:19 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: pigtwo

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Thank you very much for the detailed response. 

Your description of what I was trying to do was exactly right for this module.  But as you asked I can describe what I ultimately want.  This was a test to see if I could get a DSP to work at all.  But the final design is I wan't 15 bandpass filters ranging between 24kHz to about 20Hz(Ex one BP filter will be 10KHz to 16 KHz but another will be 100Hz to 160Hz with other spread between).  I imagine this will be fairly difficult given the magnitudes of the frequencies.  I would like to implement it with a Spartan-6 XC6SLX9 because it is the largest FPGA that comes in a QFP or QFN.   Now if this seems to be impossible
I've been trying to find a reason to buy a reflow oven so this could be it. 

The basic idea of this project is to create a very basic spectrum analyzer.  So I want to take in audio frequencies and display an output for various frequency ranges amplitudes(a basic min/max measurement).  Accuracy is not very important here(although I'd like to try to keep it reasonable for entertainment's sake). 

The final design specs are:
- 100Mhz system clock
- 1Msps ADC
- 15 Bandpass filters ranging between 24Khz and 20Hz(proportionally).
- ~400 outputs a second from all the filters(IE complete processes). 
- Nearly all FPGA resources can be used.
- Latency is unimportant

The filter specs are not super important.  I choose 256 tap FIR because the frequency response looked relativity good.  But I think I could decrease the number of taps to fit the design requirements.  I even could move to IIR if it is necessary. 

In regards to your suggestions I have a lot of questions(hope you don't mind):
Quote
You are not paying much attention to how the pipelining works in the multiplier (but isn't your limiting factor)
So my understanding of pipelining the DSP48A1 slices was basically connecting the carry out of one to the carry in of the next.  Should these be instead registered between slices?  My original understanding is the 16 DSP48A1 slices could all be daisy chained together and be able to complete that operation in one clock cycle.  Maybe this is wrong. 

Quote
You are not careful about how you 'fan in' the filter kernel and constants into the multiplier. This gives poor performance (but isn't your limiting factor)
I don't completely understand this.  My understanding here is each multiplier has various inputs(the filter coefficients).  These all get multiplexed in based on the state of the FPGA.  How should this be done differently?  Since this isn't major feel free to tell me just to look something up.  As long as it's in the right direction I can probably figure it out.

Quote
- Try getting the structure you want. Use primitives if you have to.
Does it make sense to use the DSP48A1 primitive to get better results?  I looked into it but the instance for it was huge!  And people said online that Xilinx was very good at infering these.

Quote
Why not have 16 FIR filters, that output once every 256 cycles, set so their outputs are staggered by 16 cycles, otherwise they output 'zero'. Then you can just OR the outputs together rather than using adders.
To make sure I understand, I would take 1 multiplier and have it do the multiplication and summation over 16 elements.  Then have 16 of those doing this over the data range and finally sum all their outputs at the end for the output.  This seems to avoid long data paths as they all act in parallel instead of in series.  Is this correct?   It seems much better to me. 

I will almost certainly have more questions later but I have to think about your response a little more.  Thank your very much hampser_nz!  I am trying to pivot my career towards FPGAs so this is very helpful!   
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
If you want to do 15 filters which run in parallel and you have 16 DSPs which you want to use, then each filter gets only one DSP, right?

If you want to stick to DSPs, the filter can do only one multiplication per clock. What it is going to multiply? There's not much choice here. It's going to be the most current data point and one of the coefficients. This sounds much simpler than what your design does:

Code: [Select]
result <= result + data_in*coefficient;
This will be done inside DSP (hopefully), and the only thing left is to fetch the correct coefficient. You can use a shift register, or you can use BRAM. Since you'll need to fetch coefficients for another 14 filters (some of which can share the BRAM block), and you're not in a hurry (DSP is rather slow), BRAM is probably better. To organize the BRAM fetch, you need to maintain a counter, incrementing it on every clock:

Code: [Select]
counter <= counter + 1;
and you fetch from BRAM into a register at every clock:

Code: [Select]
coefficient <= coefficient_storage[counter];
The register here helps to make sure that the DSP doesn't have to wait for the BRAM fetch which will make things dramatically faster. You may want to look at the DSP block docs to see if adding more registers may help to pipeline MADD operations, which will make things even faster.

Of course, you need to make sure that new data_in is fetched every clock.

You also need to check when counter overflows, get the result and reset everything afterwards. Looks simple enough.

It'll take 256 clocks (for 256 taps) to get the result, then everything starts over. Won't work any faster. Other 14 filters work at the same time, doing the same thing, but with other sets of coefficients.

« Last Edit: April 23, 2018, 04:00:30 am by NorthGuy »
 
The following users thanked this post: pigtwo

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
I guess what I am trying to say is you have lots of options, if you only want to process 192kS/s, then you have lots of ways to implement it.

If you want a 999-tap filter, I would suggest you think about running at 192MHz, and have logic like this (not in any particular HDL):

Code: [Select]
m[0:1024]  <<< memory for buffering samples
k[0:1023] <<< filter kernel constants

every clock cycle
  if new_sample_in = 1 then
    m[n] <= sample_in;
    j = 0;
    a = 0;
    n = MOD(n+1,1024);
    new_sample_out = 0;
  else if j < 999 then
    a = a + m[mod(n+j,1024)] * j[j];
    j++;
    new_sample_out = 0;
  else if j = 999 then
    filter_out = a;
    new_sample_out = 1;
    j++;
  else
    new_sample_out = 0;
  end if;
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Online radar_macgyver

  • Frequent Contributor
  • **
  • Posts: 687
  • Country: us
To hamster_nz's point about fan-in and fan-out control, look up the term 'systolic filter' and you'll find some implementations that ensure that the large filter is broken up into small chunks that only interact with each other, and don't end up causing control signals to be routed to all the chunks. This greatly improves timing.

A reference I used back in the day:
https://www.xilinx.com/publications/archives/books/dsp.pdf

Any reason the FIR filter IP core won't work for you?
 
The following users thanked this post: pigtwo

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
What you are wanting to do is to work out:

   (taps_in_filter * sample_rate)

and balance that against:
 
   (MAC_blocks_used * design_clock_speed)

given the set of constraints you are under

* "sample_rate" is determined by your application. Usually higher than 1.2x the bandwidth of interest.

* "taps_in_filter " is determined by your application and the performance you need from your filter. For high performance it might be 5x or 7x the wavelength (in samples) of the critical frequency of the filter. (Rough rule of thumb)

* "design_clock_speed"'s upper limit is determined by your FPGA architecture and speed grade, and your ability to design to that speed. For a non-specialist designer 180 MHz would be a sensible number for Spartan-3, 200 MHz for Spartan-6, and maybe 240 MHz for 7-series. To get the datasheet 'hero' numbers (300MHz+) requires a specialist, and an ideal fit for the application.

* The upper limit for "MAC_blocks_used" is determined by the size of your FPGA. If your multiplication takes more than the native size of the DSP block (18-bit x 18-bit ) then you might need two or more DSP blocks to implement each MAC.

So....

If (MAC_blocks_used * design_clock_speed) is less than (taps_in_in_filter * sample_rate) you don't have enough resources to make the filter work, so go back to the drawing board and change some other numbers

If  (MAC_blocks_used * design_clock_speed) is greater than (taps_in_in_filter * sample_rate), or "design_clock_speed" is too low (<100MHz), then you are wasting resource. Your FPGA is idle most of the time.

However, when "dsp_blocks_used" gets too high, it becomes hard to get the right information in the right place at the right time, even if you exploit the architecture features. Having a bigger FPGA doesn't usually make that better. A bigger die gives longer routing delays.

So for each filter out your values for 'taps_in_filter', 'sample_rate', 'MAC_blocks_used', and 'design_clock_speed' and that should give you a pretty good idea of what you need to do, and if you have spare cycles for housekeeping or if everything must mesh precisely together.

....or just use the FIR filter IP block, and accept it as being magic :-)
« Last Edit: April 23, 2018, 05:28:18 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
12
« Reply #7 on: April 23, 2018, 09:26:22 pm »
I did some testing last night.

An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:
Code: [Select]
            -- The inferred multiplier, with pipelining
            product <= a * b;
            a <= kernel(to_integer(count));
            b <= memory(to_integer(index));

Data has to be supplied once every 16 cycles. Latency about 2,056 cycles or so (a few cycles more than it takes to get 128 data items).

Implemented targeting an Spartan 6LX9-2.
- Usage about 1/3rd of the fabric, 100% of DSP blocks
- Max clock frequency around 170 MHz, when constrained for 6ns (a 166 MHz design target).

Performance,
- Throughput ~ 10.5 M 18-bit data items per second
- 2,688M 18x18 MAC operations per second.
- 40-bit signed result (just sent straight to I/O pins)

I can see a few tricks that I would like to try, most of which will reduce fabric usage.

I want to spend a while verifying it and making the filter kernel a parameter rather than hard-coded...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: 12
« Reply #8 on: April 23, 2018, 11:51:38 pm »
An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:

I think you need to drop the idea of decompositions.

The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #9 on: April 24, 2018, 01:52:42 am »
An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:

I think you need to drop the idea of decompositions.

The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.

Definitely agree. And 1MS/s that is way overkill for audio work - with a 256-tap FIR you would be hard pressed to do anything useful in the audio range.

I just wanted to see what peak throughout was possible on Spartan 6 without any big design investment.... around half the "capable of operating at up to 390 MHz" mentioned for the DSP block in the datasheet.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
@NorthGuy
Quote
If you want to do 15 filters which run in parallel and you have 16 DSPs which you want to use, then each filter gets only one DSP, right?
My original justification for using 16 dsp slices in series was so that I could use all of them.  I thought of the way you describe it but I was bothered by not using that last dsp slice.  But from what you and others have said it seems it's not worth changing the overall architecture to use that last dsp slice. 

Quote
The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.
That does sound much better.  I'm going to try this this weekend.   I was originally a little worried about raising the frequency that high but since it's all inside the FPGA I guess I shouldn't have to worry about it too much.  Plus it will be good experience in working with higher frequencies in FPGAs.  Previously the highest frequency circuit board/FPGA design I've made is 50Mhz.   

@radar_macgyver Thank you for the link.  I will have to look through that this weekend.  It looks very helpful. 

@hampster_nz Thank you again for the detailed response.  I've read through it but I'll probably have to re-read it a couple times to internalize it.  I think I have a much better understanding of what to do now. 

As NorthGuy said 1MS/s is probably way too much and I noticed it was very difficult to get a good frequency response at that rate(without a huge number of taps).  So I actually designed the original filter with a sample rate of 100KS/s which I forgot to mention earlier.  Not to mention all I want to do is a very basic amplitude measurement so a large number of samples probably isn't necessary.  I think I will give it another shot using one filter per channel. 

The reason I'm not using a FIR filter core is because I'm trying to create a resume of projects to show potential employers.   I would like to get a job coding FPGAs but currently I only have 3 years of mild PCB/circuit design experience so I need something to show.  So I figured I should do it manually.  I figure anyone could just load a core and have it work.  Also it seems like a good learning exercise.  It has been pretty interesting so far. 

I'll probably have more questions as I re-read.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Something I forgot to ask earlier, is the way I stored the filter coefficients a good way to do this?  I wasn't sure how to define them another way.  Also, how are these stored in the FPGA?  Does each coefficient take up some amount of CLBs or is this done in a more efficent manner?  I know NorthGuy mentioned using BRAM but I'm not sure how to use this exactly. 
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: 12
« Reply #12 on: April 24, 2018, 04:02:01 am »
Definitely agree. And 1MS/s that is way overkill for audio work - with a 256-tap FIR you would be hard pressed to do anything useful in the audio range.

My bad. I didn't get this was for Audio.  It is probably a good idea to run each filter at its own frequency. This way it's possible to get better results with lesser number of taps. The slower filters will be slow, so you probably can use a single DSP to run several filters

I just wanted to see what peak throughout was possible on Spartan 6 without any big design investment.... around half the "capable of operating at up to 390 MHz" mentioned for the DSP block in the datasheet.

You should be able to get very close to the datasheet figure. There may be problems with long routing.  The internal registers may be missing somewhere. DSP has lots of built-in registers which all must be used to get peak performance. When you couple it with BRAM, you also need to infer BRAM's output register. Along with the input registers of the DSP, it creates a chain of registers. Also, outside things, such as counters may need to be pipelined. I'm sure if you dig into it, you'll be able to run it much faster than 170 MHz.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Something I forgot to ask earlier, is the way I stored the filter coefficients a good way to do this?  I wasn't sure how to define them another way.  Also, how are these stored in the FPGA?  Does each coefficient take up some amount of CLBs or is this done in a more efficent manner?  I know NorthGuy mentioned using BRAM but I'm not sure how to use this exactly.

You have array x. If BRAM is used, the coefficients are used to initialize BRAM. Then you use as x[index]. When this happens, "index" gets wired to the address input of the BRAM, the next clock, the coefficient appears on the data output of the BRAM, so the data output is wired to the wherever x[index] is used.

There's also LUT memory, which is smaller, but it is combinatorial, so you can do things like x[y[index]] within one clock (although this is not necessarily fast). The tools select the type of memory based on how you use it. If you want particular kind of memory, you often can infer it, or you can simply instantiate it.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Just glancing through the "Spartan-6 FPGA DSP48A1 Slice User Guide" ( https://www.xilinx.com/support/documentation/user_guides/ug389.pdf ). I forgot just how many neat ideas it had in it.

I am then left with lots of wonderings like if:

Code: [Select]
   if restart = 1 then
     accum <= zero_48_bits + product;
   else
     accum <= accum + product;
   end if;

...will infer the use of the fast path from the P Reg to the Z Mux (by adjusting OpMode[3:2] bits as needed), or go the long way via the 'C' input. Time to glance a the synthesis manuals....


(attached image is from the above user guide)

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #15 on: April 25, 2018, 11:49:03 am »
You should be able to get very close to the datasheet figure. There may be problems with long routing.  The internal registers may be missing somewhere. DSP has lots of built-in registers which all must be used to get peak performance. When you couple it with BRAM, you also need to infer BRAM's output register. Along with the input registers of the DSP, it creates a chain of registers. Also, outside things, such as counters may need to be pipelined. I'm sure if you dig into it, you'll be able to run it much faster than 170 MHz.

After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: 12
« Reply #16 on: April 25, 2018, 06:12:01 pm »
After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

The amount of gain you extracted by the tweaking (roughly 1.6 times) is comparable to the performance gain you would get by moving to UltraScale+. Quite impressive.

I'm sure it can go even faster, but the faster it runs the harder it is to satisfy timing.

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #17 on: April 25, 2018, 06:55:38 pm »
After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

The amount of gain you extracted by the tweaking (roughly 1.6 times) is comparable to the performance gain you would get by moving to UltraScale+. Quite impressive.

I'm sure it can go even faster, but the faster it runs the harder it is to satisfy timing.

Nope, it own't run any faster and still meet timing.

The switching limit for Block RAM is 280 MHz for the -2 grade, so it has 'maxed out'.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3137
  • Country: ca
Re: 12
« Reply #18 on: April 25, 2018, 07:58:34 pm »
Nope, it own't run any faster and still meet timing.

The switching limit for Block RAM is 280 MHz for the -2 grade, so it has 'maxed out'.

I see. That's rather slow. It's way faster in 7-series.

It is possible to do it without BRAM, using distributed RAM and/or SRLs, although this will take more resources.

For the data queue, SRLs may work faster than BRAM. If you rotate data through SRL chain with N+1 elements (N being the number of taps), it'll automatically supply the data in the correct order.

You can try to use distributed RAM for coefficients - it must be fast in read-only mode. SRLs may also work.

Another possibility is to use BRAM, but run it at 1/2 clock rate. You can fetch two values per clock.

 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
It's finally the weekend so I've had some more time to look at this again and I've redesigned as suggested by using only a single multiplier per channel.  But I'm still running into the issue that ISE doesn't like the way I've defined my coefficients.  Below is an example of how I do it.
Code: [Select]
localparam num_taps = 1000;
localparam coeff_scale = 16;
localparam signed[17:0] coeff[0:999] = '{0, 3, 5, 8, 9, 10, 10, 9, 6, 3, 0, -4, -8, -11, -13, -14, -13, -11, -8, -4, 0, 4, 9, 12, 14, 15, 14, 12, 9, 5, 0, -4, -7, -10, -12, -12, -11, -9, -7, -4, -1, 2, 4, 6, 6, 6, 5, 4, 2, 1, 0, 0, 0, 1, 2, 3, 4, 4, 4, 3, 1, -2, -5, -8, -11, -13, -14, -14, -12, -8, -3, 3, 9, 15, 20, 23, 24, 22, 19, 13, 5, -3, -11, -19, -25, -28, -29, -27, -22, -15, -7, 2, 11, 19, 24, 27, 28, 25, 20, 14, 6, -1, -8, -13, -17, -18, -18, -15, -12, -8, -3, 0, 3, 4, 3, 2, 0, -2, -3, -4, -3, 0, 4, 9, 14, 19, 23, 24, 23, 19, 11, 2, -10, -21, -32, -40, -45, -46, -42, -33, -20, -4, 13, 30, 44, 55, 61, 61, 54, 43, 26, 7, -13, -31, -47, -58, -63, -62, -55, -43, -27, -9, 9, 25, 38, 46, 48, 46, 39, 29, 18, 6, -4, -12, -17, -18, -16, -13, -8, -3, 0, 1, -1, -5, -12, -20, -27, -33, -35, -32, -25, -13, 4, 23, 42, 59, 73, 80, 79, 69, 51, 26, -3, -35, -65, -90, -108, -115, -112, -96, -71, -38, 0, 39, 74, 102, 121, 128, 122, 104, 76, 42, 4, -33, -65, -89, -104, -107, -100, -83, -60, -33, -6, 19, 39, 51, 56, 54, 46, 34, 21, 10, 2, -2, 0, 6, 16, 26, 34, 38, 36, 26, 9, -14, -42, -70, -95, -114, -122, -117, -99, -68, -25, 24, 76, 124, 163, 188, 195, 183, 152, 103, 42, -25, -93, -152, -199, -226, -231, -214, -175, -120, -53, 19, 87, 146, 188, 211, 212, 193, 156, 106, 49, -9, -60, -101, -127, -137, -132, -114, -87, -55, -24, 1, 19, 26, 24, 14, 0, -14, -25, -28, -20, -1, 27, 63, 100, 134, 158, 167, 158, 128, 78, 12, -65, -144, -216, -272, -305, -309, -280, -221, -135, -30, 84, 196, 292, 363, 399, 397, 356, 278, 172, 48, -82, -203, -304, -373, -406, -398, -351, -272, -169, -55, 59, 160, 239, 289, 305, 290, 248, 186, 113, 39, -26, -75, -105, -113, -103, -79, -49, -20, 0, 6, -6, -34, -75, -123, -169, -203, -216, -201, -155, -79, 23, 141, 262, 371, 454, 497, 492, 432, 321, 166, -21, -219, -409, -570, -682, -731, -708, -613, -453, -241, 0, 247, 475, 659, 781, 826, 789, 676, 498, 273, 26, -217, -431, -595, -693, -719, -673, -564, -408, -225, -38, 132, 267, 355, 391, 377, 323, 242, 152, 70, 12, -13, 0, 47, 116, 193, 258, 291, 277, 205, 72, -113, -335, -568, -781, -943, -1023, -997, -852, -590, -224, 215, 686, 1140, 1523, 1785, 1886, 1799, 1516, 1051, 439, -267, -999, -1682, -2241, -2609, -2737, -2596, -2185, -1533, -694, 256, 1225, 2115, 2833, 3298, 3454, 3275, 2766, 1969, 954, -183, -1332, -2377, -3213, -3752, -3934, -3735, -3170, -2288, -1173, 66, 1310, 2435, 3329, 3904, 4102, 3904, 3329, 2435, 1310, 66, -1173, -2288, -3170, -3735, -3934, -3752, -3213, -2377, -1332, -183, 954, 1969, 2766, 3275, 3454, 3298, 2833, 2115, 1225, 256, -694, -1533, -2185, -2596, -2737, -2609, -2241, -1682, -999, -267, 439, 1051, 1516, 1799, 1886, 1785, 1523, 1140, 686, 215, -224, -590, -852, -997, -1023, -943, -781, -568, -335, -113, 72, 205, 277, 291, 258, 193, 116, 47, 0, -13, 12, 70, 152, 242, 323, 377, 391, 355, 267, 132, -38, -225, -408, -564, -673, -719, -693, -595, -431, -217, 26, 273, 498, 676, 789, 826, 781, 659, 475, 247, 0, -241, -453, -613, -708, -731, -682, -570, -409, -219, -21, 166, 321, 432, 492, 497, 454, 371, 262, 141, 23, -79, -155, -201, -216, -203, -169, -123, -75, -34, -6, 6, 0, -20, -49, -79, -103, -113, -105, -75, -26, 39, 113, 186, 248, 290, 305, 289, 239, 160, 59, -55, -169, -272, -351, -398, -406, -373, -304, -203, -82, 48, 172, 278, 356, 397, 399, 363, 292, 196, 84, -30, -135, -221, -280, -309, -305, -272, -216, -144, -65, 12, 78, 128, 158, 167, 158, 134, 100, 63, 27, -1, -20, -28, -25, -14, 0, 14, 24, 26, 19, 1, -24, -55, -87, -114, -132, -137, -127, -101, -60, -9, 49, 106, 156, 193, 212, 211, 188, 146, 87, 19, -53, -120, -175, -214, -231, -226, -199, -152, -93, -25, 42, 103, 152, 183, 195, 188, 163, 124, 76, 24, -25, -68, -99, -117, -122, -114, -95, -70, -42, -14, 9, 26, 36, 38, 34, 26, 16, 6, 0, -2, 2, 10, 21, 34, 46, 54, 56, 51, 39, 19, -6, -33, -60, -83, -100, -107, -104, -89, -65, -33, 4, 42, 76, 104, 122, 128, 121, 102, 74, 39, 0, -38, -71, -96, -112, -115, -108, -90, -65, -35, -3, 26, 51, 69, 79, 80, 73, 59, 42, 23, 4, -13, -25, -32, -35, -33, -27, -20, -12, -5, -1, 1, 0, -3, -8, -13, -16, -18, -17, -12, -4, 6, 18, 29, 39, 46, 48, 46, 38, 25, 9, -9, -27, -43, -55, -62, -63, -58, -47, -31, -13, 7, 26, 43, 54, 61, 61, 55, 44, 30, 13, -4, -20, -33, -42, -46, -45, -40, -32, -21, -10, 2, 11, 19, 23, 24, 23, 19, 14, 9, 4, 0, -3, -4, -3, -2, 0, 2, 3, 4, 3, 0, -3, -8, -12, -15, -18, -18, -17, -13, -8, -1, 6, 14, 20, 25, 28, 27, 24, 19, 11, 2, -7, -15, -22, -27, -29, -28, -25, -19, -11, -3, 5, 13, 19, 22, 24, 23, 20, 15, 9, 3, -3, -8, -12, -14, -14, -13, -11, -8, -5, -2, 1, 3, 4, 4, 4, 3, 2, 1, 0, 0, 0, 1, 2, 4, 5, 6, 6, 6, 4, 2, -1, -4, -7, -9, -11, -12, -12, -10, -7, -4, 0, 5, 9, 12, 14, 15, 14, 12, 9, 4, 0, -4, -8, -11, -13, -14, -13, -11, -8, -4, 0, 3, 6, 9, 10, 10, 9, 8, 5, 3};


I get the error:
Code: [Select]
ERROR:HDLCompiler:806 - "filter_coeff_15.h" Line 3: Syntax error near "'".


So it doesn't like the single quote before the "{".  But my understanding is that this is how this is defined.  If I remove the single quote then it will compile but I get 12000 warnings and the design is too large for the FPGA.  I think it's treating the coefficient as like 18000 bit register or something.  What am I doing wrong here? 

In case it's useful below is my current code.  I haven't tested it yet because of the above problem so there might be errors. 
Code: [Select]
`timescale 1ns / 1ps
module Bandpass_Filter(
input wire clk,
input wire reset,

input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_coeff_15.h"  // This file contains the information about the filter.

integer i;

reg shift;     // Shifts data in.
reg[15:0] data_index;  // Index of the delay data.  This should be parameteized later.
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.
wire signed[47:0] product;   // Results of multipication.
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.

always @(posedge clk or negedge reset) begin
if(!reset) begin
shift <= 0;
carry_in <= 0;
data_out <= 0;

for(i=0; i<num_taps; i=i+1) begin : data_reset
data_delay[i] <= 0;
end
end else begin
if(!shift) begin       // Operate on the data when not shifting.
if(data_index < num_taps) begin
data_index <= data_index + 1;
carry_in <= data_delay[data_index]*coeff[data_index] + carry_in;  // Later buffer the coefficent value rather than accessing directly.
end else begin
shift <= 1;  // Delete later
end
end else begin // Shift data when new data arrives.
data_index <= 0;
carry_in <= 0;
shift <= 0;    // Delete later.
data_out <= carry_in >>> coeff_scale;   // Scale the data and output it.
data_delay[0] <= {6'b0, data_in};  // Take in the new data;
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift  // Shift all the data.
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
You can't reset the contents of a large memory block in one cycle. Try removing that.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Yep that was it.  Now the design easily fits.

I still get 12000 warnings for some reason but I think it's something to do with the coefficient localparam.   Almost all the warnings look like this:
Code: [Select]
WARNING:Xst:1895 - Due to other FF/Latch trimming, FF/Latch <data_delay_999_17384> (without init value) has a constant value of 0 in block <Bandpass_Filter>. This FF/Latch will be trimmed during the optimization process.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
I've been trying for a few hours but I can't seem to get the design to do what I want to meet timing.  I'm trying to run the module at 200MHz but I can't seem to get it above 128MHz.  I have two main problems that I think are preventing me.  My longest path is from the coefficient values to the accumulator so I'm trying to shorten that.

My current code is posted at the bottom for reference.  It's still a work in progress and I'm mostly working on timing so there is probably logic problems or off by one errors. 

First, I'm trying to pre-fetch the coefficient and data values so that there is as minimal delay between the input register that holds the values and the inputs of the DSP slice.  When I've attempted this it seems that the register I'm trying to infer gets removed.  See the attached picture.  It seems the the coefficient value is taken directly from BRAM to the multiplier then the accumulator.  I was thinking maybe it was using the A and B register inside the DSP slice but the timing report shows the path going from BRAM all the way to the accumulator.  Am I doing something wrong in my definition of the pre-fetch register?

The second problem I'm having is inferring the P reg to Z mux loop as hamster_NZ mentioned.  I got it to do it once but there were other problems.  When I fixed those it went back to creating an accumulator outside of the DSP slice and using the M output.  I don't understand why it would do this.  I tried to force it to multiply then sum but it always does the summation outside the DSP block.  How can you correctly infer the accumulator to be inside the DSP block?

Also, I'm using the schematic viewer to figure this stuff out. Is that the best tool for this?  I notice I can't really see inside the DSP48A1 to see exactly how it's configured.  Is there a way to do this?  Or maybe because it's governed by opcodes you can't?

Thanks again for the help. 

Code: [Select]
`timescale 1ns / 1ps

module Bandpass_Filter(
input wire clk,
input wire reset,

input wire shift,
input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_coeff_15.h"  // This file contains the information about the filter.

integer i;

// reg shift;     // Shifts data in.
reg[15:0] data_index;  // Index of the delay data.  This should be parameteized later.
reg first;
reg signed[17:0] coeff_pre_fetch; 
reg signed[17:0] data_pre_fetch;
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.
wire signed[47:0] product;
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.

assign product = data_pre_fetch*coeff_pre_fetch;

always @(posedge clk or negedge reset) begin
if(!reset) begin
first <= 1;
carry_in <= 0;
data_out <= 0;
end else begin
data_pre_fetch <= data_delay[data_index+1];
coeff_pre_fetch <= coeff[data_index+1];
data_index <= data_index + 1;
if(data_index < num_taps) begin
if(first) begin
first <= 0;
carry_in <= product + 48'b0;
end else begin
carry_in <= product + carry_in;
end
end else begin
carry_in <= carry_in;
end

if(shift) begin // Shift data when new data arrives.
first <= 1;
data_index <= 0;
data_out <= carry_in >>> coeff_scale;   // Scale the data and output it.
data_delay[0] <= {6'b0, data_in};  // Take in the new data;
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift  // Shift all the data.
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
I've been trying my hand a Verilog, and learnt something...

Code: [Select]
localparam signed[17:0] coeff[0:255] = {
      5,   0, -6, -14, -20, -23, -23, -18, -9, 3, 16, 27, 35, 37, 32,     
     22,   7, -8, -22, -32, -35, -32, -24, -12, -1, 6, 9, 6, 0, -7,  -11, -8, 2, 20, 42, 63, 77, 76, 59,
  24, -23, -75, -121, -150, -155, -133, -84, -17, 56, 120, 164, 178, 160, 116, 57, -4, -54, -81, -83,
-64, -33, -6, 5, -8, -47, -103, -159, -195, -194, -142, -40, 100, 253, 387, 468, 472, 388, 222, 0,
-238, -447, -584, -621, -551, -387, -164, 71, 270, 395, 429, 375, 261, 129, 25, -16, 21, 124, 256,
361, 382, 274, 25, -343, -766, -1151, -1396, -1408, -1136, -581, 192, 1061, 1866, 2438, 2635, 2377,
1661, 579, -702, -1961, -2966, -3518, -3492, -2863, -1718, -241, 1315, 2680, 3611, 3941, 3611, 2680,
1315, -241, -1718, -2863, -3492, -3518, -2966, -1961, -702, 579, 1661, 2377, 2635, 2438, 1866, 1061,
192, -581, -1136, -1408, -1396, -1151, -766, -343, 25, 274, 382, 361, 256, 124, 21, -16, 25, 129, 261,
375, 429, 395, 270, 71, -164, -387, -551, -621, -584, -447, -238, 0, 222, 388, 472, 468, 387, 253, 100,
-40, -142, -194, -195, -159, -103, -47, -8, 5, -6, -33, -64, -83, -81, -54, -4, 57, 116, 160, 178,
164, 120, 56, -17, -84, -133, -155, -150, -121, -75, -23, 24, 59, 76, 77, 63, 42, 20, 2, -8, -11, -7,
0, 6, 9, 6, -1, -12, -24, -32, -35, -32, -22, -8, 7, 22, 32, 37, 35, 27, 16, 3, -9, -18, -23, -23, -20, -14, -6, 0};

Gives the following error:

Code: [Select]
WARNING:HDLCompiler:413 - "C:\Users\Hamster\Desktop\Projects\filter_test\bandpass.v" Line 11: Result of 8163-bit expression is truncated to fit in 4608-bit target.

This makes me think that it is attempting to concatenate 255 32-bit binary numbers together, rather than 255 18-bit numbers
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Yeah, I've been running into that.  I don't know what the deal is here because I believe it should be:
Code: [Select]
localparam signed[17:0] coeff[0:255] = '{ 5,   0, ....};

But ISE doesn't like that and throws an error on the single quote.  Modelsim seems fine with it so I don't know what the deal is.

I think you're right and that might be what's causing my problems.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf