Author Topic: First attempt at FIR bandpass filter on FPGA, critique/timing errors.  (Read 5229 times)

0 Members and 1 Guest are viewing this topic.

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Hello all,

I'm working on a project where I want to implement some FIR bandpass filters in a Spartan 6(XC6SLX9).  I've never done this before so I created a proof of concept module to make sure I understand everything.  I finished the module and comparing its step response with one generated from Matlab it appears to be working but I have a few questions.  Below is the code.
Code: [Select]
module Bandpass_Filter(
input wire clk,
input wire reset,

input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_1_coeff.h"  // This file contains the information about the filter.

integer i;

reg shift;     // Shifts data in.
reg[9:0] filter_chunk;   // Used to time multiplex the 16 DSP slices.
wire signed[47:0] product[0:15];   // Results of multipication.
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.

assign product[0] = data_delay[0+filter_chunk]*coeff[0+filter_chunk] + carry_in;   // Multiply the first data and coefficent and add in the carry in.
genvar j;
generate
for(j=1; j<16; j=j+1) begin : product_operate // Create 15 more MAC operations.
assign product[j] = (data_delay[j+filter_chunk]*coeff[j+filter_chunk]) + product[j-1]; // Multiply the  data and coefficent and add in the previous carry out.
end
endgenerate

always @(posedge clk or negedge reset) begin
if(!reset) begin
carry_in <= 0;
shift <= 0;
data_out <= 0;
filter_chunk <= 0;

for(i=0; i<num_taps; i=i+1) begin : data_reset
data_delay[i] <= 0;
end
end else begin

carry_in <= product[15];          // Save last value to be the carry in for the next chunk on the next clock cycle.
filter_chunk <= filter_chunk + 16;  // Increment to evaluate the next 16 items of data.
shift <= 0;
if(filter_chunk == num_taps-16) begin  // When the end is reached shift data and give data output.
data_out <= product[15] >>> coeff_scale;
carry_in <= 0;
filter_chunk <= 0;
shift <= 1;
end

// Shift data when new data arrives.
if(shift) begin 
data_delay[0] <= {6'b0, data_in};
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule

filter_1_coeff.h
Code: [Select]
localparam num_taps = 256;
localparam coeff_scale = 16;
localparam signed[17:0] coeff[0:255] = '{5, 0, -6, -14, -20, -23, -23, -18, -9, 3, 16, 27, 35, 37, 32, 22, 7, -8, -22, -32, -35, -32, -24, -12, -1, 6, 9, 6, 0, -7, -11, -8, 2, 20, 42, 63, 77, 76, 59, 24, -23, -75, -121, -150, -155, -133, -84, -17, 56, 120, 164, 178, 160, 116, 57, -4, -54, -81, -83, -64, -33, -6, 5, -8, -47, -103, -159, -195, -194, -142, -40, 100, 253, 387, 468, 472, 388, 222, 0, -238, -447, -584, -621, -551, -387, -164, 71, 270, 395, 429, 375, 261, 129, 25, -16, 21, 124, 256, 361, 382, 274, 25, -343, -766, -1151, -1396, -1408, -1136, -581, 192, 1061, 1866, 2438, 2635, 2377, 1661, 579, -702, -1961, -2966, -3518, -3492, -2863, -1718, -241, 1315, 2680, 3611, 3941, 3611, 2680, 1315, -241, -1718, -2863, -3492, -3518, -2966, -1961, -702, 579, 1661, 2377, 2635, 2438, 1866, 1061, 192, -581, -1136, -1408, -1396, -1151, -766, -343, 25, 274, 382, 361, 256, 124, 21, -16, 25, 129, 261, 375, 429, 395, 270, 71, -164, -387, -551, -621, -584, -447, -238, 0, 222, 388, 472, 468, 387, 253, 100, -40, -142, -194, -195, -159, -103, -47, -8, 5, -6, -33, -64, -83, -81, -54, -4, 57, 116, 160, 178, 164, 120, 56, -17, -84, -133, -155, -150, -121, -75, -23, 24, 59, 76, 77, 63, 42, 20, 2, -8, -11, -7, 0, 6, 9, 6, -1, -12, -24, -32, -35, -32, -22, -8, 7, 22, 32, 37, 35, 27, 16, 3, -9, -18, -23, -23, -20, -14, -6, 0};

First, I've never done DSP with an FPGA so I'm open to any critique about the design.  I'm not sure exactly how these are implemented so maybe I'm doing some things in a bad way. 

Second, I'm including a header file which defines the filter coefficients but ISE gives me an error for this.  It doesn't like the single quote before the coefficient declaration(3rd line, the erorr is just 'Unexpected ' found.').  Modelsim is perfectly fine with this and in fact wont work without it.  I believe that's the way you're supposed to define 2 dimensional arrays.  What is the deal here?  If I remove the quote it will synthesize in ISE but I'm not sure it's being synthesized correctly. 

Third, if I remove the quote I mentioned in the second question and synthesize the module.  Assuming it is working correctly, it then fails timing.  I'm trying to run this module at 100Mhz and I'm getting a max path of 50ns.  This seems strange to me and makes me think I'm doing something wrong.  I'm daisy chaining the multiplications such they should naturally infer the DSP slices in an efficient manner.   I believe that's how these are interned to be configured.  So it seems weird that 20Mhz is the fastest they could run.  Am I doing something wrong here or is this normal? 

Thank you for any help or advice!
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Are you able to describe what you are doing in words (not code).

So if I read it right, you want to have a 256-tap FIR filter, in a small FPGA (Spartan 6 LX9 only has 16 DSP slices).

You seem to be implementing it by creating 16 partial products (kernel value * data value) each cycle, and then summing all 16 of them them over 16 cycles.

The design is fully inferred, but you have paid attention to the underlying widths of the multipliers.

You are having problems meeting timing.

Since you asked for a critique....

First thoughts

- You are not paying much attention to how the pipelining works in the multiplier (but isn't your limiting factor)

- You are not careful about how you 'fan in' the filter kernel and constants into the multiplier. This gives poor performance (but isn't your limiting factor)

- You are using single-cycle multiplication. This gives poor performance (but isn't your limiting factor)

- You are adding the 16 products from each multiplier each cycle, and 'ejecting' the total once every 16 cycles. This gives poor performance, as data from literally every DSP block has to come together and be added in a single cycle. This will be your limiting factor.

- You have 'control' signals that are running out to every DSP block on the FPGA. This fan-out might be a limiting factor.

- The device will have high utilization (100% of DSP slices). This gives poor performance unless things are carefully placed.  A bigger FPGA might help. (but isn't your limiting factor)

My suggested action plan

- Outline clearly what results you want/need - e.g. timing requirements, resource utilization, inputs, latency

- Have a read through the DSP block user's guide, have a think about which features could be valuable to you in this case and which are not. You won't get this knowledge experimenting with Verilog and seeing the results.

- Think about how you want it to be structured, and how this will make use of the features of the H/W

- Try getting the structure you want. Use primitives if you have to.

- Once you know what you want is possible, and how it is achievable, then think about how you can get the code to infer what you are after.

My thoughts on a design
This is untested, but an example of how you can do it differently. They are all flawed in some way.

Design idea 1:
Why not have 16 FIR filters, that output once every 256 cycles, set so their outputs are staggered by 16 cycles, otherwise they output 'zero'. Then you can just OR the outputs together rather than using adders.

You can use the block ram as circular buffer to  allow you to write incoming data into memory that isn't needed for the currently running calculation.

Pros: Simple, scalable. Things working locally for a global result.

Cons: Not sure if you will have enough resources to hold 16 copies of the kernel as well.

Design idea 2:
Look carefully how you can use the LUT-4 as a shift register and an indexed memory. With an two 16x18-bit RAM for the data and an 16x18-bit ROM for the kernel you can then have each DSP block do 16th the work. Then pipeline the adder.

Pros: Looks to be the best way, for both resources and performance.

Design idea 3:
If your FIR filter is symmetric, you can wrap the data back on itself, and then you only need half the DSP blocks (assuming that using 17-bit data rather than 16-bit is OK). e.g. total = (d[0]+d[255]) * k[0]) + (d[1]+d[254]) * k[1]....

Pros: Saves a lot of DSP. Might be possible,

Cons: not a good general solution. You lose a bit of precision on the input.
« Last Edit: April 23, 2018, 12:40:19 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: pigtwo

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Thank you very much for the detailed response. 

Your description of what I was trying to do was exactly right for this module.  But as you asked I can describe what I ultimately want.  This was a test to see if I could get a DSP to work at all.  But the final design is I wan't 15 bandpass filters ranging between 24kHz to about 20Hz(Ex one BP filter will be 10KHz to 16 KHz but another will be 100Hz to 160Hz with other spread between).  I imagine this will be fairly difficult given the magnitudes of the frequencies.  I would like to implement it with a Spartan-6 XC6SLX9 because it is the largest FPGA that comes in a QFP or QFN.   Now if this seems to be impossible
I've been trying to find a reason to buy a reflow oven so this could be it. 

The basic idea of this project is to create a very basic spectrum analyzer.  So I want to take in audio frequencies and display an output for various frequency ranges amplitudes(a basic min/max measurement).  Accuracy is not very important here(although I'd like to try to keep it reasonable for entertainment's sake). 

The final design specs are:
- 100Mhz system clock
- 1Msps ADC
- 15 Bandpass filters ranging between 24Khz and 20Hz(proportionally).
- ~400 outputs a second from all the filters(IE complete processes). 
- Nearly all FPGA resources can be used.
- Latency is unimportant

The filter specs are not super important.  I choose 256 tap FIR because the frequency response looked relativity good.  But I think I could decrease the number of taps to fit the design requirements.  I even could move to IIR if it is necessary. 

In regards to your suggestions I have a lot of questions(hope you don't mind):
Quote
You are not paying much attention to how the pipelining works in the multiplier (but isn't your limiting factor)
So my understanding of pipelining the DSP48A1 slices was basically connecting the carry out of one to the carry in of the next.  Should these be instead registered between slices?  My original understanding is the 16 DSP48A1 slices could all be daisy chained together and be able to complete that operation in one clock cycle.  Maybe this is wrong. 

Quote
You are not careful about how you 'fan in' the filter kernel and constants into the multiplier. This gives poor performance (but isn't your limiting factor)
I don't completely understand this.  My understanding here is each multiplier has various inputs(the filter coefficients).  These all get multiplexed in based on the state of the FPGA.  How should this be done differently?  Since this isn't major feel free to tell me just to look something up.  As long as it's in the right direction I can probably figure it out.

Quote
- Try getting the structure you want. Use primitives if you have to.
Does it make sense to use the DSP48A1 primitive to get better results?  I looked into it but the instance for it was huge!  And people said online that Xilinx was very good at infering these.

Quote
Why not have 16 FIR filters, that output once every 256 cycles, set so their outputs are staggered by 16 cycles, otherwise they output 'zero'. Then you can just OR the outputs together rather than using adders.
To make sure I understand, I would take 1 multiplier and have it do the multiplication and summation over 16 elements.  Then have 16 of those doing this over the data range and finally sum all their outputs at the end for the output.  This seems to avoid long data paths as they all act in parallel instead of in series.  Is this correct?   It seems much better to me. 

I will almost certainly have more questions later but I have to think about your response a little more.  Thank your very much hampser_nz!  I am trying to pivot my career towards FPGAs so this is very helpful!   
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
If you want to do 15 filters which run in parallel and you have 16 DSPs which you want to use, then each filter gets only one DSP, right?

If you want to stick to DSPs, the filter can do only one multiplication per clock. What it is going to multiply? There's not much choice here. It's going to be the most current data point and one of the coefficients. This sounds much simpler than what your design does:

Code: [Select]
result <= result + data_in*coefficient;
This will be done inside DSP (hopefully), and the only thing left is to fetch the correct coefficient. You can use a shift register, or you can use BRAM. Since you'll need to fetch coefficients for another 14 filters (some of which can share the BRAM block), and you're not in a hurry (DSP is rather slow), BRAM is probably better. To organize the BRAM fetch, you need to maintain a counter, incrementing it on every clock:

Code: [Select]
counter <= counter + 1;
and you fetch from BRAM into a register at every clock:

Code: [Select]
coefficient <= coefficient_storage[counter];
The register here helps to make sure that the DSP doesn't have to wait for the BRAM fetch which will make things dramatically faster. You may want to look at the DSP block docs to see if adding more registers may help to pipeline MADD operations, which will make things even faster.

Of course, you need to make sure that new data_in is fetched every clock.

You also need to check when counter overflows, get the result and reset everything afterwards. Looks simple enough.

It'll take 256 clocks (for 256 taps) to get the result, then everything starts over. Won't work any faster. Other 14 filters work at the same time, doing the same thing, but with other sets of coefficients.

« Last Edit: April 23, 2018, 04:00:30 am by NorthGuy »
 
The following users thanked this post: pigtwo

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
I guess what I am trying to say is you have lots of options, if you only want to process 192kS/s, then you have lots of ways to implement it.

If you want a 999-tap filter, I would suggest you think about running at 192MHz, and have logic like this (not in any particular HDL):

Code: [Select]
m[0:1024]  <<< memory for buffering samples
k[0:1023] <<< filter kernel constants

every clock cycle
  if new_sample_in = 1 then
    m[n] <= sample_in;
    j = 0;
    a = 0;
    n = MOD(n+1,1024);
    new_sample_out = 0;
  else if j < 999 then
    a = a + m[mod(n+j,1024)] * j[j];
    j++;
    new_sample_out = 0;
  else if j = 999 then
    filter_out = a;
    new_sample_out = 1;
    j++;
  else
    new_sample_out = 0;
  end if;
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline radar_macgyver

  • Frequent Contributor
  • **
  • Posts: 698
  • Country: us
To hamster_nz's point about fan-in and fan-out control, look up the term 'systolic filter' and you'll find some implementations that ensure that the large filter is broken up into small chunks that only interact with each other, and don't end up causing control signals to be routed to all the chunks. This greatly improves timing.

A reference I used back in the day:
https://www.xilinx.com/publications/archives/books/dsp.pdf

Any reason the FIR filter IP core won't work for you?
 
The following users thanked this post: pigtwo

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
What you are wanting to do is to work out:

   (taps_in_filter * sample_rate)

and balance that against:
 
   (MAC_blocks_used * design_clock_speed)

given the set of constraints you are under

* "sample_rate" is determined by your application. Usually higher than 1.2x the bandwidth of interest.

* "taps_in_filter " is determined by your application and the performance you need from your filter. For high performance it might be 5x or 7x the wavelength (in samples) of the critical frequency of the filter. (Rough rule of thumb)

* "design_clock_speed"'s upper limit is determined by your FPGA architecture and speed grade, and your ability to design to that speed. For a non-specialist designer 180 MHz would be a sensible number for Spartan-3, 200 MHz for Spartan-6, and maybe 240 MHz for 7-series. To get the datasheet 'hero' numbers (300MHz+) requires a specialist, and an ideal fit for the application.

* The upper limit for "MAC_blocks_used" is determined by the size of your FPGA. If your multiplication takes more than the native size of the DSP block (18-bit x 18-bit ) then you might need two or more DSP blocks to implement each MAC.

So....

If (MAC_blocks_used * design_clock_speed) is less than (taps_in_in_filter * sample_rate) you don't have enough resources to make the filter work, so go back to the drawing board and change some other numbers

If  (MAC_blocks_used * design_clock_speed) is greater than (taps_in_in_filter * sample_rate), or "design_clock_speed" is too low (<100MHz), then you are wasting resource. Your FPGA is idle most of the time.

However, when "dsp_blocks_used" gets too high, it becomes hard to get the right information in the right place at the right time, even if you exploit the architecture features. Having a bigger FPGA doesn't usually make that better. A bigger die gives longer routing delays.

So for each filter out your values for 'taps_in_filter', 'sample_rate', 'MAC_blocks_used', and 'design_clock_speed' and that should give you a pretty good idea of what you need to do, and if you have spare cycles for housekeeping or if everything must mesh precisely together.

....or just use the FIR filter IP block, and accept it as being magic :-)
« Last Edit: April 23, 2018, 05:28:18 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
12
« Reply #7 on: April 23, 2018, 09:26:22 pm »
I did some testing last night.

An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:
Code: [Select]
            -- The inferred multiplier, with pipelining
            product <= a * b;
            a <= kernel(to_integer(count));
            b <= memory(to_integer(index));

Data has to be supplied once every 16 cycles. Latency about 2,056 cycles or so (a few cycles more than it takes to get 128 data items).

Implemented targeting an Spartan 6LX9-2.
- Usage about 1/3rd of the fabric, 100% of DSP blocks
- Max clock frequency around 170 MHz, when constrained for 6ns (a 166 MHz design target).

Performance,
- Throughput ~ 10.5 M 18-bit data items per second
- 2,688M 18x18 MAC operations per second.
- 40-bit signed result (just sent straight to I/O pins)

I can see a few tricks that I would like to try, most of which will reduce fabric usage.

I want to spend a while verifying it and making the filter kernel a parameter rather than hard-coded...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Re: 12
« Reply #8 on: April 23, 2018, 11:51:38 pm »
An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:

I think you need to drop the idea of decompositions.

The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #9 on: April 24, 2018, 01:52:42 am »
An 256-tap FIR filter. Decomposed into sixteen 16-tap filters, with the DSP blocks inferred, with pipelining:

I think you need to drop the idea of decompositions.

The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.

Definitely agree. And 1MS/s that is way overkill for audio work - with a 256-tap FIR you would be hard pressed to do anything useful in the audio range.

I just wanted to see what peak throughout was possible on Spartan 6 without any big design investment.... around half the "capable of operating at up to 390 MHz" mentioned for the DSP block in the datasheet.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
@NorthGuy
Quote
If you want to do 15 filters which run in parallel and you have 16 DSPs which you want to use, then each filter gets only one DSP, right?
My original justification for using 16 dsp slices in series was so that I could use all of them.  I thought of the way you describe it but I was bothered by not using that last dsp slice.  But from what you and others have said it seems it's not worth changing the overall architecture to use that last dsp slice. 

Quote
The OP wants to run 15 different FIR filters. So, if you assign one filter to one DSP, you will fully utilize 15 of 16  DSPs without any decomposition.

You get a value from ADC at 1 MHz.  You run the filter at 256 MHz - all 256 taps at once, one tap per clock, using one DSP. 256 MHz shouldn't be a problem. This gives you 1 MValues/s per DSP. If you run 16 of these in parallel (each with a different set of coefficients), you get 16 MValues/s throughput.

Such approach is much simpler, easier to write, and will run faster. It will also produce much better resource utilization.
That does sound much better.  I'm going to try this this weekend.   I was originally a little worried about raising the frequency that high but since it's all inside the FPGA I guess I shouldn't have to worry about it too much.  Plus it will be good experience in working with higher frequencies in FPGAs.  Previously the highest frequency circuit board/FPGA design I've made is 50Mhz.   

@radar_macgyver Thank you for the link.  I will have to look through that this weekend.  It looks very helpful. 

@hampster_nz Thank you again for the detailed response.  I've read through it but I'll probably have to re-read it a couple times to internalize it.  I think I have a much better understanding of what to do now. 

As NorthGuy said 1MS/s is probably way too much and I noticed it was very difficult to get a good frequency response at that rate(without a huge number of taps).  So I actually designed the original filter with a sample rate of 100KS/s which I forgot to mention earlier.  Not to mention all I want to do is a very basic amplitude measurement so a large number of samples probably isn't necessary.  I think I will give it another shot using one filter per channel. 

The reason I'm not using a FIR filter core is because I'm trying to create a resume of projects to show potential employers.   I would like to get a job coding FPGAs but currently I only have 3 years of mild PCB/circuit design experience so I need something to show.  So I figured I should do it manually.  I figure anyone could just load a core and have it work.  Also it seems like a good learning exercise.  It has been pretty interesting so far. 

I'll probably have more questions as I re-read.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Something I forgot to ask earlier, is the way I stored the filter coefficients a good way to do this?  I wasn't sure how to define them another way.  Also, how are these stored in the FPGA?  Does each coefficient take up some amount of CLBs or is this done in a more efficent manner?  I know NorthGuy mentioned using BRAM but I'm not sure how to use this exactly. 
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Re: 12
« Reply #12 on: April 24, 2018, 04:02:01 am »
Definitely agree. And 1MS/s that is way overkill for audio work - with a 256-tap FIR you would be hard pressed to do anything useful in the audio range.

My bad. I didn't get this was for Audio.  It is probably a good idea to run each filter at its own frequency. This way it's possible to get better results with lesser number of taps. The slower filters will be slow, so you probably can use a single DSP to run several filters

I just wanted to see what peak throughout was possible on Spartan 6 without any big design investment.... around half the "capable of operating at up to 390 MHz" mentioned for the DSP block in the datasheet.

You should be able to get very close to the datasheet figure. There may be problems with long routing.  The internal registers may be missing somewhere. DSP has lots of built-in registers which all must be used to get peak performance. When you couple it with BRAM, you also need to infer BRAM's output register. Along with the input registers of the DSP, it creates a chain of registers. Also, outside things, such as counters may need to be pipelined. I'm sure if you dig into it, you'll be able to run it much faster than 170 MHz.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Something I forgot to ask earlier, is the way I stored the filter coefficients a good way to do this?  I wasn't sure how to define them another way.  Also, how are these stored in the FPGA?  Does each coefficient take up some amount of CLBs or is this done in a more efficent manner?  I know NorthGuy mentioned using BRAM but I'm not sure how to use this exactly.

You have array x. If BRAM is used, the coefficients are used to initialize BRAM. Then you use as x[index]. When this happens, "index" gets wired to the address input of the BRAM, the next clock, the coefficient appears on the data output of the BRAM, so the data output is wired to the wherever x[index] is used.

There's also LUT memory, which is smaller, but it is combinatorial, so you can do things like x[y[index]] within one clock (although this is not necessarily fast). The tools select the type of memory based on how you use it. If you want particular kind of memory, you often can infer it, or you can simply instantiate it.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Just glancing through the "Spartan-6 FPGA DSP48A1 Slice User Guide" ( https://www.xilinx.com/support/documentation/user_guides/ug389.pdf ). I forgot just how many neat ideas it had in it.

I am then left with lots of wonderings like if:

Code: [Select]
   if restart = 1 then
     accum <= zero_48_bits + product;
   else
     accum <= accum + product;
   end if;

...will infer the use of the fast path from the P Reg to the Z Mux (by adjusting OpMode[3:2] bits as needed), or go the long way via the 'C' input. Time to glance a the synthesis manuals....


(attached image is from the above user guide)

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #15 on: April 25, 2018, 11:49:03 am »
You should be able to get very close to the datasheet figure. There may be problems with long routing.  The internal registers may be missing somewhere. DSP has lots of built-in registers which all must be used to get peak performance. When you couple it with BRAM, you also need to infer BRAM's output register. Along with the input registers of the DSP, it creates a chain of registers. Also, outside things, such as counters may need to be pipelined. I'm sure if you dig into it, you'll be able to run it much faster than 170 MHz.

After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Re: 12
« Reply #16 on: April 25, 2018, 06:12:01 pm »
After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

The amount of gain you extracted by the tweaking (roughly 1.6 times) is comparable to the performance gain you would get by moving to UltraScale+. Quite impressive.

I'm sure it can go even faster, but the faster it runs the harder it is to satisfy timing.

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: 12
« Reply #17 on: April 25, 2018, 06:55:38 pm »
After a lot of mucking around, I finally got all the correct registers to be inferred in the inferred DSP block, and the kernel and data stream stored in inferred BRAM blocks. Fmax is now 280MHz.

Sort of interesting what a bit of tweaking can do...

The amount of gain you extracted by the tweaking (roughly 1.6 times) is comparable to the performance gain you would get by moving to UltraScale+. Quite impressive.

I'm sure it can go even faster, but the faster it runs the harder it is to satisfy timing.

Nope, it own't run any faster and still meet timing.

The switching limit for Block RAM is 280 MHz for the -2 grade, so it has 'maxed out'.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Re: 12
« Reply #18 on: April 25, 2018, 07:58:34 pm »
Nope, it own't run any faster and still meet timing.

The switching limit for Block RAM is 280 MHz for the -2 grade, so it has 'maxed out'.

I see. That's rather slow. It's way faster in 7-series.

It is possible to do it without BRAM, using distributed RAM and/or SRLs, although this will take more resources.

For the data queue, SRLs may work faster than BRAM. If you rotate data through SRL chain with N+1 elements (N being the number of taps), it'll automatically supply the data in the correct order.

You can try to use distributed RAM for coefficients - it must be fast in read-only mode. SRLs may also work.

Another possibility is to use BRAM, but run it at 1/2 clock rate. You can fetch two values per clock.

 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
It's finally the weekend so I've had some more time to look at this again and I've redesigned as suggested by using only a single multiplier per channel.  But I'm still running into the issue that ISE doesn't like the way I've defined my coefficients.  Below is an example of how I do it.
Code: [Select]
localparam num_taps = 1000;
localparam coeff_scale = 16;
localparam signed[17:0] coeff[0:999] = '{0, 3, 5, 8, 9, 10, 10, 9, 6, 3, 0, -4, -8, -11, -13, -14, -13, -11, -8, -4, 0, 4, 9, 12, 14, 15, 14, 12, 9, 5, 0, -4, -7, -10, -12, -12, -11, -9, -7, -4, -1, 2, 4, 6, 6, 6, 5, 4, 2, 1, 0, 0, 0, 1, 2, 3, 4, 4, 4, 3, 1, -2, -5, -8, -11, -13, -14, -14, -12, -8, -3, 3, 9, 15, 20, 23, 24, 22, 19, 13, 5, -3, -11, -19, -25, -28, -29, -27, -22, -15, -7, 2, 11, 19, 24, 27, 28, 25, 20, 14, 6, -1, -8, -13, -17, -18, -18, -15, -12, -8, -3, 0, 3, 4, 3, 2, 0, -2, -3, -4, -3, 0, 4, 9, 14, 19, 23, 24, 23, 19, 11, 2, -10, -21, -32, -40, -45, -46, -42, -33, -20, -4, 13, 30, 44, 55, 61, 61, 54, 43, 26, 7, -13, -31, -47, -58, -63, -62, -55, -43, -27, -9, 9, 25, 38, 46, 48, 46, 39, 29, 18, 6, -4, -12, -17, -18, -16, -13, -8, -3, 0, 1, -1, -5, -12, -20, -27, -33, -35, -32, -25, -13, 4, 23, 42, 59, 73, 80, 79, 69, 51, 26, -3, -35, -65, -90, -108, -115, -112, -96, -71, -38, 0, 39, 74, 102, 121, 128, 122, 104, 76, 42, 4, -33, -65, -89, -104, -107, -100, -83, -60, -33, -6, 19, 39, 51, 56, 54, 46, 34, 21, 10, 2, -2, 0, 6, 16, 26, 34, 38, 36, 26, 9, -14, -42, -70, -95, -114, -122, -117, -99, -68, -25, 24, 76, 124, 163, 188, 195, 183, 152, 103, 42, -25, -93, -152, -199, -226, -231, -214, -175, -120, -53, 19, 87, 146, 188, 211, 212, 193, 156, 106, 49, -9, -60, -101, -127, -137, -132, -114, -87, -55, -24, 1, 19, 26, 24, 14, 0, -14, -25, -28, -20, -1, 27, 63, 100, 134, 158, 167, 158, 128, 78, 12, -65, -144, -216, -272, -305, -309, -280, -221, -135, -30, 84, 196, 292, 363, 399, 397, 356, 278, 172, 48, -82, -203, -304, -373, -406, -398, -351, -272, -169, -55, 59, 160, 239, 289, 305, 290, 248, 186, 113, 39, -26, -75, -105, -113, -103, -79, -49, -20, 0, 6, -6, -34, -75, -123, -169, -203, -216, -201, -155, -79, 23, 141, 262, 371, 454, 497, 492, 432, 321, 166, -21, -219, -409, -570, -682, -731, -708, -613, -453, -241, 0, 247, 475, 659, 781, 826, 789, 676, 498, 273, 26, -217, -431, -595, -693, -719, -673, -564, -408, -225, -38, 132, 267, 355, 391, 377, 323, 242, 152, 70, 12, -13, 0, 47, 116, 193, 258, 291, 277, 205, 72, -113, -335, -568, -781, -943, -1023, -997, -852, -590, -224, 215, 686, 1140, 1523, 1785, 1886, 1799, 1516, 1051, 439, -267, -999, -1682, -2241, -2609, -2737, -2596, -2185, -1533, -694, 256, 1225, 2115, 2833, 3298, 3454, 3275, 2766, 1969, 954, -183, -1332, -2377, -3213, -3752, -3934, -3735, -3170, -2288, -1173, 66, 1310, 2435, 3329, 3904, 4102, 3904, 3329, 2435, 1310, 66, -1173, -2288, -3170, -3735, -3934, -3752, -3213, -2377, -1332, -183, 954, 1969, 2766, 3275, 3454, 3298, 2833, 2115, 1225, 256, -694, -1533, -2185, -2596, -2737, -2609, -2241, -1682, -999, -267, 439, 1051, 1516, 1799, 1886, 1785, 1523, 1140, 686, 215, -224, -590, -852, -997, -1023, -943, -781, -568, -335, -113, 72, 205, 277, 291, 258, 193, 116, 47, 0, -13, 12, 70, 152, 242, 323, 377, 391, 355, 267, 132, -38, -225, -408, -564, -673, -719, -693, -595, -431, -217, 26, 273, 498, 676, 789, 826, 781, 659, 475, 247, 0, -241, -453, -613, -708, -731, -682, -570, -409, -219, -21, 166, 321, 432, 492, 497, 454, 371, 262, 141, 23, -79, -155, -201, -216, -203, -169, -123, -75, -34, -6, 6, 0, -20, -49, -79, -103, -113, -105, -75, -26, 39, 113, 186, 248, 290, 305, 289, 239, 160, 59, -55, -169, -272, -351, -398, -406, -373, -304, -203, -82, 48, 172, 278, 356, 397, 399, 363, 292, 196, 84, -30, -135, -221, -280, -309, -305, -272, -216, -144, -65, 12, 78, 128, 158, 167, 158, 134, 100, 63, 27, -1, -20, -28, -25, -14, 0, 14, 24, 26, 19, 1, -24, -55, -87, -114, -132, -137, -127, -101, -60, -9, 49, 106, 156, 193, 212, 211, 188, 146, 87, 19, -53, -120, -175, -214, -231, -226, -199, -152, -93, -25, 42, 103, 152, 183, 195, 188, 163, 124, 76, 24, -25, -68, -99, -117, -122, -114, -95, -70, -42, -14, 9, 26, 36, 38, 34, 26, 16, 6, 0, -2, 2, 10, 21, 34, 46, 54, 56, 51, 39, 19, -6, -33, -60, -83, -100, -107, -104, -89, -65, -33, 4, 42, 76, 104, 122, 128, 121, 102, 74, 39, 0, -38, -71, -96, -112, -115, -108, -90, -65, -35, -3, 26, 51, 69, 79, 80, 73, 59, 42, 23, 4, -13, -25, -32, -35, -33, -27, -20, -12, -5, -1, 1, 0, -3, -8, -13, -16, -18, -17, -12, -4, 6, 18, 29, 39, 46, 48, 46, 38, 25, 9, -9, -27, -43, -55, -62, -63, -58, -47, -31, -13, 7, 26, 43, 54, 61, 61, 55, 44, 30, 13, -4, -20, -33, -42, -46, -45, -40, -32, -21, -10, 2, 11, 19, 23, 24, 23, 19, 14, 9, 4, 0, -3, -4, -3, -2, 0, 2, 3, 4, 3, 0, -3, -8, -12, -15, -18, -18, -17, -13, -8, -1, 6, 14, 20, 25, 28, 27, 24, 19, 11, 2, -7, -15, -22, -27, -29, -28, -25, -19, -11, -3, 5, 13, 19, 22, 24, 23, 20, 15, 9, 3, -3, -8, -12, -14, -14, -13, -11, -8, -5, -2, 1, 3, 4, 4, 4, 3, 2, 1, 0, 0, 0, 1, 2, 4, 5, 6, 6, 6, 4, 2, -1, -4, -7, -9, -11, -12, -12, -10, -7, -4, 0, 5, 9, 12, 14, 15, 14, 12, 9, 4, 0, -4, -8, -11, -13, -14, -13, -11, -8, -4, 0, 3, 6, 9, 10, 10, 9, 8, 5, 3};


I get the error:
Code: [Select]
ERROR:HDLCompiler:806 - "filter_coeff_15.h" Line 3: Syntax error near "'".


So it doesn't like the single quote before the "{".  But my understanding is that this is how this is defined.  If I remove the single quote then it will compile but I get 12000 warnings and the design is too large for the FPGA.  I think it's treating the coefficient as like 18000 bit register or something.  What am I doing wrong here? 

In case it's useful below is my current code.  I haven't tested it yet because of the above problem so there might be errors. 
Code: [Select]
`timescale 1ns / 1ps
module Bandpass_Filter(
input wire clk,
input wire reset,

input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_coeff_15.h"  // This file contains the information about the filter.

integer i;

reg shift;     // Shifts data in.
reg[15:0] data_index;  // Index of the delay data.  This should be parameteized later.
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.
wire signed[47:0] product;   // Results of multipication.
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.

always @(posedge clk or negedge reset) begin
if(!reset) begin
shift <= 0;
carry_in <= 0;
data_out <= 0;

for(i=0; i<num_taps; i=i+1) begin : data_reset
data_delay[i] <= 0;
end
end else begin
if(!shift) begin       // Operate on the data when not shifting.
if(data_index < num_taps) begin
data_index <= data_index + 1;
carry_in <= data_delay[data_index]*coeff[data_index] + carry_in;  // Later buffer the coefficent value rather than accessing directly.
end else begin
shift <= 1;  // Delete later
end
end else begin // Shift data when new data arrives.
data_index <= 0;
carry_in <= 0;
shift <= 0;    // Delete later.
data_out <= carry_in >>> coeff_scale;   // Scale the data and output it.
data_delay[0] <= {6'b0, data_in};  // Take in the new data;
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift  // Shift all the data.
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
You can't reset the contents of a large memory block in one cycle. Try removing that.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Yep that was it.  Now the design easily fits.

I still get 12000 warnings for some reason but I think it's something to do with the coefficient localparam.   Almost all the warnings look like this:
Code: [Select]
WARNING:Xst:1895 - Due to other FF/Latch trimming, FF/Latch <data_delay_999_17384> (without init value) has a constant value of 0 in block <Bandpass_Filter>. This FF/Latch will be trimmed during the optimization process.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
I've been trying for a few hours but I can't seem to get the design to do what I want to meet timing.  I'm trying to run the module at 200MHz but I can't seem to get it above 128MHz.  I have two main problems that I think are preventing me.  My longest path is from the coefficient values to the accumulator so I'm trying to shorten that.

My current code is posted at the bottom for reference.  It's still a work in progress and I'm mostly working on timing so there is probably logic problems or off by one errors. 

First, I'm trying to pre-fetch the coefficient and data values so that there is as minimal delay between the input register that holds the values and the inputs of the DSP slice.  When I've attempted this it seems that the register I'm trying to infer gets removed.  See the attached picture.  It seems the the coefficient value is taken directly from BRAM to the multiplier then the accumulator.  I was thinking maybe it was using the A and B register inside the DSP slice but the timing report shows the path going from BRAM all the way to the accumulator.  Am I doing something wrong in my definition of the pre-fetch register?

The second problem I'm having is inferring the P reg to Z mux loop as hamster_NZ mentioned.  I got it to do it once but there were other problems.  When I fixed those it went back to creating an accumulator outside of the DSP slice and using the M output.  I don't understand why it would do this.  I tried to force it to multiply then sum but it always does the summation outside the DSP block.  How can you correctly infer the accumulator to be inside the DSP block?

Also, I'm using the schematic viewer to figure this stuff out. Is that the best tool for this?  I notice I can't really see inside the DSP48A1 to see exactly how it's configured.  Is there a way to do this?  Or maybe because it's governed by opcodes you can't?

Thanks again for the help. 

Code: [Select]
`timescale 1ns / 1ps

module Bandpass_Filter(
input wire clk,
input wire reset,

input wire shift,
input wire[11:0] data_in,

output reg signed[11:0] data_out
    );

`include "filter_coeff_15.h"  // This file contains the information about the filter.

integer i;

// reg shift;     // Shifts data in.
reg[15:0] data_index;  // Index of the delay data.  This should be parameteized later.
reg first;
reg signed[17:0] coeff_pre_fetch; 
reg signed[17:0] data_pre_fetch;
reg signed[17:0] data_delay[0:num_taps-1];   // Data in.
wire signed[47:0] product;
reg signed[47:0] carry_in;         // Carry in, carry out from last calculation.

assign product = data_pre_fetch*coeff_pre_fetch;

always @(posedge clk or negedge reset) begin
if(!reset) begin
first <= 1;
carry_in <= 0;
data_out <= 0;
end else begin
data_pre_fetch <= data_delay[data_index+1];
coeff_pre_fetch <= coeff[data_index+1];
data_index <= data_index + 1;
if(data_index < num_taps) begin
if(first) begin
first <= 0;
carry_in <= product + 48'b0;
end else begin
carry_in <= product + carry_in;
end
end else begin
carry_in <= carry_in;
end

if(shift) begin // Shift data when new data arrives.
first <= 1;
data_index <= 0;
data_out <= carry_in >>> coeff_scale;   // Scale the data and output it.
data_delay[0] <= {6'b0, data_in};  // Take in the new data;
for(i=1; i<num_taps; i=i+1) begin : gen_data_shift  // Shift all the data.
data_delay[i] <= data_delay[i-1];
end
end
end
end
endmodule

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
I've been trying my hand a Verilog, and learnt something...

Code: [Select]
localparam signed[17:0] coeff[0:255] = {
      5,   0, -6, -14, -20, -23, -23, -18, -9, 3, 16, 27, 35, 37, 32,     
     22,   7, -8, -22, -32, -35, -32, -24, -12, -1, 6, 9, 6, 0, -7,  -11, -8, 2, 20, 42, 63, 77, 76, 59,
  24, -23, -75, -121, -150, -155, -133, -84, -17, 56, 120, 164, 178, 160, 116, 57, -4, -54, -81, -83,
-64, -33, -6, 5, -8, -47, -103, -159, -195, -194, -142, -40, 100, 253, 387, 468, 472, 388, 222, 0,
-238, -447, -584, -621, -551, -387, -164, 71, 270, 395, 429, 375, 261, 129, 25, -16, 21, 124, 256,
361, 382, 274, 25, -343, -766, -1151, -1396, -1408, -1136, -581, 192, 1061, 1866, 2438, 2635, 2377,
1661, 579, -702, -1961, -2966, -3518, -3492, -2863, -1718, -241, 1315, 2680, 3611, 3941, 3611, 2680,
1315, -241, -1718, -2863, -3492, -3518, -2966, -1961, -702, 579, 1661, 2377, 2635, 2438, 1866, 1061,
192, -581, -1136, -1408, -1396, -1151, -766, -343, 25, 274, 382, 361, 256, 124, 21, -16, 25, 129, 261,
375, 429, 395, 270, 71, -164, -387, -551, -621, -584, -447, -238, 0, 222, 388, 472, 468, 387, 253, 100,
-40, -142, -194, -195, -159, -103, -47, -8, 5, -6, -33, -64, -83, -81, -54, -4, 57, 116, 160, 178,
164, 120, 56, -17, -84, -133, -155, -150, -121, -75, -23, 24, 59, 76, 77, 63, 42, 20, 2, -8, -11, -7,
0, 6, 9, 6, -1, -12, -24, -32, -35, -32, -22, -8, 7, 22, 32, 37, 35, 27, 16, 3, -9, -18, -23, -23, -20, -14, -6, 0};

Gives the following error:

Code: [Select]
WARNING:HDLCompiler:413 - "C:\Users\Hamster\Desktop\Projects\filter_test\bandpass.v" Line 11: Result of 8163-bit expression is truncated to fit in 4608-bit target.

This makes me think that it is attempting to concatenate 255 32-bit binary numbers together, rather than 255 18-bit numbers
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Yeah, I've been running into that.  I don't know what the deal is here because I believe it should be:
Code: [Select]
localparam signed[17:0] coeff[0:255] = '{ 5,   0, ....};

But ISE doesn't like that and throws an error on the single quote.  Modelsim seems fine with it so I don't know what the deal is.

I think you're right and that might be what's causing my problems.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Yeah, I've been running into that.  I don't know what the deal is here because I believe it should be:
Code: [Select]
localparam signed[17:0] coeff[0:255] = '{ 5,   0, ....};

But ISE doesn't like that and throws an error on the single quote.  Modelsim seems fine with it so I don't know what the deal is.

I think you're right and that might be what's causing my problems.

I don't know what I am doing, so I've cheated and made it 31:0. I then pull the value into a 18-bit register, so it all works out fine :D


EDIT: But it won't simulate, so I am changing to the style recommended in page 184 of https://www.xilinx.com/support/documentation/sw_manuals/xilinx11/xst.pdf, and cross the bridge of how to parameterize it later.
« Last Edit: April 29, 2018, 02:37:08 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Hmmm And it meets timing like that?  I get an error from ISE that it's impossible to meet timing but I assume it's because I'm doing something wrong. 

Code: [Select]
At least one timing constraint is impossible to meet because component delays alone exceed the constraint. A timing
   constraint summary below shows the failing constraints (preceded with an Asterisk (*)). Please use the Timing Analyzer (GUI) or TRCE
   (command line) with the Mapped NCD and PCF files to identify which constraints and paths are failing because of the component delays
   alone. If the failing path(s) is mapped to Xilinx components as expected, consider relaxing the constraint. If it is not mapped to
   components as expected, re-evaluate your HDL and how synthesis is optimizing the path. To allow the tools to bypass this error, set the
   environment variable XIL_TIMING_ALLOW_IMPOSSIBLE to 1.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
So, my Verilog is very week, but I made a VHDL design that has been verified enough that I would sort of trust it. I then bashed up some really rough Verilog, and just worked around my lack of skill (e.g. how to set up the Kernel ROM correctly).

I then test-benched it against the original, debugged it till they were in agreement. Resource count is identical (1xDSP, 2xBRAM, 80 registers...), timing is identical, so I am sure they are bug-for-bug compatible. Here it is.

Code: [Select]
Timing summary:
 ---------------
 
 Timing errors: 0  Score: 0  (Setup/Max: 0, Hold: 0)
 
 Constraints cover 640 paths, 0 nets, and 290 connections
 
 Design statistics:
    Minimum period:   3.570ns{1}   (Maximum frequency: 280.112MHz)
 

Code: [Select]
`timescale 1ns / 1ps
module bandpass(
    input   clk,
    input  [17:0] din,
    input  din_enable,
    output [47:0] dout,
    output dout_enable
    );

reg num_taps = 255;
reg signed[17:0] buffer[0:1023];   /* Data in, sized for a block RAM */
/* Pipelining for the multipler */
reg signed[17:0] a1,a2,a3;
reg signed[17:0] b1,b2,b3;
/* Results of multipication. */
reg signed[35:0] product;   
/* Accumulate the products */
reg signed[47:0] accumulator;

reg [47:0] result;
reg result_enable;

   reg [9:0] max_count = 255;
   reg [9:0] data_index;
   reg [9:0] coeff_index;
reg [9:0] write_index;
/* Shift registers for scheduling things */
reg [4:0] reset_accum_sr;
reg [4:0] eject_result_sr;

   assign dout        = result;
assign dout_enable = result_enable;

integer i;

//////////////////////////////////////////
// Not sure how to assign initial values
//////////////////////////////////////////
initial
begin
      for (i=0; i<1024; i=i+1)
   buffer[i] = 2'b00;

data_index   = 255;
max_count    = 255;
write_index  = 254;
      coeff_index  = 254;

end

always @(posedge clk) begin

   /****************************
* The inferred DSP block    *
****************************/
/* The accumulator */
   accumulator <= (reset_accum_sr[0] == 1 ? accumulator:  48'b0) + { {12{product[35]}}, product[35:0] };

/* The multiply operation*/
   product <= a3 * b3;

/* The input pipeline */
a3 <= a2;
a2 <= a1;
a1 <= buffer[data_index];

b3 <= b2;
b2 <= b1;
// b1 <= coeff[coeff_index];
/////////////////////////////////////////////////////////////////
// Can't work out how to infer an pre-initialised ROM properly
/////////////////////////////////////////////////////////////////
// Filter Kernel
case(coeff_index)
           10'b0000000000: b1 <= 18'b111111110111111110; 10'b0000000001: b1 <= 18'b111111101000001011;
           10'b0000000010: b1 <= 18'b111111011001100110; 10'b0000000011: b1 <= 18'b111111001101000111;
           10'b0000000100: b1 <= 18'b111111000011100001; 10'b0000000101: b1 <= 18'b111110111101011101;
           10'b0000000110: b1 <= 18'b111110111011010110; 10'b0000000111: b1 <= 18'b111110111101011001;
           ... 496 more lines ....
           10'b1111111100: b1 <= 18'b000000000000000000; 10'b1111111101: b1 <= 18'b000000000000000000;
           10'b1111111110: b1 <= 18'b000000000000000000; 10'b1111111111: b1 <= 18'b000000000000000000;
endcase;

/***********************************
* Ejecting the result of the filter*
***********************************/
if (eject_result_sr[0] == 1)
begin
result <= accumulator;
result_enable <= 1;
        end else begin
result_enable <= 0;
end
/*********************************
* When we need to trigger the    *
* ejecting the result            *
*********************************/
if (coeff_index == max_count-1)
begin
eject_result_sr = {1'b1, eject_result_sr[4:1]};
end else begin
eject_result_sr = {1'b0, eject_result_sr[4:1]};
end

/*********************************
* Restarting the filter when new *
* data arrives                   *
*********************************/
if (din_enable == 1)
begin
reset_accum_sr = {1'b0, reset_accum_sr[4:1]};
coeff_index    = 0;
data_index     = write_index - max_count + 1;
end else begin
reset_accum_sr = {1'b1, reset_accum_sr[4:1]};
if (coeff_index != max_count)
begin
coeff_index  = coeff_index + 1;
data_index   = data_index + 1;
   end
end;

/*********************************
* Storing new data in the buffer *
*********************************/
if (din_enable == 1)
begin
buffer[write_index] <= din;
write_index <= write_index+1;
end;
end
endmodule
=
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline whollender

  • Regular Contributor
  • *
  • Posts: 58
  • Country: us
I've used this to initialize ROMs for FFT windowing instead of an FIR kernel, but the result is the same:

Code: [Select]
   // Window function ROM
   localparam WINDOW_ROM_WIDTH = 16;
   reg [WINDOW_ROM_WIDTH-1:0] window_rom [255:0];

   initial
      $readmemh("window_samp.hex", window_rom, 0, 255);


Where "window_samp.hex" is a text file with hex numbers:

Code: [Select]
0002
0002
0003
0005
0007
0009
000D
0011
0016
001B
0022
002A
0033
003D
0049
0056
0065
0076
0089
009E
00B5
00CF
00EB
010B

I don't think that I could find a better way to do it in Verilog.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
@hamster_nz Thank you for the code.  I've run it and I get the same results you do(mostly) so that should make it a lot easier to figure out what I'm doing wrong.  I'm going to start tweaking my code in the direction of yours to hopefully figure out the fundamental difference. 

I did figure out my issue with the localparam stuff.  Apparently assigning two dimensional arrays like that is only for system verilog.  You see to have to do it the way you did it or with an initial block assigning each value or as whollender mentions.  I haven't tried whollender's method yet but so far all methods have inferred BRAM. 

@whollender Thank you for the info.  I knew there had to be a better way than what I was doing.  That is much nicer because it's less of a pain to generate a file like that.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
So after hours of messing with my code and nearly losing my mind I've finally figured out what the main difference between my code and hamster_nz's code is.  The fundamental difference is the input pipeline.  My pipeline was only 2 registers deep while his was 3.  I'm not sure why this matters.  While reading through the DSP48A1 user guide I noticed that the input path can have 2 registers in series.  Since my input pipeline only had 2 registers I guess the system couldn't completely optimize this resulting in weird things like putting the accumulator outside the DSP slice. 

If my previous assumption is correct I'm still a little confused about why it couldn't optimize it.  Assuming I need to use the two input registers to pipeline, why is the third needed?  I can see that to meet timing I should have a third to pre-load the values from BRAM so that they are closer to the DSP slice inputs.  Then I would just expect that area to not meet timing.  Instead it seems to completely mess everything up.

I guess my main takeaway from this is that if you want the system to be completely optimized you have to use it a very exact way.  And until you get it exactly right none of it will optimize, even if some of the pieces seem completely unrelated. 

My final question to hampster_nz is what made you decide to make the input pipeline three registers deep?  Was it just to use the two input registers inside the DSP48A1 and then one outside the DSP slice?   
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
If my previous assumption is correct I'm still a little confused about why it couldn't optimize it.  Assuming I need to use the two input registers to pipeline, why is the third needed?  I can see that to meet timing I should have a third to pre-load the values from BRAM so that they are closer to the DSP slice inputs.  Then I would just expect that area to not meet timing.  Instead it seems to completely mess everything up.

BRAM has internal output registers too. They may be bypassed, but this makes BRAM very slow. If you don't have these registers, the tools may feel they cannot go with BRAM at all. Therefore you need 3 registers - one for BRAM and two for DSP.

Getting close shouldn't be a problem, not at this speed. They're already close enough. I don't think there's a need from intermediary flip-flops between BRAM and DSP.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Ah, that makes sense.  So when I'm only using two input pipeline registers it probably puts one in the BRAM and one in the DSP slice.   Which causes the DSP slice to not be entirely pipelined which breaks whatever optimization it wants to do. 
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Does you design use one or two BRAMs? If it is only one, then that is also involved, because a lot of LUTs are being used to MUX in the samples.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
My design was only inferring one BRAM which makes sense because I was using a huge shift register to implement the buffer.  The data from the big shift register to the DSP was the limiting factor in my new design.  I wrote a new one using a circular buffer like you used in yours so when I get home today I'll see if that improves the design further. 
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Well my new code is not creating 2 separate BRAM's.  Even though i tried to make a circular buffer it doesn't seem to be working.  And it's also not putting the accumulator inside the DSP slice again.  I just need to really go through it and figure out why.  Surprisingly it's doing much better than previous attempts on timing despite all the issues mentioned.   The input pipelining seems to help a lot.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
I figured out the accumulator problem.  I forgot to make my product and accumulator registers signed.  If it helps anyone else ever, it matters whether you make those signed or not.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
I guess my main takeaway from this is that if you want the system to be completely optimized you have to use it a very exact way.  And until you get it exactly right none of it will optimize, even if some of the pieces seem completely unrelated. 

I agree, but would say it more softly - to get the best results, you have to write with a clear view of what it should look like at the technology level in mind. If your DSP blocks need pipelining to perform well, you need pipelining in your design (either explicitly or implicitly.

I know other feel differently, and say that you should focus on the behavior you are wanting to describe, not on the implementation. Maybe they are from an ASIC world where you have greater freedom, or maybe they like having issues with timing closure  ;).

Quote
My final question to hampster_nz is what made you decide to make the input pipeline three registers deep?  Was it just to use the two input registers inside the DSP48A1 and then one outside the DSP slice?
Yes - to use the registers.. You might even need one more cycle.

- The RAM block needs a cycle to retrieve the data.

- A cycle is needed to get the data from the RAM into DSP's first input register.

- There is another cycle to get into the second register, the one that feeds the multiplier.

- The result of the multiplication then goes into the product register, so the multiplier has the whole cycle

- Another cycle is needed to add the the product to be added to the accumulator.

The 'trick' (if there is one), is to schedule/sequence everything on the first cycle, then use shift registers to delay the control signals the required number of cycles till when they are needed (e.g. resetting the accumulator, or producing the result).

« Last Edit: May 02, 2018, 04:46:06 am by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
Quote
I agree, but would say it more softly - to get the best results, you have to write with a clear view of what it should look like at the technology level in mind. If your DSP blocks need pipelining to perform well, you need pipelining in your design (either explicitly or implicitly.
I think I basically mean the same thing.  The thing that threw me off originally was I saw that it said that the DSP48A1 needs to be fully pipelined to be optimized.  I thought this meant there should be a register for the input, product, and accumulator(opposed to doing all this in one clock cycle).  But it seems to mean that the two input registers must also be used or other parts of the optimization might not be used.  But this may actually be an unrelated issue I want to ask about below.

Quote
The 'trick' (if there is one), is to schedule/sequence everything on the first cycle, then use shift registers to delay the control signals the required number of cycles till when they are needed (e.g. resetting the accumulator, or producing the result).
Yeah, when I was reading your code I didn't get why you did that at first.  But I then tried to write it myself and ran into the problem that doing that was solving.  It was confusing at first but now it seems pretty obvious. 

So I have my design which is basically hamster_nz's design(once you see a way to do it it's hard to think of a different way).  But on my home computer(Windows 7) I couldn't get my buffer register to infer BRAM.  I feel like I tried everything and I exactly copied hamster_nz's way of creating the buffer but I was getting weird results.  Most of the time I would get what I think is a large register with the output be these multiplexed.  Then ISE was telling me my design was empty because the only output wasn't being driven.  All this was very strange because Modelsim showed the design simulating correctly.  So I came into work and wanted to try something and installed the ISE VM because my work computer is Windows 10.  And with no changes the design is created perfectly!  So it seems like ISE is doing something weird. 

Not only that but it says it can run at 314MHz.  From what I can tell the inferred design is correct(2 BRAM, 1 DSP48A1).  The limiting factor was the path from BRAM to the first input register of the DSP48A1.

On the virtual machine ISE it also shows hamster_nz's code's max frequency being 300MHz which is different from what I get on my personal computer.(Edit: The max frequency differences here were because I used a speed grade of -3.  Switching it to -2 makes the max frequency of both go to 280MHz as expected). 

What operating system are you running ISE on?  Maybe I should always be running the virtual machine version? 

I'll have to do some side by side tests when I get home today. 

Below is my code in case there is a useful difference between it and what hamster_nz's.  It was originally not as similar to his but in troubleshooting my design I tried making them more similar until mine would start to work(which was never on my computer).
Code: [Select]
module Bandpass_Filter(
// Basic control signals
input wire clk,
input wire reset,

// Input data/control signals
input wire[11:0] din,
input wire din_enable,

// Output data/control signals
output reg[11:0] dout,
output reg dout_enable
);

// Define local registers
reg signed[17:0] buffer[0:1023];  // Input buffer to hold data values.  Sized to a power of 2. 
reg signed[35:0] product;       // Product result
reg signed[47:0] accumulator;    

reg signed[17:0] data_d1, data_d2, data_d3;     // Data pipeline
reg signed[17:0] coeff_d1, coeff_d2, coeff_d3; // Coefficent pipeline

reg[9:0] data_index;
reg[9:0] coeff_index;
reg[9:0] write_index;      // This points to the last item in the buffer

reg[4:0] clear_acc_sr;
reg[4:0] eject_data_sr;

integer i;

// Define localparams
reg[9:0] num_taps;
localparam data_scale = 16;


initial begin
for(i=0; i<1024; i=i+1)
buffer[i] = 18'b0;
data_index = 1000;
write_index = 999;
coeff_index = 999;
num_taps = 1000;
end

always @(posedge clk) begin
//////////////////////////////////////////
// Create data and coefficent pipeline //
//////////////////////////////////////////
data_d1 <= buffer[data_index];
data_d2 <= data_d1;
data_d3 <= data_d2;

// `include "filter_coeff_15.h";
case(coeff_index)
0: coeff_d1 <= 4;
1: coeff_d1 <= 3;
2: coeff_d1 <= 5;
3: coeff_d1 <= 8;
4: coeff_d1 <= 9;
5: coeff_d1 <= 10;
6: coeff_d1 <= 10;
7: coeff_d1 <= 9;
8: coeff_d1 <= 6;
9: coeff_d1 <= 3;
10: coeff_d1 <= 0;
11: coeff_d1 <= -4;
12: coeff_d1 <= -8;
13: coeff_d1 <= -11;
14: coeff_d1 <= -13;
15: coeff_d1 <= -14;
16: coeff_d1 <= -13;
17: coeff_d1 <= -11;
18: coeff_d1 <= -8;
19: coeff_d1 <= -4;
20: coeff_d1 <= 0;
21: coeff_d1 <= 4;
22: coeff_d1 <= 9;
23: coeff_d1 <= 12;
24: coeff_d1 <= 14;
25: coeff_d1 <= 15;
26: coeff_d1 <= 14;
27: coeff_d1 <= 12;
28: coeff_d1 <= 9;
29: coeff_d1 <= 5;
30: coeff_d1 <= 0;
31: coeff_d1 <= -4;
32: coeff_d1 <= -7;
33: coeff_d1 <= -10;
34: coeff_d1 <= -12;
35: coeff_d1 <= -12;
36: coeff_d1 <= -11;
37: coeff_d1 <= -9;
38: coeff_d1 <= -7;
39: coeff_d1 <= -4;
40: coeff_d1 <= -1;
41: coeff_d1 <= 2;
42: coeff_d1 <= 4;
43: coeff_d1 <= 6;
44: coeff_d1 <= 6;
45: coeff_d1 <= 6;
46: coeff_d1 <= 5;
47: coeff_d1 <= 4;
48: coeff_d1 <= 2;
49: coeff_d1 <= 1;
50: coeff_d1 <= 0;
51: coeff_d1 <= 0;
52: coeff_d1 <= 0;
53: coeff_d1 <= 1;
54: coeff_d1 <= 2;
55: coeff_d1 <= 3;
56: coeff_d1 <= 4;
57: coeff_d1 <= 4;
58: coeff_d1 <= 4;
59: coeff_d1 <= 3;
60: coeff_d1 <= 1;
61: coeff_d1 <= -2;
62: coeff_d1 <= -5;
63: coeff_d1 <= -8;
64: coeff_d1 <= -11;
65: coeff_d1 <= -13;
66: coeff_d1 <= -14;
67: coeff_d1 <= -14;
68: coeff_d1 <= -12;
69: coeff_d1 <= -8;
70: coeff_d1 <= -3;
71: coeff_d1 <= 3;
72: coeff_d1 <= 9;
73: coeff_d1 <= 15;
74: coeff_d1 <= 20;
75: coeff_d1 <= 23;
76: coeff_d1 <= 24;
77: coeff_d1 <= 22;
78: coeff_d1 <= 19;
79: coeff_d1 <= 13;
80: coeff_d1 <= 5;
81: coeff_d1 <= -3;
82: coeff_d1 <= -11;
83: coeff_d1 <= -19;
84: coeff_d1 <= -25;
85: coeff_d1 <= -28;
86: coeff_d1 <= -29;
87: coeff_d1 <= -27;
88: coeff_d1 <= -22;
89: coeff_d1 <= -15;
90: coeff_d1 <= -7;
91: coeff_d1 <= 2;
92: coeff_d1 <= 11;
93: coeff_d1 <= 19;
94: coeff_d1 <= 24;
95: coeff_d1 <= 27;
96: coeff_d1 <= 28;
97: coeff_d1 <= 25;
98: coeff_d1 <= 20;
99: coeff_d1 <= 14;
100: coeff_d1 <= 6;
101: coeff_d1 <= -1;
102: coeff_d1 <= -8;
103: coeff_d1 <= -13;
104: coeff_d1 <= -17;
105: coeff_d1 <= -18;
106: coeff_d1 <= -18;
107: coeff_d1 <= -15;
108: coeff_d1 <= -12;
109: coeff_d1 <= -8;
110: coeff_d1 <= -3;
111: coeff_d1 <= 0;
112: coeff_d1 <= 3;
113: coeff_d1 <= 4;
114: coeff_d1 <= 3;
115: coeff_d1 <= 2;
116: coeff_d1 <= 0;
117: coeff_d1 <= -2;
118: coeff_d1 <= -3;
119: coeff_d1 <= -4;
120: coeff_d1 <= -3;
121: coeff_d1 <= 0;
122: coeff_d1 <= 4;
123: coeff_d1 <= 9;
124: coeff_d1 <= 14;
125: coeff_d1 <= 19;
126: coeff_d1 <= 23;
127: coeff_d1 <= 24;
128: coeff_d1 <= 23;
129: coeff_d1 <= 19;
130: coeff_d1 <= 11;
131: coeff_d1 <= 2;
132: coeff_d1 <= -10;
133: coeff_d1 <= -21;
134: coeff_d1 <= -32;
135: coeff_d1 <= -40;
136: coeff_d1 <= -45;
137: coeff_d1 <= -46;
138: coeff_d1 <= -42;
139: coeff_d1 <= -33;
140: coeff_d1 <= -20;
141: coeff_d1 <= -4;
142: coeff_d1 <= 13;
143: coeff_d1 <= 30;
144: coeff_d1 <= 44;
145: coeff_d1 <= 55;
146: coeff_d1 <= 61;
147: coeff_d1 <= 61;
148: coeff_d1 <= 54;
149: coeff_d1 <= 43;
150: coeff_d1 <= 26;
151: coeff_d1 <= 7;
152: coeff_d1 <= -13;
153: coeff_d1 <= -31;
154: coeff_d1 <= -47;
155: coeff_d1 <= -58;
156: coeff_d1 <= -63;
157: coeff_d1 <= -62;
158: coeff_d1 <= -55;
159: coeff_d1 <= -43;
160: coeff_d1 <= -27;
161: coeff_d1 <= -9;
162: coeff_d1 <= 9;
163: coeff_d1 <= 25;
164: coeff_d1 <= 38;
165: coeff_d1 <= 46;
166: coeff_d1 <= 48;
167: coeff_d1 <= 46;
168: coeff_d1 <= 39;
169: coeff_d1 <= 29;
170: coeff_d1 <= 18;
171: coeff_d1 <= 6;
172: coeff_d1 <= -4;
173: coeff_d1 <= -12;
174: coeff_d1 <= -17;
175: coeff_d1 <= -18;
176: coeff_d1 <= -16;
177: coeff_d1 <= -13;
178: coeff_d1 <= -8;
179: coeff_d1 <= -3;
180: coeff_d1 <= 0;
181: coeff_d1 <= 1;
182: coeff_d1 <= -1;
183: coeff_d1 <= -5;
184: coeff_d1 <= -12;
185: coeff_d1 <= -20;
186: coeff_d1 <= -27;
187: coeff_d1 <= -33;
188: coeff_d1 <= -35;
189: coeff_d1 <= -32;
190: coeff_d1 <= -25;
191: coeff_d1 <= -13;
192: coeff_d1 <= 4;
193: coeff_d1 <= 23;
194: coeff_d1 <= 42;
195: coeff_d1 <= 59;
196: coeff_d1 <= 73;
197: coeff_d1 <= 80;
198: coeff_d1 <= 79;
199: coeff_d1 <= 69;
200: coeff_d1 <= 51;
201: coeff_d1 <= 26;
202: coeff_d1 <= -3;
203: coeff_d1 <= -35;
204: coeff_d1 <= -65;
205: coeff_d1 <= -90;
206: coeff_d1 <= -108;
207: coeff_d1 <= -115;
208: coeff_d1 <= -112;
209: coeff_d1 <= -96;
210: coeff_d1 <= -71;
211: coeff_d1 <= -38;
212: coeff_d1 <= 0;
213: coeff_d1 <= 39;
214: coeff_d1 <= 74;
215: coeff_d1 <= 102;
216: coeff_d1 <= 121;
217: coeff_d1 <= 128;
218: coeff_d1 <= 122;
219: coeff_d1 <= 104;
220: coeff_d1 <= 76;
221: coeff_d1 <= 42;
222: coeff_d1 <= 4;
223: coeff_d1 <= -33;
224: coeff_d1 <= -65;
225: coeff_d1 <= -89;
226: coeff_d1 <= -104;
227: coeff_d1 <= -107;
228: coeff_d1 <= -100;
229: coeff_d1 <= -83;
230: coeff_d1 <= -60;
231: coeff_d1 <= -33;
232: coeff_d1 <= -6;
233: coeff_d1 <= 19;
234: coeff_d1 <= 39;
235: coeff_d1 <= 51;
236: coeff_d1 <= 56;
237: coeff_d1 <= 54;
238: coeff_d1 <= 46;
239: coeff_d1 <= 34;
240: coeff_d1 <= 21;
241: coeff_d1 <= 10;
242: coeff_d1 <= 2;
243: coeff_d1 <= -2;
244: coeff_d1 <= 0;
245: coeff_d1 <= 6;
246: coeff_d1 <= 16;
247: coeff_d1 <= 26;
248: coeff_d1 <= 34;
249: coeff_d1 <= 38;
250: coeff_d1 <= 36;
251: coeff_d1 <= 26;
252: coeff_d1 <= 9;
253: coeff_d1 <= -14;
254: coeff_d1 <= -42;
255: coeff_d1 <= -70;
256: coeff_d1 <= -95;
257: coeff_d1 <= -114;
258: coeff_d1 <= -122;
259: coeff_d1 <= -117;
260: coeff_d1 <= -99;
261: coeff_d1 <= -68;
262: coeff_d1 <= -25;
263: coeff_d1 <= 24;
264: coeff_d1 <= 76;
265: coeff_d1 <= 124;
266: coeff_d1 <= 163;
267: coeff_d1 <= 188;
268: coeff_d1 <= 195;
269: coeff_d1 <= 183;
270: coeff_d1 <= 152;
271: coeff_d1 <= 103;
272: coeff_d1 <= 42;
273: coeff_d1 <= -25;
274: coeff_d1 <= -93;
275: coeff_d1 <= -152;
276: coeff_d1 <= -199;
277: coeff_d1 <= -226;
278: coeff_d1 <= -231;
279: coeff_d1 <= -214;
280: coeff_d1 <= -175;
281: coeff_d1 <= -120;
282: coeff_d1 <= -53;
283: coeff_d1 <= 19;
284: coeff_d1 <= 87;
285: coeff_d1 <= 146;
286: coeff_d1 <= 188;
287: coeff_d1 <= 211;
288: coeff_d1 <= 212;
289: coeff_d1 <= 193;
290: coeff_d1 <= 156;
291: coeff_d1 <= 106;
292: coeff_d1 <= 49;
293: coeff_d1 <= -9;
294: coeff_d1 <= -60;
295: coeff_d1 <= -101;
296: coeff_d1 <= -127;
297: coeff_d1 <= -137;
298: coeff_d1 <= -132;
299: coeff_d1 <= -114;
300: coeff_d1 <= -87;
301: coeff_d1 <= -55;
302: coeff_d1 <= -24;
303: coeff_d1 <= 1;
304: coeff_d1 <= 19;
305: coeff_d1 <= 26;
306: coeff_d1 <= 24;
307: coeff_d1 <= 14;
308: coeff_d1 <= 0;
309: coeff_d1 <= -14;
310: coeff_d1 <= -25;
311: coeff_d1 <= -28;
312: coeff_d1 <= -20;
313: coeff_d1 <= -1;
314: coeff_d1 <= 27;
315: coeff_d1 <= 63;
316: coeff_d1 <= 100;
317: coeff_d1 <= 134;
318: coeff_d1 <= 158;
319: coeff_d1 <= 167;
320: coeff_d1 <= 158;
321: coeff_d1 <= 128;
322: coeff_d1 <= 78;
323: coeff_d1 <= 12;
324: coeff_d1 <= -65;
325: coeff_d1 <= -144;
326: coeff_d1 <= -216;
327: coeff_d1 <= -272;
328: coeff_d1 <= -305;
329: coeff_d1 <= -309;
330: coeff_d1 <= -280;
331: coeff_d1 <= -221;
332: coeff_d1 <= -135;
333: coeff_d1 <= -30;
334: coeff_d1 <= 84;
335: coeff_d1 <= 196;
336: coeff_d1 <= 292;
337: coeff_d1 <= 363;
338: coeff_d1 <= 399;
339: coeff_d1 <= 397;
340: coeff_d1 <= 356;
341: coeff_d1 <= 278;
342: coeff_d1 <= 172;
343: coeff_d1 <= 48;
344: coeff_d1 <= -82;
345: coeff_d1 <= -203;
346: coeff_d1 <= -304;
347: coeff_d1 <= -373;
348: coeff_d1 <= -406;
349: coeff_d1 <= -398;
350: coeff_d1 <= -351;
351: coeff_d1 <= -272;
352: coeff_d1 <= -169;
353: coeff_d1 <= -55;
354: coeff_d1 <= 59;
355: coeff_d1 <= 160;
356: coeff_d1 <= 239;
357: coeff_d1 <= 289;
358: coeff_d1 <= 305;
359: coeff_d1 <= 290;
360: coeff_d1 <= 248;
361: coeff_d1 <= 186;
362: coeff_d1 <= 113;
363: coeff_d1 <= 39;
364: coeff_d1 <= -26;
365: coeff_d1 <= -75;
366: coeff_d1 <= -105;
367: coeff_d1 <= -113;
368: coeff_d1 <= -103;
369: coeff_d1 <= -79;
370: coeff_d1 <= -49;
371: coeff_d1 <= -20;
372: coeff_d1 <= 0;
373: coeff_d1 <= 6;
374: coeff_d1 <= -6;
375: coeff_d1 <= -34;
376: coeff_d1 <= -75;
377: coeff_d1 <= -123;
378: coeff_d1 <= -169;
379: coeff_d1 <= -203;
380: coeff_d1 <= -216;
381: coeff_d1 <= -201;
382: coeff_d1 <= -155;
383: coeff_d1 <= -79;
384: coeff_d1 <= 23;
385: coeff_d1 <= 141;
386: coeff_d1 <= 262;
387: coeff_d1 <= 371;
388: coeff_d1 <= 454;
389: coeff_d1 <= 497;
390: coeff_d1 <= 492;
391: coeff_d1 <= 432;
392: coeff_d1 <= 321;
393: coeff_d1 <= 166;
394: coeff_d1 <= -21;
395: coeff_d1 <= -219;
396: coeff_d1 <= -409;
397: coeff_d1 <= -570;
398: coeff_d1 <= -682;
399: coeff_d1 <= -731;
400: coeff_d1 <= -708;
401: coeff_d1 <= -613;
402: coeff_d1 <= -453;
403: coeff_d1 <= -241;
404: coeff_d1 <= 0;
405: coeff_d1 <= 247;
406: coeff_d1 <= 475;
407: coeff_d1 <= 659;
408: coeff_d1 <= 781;
409: coeff_d1 <= 826;
410: coeff_d1 <= 789;
411: coeff_d1 <= 676;
412: coeff_d1 <= 498;
413: coeff_d1 <= 273;
414: coeff_d1 <= 26;
415: coeff_d1 <= -217;
416: coeff_d1 <= -431;
417: coeff_d1 <= -595;
418: coeff_d1 <= -693;
419: coeff_d1 <= -719;
420: coeff_d1 <= -673;
421: coeff_d1 <= -564;
422: coeff_d1 <= -408;
423: coeff_d1 <= -225;
424: coeff_d1 <= -38;
425: coeff_d1 <= 132;
426: coeff_d1 <= 267;
427: coeff_d1 <= 355;
428: coeff_d1 <= 391;
429: coeff_d1 <= 377;
430: coeff_d1 <= 323;
431: coeff_d1 <= 242;
432: coeff_d1 <= 152;
433: coeff_d1 <= 70;
434: coeff_d1 <= 12;
435: coeff_d1 <= -13;
436: coeff_d1 <= 0;
437: coeff_d1 <= 47;
438: coeff_d1 <= 116;
439: coeff_d1 <= 193;
440: coeff_d1 <= 258;
441: coeff_d1 <= 291;
442: coeff_d1 <= 277;
443: coeff_d1 <= 205;
444: coeff_d1 <= 72;
445: coeff_d1 <= -113;
446: coeff_d1 <= -335;
447: coeff_d1 <= -568;
448: coeff_d1 <= -781;
449: coeff_d1 <= -943;
450: coeff_d1 <= -1023;
451: coeff_d1 <= -997;
452: coeff_d1 <= -852;
453: coeff_d1 <= -590;
454: coeff_d1 <= -224;
455: coeff_d1 <= 215;
456: coeff_d1 <= 686;
457: coeff_d1 <= 1140;
458: coeff_d1 <= 1523;
459: coeff_d1 <= 1785;
460: coeff_d1 <= 1886;
461: coeff_d1 <= 1799;
462: coeff_d1 <= 1516;
463: coeff_d1 <= 1051;
464: coeff_d1 <= 439;
465: coeff_d1 <= -267;
466: coeff_d1 <= -999;
467: coeff_d1 <= -1682;
468: coeff_d1 <= -2241;
469: coeff_d1 <= -2609;
470: coeff_d1 <= -2737;
471: coeff_d1 <= -2596;
472: coeff_d1 <= -2185;
473: coeff_d1 <= -1533;
474: coeff_d1 <= -694;
475: coeff_d1 <= 256;
476: coeff_d1 <= 1225;
477: coeff_d1 <= 2115;
478: coeff_d1 <= 2833;
479: coeff_d1 <= 3298;
480: coeff_d1 <= 3454;
481: coeff_d1 <= 3275;
482: coeff_d1 <= 2766;
483: coeff_d1 <= 1969;
484: coeff_d1 <= 954;
485: coeff_d1 <= -183;
486: coeff_d1 <= -1332;
487: coeff_d1 <= -2377;
488: coeff_d1 <= -3213;
489: coeff_d1 <= -3752;
490: coeff_d1 <= -3934;
491: coeff_d1 <= -3735;
492: coeff_d1 <= -3170;
493: coeff_d1 <= -2288;
494: coeff_d1 <= -1173;
495: coeff_d1 <= 66;
496: coeff_d1 <= 1310;
497: coeff_d1 <= 2435;
498: coeff_d1 <= 3329;
499: coeff_d1 <= 3904;
500: coeff_d1 <= 4102;
501: coeff_d1 <= 3904;
502: coeff_d1 <= 3329;
503: coeff_d1 <= 2435;
504: coeff_d1 <= 1310;
505: coeff_d1 <= 66;
506: coeff_d1 <= -1173;
507: coeff_d1 <= -2288;
508: coeff_d1 <= -3170;
509: coeff_d1 <= -3735;
510: coeff_d1 <= -3934;
511: coeff_d1 <= -3752;
512: coeff_d1 <= -3213;
513: coeff_d1 <= -2377;
514: coeff_d1 <= -1332;
515: coeff_d1 <= -183;
516: coeff_d1 <= 954;
517: coeff_d1 <= 1969;
518: coeff_d1 <= 2766;
519: coeff_d1 <= 3275;
520: coeff_d1 <= 3454;
521: coeff_d1 <= 3298;
522: coeff_d1 <= 2833;
523: coeff_d1 <= 2115;
524: coeff_d1 <= 1225;
525: coeff_d1 <= 256;
526: coeff_d1 <= -694;
527: coeff_d1 <= -1533;
528: coeff_d1 <= -2185;
529: coeff_d1 <= -2596;
530: coeff_d1 <= -2737;
531: coeff_d1 <= -2609;
532: coeff_d1 <= -2241;
533: coeff_d1 <= -1682;
534: coeff_d1 <= -999;
535: coeff_d1 <= -267;
536: coeff_d1 <= 439;
537: coeff_d1 <= 1051;
538: coeff_d1 <= 1516;
539: coeff_d1 <= 1799;
540: coeff_d1 <= 1886;
541: coeff_d1 <= 1785;
542: coeff_d1 <= 1523;
543: coeff_d1 <= 1140;
544: coeff_d1 <= 686;
545: coeff_d1 <= 215;
546: coeff_d1 <= -224;
547: coeff_d1 <= -590;
548: coeff_d1 <= -852;
549: coeff_d1 <= -997;
550: coeff_d1 <= -1023;
551: coeff_d1 <= -943;
552: coeff_d1 <= -781;
553: coeff_d1 <= -568;
554: coeff_d1 <= -335;
555: coeff_d1 <= -113;
556: coeff_d1 <= 72;
557: coeff_d1 <= 205;
558: coeff_d1 <= 277;
559: coeff_d1 <= 291;
560: coeff_d1 <= 258;
561: coeff_d1 <= 193;
562: coeff_d1 <= 116;
563: coeff_d1 <= 47;
564: coeff_d1 <= 0;
565: coeff_d1 <= -13;
566: coeff_d1 <= 12;
567: coeff_d1 <= 70;
568: coeff_d1 <= 152;
569: coeff_d1 <= 242;
570: coeff_d1 <= 323;
571: coeff_d1 <= 377;
572: coeff_d1 <= 391;
573: coeff_d1 <= 355;
574: coeff_d1 <= 267;
575: coeff_d1 <= 132;
576: coeff_d1 <= -38;
577: coeff_d1 <= -225;
578: coeff_d1 <= -408;
579: coeff_d1 <= -564;
580: coeff_d1 <= -673;
581: coeff_d1 <= -719;
582: coeff_d1 <= -693;
583: coeff_d1 <= -595;
584: coeff_d1 <= -431;
585: coeff_d1 <= -217;
586: coeff_d1 <= 26;
587: coeff_d1 <= 273;
588: coeff_d1 <= 498;
589: coeff_d1 <= 676;
590: coeff_d1 <= 789;
591: coeff_d1 <= 826;
592: coeff_d1 <= 781;
593: coeff_d1 <= 659;
594: coeff_d1 <= 475;
595: coeff_d1 <= 247;
596: coeff_d1 <= 0;
597: coeff_d1 <= -241;
598: coeff_d1 <= -453;
599: coeff_d1 <= -613;
600: coeff_d1 <= -708;
601: coeff_d1 <= -731;
602: coeff_d1 <= -682;
603: coeff_d1 <= -570;
604: coeff_d1 <= -409;
605: coeff_d1 <= -219;
606: coeff_d1 <= -21;
607: coeff_d1 <= 166;
608: coeff_d1 <= 321;
609: coeff_d1 <= 432;
610: coeff_d1 <= 492;
611: coeff_d1 <= 497;
612: coeff_d1 <= 454;
613: coeff_d1 <= 371;
614: coeff_d1 <= 262;
615: coeff_d1 <= 141;
616: coeff_d1 <= 23;
617: coeff_d1 <= -79;
618: coeff_d1 <= -155;
619: coeff_d1 <= -201;
620: coeff_d1 <= -216;
621: coeff_d1 <= -203;
622: coeff_d1 <= -169;
623: coeff_d1 <= -123;
624: coeff_d1 <= -75;
625: coeff_d1 <= -34;
626: coeff_d1 <= -6;
627: coeff_d1 <= 6;
628: coeff_d1 <= 0;
629: coeff_d1 <= -20;
630: coeff_d1 <= -49;
631: coeff_d1 <= -79;
632: coeff_d1 <= -103;
633: coeff_d1 <= -113;
634: coeff_d1 <= -105;
635: coeff_d1 <= -75;
636: coeff_d1 <= -26;
637: coeff_d1 <= 39;
638: coeff_d1 <= 113;
639: coeff_d1 <= 186;
640: coeff_d1 <= 248;
641: coeff_d1 <= 290;
642: coeff_d1 <= 305;
643: coeff_d1 <= 289;
644: coeff_d1 <= 239;
645: coeff_d1 <= 160;
646: coeff_d1 <= 59;
647: coeff_d1 <= -55;
648: coeff_d1 <= -169;
649: coeff_d1 <= -272;
650: coeff_d1 <= -351;
651: coeff_d1 <= -398;
652: coeff_d1 <= -406;
653: coeff_d1 <= -373;
654: coeff_d1 <= -304;
655: coeff_d1 <= -203;
656: coeff_d1 <= -82;
657: coeff_d1 <= 48;
658: coeff_d1 <= 172;
659: coeff_d1 <= 278;
660: coeff_d1 <= 356;
661: coeff_d1 <= 397;
662: coeff_d1 <= 399;
663: coeff_d1 <= 363;
664: coeff_d1 <= 292;
665: coeff_d1 <= 196;
666: coeff_d1 <= 84;
667: coeff_d1 <= -30;
668: coeff_d1 <= -135;
669: coeff_d1 <= -221;
670: coeff_d1 <= -280;
671: coeff_d1 <= -309;
672: coeff_d1 <= -305;
673: coeff_d1 <= -272;
674: coeff_d1 <= -216;
675: coeff_d1 <= -144;
676: coeff_d1 <= -65;
677: coeff_d1 <= 12;
678: coeff_d1 <= 78;
679: coeff_d1 <= 128;
680: coeff_d1 <= 158;
681: coeff_d1 <= 167;
682: coeff_d1 <= 158;
683: coeff_d1 <= 134;
684: coeff_d1 <= 100;
685: coeff_d1 <= 63;
686: coeff_d1 <= 27;
687: coeff_d1 <= -1;
688: coeff_d1 <= -20;
689: coeff_d1 <= -28;
690: coeff_d1 <= -25;
691: coeff_d1 <= -14;
692: coeff_d1 <= 0;
693: coeff_d1 <= 14;
694: coeff_d1 <= 24;
695: coeff_d1 <= 26;
696: coeff_d1 <= 19;
697: coeff_d1 <= 1;
698: coeff_d1 <= -24;
699: coeff_d1 <= -55;
700: coeff_d1 <= -87;
701: coeff_d1 <= -114;
702: coeff_d1 <= -132;
703: coeff_d1 <= -137;
704: coeff_d1 <= -127;
705: coeff_d1 <= -101;
706: coeff_d1 <= -60;
707: coeff_d1 <= -9;
708: coeff_d1 <= 49;
709: coeff_d1 <= 106;
710: coeff_d1 <= 156;
711: coeff_d1 <= 193;
712: coeff_d1 <= 212;
713: coeff_d1 <= 211;
714: coeff_d1 <= 188;
715: coeff_d1 <= 146;
716: coeff_d1 <= 87;
717: coeff_d1 <= 19;
718: coeff_d1 <= -53;
719: coeff_d1 <= -120;
720: coeff_d1 <= -175;
721: coeff_d1 <= -214;
722: coeff_d1 <= -231;
723: coeff_d1 <= -226;
724: coeff_d1 <= -199;
725: coeff_d1 <= -152;
726: coeff_d1 <= -93;
727: coeff_d1 <= -25;
728: coeff_d1 <= 42;
729: coeff_d1 <= 103;
730: coeff_d1 <= 152;
731: coeff_d1 <= 183;
732: coeff_d1 <= 195;
733: coeff_d1 <= 188;
734: coeff_d1 <= 163;
735: coeff_d1 <= 124;
736: coeff_d1 <= 76;
737: coeff_d1 <= 24;
738: coeff_d1 <= -25;
739: coeff_d1 <= -68;
740: coeff_d1 <= -99;
741: coeff_d1 <= -117;
742: coeff_d1 <= -122;
743: coeff_d1 <= -114;
744: coeff_d1 <= -95;
745: coeff_d1 <= -70;
746: coeff_d1 <= -42;
747: coeff_d1 <= -14;
748: coeff_d1 <= 9;
749: coeff_d1 <= 26;
750: coeff_d1 <= 36;
751: coeff_d1 <= 38;
752: coeff_d1 <= 34;
753: coeff_d1 <= 26;
754: coeff_d1 <= 16;
755: coeff_d1 <= 6;
756: coeff_d1 <= 0;
757: coeff_d1 <= -2;
758: coeff_d1 <= 2;
759: coeff_d1 <= 10;
760: coeff_d1 <= 21;
761: coeff_d1 <= 34;
762: coeff_d1 <= 46;
763: coeff_d1 <= 54;
764: coeff_d1 <= 56;
765: coeff_d1 <= 51;
766: coeff_d1 <= 39;
767: coeff_d1 <= 19;
768: coeff_d1 <= -6;
769: coeff_d1 <= -33;
770: coeff_d1 <= -60;
771: coeff_d1 <= -83;
772: coeff_d1 <= -100;
773: coeff_d1 <= -107;
774: coeff_d1 <= -104;
775: coeff_d1 <= -89;
776: coeff_d1 <= -65;
777: coeff_d1 <= -33;
778: coeff_d1 <= 4;
779: coeff_d1 <= 42;
780: coeff_d1 <= 76;
781: coeff_d1 <= 104;
782: coeff_d1 <= 122;
783: coeff_d1 <= 128;
784: coeff_d1 <= 121;
785: coeff_d1 <= 102;
786: coeff_d1 <= 74;
787: coeff_d1 <= 39;
788: coeff_d1 <= 0;
789: coeff_d1 <= -38;
790: coeff_d1 <= -71;
791: coeff_d1 <= -96;
792: coeff_d1 <= -112;
793: coeff_d1 <= -115;
794: coeff_d1 <= -108;
795: coeff_d1 <= -90;
796: coeff_d1 <= -65;
797: coeff_d1 <= -35;
798: coeff_d1 <= -3;
799: coeff_d1 <= 26;
800: coeff_d1 <= 51;
801: coeff_d1 <= 69;
802: coeff_d1 <= 79;
803: coeff_d1 <= 80;
804: coeff_d1 <= 73;
805: coeff_d1 <= 59;
806: coeff_d1 <= 42;
807: coeff_d1 <= 23;
808: coeff_d1 <= 4;
809: coeff_d1 <= -13;
810: coeff_d1 <= -25;
811: coeff_d1 <= -32;
812: coeff_d1 <= -35;
813: coeff_d1 <= -33;
814: coeff_d1 <= -27;
815: coeff_d1 <= -20;
816: coeff_d1 <= -12;
817: coeff_d1 <= -5;
818: coeff_d1 <= -1;
819: coeff_d1 <= 1;
820: coeff_d1 <= 0;
821: coeff_d1 <= -3;
822: coeff_d1 <= -8;
823: coeff_d1 <= -13;
824: coeff_d1 <= -16;
825: coeff_d1 <= -18;
826: coeff_d1 <= -17;
827: coeff_d1 <= -12;
828: coeff_d1 <= -4;
829: coeff_d1 <= 6;
830: coeff_d1 <= 18;
831: coeff_d1 <= 29;
832: coeff_d1 <= 39;
833: coeff_d1 <= 46;
834: coeff_d1 <= 48;
835: coeff_d1 <= 46;
836: coeff_d1 <= 38;
837: coeff_d1 <= 25;
838: coeff_d1 <= 9;
839: coeff_d1 <= -9;
840: coeff_d1 <= -27;
841: coeff_d1 <= -43;
842: coeff_d1 <= -55;
843: coeff_d1 <= -62;
844: coeff_d1 <= -63;
845: coeff_d1 <= -58;
846: coeff_d1 <= -47;
847: coeff_d1 <= -31;
848: coeff_d1 <= -13;
849: coeff_d1 <= 7;
850: coeff_d1 <= 26;
851: coeff_d1 <= 43;
852: coeff_d1 <= 54;
853: coeff_d1 <= 61;
854: coeff_d1 <= 61;
855: coeff_d1 <= 55;
856: coeff_d1 <= 44;
857: coeff_d1 <= 30;
858: coeff_d1 <= 13;
859: coeff_d1 <= -4;
860: coeff_d1 <= -20;
861: coeff_d1 <= -33;
862: coeff_d1 <= -42;
863: coeff_d1 <= -46;
864: coeff_d1 <= -45;
865: coeff_d1 <= -40;
866: coeff_d1 <= -32;
867: coeff_d1 <= -21;
868: coeff_d1 <= -10;
869: coeff_d1 <= 2;
870: coeff_d1 <= 11;
871: coeff_d1 <= 19;
872: coeff_d1 <= 23;
873: coeff_d1 <= 24;
874: coeff_d1 <= 23;
875: coeff_d1 <= 19;
876: coeff_d1 <= 14;
877: coeff_d1 <= 9;
878: coeff_d1 <= 4;
879: coeff_d1 <= 0;
880: coeff_d1 <= -3;
881: coeff_d1 <= -4;
882: coeff_d1 <= -3;
883: coeff_d1 <= -2;
884: coeff_d1 <= 0;
885: coeff_d1 <= 2;
886: coeff_d1 <= 3;
887: coeff_d1 <= 4;
888: coeff_d1 <= 3;
889: coeff_d1 <= 0;
890: coeff_d1 <= -3;
891: coeff_d1 <= -8;
892: coeff_d1 <= -12;
893: coeff_d1 <= -15;
894: coeff_d1 <= -18;
895: coeff_d1 <= -18;
896: coeff_d1 <= -17;
897: coeff_d1 <= -13;
898: coeff_d1 <= -8;
899: coeff_d1 <= -1;
900: coeff_d1 <= 6;
901: coeff_d1 <= 14;
902: coeff_d1 <= 20;
903: coeff_d1 <= 25;
904: coeff_d1 <= 28;
905: coeff_d1 <= 27;
906: coeff_d1 <= 24;
907: coeff_d1 <= 19;
908: coeff_d1 <= 11;
909: coeff_d1 <= 2;
910: coeff_d1 <= -7;
911: coeff_d1 <= -15;
912: coeff_d1 <= -22;
913: coeff_d1 <= -27;
914: coeff_d1 <= -29;
915: coeff_d1 <= -28;
916: coeff_d1 <= -25;
917: coeff_d1 <= -19;
918: coeff_d1 <= -11;
919: coeff_d1 <= -3;
920: coeff_d1 <= 5;
921: coeff_d1 <= 13;
922: coeff_d1 <= 19;
923: coeff_d1 <= 22;
924: coeff_d1 <= 24;
925: coeff_d1 <= 23;
926: coeff_d1 <= 20;
927: coeff_d1 <= 15;
928: coeff_d1 <= 9;
929: coeff_d1 <= 3;
930: coeff_d1 <= -3;
931: coeff_d1 <= -8;
932: coeff_d1 <= -12;
933: coeff_d1 <= -14;
934: coeff_d1 <= -14;
935: coeff_d1 <= -13;
936: coeff_d1 <= -11;
937: coeff_d1 <= -8;
938: coeff_d1 <= -5;
939: coeff_d1 <= -2;
940: coeff_d1 <= 1;
941: coeff_d1 <= 3;
942: coeff_d1 <= 4;
943: coeff_d1 <= 4;
944: coeff_d1 <= 4;
945: coeff_d1 <= 3;
946: coeff_d1 <= 2;
947: coeff_d1 <= 1;
948: coeff_d1 <= 0;
949: coeff_d1 <= 0;
950: coeff_d1 <= 0;
951: coeff_d1 <= 1;
952: coeff_d1 <= 2;
953: coeff_d1 <= 4;
954: coeff_d1 <= 5;
955: coeff_d1 <= 6;
956: coeff_d1 <= 6;
957: coeff_d1 <= 6;
958: coeff_d1 <= 4;
959: coeff_d1 <= 2;
960: coeff_d1 <= -1;
961: coeff_d1 <= -4;
962: coeff_d1 <= -7;
963: coeff_d1 <= -9;
964: coeff_d1 <= -11;
965: coeff_d1 <= -12;
966: coeff_d1 <= -12;
967: coeff_d1 <= -10;
968: coeff_d1 <= -7;
969: coeff_d1 <= -4;
970: coeff_d1 <= 0;
971: coeff_d1 <= 5;
972: coeff_d1 <= 9;
973: coeff_d1 <= 12;
974: coeff_d1 <= 14;
975: coeff_d1 <= 15;
976: coeff_d1 <= 14;
977: coeff_d1 <= 12;
978: coeff_d1 <= 9;
979: coeff_d1 <= 4;
980: coeff_d1 <= 0;
981: coeff_d1 <= -4;
982: coeff_d1 <= -8;
983: coeff_d1 <= -11;
984: coeff_d1 <= -13;
985: coeff_d1 <= -14;
986: coeff_d1 <= -13;
987: coeff_d1 <= -11;
988: coeff_d1 <= -8;
989: coeff_d1 <= -4;
990: coeff_d1 <= 0;
991: coeff_d1 <= 3;
992: coeff_d1 <= 6;
993: coeff_d1 <= 9;
994: coeff_d1 <= 10;
995: coeff_d1 <= 10;
996: coeff_d1 <= 9;
997: coeff_d1 <= 8;
998: coeff_d1 <= 5;
999: coeff_d1 <= 3;
default: coeff_d1 <= 0;
endcase
coeff_d2 <= coeff_d1;
coeff_d3 <= coeff_d2;

// Define multiplication
product <= data_d3*coeff_d3;

// Define accumulator
accumulator <= (clear_acc_sr[0] ? 48'b0 : accumulator) + { {12{product[35]}}, product[35:0] };

// Handle resetting filter when new data is received
if(din_enable) begin
clear_acc_sr <= {1'b1, clear_acc_sr[4:1]};
data_index <= write_index - num_taps + 1'b1;
coeff_index <= 1'b0;
end else begin
data_index <= data_index + 1'b1;
coeff_index <= coeff_index + 1'b1;
clear_acc_sr <= {1'b0, clear_acc_sr[4:1]};
end

// Handle writing buffer data
if(din_enable) begin
buffer[write_index] <= { {6{din[11]}}, din[11:0] };
write_index <= write_index + 1'b1;
end

// Handle eject_data scheduling
if(coeff_index == num_taps-1) begin
eject_data_sr <= {1'b1, eject_data_sr[4:1]};
end else begin
eject_data_sr <= {1'b0, eject_data_sr[4:1]};
end

// Handle ejecting data
if(eject_data_sr[0]) begin
dout <= accumulator[11:0];// >>> data_scale;
dout_enable <= 1'b1;
end else begin
// dout <= dout;
dout_enable <= 1'b0;
end

end

endmodule
« Last Edit: May 02, 2018, 05:55:53 pm by pigtwo »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Quote
What operating system are you running ISE on?  Maybe I should always be running the virtual machine version? 
I am running it on Window 10 natively. But have an older system that I use when I need to program hardware.

I think you have earned yourself a DSP48A1 scout badge. You need a new project that forces you to learn some other part of the chip. BRAM? Weird PLL reprogramming? SERDES? 

Maybe a Stereo FM transmitter based on waggling a pin really fast with the SERDES block?

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline pigtwoTopic starter

  • Regular Contributor
  • *
  • Posts: 133
I've compared my results and I do get different results from the two different installations.  I do notice that my home computer is running ISE 14.6.  I reinstalled to 14.7 and it seems to be working now.  I feel much better about this now.  I was losing my mind for quite a while.  I've retried a few of my past problems and they now work much better.  All that stuff about the accumulator being outside of the DSP slice seems to have gone away.  So I take back what I said about the design needing to be hyper exact before optimizations happen.  It seems to have just been some sort of problem I was having. 

Quote
I think you have earned yourself a DSP48A1 scout badge. You need a new project that forces you to learn some other part of the chip. BRAM? Weird PLL reprogramming? SERDES? 
Hahah, I definitely feel like I have a much better understanding of the DSP48A1 now.  I agree,  this whole project was actually to get me familiar with fast multiplications(DSP48A1) so I'd be more prepared for my next project which I'm sort of stealing from you.  I wanted to become familiar driving high speed LVDS lines with a SERDES so the project is driving a DVI display showing the Mandlebrot set/Julia sets.  Ideally allowing the user to zoom or pan.  I have no idea how practical that is since I've done zero calculations so far.

Quote
Maybe a Stereo FM transmitter based on waggling a pin really fast with the SERDES block?

That sounds like a very good idea.  It seems like a very cool projects especially since I know almost nothing about FM transmission, antennas etc.  Plus I've really wanted to buy a spectrum analyzer but havn't really had a good reason to.  Maybe this can be it. 

Thanks again for all the help!
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf