Author Topic: More Efficient MAC operation  (Read 1595 times)

0 Members and 1 Guest are viewing this topic.

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1930
  • Country: ca
More Efficient MAC operation
« on: January 18, 2021, 10:27:01 am »
Hi,
I Have done a RMS unit inside the Gowin FPGA, I need to do a MAC operation on 24bit input data, the verilog code for this operation is something like this

Code: [Select]
input signed [23:0] i_Data;
reg [47:0] r_MAC;

r_MAC       <= r_MAC + (i_Data * i_Data);

The code works as expected, But it has used a lot of DSP resources,
I want to know is there a clever way of doing it with less DSP,
Gowin has this MULTADDALU IP core, which I think would use only two multipliers, is there a way to do this calculation with this IP?
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Online BrianHG

  • Super Contributor
  • ***
  • Posts: 8089
  • Country: ca
Re: More Efficient MAC operation
« Reply #1 on: January 18, 2021, 10:43:59 am »
Hi,
I Have done a RMS unit inside the Gowin FPGA, I need to do a MAC operation on 24bit input data, the verilog code for this operation is something like this

Code: [Select]
input signed [23:0] i_Data;
reg [47:0] r_MAC;

r_MAC       <= r_MAC + (i_Data * i_Data);

The code works as expected, But it has used a lot of DSP resources,
I want to know is there a clever way of doing it with less DSP,
Gowin has this MULTADDALU IP core, which I think would use only two multipliers, is there a way to do this calculation with this IP?
If you don't need that full integer precision, you can always (example, adjust to the precision you need):
Code: [Select]
input signed [23:0] i_Data;
reg [23:0] r_MAC;

r_MAC       <= 24' (r_MAC + ((i_Data * i_Data)>>24) );
You can operate at 256 times the precision changing the 24 for 32.
Also, is you compiler smart enough to know that when you write (i_Data * i_Data), that it should optimize it's compiling for power of 2 which might use less multipliers.  If not, try coding for a true powers like this:
r_MAC       <= 48'( r_MAC + (i_Data**2) );

Just use the MULTADDALU IP core, does it use 2 with your code?
How many is you current verilog code using.
« Last Edit: January 18, 2021, 10:45:30 am by BrianHG »
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1930
  • Country: ca
Re: More Efficient MAC operation
« Reply #2 on: January 18, 2021, 11:18:25 am »
Thanks BrianHG,
I have change the code as you suggested,

Code: [Select]
r_MAC       <= 48'( r_MAC + (i_Data**2) );
The result is still the same, it uses the same amount of DSP.

Quote
Just use the MULTADDALU IP core, does it use 2 with your code?
How many is you current verilog code using.
I do not get your point in here, what did you mean?
My code is a simple RMS calculator, and the only place that I have used multipliers is this line of code, the rest is just controlling the state machine.
If you mean I should use the IP core, then I want to know how to split the multiplication to use 2 18bit numbers, because the maximum bit width of the in[puts for it is 18bit wide.
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3246
  • Country: ca
Re: More Efficient MAC operation
« Reply #3 on: January 18, 2021, 01:52:19 pm »
Look at the datasheet, figure what size the DSP can take, and use this size. For example, instead of 24-bit numbers use 16-bit numbers. This way you'll only need one DSP block for your MAC.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2797
  • Country: ca
Re: More Efficient MAC operation
« Reply #4 on: January 18, 2021, 05:02:52 pm »
If you mean I should use the IP core, then I want to know how to split the multiplication to use 2 18bit numbers, because the maximum bit width of the in[puts for it is 18bit wide.
This is basic math. Here is a very simple example for decimal numbers just to demonstrate the idea (imagine that we only have multipliers which can work with single decimal digits as arguments):
12 * 24 = (10 + 2) * (20 + 4) = (10 * 20) + (2 * 20) + (10 * 4) + (2 * 4) = (1 * 2) * 100 + (2 * 2) * 10 + (1 * 4) * 10 + (2 * 4)
Each operation in brackets is a separate multiplier, while *10 and *100 operations are just shifts (remember we're talking base 10 here so assume 1 << 1 is 10).

Situation is the same with binary numbers - you split input arguments into pieces which fits into your HW multipliers, and then recombine results.
The cool thing about this is that you can scale this code depending on whether you want to minimize latency, or minimize resource utilization, or maximize throughput by changing amount of operations performed per single clock (and perhaps pipelining operations for maximum throughput).
« Last Edit: January 18, 2021, 07:59:32 pm by asmi »
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1930
  • Country: ca
Re: More Efficient MAC operation
« Reply #5 on: January 19, 2021, 07:25:09 am »
Thanks asmi for the hint, the DSP blocks are almost the same as spartan 6 devices, they would accept 18bit data as input.
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf