Author Topic: vhdl, fast divider and fast multiplier, without instantiating DSP slices  (Read 11416 times)

0 Members and 1 Guest are viewing this topic.

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
guys, I am looking for algorithms implemented in vhdl
examples, tips, advices, and keyworks to google for

let me know, thanks in advance  :-+

I am going to (re)implement my 32bit ALU (1)
for an hobby project (arise-v2/r5 softcore)


(1) it will be tested under ghdl
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Almost every modern FPGA has a bunch of fast (1 cycle) multipliers which may, in fact, be DSP blocks but single cycle multipliers were available before DSP came along.

Division is a completely different issue.

I took a class on this stuff in grad school and it must have slipped away.  I didn't use the information for a very long time.

If you insist on doing multiply the hard way, look into Boothe's Algorithm, it deals with more than 1 bit at a time but it is still iterative.
https://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm

When I wanted to do non-restoring division on signed integers (2's complement), I had a heck of a time getting from what was in the textbook down to something that resembled VHDL.  Then I stumbled across a book that had some microcode for the operation and just copied it.  I wish I had seen this paper back then:

http://www.ijareeie.com/upload/2013/july/59_VHDL.pdf

There are hundreds of web sites and text books that cover computer arithmetic.
« Last Edit: August 24, 2016, 01:41:40 pm by rstofer »
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 26891
  • Country: nl
    • NCT Developments
I'd use the * in VHDL and let the synthesizer deal with it. Doing it yourself is likely similar to try and come up with better code than a C compiler / standard library does.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
reasons to do so: don't underestimate the didactical purpose :-+
« Last Edit: August 24, 2016, 06:06:08 am by legacy »
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Boothe's Algorithm

good, that's a key to google for  :-+
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
I'd use the * in VHDL and

interesting, I am looking at some projects in OpenCores, and it seems they have different ideas and behaviors about that, e.g. the Plasma's author used a "*", whereas the Ion's author did something strange which I can't understand, and so on  :-//
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
@rstofer
VHDL Implementation of Non Restoring Division Algorithm Using High Speed Adder/Subtractor
interesting, that's another key to google for and a good lecture, thanks  :-+
 

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 26891
  • Country: nl
    • NCT Developments
I'd use the * in VHDL and
interesting, I am looking at some projects in OpenCores, and it seems they have different ideas and behaviors about that, e.g. the Plasma's author used a "*", whereas the Ion's author did something strange which I can't understand, and so on  :-//
Don't take Opencores projects as examples. Actually the amount of really clever VHDL code you can find is very small.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Looking back, I just used the Xilinx multiplier gadget:

Code: [Select]
   MULT18X18_inst : MULT18X18
   port map (
      P => Product, -- 36-bit multiplier output
      A => Multiplicand,-- 18-bit multiplier input
      B => Multiplier -- 18-bit multiplier input
   );

In my case, I was using 16 bit 2's complement so I just extended the sign bit two places to the left.  Notice that the multiplier is not a clocked process.  It must be one big ball of logic.  I didn't check the datasheet to see how fast it can run nor did I look to see how long the operands are stable before I snag up the result.  If I were pushing the Spartan 3 I would probably have to do some research.

 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
which is the fastest division algorithm ?
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 8637
  • Country: gb
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #10 on: August 25, 2016, 03:36:58 pm »
If you insist on doing multiply the hard way, look into Boothe's Algorithm, it deals with more than 1 bit at a time but it is still iterative.
https://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm
The key feature of Booth's algorithm is not that it deals with multiple bits. All the algorithms I know of can do that. Booth is distinctive because its naturally 2's complement. Wallace, and other approaches, generally work with unsigned numbers. You need to wrap them in extra logic, with extra delays to achieve a 2's complement multiplier. This can be good or bad. For most DSP you only want a signed multiplier, so Booth is a natural choice. For an ALU you might want both signed and unsigned modes. Wallace might then be a better choice.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #11 on: August 25, 2016, 04:53:54 pm »
Xilinx doesn't say how they build up the 18x18 multiplier in the Spartan 3 but it may very well be a Wallace tree because it is unclocked.  The Wallace tree is just a big fur ball of logic.

http://www.xilinx.com/support/documentation/application_notes/xapp467.pdf

In my case, I was only interested in 16 bit 2's complement arithmetic because the computer I was emulating used only that format.

As to division, there are 3 broad approaches: Shift and Add/Subtract, High Radix and Convergence.  Of the Shift and Add/Subtract type, non-restoring is probably a bit faster but if it is, it will only be by a cycle of two.  Both obtain k bits in k cycles plus something for overhead and this is where things will vary.

High Radix is trying to do multi-bit division that I haven't thought much about.

Look for SRT algorithms because they try to do multi-bit division using a lookup table.

Convergence uses an iterative Newton-Raphson scheme (or similar) to converge on a solution.  It will use a lot less cycles on a large word size.  One application is when instead of dividing as n/d, they first find 1/d and then multiply by n.  Clearly, that gets into some kind of fractional notation.

There are MANY discussions on the Internet and a couple of books seem to be useful:

"Computer Arithmetic Algorithms and Hardware Design" - Behrooz Parhami

http://www.alibris.com/Computer-Arithmetic-Algorithms-and-Hardware-Designs-Behrooz-Parhami/book/8234663?matches=42

I don't think I would buy the $600+ offering.  They like that book more than I do!

"Synthesis or Arithmetic Circuits" - Deschamps, Bioul, Sutter

http://www.alibris.com/Synthesis-of-Arithmetic-Circuits-FPGA-ASIC-and-Embedded-Systems-Jean-Pierre-DesChamps/book/28466467?matches=16

A little more reasonable!  This book has a lot of pseudo code that makes the algorithm more clear.

There are many other books on the subject.

The thing about division is that it is seldom used.  It is worth spending a little time researching solutions but, in the end, non-restoring 2's complement gets it done.  Especially for short word lengths.  For a 64 bit machine, maybe the Newton-Raphson approach is a lot better.

 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 8637
  • Country: gb
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #12 on: August 25, 2016, 05:03:44 pm »
Xilinx doesn't say how they build up the 18x18 multiplier in the Spartan 3 but it may very well be a Wallace tree because it is unclocked.  The Wallace tree is just a big fur ball of logic.
Both Wallace and Booth can be inplemented as a ripple through logic array, typically with some carry speeds ups. They can also both be implemented as a partitioned tree, with one or more registers creating a pipeline.

As to division, there are 3 broad approaches: Shift and Add/Subtract, High Radix and Convergence.  Of the Shift and Add/Subtract type, non-restoring is probably a bit faster but if it is, it will only be by a cycle of two.  Both obtain k bits in k cycles plus something for overhead and this is where things will vary.
The are more than 3 broad approaches. If you have a single cycle multiplier you can wrap a successive approximation register (SAR) around it, just like many ADCs are a DAC with a SAR wrapped around it. You can spin that into a square root unit, too. Just feed the same number into both multiplier inputs.

 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #13 on: August 25, 2016, 05:33:42 pm »


that's my toy division algorithm, it calculates correctly
but for a word of 32bits it takes 35 clock's ticks  :-DD

it needs "some" work  ;D
( oh well, a completely erase and rewind
I'd better go for the newton_raphson's method)
« Last Edit: August 25, 2016, 05:38:30 pm by legacy »
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #14 on: August 25, 2016, 05:36:18 pm »
@rstofer
@coppice
thanks for your advice  :-+
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #15 on: August 25, 2016, 07:25:42 pm »
that's my toy division algorithm, it calculates correctly
but for a word of 32bits it takes 35 clock's ticks  :-DD

it needs "some" work  ;D
( oh well, a completely erase and rewind
I'd better go for the newton_raphson's method)

Not necessarily...

Nevertheless, here is a paper on the Successive Approximation Algorithm:

http://www.ijiee.org/papers/173-E0005.pdf

The algorithm takes as few as 2 states, as many as 34 states but averages around 4.  That's pretty fast!

They also show restoring as being 1 state faster than non-restoring but I wonder if that is just for unsigned.

 

Offline exmadscientist

  • Frequent Contributor
  • **
  • Posts: 342
  • Country: us
  • Technically A Professional
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #16 on: August 26, 2016, 04:23:08 am »
As a point of comparison it might be worth looking at DIV instruction latency on modern CPUs, for example as tabulated by Agner Fog for x86 in his optimization manuals, section 4: http://www.agner.org/optimize/#manuals. Now admittedly this is a very biased sample -- dedicated designs with large teams, plenty of gates to burn, targeting very high clock rates -- but suddenly 35 clocks does not look so bad at all!
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #17 on: August 26, 2016, 11:44:46 am »
 :horse:

is there an ALU's source code to look for, as example of *good* implementation ?
it seems there are no examples about the integer division
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #18 on: August 26, 2016, 12:18:21 pm »
ideas to Speed up the slow integer division algorithm

More difficult problem -- you have no hardware divider, you can't use built-in hardware multipliers because you don't have/can't use them, and traditional division methods like the Grade school algorithm: long division are just too slow, so, what to do?

Find alternative solutions: Multiply by the reciprocal A / D = A * (1/ D) looks interesting !!!

(1/D) ---> Use Newton's method for calculation of the reciprocal of D

it's called Newton-Raphson, it uses Newton's method to converge to the quotient, its strategy is to find the reciprocal of D, and multiply that reciprocal by N to find the final quotient Q

it sounds great, it can also be Pipelined, and it's very fast (quadratic convergence), and so on ...
... unfortunately it requires conversion to/from fractional or still worse floating-point :palm: :palm: :palm:
« Last Edit: August 26, 2016, 12:33:15 pm by legacy »
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #19 on: August 26, 2016, 04:39:03 pm »
ideas to Speed up the slow integer division algorithm

More difficult problem -- you have no hardware divider, you can't use built-in hardware multipliers because you don't have/can't use them, and traditional division methods like the Grade school algorithm: long division are just too slow, so, what to do?

Find alternative solutions: Multiply by the reciprocal A / D = A * (1/ D) looks interesting !!!

(1/D) ---> Use Newton's method for calculation of the reciprocal of D

it's called Newton-Raphson, it uses Newton's method to converge to the quotient, its strategy is to find the reciprocal of D, and multiply that reciprocal by N to find the final quotient Q

it sounds great, it can also be Pipelined, and it's very fast (quadratic convergence), and so on ...
... unfortunately it requires conversion to/from fractional or still worse floating-point :palm: :palm: :palm:

I don't think you need to find the reciprocal of N, only D.  N * (1/D) => N/D

What device are you using that doesn't have a multiplier?  In fact, how did you decide to do multiplication?

Pipelining helps if you are doing lots of consecutive divisions but it does nothing to help one division in a long sequence of other instructions unless, by chance or clever application coding (compiler code generation), you don't need the result immediately.

Successive Approximation looks good except for the bit where it only works for unsigned division.  I'd have to think about how to use it with 2's complement.  Having to convert to unsigned is a PITA because it takes extra cycles on both ends of the algorithm.

MIPS used the same sequential hardware for both multiply and divide.  See "Computer Organization and Design" - Patterson and Hennessy.  I don't see a clear description of how they handled both signed and unsigned but they did.

http://www.alibris.com/search/books/isbn/9788181475343

I had to buy my copy from India!  I'm not sure why the publishers sell books in foreign markets for FAR LESS than they sell here.  Kind of like pharmaceuticals.

SRT is also interesting.  It is discussed here:
http://people.irisa.fr/Nicolas.Veyrat-Charvillon/files/articles/SPIE_2005.pdf

More on 'divgen' discussed in the paper above:
http://lipforge.ens-lyon.fr/www/divgen/


 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #20 on: August 26, 2016, 04:47:38 pm »
I don't think you need to find the reciprocal of N

(1/D) is fractional_t
you have to convert N into fractional_t: N --> N~
then you can multiply N~ x 1/D, and you get a fractional_t number, Q~
then you have to convert Q~ into integer, Q~ --> Q

What device are you using that doesn't have a multiplier? 

already implemented, it takes 2 clock's ticks

Pipelining

I don't need it

MIPS used the same sequential hardware for both multiply and divide.  See "Computer Organization and Design" - Patterson and Hennessy.  I don't see a clear description of how they handled both signed and unsigned but they did.

I have already implemented something similar, and the integer-division takes too long, 36 clock's ticks for 32 bit, anyway I have solved the unsigned/signed problem

SRT is also interesting

yup, thanks for the link  :D
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #21 on: August 26, 2016, 07:12:29 pm »
I'm not convinced you need to convert N.  It is a whole number.  OTOH, I haven't tried every possibility.

6 / 4 = 1 with a remainder of 2

1 10  -- 6 as a whole number, I just aligned the digits for giggles
0.01  -- 1/4
===
1.10  => quotient is 1, remainder is 1/2 which when converted is 2

I only tried a few test cases and I certainly don't know how to mechanize the creation and extraction of the fractional parts but it seems to work.  Maybe it breaks down for negative numbers...

Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.
« Last Edit: August 26, 2016, 08:43:00 pm by rstofer »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #22 on: August 26, 2016, 10:23:54 pm »
Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.

Early Cray supercomputers used reciprocal then multiply. They were pretty smart designers.....
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #23 on: August 26, 2016, 11:35:26 pm »
I have also *a lot* of problems with cordic/16bit
e.g. cos(0) is -1, because 0x8000 is -1 :palm: :palm: :palm:
0x7fff is +1, just 1 bit of difference makes the difference
(those problems are common with fractional_t)
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9889
  • Country: us
Re: vhdl, fast divider and fast multiplier, without instantiating DSP slices
« Reply #24 on: August 27, 2016, 03:53:05 pm »
Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.

Early Cray supercomputers used reciprocal then multiply. They were pretty smart designers.....

I wonder if the CDC 6000 series also used it.  Considering the 60 bit integer word size, anything that works in log(n) versus linear in n would be a really big deal.

The thing is, CDC converted fixed point operands to floating point before performing the multiply/divide operations and then converted back to fixed point.  This took several steps in software.  Page 3-21 here:
http://www.textfiles.com/bitsavers/pdf/cdc/6x00/60100000D_6600refMan_Feb67.pdf

Cray does pretty much the same thing.  See 2240004   3-18 here:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-19

The CDC and Cray computers were designed for scientific computing.  Integer arithmetic was a side issue, the CPUs were designed to expedite floating point.  Both computers had long and short integers.  The floating point accumulator in the CDC was 96 bits wide!

The CDC machines were impressive!  I have often wondered how many FPGAs it would take to replicate one of these machines and how performance would compare.  Lacking software to run on the new machine makes the project pretty much worthless.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf