Author Topic: vhdl, fast divider and fast multiplier, without instantiating DSP slices (Read 11416 times)

legacy · « **on:** August 23, 2016, 09:18:30 pm »

guys, I am looking for algorithms implemented in vhdl
examples, tips, advices, and keyworks to google for

let me know, thanks in advance

I am going to (re)implement my 32bit ALU (1)
for an hobby project (arise-v2/r5 softcore)

(1) it will be tested under ghdl

rstofer · « **Reply #1 on:** August 23, 2016, 11:46:32 pm »

Almost every modern FPGA has a bunch of fast (1 cycle) multipliers which may, in fact, be DSP blocks but single cycle multipliers were available before DSP came along.

Division is a completely different issue.

I took a class on this stuff in grad school and it must have slipped away. I didn't use the information for a very long time.

If you insist on doing multiply the hard way, look into Boothe's Algorithm, it deals with more than 1 bit at a time but it is still iterative.
https://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm

When I wanted to do non-restoring division on signed integers (2's complement), I had a heck of a time getting from what was in the textbook down to something that resembled VHDL. Then I stumbled across a book that had some microcode for the operation and just copied it. I wish I had seen this paper back then:

http://www.ijareeie.com/upload/2013/july/59_VHDL.pdf

There are hundreds of web sites and text books that cover computer arithmetic.

nctnico · « **Reply #2 on:** August 24, 2016, 12:43:40 am »

I'd use the * in VHDL and let the synthesizer deal with it. Doing it yourself is likely similar to try and come up with better code than a C compiler / standard library does.

legacy · « **Reply #3 on:** August 24, 2016, 05:43:37 am »

reasons to do so: don't underestimate the didactical purpose

legacy · « **Reply #4 on:** August 24, 2016, 05:47:15 am »

Quote from: rstofer on August 23, 2016, 11:46:32 pm

Boothe's Algorithm

good, that's a key to google for

legacy · « **Reply #5 on:** August 24, 2016, 06:01:35 am »

Quote from: nctnico on August 24, 2016, 12:43:40 am

I'd use the * in VHDL and

interesting, I am looking at some projects in OpenCores, and it seems they have different ideas and behaviors about that, e.g. the Plasma's author used a "*", whereas the Ion's author did something strange which I can't understand, and so on

legacy · « **Reply #6 on:** August 24, 2016, 06:04:59 am »

@rstofer
VHDL Implementation of Non Restoring Division Algorithm Using High Speed Adder/Subtractor
interesting, that's another key to google for and a good lecture, thanks

nctnico · « **Reply #7 on:** August 24, 2016, 09:33:58 am »

Quote from: legacy on August 24, 2016, 06:01:35 am

Quote from: nctnico on August 24, 2016, 12:43:40 am
I'd use the * in VHDL and
interesting, I am looking at some projects in OpenCores, and it seems they have different ideas and behaviors about that, e.g. the Plasma's author used a "*", whereas the Ion's author did something strange which I can't understand, and so on

Don't take Opencores projects as examples. Actually the amount of really clever VHDL code you can find is very small.

rstofer · « **Reply #8 on:** August 24, 2016, 01:38:57 pm »

Looking back, I just used the Xilinx multiplier gadget:

Code: [Select]

   MULT18X18_inst : MULT18X18
   port map (
      P => Product,		-- 36-bit multiplier output
      A => Multiplicand,-- 18-bit multiplier input
      B => Multiplier	-- 18-bit multiplier input
   );

In my case, I was using 16 bit 2's complement so I just extended the sign bit two places to the left. Notice that the multiplier is not a clocked process. It must be one big ball of logic. I didn't check the datasheet to see how fast it can run nor did I look to see how long the operands are stable before I snag up the result. If I were pushing the Spartan 3 I would probably have to do some research.

legacy · « **Reply #9 on:** August 25, 2016, 03:17:39 pm »

which is the fastest division algorithm ?

coppice · « **Reply #10 on:** August 25, 2016, 03:36:58 pm »

Quote from: rstofer on August 23, 2016, 11:46:32 pm

If you insist on doing multiply the hard way, look into Boothe's Algorithm, it deals with more than 1 bit at a time but it is still iterative.
https://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm

The key feature of Booth's algorithm is not that it deals with multiple bits. All the algorithms I know of can do that. Booth is distinctive because its naturally 2's complement. Wallace, and other approaches, generally work with unsigned numbers. You need to wrap them in extra logic, with extra delays to achieve a 2's complement multiplier. This can be good or bad. For most DSP you only want a signed multiplier, so Booth is a natural choice. For an ALU you might want both signed and unsigned modes. Wallace might then be a better choice.

rstofer · « **Reply #11 on:** August 25, 2016, 04:53:54 pm »

Xilinx doesn't say how they build up the 18x18 multiplier in the Spartan 3 but it may very well be a Wallace tree because it is unclocked. The Wallace tree is just a big fur ball of logic.

http://www.xilinx.com/support/documentation/application_notes/xapp467.pdf

In my case, I was only interested in 16 bit 2's complement arithmetic because the computer I was emulating used only that format.

As to division, there are 3 broad approaches: Shift and Add/Subtract, High Radix and Convergence. Of the Shift and Add/Subtract type, non-restoring is probably a bit faster but if it is, it will only be by a cycle of two. Both obtain k bits in k cycles plus something for overhead and this is where things will vary.

High Radix is trying to do multi-bit division that I haven't thought much about.

Look for SRT algorithms because they try to do multi-bit division using a lookup table.

Convergence uses an iterative Newton-Raphson scheme (or similar) to converge on a solution. It will use a lot less cycles on a large word size. One application is when instead of dividing as n/d, they first find 1/d and then multiply by n. Clearly, that gets into some kind of fractional notation.

There are MANY discussions on the Internet and a couple of books seem to be useful:

"Computer Arithmetic Algorithms and Hardware Design" - Behrooz Parhami

http://www.alibris.com/Computer-Arithmetic-Algorithms-and-Hardware-Designs-Behrooz-Parhami/book/8234663?matches=42

I don't think I would buy the $600+ offering. They like that book more than I do!

"Synthesis or Arithmetic Circuits" - Deschamps, Bioul, Sutter

http://www.alibris.com/Synthesis-of-Arithmetic-Circuits-FPGA-ASIC-and-Embedded-Systems-Jean-Pierre-DesChamps/book/28466467?matches=16

A little more reasonable! This book has a lot of pseudo code that makes the algorithm more clear.

There are many other books on the subject.

The thing about division is that it is seldom used. It is worth spending a little time researching solutions but, in the end, non-restoring 2's complement gets it done. Especially for short word lengths. For a 64 bit machine, maybe the Newton-Raphson approach is a lot better.

coppice · « **Reply #12 on:** August 25, 2016, 05:03:44 pm »

Quote from: rstofer on August 25, 2016, 04:53:54 pm

Xilinx doesn't say how they build up the 18x18 multiplier in the Spartan 3 but it may very well be a Wallace tree because it is unclocked. The Wallace tree is just a big fur ball of logic.

Both Wallace and Booth can be inplemented as a ripple through logic array, typically with some carry speeds ups. They can also both be implemented as a partitioned tree, with one or more registers creating a pipeline.

Quote from: rstofer on August 25, 2016, 04:53:54 pm

As to division, there are 3 broad approaches: Shift and Add/Subtract, High Radix and Convergence. Of the Shift and Add/Subtract type, non-restoring is probably a bit faster but if it is, it will only be by a cycle of two. Both obtain k bits in k cycles plus something for overhead and this is where things will vary.

The are more than 3 broad approaches. If you have a single cycle multiplier you can wrap a successive approximation register (SAR) around it, just like many ADCs are a DAC with a SAR wrapped around it. You can spin that into a square root unit, too. Just feed the same number into both multiplier inputs.

legacy · « **Reply #13 on:** August 25, 2016, 05:33:42 pm »

that's my toy division algorithm, it calculates correctly
but for a word of 32bits it takes 35 clock's ticks

it needs "some" work

( oh well, a completely erase and rewind
I'd better go for the newton_raphson's method)

legacy · « **Reply #14 on:** August 25, 2016, 05:36:18 pm »

@rstofer
@coppice
thanks for your advice

rstofer · « **Reply #15 on:** August 25, 2016, 07:25:42 pm »

Quote from: legacy on August 25, 2016, 05:33:42 pm

that's my toy division algorithm, it calculates correctly
but for a word of 32bits it takes 35 clock's ticks

it needs "some" work
( oh well, a completely erase and rewind
I'd better go for the newton_raphson's method)

Not necessarily...

Nevertheless, here is a paper on the Successive Approximation Algorithm:

http://www.ijiee.org/papers/173-E0005.pdf

The algorithm takes as few as 2 states, as many as 34 states but averages around 4. That's pretty fast!

They also show restoring as being 1 state faster than non-restoring but I wonder if that is just for unsigned.

exmadscientist · « **Reply #16 on:** August 26, 2016, 04:23:08 am »

As a point of comparison it might be worth looking at DIV instruction latency on modern CPUs, for example as tabulated by Agner Fog for x86 in his optimization manuals, section 4: http://www.agner.org/optimize/#manuals. Now admittedly this is a very biased sample -- dedicated designs with large teams, plenty of gates to burn, targeting very high clock rates -- but suddenly 35 clocks does not look so bad at all!

legacy · « **Reply #17 on:** August 26, 2016, 11:44:46 am »

is there an ALU's source code to look for, as example of *good* implementation ?
it seems there are no examples about the integer division

legacy · « **Reply #18 on:** August 26, 2016, 12:18:21 pm »

ideas to Speed up the slow integer division algorithm

More difficult problem -- you have no hardware divider, you can't use built-in hardware multipliers because you don't have/can't use them, and traditional division methods like the Grade school algorithm: long division are just too slow, so, what to do?

Find alternative solutions: Multiply by the reciprocal A / D = A * (1/ D) looks interesting !!!

(1/D) ---> Use Newton's method for calculation of the reciprocal of D

it's called Newton-Raphson, it uses Newton's method to converge to the quotient, its strategy is to find the reciprocal of D, and multiply that reciprocal by N to find the final quotient Q

it sounds great, it can also be Pipelined, and it's very fast (quadratic convergence), and so on ...
... unfortunately it requires conversion to/from fractional or still worse floating-point

rstofer · « **Reply #19 on:** August 26, 2016, 04:39:03 pm »

Quote from: legacy on August 26, 2016, 12:18:21 pm

ideas to Speed up the slow integer division algorithm

More difficult problem -- you have no hardware divider, you can't use built-in hardware multipliers because you don't have/can't use them, and traditional division methods like the Grade school algorithm: long division are just too slow, so, what to do?

Find alternative solutions: Multiply by the reciprocal A / D = A * (1/ D) looks interesting !!!

(1/D) ---> Use Newton's method for calculation of the reciprocal of D

it's called Newton-Raphson, it uses Newton's method to converge to the quotient, its strategy is to find the reciprocal of D, and multiply that reciprocal by N to find the final quotient Q

it sounds great, it can also be Pipelined, and it's very fast (quadratic convergence), and so on ...
... unfortunately it requires conversion to/from fractional or still worse floating-point

I don't think you need to find the reciprocal of N, only D. N * (1/D) => N/D

What device are you using that doesn't have a multiplier? In fact, how did you decide to do multiplication?

Pipelining helps if you are doing lots of consecutive divisions but it does nothing to help one division in a long sequence of other instructions unless, by chance or clever application coding (compiler code generation), you don't need the result immediately.

Successive Approximation looks good except for the bit where it only works for unsigned division. I'd have to think about how to use it with 2's complement. Having to convert to unsigned is a PITA because it takes extra cycles on both ends of the algorithm.

MIPS used the same sequential hardware for both multiply and divide. See "Computer Organization and Design" - Patterson and Hennessy. I don't see a clear description of how they handled both signed and unsigned but they did.

http://www.alibris.com/search/books/isbn/9788181475343

I had to buy my copy from India! I'm not sure why the publishers sell books in foreign markets for FAR LESS than they sell here. Kind of like pharmaceuticals.

SRT is also interesting. It is discussed here:
http://people.irisa.fr/Nicolas.Veyrat-Charvillon/files/articles/SPIE_2005.pdf

More on 'divgen' discussed in the paper above:
http://lipforge.ens-lyon.fr/www/divgen/

legacy · « **Reply #20 on:** August 26, 2016, 04:47:38 pm »

Quote from: rstofer on August 26, 2016, 04:39:03 pm

I don't think you need to find the reciprocal of N

(1/D) is fractional_t
you have to convert N into fractional_t: N --> N~
then you can multiply N~ x 1/D, and you get a fractional_t number, Q~
then you have to convert Q~ into integer, Q~ --> Q

Quote from: rstofer on August 26, 2016, 04:39:03 pm

What device are you using that doesn't have a multiplier?

already implemented, it takes 2 clock's ticks

Quote from: rstofer on August 26, 2016, 04:39:03 pm

Pipelining

I don't need it

Quote from: rstofer on August 26, 2016, 04:39:03 pm

MIPS used the same sequential hardware for both multiply and divide. See "Computer Organization and Design" - Patterson and Hennessy. I don't see a clear description of how they handled both signed and unsigned but they did.

I have already implemented something similar, and the integer-division takes too long, 36 clock's ticks for 32 bit, anyway I have solved the unsigned/signed problem

Quote from: rstofer on August 26, 2016, 04:39:03 pm

SRT is also interesting

yup, thanks for the link

rstofer · « **Reply #21 on:** August 26, 2016, 07:12:29 pm »

I'm not convinced you need to convert N. It is a whole number. OTOH, I haven't tried every possibility.

6 / 4 = 1 with a remainder of 2

1 10 -- 6 as a whole number, I just aligned the digits for giggles
0.01 -- 1/4
===
1.10 => quotient is 1, remainder is 1/2 which when converted is 2

I only tried a few test cases and I certainly don't know how to mechanize the creation and extraction of the fractional parts but it seems to work. Maybe it breaks down for negative numbers...

Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.

hamster_nz · « **Reply #22 on:** August 26, 2016, 10:23:54 pm »

Quote from: rstofer on August 26, 2016, 07:12:29 pm

Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.

Early Cray supercomputers used reciprocal then multiply. They were pretty smart designers.....

legacy · « **Reply #23 on:** August 26, 2016, 11:35:26 pm »

I have also *a lot* of problems with cordic/16bit
e.g. cos(0) is -1, because 0x8000 is -1

0x7fff is +1, just 1 bit of difference makes the difference
(those problems are common with fractional_t)

rstofer · « **Reply #24 on:** August 27, 2016, 03:53:05 pm »

Quote from: hamster_nz on August 26, 2016, 10:23:54 pm

Quote from: rstofer on August 26, 2016, 07:12:29 pm
Unless there is some super fast way to create reciprocals, I'm not sure I see any benefit to the algorithm.

Early Cray supercomputers used reciprocal then multiply. They were pretty smart designers.....

I wonder if the CDC 6000 series also used it. Considering the 60 bit integer word size, anything that works in log(n) versus linear in n would be a really big deal.

The thing is, CDC converted fixed point operands to floating point before performing the multiply/divide operations and then converted back to fixed point. This took several steps in software. Page 3-21 here:
http://www.textfiles.com/bitsavers/pdf/cdc/6x00/60100000D_6600refMan_Feb67.pdf

Cray does pretty much the same thing. See 2240004 3-18 here:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-19

The CDC and Cray computers were designed for scientific computing. Integer arithmetic was a side issue, the CPUs were designed to expedite floating point. Both computers had long and short integers. The floating point accumulator in the CDC was 96 bits wide!

The CDC machines were impressive! I have often wondered how many FPGAs it would take to replicate one of these machines and how performance would compare. Lacking software to run on the new machine makes the project pretty much worthless.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: vhdl, fast divider and fast multiplier, without instantiating DSP slices (Read 11416 times)

Share me