Author Topic: what arithmetic functions need to be made in hardware? (Read 7633 times)

legacy · « **on:** September 15, 2016, 03:07:29 pm »

what arithmetic functions need to be made in hardware
to make the programming comfortable and efficient ?!?

I am still developing my-own arithmetic engine, since the multiplication and the division go complex (and they both eat a lot of logic) I am temped to recycle a cordic/bkm fixedpoint-engine which can converge to the multiplication/division of two integer number, with the bonus of providing a few elementary functions such as
- trigonometric sine, cosine, tangent
- hyperbolic sine-h, cosine-h, tangent-h
- absolute value of a complex number (also called "hypotenuse" in Pythagoras' theorem)
- inverse tangent (atan)
- exponential of a complex number
- logarithm of a complex number
- Multiplication and division of complex numbers in rectangular form
- 2D rotation of a complex number by a real angle
- Square root of a real number

what do you think? what do you suggest? any tips?

I have also to consider how many clock ticks they take, e.g. the hyperbolic functions takes additional cycles in cordic, while BKM increases the complexity by different order of magnitude, also, to compute hyperbolic functions with BKM ... I need to implement a cascade of two BKM modules and this doubles the requested logic on fpga

Kilrah · « **Reply #1 on:** September 15, 2016, 03:31:38 pm »

Anything between none and all of them?

Question makes no sense without an intended purpose. There's a reason why there are hundreds of slight variations of a same microcontroller, because each of them does fit some purpose better than another.

JacquesBBB · « **Reply #2 on:** September 15, 2016, 03:38:34 pm »

I believe it really depends on the final application.

For N-body simulations, the most important operation is the computation of the inverse of the distance
$$ 1/ \sqrt{x^2+y^2+z^2} $$

For this there has been some development of specialized harware as the GRAPE cards
http://jun.artcompsci.org/talks/saltrakecity20121115.pdf

but it is used only by a minority compared to general purpose workstations.

legacy · « **Reply #3 on:** September 15, 2016, 04:55:54 pm »

Quote from: Kilrah on September 15, 2016, 03:31:38 pm

Question makes no sense without an intended purpose

I am developing a general purpose soft core, but it would be also cute to use it in scientific applications, or just like a DSP
in this case, e.g. for mpeg compression/decompression, I'd better implement discrete cosine transform, in hw

currently I need a full ALU, the bitwise part is completed and working, the arithmetic part is complete but … I don't like the multiply and division unit because the take too much logic, also I have already implemented a fixed point engine with saturated operation (add/sub/mul/div)

and I am tempted to add the cordic unit, and the BKM unit to this project

Kilrah · « **Reply #4 on:** September 15, 2016, 05:01:46 pm »

Quote from: legacy on September 15, 2016, 04:55:54 pm

I am developing a general purpose soft core, but it would be also cute to use it in scientific applications, or just like a DSP

Then you've gotta make it parametric, otherwise it will be used absolutely nowhere.

Let the user enable/disable each instruction (or group of closely related / logic-sharing instructions), and/or choose their preferred implementation of each from several you've implemented if that's what you feel like doing. Someone will be happy with the approach that takes "too much logic" but doesn't result in the approximations the one that takes less logic implies etc...

Someone · « **Reply #5 on:** September 15, 2016, 10:57:12 pm »

Quote from: JacquesBBB on September 15, 2016, 03:38:34 pm

I believe it really depends on the final application.

Absolutely, you can see many academic papers where different architectures have been evaluated and they can be run against "generic" tests such as Dhrystone or more targeted benchmarks:
http://parasuite.inria.fr/benchmarks/

Quote from: Kilrah on September 15, 2016, 05:01:46 pm

Quote from: legacy on September 15, 2016, 04:55:54 pm
I am developing a general purpose soft core, but it would be also cute to use it in scientific applications, or just like a DSP

Then you've gotta make it parametric, otherwise it will be used absolutely nowhere.

Let the user enable/disable each instruction (or group of closely related / logic-sharing instructions), and/or choose their preferred implementation of each from several you've implemented if that's what you feel like doing. Someone will be happy with the approach that takes "too much logic" but doesn't result in the approximations the one that takes less logic implies etc...

Take it to the next level and have the compiler spit out all the possible implementation options for a given program. That can then be compared with the speed or area metrics of each physical implementation of the target core at compile time to select the optimal solution. Its these higher order comparisons which take so much time to complete when engineering systems, if there were free and cross platform/target tools to increase productivity they would be very attractive.

Kalvin · « **Reply #6 on:** September 16, 2016, 07:01:37 am »

Have you thought implementing some OS functionality in hardware? For example timer queue and tasker could be implemented in the hardware level. Starting a timer would insert the timer in the hardware queue and the hardware would keep track of the timers and generate notifications when the timers expire. A tasker could implement simple tasker queue with priorities etc. Those could be configurable when compiling the VHDL/Verilog for the system.

How about adding one bit to operand field which would make the operation atomic? For example if a sequence of the instructions have the atomic bit set, the instructions would be guaranteed to be executed as atomic without interrupts. Then there would not be any need to enable/disable interrupts when performing atomic operations. Of course it would cost one bit in the instruction word, but if you have a VLIW architecture, you might be able to implement this functionality.

legacy · « **Reply #7 on:** September 16, 2016, 11:11:14 am »

Quote from: Kalvin on September 16, 2016, 07:01:37 am

How about adding one bit to operand field which would make the operation atomic?

good idea

rstofer · « **Reply #8 on:** September 16, 2016, 04:05:01 pm »

Memory management is an essential component if any multitasking OS is to be added. Memory protection is part of this. A user just can't be allowed to access memory that doesn't belong to his process. Demand paging or at least page faults need to be handled. There's a barn full of logic involved with memory...

I don't pretend to understand how content addressable memory is implemented but it is clearly required for paging. Either a page is in RAM or it needs to be (re)loaded from disk. Dirty pages need to be saved, code pages (Harvard architecture) don't. Etc...

Atomic semaphores: https://docs.oracle.com/cd/E19683-01/806-6867/sync-27385/index.html

tatus1969 · « **Reply #9 on:** September 16, 2016, 05:24:43 pm »

at the end it will always be a cost-performance tradeoff for every target, as you don't spend chip area unless you need to.

Bruce Abbott · « **Reply #10 on:** September 16, 2016, 06:06:08 pm »

Quote from: rstofer on September 16, 2016, 04:05:01 pm

Memory management is an essential component if any multitasking OS is to be added. Memory protection is part of this. A user just can't be allowed to access memory that doesn't belong to his process. Demand paging or at least page faults need to be handled.

This explains why multitasking on the Commodore Amiga didn't work...

rstofer · « **Reply #11 on:** September 16, 2016, 07:17:22 pm »

Quote from: Bruce Abbott on September 16, 2016, 06:06:08 pm

Quote from: rstofer on September 16, 2016, 04:05:01 pm
Memory management is an essential component if any multitasking OS is to be added. Memory protection is part of this. A user just can't be allowed to access memory that doesn't belong to his process. Demand paging or at least page faults need to be handled.
This explains why multitasking on the Commodore Amiga didn't work...

Multitasking can be done at some level even on the most simple uCs. RTOS's and even superloops count as a form of multitasking.

Windows implements some kind of demand paging. The question isn't so much what is required but whether it is worth doing in hardware. The 'big iron' mainframes did a lot of the memory management in hardware including demand paging, at least up to the point of realizing that there was a page fault. Memory protection was a big feature of the CDC 6400 systems with bounds registers for each process (classified work, no peeking). Given the speed of modern processors, I wonder if moving more off to hardware is even required.

'Small iron' like the IBM1130 didn't initially do multitasking but they did do a form of demand paging by loading on call various subroutines. There were LOCAL (Load On CALl) cards at the end of the job deck that indicated which user subroutines could overlay each other. It was up to the programmer to get it right. There were SOCAL cards to allow overlaying system routines. All because memory was a very limited resource. In this regard, DMS for the IBM1130 was far superior to CP/M for the 8080s that came along a decade later. Yes, people added paging to 8080s (and their brethren) but so far as I remember, it was never supported by the OS's of the day.

bitslice · « **Reply #12 on:** September 16, 2016, 08:22:38 pm »

Quote from: legacy on September 15, 2016, 03:07:29 pm

what arithmetic functions need to be made in hardware
to make the programming comfortable and efficient ?!?

64bit barrel shifts are sometimes implemented in hardware rather than software.

helius · « **Reply #13 on:** September 16, 2016, 08:42:12 pm »

Quote from: rstofer on September 16, 2016, 07:17:22 pm

...'Small iron' like the IBM1130 didn't initially do multitasking but they did do a form of demand paging by loading on call various subroutines...

That's a rather more expansive definition of "demand paging" than I am used to. Usually it only means one thing: that a program may make an access to memory which doesn't exist anywhere, and it will get allocated and initialized by the page fault handler. This isn't required for a virtual memory system, but it's generally convenient to do it this way.

rstofer · « **Reply #14 on:** September 16, 2016, 10:25:11 pm »

Quote from: helius on September 16, 2016, 08:42:12 pm

Quote from: rstofer on September 16, 2016, 07:17:22 pm
...'Small iron' like the IBM1130 didn't initially do multitasking but they did do a form of demand paging by loading on call various subroutines...
That's a rather more expansive definition of "demand paging" than I am used to. Usually it only means one thing: that a program may make an access to memory which doesn't exist anywhere, and it will get allocated and initialized by the page fault handler. This isn't required for a virtual memory system, but it's generally convenient to do it this way.

In modern terms with virtual memory systems all over the place, yes, the definition above is expansive.
But it did allow for dynamic loading of code; FSIN(x) could overlay FCOS(x) even if both were used in the same expression (hideous for execution time). But it was an overlay of an area of memory, not a remapping.

Still, it allowed rather large programs to fit in our small 8k word machines.

jcosper · « **Reply #15 on:** September 17, 2016, 03:16:03 am »

Fast MACs are always useful for DSP

AndyC_772 · « **Reply #16 on:** September 17, 2016, 08:00:03 am »

Quote from: legacy on September 15, 2016, 04:55:54 pm

I am developing a general purpose soft core, but it would be also cute to use it in scientific applications

Is this a commercial development that's intended for a particular project, and must achieve some price / performance / feature combination that's not available with off-the-shelf silicon? Or is it an academic or hobby exercise?

If it's the former, then obviously the feature set you need to implement will be dictated by its end use.

If it's the latter, then the idea that you "need" to implement any particular feature at all depends only on whether or not you want to implement it. If something interests you, or will be a useful learning exercise, then go right ahead and do it, and if it doesn't, then don't.

If you're after ideas in either case: one thing I find tends to be done badly in many microprocessors has nothing to do with the processing itself; it's getting data in and out of the CPU in the first place. Make sure you include an interface that will readily connect to an external data source or sink on your FPGA, eg. some kind of parallel interface with a reasonable clock rate and word length, and a simple scheme for flow control and addressing.

Feynman · « **Reply #17 on:** September 17, 2016, 08:29:58 am »

Quote from: JacquesBBB on September 15, 2016, 03:38:34 pm

I believe it really depends on the final application.

For Motor Control i would love Clarke- and Parke-Transformation (and their inverses) and Space Vector Modulation. As a bonus i would take a hardware PI(D) control algorithm (with integrator anti-windup) as well

legacy · « **Reply #18 on:** September 17, 2016, 10:35:43 pm »

Quote from: Feynman on September 17, 2016, 08:29:58 am

For Motor Control i would love Clarke- and Parke-Transformation (and their inverses) and Space Vector Modulation. As a bonus i would take a hardware PI(D) control algorithm (with integrator anti-windup) as well

too bonus

legacy · « **Reply #19 on:** September 18, 2016, 11:39:11 am »

Quote from: AndyC_772 on September 17, 2016, 08:00:03 am

Is this a commercial development that's intended for a particular project, and must achieve some price / performance / feature combination that's not available with off-the-shelf silicon? Or is it an academic or hobby exercise?

personal project, pure hobby, not academic, training purpose
(may be also for research? who knows

)

Quote from: AndyC_772 on September 17, 2016, 08:00:03 am

If it's the former, then obviously the feature set you need to implement will be dictated by its end use.

at the beginning I was just playing with fpga, the whole ALU was missing the multiplier and the divider, then I wanted to add them, and I am still having troubles in order to make them faster than they can

I mean the multiplier was taking 32 clock ticks, now it takes just 2 clock ticks
whereas the divider still takes 35 clock ticks

anyway, when I wanted to use my soc in a real application I discovered that I needed trigonometric functions (cosine and sine) made in hardware, therefore I added the CORDIC unit (circular domain only)

then I discovered that I need hyperbolic functions (hyperbolic cosine, hyperbolic sine), the CORDIC unit can calculate them as well as the exponential, but I wanted to implement the BKM(1) algorithm because it also provides 2d-rotation of a complex number by a real angle

and here we are

(1) currently I am working on a modified version of BKM called "Zeda"; from the paper, the original BKM algorithm does not converge as fast as expected and the complex-sign-function (derived by Robertson diagram) is too critical when ported to fixed point

Code: [Select]

******************************************************************************
 test15: "trig(pi/4)"
******************************************************************************
 stimulus(+0.000000,+0.392699)
 expected(+0.923880,+0.382683) <--- correct value
computed0(+0.923880,+0.382683) diff(+0.000000,-0.000000) zeda_fp_cexp: success
computed1(+0.928537,+0.371240) diff(+0.004658,-0.011444) mbkm_fp_cexp: failure
computed2(+0.923880,+0.382683) diff(+0.000000,-0.000000) cordic_fx_sc: success
computed3(+0.923880,+0.382683) diff(+0.000000,-0.000000) zeda_fx_cexp: success

******************************************************************************
 test16: "trig(pi/6)"
******************************************************************************
 stimulus(+0.000000,+0.523599)
 expected(+0.866025,+0.500000) <--- correct value
computed0(+0.866025,+0.500000) diff(+0.000000,+0.000000) zeda_fp_cexp: success
computed1(+0.854114,+0.520086) diff(-0.011911,+0.020086) mbkm_fp_cexp: failure
computed2(+0.866025,+0.500000) diff(+0.000000,+0.000000) cordic_fx_sc: success
computed3(+0.866025,+0.500000) diff(+0.000000,-0.000000) zeda_fx_cexp: success

Quote from: AndyC_772 on September 17, 2016, 08:00:03 am

If it's the latter, then the idea that you "need" to implement any particular feature at all depends only on whether or not you want to implement it. If something interests you, or will be a useful learning exercise, then go right ahead and do it, and if it doesn't, then don't.

exactly

Quote from: AndyC_772 on September 17, 2016, 08:00:03 am

If you're after ideas in either case: one thing I find tends to be done badly in many microprocessors has nothing to do with the processing itself; it's getting data in and out of the CPU in the first place. Make sure you include an interface that will readily connect to an external data source or sink on your FPGA, eg. some kind of parallel interface with a reasonable clock rate and word length, and a simple scheme for flow control and addressing.

ah well, I have already implemented a *debug engine*, it talks serially over the serial port at 115200bps (it can go up to 1Mbps), and I am going to add a super fast cypress-USB interface at 20Mbyte/sec

the debug engine uses a protocol that I have developed to take the full control of the datapath, I can read/write registers, read/write devices, bypass the cpu, inject opcode, program the external asynchronous static ram, etc

on the host side, the debug engine talks to an interface written in C, it comes with a client server model, the server is attached to the serial port and talks to a client, different clients are possible, including a client with a comfortable shell which can be used interactively and can accept scripts

btw, before putting the bit-stream into a real fpga I am using to simulate everything through gHDL, I have a lot of test-entities in my testbench for the CORDIC unit, the BKM comes more test-entities because it's more complex

legacy · « **Reply #20 on:** September 18, 2016, 11:44:47 am »

Quote from: bitslice on September 16, 2016, 08:22:38 pm

64bit barrel shifts are sometimes implemented in hardware rather than software.

added

legacy · « **Reply #21 on:** September 18, 2016, 11:49:21 am »

Quote from: jcosper on September 17, 2016, 03:16:03 am

Fast MACs are always useful for DSP

ironically, as collateral effect
when I implemented the fast-mul
I got a multiply-and-accumulate unit

Code: [Select]

entity multiplier_bh_v1 is
  generic
  (
    n: natural:= 16; 
    m: natural:= 8
  );
  port
  (
    in_x : in  signed(n downto 0);
    in_y : in  signed(m downto 0);
    in_u : in  signed(n downto 0);
    out_z: out signed(n+m+1 downto 0)
  );
end multiplier_bh_v1;

out_z = in_u + (in_x * in_y)

currently it's "unsigned" only
I need to make it "signed"

legacy · « **Reply #22 on:** September 18, 2016, 11:54:54 am »

Quote from: Feynman on September 17, 2016, 08:29:58 am

For Motor Control i would love Clarke- and Parke-Transformation (and their inverses) and Space Vector Modulation

this could be a very interesting "demo"

I wonder if 2D-rotations are enough
I don't actually know Clarke-and Parke transformation

Kalvin · « **Reply #23 on:** September 18, 2016, 01:01:22 pm »

Quote from: legacy on September 16, 2016, 11:11:14 am

Quote from: Kalvin on September 16, 2016, 07:01:37 am
How about adding one bit to operand field which would make the operation atomic?

good idea

An update to my earlier suggestion: Microchip has this kind of functionality implemented in the PIC24F processors (at least). The instruction "disi #N" will disable interrupts for the next N instructions. Issuing "disi #0" will cancel the effect of the previous "disi #N" instruction. This would be less costly in hardware wise compared to preserving an extra bit in the instruction word. Nesting blocks of the disi's would be a nightmare, though. But when carefully used, the disi would be quite handy implementing atomic operations.

AndyC_772 · « **Reply #24 on:** September 18, 2016, 05:58:16 pm »

It's a bit of a swine if you're writing code in a high level language, though, where you don't know the exact length of the object code in advance.

How many instructions do I need to disable interrupts for? Answers on a postcard...

legacy · « **Reply #25 on:** September 18, 2016, 06:05:24 pm »

Quote from: AndyC_772 on September 18, 2016, 05:58:16 pm

It's a bit of a swine if you're writing code in a high level language, though, where you don't know the exact length of the object code in advance.

How many instructions do I need to disable interrupts for? Answers on a postcard...

usually you need to disable interrupt in "critical sections", which are written in assembly
and here we go with that idea

you have the same problem if you have to set a bit in your instruction opcode
in order to make it "interrupt disabled during execution" (IDDE)
you need a pragma, or a way, to tell the compiler to set the IDDE-bit

AndyC_772 · « **Reply #26 on:** September 18, 2016, 06:24:17 pm »

I agree about the pragma.

A 'critical' section doesn't need to be in assembler, though. Something like a read / modify / write, or sending an unlock sequence to a peripheral, doesn't necessarily need to be fast but it does need to be uninterrupted.

The PIC24 architecture is particularly bad for needing specific assembly language instructions to achieve ordinary objectives. Simply reading from Flash requires 'table read' instructions to be used, which are included as built-in macros by the XC16 compiler, but the result is messy. It also means a function to, say, print a string located in RAM can't be the same as a function to print a string constant from Flash.

Writing to the register that governs the clock source is a pain too, because it requires a sequence of writes to be done not just one after the other, but actually on consecutive clock edges. This again is handled by a built-in macro, but it's a pointless restriction that doesn't really help anybody.

mac.6 · « **Reply #27 on:** September 18, 2016, 07:09:37 pm »

Turbo DMA: DMA engine that apply transfert function like FFT/DCT while moving data between memory location or from/to devices.

Kalvin · « **Reply #28 on:** September 19, 2016, 03:22:06 pm »

Quote from: AndyC_772 on September 18, 2016, 05:58:16 pm

It's a bit of a swine if you're writing code in a high level language, though, where you don't know the exact length of the object code in advance.

How many instructions do I need to disable interrupts for? Answers on a postcard...

For the PIC24F the parameter value of the idis can be 0x3fff ie. 16384 cycles. So, one can use idis #0x3fff at the start of the block and idis #0 at the end of the block. If one needs to disable the interrupts for more than few tens of cycles for the atomic operation, then one should consider alternative solution to the problem.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: what arithmetic functions need to be made in hardware? (Read 7633 times)

Share me