Is this a commercial development that's intended for a particular project, and must achieve some price / performance / feature combination that's not available with off-the-shelf silicon? Or is it an academic or hobby exercise?
personal project, pure hobby, not academic, training purpose
(may be also for research? who knows
)
If it's the former, then obviously the feature set you need to implement will be dictated by its end use.
at the beginning I was just playing with fpga, the whole ALU was missing the multiplier and the divider, then I wanted to add them, and I am still having troubles in order to make them faster than they can
I mean the multiplier was taking 32 clock ticks, now it takes just 2 clock ticks
whereas the divider still takes 35 clock ticks
anyway, when I wanted to use my soc in a real application I discovered that I needed trigonometric functions (cosine and sine) made in hardware, therefore I added the CORDIC unit (circular domain only)
then I discovered that I need hyperbolic functions (hyperbolic cosine, hyperbolic sine), the CORDIC unit can calculate them as well as the exponential, but I wanted to implement the BKM(1) algorithm because it also provides 2d-rotation of a complex number by a real angle
and here we are
(1) currently I am working on a modified version of BKM called "Zeda"; from the paper, the original BKM algorithm does not converge as fast as expected and the complex-sign-function (derived by Robertson diagram) is too critical when ported to fixed point
******************************************************************************
test15: "trig(pi/4)"
******************************************************************************
stimulus(+0.000000,+0.392699)
expected(+0.923880,+0.382683) <--- correct value
computed0(+0.923880,+0.382683) diff(+0.000000,-0.000000) zeda_fp_cexp: success
computed1(+0.928537,+0.371240) diff(+0.004658,-0.011444) mbkm_fp_cexp: failure
computed2(+0.923880,+0.382683) diff(+0.000000,-0.000000) cordic_fx_sc: success
computed3(+0.923880,+0.382683) diff(+0.000000,-0.000000) zeda_fx_cexp: success
******************************************************************************
test16: "trig(pi/6)"
******************************************************************************
stimulus(+0.000000,+0.523599)
expected(+0.866025,+0.500000) <--- correct value
computed0(+0.866025,+0.500000) diff(+0.000000,+0.000000) zeda_fp_cexp: success
computed1(+0.854114,+0.520086) diff(-0.011911,+0.020086) mbkm_fp_cexp: failure
computed2(+0.866025,+0.500000) diff(+0.000000,+0.000000) cordic_fx_sc: success
computed3(+0.866025,+0.500000) diff(+0.000000,-0.000000) zeda_fx_cexp: success
If it's the latter, then the idea that you "need" to implement any particular feature at all depends only on whether or not you want to implement it. If something interests you, or will be a useful learning exercise, then go right ahead and do it, and if it doesn't, then don't.
exactly
If you're after ideas in either case: one thing I find tends to be done badly in many microprocessors has nothing to do with the processing itself; it's getting data in and out of the CPU in the first place. Make sure you include an interface that will readily connect to an external data source or sink on your FPGA, eg. some kind of parallel interface with a reasonable clock rate and word length, and a simple scheme for flow control and addressing.
ah well, I have already implemented a *debug engine*, it talks serially over the serial port at 115200bps (it can go up to 1Mbps), and I am going to add a super fast cypress-USB interface at 20Mbyte/sec
the debug engine uses a protocol that I have developed to take the full control of the datapath, I can read/write registers, read/write devices, bypass the cpu, inject opcode, program the external asynchronous static ram, etc
on the host side, the debug engine talks to an interface written in C, it comes with a client server model, the server is attached to the serial port and talks to a client, different clients are possible, including a client with a comfortable shell which can be used interactively and can accept scripts
btw, before putting the bit-stream into a real fpga I am using to simulate everything through gHDL, I have a lot of test-entities in my testbench for the CORDIC unit, the BKM comes more test-entities because it's more complex