EEVblog Electronics Community Forum

Electronics => FPGA => Topic started by: ali_asadzadeh on March 04, 2020, 11:30:55 am

Title: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 04, 2020, 11:30:55 am
Hi,
I'm using FPU100 core from opencores for now,
https://opencores.org/projects/fpu100
It's good for now,But I take a look at ARM Cortex M4 FPU and it can do most things in 1 or 2 clock cycles and it's divide would take about 14 clock cycles, I want to know what better open source options do we have? how much space and resources deos ARM used?
I think if we could do it with  few DSP blocks and less than 1000 LE blocks to achieve it, it can be very useful.

what cores do you use for Doing float callculations, I know that you may say that we do not do it in FPGA and we should use fixed point, But I just want to study it.
Title: Re: Open source FPU with 1 cycle performance
Post by: TK on March 04, 2020, 12:01:18 pm
Have you tried RISC-V?
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 04, 2020, 12:16:20 pm
Quote
Have you tried RISC-V?
Where can I get the HDL codes?
Title: Re: Open source FPU with 1 cycle performance
Post by: TK on March 04, 2020, 01:07:59 pm
You can try riscv.org (http://riscv.org), sifive.com (http://sifive.com) and https://github.com/chipsalliance/Cores-SweRV (https://github.com/chipsalliance/Cores-SweRV)

More information about the RISC-V implementation at Western Digital: https://blog.westerndigital.com/risc-v-swerv-core-open-source/ (https://blog.westerndigital.com/risc-v-swerv-core-open-source/)
Title: Re: Open source FPU with 1 cycle performance
Post by: SiliconWizard on March 04, 2020, 05:40:29 pm
I may be wrong, but I'm not sure the WD open source implementation contains any FPU this far ( https://github.com/chipsalliance/Cores-SweRV indeed )

Regarding SiFive open source cores, I'm not sure either - but someone from SiFive (Bruce) is a regular here, so he'll be able to answer that point precisely. So even if there's indeed an FPU, keep in mind that SiFive's open source cores are written in Chisel (AFAIR) and not in a standard HDL directly. That may matter for the OP.

I personally don't know of any resource, performance-optimized and robust implementation of a reusable FPU, and would also certainly be interested.

I haven't taken a deep look yet at the PULP platform, maybe you'll find something interesting there: https://github.com/pulp-platform
Title: Re: Open source FPU with 1 cycle performance
Post by: TK on March 04, 2020, 09:14:57 pm
Sorry I confused M extension (multiply - divide) with FPU... I am also not sure which RISC-V core has implemented a FPU
Title: Re: Open source FPU with 1 cycle performance
Post by: asmi on March 04, 2020, 09:22:13 pm
Sorry I confused M extension (multiply - divide) with FPU... I am also not sure which RISC-V core has implemented a FPU
FPU is "F" extension, possibly with "D" depending on how exactly you define FPU (on x86 it supports multiple data types). Both of them are part of "G" superset (IMAFD = G).
Title: Re: Open source FPU with 1 cycle performance
Post by: ve7xen on March 04, 2020, 09:59:38 pm
F-Box FPU module from Shakti RISC-V project might be interesting, not sure how reusable it is and certainly isn't well documented: https://gitlab.com/shaktiproject/cores/fbox .It seems that the implementations in the C-class core (https://gitlab.com/shaktiproject/cores/c-class) is not derived from this though, and is single-cycle, so not sure what's up with that but maybe something here is useful. It's all SystemVerilog.
Title: Re: Open source FPU with 1 cycle performance
Post by: asmi on March 04, 2020, 10:21:51 pm
F-Box FPU module from Shakti RISC-V project might be interesting, not sure how reusable it is and certainly isn't well documented: https://gitlab.com/shaktiproject/cores/fbox .It seems that the implementations in the C-class core (https://gitlab.com/shaktiproject/cores/c-class) is not derived from this though, and is single-cycle, so not sure what's up with that but maybe something here is useful. It's all SystemVerilog.
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
Also I think they use Bluespec, which is not exactly SystemVerilog.
Title: Re: Open source FPU with 1 cycle performance
Post by: coppice on March 04, 2020, 10:35:08 pm
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" has when that cycle needs to be at least a 1 us?
Spot on. The reason the opencores FPU takes numerous cycles is its pipelined to keep the clock rate up (to 100MHz in their test case). I didn't look at the details of their design, but if its sensibly pipelined it should be possible to start a new calculation every cycle, so the throughput should be respectable.

Its rare for the FPU in a processor to complete calculations in 1 or 2 cycles. They can typically start new calculations every 1 or 2 cycles. The pipelining of the processor then masks the several cycles the calculation actually takes, so there may appear to have no latency to the programmer.
Title: Re: Open source FPU with 1 cycle performance
Post by: hamster_nz on March 05, 2020, 12:48:23 am
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" has when that cycle needs to be at least a 1 us?
Spot on. The reason the opencores FPU takes numerous cycles is its pipelined to keep the clock rate up (to 100MHz in their test case). I didn't look at the details of their design, but if its sensibly pipelined it should be possible to start a new calculation every cycle, so the throughput should be respectable.

If anyone is interested in how this dynamic instruction scheduling works but don't know where to start, have a read up on Tomasulu Algorithm.

If you try to implement it you will find that It is surprisingly simple and effective.
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on March 05, 2020, 01:09:48 am
I may be wrong, but I'm not sure the WD open source implementation contains any FPU this far ( https://github.com/chipsalliance/Cores-SweRV indeed )

SweRV (all three versions to date) are RV32IMC. No FPU.

Quote
Regarding SiFive open source cores, I'm not sure either - but someone from SiFive (Bruce) is a regular here, so he'll be able to answer that point precisely. So even if there's indeed an FPU, keep in mind that SiFive's open source cores are written in Chisel (AFAIR) and not in a standard HDL directly. That may matter for the OP.

No longer SiFive, but familiar with existing and some unannounced products :-)

All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 05, 2020, 06:29:28 am
Quote
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
ARM Cortex M can do it, and you can see it here in M4 reference manual,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html)
Also we should see 1GHz Cortex M cores very very soon. I know it can be pipe-lined and can have 1 or 2 clock cycle latency, and that's also very better than the one that I have found in opencores.

If we could see some cool and human readable high performance FPU  ....  ^-^ ^-^ ^-^
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 05, 2020, 06:32:23 am
Quote
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
It's interesting, how can it be re-targeted for FPGA's Like xilinx or Gowin? it's written in Chisel |O
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on March 05, 2020, 06:42:38 am
Quote
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
It's interesting, how can it be re-targeted for FPGA's Like xilinx or Gowin? it's written in Chisel |O

A big advantage of the Chisel / FIRRTL / verilog pipeline is that you can optimize the same design for SoC or optimize it for FPGA.

You can also run the SoC-optimized verilog on FPGA, which is less efficient, but is better for validating your SoC design before you tape out and send it to a foundry.
Title: Re: Open source FPU with 1 cycle performance
Post by: coppice on March 05, 2020, 11:35:27 am
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
That sounds nice. Denorms have been a huge problem with many FPUs, when used for real time work.
Title: Re: Open source FPU with 1 cycle performance
Post by: coppice on March 05, 2020, 11:55:48 am
Quote
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
ARM Cortex M can do it, and you can see it here in M4 reference manual,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html)
Also we should see 1GHz Cortex M cores very very soon. I know it can be pipe-lined and can have 1 or 2 clock cycle latency, and that's also very better than the one that I have found in opencores.

If we could see some cool and human readable high performance FPU  ....  ^-^ ^-^ ^-^
It seems you are trying to do develop hardware by plugging together black boxes, with no understanding of why each black box was built the way it is. This seldom ends well. The M4 is pipelined, so basically nothing really finishes in one cycle. Things have to flow through the pipeline to conclusion. Those lines in the instruction table that say 1 cycle mean you can start a new operation every cycle. It is the number of cycles needed to initiate an instruction. Not that they finish that fast. There are lines in that table which say something like "1 + P". This means the instruction takes 1 cycle plus the number of cycles needed for a pipeline refill, because those instructions inherently flush and restart the pipeline. If you write code that needs a result from an FPU instruction for another instruction, you may see a pipeline stall if those instructions are too close together.

If you want a true 1 cycle FPU the clock speed will be rather low. If you want the 1GHz speed you referred to you have to pipeline to break up the ripple through time into smaller pieces, A high clock speed and single cycle operation do NOT go hand in hand. They are in serious conflict with each other.
Title: Re: Open source FPU with 1 cycle performance
Post by: SiliconWizard on March 05, 2020, 02:38:43 pm
Yeah, as the OP stated, they want that on FPGAs. Any decent FPU will inevitably be heavily pipelined to reach even moderate clock frequencies on FPGAs, and thus have significant latency (but throughput can be good - although depending on the FPU structure, you may be able to get high throughput only on a series of similar operations, like a series of multiplies, and not on just a series of random operations such as a series of multiply, then divide, then add, then .... for the latter, you'll need a more involved structure taking up significant resources.)

Title: Re: Open source FPU with 1 cycle performance
Post by: Scrts on March 05, 2020, 04:43:29 pm
Is 1 cycle performance truly needed? I remember doing one design where performance wasn't needed, so eventually it was cheaper to buy Nios II/f core with enabled FP and save on engineering time rather than trying to learn and either write my own or take it off the internet and understand how that works...
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 06, 2020, 08:07:29 am
Quote
Is 1 cycle performance truly needed?
Nope, But accepting a new instruction in every cycle and low area and high speed is needed. I want to study a few better opensource alternatives, so I can choose or implement a better version for my own needs.
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 06, 2020, 08:59:35 am
I have downloaded SiFive E76-MC  core, which have a floating point inside, But it's all garbage and no reusable code! |O why they even name it open source! look at he codes, it's not human readable!
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on March 06, 2020, 09:59:58 am
I have downloaded SiFive E76-MC  core, which have a floating point inside, But it's all garbage and no reusable code! |O why they even name it open source! look at he codes, it's not human readable!

The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 06, 2020, 12:04:22 pm
Quote
The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
Thanks for the info, then is there a verilog or VHDL open source with FPU inside that you are aware of?
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on March 06, 2020, 01:08:02 pm
Quote
The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
Thanks for the info, then is there a verilog or VHDL open source with FPU inside that you are aware of?

I've already pointed you to the source code for the Berkeley FPU (which you'd find somewhere inside that obfuscated E76 evaluation core).

It's in Chisel, not verilog or VHDL but it shouldn't be hard for someone who knows one of those to convert it. The Chisel tools output verilog, and it's by default relatively readable as people look at waveforms and debug stuff at the verilog level so it's reasonably easy to related names of things in the verilog back to the names in the Chisel in much the same way that you can look at names of things in gdb and related them back to the original C or C++ names.

There are many organisations and individuals creating RISC-V cores. Many use Chisel, but some do work directly in verilog or VHDL.

There is a list here:

https://riscv.org/risc-v-cores/

Each entry lists the implementation language, and also the RISC-V flavour supported. If it's got an F, D, or G in the User Spec field then it's got an FPU (G is short for IMAFD).

You might want to unselect the "SOC PLATFORMS" and "SOCS" filter buttons. You can also hit REFINE and choose a license acceptable to you.

I see several cores with BSD or MIT licenses and FPUs, but the ones I noticed are written in Chisel or Bluespec. Cores written in verilog or VHDL seem in general to be simpler ones, without FPU.
Title: Re: Open source FPU with 1 cycle performance
Post by: SiliconWizard on March 06, 2020, 02:41:04 pm
I already suggested the OP to take a look at the PULP platform.

Turns out the Ariane core does include an FPU. It's written in SystemVerilog: https://github.com/pulp-platform/ariane/tree/master/src
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on March 06, 2020, 06:33:04 pm
Thanks brucehoult,

Quote
I've already pointed you to the source code for the Berkeley FPU (which you'd find somewhere inside that obfuscated E76 evaluation core).
Do you have any idea on How much resource and speed can it achieve on an ARTIX 7 device?
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on March 06, 2020, 08:52:28 pm
Thanks brucehoult,

Quote
I've already pointed you to the source code for the Berkeley FPU (which you'd find somewhere inside that obfuscated E76 evaluation core).
Do you have any idea on How much resource and speed can it achieve on an ARTIX 7 device?

You might find this paper interesting: https://hal.archives-ouvertes.fr/hal-02303453/document

They give figures for a number of RISC-V cores. For a Rocket implementing RV32IMF they get 76 MHz on an Artix 7 and 8132 LUT, 3094 FF. I'm pretty sure I've seen Rocket-based cores running at 100 MHz on an Artix 7 but that might be without FPU.
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on July 11, 2020, 12:07:37 pm
Does anybody know how to get verilog from the chisel?

https://github.com/ucb-bar/berkeley-hardfloat/

I want to generate verilog form this berkeley repo, there is also a build.sbt in the repo.
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on December 10, 2020, 01:41:27 pm
No Idea on how to compile chisel to Verilog, if anyone knows something please point it out for the berkeley-hardfloat on GitHub
So we can take a look at it, also I have found this useful open-source FPU, it's not what I wanted to accept data on every cycle, But it has low area and good speed, it can reach 150MHz on Gowin.

https://github.com/dawsonjon/fpu
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on December 10, 2020, 09:23:42 pm
No Idea on how to compile chisel to Verilog, if anyone knows something please point it out for the berkeley-hardfloat on GitHub
So we can take a look at it, also I have found this useful open-source FPU, it's not what I wanted to accept data on every cycle, But it has low area and good speed, it can reach 150MHz on Gowin.

https://github.com/dawsonjon/fpu

You generate Verilog from chisel by ... running chisel. That's its purpose.

Well, these days chisel itself generates FIRRTL (Flexible Intermediate Representation for RTL), and then you run various optimizations on the FIRRTL (including such things as optimizing for SoC or for FPGA, and then you convert the FIRRTL to Verilog or I think VHDL is also an option.

You can also I believe convert Verilog or VHDL to FIRRTL for optimization.
Title: Re: Open source FPU with 1 cycle performance
Post by: ali_asadzadeh on December 10, 2020, 10:22:18 pm
Thanks bruce, do you recommend any  tutorials on chisel?
Title: Re: Open source FPU with 1 cycle performance
Post by: brucehoult on December 10, 2020, 11:30:08 pm
Thanks bruce, do you recommend any  tutorials on chisel?

I used to work with the people who made it (and continue to enhance it) but I've never used it myself.

A quick  google search finds tutorials at Berkeley University, and this one seems reasonable as a quick start:

https://www.instructables.com/Getting-Started-With-Chisel/ (https://www.instructables.com/Getting-Started-With-Chisel/)

But I just found that in 30 seconds so I expect you've already looked at that and more in the last five months since your initial messages.