Author Topic: Open source FPU with 1 cycle performance  (Read 6147 times)

0 Members and 1 Guest are viewing this topic.

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Open source FPU with 1 cycle performance
« on: March 04, 2020, 11:30:55 am »
Hi,
I'm using FPU100 core from opencores for now,
https://opencores.org/projects/fpu100
It's good for now,But I take a look at ARM Cortex M4 FPU and it can do most things in 1 or 2 clock cycles and it's divide would take about 14 clock cycles, I want to know what better open source options do we have? how much space and resources deos ARM used?
I think if we could do it with  few DSP blocks and less than 1000 LE blocks to achieve it, it can be very useful.

what cores do you use for Doing float callculations, I know that you may say that we do not do it in FPGA and we should use fixed point, But I just want to study it.
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline TK

  • Super Contributor
  • ***
  • Posts: 1722
  • Country: us
  • I am a Systems Analyst who plays with Electronics
Re: Open source FPU with 1 cycle performance
« Reply #1 on: March 04, 2020, 12:01:18 pm »
Have you tried RISC-V?
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #2 on: March 04, 2020, 12:16:20 pm »
Quote
Have you tried RISC-V?
Where can I get the HDL codes?
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline TK

  • Super Contributor
  • ***
  • Posts: 1722
  • Country: us
  • I am a Systems Analyst who plays with Electronics
Re: Open source FPU with 1 cycle performance
« Reply #3 on: March 04, 2020, 01:07:59 pm »
You can try riscv.org, sifive.com and https://github.com/chipsalliance/Cores-SweRV

More information about the RISC-V implementation at Western Digital: https://blog.westerndigital.com/risc-v-swerv-core-open-source/
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14476
  • Country: fr
Re: Open source FPU with 1 cycle performance
« Reply #4 on: March 04, 2020, 05:40:29 pm »
I may be wrong, but I'm not sure the WD open source implementation contains any FPU this far ( https://github.com/chipsalliance/Cores-SweRV indeed )

Regarding SiFive open source cores, I'm not sure either - but someone from SiFive (Bruce) is a regular here, so he'll be able to answer that point precisely. So even if there's indeed an FPU, keep in mind that SiFive's open source cores are written in Chisel (AFAIR) and not in a standard HDL directly. That may matter for the OP.

I personally don't know of any resource, performance-optimized and robust implementation of a reusable FPU, and would also certainly be interested.

I haven't taken a deep look yet at the PULP platform, maybe you'll find something interesting there: https://github.com/pulp-platform
 

Offline TK

  • Super Contributor
  • ***
  • Posts: 1722
  • Country: us
  • I am a Systems Analyst who plays with Electronics
Re: Open source FPU with 1 cycle performance
« Reply #5 on: March 04, 2020, 09:14:57 pm »
Sorry I confused M extension (multiply - divide) with FPU... I am also not sure which RISC-V core has implemented a FPU
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2733
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #6 on: March 04, 2020, 09:22:13 pm »
Sorry I confused M extension (multiply - divide) with FPU... I am also not sure which RISC-V core has implemented a FPU
FPU is "F" extension, possibly with "D" depending on how exactly you define FPU (on x86 it supports multiple data types). Both of them are part of "G" superset (IMAFD = G).
 
The following users thanked this post: TK

Offline ve7xen

  • Super Contributor
  • ***
  • Posts: 1193
  • Country: ca
    • VE7XEN Blog
Re: Open source FPU with 1 cycle performance
« Reply #7 on: March 04, 2020, 09:59:38 pm »
F-Box FPU module from Shakti RISC-V project might be interesting, not sure how reusable it is and certainly isn't well documented: https://gitlab.com/shaktiproject/cores/fbox .It seems that the implementations in the C-class core is not derived from this though, and is single-cycle, so not sure what's up with that but maybe something here is useful. It's all SystemVerilog.
73 de VE7XEN
He/Him
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2733
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #8 on: March 04, 2020, 10:21:51 pm »
F-Box FPU module from Shakti RISC-V project might be interesting, not sure how reusable it is and certainly isn't well documented: https://gitlab.com/shaktiproject/cores/fbox .It seems that the implementations in the C-class core is not derived from this though, and is single-cycle, so not sure what's up with that but maybe something here is useful. It's all SystemVerilog.
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
Also I think they use Bluespec, which is not exactly SystemVerilog.
« Last Edit: March 04, 2020, 10:41:30 pm by asmi »
 
The following users thanked this post: Someone, I wanted a rude username

Online coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Open source FPU with 1 cycle performance
« Reply #9 on: March 04, 2020, 10:35:08 pm »
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" has when that cycle needs to be at least a 1 us?
Spot on. The reason the opencores FPU takes numerous cycles is its pipelined to keep the clock rate up (to 100MHz in their test case). I didn't look at the details of their design, but if its sensibly pipelined it should be possible to start a new calculation every cycle, so the throughput should be respectable.

Its rare for the FPU in a processor to complete calculations in 1 or 2 cycles. They can typically start new calculations every 1 or 2 cycles. The pipelining of the processor then masks the several cycles the calculation actually takes, so there may appear to have no latency to the programmer.
« Last Edit: March 04, 2020, 10:38:26 pm by coppice »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2803
  • Country: nz
Re: Open source FPU with 1 cycle performance
« Reply #10 on: March 05, 2020, 12:48:23 am »
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" has when that cycle needs to be at least a 1 us?
Spot on. The reason the opencores FPU takes numerous cycles is its pipelined to keep the clock rate up (to 100MHz in their test case). I didn't look at the details of their design, but if its sensibly pipelined it should be possible to start a new calculation every cycle, so the throughput should be respectable.

If anyone is interested in how this dynamic instruction scheduling works but don't know where to start, have a read up on Tomasulu Algorithm.

If you try to implement it you will find that It is surprisingly simple and effective.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4037
  • Country: nz
Re: Open source FPU with 1 cycle performance
« Reply #11 on: March 05, 2020, 01:09:48 am »
I may be wrong, but I'm not sure the WD open source implementation contains any FPU this far ( https://github.com/chipsalliance/Cores-SweRV indeed )

SweRV (all three versions to date) are RV32IMC. No FPU.

Quote
Regarding SiFive open source cores, I'm not sure either - but someone from SiFive (Bruce) is a regular here, so he'll be able to answer that point precisely. So even if there's indeed an FPU, keep in mind that SiFive's open source cores are written in Chisel (AFAIR) and not in a standard HDL directly. That may matter for the OP.

No longer SiFive, but familiar with existing and some unannounced products :-)

All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
 
The following users thanked this post: ali_asadzadeh, TK, SiliconWizard, I wanted a rude username

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #12 on: March 05, 2020, 06:29:28 am »
Quote
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
ARM Cortex M can do it, and you can see it here in M4 reference manual,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html
Also we should see 1GHz Cortex M cores very very soon. I know it can be pipe-lined and can have 1 or 2 clock cycle latency, and that's also very better than the one that I have found in opencores.

If we could see some cool and human readable high performance FPU  ....  ^-^ ^-^ ^-^
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #13 on: March 05, 2020, 06:32:23 am »
Quote
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
It's interesting, how can it be re-targeted for FPGA's Like xilinx or Gowin? it's written in Chisel |O
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4037
  • Country: nz
Re: Open source FPU with 1 cycle performance
« Reply #14 on: March 05, 2020, 06:42:38 am »
Quote
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
It's interesting, how can it be re-targeted for FPGA's Like xilinx or Gowin? it's written in Chisel |O

A big advantage of the Chisel / FIRRTL / verilog pipeline is that you can optimize the same design for SoC or optimize it for FPGA.

You can also run the SoC-optimized verilog on FPGA, which is less efficient, but is better for validating your SoC design before you tape out and send it to a foundry.
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Open source FPU with 1 cycle performance
« Reply #15 on: March 05, 2020, 11:35:27 am »
All SiFive cores to date use completely unmodified ALUs including FPU from the open-source Rocket implementation from UCB. The FPU is 4 cycle latency for add/sub/mul/fma with 1 operation per clock cycle throughput, fully pipelined. Denorms, underflow, and infinities do not affect the throughput or latency.

https://github.com/ucb-bar/berkeley-hardfloat/
That sounds nice. Denorms have been a huge problem with many FPUs, when used for real time work.
 

Online coppice

  • Super Contributor
  • ***
  • Posts: 8646
  • Country: gb
Re: Open source FPU with 1 cycle performance
« Reply #16 on: March 05, 2020, 11:55:48 am »
Quote
I would look for a performance metric rather than a single-cycle. What use does "single-cycle performance" have when that cycle needs to be at least a 1 us?
ARM Cortex M can do it, and you can see it here in M4 reference manual,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/BEHJADED.html
Also we should see 1GHz Cortex M cores very very soon. I know it can be pipe-lined and can have 1 or 2 clock cycle latency, and that's also very better than the one that I have found in opencores.

If we could see some cool and human readable high performance FPU  ....  ^-^ ^-^ ^-^
It seems you are trying to do develop hardware by plugging together black boxes, with no understanding of why each black box was built the way it is. This seldom ends well. The M4 is pipelined, so basically nothing really finishes in one cycle. Things have to flow through the pipeline to conclusion. Those lines in the instruction table that say 1 cycle mean you can start a new operation every cycle. It is the number of cycles needed to initiate an instruction. Not that they finish that fast. There are lines in that table which say something like "1 + P". This means the instruction takes 1 cycle plus the number of cycles needed for a pipeline refill, because those instructions inherently flush and restart the pipeline. If you write code that needs a result from an FPU instruction for another instruction, you may see a pipeline stall if those instructions are too close together.

If you want a true 1 cycle FPU the clock speed will be rather low. If you want the 1GHz speed you referred to you have to pipeline to break up the ripple through time into smaller pieces, A high clock speed and single cycle operation do NOT go hand in hand. They are in serious conflict with each other.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14476
  • Country: fr
Re: Open source FPU with 1 cycle performance
« Reply #17 on: March 05, 2020, 02:38:43 pm »
Yeah, as the OP stated, they want that on FPGAs. Any decent FPU will inevitably be heavily pipelined to reach even moderate clock frequencies on FPGAs, and thus have significant latency (but throughput can be good - although depending on the FPU structure, you may be able to get high throughput only on a series of similar operations, like a series of multiplies, and not on just a series of random operations such as a series of multiply, then divide, then add, then .... for the latter, you'll need a more involved structure taking up significant resources.)

 

Offline Scrts

  • Frequent Contributor
  • **
  • Posts: 797
  • Country: lt
Re: Open source FPU with 1 cycle performance
« Reply #18 on: March 05, 2020, 04:43:29 pm »
Is 1 cycle performance truly needed? I remember doing one design where performance wasn't needed, so eventually it was cheaper to buy Nios II/f core with enabled FP and save on engineering time rather than trying to learn and either write my own or take it off the internet and understand how that works...
 
The following users thanked this post: Someone

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #19 on: March 06, 2020, 08:07:29 am »
Quote
Is 1 cycle performance truly needed?
Nope, But accepting a new instruction in every cycle and low area and high speed is needed. I want to study a few better opensource alternatives, so I can choose or implement a better version for my own needs.
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #20 on: March 06, 2020, 08:59:35 am »
I have downloaded SiFive E76-MC  core, which have a floating point inside, But it's all garbage and no reusable code! |O why they even name it open source! look at he codes, it's not human readable!
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4037
  • Country: nz
Re: Open source FPU with 1 cycle performance
« Reply #21 on: March 06, 2020, 09:59:58 am »
I have downloaded SiFive E76-MC  core, which have a floating point inside, But it's all garbage and no reusable code! |O why they even name it open source! look at he codes, it's not human readable!

The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
 

Offline ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1905
  • Country: ca
Re: Open source FPU with 1 cycle performance
« Reply #22 on: March 06, 2020, 12:04:22 pm »
Quote
The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
Thanks for the info, then is there a verilog or VHDL open source with FPU inside that you are aware of?
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4037
  • Country: nz
Re: Open source FPU with 1 cycle performance
« Reply #23 on: March 06, 2020, 01:08:02 pm »
Quote
The E76 is absolutely not an open source core! What gave you the impression that it is?

It's a high performance commercial core, licensed for a considerable amount of money, and you've got an evaluation copy of it.
Thanks for the info, then is there a verilog or VHDL open source with FPU inside that you are aware of?

I've already pointed you to the source code for the Berkeley FPU (which you'd find somewhere inside that obfuscated E76 evaluation core).

It's in Chisel, not verilog or VHDL but it shouldn't be hard for someone who knows one of those to convert it. The Chisel tools output verilog, and it's by default relatively readable as people look at waveforms and debug stuff at the verilog level so it's reasonably easy to related names of things in the verilog back to the names in the Chisel in much the same way that you can look at names of things in gdb and related them back to the original C or C++ names.

There are many organisations and individuals creating RISC-V cores. Many use Chisel, but some do work directly in verilog or VHDL.

There is a list here:

https://riscv.org/risc-v-cores/

Each entry lists the implementation language, and also the RISC-V flavour supported. If it's got an F, D, or G in the User Spec field then it's got an FPU (G is short for IMAFD).

You might want to unselect the "SOC PLATFORMS" and "SOCS" filter buttons. You can also hit REFINE and choose a license acceptable to you.

I see several cores with BSD or MIT licenses and FPUs, but the ones I noticed are written in Chisel or Bluespec. Cores written in verilog or VHDL seem in general to be simpler ones, without FPU.
 
The following users thanked this post: ali_asadzadeh

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14476
  • Country: fr
Re: Open source FPU with 1 cycle performance
« Reply #24 on: March 06, 2020, 02:41:04 pm »
I already suggested the OP to take a look at the PULP platform.

Turns out the Ariane core does include an FPU. It's written in SystemVerilog: https://github.com/pulp-platform/ariane/tree/master/src
 
The following users thanked this post: ali_asadzadeh


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf