Author Topic: Very small linux capable core  (Read 9073 times)

0 Members and 1 Guest are viewing this topic.

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #25 on: July 27, 2021, 02:53:16 pm »
https://libre-soc.org/3d_gpu/architecture/6600scoreboard/
Thanks! I skimmed through the link, and immediately noticed it's talking about 5R4W reg file - which was like :o Multi-ported memory is the biggest problem in FPGA as BRAMs are only dual-port. More read ports can be added by adding duplicate BRAMs and writing to all of them at once, but this won't work if more than a single write port is required. If you serialize writing to reg file from the execution units, you can get away with only a single write port, but in any case you will need a metric ton of read ports (two per computational EU, one for a load unit, two for store unit), or a complex forwarding matrix (a-la CDB in the classic Tomasulo). So such core will be a BIG resource hog, which is why I'm curious to see if advantages of OoO with speculative execution are going to be worth it.

I already have a RV64I core built as a "classic" pipeline running at a bit over 170 MHz on my Spartan-7 board, so it will be interesting to compare them.
« Last Edit: July 27, 2021, 02:56:29 pm by asmi »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #26 on: July 27, 2021, 04:06:48 pm »
I already have a RV64I core built as a "classic" pipeline running at a bit over 170 MHz on my Spartan-7 board, so it will be interesting to compare them.

Have you "benchmarked" it in some way? I'm curious.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #27 on: July 27, 2021, 06:30:34 pm »
Have you "benchmarked" it in some way? I'm curious.
Not really, it's a bit of a mess and could use some polishing (like adding dynamic branch prediction), but I plan to add it because I want to add it to the new core as well. I want to try an OoO approach to see if it's worth going with that approach, or for FPGA it's better to stick to the classic pipeline, before I invest more significant amount of time into further development. BTW - that same core closes timing at 258 MHz on K325-2 of G2 board, which is a quite nice 50% frequency bump.

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #28 on: July 27, 2021, 06:54:44 pm »
Have you "benchmarked" it in some way? I'm curious.
Not really, it's a bit of a mess and could use some polishing (like adding dynamic branch prediction), but I plan to add it because I want to add it to the new core as well. I want to try an OoO approach to see if it's worth going with that approach, or for FPGA it's better to stick to the classic pipeline, before I invest more significant amount of time into further development. BTW - that same core closes timing at 258 MHz on K325-2 of G2 board, which is a quite nice 50% frequency bump.

Well, if you haven't benchmarked what you already have, I can suppose it's not really ready for running any significant amount of code, in which case it's hard to tell how it will really perform for real use. Or not sure what you mean by "mess" and why you haven't tried some benchmarks just to see where you stand.

I'm not sure if going for something a lot more advanced makes much sense if you haven't completed the simpler design first, but that's just me.

Fmax is cool but tells you nothing about performance, and the whole point of OoO execution is perfomance. So yeah... I dunno. Just a thought, not to deter you. Just that I have been working on a RISCV  core myself and I know it took quite a while to get it fully debugged, with enough around the core (some peripherals, memory cache, etc) to get it to run "significant" code. So my experience just tells me that going for a variant with OoO before everything else was finalized wouldn't have been the best idea. Just a thought, of course I absolutely do not know what your own goals are and I don't mean to lecture you. ;D
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #29 on: July 27, 2021, 08:35:33 pm »
Well, if you haven't benchmarked what you already have, I can suppose it's not really ready for running any significant amount of code, in which case it's hard to tell how it will really perform for real use. Or not sure what you mean by "mess" and why you haven't tried some benchmarks just to see where you stand.

I'm not sure if going for something a lot more advanced makes much sense if you haven't completed the simpler design first, but that's just me.

Fmax is cool but tells you nothing about performance, and the whole point of OoO execution is perfomance. So yeah... I dunno. Just a thought, not to deter you. Just that I have been working on a RISCV  core myself and I know it took quite a while to get it fully debugged, with enough around the core (some peripherals, memory cache, etc) to get it to run "significant" code. So my experience just tells me that going for a variant with OoO before everything else was finalized wouldn't have been the best idea. Just a thought, of course I absolutely do not know what your own goals are and I don't mean to lecture you. ;D
There is a difference between validation and performance benchmark. I've done the former, but not the latter. That means the core executes code correctly and so it can successfully run any (valid) code.
I will see if I can run brucehoult's code when I have some spare time.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Very small linux capable core
« Reply #30 on: July 28, 2021, 12:22:29 am »
I already have a RV64I core built as a "classic" pipeline running at a bit over 170 MHz on my Spartan-7 board, so it will be interesting to compare them.

That's extremely fast for an FPGA core if it's using only 1 CPI (for non-load/branch). I'd be tempted to say "impossible" in an Artix-7. Is Spartan-7 *that* much better? I don't have experience with them.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #31 on: July 28, 2021, 12:30:15 am »
I already have a RV64I core built as a "classic" pipeline running at a bit over 170 MHz on my Spartan-7 board, so it will be interesting to compare them.

That's extremely fast for an FPGA core if it's using only 1 CPI (for non-load/branch). I'd be tempted to say "impossible" in an Artix-7. Is Spartan-7 *that* much better? I don't have experience with them.

I doubt it can be anywhere close to 1 CPI, especially for a 64-bit core, but even for a 32-bit one, at this clock rate. Or if it is, tell us your secret.

Spartan-7 are stripped-down Artix-7 AFAIK.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Very small linux capable core
« Reply #32 on: July 28, 2021, 12:52:38 am »
I want to try an OoO approach to see if it's worth going with that approach, or for FPGA it's better to stick to the classic pipeline, before I invest more significant amount of time into further development.

Dual-issue in-order with both an "early" ALU stage and "late" ALU stage is far far simpler, less area, less energy than OoO but gets much of the performance benefit, at least on code that runs in L1 cache / SRAM.

Examples of this setup include the ARM A55, SiFive U74, and the Western Digital SweRV.

On my primes benchmark (https://hoult.org/primes.txt) the dual-issue in-order SiFive U74 in the HiFive Unmatched uses almost exactly the same number of clock cycles as the 3-issue 128 entry ROB OoO ARM A15 in the Beagle-X15 and Odroid XU4. I think the difference (including between X15 and XU4) comes down to different gcc versions.

The A57 is also 3 issue 128 entry ROB, and I think is pretty much just an Armv8 version of the A15, so is probably the same, but I don't have one to test.

The A72 in a Pi 4 uses 17.5% (A32) to 19.7% (A64) fewer clock cycles than the above.

In comparison, going from the single-issue U54 to the dual-issue U74 results in a 42.5% decrease in clock cycles.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #33 on: July 28, 2021, 01:06:01 am »
That's extremely fast for an FPGA core if it's using only 1 CPI (for non-load/branch). I'd be tempted to say "impossible" in an Artix-7. Is Spartan-7 *that* much better? I don't have experience with them.
Spartan-7 and Artix-7 have the same fabric. The only difference is that there is a speed grade 3 for Artix, but not for Spartan, so technically Artix can be a bit faster. All operations are single-cycle and fully pipelined, but of course there are stalls due to branches/jumps, as well as data hazards because the core is only partially bypassed. Like I said above, I plan to eventually add dynamic branch prediction, as well as more bypassing. Also currently interrupts and exceptions are not implemented, however I don't think it's going to be that hard to add them.

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #34 on: July 28, 2021, 01:13:28 am »
I doubt it can be anywhere close to 1 CPI, especially for a 64-bit core, but even for a 32-bit one, at this clock rate. Or if it is, tell us your secret.
There is no secret. If you throw enough hardware at the problem, you can achieve a lot. My core currently consumes 3424 LUTs and 1525 FFs, and I use post-route physical optimization to help close timing. See attachments for details.

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #35 on: July 28, 2021, 01:30:17 am »
Alright. Of course it was all in what was meant by "1 CPI".

Your core is not fully bypassed as you said (and has no branch prediction). I'm curious, as I said, to see real performance in terms of average CPI on real code, because it's likely to be pretty far from 1. The fact most instructions independently only require one cycle is the best case, and I don't doubt you can reach 170 MHz for that.

I've done extensive simulations - and then real tests on FPGA - with various configurations, so I kind of know what to expect by now. You can have most instructions (except branches) individually 1 CPI, but the average CPI on any real code with no or little bypassing and no branch prediction is likely to be something between 2 and 3. So, I'm curious. :)

And, I don't claim my implementation is optimized in terms of Fmax either - but I do know cores with similar features rarely get above 100 MHz or so on this kind of FPGAs. That's probably also why Bruce was surprised.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #36 on: July 28, 2021, 02:23:18 am »
Alright. Of course it was all in what was meant by "1 CPI".

Your core is not fully bypassed as you said (and has no branch prediction). I'm curious, as I said, to see real performance in terms of average CPI on real code, because it's likely to be pretty far from 1. The fact most instructions independently only require one cycle is the best case, and I don't doubt you can reach 170 MHz for that.
In my case bypassing shouldn't change much in terms of frequency, because it's not on a critical path, as I have a dedicated stage (read registers) which does just that - read registers and pass values along to the EX stage. One bypassing which IS on a critical path (from EX/MEM to RR/EX) is already there and implemented - this way I only have 2:1 MUX in front of ALU (which is on a critical path and limits Fmax now, specifically shifter) - it either takes value from the EX/MEM pipeline register (effectively ALU output), or from register passed along from RR stage, based on a flag passed along from RR stage. Another trick - I have separate ALUs for 64bit and 32bit operations. If I could find a way to speed up 64bit shifter, I can increase Fmax even further.
Same thing can be said about branch prediction - fetch stage is very simple - the only complication there is dealing with the fact that a memory bus has a width of 64 bits, while commands are 32 bit, and so it has to split 64 values into two commands, and it also has to deal with jump targets not aligned onto 64 bit boundary.

Online ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1896
  • Country: ca
Re: Very small linux capable core
« Reply #37 on: July 28, 2021, 06:23:28 am »
asmi can the core be shared >:D
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Offline hve

  • Contributor
  • Posts: 46
  • Country: nl
Re: Very small linux capable core
« Reply #38 on: July 28, 2021, 09:01:50 am »
Not sure if it is already mentioned:

Assuming you have external memory attached.. :)

https://github.com/litex-hub/linux-on-litex-vexriscv

 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #39 on: July 28, 2021, 12:55:30 pm »
asmi can the core be shared >:D
It will be open source, but right now it's not ready for the prime time yet. Also it's optimized for frequency and performance, and not resource utilization.
 
The following users thanked this post: ali_asadzadeh

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #40 on: July 28, 2021, 04:27:10 pm »
Thanks for sharing your Ideas.
In the gowin I have about 12K of LUT's, also I have a 32MB internal SDRAM, I want to compile and use this https://libiec61850.com/libiec61850/ on minimal linux, I hope I can do it. ^-^

As can be read on the website, this library can be ported for any custom HAL, so you don't need Linux. https://libiec61850.com/libiec61850/documentation/
You could probably use it on a baremetal system writing your own HAL support. Not requiring to run Linux will open many more options for what you want to achieve. Just a thought. You may have other reasons for running on Linux, but we don't know.
 

Online ali_asadzadehTopic starter

  • Super Contributor
  • ***
  • Posts: 1896
  • Country: ca
Re: Very small linux capable core
« Reply #41 on: July 28, 2021, 05:51:11 pm »
Quote
It will be open source, but right now it's not ready for the prime time yet. Also it's optimized for frequency and performance, and not resource utilization.
That's great :-+
Guys what's your ideas about Latticemico32 and zipcpu?
I think at least the zipcpu blog is just a gold mine >:D >:D >:D >:D
ASiDesigner, Stands for Application specific intelligent devices
I'm a Digital Expert from 8-bits to 64-bits
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Very small linux capable core
« Reply #42 on: July 28, 2021, 08:53:44 pm »
Thanks for sharing your Ideas.
In the gowin I have about 12K of LUT's, also I have a 32MB internal SDRAM, I want to compile and use this https://libiec61850.com/libiec61850/ on minimal linux, I hope I can do it. ^-^

As can be read on the website, this library can be ported for any custom HAL, so you don't need Linux. https://libiec61850.com/libiec61850/documentation/
You could probably use it on a baremetal system writing your own HAL support. Not requiring to run Linux will open many more options for what you want to achieve. Just a thought. You may have other reasons for running on Linux, but we don't know.

Yeah, I already covered that in:

https://www.eevblog.com/forum/fpga/very-small-linux-capable-core/msg3614765/#msg3614765

Quote
It should be easy to create a HAL for FreeTOS or Zephyr for that project.

https://www.eevblog.com/forum/fpga/very-small-linux-capable-core/msg3614778/#msg3614778

Quote
According to the link you sent, it needs only time, threads, TCP/IP sockets, ethernet, and a filesystem. The mentioned RTOSes have all those.
 

Offline dolbeau

  • Regular Contributor
  • *
  • Posts: 86
  • Country: fr
Re: Very small linux capable core
« Reply #43 on: August 04, 2021, 12:41:21 pm »
Now, it's unclear to me whether VexRiscv is really ready to be directly used for running Linux at the moment outside of pure simulation. Quoting VexRiscv's Readme:
Quote
There is currently no SoC to run it on hardware, it is WIP. But the CPU simulation can already boot linux and run user space applications (even python).
Maybe brucehoult can confirm it is possible already, but what the maintainers say sounds confusing to me.

VexRiscv will run Linux in FPGA (not yet ASIC that I know of) with no issue, both in the SpinalHDL Saxon SoC and in the Migen Litex SoC. For this second case, the entire process is scripted in Linux-on-Litex-VexRiscv. Complete with SMP support, optional C, and optional FPU (can be shared by 1 to 4+ cores). The Litex SoC might require sufficient memory (at least 32 MiB, preferably more) and support optional ethernet, micro-sd, framebuffer, USB OHCI host, and you may even get SATA if you feel adventurous.

I run a 4 cores RV32GCBK VexRiscv Litex SoC on a Qmtech Wukong board (Artix-7 100T), it works fine, including recompiling enough X11 by itself to run Doom.
« Last Edit: August 04, 2021, 12:45:13 pm by dolbeau »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14230
  • Country: fr
Re: Very small linux capable core
« Reply #44 on: August 04, 2021, 03:56:16 pm »
Now, it's unclear to me whether VexRiscv is really ready to be directly used for running Linux at the moment outside of pure simulation. Quoting VexRiscv's Readme:
Quote
There is currently no SoC to run it on hardware, it is WIP. But the CPU simulation can already boot linux and run user space applications (even python).
Maybe brucehoult can confirm it is possible already, but what the maintainers say sounds confusing to me.

VexRiscv will run Linux in FPGA (not yet ASIC that I know of) with no issue, both in the SpinalHDL Saxon SoC and in the Migen Litex SoC. For this second case, the entire process is scripted in Linux-on-Litex-VexRiscv. Complete with SMP support, optional C, and optional FPU (can be shared by 1 to 4+ cores). The Litex SoC might require sufficient memory (at least 32 MiB, preferably more) and support optional ethernet, micro-sd, framebuffer, USB OHCI host, and you may even get SATA if you feel adventurous.

I run a 4 cores RV32GCBK VexRiscv Litex SoC on a Qmtech Wukong board (Artix-7 100T), it works fine, including recompiling enough X11 by itself to run Doom.

OK, thanks for the feedback. At what frequency do the cores run?
 

Offline dolbeau

  • Regular Contributor
  • *
  • Posts: 86
  • Country: fr
Re: Very small linux capable core
« Reply #45 on: August 04, 2021, 06:08:38 pm »
OK, thanks for the feedback. At what frequency do the cores run?

100 MHz with no issue on the speed grade 2 Artix-7. It's mostly limited by the decode path I think (there's a tuning in VexRiscv to relieve some pressure related to C which is enabled, and all the B and K instructions add a lot of stuff to the decoder).

IIRC, the 'vanilla' VexRiscv in default config (RV32IMA) reached at least 120 MHz on that board, I don't think I tried to push beyond that.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #46 on: August 04, 2021, 08:02:22 pm »
100 MHz with no issue on the speed grade 2 Artix-7. It's mostly limited by the decode path I think (there's a tuning in VexRiscv to relieve some pressure related to C which is enabled, and all the B and K instructions add a lot of stuff to the decoder).
Yep, this is why in my core I've split decoding and register reading into separate stages. Otherwise it was on the critical path. Now execute/alu64 is on the critical path, and there is nothing I can do about it, and I don't want to make it multicycle. That is part of the reason why I want to try Tomasulo out of order approach - this way I can make shifts (which is the critical path inside alu) multicycle without affecting other ALU operations. I've also added full bypassing, and as I expected, it did not cause any timing problems.
« Last Edit: August 04, 2021, 08:18:19 pm by asmi »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Very small linux capable core
« Reply #47 on: August 05, 2021, 12:59:20 am »
Yep, this is why in my core I've split decoding and register reading into separate stages. Otherwise it was on the critical path. Now execute/alu64 is on the critical path, and there is nothing I can do about it, and I don't want to make it multicycle. That is part of the reason why I want to try Tomasulo out of order approach - this way I can make shifts (which is the critical path inside alu) multicycle without affecting other ALU operations. I've also added full bypassing, and as I expected, it did not cause any timing problems.

You can of course easily split the six layers of a logarithmic 64 bit shifter into two pipeline stages with three shift layers in each stage.

The problem is this almost certainly isn't going to let you double the clock frequency as something else will become the critical path long before that. That means that while every other instruction will now go a little faster, shifts will now be slower. If the clock speed improvement is small (under 5%?) it might not be a win overall. Shifts are pretty common in some types of code, and using the base ISA (no B extension) there is a regrettable amount of back-to-back shifts to do things such as zero-extending a 32 bit unsigned value to a 64 bit value whenever someone has foolishly used an unsigned int (32 bits) as a loop index (etc) and then uses it to index an array. People who use size_t or similar for their loop indexes are fine. As are people who use "int".

Coremark is an unfortunate offender here. They have gone out of their way to typedef critical variables as "unsigned int" because this produces better code than "int" on 64 bit ARM CPUs. But it de-optimises 64 bit RISC-V. For some time, RISC-V people were changing the typedef, but the benchmark owner made a ruling this is not permitted.

The RISC-V B extension adds some instructions which do 64 bit computations after implicitly zero-extending one of the operands, which eliminates this problem. And also instructions to sign-extend and zero-extend 8, 16, and 32 bit quantities to 64 bits with a single instruction instead of a pair of shifts.

But anyway -- it would be very interesting to know what becomes the critical path after you split shifts into 2 stages, and how much frequency increase you can then get.
 

Offline asmi

  • Super Contributor
  • ***
  • Posts: 2728
  • Country: ca
Re: Very small linux capable core
« Reply #48 on: August 05, 2021, 02:27:53 am »
But anyway -- it would be very interesting to know what becomes the critical path after you split shifts into 2 stages, and how much frequency increase you can then get.
I think jump/branch handling is the next one on the line, but I'm not sure about it yet. Currently I have too much latency on the front end (PC to fetch), so that's the area I need to focus on, as well as dynamic branch prediction to reduce this latency. There were some design decisions made in this area, which I question now, so will see about that. I also plan to eventually add support for external memory bus, and this will basically require implementing caches.

When I started this design, my target was to hit 150 MHz and some reasonable CPI (anything below 2 I consider acceptable). Right now it's about 4 running a very branch-heavy code (loop for a few million cycles, toggle GPIO, repeat), but it's also an ideal target for branch prediction to drastically improve CPI. There are also some memory operations, but there is little I can do about speeding them up. For timing reasons I had to add additional register after BRAM output to close timing, so the data memory latency is currently 2. Good thing is that I still have a healthy amount of headroom as far as frequency goes. Another good thing is that addition of full bypassing indeed brought CPI of compute-heavy linear code close to 1, so that once I add branch prediction and reduce instruction fetch latency, I can get a very good overall CPI. Right now I only have an early unconditional jumps detection, so it's possible to hand-optimize the code with branches to achieve very good CPI.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 3973
  • Country: nz
Re: Very small linux capable core
« Reply #49 on: August 05, 2021, 03:25:12 am »
SRAM/cache latency of 2 is very common in RISC-V pipelines, for example all the Rocket-based cores are e.g. SiFive's original 3- and 5-series. In fact they are 3 cycles for sub-word loads. But they don't stall unless the value is actually used by the immediately following instruction.

 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf