Author Topic: Soft CPU Design on Zedboard: What will it take to get any embedded os running?  (Read 623 times)

0 Members and 1 Guest are viewing this topic.

Offline Synesthesiac

  • Newbie
  • Posts: 4
  • Country: us
Hey guys,

I'm taking the challenge of getting a better job without getting a master's degree just yet by taking on some projects. Using Computer Organization and Design and Computer Architecture A Quantitative Approach as a reference, I've already implemented a semi-decent RV64IM implementation with experimental cache and am currently doing some superscaler work, but eventually I want to get it to the next level of actually running something other than like a bubble sort. I'm used to writing ARM firmware for the cores in Zedboard and using that to interface with the PL but now how should I approach supporting something, especially some simple embedded os, with my own soft CPU instead of their ARM cores? I've taken Operating Systems have have used embedded linux and FreeRTOS on the Zedboard in the past, but I just don't know how much work I'm giving myself here just yet with my own messy hardware :P.

It's probably a big project with maybe a lot of consideration for MMU and cache design, but I definitely want this to be one of my goals. Any tips or recommendations?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
Wow, nice work on getting an RV64IM core working.

As far as I know, typical embedded OSes (RTOS) such as FreeRTOS and Zephyr don't use (or certainly don't need) an MMU, and I don't suppose a cache will do much if you're running off "ROM" and RAM in the FPGA itself.

If you're planing to run Linux and use the DRAM on the board then that's a different matter. I'd love to know how to do that too.

Superscalar is hard! I guess dual issue in order can be worth it but otherwise multiple small cores can be better.

I'm hoping to get some FPGA skills and follow in your footsteps in the coming months. I bought both Zynq and non-Zynq Arty boards a couple of years ago but didn't get a chance to use them while employed :-(
 

Online asmi

  • Super Contributor
  • ***
  • Posts: 1069
  • Country: ca
Using DRAM is fairly easy on Xilinx devices because they provide free controller IP that exposes an industry-standard AXI4 slave interface port (AXI3 slave in case of Zynq devices and their hardIP controller). So your CPU will only need to implement an AXI4 master port, which is very easy (much easier than slave one).

But to use it efficiently you will need to have a cache, because without it the latency would kill performance, which is typical for DDRx memories as they are designed for bandwidth at expense of latency. This is why MIG (a.k.a. Memory Interface Generator, IP provided by Xilinx for creating "soft" memory controllers) typically has a very wide data bus (up to 512 bits depending on how wide the physical memory bus is, as well as some other parameters). Controller supports up to 256 beats-long bursts to better utilize available bandwidth, which is ideal for cache line-long transactions.

There is the System Cache IP to help implement higher-level caches to ensure system-wide coherency in multi-master environment, but you will probably want to implement L1I/L1D caches yourself.

I'm working on RV64 core as well :D
« Last Edit: May 08, 2020, 05:47:33 pm by asmi »
 

Offline 0db

  • Regular Contributor
  • *
  • Posts: 184
  • Country: zm
superscalar ... how to properly test it?
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 7304
  • Country: us
A core without an assembler or C compiler is just about worthless.  Furthermore, the amount of effort required to port GCC to some arbitrary architecture may exceed my "best used by" date.

Assuming you have a C compiler, porting something like FreeRTOS shouldn't be terribly difficult.

I suppose it depends on what you mean by "OS".  A non-realtime OS like MS-DOS or CP/M would be fairly easy once the sector allocation was understood.  At least there would be some user interactivity.

Porting Linux is a much bigger job and, as I understand it, uClinux isn't being continued.  It had the advantage of not needing an MMU.

https://elinux.org/images/b/bb/Uclinux.pdf

Porting Unix V6 may be easy because the annotated source is available but it would require an MMU:

https://en.wikipedia.org/wiki/Lions%27_Commentary_on_UNIX_6th_Edition,_with_Source_Code

It wouldn't need to be terribly sophisticated, it ran on a PDP-11.  In fact, I have BSD2.11 running on a SIMH simulator on a Raspberry Pi.  The project is known as the PiDP11.

https://obsolescence.wixsite.com/obsolescence/pidp-11

The point being, Unix is pretty easy to port and doesn't suffer from having a GUI.

But, again, it depends on what is meant by "OS" and it almost certainly needs an assembler/compiler.

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
A core without an assembler or C compiler is just about worthless.  Furthermore, the amount of effort required to port GCC to some arbitrary architecture may exceed my "best used by" date.

He said his core implements RV64IM, which gcc already knows how to generate code for. FreeRTOS and Zephyr already support RISC-V and at least FreeRTOS only needs RV32I/RV64I. Linux generally needs RV64GC.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 5330
  • Country: fr
A core without an assembler or C compiler is just about worthless.  Furthermore, the amount of effort required to port GCC to some arbitrary architecture may exceed my "best used by" date.

He said his core implements RV64IM, which gcc already knows how to generate code for. FreeRTOS and Zephyr already support RISC-V and at least FreeRTOS only needs RV32I/RV64I. Linux generally needs RV64GC.

Yup.

From experience, I'd now say going up to having a Linux OS working on a "homebuilt" core would be a gigantic challenge. Don't underestimate this. Especially for a one-person project.

Writing a full RV64GC core with no bugs plus a working MMU to be able to run Linux successfully - good luck. SiFive did it. I'm sure the team was a bit more staffed than 1 person.

For some simple "embedded OS" as the OP said, as long as you have a working core, a compiler and the basic required peripherals, that should be alright. Still a pretty ambitious endeavor IMO.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
Writing a full RV64GC core with no bugs plus a working MMU to be able to run Linux successfully - good luck. SiFive did it. I'm sure the team was a bit more staffed than 1 person.

Berkeley's open source "Rocket" core already ran Linux before SiFive was founded.

The FE-310 and FU-540 are both based on Rocket, but with additional features e.g. in the FU-540 having five CPU cores (with two different ISA flavours), a coherent L2 cache, an SoC with onboard ethernet and DDR controller. There was also work iterating the Privileged ISA to 1.10 and defining a new interrupt controller architecture. And updating the Linux kernel and preparing it for upstreaming. And much work on binutils and gcc and other toolchain stuff.

Even with the Rocket head start SiFive had around 25 people by the time the FU-540 taped out. Many, many more people were added to create the superscalar (dual issue, in order) 7-series and Out of Order 8-series. Those are serious endeavours.
 
The following users thanked this post: SiliconWizard

Online asmi

  • Super Contributor
  • ***
  • Posts: 1069
  • Country: ca
Writing a full RV64GC core with no bugs plus a working MMU to be able to run Linux successfully - good luck. SiFive did it. I'm sure the team was a bit more staffed than 1 person.
I think you grossly overestimate complexity of implementing RV in a general sense. I will give it to you that it will be hard to make it fast, but if speed is not your requirement (and in a lot of hobby cases it isn't) than you can do it all quite quickly. From my own experience 80+% of all complexity is in pipelining and dealing with all resulting issues, as well as HW-specific optimization to reach max performance, not with the actual behavioral modeling.
 
The following users thanked this post: 0db

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
Writing a full RV64GC core with no bugs plus a working MMU to be able to run Linux successfully - good luck. SiFive did it. I'm sure the team was a bit more staffed than 1 person.
I think you grossly overestimate complexity of implementing RV in a general sense. I will give it to you that it will be hard to make it fast, but if speed is not your requirement (and in a lot of hobby cases it isn't) than you can do it all quite quickly. From my own experience 80+% of all complexity is in pipelining and dealing with all resulting issues, as well as HW-specific optimization to reach max performance, not with the actual behavioral modeling.

Implementing RV64I (or 32) is pretty straightforward, but M is a bit harder, and a fully IEEE-compliant FPU is really not that easy.

On the other hand only a masochist does all that themselves :-)  There are a ton of projects that implement their own absolutely unique pipeline but grab the perfectly good Berkeley FPU and/or ALU as black box components.
 

Offline 0db

  • Regular Contributor
  • *
  • Posts: 184
  • Country: zm
Dunning-Kruger effect, and ego boost syndrome.
Be aware.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 5330
  • Country: fr
Writing a full RV64GC core with no bugs plus a working MMU to be able to run Linux successfully - good luck. SiFive did it. I'm sure the team was a bit more staffed than 1 person.
I think you grossly overestimate complexity of implementing RV in a general sense. I will give it to you that it will be hard to make it fast, but if speed is not your requirement (and in a lot of hobby cases it isn't) than you can do it all quite quickly. From my own experience 80+% of all complexity is in pipelining and dealing with all resulting issues, as well as HW-specific optimization to reach max performance, not with the actual behavioral modeling.

I think you missed part of my post, and/or grossly underestimate the total amount of work for a whole system able to run Linux.

There was no "general sense". I specifically talked about RV64GC and Linux (for a simpler core and a simpler OS, I already said it was doable, but no picnic either.) Take a look at what the "G" entails, and  see how much time and care it will take to properly implement all needed extensions to support Linux. And then the MMU. And then see how much knowledge and time it will take to build your own Linux distro compatible with all this. As Bruce said, it's a gigantic work.

As can be followed in another thread, I happen to have designed a complete and working pipelined RV32/64 core, so I kind of know what it entails now. And so far, it's far from supporting all extensions implied by "G". Only base, M and Zicsr. It would be a lot of extra work. It's not just the FPU, although an FPU is already a lot of hard work. Sure you can reuse an existing one, but as was mentioned in another thread, there aren't that many open (and good) FPUs out there. Bruce mentioned the Berkeley FPU which is good, but AFAIK, it's written in Chisel. If you don't use Chisel otherwise, you may not want to have to mess with that. So, not trivial. And then there are the other mandatory extensions.

But then there's also the MMU, plus the OP wants a cache system, plus, plus... plus a working and adapted Linux system. Do you really realize what this all means?

(Again as I said above, a simple RV32/64IM core + a simple OS would be doable. That's an order of magnitude simpler IMO.)

But those convinced it's doable in a few months for one person, please have at it. Just don't consider that because you managed to write a simple, non-pipelined RV32I core, that would mean that upping your game to an RV64GC + MMU + Linux would be a piece of cake. Don't hesitate to try and report back.
 

Offline 0db

  • Regular Contributor
  • *
  • Posts: 184
  • Country: zm
As can be followed in another thread, I happen to have designed a complete and working pipelined RV32/64 core

Everyone can write on the public internet, but only evaluable facts matter, and I can only judge facts.

OpenCores is full of public sources. Softcore, devices, a lot of stuff. 90% shows the difference between "I'd like to do something with my FPGA" and "I have just successfully done and verified it actually work as expected", and public repositories show the effort and the time to achieve results.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
Bruce mentioned the Berkeley FPU which is good, but AFAIK, it's written in Chisel. If you don't use Chisel otherwise, you may not want to have to mess with that.

Chisel compiles to Verilog, so you can do that once and then build around that. You might or might not want to rename some of the inputs and outputs to make interfacing cleaner, but that's a heck of a lot easier than starting from scratch.
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 5330
  • Country: fr
Bruce mentioned the Berkeley FPU which is good, but AFAIK, it's written in Chisel. If you don't use Chisel otherwise, you may not want to have to mess with that.

Chisel compiles to Verilog, so you can do that once and then build around that. You might or might not want to rename some of the inputs and outputs to make interfacing cleaner, but that's a heck of a lot easier than starting from scratch.

I'll have to try that at least once to see. Never used the Chisel translators. I'm not sure the Verilog output is very readable though (like with most code generation tools?)
It should certainly be much easier than writing your own, but that could still be a concern for some people/teams, so I thought that was worth mentioning.

But anyway, certainly if you start from an existing core that already has everything to run Linux, with a corresponding Linux disto, then that can definitely be a one-person project. Of course if you start with something that already works. I don't think that was what the OP was after.

OTOH, any simpler core, as I said above, with a simple OS, that is definitely a lot easier. I don't think an order of magnitude is exxagerated.
« Last Edit: May 12, 2020, 05:37:03 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 1525
  • Country: us
  • Formerly SiFive, Samsung R&D
Bruce mentioned the Berkeley FPU which is good, but AFAIK, it's written in Chisel. If you don't use Chisel otherwise, you may not want to have to mess with that.

Chisel compiles to Verilog, so you can do that once and then build around that. You might or might not want to rename some of the inputs and outputs to make interfacing cleaner, but that's a heck of a lot easier than starting from scratch.

I'll have to try that at least once to see. Never used the Chisel translators. I'm not sure the Verilog output is very readable though (like with most code generation tools?)

Sure. Don't even think about looking inside the black box. Just find the buses for the operand and operation/mode inputs and result output and be happy.

Chisel output does keep the original names the programmer used as part of the Verilog names, in a similar way to how C++ does, it's not like it's "WIRE23752" or something. (unless you explicitly ask it to obfuscate)
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf