Author Topic: The RISC-V ISA discussion  (Read 19478 times)

0 Members and 1 Guest are viewing this topic.

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #150 on: April 23, 2020, 12:24:22 am »
I believe I've written about this before.

You have two completely different and independent problems: 1) predicting if a branch will be taken or not, and 2) predicting the PC for a taken branch.

You only have to predict the PC for the JALR instruction. The next PC will be the value in the rs1 register plus an offset. You don't know at instruction decode time what the value in the register will be when you get to the execute stage. You do however know with absolute certainty that the branch *will* be taken. JALR is a very infrequently used instruction *except* for function return, which can be accelerated using a small return address stack (even 2 entries gives almost all the benefit). The ISA manual lists the combinations of rs1 and rd that can be assumed to imply particular actions with the return address stack. The other uses for JALR are virtual function call / call via a pointer and switch statements. They are fairly rare but can be accelerated by a branch target cache which is used *only* for JALR, not for conditional branches.

Conditional branches need to have a prediction for whether they will be taken or not. But once you have decided (using the branch predictor) whether or not it will be taken, it is absolutely certain what the next PC will be -- if the branch is not taken it will be PC+4 (or+2) and if it is taken it will be PC+simm12.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #151 on: April 23, 2020, 01:09:11 pm »
You have two completely different and independent problems: 1) predicting if a branch will be taken or not, and 2) predicting the PC for a taken branch.

Maybe my question/issue was not very clear. Let's ignore the target PC for a while and just think about the prediction (taken/not taken).

In the general case, at any given cycle, if the instruction being fetched is a conditional branch (needs to be predicted), you must have the corresponding prediction ready for the next cycle - but at the fetch stage, the instruction is not decoded yet, so you don't know it's a conditional branch. The branch predictor will not issue a status ready for the next cycle, unless you access it asynchronously. It works, but it's suboptimal. Is that clearer?

Basically, by the time a conditional branch gets to the decode stage - so you know it's a conditional branch - you must have a valid target address - which will depend on the prediction (taken/not taken) for the current fetch. Otherwise you waste 1 cycle.

I may not be approaching the thing completely right, but basically, if you want your branch predictor to be able to issue a valid prediction at EVERY cycle, there is something problematic here to make it fully synchronous. Does what I'm saying make sense?

« Last Edit: April 23, 2020, 01:12:18 pm by SiliconWizard »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #152 on: April 23, 2020, 02:11:16 pm »
You don't need a lot of decoding.

You need to look at the branch predictor if and only if opcode[6:0] == 1100011  (or opcode[1:0] == 00 && opcode[15:14] == 11 if RVC is supported).

That only needs a very shallow AND/OR network to figure out.

You can extract the presumed offset from the opcode and add it to the PC in parallel, without knowing whether the instruction is a branch or not.

Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch. If the instruction turns out not to be a branch then you just ignore what you read.

Of course the branch predictor table can't be updated until the branch proceeds through the pipeline, the register contents have been read and compared etc, and the correctness of the prediction checked.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #153 on: April 23, 2020, 02:54:47 pm »
Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch.

This is currently what I do indeed. Point is, as I said earlier: the PC of the instruction to fetch depends on a possible prediction from the previous cycle, and this dependency causes me troubles as far as registering goes. To better understand: in my implementation, the current PC is the output of a MUX with the "next PC" and "predicted PC" as inputs. That part alone may be the main point to refactor?

If the instruction turns out not to be a branch then you just ignore what you read.

But then you may have fetched the wrong next instruction - you waste 1 cycle (or more).

Maybe there's no way around this if again I want to get to higher speeds (but with potentially more mispredictions.)

But if you remember, this is why I added a tag (like with caches) in the branch predictor (which I know you weren't fond of), so I don't need to wait till the instruction is fetched to know for sure it's a branch that needs to be predicted. May sound a bit wasteful, but from my tests, it was worth it in terms of correct prediction rate. But whereas this tag thing could be done without and uses some memory, it's not what causes me a problem here. I could remove this (and I tested that), but I still have the same problem.

Of course the branch predictor table can't be updated until the branch proceeds through the pipeline, the register contents have been read and compared etc, and the correctness of the prediction checked.

Yes, that point is OK.

I'm sure I may have to rethink things a bit. It's very possible that I'll have to compromise the performance of my BP to make it fully synchronous.

« Last Edit: April 23, 2020, 02:57:23 pm by SiliconWizard »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #154 on: April 24, 2020, 03:46:07 am »
Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch.

This is currently what I do indeed. Point is, as I said earlier: the PC of the instruction to fetch depends on a possible prediction from the previous cycle, and this dependency causes me troubles as far as registering goes. To better understand: in my implementation, the current PC is the output of a MUX with the "next PC" and "predicted PC" as inputs.

That doesn't sound quite precise.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

I don't expect this to be the overall critical pipeline stage -- or at least not by much -- given that one of the other pipeline stages has to do register access and muxing to the ALU inputs, and the ALU has to be able to do a 0..32 bit shift in the same amount of time.

You could of course split instruction fetch and branch prediction into two pipeline stages and have taken branches take 2 clock cycles. That would allow slightly higher MHz, but not much higher and I would be pretty darn sure not enough higher to compensate for taking and extra cycle every five or six instructions. I don't know offhand of any RISC-V core with a branch predictor that does that. SiFive's tiny 2-series cores and PULP ZeroRiscy don't bother with branch prediction at all and just accept they are going to need 2 clock cycles for every taken branch.

Quote
If the instruction turns out not to be a branch then you just ignore what you read.

But then you may have fetched the wrong next instruction - you waste 1 cycle (or more).

No. You fetched an unnecessary branch prediction -- because the instruction turned out not to be a branch. You are nowhere near fetching the next instruction yet.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #155 on: April 24, 2020, 04:18:01 pm »
I'm not completely sure what I say is clear, or if we fully understand each other.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

OK - that's the idea of what I do. I'll try to illustrate what I want to achieve with code pieces later on to be clearer/more precise.

But as I said, I'm expecting to be able to do the above for EVERY clock cycle.

Feeding the 'pc' register, which is the output of a MUX, to predict_taken[pc] (which I'd like to be implemented as block RAM/synchronous RAM in general) seems problematic as 'pc' in this way isn't exactly registered. So predict_taken has to be implemented as asynchronous memory basically. As soon as I register the output of this mux, the issue disappears entirely. But of course that would add a 1-cycle latency, which I do not want.

You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

Ok, there I think you got what I want to do. Single cycle branches when they are correctly predicted.

You could of course split instruction fetch and branch prediction into two pipeline stages and have taken branches take 2 clock cycles.

That's what I want to avoid. That would obviously solve the issues altogether, but my goal is single-cycle branches as much as possible. The potential speed increase I could get adding a stage here would likely not make up for the additional branch latency.

There's not surprise I got, for instance, a Coremark/MHz figure almost exactly the same as the Freedom U540. It was almost entirely due to my branch predictor (disabling it, or going for something simpler got me significantly lower figures). I'm certain the rest of the U540 makes it have much better performance overall. But maybe my branch predictor, as it is, is just not completely realistic for actual implementations. That's what I'm currently trying to work on/figure out.
« Last Edit: April 24, 2020, 06:57:27 pm by SiliconWizard »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #156 on: April 24, 2020, 10:20:26 pm »
I'm not completely sure what I say is clear, or if we fully understand each other.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

OK - that's the idea of what I do. I'll try to illustrate what I want to achieve with code pieces later on to be clearer/more precise.

But as I said, I'm expecting to be able to do the above for EVERY clock cycle.

Yes, of course. Everyone does that, as I said.

Quote
Feeding the 'pc' register, which is the output of a MUX, to predict_taken[pc] (which I'd like to be implemented as block RAM/synchronous RAM in general) seems problematic as 'pc' in this way isn't exactly registered. So predict_taken has to be implemented as asynchronous memory basically. As soon as I register the output of this mux, the issue disappears entirely. But of course that would add a 1-cycle latency, which I do not want.

Why is PC the output of a mux?

We are talking about RISC-V here. The PC is a special register located in the instruction fetch unit as, literally, a register. It is registered (to use your terminology). It is not a general register in the register file with a mux to access it. And even if it was, you'd make a bypass for instruction fetch that went directly from the output of the PC register, not via the register-select mux.

You read the PC contents from a register (flipflops on an SoC), pass it though a bunch of asynchronous logic including some SRAM holding instructions, minimal instruction decode, adders, mux, and feed the result back into the input of the same register you read the old PC from. Some time after the signal settles the clock ticks and BOOM you read the new PC value into the register, replacing the old PC.

If you want to, you can implement an entire RV32I CPU using a single-stage pipeline with not only the PC being updated, but also the fetched instruction decoded, values read from registers, passed through the ALU, and presented back to the write port of the registers before the next clock tick.

Michael Field (aka field_hamster, hamsternz) posted his own single-stage RISC-V design here sometime last year (I think). Anyway it's on his github. Rudi-RV32I if I recall.

It's certainly no problem to have PC read, instruction fetch, branch predictor lookup, next PC calculation all in one clock cycle. You simply have to make the clock cycle suitably long that everything has propagated before it ticks. If you don't do that then you might be able to make the clock cycle VERY slightly faster, but it won't be by anywhere enough to compensate for using more clock cycles for branches.

Quote
You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

Ok, there I think you got what I want to do. Single cycle branches when they are correctly predicted.

Yes, like everyone does. It would be crazy to do something else, for typical programs.
« Last Edit: April 24, 2020, 10:22:11 pm by brucehoult »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #157 on: April 25, 2020, 04:24:00 pm »
As I already said, my implementation at this point does work. It's just not optimal IMO, and that's what I'm working on. I'm sure I'll find a way to optimize it, and there will certainly be many more opportunities to optimize my whole core later on. I was already happy about it able to run @100MHz on a Spartan 6, a bit less happy about the ~85MHz with the branch predictor enabled, but it was not that bad either, considering it implements RV32IM_Zicsr, exceptions/traps and branch prediction. My goal though is to make the branch predictor not add significant overall delay.

Just one point - may have been obvious, but I'm not sure, so here it is. When I talk about "branch prediction", I'm actually talking about both branch prediction and branch target prediction, as I know those are usually formally 2 different concepts - but I need both, and I have implemented both.

Without branch target prediction, what you suggested earlier poses no problem, but there will always be a one-cycle latency if you need to get at the decode stage to figure out the next PC. I don't really see a way around that, and that's usually why branch target buffers are used. So, yeah I have implemented both, and I consider both part of the "branch prediction unit" in my design. That may have been a bit confusing, so here I make it clear. That's also why I talked about PC tags. BTBs are a form of cache and require some form of tags like with any cache to be effective.

So this is mainly a matter of optimizing the implementation of both predictors used in conjunction, and I'm sure I'll figure it out.
« Last Edit: April 25, 2020, 04:28:04 pm by SiliconWizard »
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #158 on: April 26, 2020, 02:20:34 am »
As I already said, my implementation at this point does work. It's just not optimal IMO, and that's what I'm working on. I'm sure I'll find a way to optimize it, and there will certainly be many more opportunities to optimize my whole core later on. I was already happy about it able to run @100MHz on a Spartan 6, a bit less happy about the ~85MHz with the branch predictor enabled, but it was not that bad either, considering it implements RV32IM_Zicsr, exceptions/traps and branch prediction. My goal though is to make the branch predictor not add significant overall delay.

This is good.

Quote
Just one point - may have been obvious, but I'm not sure, so here it is. When I talk about "branch prediction", I'm actually talking about both branch prediction and branch target prediction, as I know those are usually formally 2 different concepts - but I need both, and I have implemented both.

Without branch target prediction, what you suggested earlier poses no problem, but there will always be a one-cycle latency if you need to get at the decode stage to figure out the next PC. I don't really see a way around that, and that's usually why branch target buffers are used. So, yeah I have implemented both, and I consider both part of the "branch prediction unit" in my design. That may have been a bit confusing, so here I make it clear. That's also why I talked about PC tags. BTBs are a form of cache and require some form of tags like with any cache to be effective.

So this is mainly a matter of optimizing the implementation of both predictors used in conjunction, and I'm sure I'll figure it out.

I've said all this before but I'll repeat.

1) you don't need a 1 cycle latency to calculate the next PC. You can do just enough decode of the instruction (figure out if it's a conditional branch, extract the offset and add it to the PC) right there in the instruction fetch stage. This will result in a slightly lower maximum MHz, but not much.

2) only JALR instructions logically require prediction of the branch target. They are comparatively rare, with function return being the vast majority. Returns can be handled very well with typically a 2 (FE310) to 6 (FU540) entry return address stack.

3) branch target predictors are *huge*. Each entry needs to store both the current (matching) PC and the next PC, which is 64 bits on RV32. Plus it needs to be a CAM, which is very expensive, adding a comparator for the PC (tag) of every entry. A branch predictor needs 2 bits for each entry and is direct addressed access. You can afford to have at least 32 times more branch predictor entries than branch target entries for the same area / LUTs. Maybe 50x.

4) a return address entry is just as large as a branch target entry, but you only need a few of them to be effective and they don't need to be CAM as you only need to check the top entry. (you *could* CAM it and let any entry match, but that's only going to help malformed programs)

5) yes, you can just use a branch target predictor for conditional branches and JAL as well as for JALR. That will save waiting for the instruction fetch and doing the minimal decode needed. It will need far more resources (or have a lower hit rate than a branch predictor, if it has fewer entries), and most of it will be used by things that aren't JALR.

I mean .. experiment with different things by all means. I don't want to discourage that! And it depends on what your goal is.

For reference (BHT = Branch History Table, aka 2 bits per entry predictor, BTB = predicts the next PC, RAS = Return Address Stack):

FE310-G000: 128 BHT, 40 BTB, 2 RAS
FE310-G002: 512 BHT, 28 BTB, 6 RAS
FU540-C000: 256 BHT, 30 BTB, 6 RAS

ARM A53: 3072 BHT, 256 BTB, 8 RAS

Note that the FU540 taped out almost exactly a year after the FE310-G000, while the FE310-G002 was another year after the FU540.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #159 on: April 26, 2020, 02:31:50 pm »
5) yes, you can just use a branch target predictor for conditional branches and JAL as well as for JALR. That will save waiting for the instruction fetch and doing the minimal decode needed.

Yes, there we go. This is exactly my goal, and what I'm basically doing. The key idea was to avoid waiting for the instruction fetch itself (whereas I do agree that the subsequent minimal decode required would be negligible).

And the point I was trying to make is: if you have to wait for the fetch, how can you select the most probable PC for the next instruction at every cycle? That's where BTBs come into play as far as I have read and also thought about it. And yes I'm trying to optimize every kind of branches.

It will need far more resources (or have a lower hit rate than a branch predictor, if it has fewer entries), and most of it will be used by things that aren't JALR.

Yup. I know and this is what I'm currently running into. I don't mind the required memory for this per se, but as I said, my implementation currently requires extra resources on top of mere memory for storing the entries, and that's what I'll have to work on/refactor. Maybe I won't find a better implementation for this, and that's the price I'll have to pay if I want to stick to this.

I mean .. experiment with different things by all means. I don't want to discourage that! And it depends on what your goal is.

Yup! And I have experimented different things with my simulator earlier. This one approach was the most promising one from everything I tested, but my tests were not exhaustive by any means either. But I knew this approach was going to be expensive, and I now have real data to figure this out.

For reference (BHT = Branch History Table, aka 2 bits per entry predictor, BTB = predicts the next PC, RAS = Return Address Stack):

FE310-G000: 128 BHT, 40 BTB, 2 RAS
FE310-G002: 512 BHT, 28 BTB, 6 RAS
FU540-C000: 256 BHT, 30 BTB, 6 RAS

ARM A53: 3072 BHT, 256 BTB, 8 RAS

Thanks for these figures!
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #160 on: April 26, 2020, 02:49:47 pm »
As to max frequency for a typical RV32IM (or RV64IM while we're at it) core on a typical FPGA (3-stage and 5-stage pipeline cores for instance), would you happen to have some real figures, just so I can get some idea of what should be achievable?

I suppose SiFive's cores have all been prototyped on FPGAs? Would you have figures about achievable max freq (and on which FPGA(s)) ?
(I think I remember one figure: if I'm not mistaken, the U540 runs @100MHz on FPGA - not sure which - and I don't know whether this is the max frequency achievable either.)
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #161 on: April 27, 2020, 02:01:33 am »
As to max frequency for a typical RV32IM (or RV64IM while we're at it) core on a typical FPGA (3-stage and 5-stage pipeline cores for instance), would you happen to have some real figures, just so I can get some idea of what should be achievable?

Frequency isn't necessarily the number you're looking for :-) The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs, but uses I think 4 clock cycles per instruction. And then there is the very small SERV core (about 300 LUT4s?) which runs at 300 MHz but uses 32 clock cycles for most instructions and 64 for branches and stores.

I think 100 MHz is pretty typical for 1 CPI designs.

Quote
I suppose SiFive's cores have all been prototyped on FPGAs? Would you have figures about achievable max freq (and on which FPGA(s)) ?
(I think I remember one figure: if I'm not mistaken, the U540 runs @100MHz on FPGA - not sure which - and I don't know whether this is the max frequency achievable either.)

That was certainly true up to the time I stopped working there, though the more sophisticated designs such as the OoO 8-series and especially simulating an entire SoC with multiple cores needs a very large FPGA. The FU540 was prototyped on the VC707 board ($3500) but the FU740 and FU840 have needed the VCU118 board ($7915).

There is no attempt to make those cores run as quickly as possible. They use RTL designed for the eventual SoC which is not optimal for FGPA. Even if they run at 30 Mhz or 50 MHz that's still massively faster than verilator, and is good enough to boot linux and run SPEC and so forth with performance per clock representative of the final 1.5 GHz (or whatever) SoC. In fact those FPGA prototypes deliberately slow down RAM access to be proportional to how it will be on the SoC.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #162 on: April 27, 2020, 01:37:05 pm »
There is no attempt to make those cores run as quickly as possible. They use RTL designed for the eventual SoC which is not optimal for FGPA. Even if they run at 30 Mhz or 50 MHz that's still massively faster than verilator, and is good enough to boot linux and run SPEC and so forth with performance per clock representative of the final 1.5 GHz (or whatever) SoC.

Yeah I agree with this. But I unfortunately have no means of knowing what exactly I could get directly on silicon for now, so FPGA prototyping is all I have to get an approximate, and relative, idea.
(I would need access to Synopsys - or similar - tools along with some PDK, which I don't have at the moment.)

I have already noticed that some logic paths for my implementation, synthesized on FPGA, were unnecessary long compared to what I expected, but yes that's due to how FPGA slices are designed. My goal is not to tweak my RTL to make it specifically efficient on FPGAs either, I just want a relative idea. If ~100MHz on a typical FPGA for a ~1 CPI, pipelined RV32 core is already reasonable, then I'm fine for now.

I haven't even tested the 64-bit version yet on FPGA (my core is generic enough to implement both RV32 and RV64), but I expect it to reach much lower frequencies, and the 30MHz to 50Mhz figure you're giving is likely what I'll get.

I'd be curious to see how fast it could run on silicon, say on a 45nm process...
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #163 on: April 27, 2020, 02:03:53 pm »
At this point, that's what I managed to get on a Spartan 6 LX 25 for the whole core (including BHT, BTB and register file):
(and yes, this is with a large BTB of 512 entries - I know what you're going to say - I'll likely rework this later on... but at this point, I'm not too shocked by the 90Kbits required.)

  • 2205 LUTs
  • 5 BRAM blocks (90Kbits)
  • 12 DSP48A1 slices
  • ~111 MHz max.

« Last Edit: April 27, 2020, 02:12:16 pm by SiliconWizard »
 

Offline 0db

  • Frequent Contributor
  • **
  • Posts: 336
  • Country: zm
Re: The RISC-V ISA discussion
« Reply #164 on: April 27, 2020, 04:55:59 pm »
The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs

Which cheap FPGAs?
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #165 on: April 27, 2020, 05:27:26 pm »
The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs

Which cheap FPGAs?

7-Series Xilinx
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 15172
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #166 on: April 27, 2020, 05:34:38 pm »
That's stated there: https://github.com/cliffordwolf/picorv32
The core should run @300-400MHz on a Xilinx Artix-7 or something. (Note that I don't really consider the Artix-7 a "cheap" FPGA, but it's all relative. Compared to the high-end FPGAs that are often used for CPU prototyping, such as the Virtex series, it sure is cheap.)

Note that the goal is entirely different. It's designed with an average CPI of 4, and they say it's optimized for size and speed, not performance.
This core running  @400MHz will have approx. the same performance as a pipelined core running @100MHz with an average CPI close to 1.
So it's an exercise in simplicity, and I get the idea. Implementing a more complex, pipelined core, as I experienced, is not an easy task.

In terms of features, I'd say my core is somewhere between their "regular" and "large" variants (917 and 2019 LUTs respectively), so with my ~2200 LUTs (and much better performance/MHz), my approach doesn't look too bad.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 4404
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #167 on: April 27, 2020, 08:27:22 pm »
This core running  @400MHz will have approx. the same performance as a pipelined core running @100MHz with an average CPI close to 1.

Yes, I said it uses about 4 clock cycles per instruction.

The idea is that if you're using it to control other things in the FPGA then simple things often will run at 400 MHz and you don't want the CPU slowing it down, or have to deal with different clock domains.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf