Author Topic: The RISC-V ISA discussion  (Read 32866 times)

0 Members and 1 Guest are viewing this topic.

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
The RISC-V ISA discussion
« on: December 27, 2019, 04:46:35 pm »
(Foreword: not sure this is the appropriate section, but no other really seems more appropriate so...)

I've been seriously considering and studying the RISC-V ISA lately. I also started developing a cycle-accurate CPU "emulator" (simulator may be a better word?), meant to be useful for testing new ideas, benchmarking, etc. (and the first target is RISC-V, but it won't be limited to that.)

So while taking a closer look at the RISC-V ISA, I have a few remarks/questions... if anyone having worked with it and having experience/insight, that would be great if they could chime in. The discussion can pretty much follow on many other aspects of it. Thought this could be interesting.

My first remarks:

1. Looks like a very nice "exercise" in simplicity. I like the "minimalist" approach. Makes implementing it pretty straightforward.
2. The minimalism looks a bit too much to me on some points. A couple examples:
2.1. The "bit manipulation" extension is not part of the base ISA. I personally think this decision is a bit too drastic. Bit manipulation can definitely be pretty useful in many cases (I'm thinking of some instruction akin to "clz" for instance... or byte swaps, bit reverse, etc.) Could be debated, but what's worse, this "B" extension is not even defined yet. I really think this is a problem at this point, because it's (in my eyes) part of basic operations and even if it's an extension, it should have been defined already IMO. As it is, core designers are likely to define their own extensions with this, and this is going to lead to useless fragmentation for something that again, seems basic to me.
2.2. The "no flag register" approach is interesting, but it makes some operations pretty clunky. For instance, working with integers wider than the native ISA width. No "add with carry" or anything like this. Would be interesting to see how you guys (experienced with RISC-V) would implement it and how much more efficient (or not?) it would be with at least additional operations with carry. You may say this could be a further extension, but again I think this is pretty basic?

I'll probably have tons of other remarks and questions later on, but I'd be interested in reading opinions on these first two to begin with.
« Last Edit: December 27, 2019, 07:36:45 pm by SiliconWizard »
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 12390
  • Country: us
    • Personal site
Re: This RISC-V ISA discussion
« Reply #1 on: December 27, 2019, 05:27:00 pm »
Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.

There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.

Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.

Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.
Alex
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: This RISC-V ISA discussion
« Reply #2 on: December 27, 2019, 07:17:29 pm »
I think it's pretty amazing that despite being fairly minimalist with only 37 instructions a compiler will generate from C code (so leaving out fence, system call, debugger call, the CSR instructions), RV32I has everything necessary to efficiently support a modern software stack.

You could even make it a bit more minimalist without any great harm. I'd suggest, for example, leaving out all the "immediate" instructions except addi. Boom! You're now down to 29 instructions. And you've freed up 3% of the opcode space at a stroke. The cost? One instruction to load the desired immediate value into a register and then use the register-to-register version of the instruction instead.

Here are some instruction frequency stats I gathered from  the RISC-V Debian distro with the standard packages an an assortment of extras. Format: percentage of total instructions, mnemonic, raw instruction count. I've listed the top 16 instructions in full, but only the full register immediate ones after that.

 16.224593   addi   2528047
 15.237536    jal   2374248
 11.123998  auipc   1733294
  9.981167     ld   1555223
  6.658275    beq   1037464
  4.305509    bne    670866
  3.687067     sd    574503
  3.418121    lbu    532597
  3.376591   jalr    526126
  2.435197     lw    379442
  2.357368    lui    367315
  1.800274     sb    280511
  1.768576  addiw    275572
  1.472592     sw    229453
  1.430902   slli    222957
  1.314809   andi    204868
:
  0.433172   srli     67495
  0.296973    ori     46273
  0.244295   xori     38065
  0.192047  sltiu     29924
  0.131027   srai     20416
  0.011071   slti      1725

addi is *the* most popular instruction. This happens on other code bases I've looked at as well. On Fedora addi comes in slightly behind jal. Part of this is that addi does triple duty as both the "move register" instruction and the "load immediate" instruction (both of which could be done by other instructions such as ori instead) but incrementing and decrementing loop variables and the stack pointer is anyway so common that addi would always be in the top instructions. This is 64 bit code, so addiw also makes a showing. If you want to think about RV32I then probably just lump addi and addiw together and call it 18%.

What about the others? slli+andi+srli+ori+xori+sltiu+srai+slti together come to 4.05% of all instructions. That's more than the 3.125% of the opcode space they take up (along with addi), but not a lot more. If you left them all out then RISC-V programs would get at most 4% bigger (less, because the same constant could often be loaded once and left in a register, of which there are usually plenty, to be used several times), and probably no more than 1% slower (because the loading of the constant could often be done outside of a loop).

Do I seriously suggest ripping those immediate instructions out of the standard? No, of course not. The standard is ratified :-) And they are carrying their weight, collectively, even if ori, xori, sltiu, srai, slti individually are not. It would also make the hardware *more* complex to disable them, given that the ALU supports those operations, and the data path for immediates from the instruction decoder to the ALU has to exist anyway.

It is however a simple mathematical fact that an immediate instruction takes up 128x more encoding space than the corresponding instruction with two register sources. We can add hundreds and hundreds of R-type instructions in future without problems, but it's going to need a very strong justification to add more immediate instructions -- at least within the 32 bit opcode space. Future 48 bit, 64 bit or longer instructions are a different matter.

I make an exception for the shift instructions. slli, srli, srai don't use the entire 12 bit immediate field, but only enough bits to encode a number up to the register size -- 5 bits for RV32, 6 for RV64. There is room to add more than 100 "shift-like" instructions in the unused all-zero bits of the slli and srli encodings. (srai already uses one of these). The proposal for the BitManip extension adds a number of "shift-like" instructions with immediate versions e.g. sloi, sroi, rori, grevi, gorci.

All this does I think demonstrate that while RV32I is fairly minimal, it could be made significantly more minimal without huge harm to code size or speed.
 
The following users thanked this post: DIPLover, Xenoamor, I wanted a rude username

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: This RISC-V ISA discussion
« Reply #3 on: December 27, 2019, 08:22:19 pm »
Well, what is in the main set and what is an extension is a matter of preference. If you include everything into the basic set, then you will make basic implementations of the ISA much harder. And I personally appreciate the simplicity and ease of implementation of the basic set.

Well of course not everyone will have the same opinion of what should be the minimal set, and it will largely depend on the kind of code they tend to work on.
The RISC-V idea is to put a minimal subset in the base set (I/E), which you can implement everything with (except for the more hardware-related extensions such as"A"). Additional extensions (except again the hardware-related ones) are for performance only. You can absolutely implement FP in software with RV32/64I for instance.

So yes it all comes down to what you consider important for performance or not. As I said, for instance I wouldn't have a problem with bit manipulation having its own extension (although I may have done it differently, but that's preferences as you said). I just think it's past time it would get defined. I understand the whole idea of statistically evaluating the use of given instructions and decide which ones to include based on that, but I also think this approach is not without flaws.

There are no advanced instructions in Cortex-M0+ either. You just go to the higher end core when you need them. Same with RISC-V, you go to a core that also implements an extension.

OK, but the difference here is not as drastic. Cortex M0 has (I don't know the difference with M0+? is the IS smaller than in the M0?) "add with carry" instructions and clz (I think), for instance, which were in question here.

Yes, it sucks that extensions sit undefined for years. I believe the main stopping point here is lack of confidence in that specific implementation or a set of instructions are good. Hopefully that with more and more RISC-V devices appearing, there will be more push to standardize things.

Certainly. I don't quite know how priorities at the RISC-V Foundation level are defined though. I'd be interested in understanding what drives them. I'd suspect that they are largely influenced by the "main" big members.

Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

Just a small example.
Consider the very simple code below, compiled for a 32-bit target:
Code: [Select]
uint64_t Add64(uint64_t n1, uint64_t n2)
{
return n1 + n2;
}

RV32I:
Code: [Select]
mv a5,a0
add a0,a0,a2
sltu a5,a0,a5
add a1,a1,a3
add a1,a5,a1
ret

NanoMIPS (you can see that it's almost exactly the same as with RV32I):
Code: [Select]
addu $a2,$a0,$a2
addu $a1,$a1,$a3
sltu $a4,$a2,$a0
move $a0,$a2
addu $a1,$a4,$a1
jrc $ra

ARM Cortex-M4:
Code: [Select]
adds r0, r0, r2
adc r1, r3, r1
bx lr

ARM Cortex-M0 (don't know the difference between adc and adcs, but it seems pretty equivalent to -M4):
Code: [Select]
adds r0, r0, r2
adcs r1, r1, r3
bx lr

It's basically 5 instructions (not counting ret) for RV32I (and interestingly NanoMIPS, which looks pretty close anyway - not that surprising), and 2 for Cortex-M0 and -M4.
No matter how efficient your implementation is, it's hard to beat that. If you're using a lot of large integer operations in some code, it'll make a pretty significant difference.

Not to mention that beyond code size (which can be mitigated using compressed instructions), you potentially get additional performance issues if you need more instructions to do the same operation. Data hazards are a lot more likely to occur between successive instructions and may not all be solvable without stalling the pipeline...
« Last Edit: December 27, 2019, 08:30:24 pm by SiliconWizard »
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 12390
  • Country: us
    • Personal site
Re: The RISC-V ISA discussion
« Reply #4 on: December 27, 2019, 08:29:47 pm »
Adding carry and other flags has significant implications on the hardware design.

Having separate flags introduces additional pipeline hazards, which may make efficient implementation very hard.

The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.

Who cares how many instructions there are if they take the same amount of time to execute.

Of course, simplest implementations won't do any of this and will suffer a bit. But you shouldn't design modern architectures for simplest implementations.
Alex
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #5 on: December 27, 2019, 08:34:52 pm »
2.1. The "bit manipulation" extension is not part of the base ISA. I personally think this decision is a bit too drastic. Bit manipulation can definitely be pretty useful in many cases (I'm thinking of some instruction akin to "clz" for instance... or byte swaps, bit reverse, etc.) Could be debated, but what's worse, this "B" extension is not even defined yet. I really think this is a problem at this point, because it's (in my eyes) part of basic operations and even if it's an extension, it should have been defined already IMO. As it is, core designers are likely to define their own extensions with this, and this is going to lead to useless fragmentation for something that again, seems basic to me.

Unfortunately we don't have time machines. Would it have been good to have bitmanip instructions ready to go in 2015? Sure, of course. Should the ISA announcement, formation of the Foundation etc have been delayed to 2019 or 2020 to allow time for bitmanip and vectors to be designed and added? HELL NO.

There is an element of things taking longer now because it's not just Krste, Andrew and Yunsup sitting around a table and deciding by fiat what is in and what is out. The ISA is owned by a community consisting of dozens (hundreds) of organisations now, and it's necessary to get input from a lot of people as to what they'd like to see in there for their applications, evaluate how useful each thing is, how synergistic different things are, and vote on inclusion and how to organize into various sub-extensions.

When it *was* just Krste, Andrew, and Yunsup they took the time to propose something, implement it in actual chips, add support to gcc, and compiler and run software. And throw that away and try something else.

Even now, the strong preference is to actually implement proposed extensions in real chips (preferably multiple independent implementations) and gain experience with it before ratifying it. It's not just throwing a spec over the wall and hoping the designers were sufficiently prescient.

I think also there is an element of people simply not realizing how long things take even in a closed-doors and unannounced effort at Intel or ARM.

I've heard from people who previously worked for ARM that the Aarch64 project was started in 2001. soon after the AMD64 specification was published and well before the first Opteron or Athlon64 processors were released in 2003. This was successfully kept secret until the ARMv8.0 spec was published in October 2011 -- after the RISC-V project was started. There were no Aarch64 chips until Apple shocked everyone with the iPhone 5s in September 2013. The other phone makers were all simultaneously saying both "Why on earth would you need 64 bits in a phone? It's just a marketing gimmick." and "We'll have one in six months". Actually, it was 19 months until April 2015 with the Galaxy S6.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: This RISC-V ISA discussion
« Reply #6 on: December 27, 2019, 08:57:01 pm »
Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

It's not the extra instructions, it's having to add a flags register as an extra result of instructions, and extra source for some instructions. That's expensive, a huge cost and bottleneck especially once you go superscalar or out of order, and simply not useful very often.

Yes, it takes four instructions on RISC-V or NanoMIPS (which btw exists as precisely one licensed RTL core at the moment with as far as I know exactly one user: Mediatek -- you can't buy a chip or a board with a chip on it) vs two instructions on ARM. But how often do you need it? (the "mov" is an artifact of the calling convention, and will disappear if the function is inlined)

Sure, I was doing add with carry all the time on 8 bit CPUs, but on 32 bit or 64 bit CPUs it's an rarity.

The only time it would come close to being performance-critical is for bignum libraries, and in that case it's going to be dominated by the loads and stores even if the data is coming from L1 cache (or SRAM). So now if you have a carry flag it's four instructions per word instead of one. And then there's a couple of instructions of loop overhead (you can unroll, but there's still overhead). RISC-V keeps that constant 2 instructions per word extra.

If you're doing a specialized embedded CPU and multi-precision arithmetic is a dominant part of your workload, then you can add custom instructions for it.
 

Offline emece67

  • Frequent Contributor
  • **
  • !
  • Posts: 614
  • Country: 00
Re: The RISC-V ISA discussion
« Reply #7 on: December 27, 2019, 09:33:43 pm »
.
« Last Edit: August 19, 2022, 02:43:40 pm by emece67 »
 

Offline I wanted a rude username

  • Frequent Contributor
  • **
  • Posts: 669
  • Country: au
  • ... but this username is also acceptable.
Re: The RISC-V ISA discussion
« Reply #8 on: December 27, 2019, 09:47:44 pm »
It is instructive to reflect that the legendary Alpha/AXP, HPC king of the 1990s, did not even have an integer DIV instruction.

"Perfection is attained not when there is no longer anything to add, but when there is no longer anything to take away." (Antoine de Saint-Exupéry)
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 12390
  • Country: us
    • Personal site
Re: The RISC-V ISA discussion
« Reply #9 on: December 27, 2019, 09:49:51 pm »
They did not do because of some "perfection", but because they could not implement is economically.

Cortex-M0+ also does not have an integer divide instruction and it sucks.
Alex
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: This RISC-V ISA discussion
« Reply #10 on: December 27, 2019, 09:51:04 pm »
Add with carry can be somewhat efficiency implemented using SLTU and likes. It is not as efficient as far as the number of instructions goes, but if you are doing optimized micro architecture, it makes for a much easier implementation.

Well, sure it would use 'sltu'. As to much easier to implement... this seems slightly exxagerated. Handling a carry flag is pretty cheap IMO. You get it for almost no added cost with pretty much any multi-bit adder. Adding a couple instructions (which would be derivatives of normal add anyway) wouldn't massively hurt anything either.

It's not the extra instructions, it's having to add a flags register as an extra result of instructions, and extra source for some instructions. That's expensive, a huge cost and bottleneck especially once you go superscalar or out of order, and simply not useful very often.

I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

And as to the "rarity" of needing that... I don't know. There are certainly a number of applications where using 64-bit integers, for instance, in 32-bit code is pretty common. Whether this would be performance critical largely depends on the application.

Again I agree that the third source is added pain to handle, but I'll point out again that this is already the case if you implement the F extension?

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #11 on: December 27, 2019, 09:52:09 pm »
By the way, the RISC-V Vector extension proposal includes add-with-carry, using the mask input as carry-in instead of as a mask:

# vd = vs2 + vs1 + v0.LSB
vadc.vvm   vd, vs2, vs1, v0  # in the base vector encoding with 32 bit opcodes, the mask can only come from v0

# vd = carry_out(vs2 + vs1 + v0.LSB)
vmadc.vvm   vd, vs2, vs1, v0  # produces the carry out into vd (has to be v0 or moved to v0 to use it later)

Note that this enables doing a number of multi-precision adds in parallel, not one huge multi-precision add across the vector.

 
The following users thanked this post: SiliconWizard

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: This RISC-V ISA discussion
« Reply #12 on: December 27, 2019, 10:06:05 pm »
I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

That's a different register file and a different ALU.

Supporting 3 input operands is expensive. but FMA is maybe *the* most common FP operation, so it's extremely important to support it efficiently. ADC on the other hand is a rarity in most code.

The Bitmanip extension proposal includes some operations that need 3 integer register inputs: cmix (conditional mix), cmov (conditional move) and funnel shifts (fsl, fsr, fsri). Some people may find it worthwhile to support those, but I think they're *extremely* unlikely to find their way into the standard B extension that is what general purpose operating systems such as Linux will support.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: This RISC-V ISA discussion
« Reply #13 on: December 27, 2019, 10:32:28 pm »
I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

That's a different register file and a different ALU.

Oh. Right. So you'd have to duplicate it, but I guess the structure would be pretty similar. I still don't have a precise idea of how much area/LEs it takes to implement that. I'm currently working on a typical 5-stage pipeline with 2 data sources and 1 destination as per the base RISC-V ISA. But at this point, even that I have no idea exactly how much it would take in hardware.

I have implemented pipelines in the past but none with data hazards (or very simple ones) or branch hazards, so I'm still wrapping my head around that.

Supporting 3 input operands is expensive. but FMA is maybe *the* most common FP operation, so it's extremely important to support it efficiently. ADC on the other hand is a rarity in most code.

Yep, so this was considered that doing this would indeed improve performance even with the added complexity.

The Bitmanip extension proposal includes some operations that need 3 integer register inputs: cmix (conditional mix), cmov (conditional move) and funnel shifts (fsl, fsr, fsri). Some people may find it worthwhile to support those, but I think they're *extremely* unlikely to find their way into the standard B extension that is what general purpose operating systems such as Linux will support.

Interesting, I have only read the ratified documents so far, I don't know anything about the current proposals.
From what you say about the Bitmanip extension, it looks like the proposal includes a lot of stuff (probably way beyond what I had in mind) and that may be one of the reasons it takes time to finalize...
 

Offline I wanted a rude username

  • Frequent Contributor
  • **
  • Posts: 669
  • Country: au
  • ... but this username is also acceptable.
Re: The RISC-V ISA discussion
« Reply #14 on: December 27, 2019, 10:38:00 pm »
They did not do because of some "perfection", but because they could not implement is economically.

Just because they couldn't change the laws of physics doesn't mean their design was bad.

Obviously if the cost had been the same as MUL, they would have included it. But it isn't, not just in space but also time, which would have flow-on effects for the pipeline. And anyway, in server land floating point division is more useful.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: This RISC-V ISA discussion
« Reply #15 on: December 27, 2019, 10:53:16 pm »
[As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

A simulator is of course specific to a particular microarchitecture.

What is best is really not trivial at all. Just counting instructions or clock cycles is not enough. Adding hardware for extra instructions can result in a slower maximum MHz. Decreasing the number of clock cycles by 1% is not useful if you then have to run the clock 1% slower (or more). For battery powered things, the additional gates also add to the energy use, whether you are using those instructions or not. No one does clock gating on just the "clz" circuit :-) The extra die area also adds to the cost of each chip.

For applications processors in current mobile phones and up, both Intel and ARM have decided to take a kitchen sink approach, and mandate that all CPUs have every instruction anyone every thought of. Everyone gets SIMD. Everyone gets clz and popcount and sha and aes and ....

Are they right? Maybe. Maybe not.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: This RISC-V ISA discussion
« Reply #16 on: December 27, 2019, 11:25:21 pm »
[As to which approach would yield the most efficient execution is really not this trivial, it would depend on a number of factors, but I would guess that in simpler architectures with in-order execution, that may be more efficient. The simulator I'm working on is meant to help figuring this out on real code.

A simulator is of course specific to a particular microarchitecture.

Well, the whole idea is to be able to try various IS but also various microarchitectures. Implementing just one microarchitecture would serve limited purpose as comparing different IS this way would inevitably be biased.

What is best is really not trivial at all. Just counting instructions or clock cycles is not enough. Adding hardware for extra instructions can result in a slower maximum MHz. Decreasing the number of clock cycles by 1% is not useful if you then have to run the clock 1% slower (or more). For battery powered things, the additional gates also add to the energy use, whether you are using those instructions or not. No one does clock gating on just the "clz" circuit :-) The extra die area also adds to the cost of each chip.

A lot of factors for sure. Decreasing the CPI (and the number of required instructions, as long as it doesn't adversely increase the CPI) can be beneficial even if it won't clock as fast, as you can do the same amount of work as at a lower frequency. For that to be interesting, of course you need applications in which it makes a difference, and power consumption wise, it could indeed not be beneficial... or it could be. (But running at lower frequencies could have other benefits anyway.) So yeah it so much... depends!

For applications processors in current mobile phones and up, both Intel and ARM have decided to take a kitchen sink approach, and mandate that all CPUs have every instruction anyone every thought of. Everyone gets SIMD. Everyone gets clz and popcount and sha and aes and ....

Are they right? Maybe. Maybe not.

As we said above, there are just too many factors to consider, so they just go for the general-purpose approach that will get them the most customers, and that is relatively easy to handle, not requiring a huge number of potential variants (which can be a problematic point with RISC-V.)

But we are talking about completely different things... RISC-V is just an ISA. Not actual chips. Intel sells mostly chips. ARM, IPs, but still too high a level of customization would probably be a nightmare to handle for them. Having completely "modular" instruction sets is pretty neat, but very tough to handle when you sell something IMO.

At SiFive, you are in an interesting stage, as you are actually selling stuff out of RISC-V, and you probably encounter the issues I'm talking about above... picking just the right set of extensions, managing queries from customers that ask for non-standard stuff... By the way, I know you have a number of "off-the-shelf" base cores, but they don't include ALL extensions of course. How do you handle it if some customer asks for a specific extension that your cores don't support? Is it a no-no, or is it yes (and you have most ratified extensions ready), or do you even design custom extensions in some cases?
« Last Edit: December 27, 2019, 11:26:56 pm by SiliconWizard »
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4555
  • Country: us
Re: The RISC-V ISA discussion
« Reply #17 on: December 28, 2019, 06:53:34 am »
It's not so different with its limitations than other RISC architectures.

MIPS lacks a carry bit and multi-precision math is a bit weird.

Cortex-M0 and M0+ are missing a painful number of "expected" instructions (everyone notices division, but bit-tests are pretty painful, too.  Neither the sort of "and with immediate" or the "shift register till the desired bit in carry or sign position" methods common on other ARM architectures is there.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #18 on: December 28, 2019, 07:41:57 am »
Cortex-M0 and M0+ are missing a painful number of "expected" instructions (everyone notices division, but bit-tests are pretty painful, too.  Neither the sort of "and with immediate" or the "shift register till the desired bit in carry or sign position" methods common on other ARM architectures is there.

That doesn't sound right. As with any Thumb1 implementation, there is the "LSL Rd, Rm, #bits" instruction ("MOVS Rd, Rs, LSL #bits" in unified syntax), which sets the N and Z flags as you would expect. Opcode 00000bbbbbsssddd. You can then follow it with a BMI or BPL as expected.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0432c/CHDCICDF.html
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #19 on: December 28, 2019, 07:52:10 am »
Adding carry and other flags has significant implications on the hardware design.
Having separate flags introduces , which may make efficient implementation very hard.

Yup, precisely.
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7837
  • Country: pl
Re: The RISC-V ISA discussion
« Reply #20 on: December 28, 2019, 09:56:29 am »
The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.
« Last Edit: December 28, 2019, 09:59:23 am by magic »
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4555
  • Country: us
Re: The RISC-V ISA discussion
« Reply #21 on: December 28, 2019, 10:14:39 am »
Quote
there is the "LSL Rd, Rm, #bits" instruction ("MOVS Rd, Rs, LSL #bits" in unified syntax), which sets the N and Z flags as you would expect.
Hmm.   You're correct.  I wonder what I was think of?  :-(
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #22 on: December 28, 2019, 02:52:28 pm »
The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.

I've heard/read about proposals to design RISC-V cores with instruction fusion like this, but have never seen one actually work. It sure sounds like pretty complex to implement correctly, and I'm frankly not convinced it would end up being simpler than just handling 3-source instructions (which again already exists in some RISC-V extensions anyway...)

Anyway, it's all an exercise of making a good compromise between simplicity and performance. A pretty tough endeavor that's obviously bound not to please everyone.

I completely understand the whole idea of having a simple ISA (RISC-V in that regard is very much in the RISC spirit of the early days, whereas most RISC processors now have become monsters and we can question what the "R" means anymore). But putting all the work for performance upon the microarchitecture's shoulders is debatable as well. One point of RISC-V is to make it very easy/and lightweight to implement, but then if we need to design complex microarchitectures to really make it efficient, is the compromise really always worth it? At least it certainly doesn't look as easy as what we may hear here and there...

 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #23 on: December 28, 2019, 02:59:03 pm »
Adding carry and other flags has significant implications on the hardware design.
Having separate flags introduces , which may make efficient implementation very hard.

Yup, precisely.

As we said, it's just basically handling 3 sources instead of 2. Sure it adds complexity, but "very hard" is a bit much here. Certainly though if you're looking to design very small cores, that would be something to avoid.

I was thinking of another way to implement ADC not requiring a separate flag register. Not ultra efficient, but simpler?
The idea would just be to make all integer registers 1 bit wider (ie. 33 bits for RV32). The destination of any ADD would naturally receive the carry in its MSB (bit 32). Thus a further ADC using this register as a source would not require handling a third source. This extra bit could maybe be used for other purposes as well? I know it sounds a bit wasteful, but it looks much simpler to implement. And yes, for those who consider ADC to be a rarity these days, that would probably make them cringe. ;D


 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #24 on: December 28, 2019, 04:27:50 pm »
Exactly what's your interest?
Building an HDL cpu-core with pipeline?
Writing a cycle-accurate pipeline simulator?
Designing an HL compiler, from HL to machine-code?

Each of these fields has its trade-off

but talking about "architecture", I wish I had "see RISC-V run" (The Book) under my Xmas tree.
Has it already been written? Let's write it! I wanna it under the tree :D
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #25 on: December 28, 2019, 04:48:49 pm »
Exactly what's your interest?
Building an HDL cpu-core with pipeline?

Likely yes.

Writing a cycle-accurate pipeline simulator?

Yes, but not just for the sake of it. Mainly to be able to try/benchmark new ideas much more easily than having to write HDL for them every time. Then the ideas that would prove interesting could be later implemented in HDL.

Implementing a simulator also helps (at least helps me) tremendously understand how to implement things in HDL when they become this complex.

Designing an HL compiler, from HL to machine-code?

This one would be a big nope. For now anyway.

but talking about "architecture", I wish I had "see RISC-V run" (The Book) under my Xmas tree.
Has it already been written? Let's write it! I wanna it under the tree :D

What would you like to read in it?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #26 on: December 28, 2019, 05:28:46 pm »
The idea here is that microarchitecture will take care of multiple instructions that cam be fused together and executed as one.
That's gonna be tricky for ADC as the decoder needs to remember which register holds the SLTU-computed carry bit of which addition and then find the ADD instructions that consume this register, which may come in a few permutations, possibly partly before the SLTU itself, and possibly spread out and interleaved with unrelated code if the compiler targets a superscalar core. Sounds fun, I'm not sure if even x86 perform such complex fusions.

I've heard/read about proposals to design RISC-V cores with instruction fusion like this, but have never seen one actually work. It sure sounds like pretty complex to implement correctly, and I'm frankly not convinced it would end up being simpler than just handling 3-source instructions (which again already exists in some RISC-V extensions anyway...)

No one is going to do anything as complex as that.

Instruction fusion is of course a complex thing. The whole point of it is that it allows you to put the complexity of decoding it and the complexity of needing possibly more than two source operands only on the implementation that wants to do that. On others the potentially-fused instructions just execute serially as usual.

That means that one constraint on fusible instruction pairs (or sequences) is that they must modify only one register (or if they are executed fused then all result registers must be properly written with the intermediate values -- which is a whole other level of complexity)

If we rearrange that 64 bit add example a little...

Code: [Select]
add a0,a0,a2
sltu a5,a0,a2
add a5,a5,a3
add a1,a1,a5
ret

... then the sltu and following add can be fused.

I don't expect CPUs will go looking for fusible pairs that are randomly separated in the instruction stream. If you know you're going to be running on a CPU that fuses certain sequences then you tell the compiler to schedule (and register allocate) so as to generate those pairs.

We know recent Intel x86 CPUs fuse cmp/bCC pairs. They don't do it if they are separated.

SiFive's 7-series cores fuse conditional branches that branch forward over exactly one instruction into effectively predicating that following instruction.

I don't know of other RISC-V cores that implement any instruction fusion yet. It's a high end feature for high performance CPUs and virtually all RISC-V cores actually released until now have been aimed at the low end. Several companies and other organizations are working on high performance RISC-V cores, so we'll find out in due course.

Quote
Anyway, it's all an exercise of making a good compromise between simplicity and performance. A pretty tough endeavor that's obviously bound not to please everyone.

Absolutely.

Quote
I completely understand the whole idea of having a simple ISA (RISC-V in that regard is very much in the RISC spirit of the early days, whereas most RISC processors now have become monsters and we can question what the "R" means anymore). But putting all the work for performance upon the microarchitecture's shoulders is debatable as well. One point of RISC-V is to make it very easy/and lightweight to implement, but then if we need to design complex microarchitectures to really make it efficient, is the compromise really always worth it? At least it certainly doesn't look as easy as what we may hear here and there...

I think you should look at the situation as it is before performing any such heroics.

Simple RISC-V cores such as the original open source single-issue in-order "Rocket" have both code size and program execution cycles within epsilon (+/- a couple of percent) of Thumb2 cores such as Cortex M0/M3/M4. It depends on the benchmark. Sometimes ARM comes out slightly ahead and sometimes Rocket comes out slightly ahead.

The Rocket cores also come out smaller in silicon area (cheaper, lower energy use) in the same process AND can be clocked higher in the same process. More recent core designs from SiFive, Andes, Western Digital, PULP and others are better again than Rocket.

A lot of people have looked at the FE310 and said "Why on earth would you want to make a microcontroller with only 16 KB of RAM run at 320 MHz?" A fair enough question. The answer is simply that 320 MHz is simply the speed that all of the chips turned out to run at on the very cheap 180nm process (some do 400 MHz), and there seemed to be no marketing reason to artificially limit it. They run perfectly well at 16 MHz or 80 MHz or whatever if you want (and at lower power levels of course).

There doesn't seem to be anything else that comes close to matching RV32IMAC and Thumb2 on all axes of code size, clock cycles, and die area.

NanoMIPS, maybe, which looks very very nice in the manual but it seems basically still-born, which is a great pity. There was an announcement of an RTL core in May 2018, and later of it being licensed to Mediatek, but nothing else since then. It might never see the light of day anywhere else. I'd love to be wrong on that, as it seems to be very nice work.

SH4 and ColdFire and Thumb1 have good code size but having only 2-address instructions loses too much on execution speed unless you go super scalar out-of-order like x86.

Three-address instructions but only 32 bit opcodes loses too much on code size.

Maybe Xtensa's mixed 16 bit and 24 bit instructions comes close. I haven't studied it closely enough to know. I do know RISC-V is getting lots of design wins against Xtensa (and ARC) and also displacing them from existing customers.
 
The following users thanked this post: I wanted a rude username

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #27 on: December 28, 2019, 05:38:10 pm »
I was thinking of another way to implement ADC not requiring a separate flag register. Not ultra efficient, but simpler?
The idea would just be to make all integer registers 1 bit wider (ie. 33 bits for RV32). The destination of any ADD would naturally receive the carry in its MSB (bit 32). Thus a further ADC using this register as a source would not require handling a third source. This extra bit could maybe be used for other purposes as well? I know it sounds a bit wasteful, but it looks much simpler to implement. And yes, for those who consider ADC to be a rarity these days, that would probably make them cringe. ;D

That's certainly a much better solution than a condition code register.

Another thing not noted so far in this thread: RISC-V doesn't have a rotate instruction.

Have at it :-)
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 12390
  • Country: us
    • Personal site
Re: The RISC-V ISA discussion
« Reply #28 on: December 28, 2019, 05:47:48 pm »
Yes, CPU will not be looking for randomly located bits and pieces. They will be looking for a specific pattern. This is way above your typical home brew on FPGA, but not uncommon at all for real devices. You would have to do that anyway to compete with X86 and likes, since that type stuff is where they get all their performance from.

Even with Cortex-M7 you have dual-issue pipeline. It has the same instruction set as Cortex-M4, but your program can literally run 2 times faster if you interleave floating point and integer instructions. The core does not do it automatically, it expects the programmer or the compiler to take care of that.
« Last Edit: December 28, 2019, 06:10:07 pm by ataradov »
Alex
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #29 on: December 28, 2019, 06:08:52 pm »
A lot of people have looked at the FE310 and said "Why on earth would you want to make a microcontroller with only 16 KB of RAM run at 320 MHz?" A fair enough question. The answer is simply that 320 MHz is simply the speed that all of the chips turned out to run at on the very cheap 180nm process (some do 400 MHz), and there seemed to be no marketing reason to artificially limit it. They run perfectly well at 16 MHz or 80 MHz or whatever if you want (and at lower power levels of course).

Yeah, I was one of them! But I get the point, and SRAM is not cheap... (I also understand the point that SiFive at this point would not gain anything selling off-the-shelf chips potentially "competing" with your custom offers...)

There doesn't seem to be anything else that comes close to matching RV32IMAC and Thumb2 on all axes of code size, clock cycles, and die area.

NanoMIPS, maybe, which looks very very nice in the manual but it seems basically still-born, which is a great pity. There was an announcement of an RTL core in May 2018, and later of it being licensed to Mediatek, but nothing else since then. It might never see the light of day anywhere else. I'd love to be wrong on that, as it seems to be very nice work.

Yeah, I've played a bit with the NanoMIPS target for GCC to get an idea of what we'd get with real code... But obviously never had the chance to try an actual chip.

SH4 and ColdFire and Thumb1 have good code size but having only 2-address instructions loses too much on execution speed unless you go super scalar out-of-order like x86.

You probably know that both SH2 and SH4 patents have expired, and there is one open core: http://www.j-core.org/ (it still only implements SH2 I think, since SH4's patent had not yet expired when they started, and the project seems now on hold...) SH looks interesting. Of course, contrary to RISC-V, they are only 32-bit ISAs. SH5 was an attempt for a 64-bit version, but it never saw the light AFAIK.

 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #30 on: December 28, 2019, 06:14:22 pm »
I was thinking of another way to implement ADC not requiring a separate flag register. Not ultra efficient, but simpler?
The idea would just be to make all integer registers 1 bit wider (ie. 33 bits for RV32). The destination of any ADD would naturally receive the carry in its MSB (bit 32). Thus a further ADC using this register as a source would not require handling a third source. This extra bit could maybe be used for other purposes as well? I know it sounds a bit wasteful, but it looks much simpler to implement. And yes, for those who consider ADC to be a rarity these days, that would probably make them cringe. ;D

That's certainly a much better solution than a condition code register.

Glad you didn't reject the idea. Yes it seems pretty easy to implement, would "only" require 1 extra bit, and would have the benefit of holding one flag per register - which would not only avoid an additional data hazard, but would also allow a range of optimizations for complex "bigint" arithmetic. I think it could be implemented as an extension, as the base registers wouldn't actually need to be wider than what is spec'ed in the base ISA. We could just add one register with each bit corresponding to one data register, and address the bits with the same source index as one of the operands.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #31 on: December 28, 2019, 06:14:41 pm »
What would you like to read in it?

Good stuff like in the book "see MIPS run": from the idea behind the design to the real applications, with humor and anecdotes  :D
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #32 on: December 28, 2019, 06:25:07 pm »
What would you like to read in it?

Good stuff like in the book "see MIPS run": from the idea behind the design to the real applications, with humor and anecdotes  :D

Didn't read it (so that wasn't ringing a bell), but I will take a look.
 

Offline GeorgeOfTheJungle

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: The RISC-V ISA discussion
« Reply #33 on: December 28, 2019, 06:41:06 pm »
One point of RISC-V is to make it very easy/and lightweight to implement, but then if we need to design complex microarchitectures to really make it efficient, is the compromise really always worth it?

No it isn't always worth it. That's why the ARMs in our phones have plenty of dedicated extensions, coprocessors and GPUs and hardly deserve to be called Reduced anything.

The extremes are low power/simple/small/slow and power hungry/complex/big/fast. You have to choose where you want to be, can't have it all.

And IMO after all these years, we have already plenty of good enough ISAs and CPUs, and the ISAs competition is over. As long as the C (or whatever) we write runs fine, who cares what the ML looks like? 99.999% of the time, nobody cares.

The only thing that RISC-V brings to the table is that it's free, if you ask me.
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #34 on: December 28, 2019, 09:47:41 pm »
Quote from: GeorgeOfTheJungle link=topic=223802.msg2847858#msg2847858
The only thing that RISC-V brings to the table is that it's free, if you ask me.

Free as in Freedom, not free as in a free lunch, yes. That is absolutely the main thing. It's not that the code might be 10% smaller than some other ISA or run 2% faster or use 50% less energy (though those might be true, in some cases).

There are other "Free as in Freedom" ISAs, but at this point in time they don't have the software ecosystem and hardware and software momentum (acceleration?) that RISC-V does.

The truly big thing is that as an ISA it is "good enough", and that if you choose to start using it, you can invest in software and tooling and whatever confident that no one can ever force you to change ISA later on against your will. Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, M68k, M88k, DG Nova/Eclipse.

Anyone competent can continue to make RISC-V cores as far into the future as they want, using whatever design and implementation techniques are available at that time, and no lawyer will tell them they can't. If one supplier goes out of business or raises prices too much, you can switch to another supplier, or even hire your own engineers to make your own core.

Or use an emulator on whatever the leading CPU is at the time: RISC-V is very easy to interpret or JIT with high performance: Portable QEMU manages about a 3x-4x slowdown over native, and the "RV8" JIT Michael Clarke and I wrote demonstrated as low as 1.1x to 2x on x86_64. qemu-arm and qemu-aarch64 on the other hand show around an 3x to 8x slowdown over native (those pesky condition codes are largely to blame), plus if you try to use that commercially you may well find yourself receiving a letter from a lawyer.

What happens if there is a new hardware technology with vastly faster switching times, but you can't put many logic elements in a computer? Unless it's something totally alien, you might see something like a 4004 or 6502 implemented on it first, but once it can support a 32 bit ISA it's likely that RV32I (or maybe RV32E) will be the first implemented for it, as it will have a complete and current software toolchain and ecosystem.

What guarantee do we have the the optimal ISAs won't change completely in future? None, of course, beyond the fact that for 55 years now ISAs have fashionably headed off in various directions as technology changed, but keep coming back to somewhere in the space between 1964's IBM 360 and 1975's IBM 801 project (the first true RISC).

https://carrv.github.io/2017/papers/clark-rv8-carrv2017.pdf
https://rv8.io/
 

Offline I wanted a rude username

  • Frequent Contributor
  • **
  • Posts: 669
  • Country: au
  • ... but this username is also acceptable.
Re: The RISC-V ISA discussion
« Reply #35 on: December 28, 2019, 10:18:42 pm »
Free as in Freedom, not free as in a free lunch, yes. That is absolutely the main thing.



Richard Stallman blesses this message with a recorder solo.

Also (you probably know this, because riscv.org reblogged it), as Arnd Bergmann commented when support for the indigenous Chinese C-SKY architecture was added to Linux:

> One more general comment: I think this may well be the last new CPU
> architecture we ever add to the kernel. Both nds32 and c-sky are made
> by companies that also work on risc-v, and generally speaking risc-v
> seems to be killing off any of the minor licensable instruction set projects,
> just like ARM has mostly killed off the custom vendor-specific instruction
> sets already. If we add another architecture in the future, it may instead
> be something like the LLVM bitcode or WebAssembly, who knows?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #36 on: December 28, 2019, 10:27:54 pm »
Also (you probably know this, because riscv.org reblogged it), as Arnd Bergmann commented when support for the indigenous Chinese C-SKY architecture was added to Linux:

Yes, I saw that at the time.

Certainly Andes (nds32) has gone all-in on RISC-V and are currently (with SiFive) in the top 2 commercial RISC-V vendors in the world, and #1 in China where they already had a long-established client list for their own ISA. They also have a bunch of extensions which they have already ported from their own ISA to RISC-V and the proposed "P" packed-SIMD/DSP extension is essentially donated by Andes with I think very few if any modifications.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #37 on: December 28, 2019, 11:08:08 pm »
The truly big thing is that as an ISA it is "good enough", and that if you choose to start using it, you can invest in software and tooling and whatever confident that no one can ever force you to change ISA later on against your will. Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, M68k, M88k, DG Nova/Eclipse.

When I was at home, I daily used two PA-8700(+) @ 875 MHz servers running Linux. I could have moved to arm (RPI?) or x86 (or AMD Geode?), but I preferred HPPA simply because I like the specific hardware implementation of the Cxxx workstation and server lines, and because they have neater PCI stuff under the hood as well as a true HMC chip on the bus (which helps to debug the PCI), and this is good for my specific applications. I also remotely use four POWER9 machines (PPC64++/BE) running Linux, but it's pure researching activity.

The 90% of what I do there can be moved to x86, arm, MIPS, ... everything that runs Linux can potentially be OK. But, talking about things I daily use for my job, the PowerPC is used in avionics because its implementation offers guarantees that other architectures don't offer, and the M88K was used (and it is still used) for military mission-critical stuff because its implementation also offers the possibility to have a voter with better guarantees on its functionality.

In these cases, my company cares more about the implementation, and all the legal stuff a chip-maker can offer.
« Last Edit: December 28, 2019, 11:25:41 pm by legacy »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #38 on: December 28, 2019, 11:16:03 pm »
Talking about things I daily use for my job, the PowerPC is used in avionics because its implementation offers guaranties. The M88K was used (and it is still used) for military mission-critical stuff because its implementation offers the possibility to have a voter with better guarantees on its functionality.

At the RISC-V Summit I was talking to someone from a well known airliner avionics company. They told me they're switching from PowerPC to RISC-V and don't see qualification or certification or whatever as a problem. They'll be using the same supplier for the physical chips as they were before.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #39 on: December 28, 2019, 11:29:27 pm »
At the RISC-V Summit I was talking to someone from a well known airliner avionics company. They told me they're switching from PowerPC to RISC-V and don't see qualification or certification or whatever as a problem. They'll be using the same supplier for the physical chips as they were before.

I 'd like to see who will assume the responsibility for the certifications. One thing is to talk, one thing is to sign with your blood. We will see. What I know is: we have just invested 5 million Euro in PowerPC for the new year.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #40 on: December 28, 2019, 11:30:44 pm »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #41 on: December 28, 2019, 11:39:51 pm »
the same supplier

Who? AMCC?

They told me who, but I think they would probably prefer me not to say at present.

I haven't seen you mention who you work for.
« Last Edit: December 28, 2019, 11:41:47 pm by brucehoult »
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4555
  • Country: us
Re: The RISC-V ISA discussion
« Reply #42 on: December 29, 2019, 03:00:41 am »
Quote
Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, ...
It's "interesting" that a lot of those have presumably had their "intellectual property rights" expire, and COULD probably be manufactured by anyone who really wanted to.  But most have other reasons that people aren't actually interested any more.

Back when the original x86 patents were expiring, "we" had some significant worry that all the chip vendors would be making nothing but x86 clones (and our code didn't run on x86...)  That didn't happen, though!

(Now, 8051 sort-of did this.  I wonder what made that work...)
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #43 on: December 29, 2019, 04:53:31 am »
Quote
Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, ...
It's "interesting" that a lot of those have presumably had their "intellectual property rights" expire, and COULD probably be manufactured by anyone who really wanted to.  But most have other reasons that people aren't actually interested any more.

Certainly patents have expired, but I gather you could run afoul of copyright claims on the ISA itself, or the manual, or ...  I don't know. And Trademark infringement if you claim to be compatible with something rather than simply copying it and calling it something else.

What prevents anyone who wants to from making PowerPC- (1994ish) or MIPS32/MIPS64- (1999) compatible CPUs without a license?
 

Offline donotdespisethesnake

  • Super Contributor
  • ***
  • Posts: 1093
  • Country: gb
  • Embedded stuff
Re: The RISC-V ISA discussion
« Reply #44 on: December 29, 2019, 12:05:05 pm »
Free as in Freedom, not free as in a free lunch, yes. That is absolutely the main thing.

Richard Stallman blesses this message with a recorder solo.

I'm not sure he would, I don't think RISC-V is copyleft? Open Source licenses dilute his concept a lot (i.e all the non copyleft licenses). They present great opportunities for commercial operators, i.e. to not pay license fees to Microsoft, ARM etc, but don't necessarily help the end users, which is what Stallman is concerned about.
Bob
"All you said is just a bunch of opinions."
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7837
  • Country: pl
Re: The RISC-V ISA discussion
« Reply #45 on: December 29, 2019, 12:07:46 pm »
If we rearrange that 64 bit add example a little...

Code: [Select]
add a0,a0,a2
sltu a5,a0,a2
add a5,a5,a3
add a1,a1,a5
ret

... then the sltu and following add can be fused.
That's still not competitive with CISC on Fibonacci computation benchmarks ;)

You would have to fuse SLTU with the preceding ADD and then fuse the two following ADDs into ADC.
It could be doable, I think. The Sxxx instructions decoder would be looking for such patterns and simply repointing the target register to a hidden carry/overflow/zero flag in the ALU instruction's ROB, making the Sxxx essentially a NOP. Then the following ADD would detect the condition and wait for another ADD to fuse with.
 

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 18809
  • Country: us
  • DavidH
Re: This RISC-V ISA discussion
« Reply #46 on: December 29, 2019, 12:18:08 pm »
I understand that. Yes that would be instructions with the equivalent of 3 sources instead of 2. There are such instructions in the FP extensions by the way (fused multiply add), so if you're implementing FP extensions, you'll need the logic to handle 3 sources anyway. Given that 3-source instructions are part of extensions already, it would make sense to put instructions using carry in an extension as well, I concede that.

That's a different register file and a different ALU.

Oh. Right. So you'd have to duplicate it, but I guess the structure would be pretty similar. I still don't have a precise idea of how much area/LEs it takes to implement that. I'm currently working on a typical 5-stage pipeline with 2 data sources and 1 destination as per the base RISC-V ISA. But at this point, even that I have no idea exactly how much it would take in hardware.

I have implemented pipelines in the past but none with data hazards (or very simple ones) or branch hazards, so I'm still wrapping my head around that.

The problem with 3 operand instructions is that the register file requires an extra read port which is *very* expensive.  I am not sure how they handle special cases like 3 and 4 operand FMA but I assume that an extra register file access is buried in the instruction latency.

But there is no requirement that flags be interchangeable with normal registers so they can be implemented in their own single port register file and some ISAs (Power?) do this.  Now the problem becomes extra instruction length.

Using a single flags register obviously can be made to work however it entails a lot of complexity in a high performance design.
 

Offline GeorgeOfTheJungle

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: The RISC-V ISA discussion
« Reply #47 on: December 29, 2019, 03:51:08 pm »
Quote from: GeorgeOfTheJungle link=topic=223802.msg2847858#msg2847858
The only thing that RISC-V brings to the table is that it's free, if you ask me.

Free as in Freedom, not free as in a free lunch, yes. That is absolutely the main thing. It's not that the code might be 10% smaller than some other ISA or run 2% faster or use 50% less energy (though those might be true, in some cases).

There are other "Free as in Freedom" ISAs, but at this point in time they don't have the software ecosystem and hardware and software momentum (acceleration?) that RISC-V does.

Sorry but "Use 50% less energy" sounds too good to be true... compared to what? Apples to oranges? But if that's true,  :-+ , you're going to own the market very soon.

In any case, in my opinion the ISA doesn't matter much these days. You only need to know the exact details to write for bare metal/lower level layers support/HAL/kernel etc, but above that, the application is going to be written in at least C or C++ (or worse) and there it doesn't matter much if at all.

You as an Apple user know it very well, because you've witnessed the almost seamless transitions from 68k to PPC and then to Intel and ARM.

Quote
The only thing that RISC-V brings to the table is that it's free, if you ask me.

"The only" was a bit rude, I apologize, should have said instead "The most significant".
« Last Edit: December 29, 2019, 09:27:07 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #48 on: December 29, 2019, 04:37:51 pm »
Quote
Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, ...
It's "interesting" that a lot of those have presumably had their "intellectual property rights" expire, and COULD probably be manufactured by anyone who really wanted to.  But most have other reasons that people aren't actually interested any more.

Certainly patents have expired, but I gather you could run afoul of copyright claims on the ISA itself, or the manual, or ...  I don't know. And Trademark infringement if you claim to be compatible with something rather than simply copying it and calling it something else.

What prevents anyone who wants to from making PowerPC- (1994ish) or MIPS32/MIPS64- (1999) compatible CPUs without a license?

https://en.wikipedia.org/wiki/OpenPOWER_Foundation
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #49 on: December 29, 2019, 08:51:15 pm »
E.g. considering the temperature range, chips have the following classification:

Consumer 0°C to +95°C -> Level E
Extended Consumer -20°C to + 105°C -> Level D
Industrial -40°C to +105°C -> Level C
Automotive: -40°C to + 125°C -> Level B
Avionics: -55°C to + 125°C Level A
extended Avionics: -55°C to + 150°C Level A+ <--------------------- new spec 2015

Of course, there are other specs to be satisfied, but the point is: Motorola's, Freescale's, and AMCC's devices were/are built with ruggedized technology and designed with rules to satisfy all the specs.
« Last Edit: December 29, 2019, 09:31:00 pm by legacy »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #50 on: December 29, 2019, 11:12:59 pm »
Again this has little to do with the ISA... An ISA is not an IP, and an IP is not an actual chip... That's 2 levels to overcome to get something certified...

Given the simplicity of the RISC-V ISA (at least if you stick to I (/E) or IM for instance), I guess that an actual implementation with a simple pipeline/in-order execution and nothing too fancy should be much easier to validate than any more complex core... but that's still a lot of work. Who's going to be first to commit the resources for that? Could be interesting, but I'd be willing to bet that some big fish will have to invest serious cash for this to happen.

Could happen with government grants, but still probably through a large and reputable corporation... and then that would be mainly for strategic independance, so no wonder China has seemingly invested more so far in actual RISC-V chips than most other countries AFAIK. So the first RISC-V processor in avionics applications could well be a chinese one in chinese planes for instance...
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #51 on: December 30, 2019, 02:16:36 am »
Sorry but "Use 50% less energy" sounds too good to be true... compared to what? Apples to oranges? But if that's true,  :-+ , you're going to own the market very soon.

I didn't make a claim that some RISC-V core uses 50% less energy than some other core. What I said is that even if someone makes such a core, that's still not the main reason to use RISC-V. Conversely, even if some RISC-V core used 50% more energy, the Open Source license-free nature of it might still be a reason to use it.

Basically, the exact technical specs don't matter, as long as they are reasonably in the same ballpark as others.

They *do* happen to be pretty good, but that's just icing.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #52 on: December 30, 2019, 02:19:49 am »
Quote
Unlike, say, people who invested in .. well, it's a long list .. PDP11/VAX/Alpha, i860/i960/Itanium, PA-RISC, SPARC, ...
It's "interesting" that a lot of those have presumably had their "intellectual property rights" expire, and COULD probably be manufactured by anyone who really wanted to.  But most have other reasons that people aren't actually interested any more.

Certainly patents have expired, but I gather you could run afoul of copyright claims on the ISA itself, or the manual, or ...  I don't know. And Trademark infringement if you claim to be compatible with something rather than simply copying it and calling it something else.

What prevents anyone who wants to from making PowerPC- (1994ish) or MIPS32/MIPS64- (1999) compatible CPUs without a license?

https://en.wikipedia.org/wiki/OpenPOWER_Foundation

I'm aware of that. It appears to require applying to join their Foundation and getting a license.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #53 on: December 30, 2019, 02:21:41 am »
E.g. considering the temperature range, chips have the following classification:

Consumer 0°C to +95°C -> Level E
Extended Consumer -20°C to + 105°C -> Level D
Industrial -40°C to +105°C -> Level C
Automotive: -40°C to + 125°C -> Level B
Avionics: -55°C to + 125°C Level A
extended Avionics: -55°C to + 150°C Level A+ <--------------------- new spec 2015

Of course, there are other specs to be satisfied, but the point is: Motorola's, Freescale's, and AMCC's devices were/are built with ruggedized technology and designed with rules to satisfy all the specs.

As I said, the same company that is already making their PowerPC avionics chips will be making the RISC-V avionics chips.

I'm sure they didn't suddenly forget the physical specs required, or how to achieve them :-)
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #54 on: December 30, 2019, 01:03:08 pm »
Just for the records, the transaction from m68k to PowerPC took six years, and chips were manufactured by the same company. At some point, the contracts with Motorola were then moved to the new division Freescale. This for sure added some delay, anyway, it took six years.

One of the good-good reasons for using the PowerPC architecture is in real-time operation. There is a whole market for safety-critical equipment where even a small unpredictable delay is a problem. When a delay can kill people, or destroy millions of dollars of equipment.

There are some new ARM designs that are now reaching acceptable levels of real-time operation. ARM did some progress in 2016. But ARM does not manufacture itself, so it takes a couple of years for real processors to be available after the design. Last 2018 chips are interesting, a lot interesting even if they are still significantly lower performance in processing power compared to a PowerPC.

I do believe the architecture of RISC-V can (and will) be tuned to do it better, but one thing that PPC still does better than any other chip is radiation-hardened designs thanks to the contribution by IBM.

This is not "ISA architecture", this is "how the chip is made", and this is mainly due to the fact that the chip-manufacturer has by far the best fabs for rad-hardening but they have also redesigned parts of the chip to increase reliability.

Do you know that potentially AMD has the know-how (and potentially even the fab) for the satellite market but has always refused to do it? The reason is unknown, but it's interesting, since they could have potentially switched from x86 compatible to PowerPC. But they have never done it.

Switching from PowerPC to RISC-V is not a matter of redesigning the "die film", but rather a matter of literally redesigning everything. You have to simulate everything and to run a lot of experiments on it to pass qualifications. This can be done, but it's like talking about enormous amounts of money and resources and exploring numerous dead-ends, which would eventually concretize something for the avionics market.

Hence, if I was a CEO, I would reach it progressively. From level C to level B, from level B to level A.

The SEU Radiation Events program is very interesting. The entry-level is for alpha-particles-tolerant systems, which can be shielded thus only concerned with internal alpha sources, and this can be mitigated by enforcing the hardware qualification process, by choosing Level A and level A+ manufacturing, and requiring that there are no impurities and/or contamination of fab. process. But Neutrons-tolerant systems have a problem with cosmic ray flux, which depends on location, altitude, and atmospheric conditions, and it's very difficult to shield; in this case, you need chips qualified for SEU. The hardenest scenario is SEU/Neutrons, which are designed, tested and qualified as "radiation-hardened".

Even the CPU's internal cache needs to be protected via parity/ECC, and the external memory cells are spread to avoid a multiple-bit upset. IBM's, Motorola's, Freescale's, and AMCC's devices were/are built with ruggedized technology and designed with rules to mitigate those threats.

-

In short, in my opinion, "real-time operations made in ruggedized technology" is the keyword you have to beat.
 

Offline David Hess

  • Super Contributor
  • ***
  • Posts: 18809
  • Country: us
  • DavidH
Re: The RISC-V ISA discussion
« Reply #55 on: December 31, 2019, 03:20:06 am »
I do believe the architecture of RISC-V can (and will) be tuned to do it better, but one thing that PPC still does better than any other chip is radiation-hardened designs thanks to the contribution by IBM.

IBM used a silicon on insulator process for PowerPC which gives a big boost to radiation resistance.

Quote
Do you know that potentially AMD has the know-how (and potentially even the fab) for the satellite market but has always refused to do it? The reason is unknown, but it's interesting, since they could have potentially switched from x86 compatible to PowerPC. But they have never done it.

AMD shared process technology with IBM so they would have the same advantage.  I suspect they never bothered because the market is so small and development tools in that market were not geared toward x86.  Things might have been different if the embedded x86 market had still existed.
« Last Edit: January 02, 2020, 12:56:13 am by David Hess »
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #56 on: December 31, 2019, 08:06:17 pm »
no wonder China has seemingly invested more so far in actual RISC-V chips than most other countries

Loongson is the "Chinese MIPS", and it's where they have invested money. A lot of money.
  • commercial
  • industrial
  • enhanced industrial
  • military
  • enhanced military
  • aerospace
Loongson has six product grades.

 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #57 on: January 01, 2020, 11:12:02 am »
Ron Minnich:

Quote
[..] All this said, note that the HiFive is no more open, today, than your average ARM SOC; and it is much less open than, e.g., Power. […] Open instruction sets do not necessarily result in open implementations. An open implementation of RISC-V will require a commitment on the part of a company to opening it up at all levels, not just the instruction set.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #58 on: January 01, 2020, 02:45:00 pm »
Ron Minnich:

Quote
[..] All this said, note that the HiFive is no more open, today, than your average ARM SOC; and it is much less open than, e.g., Power. […] Open instruction sets do not necessarily result in open implementations. An open implementation of RISC-V will require a commitment on the part of a company to opening it up at all levels, not just the instruction set.

When he wrote that, soon after the release of the board, the zero-stage- and first-stage bootloader code had not yet been open-sourced. That was done a couple of months later, almost a year ad a half ago.

So that comment is way out of date.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #59 on: January 01, 2020, 03:58:44 pm »
no wonder China has seemingly invested more so far in actual RISC-V chips than most other countries

Loongson is the "Chinese MIPS", and it's where they have invested money. A lot of money.

Does that make what I said above any less true? I don't have exact figures (maybe Bruce does), but I've seen more chinese companies designing AND releasing RISC-V-based chips than in any other part of the world, and the reason is obviously independance (oh, and cost), which China works on relentlessly.

I don't know much about Loongson, I remember some chinese regular once said it was certainly not as significant as we thought, but it would be nice to know more. China may have invested public money in Loongson, but how many commercial companies have shipped Loongson-based processors so far compared to RISC-V? Anyway, this was not meant to be a contest of anything - just stating that chinese companies are definitely interested in RISC-V, and that so far, if predictions had to be made, it doesn't look unlikely that the first RISC-V processor to be certified could be a chinese one, which was my point.

I don't know all companies actively designing RISC-V cores in the world, but it looks like SiFive is one of the rare western ones. One of its closest "equivalent" (I think) is Andes, which is taiwanese. That's a very few companies outside of China.

Anyway, I think this is getting a bit off-topic as I meant this topic as a discussion about the ISA (which RISC-V is, and nothing more), and it seems to be slipping.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #60 on: January 01, 2020, 04:27:57 pm »
Loongson has six product grades, including military, enhanced military and aerospace. This is not an opinion, this is a fact, and it means that China really uses Loongson for everything they internally consider serious.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #61 on: January 01, 2020, 10:16:12 pm »
no wonder China has seemingly invested more so far in actual RISC-V chips than most other countries

Loongson is the "Chinese MIPS", and it's where they have invested money. A lot of money.

Does that make what I said above any less true? I don't have exact figures (maybe Bruce does), but I've seen more chinese companies designing AND releasing RISC-V-based chips than in any other part of the world, and the reason is obviously independance (oh, and cost), which China works on relentlessly.

I don't know exact numbers. Obviously there have been some very nicely spec'd and priced chips and boards coming out of China, based so far on the Kendryte K210 and GigaDevices GD32VF103. China has its own internal RISC-V Foundation equivalent (in fact I think two of them) so there's a lot of interest there and it's not just talk.

Quote
I don't know all companies actively designing RISC-V cores in the world, but it looks like SiFive is one of the rare western ones. One of its closest "equivalent" (I think) is Andes, which is taiwanese. That's a very few companies outside of China.

Andes is a serious player. They're an established company, with a long client list which they seem to be successful in switching from their own NDS32 to RISC-V. Both Andes and SiFive have announced numbers for incremental or total design wins at various points in 2019 and the numbers are in the same ballpark.

The PULP project, based out of ETH Zurich and University of Bologna seem to be quite influential in Europe. NXP is using their cores, and I think a number of others too. The RISC-V Foundation is using one of their cores as a reference implementation.

There are several companies with RISC-V cores here in Russia (I'm in Moscow for the New Year holidays, and will visit SPB a couple of days too). Cloud Bear and Syntacore are the main names. I count that as "western", culturally and in the approach to engineering and technology.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #62 on: January 02, 2020, 06:29:28 am »
Here's the latest cheap board. The same STM32 F103 clone (but RISC-V) as the Longan Nano on a spartan 33mm x 33mm board with USB, crystal, power regulator, and 'every pin exposed' via 1/10" inch spacing plated-through holes, for $3. That's 108 MHz, 128 B flash, 32 KB SRAM.

https://www.cnx-software.com/2020/01/02/polos-gd32v-alef-tiny-risc-v-mcu-board/
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #63 on: January 02, 2020, 10:21:04 am »
Exactly what's your interest?
Building an HDL cpu-core with pipeline?
Writing a cycle-accurate pipeline simulator?
Designing an HL compiler, from HL to machine-code?

Each of these fields has its trade-off

but talking about "architecture", I wish I had "see RISC-V run" (The Book) under my Xmas tree.
Has it already been written? Let's write it! I wanna it under the tree :D
With synchronicity, I've just got my Xmas holiday project up and running - a R32I in VHDL, working from the ISA (not somebody elses HDL).

It is currently running this code compiled by GCC, and then dumped into ROM & RAM by a script:

Code: [Select]
char text[] = "Hello world!\r\n";

int _start(void) {
  volatile char * serial           = (char *)0xE0000000;
  volatile char * serial_tx_status = (char *)0xE0000004;
  while(1) {
    char *s = text;
    while(*s) {
      while(*serial_tx_status) {
      }
      *serial = *s;
      s++;
    }
  }
  return 0;
}

I've still got to set up the stack and so on, but that will come...

It was pretty awesome feeling to see it running in real hardware. A combination of "Wow! It works!" and "Really? It's that easy?" and self doubt like "Gosh! I never thought it would work - must have missed something...".

It's not pipelined or anything at all advanced - memory reads take two cycles (due to memory latency), everything else takes one cycle. Optimised for size, it still runs at a little over 50 MHz in an Artix-7 -1. A slightly less resource optimized version runs at 60 MHz

Resource usage is close to nothing (less than of a low end FPGA dev board):

Code: [Select]
+-------------------------+-------------------+------------+------------+---------+------+-----+--------+--------+--------------+
|         Instance        |       Module      | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+-------------------------+-------------------+------------+------------+---------+------+-----+--------+--------+--------------+
| basys3_top_level        |             (top) |        731 |        682 |      48 |    1 |  95 |      3 |      0 |            0 |
|   (basys3_top_level)    |             (top) |          0 |          0 |       0 |    0 |   0 |      0 |      0 |            0 |
|   i_top_level           |         top_level |        731 |        682 |      48 |    1 |  95 |      3 |      0 |            0 |
|     (i_top_level)       |         top_level |          1 |          0 |       0 |    1 |   1 |      0 |      0 |            0 |
|     i_bus_bridge        |        bus_bridge |        103 |        103 |       0 |    0 |   0 |      0 |      0 |            0 |
|     i_peripheral_ram    |    peripheral_ram |          1 |          1 |       0 |    0 |   1 |      1 |      0 |            0 |
|     i_peripheral_serial | peripheral_serial |         63 |         63 |       0 |    0 |  62 |      0 |      0 |            0 |
|     i_program_memory    |    program_memory |          5 |          5 |       0 |    0 |   1 |      2 |      0 |            0 |
|     i_riscv_cpu         |         riscv_cpu |        558 |        510 |      48 |    0 |  30 |      0 |      0 |            0 |
|       i_alu             |               alu |        144 |        144 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_branch_test     |       branch_test |         29 |         29 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_data_bus_mux_a  |    data_bus_mux_a |         28 |         28 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_data_bus_mux_b  |    data_bus_mux_b |         18 |         18 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_decoder         |           decoder |         66 |         66 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_program_counter |   program_counter |         79 |         79 |       0 |    0 |  30 |      0 |      0 |            0 |
|       i_register_file   |     register_file |         49 |          1 |      48 |    0 |   0 |      0 |      0 |            0 |
|       i_result_bus_mux  |    result_bus_mux |         32 |         32 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_shifter         |           shifter |        113 |        113 |       0 |    0 |   0 |      0 |      0 |            0 |
|       i_sign_extender   |     sign_extender |         24 |         24 |       0 |    0 |   0 |      0 |      0 |            0 |
+-------------------------+-------------------+------------+------------+---------+------+-----+--------+--------+--------------+
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline GeorgeOfTheJungle

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: The RISC-V ISA discussion
« Reply #64 on: January 02, 2020, 10:57:11 am »
The further a society drifts from truth, the more it will hate those who speak it.
 
The following users thanked this post: iMo

Offline GeorgeOfTheJungle

  • Super Contributor
  • ***
  • !
  • Posts: 2699
  • Country: tr
Re: The RISC-V ISA discussion
« Reply #65 on: January 02, 2020, 11:10:55 am »
I was going to buy a RISC-V longan nano ($4.90) to play, but ended buying an esp32 TTGO-Display ($4.99) because for the same price I get 31x the flash, 16x more RAM, two cores instead of one, at 2x the clock speed, plus WiFi, BT, an ethernet MAC, a lipo charger, etc. The only thing I'm missing is that nice acrylic case, damn it.

https://github.com/Xinyuan-LilyGO/TTGO-T-Display
https://www.seeedstudio.com/Sipeed-Longan-Nano-RISC-V-GD32VF103CBT6-Development-Board-p-4205.html

Edit:
Something I still don't get is the SiFive HiFive1 Arduino board: take a RISC-V "freedom e310" pcb and shove in the cheapest ($1.x) single core esp32 (for WiFi+BT) there is and... sell it for $59 ?!?!? I mean...  :wtf: Not a good start if you ask me.

https://www.sifive.com/boards/hifive1-rev-b
https://sifive.cdn.prismic.io/sifive%2Fa4546ced-0922-4d87-9334-e97c1a9fd9a5_hifive1.b01.schematics.pdf
« Last Edit: January 02, 2020, 03:39:00 pm by GeorgeOfTheJungle »
The further a society drifts from truth, the more it will hate those who speak it.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #66 on: January 02, 2020, 03:55:19 pm »
With synchronicity, I've just got my Xmas holiday project up and running - a R32I in VHDL, working from the ISA (not somebody elses HDL).
(...)

Cool stuff. I'll let you know here when my simulator has progressed enough that I can run your example code and Bruce's small benchmark on it.
(As to working from the ISA - same here. I'm writing it 100% from scratch with no 3rd-party code.) Goal is to later implement an HDL version as well.

I saw your core is not pipelined - it would have taken you a lot more time indeed. I still haven't quite finished my implementation of a 5-stage pipeline in software (the fact it's in software doesn't make it any easier as I'm really trying to simulate a true pipeline that could be later on relatively easily translated to HDL...) Do you intend on making your core pipelined? It would also be interesting to compare how many LEs it takes (comparing non-pipelined, 3-stage and 5-stage pipelined versions for instance...) and max freq in each case...

Additional question: have you implemented the FENCE instruction?
« Last Edit: January 02, 2020, 05:41:41 pm by SiliconWizard »
 

Offline andersm

  • Super Contributor
  • ***
  • Posts: 1198
  • Country: fi
Re: The RISC-V ISA discussion
« Reply #67 on: January 02, 2020, 08:50:07 pm »
Something I still don't get is the SiFive HiFive1 Arduino board: take a RISC-V "freedom e310" pcb and shove in the cheapest ($1.x) single core esp32 (for WiFi+BT) there is and... sell it for $59 ?!?!?
They're not in the business of selling boards or chips. The boards are really meant to demonstrate their technology.

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #68 on: January 02, 2020, 10:19:49 pm »
Do you intend on making your core pipelined? It would also be interesting to compare how many LEs it takes (comparing non-pipelined, 3-stage and 5-stage pipelined versions for instance...) and max freq in each case...
I don't really have any intention of pipelining it - it was mainly to work out a micro-architecture.

Additional question: have you implemented the FENCE instruction?
All memory accesses are in order, so FENCE is equivalent to a NOP.

It has been quite interesting, and has let me seen quite a few optimizations that I wouldn't have seen otherwise. Currently the slow est path is the one to support the branching instructions:

Code: [Select]
                     Delay        Levels of logic               
Instruction register 2.454        1               
Decode               4.650        3               
Register File        1.586        1               
Branch test          3.820        6               
Program Counter      1.846        2               
Instruction Memory   1.801        2               
                    16.157       15  61.9MHz

I'm looking at duplicating the register file so the decoding is completely bypassed - without those levels of logic it should be ~3ns faster,
                               
Code: [Select]
With duplicated register file (at cost of 64 LUTMEM)                               
                     Delay        Levels of logic               
Instruction register 2.454        1               
Decode               1.660        0               
Register File        1.586        1               
Branch test          3.820        6               
Program Counter      1.846        2               
Instruction Memory   1.801        2               
                    13.167       12  75.9 MHz

That will move the bottleneck to the ALU, Shifter or load/store paths.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #69 on: January 02, 2020, 10:58:52 pm »
Do you have mul and div? If yes, how are them implemented?

Would you like to compile and test a simple integer calculator written in C implementing a recursive descent parser? It's metalbare, it just needs char_get and char_put over a (serial?) console.

I wrote this code in 2004 to test a wild toolchain for the Game Boy. I re-used this code in 2011 to test a pipelined softcore, helping a colleague to find a couple of bugs.
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4555
  • Country: us
Re: The RISC-V ISA discussion
« Reply #70 on: January 03, 2020, 01:59:58 am »
From a "user" point of view, I really like having a standardized instruction set that has been:
  • Vetted by compiler people not to be awful to generate code for.
  • Vetted by hardware people to have simple implementations/
  • Vetted by hardware people not to have a bunch of "known bottlenecks" that would interfere with more complex implementations (pipelining being a good example.)
  • created with some extensibility in mind.
  • created with some sub-setting in mind.
  • was non-proprietary enough to incorporate good ideas from multiple sources.
  • but was constrained enough not to get out-of-control.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #71 on: January 03, 2020, 04:48:22 am »
Do you have mul and div? If yes, how are them implemented?

No, just the R32I

Quote
Would you like to compile and test a simple integer calculator written in C implementing a recursive descent parser? It's metalbare, it just needs char_get and char_put over a (serial?) console.

I wrote this code in 2004 to test a wild toolchain for the Game Boy. I re-used this code in 2011 to test a pipelined softcore, helping a colleague to find a couple of bugs.

Wold love to -  I'll have to write the Serial RX though...

I've attached the sketch of the design, so you can point and laugh :D
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline Someone

  • Super Contributor
  • ***
  • Posts: 5840
  • Country: au
    • send complaints here
Re: The RISC-V ISA discussion
« Reply #72 on: January 03, 2020, 11:12:22 am »
It's not pipelined or anything at all advanced - memory reads take two cycles (due to memory latency), everything else takes one cycle. Optimised for size, it still runs at a little over 50 MHz in an Artix-7 -1. A slightly less resource optimized version runs at 60 MHz

Resource usage is close to nothing (less than of a low end FPGA dev board):

Code: [Select]
+-------------------------+-------------------+------------+------------+---------+------+-----+--------+--------+--------------+
|         Instance        |       Module      | Total LUTs | Logic LUTs | LUTRAMs | SRLs | FFs | RAMB36 | RAMB18 | DSP48 Blocks |
+-------------------------+-------------------+------------+------------+---------+------+-----+--------+--------+--------------+
| basys3_top_level        |             (top) |        731 |        682 |      48 |    1 |  95 |      3 |      0 |            0 |
Well done, thats already competitive with a microblaze without much effort.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #73 on: January 03, 2020, 02:40:21 pm »
Do you have mul and div? If yes, how are them implemented?

No, just the R32I

Quote
Would you like to compile and test a simple integer calculator written in C implementing a recursive descent parser? It's metalbare, it just needs char_get and char_put over a (serial?) console.

I wrote this code in 2004 to test a wild toolchain for the Game Boy. I re-used this code in 2011 to test a pipelined softcore, helping a colleague to find a couple of bugs.

Wold love to -  I'll have to write the Serial RX though...

I've attached the sketch of the design, so you can point and laugh :D

Well, that's pretty basic but does the job - there's nothing to really laugh about here. It's rather typical and clean.

What amazes me is that it still manages to work at 50MHz+ without pipelining. Those Artix-7 FPGAs are damn great (and yes, RV32I is very lightweight, but still.)

 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #74 on: January 03, 2020, 02:44:58 pm »
Do you have mul and div? If yes, how are them implemented?

As he said RV32I, no, it doesn't include MUL and DIV ops which are part of an extension in RISC-V ('M'). Whereas the multiplies, with his non pipelined-design, should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story. I doubt they can be achieved without pipelining, and then it creates a whole new data hazard to handle...
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #75 on: January 03, 2020, 03:40:33 pm »
should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story

The ones implemented in my softcore Arise-v2 are a bit slow in terms of clock cycles but they don't add any  fmax penalty to the synthesis. The Mul is a 32bit MAC(1) and it costs the fsm to wait for 13 clock cycles, while The Div costs the fsm to wait for 34 clock cycles.

(1) not using any DSP slice. It's manually implemented.
« Last Edit: January 03, 2020, 04:12:30 pm by legacy »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #76 on: January 03, 2020, 09:31:24 pm »
Do you have mul and div? If yes, how are them implemented?

As he said RV32I, no, it doesn't include MUL and DIV ops which are part of an extension in RISC-V ('M'). Whereas the multiplies, with his non pipelined-design, should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story. I doubt they can be achieved without pipelining, and then it creates a whole new data hazard to handle...

If you tell gcc to compile for RV32I then it will automatically use and link in software multiply and divide routines from libgcc if they are needed.

It's possible to use RV32IM with a hardware multiply but not divide. Well, in fact you're allowed to compile code using RV32IM without having either multiply or divide instructions, as long as you ensure the "unimplemented instruction" traps go to a handler that emulates them. Saying a system supports RV32IM (or any other extension) promises that programs using multiply and divide opcodes will work, it doesn't promise that they will be fast.

So, you can implement multiply in hardware but not divide, implement a handler for divide and happily compile for RV32IM. Divide will be slower than if you called the libgcc routine directly, but possibly not all that much slower, depending on how efficiently you implement the trap handler.

There is a -mno-div flag to tell gcc not to use the divide instruction even though you're compiling for RV32IM.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #77 on: January 04, 2020, 10:38:33 pm »
Do you have mul and div? If yes, how are them implemented?

As he said RV32I, no, it doesn't include MUL and DIV ops which are part of an extension in RISC-V ('M'). Whereas the multiplies, with his non pipelined-design, should be achievable with a simple multiply operator in HDL without harming the max frequency too much (at least on those FPGAs), the divides are another story. I doubt they can be achieved without pipelining, and then it creates a whole new data hazard to handle...

If you tell gcc to compile for RV32I then it will automatically use and link in software multiply and divide routines from libgcc if they are needed.

Well, sure! It's just software emulation, it's not like actually having multiply and divide. You can emulate anything in software, and yes, GCC can emulate integer multiplies and divides, and of course FP operations. And in this case, GCC is implementing them, not actually you. So that wouldn't really answer legacy's question.

It's possible to use RV32IM with a hardware multiply but not divide. Well, in fact you're allowed to compile code using RV32IM without having either multiply or divide instructions, as long as you ensure the "unimplemented instruction" traps go to a handler that emulates them. Saying a system supports RV32IM (or any other extension) promises that programs using multiply and divide opcodes will work, it doesn't promise that they will be fast.

Well well. Certainly again. But this would be software emulation as well (this time through traps). Of course nothing is said about speed in the spec. But I would still consider the above approach dubious. Whereas the spec doesn't say a given instruction has to be fast, I think it's pretty reasonable to assume that the core will not raise an exception if an instruction from an extension it's supposed to support is used. This would be a bit of a non-sense. Of course you didn't say that the core here would be RV32IM, you said the "system" would be. It certainly expresses emulated functionality, but I personally think talking about a "RV32IM system" would be a bit of a misnomer. And I certainly hope no vendor has the balls to market some RV core as supporting some extensions if said support is only via trap handlers. That would be really bad marketing if you ask me.

And getting back to the question and hamster's work, the trap thing was probably way out of the question as he said his work was still preliminary and he hasn't even set up a stack yet...

So, you can implement multiply in hardware but not divide, implement a handler for divide and happily compile for RV32IM. Divide will be slower than if you called the libgcc routine directly, but possibly not all that much slower, depending on how efficiently you implement the trap handler.

You can, but except 1/ if you need to run unmodified object code or 2/ if you absolutely need code size reduction (using traps this way would of course decrease code size as opposed to letting the compiler inline the emulated operations), I don't really see the point. No matter how efficient your trap handler is, it's going to be much slower than an operation directly executed in the ALU without disurbing the pipeline.

Of course I understand it was worth mentioning as a way to handle executing code using some extension on a core that doesn't support it.

There is a -mno-div flag to tell gcc not to use the divide instruction even though you're compiling for RV32IM.

I was wondering about this earlier, so it's nice to know. I was indeed considering the "M" extension and how it requires divides, whereas divides are a lot more ressource hungry to implement in hardware and not always as necessary performance-wise, because in practice, a lot of code of computations can be rewritten to avoid using divides. So having a flag that ensures that no divide instruction is used is nice to have.

But I wouldn't know how to call a core that implements only multiplies. I do again think that you couldn't claim your core supporting the "M" extension in this case.
« Last Edit: January 04, 2020, 10:42:40 pm by SiliconWizard »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #78 on: January 08, 2020, 02:45:45 pm »
So... my CPU emulator/simulator is now functional! ;D

At this point, it can simulate a 5-stage pipelined RISC-V core. The variant can be selected at init time, it currently supports RV32I/E, RV64I/E, and the M extension (RV32/RV64). The C extension is not quite finished, and then I intend on implementing most other standard extensions as well.

I have tested it with a few RISC-V test programs. After the first few very simple, hand-crafted assembly tests, I have written some start-up code and a linker script, so I can test C code. Current tests are emulating an environment with a 256KB instruction memory and 1MB data memory. The startup code: sets up the stack pointer, initializes data, zero-fills unitialized globals, calls 'main' and then I put an 'ebreak' instruction right after that. I've set up a stop condition in my simulator to stop execution on 'ebreak', so that's a simple setup to stop automatically once 'main' has returned.

Example code that simulates correctly (stripping out include/comments/...):
Code: [Select]
static char gst_szString[256];

static size_t StrLen(const char *szString)
{
size_t nLength;

if (szString == NULL)
return 0;

for (nLength = 0; *szString != '\0'; szString++)
nLength++;

return nLength;
}

static char * StrReverse(char *szStringDest, const char *szStringSrc)
{
size_t nLength, i, j;

if ((szStringDest == NULL) || (szStringSrc == NULL))
return NULL;

nLength = StrLen(szStringSrc);

for (i = 0, j = nLength - 1; i < nLength; i++, j--)
szStringDest[i] = szStringSrc[j];

szStringDest[nLength] = '\0';

return szStringDest;
}

int main(void)
{
(void) StrReverse(gst_szString, "This is a test string!");

return 0;
}

So, this basically stores a reversed version of some constant string in a global variable. The simulator dumps data memory content in a file once it's done, so I can check that it correctly stored "!gnirts tset a si sihT".

The simulator CLI outputs this for the above at this point (yes as you can see it runs on Windows here, but it's pure C99 code, so it builds and runs fine on any Posix-compliant platform):
Code: [Select]
CPUEmu: CPUEmu-RISCV (0.1.1) on CPUEmu (0.1.0)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test2.bin' loaded.
Running simulation...
Simulation completed in 0.000015 s.
Clock Counter = 664
Instruction Counter = 442
CPI = 1.502
STOP: Stop on Instruction (Num = 1)
Registers:
    x0        0x00000000
    x1        0x00000058
    x2        0x10100000
    x3        0x00000000
    x4        0x00000000
    x5        0x00000000
    x6        0x00000000
    x7        0x00000000
    x8        0x00000000
    x9        0x00000000
    x10       0x00000000
    x11       0x10000016
    x12       0x00000054
    x13       0x10000016
    x14       0x000000C7
    x15       0x10000016
    x16       0x00000000
    x17       0x00000000
    x18       0x00000000
    x19       0x00000000
    x20       0x00000000
    x21       0x00000000
    x22       0x00000000
    x23       0x00000000
    x24       0x00000000
    x25       0x00000000
    x26       0x00000000
    x27       0x00000000
    x28       0x00000000
    x29       0x00000000
    x30       0x00000000
    x31       0x00000000
    pc        0x0000005C

(About half of all executed instructions are in the startup code, as it zero-fills the 256-byte string global.)

The CPI at about 1.5 is not extraordinary but it's not too bad for a first pipeline version. I think branches are the culprit here, as I haven't implemented branch prediction so far, so it's just wasting 2 cycles each time a branch is taken, and this small test code is almost entirely made up of loops.

So now I'm going to write many tests for it. (I'm also going to test Bruce's benchmark shortly.) If anyone has any example code to share that I could test, that'd be cool as well.

« Last Edit: January 08, 2020, 02:47:29 pm by SiliconWizard »
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #79 on: January 08, 2020, 04:30:11 pm »
simulate a 5-stage pipelined RISC-V core

How do you simulate the pipeline stages?
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #80 on: January 08, 2020, 07:13:11 pm »
simulate a 5-stage pipelined RISC-V core

How do you simulate the pipeline stages?

For each simulated clock cycle, each stage executes what would be executed in a purely digital design. The FETCH stage fetches the instruction from instruction memory at the current PC, the DECODE stage decodes the instruction, etc. To simulate the registering and propagation, each stage has associated "output" registers in a dedicated struct, which are read as inputs for the next stage. For instance, the FETCH stage stores the fetched instruction and PC in members of a dedicated struct, the DECODE stage reads them to decode, and also pass them on to the next stage by writing them to its own output registers.... To avoid having to duplicate the registers between each stage (to make sure that each stage reads the output registers from the previous stage for the previous clock cycle) you just need to execute stages from last to first (as they are obviously executed sequentially in software, and not in parallel...)

For handling stalls, I have two stall registers (which are integers). Each bit represents the current stall state of each corresponding stage. The first stall register is to handle bubble-like stalls. Bubbles must be propagated through the pipeline, so this "register" is shifted left at each simulated clock cycle. The second register is to handle stalls that are not bubbles. Then both registers are AND'ed and the result defines which stages will be executed and which won't at each cycle.

Not getting into a lot of details as it's a bit involved in the end but hopefully you get the basics from the above.

The pipeline itself I implemented is a classic RISC 5-stage pipeline (IF, ID, IE, MA, WB). I simulated data bypass for data hazards and handled the Load-Use hazards with a 1-cycle stall when required, and the taken branches with a 2-cycle stall. So, I basically tried to make it as efficient as possible (stalling only when strictly necessary). Still pretty textbook stuff. The thing that could be improved is with branches by adding some form of branch prediction. I'll probably do this later on. As I said earlier, the whole idea is to implement it so that it's "easily" implementable in HDL later (and so that it can simulate it accurately). As I said before, doing this really helps understanding how to design this. I now have a much better grasp of CPU pipelines (and of RISC-V of course as well), and the ease and speed with which things can be tested is a big plus as opposed to directly implement them in HDL and go through tedious steps each time you want to test something new... so I definitely confirm the approach of designing a simulator first is interesting. And ironing out the initial bugs I had with the pipeline, for instance, would have taken a lot more time if I had started in HDL directly.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #81 on: January 08, 2020, 07:17:13 pm »
I've added branch counters so for the above example, I now get:
Code: [Select]
Clock Counter = 664
Instruction Counter = 442
CPI = 1.502
Branch Counter = 114
Taken Branch Counter = 109

Definitely helps seeing what happens.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #82 on: January 09, 2020, 12:06:01 am »
Now with Bruce's benchmark code (prime counting):
(This benchmark helped me find - and fix - a small decoding bug for the M extension.)

Code: [Select]
Clock Counter = 47886782350
Instruction Counter = 31207736416
CPI = 1.534
Branch Counter = 13783063651
Taken Branch Counter = 6628998761

And... the result is actually correct! :-+
the number of primes found can be seen in the following register:
Code: [Select]
x29       0x0038A888
(which is indeed 3713160 dec)

So... that's ~47.9 billion clocks (simulated), which is explained by the CPI of ~1.5, and ~6.6 billion taken branches. It looks clear that a 5-stage pipeline without branch prediction is not ideal. Now I can make some interesting experiments. But already happy that I got this far. ;D
 

Offline ataradov

  • Super Contributor
  • ***
  • Posts: 12390
  • Country: us
    • Personal site
Re: The RISC-V ISA discussion
« Reply #83 on: January 09, 2020, 12:13:10 am »
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?
Alex
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #84 on: January 09, 2020, 12:36:50 am »
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

A link was posted here: https://www.eevblog.com/forum/embedded-computing/raspberry-pi-4/msg2765598/#msg2765598
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #85 on: January 09, 2020, 07:39:50 am »
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

The code lives at http://hoult.org/primes.txt  I should probably put it on github :-)

Silly little benchmark with the main virtue that it's been run (and *can* run) on a metric shedload of stuff from AVR to Xeons. Only exercises cpu core and L1 / SRAM, very branchy.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #86 on: January 09, 2020, 03:28:52 pm »
Now with Bruce's benchmark code (prime counting):
I can't find it mentioned anywhere in this thread? Where is the code?

The code lives at http://hoult.org/primes.txt  I should probably put it on github :-)

Silly little benchmark with the main virtue that it's been run (and *can* run) on a metric shedload of stuff from AVR to Xeons. Only exercises cpu core and L1 / SRAM, very branchy.

Yes it's very simple and very branchy as you said, so it actually helps not just benchmarking but debugging CPUs... I have found a second bug in my simulator thanks to it (that didn't appear when compiled at -O1 but did at -O3). ;D
 

Offline legacy

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Re: The RISC-V ISA discussion
« Reply #87 on: January 09, 2020, 09:58:49 pm »
Not polished, simply pacakged as "demo" for being compiled on a Linux host. It's here, you can download it as tgz. It's the simple integer calculator written in C I was talking about a few posts ago. The machine define was automatically generated and needs to be redefined for the target, as well as the console_IO.

Have fun  :D
 
The following users thanked this post: hamster_nz

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #88 on: January 10, 2020, 01:20:00 am »
Thanks. I haven't implemented any kind of emulation for a console at this point (or any peripheral), so I can't really test it as such for the time being. For now I will favor code that can be run without interaction. But I'll give it a try when I implement at least some peripheral emulation.

I've found and fixed a new tricky bug with data hazard + JALR thanks to test code I had written for a Project Euler problem. Getting there! ;D

(Regarding branch prediction, what I said earlier about not having implemented it is technically not quite exact. It has some form a very basic static prediction, as untaken branches will have no penalty. I don't consider it branch prediction per se, but the textbook definition would say that it is. And that being said, if anyone has worked on branch prediction, and has a *simple* yet reasonably effective approach, I'd be happy to hear about it and implement it. As an interesting alternative to branch prediction I've thought about, and have seen that it was considered by the PULP team, would be to implement an extension for "hardware loops". Another idea - not new per se - would be to add a couple instructions to give hints about the preferred branch option for any branch instruction. That would be quite effective IMO, but would require compiler support, which may not be that trivial to do. For hand-written assembly, that would certainly work fine...)
« Last Edit: January 10, 2020, 01:29:05 am by SiliconWizard »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #89 on: January 10, 2020, 01:50:28 am »
I'm looking  at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

mstatus : Machine Status register
mtvec  : Machine trap-handler base address
mip  : Machine Interrupt Pending
mio : Machine interrupt enable
mepc :  Machine Exception Program Counter
mcause :  Machine Cause Register

Does anybody know of a simple how-to describing it?


Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #90 on: January 10, 2020, 03:02:49 am »
I'm looking  at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

Interested as well.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #91 on: January 12, 2020, 03:55:45 pm »
Have done more intensive tests, including testing single- and double-precision FP (GCC) emulation on simulated RV32IM. (Helped me find and fix and couple other bugs.)
It's starting to be relatively stable. (Until next bug of course ;D )

Then I added emulation of MMIO and added simple emulation of console output on top of that (no input yet, coming!)
I can use printf() from the C lib. As I'm using my own startup code, I had to implement a few C IO functions. (Only _write() for now, all others stubs.)
Below the code if anyone's interested (nothing fancy but it could save you a few minutes of looking up.)
Code: [Select]
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>

#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>

#include "Tests_MMIO.h"

////////////////////////////////////////////////////////////////////////////////

int _read(int file, char *data, int len)
{
return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _close(int file)
{
return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _lseek(int file, int ptr, int dir)
{
return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _fstat(int file, struct stat *st)
{
return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _isatty(int file)
{
return 0;
}

////////////////////////////////////////////////////////////////////////////////

int _write(int file, char *data, int len)
{
if ((file != STDOUT_FILENO) && (file != STDERR_FILENO))
{
errno = EBADF;
return -1;
}

for (int i = 0; i < len; i++)
*TESTS_MMIO_CONSOLEOUT_PTR = data[i];

return len;
}

////////////////////////////////////////////////////////////////////////////////

The obligatory Hello World with just
Code: [Select]
int main(void)
{
printf("Hello World!\n");

return 0;
}
gives this:

Code: [Select]
CPUEmu: CPUEmu-RISCV (0.1.6) on CPUEmu (0.1.2)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test8.bin' loaded.
Running simulation...
---------------------------------Emulated Console-------------------------------
Hello World!

--------------------------------------------------------------------------------
Simulation completed in 0.000419 s.
Clock Counter = 5136
Instruction Counter = 3732
CPI = 1.376
Branch Counter = 753
Taken Branch Counter = 687

 :-+

(The RISC-V binary file is about 14KB.)
« Last Edit: January 12, 2020, 04:00:13 pm by SiliconWizard »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #92 on: January 14, 2020, 12:03:03 am »
@legacy:

Code: [Select]
CPUEmu: CPUEmu-RISCV (0.1.7) on CPUEmu (0.1.2)
CPU Variant: RV32IM
Binary file '..\..\Tests\RISCV_Code\GCC-Build\Test11.bin' loaded.
Running simulation...
---------------------------------Emulated Console-------------------------------

check ctypes: success

mycalc v2.1 24/04/2002, 24/05/2004, 24/06/2014, 24/10/2019
iset={@,+,-,*,/,!,=,?,<,>,a..z,A..Z,0..9}
    type '@' to exit
> a=10
10

> b=5
5

> c=-30
-30

> a*b+c
20

> d=a*c+b
-295

> d/c
9

> @
byebye


--------------------------------------------------------------------------------
Simulation completed in 0.008811 s.
Clock Counter = 46044
Instruction Counter = 33980
CPI = 1.355
Branch Counter = 6723
Taken Branch Counter = 4603

 ;D

Actually thanks for this little prog, it helped me find and fix a nasty issue with load-use hazards affecting the JALR instruction. Your prog uses big switches which GCC compiles as branch tables (even at -O0), and that helped find the issue!
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #93 on: January 15, 2020, 05:53:37 pm »
Alright - moving foward...
I got to run CoreMark with my simulator. Which was a nice achievement. (Helped me find a silly bug in the SLT instruction!)

https://www.eembc.org/coremark
You can get the code here: https://github.com/eembc/coremark

(To port it to some target, you just need to adjust core_portme.c and core_portme.h files. It basically just requires to adapt timing stuff, and adapt some other defines. (I'm using the CSR cycle register as I just implemented the Zicsr extension.)
So for timing, it looks like this:
Code: [Select]
inline __attribute__((always_inline)) uint32_t ReadCycle32(void)
{
uint32_t nCycle;

__asm__ volatile ("rdcycle %0" : "=r" (nCycle));

return nCycle;
}

#define GETMYTIME(_t) (*_t = ReadCycle32())

I get ~2.57 CoreMark/MHz. The average CPI is ~1.3. I guess I could get a bit better score with some branch prediction, there is none yet.
I'd be curious to see what kind of score you can get with a real RISC-V CPU. So if anyone here can try this...

There are a few scores for RISC-V on eembc's website, for Andes processors and SiFive HiFive Unleashed. For the latter, there are actually two scores (1 and 4 threads). My own test is of course on only 1 thread at this point. Interestingly, the 1-thread score for the HiFive unleashed is 2.01, so my core (simulated, but as I said, is cycle accurate) seems to do better... :P (But one thing to consider here probably: it has no memory access penalty as it simulates a 1-cycle memory, so I'm guessing the difference lies in the fact that the HiFive is using DDR RAM and caches? - although the data memory this benchmark uses is very small so everything should fit in the cache...)

Would be interesting to see more results with the FE310 for instance, and the GigaDevice MCU if anyone has a dev board...

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #94 on: January 15, 2020, 06:12:17 pm »
FE310-G000 does 2.73 Coremark/MHz. The U54 in the HiFive Unleashed does 3.01 not 2.01. Maybe a typo in the page you saw?

Actually, I get a couple of percent better than the published numbers on the HiFive1 if I enable -msave-restore which uses short routines to save registers on function entry and restore them on exit instead of inline code. It uses extra instructions but makes more of Coremark fit into the 16 k L1 instruction cache.

Single-cycle memory is certainly a huge advantage. SiFive cores have 2 cycle access time for L1 or SRAM (3 cycle for subword loads).
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #95 on: January 15, 2020, 06:27:24 pm »
FE310-G000 does 2.73 Coremark/MHz. The U54 in the HiFive Unleashed does 3.01 not 2.01. Maybe a typo in the page you saw?

Actually, I get a couple of percent better than the published numbers on the HiFive1 if I enable -msave-restore which uses short routines to save registers on function entry and restore them on exit instead of inline code. It uses extra instructions but makes more of Coremark fit into the 16 k L1 instruction cache.

Single-cycle memory is certainly a huge advantage. SiFive cores have 2 cycle access time for L1 or SRAM (3 cycle for subword loads).

Thanks for the info!
Well the scores can be found there: https://www.eembc.org/coremark/scores.php
It's 2.01 (it's noted 3020.46 @1500MHz, so unless the raw score itself is wrong, it looks correct) for 1 thread and 8.02 for 4 threads, so it looked at least kinda consistent? But apparently the scores where evaluated with CoreMark compiled at -O2, maybe that could explain the difference, dunno if you used -O3? But could be a typo indeed... (if so, you may want to report this as eembc gives "official" scores.)

So, that the U54 does better than my core seems more logical. ;D
I think the U54 also has a 5-stage pipeline, but with branch prediction and OOE (mine has neither so far)? So that should make quite a bit of difference.
Still, I'm happy so far to get something reasonable with a relatively "simple" design.
« Last Edit: January 15, 2020, 06:37:47 pm by SiliconWizard »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #96 on: January 16, 2020, 02:41:34 pm »
I'm looking  at adding really simple interrupts to my design. Nothing fancy, just a single interrupt.

Has anybody found a concise discussion on how interrupts should work in RISC-V? Reading though the privileged ISA spec it seems the following CSR come into play....

Am working on that now.
Whereas the unprivileged ISA spec is clear and easy to follow, I find the privileged one to be a lot harder to get.

This doc helps clearing things up a bit:
https://sifive.cdn.prismic.io/sifive/0d163928-2128-42be-a75a-464df65e04e0_sifive-interrupt-cookbook.pdf

You can also watch this: https://www.youtube.com/watch?v=QFPQ_kTsbtw
which was a discussion for future improvements on interrupt handling.

And now a question (for Bruce?):
is any part of the privileged architecture mandatory for any RISC-V core (even just say RV32I?) If so and I understand things correctly, is the "machine" level the minimum that must be implemented?

« Last Edit: January 16, 2020, 02:44:45 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #97 on: January 16, 2020, 03:27:41 pm »
And now a question (for Bruce?):
is any part of the privileged architecture mandatory for any RISC-V core (even just say RV32I?) If so and I understand things correctly, is the "machine" level the minimum that must be implemented?

If you want to call something "RISC-V" then it is only necessary to be able to run User ISA instructions. You are explicitly allowed to have a different (or no) privileged architecture.

Until very recently (v2.2) it was required to implement FENCE.I and the ability to read time/cycle/instructions retired CSRs, but in the ratified version of the ISA these have been made optional extensions.

If you're making actual hardware, then you need some kind of Machine level, but you might run User mode RISC-V software under some non RISC-V Machine level (x86, POWER, something custom) or in an emulator.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #98 on: January 16, 2020, 03:40:05 pm »
OK, thanks for this. Implementing a different privileged arch could be interesting for some particular applications.

I guess one benefit of implementing at least some of the privileged arch is to be able to take advantage of the support in the software toolchain. For instance, the "interrupt" GCC attribute that would  automatically save registers used in said handler function and return with an "mret" instruction. If you don't implement this in hardware, then you're pretty much on your own for handling interrupts and exceptions in general.

But yes, as I said, I find the privileged spec to be harder to follow, and information more spread out. Dunno what's your opinion on this?

Just a tiny related remark: any part of the privileged arch requires the Zicsr extension as far as I've seen. But the "Zicsr " is usually not mentioned. Most of the "non-G" (for which it's implicit) CPUs don't seem to mention Zicsr as part of the supported extensions, when it's usually supported. I know I'm nitpicking, maybe it's just due to the fact this extension has been only relatively recently separated from the base ISA?

And another question now that I'm starting to get the whole thing a bit better: (I'll take the machine level as an example, but it's the same at any level)
The spec says the mepc CSR is supposed to hold the PC of the instruction that caused the exception/or that was interrupted. Then the mret instruction in the end is supposed to restore the current PC to mepc. So this would mean that returning from an exception would jump to the same instruction again. (For interrupts, it means the interrupted instruction will be executed again, meaning that if it's interrupted, it must NOT get to the execute stage!) Am I getting it right? If so, how are you supposed to return from exception handlers, if that even makes sense to do so, since it would just return to the same instruction again and again?

« Last Edit: January 16, 2020, 04:11:36 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #99 on: January 16, 2020, 05:44:39 pm »
But yes, as I said, I find the privileged spec to be harder to follow, and information more spread out. Dunno what's your opinion on this?

Maybe. As a compiler kind of person I've paid much more attention to the unprivileged ISA :-)

Quote
Just a tiny related remark: any part of the privileged arch requires the Zicsr extension as far as I've seen. But the "Zicsr " is usually not mentioned. Most of the "non-G" (for which it's implicit) CPUs don't seem to mention Zicsr as part of the supported extensions, when it's usually supported. I know I'm nitpicking, maybe it's just due to the fact this extension has been only relatively recently separated from the base ISA?

I think so, yes, as that's *very* recent.

Quote
The spec says the mepc CSR is supposed to hold the PC of the instruction that caused the exception/or that was interrupted. Then the mret instruction in the end is supposed to restore the current PC to mepc. So this would mean that returning from an exception would jump to the same instruction again. (For interrupts, it means the interrupted instruction will be executed again, meaning that if it's interrupted, it must NOT get to the execute stage!) Am I getting it right? If so, how are you supposed to return from exception handlers, if that even makes sense to do so, since it would just return to the same instruction again and again?

It's important to be able to try the same instruction again, especially for things such as page faults, but there are also other kinds of things that you can fix up and retry.

It's much much easier to figure out the instruction length and skip it before returning than it would be to try to work backwards to find the start of the previous instruction.

In the case of asynchronous interrupts, I believe the definition is that interrupts are checked for before the execution of the current instruction, so there is no concern of executing it twice.

If you look at for example https://github.com/riscv/riscv-pk/blob/master/machine/misaligned_ldst.c you will see that misaligned_load_trap() calculates insn_t insn = get_insn(mepc, &mstatus); uintptr_t npc = mepc + insn_len(insn); and then later write_csr(mepc, npc);

i.e. it emulates the misaligned load (if the hardware trapped on it, which some will and some won't), and then returns to the next instruction.
 
The following users thanked this post: SiliconWizard

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #100 on: January 17, 2020, 02:50:26 am »
It's important to be able to try the same instruction again, especially for things such as page faults, but there are also other kinds of things that you can fix up and retry.

It's much much easier to figure out the instruction length and skip it before returning than it would be to try to work backwards to find the start of the previous instruction.
(...)

Yes, I realized mepc is writable... so you can change it to whatever fits your requirements before returning.

Now it makes me think about something... I suppose the "atomicity" of operations on the CSRs means that, in the case you write to a CSR, in a pipelined core, you'll need to flush the pipeline before executing the instruction after the write, otherwise you may get data hazards, and I suppose that bypassing CSRs would be completely out of the question due to their number.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #101 on: January 17, 2020, 03:09:09 am »
Yes, the E31, S51, U54 manuals all state:

"Most CSR writes result in a pipeline flush with a five-cycle penalty."

On the 7-series it's a seven-cycle penalty.
 
The following users thanked this post: SiliconWizard

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #102 on: January 20, 2020, 03:55:35 pm »
Exceptions/traps are now implemented. That was not as easy as I expected. Exceptions in a pipelined CPU are a bit of a bitch.
For anyone interested, here is the small first code I used to test them:

Code: [Select]
_start:
/* Testing Exceptions. */
la a1, ExceptionHandler
csrw mtvec, a1

li a0, 10 /* Number of times the exception will be raised */

rdcycle s0

sret /* This should raise an 'illegal instruction' exception here */

rdcycle s1
sub a1, s1, s0 /* a1 <- number of elapsed cycles for the repeated exception */

ebreak

ExceptionHandler:
addi a0, a0, -1
bnez a0, EH_End
csrr s2, mepc
addi s2, s2, 4
csrw mepc, s2
EH_End:
mret

So, it basically forces an exception through the use of an instruction that is illegal where it is (sret). To better exercise the whole exception handling and the pipeline, the exception handler will return to the same instruction 10 times and finally will return to the next instruction (adding 4 to mepc), which will eventually finish execution. The total number of cycles for the 10 exceptions is counted with rdcycle. In the end, I get 122 cycles for the 10 exceptions.

During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #103 on: January 20, 2020, 05:53:05 pm »
During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?

Be very careful writing exception handlers :-) Or at least the first few instructions of them. Preferably use a shared and well debugged entry sequence or copy and paste it from a reference source.

Bugs in M-mode software are always very very bad. That's why we write most code in U-mode, when it exists.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #104 on: January 20, 2020, 07:02:54 pm »
During the process, I've been wondering about nested exceptions. Whereas interrupts are automatically disabled upon trapping, so you get a chance to save mepc/mstatus before re-enabling them if you want to implement nested interrupts, there's no such provision for exceptions, which means that if an exception is raised in an exception handler *before* you get a chance to save mepc/mstatus (maybe unlikely, but not impossible), the situation will be unrecoverable. I've read a couple articles on this that confirm the issue. I'm wondering how that should be handled?

Be very careful writing exception handlers :-) Or at least the first few instructions of them. Preferably use a shared and well debugged entry sequence or copy and paste it from a reference source.

Ahah, yep. But the exception could be caused not by a software bug, but something else. Even if it's kind of a corner case, it doesn't make me feel fully comfortable.

Some CPUs handle this case as "double faults" (which usually either stops the CPU completely, or jumps to a predefined address.) I'm considering adding this.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #105 on: January 20, 2020, 07:34:08 pm »
If you swap mscratch with SP (or some other register) (which is a single instruction), decrement it by the size of some registers, save the registers there (where the value kept in mscratch is a pointer to preferably some small SRAM area), and grab some or all of mstatus/mepc/mcause/mtval .. and something goes wrong with this ... your basic CPU core is pretty seriously borked.

Taking a trap to some other vector is probably not going to help. Shutting down or looping forever is almost certainly the best you can do.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #106 on: January 21, 2020, 03:01:02 pm »
If you swap mscratch with SP (or some other register) (which is a single instruction), decrement it by the size of some registers, save the registers there (where the value kept in mscratch is a pointer to preferably some small SRAM area), and grab some or all of mstatus/mepc/mcause/mtval .. and something goes wrong with this ... your basic CPU core is pretty seriously borked.

Taking a trap to some other vector is probably not going to help. Shutting down or looping forever is almost certainly the best you can do.

I'll think about it. Several CPUs actually implement what I mentioned. Some (at least x86, which you may not label as a good architecture ;D ) actually has a provision for double fault handlers, and triple faults basically generate a CPU reset.

Stopping or preferrably resetting the CPU, even in case of double faults, looks like a simple and reasonable way to handle it, so I'll implement that first. (Not doing anything about it if it ever happens still doesn't look like a good idea to me. The infamous "it should never happen" often leads to catastrophic events.  ;D )

So what I'm thinking is implementing a number of "reset sources", including watchdogs and double faults (nested exceptions). I may make all of them optional through a custom CSR. Or not.
« Last Edit: January 21, 2020, 03:02:51 pm by SiliconWizard »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #107 on: January 28, 2020, 11:29:16 pm »
I've added interrupt support, and also added peripheral simulation in separate thread(s) so 1/ peripheral simulation doesn't hinder the overall simulation speed and 2/ it's easier to properly simulate asynchronous events.

I'll give one of the small tests (assembly) I devised (with just some constants slightly renamed), again for anyone interested. It uses the machine timer (that I've simulated), and will interrupt a loop 10 times, then the interrupts get disabled and it's done.

Code: [Select]
/* Testing Interruptions (Machine Timer). */

/* Set exception handler */
la a1, ExceptionHandler
csrw mtvec, a1

/* Disable Machine-level interrupts */
li t0, CSR_MIE_MTIE
csrc mie, t0

/* Set mtimecmp to mtime + PERIOD (next interrupt) */
li a1, MTIME_ADDR
lw a2, 0(a1)
addi a2, a2, PERIOD
li a1, MTIMECMP_ADDR
sw a2, 0(a1)

li a0, 10 /* Number of times the Machine Timer interrupt will be handled */

/* Enable Machine-level interrupts */
csrs mie, t0
csrsi mstatus, CSR_MSTATUS_MIE

rdcycle s0

Loop:
bnez a0, Loop /* Loop until a0 == 0, which means the interrupt has fired 10 times */

rdcycle s1

/* Disable Machine-level interrupts */
csrc mie, t0

sub a1, s1, s0 /* a1 <- number of elapsed cycles */

ebreak

ExceptionHandler:
csrr s2, mcause
bgez s2, EH_End /* Not an interrupt */

addi a0, a0, -1

/* Set mtimecmp to mtime + PERIOD (next interrupt) */
li a1, MTIME_ADDR
lw a2, 0(a1)
addi a2, a2, PERIOD
li a1, MTIMECMP_ADDR
sw a2, 0(a1)

EH_End:
mret

One of the next steps will be to translate the simulated core to HDL. But before I do that, I'd like to get some kind of branch prediction done. I'm suspecting branch prediction might need some changes in the pipeline that I don't want to have to translate several times... so I'd like to get this right first.

So yeah, I'm currently studying branch prediction and trying to choose something effective enough, yet simple. For both simplicity and security, I am ruling out anything that would lead to instructions getting to the execution stage or later, and then "undo" things if said instructions are not to be executed. Way too slippery for my taste, so the latest stage I'll let them go is the decode stage. (Not trying to go for high performance either, just something reasonable.) Speculative execution, we all know where that can lead. Of course since I'm designing a 5-stage pipeline and don't really intend on pushing it deeper, it's doable. When you have a 20-stage pipeline or deeper, it's pretty hard to avoid speculative execution without harming performance a lot. Not my area.
 

Offline andersm

  • Super Contributor
  • ***
  • Posts: 1198
  • Country: fi
Re: The RISC-V ISA discussion
« Reply #108 on: January 29, 2020, 12:37:26 am »
I'm assuming you're read your Hennessy & Patterson? It has pretty in-depth coverage of branch predictors. Strictly speaking, you've already implemented a static no-branch-taken prediction scheme. A bit more interesting might be statically predicting backwards branches taken, and forwards branches not taken, which was used in some old CPUs. However, IIRC all static schemes turns out to add very little performance, and you really need something dynamic if it's going to be useful.

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #109 on: January 29, 2020, 02:11:59 am »
I'm assuming you're read your Hennessy & Patterson? It has pretty in-depth coverage of branch predictors.

I've read parts of it, and quite a few articles on the topic.

Strictly speaking, you've already implemented a static no-branch-taken prediction scheme.

Yes, that's what I noted a few posts earlier. On "general" code, it's actually not that bad. Of course it falls short on tight loops.

A bit more interesting might be statically predicting backwards branches taken, and forwards branches not taken, which was used in some old CPUs. However, IIRC all static schemes turns out to add very little performance, and you really need something dynamic if it's going to be useful.

Yes I realize that. What I'm really wondering about though, and that I'll be able to test in the simulator before even considering implementing it in HDL, is the impact of a good branch predictor in a relatively short pipeline. Given that branches are taken at the execute stage in my 5-stage pipeline, the penalty is relatively small (I lose 2 cycles for taken branches), so it becomes significant only for tight loops/series of branches with just a few instructions in between. A typical worst case is a branch instruction looping to itself: it will have a CPI = 3, whereas it could be just 1 with even a simple branch predictor. But with "real-life" code, the impact is much less on average. So I need to figure out whether it's even worth it compared to how much additional logic it would take (and the potential bugs), knowing that I'm not really looking to design high-performance cores, but just something reasonable.


 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #110 on: February 03, 2020, 03:39:52 am »
I've just found "Clarvi"  - https://github.com/ucam-comparch/clarvi

It's by the Computer Architecture Group, University of Cambridge Computer Laboratory, and looks to be a very nice implementation to refer to. As per the README.md:

Quote
This is an RV32I core. It provides a minimum implementation of the v1.9 RISC-V privileged specification, including full support for interrupts and exceptions.

Lines 383 to 470 of clarvi.sv is the core of the trap/interrupt/exception handling, which has proven to be quite enlightening.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 
The following users thanked this post: oPossum

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #111 on: February 06, 2020, 08:18:08 pm »
I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

I've tested it with a number of test and benchmark code. On CoreMark, I get the following results:
- Branch predictor disabled: ~2.56 CoreMark/MHz, average CPI = 1.313
- Branch predictor enabled: ~3.00 CoreMark/MHz, 89.0% of correct branch predictions (pretty typical for a bimodal predictor), 0.6% of mispredicted target addresses, average CPI = 1.112

Pretty happy with the results already.

I haven't implemented a return stack buffer yet, so "returns" are very likely to have a mispredicted target address. CoreMark doesn't make heavy use of function calls here. But I've tested it on a very simple, yet very taxing test for function calling (the Takeuchi function), and I get the following: (called with: Takeuchi(18, 12, 6), which BTW should return 7. If not, your compiler or processor is borked.)

- Branch predictor disabled: average CPI = 1.243
- Branch predictor enabled: 61.8% of correct branch predictions, 27.2% of mispredicted target addresses, average CPI = 1.111

Interesting stuff.

For anyone curious, this is the Takeuchi function:

Code: [Select]
int Takeuchi(int x, int y, int z)
{
if (x > y)
return Takeuchi(Takeuchi(x - 1, y, z), Takeuchi(y - 1, z, x), Takeuchi(z - 1, x, y));
else
return z;
}

Next step, I'm going to implement some kind of return stack buffer in the predictor and see if this is worth it.
But apart from particular cases like this, I get an average speed-up (on typical code) of 15%-20%. Not too shabby.
« Last Edit: February 06, 2020, 08:23:30 pm by SiliconWizard »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #112 on: February 06, 2020, 08:37:04 pm »
I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

What size values of 'n' are you using?

Do the tags help much?

If you don't have tags, then the you might pollute the branch predictor tables when two branches have the same LSBs. But even then they will only fight half the time (as if both branches are usually taken in both cases (or both not taken) the prediction is correct).

However, if you do have tags then every time the branch predictor throws out any valid information and replaces it with the guessed "weakly taken" or "weakly not taken" state.

Wouldn't the bits be better spent making the branch predictor table larger, reducing the chance of a collision?

Using stale data (from an older branch at a matching address) only requires two misses to change the behavior, so doesn't seem that expensive, and the extra simplicity might be helpful.

Another question... When you first hit a branch, are you guessing that backwards jumps are "weakly taken", and forward jumps are "weakly not taken"?

(Asking because I'm looking into branch prediction at the moment....)
« Last Edit: February 06, 2020, 08:40:00 pm by hamster_nz »
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #113 on: February 06, 2020, 09:56:42 pm »
I've now completed a basic dynamic branch predictor with a classic 2-bit saturating counter, 2^n entries indexed by the n LSBs of the PC of the branch instructions (well, actually of the PC shifted right by 1 bit if the C extension is supported, and 2 bits otherwise, since those bits would always be zero). It's a tagged table, so the whole PC address is checked during prediction, and it also stores the target address.

What size values of 'n' are you using?

At this point, 10 bits (so 1024 entries). I've experimented a bit, and there was still a significant improvement going up to 10. But I've found that beyond 10 bits, that was completely negligible. But that's for a tagged version of the predictor. So far, I haven't experimented a lot without tags. Of course to get a better idea, I'd probably have to evaluate it on a much larger number of test cases...

Do the tags help much?

For up to 2^10 entries, definitely, at least in my first experiments. Without tags, on anything but the smallest pieces of code, the misprediction rate would increase a lot, and it seems that it's often a lot worse than just static prediction.

I haven't tested for a much larger number of entries, so I can't tell where the sweet spot would lie. I'd have to experiment more.

Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

Wouldn't the bits be better spent making the branch predictor table larger, reducing the chance of a collision?

It doesn't look like it from my (again limited) testing so far. Besides, even though that would make the table smaller, having to deal with more entries (meaning, having to address more lines) may make it slower to access (not sure)? And we're still talking about relatively small buffers here (a few KBytes). I don't think that'd be worth optimizing unless you were going for a very low-area core (in which case, I don't think I would go for a 5-stage pipeline anyway...)

But I may experiment further with this.

Another question... When you first hit a branch, are you guessing that backwards jumps are "weakly taken", and forward jumps are "weakly not taken"?

Nope. When a branch is not predicted yet, it defaults to the static behavior of "branch not taken" - meaning it just fetches the next instruction.
I had thought about doing just what you say, but I'm really not sure it's worth it in practice once you have a dynamic predictor, and it would still require an extra address comparator and more...

As I said, I'm not looking to design the best prediction possible (Intel does that pretty well, although it's not always without faults). What I'm getting at this point is pretty satisfactory. The only thing that might add some more speed-up would be to improve indirect branches such as returns from calls (although I've read about exploits with return stack buffers - see SpectreRSB - so I'm also evaluating what is safe to implement in terms of security and what is not.)

If you're going for a different approach, don't hesitate to report back so we can see what really makes a difference and what doesn't.

 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #114 on: February 06, 2020, 10:23:33 pm »
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address.

Thinking out loud, for the RISCV ISA, is this only true for JALR?

JAL is always taken, and along with BEQ, BNE, BLT, BGE, BLTU and BGE are all immediate offsets from the current instruction's address so the target address can be calculated during the decode?

But JALR is a big problem as it is the call and return instruction, and for implementing jump tables and so on.
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #115 on: February 06, 2020, 10:46:34 pm »
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address.

Thinking out loud, for the RISCV ISA, is this only true for JALR?

JAL is always taken, and along with BEQ, BNE, BLT, BGE, BLTU and BGE are all immediate offsets from the current instruction's address so the target address can be calculated during the decode?

Technically yes, but this is not how I have implemented things. The target address for all branches is calculated at the execute stage. I've tried to limit the number of special cases to make things easier to implement and easier to validate. Besides, in a 5-stage pipeline, the address of the currently fetched instruction would then potentially depend on the calculated address at the decode stage, which would lengthen the logic paths. All that may not be very favorable for speed.

In the same vein, tags in my case actually help speed as well. This is why. With tags, I'm sure that a matching entry in the table IS a branch. Without tags, this could be any instruction with the same lower n bits, which has a lot higher probability than between purely branch instructions. So without tags, I would actually need to determine that it's a branch instruction early - at the decode stage obviously - which would again add a logic dependency, thus lengthening the logic path. So I'm trading some memory for speed.

Of course all this is heavily dependent on the way you implement your pipeline.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #116 on: February 07, 2020, 02:35:28 am »
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

Implementing "gshare" should massively increase the prediction rate. Keep track of the actual direction of the last n conditional branches (in a shift register), XOR this with the low n bits of the PC, and use this to select the counter.

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.
 
The following users thanked this post: hamster_nz

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #117 on: February 07, 2020, 03:07:07 am »
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

Implementing "gshare" should massively increase the prediction rate. Keep track of the actual direction of the last n conditional branches (in a shift register), XOR this with the low n bits of the PC, and use this to select the counter.

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.

What is surprising me in advanced CPU design at this level is how much is done using "almost correct", "approximate solutions" and "rules of thumb" that work "correctly enough" most of the time and are designed to work with imprecise information (like Two-bit Predictor for branches), because theoretical, more accurate, more correct solutions would be too complex or expensive.

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #118 on: February 07, 2020, 04:42:59 am »
Yeah, it's weird that "two" seems to be the optimum number of bits for a counter. More just means it takes too long to unwind the prediction when circumstances change.

The Pentium MMX and Pentium Pro suddenly got massively better branch prediction than either their predecessors or their peers (e.g. PowerPC) -- in the range of 98% or better -- and how it happened was top top secret for some time. It was just XORing the taken/not taken history or recent branches with the PC. Almost too simple. A 1024 entry table needs 2*2^10 + 10 = 2058 bits of storage and that's *all*. (1024 entries with tags needs ... 32k bits? Ok, maybe (2 + (32-10) * 2^10 =  24576.)
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #119 on: February 07, 2020, 01:49:51 pm »
Keep in mind that the predictor must predict not only the direction (taken or not taken) but also the target address. Some studies have shown that direction itself could be predicted pretty well without tags (and the first, simple predictors only did that, I think) and a relatively small number of entries, but target addresses are another story.

The target address is known (if the branch is taken) for everything except JALR (which isn't conditional).

So the set of things for which a branch prediction is needed is completely disjoint from the set of things for which a branch target prediction is needed, so there are usually handled by completely different data structures.

You probably didn't follow what I replied to hamster. There are several reasons I mixed both. (And they are usually, but *not always* handled separately. Whereas I've reasonably studied what is SOA, I'm implementing things with a specific set of requirements here.)

- Implementing a scheme that can be potentially reused for other ISAs without major modifications;
- Handling ALL branches in the same way;
- Security (yes, checking that a prediction actually belongs to the current branch instruction and not to another one will prevent a few potential exploit issues, so it doesn't just marginally improve prediction rate, but has another purpose);
- Simplifying logic (at the expense of more memory) - has benefits for design simplicity, verification AND length of logic paths.

As I implemented it, the branch predictor doesn't just predict branch direction (and target), it does so acting as some kind of cache. So basically, at the fetch stage, the corresponding predictor's table line is fetched, and at the decode stage, we have all info needed for branching without adding dependencies on the instruction decoding itself. It limits logic depth.

Implementing "gshare" should massively increase the prediction rate.

I've read a few papers and the "massively" looks pretty overrated. A couple papers notably find that on a mix of benchmarks, gshare is slightly better but nothing drastic.

Given the first results I get with my approach, there doesn't seem to be real room for "massive increase" anyway, and I'm not looking to extract the last few %. As I said, I'll leave that to Intel, AMD, ARM, (maybe SiFive? ;D ) Not my area.

At the beginning, I was even considering no dynamic BP at all. With the one I implemented, I get an average of +15-+20% speedup, and even over 10% with degenerate cases such as the Takeuchi function, so I'm not convinced at this point I'm going to try and further improve this. As I said, the only thing I'm considering is a return stack buffer, but even that I'm not completely decided.

My questioning for the real benefit of RSBs is that except for the very small functions (or functions that are called often and return very early, thus not executing much on average), the context saving/restoring on the stack takes a significant number of cycles compared to the return itself, thus a misprediction penalty for returns may be relatively negligible. Of course there are always cases where it would make a difference, but not sure it's worth it.

An illustration of this is with the simple Takeuchi function test case I showed. You can see that even with a relatively poor prediction rate (still better than static), the average CPI is close to 1.1, which is close to the average CPI I get on a mix of benchmarks. Not only do I consider 1.1 a decent figure here (for my requirements anyway), but it seems to show that branch prediction for function calling specifically may be marginal in many cases compared to branch prediction of all other kinds of branches. Don't hesitate to prove me wrong here, but my tests and rationale why seem to hold.

Don't store the actual PC for the tag. Waste of time. It just forces branches that alias to fight each other every time instead of half the time.

My rationale explained above.

As to performance, I've done some more comparative testing today. Without tags, the prediction rate itself, not surprisingly, very slightly increases (but like marginally - something in the order of +0.1%), but the target prediction rate decreases significantly (several %).

Again, this is to be considered with a specific set of requirements and a specific implementation of the pipeline.

Of course, I'll be interested to see real tests of different approaches, and see how they compare, but raw performance is not the only criterion here.

 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #120 on: February 08, 2020, 04:11:29 pm »
I added a test with the following benchmark: https://people.sc.fsu.edu/~jburkardt/c_src/linpack_bench/linpack_bench.html

I haven't implemented F/D extensions yet, so I tested it with emulated floating point on a RV32IM. A lot of function calls for sure.
Of course it's dead slow with emulated FP. ;D

I get the following:
Avg. CPI = 1.030
Correct branch predictions = 87.8%
Mispredicted branch targets = 4.8%

As to the benchmark itself... I get a whopping ~0.62 MFlops (simulated RV32IM clocked at 100 MHz.)

I'd be curious to see results for this benchmark on ARM Cortex CPUs (also with emulated FP of course), if anyone has some time to spare.

Edit: for anyone feeling like trying: the default N value for this benchmark (1000) makes it pretty slow to execute AND requires a lot of RAM (~ 8MBytes). Unpractical for MCUs unless you have external SDRAM. You can change N to 100, I think 128KBytes of RAM should do it then (take a look at the memory allocations.) It'll be a bit less "accurate", but I get a pretty similar figure, so that'll do it for comparing. Another point: you will probably have to comment out the timestamp() function (not useful here and uses time functions probably not available on a typical MCU), and port the cpu_time() function. That's all.
« Last Edit: February 08, 2020, 07:26:33 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #121 on: February 08, 2020, 10:47:57 pm »
Have you tried my primes benchmark?

http://hoult.org/primes.txt

90% correct branch predictions is of course 90% the maximum possible gain, so is fine for a simple processor.

The problem is once you have an out-of-order processor with say ~200 instructions decoded and dispatched into the execution engine, and there is a branch every five or six instructions on average. You have to have very close to perfect branch prediction to have any chance of those 200 instructions being the right 200 instructions.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #122 on: February 09, 2020, 12:05:17 am »
Have you tried my primes benchmark?

http://hoult.org/primes.txt

Yes I have! I get surprisingly good results with it: 97.7% correctly predicted branches.
I get an avg. CPI = 1.235, worse than what I get with most of my other tests though which are now closer to 1.11. I suspect it's mostly explained by stalls due to more load-use hazards than usual though. I haven't implemented a specific instrumentation counter for those, but I will - that will confirm or infirm this.

90% correct branch predictions is of course 90% the maximum possible gain, so is fine for a simple processor.

The problem is once you have an out-of-order processor with say ~200 instructions decoded and dispatched into the execution engine, and there is a branch every five or six instructions on average. You have to have very close to perfect branch prediction to have any chance of those 200 instructions being the right 200 instructions.

I can see how that would become critical for complex OoO CPUs, and even more so with long pipelines. I haven't considered OoO execution yet though, and probably won't.
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #123 on: February 09, 2020, 03:23:43 am »
I found it pretty interesting to think about how the two-bit saturating counter is 'good enough' for branch prediction.

If you have any loop (e.g. iterating over a list, or "for(i=0;i<10;i++) ..." after one or two passes it quickly learns the flow through the code, and when the loop finishes the history of branch prediction doesn't get completely trashed by exiting the loop.

Exiting the loop also removes half of the history (going from 11 to 10, or from 00 to 01), so next time the entry in the table is used it will quickly adapt to the new pattern.  :-+

Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #124 on: February 09, 2020, 02:45:25 pm »
I implemented a load-use hazard instrumentation counter, and I can confirm the point with the avg. CPI on Bruce's primes benchmark. I get ~21.5% of load-use hazards (percentage over all executed instructions!). (With CoreMark, for instance, I get ~6.5%. On many other tests I've done, I get less than 5%.)

I've looked at the generated assembly and the culprit seems to lie in this small portion:
Code: [Select]
lw t5,0(t1)
blt a3,t5,.L7
lw a2,0(a6)
ble a4,a2,.L9
Two load-use hazards that involve a 1-cycle stall each.

Looking at the whole assembly code (which is pretty small), I can't really figure out a way to better order instructions to avoid this, so I can't really blame GCC. Maybe there is.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #125 on: February 09, 2020, 02:55:46 pm »
I found it pretty interesting to think about how the two-bit saturating counter is 'good enough' for branch prediction.

If you have any loop (e.g. iterating over a list, or "for(i=0;i<10;i++) ..." after one or two passes it quickly learns the flow through the code, and when the loop finishes the history of branch prediction doesn't get completely trashed by exiting the loop.

Exiting the loop also removes half of the history (going from 11 to 10, or from 00 to 01), so next time the entry in the table is used it will quickly adapt to the new pattern.  :-+

Yes, it's really just about compromising between "latency", the number of times it will take to get prediction right, and accuracy. One bit is too little, two bits is the sweet spot (confirmed by many studies and by  my own tests.) I've tried with 3 bits and more, it consistently performs worse on average. (Of course you can always devise very specific examples with which this would perform better, but the main point is to devise something that will work well enough in most situations.) Maybe a little more "intelligence" in the predictor could use some kind of adaptive counter depth. (That might have been done already, I have read quite a few papers but certainly not all.)

Real "loops" can be predictable per se at compile time and even with good predictors, you always have the overhead of the branch instruction itself (which will take one cycle even with the best predicition.) To further optimize loops without requiring complex predictors (which would not be better for other cases anyway), you can also implement some kind of hardware loop extension. That's what they did with PULP (you can take a look at the documents to see what they did.) Loops will execute N times without the extra branch instruction.
« Last Edit: February 09, 2020, 02:58:11 pm by SiliconWizard »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #126 on: February 15, 2020, 03:13:35 pm »
OK, now the time has come to start implementing this in HDL.

And now I'm facing something new. Whereas I'm currently interested in implementing core(s) that are relatively simple and, for instance, would not deal with any kind of memory cache (just directly coupled SRAM blocks), my prototyping stage will obviously be done on FPGAs. Problem is, unless I go for a very large FPGA (expensive shit), the amount of embedded block RAM is relatively limited. I can certainly test things but will be quickly limited to run even moderately large programs.

And of course many dev boards (I already have quite a few) have a nice amount of SDRAM or DDR RAM. Problem is, it's impossible to use them without implementing caches, unless you're ready to have your core run at a very low frequency.

So now my next task is to implement caches.

There would be another possible approach - designing my own dev board with some FPGA + several fast SRAM chips. Not cheap and would require a very large number of IOs in order to be able to access the different SRAM chips concurrently...

And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

Has any of you already implemented simple CPU cores with some memory caches, and if so what approach did you take and did you evaluate the average penalty (compared to a simple, tightly-coupled memory system)?
« Last Edit: February 15, 2020, 03:17:07 pm by SiliconWizard »
 

Offline hamster_nz

  • Super Contributor
  • ***
  • Posts: 2850
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #127 on: February 16, 2020, 02:23:37 am »
And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

To me Out of Order ('OoO') seems to be only helpful with hiding the hit from cache misses, as it can only hide a few cycles of latency. If memory accesses can raise exceptions then it is a whole new level of complexity in making it all work properly.

I've been playing around implementing the Tomasulo Algorithm to allow OoO, but decided that it is only super-useful if either your instructions take multiple cycles, or you can issue multiple instructions per cycle... The classical implementation limits performance to generating at most one result each cycle.

Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #128 on: February 16, 2020, 03:21:00 am »
I've been playing around implementing the Tomasulo Algorithm to allow OoO, but decided that it is only super-useful if either your instructions take multiple cycles, or you can issue multiple instructions per cycle... The classical implementation limits performance to generating at most one result each cycle.

Yeah, I've studied that a bit, but I definitely don't want to mess with OoO at this point. My mental process is just going through all this, and I'm starting to more clearly see when things are useful and why they are commonly used in today's high-performance CPUs...

Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...

I'm going to make some more experiments with cache with my simulator first and get an idea, especially to select appropriate cache sizes and experiment with simple replacement policies. Yes I'm basically thinking of implementing an instruction cache of 32KB and data cache of 32KB or 64KB, the rest of EBR will be used for register files and branch prediction (should fit in the FPGAs I'm targetting right now).

While working on memory access, I'm also thinking of a possible extension for dealing with memory copy. It's a very frequent operation in actual code, and doing that with the base instruction set and loops with branches looks rather inefficient - that kind of looks like it would be an opportunity for a specific extension. I've thought of just making that part of a more general DMA coprocessor, but typical DMA would require significant setup overhead for starting every copy/transfer (and would have to be much more generic that just being able to move memory blocks...), with typically a set of memory-mapped registers... so specifically for memory-to-memory transfers, I think an extension with just a couple specific and well thought out instructions could improve things drastically.

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #129 on: February 16, 2020, 08:02:24 am »
And while working on caches, I now fully see why OoO execution would be particularly helpful. This can help a lot keeping the core busy when there are cache misses. Dang, looks like opening some can of worms.

To me Out of Order ('OoO') seems to be only helpful with hiding the hit from cache misses, as it can only hide a few cycles of latency.

Or long latency instructions. And it can be quite a lot of clock cycles on a cache miss if you have to go all the way last L1, L2, maybe L3 cache to RAM.

In 1996 the Alpha 21264 could have 80 instructions in the OoO engine, the largest up to that point. The Pentium Pro (1995) had a 40 entry Reorder Buffer.

Intel has gradually been increasing the size of the Reorder Buffer:

Nehalem: 128 uops
Sandy Bridge: 168
Haswell: 192
Skylake: 224

That can hide a *lot* of latency.

Quote
Also, the cache doesn't have to be that big to be useful - 80486 had 8KB or 16KB of L1 cache...

68020/68030 got useful speedups from a 256 byte icache. The 68030 added a 256 byte dcache. 68040 increased both to 4k which made a really big difference.

The PowerPC 603 had 8k each for icache and dcache which worked well for normal code but it performed badly with the m68000 emulator which used a huge 512k byte switch statement to contain two PPC instructions for each of the 65536 possible m68k instructions. In some cases the 1st PPC instruction completely emulated the m68k instruction and the 2nd PPC instruction jumped back to fetch the next m68k instruction. In other cases the 1st instruction just set up some information about the m68k instruction with a load immediate and then the 2nd instruction jumped to some common handler code.

See https://groups.google.com/d/msg/comp.sys.powerpc/jfSUDOGuNNM/BHeYAVoT2NIJ

The PowerPC 603e increased the L1 caches to 16k each, which fixed this problem.
 
The following users thanked this post: SiliconWizard, I wanted a rude username

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #130 on: March 04, 2020, 06:26:32 pm »
Alright, I've implemented memory caches. So far I have added only an instruction cache (the memory cache mechanism itself was easy, but properly inserting it in the pipeline was not... correctly stalling the pipeline in ALL cases - while being as efficient as possible - with the added instruction cache took a while to get right.) So next step will be to add a data cache, now that it's ironed out.

I've tested the instruction cache with the following parameters:  4-way set associative, 64 bytes per line, 32KB, and a PLRUm replacement policy.

In most tests I have done, the penalty is negligible (of course thanks to reasonable code locality on average). There is only a marginal difference with CoreMark, linpack and Bruce's primes benchmark, and the miss rate is consistently less than 0.01%.

I tested with a 2-way, 16KB, still 64 bytes/line cache. The difference was negligible compared to 4-way, 32KB with the above benchmarks, but more significant obviously on code with a lot of function calls in further parts of the object code, in which, not surprisingly, the miss rate almost doubled on average (eg: 0.71% vs. 0.38% in one of my tests).
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #131 on: March 04, 2020, 07:35:54 pm »
Alright, I've implemented memory caches. So far I have added only an instruction cache (the memory cache mechanism itself was easy, but properly inserting it in the pipeline was not... correctly stalling the pipeline in ALL cases - while being as efficient as possible - with the added instruction cache took a while to get right.) So next step will be to add a data cache, now that it's ironed out.

I've tested the instruction cache with the following parameters:  4-way set associative, 64 bytes per line, 32KB, and a PLRUm replacement policy.

In most tests I have done, the penalty is negligible (of course thanks to reasonable code locality on average). There is only a marginal difference with CoreMark, linpack and Bruce's primes benchmark, and the miss rate is consistently less than 0.01%.

I tested with a 2-way, 16KB, still 64 bytes/line cache. The difference was negligible compared to 4-way, 32KB with the above benchmarks, but more significant obviously on code with a lot of function calls in further parts of the object code, in which, not surprisingly, the miss rate almost doubled on average (eg: 0.71% vs. 0.38% in one of my tests).

My primes benchmark compiles to around 200 bytes of code, so any instruction cache at all is going to work :-)  It uses 8 KB of data, so doesn't need a lot of cache there either.

I've noticed experimentally that Coremark runs well with a 16 KB instruction cache if you enable the C extension but considerably less well without it. If I recall correctly on the HiFive1 it's around a factor of two in execution time. That's an extreme case because icache misses have to go to SPI flash which is very slow. Turning on -msave-restore (which uses runtime functions to save registers in function prologues and restore them in function epilogs) also gives a couple of percent speedup despite the extra instructions executed because a bit more of the hot code will fit into the instruction cache.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #132 on: March 04, 2020, 08:44:00 pm »
My primes benchmark compiles to around 200 bytes of code, so any instruction cache at all is going to work :-)  It uses 8 KB of data, so doesn't need a lot of cache there either.

Yup obviously. ;D

I've noticed experimentally that Coremark runs well with a 16 KB instruction cache if you enable the C extension but considerably less well without it.

That's interesting. It may depend on your compilation options a bit - dunno.
But I haven't implemented the C extension yet so it's purely RV32IM here, but even with a 16KB cache, I see little penalty compared to 32KB. I get a negligible miss rate, so even high penalty for misses would not matter much (unless of course the penalty was HUGE.)

What kind of icache is implemented in the CPU you're mentioning? Is it a n-way set associative? If so, what "n"? And if so, what's the replacement policy?

Now what is also the core frequency you run it at? Because of course for very slow flash and very high core frequency, miss penalties will have a huge impact. (In my simulator I have set up a typical penalty for accessing DDR RAM.)

What would be interesting would be to know the miss rate on icache you get with CoreMark (don't know if this CPU has instrumentation registers that allow to figure this out...?)
« Last Edit: March 04, 2020, 08:46:16 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #133 on: March 05, 2020, 12:56:07 am »
FE310-G000 in a HIFive1, running at 256 MHz or 320 MHz depending on my whim. icache is 16 KB 2-way associative, 32 byte cache lines.

It takes around 1 us to do a 1/2/4 byte data load from the SPI. An icache line will be slower, but not hugely because it will do a burst transfer. But it's several hundred clock cycles, anyway, much like missing all the way to DRAM on a modern x86.

Sadly, the hardware performance monitor has only clock cycles and instructions retired counters.

https://sifive.cdn.prismic.io/sifive%2F500a69f8-af3a-4fd9-927f-10ca77077532_fe310-g000.pdf

I could try on the FU540 in the HiFive Unleashed, which has a much more comprehensive performance monitor. And L2 cache.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #134 on: March 05, 2020, 02:52:00 pm »
On my tests, enough of CoreMark's "core" code seems to fit within 16KB that I still get an extremely low miss rate even with just 16KB cache. I'm not sure what would explain the difference here with the FE310. The only thing that may differ is the replacement policy, but with only 2 ways, that shouldn't make much difference if at all? The other difference is the line size (I use 64 bytes), but I don't think it would make a difference here either. If not all code fits within 16KB, larger lines could actually be more a problem than a benefit?

Of course the other difference could be compiled code. I am currently using GCC 9.2 (custom build), I could try with SiFive's toolchain (currently based on GCC 8.3 IIRC?), although I don't expect that much of a difference. (But I could very well be "lucky" here, and it could be a matter of just a few dozen less bytes!)
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #135 on: March 14, 2020, 06:15:36 pm »
Data memory cache is now also implemented. Ran all my tests with both instruction and data cache enabled. Both are 32KB, 4-way set associative.

For CoreMark, I get negligible penalty compared to tightly-coupled memory.

One of the most taxing benchmarks was linpack with N=1000 and FP emulation (uses over 8MB of data memory, heavy matrix computation stuff with FP.) About -2.2% speed compared to tightly-coupled memory. Average CPI = 1.054. Miss rate on instruction cache is negligible (everything fits within 32KB with no problem even though FP emulation is used.) Miss rate on data cache is ~0.357%. Probably not the best possible, but I'm pretty happy with the results given that I used a replacement policy that is pretty simple (PLRUm, read a few papers and this one looked like the sweet spot for a very simple yet effective policy.)

If anyone knows of typical benchmark code for data caches, I'll take it!
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #136 on: March 19, 2020, 06:05:40 pm »
OK, I've just devised some code that looks like memory testing code (writing and reading back patterns, written in linear order or differently).

On large buffers (4 MB) I get an average miss rate of ~7%. Not hugely surprising.

I have also benchmarked memcpy() for various buffer sizes from 1KB to 4MB. Got the following results:

Code: [Select]
* memcpy(): 1 KiB, Exec. cycles = 1261 (1.23 cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1738 (0.85 cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 3412 (0.83 cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 6876 (0.84 cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 23188 (1.42 cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 59268 (1.81 cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 130742 (1.99 cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 261112 (1.99 cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 521858 (1.99 cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 1043440 (1.99 cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 2086538 (1.99 cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 4172772 (1.99 cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 8345270 (1.99 cycles/byte)

I'd be curious to compare that with what you get on a FU540...
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #137 on: March 20, 2020, 08:03:21 am »
Ok, on FU540 at 1.45GHz:

Code: [Select]
* memcpy(): 1 KiB, Exec. cycles = 549 (0.54) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1018 (0.50) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1953 (0.48) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3812 (0.47) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 11051 (0.67) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 18994 (0.58) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 91698 (1.40) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 425028 (3.24) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 1780168 (6.79) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 3308322 (6.31) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 5550824 (5.29) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 10328072 (4.92) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 20878807 (4.98) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 40870825 (4.87) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 81417119 (4.85) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 162796040 (4.85) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 325120247 (4.84) cycles/byte)
* memcpy(): 131072 KiB, Exec. cycles = 649763946 (4.84) cycles/byte)
* memcpy(): 262144 KiB, Exec. cycles = 1298570608 (4.84) cycles/byte)
* memcpy(): 524288 KiB, Exec. cycles = 2600203293 (4.84) cycles/byte)
* memcpy(): 1048576 KiB, Exec. cycles = 5197559587 (4.84) cycles/byte)

So memcpy() in DRAM is 285 MB/sec.


Code:

Code: [Select]
#include <stdio.h>
#include <string.h>

typedef unsigned long ulong;

ulong read_cycles() {
    ulong cycles;
    asm volatile ("rdcycle %0" : "=r" (cycles));
    return cycles;
}

typedef void test_proc(ulong arg);

ulong measure_time(test_proc p, ulong arg) {
  ulong min = -1;
  for (int i=0; i<10; ++i) {
    ulong start = read_cycles();
    p(arg);
    ulong cycles = read_cycles() - start;
    if (min > cycles) min = cycles;
  }
  return min;
}

#define MAX_SZ (1<<30)
char buf_a[MAX_SZ], buf_b[MAX_SZ];

void empty(ulong arg) {};
void do_memcpy(ulong sz) {memcpy(buf_a, buf_b, sz);}

int main() {
  ulong empty_time = measure_time(empty, 0);
  for (ulong kb=1; kb<=(1<<20); kb*=2) {
    ulong t = measure_time(do_memcpy, 1024*kb);
    printf("* memcpy(): %lu KiB, Exec. cycles = %lu (%4.2f) cycles/byte)\n",
   kb, t, t/(1024.0*kb));
  }
}


 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #138 on: March 20, 2020, 01:49:11 pm »
Incidentally, if I drop the FU540 down to 37.75 MHz (the slowest clock the board can generate) then I get:

Code: [Select]
* memcpy(): 1 KiB, Exec. cycles = 554 (0.54) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1016 (0.50) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1969 (0.48) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3906 (0.48) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 9683 (0.59) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 19035 (0.58) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 115153 (1.76) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 257464 (1.96) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 486513 (1.86) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 943371 (1.80) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 1751138 (1.67) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 3342729 (1.59) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 6900410 (1.65) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 14309351 (1.71) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 28623709 (1.71) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 57386550 (1.71) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 114705607 (1.71) cycles/byte)

So slowing the CPU down doesn't slow down the RAM by as much. (I hit ^C before completion of the program..)

I can get 2.00 cycles per byte at 182 MHz:

Code: [Select]
* memcpy(): 1 KiB, Exec. cycles = 544 (0.53) cycles/byte)
* memcpy(): 2 KiB, Exec. cycles = 1013 (0.49) cycles/byte)
* memcpy(): 4 KiB, Exec. cycles = 1937 (0.47) cycles/byte)
* memcpy(): 8 KiB, Exec. cycles = 3859 (0.47) cycles/byte)
* memcpy(): 16 KiB, Exec. cycles = 9441 (0.58) cycles/byte)
* memcpy(): 32 KiB, Exec. cycles = 33845 (1.03) cycles/byte)
* memcpy(): 64 KiB, Exec. cycles = 70787 (1.08) cycles/byte)
* memcpy(): 128 KiB, Exec. cycles = 160051 (1.22) cycles/byte)
* memcpy(): 256 KiB, Exec. cycles = 447387 (1.71) cycles/byte)
* memcpy(): 512 KiB, Exec. cycles = 990994 (1.89) cycles/byte)
* memcpy(): 1024 KiB, Exec. cycles = 1994527 (1.90) cycles/byte)
* memcpy(): 2048 KiB, Exec. cycles = 3930281 (1.87) cycles/byte)
* memcpy(): 4096 KiB, Exec. cycles = 7860005 (1.87) cycles/byte)
* memcpy(): 8192 KiB, Exec. cycles = 16408431 (1.96) cycles/byte)
* memcpy(): 16384 KiB, Exec. cycles = 33264848 (1.98) cycles/byte)
* memcpy(): 32768 KiB, Exec. cycles = 66778139 (1.99) cycles/byte)
* memcpy(): 65536 KiB, Exec. cycles = 133847271 (1.99) cycles/byte)
* memcpy(): 131072 KiB, Exec. cycles = 267943099 (2.00) cycles/byte)
* memcpy(): 262144 KiB, Exec. cycles = 536411759 (2.00) cycles/byte)

Which works out to about 87 MB/sec at that clock speed. Or 21 MB/sec at 37.75 MHz.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #139 on: March 20, 2020, 05:26:04 pm »
Thanks! Intreresting results that look consistent with what I get. Of course CPU clock relative to DRAM throughput (and latency) will influence the average number of cycles per byte.

Another interesting point is the difference (from my own tests) regarding the buffer size at which the "knee" appears. I think the FU540 also has a 32KB L1 cache? But does it have an L2 cache?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #140 on: March 20, 2020, 05:38:54 pm »
32 KB 8-way L1, yes. And 2 MB 16-way L2 shared between the 4 cores.

Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #141 on: March 20, 2020, 05:55:11 pm »
32 KB 8-way L1, yes. And 2 MB 16-way L2 shared between the 4 cores.

L2 cache mostly explains why you get the knee much later (meaning with bigger buffer sizes) on average.

Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #142 on: March 20, 2020, 06:20:25 pm »
Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?

You can't disable L2 entirely. Way 0 is enabled at reset and software can enable more ways, but once enabled they can not be disabled except by reset.

You can mask ways from being allocated in by a particular CPU (or DMA) by setting bits in the appropriate WayMask register but it seems each CPU must have at least one way unmasked.

See page 69 and following in https://static.dev.sifive.com/FU540-C000-v1.0.pdf
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #143 on: March 20, 2020, 06:33:52 pm »
Weird that the speed is actually worse when working in L2 than once it gets to RAM! I don't have an explanation for that.

For both points, do you have any means of disabling L2 cache only and see what you get?

You can't disable L2 entirely. Way 0 is enabled at reset and software can enable more ways, but once enabled they can not be disabled except by reset.

You can mask ways from being allocated in by a particular CPU (or DMA) by setting bits in the appropriate WayMask register but it seems each CPU must have at least one way unmasked.

See page 69 and following in https://static.dev.sifive.com/FU540-C000-v1.0.pdf

OK thanks. You could at least try disabling all ways beside way 0 and see if it makes any difference...
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #144 on: March 20, 2020, 06:53:43 pm »
I think that would require running bare metal code rather than running under Debian :-)

I've never done that with this board, though it's one of the projects I've been considering playing with during funemployment after my move from SF to New Zealand at the end of the month (hopefully...).
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #145 on: March 20, 2020, 07:12:40 pm »
Yep that would require going bare metal... Well if that was already one of your projects :)

 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #146 on: April 20, 2020, 04:49:38 pm »
Just some news. I've worked on a VHDL implementation for a while. Writing a simulator first (cycle-accurate for a pipelined core) definitely helped a lot. Some things were still tougher than others to properly "translate", but that wasn't too bad. I would have wasted a lot of time ironing out the bugs if I had gone for an HDL version first. (Main reason being that it's a 5-stage pipelined core. A non-pipelined, or with fewer stages, core would have been much easier to implement and get right.)

So, it's basically done. Right now I'm still having a small issue with the branch prediction logic, which makes the synthesizer infer block RAM for it only partially, so that's a lot of wasted LUTs and hinders speed a bit as well. Currently without branch prediction, RV32IM+Zicsr (w/single-cycle multiply) takes about 2000 LUTs (5-stage pipeline, fully bypassed) and can run at up to ~100MHz on a Spartan 6 LX25. With branch prediction enabled, that's about 3500 LUTs and max freq down to ~85MHz. But I should be able to get it to at least 100MHz once I fix the implementation of the branch prediction, and down to about 2100 or 2200 LUTs max.

The test project currently uses block RAM for instruction and data memory, but I intend on using DDR RAM later on with memory cache. I've implemented a memory cache in VHDL and verified it with simulation, now I'll need to test this for real on an FPGA. But let's fix this branch prediction thing first...
 
The following users thanked this post: hamster_nz, emece67, I wanted a rude username

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #147 on: April 20, 2020, 08:19:16 pm »
Nice work!
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #148 on: April 20, 2020, 11:52:57 pm »
Thanks. I now get the "issue" with the branch predictor, but I still need to find an appropriate "fix" without harming its performance.

The main reason is that it must be able to predict the PC at each cycle for the next cycle, but the prediction itself can lead to a branch instruction that will require the right prediction for the subsequent cycle... there is thus a small part in my implementation that is combinatorial, which is not that great. I need to give it some more thought.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #149 on: April 22, 2020, 04:04:08 pm »
Haven't had a lot of time re-working this, but I'm kind of stuck...

Issue basically is:
- The current PC is either the predicted PC or the "next" PC;
- The predicted PC is determined based on the previous PC;
- That kind of makes prediction not fully synchronous, which is the cause of my "issues" here.

Does anyone have an idea how to solve this? Does branch prediction actually have a 1-cycle latency in some cases to be able to make the whole prediction fully synchronous? (Which may be the answer here but would hinder performance.) Any idea welcome.

I think the problematic case would be the one I spotted above: a branch prediction leading to a branch instruction (because to handle this case, prediction must be able to work at EVERY clock cycle) - otherwise I think the latency can be worked around. If anyone sees what I mean and has any tip...
« Last Edit: April 22, 2020, 04:12:12 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #150 on: April 23, 2020, 12:24:22 am »
I believe I've written about this before.

You have two completely different and independent problems: 1) predicting if a branch will be taken or not, and 2) predicting the PC for a taken branch.

You only have to predict the PC for the JALR instruction. The next PC will be the value in the rs1 register plus an offset. You don't know at instruction decode time what the value in the register will be when you get to the execute stage. You do however know with absolute certainty that the branch *will* be taken. JALR is a very infrequently used instruction *except* for function return, which can be accelerated using a small return address stack (even 2 entries gives almost all the benefit). The ISA manual lists the combinations of rs1 and rd that can be assumed to imply particular actions with the return address stack. The other uses for JALR are virtual function call / call via a pointer and switch statements. They are fairly rare but can be accelerated by a branch target cache which is used *only* for JALR, not for conditional branches.

Conditional branches need to have a prediction for whether they will be taken or not. But once you have decided (using the branch predictor) whether or not it will be taken, it is absolutely certain what the next PC will be -- if the branch is not taken it will be PC+4 (or+2) and if it is taken it will be PC+simm12.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #151 on: April 23, 2020, 01:09:11 pm »
You have two completely different and independent problems: 1) predicting if a branch will be taken or not, and 2) predicting the PC for a taken branch.

Maybe my question/issue was not very clear. Let's ignore the target PC for a while and just think about the prediction (taken/not taken).

In the general case, at any given cycle, if the instruction being fetched is a conditional branch (needs to be predicted), you must have the corresponding prediction ready for the next cycle - but at the fetch stage, the instruction is not decoded yet, so you don't know it's a conditional branch. The branch predictor will not issue a status ready for the next cycle, unless you access it asynchronously. It works, but it's suboptimal. Is that clearer?

Basically, by the time a conditional branch gets to the decode stage - so you know it's a conditional branch - you must have a valid target address - which will depend on the prediction (taken/not taken) for the current fetch. Otherwise you waste 1 cycle.

I may not be approaching the thing completely right, but basically, if you want your branch predictor to be able to issue a valid prediction at EVERY cycle, there is something problematic here to make it fully synchronous. Does what I'm saying make sense?

« Last Edit: April 23, 2020, 01:12:18 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #152 on: April 23, 2020, 02:11:16 pm »
You don't need a lot of decoding.

You need to look at the branch predictor if and only if opcode[6:0] == 1100011  (or opcode[1:0] == 00 && opcode[15:14] == 11 if RVC is supported).

That only needs a very shallow AND/OR network to figure out.

You can extract the presumed offset from the opcode and add it to the PC in parallel, without knowing whether the instruction is a branch or not.

Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch. If the instruction turns out not to be a branch then you just ignore what you read.

Of course the branch predictor table can't be updated until the branch proceeds through the pipeline, the register contents have been read and compared etc, and the correctness of the prediction checked.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #153 on: April 23, 2020, 02:54:47 pm »
Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch.

This is currently what I do indeed. Point is, as I said earlier: the PC of the instruction to fetch depends on a possible prediction from the previous cycle, and this dependency causes me troubles as far as registering goes. To better understand: in my implementation, the current PC is the output of a MUX with the "next PC" and "predicted PC" as inputs. That part alone may be the main point to refactor?

If the instruction turns out not to be a branch then you just ignore what you read.

But then you may have fetched the wrong next instruction - you waste 1 cycle (or more).

Maybe there's no way around this if again I want to get to higher speeds (but with potentially more mispredictions.)

But if you remember, this is why I added a tag (like with caches) in the branch predictor (which I know you weren't fond of), so I don't need to wait till the instruction is fetched to know for sure it's a branch that needs to be predicted. May sound a bit wasteful, but from my tests, it was worth it in terms of correct prediction rate. But whereas this tag thing could be done without and uses some memory, it's not what causes me a problem here. I could remove this (and I tested that), but I still have the same problem.

Of course the branch predictor table can't be updated until the branch proceeds through the pipeline, the register contents have been read and compared etc, and the correctness of the prediction checked.

Yes, that point is OK.

I'm sure I may have to rethink things a bit. It's very possible that I'll have to compromise the performance of my BP to make it fully synchronous.

« Last Edit: April 23, 2020, 02:57:23 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #154 on: April 24, 2020, 03:46:07 am »
Also you can read the branch predictor table as soon as you know the PC of the instruction, before you've fetched the instruction or determined that it is a branch.

This is currently what I do indeed. Point is, as I said earlier: the PC of the instruction to fetch depends on a possible prediction from the previous cycle, and this dependency causes me troubles as far as registering goes. To better understand: in my implementation, the current PC is the output of a MUX with the "next PC" and "predicted PC" as inputs.

That doesn't sound quite precise.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

I don't expect this to be the overall critical pipeline stage -- or at least not by much -- given that one of the other pipeline stages has to do register access and muxing to the ALU inputs, and the ALU has to be able to do a 0..32 bit shift in the same amount of time.

You could of course split instruction fetch and branch prediction into two pipeline stages and have taken branches take 2 clock cycles. That would allow slightly higher MHz, but not much higher and I would be pretty darn sure not enough higher to compensate for taking and extra cycle every five or six instructions. I don't know offhand of any RISC-V core with a branch predictor that does that. SiFive's tiny 2-series cores and PULP ZeroRiscy don't bother with branch prediction at all and just accept they are going to need 2 clock cycles for every taken branch.

Quote
If the instruction turns out not to be a branch then you just ignore what you read.

But then you may have fetched the wrong next instruction - you waste 1 cycle (or more).

No. You fetched an unnecessary branch prediction -- because the instruction turned out not to be a branch. You are nowhere near fetching the next instruction yet.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #155 on: April 24, 2020, 04:18:01 pm »
I'm not completely sure what I say is clear, or if we fully understand each other.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

OK - that's the idea of what I do. I'll try to illustrate what I want to achieve with code pieces later on to be clearer/more precise.

But as I said, I'm expecting to be able to do the above for EVERY clock cycle.

Feeding the 'pc' register, which is the output of a MUX, to predict_taken[pc] (which I'd like to be implemented as block RAM/synchronous RAM in general) seems problematic as 'pc' in this way isn't exactly registered. So predict_taken has to be implemented as asynchronous memory basically. As soon as I register the output of this mux, the issue disappears entirely. But of course that would add a 1-cycle latency, which I do not want.

You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

Ok, there I think you got what I want to do. Single cycle branches when they are correctly predicted.

You could of course split instruction fetch and branch prediction into two pipeline stages and have taken branches take 2 clock cycles.

That's what I want to avoid. That would obviously solve the issues altogether, but my goal is single-cycle branches as much as possible. The potential speed increase I could get adding a stage here would likely not make up for the additional branch latency.

There's not surprise I got, for instance, a Coremark/MHz figure almost exactly the same as the Freedom U540. It was almost entirely due to my branch predictor (disabling it, or going for something simpler got me significantly lower figures). I'm certain the rest of the U540 makes it have much better performance overall. But maybe my branch predictor, as it is, is just not completely realistic for actual implementations. That's what I'm currently trying to work on/figure out.
« Last Edit: April 24, 2020, 06:57:27 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #156 on: April 24, 2020, 10:20:26 pm »
I'm not completely sure what I say is clear, or if we fully understand each other.

opcode = sram[pc]; // or icache
pc  = (predict_taken[pc] & is_branch(opcode)) ? pc+branch_offset(opcode) : pc+4;

OK - that's the idea of what I do. I'll try to illustrate what I want to achieve with code pieces later on to be clearer/more precise.

But as I said, I'm expecting to be able to do the above for EVERY clock cycle.

Yes, of course. Everyone does that, as I said.

Quote
Feeding the 'pc' register, which is the output of a MUX, to predict_taken[pc] (which I'd like to be implemented as block RAM/synchronous RAM in general) seems problematic as 'pc' in this way isn't exactly registered. So predict_taken has to be implemented as asynchronous memory basically. As soon as I register the output of this mux, the issue disappears entirely. But of course that would add a 1-cycle latency, which I do not want.

Why is PC the output of a mux?

We are talking about RISC-V here. The PC is a special register located in the instruction fetch unit as, literally, a register. It is registered (to use your terminology). It is not a general register in the register file with a mux to access it. And even if it was, you'd make a bypass for instruction fetch that went directly from the output of the PC register, not via the register-select mux.

You read the PC contents from a register (flipflops on an SoC), pass it though a bunch of asynchronous logic including some SRAM holding instructions, minimal instruction decode, adders, mux, and feed the result back into the input of the same register you read the old PC from. Some time after the signal settles the clock ticks and BOOM you read the new PC value into the register, replacing the old PC.

If you want to, you can implement an entire RV32I CPU using a single-stage pipeline with not only the PC being updated, but also the fetched instruction decoded, values read from registers, passed through the ALU, and presented back to the write port of the registers before the next clock tick.

Michael Field (aka field_hamster, hamsternz) posted his own single-stage RISC-V design here sometime last year (I think). Anyway it's on his github. Rudi-RV32I if I recall.

It's certainly no problem to have PC read, instruction fetch, branch predictor lookup, next PC calculation all in one clock cycle. You simply have to make the clock cycle suitably long that everything has propagated before it ticks. If you don't do that then you might be able to make the clock cycle VERY slightly faster, but it won't be by anywhere enough to compensate for using more clock cycles for branches.

Quote
You can do sram[pc] and predict_taken[pc] in parallel (and pc+4 too). Once you have the opcode back from the sram you can do prediction&is_branch(opcode) and pc+branch_offset(opcode) in parallel. Then you have remaining only a 2:1 mux.

If you want to be able to do single cycle branches then you simply have to make sure that your cycle time is long enough to do sram access + a 32 bit add + 2:1 mux in sequence as that will be the critical path for that pipeline stage.

Ok, there I think you got what I want to do. Single cycle branches when they are correctly predicted.

Yes, like everyone does. It would be crazy to do something else, for typical programs.
« Last Edit: April 24, 2020, 10:22:11 pm by brucehoult »
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #157 on: April 25, 2020, 04:24:00 pm »
As I already said, my implementation at this point does work. It's just not optimal IMO, and that's what I'm working on. I'm sure I'll find a way to optimize it, and there will certainly be many more opportunities to optimize my whole core later on. I was already happy about it able to run @100MHz on a Spartan 6, a bit less happy about the ~85MHz with the branch predictor enabled, but it was not that bad either, considering it implements RV32IM_Zicsr, exceptions/traps and branch prediction. My goal though is to make the branch predictor not add significant overall delay.

Just one point - may have been obvious, but I'm not sure, so here it is. When I talk about "branch prediction", I'm actually talking about both branch prediction and branch target prediction, as I know those are usually formally 2 different concepts - but I need both, and I have implemented both.

Without branch target prediction, what you suggested earlier poses no problem, but there will always be a one-cycle latency if you need to get at the decode stage to figure out the next PC. I don't really see a way around that, and that's usually why branch target buffers are used. So, yeah I have implemented both, and I consider both part of the "branch prediction unit" in my design. That may have been a bit confusing, so here I make it clear. That's also why I talked about PC tags. BTBs are a form of cache and require some form of tags like with any cache to be effective.

So this is mainly a matter of optimizing the implementation of both predictors used in conjunction, and I'm sure I'll figure it out.
« Last Edit: April 25, 2020, 04:28:04 pm by SiliconWizard »
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #158 on: April 26, 2020, 02:20:34 am »
As I already said, my implementation at this point does work. It's just not optimal IMO, and that's what I'm working on. I'm sure I'll find a way to optimize it, and there will certainly be many more opportunities to optimize my whole core later on. I was already happy about it able to run @100MHz on a Spartan 6, a bit less happy about the ~85MHz with the branch predictor enabled, but it was not that bad either, considering it implements RV32IM_Zicsr, exceptions/traps and branch prediction. My goal though is to make the branch predictor not add significant overall delay.

This is good.

Quote
Just one point - may have been obvious, but I'm not sure, so here it is. When I talk about "branch prediction", I'm actually talking about both branch prediction and branch target prediction, as I know those are usually formally 2 different concepts - but I need both, and I have implemented both.

Without branch target prediction, what you suggested earlier poses no problem, but there will always be a one-cycle latency if you need to get at the decode stage to figure out the next PC. I don't really see a way around that, and that's usually why branch target buffers are used. So, yeah I have implemented both, and I consider both part of the "branch prediction unit" in my design. That may have been a bit confusing, so here I make it clear. That's also why I talked about PC tags. BTBs are a form of cache and require some form of tags like with any cache to be effective.

So this is mainly a matter of optimizing the implementation of both predictors used in conjunction, and I'm sure I'll figure it out.

I've said all this before but I'll repeat.

1) you don't need a 1 cycle latency to calculate the next PC. You can do just enough decode of the instruction (figure out if it's a conditional branch, extract the offset and add it to the PC) right there in the instruction fetch stage. This will result in a slightly lower maximum MHz, but not much.

2) only JALR instructions logically require prediction of the branch target. They are comparatively rare, with function return being the vast majority. Returns can be handled very well with typically a 2 (FE310) to 6 (FU540) entry return address stack.

3) branch target predictors are *huge*. Each entry needs to store both the current (matching) PC and the next PC, which is 64 bits on RV32. Plus it needs to be a CAM, which is very expensive, adding a comparator for the PC (tag) of every entry. A branch predictor needs 2 bits for each entry and is direct addressed access. You can afford to have at least 32 times more branch predictor entries than branch target entries for the same area / LUTs. Maybe 50x.

4) a return address entry is just as large as a branch target entry, but you only need a few of them to be effective and they don't need to be CAM as you only need to check the top entry. (you *could* CAM it and let any entry match, but that's only going to help malformed programs)

5) yes, you can just use a branch target predictor for conditional branches and JAL as well as for JALR. That will save waiting for the instruction fetch and doing the minimal decode needed. It will need far more resources (or have a lower hit rate than a branch predictor, if it has fewer entries), and most of it will be used by things that aren't JALR.

I mean .. experiment with different things by all means. I don't want to discourage that! And it depends on what your goal is.

For reference (BHT = Branch History Table, aka 2 bits per entry predictor, BTB = predicts the next PC, RAS = Return Address Stack):

FE310-G000: 128 BHT, 40 BTB, 2 RAS
FE310-G002: 512 BHT, 28 BTB, 6 RAS
FU540-C000: 256 BHT, 30 BTB, 6 RAS

ARM A53: 3072 BHT, 256 BTB, 8 RAS

Note that the FU540 taped out almost exactly a year after the FE310-G000, while the FE310-G002 was another year after the FU540.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #159 on: April 26, 2020, 02:31:50 pm »
5) yes, you can just use a branch target predictor for conditional branches and JAL as well as for JALR. That will save waiting for the instruction fetch and doing the minimal decode needed.

Yes, there we go. This is exactly my goal, and what I'm basically doing. The key idea was to avoid waiting for the instruction fetch itself (whereas I do agree that the subsequent minimal decode required would be negligible).

And the point I was trying to make is: if you have to wait for the fetch, how can you select the most probable PC for the next instruction at every cycle? That's where BTBs come into play as far as I have read and also thought about it. And yes I'm trying to optimize every kind of branches.

It will need far more resources (or have a lower hit rate than a branch predictor, if it has fewer entries), and most of it will be used by things that aren't JALR.

Yup. I know and this is what I'm currently running into. I don't mind the required memory for this per se, but as I said, my implementation currently requires extra resources on top of mere memory for storing the entries, and that's what I'll have to work on/refactor. Maybe I won't find a better implementation for this, and that's the price I'll have to pay if I want to stick to this.

I mean .. experiment with different things by all means. I don't want to discourage that! And it depends on what your goal is.

Yup! And I have experimented different things with my simulator earlier. This one approach was the most promising one from everything I tested, but my tests were not exhaustive by any means either. But I knew this approach was going to be expensive, and I now have real data to figure this out.

For reference (BHT = Branch History Table, aka 2 bits per entry predictor, BTB = predicts the next PC, RAS = Return Address Stack):

FE310-G000: 128 BHT, 40 BTB, 2 RAS
FE310-G002: 512 BHT, 28 BTB, 6 RAS
FU540-C000: 256 BHT, 30 BTB, 6 RAS

ARM A53: 3072 BHT, 256 BTB, 8 RAS

Thanks for these figures!
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #160 on: April 26, 2020, 02:49:47 pm »
As to max frequency for a typical RV32IM (or RV64IM while we're at it) core on a typical FPGA (3-stage and 5-stage pipeline cores for instance), would you happen to have some real figures, just so I can get some idea of what should be achievable?

I suppose SiFive's cores have all been prototyped on FPGAs? Would you have figures about achievable max freq (and on which FPGA(s)) ?
(I think I remember one figure: if I'm not mistaken, the U540 runs @100MHz on FPGA - not sure which - and I don't know whether this is the max frequency achievable either.)
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #161 on: April 27, 2020, 02:01:33 am »
As to max frequency for a typical RV32IM (or RV64IM while we're at it) core on a typical FPGA (3-stage and 5-stage pipeline cores for instance), would you happen to have some real figures, just so I can get some idea of what should be achievable?

Frequency isn't necessarily the number you're looking for :-) The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs, but uses I think 4 clock cycles per instruction. And then there is the very small SERV core (about 300 LUT4s?) which runs at 300 MHz but uses 32 clock cycles for most instructions and 64 for branches and stores.

I think 100 MHz is pretty typical for 1 CPI designs.

Quote
I suppose SiFive's cores have all been prototyped on FPGAs? Would you have figures about achievable max freq (and on which FPGA(s)) ?
(I think I remember one figure: if I'm not mistaken, the U540 runs @100MHz on FPGA - not sure which - and I don't know whether this is the max frequency achievable either.)

That was certainly true up to the time I stopped working there, though the more sophisticated designs such as the OoO 8-series and especially simulating an entire SoC with multiple cores needs a very large FPGA. The FU540 was prototyped on the VC707 board ($3500) but the FU740 and FU840 have needed the VCU118 board ($7915).

There is no attempt to make those cores run as quickly as possible. They use RTL designed for the eventual SoC which is not optimal for FGPA. Even if they run at 30 Mhz or 50 MHz that's still massively faster than verilator, and is good enough to boot linux and run SPEC and so forth with performance per clock representative of the final 1.5 GHz (or whatever) SoC. In fact those FPGA prototypes deliberately slow down RAM access to be proportional to how it will be on the SoC.
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #162 on: April 27, 2020, 01:37:05 pm »
There is no attempt to make those cores run as quickly as possible. They use RTL designed for the eventual SoC which is not optimal for FGPA. Even if they run at 30 Mhz or 50 MHz that's still massively faster than verilator, and is good enough to boot linux and run SPEC and so forth with performance per clock representative of the final 1.5 GHz (or whatever) SoC.

Yeah I agree with this. But I unfortunately have no means of knowing what exactly I could get directly on silicon for now, so FPGA prototyping is all I have to get an approximate, and relative, idea.
(I would need access to Synopsys - or similar - tools along with some PDK, which I don't have at the moment.)

I have already noticed that some logic paths for my implementation, synthesized on FPGA, were unnecessary long compared to what I expected, but yes that's due to how FPGA slices are designed. My goal is not to tweak my RTL to make it specifically efficient on FPGAs either, I just want a relative idea. If ~100MHz on a typical FPGA for a ~1 CPI, pipelined RV32 core is already reasonable, then I'm fine for now.

I haven't even tested the 64-bit version yet on FPGA (my core is generic enough to implement both RV32 and RV64), but I expect it to reach much lower frequencies, and the 30MHz to 50Mhz figure you're giving is likely what I'll get.

I'd be curious to see how fast it could run on silicon, say on a 45nm process...
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #163 on: April 27, 2020, 02:03:53 pm »
At this point, that's what I managed to get on a Spartan 6 LX 25 for the whole core (including BHT, BTB and register file):
(and yes, this is with a large BTB of 512 entries - I know what you're going to say - I'll likely rework this later on... but at this point, I'm not too shocked by the 90Kbits required.)

  • 2205 LUTs
  • 5 BRAM blocks (90Kbits)
  • 12 DSP48A1 slices
  • ~111 MHz max.

« Last Edit: April 27, 2020, 02:12:16 pm by SiliconWizard »
 

Offline 0db

  • Frequent Contributor
  • **
  • Posts: 336
  • Country: zm
Re: The RISC-V ISA discussion
« Reply #164 on: April 27, 2020, 04:55:59 pm »
The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs

Which cheap FPGAs?
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #165 on: April 27, 2020, 05:27:26 pm »
The popular picorv32 core runs at 300 or 400 MHz in cheap FPGAs

Which cheap FPGAs?

7-Series Xilinx
 

Offline SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 17190
  • Country: fr
Re: The RISC-V ISA discussion
« Reply #166 on: April 27, 2020, 05:34:38 pm »
That's stated there: https://github.com/cliffordwolf/picorv32
The core should run @300-400MHz on a Xilinx Artix-7 or something. (Note that I don't really consider the Artix-7 a "cheap" FPGA, but it's all relative. Compared to the high-end FPGAs that are often used for CPU prototyping, such as the Virtex series, it sure is cheap.)

Note that the goal is entirely different. It's designed with an average CPI of 4, and they say it's optimized for size and speed, not performance.
This core running  @400MHz will have approx. the same performance as a pipelined core running @100MHz with an average CPI close to 1.
So it's an exercise in simplicity, and I get the idea. Implementing a more complex, pipelined core, as I experienced, is not an easy task.

In terms of features, I'd say my core is somewhere between their "regular" and "large" variants (917 and 2019 LUTs respectively), so with my ~2200 LUTs (and much better performance/MHz), my approach doesn't look too bad.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5841
  • Country: nz
Re: The RISC-V ISA discussion
« Reply #167 on: April 27, 2020, 08:27:22 pm »
This core running  @400MHz will have approx. the same performance as a pipelined core running @100MHz with an average CPI close to 1.

Yes, I said it uses about 4 clock cycles per instruction.

The idea is that if you're using it to control other things in the FPGA then simple things often will run at 400 MHz and you don't want the CPU slowing it down, or have to deal with different clock domains.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf