RISC-V assembly language programming tutorial on YouTube

#25 Reply
Posted by legacy on 10 Dec, 2018 14:47
I haven't yet looked it in detail, but the "FENCE" class looks interesting!

#26 Reply
Posted by brucehoult on 10 Dec, 2018 14:53
Quote from: legacy on 10 Dec, 2018 14:47
I haven't yet looked it in detail, but the "FENCE" class looks interesting!

EIEIO. And related :-)

#27 Reply
Posted by rstofer on 10 Dec, 2018 15:16
I just went over to ARM to get a sense of the size of their instruction set. Somehow, I think they have moved beyond Reduced Instruction Set with the latest designs. There certainly are some 'interesting' instructions but I wonder which opcodes GCC actually uses.

#28 Reply
Posted by ehughes on 10 Dec, 2018 15:29
Quote
Radical personalization is the present and future of many industries. Now it's available in CPUs to
o.

So if I understand correctly, RISC-V is more intended for high volume SoC type customers who want to make specialized cores? (i.e. the Western Digital use case).

I looked through the SiFive site and it seems the message is that you can get your own custom SoC made quicker.

#29 Reply
Posted by legacy on 10 Dec, 2018 15:30
Quote from: brucehoult on 10 Dec, 2018 14:53
Quote from: legacy on 10 Dec, 2018 14:47
I haven't yet looked it in detail, but the "FENCE" class looks interesting!

EIEIO. And related :-)

This will add more "fun" to every superscalar implementation of the RISC-V. EIEIO + isync + sync on our PowerPC460 is able to cause great emotions, like people hammering their heads on the desk and going to throw the target-board out of the window ... which is ... love ... in reverse order

another interesting point I see: like MIPS and PowerPC, even RISC-V uses LL/SC to emulate CAS, which is to say, LL/SC is used to write a tiny bit of code which loads a target memory address, compares it to a comparand, and then writes back a swap value to the target if the comparand and target values are equal.

It would be interesting ... how LL monitors an address (say, a semaphore), and how SC does its job.

A senior here said that X86/x64 is better because it implements DWCAS (sort of CAS, but more complex) instead of LL/SC ... dunno, I have ZERO experience with Intel x86

#30 Reply
Posted by FlyingDutch on 10 Dec, 2018 18:57
Quote from: legacy on 09 Dec, 2018 16:13

expensive? 60 USD are nothing for a board.

Hello,

comparing for example to these FPGA boards (I have both of them):

https://numato.com/product/mimas-v2-spartan-6-fpga-development-board-with-ddr-sdram

https://store.digilentinc.com/cmod-a7-breadboardable-artix-7-fpga-module/

Yes, it seems to me expensive

Regards

#31 Reply
Posted by rstofer on 10 Dec, 2018 20:27
Thinking about the smallest FPGA incantation, does the RISC-V make sense as a general purpose drop-in core? Maybe there is a project where the CPU is just handling details (maybe console IO or file IO) but the majority of the project is some kind of hardware thing (even including another CPU) that just needs a little high level help - that is, the full hardware description is too ugly to contemplate and a programmable core would smooth things out.

OK, I'll fess up! I would rather write C code than HDL when it comes to things like a microSD driver.

Assuming adequate resources, of course.

Board has shipped, toolchain has been built. One thing about a fresh Linux install, there are a bunch of dependent tools that need to be built or just installed. Among my favorites: MPR, MPFC, GMP. Nothing is as simple as it seems it would be!

This thread has links to amazing resources. Once I get to play with the HiFive1 board for a while, I am almost certain to be looking at an Artix incantation of the core. I have a couple of Arty 7 boards and that Nexys 4 board I have uses an Artix 100T chip. Lots of resources!

It should be an interesting winter.

#32 Reply
Posted by David Hess on 10 Dec, 2018 20:34
Quote from: brucehoult on 09 Dec, 2018 22:58
There is no great *technical* advantage over ARM or MIPS, but also no disadvantage. Compare code size, compare Dhrystone or Coremark or SPEC ... it's a photo finish in most cases. MIPS code is the biggest (and microMIPS doesn't help as much as Thumb or rvc), rv32i is comparable to ARM, rv32ic to Thumb2. In 64 bit, rv64ic is much smaller than anything else (ARM didn't see fit to duplicate Thumb in 64 bit!).

Lack of flags increasing code size by 4 times and requiring 2 extra registers to detect various conditions sure seems like a disadvantage. That extra code and register pressure also has the effect of making the caches effectively smaller. Having to effectively execute an ALU operation twice or more cannot help power efficiency.

Technically only flags which represent changes in state like carry and overflow are required; for instance zero, negative, and parity can be computed at any time. What I would like to see is a design where flags requiring state are stored in a register dedicated to each destination register which avoids the hazard of having a single flags register like in x86 or requiring a flags register operand which would require extra instruction bits.

Some ISAs do this to track whether a register has been used in the current execution context so that the entire register set does not need to be saved on a context switch. The first use of a register is just another bit of state to save.

#33 Reply
Posted by hamster_nz on 10 Dec, 2018 21:01
Quote from: brucehoult on 10 Dec, 2018 04:48
In case anyone is interested in instruction encodings...

I've been tinkering with RISC-V in my spare time, and I have to say that the 32-bit integer instruction set is quite nice for hardware implementation:

- The source and destination registers are always encoded in the same place.
- The most significant bit of any constant is always in the same place (makes for easy sign extension)
- The privileged instructions (ones that need to be trapped for OS / Hypervisor) are all nicely contained

The only thing I find awkward is that the encoding of the offsets on the jump instructions - fine for H/W but painful to decode for a naive software emulation.

#34 Reply
Posted by brucehoult on 10 Dec, 2018 23:27
Quote from: rstofer on 10 Dec, 2018 15:16
I just went over to ARM to get a sense of the size of their instruction set. Somehow, I think they have moved beyond Reduced Instruction Set with the latest designs. There certainly are some 'interesting' instructions but I wonder which opcodes GCC actually uses.

Exactly what "RISC" means has always been and remains the subject of some debate :-)

For me, I think the most important characteristics are:

- strict separation of computation from data transfer (load/store)

- enough registers that you don't touch memory much. Arguments for most functions fit in registers, and the return address too (the otherwise RISC AVR8 violates this).

- general purpose registers rather than special purpose.

- no instruction can cause more than one cache or TLB miss, or two adjacent lines/pages if unaligned access is supported (and this case might be trapped and emulated)

- the instruction length can be determined easily (combinatorial circuit with few gates) by examining only the initial part of the instruction.

- each instruction modifies at most one register.

- integer instructions read at most two registers. This is ultra-purist :-) A number of RISC ISAs break it in order to have e.g. a register plus (scaled) register addressing mode, or conditional select. But no more than three!

- no microcode or hardware sequencing. Each instruction executes in a small and fixed number of clock cycles (usually one). Load/Store multiple are the main offenders in both ARM and PowerPC. They help with code size, but it's interesting that ARM didn't put them in Aarch64 and is deprecating them in 32 bit as well, providing the much less offensive load/store pair.

Something that I think is *not* necessary in order to be "RISC" is to have a small number of instructions. Yes, PowerPC has a huge number of instructions, as does modern ARM and Aarch64. This does not disqualify from being RISC as long as each instruction follows the above rules.

What a huge number of instructions *does* do is make very small low end implementations impossible. And puts a big burden of work on every hardware and every emulator implementer.

#35 Reply
Posted by brucehoult on 11 Dec, 2018 00:04
Quote from: ehughes on 10 Dec, 2018 15:29
Quote
Radical personalization is the present and future of many industries. Now it's available in CPUs to
o.

So if I understand correctly, RISC-V is more intended for high volume SoC type customers who want to make specialized cores? (i.e. the Western Digital use case).

I don't think you can distil one single thing that a standard backed by 100+ companies is "intended for".

It's a free and open standard that software can be written to, and that anyone is free to implement in any way they chose: hardware, software (interpret/JIT), or FPGA.

It is intended that nothing in RISC-V disqualify is from being applicable to everything from the smallest (32 bit) microcontroller to the largest supercomputer, and everything between. See for example the European Processor Initiative which is developing processors for supercomputers. Based on the RISC-V ISA.

Quote
I looked through the SiFive site and it seems the message is that you can get your own custom SoC made quicker.

SiFive is just one company of many doing things with RISC-V. It happens to be one of the first out of the starting gate (founded in September 2015) and therefore currently one of the most visible.

SiFive's business model is indeed not to be a chip vendor, but to enable others to make chips.

At the moment. most of the RISC-V activity has been people (often individuals) making soft cores for FPGAs and large companies who are already making SoCs putting a RISC-V processor in on corner.

What a lot of people on this forum want is to be able to go to digikey/mouser/element14 and choose a microcontroller with a CPU (and they don't really care what CPU) and the selection of peripherals they need for some task.

Those doesn't exist now, but they will start to in 2019, from a number of vendors.

The first off the block appears to be NXP, with an SoC (RV32M1) with two RISC-V cores (not SiFive ones), two ARM cores, and a bunch of peripherals including BlueTooth, USB, ADC, RTC, uSDHC, crypto acceleration. They have a web site where you can get a board with this for free http://open-isa.org/order/ and they gave away a few hundred boards at the RISC-V Summit last week.

MicroSemi/Microchip have announced a version of their PolarFire FPGA with embedded SiFive FU540 complex (five 64 bit cores, four with FPU&MMU).

There will be a *lot* more to follow during 2019.

#36 Reply
Posted by brucehoult on 11 Dec, 2018 00:34
Quote from: legacy on 10 Dec, 2018 15:30
Quote from: brucehoult on 10 Dec, 2018 14:53
Quote from: legacy on 10 Dec, 2018 14:47
I haven't yet looked it in detail, but the "FENCE" class looks interesting!

EIEIO. And related :-)

This will add more "fun" to every superscalar implementation of the RISC-V. EIEIO + isync + sync on our PowerPC460 is able to cause great emotions, like people hammering their heads on the desk and going to throw the target-board out of the window ... which is ... love ... in reverse order

Alas, if you want to do out of order CPUs then you absolutely need a well thought-out system for ensuring memory consistency. Hopefully RISC-V has got that right -- there's been a committee with very very experienced people in this field (industry and academics) who've spent over a year on this. The PowerPC/Alpha/ARM experience has hopefully been learned from -- certainly it's not lack of will or effort.

The RISC-V spec also allows TSO (like SPARC, x86) as an optional feature. That will inevitably be a little lower performance, especially when scaled to large numbers of cores (dozens, hundreds, thousands), but it does make life easier for programmers. Standard RISC-V code written for a weak memory model will run correctly on systems with TSO, but not vice-versa.

Quote
another interesting point I see: like MIPS and PowerPC, even RISC-V uses LL/SC to emulate CAS, which is to say, LL/SC is used to write a tiny bit of code which loads a target memory address, compares it to a comparand, and then writes back a swap value to the target if the comparand and target values are equal.

Right. You can also implement many other interesting things using LL/SC.

RISC-V also has a number of Atomic Memory Operations (AMOs) which take one integer argument (not two like CAS) and an address, atomically do ... something ... with the integer and the memory contents, and return an integer. The allowed operations are swap (unconditional), add, and/or/xor, min/max (signed and unsigned).

This can be done by bringing the data into the CPU and sending the new value back, but the TileLink protocol (from Berkeley, which SiFive and others use) supports pushing these out to be executed in a cache controller, or on another CPU or peripheral that owns the address.

I see AMBA 5 got similar capability this year, although it also includes a remote CAS, which TileLink doesn't.

Quote
It would be interesting ... how LL monitors an address (say, a semaphore), and how SC does its job.

That is entirely up to whoever implements an individual core/memory system. The most common way would be for the CPU to take exclusive ownership of a cache line, and then the SC checks that it still has exclusive ownership. There is a small limit on the number (and type) of instructions you are allowed to execute between the LL and SC if you want to guarantee forward progress. One way this might work is that the CPU might ... delay ... its response to other CPUs requests to read or take ownership of that cache line for a few clock cycles.

Quote
A senior here said that X86/x64 is better because it implements DWCAS (sort of CAS, but more complex) instead of LL/SC ... dunno, I have ZERO experience with Intel x86

DCAS is useful for some things, such as manipulating queues without the expense of using a full semaphore. But it's not cheap to implement and has its limitations. e.g. see http://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/DCAS.pdf

The RISC-V community is interesting in adopting some more powerful mechanism than LL/SC, but I think it's more likely to be a form of LL/SC that accepts a small number (2 to 5) of addresses rather than something as specific as DCAS ... and rather than something as general as full STM (which Intel has had numerous bugs trying to implement).

#37 Reply
Posted by brucehoult on 11 Dec, 2018 00:44
Quote from: rstofer on 10 Dec, 2018 20:27
Thinking about the smallest FPGA incantation, does the RISC-V make sense as a general purpose drop-in core? Maybe there is a project where the CPU is just handling details (maybe console IO or file IO) but the majority of the project is some kind of hardware thing (even including another CPU) that just needs a little high level help - that is, the full hardware description is too ugly to contemplate and a programmable core would smooth things out.

Sure. That's one of the major uses of RISC-V right now. Some of the stripped down RV32I cores are using around 300 LUT6's! In fact the winner of a recent contest, engine-V uses only 306 LUT4's -- amazing.

https://riscv.org/2018/12/risc-v-softcpu-contest-winners-demonstrate-cutting-edge-risc-v-implementations-for-fpgas/
https://github.com/micro-FPGA/engine-V

PicoRV32 and VexRiscv are also worth checking out.

Quote
It should be an interesting winter.

Have fun!

#38 Reply
Posted by David Hess on 11 Dec, 2018 00:59
Quote from: brucehoult on 10 Dec, 2018 23:27
- strict separation of computation from data transfer (load/store)

On the other hand, allowing ALU instructions to have one memory operand acts as a type of instruction set compression, lowers register pressure, and seems to have little disadvantage when out-of-order execution allows long load-to-use latencies from cache.

Quote
- enough registers that you don't touch memory much. Arguments for most functions fit in registers, and the return address too (the otherwise RISC AVR8 violates this).

But if the register set is too large, it means more state to save. There are other solutions for this though.

Quote
- no instruction can cause more than one cache or TLB miss, or two adjacent lines/pages if unaligned access is supported (and this case might be trapped and emulated)

The lack of hardware support for unaligned access always seems to end up being a performance problem once a processor gets deployed into the real world.

Weak memory ordering which seems like it should be an advantage also becomes a liability.

Quote
- each instruction modifies at most one register.

That is pretty standard but how then do you handle integer multiplies and divides? Break them up into two instructions?

Quote
- integer instructions read at most two registers. This is ultra-purist :-) A number of RISC ISAs break it in order to have e.g. a register plus (scaled) register addressing mode, or conditional select. But no more than three!

Internally it seems like this sort of thing and modifying more than one register should be broken up into separate micro-operations so that the register file has a lower number of read and write ports. The alternative is having to decode more instructions which clog up the front end once an out-of-order implementation is desired.

On the other hand, this means discarding the performance advantages of the FMA instruction.

Quote
- no microcode or hardware sequencing. Each instruction executes in a small and fixed number of clock cycles (usually one). Load/Store multiple are the main offenders in both ARM and PowerPC. They help with code size, but it's interesting that ARM didn't put them in Aarch64 and is deprecating them in 32 bit as well, providing the much less offensive load/store pair.

Maybe more interesting is why ARM even included them in the first place. Load and store multiple took advantage of fast-page-mode DRAM access when ARMs instruction pipeline was closely linked with DRAM access.

Should stack instructions be broken up as well?

Quote
What a huge number of instructions *does* do is make very small low end implementations impossible. And puts a big burden of work on every hardware and every emulator implementer.

I do not know about that. Multiple physical designs covering a wide performance range are possible with the same ISA. Microcode is convenient to handle seldom used instructions. Vector operations can be broken up into instructions which fit the micro-architecture's ALU width while allowing support for the same vector instruction set across a wide range of implementations.

Or you can use instruction set extensions every time you want to support a different vector length. How many FPU ISAs has ARM gone through now?

#39 Reply
Posted by brucehoult on 11 Dec, 2018 01:07
Quote from: David Hess on 10 Dec, 2018 20:34
Quote from: brucehoult on 09 Dec, 2018 22:58
There is no great *technical* advantage over ARM or MIPS, but also no disadvantage. Compare code size, compare Dhrystone or Coremark or SPEC ... it's a photo finish in most cases. MIPS code is the biggest (and microMIPS doesn't help as much as Thumb or rvc), rv32i is comparable to ARM, rv32ic to Thumb2. In 64 bit, rv64ic is much smaller than anything else (ARM didn't see fit to duplicate Thumb in 64 bit!).

Lack of flags increasing code size by 4 times and requiring 2 extra registers to detect various conditions sure seems like a disadvantage. That extra code and register pressure also has the effect of making the caches effectively smaller. Having to effectively execute an ALU operation twice or more cannot help power efficiency.

That happens only with the overflow flag, which is used ... well ... never ... in standard software. C/C++ code does not use or require it.

Carry flag requires *one* instruction to branch on it (same as in an ISA with condition codes), or *one* instruction to set a register to 0 or 1 (one more than an ISA with condition codes). It's also extremely rarely needed -- mostly in bignum libraries, which are going to be limited by memory load/store anyway.

Quote
Technically only flags which represent changes in state like carry and overflow are required; for instance zero, negative, and parity can be computed at any time. What I would like to see is a design where flags requiring state are stored in a register dedicated to each destination register which avoids the hazard of having a single flags register like in x86 or requiring a flags register operand which would require extra instruction bits.

That would be better for OoO implementations than a single flags register, yes. You'd need BVC, BVS, BCC, BCS instructions that took a full register number as well as the branch offset.

It would be an interesting experiment to do to implement this. And this is EXACTLY what RISC-V enables you to do for low cost in time and money. Modify your favourite FPGA implementation to have your new instructions, modify gcc or llvm to generate them, and run dhrystone/coremark/SPEC/your favourite benchmark suite without and without using the new instructions. Publish the results with execution time, energy use, area cost, and any effect on MHz. We all learn something!

Quote
Some ISAs do this to track whether a register has been used in the current execution context so that the entire register set does not need to be saved on a context switch. The first use of a register is just another bit of state to save.

I haven't seen a bit for every register, but it's common for FPUs or vector units to have a single bit for the whole unit, as many programs don't use FP or vectors at all.

Again, worth trying, though context switches are very rare on normal systems.

Back in June, Intel disclosed a "Lazy FPU State Restore" bug in all Core-based processors. Microsoft and others fixed the bug by disabling the use of the FPU dirty bit and just saving and restoring everything on every context switch. The effect on performance was basically unmeasurable.

#40 Reply
Posted by brucehoult on 11 Dec, 2018 01:34
Quote from: hamster_nz on 10 Dec, 2018 21:01
I've been tinkering with RISC-V in my spare time, and I have to say that the 32-bit integer instruction set is quite nice for hardware implementation:

- The source and destination registers are always encoded in the same place.
- The most significant bit of any constant is always in the same place (makes for easy sign extension)
- The privileged instructions (ones that need to be trapped for OS / Hypervisor) are all nicely contained

Yes, all of those were deliberate goals when Krste Asanovic designed the encoding -- or should I say, redesigned it after experience implementing an earlier version.

Something that almost everyone else does is put the data register for load and store instructions in the same place, even though for a load it's the destination but for a store it's a source!

Quote
The only thing I find awkward is that the encoding of the offsets on the jump instructions - fine for H/W but painful to decode for a naive software emulation.

Yes, I find it painful for trying to mentally decode instructions too. It's the result of the interaction of three things:

- always having the sign bit in the same place (as you noted)

- wanting to scale branch offsets to increase reach as you don't need byte-addressing (except for J(AL)R, which is almost always either paired with a LUI/AUIPC making increased reach unnecessary OR has a zero offset)

- minimising the number of places in the opcode where each bit of a literal or offset can come from and *not* requiring mass shifters. I *think* it's the case that only bit 11 can come from three places in the instruction while all the other bits can come from at most two places in the instruction (and bits 13 to 19 only one). A constant 0 or sign extension is also possible as well of course.

#41 Reply
Posted by hamster_nz on 11 Dec, 2018 02:44
Quote from: David Hess on 11 Dec, 2018 00:59
Quote
- each instruction modifies at most one register.

That is pretty standard but how then do you handle integer multiplies and divides? Break them up into two instructions?

Yes, Exactly this.

From the ISA spec (https://content.riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf):

Quote
If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[ S ]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.

#42 Reply
Posted by brucehoult on 11 Dec, 2018 02:52
Video #1 has been replaced today. I don't know what changed. I wonder if Western Digital will update the videos that show buggy code (at least with a text "oops.." overlay as they already did for a few things).

#43 Reply
Posted by hamster_nz on 11 Dec, 2018 04:34
Is there a preferred or recommended memory map for a RISC-V environment?

The ISA spec doesn't have much to say, apart from that the ISA is set up be helpful for generating relocatable code. Is there a guide of "common/best practice" for where you put your memory mapped I/O, bootstrap ROMs, and so on that is helpful?

For my software emulator I was thinking of trying something like the FE310-G000:

Quote
00000000:00000FFF Debug address space
00001000:01FFFFFF On-chip Non volatile memory
02000000:1FFFFFFF I/O
20000000:7FFFFFFF Off-chip Non volatile memory
80000000:FFFFFFFF On-chip volatile memory

And after a reset execution starts at 0000:1000.

Does that sound sane?

#44 Reply
Posted by brucehoult on 11 Dec, 2018 08:23
Quote from: hamster_nz on 11 Dec, 2018 04:34
Is there a preferred or recommended memory map for a RISC-V environment?

The ISA spec doesn't have much to say, apart from that the ISA is set up be helpful for generating relocatable code. Is there a guide of "common/best practice" for where you put your memory mapped I/O, bootstrap ROMs, and so on that is helpful?

For my software emulator I was thinking of trying something like the FE310-G000:

Quote
00000000:00000FFF Debug address space
00001000:01FFFFFF On-chip Non volatile memory
02000000:1FFFFFFF I/O
20000000:7FFFFFFF Off-chip Non volatile memory
80000000:FFFFFFFF On-chip volatile memory

And after a reset execution starts at 0000:1000.

Does that sound sane?

Looks ok to me.

Neither the RISC-V user mode architecture nor the privileged mode architecture define a memory map. SiFive follows rocket-chip and I think other prior Berkeley practice with the memory map. I don't know whether other vendors do too.

You're supposed to read the configuration string to figure out where things are. The FE310 has a config string at 0x100C in the mask ROM(just after the reset vector) containing:

Code: [Select]
/cs-v1/; /{ model = \"SiFive,FE310G-0000-Z0\"; compatible = \"sifive,fe300\"; /include/ 0x20004; };
And then 0x20004 is in the OTP.

For Linux, the Boot Loader creates a deviceTree somehow (for example from config string, or it could be a DTB on disk) and passes it to the kernel when it starts it.

The bottom 4 GB on the FU540 are very similar the FE310 (including RAM at 0x8000_000:0xFFFF_FFFF), and then above 4 GB you have:

Code: [Select]
0x01_0000_0000:0x0F_FFFF_FFFF Peripherals 0x10_0000_0000:0x1F_FFFF_FFFF System 0x20_0000_0000:0x3F_FFFF_FFFF RAM

#45 Reply
Posted by NorthGuy on 11 Dec, 2018 14:17
You probably can re-code it on MIPS one-to-one, except for LUI (if not followed by XORI or ADDI), which would require an extra instruction - very simple hardware emulator

Why every instruction has "11" at the end? This way it only uses 1/4 of the code point space.

#46 Reply
Posted by hamster_nz on 11 Dec, 2018 21:01
Quote from: NorthGuy on 11 Dec, 2018 14:17
You probably can re-code it on MIPS one-to-one, except for LUI (if not followed by XORI or ADDI), which would require an extra instruction - very simple hardware emulator

Why every instruction has "11" at the end? This way it only uses 1/4 of the code point space.

This is just the RV32I (32-bit integer) instructions - the minimal set set you need to support. On top of this are the mult/div extensions, the floating point, compressed instructions and so on.

It is encoded this way to make life easier (i.e. smaller, faster, simpler) for the instruction fetching/decode.

The ISA supports quite a few different instruction lengths:

#47 Reply
Posted by David Hess on 11 Dec, 2018 21:27
Quote from: brucehoult on 11 Dec, 2018 01:07
It would be an interesting experiment to do to implement this. And this is EXACTLY what RISC-V enables you to do for low cost in time and money. Modify your favourite FPGA implementation to have your new instructions, modify gcc or llvm to generate them, and run dhrystone/coremark/SPEC/your favourite benchmark suite without and without using the new instructions. Publish the results with execution time, energy use, area cost, and any effect on MHz. We all learn something!

It would be too big of a change to RISC-V. It alters the basic ISA and architecture and then a new code generator would be required anyway. It goes against the design principles of RISC-V.

Quote
Quote
Some ISAs do this to track whether a register has been used in the current execution context so that the entire register set does not need to be saved on a context switch. The first use of a register is just another bit of state to save.

I haven't seen a bit for every register, but it's common for FPUs or vector units to have a single bit for the whole unit, as many programs don't use FP or vectors at all.

Back in June, Intel disclosed a "Lazy FPU State Restore" bug in all Core-based processors. Microsoft and others fixed the bug by disabling the use of the FPU dirty bit and just saving and restoring everything on every context switch. The effect on performance was basically unmeasurable.

That is what I was thinking of. Intel of course managed to screw it up. It had a limited effect on performance because of its limited applicability; task switching the vector instructions was already a performance problem.

Intel has a lot of performance problems with their vector units and so much so that they had to issue a directive not to use them for things like memory copies.

It would be more appropriate for a design intended to take advantage of it like with old ARM's load and store multiple.

Quote
Again, worth trying, though context switches are very rare on normal systems.

But subroutine calls are not.

Stack dumps would be marvelous in a bad way I think with a feature like this but I would want it anyway.

#48 Reply
Posted by hamster_nz on 11 Dec, 2018 22:03
Quote from: David Hess on 11 Dec, 2018 21:27
Quote from: brucehoult on 11 Dec, 2018 01:07
It would be an interesting experiment to do to implement this. And this is EXACTLY what RISC-V enables you to do for low cost in time and money. Modify your favourite FPGA implementation to have your new instructions, modify gcc or llvm to generate them, and run dhrystone/coremark/SPEC/your favourite benchmark suite without and without using the new instructions. Publish the results with execution time, energy use, area cost, and any effect on MHz. We all learn something!

It would be too big of a change to RISC-V. It alters the basic ISA and architecture and then a new code generator would be required anyway. It goes against the design principles of RISC-V.

I sort of think that Bruce's use is fully in in line with the design principles of RISC-V...
"RISC-V (pronounced “risk-five”) is a new instruction set architecture (ISA) that was originally
designed to support computer architecture research and education..."
"An ISA separated into a small base integer ISA, usable by itself as a base for customized
accelerators or for educational purposes, and optional standard extensions, to support general purpose
software development."/
To me, a RISC-V RV32I core, with a custom hyperconverged-blockchain-crypto-quantum-showlace-tying extension sounds perfectly in line with the design principles.

#49 Reply
Posted by NorthGuy on 11 Dec, 2018 22:57
Quote from: David Hess on 11 Dec, 2018 21:27
Intel has a lot of performance problems with their vector units and so much so that they had to issue a directive not to use them for things like memory copies.

I definitely need to read this. I'm one of those who have been using them for memory copying for 20 years or so, and it always had performance increases in my tests when I moved to the next bigger register size over the years. Do you have any reference for the document?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 4 5 6 7 8 9 10 11 12 13 » All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod