if U were a RISC assembly programmer which instructions would U want to have ?

if U were a RISC assembly programmer which instructions would U want to have ?
Posted by legacy on 29 Feb, 2016 18:17
including the most difficult to be implemented
- pre increment?
- post increment?
- stack? (push/pop)?
about the first two instructions, ARM seems to have them, while m88K hasn't
and currently I have no idea about how to implement them in RISC terms

thanks

#1 Reply
Posted by free_electron on 29 Feb, 2016 19:02
bit count instructions. like 'how many bits in this byte are 1'
i HATE RISC processors by the way. i want processors with massive single cycle instruction sets.

When i was still mucking inside silicon i built hardware instruction set expanders for an ARM7 in an FPGA.
i even had vector-like capabilities

#2 Reply
Posted by legacy on 29 Feb, 2016 19:08
Quote from: free_electron on 29 Feb, 2016 19:02
bit count instructions. like 'how many bits in this byte are 1'

to be used for bitmap purpose ?

I can implement an instruction like it, can you suggest me an "opcode name" ?
how should it be called ?

Code: [Select]
bitcnt rt, rs1(how many bits in register rs1 are "1"?, put the result int rt)

I can also offer bit manipulation
- clear
- set
- get
- toggle

#3 Reply
Posted by tesla500 on 29 Feb, 2016 19:10
Advanced bit manipulation instructions or engine.

Simple things like reverse the order of all the bits, as well as more complex things like randomly programmable remapping of input to output bits. This is very slow to do in software in pretty much any CPU I can imagine:

8-bit value:
Input result
bit bit
0 -> 5
1 -> 7
2 -> 2
3 -> 0
4 -> 6
5 -> 1
6 -> 4
7 -> 3

#4 Reply
Posted by legacy on 29 Feb, 2016 19:17
Quote from: free_electron on 29 Feb, 2016 19:02
When i was still mucking inside silicon i built hardware instruction set expanders for an ARM7 in an FPGA.
i even had vector-like capabilities

like the tensor processor unit (TPU) by SGI ?
(kidding)

do you think I'd better implement MAC (from DSP world)?
saturated operation ALU engine?

something to accelerate matrix computation?
(exactly what?)

#5 Reply
Posted by legacy on 29 Feb, 2016 19:18
about pre-increment and post-descrement (and stack operations)
how are they implemented within the pipeline?
in which stage?

here I Arise-v2 (my soft core) currently has a strict RISC pipeline
- fetch
- decode
- register_rd (2 registers)
- ALU/EA (multiply and division will take more cycles)
- load/store (dtack will take more cycles)
- register_wr(1 register), branch
which looks like MIPS-approch, and MIPS hasn't them

#6 Reply
Posted by helius on 29 Feb, 2016 19:33
Quote from: legacy on 29 Feb, 2016 19:08
Quote from: free_electron on 29 Feb, 2016 19:02
bit count instructions. like 'how many bits in this byte are 1'
to be used for bitmap purpose ?

I can implement an instruction like it, can you suggest me an "opcode name" ?
how should it be called ?
The SPARC v9 architecture defines an instruction
popc rs, rd
meaning "population count". It wasn't actually implemented until the UltraSPARC IV. This is also called the "Hamming weight" so hwt would be a suitable mnemonic. Using a combination of bitwise operations and popc, you can implement the FF1 and FF0 operations (Find First 0 or Find First 1) which are used in hash algorithms.

#7 Reply
Posted by ataradov on 29 Feb, 2016 19:39
I want all the instruction from a modern ARM architecture (let's say ARMv8). It has all you need and some more.

But good look implementing all of that, it is a nightmare.

#8 Reply
Posted by ataradov on 29 Feb, 2016 19:41
Quote from: legacy on 29 Feb, 2016 19:18
how are they implemented within the pipeline?
in which stage?
Decoder issues multiple simple instructions. Push/pop typically take more cycles because of this.

#9 Reply
Posted by legacy on 29 Feb, 2016 19:54
Quote from: ataradov on 29 Feb, 2016 19:39
I want all the instruction from a modern ARM architecture (let's say ARMv8). It has all you need and some more.

But good look implementing all of that, it is a nightmare.

hence we need a compromise

#10 Reply
Posted by legacy on 29 Feb, 2016 19:56
Quote from: helius on 29 Feb, 2016 19:33
Find First 0 or Find First 1) which are used in hash algorithms.

already implemented

#11 Reply
Posted by ataradov on 29 Feb, 2016 20:01
Quote from: legacy on 29 Feb, 2016 19:54
hence we need a compromise
No, you need a compromise, I'm perfectly happy using ARM

#12 Reply
Posted by blacksheeplogic on 29 Feb, 2016 20:35
Quote from: legacy on 29 Feb, 2016 18:17
- pre increment?
- post increment?
- stack? (push/pop)?
I have no idea about how to implement them in RISC terms
When implemented for connivence you crack them within the dispatcher. However, that will require instruction group restrictions which makes optimization more difficult.

As an example,
stwu sp,-stksz(sp)

Would crack to an add & store, but there would be a FIG restriction. The store has a register dependency which can't be mitigated.

#13 Reply
Posted by HAL-42b on 29 Feb, 2016 20:55
Quote from: free_electron on 29 Feb, 2016 19:02
bit count instructions. like 'how many bits in this byte are 1'
i HATE RISC processors by the way. i want processors with massive single cycle instruction sets.

When i was still mucking inside silicon i built hardware instruction set expanders for an ARM7 in an FPGA.
i even had vector-like capabilities

It is a question of transistor count I suppose.

"Given a million transistors, is it better to have them arranged into a single CISC or several smaller RISCs?"

#14 Reply
Posted by helius on 29 Feb, 2016 21:10
Transistor count is never the constraint, wiring and power dissipation are.

#15 Reply
Posted by legacy on 29 Feb, 2016 22:34
Quote from: helius on 29 Feb, 2016 21:10
Transistor count is never the constraint, wiring and power dissipation are.

those are called "micro architectural constraints", right ?

#16 Reply
Posted by legacy on 29 Feb, 2016 23:01
guys, and what about addressing modes ?
e.g. should I include "scaled indexed addressing mode" ?
(currently it takes +1 extra cycle)

#17 Reply
Posted by ataradov on 29 Feb, 2016 23:31
Quote from: legacy on 29 Feb, 2016 23:01
e.g. should I include "scaled indexed addressing mode" ?
Sure, why not?

The point is, the more you include, the more people will use. But if core is FPGA-only softcore, then it must be small in a first place.

#18 Reply
Posted by rs20 on 29 Feb, 2016 23:36
Quote from: free_electron on 29 Feb, 2016 19:02
bit count instructions. like 'how many bits in this byte are 1'

If you have 256 nibbles free (which I know is a big if), this is a simple array lookup. Not sure if it deserves its own instruction?

#19 Reply
Posted by ataradov on 29 Feb, 2016 23:39
Quote from: rs20 on 29 Feb, 2016 23:36
Not sure if it deserves its own instruction?
Everything deserves its own instruction. Why not? Apart from being annoying for chip designers, there is really no downside. None of those small extensions significantly affect the price of the device.

Array lookups need memory access, and that's expensive. Actually more expensive than to count with a simple stupid loop.

#20 Reply
Posted by legacy on 01 Mar, 2016 00:42
Quote from: ataradov on 29 Feb, 2016 23:31
FPGA-only soft core

yep, FPGA-only soft core

#21 Reply
Posted by ataradov on 01 Mar, 2016 00:43
For soft cores look at Microblaze or NIOS-II. They have all you really need and noting extra. Plus they support extensions. I know it is much more fun to design cores, of course.

#22 Reply
Posted by retrolefty on 01 Mar, 2016 01:02
Quote from: legacy on 29 Feb, 2016 19:54
Quote from: ataradov on 29 Feb, 2016 19:39
I want all the instruction from a modern ARM architecture (let's say ARMv8). It has all you need and some more.

But good look implementing all of that, it is a nightmare.

hence we need a compromise

Such a compromise was available starting in the 70s with several of the computer companies, mini and mainframe types, that offered WCS, writable control store, that allowed end users to define and run their own machine instructions using the same microcode method that the standard supplied instructions used.

https://en.wikipedia.org/wiki/Microcode

#23 Reply
Posted by helius on 01 Mar, 2016 01:08
Quote from: retrolefty on 01 Mar, 2016 01:02
Such a compromise was available starting in the 70s with several of the computer companies, mini and mainframe types, that offered WCS, writable control store, that allowed end users to define and run their own machine instructions using the same microcode method that the standard supplied instructions used.
Because the compilers of the day were getting a little too good at actually using the available instructions!

#24 Reply
Posted by westfw on 01 Mar, 2016 01:58
Quote
I have no idea about how to implement [autoincrement/etc] in RISC terms
Presumably, since an alu operation happens in a single clock cycle, and a memory access takes more than one clock, you just do the the increment/etc while you're waiting for the memory. That's where all those weird modes of RISC operands come from, right: "We have an ALU, a barrel shifter, and ...; they should all be able to do something each clock cycle!"

The less standard thing that I wish were more common is probably bitfield instructions (combining barrel shift and AND.)

I wonder how the PDP11 or PDP10 instruction sets (both of which are nominally regiser/memory architectures) would work i you RISCified them and made it so only the load/store operations could access memory?

Quote
I want all the instruction from a modern ARM architecture
The 32-bit ARM instructions, I hope? The Cortex (thumbX only) encodings have lost any elegance, and the CM0 is almost painful in is weird limitations...