Author Topic: if U were a RISC assembly programmer which instructions would U want to have ?  (Read 21033 times)

0 Members and 1 Guest are viewing this topic.

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
I'd put the PC, SP and flags in the register map as well.

register-file

no, I can't because I need (1) to write 2 register in the "write back" stage, so I need to have status, PC and SP implemented out of the register file, which makes possible to write just one register per clock cycle

(1) it's required by features like "private registers during exception" and "private registers during function call": I can support both of them in hw, without pipeline stall, but I need to have PC and SP implemented externally, otherwise I will need +3 clock cycles of stall.

read more than one register simultaneously to save clock cycles

Currently I can read 2 registers simultaneously, I can't write 2 registers simultaneously, just one per clock cycle
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 26906
  • Country: nl
    • NCT Developments
But you don't have to write the result, PC and flags all at once. The PC can be updated at the decoding stage, flags during fetch. SP is a general purpose register anyway if you make indexed addressing flexible.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
The PC can be updated at the decoding stage

no, it's updated in the write back stage due to the debug constraints
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
SP is a general purpose register

it's a special register, the only one to be passed in function call, as I am implementing a simplified form of "registers window" and each function has its private registers, the caller's SP needs to be passed (copied) into the called's SP

things like parameters are passed over the stack instead than over registers, it's my choice

also, in this way, SP is the only one that can have pre/post increment/decrement without stall penalties
« Last Edit: March 03, 2016, 06:54:59 am by legacy »
 

Offline sarepairman2

  • Frequent Contributor
  • **
  • Posts: 480
  • Country: 00
goto op amp
 

Offline ale500

  • Frequent Contributor
  • **
  • Posts: 415
I like the following instruction (from PowerPC): isel rdest, ra, rb, rsel : if rsel is 0 then rdest gets the value of ra, if rsel is != 0 then rdest gets rb. (or was it with 3 arguments?).

Can I ask what decode do ? what do you do there ? (I like designing soft-cores too).
 

Offline Cerebus

  • Super Contributor
  • ***
  • Posts: 10576
  • Country: gb
I'd put the PC, SP and flags in the register map as well.

register-file

no, I can't because I need (1) to write 2 register in the "write back" stage, so I need to have status, PC and SP implemented out of the register file, which makes possible to write just one register per clock cycle



You can. Keep the PC, SP and flags in dedicated registers and just add decoding logic that muxes them into the data path instead of the register file when they're addressed as registers. It obviously means that the physical registers in the register file that sit at the same addresses never get actually used.

You can see something similar in the design below that I cooked up for fun back in 2008 while learning Verilog. Note in this case that the PC is spread out along the pipeline registers but can be read as if it was a register, or written as if it's a register in the writeback phase.
Anybody got a syringe I can use to squeeze the magic smoke back into this?
 

Offline CXm0dk

  • Newbie
  • Posts: 4
  • Country: 00
including the most difficult to be implemented
  • pre increment?
  • post increment?
  • stack? (push/pop)?

Hello legacy,

  an idea discussed in this forum in the past, the looping instruction for increasing code density and a mean to reduce fixed loop overhead:

Code: [Select]
nloop iter,#of instruction

0 <= iter < 16                 iterate 3 to 18 times the following number of instructions
0 <= #of instruction < 16  number of instructions to iterate between 1 through 16 (only 32bit instruction size?)

then, you can write this in Assembly:

Code: [Select]
mov Ru15,0x0Af1794d      // random address of data
enter_loop: nloop 6      // for (i=9 ; i>0 ; i--) || for (i=0 ; i<9; i++)
{ slwi                   // or rlwinm, see previous post of backsheeplogic
  load Ru01,Ru15         // get value
  mov  Ru06,Ru01         // duplicate
  bitrev 17,Ru00,Ru01    // number of bit 1
  neg Ru01               // bit invert of value
  bitrev 17,Ru02,Ru01    // number of bit 0 of initial value
  rsht Ru01,17           // right shift of complemented upper value
  add Ru01,17            // add it + 17
  xor Ru06,Ru01          // shuffle & shake
  mac Ru03,Ru06,Ru00     // famous DSP function
  mac Ru06,Ru03,Ru02     // more shuffled with famous DSP function
  sto Ru15,Ru06          // store #@%&~*
  add Ru15,4             // next value
}                        // nloop 6,12  end of 9 loops of 13 instructions
exit_loop:               // looped without using a register for i (shadow register?) Loop overhead 1 instruction!

  Now you have the primitive for pre/post increment or push/pop n/POPCNT!

  Another example:

Code: [Select]
nloop 1,0 imul Ru00,Ru00   //for square of square of square of Ru00!or
Code: [Select]
imul Ru00,Ru00
imul Ru00,Ru00
imul Ru00,Ru00

  Have fun! ;-)
« Last Edit: March 04, 2016, 09:40:56 am by CXm0dk »
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
You can. Keep the PC, SP and flags in dedicated registers and just add decoding logic that muxes them into the data path instead of the register file when they're addressed as registers. It obviously means that the physical registers in the register file that sit at the same addresses never get actually used.

You can see something similar in the design below that I cooked up for fun back in 2008 while learning Verilog. Note in this case that the PC is spread out along the pipeline registers but can be read as if it was a register, or written as if it's a register in the writeback phase.

good trick, brilliant :D
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
  an idea discussed in this forum in the past, the looping instruction for increasing code density and a mean to reduce fixed loop overhead:

Code: [Select]
nloop iter,#of instruction

excellent, thank you very much :D
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
  Have fun! ;-)

yep, I bought a Papilio/Pro board, it's Spartan6e fpga, it comes with synchronous dram (8Mbyte) already installed on the board, I added an external asynchronous static ram (2Mbyte), I am going to add an STN lcd (4bit, 320x240, 2bit of color), and a second uart console, since the first uart is used by the DebugEngine

Arise-v2.1 and greater will run there with a few of your tips & tricks successfully implemented (I hope) :D
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
what decode do ?

it prepares control signals along the data path and takes care about of magic under the hood
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Code: [Select]
{ slwi                   // or rlwinm, see previous post of backsheeplogic

who is that guy called "slwi" ?
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
Code: [Select]
exit_loop:               // looped without using a register for i (shadow register?) Loop overhead 1 instruction!

yes, basically it's "decrement and loop until zero", in my head it uses a shadow register


Now you have the primitive for pre/post increment or push/pop n/POPCNT!

I don't get it  :-// as I was talking about 1 instruction which is able to perform pre/post increment within the load/store class of instructions

something like:
push r1 ---> store r1, (sp)+4 ---> store r1, (sp), sp=sp+4; 1 instruction
pop r1 ---> load r1, -4(sp) ---> sp=sp-4, load r1, (sp); 1 instruction

sizeof(r1)=4 byte, it's 32bit register, the load/store stage is attached to 32bit bus but it's able to access the memory with a byte granularity
 

Offline CXm0dk

  • Newbie
  • Posts: 4
  • Country: 00
Code: [Select]
{ slwi                   // or rlwinm, see previous post of backsheeplogic

who is that guy called "slwi" ?
  see this previous post
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
oh, PowerPC, it's this guy  :)
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
I can implement "rlwinm" with +1 penalty (it takes an extra clock cycle in the ALU stage), but …

Code: [Select]
dest-register, source-register, rotate-left, mask-left, mask-right

it has too many parameters  :wtf: :wtf: :wtf:

do I have to implement "variable length" opcodes in the fetch unit like in Blackfin?
 

Offline CXm0dk

  • Newbie
  • Posts: 4
  • Country: 00
Now you have the primitive for pre/post increment or push/pop n/POPCNT!

I don't get it  :-// as I was talking about 1 instruction which is able to perform pre/post increment within the load/store class of instructions


not in one instruction, for example, the pre-increment addressing mode:

Code: [Select]
xor Ru01,Ru01         // clear Ru01 (int sum = 0)
mov Ru15,0x64EF01AD   // tail pointer of an integer array
nloop 17              //  for (i = 20 ; i > 0 ; i--) sum += *(--pointer);
{
  sub Ru15,4          // decrement address ; pointer -= sizeof(int)
  load Ru00,Ru15      // value = *pointer
  add Ru01,Ru00       // sum += value
}

  I do not say that array is passed as the last parameter in the stack (or you must change 0x64EF01AD by SP) ... ;-)

something like:
push r1 ---> store r1, (sp)+4 ---> store r1, (sp), sp=sp+4; 1 instruction
pop r1 ---> load r1, -4(sp) ---> sp=sp-4, load r1, (sp); 1 instruction


well, neither in one instruction neither implementable with SP as special register, I just think of something like that for push n:

Code: [Select]
//  push in stack the int array[20]
mov Ru15,0x64EF01AD   // pointer = array
mov Ru31,SP           // next store space in stack ; instack = SP
nloop 17              // duplicate values of int array in stack : size(int array) - 3 = 17
{
  load Ru00,Ru15      // value = *pointer
  add Ru15,4          // next value in array ; pointer++
  add SP,4            // reserve space for value
  sto Ru31,Ru00       // *(SP-4) = value
  add Ru31,4          // next store address in stack ; instack++
}                     // nloop 17,4 ; 20 * 5 instructions executed

oh, I forget an optimization  >:D:

Code: [Select]
//  push in stack the int array[20]
mov Ru15,0x64EF01AD   // pointer = array
mov Ru31,SP           // next store space in stack ; instack = SP
add SP,80             // reserve space in stack for the whole array
nloop 17              // duplicate values of int array in stack : size(int array) - 3 = 17
{
  load Ru00,Ru15      // value = *pointer
  add Ru15,4          // next value in array ; pointer++
  sto Ru31,Ru00       // *instack = value
  add Ru31,4          // next store address in stack ; instack++
}                     // nloop 17,3 ; 20 * 4 instructions executed if instructions are not parallelized between ALU & LOAD/STORE unit


this is much simple than any complexe addressing mode, but nloop may be reduce in early stage of your pipeline (use 0 real CPU cycle), but creating complexe addressing mode will not cover all the use case, plus the compiler will rarely use them and that's complexify code generation stage... See the reduction in the addressing mode between the 68k family and these derivatives versions.

Even better, if you have a prefetch instruction unit, it can prefetch @exit_loop the next instructions at the end of the first loop!  :popcorn:

EDIT: correction about the 68000 addressing mode features
« Last Edit: March 05, 2016, 12:24:50 pm by CXm0dk »
 

Offline helius

  • Super Contributor
  • ***
  • Posts: 3642
  • Country: us
Quote
creating complexe addressing mode will not cover all the use case, plus the compiler will rarely use them and that's complexify code generation stage... See the reduction in the addressing mode between the 68000 and the later versions.
In fact the later 680x0 processors greatly increased their addressing modes, not reduced them, adding complex modes like double-scaled double-indexed double-indirect. They had VAX envy...
 

Offline ale500

  • Frequent Contributor
  • **
  • Posts: 415
The >=68020 had extra indirect addressing modes... those where the first to go for slowest implementation on the 68060... due to being hardly used.

Things like :
Code: [Select]
lea eax,ebx+esi*4

for fast add/multiply are sort of useful but indirect addressing modes ?... One should check for instance what metrowerks used to generate for the 68k and see if those modes are used...

But if we see the evolution of RISC, there are not much more than base+index and sometimes post/pre increment, decrement. But that means another write port to the register file or another clock if there is only one port.



 

Offline Muxr

  • Super Contributor
  • ***
  • Posts: 1369
  • Country: us
I want a cpu core that does not need a compiler. the instruction set should be so simple that you can basically translate a high level language instruction directly into cpu instructions.

oh, and want a machine that calculates in BCD ( even for floating point )

So a FORTH processor like RTX2010?

How about that newfangled number format the UNUM?
Or picoJava, there was even a Verilog source for it: http://java.epicentertech.com/Archive_Embedded/Sun_Microsystems/Micro%20&%20Pico%20Java/PicoJava_Port_to_FPGA.pdf

Although it still required compilation, it ran Java bytecode natively.
 

Offline CXm0dk

  • Newbie
  • Posts: 4
  • Country: 00
In fact the later 680x0 processors greatly increased their addressing modes, not reduced them, adding complex modes like double-scaled double-indexed double-indirect. They had VAX envy...
Hello helius,

you are right, my statement is inaccurate :palm:, it's the ColdFire which simplify the addressing mode, see Differences between ColdFire & 68K

May I should wrote "See the reduction in the addressing mode between the 68k families and the earlier derivatives versions."
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
PowerPC addressing modes

Quote
the effective address (EA), also called the logical address, is the address computed by the processor when executing a memory access or branch instruction or when fetching the next sequential instruction. Unless address translation is disabled, this address is converted by the MMU to the appropriate physical address. (Note that the architecture specification uses only the term effective address and not logical address.)

The PowerPC architecture supports the following simple addressing modes for memory access instructions:
  • EA = (rA|0) (register indirect)
  • EA = (rA|0) + offset (including offset = 0) (register indirect with immediate index)
  • EA = (rA|0) + rB (register indirect with index)
These simple addressing modes allow efficient address generation for memory accesses.
 

Offline MK14

  • Super Contributor
  • ***
  • Posts: 4539
  • Country: gb
including the most difficult to be implemented
  • pre increment?
  • post increment?
  • stack? (push/pop)?

about the first two instructions, ARM seems to have them, while m88K hasn't
and currently I have no idea about how to implement them in RISC terms

thanks  :D

Delete ALL the other instructions, and just give me my SBNZ.



https://en.wikipedia.org/wiki/One_instruction_set_computer
 

Offline legacyTopic starter

  • Super Contributor
  • ***
  • !
  • Posts: 4415
  • Country: ch
computer science humor  :-DD
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf