Author Topic: if U were a RISC assembly programmer which instructions would U want to have ? (Read 21033 times)

legacy · « **Reply #50 on:** March 02, 2016, 08:12:06 pm »

Quote from: nctnico on March 02, 2016, 08:03:18 pm

I'd put the PC, SP and flags in the register map as well.

register-file

no, I can't because I need (1) to write 2 register in the "write back" stage, so I need to have status, PC and SP implemented out of the register file, which makes possible to write just one register per clock cycle

(1) it's required by features like "private registers during exception" and "private registers during function call": I can support both of them in hw, without pipeline stall, but I need to have PC and SP implemented externally, otherwise I will need +3 clock cycles of stall.

Quote from: nctnico on March 02, 2016, 08:03:18 pm

read more than one register simultaneously to save clock cycles

Currently I can read 2 registers simultaneously, I can't write 2 registers simultaneously, just one per clock cycle

nctnico · « **Reply #51 on:** March 02, 2016, 09:18:06 pm »

But you don't have to write the result, PC and flags all at once. The PC can be updated at the decoding stage, flags during fetch. SP is a general purpose register anyway if you make indexed addressing flexible.

legacy · « **Reply #52 on:** March 02, 2016, 10:34:01 pm »

Quote from: nctnico on March 02, 2016, 09:18:06 pm

The PC can be updated at the decoding stage

no, it's updated in the write back stage due to the debug constraints

legacy · « **Reply #53 on:** March 02, 2016, 11:04:35 pm »

Quote from: nctnico on March 02, 2016, 09:18:06 pm

SP is a general purpose register

it's a special register, the only one to be passed in function call, as I am implementing a simplified form of "registers window" and each function has its private registers, the caller's SP needs to be passed (copied) into the called's SP

things like parameters are passed over the stack instead than over registers, it's my choice

also, in this way, SP is the only one that can have pre/post increment/decrement without stall penalties

sarepairman2 · « **Reply #54 on:** March 03, 2016, 05:01:26 am »

goto op amp

ale500 · « **Reply #55 on:** March 03, 2016, 05:41:57 am »

I like the following instruction (from PowerPC): isel rdest, ra, rb, rsel : if rsel is 0 then rdest gets the value of ra, if rsel is != 0 then rdest gets rb. (or was it with 3 arguments?).

Can I ask what decode do ? what do you do there ? (I like designing soft-cores too).

Cerebus · « **Reply #56 on:** March 04, 2016, 12:29:49 am »

Quote from: legacy on March 02, 2016, 08:12:06 pm

Quote from: nctnico on March 02, 2016, 08:03:18 pm
I'd put the PC, SP and flags in the register map as well.

register-file

no, I can't because I need (1) to write 2 register in the "write back" stage, so I need to have status, PC and SP implemented out of the register file, which makes possible to write just one register per clock cycle

You can. Keep the PC, SP and flags in dedicated registers and just add decoding logic that muxes them into the data path instead of the register file when they're addressed as registers. It obviously means that the physical registers in the register file that sit at the same addresses never get actually used.

You can see something similar in the design below that I cooked up for fun back in 2008 while learning Verilog. Note in this case that the PC is spread out along the pipeline registers but can be read as if it was a register, or written as if it's a register in the writeback phase.

CXm0dk · « **Reply #57 on:** March 04, 2016, 09:33:46 am »

Quote from: legacy on February 29, 2016, 06:17:25 pm

including the most difficult to be implemented
pre increment?
post increment?
stack? (push/pop)?

Hello legacy,

an idea discussed in this forum in the past, the looping instruction for increasing code density and a mean to reduce fixed loop overhead:

Code: [Select]

nloop iter,#of instruction

0 <= iter < 16 iterate 3 to 18 times the following number of instructions
0 <= #of instruction < 16 number of instructions to iterate between 1 through 16 (only 32bit instruction size?)

then, you can write this in Assembly:

Code: [Select]

mov Ru15,0x0Af1794d      // random address of data
enter_loop: nloop 6      // for (i=9 ; i>0 ; i--) || for (i=0 ; i<9; i++)
{ slwi                   // or rlwinm, see previous post of backsheeplogic
  load Ru01,Ru15         // get value
  mov  Ru06,Ru01         // duplicate
  bitrev 17,Ru00,Ru01    // number of bit 1
  neg Ru01               // bit invert of value
  bitrev 17,Ru02,Ru01    // number of bit 0 of initial value
  rsht Ru01,17           // right shift of complemented upper value
  add Ru01,17            // add it + 17
  xor Ru06,Ru01          // shuffle & shake
  mac Ru03,Ru06,Ru00     // famous DSP function
  mac Ru06,Ru03,Ru02     // more shuffled with famous DSP function
  sto Ru15,Ru06          // store #@%&~*
  add Ru15,4             // next value
}                        // nloop 6,12  end of 9 loops of 13 instructions
exit_loop:               // looped without using a register for i (shadow register?) Loop overhead 1 instruction!

Now you have the primitive for pre/post increment or push/pop n/POPCNT!

Another example:

Code: [Select]

nloop 1,0 imul Ru00,Ru00 //for square of square of square of Ru00!or

Code: [Select]

imul Ru00,Ru00
imul Ru00,Ru00
imul Ru00,Ru00

Have fun! ;-)

legacy · « **Reply #58 on:** March 04, 2016, 12:12:00 pm »

Quote from: Cerebus on March 04, 2016, 12:29:49 am

You can. Keep the PC, SP and flags in dedicated registers and just add decoding logic that muxes them into the data path instead of the register file when they're addressed as registers. It obviously means that the physical registers in the register file that sit at the same addresses never get actually used.

You can see something similar in the design below that I cooked up for fun back in 2008 while learning Verilog. Note in this case that the PC is spread out along the pipeline registers but can be read as if it was a register, or written as if it's a register in the writeback phase.

good trick, brilliant

legacy · « **Reply #59 on:** March 04, 2016, 12:13:28 pm »

Quote from: CXm0dk on March 04, 2016, 09:33:46 am

an idea discussed in this forum in the past, the looping instruction for increasing code density and a mean to reduce fixed loop overhead:

Code: [Select]
nloop iter,#of instruction

excellent, thank you very much

legacy · « **Reply #60 on:** March 04, 2016, 12:20:54 pm »

Quote from: CXm0dk on March 04, 2016, 09:33:46 am

Have fun! ;-)

yep, I bought a Papilio/Pro board, it's Spartan6e fpga, it comes with synchronous dram (8Mbyte) already installed on the board, I added an external asynchronous static ram (2Mbyte), I am going to add an STN lcd (4bit, 320x240, 2bit of color), and a second uart console, since the first uart is used by the DebugEngine

Arise-v2.1 and greater will run there with a few of your tips & tricks successfully implemented (I hope)

legacy · « **Reply #61 on:** March 04, 2016, 12:22:36 pm »

Quote from: ale500 on March 03, 2016, 05:41:57 am

what decode do ?

it prepares control signals along the data path and takes care about of magic under the hood

legacy · « **Reply #62 on:** March 04, 2016, 12:33:46 pm »

Quote from: CXm0dk on March 04, 2016, 09:33:46 am

Code: [Select]
{ slwi // or rlwinm, see previous post of backsheeplogic

who is that guy called "slwi" ?

legacy · « **Reply #63 on:** March 04, 2016, 12:51:35 pm »

Quote from: CXm0dk on March 04, 2016, 09:33:46 am

Code: [Select]
exit_loop: // looped without using a register for i (shadow register?) Loop overhead 1 instruction!

yes, basically it's "decrement and loop until zero", in my head it uses a shadow register

Quote from: CXm0dk on March 04, 2016, 09:33:46 am

Now you have the primitive for pre/post increment or push/pop n/POPCNT!

I don't get it

as I was talking about 1 instruction which is able to perform pre/post increment within the load/store class of instructions

something like:
push r1 ---> store r1, (sp)+4 ---> store r1, (sp), sp=sp+4; 1 instruction
pop r1 ---> load r1, -4(sp) ---> sp=sp-4, load r1, (sp); 1 instruction

sizeof(r1)=4 byte, it's 32bit register, the load/store stage is attached to 32bit bus but it's able to access the memory with a byte granularity

CXm0dk · « **Reply #64 on:** March 04, 2016, 02:05:30 pm »

Quote from: legacy on March 04, 2016, 12:33:46 pm

Quote from: CXm0dk on March 04, 2016, 09:33:46 am
Code: [Select]
{ slwi // or rlwinm, see previous post of backsheeplogic

who is that guy called "slwi" ?

see this previous post

legacy · « **Reply #65 on:** March 04, 2016, 03:04:15 pm »

oh, PowerPC, it's this guy

legacy · « **Reply #66 on:** March 04, 2016, 03:16:11 pm »

I can implement "rlwinm" with +1 penalty (it takes an extra clock cycle in the ALU stage), but …

Code: [Select]

dest-register, source-register, rotate-left, mask-left, mask-right

it has too many parameters

do I have to implement "variable length" opcodes in the fetch unit like in Blackfin?

CXm0dk · « **Reply #67 on:** March 04, 2016, 05:04:20 pm »

Quote from: legacy on March 04, 2016, 12:51:35 pm

Quote from: CXm0dk on March 04, 2016, 09:33:46 am
Now you have the primitive for pre/post increment or push/pop n/POPCNT!

I don't get it as I was talking about 1 instruction which is able to perform pre/post increment within the load/store class of instructions

not in one instruction, for example, the pre-increment addressing mode:

Code: [Select]

xor Ru01,Ru01         // clear Ru01 (int sum = 0)
mov Ru15,0x64EF01AD   // tail pointer of an integer array
nloop 17              //  for (i = 20 ; i > 0 ; i--) sum += *(--pointer);
{
  sub Ru15,4          // decrement address ; pointer -= sizeof(int)
  load Ru00,Ru15      // value = *pointer
  add Ru01,Ru00       // sum += value
}

I do not say that array is passed as the last parameter in the stack (or you must change 0x64EF01AD by SP) ... ;-)

Quote from: legacy on March 04, 2016, 12:51:35 pm

something like:
push r1 ---> store r1, (sp)+4 ---> store r1, (sp), sp=sp+4; 1 instruction
pop r1 ---> load r1, -4(sp) ---> sp=sp-4, load r1, (sp); 1 instruction

well, neither in one instruction neither implementable with SP as special register, I just think of something like that for push n:

Code: [Select]

//  push in stack the int array[20]
mov Ru15,0x64EF01AD   // pointer = array
mov Ru31,SP           // next store space in stack ; instack = SP
nloop 17              // duplicate values of int array in stack : size(int array) - 3 = 17
{
  load Ru00,Ru15      // value = *pointer
  add Ru15,4          // next value in array ; pointer++
  add SP,4            // reserve space for value
  sto Ru31,Ru00       // *(SP-4) = value
  add Ru31,4          // next store address in stack ; instack++
}                     // nloop 17,4 ; 20 * 5 instructions executed

oh, I forget an optimization

:

Code: [Select]

//  push in stack the int array[20]
mov Ru15,0x64EF01AD   // pointer = array
mov Ru31,SP           // next store space in stack ; instack = SP
add SP,80             // reserve space in stack for the whole array
nloop 17              // duplicate values of int array in stack : size(int array) - 3 = 17
{
  load Ru00,Ru15      // value = *pointer
  add Ru15,4          // next value in array ; pointer++
  sto Ru31,Ru00       // *instack = value
  add Ru31,4          // next store address in stack ; instack++
}                     // nloop 17,3 ; 20 * 4 instructions executed if instructions are not parallelized between ALU & LOAD/STORE unit

this is much simple than any complexe addressing mode, but nloop may be reduce in early stage of your pipeline (use 0 real CPU cycle), but creating complexe addressing mode will not cover all the use case, plus the compiler will rarely use them and that's complexify code generation stage... See the reduction in the addressing mode between the 68k family and these derivatives versions.

Even better, if you have a prefetch instruction unit, it can prefetch @exit_loop the next instructions at the end of the first loop!

EDIT: correction about the 68000 addressing mode features

helius · « **Reply #68 on:** March 04, 2016, 08:52:31 pm »

Quote

creating complexe addressing mode will not cover all the use case, plus the compiler will rarely use them and that's complexify code generation stage... See the reduction in the addressing mode between the 68000 and the later versions.

In fact the later 680x0 processors greatly increased their addressing modes, not reduced them, adding complex modes like double-scaled double-indexed double-indirect. They had VAX envy...

ale500 · « **Reply #69 on:** March 05, 2016, 05:44:47 am »

The >=68020 had extra indirect addressing modes... those where the first to go for slowest implementation on the 68060... due to being hardly used.

Things like :

Code: [Select]

lea eax,ebx+esi*4

for fast add/multiply are sort of useful but indirect addressing modes ?... One should check for instance what metrowerks used to generate for the 68k and see if those modes are used...

But if we see the evolution of RISC, there are not much more than base+index and sometimes post/pre increment, decrement. But that means another write port to the register file or another clock if there is only one port.

Muxr · « **Reply #70 on:** March 05, 2016, 06:00:54 am »

Quote from: HAL-42b on March 01, 2016, 02:32:17 am

Quote from: free_electron on March 01, 2016, 02:08:00 am
I want a cpu core that does not need a compiler. the instruction set should be so simple that you can basically translate a high level language instruction directly into cpu instructions.

oh, and want a machine that calculates in BCD ( even for floating point )

So a FORTH processor like RTX2010?

How about that newfangled number format the UNUM?

Or picoJava, there was even a Verilog source for it: http://java.epicentertech.com/Archive_Embedded/Sun_Microsystems/Micro%20&%20Pico%20Java/PicoJava_Port_to_FPGA.pdf

Although it still required compilation, it ran Java bytecode natively.

CXm0dk · « **Reply #71 on:** March 05, 2016, 12:21:15 pm »

Quote from: helius on March 04, 2016, 08:52:31 pm

In fact the later 680x0 processors greatly increased their addressing modes, not reduced them, adding complex modes like double-scaled double-indexed double-indirect. They had VAX envy...

Hello helius,

you are right, my statement is inaccurate

, it's the ColdFire which simplify the addressing mode, see Differences between ColdFire & 68K

May I should wrote "See the reduction in the addressing mode between the 68k families and the earlier derivatives versions."

legacy · « **Reply #72 on:** March 06, 2016, 01:35:07 pm »

PowerPC addressing modes

Quote

the effective address (EA), also called the logical address, is the address computed by the processor when executing a memory access or branch instruction or when fetching the next sequential instruction. Unless address translation is disabled, this address is converted by the MMU to the appropriate physical address. (Note that the architecture specification uses only the term effective address and not logical address.)

The PowerPC architecture supports the following simple addressing modes for memory access instructions:
EA = (rA|0) (register indirect)
EA = (rA|0) + offset (including offset = 0) (register indirect with immediate index)
EA = (rA|0) + rB (register indirect with index)
These simple addressing modes allow efficient address generation for memory accesses.

MK14 · « **Reply #73 on:** March 06, 2016, 01:41:03 pm »

Quote from: legacy on February 29, 2016, 06:17:25 pm

including the most difficult to be implemented
pre increment?
post increment?
stack? (push/pop)?

about the first two instructions, ARM seems to have them, while m88K hasn't
and currently I have no idea about how to implement them in RISC terms

thanks

Delete ALL the other instructions, and just give me my SBNZ.

https://en.wikipedia.org/wiki/One_instruction_set_computer

legacy · « **Reply #74 on:** March 06, 2016, 01:58:57 pm »

computer science humor


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: if U were a RISC assembly programmer which instructions would U want to have ? (Read 21033 times)

Share me