Author Topic: Critique of first design  (Read 384 times)

0 Members and 2 Guests are viewing this topic.

Offline aheid

  • Regular Contributor
  • *
  • Posts: 151
  • Country: no
Critique of first design
« on: February 14, 2020, 12:40:48 am »
So I made my first non-trivial design, a soft-core running my own homebrew instruction set. Thanks to the excellent help I got in my previous thread, I managed to get most of it up and running in just a couple of evenings. Instruction set was something I came up with on my commute to work, so yeah, not a whole lot of thought went into that. It's a RISC-y MISC I guess :) Keep in mind I've got zero digital design experience.

It's not 100% completed though, and before I finish it up and continue on my next project I would be very grateful if any of you wizards could have a quick look to see what horrible things I have done that needs fixing. As for completeness, a proper memory interface is lacking, and I haven't tested the call instruction. But it runs a small sample program (defined at the end of core.v) so there's that.

One thing I wonder is how to split up the main instruction FSM. Is tasks the best way?

Another question is if there's a good way to write a reusable "demux" for a signal... for mux'ing I can use a function with a case statment (like my load_reg). But what if I want to assign one of N regs from some source reg Y? I can write a case and assign one by one, but if I need to do the same job multiple places, what then?
« Last Edit: February 14, 2020, 12:45:09 am by aheid »
 

Offline aheid

  • Regular Contributor
  • *
  • Posts: 151
  • Country: no
Re: Critique of first design
« Reply #1 on: February 14, 2020, 09:28:34 am »
Doubled fMax and reduced LUT usage by 10% by simply gating the alu_flags for conditional branching (CYCLE_3) and postponing the actual jump by  a cycle... though simply delaying without gating the flags only helped sligthly.


Topkek  :palm: Slight typo meant I essentially no-op'ed the whole conditional branch code, so no wonder it flew :D I get the point of Verilog being sloppy, not a single warning.

Clearly I got learning potential when it comes to thinking about dependecies and how they translate :)
« Last Edit: February 14, 2020, 02:05:51 pm by aheid »
 

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 4285
  • Country: fr
Re: Critique of first design
« Reply #2 on: February 14, 2020, 02:09:28 pm »
Just took a quick look, but it doesn't seem too bad for a "beginner"! Looks like you catch on real quick.
I'll let the Verilog gurus tell you where your style could be improved.
 
The following users thanked this post: aheid

Offline kfnight

  • Regular Contributor
  • *
  • Posts: 67
Re: Critique of first design
« Reply #3 on: February 14, 2020, 06:01:39 pm »
Rather than

Code: [Select]
    localparam R0  = 4'h0;
    localparam R1  = 4'h1;
    localparam R2  = 4'h2;
    localparam R3  = 4'h3;
    localparam R4  = 4'h4;

you can do

Code: [Select]
    localparam
        R0  = 4'h0,
        R1  = 4'h1,
        R2  = 4'h2,
        R3  = 4'h3,
        R4  = 4'h4;
 
The following users thanked this post: BrianHG, aheid

Offline free_electron

  • Super Contributor
  • ***
  • Posts: 7310
  • Country: us
    • SiliconValleyGarage
Re: Critique of first design
« Reply #4 on: February 14, 2020, 06:51:11 pm »

cleanup of alu
Code: [Select]
module ALU(
    input clk,
    input [2:0] op_sel,
    input [7:0] bus_a,
    input [7:0] bus_b,
    output reg[7:0] result,
    output reg[3:0] flags
);

`define OP_ADD 3'b000
`define OP_SUB 3'b001
`define OP_AND 3'b010
`define OP_OR  3'b011
`define OP_XOR 3'b100
`define OP_NOT 3'b101
`define OP_LSH 3'b110
`define OP_RSH 3'b111
   
    `define FLAG_CARRY flags[0]
    `define FLAG_ZERO  flags[1]
    `define FLAG_SIGN  flags[2]
    `define FLAG_TBD   flags[3]
   
    wire [8:0] alu_out;

    always_comb begin
case (op_sel)
`OP_ADD: alu_out = (bus_a + bus_b);
`OP_SUB: alu_out = (bus_a - bus_b);
`OP_AND: alu_out = (bus_a & bus_b);
`OP_OR:  alu_out = (bus_a | bus_b);
`OP_XOR: alu_out = (bus_a ^ bus_b);
`OP_NOT: alu_out = {1'b0, (~bus_a)};
`OP_LSH: alu_out = {bus_a[6:0],1b0);
`OP_RSH: alu_out = {1b0,bus_a[7:1]);
  default: alu_out = 0;                   
endcase

end

always_ff @(posedge clk) begin
result <= alu_out[7:0];
`FLAG_ZERO <= |alu_out[8:0];
`FLAG_SIGN <= alu_out[7];
`FLAG_CARRY <= alu_out[8];
end
   
endmodule
Professional Electron Wrangler.
Any comments, or points of view expressed, are my own and not endorsed , induced or compensated by my employer(s).
 
The following users thanked this post: BrianHG, aheid

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 6930
  • Country: us
Re: Critique of first design
« Reply #5 on: February 14, 2020, 08:35:28 pm »
I am far from achieving incompetence with Verilog and even less skilled with System Verilog but...

You aren't making any accommodation for Carry In.  Maybe multi-word arithmetic is not on the menu.  Manipulating Carry In comes up with two's complement arithmetic.

More worrying to me is the "always_ff at @(posedge clk)' controlling the flags and the result:  On every single clock, even during instruction fetch, the flags are getting clicked.  Any perturbation of either bus or op_sel will change the results. Maybe there is some other logic that prevents this.

Overflow isn't detected.  Maybe it isn't important but if it gets set, there needs to be some deliberate action to clear it (like a write to the status register).  It should not get cleared at the beginning of every ALU operation.  The idea is to be able to carry out some long computation and check at the very end whether overflow occurred along the way.  It isn't necessary to know exactly where it occurred, just that it occurred.  Or you can cause overflow to generate a TRAP operation.


ETA:


Code: [Select]

-- code for ADD instruction from inside main FSM
-- note that the CaryIndCtrl is set to clear the carry indicator before the addition takes place
-- but the overflow indicator is not set
-- then in state S11, both the carry and overflow indicators are enabled which causes them to latch
-- the flag results of the operation

-- A - add [EA] to ACC   
   when s_A  => A_BusCtrl    <= A_BUS_RamData;
                AFR_Ctrl     <= AFR_LOAD;
                CarryIndCtrl <= CARRY_IND_CLEAR;
                NextState    <= s11;
   when s11  => Add          <= '1';
                CI           <= '0';
                A_BusCtrl    <= A_BUS_ADDER_RESULT;
                ACC_Ctrl     <= ACC_LOAD;
                CarryIndCtrl <= CARRY_IND_ENABLE;
                OvflIndCtrl  <= OVFL_IND_ENABLE;
                NextState    <= Fetch;
       
 <gigntic snip>       
       
-- carry indicator
 
  process(Reset, Clk, CarryIndCtrl)
  begin
    if Reset = '1' then
      CarryInd <= '0';
    elsif Clk'event and Clk = '1' then
      case CarryIndCtrl is
        when CARRY_IND_NOP    => null;
        when CARRY_IND_CLEAR  => CarryInd <= '0';
        when CARRY_IND_SET    => CarryInd <= '1';
        when CARRY_IND_ENABLE => CarryInd <= CO;
        when CARRY_IND_INVERT => CarryInd <= not CO;
        when others           => null;
      end case;
    end if;
  end process;

-- overflow indicator
 
 process(Reset, Clk, OvflIndCtrl)
 begin
  if Reset = '1' then
    OvflInd <= '0';
  elsif Clk'event and Clk = '1' then
    case OvflIndCtrl is
      when OVFL_IND_NOP    => null;
      when OVFL_IND_CLEAR  => OvflInd <= '0';
      when OVFL_IND_SET    => OvflInd <= '1';
      when OVFL_IND_ENABLE => if OvflInd = '0' then  -- don't turn off
                                OvflInd <= OFL;
                              end if;
      when others          => null;
    end case;
  end if;
 end process;

[/font]

No doubt there are better ways to do this but this is what occurred to me.  Some instructions affect the flags, some don't.  Carry is cleared at the start of every operation that uses the flag, overflow is not cleared except with a specific instruction and then only conditioned by a bit in the instruction.  It is possible to test overflow without clearing it or to test it and reset it depending on the instruction coding.
« Last Edit: February 14, 2020, 09:15:56 pm by rstofer »
 
The following users thanked this post: aheid

Offline aheid

  • Regular Contributor
  • *
  • Posts: 151
  • Country: no
Re: Critique of first design
« Reply #6 on: February 14, 2020, 09:39:51 pm »
Thanks for the feedback, highly appreciated! Learned some new tricks, thanks for that free_electron.

For carry, the idea was that for "add with carry" you'd do "tfm" followed by "add" to add any carry. So 16 bit addition of r6:r5 to r1:r0 would be
Code: [Select]
// lower 8 bits
add r0, r5
// add carry
tfm r9, 0x1
add r1, r9
// upper 8 bits
add r1, r6

Like I said, didn't put too much effort into the instruction set, the code density sucks. Though I admit I was kinda drawn towards minimalism, see if I could get away with a RISC-like MISC, unlike several of those early MISC's which from what I could see were more WLIW-like.

I must admit I haven't thought much about overflow flag. If you need to you could use the tfm instruction again and then or result into a spare register as you go, but yeah something for me to think about.

I guess I need to learn more about timing constraints and how to relieve them, I see my iCE40 does 16 bit mux/adders/counters at max 100MHz, and the VexRiscv folks claim 92MHz for their RISC-V core, while mine is like 22MHz currently  :palm:
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 6930
  • Country: us
Re: Critique of first design
« Reply #7 on: February 15, 2020, 12:14:07 am »
If you want speed, do what the big kids do and pipeline everything.  There's a reason that ARM and RISC processors have a 5 stage pipeline.  Five is typical, I don't know much about any specific implementation.  That's why RISC machines are typically just load-store from registers with all arithmetic operations occurring only between registers.

After you synthesize the code, look at the schematic (if you can't do that with the Lattice tools, install Vivado and just use if for synthesis and simulation).  The SOURCE code should be perfectly portable.  Somewhere in the reports, the tools will disclose the longest logic chain(s).  If you created a priority tree, it could be a very long and stringy bit of logic.

If <cond1>
.. xxxx
elsif <cond2>
.. xxxx
elsif <cond3>
.. xxxx
else
..xxxx

That kind of thing results in a long string of logic.  It is a priority tree in that cond1, if true, masks all lower level tests.  If false, testing flows to cond2 which masks all lower level tests.  And so on.

Sometimes the synthesizer can figure out how to make a MUX out of this, other times it just generates stringy logic.  It might also do a lookup function using the LUT feature.  With careful coding, you can force the LUT approach and this will run as fast as the chip is possible.

I didn't look at how you implemented the registers but one approach is to have two identical banks in parallel.  When a register is written, both banks are updated at the same clock.  You can read from 2 different registers (or even twice from the same register), add the values and put them back in the same register all in one clock.  And this one-clock datapath is the heart of the machine!

Here is my register code.  There are two instances wire together at a high level.  Note that I have a 'debug' port which allows me to view a register by selecting it with some toggle switches.  Kind of handy!


Code: [Select]
-- the two instances at the top level
-- note that both register banks are written at the same time with LD_REG as the enable signal and a clock.

    -- SR1 Register Bank
    inst_Regs1 : entity work.Registers
        Port Map (
            reset       => reset,
            clk         => Pulse,
            load        => LD_REG,
            ReadAddr    => SR1muxOutput,
            WriteAddr   => DRmuxOutput,
            Din         => mBus,
            Dout        => SR1Output,
            DebugIn     => Sw(2 downto 0),
            DebugOut    => open
            );
           
    -- SR2 Register Bank       
    inst_Regs2 : entity work.Registers
        Port Map (
            reset       => reset,
            clk         => Pulse,
            load        => LD_REG,
            ReadAddr    => SR2,
            WriteAddr   => DRmuxOutput,
            Din         => mBus,
            Dout        => SR2Output,
            DebugIn     => Sw(2 downto 0),
            DebugOut    => DebugOut
            );



-- the register entity

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity Registers is
    Port ( reset    : in  std_logic;
           clk      : in  std_logic;
           load     : in  std_logic;
           ReadAddr : in  std_logic_vector(2 downto 0);
           WriteAddr: in  std_logic_vector(2 downto 0);
           Din      : in  std_logic_vector(15 downto 0);
           Dout     : out std_logic_vector(15 downto 0);
           DebugIn  : in  std_logic_vector( 2 downto 0);
           DebugOut : out std_logic_vector(15 downto 0)
         );
end Registers;

architecture Behavioral of Registers is

    type Reg_t is array(0 to 7) of std_logic_vector(15 downto 0);
    signal Regs : Reg_t;
   
begin

    Dout        <= Regs(to_integer(unsigned(ReadAddr)));
    DebugOut    <= Regs(to_integer(unsigned(DebugIn)));
   
    process(reset,clk,WriteAddr,Load,Din)
    begin
        if rising_edge(clk) then
            if reset = '1' then
                Regs <= ((others => (others => '0')));
            else
                if Load = '1' then
                    Regs(to_integer(unsigned(WriteAddr))) <= Din;
                end if;
             end if;
        end if;
    end process;
   
end Behavioral;

[/font]

« Last Edit: February 15, 2020, 12:28:43 am by rstofer »
 
The following users thanked this post: aheid

Offline free_electron

  • Super Contributor
  • ***
  • Posts: 7310
  • Country: us
    • SiliconValleyGarage
Re: Critique of first design
« Reply #8 on: February 15, 2020, 02:10:03 am »
You aren't making any accommodation for Carry In.
neither did the orignal poster. ...

[cquote]
More worrying to me is the "always_ff at @(posedge clk)' controlling the flags and the result:  On every single clock, even during instruction fetch, the flags are getting clicked.  Any perturbation of either bus or op_sel will change the results. Maybe there is some other logic that prevents this.
[/quote]
That is up to higher up modules. my goal was single cycle alu. if the machine is doing a move operation then the alu is not being used so the output doesn't really matter.
My code is partitioned in such a way that the combinatorial block of the ALU is always active. By the time the posedge comes the output of the combinatorial block is stable so it locks the result .
you could add an additional 'enable_alu' signal that prevents other cycles from modifying the output.

Quote
Overflow isn't detected.

mine does. The alu has 9 bit output. the msb is overflow

This processor does not have instruction to do continuous adding. The only add instruction is from x = a + b. you would need additional x=x+a instruction for that.


`define OP_ADD_A 3'b1000
`define OP_ADD_B 3'b1000

`OP_ADD_A alu_out = (result + bus_a);
... and so on.

But architecturally speaking this is garbage.
i would prefer to see additional registers that control source and target. Then you can do anything to anything. The memory map would be a block of ram for example 64K. the processor registers themselves would live above that

reg_source1 @0x10000
reg_source2 @0x10001
reg_target

instructions
LD_SRC_A   ' load data from source a in accumulator
LD_SRC_B ' load data from source b in accumulator
ADD_A  ' add data for source A to accumluator
ADD_A_INC ' add data for source a to accumulator and increment source_a address.
ADD_B ' add data from source b to accumulator
STOR ' write accumulator contents to target
STORINC ( same as STOR but increment target address ) this allows for stream processing.

since the data pointers (A , B and TARGET) reside inside the memory map you can even manipulate thos eprogrammatically.

i made a core once that could unroll loops. there were speical 'stream instructions.
you set the source and target begin addresses and load a count value. the adding went on (with post increment of address pointers )  as long as count did not reach zero.
crunch a table with 100 numbers ? like adding a constant to 100 memory locations .. 100 clockticks. hardware did it.
vector processing.

Professional Electron Wrangler.
Any comments, or points of view expressed, are my own and not endorsed , induced or compensated by my employer(s).
 

Offline aheid

  • Regular Contributor
  • *
  • Posts: 151
  • Country: no
Re: Critique of first design
« Reply #9 on: February 15, 2020, 12:46:52 pm »
But architecturally speaking this is garbage.

I'm sure we can all agree I shouldn't quit my day-job to become an ISA designer anytime soon  :-DD

I got a few spare bits in the instruction word, so was considering specifying the dest register using those, though would be restricted to first 4 or 8 registers. But I decided to let that be for now. This isn't about designing a good ISA but learning how to implement one :)

I get that pipelining would increase the throughput (instructions per cycle) but but I think it would be good for me to learn how they got a (reported) fmax of 4 times what I got with not that much more resource usage. Clearly data ain't flowing well through my core. Time to study those tool docs.

Anyway thanks for the additional info from both of you, and I'll try that cloned-reg bank trick, see how that works out.

edit: hah, started to figure out the the timing analysis tool in Radiant, so found out that my main culprit was the way I handled the reset signal. Instead of checking for reset every cycle, I moved the reset check to CYCLE_0. This made my fmax go from ~22MHz to ~40MHz. This was before the double-register trick or anything else. Good times ahead.
« Last Edit: February 15, 2020, 01:13:09 pm by aheid »
 

Offline free_electron

  • Super Contributor
  • ***
  • Posts: 7310
  • Country: us
    • SiliconValleyGarage
Re: Critique of first design
« Reply #10 on: February 15, 2020, 04:34:03 pm »
It is not really that hard to make a small custom core. There is a couple of things you need to think about BEFORE you start out.
Register or accumulator based architecture ? Or a hybrid ? The old school processors 6502 , 8051, 6805 are accumulator based. Arm Mis are register based
Harvard or Von Neumann ? Harvard has clean split between code and data an io. Von nueman is one pool. each had pro and con.

i would go with a modified Von Neumann. It needs less move instructions ( ROM to RAM is required in harvard, RAM to and from IO in Harvard) but still offers code protection. bugs in code can not overwrite code.
Accumulator based requires all operations to pass through the accumulator with premove and post-move of data. Regster based can directly spit the result of an operation into a target register. so you can have more than one 'accumulator'.
Here again i would use a hybrid. in the sense that i map the code and data pointers as registers.

This give the most flexibility.

instruction set : hardcoded machine or a microcode based machine ? A microcoded machine is the most flexible.

Memory layout :
------------------
for compilers sake :
- Code begins at 0
- data Ram begins somewhere 'up'
- working registers sit at top and grow downwards.
- I/O channels are controlle by a separate instruction so that code and ram writes can not accidentally toggle I/O . This gives some protection against runaway programs starting to flip pins ... ( this is a harvard trick but we use it in a von neumann machine)

This layout is 'expandable' memory map can grow by adding bits to address bus , word size can grow by adding bits to databus.

Registers.
-----------
well we need some control registers. To begin with , we need to know what instruction we are on and where the next one comes from. so
CPTR : code pointer
and we have the same for data ...
DPTR : data pointer
since IO lives in a separate block we need an IO pointer too
IOPTR : I/O pointer
Having two datapointers complicates the RAM adressing and will require multiple clocks to do moves. we're trying to make a single clock machine. Not a single clock cycle where multiple clock pulses are required to do an operation. we want everything done in one clock pulse)
And we will need an accumulator and some flags
ACC : accumulator
FLAGS : overflow,underflow,zero,true,sign. (overflow and underflow are carry and borrow)

ISA
----
Now on to the instruction set. What do we need first ? An aLU ? NO ! the ALU is the last of your worries. You need memory access functions. without that your ALU doesn't do diddly squat !
But the very first instruction we need is one that doesn't do anything. NOP

NOP = 0xFF. why ? because uninitialized memory is all logic 1 ( a flash or eeprom cell is logic 1 when erased )

Now, if we think a bit further : why stop there and risk having weird behavior ? anything not defined as instruction should do nothing. so in our decidsion 'case statement' the 'default' should be NOP
The only thing we need to take care of is to make sure we do not use 0xFF as an instruction so with uninitialized rom the processor will run until it finds a valid instruction. so without a program at least the CPTR should be incrementing. Later on , for added security we can cast 0xFF as 'HALT'. that way , if there is no code the processor doesnt do anything. but for debugging we leave it as NOP for now.

NOP = 0xFF and anything not explicitly defined

case INSTR
   0xFF : NOP
   default : NOP

The first operation is a a simple READ operation: take the contents of whatever is pointed at by DPTR, and throw in the accumulator
RD : 0xFE
and its counterpart WRITE
WR : 0XFD
whatever is in the accumulator goes into the location pointed at by DPTR
and since our IO is partitioned off we need an IORD and IOWR too
IORD 0xFC
IOWR 0xFB
these load or store the accumulator contents in whatever is pointed to by IOPTR.
from now on i am not going to assign numbers ot the instructions. it is obvious this has to happen ...

So what do we need next?  So far we can shuffle stuff to and from the accumulator from locations pointed at by the dptr and ioptr. Note that the DPTR can point to locations in ROM ! they share the same memory map. so there is no need for a special 'load from rom' instruction.

We do need a way to preset the pointers . this will involve two clockticks as we need to read 2 bytes from code space.
SETDPTR  : load the next 2 bytes from code 'rom' into the DPTR
SETIOPTR : load the next 2 bytes from 'rom' into the IOPTR

It would also be nice to be able to 'stream' data so we should have a RD and WR operation that can post incrmeent the DPTR and IOPTR
RDINC : reads from DPTR and increments DPTR post operation
WRINC: write to dptr and post increments

Now we need the ability to 'jump' to different locations (goto). possibly based on the outcome of ALU operations (if then else)

JMP : load next 2 bytes from rom into CPTR

for jump based on decision we need a test
JMPTRUE ; jump is accumulator is true ( nonzero)
JMPFALSE : jump if accumulator is false ( 0)
these just look at the zero flag of the accumulator. based upon that they either execute a JMP or let 2 clockticks pass wihtout doing anything.

these jump operations need a temporary holding place as we can not simpy load the lower or upper byte. this would move the CPTR and create issues..
so we need a TEMPHI register.

whenever a jump style instruction is met we do this :
take the next byte and throw in SCRATCH. ( it is good to have a SCRATCH register anyway. this is a destructive register that can be used as scratchpad
load CPTR with scratch + next byte.

next thing : bit set , reset and toggle instructions and their accompanying tests
since we have many empty instructions i will encode these in banks

SETBIT7..0
instructions 0XDF to 0xD8
0xDF : ACC[7] <=1b1;
0xDE : ACC[6] <=1'b1;
...
CLRBIT7..0
instructions 0xD7 to 0xD0
0xDF : ACC[7] <=1b0;
0xDE : ACC[6] <=1'b0;

TESTBIT7:0
0xCF .. 0xC0:
0XCF : ACC[7:0] <= {7'b0,ACC[7]};
0xCE : ACC[7:0] <= {7'b0,ACC[6]};
....

then we can jump afterwards based on ACC being 0 or nonzero. you could also do this differently by simply loading the same bit 8 times. ACC[7:0] <= (ACC[7],acc[7],acc[7] .... but that requires more logic.

and TOGGLE is always nice to have
0xC7 : ACC[7] <= ~acc[7]
0xc6 : acc[6] <= ~acc[6]

blinking LEd :
SETIOPTR #led
loop:
   TOGGLE0
   IOWR
   JMP loop

and of course you still need a basic arithmetic block + - AND OR XOR NOT XNOR NAND NOR INCR DECR SHL SHR and so on....


Professional Electron Wrangler.
Any comments, or points of view expressed, are my own and not endorsed , induced or compensated by my employer(s).
 
The following users thanked this post: aheid


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf