Author Topic: How does a von neuman do 1 instruction per cycle? (Read 7923 times)

asgard20032 · « **on:** May 18, 2015, 11:56:48 pm »

I am trying to figure out how von neuman work. How does it do 1 instruction per cycle. (Arm cortex M0, Arm cortex M0+, some MIPS implementation and x86).

I can understand how modified harvard manage to do it, but not von neuman.

hamster_nz · « **Reply #1 on:** May 19, 2015, 12:03:05 am »

Mostly from having separate instruction and data caches, along with instruction pipelining and all the other tricks that let your CPU perform superscalar execution.

See http://en.wikipedia.org/wiki/Superscalar for an overview.

asgard20032 · « **Reply #2 on:** May 19, 2015, 01:12:03 am »

Hey, its the legendary hamster. If i am asking about this, its because i am about to design a CPU on FPGA.

I could wait 2 year to be at university and get the CPU architecture course... but i cant wait 2 year, its too much time. I want to make some fun soon.

AlfBaz · « **Reply #3 on:** May 19, 2015, 01:19:28 am »

Quote from: asgard20032 on May 18, 2015, 11:56:48 pm

How does it do 1 instruction per cycle.

Per cycle is a bit of a catch, more accurately it's per instruction cycle which is usually 4 clock cycles. Having said that, using multiple "pipelines" and assuming no "pipeline stall" an instruction is being completed every clock cycle

hamster_nz · « **Reply #4 on:** May 19, 2015, 06:46:38 am »

Here is a big list of reading material for you.... http://linuxdocs.org/HOWTOs/CPU-Design-HOWTO-4.html

:-)

helius · « **Reply #5 on:** May 19, 2015, 07:38:10 am »

I think you should take "A Quantitative Approach"...

cyr · « **Reply #6 on:** May 19, 2015, 07:49:26 am »

The key point is probably that real-world CPUs have registers and caches that enable them to execute most instructions without doing both an instruction and a data fetch over the shared bus, and they don't execute every instruction in one cycle (e.g. load and store usually take longer).

hamster_nz · « **Reply #7 on:** May 19, 2015, 10:18:07 am »

One thing that would be worth while to think about is that caches, register files and so on allow higher performance if they can perform more than one accesses per clock cycle.

FPGAs have only dual-port memory, which can read or write two values per clock cycle. How can you make a memory where you can perform multiple reads and a write at the same time? A trickier issue is that for some FPGAs you should not be reading from an address that is also being written to through another port. How can the be worked around?

If you have solutions to these problems, you can implement quite powerful register files and havecomplex instructions sets that can do things quickly. Eg
c = a + b;
z = x + y;

can be implemented as:

Cycle0: Read memory(A), Read memory(B),
Cycle1: Read memory(X), Read memory(Y), Write A+B to memory(C)
Cycle2: Write X+Y to memory(Z)

(where the upper case letters are the addresses in memory of the variables in memory)

or implement fancy addressing modes like "mov eax, (address+edx+ebx*4)" quickly (I doubt I remember my Intel Assembler Syntax correctly).

For a real hard problem, how can you make a memory that can process multiple reads and multiple writes (to different addresses) during the same cycle?

Oh, and it pays to read the product documentation really closely - if you have a common problem there can be handy solutions (e.g. For Xilinx Spartan 6 Distributed RAM, from UG384: "For quad-port configurations, distributed RAM has one port for synchronous writes and asynchronous reads, and three additional ports for asynchronous reads" - that would be really handy when making a CPU register file!)
.

langwadt · « **Reply #8 on:** May 19, 2015, 02:34:10 pm »

Quote from: AlfBaz on May 19, 2015, 01:19:28 am

Quote from: asgard20032 on May 18, 2015, 11:56:48 pm
How does it do 1 instruction per cycle.
Per cycle is a bit of a catch, more accurately it's per instruction cycle which is usually 4 clock cycles. Having said that, using multiple "pipelines" and assuming no "pipeline stall" an instruction is being completed every clock cycle

that's for the special case of something like a pic, there is plenty that can do one instruction per clock cycles

but you have to look at what those instructions are, to get one cycle with one memory the instructions can only
operate on registers, as soon as you need to access memory you'll need an extra cycle

Howardlong · « **Reply #9 on:** May 19, 2015, 08:46:06 pm »

Quote from: asgard20032 on May 18, 2015, 11:56:48 pm

I am trying to figure out how von neuman work. How does it do 1 instruction per cycle. (Arm cortex M0, Arm cortex M0+, some MIPS implementation and x86).

I can understand how modified harvard manage to do it, but not von neuman.

For the M0 where there is no cache, only pipelining, it is due to a fundamental design of ARM processors, namely that you are coerced into using on chip registers for processing.

The instruction set has plenty of the usual addressing modes, but they only available for simple load and store, so to do any arithmetic, shift or logical functions you must do that in registers first.

There is some optimisation to be had from being able to load and store multiple registers in a single instruction from adjacent memory locations (LDM/STM).

All stores and loads take at least 2 cycles, as would be expected on an M0. LDM/STM multiple load and stores take 1+N cycles where N is the number of registers in question.

For Cortex M0, one clock cycle is the same as one instruction cycle apart from branches, loads & stores (including push/pop) and a very few other specialised, typically seldom executed instructions.

Regarding x86 I can't really comment about today's processors, it's been a long time since I wrote any cycle-counting x86 assembler. But back when I did 25 or so years ago, I wrote Windows graphics drivers. It definitely wasn't single cycle then. Any half decent graphics driver programmer did a cycle budget on the back of an envelope and would know the cycle times of each instruction in whatever addressing mode, this added an additional dimension to your programming strategy. Writing a super speedy TextOut for proportional fonts was a bit of a mind bender, albeit academically stimulating.

6thimage · « **Reply #10 on:** May 19, 2015, 09:12:57 pm »

Quote from: Howardlong on May 19, 2015, 08:46:06 pm

Regarding x86 I can't really comment about today's processors, it's been a long time since I wrote any cycle-counting x86 assembler. But back when I did 25 or so years ago, I wrote Windows graphics drivers. It definitely wasn't single cycle then. Any half decent graphics driver programmer did a cycle budget on the back of an envelope and would know the cycle times of each instruction in whatever addressing mode, this added an additional dimension to your programming strategy. Writing a super speedy TextOut for proportional fonts was a bit of a mind bender, albeit academically stimulating.

I remember reading an article recently about encryption library developers (openssl etc.) wanting to have each branch of a function consume the same amount of time (so you can't infer what the key is). But the microcode architecture of modern x86 processors completely destroying it. I'll see if I can find the article, but the gist of it was that each instruction on a modern x86 is split up into mini/multi-step instructions, and also multiple instructions are actually executing at the same time, with the processor having around 40 to 50 internal registers. When a set of instructions is run, the processor orders them in the most efficient way that doesn't change the result, but doesn't run exactly what the program tells it to. Whilst improving performance, it has the major downside, for crypto-developers, that an instruction path does not have a fixed length.

I came across this article a while back http://www.dte.eis.uva.es/Datos/Congresos/FPGAworld2006a.pdf, which is a very simple verilog processor designed to be simple to understand. I don't think it is going to be a fast processor, but it looks very good as a starting point.

I have wanted to design my own processor for quite a while, but it, unfortunately, seems like a lot of effort - as once the design is done, you then need to write an assembler and probably a compiler if you're going to use it all the time. For my FPGA designs, I tend to use a soft-core cortex-m0 (ARM's designstart version), which uses the AHB-lite bus - which might be worth looking into a little to steal some ideas from. But I wish you (and anyone really) the best of luck in designing your own - I'm sure it will be a very enjoyable challenge.

hamster_nz · « **Reply #11 on:** May 19, 2015, 09:16:18 pm »

Quote from: Howardlong on May 19, 2015, 08:46:06 pm

Regarding x86 I can't really comment about today's processors, it's been a long time since I wrote any cycle-counting x86 assembler. But back when I did 25 or so years ago, I wrote Windows graphics drivers. It definitely wasn't single cycle then. Any half decent graphics driver programmer did a cycle budget on the back of an envelope and would know the cycle times of each instruction in whatever addressing mode, this added an additional dimension to your programming strategy. Writing a super speedy TextOut for proportional fonts was a bit of a mind bender, albeit academically stimulating.

As long as there are no conflicts or pipeline stalls, current x64 can execute multiple instructions per cycle per core. Most simple instructions (and/or/add...) complete in one cycle. For complex instructions it is a different story - on a Sandy Bridge a 64 bit division can take up to 138 cycles!

If you are interested, have a look at http://www.agner.org/optimize/instruction_tables.pdf - core i7 cycle counts starts at about page 146.

madires · « **Reply #12 on:** May 19, 2015, 09:34:02 pm »

Quote from: hamster_nz on May 19, 2015, 12:03:05 am

Mostly from having separate instruction and data caches, along with instruction pipelining and all the other tricks that let your CPU perform superscalar execution.

Separate instruction and data caches = Harvard architecture

Howardlong · « **Reply #13 on:** May 19, 2015, 10:04:43 pm »

Quote from: hamster_nz on May 19, 2015, 09:16:18 pm

Quote from: Howardlong on May 19, 2015, 08:46:06 pm
Regarding x86 I can't really comment about today's processors, it's been a long time since I wrote any cycle-counting x86 assembler. But back when I did 25 or so years ago, I wrote Windows graphics drivers. It definitely wasn't single cycle then. Any half decent graphics driver programmer did a cycle budget on the back of an envelope and would know the cycle times of each instruction in whatever addressing mode, this added an additional dimension to your programming strategy. Writing a super speedy TextOut for proportional fonts was a bit of a mind bender, albeit academically stimulating.

As long as there are no conflicts or pipeline stalls, current x64 can execute multiple instructions per cycle per core. Most simple instructions (and/or/add...) complete in one cycle. For complex instructions it is a different story - on a Sandy Bridge a 64 bit division can take up to 138 cycles!

If you are interested, have a look at http://www.agner.org/optimize/instruction_tables.pdf - core i7 cycle counts starts at about page 146.

Ewww...now I know why there's this mantra towards JIT compilers these days. The problem with the JIT strategy is its lack of determinism as well as the "upfront charge" of recompiling. There is also little point in a developer making much effort to optimise their code at a lower level if there's a plethora of almost impenetrable black boxes all doing their own thing differently. Fundamentally a computer doesn't "know" the structure of the data and how it's being used at a macro level, whereas a human does. It has to make generalised assumptions.

Regrettably the art of hand optimising and understanding the cost of inefficient code is becoming lost.

Asmyldof · « **Reply #14 on:** May 21, 2015, 07:09:48 pm »

Quote from: madires on May 19, 2015, 09:34:02 pm

Quote from: hamster_nz on May 19, 2015, 12:03:05 am
Mostly from having separate instruction and data caches, along with instruction pipelining and all the other tricks that let your CPU perform superscalar execution.

Separate instruction and data caches = Harvard architecture

Actually, as long as the separation is applied to cache only, it is strictly commonly known as Modified Von Neumann.
Although I do object to that name.
Clearly, the V in von should be non-capitalised if the last name is preceded by the first-name (Modified). The wisdom of having Modified as the first name for something, that's another matter.

westfw · « **Reply #15 on:** May 21, 2015, 11:52:21 pm »

Assuming a full-decoded instruction, you read your operands from registers into your ALU logic on a clock rising edge. and write the registers with the result on the falling edge. Memory isn't involved, so the operation doesn't interfere with fetching/decoding the next instruction. Load/store instructions that do reference memory take more cycles.

The idea that operations access ONLY registers is pretty fundamental to RISC principles. Microcontrollers that claim to be RISC but still implement register/memory operations are spouting marketing terms, rather than technical terms.

I don't know how the x86 manages. Magic and lots of hard work, I suppose.

madires · « **Reply #16 on:** May 22, 2015, 12:25:20 pm »

Quote from: Asmyldof on May 21, 2015, 07:09:48 pm

Quote from: madires on May 19, 2015, 09:34:02 pm
Separate instruction and data caches = Harvard architecture

Actually, as long as the separation is applied to cache only, it is strictly commonly known as Modified Von Neumann.

When we're nitpicking, we should do it correctly

Sorry, but it's called modified Harvard architecture. Please see:
- https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_Neumann_bottleneck
- https://en.wikipedia.org/wiki/Modified_Harvard_architecture#Split_cache_architecture

T3sl4co1l · « **Reply #17 on:** May 22, 2015, 01:20:26 pm »

Regarding x86: it manages by a heaping mess of taking things apart and putting them back together. AFAIK, all high power processors today utilize microinstructions, the instruction pipeline and decoder serving to transform complex opcodes into sequences of simple operations, while avoiding conflicts (pipelining and out-of-order dependencies).

Deep pipelines, dependency analysis, and good branch prediction, all combine to give a total execution capacity over 2 instructions/cycle per core (or whatever the standard is, these days; I want to say 2/cy was more typical of, like, Pentium III days??).

The absolute propagation delay of a given instruction might be quite large (20-40 cycles?), and not much faster than modern non-pipelined RISC machines (e.g., AVR, Cortex M0?), but because all those delays are hidden behind caches and operating system management, you have no way to actually tell, or care, how long a given instruction takes. So the system works.

Think I heard the claim that x86 overhead takes a startling percentage of power, like, the Intel Atom / AMD Geode takes an extra 20% for paying the price of running x86, as compared to low power [native-]RISC machines (ARM A series, PPC, etc?). Whether that makes an ultimate difference in an end product is a very high level systems engineering question, really hard to make any hard statement on the low level hardware, even if such a claim happens to be true.

Tim

amyk · « **Reply #18 on:** May 22, 2015, 02:00:46 pm »

Quote from: T3sl4co1l on May 22, 2015, 01:20:26 pm

Think I heard the claim that x86 overhead takes a startling percentage of power, like, the Intel Atom / AMD Geode takes an extra 20% for paying the price of running x86, as compared to low power [native-]RISC machines (ARM A series, PPC, etc?). Whether that makes an ultimate difference in an end product is a very high level systems engineering question, really hard to make any hard statement on the low level hardware, even if such a claim happens to be true.

I think that's just marketing. Power is proportional to number of transistors and clock frequency, and the first-level cache and core are running at the same frequency. The instruction decoder for x86 requires more transistors, but higher code density means the instruction cache can be smaller... and RISCs definitely need more cache to keep up with x86, due to their lower code density.

http://www.extremetech.com/extreme/188396-the-final-isa-showdown-is-arm-x86-or-mips-intrinsically-more-power-efficient

Only at the ultra-low-power end of the spectrum, where CPUs don't have caches and memory runs at core speed, then RISCs can be more power-efficient.

CPUs that execute more than 1 instruction per cycle, also read more than one instruction per cycle. On P6 x86 16 bytes are read at once and then split into individual instructions:
http://1.bp.blogspot.com/_oGCeAi-2i3Q/Rl2iAM4lV7I/AAAAAAAAAB8/NjzdQ4Ld5Ms/s1600-h/x86_predecode_p6.gif

Howardlong · « **Reply #19 on:** May 22, 2015, 02:16:26 pm »

One thing to note is that I have noticed that GCC doesn't seem to optimise any where near as well on ARM as on x86. What's not clear is whether or not they are enabling Thumb-2 or MIPS16, it says they are compiling for 32 bit, whatever that means: is that as opposed to x64, or to disable thumb and MIPS16?. Thumb and MIPS16 were specifically introduced to tackle the code density assertions made on ARM and MIPS.

But you are right, throwing transistors at the problem has long been the Intel way in particular, which is largely why they lost the mobile device race.

andersm · « **Reply #20 on:** May 24, 2015, 09:30:49 pm »

Quote from: Howardlong on May 22, 2015, 02:16:26 pm

One thing to note is that I have noticed that GCC doesn't seem to optimise any where near as well on ARM as on x86.

Vastly more resources are poured into the x86 port.

Quote

What's not clear is whether or not they are enabling Thumb-2 or MIPS16, it says they are compiling for 32 bit, whatever that means: is that as opposed to x64, or to disable thumb and MIPS16?. Thumb and MIPS16 were specifically introduced to tackle the code density assertions made on ARM and MIPS.

GCC will only produce code for the ISA you select, ie. if you want Thumb/Thumb-2/MIPS16/microMIPS you have to explicitly ask for it. While Thumb and MIPS16 increase code density, they also require a significantly higher number of instructions for the same task, leading to lower performance. Thumb-2 and microMIPS were designed to solve that problem.

westfw · « **Reply #21 on:** May 25, 2015, 09:58:43 am »

Quote

What's not clear is whether or not they are enabling Thumb-2

gcc supports Cortex-M cpus, which are Thumb/Thumb2 only...
Or did you mean dynamically choosing between 16-bit thumb and 32bit on a per-function (or even smaller) basis (on chips that support both), depending on optimization levels or "how the code works out"? I don't think it does that.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: How does a von neuman do 1 instruction per cycle? (Read 7923 times)

Share me