Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10833 times)

brucehoult · « **Reply #75 on:** November 16, 2018, 12:58:24 am »

Quote from: NorthGuy on November 15, 2018, 07:46:03 pm

Quote from: brucehoult on November 15, 2018, 01:38:16 am
Quote from: NorthGuy on November 14, 2018, 11:51:38 pm
Quote from: brucehoult on November 14, 2018, 09:49:25 pm
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

Low frequency silicon, huh? Very smooth ...

How about Intel processors? A guy measured power consumption of 22/32 nm technology here:

http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/

Look at picture 2a/2b and extend the orange "Total power" line until it crosses 0 frequency. Looks like around 10 W (perhaps 15 W). And this includes whole lot more than just leakage. As the frequency increases, the power goes over 100 W. Doesn't look like leakage dominates. Do they use "low frequency silicon" too?

No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

NorthGuy · « **Reply #76 on:** November 16, 2018, 03:42:26 am »

Quote from: brucehoult on November 16, 2018, 12:58:24 am

No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

May be you can also explain why the power gating has anything to do with this?

Linear dependency between frequency and power consumption is direct proof that the power consumption depends on switching, whether power gating or not. Only a small portion (around 10%) of power consumption is static and may be attributed to leakage.

brucehoult · « **Reply #77 on:** November 16, 2018, 04:45:14 am »

Quote from: NorthGuy on November 16, 2018, 03:42:26 am

Quote from: brucehoult on November 16, 2018, 12:58:24 am
No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

May be you can also explain why the power gating has anything to do with this?

Linear dependency between frequency and power consumption is direct proof that the power consumption depends on switching, whether power gating or not. Only a small portion (around 10%) of power consumption is static and may be attributed to leakage.

Don't just take my word for it. This is something known by absolutely anyone who is making chips at 65nm, 40nm, 28nm and below.

1991 article saying leakage dominates at 65nm and below: https://www.eetimes.com/document.asp?doc_id=1264175

Broadcom presentation saying that leakage current dominates and turning off everything that does not need to be on is critical http://www.islped.org/2014/files/islped2014_leakage_mitigation_JohnRedmond_Broadcom.pdf

IEEE article explaining why leakage starts to dominate more and more once you go below 100 nm: http://www.ruf.rice.edu/~mobile/elec518/readings/DevicesAndCircuits/kim03leakage.pdf

NorthGuy · « **Reply #78 on:** November 16, 2018, 03:19:26 pm »

Quote from: brucehoult on November 16, 2018, 04:45:14 am

Don't just take my word for it. This is something known by absolutely anyone who is making chips at 65nm, 40nm, 28nm and below.

1991 article saying leakage dominates at 65nm and below: https://www.eetimes.com/document.asp?doc_id=1264175

Broadcom presentation saying that leakage current dominates and turning off everything that does not need to be on is critical http://www.islped.org/2014/files/islped2014_leakage_mitigation_JohnRedmond_Broadcom.pdf

IEEE article explaining why leakage starts to dominate more and more once you go below 100 nm: http://www.ruf.rice.edu/~mobile/elec518/readings/DevicesAndCircuits/kim03leakage.pdf

Look at the articles you're quoting. Generally they derive theoretical form of dependencies between leakage and the size of transistors. Then they need to fit the coefficients of their equations. To do so, they use empirical data obtained from measurements made on transistors built with technologies which were available back then. Then they extrapolate these data to smaller sizes and predict the leakage at smaller sizes (admittedly for future technologies).

But guess what? Technologies change. The leakage gets smaller. The old empirical coefficients no longer apply. The pessimistic predictions don't hold. They now use 7nm technology. Even looking back few years - the 7-series Xilinx FPGA built on 28nm technology has leakage well below 5%. Looking in the future, technologies may only improve.

Regardless, the simple logic necessary to get what you call "fancy" addressing modes doesn't consume much silicon, but it can produce great results. Look at dsPIC33, for example. The "fancy" CISC addressing schemes are used to execute complex instructions within a single instruction cycle. For example, you can do single cycle MAD which includes two memory fetches (from dual-port RAM), auto-increments, no-overhead looping, wrapping of addresses at the specified bounds. It can even automatically shuffle bits for FFT.

How much would it take for RISC-V to implement such sophisticated MAD instruction. I guess around 10 instructions (including one cycle for the branch delay slot which you have eliminated). So, if dsPIC33 runs at 100 MHz, and you want to match it with RISC, you will have to run at 1 GHz. This means faster technology (more leakage!), caches, cache controllers etc., which is really massive amounts of silicon, much more space and energy than dsPIC33. And no amount of magical C optimization can help.

brucehoult · « **Reply #79 on:** November 17, 2018, 04:30:56 am »

Well, we're just going to have to agree to disagree.

We're designing and building SoCs here, according to our instruction set and hardware design philosophy. The proof is in how they work and that speaks for itself. So far, every chip taped out has come back working first try, and at the predicted performance and power consumption. The only opinions that matter are the paying customers.

NorthGuy · « **Reply #80 on:** November 17, 2018, 02:29:31 pm »

Quote from: brucehoult on November 17, 2018, 04:30:56 am

Well, we're just going to have to agree to disagree.

Good idea. Quite right.

Quote from: brucehoult on November 17, 2018, 04:30:56 am

We're designing and building SoCs here, according to our instruction set and hardware design philosophy. The proof is in how they work and that speaks for itself. So far, every chip taped out has come back working first try, and at the predicted performance and power consumption. The only opinions that matter are the paying customers.

Well. This is a discussion forum, not a marketing platform. And this thread is not about your SoCs. I don't see how your paying customers are of any concern.

legacy · « **Reply #81 on:** November 17, 2018, 03:18:38 pm »

In the end, SLAC got implemented, tested, and added to the ISA

But not as "EA" addressing mode, but rather like a common instruction that needs to precede a load/store.
Reasons for this? Elisabeth doesn't like to add a new timing constraint to the list.

It seems a good compromise.

brucehoult · « **Reply #82 on:** November 17, 2018, 07:11:31 pm »

Quote from: legacy on November 17, 2018, 03:18:38 pm

In the end, SLAC got implemented, tested, and added to the ISA

But not as "EA" addressing mode, but rather like a common instruction that needs to precede a load/store.
Reasons for this? Elisabeth doesn't like to add a new timing constraint to the list.

It seems a good compromise.

So you've got "Rd = Ra + Rb << #n" ?

Perfectly good RISC instruction. I approve. And your actual load or store can add a fixed offset to that.

legacy · « **Reply #83 on:** November 17, 2018, 07:35:52 pm »

Quote from: brucehoult on November 17, 2018, 07:11:31 pm

So you've got "Rd = Ra + Rb << #n" ?

Yup. Precisely!

legacy · « **Reply #84 on:** November 17, 2018, 07:53:04 pm »

Next step: teaching to our home-made HL-compiler *HOW* to use SLAC.

Currently, we are testing Arise-v2 on the HDL simulator (and sometimes on the real hardware) by writing short assembly programs. Basically, they are loops operating on predefined values so we can check if things go as expected, step by step. The HL-compiler is already able to identify and parse mathematical expressions(1) and to transform them into RPN. However, it still needs to recognize a matrix's item so it can use SLAC for the EA.

(1) valid expressions can contain number, variables(2), algebric operators { +,-,*,/,% }, and functions (returing (2) ). Matrix's items are currently not recognized.

RPN was the reason why we considered stack-hw support. Push and Pop -> autoincrement/decrement Load/Store, since the HL compiler tends not to use registers for RPN solving.

This point is still under evaluation. Not implemented, but not rejected. Suspended.

(2) currently, valid types are { uint32_t, sint32_t, fixedpoint_t }. It's strongly-typed, so unchecked-converters are provided and required in expressions since there is absolutely is no casting possibility.

NorthGuy · « **Reply #85 on:** November 18, 2018, 05:02:45 pm »

Quote from: legacy on November 17, 2018, 07:53:04 pm

Currently, we are testing Arise-v2 on the HDL simulator (and sometimes on the real hardware) by writing short assembly programs. Basically, they are loops operating on predefined values so we can check if things go as expected, step by step. The HL-compiler is already able to identify and parse mathematical expressions(1) and to transform them into RPN. However, it still needs to recognize a matrix's item so it can use SLAC for the EA.

I've always used binary trees, which is a different representation of the same information as RPN, or you can look at them as a representation of RPN. The leaves of the tree are either variables (blocks of information of where the value is stored at run time) or constants. Starting from leaves, I can reduce the tree by replacing the operation nodes by leaves, and simultaneously emitting commands. For example, an expression:

Code: [Select]

a*b + c*d
is represented as

Code: [Select]

+(*(a,b),*(c,d))
I find a node containing only leaves, and I remove it:

Code: [Select]

 temp1 := *(a,b) // this will be replaced by a single instruction
+(temp1,*(c,d))

And another one:

Code: [Select]

 temp1 := *(a,b) // this will be replaced by a single instruction
temp2 := *(c,d) // this will be replaced by a single instruction
temp3 := +(temp1,temp2) // this will be replaced by a single instruction

Now I can assign storage to the temporary variables (temp1,temp2). Usually they're stored in registers, or in stack if there's not enough registers. However, you need really really complex expressions to run out of registers.

Except for the operations on values, there are operations which operate on addressing. Namely, a subscript operation:

Code: [Select]

subscript(array,index) // expanded from array[index]
and dot operation:

Code: [Select]

dot(structure,member) // expanded from structure.member
These cannot be dealt with as easily as arithmetics. You cannot simply remove nodes. Especially, if such operations are nested. For example:

Code: [Select]

dot(subscript(array,index),member) // expanded from array[index].member
So, they get aggregated. After aggregation, the resulting expression always has one base variable (such as "array" in the above example), and linear indexes, which are either fixed offsets (from structure elements or arrays with fixed indices), or indexed with fixed coefficients. It is always possible to represent the addressing calculation in the form:

Code: [Select]

EA = base + k0 + index1*k1 + index2*k2 + ...
Where index1, index2 are derived from the sub-expressions and usually stored in registers (the same as temp1,temp2 above). k0, k1, k2 are numbers known at compile time, and "base" is the register on which the original variable is based upon (such as stack pointer for local variables).

At this point, you need some sort of heuristics to process the equation and map it to the most efficient combination of your SLAC instruction, other instructions, or hardware addressing. In most cases this is fairy simple. Since, at this point in compilation, you haven't yet assigned the storage for temporaries (index1, index2 etc.), you can mandate that things which must be in the registers for your addressing are indeed placed into the registers, not on stack.

Then you emit the corresponding instructions, including the final instruction which fetches (or stored for L-values) the data from the variable. This fetched value is sitting in a register and looks just a regular variable to the upstream parts of the expression.

theoldwizard1 · « **Reply #86 on:** November 19, 2018, 10:24:02 pm »

I have different background than many of you. I "cut my teeth" writing embbeded code in assembler on several different "oddball" microprocessors. I then moved on to programming, including writing assembler, on a VAX, which I think most people would agree was the ULTIMATE complex instruction and addressing mode processor every built. The last big project I worked on was designing an application binary interface (ABI) to be used by Gnu C for a 32 bit RISC architecture, so I have sort of been around the block on this.

Quote from: Nominal Animal on November 09, 2018, 12:27:17 am

Surprisingly enough, it highly depends on the number of registers available.

EXTREMELY TRUE ! Looking back, it amazes me that VMS only allowed compilers/programmers to use 12 of 16 registers. The 4 "dedicated" register are the program counter, the stack pointer, the frame pointer and the argument pointer.

Now it seems like everything has a minimum of 32 registers. Reserving Reg0 to always be ZERO is very useful in embedded programming. (IIRC we designed the ABI to use a special data section called .zdata that was within the range of the address offest added to R0.)

Quote from: legacy on November 08, 2018, 10:23:45 pm

Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit

Probably the most important addressing mode. The big debate is just how many bits should be allowed for that constant (offset).

An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

Atomic adjustment of the stack pointer is critical.

brucehoult · « **Reply #87 on:** November 20, 2018, 01:45:49 am »

Quote from: theoldwizard1 on November 19, 2018, 10:24:02 pm

I have different background than many of you. I "cut my teeth" writing embbeded code in assembler on several different "oddball" microprocessors. I then moved on to programming, including writing assembler, on a VAX, which I think most people would agree was the ULTIMATE complex instruction and addressing mode processor every built. The last big project I worked on was designing an application binary interface (ABI) to be used by Gnu C for a 32 bit RISC architecture, so I have sort of been around the block on this.

Pretty similar here. I taught myself machine code (not even asm as I didn't have an assembler) on 6502 and z80 systems that otherwise only had a BASIC interpreter. Then PDP11 then VAX, z8000 and m68000, PowerPC/MIPS/Alpha/SPARC/RISC-V, ARM, AVR8.

VAX was the ultimate demonstration that you could provide complex addressing modes and complex instructions for things like function call, but sticking to the simple addressing modes and instructions made your code run faster.

Dave Patterson discovered this while on sabbatical at DEC, and this experience is what caused him on his return to Berkeley to design RISC I.

Quote

Quote from: Nominal Animal on November 09, 2018, 12:27:17 am
Surprisingly enough, it highly depends on the number of registers available.
EXTREMELY TRUE ! Looking back, it amazes me that VMS only allowed compilers/programmers to use 12 of 16 registers. The 4 "dedicated" register are the program counter, the stack pointer, the frame pointer and the argument pointer.

Now it seems like everything has a minimum of 32 registers.

Except 32 bit ARM/Thumb, which also has 16 and reserves about the same number of them.

Quote

Reserving Reg0 to always be ZERO is very useful in embedded programming. (IIRC we designed the ABI to use a special data section called .zdata that was within the range of the address offest added to R0.)

On MIPS and RISC-V there is unlikely to be RAM at address zero, but there is a dedicated GP (Global Pointer) that points to the global variables (.data section). There is a linker section .sdata that is put at the start, so you are more likely to be able to reference things there as a simple offset from GP: +/- 2 KB for RISC-V (GP actually points to 2 KB past the start of the globals section) and a 64 KB range for MIPS. Outside that range you need a three instruction sequence "lui Rtmp,#nnnnn; add Rtmp,Rtmp,GP; ld/st nnn(Rtmp)".

Quote

Quote from: legacy on November 08, 2018, 10:23:45 pm
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit
Probably the most important addressing mode. The big debate is just how many bits should be allowed for that constant (offset).

Yes. Some CPUs have short instructions with and 8 bit -- or even 5 bit -- offset! They'll usually have bigger ones available as well, but these actually cover a lot of cases.

IBM 360, ARM, and RISC-V have 12 bits. This covers probably 99% of real-world offsets. The IBM assembler allows you to dedicate several additional GP registers to directly address further 4KB chunks of globals, if you wish.

Most other RISC ISAs give 16 bit offsets. It's very very rare to need more.

Quote

An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

CISC ISAs usually allow this. Certainly VAX, m68000, i386.

The only RISC ISA I know of that allows this is the new NanoMIPS encoding, introduced in a chip in May (with 16, 32, and 48 bit instruction lengths).

RISC-V will probably get a 48 bit opcode to load a 32 bit literal in the future, but there won't be enough space to allow ADDI, ORI, ANDI, XORI etc to have 32 bit literals.

legacy · « **Reply #88 on:** November 20, 2018, 03:31:53 pm »

Quote from: theoldwizard1 on November 19, 2018, 10:24:02 pm

An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

Arise-v2 offers a 32bit constant for this. Unsigned.

Quote from: theoldwizard1 on November 19, 2018, 10:24:02 pm

Atomic adjustment of the stack pointer is critical.

Yup. We are now currently working on this


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10833 times)

brucehoult

Re: which Effective Addressing Modes are essential for a HL language?

NorthGuy

Re: which Effective Addressing Modes are essential for a HL language?

brucehoult

Re: which Effective Addressing Modes are essential for a HL language?

NorthGuy

Re: which Effective Addressing Modes are essential for a HL language?

brucehoult

Re: which Effective Addressing Modes are essential for a HL language?

NorthGuy

Re: which Effective Addressing Modes are essential for a HL language?

legacy

Re: which Effective Addressing Modes are essential for a HL language?

brucehoult

Re: which Effective Addressing Modes are essential for a HL language?

legacy

Re: which Effective Addressing Modes are essential for a HL language?

legacy

Re: which Effective Addressing Modes are essential for a HL language?

NorthGuy

Re: which Effective Addressing Modes are essential for a HL language?

theoldwizard1

Re: which Effective Addressing Modes are essential for a HL language?

brucehoult

Re: which Effective Addressing Modes are essential for a HL language?

legacy

Re: which Effective Addressing Modes are essential for a HL language?

Share me