which Effective Addressing Modes are essential for a HL language?

which Effective Addressing Modes are essential for a HL language?
Posted by legacy on 08 Nov, 2018 19:59
HL = High-Level, like C, Pascal, Ada ...

Effective Addressing Modes currently supported by our Arise-v2 CPU project:
- Address Register Direct: EA = reg
~~- Address Register Indirect: EA = mem(reg)~~ (removed by Elisabeth)
- Address Register Direct with Postincrement: EA = reg; reg++
- Address Register Direct with Predecrement: reg--; EA = reg
- Absolute Immediate Addressing: EA = const

EA = Effective Address, what goes on the physical address bus.

There are other addressing modes, with index, scaled index, etc, ...

Now, since implementing them costs time and effort, I wonder if they are worth with.
Purpose of this: give a good support for a C-like compiler (still to be completed).

In short, I am wondering *where* (for which HL construct) and *which* addressing mode is really really useful; e.g. the RISC MIPS R2K does not even implement "Direct with Post/Pre increment/decrement Mode" ...

#1 Reply
Posted by hamster_nz on 08 Nov, 2018 20:14
Essential? only one...

Required for efficient implementation? most likely more than one.

mem(reg+const) is very useful if the reg is the equivalent of a stack frame pointer or base pointer.

I would have thought that postincrement/postdecrement are most likely not that useful, depending on if you allow byte or word addressing. On superscalar CPUs I guess you can do the fetch and the increment/decrement in the same cycle anyway.

Maybe see if you can find a copy of "Computer Architecture: A Quantitative Approach" by John L. Hennessy.

https://www.amazon.com/Computer-Architecture-Quantitative-John-Hennessy/dp/012383872X

PS. Is Pascal p-code still a thing?

#2 Reply
Posted by nctnico on 08 Nov, 2018 21:30
IMHO indexed (register + X) is very important. Think about stack frames (SP+x) or accessing members of structs.

#3 Reply
Posted by legacy on 08 Nov, 2018 22:23
Quote from: hamster_nz on 08 Nov, 2018 20:14
mem(reg+const) is very useful if the reg is the equivalent of a stack frame pointer or base pointer.

Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit

approved, and added

Quote from: hamster_nz on 08 Nov, 2018 20:14
I would have thought that postincrement/postdecrement are most likely not that useful, depending on if you allow byte or word addressing.

Arise-v2 allows { 8bit, 16bit, 32bit } addressing on the local bus.

therefore these two are
- Address Register Direct with Postincrement: EA = reg; reg+=OpSize
- Address Register Direct with Predecrement: reg-=OpSize; EA = reg

OpSize = { 1, 2, 4 }, according to { byte, word, long } accesses

However, there is a price to pay: it costs a couple of extra-circuits and 1 stage

Quote from: hamster_nz on 08 Nov, 2018 20:14
Maybe see if you can find a copy of "Computer Architecture: A Quantitative Approach" by John L. Hennessy.

Thanks, Elisabeth should have a paper copy, somewhere.

Quote from: hamster_nz on 08 Nov, 2018 20:14
PS. Is Pascal p-code still a thing?

Yup, it is

You can still buy your copy of "MikroPascal Compiler" for modern MPUs (including ARM), it's cheap and funny. Besides, Pascal and Modula2 (sometimes also Oberon) are still used in universities, especially for teaching languages and compilers.

#4 Reply
Posted by legacy on 08 Nov, 2018 22:32
Quote from: nctnico on 08 Nov, 2018 21:30
IMHO indexed (register + X) is very important. Think about stack frames (SP+x) or accessing members of structs.

yup, it's simple and good for this!

Arise-v2 should be used to compute a lot of matrices, in fact, we also have a GTE unit (external to the SoC, it's a dedicated unit, and physically it's a second FPGA), hence I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix.

This addressing mode ( I don't know how it's called ... perhaps a sort of "Address Register Direct with Scaled Index" ?!?) costs:

- one multiplier unit, 32bit, unsigned
- one zero-extender, 32bit
- one adder, 32bit, unsigned
- six clock edges of wait (until EA is ready) in the load/store unit

What do you think?

#5 Reply
Posted by Nominal Animal on 09 Nov, 2018 00:27
Quote from: legacy on 08 Nov, 2018 19:59
Now, since implementing them costs time and effort, I wonder if they are worth with.
Surprisingly enough, it highly depends on the number of registers available.

When there are only a few registers available (say, 80386 and compatibles), complex indirect accessing modes can keep the instruction count down, and avoid having to store temporary variables on the stack.

If you have lots of registers, and a fast "load-effective-address" instruction (again, similar to 80386), so that most of those registers can be used as pointers, it turns out that those complex indirect addressing modes aren't that useful anymore.

The next problem is finding a compiler that generates efficient code for either case. As an example, GCC is not particularly good at register use and reuse; even in optimized (-O2) code, you often see register moves and assignments that are completely ... stupid.

This means that the answer to the underlying question ("which EA modes ..") is something annoying like "it depends on what addressing modes your compiler needs, to generate efficient code".

I know that GCC generates slightly better assembly from pointer-based C than the equivalent expressions using array indexing. Fortran (say F95 and later) is obviously different, as it has an efficient way of expressing array slicing.

All this waffle means that in my opinion, you should consider writing e.g. a GCC port for the tentative instruction set architecture, and examine the code generated in each case, for some typical constructs. Yeah, I know; I'm not being particularly helpful here, sorry.

(I need to go look at Arise-v2 in detail before commenting again, I guess.)

Quote from: legacy on 08 Nov, 2018 22:32
I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix.
The scale is not always a constant, so do consider EA = reg0 + reg1 * reg2 instead. (Not only is the instruction shorter, but it is much more versatile this way.) Also, all of those could be negative as well.

That said, I do believe that if you had a fast (whatever your minimum instruction duration is) fused multiply-add instruction (regN = regA + regB * regC) and/or related load-effective-address instruction (regN = imm32 + regA<<imm5 + regB * regC) for signed integers, that would do just as well. If simplifying the addressing modes would allow that, I'd definitely go for it.

In general linear algebra (say, the same niche BLAS/LAPACK or GSL target), using a separate signed stride for both row and column advancement allows advanced features not supported by even GSL. I have an outline here on stackoverflow in C. Unlike in GSL, "views" are first-class matrices, indistinguishable from the matrix they view to; this not only simplifies code, but it makes it easier for "non-professional programmers" (scientists) to write efficient linear algebra code.

(It's just that BLAS/LAPACK/ATLAS/GSL have such a devoted following that even showing practical examples of a better, easier, and more efficient code is not enough to budge the set of scientists I've worked with. Meaning no funding, and therefore no push on my part.)

Another set of users work with very large sparse matrices, but the known implementations do not really need any special addressing modes for those. (These appear in e.g. materials physics, where the matrix contains the locations and velocities of each atom in the cluster or lattice, and the eigenvalues provide the phonon frequencies.)

#6 Reply
Posted by westfw on 09 Nov, 2018 01:05
Quote
it highly depends on the number of registers available.
Yeah, that. I would much rather have multiple (preferably all) registers be usable as simple index registers than have a single index register with lots of modes.
RISC philosophy says it's faster and easier (for both hardware and compiler authors) to leave out fancy addressing modes and just do the math manually.
But... An Atomic PUSH and POP capability is pretty important.
On modern RISC processors, I really miss "immediate" addressing (on PDP11, this was done with post-incremented indexing off of the PC register. That probably wrecks havoc with pipelines...)
TI's MSP430 uses some non-sensical address modes (anything off of STATUS?) to implement some common constants. Cute.

#7 Reply
Posted by andyturk on 09 Nov, 2018 06:06
https://llvm.org/docs/LangRef.html#load-instruction

EDIT: and of course, the corresponding "store" instruction

#8 Reply
Posted by Nominal Animal on 09 Nov, 2018 08:55
Quote from: westfw on 09 Nov, 2018 01:05
But... An Atomic PUSH and POP capability is pretty important.
Fully agreed. Also either atomic compare-exchange or load-linked store-conditional, for lock support. But those are specific instructions, not general arithmetic instructions with addressing modes.

Quote from: westfw on 09 Nov, 2018 01:05
on PDP11, this was done with post-incremented indexing off of the PC register.
Also on AMD64 for position-independent code; one can offset off %rip, which during execution of such an instruction, is the address of the beginning of the following instruction. Just recently posted a freestanding C example on AMD64 on Linux about that.

#9 Reply
Posted by westfw on 09 Nov, 2018 10:06
Quote
Quote
An Atomic PUSH and POP capability is pretty important.
Fully agreed. Also either atomic compare-exchange or load-linked store-conditional, for lock support
Well, yeah. But in this case I was pointing out that some instructions are going to need to do that whole pre/post increment/decrement indexed addressing, and it seems to be a shame to make that work ONLY for push/pop. (although I guess that ARM v6m (Cortex M0) does exactly that. Grr.)

#10 Reply
Posted by legacy on 09 Nov, 2018 11:01
Quote from: Nominal Animal on 09 Nov, 2018 00:27
Surprisingly enough, it highly depends on the number of registers available.

32 registers + PC: 30 are usable, 2 are reserved (e.g. r0 always returns 0x0000.0000, r1 is the stackpointer)

fast "load-effective-address" implemented instructions are:
- Address Register Direct: EA = reg
- Absolute Immediate Addressing: EA = const

these take 1 stage, 1 writeback, and 1 clock edge to evaluate EA, besides all registers can be used as source and target

Quote from: Nominal Animal on 09 Nov, 2018 00:27
The next problem is finding a compiler that generates efficient code for either case

We have to write the compiler, currently, there is only an assembly compiler to "try" the ISA (if enough consistent, etc ...)

Quote from: legacy on 08 Nov, 2018 22:32
I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix

Quote from: Nominal Animal on 09 Nov, 2018 00:27
The scale is not always a constant, so do consider EA = reg0 + reg1 * reg2 instead. (Not only is the instruction shorter, but it is much more versatile this way.) Also, all of those could be negative as well.

for what is it more versatile than the above?

Quote from: Nominal Animal on 09 Nov, 2018 00:27
if you had a fast (whatever your minimum instruction duration is) fused multiply-add instruction (regN = regA + regB * regC) and/or related load-effective-address instruction (regN = imm32 + regA<<imm5 + regB * regC) for signed integers, that would do just as well. If simplifying the addressing modes would allow that, I'd definitely go for it.

Arise-v2 supports MAC for System of Linear Equations, FIRs, and IIRs algorithms. The MAC unit costs more LEs in the FPGA than a simple unsigned-multiplier, and the MAC takes 12 clock edges to compute.

As every RISC machine, the ALU stage precedes the Load/Store stage, therefore the ALU may use the MAC to evaluate the EA for the Load/Store.

Technically even the MAC can be used, but it has cost (the design needs to be modified a bit, therefore retested from the belt down), and again I miss where it's really useful?

Can you give me an example of a C structure where this makes a good deal?

Quote from: Nominal Animal on 09 Nov, 2018 00:27
In general linear algebra (say, the same niche BLAS/LAPACK or GSL target), using a separate signed stride for both row and column advancement allows advanced features not supported by even GSL. I have an outline here on stackoverflow in C. Unlike in GSL, "views" are first-class matrices, indistinguishable from the matrix they view to; this not only simplifies code, but it makes it easier for "non-professional programmers" (scientists) to write efficient linear algebra code.

Here I don't follow you

The GTE unit performs
- matrix rotations
- matrix scaling
- matrix translation
- pixel color interpolation (up to 3 points, triangle rastering)
- polyline drawing (up to 4 points)

All of this is made on hardware by the GTE unit for which the CPU (Arise-v2) only needs to pass commands and data; however Arise-v2 also needs to manipulate matrices and linear system of equations, whose algorithms are interactive approached and massively use the MAC, since, at each step, the iteration is a sort of A[ i+1 ] = A[ i ] + B[ i ]*b

rt = MAC (r0, r1, r2)
rt = A[ i+1 ]; r0 = A[ i ]; r1 = B[ i ]; r2 = b

Accessing a matrix-cell is something like "coll_i * (sizeof(row)) + row_i", which can be mapped into the hardware directly, EA = r0 * imm16 + r1, r0=coll_i, r1=row_i

I don't see why you are suggesting to use a register for the scale, and a sint32_t scale.

Just to hve shorter instructions?

#11 Reply
Posted by legacy on 09 Nov, 2018 11:19
Atomic PUSH and POP are implemented by
- Address Register Direct with Postincrement: EA = reg; reg++
- Address Register Direct with Predecrement: reg--; EA = reg
by using the stack pointer SP for the "reg"

These two have the cost of a second writeback stage, which is one of the reasons why MIPS-R2K" doesn't have them. But my team believes they are a MUST.

I don't believe Arise-v2 will ever have a multitasking OS, however CAS and TAS (semaphore-instructions) are useful because the GTE and the CPU share a dual port ram, with a couple of hardware semaphores; they don't have any impact on the current EA circuit, they need more dedicated states for the read-and-modify cycle on the bus.

#12 Reply
Posted by Nominal Animal on 09 Nov, 2018 14:33
I hadn't read your Hackaday.io project page, so I misunderstood the context.

Quote from: legacy on 09 Nov, 2018 11:01
Can you give me an example of a C structure where this makes a good deal?
Accessing a pixel canvas, where the origin address and width are variables, i.e.
Code: [Select]
uint32_t *canvas; uint16_t width, row, col; canvas[row*width + col] = ...;the address calculation is essentially canvas + 4*col + 4*row*width . Even when canvas and width are fixed, you are still left with imm32 + 4*regA + 4*imm16*regB.

Now that I understand that the matrix stuff is related to computer graphics, regA + 4*regB + 4*regC*regD with 15-bit unsigned regB, regC, and regD would do just fine. But, because most algorithm access pixels sequentially, such random access is probably not needed at all.

Quote from: legacy on 09 Nov, 2018 11:01
Here I don't follow you
I hadn't read your Hackaday.io project description, so was off in the woods. (I've always wanted a real beefy FPGA to play with, to implement certain classical potential models in hardware, to see how/if I could boost simulation speeds. So I was projecting there a bit..)

For antialiasing lines, curves, and edges in general, I'd really want an addressing mode for loads and stores where only the upper 16 bits of col and row registers (ignoring the highest bit) were used in the address calculation, with a fixed multiplier of 4. It would mean that if you have a Q15.16 row and column coordinates in 32-bit registers, you could use them directly in addressing the pixel itself: EA = reg32/imm32 + 4 * ((regC >> 16) & 32767) + 4 * reg15/imm15 * ((regR >> 16) & 32767). (Noting, of course, that 15×15-bit multiplication shifted left by two bits is at most 32 bits.)

That high subpixel precision may sound overkill, but it allows pixel-precise lines over the entire coordinate range, and should be enough to evaluate e.g. Bézier curves without converting them into lines. Things like triangle filling with edge antialiasing is also computationally cheap; and you don't need Bresenham anymore at all. You do need a 31-bit to 31-bit (unsigned) division that essentially does (regA << 16) / regB to calculate the slope. But it works for lines starting and ending at subpixel coordinates just fine.

Let me know if you want some example C code to play with.

#13 Reply
Posted by NorthGuy on 09 Nov, 2018 15:47
There's no difference for the HL compiler. It can emit a series of commands just as easy as a single command.

The important thing is how you encode your instructions. If you have fixed-length instructions, unless your instructions are wider than the address bus, encoding "Absolute Immediate Addressing" consumes lots of code space without much benefits. The small things, such as auto-increments/decrements, take very little of code space, and do not take any extra time, and require very little extra logic, so they're easy. However, taking advantage of them will take extra work when you write the compiler. Indexing is also easy if you can handle several additions within a cycle, and it also doesn't take much code space. It greatly simplifies access to arrays.

The compiler implementation will be easier if you do not to make any exceptions - such as register A can only be used for this, while B can only be used for that etc.

If you're going to use dynamic linking, think about addressing modes which would support PIE (position-independent execution). Otherwise, you may create extra work for the compiler.

Another thing which is important for the HL compiler is stack frames. If your compiler is going to use them then it's a good idea to include provisions for this. For example, you may include addressing relative to a dedicated frame pointer.

#14 Reply
Posted by legacy on 09 Nov, 2018 17:20
Quote from: NorthGuy on 09 Nov, 2018 15:47
auto-increments/decrements, take very little of code space, and do not take any extra time, and require very little extra logic, so they're easy

this takes
- 1 extra writeback stage to update the auto-incremented/decremented register in the RegisterFile
- 1 glue stage
- 1 extra cycle

Easy? Well, it's just 30 lines of VHDL, but this breaks the simple design of a machine like MIPS, in fact, the MIPS R2K doesn't have any of these instructions. We can add them, and we will, but first I wanted to understand *IF* it's a good deal, and which is the real gain!

It seems that the only real GOOD motivation gets point for the atomic push/pop.

We are in a democracy team, four dudes, and a girl; and Elisabeth (the project leader) wants them, I am core-design, and I am still not sure it's a good idea, but other guys have voted to have them, therefore let's have it

Quote from: NorthGuy on 09 Nov, 2018 15:47
The compiler implementation will be easier if you do not to make any exceptions - such as register A can only be used for this, while B can only be used for that etc.

Yup, precisely. One of the goals is getting stuff prepared for the simplest Compiler implementation possible.

#15 Reply
Posted by NorthGuy on 09 Nov, 2018 18:37
Quote from: legacy on 09 Nov, 2018 17:20
this takes
- 1 extra writeback stage to update the auto-incremented/decremented register in the RegisterFile
- 1 glue stage
- 1 extra cycle

If your registers are implemented as a memory file which can only be accessed through a single access port, then, of course, including auto increments and decrements is not a good idea, even for stack. In such case, auto increments will not give you much of a gain compared to incrementing in a separate command.

If, on the other hand, your register file is double (or triple) port, you can auto-increment and write simultaneously with the main operation without any side effects.

If your registers are just sets of flip-flops, you can read or update all of them as you wish without interfering with each other, but this is much more logic, and probably is not practical for CPUs with 32 registers.

Quote from: legacy on 09 Nov, 2018 17:20
Yup, precisely. One of the goals is getting stuff prepared for the simplest Compiler implementation possible.

Having auto-increments/auto-decrements may make the compiler more complex, unless you want to limit the use of auto-increments to trivial cases such as "i++".

#16 Reply
Posted by legacy on 09 Nov, 2018 19:23
Quote from: NorthGuy on 09 Nov, 2018 18:37
If your registers are implemented as a memory file which can only be accessed through a single access port

This is the case. Like in the MIPS R2K design.

Quote from: NorthGuy on 09 Nov, 2018 18:37
Having auto-increments/auto-decrements may make the compiler more complex, unless you want to limit the use of auto-increments to trivial cases such as "i++".

Why?

Consider a thing, expressions are handled as RPN, for which auto-increments/auto-decrements should be useful. I guess

RPN = Reverse Polish Notation

#17 Reply
Posted by NorthGuy on 09 Nov, 2018 20:23
Quote from: legacy on 09 Nov, 2018 19:23
This is the case. Like in the MIPS R2K design.

You may consider dual-port memory. It shouldn't complicate the design much.

Quote from: legacy on 09 Nov, 2018 19:23
Quote from: NorthGuy on 09 Nov, 2018 18:37
Having auto-increments/auto-decrements may make the compiler more complex, unless you want to limit the use of auto-increments to trivial cases such as "i++".

Why?

Consider a thing, expressions are handled as RPN, for which auto-increments/auto-decrements should be useful. I guess

RPN = Reverse Polish Notation

Imagine usual code patterns, such as:

Code: [Select]
for(x = 0; x < z; x++) { lots of code here with accessing various arrays with x. }
If you want to insert the auto-increment here, you need to analyze the whole body of the loop and find the last-executed array access where you need to pin your auto-increment. With lots of ifs it will require some work and may not be possible.

Or this:

Code: [Select]
z = a[x] + b[x]; // 20 unrelated lines here x++;
You can use auto-increment here, but the C coder can't. By C standard, adding "++" to any x in the expression may harm another "x". So, the coder will put "x++" at the end. The compiler must find it and convert to auto-increment.

If you don't want to deal with cases like these, all you have left is the cases where the C coder combined "++" with some sort of access to the variable. For some reason, many of them don't do this and prefer standalone "x++;" statements.

#18 Reply
Posted by NorthGuy on 09 Nov, 2018 22:14
Quote from: legacy link=topic=149881.msg1953853#msg1953853
... you can read two registers in parallel per cycle, but only write one register per cycle! For two registers to be updated, and considering FPGA uses BRAM which is of the same kind of above, you need to write-cycles done on two different stage of the FSM!

Spartan-6 (and -7) has true dual port BRAM where both ports can be written at the same time.

#19 Reply
Posted by SiliconWizard on 10 Nov, 2018 00:54
Quote from: legacy on 09 Nov, 2018 19:23
Consider a thing, expressions are handled as RPN, for which auto-increments/auto-decrements should be useful. I guess

Just a quick note, but having auto-incrementing with a step of only 1 is only marginally useful especially if you have other sophisticated addressing modes such as those you suggested (with scaling and such), so would you allow increments different than 1?

#20 Reply
Posted by legacy on 10 Nov, 2018 10:03
Quote from: SiliconWizard on 10 Nov, 2018 00:54
Just a quick note, but having auto-incrementing with a step of only 1 is only marginally useful especially if you have other sophisticated addressing modes such as those you suggested (with scaling and such), so would you allow increments different than 1?

Quote from: legacy on 08 Nov, 2018 22:23
Arise-v2 allows { 8bit, 16bit, 32bit } addressing on the local bus.

therefore these two are
- Address Register Direct with Postincrement: EA = reg; reg+=OpSize
- Address Register Direct with Predecrement: reg-=OpSize; EA = reg

OpSize = { 1, 2, 4 }, according to { byte, word, long } accesses

#21 Reply
Posted by SiliconWizard on 10 Nov, 2018 23:21
Alright. This is useful for consecutive accesses of course, which are frequent for moving/copying blocks of data for instance.

I'm thinking that combining postincrement/predecrement with your "scaling" access modes would be much more generic and useful in various cases, such as:
EA = reg0 + reg1 * Scale; reg1++

an increment of 1 only would be sufficient due to the scale factor. That would cover a much larger number of cases. That may be a bit too complex though as far as addressing modes go?

#22 Reply
Posted by brucehoult on 11 Nov, 2018 01:28
Quote from: legacy on 08 Nov, 2018 19:59
Now, since implementing them costs time and effort, I wonder if they are worth with.
Purpose of this: give a good support for a C-like compiler (still to be completed).

Compiler guy here. It's almost certainly not worth the extra complexity.

ARM has a lot of complex addressing modes. RISC-V (like MIPS) has only "EA = register+offset" and, even more, reduces the offset from 16 bits to 12 bits. In compensation, LUI and AUIPC are increased from 16 to 20 bits, so you can still load anything in a 32 bit addressing range (either absolute or PC relative) in two instructions.

Despite the lack of addressing modes (no scaled offset, no auto-increment, no register+register), RISC-V programs come in at essentially identical size to Thumb2 programs, and also have slightly better benchmark numbers (Coremark, Dhrystone, whatever you want) than equivalent ARM implementations.

Modern compilers such as gcc or llvm do a great job of removing complex addressing from loops and turning it into simple increments.

Note: this does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work. I'd say it would be crazy. You're much better off writing a back-end for gcc or llvm.

Or, just base your CPU on an existing instruction set that there are already compilers and software for. You can always add your own extensions for the unique aspects.

#23 Reply
Posted by legacy on 11 Nov, 2018 07:55
Quote from: brucehoult on 11 Nov, 2018 01:28
Note: this does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work. I'd say it would be crazy. You're much better off writing a back-end for gcc or llvm.
- gcc was the first choice we evaluated, but it is not good to deal with; especially at the machine-layer level, it's so poorly documented that the closest you look at the code the most it drives you crazy since the code is really a crap. Besides, GNU guys at gcc are not prone to help, they are snob and clinky with the very bad attitute of thinking they are special-ones, like if ones out of their group were so lacking in intelligence that they cannot grasp their simplest concepts, hence in our group we have recently voted to stay away from gcc in order to safeguard the mental and emotional health
- llvm is good stuff! neat and clean, written in C++ as a library, guys are friendly and writing the machine-layer looks well documented and "sane"
- lcc is a good didactical compiler, which makes it more interesting for Arise-v2, the code comes with a book, and it's neat and clean, and good at showing the theory and the practice
llvm and lcc were to good candidates, but we already have a partially working compiler (used to generate the assembly for RPN expressions and simple while-loops for HDL-testbench-ing), and we don't want to implement the common C-language; in fact, Arise-v2 comes with what we call "SafeSmallC", which is not exactly C, it's similar but not equal, and it's a (strongly typed designed) subset.

The world is already plenty of C compilers, and ISAs and MPUs, CPUs, DSPs and tons of things for every taste. Arise-v2 is not intended for this.

#24 Reply
Posted by brucehoult on 11 Nov, 2018 10:15
You're right that gcc is probably best avoided.

I think you should go with llvm. As you say, it's well written and the people are friendly. Also, it's quite careful to *not* be tied in to C. A lot of languages use llvm as the optimiser and code generator.

If you want an optimising compiler (and you do if you want to efficiently have only a simple addressing mode) then go with llvm.

If you go with a simple compiler that doesn't know how to do loop induction variables, strength reduction/scalar evolution, code motion out of loops then you might want to include addressing modes such as base + scaled offset. It's more hardware, more redundant repeated calculations, more energy consumption but it's easier to do simple compilation of ASTs to.

#25 Reply
Posted by rhb on 11 Nov, 2018 17:00
Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.

At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy

"Advanced Compiler Design & Implementation" by Muchnick

I *strongly* recommend reading P & H starting with the first edition and continuing until the device complexity level exceeds what you want to do.

Also absolutely essential is:

"Cache and Memory Hierarchy Design" by Prrzybylski

or a recent equivalent.

In 1998, a DEC Alpha 164LX at 533 MHz was several times the floating point performance of the fastest Intel and AMD. AMD got their big jump with the Opteron when HP killed the Alpha and AMD managed to get a bunch of team. Register coloring and speculative execution made a huge difference in performance. The Alpha was as big an advance in design as the R2K and SPARC.

Addressing modes really isn't a relevant question today and hasn't been for over 30 years unless you are writing for MCUs in assembler. The name of the game is memory hierarchy and data access patterns and making transformations in the compiler to minimize cache misses and pipeline stalls.

It would be nice if you defined acronyms the first time you use them. I have *no* idea what you mean by MAC and I read the first 3 editions of P & H cover to cover.

I think you'll find it very instructive to implement an out of core matrix inversion or LU decomposition. I don't have this yet, but it looks pretty good. I have the HOBBIES book which covers the out of core LU factorization briefly.

#26 Reply
Posted by oPossum on 11 Nov, 2018 17:35
MAC - Multiply ACcumulate

also known as

FMA - Fused Multiply Add

#27 Reply
Posted by brucehoult on 11 Nov, 2018 17:42
Quote from: rhb on 11 Nov, 2018 17:00
Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.

At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy

Yes, I always take P&H as read in any discussion of modern computer architecture. Anyone who hasn't read it shoudl do so immediately.

Quote
It would be nice if you defined acronyms the first time you use them. I have *no* idea what you mean by MAC and I read the first 3 editions of P & H cover to cover.

It seems clear in context that he means "Multiply-ACcumulate" aka FMA.

#28 Reply
Posted by rhb on 11 Nov, 2018 18:02
Thanks. I'm very much a polymath, so I really struggle with acronyms.

The fallacy in the subject line tends to make me expect anything.

I should very much like to know what the objective is. A demonstration project or a DSP engine for a Zynq or other FPGA. I have a long history with DSP going back the the 11/780 and FPS-120B. I spent quite a lot of time circa 1989-90 explaining the implications of the RISC model to my colleagues at the ARCO Plano Research Center. And doing occasional cute tricks like passing data between functions using foo(void), bar*void), etc.

Not only would I cite section 8.3.5 of the F77 standard from memory, I'd top it off with a short description of the Burroughs 5000 and why 8.3.5 was included in the standard.

#29 Reply
Posted by legacy on 11 Nov, 2018 18:27
Quote from: brucehoult on 11 Nov, 2018 17:42
It seems clear in context that he means "Multiply-ACcumulate" aka FMA.

rt0 = rs2 + (rs1 * rs0)

About that, we have decided to exclude it from the supported addressing modes. For accessing cells in a matrix, we prefer using a MAC instruction to calculate the cell's address, and then using a load/Store instruction with EA = rs0 + imm.

This is neat.

#30 Reply
Posted by legacy on 11 Nov, 2018 18:35
Quote from: rhb on 11 Nov, 2018 17:00
Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.
At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy
"Advanced Compiler Design & Implementation" by Muchnick

I *strongly* recommend reading P & H starting with the first edition and continuing until the device complexity level exceeds what you want to do.

The P & H book talks about MIPS R2K, I did two examinations years ago on this book.
Arise-v2 is based on MIPS R2K, but Elisabeth is looking at the Motorola 88K for some features.

Motorola eight-eight-key. It was one of the first RISC design.

Quote from: rhb on 11 Nov, 2018 17:00
Also absolutely essential is:

"Cache and Memory Hierarchy Design" by Prrzybylski
or a recent equivalent.

In 1998, a DEC Alpha 164LX at 533 MHz was several times the floating point performance of the fastest Intel and AMD. AMD got their big jump with the Opteron when HP killed the Alpha and AMD managed to get a bunch of team. Register coloring and speculative execution made a huge difference in performance. The Alpha was as big an advance in design as the R2K and SPARC.

Addressing modes really isn't a relevant question today and hasn't been for over 30 years unless you are writing for MCUs in assembler. The name of the game is memory hierarchy and data access patterns and making transformations in the compiler to minimize cache misses and pipeline stalls.

Arise-v2 has no cache, and it's not meant for competing with those big irons. We want to keep it simple on both HDL and programming sides, and we want it nice to be programmed in assembly, as well as good for making our own high language compiler.

It's not a commercial product, it's made for fun and learning; the hardware is modeled like a gaming console. You can think *like the Playstation1*. A mini version, of course.

#31 Reply
Posted by rhb on 11 Nov, 2018 20:13
Sorry, but I seem to be in" too long; do not read" mode.

As it happens I have the processor manuals for both the R2K and the 88K.

There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now. I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.

Linear algebra lives or dies on memory bandwidth. IIRC the Intel i860 was touted as being 80 MFLOPS. I did an analysis of the cycle time to do a vector multiply-add and found that absolute peak performance was more like 10 MFLOPS for vectors of 12 KB which was a typical seismic trace length at the time. They are double that now.

You can have large memory or you can have fast memory, but you cannot have large, fast memory. It is physically impossible because as the memory gets larger the capacitance of the address lines increases and the RC constant gets larger and the cycle time for an access gets longer. So I am *not* going overboard. This is the cold hard truth I have dealt with for 30 years in a work environment where running 10,000 cores for 7-10 days and feeding them 10-12 TB of data is routine. And power and cooling limitations prevent having more than about 50,000 cores at one site, so processing companies have multiple installations scattered all over Houston.

The number of cores allocated to a particular job depends on how urgent it is and how fast the person doing it can do the QC on a run and update the velocity model before running the whole thing again. Typically a seismic processor will be working on two of these, so that while one is on the machine they are doing the QC on the other.

The only people doing anything comparable face a long vacation is a federal prison for discussing such matters. So one never does. A simple statement such as there is some mathematical identity that can be used at a certain point in a calculation can land all involved a very long prison visit.

Generally this is not a big deal for small problems of a few million matrix entries. But when each of 3 dimensions measures in the 10e6-8 range, it gets *really* serious.

I have no recommendation other than to do exactly what I would do, a lot of reading.

If you're not familiar with it, look at the implications of particular strides and cache associativities.

I have written *one* computed GOTO in my life. The Alpha was 10% faster if I used a stride of 2 and 2 temporary variables. The killer in the code (Kirchoff pre-stack time migration) was computing square roots. I observed that a linear approximation was good enough. So I only recomputed the square root at intervals and used a linear approximation in between. But it meant that I had to continually recompute where to calculate the square root and call a different subroutine to do that segment of the integration. Which was most cleanly implemented with a computed GOTO.

The point of all this is quite simple:

If you need to do high performance linear algebra you have to count cycles for *every* operation from the prefetch to the write and make use of whatever parallelism the hardware provides.

I had an interview in 2006 with a startup called SiCortex which had a nifty design for a very interesting processor based on the MIPS. After we had scheduled my interview I got a bunch of documentation to read. The interview turned into $2-3K of free consulting. However, I did get to spend the weekend with a friend who lived nearby.

I did an hour talk in which I explained the data movement of all the migration algorithms and why their design could not do this well. Later I had a cycle by cycle discussion with the chief architect and a discussion with the CEO in which I explained why what they had designed, despite being quite marvelous, was not suitable for seismic processing. The friend who had given my name to the head hunter was doing hands on evaluations of a prototype and came to precisely the same conclusions after several months of testing. Needless to say, I did not get hired. The company went under a few years later.

#32 Reply
Posted by legacy on 11 Nov, 2018 20:55
Quote from: rhb on 11 Nov, 2018 20:13
As it happens I have the processor manuals for both the R2K and the 88K.

The MC88100 shows a neat example of scaled-index addressing mode.
Never played with a chip, but the ISA looks elegant. We wonder if it helped or not.

Quote from: rhb on 11 Nov, 2018 20:13
You can have large memory or you can have fast memory, but you cannot have large, fast memory

I know, and it's not a problem for Arise-v2, like it was not a problem for earlier 68k systems (68000,68010). Besides, Arise-v2 is clocked @ 66Mhz(1), the external SDRAM is clocked at 120Mhz, there are only 2 wait states in the load/store unit. The static RAM on the asynchronous takes 5 wait-states.

The static RAM (four chip, 64Kbyte@8bit each) is a dual port ram, shared with the GTE.

We don't want/need any cache

edit:
We have to implementations
- The multicycles not-pipelined implementation usually takes 7-15 clock edges to complete an instruction
- The multicycles pipelined implementation usually evolves in 1 clock edges, except ALU{MUL,DIV,MAC,...}, Load/Store that need to stall the pipeline (bubbling by NOP-injection) or conditional branches that need to stall, go back and resume the pipeline. The currently used branch prediction is a Saturating Counter, hence not smart. I have a second level brand prediction, and even a neural-oriented branch prediction unit, but not yet integrated. For sure the last one will never implement, it's currently written in Matlab's script, and frankly, it's too complex
(1) Elisabeth is persuading the team to carry on the multicycles not-pipelined implementation since it's simpler and easier to be simulated, used and debugged, especially for the HDL stuff.

#33 Reply
Posted by rhb on 11 Nov, 2018 22:22
I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.

#34 Reply
Posted by bson on 11 Nov, 2018 22:55
I'd use llvm if you want a production quality compiler. Something else if you want a teaching aid.

If you make R0 a constant generator, then you only need one addressing mode, off16(Rn) - register offset. Make off16(R0) = #off16, so when offsetting R0 you can simply bypass the memory cycle and return the offset itself - this gives you a compact 16-bit immediate instruction.

Add a shift-and-load "salh src,dst" halfword instruction that effectively does dst <<= 16; dst |= src.H. Then you can expand 32 bit immediate loads into two instructions:

mov #imm32, Rn
=>
movh #imm16_high, Rn
salh #imm16_low, Rn
=>
mov imm16_high(R0), Rn
salh imm16_low(R0), Rn

Such a "shift and load" is also handy for dealing with unaligned data, or just scattered data, especially if you also add a byte version. This would let a compiler emit "safe" code for packed data. For hand optimized inline assembly, like checksumming loops or realignment copies, it adds another arrow to the quiver.

For byte immediate, use the same off16(R0) and ignore the upper half of the offset.
If you add a shift-and-load, also add a byte version (salb).

#35 Reply
Posted by bson on 11 Nov, 2018 23:15
Quote from: legacy on 08 Nov, 2018 22:23
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
Unsigned, no sign extension stuff. You almost always work with a base+offset, not middle+offset. Base of structures, base of buffers, base of regions, base base base. Never middle. Signed offsets effectively just waste half the range. In the very rare case that you need a negative offset just accept the extra instructions needed.

#36 Reply
Posted by legacy on 11 Nov, 2018 23:20
ok, thanks

corrected to: reg + zero_ext(const)

#37 Reply
Posted by rhb on 12 Nov, 2018 01:10
I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

But given your performance targets I'm a bit baffled about the purpose of the project.

#38 Reply
Posted by westfw on 12 Nov, 2018 09:23
Code: [Select]
[quote]Note: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work[/quote]I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...[/quote]

#39 Reply
Posted by legacy on 12 Nov, 2018 13:06
Quote
I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...

MIPS R2K has no cache. The point of simple addressing modes in MIPS R2K is that it helps the pipeline to be simple, which helps the machine to be designed with only 5 stages.

#40 Reply
Posted by langwadt on 12 Nov, 2018 13:13
Quote from: rhb on 11 Nov, 2018 18:02
Thanks. I'm very much a polymath, so I really struggle with acronyms.

The fallacy in the subject line tends to make me expect anything.

I should very much like to know what the objective is. A demonstration project or a DSP engine for a Zynq or other FPGA. I have a long history with DSP going back the the 11/780 and FPS-120B. I spent quite a lot of time circa 1989-90 explaining the implications of the RISC model to my colleagues at the ARCO Plano Research Center. And doing occasional cute tricks like passing data between functions using foo(void), bar*void), etc.

Not only would I cite section 8.3.5 of the F77 standard from memory, I'd top it off with a short description of the Burroughs 5000 and why 8.3.5 was included in the standard.

long history with DSP and not knowing the acronym MAC does not compute

#41 Reply
Posted by legacy on 12 Nov, 2018 13:15
Quote from: rhb on 12 Nov, 2018 01:10
I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

specifically which EA do you like in 6502? (I have zero knowledge about it)

#42 Reply
Posted by legacy on 12 Nov, 2018 13:45
Z80's addressing modes

Quote
- Implied addressing mode: there is no bytes nor bits in the instruction op-codes to explicitly specify the operands, so they are implicit, as in CPL and LD SP, IY instructions. In general, instructions of this type have one or two op-codes.
- Immediate addressing mode: instruction needs an explicit operand which is specified in the instruction itself (second ou third byte), and it’s always one-byte length. Example of this mode are instructions ADD A, N, XOR N and LD IXL, N, where N is a one-byte operand.
- Extended immediate addressing mode: it is similar to the immediate mode but takes a two-byte operand. Instructions LD HL, NN and LD IX, NN have this kind of addressing.
- Register addressing: in the op-code, there is a three-bit field to indicate one 8-bit register as the operand. RL, R and AND R are examples of this addressing mode, where R is a general purpose register (B, C, D, E, H, L and A).
- Register indirect addressing: in this case, the operand itself is not in a register. Instead, a register pair contains the operand’s memory address. That’s why it is called indirect addressing. LD A, (BC) and INC (HL) are examples of this mode. Note the use of parenthesis around registers to denote the indirect addressing.
- Extended addressing (or direct addressing): the instruction has a two-byte field containing the operand’s memory address. Instructions like LD A, (NN) and LD (NN), HL illustrate the direct addressing mode, where NN is a 16-bit value representing a memory address.
- Modified page zero addressing: this mode is exclusive to a single instruction, RST P, which causes the CPU to jump execution to a specified location P, inside the page zero memory area. Page zero is a memory area comprised by the first 256 bytes of memory (usually ROM). P can assume only the following values: 0x0, 0x8, 0x10, 0x18, 0x20, 0x28, 0x30 and 0x38. RST P is normally used to call special ROM subroutines. (For the ZX Spectrum, see which ROM routines are available here).
- Relative addressing: it is used in relative jump instructions (conditional and unconditional), in which the memory address to go to is calculated by adding an operand displacement to the PC register. The displacement value ranges from -128 to +127 (signed binary representation), allowing to address a location in memory from -126 to +129 relative to the jump instruction address (this is due to PC getting incremented by two during instruction execution). As an example, instruction JR Z, D, where D is the displacement, uses this kind of addressing mode. The relative addressing mode allows to reduce jump instructions by one byte, compared to absolute jump instructions (JP) where a 16-bit address operand is provided.
- Indexed addressing: this mode allows one to use indirect addressing using the index registers IX and IY. Like the relative addressing mode, the instruction has an 8-bit displacement operand, which is added to the index register being used. AND (IX + D) and LD (IY + D), N are examples of the indexed mode, where D is an 8-bit signed displacement and N is an immediate operand

#43 Reply
Posted by rhb on 12 Nov, 2018 15:31
From: http://www.obelisk.me.uk/6502/addressing.html

Quote
Addressing Modes

The 6502 processor provides several ways in which memory locations can be addressed. Some instructions support several different modes while others may only support one. In addition the two index registers can not always be used interchangeably. This lack of orthogonality in the instruction set is one of the features that makes the 6502 trickier to program well.

Implicit

For many 6502 instructions the source and destination of the information to be manipulated is implied directly by the function of the instruction itself and no further operand needs to be specified. Operations like 'Clear Carry Flag' (CLC) and 'Return from Subroutine' (RTS) are implicit.
Accumulator

Some instructions have an option to operate directly upon the accumulator. The programmer specifies this by using a special operand value, 'A'. For example:

LSR A ;Logical shift right one bit
ROR A ;Rotate right one bit

Immediate

Immediate addressing allows the programmer to directly specify an 8 bit constant within the instruction. It is indicated by a '#' symbol followed by an numeric expression. For example:

LDA #10 ;Load 10 ($0A) into the accumulator
LDX #LO LABEL ;Load the LSB of a 16 bit address into X
LDY #HI LABEL ;Load the MSB of a 16 bit address into Y

Zero Page

An instruction using zero page addressing mode has only an 8 bit address operand. This limits it to addressing only the first 256 bytes of memory (e.g. $0000 to $00FF) where the most significant byte of the address is always zero. In zero page mode only the least significant byte of the address is held in the instruction making it shorter by one byte (important for space saving) and one less memory fetch during execution (important for speed).

An assembler will automatically select zero page addressing mode if the operand evaluates to a zero page address and the instruction supports the mode (not all do).

LDA $00 ;Load accumulator from $00
ASL ANSWER ;Shift labelled location ANSWER left

Zero Page,X

The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the X register to it. For example if the X register contains $0F and the instruction LDA $80,X is executed then the accumulator will be loaded from $008F (e.g. $80 + $0F => $8F).

NB:
The address calculation wraps around if the sum of the base address and the register exceed $FF. If we repeat the last example but with $FF in the X register then the accumulator will be loaded from $007F (e.g. $80 + $FF => $7F) and not $017F.

STY $10,X ;Save the Y register at location on zero page
AND TEMP,X ;Logical AND accumulator with a zero page value

Zero Page,Y

The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the Y register to it. This mode can only be used with the LDX and STX instructions.

LDX $10,Y ;Load the X register from a location on zero page
STX TEMP,Y ;Store the X register in a location on zero page

Relative

Relative addressing mode is used by branch instructions (e.g. BEQ, BNE, etc.) which contain a signed 8 bit relative offset (e.g. -128 to +127) which is added to program counter if the condition is true. As the program counter itself is incremented during instruction execution by two the effective address range for the target instruction must be with -126 to +129 bytes of the branch.

BEQ LABEL ;Branch if zero flag set to LABEL
BNE *+4 ;Skip over the following 2 byte instruction

Absolute

Instructions using absolute addressing contain a full 16 bit address to identify the target location.

JMP $1234 ;Jump to location $1234
JSR WIBBLE ;Call subroutine WIBBLE

Absolute,X

The address to be accessed by an instruction using X register indexed absolute addressing is computed by taking the 16 bit address from the instruction and added the contents of the X register. For example if X contains $92 then an STA $2000,X instruction will store the accumulator at $2092 (e.g. $2000 + $92).

STA $3000,X ;Store accumulator between $3000 and $30FF
ROR CRC,X ;Rotate right one bit

Absolute,Y

The Y register indexed absolute addressing mode is the same as the previous mode only with the contents of the Y register added to the 16 bit address from the instruction.

AND $4000,Y ;Perform a logical AND with a byte of memory
STA MEM,Y ;Store accumulator in memory

Indirect

JMP is the only 6502 instruction to support indirection. The instruction contains a 16 bit address which identifies the location of the least significant byte of another 16 bit memory address which is the real target of the instruction.

For example if location $0120 contains $FC and location $0121 contains $BA then the instruction JMP ($0120) will cause the next instruction execution to occur at $BAFC (e.g. the contents of $0120 and $0121).

JMP ($FFFC) ;Force a power on reset
JMP (TARGET) ;Jump via a labelled memory area

Indexed Indirect

Indexed indirect addressing is normally used in conjunction with a table of address held on zero page. The address of the table is taken from the instruction and the X register added to it (with zero page wrap around) to give the location of the least significant byte of the target address.

LDA ($40,X) ;Load a byte indirectly from memory
STA (MEM,X) ;Store accumulator indirectly into memory

Indirect Indexed

Indirect indirect addressing is the most common indirection mode used on the 6502. In instruction contains the zero page location of the least significant byte of 16 bit address. The Y register is dynamically added to this value to generated the actual target address for operation.

LDA ($40),Y ;Load a byte indirectly from memory
STA (DST),Y ;Store accumulator indirectly into memory

The zero page modes. In particular the zero page,x mode. On a processor like the 6502, my language of choice is forth. In the case of forth, it makes implementing the return stack very simple. My 6502 experience was both limited and 35 years ago. Since then I have almost exclusively focused on HPC considerations.

The copyright on the R2K manual is 1987. The first edition of the M88K is 1989. Two years was a *very* long time then. The SPARC and R2K were shipping in quantity and sitting on desktops by 1989. As a summer intern in 1989 everyone I worked with had a DECstation running Ultrix. I had an NCD 1024x1024 greyscale X terminal. I took a few days off and when I returned found a SPARCstation I on my desk. I demanded and got my NCD back. I worked on multiple systems, SGI, DEC & Sun testing the portability of the code I wrote by logging in to lightly used machines. My home was mounted via NFS on all the systems so all I had to do was switch windows. The grey scale NCD gave me a paper white display at close to 100 dpi that was the size of a laser printed page. I used twm with a menu bar of windows down the RHS and a stack of pae size windows. So a double click on the name of a window in the icon menu list and that's the top window. To this day my main system running Solaris 10 u8 which is offline for security uses that configuration. Though now I use a KVM switch and USB drives rather than NFS.

MAC submachine guns designed by Ingram, MAC ethernet media access control for starters. I'm sure that I could come up with many more. I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes. FWIW the Xilinx Zynq book glossary defines MAC as media access control.

#44 Reply
Posted by legacy on 12 Nov, 2018 16:03
Quote from: rhb on 12 Nov, 2018 15:31
I had an NCD 1024x1024 greyscale X terminal

On DTB we still have and use a couple of Tektronix XP217's, and XP400 series, that are still used in space and defense. Unbelievable, but that is: we have these terminals in stock because there are customers demanding them

They are MIPS R4K machines, chips made by IDT, with an HW video processor made by TexasInstruments, and they run a special version of VxWorks v5.5 (that must have been compiled by an ancestor version of DiabDB).

Unfortunately, they are pseudo-colors, running an ancestor version of the X11 protocol that of course doesn't support any modern extension, which practically means you can only run OpenMotiv and Motif applications (and you'd best forget GTK-*, QT-*, etc)

Quote from: rhb on 12 Nov, 2018 15:31
I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.

Yup it's a weird EA offered by weird beasts like the 88K. I still have to check my manuals about the earlier ARM chips used by Acorn in their RiscPC's, and earlier PowerPC's like the 601

#45 Reply
Posted by legacy on 12 Nov, 2018 16:06
Quote from: langwadt on 12 Nov, 2018 13:13
long history with DSP and not knowing the acronym MAC does not compute

Have you ever implemented IIR and FIR algorithms with a dummy-DSP like the Motorola M56000?
Then for sure, you can't be wrong, you have for sure typed a "MAC" assembly instruction

I am Kidding, but they may really exist reasons for this:
- There is not yet a working version of Gcc/SDCC/whatever C compiler for this DSP, expect an old broken and partially implemented gcc-v2.* (<<2.95)
- 20-30 years ago, toolchains for this DSP were very very expensive, thus only used by companies. Yet, no C compiler for students and hobbists(1)
- There are still books about this DSP (still available on Amazon, e.g. "Real Time Digital Signal Processing Applications With Motorola's Dsp56000 Family 1st Edition") that just talk about assembly
- For weird and unknown reasons, this DSP has been used in computer science's courses
- whose laboratories give you just an assembly compiler, and yet again you have to physically type "MAC" when your examinator asks you to implement FIR and IIR algorithms (usually for audio)
This chip was made in the 90s. Manuals are no more printed, but we have photocopies and scanned pdf. I had my examination on "numerical methods for computations" (basically DSPs and algorithms) and used this chip during my Erasmus experience in Oxford. It happened 10 years ago, but a couple of friends have recently told me they are still using this DSP

(1) there are a couple of demo-version C compilers for m563xx. I haven't yet checked if they are retro-compatible with 56000.

#46 Reply
Posted by DJohn on 12 Nov, 2018 17:07
Quote from: legacy on 12 Nov, 2018 16:03
Quote from: rhb on 12 Nov, 2018 15:31
I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.
Yup it's a weird EA offered by weird beasts like the 88K.

Not really. What the 88K (and x86, and possibly a handful of others) offer is scaled indexed addressing. You have a base register and an index register. Multiply the index register by a small power of two (on 88K, that's determined by the width of the data the instruction is operating on, and can be 1, 2, 4, 8, or 16 only. The 'multiply' is really just a shift) and add them together. That's it. No one is going to put a full multiplier in the address generation. It would slow everything down enormously for no gain.

It doesn't even help for the linear algebra stuff. Anywhere that you care about speed, you'll be iterating over elements with a fixed stride. Just add that stride to your pointer each time through the loop and you're done. No multiplication required. Not in the addresses, anyway.

#47 Reply
Posted by legacy on 12 Nov, 2018 17:17
Quote from: DJohn on 12 Nov, 2018 17:07
The 'multiply' is really just a shift

yes, the size of a cell in the matrix is a power of 2, usually four bytes for fixedpoint 32bit.

#48 Reply
Posted by rhb on 12 Nov, 2018 18:16
I'm a reflection seismologist, or at least used to be. So except for the grad school agony of an FPS-120B attached to an 11/780 all my work has been on general purpose CPUs, although some pretty weird stuff. Trust me, you *really* don't want to program an i860 based "super computer". It was acutally a graphics chip designed in Israel that Intel marketed as doing 80 MFLOPS. Which it would *if and only if* you did almost no I/O.

I have never implemented any real time filters ever. All my DSP has been done in recorded time usually in the frequency domain.

Quote
Anywhere that you care about speed, you'll be iterating over elements with a fixed stride.

And you better pay attention to the cache structure or else *every* memory access will result in a cache miss.

In 1994 Convex introduced a machine, the Exemplar, that would have been just incredible as an X terminal compute server. It had a backplane I/O structure I could have made scream. No workstation then or now could have touched what that architecture would do with comparable hardware. I'm pretty sure it was the basis for a bunch of the later Crays.

The X terminal is perfect for a secure environment and any large facility with thousands of users.

The x86 implements what is called a "segmented" architecture. With the introduction of 32 bit x86 machines everyone set the segment address to zero and ran in "Motorola mode". The genesis of this was theoretical CS muttering by the Multics project at MIT. The Intel 432 stuffed *all* the Multics baggage into the silicon. Which is why most of you have probably never even heard of it. The intent was that *every* object had its own private address space in a different segment with bounds checking on accesses to the object. This was all in the early 60's. As everything had its own segment, this led to wags saying that Multics stood for "many uselessly large tables in core simultaneously".

#49 Reply
Posted by legacy on 12 Nov, 2018 20:46
Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.

In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.

A lot of multiplications, kernel-cells by image-cells.

#50 Reply
Posted by rhb on 12 Nov, 2018 23:52
Quote from: legacy on 12 Nov, 2018 20:46
Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.

In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.

A lot of multiplications, kernel-cells by image-cells.

I should appreciate an explanation of why a convolution is not a matrix multiply. I shall be most interested to discover how you came up with this notion. This may be simply an issue of jargon. Please start by describing your operator and its origin. But I am not aware of a convolution that is not an isomorph of a matrix multiply whether 1 dimensional or N dimensional. I should greatly appreciate your filling the gap in my education.

#51 Reply
Posted by brucehoult on 13 Nov, 2018 01:13
Quote from: rhb on 11 Nov, 2018 20:13
As it happens I have the processor manuals for both the R2K and the 88K.

So do I :-) And correspond frequently with two of the main designers of the 88K, one of whom also did a lot on the original Athlon64, and both were involved in designing a new GPU I was working on at Samsung from Sep 2016 to Mar 2018. Both were also while at Motorola co-authors on the seminal 1991 paper "Single Instruction Stream Parallelism Is Greater than Two" (in fact they found it's typically over a dozen, with perfect branch prediction, but 2 to 6 with reasonably achievable hardware at the time)

Quote
There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now.

That's not entirely accurate. Yes, branch delay slots were a mistake, as both H and P agree now (which is why RISC-V doesn't have them). They happened to be convenient at one particular point in processor implementation, but complicate all other implementations. And with good branch prediction as we have now the reason is completely gone anyway, even on simple implementations -- we now know how to make fairly small branch predictors with just a couple of percent mispredict rate, while the number of branch delay slots that can't be filled (except by a NOP) is typically in the tens of percents.

The reason to change processors in later editions of H&P is I think to use examples with which students are familiar -- or at least have easy access. Hence the last two editions using Aarch64 and then RISC-V.

Quote
I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.

These vulnerabilities are pretty easy to avoid if only you realise that you need to avoid them!

The very simple principle to follow is to ensure that work done during speculative execution is completely undone in the event of a misprediction.

Of course this was already followed for such obvious things as stores to memory. There are just a few more things now that people are starting to realise must be buffered and completely rolled back as well.

As an example, if a cache line is fetched as a result of speculative execution, it must be held in a special area until the speculation is over, and only then written to the cache itself (and in particular, not result in the eviction of any other cache line until that point).

The same goes for updates to the TLB, branch history buffer, branch target buffer.

I think we will also see a move to non-inclusive caches. In the recent practice, a cache line brought from main memory will be stored at the same time into all of L1, L2, and L3 caches. This complicates rollback. In future I think we will see cache lines brought just to L1 at first. Only when they are evicted from L1 by a newer cache line will they be written to L2.

None of this is very difficult. Nor is it expensive in circuitry. And it does not cost execution speed.

The difficult part is mostly in realising that you *should* do it.

#52 Reply
Posted by brucehoult on 13 Nov, 2018 01:18
Quote from: rhb on 11 Nov, 2018 22:22
I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.

The 88K was really great technically. It's only problem was that it was from Motorola and they by that time were very ambivalent about being in the high performance processor market.

The 88K was largely axed in favour of PowerPC (which used some parts of the 88K implementation). And then a few years later Motorola was more interested in supplying automotive and communications markets that only needed low performance CPUs than in supplying the high performance mobile CPUs that Apple wanted (the high performance desktop PowerPCs from IBM were just fine). At that time Intel was just starting down the excellent Centrino/Core path with high performance and low power and so Apple jumped ship.

#53 Reply
Posted by legacy on 13 Nov, 2018 01:28
Quote from: rhb on 12 Nov, 2018 23:52
I should appreciate an explanation of why a convolution is not a matrix multiply

Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)

SLAC = Shift Left And Accumulate (similar to MAC, but faster)

In short, we are thinking about adding this instruction to the ISA

The convolution looks something like this

Code: [Select]
private sint32_t convolution_get_pixel ( sint32_t kernel[], uint32_t kernel_n, sint32_t image[], uint32_t image_width, uint32_t x, uint32_t y ) { sint32_t ans; sint32_t sum; uint32_t k; uint32_t j; uint32_t OF0; uint32_t OF1; sum = 0; for (k = -1; k <= 1; k++) { for (j = -1; j <= 1; j++) { OF0 = get_kernel_cell_offset(j + 1, k + 1, kernel_n * 4); OF1 = get_image_cell_offset(x - j, y - k, image_width * 4); sum += kernel[OF0] * image[OF1]; } } ans = sum; return ans; } public void convolution_image ( sint32_t kernel[], uint32_t kernel_n, sint32_t image_in[], sint32_t image_out[], uint32_t image_height, uint32_t image_width ) { sint32_t pixel; uint32_t x; uint32_t y; uint32_t OF0; for (y = 0; y < image_height; y++) { for (x = 0; x < image_width; x++) { pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y); OF0 = get_image_cell_offset(x,y,image_width * 4); image_out[OF0] = pixel; } } }
Code: [Select]
OF0 = get_kernel_cell_offset( reg0, reg1, const); OF1 = get_image_cell_offset( reg0, reg1, const);These two lines can be translated into "SLAC" instructions. Neat, and very fast since SLAC is basically a SHIFT which takes 1 clock edge to compute

Code: [Select]
sum += kernel[OF0] * image[OF1];This can be translated into a "MAC"

edit:
there are other ways to convolute an image * kernel
kernel_n is usually =3, for a kernell matrix 3x3

#54 Reply
Posted by brucehoult on 13 Nov, 2018 01:30
Quote from: rhb on 12 Nov, 2018 01:10
I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

What excited Woz about the 6502 was mostly that it cost $25 when the M6800 and Z80 cost $200+.

Also, unlike those other two, it was pipelined just sufficiently that it could run most code in 1 clock cycle per byte read or written from RAM (code and data), which is a kind of "speed of light" for processors without an icache (or Harvard architecture). The exceptions were 1) single-byte instructions that didn't read or write anything and took 2 cycles, and 2) indexed addressing modes with a carry from lo byte into hi byte (crossed a page boundary)

Quote
But given your performance targets I'm a bit baffled about the purpose of the project.

Indeed.

#55 Reply
Posted by brucehoult on 13 Nov, 2018 01:42
Quote from: westfw on 12 Nov, 2018 09:23
Quote
Note: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work

I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...

Yes, at the limit of the very simplest compiler it's easiest to have only "this register contains the exact address of the data to load/store" and do everything else with arithmetic.

However this is *very* low performance unless you have, as I said, an optimiser to do common subexpression elimination and movement of invariant code out of the loop. And preferably induction variable detection and strength reduction too.

It's much simpler to do a little bit of tree matching to generate the complex addressing modes. In particular a technique called "Hold up code generation" is simple and effective, especially on machines that have an LEA instruction. Basically instead of just emitting adds, multiplies by small powers of two etc you build a little state machine that records an addressing mode that can be used to represent a fragment of the calculation. Each time you come to a new operation you see if it can be added to the addressing more to transition to another addressing mode (some of which may be "virtual" if your machine doesn't actually have all combinations). If you come to a load or store then you emit it with the addressing mode implied by the state. If you come to an impossible transition from a state then you emit a LEA with what you have so far, and then continue.

This is surprisingly effective on things like VAX, M68K and x86 and I wrote compiler back-ends using it in the mid 80s. It also takes very little code and runs very quickly. Much much easier than writing a proper optimiser.

But once you have a proper optimiser anyway, you don't need all those addressing modes any more.

#56 Reply
Posted by NorthGuy on 13 Nov, 2018 04:26
Quote from: brucehoult on 13 Nov, 2018 01:42
But once you have a proper optimiser anyway, you don't need all those addressing modes any more.

Really?

A simple case:

Code: [Select]
int* b; unsigned char* c; for (k = 0; k < n; k++) { b[k] = c[k]; }
With hardware indexing (such as x64), if the compiler can take advantage of the indexing, the body of the loop is very simple, something like that:

Code: [Select]
movzx eax,byte ptr [rsi+rax] mov [rdi+rax*4]
If an (unoptimized) compiler calculates everything straight without using indexing, it becomes quite a bit more:

Code: [Select]
mov rbx,rax add rbx,rdi mov rcx,rax shl rcx,2 add rcx,rsi movzx eax,byte ptr [rbx] mov [rcx]
What the "proper optimizer" can do without hardware indexing. I guess, it can move the calculation out of the loop, only leaving increments of base registers on every cycle:

Code: [Select]
movzx eax,byte ptr [rbx] inc rbx mov [rcx] add rcx,4
Lo and behold, the performance of the hardware indexing is matched by "proper optimizer". Who needs hardware indexing?

But.

"Grau, teurer Freund, ist alle Theorie, Und grün des Lebens goldner Baum."

Just a small change to the source code:

Code: [Select]
for (k = 0; k < n; k++) { if (must_process(k)) b[k] = c[k]; }
What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?

Meanwhile, hardware indexing works the same as before.

#57 Reply
Posted by brucehoult on 13 Nov, 2018 06:51
Quote from: NorthGuy on 13 Nov, 2018 04:26
Just a small change to the source code:

Code: [Select]
for (k = 0; k < n; k++) { if (must_process(k)) b[k] = c[k]; }
What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?

Meanwhile, hardware indexing works the same as before.

I compiled your code for x86_64 with gcc -O2:

Code: [Select]
20: 89 df mov %ebx,%edi 22: e8 00 00 00 00 callq 27 <foo+0x27> 27: 85 c0 test %eax,%eax 29: 74 0a je 35 <foo+0x35> 2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax 31: 41 89 04 9c mov %eax,(%r12,%rbx,4) 35: 48 83 c3 01 add $0x1,%rbx 39: 48 39 eb cmp %rbp,%rbx 3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Let's try something without fancy addressing modes. I have RISC-V handy (also 64 bit):

Code: [Select]
101ce: 8522 mv a0,s0 101d0: fe3ff0ef jal ra,101b2 <must_process> 101d4: 2405 addiw s0,s0,1 101d6: c501 beqz a0,101de <foo+0x28> 101d8: 00094783 lbu a5,0(s2) 101dc: c09c sw a5,0(s1) 101de: 0905 addi s2,s2,1 101e0: 0491 addi s1,s1,4 101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
So that's ... nine instructions and twenty four bytes of code. I'm not seeing a lot of advantage from those fancy addressing modes here.

Let's try 64 bit ARM. It's got complex addressing modes too.

Code: [Select]
730: 2a0203e0 mov w0, w2 734: 97fffff5 bl 708 <must_process> 738: 34000060 cbz w0, 744 <foo+0x34> 73c: 38626824 ldrb w4, [x1, x2] 740: b82278a4 str w4, [x5, x2, lsl #2] 744: 91000442 add x2, x2, #0x1 748: eb03005f cmp x2, x3 74c: 54ffff21 b.ne 730 <foo+0x20> // b.any
Hmm. That's eight instructions and 32 bytes of code. It saved an instruction because, like RISC-V, it can test a register for zero and branch in one instruction.

I'm still not seeing the big benefit of these complex addressing modes.

There's *some* benefit, of course. But there's also a cost. Something extra to design and document and learn. Extra transistors to implement shift and add within an instruction. Extra transistors to be able to read three registers (w4, x5, x2 in the ARM case and %eax, %r12, %rbx for the x86), in one clock cycle instead of only two. Wasted power in any instructions that don't actually make use of those capabilities. Quite possibly the clock speed for the entire CPU will have to be slower because it has to be able to complete that shift and add and store in the same amount of time that will also be given to simple shifts or adds or stores alone.

It may also be that the x86 and/or ARM actually split that complex store instruction into several micro-ops. That would negate my hardware cost and speed points in the previous paragraph, but it will put you back to having extra actual operations in the loop, and instructions that the programmer/compiler can't influence.

#58 Reply
Posted by DJohn on 13 Nov, 2018 11:01
Quote from: legacy on 13 Nov, 2018 01:28
Code: [Select]
... for (y = 0; y < image_height; y++) { for (x = 0; x < image_width; x++) { pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y); OF0 = get_image_cell_offset(x,y,image_width * 4); image_out[OF0] = pixel; } } }

There's no need for fancy addressing modes here.
Code: [Select]
stride = 4*image_width; next_line = stride; image_end = stride*image_height; y = 0; OF0 = 0; while ( OF0 < image_end ) { x = 0; while ( OF0 < next_line ) for (x = 0; x < ; x++) { pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y); image_out[OF0] = pixel; OF0 += 4; x++; } next_line += stride; y++; } }And the same sort of thing can be done to get the multiplies out of the convolution addresses (you can get rid of x and y then). Compilers have been applying this sort of transformation automatically for decades.

I'd probably go further than that. If your kernel is separable, it might be best to do the operation in two passes: first with a one-high horizontal kernel, then again with a one-wide vertical kernel. It takes the image reads from width*height*N^2 to width*height*2N for an NxN kernel. It might not be a win for your 3x3 kernel, but it might be, and definitely is when N is large.

For a large kernel, the obvious way of arranging the image in memory (with pixel x,y at 4*x + 4*width*y) can make the cache very unhappy on the vertical pass. In that case, it's better to arrange the image in square blocks of pixels that each fit in a cache line. Random access of pixels becomes messier, but ordered access is still just a matter of adding constants to pointers.

None of this even needs register+immediate addressing. Just having the address in a register is sufficient.

#59 Reply
Posted by legacy on 13 Nov, 2018 12:08
Quote from: DJohn on 13 Nov, 2018 11:01
Code: [Select]
OF0 += 4;

fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.

I see 3 good reasons for it, and Elisabeth loves creating new acronyms
(now I see 4 good reasons for it )

Edit:
Not mentioning it also helps for sparse matrices!

#60 Reply
Posted by brucehoult on 13 Nov, 2018 13:11
Quote from: legacy on 13 Nov, 2018 12:08
fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.

So, exactly the same as the ADD instruction on ARM32 then.

#61 Reply
Posted by NorthGuy on 13 Nov, 2018 15:59
Quote from: brucehoult on 13 Nov, 2018 06:51
I compiled your code for x86_64 with gcc -O2:

Code: [Select]
20: 89 df mov %ebx,%edi 22: e8 00 00 00 00 callq 27 <foo+0x27> 27: 85 c0 test %eax,%eax 29: 74 0a je 35 <foo+0x35> 2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax 31: 41 89 04 9c mov %eax,(%r12,%rbx,4) 35: 48 83 c3 01 add $0x1,%rbx 39: 48 39 eb cmp %rbp,%rbx 3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:

Code: [Select]
00: 89 df mov edi,ebx 02: e8 00 00 00 00 call must_process 07: 85 c0 test eax,eax 09: 74 09 jz 0x14 0b: 41 0f b6 04 1e movzx eax,[r14+rbx] 10: 41 89 04 9f mov [r15+rbx*4],eax 14: ff c0 inc ebx 16: 74 e8 jz 0x00
This is 8 instructions and 24 bytes.

If you move to Windows, where calling conventions are slightly better:

Code: [Select]
00: 89 d9 mov ecx,ebx 02: e8 00 00 00 00 call must_process 07: 85 c0 test eax,eax 09: 74 07 jz 0x12 0b: 0f b6 04 1e movzx eax,[rsi+rbx] 0f: 89 04 9f mov [rdi+rbx*4],eax 12: ff c0 inc ebx 14: 74 ea jz 0x00
you still get 8 instructions, but only 22 bytes.

Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king. And look what your "proper optinmizer" has done for the "best in the world" RISC-V:

Code: [Select]
101ce: 8522 mv a0,s0 101d0: fe3ff0ef jal ra,101b2 <must_process> 101d4: 2405 addiw s0,s0,1 101d6: c501 beqz a0,101de <foo+0x28> 101d8: 00094783 lbu a5,0(s2) 101dc: c09c sw a5,0(s1) 101de: 0905 addi s2,s2,1 101e0: 0491 addi s1,s1,4 101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:

Code: [Select]
101de: 0905 addi s2,s2,1 101e0: 0491 addi s1,s1,4
are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?

I'll take hardware indexing over that any time.

<edit>Sorry, in my snippets, since RBX is shifted, the first instruction should not be "mov edi,rbx" but something like "lea edi,[rbx+r13]" to re-adjust RBX to what the routine being called wants. This will add one or 2 bytes, but won't alter the number of instructions.

#62 Reply
Posted by rhb on 13 Nov, 2018 16:03
Quote from: legacy on 13 Nov, 2018 01:28
Quote from: rhb on 12 Nov, 2018 23:52
I should appreciate an explanation of why a convolution is not a matrix multiply

Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)

The remark was rhetorical. To paraphrase Gertrude Stein, "A convolution is a convolution no matter how you do it."

And in the frequency domain it's a multiplication over frequency rather than a multiply-add over space.

This thread sounds like great fun and I wish you well, but I'm going to bow out. I already have enough projects. And it's very clear you know what you're up to. Though if you want input from the reflection seismology side of DSP send me a PM.

Have Fun!
Reg

#63 Reply
Posted by legacy on 13 Nov, 2018 16:16
Quote from: rhb on 13 Nov, 2018 16:03
Though if you want input from the reflection seismology side of DSP send me a PM.

wow, that was a part of my thesis to get my degree
Thanks, I will for sure contact you by PM about this.

Thanks

#64 Reply
Posted by brucehoult on 13 Nov, 2018 22:48
Quote from: NorthGuy on 13 Nov, 2018 15:59
Quote from: brucehoult on 13 Nov, 2018 06:51
I compiled your code for x86_64 with gcc -O2:

Code: [Select]
20: 89 df mov %ebx,%edi 22: e8 00 00 00 00 callq 27 <foo+0x27> 27: 85 c0 test %eax,%eax 29: 74 0a je 35 <foo+0x35> 2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax 31: 41 89 04 9c mov %eax,(%r12,%rbx,4) 35: 48 83 c3 01 add $0x1,%rbx 39: 48 39 eb cmp %rbp,%rbx 3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:

gcc is the compiler most people use, especially in the embedded world, and the one that works on the most different CPU architectures. I'm happy to use llvm if you prefer?

Quote
Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king.

Not everyone has tons of RAMs. Some people only have a few KB. Even the people with tons of RAMs generally only have at most 32 KB of L1 instruction cache.

As an example of why it is important, if you enable RISC-V variable-length instructions then "CoreMark" fits into the icache on the HiFive Unleashed and runs very well. If you use the fixed length 32 bit instruction encoding then (the hot parts of) CoreMark doesn't fit into 16 KB and it runs about twice slower.

Quote
And look what your "proper optinmizer" has done for the "best in the world" RISC-V:

Not "best in the world". That's not the aim. It's "free and open". That's far more important in the long term. In the case of Linux, being free and open resulted in huge numbers of people working to improve it and eventually being best in the world. It certainly wasn't in 1992 or even 1998.

Quote
Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:

Code: [Select]
101de: 0905 addi s2,s2,1 101e0: 0491 addi s1,s1,4
are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?

In the absence of knowing how often must_process() returns true: absolutely.

It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time. Complex addressing modes are not free.

But in any case, the difference will be trivial given that you're calling a *function* in the middle of your supposedly critical loop.

#65 Reply
Posted by NorthGuy on 13 Nov, 2018 23:27
Quote from: brucehoult on 13 Nov, 2018 22:48

In the absence of knowing how often must_process() returns true: absolutely.

It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time.

Exactly. The C compiler doesn't have enough information and thus it doesn't know which optimization to apply. So, you may end up applying optimization which actually makes things worse.

Hardware indexing solves this for you. The decision of whether the calculation is needed is postponed until the run-time, which completely eliminates the controversy.

That is true even with superscalar architecture - the CPU only needs to calculate the address if the instruction containing the indexing is about to execute. Although memory access will probably dominate anyway.

Quote from: brucehoult on 13 Nov, 2018 22:48
Complex addressing modes are not free.

Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.

#66 Reply
Posted by brucehoult on 14 Nov, 2018 01:01
Quote from: NorthGuy on 13 Nov, 2018 23:27
Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.

It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The proof is in the pudding. Multiple customers are telling SiFive that at the same performance level our RISC-V cores are coming in at 1/3 the area and 1/3 the energy consumption of the A** cores they evaluated or were previously using. They are telling us this for everything from the E20 (CM0 class), to the E31/E34 (CM3/4 class), to the new E71 (CM7 class) and U71 (CA53/55 class).

Certainly having only simple addressing modes not complex ones doesn't account for all of this but I believe it helps. It obviously doesn't *prevent* it.

A base + scaled index + offset addressing mode has some benefits. No question. But it also has costs, that I think outweigh the benefits.

A better way to do it is to keep simple operations but have a complete second instruction pipeline. That allows you to fold in your extra increments (or extra shift and add in the conditional part of your loop, if you want) to other instructions. You get the same benefit as the complex addressing mode, but you get the *additional* benefit of being able to use that extra hardware to pair instructions in a vast range of other situations as well. Typically you can increase overall performance by 40% to 60% at a hardware cost that is only slightly greater than having an extra ALU lying around that is usable only by addressing modes.

#67 Reply
Posted by legacy on 14 Nov, 2018 15:37
Quote from: brucehoult on 14 Nov, 2018 01:01
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.

Besides, the EA addressing mode we are talking about only work (hence only makes a switching, hence only consume energy) when it's enabled.

Quote from: brucehoult on 14 Nov, 2018 01:01
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!

on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.

#68 Reply
Posted by brucehoult on 14 Nov, 2018 21:49
Quote from: legacy on 14 Nov, 2018 15:37
Quote from: brucehoult on 14 Nov, 2018 01:01
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.

Uhhhh .. that was true in the 70s and early 80s when transistors were huge and you were working at 1 MHz or 10 MHz.

It hasn't been true for a *long* time. At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

It's even worse at smaller nodes.

Quote
Quote from: brucehoult on 14 Nov, 2018 01:01
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!

on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.

A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.

Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.

#69 Reply
Posted by NorthGuy on 14 Nov, 2018 23:51
Quote from: brucehoult on 14 Nov, 2018 21:49
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

#70 Reply
Posted by brucehoult on 15 Nov, 2018 01:38
Quote from: NorthGuy on 14 Nov, 2018 23:51
Quote from: brucehoult on 14 Nov, 2018 21:49
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

#71 Reply
Posted by rstofer on 15 Nov, 2018 16:27
Quote from: brucehoult on 14 Nov, 2018 21:49
A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.

Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.

One solution is to have multiple copies of the registers. When a register is written, every copy is written at the same time. Separate access is easy when every access datapath has its own copy of the register bank. Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?

Two copies are required in any event if we want to add 2 registers without having an intermediate storage register. Adding a 3rd or 4th copy just isn't a big deal. Conceptually...

The LC-3 project shows 2 copies (implied) on page 7
http://users.ece.utexas.edu/~patt/05f.360N/handouts/360n.appC.pdf

#72 Reply
Posted by legacy on 15 Nov, 2018 16:45
Quote from: rstofer on 15 Nov, 2018 16:27
Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?

Arise-v2 has 32 registers, but can't use BlockRam it needs to use BRAM, which happens to have two ports.
But this is not a problem since Arise-v2 has two stages for 2+2 registers read per instructions.
Usually, instructions have 2 operands, sometimes (MAC, SLAC, etc..) have 3 operands.

On the new ALU, SLAC has now its twin SRAC (shift right and accumulate).

On Arise-v2, the PC-register always belongs to the register file.

#73 Reply
Posted by NorthGuy on 15 Nov, 2018 19:46
Quote from: brucehoult on 15 Nov, 2018 01:38
Quote from: NorthGuy on 14 Nov, 2018 23:51
Quote from: brucehoult on 14 Nov, 2018 21:49
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

Low frequency silicon, huh? Very smooth ...

How about Intel processors? A guy measured power consumption of 22/32 nm technology here:

http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/

Look at picture 2a/2b and extend the orange "Total power" line until it crosses 0 frequency. Looks like around 10 W (perhaps 15 W). And this includes whole lot more than just leakage. As the frequency increases, the power goes over 100 W. Doesn't look like leakage dominates. Do they use "low frequency silicon" too?

#74 Reply
Posted by SiliconWizard on 15 Nov, 2018 19:55
Of course leakage doesn't dominate, not at the frequencies those process nodes are intended for anyway.
Obviously if you use a 14nm process for digital ICs running at a couple MHz, leakage could indeed dominate. Not at 1GHz+.

#75 Reply
Posted by brucehoult on 16 Nov, 2018 00:58
Quote from: NorthGuy on 15 Nov, 2018 19:46
Quote from: brucehoult on 15 Nov, 2018 01:38
Quote from: NorthGuy on 14 Nov, 2018 23:51
Quote from: brucehoult on 14 Nov, 2018 21:49
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

Low frequency silicon, huh? Very smooth ...

How about Intel processors? A guy measured power consumption of 22/32 nm technology here:

http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/

Look at picture 2a/2b and extend the orange "Total power" line until it crosses 0 frequency. Looks like around 10 W (perhaps 15 W). And this includes whole lot more than just leakage. As the frequency increases, the power goes over 100 W. Doesn't look like leakage dominates. Do they use "low frequency silicon" too?

No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

#76 Reply
Posted by NorthGuy on 16 Nov, 2018 03:42
Quote from: brucehoult on 16 Nov, 2018 00:58
No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

May be you can also explain why the power gating has anything to do with this?

Linear dependency between frequency and power consumption is direct proof that the power consumption depends on switching, whether power gating or not. Only a small portion (around 10%) of power consumption is static and may be attributed to leakage.

#77 Reply
Posted by brucehoult on 16 Nov, 2018 04:45
Quote from: NorthGuy on 16 Nov, 2018 03:42
Quote from: brucehoult on 16 Nov, 2018 00:58
No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.

May be you can also explain why the power gating has anything to do with this?

Linear dependency between frequency and power consumption is direct proof that the power consumption depends on switching, whether power gating or not. Only a small portion (around 10%) of power consumption is static and may be attributed to leakage.

Don't just take my word for it. This is something known by absolutely anyone who is making chips at 65nm, 40nm, 28nm and below.

1991 article saying leakage dominates at 65nm and below: https://www.eetimes.com/document.asp?doc_id=1264175

Broadcom presentation saying that leakage current dominates and turning off everything that does not need to be on is critical http://www.islped.org/2014/files/islped2014_leakage_mitigation_JohnRedmond_Broadcom.pdf

IEEE article explaining why leakage starts to dominate more and more once you go below 100 nm: http://www.ruf.rice.edu/~mobile/elec518/readings/DevicesAndCircuits/kim03leakage.pdf

#78 Reply
Posted by NorthGuy on 16 Nov, 2018 15:19
Quote from: brucehoult on 16 Nov, 2018 04:45
Don't just take my word for it. This is something known by absolutely anyone who is making chips at 65nm, 40nm, 28nm and below.

1991 article saying leakage dominates at 65nm and below: https://www.eetimes.com/document.asp?doc_id=1264175

Broadcom presentation saying that leakage current dominates and turning off everything that does not need to be on is critical http://www.islped.org/2014/files/islped2014_leakage_mitigation_JohnRedmond_Broadcom.pdf

IEEE article explaining why leakage starts to dominate more and more once you go below 100 nm: http://www.ruf.rice.edu/~mobile/elec518/readings/DevicesAndCircuits/kim03leakage.pdf

Look at the articles you're quoting. Generally they derive theoretical form of dependencies between leakage and the size of transistors. Then they need to fit the coefficients of their equations. To do so, they use empirical data obtained from measurements made on transistors built with technologies which were available back then. Then they extrapolate these data to smaller sizes and predict the leakage at smaller sizes (admittedly for future technologies).

But guess what? Technologies change. The leakage gets smaller. The old empirical coefficients no longer apply. The pessimistic predictions don't hold. They now use 7nm technology. Even looking back few years - the 7-series Xilinx FPGA built on 28nm technology has leakage well below 5%. Looking in the future, technologies may only improve.

Regardless, the simple logic necessary to get what you call "fancy" addressing modes doesn't consume much silicon, but it can produce great results. Look at dsPIC33, for example. The "fancy" CISC addressing schemes are used to execute complex instructions within a single instruction cycle. For example, you can do single cycle MAD which includes two memory fetches (from dual-port RAM), auto-increments, no-overhead looping, wrapping of addresses at the specified bounds. It can even automatically shuffle bits for FFT.

How much would it take for RISC-V to implement such sophisticated MAD instruction. I guess around 10 instructions (including one cycle for the branch delay slot which you have eliminated). So, if dsPIC33 runs at 100 MHz, and you want to match it with RISC, you will have to run at 1 GHz. This means faster technology (more leakage!), caches, cache controllers etc., which is really massive amounts of silicon, much more space and energy than dsPIC33. And no amount of magical C optimization can help.

#79 Reply
Posted by brucehoult on 17 Nov, 2018 04:30
Well, we're just going to have to agree to disagree.

We're designing and building SoCs here, according to our instruction set and hardware design philosophy. The proof is in how they work and that speaks for itself. So far, every chip taped out has come back working first try, and at the predicted performance and power consumption. The only opinions that matter are the paying customers.

#80 Reply
Posted by NorthGuy on 17 Nov, 2018 14:29
Quote from: brucehoult on 17 Nov, 2018 04:30
Well, we're just going to have to agree to disagree.

Good idea. Quite right.

Quote from: brucehoult on 17 Nov, 2018 04:30
We're designing and building SoCs here, according to our instruction set and hardware design philosophy. The proof is in how they work and that speaks for itself. So far, every chip taped out has come back working first try, and at the predicted performance and power consumption. The only opinions that matter are the paying customers.

Well. This is a discussion forum, not a marketing platform. And this thread is not about your SoCs. I don't see how your paying customers are of any concern.

#81 Reply
Posted by legacy on 17 Nov, 2018 15:18
In the end, SLAC got implemented, tested, and added to the ISA

But not as "EA" addressing mode, but rather like a common instruction that needs to precede a load/store.
Reasons for this? Elisabeth doesn't like to add a new timing constraint to the list.

It seems a good compromise.

#82 Reply
Posted by brucehoult on 17 Nov, 2018 19:11
Quote from: legacy on 17 Nov, 2018 15:18
In the end, SLAC got implemented, tested, and added to the ISA

But not as "EA" addressing mode, but rather like a common instruction that needs to precede a load/store.
Reasons for this? Elisabeth doesn't like to add a new timing constraint to the list.

It seems a good compromise.

So you've got "Rd = Ra + Rb << #n" ?

Perfectly good RISC instruction. I approve. And your actual load or store can add a fixed offset to that.

#83 Reply
Posted by legacy on 17 Nov, 2018 19:35
Quote from: brucehoult on 17 Nov, 2018 19:11
So you've got "Rd = Ra + Rb << #n" ?

Yup. Precisely!

#84 Reply
Posted by legacy on 17 Nov, 2018 19:53
Next step: teaching to our home-made HL-compiler *HOW* to use SLAC.

Currently, we are testing Arise-v2 on the HDL simulator (and sometimes on the real hardware) by writing short assembly programs. Basically, they are loops operating on predefined values so we can check if things go as expected, step by step. The HL-compiler is already able to identify and parse mathematical expressions(1) and to transform them into RPN. However, it still needs to recognize a matrix's item so it can use SLAC for the EA.

(1) valid expressions can contain number, variables(2), algebric operators { +,-,*,/,% }, and functions (returing (2) ). Matrix's items are currently not recognized.

RPN was the reason why we considered stack-hw support. Push and Pop -> autoincrement/decrement Load/Store, since the HL compiler tends not to use registers for RPN solving.

This point is still under evaluation. Not implemented, but not rejected. Suspended.

(2) currently, valid types are { uint32_t, sint32_t, fixedpoint_t }. It's strongly-typed, so unchecked-converters are provided and required in expressions since there is absolutely is no casting possibility.

#85 Reply
Posted by NorthGuy on 18 Nov, 2018 17:02
Quote from: legacy on 17 Nov, 2018 19:53
Currently, we are testing Arise-v2 on the HDL simulator (and sometimes on the real hardware) by writing short assembly programs. Basically, they are loops operating on predefined values so we can check if things go as expected, step by step. The HL-compiler is already able to identify and parse mathematical expressions(1) and to transform them into RPN. However, it still needs to recognize a matrix's item so it can use SLAC for the EA.

I've always used binary trees, which is a different representation of the same information as RPN, or you can look at them as a representation of RPN. The leaves of the tree are either variables (blocks of information of where the value is stored at run time) or constants. Starting from leaves, I can reduce the tree by replacing the operation nodes by leaves, and simultaneously emitting commands. For example, an expression:

Code: [Select]
a*b + c*d
is represented as

Code: [Select]
+(*(a,b),*(c,d))
I find a node containing only leaves, and I remove it:

Code: [Select]
temp1 := *(a,b) // this will be replaced by a single instruction +(temp1,*(c,d))
And another one:

Code: [Select]
temp1 := *(a,b) // this will be replaced by a single instruction temp2 := *(c,d) // this will be replaced by a single instruction temp3 := +(temp1,temp2) // this will be replaced by a single instruction
Now I can assign storage to the temporary variables (temp1,temp2). Usually they're stored in registers, or in stack if there's not enough registers. However, you need really really complex expressions to run out of registers.

Except for the operations on values, there are operations which operate on addressing. Namely, a subscript operation:

Code: [Select]
subscript(array,index) // expanded from array[index]
and dot operation:

Code: [Select]
dot(structure,member) // expanded from structure.member
These cannot be dealt with as easily as arithmetics. You cannot simply remove nodes. Especially, if such operations are nested. For example:

Code: [Select]
dot(subscript(array,index),member) // expanded from array[index].member
So, they get aggregated. After aggregation, the resulting expression always has one base variable (such as "array" in the above example), and linear indexes, which are either fixed offsets (from structure elements or arrays with fixed indices), or indexed with fixed coefficients. It is always possible to represent the addressing calculation in the form:

Code: [Select]
EA = base + k0 + index1*k1 + index2*k2 + ...
Where index1, index2 are derived from the sub-expressions and usually stored in registers (the same as temp1,temp2 above). k0, k1, k2 are numbers known at compile time, and "base" is the register on which the original variable is based upon (such as stack pointer for local variables).

At this point, you need some sort of heuristics to process the equation and map it to the most efficient combination of your SLAC instruction, other instructions, or hardware addressing. In most cases this is fairy simple. Since, at this point in compilation, you haven't yet assigned the storage for temporaries (index1, index2 etc.), you can mandate that things which must be in the registers for your addressing are indeed placed into the registers, not on stack.

Then you emit the corresponding instructions, including the final instruction which fetches (or stored for L-values) the data from the variable. This fetched value is sitting in a register and looks just a regular variable to the upstream parts of the expression.

#86 Reply
Posted by theoldwizard1 on 19 Nov, 2018 22:24
I have different background than many of you. I "cut my teeth" writing embbeded code in assembler on several different "oddball" microprocessors. I then moved on to programming, including writing assembler, on a VAX, which I think most people would agree was the ULTIMATE complex instruction and addressing mode processor every built. The last big project I worked on was designing an application binary interface (ABI) to be used by Gnu C for a 32 bit RISC architecture, so I have sort of been around the block on this.

Quote from: Nominal Animal on 09 Nov, 2018 00:27
Surprisingly enough, it highly depends on the number of registers available.
EXTREMELY TRUE ! Looking back, it amazes me that VMS only allowed compilers/programmers to use 12 of 16 registers. The 4 "dedicated" register are the program counter, the stack pointer, the frame pointer and the argument pointer.

Now it seems like everything has a minimum of 32 registers. Reserving Reg0 to always be ZERO is very useful in embedded programming. (IIRC we designed the ABI to use a special data section called .zdata that was within the range of the address offest added to R0.)

Quote from: legacy on 08 Nov, 2018 22:23
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit
Probably the most important addressing mode. The big debate is just how many bits should be allowed for that constant (offset).

An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

Atomic adjustment of the stack pointer is critical.

#87 Reply
Posted by brucehoult on 20 Nov, 2018 01:45
Quote from: theoldwizard1 on 19 Nov, 2018 22:24
I have different background than many of you. I "cut my teeth" writing embbeded code in assembler on several different "oddball" microprocessors. I then moved on to programming, including writing assembler, on a VAX, which I think most people would agree was the ULTIMATE complex instruction and addressing mode processor every built. The last big project I worked on was designing an application binary interface (ABI) to be used by Gnu C for a 32 bit RISC architecture, so I have sort of been around the block on this.

Pretty similar here. I taught myself machine code (not even asm as I didn't have an assembler) on 6502 and z80 systems that otherwise only had a BASIC interpreter. Then PDP11 then VAX, z8000 and m68000, PowerPC/MIPS/Alpha/SPARC/RISC-V, ARM, AVR8.

VAX was the ultimate demonstration that you could provide complex addressing modes and complex instructions for things like function call, but sticking to the simple addressing modes and instructions made your code run faster.

Dave Patterson discovered this while on sabbatical at DEC, and this experience is what caused him on his return to Berkeley to design RISC I.

Quote
Quote from: Nominal Animal on 09 Nov, 2018 00:27
Surprisingly enough, it highly depends on the number of registers available.
EXTREMELY TRUE ! Looking back, it amazes me that VMS only allowed compilers/programmers to use 12 of 16 registers. The 4 "dedicated" register are the program counter, the stack pointer, the frame pointer and the argument pointer.

Now it seems like everything has a minimum of 32 registers.

Except 32 bit ARM/Thumb, which also has 16 and reserves about the same number of them.

Quote
Reserving Reg0 to always be ZERO is very useful in embedded programming. (IIRC we designed the ABI to use a special data section called .zdata that was within the range of the address offest added to R0.)

On MIPS and RISC-V there is unlikely to be RAM at address zero, but there is a dedicated GP (Global Pointer) that points to the global variables (.data section). There is a linker section .sdata that is put at the start, so you are more likely to be able to reference things there as a simple offset from GP: +/- 2 KB for RISC-V (GP actually points to 2 KB past the start of the globals section) and a 64 KB range for MIPS. Outside that range you need a three instruction sequence "lui Rtmp,#nnnnn; add Rtmp,Rtmp,GP; ld/st nnn(Rtmp)".

Quote
Quote from: legacy on 08 Nov, 2018 22:23
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit
Probably the most important addressing mode. The big debate is just how many bits should be allowed for that constant (offset).

Yes. Some CPUs have short instructions with and 8 bit -- or even 5 bit -- offset! They'll usually have bigger ones available as well, but these actually cover a lot of cases.

IBM 360, ARM, and RISC-V have 12 bits. This covers probably 99% of real-world offsets. The IBM assembler allows you to dedicate several additional GP registers to directly address further 4KB chunks of globals, if you wish.

Most other RISC ISAs give 16 bit offsets. It's very very rare to need more.

Quote
An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

CISC ISAs usually allow this. Certainly VAX, m68000, i386.

The only RISC ISA I know of that allows this is the new NanoMIPS encoding, introduced in a chip in May (with 16, 32, and 48 bit instruction lengths).

RISC-V will probably get a 48 bit opcode to load a 32 bit literal in the future, but there won't be enough space to allow ADDI, ORI, ANDI, XORI etc to have 32 bit literals.

#88 Reply
Posted by legacy on 20 Nov, 2018 15:31
Quote from: theoldwizard1 on 19 Nov, 2018 22:24

An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).

Arise-v2 offers a 32bit constant for this. Unsigned.

Quote from: theoldwizard1 on 19 Nov, 2018 22:24
Atomic adjustment of the stack pointer is critical.

Yup. We are now currently working on this