mem(reg+const) is very useful if the reg is the equivalent of a stack frame pointer or base pointer.
I would have thought that postincrement/postdecrement are most likely not that useful, depending on if you allow byte or word addressing.
Maybe see if you can find a copy of "Computer Architecture: A Quantitative Approach" by John L. Hennessy.
PS. Is Pascal p-code still a thing?
IMHO indexed (register + X) is very important. Think about stack frames (SP+x) or accessing members of structs.
Now, since implementing them costs time and effort, I wonder if they are worth with.
I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix.
it highly depends on the number of registers available.
But... An Atomic PUSH and POP capability is pretty important.
on PDP11, this was done with post-incremented indexing off of the PC register.
QuoteAn Atomic PUSH and POP capability is pretty important.Fully agreed. Also either atomic compare-exchange or load-linked store-conditional, for lock support
Surprisingly enough, it highly depends on the number of registers available.
The next problem is finding a compiler that generates efficient code for either case
I am wondering if EA = reg0 + reg1 * scale (immediate 16bit constant) could be really useful for easily accessing a cell in the matrix
The scale is not always a constant, so do consider EA = reg0 + reg1 * reg2 instead. (Not only is the instruction shorter, but it is much more versatile this way.) Also, all of those could be negative as well.
if you had a fast (whatever your minimum instruction duration is) fused multiply-add instruction (regN = regA + regB * regC) and/or related load-effective-address instruction (regN = imm32 + regA<<imm5 + regB * regC) for signed integers, that would do just as well. If simplifying the addressing modes would allow that, I'd definitely go for it.
In general linear algebra (say, the same niche BLAS/LAPACK or GSL target), using a separate signed stride for both row and column advancement allows advanced features not supported by even GSL. I have an outline here on stackoverflow in C. Unlike in GSL, "views" are first-class matrices, indistinguishable from the matrix they view to; this not only simplifies code, but it makes it easier for "non-professional programmers" (scientists) to write efficient linear algebra code.
Can you give me an example of a C structure where this makes a good deal?
uint32_t *canvas;
uint16_t width, row, col;
canvas[row*width + col] = ...;
the address calculation is essentially canvas + 4*col + 4*row*width . Even when canvas and width are fixed, you are still left with imm32 + 4*regA + 4*imm16*regB. Here I don't follow you
auto-increments/decrements, take very little of code space, and do not take any extra time, and require very little extra logic, so they're easy
The compiler implementation will be easier if you do not to make any exceptions - such as register A can only be used for this, while B can only be used for that etc.
this takes
- 1 extra writeback stage to update the auto-incremented/decremented register in the RegisterFile
- 1 glue stage
- 1 extra cycle
Yup, precisely. One of the goals is getting stuff prepared for the simplest Compiler implementation possible.
If your registers are implemented as a memory file which can only be accessed through a single access port
Having auto-increments/auto-decrements may make the compiler more complex, unless you want to limit the use of auto-increments to trivial cases such as "i++".
This is the case. Like in the MIPS R2K design.
Having auto-increments/auto-decrements may make the compiler more complex, unless you want to limit the use of auto-increments to trivial cases such as "i++".
Why?
Consider a thing, expressions are handled as RPN, for which auto-increments/auto-decrements should be useful. I guess
RPN = Reverse Polish Notation
for(x = 0; x < z; x++) {
lots of code here with accessing various arrays with x.
}
z = a[x] + b[x];
// 20 unrelated lines here
x++;
... you can read two registers in parallel per cycle, but only write one register per cycle! For two registers to be updated, and considering FPGA uses BRAM which is of the same kind of above, you need to write-cycles done on two different stage of the FSM!
Consider a thing, expressions are handled as RPN, for which auto-increments/auto-decrements should be useful. I guess
Just a quick note, but having auto-incrementing with a step of only 1 is only marginally useful especially if you have other sophisticated addressing modes such as those you suggested (with scaling and such), so would you allow increments different than 1?
Arise-v2 allows { 8bit, 16bit, 32bit } addressing on the local bus.
therefore these two are
- Address Register Direct with Postincrement: EA = reg; reg+=OpSize
- Address Register Direct with Predecrement: reg-=OpSize; EA = reg
OpSize = { 1, 2, 4 }, according to { byte, word, long } accesses
Now, since implementing them costs time and effort, I wonder if they are worth with.
Purpose of this: give a good support for a C-like compiler (still to be completed).
Note: this does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work. I'd say it would be crazy. You're much better off writing a back-end for gcc or llvm.
Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.
At a minimum you need to read these two books:
"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy
It would be nice if you defined acronyms the first time you use them. I have *no* idea what you mean by MAC and I read the first 3 editions of P & H cover to cover.
It seems clear in context that he means "Multiply-ACcumulate" aka FMA.
Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.
At a minimum you need to read these two books:
"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy
"Advanced Compiler Design & Implementation" by Muchnick
I *strongly* recommend reading P & H starting with the first edition and continuing until the device complexity level exceeds what you want to do.
Also absolutely essential is:
"Cache and Memory Hierarchy Design" by Prrzybylski
or a recent equivalent.
In 1998, a DEC Alpha 164LX at 533 MHz was several times the floating point performance of the fastest Intel and AMD. AMD got their big jump with the Opteron when HP killed the Alpha and AMD managed to get a bunch of team. Register coloring and speculative execution made a huge difference in performance. The Alpha was as big an advance in design as the R2K and SPARC.
Addressing modes really isn't a relevant question today and hasn't been for over 30 years unless you are writing for MCUs in assembler. The name of the game is memory hierarchy and data access patterns and making transformations in the compiler to minimize cache misses and pipeline stalls.
As it happens I have the processor manuals for both the R2K and the 88K.
You can have large memory or you can have fast memory, but you cannot have large, fast memory
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
[quote]Note: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work[/quote]I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...
[/quote]
I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...
Thanks. I'm very much a polymath, so I really struggle with acronyms.
The fallacy in the subject line tends to make me expect anything.
I should very much like to know what the objective is. A demonstration project or a DSP engine for a Zynq or other FPGA. I have a long history with DSP going back the the 11/780 and FPS-120B. I spent quite a lot of time circa 1989-90 explaining the implications of the RISC model to my colleagues at the ARCO Plano Research Center. And doing occasional cute tricks like passing data between functions using foo(void), bar*void), etc.
Not only would I cite section 8.3.5 of the F77 standard from memory, I'd top it off with a short description of the Burroughs 5000 and why 8.3.5 was included in the standard.
I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.
- Implied addressing mode: there is no bytes nor bits in the instruction op-codes to explicitly specify the operands, so they are implicit, as in CPL and LD SP, IY instructions. In general, instructions of this type have one or two op-codes.
- Immediate addressing mode: instruction needs an explicit operand which is specified in the instruction itself (second ou third byte), and it’s always one-byte length. Example of this mode are instructions ADD A, N, XOR N and LD IXL, N, where N is a one-byte operand.
- Extended immediate addressing mode: it is similar to the immediate mode but takes a two-byte operand. Instructions LD HL, NN and LD IX, NN have this kind of addressing.
- Register addressing: in the op-code, there is a three-bit field to indicate one 8-bit register as the operand. RL, R and AND R are examples of this addressing mode, where R is a general purpose register (B, C, D, E, H, L and A).
- Register indirect addressing: in this case, the operand itself is not in a register. Instead, a register pair contains the operand’s memory address. That’s why it is called indirect addressing. LD A, (BC) and INC (HL) are examples of this mode. Note the use of parenthesis around registers to denote the indirect addressing.
- Extended addressing (or direct addressing): the instruction has a two-byte field containing the operand’s memory address. Instructions like LD A, (NN) and LD (NN), HL illustrate the direct addressing mode, where NN is a 16-bit value representing a memory address.
- Modified page zero addressing: this mode is exclusive to a single instruction, RST P, which causes the CPU to jump execution to a specified location P, inside the page zero memory area. Page zero is a memory area comprised by the first 256 bytes of memory (usually ROM). P can assume only the following values: 0x0, 0x8, 0x10, 0x18, 0x20, 0x28, 0x30 and 0x38. RST P is normally used to call special ROM subroutines. (For the ZX Spectrum, see which ROM routines are available here).
- Relative addressing: it is used in relative jump instructions (conditional and unconditional), in which the memory address to go to is calculated by adding an operand displacement to the PC register. The displacement value ranges from -128 to +127 (signed binary representation), allowing to address a location in memory from -126 to +129 relative to the jump instruction address (this is due to PC getting incremented by two during instruction execution). As an example, instruction JR Z, D, where D is the displacement, uses this kind of addressing mode. The relative addressing mode allows to reduce jump instructions by one byte, compared to absolute jump instructions (JP) where a 16-bit address operand is provided.
- Indexed addressing: this mode allows one to use indirect addressing using the index registers IX and IY. Like the relative addressing mode, the instruction has an 8-bit displacement operand, which is added to the index register being used. AND (IX + D) and LD (IY + D), N are examples of the indexed mode, where D is an 8-bit signed displacement and N is an immediate operand
Addressing Modes
The 6502 processor provides several ways in which memory locations can be addressed. Some instructions support several different modes while others may only support one. In addition the two index registers can not always be used interchangeably. This lack of orthogonality in the instruction set is one of the features that makes the 6502 trickier to program well.
Implicit
For many 6502 instructions the source and destination of the information to be manipulated is implied directly by the function of the instruction itself and no further operand needs to be specified. Operations like 'Clear Carry Flag' (CLC) and 'Return from Subroutine' (RTS) are implicit.
Accumulator
Some instructions have an option to operate directly upon the accumulator. The programmer specifies this by using a special operand value, 'A'. For example:
LSR A ;Logical shift right one bit
ROR A ;Rotate right one bit
Immediate
Immediate addressing allows the programmer to directly specify an 8 bit constant within the instruction. It is indicated by a '#' symbol followed by an numeric expression. For example:
LDA #10 ;Load 10 ($0A) into the accumulator
LDX #LO LABEL ;Load the LSB of a 16 bit address into X
LDY #HI LABEL ;Load the MSB of a 16 bit address into Y
Zero Page
An instruction using zero page addressing mode has only an 8 bit address operand. This limits it to addressing only the first 256 bytes of memory (e.g. $0000 to $00FF) where the most significant byte of the address is always zero. In zero page mode only the least significant byte of the address is held in the instruction making it shorter by one byte (important for space saving) and one less memory fetch during execution (important for speed).
An assembler will automatically select zero page addressing mode if the operand evaluates to a zero page address and the instruction supports the mode (not all do).
LDA $00 ;Load accumulator from $00
ASL ANSWER ;Shift labelled location ANSWER left
Zero Page,X
The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the X register to it. For example if the X register contains $0F and the instruction LDA $80,X is executed then the accumulator will be loaded from $008F (e.g. $80 + $0F => $8F).
NB:
The address calculation wraps around if the sum of the base address and the register exceed $FF. If we repeat the last example but with $FF in the X register then the accumulator will be loaded from $007F (e.g. $80 + $FF => $7F) and not $017F.
STY $10,X ;Save the Y register at location on zero page
AND TEMP,X ;Logical AND accumulator with a zero page value
Zero Page,Y
The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the Y register to it. This mode can only be used with the LDX and STX instructions.
LDX $10,Y ;Load the X register from a location on zero page
STX TEMP,Y ;Store the X register in a location on zero page
Relative
Relative addressing mode is used by branch instructions (e.g. BEQ, BNE, etc.) which contain a signed 8 bit relative offset (e.g. -128 to +127) which is added to program counter if the condition is true. As the program counter itself is incremented during instruction execution by two the effective address range for the target instruction must be with -126 to +129 bytes of the branch.
BEQ LABEL ;Branch if zero flag set to LABEL
BNE *+4 ;Skip over the following 2 byte instruction
Absolute
Instructions using absolute addressing contain a full 16 bit address to identify the target location.
JMP $1234 ;Jump to location $1234
JSR WIBBLE ;Call subroutine WIBBLE
Absolute,X
The address to be accessed by an instruction using X register indexed absolute addressing is computed by taking the 16 bit address from the instruction and added the contents of the X register. For example if X contains $92 then an STA $2000,X instruction will store the accumulator at $2092 (e.g. $2000 + $92).
STA $3000,X ;Store accumulator between $3000 and $30FF
ROR CRC,X ;Rotate right one bit
Absolute,Y
The Y register indexed absolute addressing mode is the same as the previous mode only with the contents of the Y register added to the 16 bit address from the instruction.
AND $4000,Y ;Perform a logical AND with a byte of memory
STA MEM,Y ;Store accumulator in memory
Indirect
JMP is the only 6502 instruction to support indirection. The instruction contains a 16 bit address which identifies the location of the least significant byte of another 16 bit memory address which is the real target of the instruction.
For example if location $0120 contains $FC and location $0121 contains $BA then the instruction JMP ($0120) will cause the next instruction execution to occur at $BAFC (e.g. the contents of $0120 and $0121).
JMP ($FFFC) ;Force a power on reset
JMP (TARGET) ;Jump via a labelled memory area
Indexed Indirect
Indexed indirect addressing is normally used in conjunction with a table of address held on zero page. The address of the table is taken from the instruction and the X register added to it (with zero page wrap around) to give the location of the least significant byte of the target address.
LDA ($40,X) ;Load a byte indirectly from memory
STA (MEM,X) ;Store accumulator indirectly into memory
Indirect Indexed
Indirect indirect addressing is the most common indirection mode used on the 6502. In instruction contains the zero page location of the least significant byte of 16 bit address. The Y register is dynamically added to this value to generated the actual target address for operation.
LDA ($40),Y ;Load a byte indirectly from memory
STA (DST),Y ;Store accumulator indirectly into memory
I had an NCD 1024x1024 greyscale X terminal
I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.
long history with DSP and not knowing the acronym MAC does not compute
I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.Yup it's a weird EA offered by weird beasts like the 88K.
The 'multiply' is really just a shift
Anywhere that you care about speed, you'll be iterating over elements with a fixed stride.
Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.
In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.
A lot of multiplications, kernel-cells by image-cells.
As it happens I have the processor manuals for both the R2K and the 88K.
There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now.
I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.
I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.
I should appreciate an explanation of why a convolution is not a matrix multiply
private sint32_t convolution_get_pixel
(
sint32_t kernel[],
uint32_t kernel_n,
sint32_t image[],
uint32_t image_width,
uint32_t x,
uint32_t y
)
{
sint32_t ans;
sint32_t sum;
uint32_t k;
uint32_t j;
uint32_t OF0;
uint32_t OF1;
sum = 0;
for (k = -1; k <= 1; k++)
{
for (j = -1; j <= 1; j++)
{
OF0 = get_kernel_cell_offset(j + 1, k + 1, kernel_n * 4);
OF1 = get_image_cell_offset(x - j, y - k, image_width * 4);
sum += kernel[OF0] * image[OF1];
}
}
ans = sum;
return ans;
}
public void convolution_image
(
sint32_t kernel[],
uint32_t kernel_n,
sint32_t image_in[],
sint32_t image_out[],
uint32_t image_height,
uint32_t image_width
)
{
sint32_t pixel;
uint32_t x;
uint32_t y;
uint32_t OF0;
for (y = 0; y < image_height; y++)
{
for (x = 0; x < image_width; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
OF0 = get_image_cell_offset(x,y,image_width * 4);
image_out[OF0] = pixel;
}
}
}
OF0 = get_kernel_cell_offset( reg0, reg1, const);
OF1 = get_image_cell_offset( reg0, reg1, const);
These two lines can be translated into "SLAC" instructions. Neat, and very fast since SLAC is basically a SHIFT which takes 1 clock edge to compute sum += kernel[OF0] * image[OF1];
This can be translated into a "MAC" I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.
But given your performance targets I'm a bit baffled about the purpose of the project.
QuoteNote: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work
I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...
But once you have a proper optimiser anyway, you don't need all those addressing modes any more.
int* b;
unsigned char* c;
for (k = 0; k < n; k++) {
b[k] = c[k];
}
movzx eax,byte ptr [rsi+rax]
mov [rdi+rax*4]
mov rbx,rax
add rbx,rdi
mov rcx,rax
shl rcx,2
add rcx,rsi
movzx eax,byte ptr [rbx]
mov [rcx]
movzx eax,byte ptr [rbx]
inc rbx
mov [rcx]
add rcx,4
for (k = 0; k < n; k++) {
if (must_process(k)) b[k] = c[k];
}
Just a small change to the source code:Code: [Select]for (k = 0; k < n; k++) {
if (must_process(k)) b[k] = c[k];
}
What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?
Meanwhile, hardware indexing works the same as before.
20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
101ce: 8522 mv a0,s0
101d0: fe3ff0ef jal ra,101b2 <must_process>
101d4: 2405 addiw s0,s0,1
101d6: c501 beqz a0,101de <foo+0x28>
101d8: 00094783 lbu a5,0(s2)
101dc: c09c sw a5,0(s1)
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
730: 2a0203e0 mov w0, w2
734: 97fffff5 bl 708 <must_process>
738: 34000060 cbz w0, 744 <foo+0x34>
73c: 38626824 ldrb w4, [x1, x2]
740: b82278a4 str w4, [x5, x2, lsl #2]
744: 91000442 add x2, x2, #0x1
748: eb03005f cmp x2, x3
74c: 54ffff21 b.ne 730 <foo+0x20> // b.any
Code: [Select]...
for (y = 0; y < image_height; y++)
{
for (x = 0; x < image_width; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
OF0 = get_image_cell_offset(x,y,image_width * 4);
image_out[OF0] = pixel;
}
}
}
stride = 4*image_width;
next_line = stride;
image_end = stride*image_height;
y = 0;
OF0 = 0;
while ( OF0 < image_end )
{
x = 0;
while ( OF0 < next_line )
for (x = 0; x < ; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
image_out[OF0] = pixel;
OF0 += 4;
x++;
}
next_line += stride;
y++;
}
}
And the same sort of thing can be done to get the multiplies out of the convolution addresses (you can get rid of x and y then). Compilers have been applying this sort of transformation automatically for decades.Code: [Select]OF0 += 4;
fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.
I compiled your code for x86_64 with gcc -O2:Code: [Select]20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.
00: 89 df mov edi,ebx
02: e8 00 00 00 00 call must_process
07: 85 c0 test eax,eax
09: 74 09 jz 0x14
0b: 41 0f b6 04 1e movzx eax,[r14+rbx]
10: 41 89 04 9f mov [r15+rbx*4],eax
14: ff c0 inc ebx
16: 74 e8 jz 0x00
00: 89 d9 mov ecx,ebx
02: e8 00 00 00 00 call must_process
07: 85 c0 test eax,eax
09: 74 07 jz 0x12
0b: 0f b6 04 1e movzx eax,[rsi+rbx]
0f: 89 04 9f mov [rdi+rbx*4],eax
12: ff c0 inc ebx
14: 74 ea jz 0x00
101ce: 8522 mv a0,s0
101d0: fe3ff0ef jal ra,101b2 <must_process>
101d4: 2405 addiw s0,s0,1
101d6: c501 beqz a0,101de <foo+0x28>
101d8: 00094783 lbu a5,0(s2)
101dc: c09c sw a5,0(s1)
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
I should appreciate an explanation of why a convolution is not a matrix multiply
Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)
Though if you want input from the reflection seismology side of DSP send me a PM.
I compiled your code for x86_64 with gcc -O2:Code: [Select]20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.
Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:
Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king.
And look what your "proper optinmizer" has done for the "best in the world" RISC-V:
Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:Code: [Select]101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?
In the absence of knowing how often must_process() returns true: absolutely.
It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time.
Complex addressing modes are not free.
Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.
A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"
The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!
on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).
A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.
Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.
Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).
That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).
That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).
Low frequency silicon, huh? Very smooth ...
How about Intel processors? A guy measured power consumption of 22/32 nm technology here:
http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/
Look at picture 2a/2b and extend the orange "Total power" line until it crosses 0 frequency. Looks like around 10 W (perhaps 15 W). And this includes whole lot more than just leakage. As the frequency increases, the power goes over 100 W. Doesn't look like leakage dominates. Do they use "low frequency silicon" too?
No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.
No, Intel spends a lot of money on designing and implementing aggressive power-gating (not just clock-gating) of idle circuits, as mentioned in the last sentence of my original message quoted above.
May be you can also explain why the power gating has anything to do with this?
Linear dependency between frequency and power consumption is direct proof that the power consumption depends on switching, whether power gating or not. Only a small portion (around 10%) of power consumption is static and may be attributed to leakage.
Don't just take my word for it. This is something known by absolutely anyone who is making chips at 65nm, 40nm, 28nm and below.
1991 article saying leakage dominates at 65nm and below: https://www.eetimes.com/document.asp?doc_id=1264175
Broadcom presentation saying that leakage current dominates and turning off everything that does not need to be on is critical http://www.islped.org/2014/files/islped2014_leakage_mitigation_JohnRedmond_Broadcom.pdf
IEEE article explaining why leakage starts to dominate more and more once you go below 100 nm: http://www.ruf.rice.edu/~mobile/elec518/readings/DevicesAndCircuits/kim03leakage.pdf
Well, we're just going to have to agree to disagree.
We're designing and building SoCs here, according to our instruction set and hardware design philosophy. The proof is in how they work and that speaks for itself. So far, every chip taped out has come back working first try, and at the predicted performance and power consumption. The only opinions that matter are the paying customers.
In the end, SLAC got implemented, tested, and added to the ISA
But not as "EA" addressing mode, but rather like a common instruction that needs to precede a load/store.
Reasons for this? Elisabeth doesn't like to add a new timing constraint to the list.
It seems a good compromise.
Currently, we are testing Arise-v2 on the HDL simulator (and sometimes on the real hardware) by writing short assembly programs. Basically, they are loops operating on predefined values so we can check if things go as expected, step by step. The HL-compiler is already able to identify and parse mathematical expressions(1) and to transform them into RPN. However, it still needs to recognize a matrix's item so it can use SLAC for the EA.
a*b + c*d
+(*(a,b),*(c,d))
temp1 := *(a,b) // this will be replaced by a single instruction
+(temp1,*(c,d))
temp1 := *(a,b) // this will be replaced by a single instruction
temp2 := *(c,d) // this will be replaced by a single instruction
temp3 := +(temp1,temp2) // this will be replaced by a single instruction
subscript(array,index) // expanded from array[index]
dot(structure,member) // expanded from structure.member
dot(subscript(array,index),member) // expanded from array[index].member
EA = base + k0 + index1*k1 + index2*k2 + ...
Surprisingly enough, it highly depends on the number of registers available.
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bit
I have different background than many of you. I "cut my teeth" writing embbeded code in assembler on several different "oddball" microprocessors. I then moved on to programming, including writing assembler, on a VAX, which I think most people would agree was the ULTIMATE complex instruction and addressing mode processor every built. The last big project I worked on was designing an application binary interface (ABI) to be used by Gnu C for a 32 bit RISC architecture, so I have sort of been around the block on this.
Surprisingly enough, it highly depends on the number of registers available.EXTREMELY TRUE ! Looking back, it amazes me that VMS only allowed compilers/programmers to use 12 of 16 registers. The 4 "dedicated" register are the program counter, the stack pointer, the frame pointer and the argument pointer.
Now it seems like everything has a minimum of 32 registers.
Reserving Reg0 to always be ZERO is very useful in embedded programming. (IIRC we designed the ABI to use a special data section called .zdata that was within the range of the address offest added to R0.)
Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))
const = imm16bitProbably the most important addressing mode. The big debate is just how many bits should be allowed for that constant (offset).
An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).
An immediate addressing mode that would allow a 32 bit constant to be loaded to a register is also valuable but few architecture support that (it likely would require a variable length instruction).
Atomic adjustment of the stack pointer is critical.