Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.
In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.
A lot of multiplications, kernel-cells by image-cells.
As it happens I have the processor manuals for both the R2K and the 88K.
There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now.
I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.
I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.
I should appreciate an explanation of why a convolution is not a matrix multiply
private sint32_t convolution_get_pixel
(
sint32_t kernel[],
uint32_t kernel_n,
sint32_t image[],
uint32_t image_width,
uint32_t x,
uint32_t y
)
{
sint32_t ans;
sint32_t sum;
uint32_t k;
uint32_t j;
uint32_t OF0;
uint32_t OF1;
sum = 0;
for (k = -1; k <= 1; k++)
{
for (j = -1; j <= 1; j++)
{
OF0 = get_kernel_cell_offset(j + 1, k + 1, kernel_n * 4);
OF1 = get_image_cell_offset(x - j, y - k, image_width * 4);
sum += kernel[OF0] * image[OF1];
}
}
ans = sum;
return ans;
}
public void convolution_image
(
sint32_t kernel[],
uint32_t kernel_n,
sint32_t image_in[],
sint32_t image_out[],
uint32_t image_height,
uint32_t image_width
)
{
sint32_t pixel;
uint32_t x;
uint32_t y;
uint32_t OF0;
for (y = 0; y < image_height; y++)
{
for (x = 0; x < image_width; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
OF0 = get_image_cell_offset(x,y,image_width * 4);
image_out[OF0] = pixel;
}
}
}
OF0 = get_kernel_cell_offset( reg0, reg1, const);
OF1 = get_image_cell_offset( reg0, reg1, const);
These two lines can be translated into "SLAC" instructions. Neat, and very fast since SLAC is basically a SHIFT which takes 1 clock edge to compute sum += kernel[OF0] * image[OF1];
This can be translated into a "MAC" I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.
But given your performance targets I'm a bit baffled about the purpose of the project.
QuoteNote: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work
I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...
But once you have a proper optimiser anyway, you don't need all those addressing modes any more.
int* b;
unsigned char* c;
for (k = 0; k < n; k++) {
b[k] = c[k];
}
movzx eax,byte ptr [rsi+rax]
mov [rdi+rax*4]
mov rbx,rax
add rbx,rdi
mov rcx,rax
shl rcx,2
add rcx,rsi
movzx eax,byte ptr [rbx]
mov [rcx]
movzx eax,byte ptr [rbx]
inc rbx
mov [rcx]
add rcx,4
for (k = 0; k < n; k++) {
if (must_process(k)) b[k] = c[k];
}
Just a small change to the source code:Code: [Select]for (k = 0; k < n; k++) {
if (must_process(k)) b[k] = c[k];
}
What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?
Meanwhile, hardware indexing works the same as before.
20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
101ce: 8522 mv a0,s0
101d0: fe3ff0ef jal ra,101b2 <must_process>
101d4: 2405 addiw s0,s0,1
101d6: c501 beqz a0,101de <foo+0x28>
101d8: 00094783 lbu a5,0(s2)
101dc: c09c sw a5,0(s1)
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
730: 2a0203e0 mov w0, w2
734: 97fffff5 bl 708 <must_process>
738: 34000060 cbz w0, 744 <foo+0x34>
73c: 38626824 ldrb w4, [x1, x2]
740: b82278a4 str w4, [x5, x2, lsl #2]
744: 91000442 add x2, x2, #0x1
748: eb03005f cmp x2, x3
74c: 54ffff21 b.ne 730 <foo+0x20> // b.any
Code: [Select]...
for (y = 0; y < image_height; y++)
{
for (x = 0; x < image_width; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
OF0 = get_image_cell_offset(x,y,image_width * 4);
image_out[OF0] = pixel;
}
}
}
stride = 4*image_width;
next_line = stride;
image_end = stride*image_height;
y = 0;
OF0 = 0;
while ( OF0 < image_end )
{
x = 0;
while ( OF0 < next_line )
for (x = 0; x < ; x++)
{
pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
image_out[OF0] = pixel;
OF0 += 4;
x++;
}
next_line += stride;
y++;
}
}
And the same sort of thing can be done to get the multiplies out of the convolution addresses (you can get rid of x and y then). Compilers have been applying this sort of transformation automatically for decades.Code: [Select]OF0 += 4;
fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.
I compiled your code for x86_64 with gcc -O2:Code: [Select]20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.
00: 89 df mov edi,ebx
02: e8 00 00 00 00 call must_process
07: 85 c0 test eax,eax
09: 74 09 jz 0x14
0b: 41 0f b6 04 1e movzx eax,[r14+rbx]
10: 41 89 04 9f mov [r15+rbx*4],eax
14: ff c0 inc ebx
16: 74 e8 jz 0x00
00: 89 d9 mov ecx,ebx
02: e8 00 00 00 00 call must_process
07: 85 c0 test eax,eax
09: 74 07 jz 0x12
0b: 0f b6 04 1e movzx eax,[rsi+rbx]
0f: 89 04 9f mov [rdi+rbx*4],eax
12: ff c0 inc ebx
14: 74 ea jz 0x00
101ce: 8522 mv a0,s0
101d0: fe3ff0ef jal ra,101b2 <must_process>
101d4: 2405 addiw s0,s0,1
101d6: c501 beqz a0,101de <foo+0x28>
101d8: 00094783 lbu a5,0(s2)
101dc: c09c sw a5,0(s1)
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
101e2: fe8996e3 bne s3,s0,101ce <foo+0x18>
101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
I should appreciate an explanation of why a convolution is not a matrix multiply
Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)
Though if you want input from the reflection seismology side of DSP send me a PM.
I compiled your code for x86_64 with gcc -O2:Code: [Select]20: 89 df mov %ebx,%edi
22: e8 00 00 00 00 callq 27 <foo+0x27>
27: 85 c0 test %eax,%eax
29: 74 0a je 35 <foo+0x35>
2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax
31: 41 89 04 9c mov %eax,(%r12,%rbx,4)
35: 48 83 c3 01 add $0x1,%rbx
39: 48 39 eb cmp %rbp,%rbx
3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.
Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:
Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king.
And look what your "proper optinmizer" has done for the "best in the world" RISC-V:
Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:Code: [Select]101de: 0905 addi s2,s2,1
101e0: 0491 addi s1,s1,4
are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?
In the absence of knowing how often must_process() returns true: absolutely.
It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time.
Complex addressing modes are not free.
Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.
A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"
The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!
on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).
A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.
Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.
Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.
Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).
That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).