Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10653 times)

rhb · « **Reply #50 on:** November 12, 2018, 11:52:23 pm »

Quote from: legacy on November 12, 2018, 08:46:16 pm

Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.

In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.

A lot of multiplications, kernel-cells by image-cells.

I should appreciate an explanation of why a convolution is not a matrix multiply. I shall be most interested to discover how you came up with this notion. This may be simply an issue of jargon. Please start by describing your operator and its origin. But I am not aware of a convolution that is not an isomorph of a matrix multiply whether 1 dimensional or N dimensional. I should greatly appreciate your filling the gap in my education.

brucehoult · « **Reply #51 on:** November 13, 2018, 01:13:04 am »

Quote from: rhb on November 11, 2018, 08:13:51 pm

As it happens I have the processor manuals for both the R2K and the 88K.

So do I :-) And correspond frequently with two of the main designers of the 88K, one of whom also did a lot on the original Athlon64, and both were involved in designing a new GPU I was working on at Samsung from Sep 2016 to Mar 2018. Both were also while at Motorola co-authors on the seminal 1991 paper "Single Instruction Stream Parallelism Is Greater than Two" (in fact they found it's typically over a dozen, with perfect branch prediction, but 2 to 6 with reasonably achievable hardware at the time)

Quote

There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now.

That's not entirely accurate. Yes, branch delay slots were a mistake, as both H and P agree now (which is why RISC-V doesn't have them). They happened to be convenient at one particular point in processor implementation, but complicate all other implementations. And with good branch prediction as we have now the reason is completely gone anyway, even on simple implementations -- we now know how to make fairly small branch predictors with just a couple of percent mispredict rate, while the number of branch delay slots that can't be filled (except by a NOP) is typically in the tens of percents.

The reason to change processors in later editions of H&P is I think to use examples with which students are familiar -- or at least have easy access. Hence the last two editions using Aarch64 and then RISC-V.

Quote

I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.

These vulnerabilities are pretty easy to avoid if only you realise that you need to avoid them!

The very simple principle to follow is to ensure that work done during speculative execution is completely undone in the event of a misprediction.

Of course this was already followed for such obvious things as stores to memory. There are just a few more things now that people are starting to realise must be buffered and completely rolled back as well.

As an example, if a cache line is fetched as a result of speculative execution, it must be held in a special area until the speculation is over, and only then written to the cache itself (and in particular, not result in the eviction of any other cache line until that point).

The same goes for updates to the TLB, branch history buffer, branch target buffer.

I think we will also see a move to non-inclusive caches. In the recent practice, a cache line brought from main memory will be stored at the same time into all of L1, L2, and L3 caches. This complicates rollback. In future I think we will see cache lines brought just to L1 at first. Only when they are evicted from L1 by a newer cache line will they be written to L2.

None of this is very difficult. Nor is it expensive in circuitry. And it does not cost execution speed.

The difficult part is mostly in realising that you *should* do it.

brucehoult · « **Reply #52 on:** November 13, 2018, 01:18:32 am »

Quote from: rhb on November 11, 2018, 10:22:36 pm

I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.

The 88K was really great technically. It's only problem was that it was from Motorola and they by that time were very ambivalent about being in the high performance processor market.

The 88K was largely axed in favour of PowerPC (which used some parts of the 88K implementation). And then a few years later Motorola was more interested in supplying automotive and communications markets that only needed low performance CPUs than in supplying the high performance mobile CPUs that Apple wanted (the high performance desktop PowerPCs from IBM were just fine). At that time Intel was just starting down the excellent Centrino/Core path with high performance and low power and so Apple jumped ship.

legacy · « **Reply #53 on:** November 13, 2018, 01:28:30 am »

Quote from: rhb on November 12, 2018, 11:52:23 pm

I should appreciate an explanation of why a convolution is not a matrix multiply

Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)

SLAC = Shift Left And Accumulate (similar to MAC, but faster)

In short, we are thinking about adding this instruction to the ISA

The convolution looks something like this

Code: [Select]

private sint32_t convolution_get_pixel
(
    sint32_t kernel[],
    uint32_t kernel_n,
    sint32_t image[],
    uint32_t image_width,
    uint32_t x,
    uint32_t y
)
{
    sint32_t ans;
    sint32_t sum;
    uint32_t k;
    uint32_t j;
    uint32_t OF0;
    uint32_t OF1;

    sum = 0;
    for (k = -1; k <= 1; k++)
    {
        for (j = -1; j <= 1; j++)
        {
            OF0  = get_kernel_cell_offset(j + 1, k + 1, kernel_n * 4);
            OF1  = get_image_cell_offset(x - j, y - k, image_width * 4);
            sum += kernel[OF0] * image[OF1];
        }
    }
    ans = sum;
    return ans;
}

public void convolution_image
(
    sint32_t kernel[],
    uint32_t kernel_n,
    sint32_t image_in[],
    sint32_t image_out[],
    uint32_t image_height,
    uint32_t image_width
)
{
    sint32_t pixel;
    uint32_t x;
    uint32_t y;
    uint32_t OF0;

    for (y = 0; y < image_height; y++)
    {
        for (x = 0; x < image_width; x++)
        {
            pixel          = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
            OF0            = get_image_cell_offset(x,y,image_width * 4);
            image_out[OF0] = pixel;
        }
    }
}

Code: [Select]

            OF0  = get_kernel_cell_offset( reg0, reg1, const);
            OF1  = get_image_cell_offset( reg0,  reg1, const);

These two lines can be translated into "SLAC" instructions. Neat, and very fast since SLAC is basically a SHIFT which takes 1 clock edge to compute

Code: [Select]

            sum += kernel[OF0] * image[OF1];

This can be translated into a "MAC"

edit:
there are other ways to convolute an image * kernel
kernel_n is usually =3, for a kernell matrix 3x3

brucehoult · « **Reply #54 on:** November 13, 2018, 01:30:40 am »

Quote from: rhb on November 12, 2018, 01:10:30 am

I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

What excited Woz about the 6502 was mostly that it cost $25 when the M6800 and Z80 cost $200+.

Also, unlike those other two, it was pipelined just sufficiently that it could run most code in 1 clock cycle per byte read or written from RAM (code and data), which is a kind of "speed of light" for processors without an icache (or Harvard architecture). The exceptions were 1) single-byte instructions that didn't read or write anything and took 2 cycles, and 2) indexed addressing modes with a carry from lo byte into hi byte (crossed a page boundary)

Quote

But given your performance targets I'm a bit baffled about the purpose of the project.

Indeed.

brucehoult · « **Reply #55 on:** November 13, 2018, 01:42:18 am »

Quote from: westfw on November 12, 2018, 09:23:53 am

Quote
Note: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work

I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...

Yes, at the limit of the very simplest compiler it's easiest to have only "this register contains the exact address of the data to load/store" and do everything else with arithmetic.

However this is *very* low performance unless you have, as I said, an optimiser to do common subexpression elimination and movement of invariant code out of the loop. And preferably induction variable detection and strength reduction too.

It's much simpler to do a little bit of tree matching to generate the complex addressing modes. In particular a technique called "Hold up code generation" is simple and effective, especially on machines that have an LEA instruction. Basically instead of just emitting adds, multiplies by small powers of two etc you build a little state machine that records an addressing mode that can be used to represent a fragment of the calculation. Each time you come to a new operation you see if it can be added to the addressing more to transition to another addressing mode (some of which may be "virtual" if your machine doesn't actually have all combinations). If you come to a load or store then you emit it with the addressing mode implied by the state. If you come to an impossible transition from a state then you emit a LEA with what you have so far, and then continue.

This is surprisingly effective on things like VAX, M68K and x86 and I wrote compiler back-ends using it in the mid 80s. It also takes very little code and runs very quickly. Much much easier than writing a proper optimiser.

But once you have a proper optimiser anyway, you don't need all those addressing modes any more.

NorthGuy · « **Reply #56 on:** November 13, 2018, 04:26:58 am »

Quote from: brucehoult on November 13, 2018, 01:42:18 am

But once you have a proper optimiser anyway, you don't need all those addressing modes any more.

Really?

A simple case:

Code: [Select]

int* b;
unsigned char* c;

for (k = 0; k < n; k++) {
  b[k] = c[k];  
}

With hardware indexing (such as x64), if the compiler can take advantage of the indexing, the body of the loop is very simple, something like that:

Code: [Select]

movzx eax,byte ptr [rsi+rax]
mov [rdi+rax*4]

If an (unoptimized) compiler calculates everything straight without using indexing, it becomes quite a bit more:

Code: [Select]

mov rbx,rax
add rbx,rdi
mov rcx,rax
shl rcx,2
add rcx,rsi
movzx eax,byte ptr [rbx]
mov [rcx]

What the "proper optimizer" can do without hardware indexing. I guess, it can move the calculation out of the loop, only leaving increments of base registers on every cycle:

Code: [Select]

movzx eax,byte ptr [rbx]
inc rbx
mov [rcx]
add rcx,4

Lo and behold, the performance of the hardware indexing is matched by "proper optimizer". Who needs hardware indexing?

But.

"Grau, teurer Freund, ist alle Theorie, Und grün des Lebens goldner Baum."

Just a small change to the source code:

Code: [Select]

for (k = 0; k < n; k++) {
  if (must_process(k)) b[k] = c[k];  
}

What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?

Meanwhile, hardware indexing works the same as before.

brucehoult · « **Reply #57 on:** November 13, 2018, 06:51:51 am »

Quote from: NorthGuy on November 13, 2018, 04:26:58 am

Just a small change to the source code:

Code: [Select]
for (k = 0; k < n; k++) { if (must_process(k)) b[k] = c[k]; }
What the "proper optimizer" can do now. If must_process() returns mostly "true", the above optimization trick with increments will be benefical. However, if must_process() returns mostly "false", it's better to stick with naïve unoptimized implementation. But the "proper optimizer" has no clue, so it cannot even make a proper choice between the two, and therefore is likely to produce a code which will perform worse than the naïve variant. Doesn't look very impressive, does it?

Meanwhile, hardware indexing works the same as before.

I compiled your code for x86_64 with gcc -O2:

Code: [Select]

  20:	89 df                        mov    %ebx,%edi
  22:	e8 00 00 00 00               callq  27 <foo+0x27>
  27:	85 c0                        test   %eax,%eax
  29:	74 0a                        je     35 <foo+0x35>
  2b:	41 0f b6 44 1d 00            movzbl 0x0(%r13,%rbx,1),%eax
  31:	41 89 04 9c                  mov    %eax,(%r12,%rbx,4)
  35:	48 83 c3 01                  add    $0x1,%rbx
  39:	48 39 eb                     cmp    %rbp,%rbx
  3c:	75 e2                        jne    20 <foo+0x20>

Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Let's try something without fancy addressing modes. I have RISC-V handy (also 64 bit):

Code: [Select]

   101ce:       8522                    mv      a0,s0
   101d0:       fe3ff0ef                jal     ra,101b2 <must_process>
   101d4:       2405                    addiw   s0,s0,1
   101d6:       c501                    beqz    a0,101de <foo+0x28>
   101d8:       00094783                lbu     a5,0(s2)
   101dc:       c09c                    sw      a5,0(s1)
   101de:       0905                    addi    s2,s2,1
   101e0:       0491                    addi    s1,s1,4
   101e2:       fe8996e3                bne     s3,s0,101ce <foo+0x18>

So that's ... nine instructions and twenty four bytes of code. I'm not seeing a lot of advantage from those fancy addressing modes here.

Let's try 64 bit ARM. It's got complex addressing modes too.

Code: [Select]

 730:   2a0203e0        mov     w0, w2
 734:   97fffff5        bl      708 <must_process>
 738:   34000060        cbz     w0, 744 <foo+0x34>
 73c:   38626824        ldrb    w4, [x1, x2]
 740:   b82278a4        str     w4, [x5, x2, lsl #2]
 744:   91000442        add     x2, x2, #0x1
 748:   eb03005f        cmp     x2, x3
 74c:   54ffff21        b.ne    730 <foo+0x20>  // b.any

Hmm. That's eight instructions and 32 bytes of code. It saved an instruction because, like RISC-V, it can test a register for zero and branch in one instruction.

I'm still not seeing the big benefit of these complex addressing modes.

There's *some* benefit, of course. But there's also a cost. Something extra to design and document and learn. Extra transistors to implement shift and add within an instruction. Extra transistors to be able to read three registers (w4, x5, x2 in the ARM case and %eax, %r12, %rbx for the x86), in one clock cycle instead of only two. Wasted power in any instructions that don't actually make use of those capabilities. Quite possibly the clock speed for the entire CPU will have to be slower because it has to be able to complete that shift and add and store in the same amount of time that will also be given to simple shifts or adds or stores alone.

It may also be that the x86 and/or ARM actually split that complex store instruction into several micro-ops. That would negate my hardware cost and speed points in the previous paragraph, but it will put you back to having extra actual operations in the loop, and instructions that the programmer/compiler can't influence.

DJohn · « **Reply #58 on:** November 13, 2018, 11:01:51 am »

Quote from: legacy on November 13, 2018, 01:28:30 am

Code: [Select]
... for (y = 0; y < image_height; y++) { for (x = 0; x < image_width; x++) { pixel = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y); OF0 = get_image_cell_offset(x,y,image_width * 4); image_out[OF0] = pixel; } } }

There's no need for fancy addressing modes here.

Code: [Select]

    stride = 4*image_width;
    next_line = stride;
    image_end = stride*image_height;
    y = 0;
    OF0 = 0;
    while ( OF0 < image_end )
    {
        x = 0;
        while ( OF0 < next_line )
        for (x = 0; x < ; x++)
        {
            pixel          = convolution_get_pixel(kernel, kernel_n, image_in, image_width, x, y);
            image_out[OF0] = pixel;
            OF0 += 4;
            x++;
        }
        next_line += stride;
        y++;
    }
}

And the same sort of thing can be done to get the multiplies out of the convolution addresses (you can get rid of x and y then). Compilers have been applying this sort of transformation automatically for decades.

I'd probably go further than that. If your kernel is separable, it might be best to do the operation in two passes: first with a one-high horizontal kernel, then again with a one-wide vertical kernel. It takes the image reads from width*height*N^2 to width*height*2N for an NxN kernel. It might not be a win for your 3x3 kernel, but it might be, and definitely is when N is large.

For a large kernel, the obvious way of arranging the image in memory (with pixel x,y at 4*x + 4*width*y) can make the cache very unhappy on the vertical pass. In that case, it's better to arrange the image in square blocks of pixels that each fit in a cache line. Random access of pixels becomes messier, but ordered access is still just a matter of adding constants to pointers.

None of this even needs register+immediate addressing. Just having the address in a register is sufficient.

legacy · « **Reply #59 on:** November 13, 2018, 12:08:35 pm »

Quote from: DJohn on November 13, 2018, 11:01:51 am

Code: [Select]
OF0 += 4;

fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.

I see 3 good reasons for it, and Elisabeth loves creating new acronyms

(now I see 4 good reasons for it

)

Edit:
Not mentioning it also helps for sparse matrices!

brucehoult · « **Reply #60 on:** November 13, 2018, 01:11:08 pm »

Quote from: legacy on November 13, 2018, 12:08:35 pm

fancy addressing modes ... umm, SLAC is not an addressing mode instruction, it's a common arithmetic instruction that precedes a load/store instruction, it has the same execution time as an addition: 1 clock edge, but it makes the compiler job easier since what you see in the C code is directly translated into assembly! Besides, it costs zero effort for the ISA's implementation, since it's an already existing SHIFT instruction with a little modification.

So, exactly the same as the ADD instruction on ARM32 then.

NorthGuy · « **Reply #61 on:** November 13, 2018, 03:59:12 pm »

Quote from: brucehoult on November 13, 2018, 06:51:51 am

I compiled your code for x86_64 with gcc -O2:

Code: [Select]
20: 89 df mov %ebx,%edi 22: e8 00 00 00 00 callq 27 <foo+0x27> 27: 85 c0 test %eax,%eax 29: 74 0a je 35 <foo+0x35> 2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax 31: 41 89 04 9c mov %eax,(%r12,%rbx,4) 35: 48 83 c3 01 add $0x1,%rbx 39: 48 39 eb cmp %rbp,%rbx 3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:

Code: [Select]

00: 89 df                   mov edi,ebx
02: e8 00 00 00 00          call must_process
07: 85 c0                   test eax,eax
09: 74 09                   jz 0x14
0b: 41 0f b6 04 1e          movzx eax,[r14+rbx]
10: 41 89 04 9f             mov [r15+rbx*4],eax
14: ff c0                   inc ebx
16: 74 e8                   jz 0x00

This is 8 instructions and 24 bytes.

If you move to Windows, where calling conventions are slightly better:

Code: [Select]

00: 89 d9                   mov ecx,ebx
02: e8 00 00 00 00          call must_process
07: 85 c0                   test eax,eax
09: 74 07                   jz 0x12
0b: 0f b6 04 1e             movzx eax,[rsi+rbx]
0f: 89 04 9f                mov [rdi+rbx*4],eax
12: ff c0                   inc ebx
14: 74 ea                   jz 0x00

you still get 8 instructions, but only 22 bytes.

Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king. And look what your "proper optinmizer" has done for the "best in the world" RISC-V:

Code: [Select]

   101ce:       8522                    mv      a0,s0
   101d0:       fe3ff0ef                jal     ra,101b2 <must_process>
   101d4:       2405                    addiw   s0,s0,1
   101d6:       c501                    beqz    a0,101de <foo+0x28>
   101d8:       00094783                lbu     a5,0(s2)
   101dc:       c09c                    sw      a5,0(s1)
   101de:       0905                    addi    s2,s2,1
   101e0:       0491                    addi    s1,s1,4
   101e2:       fe8996e3                bne     s3,s0,101ce <foo+0x18>

Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:

Code: [Select]

   101de:       0905                    addi    s2,s2,1
   101e0:       0491                    addi    s1,s1,4

are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?

I'll take hardware indexing over that any time.

<edit>Sorry, in my snippets, since RBX is shifted, the first instruction should not be "mov edi,rbx" but something like "lea edi,[rbx+r13]" to re-adjust RBX to what the routine being called wants. This will add one or 2 bytes, but won't alter the number of instructions.

rhb · « **Reply #62 on:** November 13, 2018, 04:03:17 pm »

Quote from: legacy on November 13, 2018, 01:28:30 am

Quote from: rhb on November 12, 2018, 11:52:23 pm
I should appreciate an explanation of why a convolution is not a matrix multiply

Different order of the cell's address during multiplication, which makes "SLAC" instruction useful for the EA of each cell (Kernel and image, convoluted)

The remark was rhetorical. To paraphrase Gertrude Stein, "A convolution is a convolution no matter how you do it."

And in the frequency domain it's a multiplication over frequency rather than a multiply-add over space.

This thread sounds like great fun and I wish you well, but I'm going to bow out. I already have enough projects. And it's very clear you know what you're up to. Though if you want input from the reflection seismology side of DSP send me a PM.

Have Fun!
Reg

legacy · « **Reply #63 on:** November 13, 2018, 04:16:10 pm »

Quote from: rhb on November 13, 2018, 04:03:17 pm

Though if you want input from the reflection seismology side of DSP send me a PM.

wow, that was a part of my thesis to get my degree

Thanks, I will for sure contact you by PM about this.

Thanks

brucehoult · « **Reply #64 on:** November 13, 2018, 10:48:17 pm »

Quote from: NorthGuy on November 13, 2018, 03:59:12 pm

Quote from: brucehoult on November 13, 2018, 06:51:51 am
I compiled your code for x86_64 with gcc -O2:

Code: [Select]
20: 89 df mov %ebx,%edi 22: e8 00 00 00 00 callq 27 <foo+0x27> 27: 85 c0 test %eax,%eax 29: 74 0a je 35 <foo+0x35> 2b: 41 0f b6 44 1d 00 movzbl 0x0(%r13,%rbx,1),%eax 31: 41 89 04 9c mov %eax,(%r12,%rbx,4) 35: 48 83 c3 01 add $0x1,%rbx 39: 48 39 eb cmp %rbp,%rbx 3c: 75 e2 jne 20 <foo+0x20>
Cool. Cure enough it used your fancy addressing modes. It has nine instructions and thirty bytes of code in the loop.

Not very cool. That's just a consequence of how GCC is doing theoretical optimizations rather than gathering to real-world hardware. I'm sure any half-decent compiler pointed into right direction, could come up with something better. As a compiler writer, you would probably agree that the optimizations applied here are not very hard to implement in the compiler:

gcc is the compiler most people use, especially in the embedded world, and the one that works on the most different CPU architectures. I'm happy to use llvm if you prefer?

Quote

Although, I don't understand why the number of bytes is so important - everyone has tons of RAMs. Speed is the king.

Not everyone has tons of RAMs. Some people only have a few KB. Even the people with tons of RAMs generally only have at most 32 KB of L1 instruction cache.

As an example of why it is important, if you enable RISC-V variable-length instructions then "CoreMark" fits into the icache on the HiFive Unleashed and runs very well. If you use the fixed length 32 bit instruction encoding then (the hot parts of) CoreMark doesn't fit into 16 KB and it runs about twice slower.

Quote

And look what your "proper optinmizer" has done for the "best in the world" RISC-V:

Not "best in the world". That's not the aim. It's "free and open". That's far more important in the long term. In the case of Linux, being free and open resulted in huge numbers of people working to improve it and eventually being best in the world. It certainly wasn't in 1992 or even 1998.

Quote

Now, if you run this loop 1,000,000 times while must_process() returns true only 2 or 3 times, these two lines:

Code: [Select]
101de: 0905 addi s2,s2,1 101e0: 0491 addi s1,s1,4
are going to execute at every pass - that is about 2 million needless additions. You really call this "proper optimization"?

In the absence of knowing how often must_process() returns true: absolutely.

It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time. Complex addressing modes are not free.

But in any case, the difference will be trivial given that you're calling a *function* in the middle of your supposedly critical loop.

NorthGuy · « **Reply #65 on:** November 13, 2018, 11:27:40 pm »

Quote from: brucehoult on November 13, 2018, 10:48:17 pm

In the absence of knowing how often must_process() returns true: absolutely.

It's worth converting to a shift and add inside the condition if must_process() returns true less than a third of the time. If it's more than that then you're doing extra work by shifting and adding every time.

Exactly. The C compiler doesn't have enough information and thus it doesn't know which optimization to apply. So, you may end up applying optimization which actually makes things worse.

Hardware indexing solves this for you. The decision of whether the calculation is needed is postponed until the run-time, which completely eliminates the controversy.

That is true even with superscalar architecture - the CPU only needs to calculate the address if the instruction containing the indexing is about to execute. Although memory access will probably dominate anyway.

Quote from: brucehoult on November 13, 2018, 10:48:17 pm

Complex addressing modes are not free.

Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.

brucehoult · « **Reply #66 on:** November 14, 2018, 01:01:25 am »

Quote from: NorthGuy on November 13, 2018, 11:27:40 pm

Aside of consuming small amount of code points, hardware indexing is free. You simply extend your address generation unit so that instead of one addition (base + offset) it does two (base + shifted index + offset). If this fits between clocks, it doesn't take any extra time.

It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The proof is in the pudding. Multiple customers are telling SiFive that at the same performance level our RISC-V cores are coming in at 1/3 the area and 1/3 the energy consumption of the A** cores they evaluated or were previously using. They are telling us this for everything from the E20 (CM0 class), to the E31/E34 (CM3/4 class), to the new E71 (CM7 class) and U71 (CA53/55 class).

Certainly having only simple addressing modes not complex ones doesn't account for all of this but I believe it helps. It obviously doesn't *prevent* it.

A base + scaled index + offset addressing mode has some benefits. No question. But it also has costs, that I think outweigh the benefits.

A better way to do it is to keep simple operations but have a complete second instruction pipeline. That allows you to fold in your extra increments (or extra shift and add in the conditional part of your loop, if you want) to other instructions. You get the same benefit as the complex addressing mode, but you get the *additional* benefit of being able to use that extra hardware to pair instructions in a vast range of other situations as well. Typically you can increase overall performance by 40% to 60% at a hardware cost that is only slightly greater than having an extra ALU lying around that is usable only by addressing modes.

legacy · « **Reply #67 on:** November 14, 2018, 03:37:52 pm »

Quote from: brucehoult on November 14, 2018, 01:01:25 am

It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.

Besides, the EA addressing mode we are talking about only work (hence only makes a switching, hence only consume energy) when it's enabled.

Quote from: brucehoult on November 14, 2018, 01:01:25 am

It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!

on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.

brucehoult · « **Reply #68 on:** November 14, 2018, 09:49:25 pm »

Quote from: legacy on November 14, 2018, 03:37:52 pm

Quote from: brucehoult on November 14, 2018, 01:01:25 am
It's absolutely not free. It might under some circumstances be cheap enough to be worth it, but it's not free. It's extra circuitry to power, even when instructions aren't using it.

A digital circuit made with MOS transistors only consumes energy on switching from on-off/off-on; the energy consumption is linear with the clock frequency and the voltage applied (squared relation here), while leakage current when a transistor keeps its internal state is neglectable.

Uhhhh .. that was true in the 70s and early 80s when transistors were huge and you were working at 1 MHz or 10 MHz.

It hasn't been true for a *long* time. At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

It's even worse at smaller nodes.

Quote

Quote from: brucehoult on November 14, 2018, 01:01:25 am
It's not even a question of "does it fit between clocks?". The question is "Is it the thing that is preventing you from making the clock faster?"

The EA addressing mode we are talking about can be calculated by the ALU, which is a mandatory stage of the machine, therefore since circuits are already there (arrange differently, but they are all there, ... shift, add, etc), it literally adds zero cost!

on MIPS-R2K: Fetch -> decode & registers load from RF -> ALU & EA evaluation -> load/store -> WriteBack.

A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.

Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.

NorthGuy · « **Reply #69 on:** November 14, 2018, 11:51:38 pm »

Quote from: brucehoult on November 14, 2018, 09:49:25 pm

At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

brucehoult · « **Reply #70 on:** November 15, 2018, 01:38:16 am »

Quote from: NorthGuy on November 14, 2018, 11:51:38 pm

Quote from: brucehoult on November 14, 2018, 09:49:25 pm
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

rstofer · « **Reply #71 on:** November 15, 2018, 04:27:18 pm »

Quote from: brucehoult on November 14, 2018, 09:49:25 pm

A standard ALU (as in the MIPS) takes two inputs and applies one operation such as AND/OR/XOR, ADD, SHIFT etc to them, producing one output.

Your addressing mode requires two adds and a shift. That means either the standard ALU must be used three times in a row (extra time), or else be extended with an extra shifter and extra adder which are only used by a fraction of instructions (loads and stores using this addressing mode). The latter uses not only extra transistors (extra area, cost, and power) but also extra time as the result of one of the new function units feeds into the next one. You also now need four inputs instead of two. The register file now also needs three read ports instead of two for stores, which is expensive.

One solution is to have multiple copies of the registers. When a register is written, every copy is written at the same time. Separate access is easy when every access datapath has its own copy of the register bank. Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?

Two copies are required in any event if we want to add 2 registers without having an intermediate storage register. Adding a 3rd or 4th copy just isn't a big deal. Conceptually...

The LC-3 project shows 2 copies (implied) on page 7
http://users.ece.utexas.edu/~patt/05f.360N/handouts/360n.appC.pdf

legacy · « **Reply #72 on:** November 15, 2018, 04:45:38 pm »

Quote from: rstofer on November 15, 2018, 04:27:18 pm

Usually, this doesn't take all that much BlockRAM because there aren't a whole lot of registers - 16?

Arise-v2 has 32 registers, but can't use BlockRam it needs to use BRAM, which happens to have two ports.
But this is not a problem since Arise-v2 has two stages for 2+2 registers read per instructions.
Usually, instructions have 2 operands, sometimes (MAC, SLAC, etc..) have 3 operands.

On the new ALU, SLAC has now its twin SRAC (shift right and accumulate).

On Arise-v2, the PC-register always belongs to the register file.

NorthGuy · « **Reply #73 on:** November 15, 2018, 07:46:03 pm »

Quote from: brucehoult on November 15, 2018, 01:38:16 am

Quote from: NorthGuy on November 14, 2018, 11:51:38 pm
Quote from: brucehoult on November 14, 2018, 09:49:25 pm
At 28nm, the power consumption of the FU540 (quad core 1.5 GHz 64 bit processor) is something like 90% leakage current, with the power consumption difference between idle (or sleeping) and running flat out is very small. Stopping the clock doesn't save power now. Only actually tuning off the power supply saves power.

Doesn't sound very compelling. Xilinx 7-series FPGA are 28nm. The quiescent current is rather low (e.g. 200 mA with Artix-7 XC7A100T), but this go up very quickly when you start massive switching at 600 MHz (such as to 10 A or more with the same Artix-7 XC7A100T).

That's because Xilinx uses the lower frequency and lower leakage 28 HPL process. Not all 28nm is the same, even from the same foundry (TSMC).

Low frequency silicon, huh? Very smooth ...

How about Intel processors? A guy measured power consumption of 22/32 nm technology here:

http://blog.stuffedcow.net/2012/10/intel32nm-22nm-core-i5-comparison/

Look at picture 2a/2b and extend the orange "Total power" line until it crosses 0 frequency. Looks like around 10 W (perhaps 15 W). And this includes whole lot more than just leakage. As the frequency increases, the power goes over 100 W. Doesn't look like leakage dominates. Do they use "low frequency silicon" too?

SiliconWizard · « **Reply #74 on:** November 15, 2018, 07:55:33 pm »

Of course leakage doesn't dominate, not at the frequencies those process nodes are intended for anyway.
Obviously if you use a 14nm process for digital ICs running at a couple MHz, leakage could indeed dominate. Not at 1GHz+.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10653 times)

Share me