Author Topic: Funnel Shifter: where is useful? (Read 6414 times)

DiTBho · « **on:** July 03, 2021, 03:01:40 pm »

Quote

GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:
funnelshift_lc(): returns most significant 32 bits of a left funnel shift.
funnelshift_rc(): returns least significant 32 bits of a right funnel shift.

If I have got it correctly, in simple terms, given a 32 bit machine, "Funnel Shifter" is a 64-bit shift with 32-bit operands and results, because the registers are all 32-bit. Given a 64 bit machine, "Funnel Shifter" is a 128-bit shift, etc ...

I fail to understand where/when this stuff is useful

SiliconWizard · « **Reply #1 on:** July 03, 2021, 04:31:10 pm »

Quote from: DiTBho on July 03, 2021, 03:01:40 pm

Quote
GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:
funnelshift_lc(): returns most significant 32 bits of a left funnel shift.
funnelshift_rc(): returns least significant 32 bits of a right funnel shift.

If I have got it correctly, in simple terms, given a 32 bit machine, "Funnel Shifter" is a 64-bit shift with 32-bit operands and results, because the registers are all 32-bit. Given a 64 bit machine, "Funnel Shifter" is a 128-bit shift, etc ...

Yep; that's what it is.

Quote from: DiTBho on July 03, 2021, 03:01:40 pm

I fail to understand where/when this stuff is useful

Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)

Apart from shifting 2N-bit numbers, you can also easily implement N-bit rotates with this. You just use the same value for both operands, and shift.

Nominal Animal · « **Reply #2 on:** July 05, 2021, 09:22:28 pm »

Quote from: DiTBho on July 03, 2021, 03:01:40 pm

I fail to understand where/when [funnel shifter] is useful

Bit stream operations.

Many formats (most compressed video streams for example) consist of bit fields not aligned to byte or word boundaries. Using native register sized units and funnel shifting, any field with up to native register width can be extracted using a single funnel shift and a binary AND. To insert, a binary AND (to ensure all bits beyond the value are zero) followed by a single funnel shift and two binary OR operations are needed. Even though this seems more work than just using 8-bit loads and stores, the native register sized accesses with a two-register variable-sized funnel shift is much faster on 32-bit and wider architectures.
Bignum arithmetic.

The double-sized case –– two registers describing an unsigned integer –– is ubiquitous.
For example, if you have N-bit unsigned integer registers, and a N×N→2N-bit multiplication like e.g. x86 and x86-64 have, you can convert most divisions by an integer constant into a single multiplication followed by a funnel shift. (Some constants need an additional addition to the register pair before the funnel shift.)

(The only case where the funnel shift would be N is when the division is by a power of two, and those are trivially implemented via a shift themselves. So, in practice, the tunnel shift used here is arithmetic right shift by K buts, 1 ≤ K < N.)
Larger than register sized fixed point arithmetic.

Fixed point arithmetic with Q bits reserved for the fractional part, with total size larger than a single register, heavily relies on funnel shifting.
On 32 bit architectures with two-register funnel shifter (multi-bit shifts) and a N×N→2N-bit multiplication, Q31.32 fixed point format has range at least -2147483647.999999999 .. +2147483647.999999999 with a resolution of 0.000000000232831 or so.

Multiplying two such numbers requires four multiplications and six additions. If you use the highest bit for sign flag, you need to branch. If you use two's complement format, one of the multiplications is unsigned-unsigned, one is signed-signed, and two are signed-unsigned. Usually, we only have unsigned-unsigned multiplication and signed-signed multiplication, but fortunately, the "fix" for the mixed case is a single-bit funnel shift – i.e., needed twice for each double-register wide fixed point multiplication operation (or once, if you have two extra registers to store the intermediate result in before fixing them, I guess).

If you count the operations and look at the format properties, you'll notice that for the vast majority of use cases, this well outperforms IEEE-754 Binary32 AKA 'float', especially if the architecture provides N×N→2N-bit multiplication and at least one-bit funnel shifting (to fix the signed-unsigned limb multiplications).

I'm sure there are many more, but the above are the ones I immediately recall using and needing funnel shifting for.
(As you can tell, I also heavily benefit from architectures providing a hardware N×N→2N-bit signed and unsigned integer multiplication instructions, too. I don't know what it is called, but I consider it related to ordinary register×register=register multiplication the same way funnel shifts relate to bit shifts within a single register.)

TheCalligrapher · « **Reply #3 on:** July 06, 2021, 12:41:29 am »

Quote from: DiTBho on July 03, 2021, 03:01:40 pm

I fail to understand where/when this stuff is useful

Um... Your own wording already contains the answer to that question. These intrinsics are useful when you need to "losslessly" shift a value that is wider than the widest type directly supported by your implementation. C and C++ provide no operations that would allow you to efficiently implement such shifts by yourself. Hence the need for intrinsics. That's what intrinsics are for.

That's all there is to it.

T3sl4co1l · « **Reply #4 on:** July 06, 2021, 04:37:12 am »

Quote from: TheCalligrapher on July 06, 2021, 12:41:29 am

Um... Your own wording already contains the answer to that question. These intrinsics are useful when you need to "losslessly" shift a value that is wider than the widest type directly supported by your implementation. C and C++ provide no operations that would allow you to efficiently implement such shifts by yourself. Hence the need for intrinsics. That's what intrinsics are for.

That's all there is to it.

Put another way, you can write monstrous shift-and-mask expressions, and pray that the compiler knows what you're trying to do; you're entirely at the mercy of the optimizer, and how much it's able to infer about your expression. Sometimes it may simply give up and write out the naive literal implementation using basic instructions (as if -O0). (I have seen this in avr-gcc for example, which is alas notoriously unoptimized among GCC branches; x86 and ARM I expect -- I hope -- should be able to figure these things out.)

Even for a powerful optimizer, some inferences may simply be missing, or the instruction simply has fundamentally different semantics from C expressions, thus being unreachable by them. In which case, you have no choice but to use the intrinsic, to force the use of that instruction. This may have its own downsides, as the optimizer may not be able to reason about these oddball instructions very well.

So, in short: you're likely using them when optimizing for speed, later in development, after functionality has been nailed down and tested. Intrinsics are nonstandard and platform-dependent, so you will need to repeat these optimizations for each supported compiler and target machine (usually a pile of #defines to select appropriate expressions/statements/functions for each).

Note that a given compiler might support the intrinsic across multiple targets, but fill in with an equivalent library operation when it is unavailable -- effectively emulating the instruction, which may come with additional overhead because of how library code gets inserted. Which is to say, just because an intrinsic might be available, doesn't mean it's a guaranteed win, you still need to do your due diligence (testing, profiling).

Tim

brucehoult · « **Reply #5 on:** July 06, 2021, 07:22:05 am »

Quote from: SiliconWizard on July 03, 2021, 04:31:10 pm

Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)

It's not *that* many instructions.

On a 32 bit machine, (A concat B) >> C is just (A << (32-C)) | (B >> C) which is at most three instructions if C is a constant or four instructions if C is in a register and you know it's between 1 and 31.

Every current (and almost all planned) RISC-V integer instructions have either two source registers or one register and one constant. Funnel shift needs three inputs.

Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?

For most people, it's not going to be anywhere near common enough to be worth it.

The RISC-V BitManip extension does plan to assign opcodes and mnemonics to three instructions that each need three inputs. If people really really want to include these instructions in their CPU core they can, and they can all use the same instruction encodings for them, and gcc and llvm can be taught to generate them if given suitable flags. But they won't be included as standard in general purpose CPUs.

See section 2.9 of https://github.com/riscv/riscv-bitmanip/blob/main-history/bitmanip-draft.pdf

These are not included in what is currently up for ratification in the next month or two.

SiliconWizard · « **Reply #6 on:** July 06, 2021, 04:44:58 pm »

Quote from: brucehoult on July 06, 2021, 07:22:05 am

Quote from: SiliconWizard on July 03, 2021, 04:31:10 pm
Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)
It's not *that* many instructions.
On a 32 bit machine, (A concat B) >> C is just (A << (32-C)) | (B >> C) which is at most three instructions if C is a constant or four instructions if C is in a register and you know it's between 1 and 31.

Yeah. I didn't say that was many. Just that it was more than 1. 4 instructions is still 4 times that. Not insignificant.

Quote from: brucehoult on July 06, 2021, 07:22:05 am

Every current (and almost all planned) RISC-V integer instructions have either two source registers or one register and one constant. Funnel shift needs three inputs.

That of course would be an issue, but FP instructions do have 3 sources IIRC, so some performance-oriented CPU can do this with integer instructions as well. Of course this has an extra cost.
The 3-source requirement also of course would be there for a barrel funnel shifter. Now you can still implement a funnel shifter with 2 sources only if it can only shift by 1 position at a time. That may be too restrictive to be really useful. Dunno. I guess some particular algorithms that typically have to shift by 1 pos for each iteration could still make use of this.

Quote from: brucehoult on July 06, 2021, 07:22:05 am

Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?
For most people, it's not going to be anywhere near common enough to be worth it.

Yes, this would make sense only on specialized CPUs IMO. For general-purpose ones, that's questionable.
The GK110 that the OP mentioned is a GPU. So... in that particular context, anything that can consistently save you even 1 cycle may be worth it.

brucehoult · « **Reply #7 on:** July 06, 2021, 10:56:13 pm »

Quote from: SiliconWizard on July 06, 2021, 04:44:58 pm

Quote from: brucehoult on July 06, 2021, 07:22:05 am
Every current (and almost all planned) RISC-V integer instructions have either two source registers or one register and one constant. Funnel shift needs three inputs.

That of course would be an issue, but FP instructions do have 3 sources IIRC, so some performance-oriented CPU can do this with integer instructions as well. Of course this has an extra cost.

FP instructions are usually a different register file, different execution pipeline.

Multiply-add is by far the most common instruction in floating point code. As such it's well worth building a register file with three read ports. It gets used *all* the time.

Adding a 3rd read port to the integer register file might well add enough extra delay (more gates, more fan-out) to slow your clock speed by a larger percentage than the frequency of use of funnel shift.

Nominal Animal · « **Reply #8 on:** July 07, 2021, 12:34:42 am »

Quote from: brucehoult on July 06, 2021, 07:22:05 am

Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?

I noticed that the RISC-V instruction set explicitly describes the register×register →register+register multiplication as form MULH rdh, rs1, rs2 ; MUL rdl, rs1, rs2 being recommended as allowing microarchitectures to fuse the operations if they deem it worthwhile.
(Nice; definitely works for my use cases. I don't need it to be a single instruction; two is well within the cost I'm willing to pay.)

Why is there no similar pair for shifting? Is it just not common enough to warrant it?

No, wait, I see ROL and ROR in the bit manipulation draft. Lessee... those do mean that for example bignum shifts only needs one rotate per limb, and two ANDs (with fixed patterns across all limbs) and two ORs to distribute the result, for any size bignum shift. Applies to fixed point arithmetic the exact same way. Yup: the rotates work just as well as funnel shifting, so extra work would be just waste.

I hadn't realized before how little benefit funnel shifting (across two registers) has compared to binary rotations. I should have realized, because I've never really needed "funnel rotations" for anything; only funnel shifts, or rotations. I need one, not both.

For binary data extraction and packing, if the fields are at most half the register width, you can use an extra register as the two-limb cache, or read/write half-register wide units. I don't recall any bit stream format having fields over 16 bits, so I guess I should have qualified the bit stream stuff using funnel shifting only on 8- and 16-bit arches.

SiliconWizard · « **Reply #9 on:** July 07, 2021, 12:50:52 am »

Quote from: Nominal Animal on July 07, 2021, 12:34:42 am

Quote from: brucehoult on July 06, 2021, 07:22:05 am
Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?
I noticed that the RISC-V instruction set explicitly describes the register×register →[ i]register[/i]+register multiplication as form MULH rdh, rs1, rs2 ; MUL rdl, rs1, rs2 being recommended as allowing microarchitectures to fuse the operations if they deem it worthwhile.
(Nice; definitely works for my use cases. I don't need it to be a single instruction; two is well within the cost I'm willing to pay.)

Why is there no similar pair for shifting? Is it just not common enough to warrant it?

As I said earlier, that would be a one-position shifter. But base bit shifting in RISC-V is barrel shifting. You can shift by more than one bit in one instruction. Doing that for pair shifting would require 3 source registers, which is exactly what I said above and what was discussed with Bruce.

Adding a one-position pair shifter (a funnel shifter) in the base instruction set would probably be odd, as other shift instructions are barrel shifting.
But, also as I said earlier, I don't know if a one-position funnel shifter would be really useful or not in practice, or if you'd need a full barrel funnel shifter. So it all depends on this.

Now if this would be useful, adding this pair shifting you have in mind (with only 1 bit shifting at a time) would be no problem indeed and wouldn't cost a whole lot.

Nominal Animal · « **Reply #10 on:** July 07, 2021, 12:15:22 pm »

Quote from: SiliconWizard on July 07, 2021, 12:50:52 am

As I said earlier, that would be a one-position shifter.

No, I meant as in "Why is there no similar suggestion of an instruction pattern, for funnel shifting", for variable bit shift counts. And then I answered myself with "because the instruction set has multi-bit rotates, which mean you don't need pairs of instructions for shifting".

I just realized that in the common cases, in integer division by a constant positive integer, only the upper word needs an arithmetic shift down, and does not involve funnel shifting.

For example, possibly the most common case of them all, dividing int32_t V by ten:
V / 10 == V * (uint32_t)(3435973837) >> 35
which on RV32E using the M extension is, I believe
LW rd, constant
MULHSU rd, rs, rd
SRAi rd, rd, 3
giving you the result in in rd, while keeping V int rs unchanged.

SiliconWizard · « **Reply #11 on:** July 07, 2021, 03:53:17 pm »

Quote from: Nominal Animal on July 07, 2021, 12:15:22 pm

Quote from: SiliconWizard on July 07, 2021, 12:50:52 am
As I said earlier, that would be a one-position shifter.
No, I meant as in "Why is there no similar suggestion of an instruction pattern, for funnel shifting", for variable bit shift counts. And then I answered myself with "because the instruction set has multi-bit rotates, which mean you don't need pairs of instructions for shifting".

It's always interesting how often each is kind of focusing on one particular point, and misunderstandings follow. I was just thinking of multi-bit funnel shifting instructions, which would require 3 sources. So, as Bruce pointed out, they would simply be ruled out by default because this is expensive, and could be justified only in very specific cases and of course as long as there are a number of other instructions that could take advantage of 3 sources. Otherwise, I agree with Bruce that would be a waste for not much.

You mentioned "multi-bit rotates", but the RISC-V *base* instructions set doesn't have any rotate instruction that I know of. It has multi-bit shifting though. Is that what you meant?
The arithmetic shift instruction you're using next is not a rotate either. So, not sure what you meant above exactly with the rotates?

Otherwise, the piece of code you gave for dividing by 10 looks correct to me. Except for the LW instruction, which is a memory load, while you meant loading a constant. The correct assembly for this would be: "li rd, 3435973837" - which is a pseudo-instruction that (given the constant here) would require two instructions

(Note that the "M" extension contains integer divide. Some implementations only partially implement the M extension, but, IIRC, this would be non-standard, but is supported by at least GCC - and probably LLVM, dunno - using a specific option. RV32E does not define a stripped-down M extension, it only defines a reduced register file of 16 registers. And with that said, even if you do have an integer divide instruction available, your piece of code will probably be faster than using a divide, as long as the multiplier doesn't take as many cycles as a divide, which is a common thing.)

As to the instruction fusing suggested for the integer multiplication, that's a general recommendation. You can implement this in various ways, more or less complex. One I used (and that is pretty common as far as I've seen at least on simple implementations) is to issue a full 32x32->64 multiplication for all multiplication instructions and hold the last operands, mode and and result in registers, so that next time the same operation is required, it just fetches the result from the result register, and returns the corresponding half of the result. Simple and effective without requiring any fancy "instruction fusing" logic, and works however far apart the instructions actually are. Just a thought. This is a minor detail.

brucehoult · « **Reply #12 on:** July 07, 2021, 11:58:56 pm »

Quote from: SiliconWizard on July 07, 2021, 03:53:17 pm

Otherwise, the piece of code you gave for dividing by 10 looks correct to me. Except for the LW instruction, which is a memory load, while you meant loading a constant. The correct assembly for this would be: "li rd, 3435973837" - which is a pseudo-instruction that (given the constant here) would require two instructions

The LW could be from a global stored in the initialised memory section and, hopefully, in the SDATA section and finding a place within +/-2K of the GP register, in which case a single LW will suffice.

It seems more illuminating (to me) to give that constant as 0xCCCCCCCD.

SiliconWizard · « **Reply #13 on:** July 08, 2021, 12:17:00 am »

Quote from: brucehoult on July 07, 2021, 11:58:56 pm

Quote from: SiliconWizard on July 07, 2021, 03:53:17 pm
Otherwise, the piece of code you gave for dividing by 10 looks correct to me. Except for the LW instruction, which is a memory load, while you meant loading a constant. The correct assembly for this would be: "li rd, 3435973837" - which is a pseudo-instruction that (given the constant here) would require two instructions

The LW could be from a global stored in the initialised memory section and, hopefully, in the SDATA section and finding a place within +/-2K of the GP register, in which case a single LW will suffice.

Yeah. Of course. But then the instruction would not look like "LW rd, constant". Just a detail, of course. Could have been some pseudo-code and not real code.

Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?

Quote from: brucehoult on July 07, 2021, 11:58:56 pm

It seems more illuminating (to me) to give that constant as 0xCCCCCCCD.

Ah, that's all a matter of taste.

brucehoult · « **Reply #14 on:** July 08, 2021, 05:03:15 am »

Quote from: SiliconWizard on July 08, 2021, 12:17:00 am

Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?

It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.

Code: [Select]

#include <stdint.h>

uint32_t div10_mul = 0xCCCCCCCD;

uint32_t div10(uint32_t V){
  return (V * (uint64_t)div10_mul) >> 35;
}

int main(int argc, char **argv){
  return div10(argc);
}

Code: [Select]

00010106 <div10>:
   10106:       c301a783                lw      a5,-976(gp) # 11818 <div10_mul>
   1010a:       02a7b533                mulhu   a0,a5,a0
   1010e:       810d                    srli    a0,a0,0x3
   10110:       8082                    ret

:
:

Disassembly of section .sdata:

00011810 <_global_impure_ptr>:
   11810:       13e8                    addi    a0,sp,492
   11812:       0001                    nop

00011814 <__dso_handle>:
   11814:       0000                    unimp
        ...

00011818 <div10_mul>:
   11818:       cccd                    beqz    s1,118d2 <__BSS_END__+0x96>
   1181a:       cccc                    sw      a1,28(s1)

0001181c <_impure_ptr>:
   1181c:       13e8                    addi    a0,sp,492
   1181e:       0001                    nop

You do have to look at the linked program to get this, not just at the .o

gf · « **Reply #15 on:** July 08, 2021, 03:12:49 pm »

That's what llvm generates: https://godbolt.org/z/fna6cr3We

SiliconWizard · « **Reply #16 on:** July 08, 2021, 05:06:38 pm »

Quote from: brucehoult on July 08, 2021, 05:03:15 am

Quote from: SiliconWizard on July 08, 2021, 12:17:00 am
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?

It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o

It does happen at link-time. But saying "it's automatic" is misleading. There's more to know to make this work, and this is why I've never seen gp used while compiling my own code.

So, I had to dig a little deeper.
Two things have to be done:

1. Define the '__global_pointer$' symbol in the linker script. You'll usually take the address in the middle of the 'sdata' section, like this:

Code: [Select]

__global_pointer$ = . + 0x800;Of course, you can locate it in another section if you have reasons to do this for a particular project.

2. Initialize the gp register in the startup code with the address from the '__global_pointer$' symbol.

If the '__global_pointer$' symbol is defined, then the linker will translate data access using gp when appropriate. If it's not defined, then it won't do anything. This is why I never saw this happen; I just only use fully custom linker scripts and didn't know about this.

SiliconWizard · « **Reply #17 on:** July 08, 2021, 05:23:05 pm »

Quote from: gf on July 08, 2021, 03:12:49 pm

That's what llvm generates: https://godbolt.org/z/fna6cr3We

So, it does optimize the divide by 10 with code similar to what Nominal Animal suggested.
This is interesting. As I said, whereas it makes sense in most cases, it's not necessarily optimal in all cases. Yes, it's very common that multiply take significantly fewer cycles than divide on typical implementations, but, on a simple implementation where multiplication would be implemented, like division, computing one bit per cycle, this would actually take more cycles than using a single divide instruction.

The "problem" - and this is the same with GCC - is that currently, contrary to most other targets, there are very few, if any, target-specific tuning options. I think SiFIve has added a couple for their own cores, but that's about it. (Of course that makes sense, since there are still very few deployed implementations of RISC-V on the market.) What that means is, if you use default target options (no '-mtune' anything), developers have taken assumptions (admittedly reasonable ones, but which won't cover all cases) regarding the microarchitecture for optimizations. From what I've seen, the default is assumed to be a classic 5-stage pipeline, and, the divide instruction is assumed to be much more expensive than the multiply instruction. Probably a lot of other assumptions. Bruce may know a lot more about this. But bottom line is, some of the default for optimizations may not be all that great for your particular implementation. And I personally am far from knowing GCC's (or LLVM's) internals enough to be able to write specific "tuning" support for a particular microarchitecture.

brucehoult · « **Reply #18 on:** July 09, 2021, 01:36:39 am »

Quote from: SiliconWizard on July 08, 2021, 05:06:38 pm

Quote from: brucehoult on July 08, 2021, 05:03:15 am
Quote from: SiliconWizard on July 08, 2021, 12:17:00 am
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?

It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o

It does happen at link-time. But saying "it's automatic" is misleading. There's more to know to make this work, and this is why I've never seen gp used while compiling my own code.

It's automatic if you compile and link normal C code using a simple "gcc foo.c -o foo" command.

If you use things such as your own linker script (not modelled off the default one) or -nostartfiles or the like then of course you can break it. But that takes work :-)

brucehoult · « **Reply #19 on:** July 09, 2021, 02:24:54 am »

Quote from: SiliconWizard on July 08, 2021, 05:23:05 pm

Quote from: gf on July 08, 2021, 03:12:49 pm
That's what llvm generates: https://godbolt.org/z/fna6cr3We

So, it does optimize the divide by 10 with code similar to what Nominal Animal suggested.
This is interesting. As I said, whereas it makes sense in most cases, it's not necessarily optimal in all cases. Yes, it's very common that multiply take significantly fewer cycles than divide on typical implementations, but, on a simple implementation where multiplication would be implemented, like division, computing one bit per cycle, this would actually take more cycles than using a single divide instruction.

This is true, but not significantly more -- just two more instructions/clock cycles in the unsigned case for LUI;ADDI;MULHU;SRLI compared to LI;DIVU if multiply and divide both take 32 cycles -- 6%.

Quote

From what I've seen, the default is assumed to be a classic 5-stage pipeline, and, the divide instruction is assumed to be much more expensive than the multiply instruction. Probably a lot of other assumptions. Bruce may know a lot more about this.

I think you would not go far wrong if you assume that RISC-V gcc is by default tuned for Berkeley "Rocket", as used in SiFive FE-310 and FU-540, Kendryte K210, and others. The gcc port was started right at the beginning of the RISC-V project in 2010 and tracked the evolution of the instruction set and the main implementation at Berkeley.

That's, as you say, a classic RISC 5 stage single-issue in-order pipe. Single cycle latency for arithmetic, three cycle penalty for branch misprediction, 2 cycle latency for full word load from cache, 3 cycles for partial word loads, 5 cycle latency for multiplies, divide latency between 2 and XLEN+1. These numbers are all the same for FE310 and FU540 -- divides take twice longer on the 64 bit CPU, while multiplies take the same as the multiplier does 16 bits per cycles instead of 8.

These assumptions should work pretty well on 4/5/6 pipe stage in-order cores from other organisations such as Andes, GigaDevice, PULP.

They should even work pretty well on small 1/2/3 pipe stage cores, other than those perhaps having a slow multiply.

Where it really breaks down is dual-issue cores. Those are just now starting to make their way into real hardware. I have two such boards -- the BeagleV "Starlight" beta board, and the HiFive Unmatched, both with SiFive U74 cores and both received in May this year. There are also the Western Digital SweRV cores, but those are currently FPGA-only, other than WD's own embedded uses.

SweRV and the U74 both try to make the best of code scheduled for single-issue cores by having two different pipe stages with integer ALUs. Two dependent instructions can issue in the same clock cycle if operands are ready for the first one. The second instruction will execute in the "late ALU". If either of the following instructions is dependent on that late ALU instruction then it also will have to execute in the late ALU. It needs four dependent instructions in a row to get a 1 cycle stall.

ARM's A55 has a similar feature. The A53 doesn't.

Out-of-Order cores of course simply absorb these small-scale scheduling differences.

SiliconWizard · « **Reply #20 on:** July 09, 2021, 04:23:51 pm »

Quote from: brucehoult on July 09, 2021, 01:36:39 am

Quote from: SiliconWizard on July 08, 2021, 05:06:38 pm
Quote from: brucehoult on July 08, 2021, 05:03:15 am
Quote from: SiliconWizard on July 08, 2021, 12:17:00 am
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?

It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o

It does happen at link-time. But saying "it's automatic" is misleading. There's more to know to make this work, and this is why I've never seen gp used while compiling my own code.

It's automatic if you compile and link normal C code using a simple "gcc foo.c -o foo" command.

If you use things such as your own linker script (not modelled off the default one) or -nostartfiles or the like then of course you can break it. But that takes work :-)

Well, unless you build for hosted environments such as Linux (for which GCC's defaults should work fine), or you just use pre-written linker scripts and startup code (such as those provided by SiFive if you use that), custome linker scripts and startup code is pretty much unavoidable.

I just didn't know how the linker handled this. Now I do.

For anyone interested, I suggest the following read, which goes into details:
https://www.sifive.com/blog/all-aboard-part-3-linker-relaxation-in-riscv-toolchain

I'm curious to see what difference it will make with Coremark. The article above talks about Dhrystone (for which it apparently makes a significant difference). Dunno if the difference is as drastic with Coremark. I'll have to test that.

brucehoult · « **Reply #21 on:** July 10, 2021, 02:01:33 am »

Quote from: SiliconWizard on July 09, 2021, 04:23:51 pm

I just didn't know how the linker handled this. Now I do.

This is good :-)

Interestingly, IBM S/360 and subsequent mainframes do exactly the same thing. They also have 12 bit offsets from a base register.

One difference is that on the 360 if you have more than 4K of global variables then a second or third base register is reserved to point to another 4K area. With only 16 registers to start with this can get a bit hairy! The programmer manages this explicitly using assembler directives. It's still the same today in z/OS.

https://www.ibm.com/docs/en/zos/2.1.0?topic=instruction-how-use-using

brucehoult · « **Reply #22 on:** July 10, 2021, 02:12:57 am »

Quote from: SiliconWizard on July 09, 2021, 04:23:51 pm

I'm curious to see what difference it will make with Coremark. The article above talks about Dhrystone (for which it apparently makes a significant difference). Dunno if the difference is as drastic with Coremark. I'll have to test that.

I don't recall if Coremark uses much in the way of global/static variables.

What I do know is it's quite sensitive to code size and L1 icache size. With 16K of icache on the HiFive1 (and with slow cache line loading from SPI flash) you get much better Coremark using the C extension than not using it, because more of the hot code fits into the cache. I also found that turning on -msave-restore (calling special millicode routines to save registers at the start of each function, and another routine to reload them at the end) results in a higher Coremark despite the extra instructions to call and return from the millicode, and also sometimes saving more registers than necessary (because there are only a few save/restore routines to choose from).

I haven't checked whether there is a measurable effect on the Linux-capable SoCs with more L1 cache and faster L2 cache or DRAM to reload from.

brucehoult · « **Reply #23 on:** July 10, 2021, 02:32:19 am »

Quote from: brucehoult on July 10, 2021, 02:01:33 am

Interestingly, IBM S/360 and subsequent mainframes

It's also interesting how IBM uses exactly the same assembler directive to provide access to struct fields via a pointer.

Quote

In the following example, a new element, NEW, is inserted into a doubly linked list between two existing elements LEFT and RIGHT, where the links are stored as pointers LPTR and RPTR:

Code: [Select]
LEFT USING ELEMENT,R3 RIGHT USING ELEMENT,R6 NEW USING ELEMENT,R1 . . MVC NEW.RPTR,LEFT.RPTR Move previous Right pointer MVC NEW.LPTR,RIGHT.LPTR Move previous Left pointer ST R1,LEFT.RPTR Chain new element from Left ST R1,RIGHT.LPTR Chain new element from Right . . ELEMENT DSECT LPTR DS A Link to left element RPTR DS A Link to right element . .

In this example LPTR has offset 0 and RPTR has offset 4, so ST R1,RIGHT.LPTR expands to ST R1,4(R6) etc.

It's a pity if that can't be ST NEW,RIGHT.LPTR -- I assume they would have shown it like that if you could do it.

SiliconWizard · « **Reply #24 on:** July 10, 2021, 04:23:23 pm »

Quote from: brucehoult on July 10, 2021, 02:12:57 am

Quote from: SiliconWizard on July 09, 2021, 04:23:51 pm
I'm curious to see what difference it will make with Coremark. The article above talks about Dhrystone (for which it apparently makes a significant difference). Dunno if the difference is as drastic with Coremark. I'll have to test that.

I don't recall if Coremark uses much in the way of global/static variables.

I think it depends on how your configure it. IIRC,yYou can build Coremark for 3 different ways of allocating most of its data: on the stack, statically or on the heap. I guess if using the static configuration (which is the one I have used the most), it should make a difference, although I'm not sure all the data it works on can fit within the 12-bit offset for GP. So I'll have to see what happens at link-time.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Funnel Shifter: where is useful? (Read 6414 times)

Share me