GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:
- funnelshift_lc(): returns most significant 32 bits of a left funnel shift.
- funnelshift_rc(): returns least significant 32 bits of a right funnel shift.
QuoteGK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:
- funnelshift_lc(): returns most significant 32 bits of a left funnel shift.
- funnelshift_rc(): returns least significant 32 bits of a right funnel shift.
If I have got it correctly, in simple terms, given a 32 bit machine, "Funnel Shifter" is a 64-bit shift with 32-bit operands and results, because the registers are all 32-bit. Given a 64 bit machine, "Funnel Shifter" is a 128-bit shift, etc ...
I fail to understand where/when this stuff is useful :-//
I fail to understand where/when [funnel shifter] is useful :-//
I fail to understand where/when this stuff is useful :-//
Um... Your own wording already contains the answer to that question. These intrinsics are useful when you need to "losslessly" shift a value that is wider than the widest type directly supported by your implementation. C and C++ provide no operations that would allow you to efficiently implement such shifts by yourself. Hence the need for intrinsics. That's what intrinsics are for.
That's all there is to it.
Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)
Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)It's not *that* many instructions.
On a 32 bit machine, (A concat B) >> C is just (A << (32-C)) | (B >> C) which is at most three instructions if C is a constant or four instructions if C is in a register and you know it's between 1 and 31.
Every current (and almost all planned) RISC-V integer instructions have either two source registers or one register and one constant. Funnel shift needs three inputs.
Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?
For most people, it's not going to be anywhere near common enough to be worth it.
Every current (and almost all planned) RISC-V integer instructions have either two source registers or one register and one constant. Funnel shift needs three inputs.
That of course would be an issue, but FP instructions do have 3 sources IIRC, so some performance-oriented CPU can do this with integer instructions as well. Of course this has an extra cost.
Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?I noticed that the RISC-V instruction set explicitly describes the register×register →register+register multiplication as form MULH rdh, rs1, rs2 ; MUL rdl, rs1, rs2 being recommended as allowing microarchitectures to fuse the operations if they deem it worthwhile.
Just how common does *your* algorithm expect funnel shift to be? Enough that it's worth building a register file with three read ports, just for that instruction?I noticed that the RISC-V instruction set explicitly describes the register×register →[ i]register[/i]+register multiplication as form MULH rdh, rs1, rs2 ; MUL rdl, rs1, rs2 being recommended as allowing microarchitectures to fuse the operations if they deem it worthwhile.
(Nice; definitely works for my use cases. I don't need it to be a single instruction; two is well within the cost I'm willing to pay.)
Why is there no similar pair for shifting? Is it just not common enough to warrant it?
As I said earlier, that would be a one-position shifter.No, I meant as in "Why is there no similar suggestion of an instruction pattern, for funnel shifting", for variable bit shift counts. And then I answered myself with "because the instruction set has multi-bit rotates, which mean you don't need pairs of instructions for shifting".
As I said earlier, that would be a one-position shifter.No, I meant as in "Why is there no similar suggestion of an instruction pattern, for funnel shifting", for variable bit shift counts. And then I answered myself with "because the instruction set has multi-bit rotates, which mean you don't need pairs of instructions for shifting".
Otherwise, the piece of code you gave for dividing by 10 looks correct to me. Except for the LW instruction, which is a memory load, while you meant loading a constant. The correct assembly for this would be: "li rd, 3435973837" - which is a pseudo-instruction that (given the constant here) would require two instructions
Otherwise, the piece of code you gave for dividing by 10 looks correct to me. Except for the LW instruction, which is a memory load, while you meant loading a constant. The correct assembly for this would be: "li rd, 3435973837" - which is a pseudo-instruction that (given the constant here) would require two instructions
The LW could be from a global stored in the initialised memory section and, hopefully, in the SDATA section and finding a place within +/-2K of the GP register, in which case a single LW will suffice.
It seems more illuminating (to me) to give that constant as 0xCCCCCCCD.
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?
#include <stdint.h>
uint32_t div10_mul = 0xCCCCCCCD;
uint32_t div10(uint32_t V){
return (V * (uint64_t)div10_mul) >> 35;
}
int main(int argc, char **argv){
return div10(argc);
}
00010106 <div10>:
10106: c301a783 lw a5,-976(gp) # 11818 <div10_mul>
1010a: 02a7b533 mulhu a0,a5,a0
1010e: 810d srli a0,a0,0x3
10110: 8082 ret
:
:
Disassembly of section .sdata:
00011810 <_global_impure_ptr>:
11810: 13e8 addi a0,sp,492
11812: 0001 nop
00011814 <__dso_handle>:
11814: 0000 unimp
...
00011818 <div10_mul>:
11818: cccd beqz s1,118d2 <__BSS_END__+0x96>
1181a: cccc sw a1,28(s1)
0001181c <_impure_ptr>:
1181c: 13e8 addi a0,sp,492
1181e: 0001 nop
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?
It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o
__global_pointer$ = . + 0x800;Of course, you can locate it in another section if you have reasons to do this for a particular project.That's what llvm generates: https://godbolt.org/z/fna6cr3We
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?
It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o
It does happen at link-time. But saying "it's automatic" is misleading. There's more to know to make this work, and this is why I've never seen gp used while compiling my own code.
That's what llvm generates: https://godbolt.org/z/fna6cr3We
So, it does optimize the divide by 10 with code similar to what Nominal Animal suggested.
This is interesting. As I said, whereas it makes sense in most cases, it's not necessarily optimal in all cases. Yes, it's very common that multiply take significantly fewer cycles than divide on typical implementations, but, on a simple implementation where multiplication would be implemented, like division, computing one bit per cycle, this would actually take more cycles than using a single divide instruction.
From what I've seen, the default is assumed to be a classic 5-stage pipeline, and, the divide instruction is assumed to be much more expensive than the multiply instruction. Probably a lot of other assumptions. Bruce may know a lot more about this.
Speaking of GP... I am wondering about its use. I got what it was meant for, but I don't think I've seen it used in code I have compiled so far. How do you make use of it using C and GCC?
It's usually just automatic. You can force the issue using __attribute__((section("foo"))) if you want, but if there's less than 4K of global data then it should just happen.
(...)
You do have to look at the linked program to get this, not just at the .o
It does happen at link-time. But saying "it's automatic" is misleading. There's more to know to make this work, and this is why I've never seen gp used while compiling my own code.
It's automatic if you compile and link normal C code using a simple "gcc foo.c -o foo" command.
If you use things such as your own linker script (not modelled off the default one) or -nostartfiles or the like then of course you can break it. But that takes work :-)
I just didn't know how the linker handled this. Now I do. :)
I'm curious to see what difference it will make with Coremark. The article above talks about Dhrystone (for which it apparently makes a significant difference). Dunno if the difference is as drastic with Coremark. I'll have to test that.
Interestingly, IBM S/360 and subsequent mainframes
In the following example, a new element, NEW, is inserted into a doubly linked list between two existing elements LEFT and RIGHT, where the links are stored as pointers LPTR and RPTR:Code: [Select]LEFT USING ELEMENT,R3
RIGHT USING ELEMENT,R6
NEW USING ELEMENT,R1
.
.
MVC NEW.RPTR,LEFT.RPTR Move previous Right pointer
MVC NEW.LPTR,RIGHT.LPTR Move previous Left pointer
ST R1,LEFT.RPTR Chain new element from Left
ST R1,RIGHT.LPTR Chain new element from Right
.
.
ELEMENT DSECT
LPTR DS A Link to left element
RPTR DS A Link to right element
.
.
I'm curious to see what difference it will make with Coremark. The article above talks about Dhrystone (for which it apparently makes a significant difference). Dunno if the difference is as drastic with Coremark. I'll have to test that.
I don't recall if Coremark uses much in the way of global/static variables.