Computing > Programming

Funnel Shifter: where is useful?

**DiTBho**:

--- Quote ---GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:

* funnelshift_lc(): returns most significant 32 bits of a left funnel shift.

* funnelshift_rc(): returns least significant 32 bits of a right funnel shift.

--- End quote ---

If I have got it correctly, in simple terms, given a 32 bit machine, "Funnel Shifter" is a 64-bit shift with 32-bit operands and results, because the registers are all 32-bit. Given a 64 bit machine, "Funnel Shifter" is a 128-bit shift, etc ...

I fail to understand where/when this stuff is useful :-//

**SiliconWizard**:

--- Quote from: DiTBho on July 03, 2021, 03:01:40 pm ---

--- Quote ---GK110 added a 64-bit “funnel shift” instruction that may be accessed with the following intrinsics:

* funnelshift_lc(): returns most significant 32 bits of a left funnel shift.

* funnelshift_rc(): returns least significant 32 bits of a right funnel shift.

--- End quote ---

If I have got it correctly, in simple terms, given a 32 bit machine, "Funnel Shifter" is a 64-bit shift with 32-bit operands and results, because the registers are all 32-bit. Given a 64 bit machine, "Funnel Shifter" is a 128-bit shift, etc ...

--- End quote ---

Yep; that's what it is.

--- Quote from: DiTBho on July 03, 2021, 03:01:40 pm ---I fail to understand where/when this stuff is useful :-//

--- End quote ---

Everytime you'd need to do 2N-bit bit shifting on a N-bit CPU. Just think of how many instructions are typically required to do this if you don't have funnel shifters. With a funnel shifter, that's just one instruction. (Edit: on architectures which support carry-in and carry-out in their shifters, that can obviously be done with 2 instructions "only". But on those, for instance RISC-V, that do not, that will require significantly more than 2.)

Apart from shifting 2N-bit numbers, you can also easily implement N-bit rotates with this. You just use the same value for both operands, and shift.

**Nominal Animal**:

--- Quote from: DiTBho on July 03, 2021, 03:01:40 pm ---I fail to understand where/when [funnel shifter] is useful :-//

--- End quote ---

* Bit stream operations.

Many formats (most compressed video streams for example) consist of bit fields not aligned to byte or word boundaries. Using native register sized units and funnel shifting, any field with up to native register width can be extracted using a single funnel shift and a binary AND. To insert, a binary AND (to ensure all bits beyond the value are zero) followed by a single funnel shift and two binary OR operations are needed. Even though this seems more work than just using 8-bit loads and stores, the native register sized accesses with a two-register variable-sized funnel shift is much faster on 32-bit and wider architectures.

* Bignum arithmetic.

The double-sized case –– two registers describing an unsigned integer –– is ubiquitous.

For example, if you have N-bit unsigned integer registers, and a N×N→2N-bit multiplication like e.g. x86 and x86-64 have, you can convert most divisions by an integer constant into a single multiplication followed by a funnel shift. (Some constants need an additional addition to the register pair before the funnel shift.)

(The only case where the funnel shift would be N is when the division is by a power of two, and those are trivially implemented via a shift themselves. So, in practice, the tunnel shift used here is arithmetic right shift by K buts, 1 ≤ K < N.)

* Larger than register sized fixed point arithmetic.

Fixed point arithmetic with Q bits reserved for the fractional part, with total size larger than a single register, heavily relies on funnel shifting.

On 32 bit architectures with two-register funnel shifter (multi-bit shifts) and a N×N→2N-bit multiplication, Q31.32 fixed point format has range at least -2147483647.999999999 .. +2147483647.999999999 with a resolution of 0.000000000232831 or so.

Multiplying two such numbers requires four multiplications and six additions. If you use the highest bit for sign flag, you need to branch. If you use two's complement format, one of the multiplications is unsigned-unsigned, one is signed-signed, and two are signed-unsigned. Usually, we only have unsigned-unsigned multiplication and signed-signed multiplication, but fortunately, the "fix" for the mixed case is a single-bit funnel shift – i.e., needed twice for each double-register wide fixed point multiplication operation (or once, if you have two extra registers to store the intermediate result in before fixing them, I guess).

If you count the operations and look at the format properties, you'll notice that for the vast majority of use cases, this well outperforms IEEE-754 Binary32 AKA 'float', especially if the architecture provides N×N→2N-bit multiplication and at least one-bit funnel shifting (to fix the signed-unsigned limb multiplications).

I'm sure there are many more, but the above are the ones I immediately recall using and needing funnel shifting for.

(As you can tell, I also heavily benefit from architectures providing a hardware N×N→2N-bit signed and unsigned integer multiplication instructions, too. I don't know what it is called, but I consider it related to ordinary register×register=register multiplication the same way funnel shifts relate to bit shifts within a single register.)

**TheCalligrapher**:

--- Quote from: DiTBho on July 03, 2021, 03:01:40 pm ---I fail to understand where/when this stuff is useful :-//

--- End quote ---

Um... Your own wording already contains the answer to that question. These intrinsics are useful when you need to "losslessly" shift a value that is wider than the widest type directly supported by your implementation. C and C++ provide no operations that would allow you to efficiently implement such shifts by yourself. Hence the need for intrinsics. That's what intrinsics are for.

That's all there is to it.

**T3sl4co1l**:

--- Quote from: TheCalligrapher on July 06, 2021, 12:41:29 am ---Um... Your own wording already contains the answer to that question. These intrinsics are useful when you need to "losslessly" shift a value that is wider than the widest type directly supported by your implementation. C and C++ provide no operations that would allow you to efficiently implement such shifts by yourself. Hence the need for intrinsics. That's what intrinsics are for.

That's all there is to it.

--- End quote ---

Put another way, you can write monstrous shift-and-mask expressions, and pray that the compiler knows what you're trying to do; you're entirely at the mercy of the optimizer, and how much it's able to infer about your expression. Sometimes it may simply give up and write out the naive literal implementation using basic instructions (as if -O0). (I have seen this in avr-gcc for example, which is alas notoriously unoptimized among GCC branches; x86 and ARM I expect -- I hope -- should be able to figure these things out.)

Even for a powerful optimizer, some inferences may simply be missing, or the instruction simply has fundamentally different semantics from C expressions, thus being unreachable by them. In which case, you have no choice but to use the intrinsic, to force the use of that instruction. This may have its own downsides, as the optimizer may not be able to reason about these oddball instructions very well.

So, in short: you're likely using them when optimizing for speed, later in development, after functionality has been nailed down and tested. Intrinsics are nonstandard and platform-dependent, so you will need to repeat these optimizations for each supported compiler and target machine (usually a pile of #defines to select appropriate expressions/statements/functions for each).

Note that a given compiler might support the intrinsic across multiple targets, but fill in with an equivalent library operation when it is unavailable -- effectively emulating the instruction, which may come with additional overhead because of how library code gets inserted. Which is to say, just because an intrinsic might be available, doesn't mean it's a guaranteed win, you still need to do your due diligence (testing, profiling).

Tim

Navigation

[0] Message Index

[#] Next page

Go to full version