Author Topic: Arithmetic operation caches (Read 1823 times)

SiliconWizard · « **on:** May 21, 2020, 04:26:06 pm »

I was just toying with the idea of implementing caches for arithmetic operations that last for several cycles, and was wondering if they could be worth considering.

Then while looking up if that idea had already been considered, I found a patent (damn! But it's expired): https://patents.google.com/patent/US5260898A/en

What do you guys think? Could it have any real-life benefit in your opinion, has any of you ever implemented that, and/or have you seen a real processor implementing it?

0db · « **Reply #1 on:** May 21, 2020, 04:48:19 pm »

Quote from: SiliconWizard on May 21, 2020, 04:26:06 pm

Could it have any real-life benefit ?

No.
The Cache is evil

ataradov · « **Reply #2 on:** May 21, 2020, 06:04:17 pm »

There is zero need for that. Compilers are already smart enough to hold on to the results that took a long time to obtain.

SiliconWizard · « **Reply #3 on:** May 21, 2020, 06:57:30 pm »

Well, not really. Compilers can indeed reuse results when they can statically infer that from source code - which is not the same. A lot happens that can't be inferred statically.
(Your argument would almost be akin to saying OoO execution is useless because compilers can already order instructions in the best manner possible.)

Now whether similar or identical operations are actually performed on average on a given processor often enough that it would make a difference, I have no clue. Guess it would have to be benchmarked, which they seem to have done reading the linked patent.

Speaking of OoO, I guess it's probably a lot more effective on average to keep FPU/ALU pipelines reasonably full, rather than trying to reuse old results with caches. Still wondering whether it could have any sizeable benefit. Also wondering whether this Sun patent was ever put to use.

ataradov · « **Reply #4 on:** May 21, 2020, 07:06:45 pm »

Ok, good point. I tied to think of scenarios where cache like this would be useful, and it is hard to imagine any. Realistically the same operations would be performed only if the data is the same. And usually when you know that algorithm will do the same operation on the same data, you tend to optimize it in the code.

Also, if had any chance of improving things, you bet Intel would have though of that.

It is also hard to imagine a cache structure, given it will have to take two arbitrary numbers. Regular caches allow for a good CAM implementation because the structure of the address is predictable. ALU inputs are totally random.

asmi · « **Reply #5 on:** May 21, 2020, 08:25:09 pm »

In RV32M mulh/mul instruction pair over same operands is required if you need to obtain a full 64bit multiplication result, of course it doesn't make any sense to do the multiplication twice, so you can cache last result and return the other half of it during second instruction's execution. Also division instruction is often implemented such that both quotient and remainder are calculated at the same time, so you can cache it too for div/rem instructions pair. Same applies for RV64M versions of these instructions.

ataradov · « **Reply #6 on:** May 21, 2020, 08:26:38 pm »

But that is not a cache, really. It is just a simple logic for specific instructions. It does not care about register values, just the register indices.

Berni · « **Reply #7 on:** May 21, 2020, 08:27:49 pm »

Well most computations are really really quick in the first place.

For example if you have a modern CPU with AVX512 support you can multiply 32 pairs of 16bit numbers in just a single cycle. The more difficult part is actually feeding it data so quickly, so for that reason it has a data memory cache.

A cache large enough to be useful would also likely take up more die area than simply adding an extra instance of the computation logic and so doubling performance right there. Perhaps a cache might provide more performance for things like divide operations that take many cycles, but might still might call for an unpractically large cache and an unpractically large amount of logic to quickly look up the result. Also algorythms that need to run fast tend to avoid division whenever possible in the first place.

That being said some math operations in CPUs do make use of built in lookup tables to speed things up. There is a famous blunder from Intel where the early Pentium CPUs had a mistake in one of these tables and this caused the floating point divide operation to give a ever so slightly wrong result (A few decimal places down) with certain input combinations.

asmi · « **Reply #8 on:** May 21, 2020, 08:34:04 pm »

Quote from: ataradov on May 21, 2020, 08:26:38 pm

But that is not a cache, really. It is just a simple logic for specific instructions. It does not care about register values, just the register indices.

Yes it is a cache by the very definition of it, even if simple one. In my implementation I stored operand values and results for the last mul and div operations, so my implementation doesn't care about the order of these instructions, or even if they follow each other or separated by a ton of non-M code (because the cache is local to M extension implementation). So indices do not matter, only actual operands do. There is nothing in the spec which would prohibit such behavior.

hamster_nz · « **Reply #9 on:** May 21, 2020, 10:07:42 pm »

If you were to implement this, you would only want to look at caching results that take multiple cycles to obtain.

I suspect you would be better off to fuse instructions. It would be much more likely to have a divide/remainder pair, or maybe a mult/add pair for a MAC operation than to repeat the same calculation.

One slight level of caching would be if your multiply/divide unit could remember the last operands and then if the operands are the same, short-circuit to republishing the same result with the tag of the new request. That would be a very cheap option.

Here's an aside from the RISC-V ISA:

Quote

If both the high and low bits of the same product are required, then the recommended code sequence is: MULH/S/U rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies

This provides a nice way to put hidden fingerprints in binaries, by allow you to see this pattern and swap instruction ordering over to encode '1's and '0's... same for all the different versions of the RV32I NOP you can pick - just don't put it in the performance sensitive parts!

SiliconWizard · « **Reply #10 on:** May 21, 2020, 10:20:41 pm »

Quote from: hamster_nz on May 21, 2020, 10:07:42 pm

If you were to implement this, you would only want to look at caching results that take multiple cycles to obtain.

Yes, of course. As I said, it would be for operations that take more than one cycle. I suggest reading the patent to have a better idea.

Quote from: hamster_nz on May 21, 2020, 10:07:42 pm

I suspect you would be better off to fuse instructions. It would be much more likely to have a divide/remainder pair, or maybe a mult/add pair for a MAC operation than to repeat the same calculation.

That's a use case, but that's assuming related/close operations are more likely than just repeated operations in general. That's possible, but that mainly what I don't know on average.

Quote from: hamster_nz on May 21, 2020, 10:07:42 pm

One slight level of caching would be if your multiply/divide unit could remember the last operands and then if the operands are the same, short-circuit to republishing the same result with the tag of the new request. That would be a very cheap option.

Well, that's the idea. Only that what you describe would be a 1-entry cache only, but this is it. Of course it is effective only for repeated operations with no other operation in between.

Add more entries, and you get your cache, which granted would take more resources but is a bit more likely to be useful.

And as I said earlier, now with superscalar, OoO micro-architectures, caches like this are probably useless. But for simpler micro-archs, I'm wondering if they could add some performance while still taking fewer resources than a full superscalar, OoO design.

Someone · « **Reply #11 on:** May 21, 2020, 11:51:25 pm »

Quote from: Berni on May 21, 2020, 08:27:49 pm

For example if you have a modern CPU with AVX512 support you can multiply 32 pairs of 16bit numbers in just a single cycle.

To be accurate its not multiplied in a single cycle, but the vector processor (under ideal circumstances) can accept new data every clock cycle. Latency (while impressive) is several clock cycles. But as you mentioned getting the data to the correct registers can be the bottleneck.

Berni · « **Reply #12 on:** May 22, 2020, 05:55:24 am »

Quote from: Someone on May 21, 2020, 11:51:25 pm

Quote from: Berni on May 21, 2020, 08:27:49 pm
For example if you have a modern CPU with AVX512 support you can multiply 32 pairs of 16bit numbers in just a single cycle.
To be accurate its not multiplied in a single cycle, but the vector processor (under ideal circumstances) can accept new data every clock cycle. Latency (while impressive) is several clock cycles. But as you mentioned getting the data to the correct registers can be the bottleneck.

Well yeah most things in modern x86 CPUs are not true single cycle latency due to all the pipelining for reaching the multi GHz speeds. But in general most CPUs that have hardware multiply instructions can do it just as fast as the other super quick operations such as sum,OR,AND,bitshift...etc. Even something as crappy and ancient as a PIC18F does single cycle multiply (Only 8 bit tho)

Perhaps instructions like Sqrt or Sine might make sense to cache. They only take 1 input so its more likely to get a cache hit while generally taking even more cycles than floating divide. But no idea how heavily they are used in typical workloads, id imagine sqrt being popular.

MK14 · « **Reply #13 on:** May 22, 2020, 11:29:18 pm »

Quote from: SiliconWizard on May 21, 2020, 04:26:06 pm

I was just toying with the idea of implementing caches for arithmetic operations that last for several cycles, and was wondering if they could be worth considering.

Then while looking up if that idea had already been considered, I found a patent (damn! But it's expired): https://patents.google.com/patent/US5260898A/en

What do you guys think? Could it have any real-life benefit in your opinion, has any of you ever implemented that, and/or have you seen a real processor implementing it?

Somewhat often, the incoming (to be processed) data, into a function (or similar), is only BYTE sized. Such as TEXT files/arrays, and/or other smallish values (e.g. Dates 1..31, Ages 1..150, Most MPH Car speeds).
So, consider an encryption routine (or similar).
Let's say (hypothetical, i.e. NOT correct in real life) you take each BYTE of an array or file, and perform...
EncryptionHash += (((CHAR_BYTE x 68) + 12) / 112) and 0xFF. (Pseudocode).
On the one hand, this will take (let's assume DOUBLE's are used), many cycles. E.g. The divide alone, might take 50 cpu cycles, especially on an older cpu. (Ignoring that pipelining/superscaler etc, may slightly/largely eliminate, that delay/time).

So, there are only 256 (char) possible values to calculate (to fill the cache), even if the text array/file is millions of bytes long. Hence (no arithmetic cache), might take 75 (cycles per byte) x 2,000,000 (text bytes to encrypt), = 150 million cpu cycles.
With the arithmetic cache, it might take 1 (cycle, if cache is fast enough) x 2,000,000 = 2 million cycles.
= x75 speedup.
N.B. Ignoring superscaler/pipelining/other-stuff etc, which may significantly change these figures.

This could apply to many actual programs, it could of been pseudorandom number generation, checksum calculation, hashing, database, weather prediction data (temps in deg C), etc etc.
Even if the variable(s), are considerably bigger than shown above. The actual values, may be only limited, to a tiny subset of the available values.
E.g. House numbers, being hashed for a database. Although the variable might be an INT32 type, to allow for very big house numbers. In practice, they will usually be less than 10,000. So, such a hashing algorithm, may only actually see 5,000 different house number types, while processing, tens of millions of addresses.

This technique is sometimes used in software. Where very time-consuming calculations are done, but where similar/same data, is often repeated. Quick hashed, lookup tables, can be used, for already calculated values.
E.g. Electronic circuit simulation, where the voltage hasn't changed (yet, or it has returned to a previous calculated value, e.g. oscillator) at some nodes, so the calculation(s), have already been done.
Despite the millions (approx, depends on algorithm used and resolution etc), of repeated, small time steps, being calculated.

0db · « **Reply #14 on:** May 23, 2020, 09:48:40 am »

Tensor matrices and Quaternions might have some computational advantages on repetitive multiply operations. But only if numbers happen "symmetrically" filled.

0db · « **Reply #15 on:** May 23, 2020, 09:55:47 am »

Quote from: MK14 on May 22, 2020, 11:29:18 pm

pseudorandom

The Galaxy A71 5G boasts a QRNG (Quantum Random Number Generator) chip for advanced security.
It doesn't cache anything, it actually generates every single pseudonumber

In short, the chip features an LED and a CMOS image sensor, and it relies on the probabilistic nature of quantum physics to create truly random numbers with unpredictable patterns. This is achieved by the CMOS image sensor harnessing the quantum randomness from the shot noise emitted by the LED.

brucehoult · « **Reply #16 on:** May 23, 2020, 10:04:00 am »

Quote from: asmi on May 21, 2020, 08:25:09 pm

In RV32M mulh/mul instruction pair over same operands is required if you need to obtain a full 64bit multiplication result, of course it doesn't make any sense to do the multiplication twice, so you can cache last result and return the other half of it during second instruction's execution. Also division instruction is often implemented such that both quotient and remainder are calculated at the same time, so you can cache it too for div/rem instructions pair. Same applies for RV64M versions of these instructions.

It's pretty natural to implement division in a way that produces both quotient and remainder at the same time, so that one makes sense.

Computing the high bits of a multiplication requires calculating the low bits too, as there can be a carry. However, by far the most common operation is putting the result into the same size of variable as the original operands, in which case calculating the high bits is a considerable waste of energy.

So it makes sense, as in Andrew's RISC-V note, to put the high bits instruction first. It can do all the work, and save the low bits result, and if there is a following low bits instruction it can pick those up. But not the other way around.

SiliconWizard · « **Reply #17 on:** May 23, 2020, 01:45:00 pm »

Quote from: brucehoult on May 23, 2020, 10:04:00 am

Computing the high bits of a multiplication requires calculating the low bits too, as there can be a carry. However, by far the most common operation is putting the result into the same size of variable as the original operands, in which case calculating the high bits is a considerable waste of energy.

So it makes sense, as in Andrew's RISC-V note, to put the high bits instruction first. It can do all the work, and save the low bits result, and if there is a following low bits instruction it can pick those up. But not the other way around.

Agreed!


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Arithmetic operation caches (Read 1823 times)

SiliconWizard

Arithmetic operation caches

0db

Re: Arithmetic operation caches

ataradov

Re: Arithmetic operation caches

SiliconWizard

Re: Arithmetic operation caches

ataradov

Re: Arithmetic operation caches

asmi

Re: Arithmetic operation caches

ataradov

Re: Arithmetic operation caches

Berni

Re: Arithmetic operation caches

asmi

Re: Arithmetic operation caches

hamster_nz

Re: Arithmetic operation caches

SiliconWizard

Re: Arithmetic operation caches

Someone

Re: Arithmetic operation caches

Berni

Re: Arithmetic operation caches

MK14

Re: Arithmetic operation caches

0db

Re: Arithmetic operation caches

0db

Re: Arithmetic operation caches

brucehoult

Re: Arithmetic operation caches

SiliconWizard

Re: Arithmetic operation caches

Share me