Author Topic: "Intermediate precision" floating point? (Read 2438 times)

westfw · « **on:** January 21, 2022, 03:06:10 am »

I've always wondered whether there's a common hack for getting more than 24 bits of mantissa precision out of 32bit floating point functions.
64bit floats have 53bits of mantissa, but it frequently seems like 48 bits (2 * 32bt FP numbers?) would be plenty. Maybe even a a few less (36 bits should give you a "comfortable" 10 decimal digits, right?) Especially if it meant being about to use the existing 32bit float functions, whether those are SW or HW (suddenly shifting to SW "double" functions on a CPU with 32bit Float hardware is depressing!)

It seems mathematically possible, but it also seems like it would be especially easy to screw up the "boundary" and lose the precision you were trying to get. :-(

SiliconWizard · « **Reply #1 on:** January 21, 2022, 03:44:59 am »

You could always devise your own FP format with a different number of bits for mantissa and exponent. Of course, you'll need to implement that yourself. It's not going to be compatible with IEEE 754.

You may be interested in tapered FP: https://en.wikipedia.org/wiki/Tapered_floating_point
in particular, Unums: https://en.wikipedia.org/wiki/Unum_(number_format)
(the last variant of that being Posits.)

T3sl4co1l · « **Reply #2 on:** January 21, 2022, 04:49:41 am »

Not sure how you'd implement it; software emulation of existing routines? (i.e. sub out, or trap, FPU instructions, replacing with calls to your custom implementations instead.) That could work for special-case routines that handle everything in the FPU then dump out the final result; but anything that handles the arguments in memory must immediately drop the modified format. And that's probably a very narrow case where such a method is applicable. And, needless to say, it will run far slower than hardware FP.

If double or better is available (in hardware), you're definitely better off using that, truncating the result if needed.

I guess the use case is, if you only have single hardware available, so anything extended beyond that, will be significant software effort anyway? (double on such a system might not be wholly software implemented -- hardware multiply helps considerably, if nothing else -- but division and transcendental functions likely benefit less?)

Assuming such a library were available, well tested, good performing, and ready to go -- I suppose it would be fine? Presumably it could perform better than double on such a system. But therein lies the problem: every compiler and its uncle has IEEE-754 support. Are you going to write and test that entire library yourself? (And you still need to convert back to common formats, sooner or later -- even if you implement every single libc function that handles floats, you can't account for 3rd party libs. Well, fine, you might avoid those in your own project that no one else ever touches. But if it's your own thing, just go and do it however you see fit?

)

As for hacks of functions, if the function is continuous*, you could perhaps take the derivative at a few values around the desired value, and interpolate based on that. Note that this takes on the order of N evaluations to test N extended-precision parameters, times 2^B sample points to (hopefully) win an additional B bits of precision. The samples would probably be arranged in a N-dimensional grid, each axis spaced according to the inverse partial derivative of its parameter, so that on average there's about 1LSB between adjacent results, say.

*In a... somewhat roundabout and complicated manner, since this is a discrete function mimicking a real-valued one. The epsilon-delta definition is still good I think, but rather than letting their values be arbitrary, stipulate that they must cover a region ⊇ sampling region. And that the curvature within that region meets certain assumptions, limited by whatever interpolation method we happen to choose (e.g., instead of multilinear interpolation, Bezier splines could be used, of order, at most, corresponding to the width of the sample space).

Note that, if we assume ideal inputs to the function, and the function itself is perfect, but it loses precision on its output (bits have been rounded out): this is the sense in which we can increase resolution. Compare with the situation of dithering an ADC: the random noise smooths out the quantization noise on average, and since the step sizes (between any given adjacent values) generally vary, we can use this to also reduce DNL. We cannot however reduce INL, which manifests as a systematic error over this process. Note that, for random additive noise, the sample size is 2^(2B), because the variance goes as 1/sqrt(N) for N samples; we might hope to have better luck here (with a complete grid, rather than a random sampling of that space)

So, too, we can only hope to get extended precision with respect to what the function itself represents, or is. If it's supposed to represent some more ideal characteristic (like sin(x) say), it can only be as good as its own curve fits. You can't add higher order polynomial terms to a function that simply doesn't have them. (Well, sort of. A sampling process will have the effect of convolving the function with the filter function over the sampled area -- and that won't be exactly the original polynomial, either. I'm not actually sure what this comes out to, offhand.) We can reduce local errors due to rounding (also assuming the rounding is evenly distributed among every operation between input and output), but the systematic error is not removable.

Interesting thought, I guess, if unfortunate: it's probably pretty clear, this would only ever make sense as an last resort. Like, you have some black-box function that you can't even diassemble or RE any other way than by sampling it for values. I mean, even then, you should be able to collect enough data to reconstruct it; for it to still be a reasonably continuous function (that would be susceptible to the above method), it can't be inordinately complex. (Let's say, an implementation of an elliptic function, which will use some manner of approximation, which we could figure out; compare to a function on the Mandelbrot set, which is anything but continuous! Though, knowing that it's a specific type of discontinuous function, we might still be able to solve for it.)

Tim

brucehoult · « **Reply #3 on:** January 21, 2022, 07:46:01 am »

There are some runtime libraries that provide a floating point format represented as the sum of two IEEE doubles. In particular on IBM POWER and PowerPC. It's pretty ugly.

https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

You could of course do the same using a pair of single precision values instead.

If you have hardware integer 32x32->64 (or upper and lower halves) and fast shifts and CLZ then you could fairly efficiently implement a format with 32 bit mantissa and 8 or 16 bits of exponent. Microsoft BASIC on early 8 bit micros used 40 bit floating point -- though of course without hardware multiply support.

Whales · « **Reply #4 on:** January 21, 2022, 10:36:50 am »

Crazy thought #1: Extra one bit accuracy at 2x time cost?
You might theoretically be able to get a single extra bit's worth of accuracy by splitting each of your operands into two almost-exact halves (h1 = A/2, h2 = A-h1), running the operations on each half set of operands independently and then summing them together to get the final number. Total cost is approximately 2x computation time, which is better than having to revert to software algorithms, but still pretty shitty for the amount of benefit. This probably also requires specific rounding modes to work?

Crazy thought #2: Throw noise at the problem
Maybe it's worth attacking this statistically by running the same calculations several times, adding a different small amount of tiny noise/error to the start of each one, then averaging the final results. There is probably even proper research into techniques like this, but I don't know what you would call them (maybe Monte Carlo related?)

I can't quite recall, is there something similar to this in ADCs and DACs (to make the noise floor across all frequencies flat?). ie if your ADC had no noise on the input then there would be weird artefacts in its readings.

Fun sidenote: x86 floating point has a set of instructions call x87 where all calculations are done in an 80-bit floating point format internally, then rounded when you copy the numbers out elsewhere. I think I read once that this caused some interesting issues with reproducibility on multitasking operating systems (when context switching to another process: only a lower precision version of the FPU register states was saved to be later restored?).

Kleinstein · « **Reply #5 on:** January 21, 2022, 10:55:56 am »

AFAIR the old MS basic on the C64 used 48 bits.
For a 8 bit micro the 48 bit format may be nice in a few places, where the single format is no sufficient. Not that often that one needs the full 64 bits and 32 bit floarting point is sometimes a bit short, e.g. with frequency measurements or when the math algorithm looses precision / needs higher resolution for an intermediate resuslt.
When doing the math in SW yourself, one is free too use whatever fits, like 56 bits integers or 24 bit fixed point or a 16 bit FP format.

Still a 32 bit µCs would have trouble with alignment. If there is a FPU in theory use double and leave out the lower 16 bits when memory is at a premium. This should still be more than 24 bit mantissa of the single data format. However this does not help with alignment and not many 8 or 16 bit CPUs come with an FPU.

gf · « **Reply #6 on:** January 21, 2022, 11:06:34 am »

The traditional x87 instruction set of the x86 architecture supports 80-bit extended precicion. But today it seems not so common to use it any more, favoring SSE and AVX instructions.

brucehoult · « **Reply #7 on:** January 21, 2022, 11:38:47 am »

Quote from: Whales on January 21, 2022, 10:36:50 am

Crazy thought #1: Extra one bit accuracy at 2x time cost?
You might theoretically be able to get a single extra bit's worth of accuracy by splitting each of your operands into two almost-exact halves (h1 = A/2, h2 = A-h1), running the operations on each half set of operands independently and then summing them together to get the final number. Total cost is approximately 2x computation time, which is better than having to revert to software algorithms, but still pretty shitty for the amount of benefit. This probably also requires specific rounding modes to work?

You get much more than 1 extra bit if your hardware has FMA instructions (as it must to be IEEE 2008 compliant) -- you get double the mantissa length.

David Hess · « **Reply #8 on:** January 21, 2022, 11:16:10 pm »

My solution has been to store floating point numbers as logarithms, so multiplies are adds, powers are multiplies, and adds are inconvenient.

westfw · « **Reply #9 on:** January 22, 2022, 01:26:30 am »

Quote

There are some runtime libraries that provide a floating point format represented as the sum of two IEEE doubles. In particular on IBM POWER and PowerPC. It's pretty ugly.
https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic

That seems like exactly what I had in mind. I'm glad to know it exists; too bad it's ugly.

Quote

you could fairly efficiently implement a format with 32 bit mantissa and 8 or 16 bits of exponent.

I hadn't considered a SW floating point at intermediate precision. But I guess the extra 8 bits isn't quite worth bucking a standard :-(

Vtile · « **Reply #10 on:** February 01, 2022, 09:02:06 pm »

I have understood that Intel do have quadruple precision float library for C available. At least it is used in my pocket calculator.

newbrain · « **Reply #11 on:** February 03, 2022, 08:25:13 am »

Quote from: Vtile on February 01, 2022, 09:02:06 pm

I have understood that Intel do have quadruple precision float library for C available. At least it is used in my pocket calculator.

I also have a Swissmicros DM42.
The used library is not only quadruple precision, it's also using base 10 instead of binary.

Berni · « **Reply #12 on:** February 03, 2022, 08:41:13 am »

If single precision float is not enough just simply go for double precision.

Usually modern CPUs with hardware floating point support also work with double precision floats with a slight speed penalty (often 1/2 speed). So doing some proprietary in between precision would actually be a lot slower than just using double precision.

There are also programing languages that support the "decimal float" format that works more like the early calculating machines in decimal digits. This lets you get lots of precision with none of the funky float binary rounding. This is typically used in finance and comes at a significant speed penalty.

brucehoult · « **Reply #13 on:** February 03, 2022, 09:45:58 am »

Quote from: Berni on February 03, 2022, 08:41:13 am

If single precision float is not enough just simply go for double precision.

"simply". At a cost of twice the storage space, and possibly several orders of magnitude lower performance.

Depending on the magnitude of your data, an alternative floating point format may offer more significant digits at the cost of less dynamic range.

Or the "posit" format give about 1 decimal digit more precision for numbers close to +/- 1, the same precision as IEEE float at around +/- 8 million and +/- 1/8,000,000, and greater dynamic range (far past 10^38) at reduced precision. Posit support is now starting to show up in hardware FP.

Quote

Usually modern CPUs with hardware floating point support also work with double precision floats with a slight speed penalty (often 1/2 speed). So doing some proprietary in between precision would actually be a lot slower than just using double precision.

Usually the same speed, or something like 4 or 5 cycle latency instead of 3 or 4 cycles. But only on "big" hardware -- smartphone or laptop or bigger. Lots of embedded devices have single precision FP hardware only.

Quote

There are also programing languages that support the "decimal float" format that works more like the early calculating machines in decimal digits. This lets you get lots of precision with none of the funky float binary rounding. This is typically used in finance and comes at a significant speed penalty.

Decimal floating point is junk. The numerical properties of decimal rounding are much worse than binary.

OK, so maybe you can design your algorithms so that they never round. But in that case you can do the same with regular binary FP. For financial purposes standard IEEE binary double precision gives you 100% accurate no rounding arithmetic up to $90,071,992,547,409.92 (90 trillion dollars). This is enough for many organisations.

Back in the days before CPUs with 64 bit integer registers and arithmetic this was often the best and fastest way to calculate with money.

The advantage of decimal FP is mostly just the easier conversion to and from text.

Nominal Animal · « **Reply #14 on:** February 03, 2022, 10:42:56 am »

In hosted environments, GCC provides libquadmath with __float128 and __complex128 type support, which on x86-64 use IEEE 754-2008 Binary128 format, AKA quadruple-precision floating-point.

I often use it to explore numerical stability of certain finicky solutions using much higher precision (and thus less quantization noise and smaller nonlinearities).

For even higher precision, I tend to use GNU Multiple Precision Arithmetic Library, because it is very reliable and robust; GCC itself depends on it at compile time.

Both of them are designed for correctness, and not speed.

Vtile · « **Reply #15 on:** February 03, 2022, 02:09:16 pm »

Quote from: newbrain on February 03, 2022, 08:25:13 am

Quote from: Vtile on February 01, 2022, 09:02:06 pm
I have understood that Intel do have quadruple precision float library for C available. At least it is used in my pocket calculator.
I also have a Swissmicros DM42.
The used library is not only quadruple precision, it's also using base 10 instead of binary.

Ah, yeah that was the catch. Indeed, now when you do mention it, it does return to my memory (decimal math).


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: "Intermediate precision" floating point? (Read 2438 times)

westfw

"Intermediate precision" floating point?

SiliconWizard

Re: "Intermediate precision" floating point?

T3sl4co1l

Re: "Intermediate precision" floating point?

brucehoult

Re: "Intermediate precision" floating point?

Whales

Re: "Intermediate precision" floating point?

Kleinstein

Re: "Intermediate precision" floating point?

gf

Re: "Intermediate precision" floating point?

brucehoult

Re: "Intermediate precision" floating point?

David Hess

Re: "Intermediate precision" floating point?

westfw

Re: "Intermediate precision" floating point?

Vtile

Re: "Intermediate precision" floating point?

newbrain

Re: "Intermediate precision" floating point?

Berni

Re: "Intermediate precision" floating point?

brucehoult

Re: "Intermediate precision" floating point?

Nominal Animal

Re: "Intermediate precision" floating point?

Vtile

Re: "Intermediate precision" floating point?

Share me