Products > Programming

"Intermediate precision" floating point?

(1/4) > >>

westfw:
I've always wondered whether there's a common hack for getting more than 24 bits of mantissa precision out of 32bit floating point functions.
64bit floats have 53bits of mantissa, but it frequently seems like 48 bits (2 * 32bt FP numbers?) would be plenty.  Maybe even a a few less (36 bits should give you a "comfortable" 10 decimal digits, right?)  Especially if it meant being about to use the existing 32bit float functions, whether those are SW or HW (suddenly shifting to SW "double" functions on a CPU with 32bit Float hardware is depressing!)

It seems mathematically possible, but it also seems like it would be especially easy to screw up the "boundary" and lose the precision you were trying to get. :-(

SiliconWizard:
You could always devise your own FP format with a different number of bits for mantissa and exponent. Of course, you'll need to implement that yourself. It's not going to be compatible with IEEE 754.

You may be interested in tapered FP: https://en.wikipedia.org/wiki/Tapered_floating_point
in particular, Unums: https://en.wikipedia.org/wiki/Unum_(number_format)
(the last variant of that being Posits.)

T3sl4co1l:
Not sure how you'd implement it; software emulation of existing routines?  (i.e. sub out, or trap, FPU instructions, replacing with calls to your custom implementations instead.)  That could work for special-case routines that handle everything in the FPU then dump out the final result; but anything that handles the arguments in memory must immediately drop the modified format.  And that's probably a very narrow case where such a method is applicable.  And, needless to say, it will run far slower than hardware FP.

If double or better is available (in hardware), you're definitely better off using that, truncating the result if needed.

I guess the use case is, if you only have single hardware available, so anything extended beyond that, will be significant software effort anyway?  (double on such a system might not be wholly software implemented -- hardware multiply helps considerably, if nothing else -- but division and transcendental functions likely benefit less?)

Assuming such a library were available, well tested, good performing, and ready to go -- I suppose it would be fine?  Presumably it could perform better than double on such a system.  But therein lies the problem: every compiler and its uncle has IEEE-754 support.  Are you going to write and test that entire library yourself?  (And you still need to convert back to common formats, sooner or later -- even if you implement every single libc function that handles floats, you can't account for 3rd party libs.  Well, fine, you might avoid those in your own project that no one else ever touches.  But if it's your own thing, just go and do it however you see fit? :D )

As for hacks of functions, if the function is continuous*, you could perhaps take the derivative at a few values around the desired value, and interpolate based on that.  Note that this takes on the order of N evaluations to test N extended-precision parameters, times 2^B sample points to (hopefully) win an additional B bits of precision.  The samples would probably be arranged in a N-dimensional grid, each axis spaced according to the inverse partial derivative of its parameter, so that on average there's about 1LSB between adjacent results, say.

*In a... somewhat roundabout and complicated manner, since this is a discrete function mimicking a real-valued one.  The epsilon-delta definition is still good I think, but rather than letting their values be arbitrary, stipulate that they must cover a region ⊇ sampling region.  And that the curvature within that region meets certain assumptions, limited by whatever interpolation method we happen to choose (e.g., instead of multilinear interpolation, Bezier splines could be used, of order, at most, corresponding to the width of the sample space).

Note that, if we assume ideal inputs to the function, and the function itself is perfect, but it loses precision on its output (bits have been rounded out): this is the sense in which we can increase resolution.  Compare with the situation of dithering an ADC: the random noise smooths out the quantization noise on average, and since the step sizes (between any given adjacent values) generally vary, we can use this to also reduce DNL.  We cannot however reduce INL, which manifests as a systematic error over this process.  Note that, for random additive noise, the sample size is 2^(2B), because the variance goes as 1/sqrt(N) for N samples; we might hope to have better luck here (with a complete grid, rather than a random sampling of that space)

So, too, we can only hope to get extended precision with respect to what the function itself represents, or is.  If it's supposed to represent some more ideal characteristic (like sin(x) say), it can only be as good as its own curve fits.  You can't add higher order polynomial terms to a function that simply doesn't have them.  (Well, sort of.  A sampling process will have the effect of convolving the function with the filter function over the sampled area -- and that won't be exactly the original polynomial, either.  I'm not actually sure what this comes out to, offhand.)  We can reduce local errors due to rounding (also assuming the rounding is evenly distributed among every operation between input and output), but the systematic error is not removable.

Interesting thought, I guess, if unfortunate: it's probably pretty clear, this would only ever make sense as an last resort.  Like, you have some black-box function that you can't even diassemble or RE any other way than by sampling it for values.  I mean, even then, you should be able to collect enough data to reconstruct it; for it to still be a reasonably continuous function (that would be susceptible to the above method), it can't be inordinately complex.  (Let's say, an implementation of an elliptic function, which will use some manner of approximation, which we could figure out; compare to a function on the Mandelbrot set, which is anything but continuous!  Though, knowing that it's a specific type of discontinuous function, we might still be able to solve for it.)

Tim

brucehoult:
There are some runtime libraries that provide a floating point format represented as the sum of two IEEE doubles. In particular on IBM POWER and PowerPC. It's pretty ugly.

You could of course do the same using a pair of single precision values instead.

If you have hardware integer 32x32->64 (or upper and lower halves) and fast shifts and CLZ then you could fairly efficiently implement a format with 32 bit mantissa and 8 or 16 bits of exponent. Microsoft BASIC on early 8 bit micros used 40 bit floating point -- though of course without hardware multiply support.

Whales:
Crazy thought #1: Extra one bit accuracy at 2x time cost?
You might theoretically be able to get a single extra bit's worth of accuracy by splitting each of your operands into two almost-exact halves (h1 = A/2, h2 = A-h1), running the operations on each half set of operands independently and then summing them together to get the final number.  Total cost is approximately 2x computation time, which is better than having to revert to software algorithms, but still pretty shitty for the amount of benefit.  This probably also requires specific rounding modes to work?

Crazy thought #2: Throw noise at the problem
Maybe it's worth attacking this statistically by running the same calculations several times, adding a different small amount of tiny noise/error to the start of each one, then averaging the final results.  There is probably even proper research into techniques like this, but I don't know what you would call them (maybe Monte Carlo related?)

I can't quite recall, is there something similar to this in ADCs and DACs (to make the noise floor across all frequencies flat?).  ie if your ADC had no noise on the input then there would be weird artefacts in its readings.

Fun sidenote: x86 floating point has a set of instructions call x87 where all calculations are done in an 80-bit floating point format internally, then rounded when you copy the numbers out elsewhere.   I think I read once that this caused some interesting issues with reproducibility on multitasking operating systems (when context switching to another process: only a lower precision version of the FPU register states was saved to be later restored?).

[#] Next page

Go to full version