Author Topic: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?  (Read 1390 times)

0 Members and 1 Guest are viewing this topic.

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
so, I am playing with myC and floating point 32 bit numbers, with the attempt of writing a function to correctly show.

The following code can be easily converted into C and it's easy enough to describe the algorithm and the inner problem.

Code: [Select]
/*
 * binary floating point reresentation description
 *      ------------   ---   -----------------
 *     | floor_part | | . | | fractional_part |
 *      ------------   ---   -----------------
 *
 *                bit
 * 31 30    23 22                    0
 * S  EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
 *          23 -----1615-----87-----00
 */

Code: [Select]
private void fp32_fractional_show
(
    my_fp32_t value /* abs(value) < 1.0 */
)
{
    uint32_t  sign;
    uint32_t  mantissa;
    sint32_t  exponent;
    uint64_t  magic0; /* final result */
    uint32_t  magic1;
    uint32_t  magic2;
    uint64_t  magic3;
    uint32_t  i0;
    uint32_t  i1;
    uint32_t  i2;
    boolean_t bit;
    uint32_t  mask;

    /*
     * 2^-1 = 1/2,
     * first cell in the tab
     */

    sign     = value.sign;
    mantissa = value.mantissa;
    exponent = value.exponent;

    magic0 = 0;
    if (exponent isNotEqualTo(-127.0))
    {
        exponent = sint32_abs(exponent);
        i2       = (exponent - 1);
        magic1   = pow2(exponent);
        magic0   = fractional_tab[i2];
        i2++;

        for (i0 = 1; i0 <= 23; i0++)
        {
            i1   = (23 - i0);
            mask = (1 shiftLeft i1);
            bit  = ((mantissa bitwiseAnd mask) shiftRight i1);
            if (bit isEqualTo True)
            {
                magic2  = pow2(i0);
                magic3  = fractional_tab[i2];
                magic0 += magic3;
            }
            i2++;
        }
    }

    dbg_wp(magic0); /* read the result on the debugger console */
}

Nothing special, it uses a per-calculated table for the fractional part, but just look at how it is generated


Code: [Select]
private void fractional_tab_init()
{
    uint32_t  i0;
    uint64_t  magic0;
    uint64_t  magic1;
    uint64_t  magic2;
    boolean_t is_ok;

    is_ok  = True;
    magic0  = 500000000;           /* 32bit */
    magic0  = 5000000000000000000; /* 64bit */
    magic2 = 1;
    for (i0 = 0; i0 < 32; i0++)
    {
        magic1             = (magic0 % 2);
        is_ok              = (is_ok logicalAnd(magic1 isEqualTo 0));
        if (is_ok isEqualTo False)
        {
            /*
             * black voodoo magic
             * adjust the numbers
             */
            magic0  += magic2;
            magic2   = (magic0 / 100);
        }
        fractional_tab[i0] = magic0;

        magic0 = (magic0 / 2);
    }
}

Do you see the black voodoo magic trick? That is the problem I have!

The algorithm uses all the entries from 0 to 31, but with 64 bit, entries from 19 to 31 are truncated, and this introduces numerical errors, which the black voodoo magic trick tries to compensate.

Code: [Select]
00: 05000000000000000000 <- this means 1/2 = 0.5
01: 02500000000000000000
02: 01250000000000000000
03: 00625000000000000000
04: 00312500000000000000
05: 00156250000000000000
06: 00078125000000000000
07: 00039062500000000000
08: 00019531250000000000
09: 00009765625000000000
10: 00004882812500000000
11: 00002441406250000000
12: 00001220703125000000
13: 00000610351562500000
14: 00000305175781250000
15: 00000152587890625000
16: 00000076293945312500
17: 00000038146972656250
18: 00000019073486328125
19: 00000009536743164062 <------------- from here, they are all truncated
20: 00000004768371582031
21: 00000002384185791015
22: 00000001192092895507
23: 00000000596046447753
24: 00000000298023223876
25: 00000000149011611938
26: 00000000074505805969
27: 00000000037252902984
28: 00000000018626451492
29: 00000000009313225746
30: 00000000004656612873
31: 00000000002328306436

Yeah, it's ugly, very ugly and dirty.

Code: [Select]
19: 00000009727478027344 (1) +97274780273
20: 00000004961013793945 (0) +49610137939
21: 00000002530117034911 (0) +25301170349
22: 00000001290359687804 (1) +12903596878
23: 00000000658083440780 (0) +6580834407
24: 00000000335622554797 (0) +3356225547
25: 00000000171167502945 (0) +1711675029
26: 00000000087295426501 (0) +872954265
27: 00000000044520667515 (0) +445206675
28: 00000000022705540432 (1) +227055404
29: 00000000011579825620 (0) +115798256
30: 00000000005905711066 (0) +59057110
31: 00000000003011912643 (1) +30119126

Code: [Select]
   value=0.010000 (floating point 32bit)
     hex=0x3c23d70a (fp32_t)
    sign=0x0
mantissa=0x23d70a .01000111101011100001010
exponent=0xfffffff9, -7

{ 1 + 1/4(2) 1/64(6) 1/128(7) 1/256(8) 1/512(9) 1/2048(11) 1/8192(13) 1/16384(14) 1/32768(15) 1/1048576(20) 1/4194304(22) } * (1/128)

0100000543941854383

so "0.01" is shown as 0.0100000543941854383


Without the black voodoo magic trick:

Code: [Select]
   value=0.010000 (floating point 32bit)
     hex=0x3c23d70a (fp32_t)
    sign=0x0
mantissa=0x23d70a .01000111101011100001010
exponent=0xfffffff9, -7

{ 1 + 1/4(2) 1/64(6) 1/128(7) 1/256(8) 1/512(9) 1/2048(11) 1/8192(13) 1/16384(14) 1/32768(15) 1/1048576(20) 1/4194304(22) } * (1/128)

0099999997764825819

so "0.01" is shown as 0.0099999997764825819


-

WTF, it seems you need more bit, say a 128bit table and math, that's crazy  :-//
(I can implement it, but WTF!)

The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #1 on: August 15, 2022, 12:28:54 am »
Code: [Select]
   value=0.875000 (floating point 32bit)
     hex=0x3f600000 (fp32_t)
    sign=0x0
mantissa=0x600000 .11000000000000000000000
exponent=0xffffffff, -1

{ 1 + 1/2(1) 1/4(2) } * (1/2)

8750000000000000000

value  = (0.0  + 1.0/2.0 + 1.0/4.0 + 1.0/8.0);

at least, values like this are fine and neat ;D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline abquke

  • Regular Contributor
  • *
  • Posts: 128
  • Country: us
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #2 on: August 15, 2022, 12:36:40 am »
union float_inspection {
    float floatval;
    struct bits{
        uint32_t fraction:23;
        uint32_t exponent:8;
        uint32_t sign:1;
    }bits;
}
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4003
  • Country: nz
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #3 on: August 15, 2022, 03:02:50 am »
WTF, it seems you need more bit, say a 128bit table and math, that's crazy  :-//
(I can implement it, but WTF!)

As discussed in another thread in the last day or two, to correctly print 32 bit IEEE floating point numbers in all cases you need 176 bit unsigned integer arithmetic.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6173
  • Country: fi
    • My home page and email address
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #4 on: August 15, 2022, 04:27:42 am »
A subset of output formats and values (i.e, if output is clamped to within a preset range and precision), can be done with less resources (as also discussed).
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #5 on: August 15, 2022, 08:58:32 am »
I don't read all the topics in this forum, where was discussed? I think I missed it because focused to other things.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #6 on: August 15, 2022, 09:29:16 am »
to correctly print 32 bit IEEE floating point numbers in all cases you need 176 bit unsigned integer arithmetic.

I am thinking about this

floating point:
- difficult to be printed
- difficult to be encoded
- requires normalization
- requires >>64bit unsigned integer arithmetic
- doesn't offer uniform density with numbers
- MAC and MSC are weak, this is a serious problem with Matrix LUP decompositions

and it has weird cases
Code: [Select]
     * Not a Number (NaN): 0 11111111 00001000000000100001000
     * Negative Infinity : 1 11111111 00000000000000000000000
     * Negative Zero     : 1 00000000 00000000000000000000000
     * Infinity          : 0 11111111 00000000000000000000000
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #7 on: August 15, 2022, 09:33:35 am »
union float_inspection
{
    float floatval;
    struct bits
   {
        uint32_t fraction:23;
        uint32_t exponent:8;
        uint32_t sign:1;
    } bits;
}

bitfields are not endian-ness-guaranted, results depend on the target endian), so with C you'd best use mask and shifts.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #8 on: August 15, 2022, 09:35:28 am »
A subset of output formats and values

yes, but it still requires voodoo black magic adjustments in the table  :-//
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4003
  • Country: nz
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #9 on: August 15, 2022, 09:44:11 am »
I don't read all the topics in this forum, where was discussed? I think I missed it because focused to other things.

https://www.eevblog.com/forum/programming/best-thread-safe-printf-and-why-does-printf-need-the-heap-for-f-etc/

You've been participating in the thread.
 

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #10 on: August 15, 2022, 10:07:54 am »
You've been participating in the thread.

Not so much even because ignored, I have contributed only to suggest to rid of printf.
Anyway, at least this topic offers a concrete example that illustrates the problem.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 6173
  • Country: fi
    • My home page and email address
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #11 on: August 15, 2022, 12:58:39 pm »
A subset of output formats and values
yes, but it still requires voodoo black magic adjustments in the table  :-//
If you treat the fp32 as a fixed point number with 128 integer bits and 128+24 = 152 fractional bits, and the caller specifies how many decimal digits you want after the decimal point (if any), the only precalculated table you need is the 37 positive powers of ten, and even that is optional.
(You can use either 38 entries of 128 bits each, or encode 10i as a bit pattern to be shifted left by i bits, in which case powers of ten up to 1013 fit in 32 bits, up to 1027 fit in 64 bits, and up to 1041 fit in 96 bits but only up to 1037 are needed.  And they can be precalculated at runtime by repeatedly multiplying by ten, of course.)

No black magic, and very simple and logical tables if any.

The idea is that you first expand the fractional bits to a 156+bit buffer, with the top 4 bits corresponding to an integer part.  If the target architecture has only 32×32=32-bit integer multiplication, or the 32×32=64-bit integer multiplication is costly, use only the bottom 28 bits of each limb, leaving four "unused" (overflow/carry) bits in each 32-bit limb.  By repeatedly multiplying the fractional part by 10, you obtain the next decimal digit in the integer part, which you clear before the next digit.  Because each multiplication moves the least significant bit set by at least one place up, you can ignore the least significant limbs as soon as they become zero.  For rounding, you do nothing if the buffer corresponds to less than half after conversion; increment the decimal representation by one if more than half; and break ties to even if the buffer is exactly half.  This can lead to the entire fractional part becoming zero, in which case you carry one to the integer part.  Note that no constants at all are needed here.

Next, you convert the integer part of the fp32 into the same bit buffer, remembering to add the possible carry from rounding the fractional part.  You then convert the 128-bit unsigned integer described by the bit buffer, to a decimal string into the output buffer.

As discussed in other threads, this can be done either by repeated division by ten (the remainder yielding the digits in increasing order of significance), using an array of precalculated powers of ten and a subtract loop, or by converting the integer in suitable power of ten units (as if converting to a different radix), or by extracting the decimal digits from the most significant end (by comparing to and repeatedly subtracting 1038-k), repeatedly multiplying by 10k.
Or by whatever other method you prefer.

For example, if you store k×1036 for k=0..99, you can use a binary search against the array to find the two highest digits in at most seven comparisons (to find the largest value in the array not greater than the current bit buffer), and one 128-bit subtraction.  Of course, this uses 100×96 bits (the 36 low bits of these 128-bit values are azeros) or 1200 bytes of ROM/Flash.  In the worst case, this ends up doing 133 128-bit comparisons and 19 128-bit subtractions, which isn't that bad, really.

In the medium scale, 1019 < 264, so if the initial integer is less than 1019, you can do the same as above but at only 64 bits, for k×1018, k=0.99.  This uses worst case 63 64-bit comparisons and 9 64-bit subtractions, but another 800 bytes of ROM/Flash in precalculated tables.

Picking the method based on where the most significant bit is, sounds like a very good idea to me.
 
The following users thanked this post: DiTBho

Online DiTBhoTopic starter

  • Super Contributor
  • ***
  • Posts: 3799
  • Country: gb
Re: Does Printing Floating-Point-32 bit Numbers need a 128 bit table?
« Reply #12 on: August 16, 2022, 02:53:45 am »
It took 2 hours, but it's done!

Code: [Select]
/projects/lib_math_x32bit_v1 # make run test_mul_pow10

* * * test mul pow10 * * *
0x[00000000.00000000.00000000.00000001] xA
0x[00000000.00000000.00000000.0000000a] xB
0x[00000000.00000000.00000000.0000000a] xC, cycle01 10
0x[00000000.00000000.00000000.00000064] xC, cycle02 100
0x[00000000.00000000.00000000.000003e8] xC, cycle03 1000
0x[00000000.00000000.00000000.00002710] xC, cycle04 10000
0x[00000000.00000000.00000000.000186a0] xC, cycle05 100000
0x[00000000.00000000.00000000.000f4240] xC, cycle06 1000000
0x[00000000.00000000.00000000.00989680] xC, cycle07 10000000
0x[00000000.00000000.00000000.05f5e100] xC, cycle08 100000000
0x[00000000.00000000.00000000.3b9aca00] xC, cycle09 10^9
0x[00000000.00000000.00000002.540be400] xC, cycle10 10^10
0x[00000000.00000000.00000017.4876e800] xC, cycle11 10^11
0x[00000000.00000000.000000e8.d4a51000] xC, cycle12 10^12
0x[00000000.00000000.00000918.4e72a000] xC, cycle13 10^13
0x[00000000.00000000.00005af3.107a4000] xC, cycle14 10^14
0x[00000000.00000000.00038d7e.a4c68000] xC, cycle15 10^15
0x[00000000.00000000.002386f2.6fc10000] xC, cycle16 10^16
0x[00000000.00000000.01634578.5d8a0000] xC, cycle17 10^17
0x[00000000.00000000.0de0b6b3.a7640000] xC, cycle18 10^18
0x[00000000.00000000.8ac72304.89e80000] xC, cycle19 10^19
0x[00000000.00000005.6bc75e2d.63100000] xC, cycle20 10^20
0x[00000000.00000036.35c9adc5.dea00000] xC, cycle21 10^21
0x[00000000.0000021e.19e0c9ba.b2400000] xC, cycle22 10^22
0x[00000000.0000152d.02c7e14a.f6800000] xC, cycle23 10^23
0x[00000000.0000d3c2.1bcecced.a1000000] xC, cycle24 10^24
0x[00000000.00084595.16140148.4a000000] xC, cycle25 10^25
0x[00000000.0052b7d2.dcc80cd2.e4000000] xC, cycle26 10^26
0x[00000000.033b2e3c.9fd0803c.e8000000] xC, cycle27 10^27
0x[00000000.204fce5e.3e250261.10000000] xC, cycle28 10^28
0x[00000001.431e0fae.6d7217ca.a0000000] xC, cycle29 10^29
0x[0000000c.9f2c9cd0.4674edea.40000000] xC, cycle30 10^30
0x[0000007e.37be2022.c0914b26.80000000] xC, cycle31 10^31
0x[000004ee.2d6d415b.85acef81.00000000] xC, cycle32 10^32
0x[0000314d.c6448d93.38c15b0a.00000000] xC, cycle33 10^33
0x[0001ed09.bead87c0.378d8e64.00000000] xC, cycle34 10^34
0x[00134261.72c74d82.2b878fe8.00000000] xC, cycle35 10^35
0x[00c097ce.7bc90715.b34b9f10.00000000] xC, cycle36 10^36
0x[0785ee10.d5da46d9.00f436a0.00000000] xC, cycle37 10^37

{ uadd, usub, umul, udiv, shiftL, shiftR, cmp{LT,LE,EQ,NE,GT,GE,ZE} } operators able to operate on  {32,64,128}bit numbers

Oh, finally I can count from 1 to 10^37  ;D
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf