Author Topic: Non-IEEE (software) float option  (Read 4899 times)

0 Members and 1 Guest are viewing this topic.

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Non-IEEE (software) float option
« on: June 12, 2024, 02:47:13 pm »
In the old Z80 etc days, you had e.g.

- Hisoft Pascal, non-IEEE floats, FP divide taking say 500us, and Borland Turbo Pascal may have been the same
- IAR Pascal or C, IEEE floats, FP divide taking say 10ms

Now, aside from IAR compilers back then generating bloated code (the runtimes were often written in C, which is a rubbish way to do stuff like software floats) the IEEE float format does produce much longer runtimes. A part of this may be the specific requirements on e.g. rounding.

On the more basic CPUs today there are no hardware floats (I think those are always IEEE compliant) so getting software floats to run several times faster would be useful to a lot of people.

The non-IEEE code is also a lot more compact. In another thread someone mentioned 20k for software floats (arm32?). The Z80 non-IEEE version was under 4k, and Z80 code is not that compact.

Another curious thing is that in Hisoft Pascal, the simple code meant that the 24-bit mantissa was an "exact integer" if you did whole-number add/subtract i.e. if you wanted a 24 bit counter (their ints were only int16) you could use a float and just keep adding 1 and it would do it right, all the way to (2^24)-1, whereas I am not sure IEEE floats do exactly that.

I do see a problem with generic tools like GCC not wanting to do this, instead writing their software floats in C, but that will immediately produce a big performance hit.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 5058
  • Country: nz
Re: Non-IEEE (software) float option
« Reply #1 on: June 12, 2024, 03:01:10 pm »
In the old Z80 etc days, you had e.g.

- Hisoft Pascal, non-IEEE floats, FP divide taking say 500us, and Borland Turbo Pascal may have been the same
- IAR Pascal or C, IEEE floats, FP divide taking say 10ms

There is for example the FP library used by Arduino on AVR. It's around 5us for add/sub/mul on a 16 MHz chip.

Quote
Another curious thing is that in Hisoft Pascal, the simple code meant that the 24-bit mantissa was an "exact integer" if you did whole-number add/subtract i.e. if you wanted a 24 bit counter (their ints were only int16) you could use a float and just keep adding 1 and it would do it right, all the way to (2^24)-1, whereas I am not sure IEEE floats do exactly that.

Of course they do. Any IEEE implementation (hardware or software) is absolutely guaranteed to give exact results for add/sub/mul of integers out to the limits of 23 bits on single precision or 53 bits in double precision. And in fact to a result of 2^23 or 2^53.  After that the odd numbers can't be represented but you get all the even numbers out to 2^24 or 2^54.

This is a consequence of IEEE implementations being required to give bit exact results for EVERY operation for which the result is representable -- and the nearest value for the rest. I mean for the fundamental operations -- this doesn't apply to trig and logs.
 
The following users thanked this post: newbrain, SiliconWizard

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 4089
  • Country: us
Re: Non-IEEE (software) float option
« Reply #2 on: June 13, 2024, 01:23:02 am »

Quote
Another curious thing is that in Hisoft Pascal, the simple code meant that the 24-bit mantissa was an "exact integer" if you did whole-number add/subtract i.e. if you wanted a 24 bit counter (their ints were only int16) you could use a float and just keep adding 1 and it would do it right, all the way to (2^24)-1, whereas I am not sure IEEE floats do exactly that.

Of course they do. Any IEEE implementation (hardware or software) is absolutely guaranteed to give exact results for add/sub/mul of integers out to the limits of 23 bits on single precision or 53 bits in double precision. And in fact to a result of 2^23 or 2^53.  After that the odd numbers can't be represented but you get all the even numbers out to 2^24 or 2^54.

Hence the use of a single base numeric data type in languages like JavaScript and Matlab.  Of course the "overflow" behavior is different...
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 5058
  • Country: nz
Re: Non-IEEE (software) float option
« Reply #3 on: June 13, 2024, 03:34:38 am »
Of course they do. Any IEEE implementation (hardware or software) is absolutely guaranteed to give exact results for add/sub/mul of integers out to the limits of 23 bits on single precision or 53 bits in double precision. And in fact to a result of 2^23 or 2^53.  After that the odd numbers can't be represented but you get all the even numbers out to 2^24 or 2^54.

Oops I miscalculated.

With the exponent at the value where the different between successive numbers is 1.0 (no fractions any more), you get 2^23 or 2^53 integers between 2^23 and 2^24-1, or between 2^53 and 2^54-1.  It is the next higher exponent value, after 2^24 and 2^54 where odd numbers can't be represented.

So, yeah, fp32 gives you exact arithmetic on integers to 16,777,216 (2^24) and fp64 to 18,014,398,509,481,984 (2^54). And the same for negative.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Re: Non-IEEE (software) float option
« Reply #4 on: June 13, 2024, 06:05:11 am »
What I was getting at is whether there is a big overhead in coding IEEE compliant floats.

All those 1970s coders implementing non-IEEE floats must have been doing it for a reason. I knew a number of them personally and they were super bright coders. People like Clive Smith-Stubbs (Hitech C) and Dave Nutkins (HiSoft).
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 5058
  • Country: nz
Re: Non-IEEE (software) float option
« Reply #5 on: June 13, 2024, 07:42:34 am »
What I was getting at is whether there is a big overhead in coding IEEE compliant floats.

Of course there is. Correctly handling NaNs and Infinities and denorms takes a lot of code and slows down every single operation.  It's also very expensive to make sure the last bit is always correct.

Many microcontroller applications of floating point don't need 23 bit accuracy -- if the final result is going to be used to control an output voltage or servo position then it's fine as long as 8-16 accurate bits are available after all calculations.

Just as one example, for multiply for an IEEE-compliant result you need to do 9 8x8->16 multiplies and add them all up.

If you don't need exact then you can just do:

a0*b0 -> r0,r1


a1*b0 -> r1,r2

a0*b1 -> r1,r2


a1*b1 -> r2,r3

a0*b2 -> r2,r3

a2*b0 -> r2,r3

If almost 24 bit accuracy is good enough then you're done

If you have MUL and MULH instructions rather than a single 8*8->16 instruction then you can just do the MULH on the last three, so you've done 6 MULH plus 3 MUL instead of 9 MULH plus 9 MUL for the full IEEE result. That's only half as many multiplications as the full IEEE calculation. (Similar argument applies if you don't have MUL at all but are doing shift-and-add)

a2*b1 -> r3,r4

a1*b2 -> r3,r4

a2*b2 -> r4,r5

Full accuracy needs these too.

How inaccurate is it if you stop early? You're ignoring five r3 result bytes and three r4 results. The r4 could be 0xFF + 0xFF + 0xFF = 0x2FD. So it could need to propagate +2 (let's call it +3 if the value is over 0x280) into the r3 calculation which therefore could be 0xFF + 0xFF + 0xFF + 0x3 = 0x300.

So your final result could be 3 units low -- you've only got 21 fully-accurate bits not 23.

But it will run twice as fast.

(this isn't a fully rigorous analysis, but close enough ...)
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4392
  • Country: us
Re: Non-IEEE (software) float option
« Reply #6 on: June 13, 2024, 08:29:56 am »
Quote
All those 1970s coders implementing non-IEEE floats must have been doing it for a reason.
Well, there was no IEEE standard in the 70s... :-)

I imagine being able to design floating point formats and functions around the particular features of the micro at hand must have been helpful...
 
The following users thanked this post: voltsandjolts

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 5058
  • Country: nz
Re: Non-IEEE (software) float option
« Reply #7 on: June 13, 2024, 11:07:20 am »
Quote
All those 1970s coders implementing non-IEEE floats must have been doing it for a reason.
Well, there was no IEEE standard in the 70s... :-)

I imagine being able to design floating point formats and functions around the particular features of the micro at hand must have been helpful...

Weirdly the early Microsoft BASICs (including AppleSoft) used a 5 byte FP format, with an 8-bit exponent and a 31-bit significand. They stored this in a 7 byte symbol table entry, along with 2 byte variable name (with variable type fp / int / string / fn) encoded in the hi bits.
 

Offline newbrain

  • Super Contributor
  • ***
  • Posts: 1840
  • Country: se
Re: Non-IEEE (software) float option
« Reply #8 on: June 13, 2024, 12:16:48 pm »
Weirdly [...]
Even more weirdly, the Grundy NewBrain computer used 6 bytes for FP, 5 bytes for the mantissa, 1 bit for sign, and 7 bits for exponent.
The radix was not 2, but 256 - so both range (10¹⁵⁰ - 10⁻¹⁵⁰) and precision were better than average computer at the time.

I noticed that while at uni: a dynamic, Runge Kutta based, simulation (written in Pascal) converged flawlessly on the NewBrain, but was extremely critical and often failed to converge at all on an Apple][e+ with exactly the same source code.
Nandemo wa shiranai wa yo, shitteru koto dake.
 

Online brucehoult

  • Super Contributor
  • ***
  • Posts: 5058
  • Country: nz
Re: Non-IEEE (software) float option
« Reply #9 on: June 13, 2024, 02:24:38 pm »
Weirdly [...]
Even more weirdly, the Grundy NewBrain computer used 6 bytes for FP, 5 bytes for the mantissa, 1 bit for sign, and 7 bits for exponent.
The radix was not 2, but 256 - so both range (10¹⁵⁰ - 10⁻¹⁵⁰) and precision were better than average computer at the time.

I noticed that while at uni: a dynamic, Runge Kutta based, simulation (written in Pascal) converged flawlessly on the NewBrain, but was extremely critical and often failed to converge at all on an Apple][e+ with exactly the same source code.

UCSD (therefore Apple) Pascal used 32 bit "real" in a format "similar to the proposed IEEE standard"..
 
The following users thanked this post: newbrain

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10289
  • Country: gb
Re: Non-IEEE (software) float option
« Reply #10 on: June 13, 2024, 02:42:40 pm »
In the old Z80 etc days, you had e.g.

- Hisoft Pascal, non-IEEE floats, FP divide taking say 500us, and Borland Turbo Pascal may have been the same
- IAR Pascal or C, IEEE floats, FP divide taking say 10ms
There were also a lot of tools that came with 2 optional floating point libraries - one fast, and one IEEE compliant. The speed difference could be quite big. In many packages the fast and slow libraries both used IEEE format, but the fast one dodged a lot of the rounding and other checks. In some packages the format in the fast and slow libraries was different, to avoid the overhead of shunting bits around. Even with bigger machines you have options like gcc's --ffast-math, which dodges some of the time consuming aspects of full IEEE compliance, even using an IEEE maths unit.

I do see a problem with generic tools like GCC not wanting to do this, instead writing their software floats in C, but that will immediately produce a big performance hit.
I don't think anyone expects a shipping GCC compiler will use that generic C code. Its a very useful tool when you are implementing a new ISA with GCC, as it gives you a working floating point package with essentially no effort, and lets you sort out the compiler aspects of floating arithmetic quickly. However, once floating point works, you really need to rework the support library to be optimised for the ISA.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Re: Non-IEEE (software) float option
« Reply #11 on: June 13, 2024, 03:54:16 pm »
Quote
Even with bigger machines you have options like gcc's --ffast-math, which dodges some of the time consuming aspects of full IEEE compliance, even using an IEEE maths unit.

Is that available on GCC v11 for 32F4? I mean does that option actually do something?

It might make a big difference on double floats, which don't use the float hardware.

In fact this whole topic is pretty relevant on double floats which is rarely hardware-float-supported.

And there are almost no real-world applications which need all the bits of a float to be perfect. One tends to use floats to get the easy dynamic range. Most "physical" stuff can be done with integers (i.e. fixed point) but you need to be very careful with the ranges.

Also remember that double is the default for most of the float C functions, so a lot of code people write will unknowingly be running much slower... maybe 10x to 100x slower.

« Last Edit: June 13, 2024, 04:01:33 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10289
  • Country: gb
Re: Non-IEEE (software) float option
« Reply #12 on: June 13, 2024, 04:22:26 pm »
Quote
Even with bigger machines you have options like gcc's --ffast-math, which dodges some of the time consuming aspects of full IEEE compliance, even using an IEEE maths unit.

Is that available on GCC v11 for 32F4? I mean does that option actually do something?
I have mostly used it on x86_64 machines, and it often speeds things up quite a bit there. I've never analysed the resulting code to see what its actually doing. Suck it and see, I guess.
Quote
It might make a big difference on double floats, which don't use the float hardware.

In fact this whole topic is pretty relevant on double floats which is rarely hardware-float-supported.

And there are almost no real-world applications which need all the bits of a float to be perfect. One tends to use floats to get the easy dynamic range. Most "physical" stuff can be done with integers (i.e. fixed point) but you need to be very careful with the ranges.

Also remember that double is the default for most of the float C functions, so a lot of code people write will unknowingly be running much slower... maybe 10x to 100x slower.
On MCUs the use of "f" at the end of many things can lead to a substantial improvement in speed. People are generally pretty sloppy about distinguishing float and double in their C code. How often do you see the "f" added to the end of constants in float (not double) code?
 
The following users thanked this post: peter-h

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4392
  • Country: us
Re: Non-IEEE (software) float option
« Reply #13 on: June 13, 2024, 05:03:14 pm »
Quote
I don't think anyone expects a shipping GCC compiler will use that generic C code
AFAIK, Arm Cortex M0 and M0+ are still using the generic C floating point code.  :-(
 

Offline coppice

  • Super Contributor
  • ***
  • Posts: 10289
  • Country: gb
Re: Non-IEEE (software) float option
« Reply #14 on: June 13, 2024, 05:50:29 pm »
Quote
I don't think anyone expects a shipping GCC compiler will use that generic C code
AFAIK, Arm Cortex M0 and M0+ are still using the generic C floating point code.  :-(
That may reflect that nobody cares. This would be strange, though, as a lot of MCU applications do most of their work in integer or fixed point, and then do some final calculation in floating point to arrive at a result which is displayed or passed on. Since there are only a few floating point operations in that kind of setup, the speed may not matter too much, but the size of the code probably does. Don't discount floating point on small MCUs. A lot of them are doing just a little of it.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 9651
  • Country: fi
Re: Non-IEEE (software) float option
« Reply #15 on: June 14, 2024, 06:54:17 am »
the speed may not matter too much, but the size of the code probably does. Don't discount floating point on small MCUs. A lot of them are doing just a little of it.

Agreed. On higher-end MCUs (with HW floating point), one might actually use floating point even in performance critical parts (often still not, for reasons such as, on ARM Cortex-M, extra penalty on ISR after lazy stacking of FP registers gets triggered), but for the small ones, everything where timing matters is done in fixed point anyway. Still, even on those, there are plenty of cases where performance is irrelevant; user interfaces as you say, but also calculating long-time averages, running infrequent control loops (e.g. once a second), etc. Think about a heating controller (second-timescale is OK due to thermal masses) vs. motor or switch mode controller control loop (in tens - hundreds of kHz).

And as such, on a micro without HW FP, it doesn't really mater much if the operation takes 50 or 200us, since even that faster version is too slow to be used willy-nilly. When you use them in slow parts of your code, even slower is almost always OK, but on the other hand saving even just half a kilobyte makes a real difference on devices with less than 16KB or so flash, and as such I prefer size-optimized SW FP implementations.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Re: Non-IEEE (software) float option
« Reply #16 on: June 14, 2024, 09:21:51 am »
Quote
I prefer size-optimized SW FP implementations.

and non-IEEE should be a lot smaller. So the option would be useful.

I too use floats in relatively time critical parts but the 32F4 has hardware single floats. I ought to look up the exact circumstances where an inadvertent double operation might be triggered ;)

Quote
On MCUs the use of "f" at the end of many things can lead to a substantial improvement in speed. People are generally pretty sloppy about distinguishing float and double in their C code. How often do you see the "f" added to the end of constants in float (not double) code?

In say

Code: [Select]
#define pi 3.141592645
float x,y;
y = x * pi;

surely the multiply will be single float?

If I did

printf("%7.5f",y);

then y will be promoted to double within the printf because that is how printf is customarily defined; correct? And the printf will run about 100x slower... but there is no way to change that unless you have the printf source code (which actually I have, because the newlib printf has various issues... various past threads, like using the heap, non thread safe, etc).
« Last Edit: June 14, 2024, 09:33:07 am by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online gf

  • Super Contributor
  • ***
  • Posts: 1467
  • Country: de
Re: Non-IEEE (software) float option
« Reply #17 on: June 14, 2024, 10:18:41 am »
In say

Code: [Select]
#define pi 3.141592645
float x,y;
y = x * pi;

surely the multiply will be single float?

No, the mutiply is double.
Only if both operands were single, the multiply would be single as well.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Re: Non-IEEE (software) float option
« Reply #18 on: June 14, 2024, 11:27:16 am »
OK how about

Code: [Select]
float x = 2.5;
float y = 3.5;
y = x * y;
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline GromBeestje

  • Frequent Contributor
  • **
  • Posts: 294
  • Country: nl
Re: Non-IEEE (software) float option
« Reply #19 on: June 14, 2024, 11:51:08 am »
OK how about

Code: [Select]
float x = 2.5;
float y = 3.5;
y = x * y;

When assigning doubles to floats, they get demoted at assignment time. Thus the multiplication is at single precision.
 

Offline peter-hTopic starter

  • Super Contributor
  • ***
  • Posts: 4591
  • Country: gb
  • Doing electronics since the 1960s...
Re: Non-IEEE (software) float option
« Reply #20 on: June 14, 2024, 03:19:32 pm »
Another learning experience!

I have a lot of stuff like

float g_gain_512 = 1.0;
g_gain_512 = ((ain0*512.0/32768.0))/480.0;

So I reckon

- the compile time calc is done as double at compile time (hardly matters)
- the ain0* multiply will also be done as double

There are certainly places in my code where stuff is done as a double, unintentionally, but none of them are even remotely time-critical.

In the future I will use

float g_gain_512 = 1.0F;

Is there some easy way to search for all instances of such code? I am using Cube IDE. A regex for a float is something like  [-+]?[0-9]*\.? [0-9]* but it turns up a load of nonsense. Maybe there is a plug-in which detects such code.

I also worry what happens with e.g.

float fred = 0.5;

The 0.5 is by default double, but how is this stored in the initialised data section? If it is stored as 8 bytes then the startup copy FLASH -> RAM will produce garbage because the RAM space allocated to fred is 4 bytes. How is that handled? It obviously works correctly so the storage of a variable must be as per its type.


« Last Edit: June 14, 2024, 04:31:10 pm by peter-h »
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 476
  • Country: be
Re: Non-IEEE (software) float option
« Reply #21 on: June 14, 2024, 04:34:56 pm »
float g_gain_512 = 1.0;
g_gain_512 = ((ain0*512.0/32768.0))/480.0;

[...]

In the future I will use

float g_gain_512 = 1.0F;

Better still would be instructing the compiler to use floats:

float g_gain_512 = (ain0 * 512.0f / 32768.0f) / 480.0f;
 

Online gf

  • Super Contributor
  • ***
  • Posts: 1467
  • Country: de
Re: Non-IEEE (software) float option
« Reply #22 on: June 14, 2024, 05:49:42 pm »

In the future I will use

float g_gain_512 = 1.0F;

...

I also worry what happens with e.g.

float fred = 0.5;

Don't worry. Here, the implicit double -> float conversion happens at compile time.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 9651
  • Country: fi
Re: Non-IEEE (software) float option
« Reply #23 on: June 15, 2024, 11:36:02 am »
It's a good idea to always use the f suffix with every literal when you want single-precision floats. This is so that you don't need to remember the promotion rules by heart, or look up the standard every time. Same thing as using parenthesis to explicitly convey the desired order of operations, helpful for yourself and other human readers.

The more you spam the f suffix, the more you remind yourself and others the importance of it. And sometimes it's very important for performance.

and non-IEEE should be a lot smaller. So the option would be useful.

Agreed. Only rarely you need to interact with other systems in binary floats directly, and in such cases you want the IEEE layout (you could still use a non-IEEE implementation which ignores some corner conditions or rounds slightly incorrectly, but uses the same number of bits for each component, same radix, same layout). But in most cases, you don't even need that compatibility; floats are internal to your code only, so it would be nice to tell the compiler "hey, how this is represented in memory is irrelevant, I just need something which has same order of magnitude accuracy as float/double".

Then again... whenever this kind of thing matters, just use fixed point. Lot easier to understand and you can trivially do napkin (or Excel) calculations of ranges and resolutions. Much more difficult on the variable-resolution floating point, and even more difficult if the implementation introduces extra error.

Therefore, I rarely use single-precision floats at all; double has enough range and resolution that in 100% of my use cases (some atom modeling scientific guy might disagree) I can treat it as an ideal real number storage, but float does not, it does that only in 99% of my use cases, which needs I need to turn my brain on. And if I do that, then why wouldn't I use fixed point which is easier to analyze and verify?

And clearly the designers of C have had similar mindset, as you can see from the default promotions to double, and all library functions working on doubles (except specifically named f variants).
« Last Edit: June 15, 2024, 11:40:27 am by Siwastaja »
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 7527
  • Country: fi
    • My home page and email address
Re: Non-IEEE (software) float option
« Reply #24 on: June 15, 2024, 02:41:43 pm »
(some atom modeling scientific guy might disagree)
For all the ones I know –– including at least LAMMPS, Gromacs, VASP, Dalton, SIESTA, CHARRM –– double suffices.

There is GNU Quad-Precision Math Library aka libquadmath, which provides 128-bit __float128 and __complex128 types and a wide range of math functions.  I use it sometimes to calculate coefficients for double precision approximations, and to compare my approximations to "best known values", but that's about it.

When I suspect cancellation error may be an issue –– as in a sum with both large and tiny terms, with the large terms cancelling out ––, either sorting them in descending order of magnitude, or Kahan summation will deal with it.

The more you spam the f suffix, the more you remind yourself and others the importance of it. And sometimes it's very important for performance.
:-+

It also reminds one to use the ..f() functions from <math.h>, for example sinf(), fabsf(), expf(), and so on.



Typical speed-enhanced non-IEEE 754 about-single-precision floating-point libraries use a 8-bit signed exponent, and a 24-bit signed mantissa, often in two's complement format, and all bits explicit.  Multiplication yields a 48-bit temporary, where only the 25 most significant bits matter; 24 for the actual result, and one for rounding.  Similarly, about-double-precision can use either a 8-bit or 16-bit signed exponent, and a 56-bit or 48-bit mantissa.

It is also useful to know that you can represent any finite unsigned float value in magnitude (ignoring sign), including subnormals, using a 277-bit fixed point format Q128.149. Since its integer and fractional parts are converted to decimal separately, you only need a 160-bit unsigned integer bit scratch pad for the conversion; or about 20 bytes.  We had a thread about the details some time ago.  double may need up to 2098 bits, Q1024.1074, but that too is doable with a 1088-bit or 136-byte scratch pad.  Fractional part happens to be easier to convert to decimal, via repeated multiplication by 10; that pushes the next fractional digit to the units place, and is why the scratch pad needs about 4 more bits than the full fractional range (i.e. Q4.1074).  Because of rounding, the fractional part may round up to the next larger integer, and thus should be converted before the decimal part.  The conversion of the integer part is easiest to implement via repeated division-by-10 with remainder, each consecutive remainder yielding the decimal digits from right to left.

"AI" calculations often use either integers or a half-precision floating point with 11-bit mantissa (implicit most significant bit, which is 1 for all normals and 0 for subnormals) and 5-bit exponent.  It has the nice feature that all its finite values, including subnormals, can be represented exactly in magnitude (i.e. ignoring sign bit) using 40-bit Q16.Q24 fixed point format.  Multiplication is only 11-bit by 11-bit with 22-bit result, of which 11 high bits remain in the result, and the highest dropped bit determines rounding; addition and subtraction requires a 21-bit scratchpad.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf