Author Topic: Best thread-safe "printf", and why does printf need the heap for %f etc? (Read 16033 times)

DiTBho · « **Reply #25 on:** August 14, 2022, 10:59:44 pm »

why printfs? why don't people get rid of it once and for all?

IanB · « **Reply #26 on:** August 14, 2022, 11:06:27 pm »

Quote from: brucehoult on August 12, 2022, 10:38:37 pm

printf() doesn't need to use the heap.

Some printf()s do anyway, but they don't *need* to.

Steele & White's 1990 paper How to Print Floating-Point Numbers Accurately says that you need arbitrary-sized bignums to do the job properly. I proved in 2006 that you don't need to, that in fact the size is bounded by, IIRC, something like 144 bytes, and you need several variables this size. I had correspondence with a local university HoD who put me in touch with Steele, who agreed with my analysis. I expect I still have the correspondence somewhere. However my boss forbade me to publish, so the only implementation was on our Java-to-ARM-native (on BREW) compiler & runtime library "AlcheMo".

I don't understand this. Since an IEEE float has a fixed size and precision, it follows that you will only need a fixed maximum amount of temporary storage to convert it to decimal format. There must surely be some other point to the paper you reference, because on the surface it defies common sense?

brucehoult · « **Reply #27 on:** August 14, 2022, 11:29:07 pm »

Quote from: IanB on August 14, 2022, 11:06:27 pm

Quote from: brucehoult on August 12, 2022, 10:38:37 pm
printf() doesn't need to use the heap.

Some printf()s do anyway, but they don't *need* to.

Steele & White's 1990 paper How to Print Floating-Point Numbers Accurately says that you need arbitrary-sized bignums to do the job properly. I proved in 2006 that you don't need to, that in fact the size is bounded by, IIRC, something like 144 bytes, and you need several variables this size. I had correspondence with a local university HoD who put me in touch with Steele, who agreed with my analysis. I expect I still have the correspondence somewhere. However my boss forbade me to publish, so the only implementation was on our Java-to-ARM-native (on BREW) compiler & runtime library "AlcheMo".

I don't understand this. Since an IEEE float has a fixed size and precision, it follows that you will only need a fixed maximum amount of temporary storage to convert it to decimal format. There must surely be some other point to the paper you reference, because on the surface it defies common sense?

Steele & White correctly show that to do the job properly (i.e. accurately in every case), you need to use multi-word arithmetic, and not just 2 or 3 words but numbers hundreds or thousands of bits long. They gloss over whether there is an upper bound or what it is, and everyone who implements their algorithm uses general heap-based unlimited precision bignum libraries.

Like you, I reasoned that IEEE FP is a finite format and so the size of the bignums needed must be bounded, and I set out to find and prove the limit, and implement their algorithm in a fixed amount of RAM. And that size, for IEEE doubles, is 36 words on a 32 bit machine or 18 words on a 64 bit machine. Or 142 bytes on an 8 bit machine. Per variable, and several variables are needed. I didn't try to minimize the number of variables (especially as speed is important too), but the total storage is less than 1 KB, which was fine for my application on ARM-based mid 2000's mobile phones with minimum 400 KB RAM and more usually 1 or 2 MB. The big gain came from not needing dynamic storage.

If you want to do perfect double precision floating point printing on an ATMega328 then you might want to do more work. But usually people are only using single precision variables on machines like that, which require much less space anyway (22 bytes per bignum variable).

SiliconWizard · « **Reply #28 on:** August 14, 2022, 11:41:08 pm »

Oh yes.

And then, I'm not sure it's worth all the trouble, especially on a very small target. Usually if you need to display/input numbers accurate to a given digit, using integers will be much easier and will make more sense.

Sure in some instances you *absolutely* have to use FP numbers, but it appears that in many cases, people use FP just because this is convenient or they just don't know how to do arithmetic with integers especially when "decimals" would be involved. Just my 2 cents. Learn proper arithmetics and suddenly a lot of things become much easier and you'll be a lot less dependent on the code of others.

brucehoult · « **Reply #29 on:** August 14, 2022, 11:52:11 pm »

Quote from: SiliconWizard on August 14, 2022, 11:41:08 pm

Sure in some instances you *absolutely* have to use FP numbers, but it appears that in many cases, people use FP just because this is convenient or they just don't know how to do arithmetic with integers especially when "decimals" would be involved. Just my 2 cents. Learn proper arithmetics and suddenly a lot of things become much easier and you'll be a lot less dependent on the code of others.

It's easier to get things such as PID algorithms stable using FP. The soft float library in Arduino on 16 MHz AVR takes about 5 µs per add/sub/mul. So you can do around 200k of them per second. No problems to do 10 or so operations for a PID even if it's every ms (which it's usually not, as physical systems don't react so quickly anyway).

Of course that FP library is designed for speed not IEEE compliance and probably has a few ULP of error, so worrying about completely precise printing is pointless anyway.

peter-h · « **Reply #30 on:** August 15, 2022, 06:17:49 am »

Quote

The sprintf code appears to be here:
https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdio/sprintf.c;h=be66ec6f5d73589e5147afeb7634b9de0daba09d;hb=HEAD
and it looks like it creates a dummy file handle before calling something which does all the work -- see lines 584 and then 591.

That calls _svfprintf_r(). Where could I find that?

Quote

It's easier to get things such as PID algorithms stable using FP.

Yes exactly; there are good reasons for using floats. Especially nowadays when ARM32 does a float mult etc in 1 clock cycle. On a Z80 you would be looking at 1-2ms which is only 100000 times slower

If you were happy with non-IEEE floats then you could improve this by a factor of 10. But double floats are very rarely needed because the range and accuracy exceeds (almost) anything in the physical world by a massive margin, and printing them even less so.

One can do "almost anything physical" using int32, and that's what Apollo used, evidently successfully, but you need to be very clever, know the full range of all variables, etc.

Quote

The soft float library in Arduino on 16 MHz AVR takes about 5 µs per add/sub/mul.

Must be using hardware for that.

brucehoult · « **Reply #31 on:** August 15, 2022, 07:35:05 am »

Quote from: peter-h on August 15, 2022, 06:17:49 am

Quote
The soft float library in Arduino on 16 MHz AVR takes about 5 µs per add/sub/mul.

Must be using hardware for that.

No, that's soft float. 5 µs is 80 instructions, depending on the instructions.

There's 8x8->16 multiply in 2 cycles. 24x24 mul needs 9 of those and a few more than that adds. Let's say 35 cycles. TBH, you could probably just not do some of the low order muls and adds if you don't care about a few ULP of error.

5 µs probably isn't the worst case for add and subtract, it's just what I've measured in practice on typical data. I'm sure it's a bit longer if subtract (or add) cancels a lot of bits and it needs a lot of shifting to re-normalize. But when that happens you've usually got other problems in your algorithm... Multiply shouldn't vary by much.

DiTBho · « **Reply #32 on:** August 15, 2022, 09:37:51 am »

fp64_t Softfloat on a 8bit MPU? Um, it must be the best idea ever (sarc)

brucehoult · « **Reply #33 on:** August 15, 2022, 09:48:17 am »

Quote from: DiTBho on August 15, 2022, 09:37:51 am

fp64_t Softfloat on a 8bit MPU? Um, it must be the best idea ever (sarc)

Why not?

Apple's SANE library implemented fully compliant 32 bit, 64 bit, and 80 bit FP on the 8 bit 6502 in the 1980s, API and bit for bit compatible with the Macintosh implementation.

https://en.wikipedia.org/wiki/Standard_Apple_Numerics_Environment

DiTBho · « **Reply #34 on:** August 15, 2022, 09:57:41 am »

Quote from: peter-h on August 15, 2022, 06:17:49 am

double [/b]floats are very rarely needed because the range and accuracy exceeds (almost) anything in the physical world by a massive margin

Except, in Matrix LUP decomposition for solving systems of linear equations ( A x = b), you get keeping errors on every single MAC and MSC iteration if the A matrix is intrinsically ill-conditioned (1), so things can literally explode into snow-crash-flakes if you don't adopt countermeasures:

fp64_t helps, a lot when the A matrix is pathologically intrinsically ill-conditioned (weak diagonal)
full-pivoting helps and does the rest of the job

Only combing these two, you can save your day, and full-pivoting on fp64_t is better than full-pivoting on fp32_t.

(1) which is a property of your system of linear equations, you can't do anything here, except "trying to swap rows and columns" to see if calculus look better, if you touch equations, if you voodoo black magic (modify) your numbers , you are solving a different system

DiTBho · « **Reply #35 on:** August 15, 2022, 10:01:16 am »

Quote from: brucehoult on August 15, 2022, 09:48:17 am

Why not?

because it adds more complexity than needed!
Fixedpoint should be simpler, and good enough.

brucehoult · « **Reply #36 on:** August 15, 2022, 10:12:16 am »

Quote from: DiTBho on August 15, 2022, 10:01:16 am

Quote from: brucehoult on August 15, 2022, 09:48:17 am
Why not?

because it adds more complexity than needed!

Than needed for what?

Quote

Fixedpoint should be simpler, and good enough.

Good enough for what?

I don't know any reason why people with slow computers should be banned from doing high precision calculations.

Even a 6502 or z80 is massively faster than an HP41CV or HP48 or whatever they made after those.

DiTBho · « **Reply #37 on:** August 15, 2022, 10:25:20 am »

Quote from: brucehoult on August 15, 2022, 09:48:17 am

FP on the 8 bit 6502 in the 1980s

Just checked my manual, intel designed 8051 in 1981 and released my BASIC51 ROM with softfloat fp32_t in 1986.

__________1986___________

Probably even Bill Gates contributed something, maybe to the BASIC core, maybe something more.

Yup, it's a very old technology, but the truth is: my Japanese SH3 pocket calculator doesn't use any floating point for calculations, instead it uses fixedpoint, thanks to a DSP fixedpoint engine, embedded with the CPU, it supports systems of up to 8 linear equations, and results are really good.

Which brings me to the question: does fixedpoint intrinsically solve the LUP problem with pathologically intrinsically ill-conditioned matrices? I have already tried and results are not bad, even better than results with fp64_t.

I have to investigate more, but if this is true, I will rid of fp64_t entirely for my algorithms.

DiTBho · « **Reply #38 on:** August 15, 2022, 10:31:28 am »

Quote from: brucehoult on August 15, 2022, 10:12:16 am

Than needed for what?

Simple implementations, things you can read and understand.
University courses, library for Softcores, things to be used for hobby, university and similar.

Things where less and neat code is largely better.

I mean, just to print-out floating-point, you have to voodoo black-magic

That's no good.

brucehoult · « **Reply #39 on:** August 15, 2022, 11:00:48 am »

Quote from: DiTBho on August 15, 2022, 10:31:28 am

Quote from: brucehoult on August 15, 2022, 10:12:16 am
Than needed for what?

Simple implementations, things you can read and understand.
University courses, library for Softcores, thins to be used for hobby, university and similar.

Things where less and neat code is largely better.

I can made any program as simple and fast as you want, if it doesn't have to be correct.

If you don't mind doing (2.0/7)*7 and getting back 1.999998 instead of 2.000000 then you don't have to worry about it.

Quote

I mean, just to print-out floating-point, you have to voodoo black-magic
That's no good.

It's not voodoo black-magic. It's just converting whatever floating point number you have (decimal or binary) to an exactly equivalent fraction using two integers in the source number base, converting both integers to the destination number base, then doing the long division.

e.g.

123.456 = 123456/1000 = 0b11110001001000000/0b1111101000 = 0b1111011.01110100101111001... = 0b1.11101101110100101111001 x 2^7

So the Float is 0 10000101 11101101110100101111001 i.e. 1 * 2^(133-126) * (1 + 0b0.11101101110100101111001)

The intermediate integers can get long, but it's not *complicated*.

Nominal Animal · « **Reply #40 on:** August 15, 2022, 11:26:03 am »

Quote from: DiTBho on August 15, 2022, 10:25:20 am

Which brings me to the question: does fixedpoint intrinsically solve the LUP problem with pathologically intrinsically ill-conditioned matrices?

The difference between fixed and floating point types is the range. If you know the range of your values, then a suitable fixed point type can have more precision.

The numerical stability properties in cases like matrix LU decomposition, are much harder to quantify. Some features, like the absolute precision make it easier to manage (consider machine epsilon, for example), but I do not believe it intrinsically solves the issues with ill-conditioned matrices... with fixed point, the detection of and boundary in the parameter space for ill-conditioned matrices is simpler, so it can definitely feel like it solves such problems.

Note that in many linear algebra use cases, one uses mostly additions, subtractions, and multiplications. If you can use two's complement format, the addition and subtraction become extremely fast – you don't even need to know the location of the decimal point, if you use modular arithmetic for additions and subtractions –, and you can keep your accumulators/sums at "double" precision; that is, you can implement fused multiply-and-add-double-precision type to accumulate the sum of products at double the normal precision, and only afterwards rescale it back to the original fixed point format.

For multiplication, you probably want to take a look at Karatsuba algorithm (replacing a multiplication with an addition and two subtractions). Note that if you work on only positive values (pre-negate and post-negate if necessary), the MSB will always be zero, therefore the sum of two-limb numbers' high and low parts (as used in Karatsuba to convert a multiplication into two additions and a subtraction) can never overflow. A nifty trick, although not worth it if you have a fast 32×32=64 (or low and high 32 bit results) machine multiplication operation.

brucehoult · « **Reply #41 on:** August 15, 2022, 11:55:03 am »

Quote from: brucehoult on August 15, 2022, 11:00:48 am

123.456 = 123456/1000 = 0b11110001001000000/0b1111101000 = 0b1111011.01110100101111001... = 0b1.11101101110100101111001 x 2^7

So the Float is 0 10000101 11101101110100101111001 i.e. 1 * 2^(133-126) * (1 + 0b0.11101101110100101111001)

And, just to illustrate the reverse direction...

0x42f6e979 = 0100 0010 1111 0110 1110 1001 0111 1001 = 0 10000101 11101101110100101111001 = 0b1111011.01110100101111001 = 0b111101101110100101111001/0b100000000000000000 = 16181625/131072.

Add and subtract 0.5 ULP to that:

32363251/262144 = 123.456005096
32363250/262144 = 123.456001282
32363249/262144 = 123.455997467

So the correct result is 123.45600.

Any apparent complications are just to make this process faster. You can write simple code for it that works, just a bit more slowly.

peter-h · « **Reply #42 on:** August 15, 2022, 12:31:36 pm »

Q1: Does scanf come into this at all? Earlier it was mentioned that IEEE requires it to create an exact mirror image of printf.

Q2: I wonder where this precision is relevant. It doesn't relate to anything in the physical world, except to make some output look pretty.

#2 can be achieved with very simple code. Many years ago I built a custom product for a customer in the film business. It passed 35mm film over some sprockets, and on one of these was a shaft encoder. The quadrature signals went to a Z80 CTC which had a special circuit which was just right for this job. Then my box had a bipolar DAC for driving a servo motor for winding the film to a specified position, which could be frames, feet (an exact multiple of frames), or metres! The last one of course produced a nonexact result, but it was unavoidable. One day the customer turned up and said he wants to get back to the same position, if going forwards in feet and backwards in metres. I spent many days doing that (Z80 asm, int32 maths)

But this was not needed; it was just for visual presentation at exhibitions. One could do the same by faking the value displayed if the abs(error) is < some amount.

Nominal Animal · « **Reply #43 on:** August 15, 2022, 04:03:57 pm »

Quote from: peter-h on August 15, 2022, 12:31:36 pm

Q1: Does scanf come into this at all? Earlier it was mentioned that IEEE requires it to create an exact mirror image of printf.

scanf() isn't as good as (reliable, report errors like) strtod()/strtof() do.

Some of the really good approaches use a custom version while generating the number, to handle the requirement of no more error than half an unit in the least significant place (0.5 ULP), with correct rounding (including ties).

Quote from: peter-h on August 15, 2022, 12:31:36 pm

Q2: I wonder where this precision is relevant. It doesn't relate to anything in the physical world, except to make some output look pretty.

Repeatability, for example.

Let's say you make a device that takes a number as an input, does something, and pops out a number.
The user uses the device, and knows that when they use X as input, the result should be Y.

Because output does not need to be precise, just pretty, a tiny change in the firmware causes the next revision of the device to produce Z instead of Y when inputting X. Or perhaps they decide to throw away the device and implement it in R/Python/Matlab/Octave/C, and scratch their head when they cannot reproduce the functionality.

peter-h · « **Reply #44 on:** August 15, 2022, 04:10:46 pm »

Isn't strtof etc inside a sscanf %f anyway?

I do normally use these (and atoi etc) but sometimes a sscanf %d %d %d to extract 3 integers is quite handy.

Can anyone find the source of the likely ST libc.a scanf lib? Probably it will have the same 1990 copyright message. I doubt (from debugging) that it uses the heap, but it would be nice to be more sure.

SiliconWizard · « **Reply #45 on:** August 15, 2022, 06:18:53 pm »

Quote from: brucehoult on August 14, 2022, 11:52:11 pm

Quote from: SiliconWizard on August 14, 2022, 11:41:08 pm
Sure in some instances you *absolutely* have to use FP numbers, but it appears that in many cases, people use FP just because this is convenient or they just don't know how to do arithmetic with integers especially when "decimals" would be involved. Just my 2 cents. Learn proper arithmetics and suddenly a lot of things become much easier and you'll be a lot less dependent on the code of others.

It's easier to get things such as PID algorithms stable using FP.(...)

Sorta. But that's the case I mentioned ("in some instances you *absolutely* have to use FP numbers"). Sometimes sure integer/fixed-point is not well adapted or would take a lot more dev time while requiring a lot of extra bits due to the limited dynamic range.

But that's my point. Those use cases are almost orthogonal to the need to accurately display and input floating point numbers per se.
Unless maybe you are specifically designing a calculator, in which case using IEEE FP may not be the best idea ever anyway.

(And yes, other than that, I've see many, many devs just using FP not because they actually needed it, but because they could not figure out how to do simple integer/fixed-point arithmetics at all.)

peter-h · « **Reply #46 on:** August 15, 2022, 09:01:21 pm »

For a calculator you probably want what used to be called BCD Reals.

The Z80 has special instructions for that. Historically calculators used 4 bit micros which were fine for BCD maths. Very slow; you could see the time they took

I've done plenty of fixed point and it is fine, and 32 but has a huge dynamic range, vastly more than nearly everything in the physical world needs, but undeniably takes longer to code. I did it mostly in the old days when floats were much slower. Well written 16x16=32 mult, or 32/16+16 divide, were quick - around 100us. For appropriately scaled variables I used a lot of 32x32=32 multiplies.

Surprisingly some products have used int8 maths. I know of one autopilot for light aircraft which does this internally. It works well enough - until something goes out of range and then you get massive underflow, with a resulting burning out of the servo motors.

Nominal Animal · « **Reply #47 on:** August 16, 2022, 04:15:06 am »

On 32-bit architectures, it is interesting to note that 10⁹ < 2³⁰, and 1,999,999,999 < 2³¹. So, if you have a 32×32=64-bit multiply and 64/32=32,32 division with remainder, you can pack to binary and extract from binary nine decimal digits at a time. x86, x86-64, Risc-V (rv32gc), armv8a, thumb2, and arm64 at least can do this (according to trivial test at Compiler Explorer).

The interesting part of 10⁹ is that it is equal to 2⁹ 5⁹ = 512 × 125 × 125 × 125, and that 125 = 128 - 3.
In other words, multiplying x by 10⁹ can be implemented as
((x * (128-3)) * (128-3) * (128-3)) << 9
where each multiplication by 128-3 is essentially just (x << 7) - (x << 1) - x. In other words, "multiply 30-bit integer by 100 or 10⁹ and add a 30-bit integer" fused-multiply-adds/subs can be implemented very efficiently even on 8-bit microcontrollers without fast hardware multiplication, as long as they implement bit shifts through carry and additions/subtractions through carry.

It is similarly interesting to note that 1/5, 1/25, 1/125 etc. are repeating bit patterns,

Code: [Select]

5⁻¹ = 0.0011 0011 ...
5⁻² = 0.00001010001111010111 00001010001111010111 ...
5⁻³ = 0.0000001000001100010010011011101001011110001101010011111101111100111011011001000101101000011100101011 0000001000001100010010011011101001011110001101010011111101111100111011011001000101101000011100101011 ...

with lengths of 4, 20, and 100 bits. In other words, division by 10, 100, or 1000 can be implemented as a multiplication by a repeating reciprocal (of 4, 20, or 100 bit units) and a bit shift right (by 1, 2, or 3 bits, respectively) –– if you don't need the remainder and can deal with a fractional quotient.

peter-h · « **Reply #48 on:** August 16, 2022, 06:13:14 am »

That's clever.

Yes; I used to pre-scale integer representations by some power of 2, too.

On an arm32, single float mult is 1 clock whereas div is 16 clocks (IIRC) and this would be a lot worse for doubles, which is probably another good reason to avoid a printf (even if you have the sources to it and know it has not been hacked with dummy mutexes etc) which converts float to double.

Nominal Animal · « **Reply #49 on:** August 16, 2022, 07:05:02 am »

Quote from: peter-h on August 16, 2022, 06:13:14 am

probably another good reason to avoid a printf [...] which converts float to double.

Right; the requirement for C to promote variadic float arguments to double is an important reason why one should use something else instead.

There's a lot of interesting stuff in the "simple" arithmetic here. For example, I once worked out how to determine how often calculating a division using fused multiply-adds and constant reciprocals (Brisebarre-Muller-Raina or Markstein algorithm), given a constant integer divisor, would the result be incorrect (by one ULP), among all finite floats.

In particular, any single-precision floating point division by an odd integer can be done exactly using the Markstein algorithm, at the cost of one floating-point multiply (by a constant), plus two floating-point fused multiply-adds. (FP fused multiply-adds are available on ARM VFPv3 and NEON, at least.)

That is, if you have
result = float / odd_integer_constant;
you can precalculate C1 = -1.0f / odd_integer_constant and C2 = -(float)odd_integer_constant, and do the calculation as
temp1 = float * C1;
temp2 = fmaf(C1, temp1, float);
result = fmaf(C2, temp2, temp1);
It also works for power of two odd integer constants; i.e. whenever
(odd_integer_constant & 1) || !((odd_integer_constant) & (odd_integer_constant - 1))
is true. Interestingly, there are algorithms that can obtain the reciprocal much faster than doing a floating-point division, so implementing the subset of divisions by an odd integer or power of two via that route, might be interesting for some workloads.

(There is an IEEE 2018 paper on fast float reciprocal using fused multiply adds, doi.org/10.1109/IDAACS-SWS.2018.8525803, but lacking IEEE access, I haven't read it.)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Best thread-safe "printf", and why does printf need the heap for %f etc? (Read 16033 times)

Share me