I've done some testing for formatting
floats, and it's looking pretty nice (not complete yet, though).
I'm using a temporary work area of 8×32 bits (32 bytes), six of which form a 6×28-bit unsigned integer ("work area", six 28-bit limbs) used in the calculation.
Technically, I'm calling
m28f the fractional type, with limb order most significant first. That is, bit 27 in the first limb corresponds to value 2
-1 = 0.5, bit 0 to 2
-28 = 0.000000003725, and so on. I'm calling
m28i the integral type, with limb order least significant first: bit 0 in the first limb corresponds to 2
0 = 1, bit 27 to 2
27 = 134217728.
When moving the mantissa (with the implicitly set bit 23 (24th) if exponent is nonzero), the mantissa can span two limbs.
The order can seem odd, but the first limb is always the one next to the decimal point. Furthermore, the key operation –– multiply by ten with carry for the fractional part, divide by 10 with remainder for the integer part –– proceeds always towards the first limb, starting at the furthest nonzero limb. Since it is trivial to keep track of the furthest nonzero limb (starting at when the mantissa is moved to the limbs, one only needs to check when it becomes zero, and move to the next closer limb), we do not need to check or conditionally operate on zero limbs; we only deal with the limbs that we need to, and no more.
What's with the 28 bit radix? Well, it turns out that using an extra 14% of memory for the limbs, all base operations stay 32-bit. Extracting a decimal digit from the integer part (divide by 10 with remainder) only requires two 32-bit multiplications, some bit shifts, and additions per active limb; and extracting a decimal digit from the fractional part (multiply by 10 with carry) only requires one, plus some bit shifts and an bitwise and, per active limb. No floating-point or 64-bit operations are needed at all –– my test code does not use any of the compiler support functions (__udivdi3 et cetera) on x86, risc-v rv32gc, or 32-bit ARMs.
To divide a 32-bit number by ten, you simply multiply it by 3435973837 = 0xCCCCCCD, and shift the result right by 35 bits. Most 32-bit architectures have a multiply-high instruction, which returns the 32 high bits, so the result only needs to be shifted right by three bits. (So, a divide by ten with remainder is two 32-bit multiplications (one multiply-high, one normal unsigned multiply), subtraction, and a shift right by three bits.)
When multiplying by ten, the high 4 bits of the 32-bit result forms the carry, which is added to the result of the next higher multiplication. This cannot overflow, because 10×0x0FFFFFFF+9 = 0x9fffffff. So, a multiply by ten with carry is just one 32-bit multiplication and addition; plus a bit shift to extract the carry, and a bitwise and to extract the result.
In summary, we're talking about exact precision and rounding according to IEEE-754 rules (or whatever you want), with something like a dozen or two cycles per decimal digit emitted, plus a few dozen cycles overhead, with a tight upper bound on the memory needed (which can be statically allocated beforehand, or allocated internally on stack via
alloca()/
__builtin_alloca()).
Finally, the exact same procedure applies to double precision, except that the temporary storage needs 39 limbs, and a temporary work area of about 1312 bits or 164 bytes. The mantissa is 52-bit (if subnormal) or 53-bit (high bit implicitly set if exponent is nonzero), so it can span three limbs initially. But the base operations stay exactly the same.
I find this rather exciting, I must say, since it looks like it is
way more efficient than any standard C library implementation I've seen, while still capable of producing the exact same results. I am currently rejecting infinities and NaNs with an error, but they're trivial to add in if someone wants to. I am having a bit of a struggle to decide which rounding modes I want to support; IEEE defaults to round to nearest breaking ties to even, and I like round to nearest breaking ties away from zero, and while others are possible, it might be nice to KISS and omit stuff not needed/used at run time. For example, does anyone need to actually change the tie-breaking mode at runtime? I don't think so, but...