I was thinking of a soft-fp format on 8-bit MCUs with 40-bit mantissa \$m\$, radix 256, and a 7-bit exponent \$p\$, as in \$m \cdot 256^{p - 67} = m \cdot 2^{8 p - 536}\$ with a separate sign bit.
While it would mean the actual precision would vary in steps, cyclically, between 32 to 40 bits, it would eliminate bit shifts. It would seriously speed up additions and subtractions, but also multiplications. Calculating
c +=
a×
b, if we label the five bytes 0 (least significant) to 4 (most significant), then
c4:c3 += a4×b4 c3:c2 += a4×b3 + a3×b4 c2:c1 += a4×b2 + a3×b3 + a2×b4 c1:c0 += a4×b1 + a3×b2 + a2×b3 + a1×b4 c0:u4 += a4×b0 + a3×b1 + a2×b2 + a1×b3 + a0×b4 u4:u3 += a3×b0 + a2×b1 + a1×b2 + a0×b3 u3:u2 += a2×b0 + a1×b1 + a0×b2 u2:u1 += a1×b0 + a0×b1 u1:u0 += a0×b0While
c3 will be nonzero for nonzero products,
c4 may become zero (whenever
a4×b4 < 256). Then,
u4:u3 products may need to be calculated, to obtain the eight bits "shifted in" (while decrementing the exponent). Typically 15 multiplications and 30 byte-wise additions are needed (not counting carry rippling, which isn't that common).
The problem isn't
speed, but the amount of machine instructions this kind of code generates, in my opinion. Most 8-bitters don't have that much Flash. Even if you implement the above using loops, you end up generating many instructions to compute the addresses. One could write the multiply-add in assembly by hand, but even then it is quite a few instructions. It ends up
bulky, not slow.
On 32-bit MCUs with a fast 32×32-bit multiplication returning the high or low 32 bits of the product, software floating-point is usually not worth it, because you can do multi-word products like above. For example, if you have a fixed-point format with 32 fractional bits, 31 integral bits, and a sign bit, then
c =
a×
b is
c3 = HI(a1×b1) c2 = HI(a1×b0) + HI(a0×b1) + LO(a1×b1) c1 = HI(a0×b0) + LO(a0×b1) + LO(a1×b0) c0 = LO(a0×b0)of which only
c2:c1 forms the result
c. We only need the high bit of
c0 for half-ULP correct rounding, and that is the same as AND operation between the most significant bits of
a0 and
b0. (When that bit and the least significant bit of
c1 are both set, increment
c1, and carry over to
c2 and
c3 if overflow. Otherwise it is ignored.) If
c3 is nonzero, the product overflows.
Thus, for modular arithmetic, you only need 6 32-bit unsigned multiplications (three low, three high), some additions, and a couple of bit checks for half-ULP correct rounding. Addition and subtraction is just two 32-bit operations. Division can be implemented either bitwise, or via 32-bit/32-bit division-and-remainder.
Extending to 96 bits,
c5 = HI(a2×b2) c4 = HI(a2×b1) + HI(a1×b2) + LO(a2×b2) c3 = HI(a2×b0) + HI(a1×b1) + HI(a0×b2) + LO(a2×b1) + LO(a1×b2) c2 = HI(a1×b0) + HI(a0×b1) + LO(a2×b0) + LO(a1×b1) + LO(a0×b2) c1 = HI(a0×b0) + LO(a0×b1) + LO(a1×b0) c0 = LO(a0×b0)For 32 fractional bits, you use
c1,
c2, and
c3. Again, only the most significant bit of
c0 matters, and if
c4 or
c5 are nonzero, the calculation overflows. For modular arithmetic, you only need 13 multiplications.
For 64 fractional bits, you use
c2,
c3, and
c4. For half-ULP correct rounding, you need to calculate the most significant bit of
c1, but it again only matters when the least significant bit of
c2 is also set.
For branchless ripple carry, after adding to e.g.
c2, you need to add zero with carry to
c3 (and if using 64 fractional bits, to
c4), so there are quite a few additions as well, but they tend to be very fast.
13 multiplications can sound like a lot, but 96 bits is a crapton of precision, easily beating even IEEE 754 double-precision floating point, unless you for some strange reason need the range. I rarely do, even in atomic simulations.
From above, we can easily extend into
arbitrary precision numbers, simply by making the number of integral and fractional words –
"limbs" – variable, in both directions.
Arbitrary precision numbers are not floating-point, because the decimal point is always where the two branches (integral and fractional) meet.
Combining both, you can represent any real number as a continuous sequence of (say 32-bit unsigned) words, i.e.
$$v = \left(-1\right)^S \, \sum_{k=0}^{N-1} m_k \, 2^{32 ( k + B) }$$
where \$S\$ is the sign bit, \$B\$ is the position of the least significant bit (in units of 32 bits), \$N\$ is the number of words in the value, and \$m_0\$ through \$m_{N-1}\$ contain the bit representation of the value.
Arbitrary precision fixed- and floating-point types are generally called
multiprecision numbers.
These are scary powerful: no limits on what you can do, really. For example, you can easily use these – say, as provided by the
GNU MPFR (multiple-precision floating-point library) or
GNU MPC (multi-precision library), or
any of the many alternatives to calculate exactly-correct IEEE Binary32 (float), Binary64 (double), Binary128(quad), or whatever precision you need. In fact, most Linux/BSD/Unix machines have
bc installed, which is a command-line arbitrary-precision calculator. For example, if you want the first 101 decimal digits of pi, you use the arctan(1) function:
echo 'scale=100; 4*a(1)' | bc -l(Yes,
bc shortens arctan() to
a(), sin to
s() cosine to
c(), natural logarithm to
l(), exponential to
e(), when the math library is enabled via the
-l option.)
Do let me know if you see any bugs in above. I've been buggy lately!