A subset of output formats and values
yes, but it still requires voodoo black magic adjustments in the table 
If you treat the fp32 as a fixed point number with 128 integer bits and 128+24 = 152 fractional bits, and the caller specifies how many decimal digits you want after the decimal point (if any), the only precalculated table you need is the 37 positive powers of ten, and even that is optional.
(You can use either 38 entries of 128 bits each, or encode 10
i as a bit pattern to be shifted left by i bits, in which case powers of ten up to 10
13 fit in 32 bits, up to 10
27 fit in 64 bits, and up to 10
41 fit in 96 bits but only up to 10
37 are needed. And they can be precalculated at runtime by repeatedly multiplying by ten, of course.)
No black magic, and very simple and logical tables if any.
The idea is that you first expand the fractional bits to a 156+bit buffer, with the top 4 bits corresponding to an integer part. If the target architecture has only 32×32=32-bit integer multiplication, or the 32×32=64-bit integer multiplication is costly, use only the bottom 28 bits of each limb, leaving four "unused" (overflow/carry) bits in each 32-bit limb. By repeatedly multiplying the fractional part by 10, you obtain the next decimal digit in the integer part, which you clear before the next digit. Because each multiplication moves the least significant bit set by at least one place up, you can ignore the least significant limbs as soon as they become zero. For rounding, you do nothing if the buffer corresponds to less than half after conversion; increment the decimal representation by one if more than half; and break ties to even if the buffer is exactly half. This can lead to the entire fractional part becoming zero, in which case you carry one to the integer part. Note that no constants at all are needed here.
Next, you convert the integer part of the fp32 into the same bit buffer, remembering to add the possible carry from rounding the fractional part. You then convert the 128-bit unsigned integer described by the bit buffer, to a decimal string into the output buffer.
As discussed in other threads, this can be done either by repeated division by ten (the remainder yielding the digits in increasing order of significance), using an array of precalculated powers of ten and a subtract loop, or by converting the integer in suitable power of ten units (as if converting to a different radix), or by extracting the decimal digits from the most significant end (by comparing to and repeatedly subtracting 10
38-k), repeatedly multiplying by 10
k.
Or by whatever other method you prefer.
For example, if you store
k×10
36 for
k=0..99, you can use a binary search against the array to find the two highest digits in at most seven comparisons (to find the largest value in the array not greater than the current bit buffer), and one 128-bit subtraction. Of course, this uses 100×96 bits (the 36 low bits of these 128-bit values are azeros) or 1200 bytes of ROM/Flash. In the worst case, this ends up doing 133 128-bit comparisons and 19 128-bit subtractions, which isn't that bad, really.
In the medium scale, 10
19 < 2
64, so if the initial integer is less than 10
19, you can do the same as above but at only 64 bits, for
k×10
18,
k=0.99. This uses worst case 63 64-bit comparisons and 9 64-bit subtractions, but another 800 bytes of ROM/Flash in precalculated tables.
Picking the method based on where the most significant bit is, sounds like a very good idea to me.