As for binary to BCD, I have no clue how those C wizards do it. In Assembly, there are several ways. Double Dabble is among the slowest. Up to 17-bit, there are several polynomial methods on PICList that are quite fast compared to division with 16-bits on 8-bit horse carts.
If you have all relevant powers of ten –– 10, 100 for 8 bits; plus 1000, 10'000 for 16 bits; plus 100'000, 1'000'000, 10'000'000 for 24 bits; plus 100'000'000, 1'000'000'000 for 32 bits –– you can do repeated subtraction, starting at the largest one. For 16 bits, you do a maximum of 5+9+9+9=32 subtractions (of a 16-bit quantity); for 32 bits, 4+8*9=76 subtractions (of a 32-bit quantity), although you can switch down to 24-bit and 16-bit at specific steps. So, it isn't fast in the time complexity O() sense, but since each comparison and subtraction tends to be fast and the code compact, it is a viable approach.
When you can treat the value as a fraction (i.e., all bits right of the decimal point), then repeated multiplication by ten (
(x<<1)+(x<<3)) is quite efficient. The problem on 8-bit machines is that for integer values, the most significant decimal digit tends to be difficult to extract. For example, for 32 bits, you'd use 100'000'000 = 0b00000101'11110101'11100001'00000000 (since it is the largest power of ten that can handle all decimal digits; you'd use the subtraction for the larger one). You can't just use the most significant byte, because 199'999'999 and 200'000'000 only differ in the least significant 9 bits: each extraction must look at three most significant bytes, so it is quite slow.
For BCD, you can use a variant of the subtraction algorithm, that starts at subtractor 64, and halves it for each subtraction (by shifting it one bit right). The constants needed are then 100 and 64 for 8-bit; 64, 6'400, 10'000 for 16-bit; 64, 6'400, 640'000, 16'000'000 for 24-bit; 64, 6'400, 640'000, 64'000'000, 3'200'000'000 for 32-bit. Each decimal digit pair takes 5 steps, so a maximum of 25 steps for 32-bit. Again, simple and relatively compact (although you do need space for those constants, they can be stored in ROM/Flash; and only the currently used one in RAM), but not the fastest known.
When building decimal number strings, the BCD approach produces pairs of decimal digits, 00..99 inclusive. If you have 100 bytes of ROM/Flash available for lookup, just store the BCD encoded value; you can also use this to encode any two-digit integer to BCD. Low nibble then gives the lower decimal digit, and upper nibble the upper decimal digit.
For 16-, 24-, and 32-bit conversions, you then need 100+2*4+2*3+2*2+1=119 bytes of ROM/Flash for the constants.
Since 100=64+32+4,
100*x=(x + x<<3 + x<<4)<<2, you can use the 100-byte BCD table to extract pairs of decimal digits from a fractional binary value by repeated multiplication by 100. The decimal value (0..99) will be in the overflow byte, so converting it to a pair of digits via the BCD table is fast.
Standard library implementations (newlib, in particular) are not designed or intended to be fast; they are written to be
correct. For example, when converting
float or
double to a string, if it is within certain bounds, multiplying it by a suitable power of ten (both integral and fractional powers!) depending on the desired format of output, rounding to an integer, and printing the integer value and just inserting the decimal point depending on the power of ten multiplier, is often much, much faster than using
printf() or
snprintf(). It is only the larger magnitudes (absolute values) whose conversion to decimal/string is problematic, and much of the standard library implementations involves getting those right. As discussed in another programming thread, I'm working on a solution intended for 32-bit architectures (ARM cores) that has a very limited memory footprint, and it already beats all standard library implementations I've seen, while being
exactly correct. My only problem is how to avoid having to store the larger positive powers of ten, because for
double, there are over 300 of them.