At around 10MHz
Is that an mcu limit, library limit or self imposed limit? That is quite a speed limit for an M0+, if that indeed is the speed limit.
I should have guessed if it takes 1024 bytes of stack space
You could mention what printf library you are using.
Not that this would help for the case at hand, but...
I used the following code to test what improvements a tweak to the divmod code would look like if the hex was optimized instead of blindly calling divmod for any base (mcu is an stm32g031). A class was created to be a buffer, inheriting the Print class so not dealing with the uart hardware for the test of speeds. Another base enum was added and the divmod loop in the Print class was set to handle that case by using bitand and shift. The difference in this test was 1.25ms vs 0.760ms so is at least something. To take advantage one would have to use hex and then either live with that or convert somewhere else such as on the pc side. Most likely not worth the trouble when what you really want is decimal, but it is an easy speed improvement with hardly any cost.
For something like newlib, instead of mucking around in printf/nano-printf one could make a simple replacement for the divmod library code where it would add a simple check for a divisor of 16 and handle that quickly without having to make the __udivsi3 library call. A one trick pony, but still one nice trick which would see some use in the case of printf and hex formatted numbers. Again, probably not a major improvement but is essentially free (unless you are hammering out lots of division elsewhere, then maybe the quick divisor test starts to add up). I know there are realities one may have to deal with that will not allow this, but if division was my game I would be looking to move up from an M0.
I show the test code only because I think its pretty neat what you can do with C++. The Print class is a single header, duplicates the 'real' C++ cout style pretty well, is actually quite easy to write once you get the big picture. It also ends up being pretty small, but at some point it will start to exceed the size of a standard printf but you will have a good head start. The following code, with a system timer (lptimer), uart, gpio, using the chrono library, etc., compiles to 4.3k. The biggest plus is you have code you wrote, understand what its doing, and can change as the need arises (such as adding the test code to improve the divmod/16).
while(1){
u32 somenums[8]{
0xFFFFFFFF, 0xFFFFFFFE, 0xFFFFFFFD, 0xFFFFFFFC,
0xFFFFFFFB, 0xFFFFFFFA, 0xFFFFFFF8, 0xFFFFFFF7
};
auto n{ 0 };
auto t1 = now(), t2 = now();;
auto summary = [&](u8 base, u32 count){
board.uart, right,
fg(WHITE), setwf(10,' '), "base: ", base, endl,
fg(WHITE), setwf(10,' '), "cpuHz: ", System::cpuHz(), endl,
fg(WHITE), setwf(10,' '), "et: ", fg(LIME), Lptim1ClockLSI::time_point(t2-t1), endl,
fg(WHITE), setwf(10,' '), "chars: ", fg(LIME), count, endl2;
};
auto test = [&](FMT::FMT_BASE base){
n = 0;
sbuf, countclr, base;
t1 = now();
for( auto& v : somenums ){ sbuf, '[', n++, "] ", v, endl; }
t2 = now();
summary( base == dec ? 10 : 16, sbuf.count() );
};
test(dec);
test(hex);
test(hexopt);
delay(10s);
}
base: 10
cpuHz: 16000000
et: 0d00:00:00.001496
chars: 120
base: 16
cpuHz: 16000000
et: 0d00:00:00.001251
chars: 104
base: 16
cpuHz: 16000000
et: 0d00:00:00.000763
chars: 104