OK; understood and thank you always for your input.
When I did the PCB design I used their eval kit as a starting point (doesn't everyone?) and then read every word of the hardware manual, which shows what can come out on which pins, etc. The Issue A PCB, which has tons of analog and digital, SPI, USB, ethernet, etc, worked first time, with no mistakes. Well, almost none, given the crappy documentation by ST. A colleague is doing the "system" software on this one, all in C, and after he spent months fixing bugs in the ST drivers (a well known topic) we are making good progress.
Yes, the UART issue was solved with interrupts.
I continue to be seriously impressed with the speed of this thing. I wrote a load of code (in Cube) to parse NMEA-type (some nonstandard binary) sentences from a GPS, which is definitely sub-optimal (e.g. I am searching for multiple 8 byte substrings concurrently, by comparing each one against a sliding buffer) and the overall processing rate, including a load of sscanfs, floats, doubles, uint32, uint64, etc, is just under a megabyte per second. That is about 600x faster than the GPS (a u-blox module) is generating the data.
The fastest I programmed before, not counting some 80x86 asm PC stuff, was probably a Z280 (a now-obscure but interesting and fast chip; according to Zilog I was the 1st bulk user in Europe) running at 24MHz and this is easily 10x faster, or 100x to 1000x on floats. Since then I have been on the Z180 and H8/3xx stuff which is good enough for most industrial stuff but is slow; bit-banging an I2C port for an ADS7828 took 600us (yes about 30x the ADC's conversion rate!).