If you can't figure out UART on AVR, why the hell on earth would it help you to have a library in C++?
Leave that C++ rubbish for PC bloatcode programmers.
By the way, whats the point of the common thought, that the bigger the MCU the bigger the bullshit overhead code it will handle? This whole idea is why we have today a 8 core processor to make a call with your phone. And still it's slow as a whack and has mostly none useful functionality, except very sophisticated espionage tools.
Instead of taking the advantage of increasing computational power of today's MCUs, most people simply throw it away, because of sillines and unwillingness to do anything properly - because they'd need to stick their nose into literature, read and learn. This phenomena is partially understandable at corporate level, where there is no room for you to do thigs properly even if you wanted to do so... but it's rather sad that most hobbyists and electronic enthusiasts are just throwing their chance away too.
You never can understand how helping it is to learn something, before you actually learn it.
There are so many things todays MCUs can handle, if you will not throw away computational power away with bad coding practice...
See this - my favourite - AVR mega644 a little overclocked (32MHz) does play S3M from a CF/HDD IDE through an I2S Stereo DAC.
//EDIT: I forgot to answer you last question! Doh. I don't want to pick an example. No library can be good for all applications. Every application needs are specific. It is up to the programmer to choose, whatever will fit the application the best.
On STM32 I personally prefer to combine STDPeriph library with direct register access. The have registering like initializing peripherals - one time called code or so - I leave it for the library. But checking flags, setting flags - doing mostly manually in registers. (Why would I use a bloaty function to check a single flag, if I know the register and bit just right out of my head?)
Sure, there are applications where not much code optimizations are needed, but I rather like a more optimized code. Some would say that such code is not portable. It is maybe not, but I don't care! You write optimized code or portable code. These two don't mix well. Mostly I don't need code portability, so I choose the optimized.
By the way, porting code in between STM32 chips is not a big deal, as the chips are designed with stunningly good compatibility in between them. (Unlike Atmel AVRs... where there are no two similar parts.. Doh.)