I don't know of any particular book that can help.
What I have noticed is that the ARM chips are getting ever more complex with multiple cores (usually of a different type) and increasing horsepower. Even the Teensy 4.1 is now up to a 600 MHz chip and a User Manual that is a concise 3637 pages.
Cypress has some great tutorials for their PSOC chips and their toolchain is pretty nice. It still takes a bunch of time to get things working even if you do use their code. But it's doable.
Moving backwards to the ARM7-TDMI series chips like the LPC2148 makes things much easier to approach. Look at
http://jcwren.com/arm/ This is a much older chip but it is actually possible for mere mortals to understand it at a very low level. The User Manual is a mere 354 pages
https://www.nxp.com/docs/en/user-guide/UM10139.pdfI haven't spent a lot of time with FreeRTOS but I have played with it on the Cypress PSOC chips but they already have a port of the code and the toolchain sets everything up. I guess if I wanted to play with high end ARM chips, Cypress is about as good as it gets.
There's plenty of griping about the ST tools but I have played with them a bit and they seem to work fine. The big complaint is about bloat and that's clearly the case. The alternative is to gut their code to bare bones for a specific processor and see how it works out.
Here's what is ridiculous: A simple blinking LED on a Teensy 4.1 takes 14,688 bytes of executable code using the Arduino toolchain. That is absolutely insane! It isn't that much better on any other chip using any other toolchain.
In terms of how to architect a large program from scratch, well, that's an advanced topic. It can take years of school to even understand the complexity much less the solution. Yes, implementing FreeRTOS is a good place to start. Not only does it support real-time applications but it cleanly segments code into functional blocks. In many ways, this is encapsulation. A UART process doesn't require periodic calls from main(). Whatever the UART handler does, it is behind a screen and the results are left in a queue.
If I were starting with FreeRTOS, I would pick a board/chip for which a port was already available.
Check out the PSOC 6
https://www.cypress.com/products/32-bit-arm-cortex-m4-cortex-m0-psoc-6