Besides optimization and language / library differences, ponder this:
A Cortex-M4 is very roughly on par with the CPU of a Pentium PC, or Nintendo 64.
The memory is very different: much less ROM and RAM on a stock micro -- but similar amounts can be added when external interfaces are available. The peripherals are much more limited as well: a truly huge amount of bandwidth is dedicated to visual output, whether it's a PC's 1MB+ SVGA video device, or the N64's RPU.
Ponder this as well:
When I was growing up, I had a 486 PC (25MHz, 32 bits -- slower than a typical M4, but similar capability otherwise). Among other things, I taught myself 2D and 3D graphics programming, in QBasic.
Now, QBasic runs in 16 bit mode, and is interpreted (even when compiled to EXE, it works by calling the same subroutines). The run-time library is very heavy weight: every function call is FAR type (a 32 bit address, with dozens of clock cycles of overhead to execute the CALL instruction), and even a basic arithmetic expression might consume dozens of these function calls. (The floating point support is particularly slow: the run-time doesn't autodetect an FPU, so there's a half dozen CALLs just to test if it can use the hardware FPU, or if it has to branch to another part of the library that calculates it in software!)
The best performance I got, with a stock QBasic implementation, was something like 20 frames per second with pretty simple models. It wasn't optimal by any means, I didn't know things that in-depth at the time -- but we're talking on the order of 100kFLOPs (floating point operations per second) here, out of a 25MHz clock. That's slow! (The 486DX, by the way, completes hardware floating operations in a few dozen cycles. Not nearly as fast as today's CPUs, but no slouch -- and a great improvement on the ~300 cycles typical of what QBasic arithmetic was designed for: the 8087 FPU.)
For perspective:
I came back to these programs, here and there, over the years. On a Pentium, with fixed point arithmetic and assembler subroutines for the most intensive functions (the innermost loop in the rendering process, and writing graphics to the video buffer), I approached screen refresh rate: 40 to 70 FPS.
Later still, I "ported" the whole thing (everything from the outer control loop to the innermost render loop) into pure assembler. Also using floating point with optimal calculations (not stuffed away in subroutines). This pushed higher still. Although it also helped that I had a modern CPU at that time: a 1666MHz AMD Duron ran this at 450 FPS.
(That fits the entire program, and all its data, entirely into CPU caches. Only video writes and device I/O has to touch the bus. The CPU is at a disadvantage, running in 16 bit mode still -- but it's clearly capable of executing about one instruction per cycle, on average!)
So, in summary: you can always burden yourself with exponentially slower, while still semantically correct, code. (That is: the code still accomplishes the same exact task, but through very different sequences of instructions.) You can always add layers of abstraction. But each added layer slows down your program by some factor. Shake loose from those layers of overhead, and you can harness the pure speed of your CPU. You don't need to write in cryptic languages, like assembly, to achieve this: C does just fine. You do need to avoid using excessively heavy libraries and function calls, when a simple expression or loop will do.
Tim