If you want to REALLY understand how a CPU works, write a compiler for it.
I'm not sure about that. You can write a "pretty good" compiler using only a fraction of the features of a CPU.
That depends on the CPU!
RV32I is a nice fit for compiling all of C/C++ [1] and similar languages, with just 37 instructions, every one of which is used by the compiler, given suitable source code. Of course if your program doesn't do any XORs then the XOR instruction won't get used etc.
In fact, that's part of the anti-CISC argument - designers were adding all sorts of instructions to their architectures that were nearly impossible for compiler writers to use effectively in a program (sure, you could have your aged assembly language programmers use them in libraries, but ... that was less interesting.)
That is true, but it's even worse than that -- those instructions were often just convenience for assembly language programmers, but actually executed slower than using several simpler instructions. Patterson found many examples of this for VAX 11/780, and Cocke for IBM System/370.
Your aged assembly language programmers should not even use them in libraries, only in quickly written application code where programmer productivity was more important than machine efficiency.
I was very unhappy when the compiler class I took in college spent so little time on optimization, or even converting generic 3-operand instructions to code for a specific CPU. But that used to be where a lot of compiler writers stopped.
I took compiler classes at university in 1983/4 and actually worked on back-ends for VAX and M6809.
We certainly covered .. and implemented .. the most important optimisations such as constant expression evaluation, strength reduction, common subexpression elimination, code motion out of loops.
We studied and implemented register allocation but the algorithms at the time were pretty crude. Patterson has said that RISC-I and RISC-II would never have used register windows if they'd known a good register allocation algorithm for their compiler.
One of the largest amounts of time spent was on trying to match address calculation DAGs to all the crazy addressing modes, and figuring out when to use arithmetic rather and an addressing mode, and when to bail out of trying to match the whole address calculation DAG to an addressing mode and just emit an LEA with what you had, and then start again from that.
And using complex addressing modes was at odds with common subexpression elimination (and still is today with x86 and Arm scaled indexed addressing).
Pure RISC with no more than base+offset addressing is SO MUCH easier to generate code for. You just need a few more instructions and a few more temporary values in registers. But those are easily subjected to all the normal optimisations -- common subexpression elimination, moving some of it out of loops, etc.
[1] with soft float used if the program uses FP, you can add F or D extensions if this is performance-critical.