Low complexity, cost, power and deterministic behaviour is also a key requirement in embedded.
You can see that e.g. the Cortex M7 chips that run 300-400MHz are getting deeper pipelines (like 5), but nowhere near modern Intel chips:
- Modern Intel chips are in the order of 14 stage papelines. Pentium 4 (NetBurst) was doing something like 30, but they reverted that decision in future architectures.
- Deep pipelines require a whole slew of things to extract optimal performance:
1) Branch predictors. There could be several instructions in between the branch and actual final decision. Thereby, making accurate predictions is important.
2) Branch predictors come in different styles. One of the simplest is remembering the last branch result and keep reapplying that. But in tight loops and a deep pipeline you will be suffering massively from small mispredictions. More complex branch predictors PC-dependent heuristics on multi-level dictionary lookups to get better predictions.
3) Modern processors tuned for performance use speculative execution. This executes instructions after a branch instruction, and then throws them away if it was incorrect. Throwing away results == wasted energy.
4) Programs have lots of dependencies and associated hazards, including false ones.
4a) Data hazards: e.g. don't overwrite a value before all previous instructions have read it. Make sure that if 2 writes happen to the same register, the last value sticks.
4b) Control hazards (e.g. branches, as explained)
4c) Structural hazards: e.g. a pipelined integer divider can be issued only once every 32 cycles, but the program does it faster than that.
5) Modern processors involve out-of-order execution to bypass false dependencies as much as possible, e.g. register renaming or the very famous Tomasulo's algorithm. This is beneficial for performance but requires more bookkeeping (e.g. re-order buffers)
5a) It does allow for multiple instructions to be fired per clock cycle, thus being able to achieve >1 instruction completed per clock cycle, given that hazards are not a problem.
6) Fast processor = fast memory busses = problems. Fast CPU's add multi-level cache structures, you can even see it on microcontrollers like ARM Cortex M7 chips or the PIC32MZ that employ program/data caches. Most modern microcontrollers also employ FLASH accelerators that in some way are also a cache.
7) Despite all these efforts, the amount of parallelism you can exploit in a single-thread program is limited. You see many CPU's introducing 2-way (or more..) "hyperthreading" that interleaves executions of multiple threads on the same CPU, to extract most of the multi-thread performance out of it.
8 ) Imagine this complex well-oiled machine executing instructions like mad. Then imagine that it also needs to handle exceptions, i.e. it needs to stop the thread it was executing and switch state. Oh, and we also want to do this in a precise manner, i.e. we want to return to the original program once the exception (or interrupt) has finished, with no side effects in any internal state or register any component of our CPU has. In order to accomplish this precise exception behaviour, it may need to undo or cancel instructions or in flight in order to show the exact content of e.g. registers at a particular moment in the program.
Contrast this with a processor that uses a 2 or 3 stage pipeline like the AVR, PIC24 or low-end ARM Cortex m chips. Alot of the "problems" said are irrelevant at that point. This makes the system also very deterministic, which is what we often want in embedded applications where latency and jitter are important.
Also, a lot of the above points are solved by throwing more and more transistors at the problem. Although modern silicon technologies has transistors in abundance (wiring and power is often a problem), all those transistors burn power, even when not switching and you got a billion of them, so you see many modern chips employing complex power management strategies. E.g. modern Intel CPUs have 10+ power states per core (which doesn't even mention frequency/voltage turbo boosts), and in addition 10 power states for the package, and 5 system states the computer can be in. In order to have these CPU's run fast, they become incredibly complex machines that are nowhere near microcontrollers.