Consider a few facts:
1. Signalling rates that are high enough that the delay time of the trace (let alone anything going over cables) is substantial, i.e. > 1/10th bit time.
2. This necessitates termination to address ISI.
3. A CMOS pin driver (transmitter) pulls to the rails, whatever that may be, 1.8V, 2.5V, 3.3V.
4. The CMOS driver is usually source terminated (whether by setting Rds(on) in the ballpark, or adding explicit internal or external termination resistance), so we can avoid DC losses at the load, and can use full-swing inputs (CMOS receivers, just plain old inverters with 30%-70% input voltage threshold, plus the usual ESD protection).
Putting all these facts together, we observe supply current goes up proportionally with signal rate, until as we approach the delay > 1/2 bit time range, it levels off at around Vdd^2 / (4 Zo). That is, the mean transmission line voltage is Vdd/2, so we drive pulses of Vdd or Vss into it, dropping (Vdd/2) across the termination resistance.
This is in addition to internal gate losses (equivalent switching capacitance), which isn't at all negligible for CMOS at these voltages and rates, it still goes up proportionally.
The simple fact is we're driving whole volts into a ~100 ohm transmission line and termination, and doing this at Gbps is intensive.
And furthermore, not to mention the demands placed on supply and grounding: such a transmitter would need multiple pins adjacent, and even then might not be adequately bypassed, in a packaged device (i.e., due to the leads or bondwires in a QFP or QFN).
So we would like to solve multiple problems at once.
- We can solve the supply and grounding problem, by using steered constant currents. Old school ECL* does this internally (the input stage is a differential pair), and externally if used appropriately (the outputs are always(?) complementary). LVDS does this explicitly, where the transmitter is a current-sourced H-bridge. (It's not using perfect switches, which actually helps as the softness defines a modest common mode impedance for the pair, setting CM bias, and termination to some extent, at the same time.)
*Which hasn't gone away at all, ECL is still very much around; I guess it should be called just "school".
I don't run across it much in common modern digital interfaces, but it's still plenty popular for lower level stuff, precision timing, etc.
- We can solve the equivalent capacitance problem, by using ever-smaller fabrication nodes, and lower supply voltages. Hence the typically 0.9-1.2V core voltage on anything reasonably complex. We still need level shifters out to the IO pads, but we don't have to do all the logic at the IO voltage.
- And we can solve the signal level problem, by simply using less signal. This necessitates a different input stage, usually a complementary differential pair, much like any other rail-to-rail comparator (and indeed, LVDS receivers can often be (ab)used in this way). The input stage is much more complex, but that's alright. Once it's solved, it's solved (i.e., fabs having standard library solutions or whatever).
That leaves the DC consumption problem, which is unfortunate, but there are solutions for that as well. MIPI for example, only asserts the bus when data is being transmitted, saving nicely for mobile devices with intermittent display updates. Think, scaled down and greatly sped-up RS-485.
Or, I don't know if PCIe does a similar thing or what exactly, but it's also AC coupled so it physically cannot deliver DC; receivers then need their own DC bias (and probably ALC or DC-restore or something like that), and also it has even more complexity (clock recovery for starters). But yeah, again, in the Gbps it's going to be complex and that's why we have 10s-nm core logic to handle all that stuff.
Whereas, down in the Mbps, you wouldn't bother with any of this stuff, it's not anywhere near worth the complexity -- and the loss due to charging relatively very short transmission lines to modest supply voltages, isn't a big expense, so that's why we stick with plain old CMOS there.
Tim