Not an expert when it comes to discretes, but on chip at RF we can tune out the all the capacitance with inductance. The thing that limits our ability to do so up to arbitrary frequencies is the resistance of the terminals (mostly gate resistance) since it exists between the capacitor we want to tune out, and the inductor/transmission line we use to do the tuning. I wonder if this is related to what is going on here?
Also, what about mobility and so on? different doping amounts or such influencing the carrier recombination time of the channel? I would imagine all that happens order of magnitude faster than the rates at which you can switch these things but I'm not sure.
Right, in fact we can methodolize that --
1. Understand the network the transistor is embedded within.
2. Determine the intrinsic properties of the transistor, in some reference arrangement. (Remember, surroundings matter, too: whether it's bolted to a heatsink, etc.)
Once we know R, L and C, we can tune out L and C for narrowband purposes, or do our best to peak them for baseband purposes.
Note, switching is a baseband application. It's also large-signal, so we need a model that accommodates that as well, not just a small-signal s-matrix or the like.
3. Device properties. If we have insight into their fabrication*, we can begin to understand physical properties and fundamental limitations.
*Heh, well, you do, in your field of work; the rest of us, down here at DC, well...

Now, I've had accidental oscillation up at several hundred MHz before. Not a very strong signal (the amplitude being a fraction of total supply -- mind, still way more than I'd want during an EMI scan

), but these things are definitely getting squirrely. That's kind of around the fT of the device. Which doesn't mean it's incapable of power gain up there -- it's just not nearly as much as at lower frequencies, and the impedances are all flippy-floppy, making it impossible to amplify a pulse that square. An analogy might be blowing a whistle with molasses: maybe it oscillates, but it takes a lot of effort, so it's not very useful, just annoying when it happens...
Since switching is baseband, we're limited to some fraction of fT, and to obtain reasonable squareness or efficiency, it's a tiny fraction at that -- maybe 1/100th.
Which means, a PRF of some MHz, or pulse widths of 100ns give or take, are reasonable. Which, well, that's what we're seeing here.

So that's device limits. Once we have a model of pin strays and device R and C, we can optimize the driver and load coupling networks.
I do think device physics, mobility and such, have some effect
nearby, but for the most part are well outside the range of interest. If resistances were lower, fT would be higher, and we could push narrowband operation up there, maybe at useful power levels. Baseband operation still wouldn't be up there, but given lower inductance connections, we could at least push it a little higher.
So, I think we're currently at step 0 -- my vote is on uncontrolled mutual source inductance. We have much to cover, before we can even begin to observe device limits, say. Not that a full understanding is very applicable here, or even very possible with these devices; but the first two steps will be great value.

The big differences between switching and RF types, are: impedance range, lead inductances, and power dissipation. Capacitance nonlinearity also tends to be a bit better with RF types.
Impedance, because switching devices use big junctions (lots of width or perimeter), prioritizing low Rds(on) at the expense of capacitance.
Inductance, because leads are easier to work with than pads (e.g. DFNs) or tabs (e.g., RF microstrip packages).
Power, because to get the output moving at the desired rate (against all that capacitance), much load current must flow.
For example, the PD57006-E is basically a power 2N7002. It has very wide gate and drain terminals, the package can dissipate 10W, and it has similar Rds(on) (ca. 1 ohm) and capacitances (ca. 30pF). Probably the die area is substantially larger, and the layout is optimized for isolation (low Crss) rather than, well, I don't know what a 2N7002 is really optimized for, if anything at all, to be honest... And of course, lower intrinsic resistances. The 2N7002 is dropping off heavily by 50MHz (where its gain is dropping approximately as sqrt(1/f), so that it still has some gain left even at 200 or 400MHz, actually), but this thing is useful beyond 1GHz due to all these differences.
Also interesting that old transistors, like 2N7002, typically have a diffusion characteristic -- resistance and gain going as sqrt(1/f). Newer transistors have solutions for this -- presumably, they have moved from uniform planar interconnects (the gate is a sea of metallization applied over the source metallization and channel oxide), to fractal interconnects (routing in stages, from wide traces off the bond pad, to smaller secondary traces, and so on, finally connecting tiny traces up to individual MOS cells; the resistance from bondwire to any transistor cell is approximately equal). This gives good agreement when modeling a transistor as a lumped equivalent gate resistance and capacitance.
Tim