Interesting, I thought the cap was just another way to reduce latency, didn't know it had other effects. But I understand it acts like a kind of a "charge transfer boost", as when you have the input at 5V and set it to zero, ~ - 4.3V is "applied" to b-e without series resistance (except cap's parasitic), sucking charge out of the BJT, right?
1N4148 also works as a latency reducer (anti-saturation), but it's not as good as a schottky; I have measured both in practice, in a driver cousin to the one I posted (on the N-MOS driver, which has an 2N3904 at the input; I've posted such scopeshots in this forum); the schottky really brings turn-off latency down to virtually zero (from the original 7us! 1N4148 -> 1.75us; BAT54 -> 0us).
Indeed, as Tim pointed out, charge storage on the BJT isn't always simulated. I had to hunt models around until I found one that does simulate it, let me dig it and I'll post it here.
In the original schematics I posted, the sub-circuit composed of Q5 (top NPN), its base resistor and the diode are a "fast MOS turn-off" technique I read around, in an app-note I think. The rest is simple stuff, I just added the zener to keep Vgs within its spec without limiting the current (thus keeping a fast turn on).
Is the 100 Ohm resistor your simulated load? Is it your true load?
When simulating this kind of circuit, I also usually give the power source internal resistance and add a little inductance to the power lines, as well as resistance and inductance to the "DC-link" capacitors. For example, if you're switching a few A in those few hundred ns or less, using bigger devices as TO-220, inductance in the lines from the driver to the FET will matter, as will inductance between the DC-link caps and the FETs, and so own, and you should take this into account in simulation. Hint: switch 5A in 100ns over 50nH (rough estimate for a 1 inch go-and-return connection distance) -> dv = L di/dt <=> dv = 5V spike.