We do have a ACS781LL ±100A hall effect sensor in the motor path as well.
However, the data from that is not currently used as a break input,
Here's your problem. (You can, of course, have other problems as well, but this is the most probable primary cause.)
IMHO, this pulse-by-pulse limit should be the first thing verified before applying actual power.
just as feedback to the control system
A very well tuned feedback controlling a PWM
may be able to protect the devices, but it's almost unheard of that you could get the response tuned right from the start of prototyping, hence the need for pulse-by-pulse limiting.
- these FETs are rated at 80A continuous / 200A pulsed, which I had assumed would mean we could get away with switching a load from a 100A supply.
Your DC link cap bank can support current only limited by its ESR. It's likely an order of magnitude more than your nominal supply current (otherwise the cap bank wouldn't be very effective).
Using excessive FET ratings and doing without active current sense is a valid strategy in very small motor controllers (maybe a few watts), but anything bigger, and the cost of the FETs and their drivers becomes much more than the cost of the current sense.
Remember that the motor iron has some saturation current (assume 3 * Inominal if you have no exact information), after which the current rise rate increases and finally shoots up. In this case, energy stored in the DC link cap bank can supply, say, 1000 amperes quite easily, which again may be able to kill the FET in tens to hundreds of microseconds, within one or a few cycles. A feedback loop typically cannot react to this.
A bad way to deal with this is to add tons of control logic such as ramps, soft starts, all kinds of edge case detections, but you'll always have some untreated edge case. Hence, do finish the proper pulse-by-pulse current limiting; it's also easier and simpler to understand.
You may be able to replace the feedback loop completely with a peak current mode control scheme, in some cases it does the torque limiting for you just fine, in which case there's no need for two separate systems.
Is pulsed current failure consistent with the resulting shorted state of the FET?
Yes. Every other cause has the same result, though, so it's hard to say which it was. Basically, excessive Vds, excessive Vgs (in either direction!) - you should use a 500MHz scope to measure either -, or excessive power dissipation in fully or partially conducting is the cause - see the SOA curve (and remember that it's often rated at Tj = 25 degC, which is completely unrealistic).
Notably, the FET (package, at any rate) is not hot after failure - thermal camera reports at is only ~10-20°C above ambient.
The fact that they are not hot only proves that a
small overload sustaining for a long time, causing slow but finally excessive heating, was not the probable cause. Local die heating can happen in microseconds (which is also why traditional fuses, or input supply current limiting before a capacitor bank, cannot protect semiconductors.) The amount of energy in such event may be very small, but enough to destroy the die (milligrams) or even hotspot a part of the die (micrograms worth of material). After this heat spreads to the package, the temperature increase is almost zero.
If you do think that this is the problem, we can restructure the board such that if the current goes above (say) 90A the comparator output from that inhibits the A/B motor signals.
This should always be the first thing you do and verify. I verify it by injecting a test voltage over the shunt (I simply discharge a capacitor through the shunt if I don't have large enough supply on hand!) or through the hall sensor, before actually applying the DC link voltage. If it works, I have excessive currents ruled out, and fewer reasons to blow fets during development. Blown FETs slow down the work, especially if they cause hidden damage to gate drivers and you fail to replace the gate drivers as well...
Thanks for the advice about the PWM frequency - we had been using 20kHz in a previous system, but had received advice from the only person we actually know in real life with motor driving experience, and they had suggested a much higher frequency. We'll revert that back down.
Going much over 20kHz might be advisable if dogs start complaining about the high-pitched noise, but even in that case, going over about 30-40kHz is not worth it. Seeing that EVs that typically run around or below 20kHz with a high but clearly audible whine don't cause much complaints, this isn't a usual issue.
100kHz itself isn't impossible to achieve, just hugely nonoptimal for such large motor drives. They have massive inductance so the ripple current is next to nothing even at lower frequencies. High switching frequency forces you to increase edge rates to have sensible switching losses, problematic for EMI. OTOH if you drop to 20kHz, you have reduced the switching loss to 20%, and may be able to slow down the edges more (bringing switching losses back up), or add more aggressive snubbers if EMI ever becomes a problem.