The clamping voltage of the MCU or etc. itself will be vastly higher than VDD, during an ESD pulse. Don't worry about it. The abs. max. rating is a DC figure only; how long you can apply some overvoltage is not defined. Evidently, it can handle -- well, whatever ESD it can. (Usually 1 or 2kV HBM is default; this isn't always defined in the datasheet(!!) but may be in supporting e.g. Quality documents.)
(Occasionally devices will carry a rating like "VDD+2V for 2ns" which accounts for overshoot on signal traces due to poor signal quality. I don't think I've ever seen something for intermediate time scales i.e. us or ms. I've also never seen it defined in terms of an ESD waveform alone -- but this is implied by the ESD test itself.)
(Also beware, ESD is done either powered down / disconnected, or at best assuming a power cycle immediately afterwards. ESD is quite likely to cause CMOS latchup, resulting in high dissipation (or failure) until power cycled.)
You will also occasionally see a current limit figure, which seems exclusive (redundant) as the pin would have to be driven beyond the < -0.3V / > VDD + 0.3V allowed range to flow any current at all; but the meaning here (and usually there's a note to this effect) is that operation up to, whatever voltage these currents develop (perhaps 0.6-0.8V beyond the rail, typical?), is permissible so long as the current is within this rating. This may also be specified in another location (e.g. I/O characteristics rather than device abs. max.), and may be called "current injection" or something like that.
Note that SMAJ size TVS are excessive for ESD purposes, but are handy if you need surge as well (very long wires?), or to shunt higher currents (e.g. as part of cross-wiring fault protection?). Not to say don't use them; they are certainly adequate for ESD purposes.
Just that there are smaller and cheaper parts available.
Anyway, suppose you want to limit peak current, regardless -- what then? Well, just connect a series resistor between TVS and IO pin. This limits current based on TVS's peak clamping voltage and IO's allowed ESD rating. For example, an ESD pulse clamped down to ~30V, then dropped through 10 ohms, is still less destructive than 2kV HBM -- which most devices are rated for as-is.
I suppose you could cascade another protection stage (this time perhaps using schottky clamping diodes) to get the voltage fully within ratings, or very nearly, as well. But that's a lot of bother for not a lot of value.
---
As for the pushbutton, is ESD even a hazard? Consider using a 5-pin type where the metal cover plate can be grounded directly. ESD can never strike the signal line. There will still be some induced voltage/current due to sheer proximity, and some leakage along/through the ground plane*, but that's easily dealt with by the other components as mentioned. I might prefer more than 10nF in that case, but some combination of clamp diodes, filter cap, and series resistance (between switch and filter/protection, and protection and IO), will certainly do the job.
*Assuming the existence of one as you should almost always build on a ground plane!
There's also the non-solution solution: just debounce in software. I'm quite fond of a hysteretic counting method. See:
https://github.com/T3sl4co1l/Reverb/blob/master/main.c#L182Poll the inputs fairly frequently (you'll usually have a periodic timer interrupt or main() loop cycle of some ~ms where this can be placed), and every time it reads 1, increment up to a maximum; 0, decrement down to a minimum. Only when the min/max has been hit, change the output state. Note this is also a digital low-pass filter, with delay and cutoff frequency equal to the number of counts, i.e. the output can toggle no faster than the slew rate of the filter allows. This can be used to drive functions like hold-and-repeat (as shown), double-click, etc.
If using a keyboard matrix, just perform a complete scan, and then evaluate each position in this way. Whether you want to do one step of a scan per tick, or a full scan quickly during a tick, is up to you; obviously you get less bandwidth in the former case (you don't get a full scan until N ticks have elapsed). Also you can sacrifice synchronicity (i.e., poll and debounce each column of the matrix per pass) to distribute the work more evenly over time.
Anyway, the first time I developed this algorithm, I took a bare wire and dragged it across a rusty pitted plate. Contact and release were read quite comfortably despite the noise.
Probably the worst thing you can do is use pin change interrupts: worst case, radio interference spams the interrupt, throwing the CPU back into the ISR almost as soon as it returns. You definitely need analog filtering if you're using this approach.
Tim