My DPLL was not locking correctly. With it running as expected, I get consistent 28 cycle latency, same as with other clock sources.
In theory Cortex-M4 has 12 cycle latency from the context save. Then in my case the first instruction of the handler are
4b06 ldr r3, [pc, #24] // Get PORT address
2210 movs r2, #16 // PB4
f8c3 209c str.w r2, [r3, #156] // Write OUTTGL
So, this would take 3-5 cycles to actually write the peripheral.
And when running at slow speeds (12 MHz), it is actually faster if both vector table and the handler are in the flash. Placing handler in the flash saves 3 clock cycles and placing vector table itself in the flash saves 1 more.
At fast clocks placing everything in the RAM wins by a lot.
But ultimately all this account for half the observed latency. It is possible that the rest is just PORT being slow.
Full measured latency is 250 ns in the best case scenario, which is about 30 clock cycles at 120 MHz.