Thank you all for further input. Here's a quick update:
The Arduino toolchain does not save assembly code, but it leaves a .elf file lying around in a mysterious temp directory, where it can be disassembled using objdump:
.../arduino/tools/arm-none-eabi-gcc/4.8.3-2014q1/bin/arm-none-eabi-objdump -SC -I *.elf
Yep, that seems to be the way to go, thanks! Figured that out the other night, and had a look at the assembly code.
In the absence of additional indirection, handler functions on CM3 do not contain any additional prologue, because the ISR hardware matches the programatic ABI. ISRs don't even have any designation as such; they're just normal C functions with the right names...
The assembly does show a bit of extra prologue code. The CM3 context switch only saves registers r0..r3, so the generated code moves nine further registers onto the stack. Definitely explains a significant part the extra latency.
FLASH apparently needs 5 cycles (4 Wait states) at 84MHz. There is a 2x128bit flash buffer for "acceleration", but the documentation is a bit too skimpy to figure out exactly how it will behave in any particular situation :-(
It can't be quite as bad. There are some threads where people have timed the execution (from flash) of 1000 sequential NOPs, and it comes out at 1003 cycles or so. Pipelining to the rescue, I assume. But, of course, for looking up in interrupt vector from flash and then jumping to the ISR, there woud be a couple of cycles which take the full hit regarding wait states.
I would certainly hope that the CPU's vector table is in RAM once the system has booted up (it can be relocated anywhere via a vector table offset register)
I very much doubt it. RAM tends to be more precious than speed.
That part I still have not figured out yet. From looking at the generated assembly, or the system libraries' source code, I can't deduct where the vector table actually lives. I probably need to do some runtime debugging to see where the vector table is located, and whether its timer interrupt vector points straight to my handler. -- I have found some published code which modifies the vector table at runtime however, so I still tend to think it lives in RAM.
In summary, it seems that I will have to live with the higher latency, unless I dive in very deeply. (ISR in assembly to minimize the context switch overhead; potentially relocate the vector table to RAM...) I had hoped that I had just committed some easily fixable goof, which was responsible for a major delay. Given that there is no simple fix, I'll probably see whether I can arrange things so that I can live with the present latency.
Anyway, a bit of a disappointment that the advertised "12 cycles latency, and we already have taken care of the context switch for you" is not the full truth...