This should only take a single clock cycle to execute, right?
TEST_PORT->BSRRL = TEST_PIN;
No.
It usually takes a minimum of 3 instructions for an ARM to move a bit to a memory location. Something like:
mov r1, &TEST_PORT
mov r2, 1<<TEST_PIN
str r2, BSRRL[r1]
If you're lucky, your pin-toggle loop might move the first two moves outside your loop, but ... you can't really count on that unless you look at the code produced.
On top of that GPIO is frequently located on the other side of a peripheral bus that runs at a lower speed than the main CPU (and has to be configured, too.) An STM32f403 datasheet I had handy said that it has two APB, one "high speed" that runs at a max of 84MHz, and one lower speed that runs up to 42MHz. And "flash wait states" and/or "accelerator" issues show up as well. So you might be getting close to the "order of magnitude slow" that you're seeing, even if you have the CPU clock running at full speed. You shouldn't even be theorizing without at least looking at the assembly language you end up with. (you don't have to KNOW assembly language to be able to get clues from LOOKING at it...)
(and on some chips, each peripheral might have yet another clock that needs to be configured for maximum speed.)
Looking into the MCO looks like a good idea, but it also might have limitations. At least one datasheet I have says:
The selected clock to output onto MCO must not exceed 100 MHz (the maximum I/O speed).
(and then there's the depressing thought that you might misconfigure MCO...)