Regarding using programmable logic, DMA, loop unrolling and other fancy stuff, may I remind folks that the OP asked a specific question about a specific loop in C. It's even easier than all of those methods to produce 204MHz by presenting the basic clock on a GPIO if you want to on an LPC4370, and many other devices I am sure if you want to cheat 
Disassembled version...
-O3 code on M4F, 51MHz toggle with 204MHz clock.
1400032a: 0x7219 strb r1, [r3, #8]
1400032c: 0x721a strb r2, [r3, #8]
1400032e: 0xe7fc b.n 0x1400032a <main+18>
I guess r1 holds #1 and r2 holds #0
strb is 2 clock cycles on the M0 cores as well as in the M4 core.
The branch is 1 clock cycle plus the time to refill the pipeline, on the M0 core is 3 cycles, probably the same on the M4 core.
That gives you 7 cycles per loop, 2 cycles on, 5 cycles off. Not sure if not having a 50% duty cycle matters to the OP.
But that code seems it will provide a 29.142857 MHz signal at 200/7 duty cycle (28.57%)
To get 50% you'll need to add 3 nop in between the store instructions giving you a 20.4MHz clock at 50% duty cycle.
I guess the nice thing about the LP4370 is that it has 3 total cores, maybe the M0 subsystem core can talk directly to the pins.
But why use a full core anyways? How fast can the PWM in the chip be driven at?