ST claim their prefetch and cache system delivers perf equivalent to a zero wait state flash.
If you read into how ST implemented their prefetch system, you can see it is really just a pitifully small cache with a stupid replacement policy that will result in multiple cache miss penalties for every function call, every loop iteration and hell forbid every constant operand fetch on Cortex-M0/M23 parts. When a cache fails basic memory-intensive C library functions like memcpy, memset and strlen, it is a failed cache system. Cortex-M3 and above just have flat slow constant fetches.
GD's shadow RAM solution is basically a huge cache that never needs to be replaced ever, giving you true and consistent zero wait states throughout the run.
Faster than with ST, whose chips are available? I don't understand.
As of now nobody's chip is available, at least within China. GD has redirected virtually all their chips to its domestic market in China, while ST is dealing with uncertainties caused by geopolitical tensions when entering its foreign market in China. Before all this chip shortage, their products are widely available world wide, especially through LCSC.
To succeed in business, against an established incumbent, you need to offer something special.
With their pin-compatible portfolio, their products have at least a slightly faster, revised main CPU core, for example GD32F103 uses a 96MHz Cortex-M3 r2p1 core, GD32E103 uses an 120MHz Cortex-M4F core, GD32VF103 uses an 108MHz
RISC-V core, while STM32F103 used a 72MHz Cortex-M3 r1p1 core. Also as above their shadow RAM architecture. Then they have their non-pin-compatible portfolio.
The copy is in the hardware. For the user it is totally transparent. You don't even have a write access to that SRAM, it just appears as normal flash, just very fast.
IMO they really should implement a bit in their shadow RAM interface to allow this shadow RAM to be written while disabling in-application programming if the Flash until the next reset. This means for applications that do not need IAP, they can just put their writable data sections and heap allocations in the Flash address space, instead of wasting time on doing another memcpy from the shadow RAM into the main RAM. Better if that shadow RAM write control is available on a per Flash sector basis.
Not that on GD, the flash is actually a second die in the same package. The code for the first 512 KB is loaded from that flash into the shadow SRAM on reset. But if you have to hit the flash for the data or code, it would be pretty slow.
That second die is really just a QSPI Flash and their Flash interface is just a QSPI controller with hard-coded parameters. GD's newer chips now do use integrated on-die Flash, yet they are still including their famous shadow RAM since it brings performance benefits.
There is this brand called Artery that also makes STM32 pin-compatible microcontrollers using this exact architecture, they even make one of their built-in hard-coded QSPI controllers accessible over the pins.