I suspect one factor behind how fast GPIO can run is what the peripheral clock is running at. The peripheral clock for GPIOD runs from APB2, which sources its clock from AHB, a.k.a. HCLK. This might be subdivided from the main SYSCLK, and not running at the full speed. We don't know what OP's clock configuration is, but it seems they are using the Arduino framework, so if they're running the
official WCH one, then we can possibly eliminate this as a problem because the WCH HAL code, when setting up for 48 MHz operation (regardless of whether HSI or HSE oscillator source), sets HPRE, the AHB prescaler, to divide-by-1 (i.e. undivided). So OP may already have HCLK as fast as it can go.
1) write to GPIOD->BSHR takes more than one instruction, check the compiler output to see exactly what; but it usually involves loading the constant from memory and writing it to another address
Assuming the GPIOD base address has already been loaded into a register, yes, it takes two instructions to set BSHR: one to load the literal value into a register, and a second to write that value to the register.
li a1,0x8
sw a1,16(a0) ; GPIOD base addr already in in A0
However, if the compiler is being sensible, then with a simple scenario like OP's benchmark where a pin is simply being toggled in a loop by assigning constant values to BSHR, the compiler will put the
li loading instructions outside the loop, so in effect only a single instruction is needed to set the GPIO pin. I seem to recall there was a thread discussing this recently.
You may also want to try relocating time-critical code into the SRAM. This way you will not be incurring flash wait state penalty.
Yeah, running at 48 MHz necessitates 1 wait state for flash access.