Still missing lots of details. How many different ports? Acceptable delay with numbers? STM32 speed?
Something like:
__attribute__((optimize("Ofast")))
void writePorts(uint32_t valA, uint32_t valB, uint32_t valC){
GPIOA->ODR = valA;
GPIOB->ODR = valB;
GPIOC->ODR = valC;
}
Generates:
8000168: b430 push {r4, r5}
800016a: 4b04 ldr r3, [pc, #16] ; (800017c <writePorts+0x14>)
800016c: 4d04 ldr r5, [pc, #16] ; (8000180 <writePorts+0x18>)
800016e: 4c05 ldr r4, [pc, #20] ; (8000184 <writePorts+0x1c>)
8000170: 60e8 str r0, [r5, #12]
8000172: 60e1 str r1, [r4, #12]
8000174: 60da str r2, [r3, #12]
Each port is updated with only 1 instruction from each other.
IIRC the STR instruction takes 2 cpu cycles in ARM, this would be ~40ns @ 50MHz.
Otherwise, use a latch.
You can daisy chain any number of 74HC595s, send a serial stream through SPI and sync the oututs with the latch clock (Pin 12).