__attribute__((section(".ramfunc")))void pushBlock(uint16_t color, uint32_t len){
cant see difference in RAM usage nor speed
Are you using O2 optimization? The speed difference is dramatic over no optimization at all.
You're right. I used this method in the past in a STM32F103, where it had an interrupt for a fast signal.
Without the ramfunc it would miss some events. But with it, it worked perfect.
I tested now in a F411 black pill board and IDE 1.6.0.
RAM address starts at 0x2000000, flash at 0x8000000. Clearly, it's not being executed in ram. I don't know why, requires further research.

Anyways the code can be optimized a LOT. It uses a ton of calls.
Calls are not good for fast code, as it will likely cause cache miss, having to fetch new data from flash (slow).
First, the parallel port is not aligned:
#define RD_PORT GPIOA
#define RD_PIN GPIO_PIN_4
#define WR_PORT GPIOA
#define WR_PIN GPIO_PIN_3
#define CD_PORT GPIOA // RS PORT
#define CD_PIN GPIO_PIN_2 // RS PIN
#define CS_PORT GPIOA
#define CS_PIN GPIO_PIN_1
#define RESET_PORT GPIOA
#define RESET_PIN GPIO_PIN_0
#define D0_PORT GPIOB
#define D0_PIN GPIO_PIN_0
#define D1_PORT GPIOB
#define D1_PIN GPIO_PIN_1
#define D2_PORT GPIOA
#define D2_PIN GPIO_PIN_15
#define D3_PORT GPIOB
#define D3_PIN GPIO_PIN_3
#define D4_PORT GPIOB
#define D4_PIN GPIO_PIN_4
#define D5_PORT GPIOB
#define D5_PIN GPIO_PIN_5
#define D6_PORT GPIOB
#define D6_PIN GPIO_PIN_6
#define D7_PORT GPIOA
#define D7_PIN GPIO_PIN_5
Thus, the data has to be rearranged every time, doing atomic operations:
// Doing this deserves going to coder's hell.
#define write_8(d) {
GPIOA->BSRR = 0b1000000000100000 << 16; \ // Reset GPIOA bits 5(D7) and 15(D2)
//GPIOA->BSRR = (D7_PIN | D2_PIN) << 16; // This is the same, but at least you can easily see wtf it's doing
GPIOB->BSRR = 0b0000000001111011 << 16; \ // Reset GBIOB bits 0(D0),1(D1),3(D3),4(D4),5(D5),6(D6)
GPIOA->BSRR = (((d) & (1<<2)) << 13) \ // Load D7 and D2 bits
| (((d) & (1<<7)) >> 2); \
//
GPIOB->BSRR = (((d) & (1<<0)) << 0) \ // Load D0, D1, D3, D4, D5, D6
| (((d) & (1<<1)) << 0) \
| (((d) & (1<<3)) << 0) \
| (((d) & (1<<4)) << 0) \
| (((d) & (1<<5)) << 0) \
| (((d) & (1<<6)) << 0); \
This the assembly result with O2 optimization: 17 instructions for only loading the 8-bit value into the port!
push {r4, r5}
lsls r3, r0, #13
asrs r2, r0, #2
ldr r4, [pc, #36] ; (0x8000710 <write+48>)
ldr r1, [pc, #40] ; (0x8000714 <write+52>)
ldr r5, [pc, #40] ; (0x8000718 <write+56>)
str r5, [r4, #24]
and.w r2, r2, #32
and.w r3, r3, #32768 ; 0x8000
orrs r3, r2
and.w r0, r0, #123 ; 0x7b
mov.w r2, #8060928 ; 0x7b0000
str r2, [r1, #24]
str r3, [r4, #24]
str r0, [r1, #24]
pop {r4, r5}
bx lr
Doing the same, but setting & resetting in the same instruction, 14 instructions:
GPIOA->BSRR = (0b1000000000100000 << 16) |
(((d) & (1<<2)) << 13) | (((d) & (1<<7)) >> 2);
GPIOB->BSRR = (0b0000000001111011 << 16) |
(((d) & (1<<0)) << 0) \
| (((d) & (1<<1)) << 0) \
| (((d) & (1<<3)) << 0) \
| (((d) & (1<<4)) << 0) \
| (((d) & (1<<5)) << 0) \
| (((d) & (1<<6)) << 0);
lsls r3, r0, #13
asrs r2, r0, #2
and.w r2, r2, #32
and.w r3, r3, #32768 ; 0x8000
orrs r3, r2
ldr r1, [pc, #24] ; (0x800070c <write+44>)
ldr r2, [pc, #28] ; (0x8000710 <write+48>)
orr.w r3, r3, #2147483648 ; 0x80000000
and.w r0, r0, #123 ; 0x7b
orr.w r3, r3, #2097152 ; 0x200000
orr.w r0, r0, #8060928 ; 0x7b0000
str r3, [r1, #24]
str r0, [r2, #24]
bx lr
Aligning the port it would be much faster.
For example, using PA0-PA7 for data, only 5 instructions:
GPIOA->BSRR = 0xFF << 16; // Reset Data pins
GPIOA->BSRR = d; // Load data
ldr r3, [pc, #12] ; (0x80006f0 <write+16>)
mov.w r2, #16711680 ; 0xff0000
str r2, [r3, #24]
str r0, [r3, #24]
bx lr
Going further, resetting and setting the pins in the same instruction, only 4 instructions:
GPIOA->BSRR = (0xFF << 16) | d ; // Reset Data pins
ldr r3, [pc, #12] ; (0x80006f0 <write+16>)
orr.w r0, r0, #16711680 ; 0xff0000
str r0, [r3, #24]
bx lr
Avoiding the call, it would be even faster, removing the call instruction from the origin, and the return instruction(bx) from the write routine.
According to the datasheet, setting and resetting in the same instruction is OK, as the SET bits have priority over the RESET bits:

And that's only a small analysis...
