STM32G431 - How can it be that the amount of lines of code in main() affect ISR

#50 Reply
Posted by wek on 13 Nov, 2022 21:28
Quote from: ataradov on 13 Nov, 2022 17:43
I too have never seen TCM being slower and I don't see how it is possible.
We are not talking about TCM in CM7 which is on an entirely separate bus of processor than the rest of the system.

CCM SRAM in STM32 denotes different things in different families/models. In 'G4, it's a single-port RAM accessed through both I and D but even through S port of processor (the latter through a different address region alias) and then through the common busmatrix, and it can be slave also to either of DMAs. Strangely enough, there's another chunk of RAM, denoted SRAM1, with exactly the same connectivity, and no explanation how would it be different from CCM SRAM.

The arbitration of the busmatrix is not documented in any other way than it is "round robin".

JW

#51 Reply
Posted by DavidAlfa on 13 Nov, 2022 21:56
Note that some linker scripts don't separate these ram regions.
I remember the 32F429 showing "RAM: .... length=192K", so the system would use any.

At least in the 429, CCM is only connected to the D-Bus, thus can't be used for instructions neither accessed by any DMA.
Quote
The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU
But placing ISR data on CCM and ISR intructions in the normal SRAM shouldn't slow it down, as they won't need to fight for access.

Not the case of the 32G431 (Totally different beast!)

#52 Reply
Posted by Siwastaja on 14 Nov, 2022 09:22
Quote from: NorthGuy on 13 Nov, 2022 17:52
Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.

This picture explains it well. Flash accelerator comes with separate DBUS and IBUS interfaces, STM32 CCM SRAM does not but has to arbitrate.

Who knows about the internals of ARM-provided M7 ITCM (tried 2 minutes in Google to no avail)? The bus is 64 bits but does that mean it's actually kinda-dual port, like the FLASH accelerator pictured? PC-relative literals are usually further away so 64 bit width does not help for that.

#53 Reply
Posted by Siwastaja on 14 Nov, 2022 09:35
Quote from: jnk0le on 13 Nov, 2022 20:36
Quote
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Kind of weird optimization idea:

Since PC-relative LDR can take offset of +/- 4095 bytes, which is quite a lot, and considering wek's comment "is there any difference between CCM and SRAM?" it should be possible to place timing-critical routines on the border of these memory segments so that the PC-relative literals fall into SRAM2 and code itself into CCM. Then of course place nothing else into SRAM2 & CCM. Then DBUS accesses would be to SRAM2 and IBUS instruction fetches to CCM, and no arbitration for PC-relative literals.

#54 Reply
Posted by jnk0le on 14 Nov, 2022 10:05
About optimizing the OP code

Quote
8000ed0:   20001558    .word   0x20001558
8000ed4:   20001524    .word   0x20001524
8000ed8:   2000155c    .word   0x2000155c
8000edc:   200014f8    .word   0x200014f8

8000ee4:   20001560    .word   0x20001560
8000ee8:   200000cc    .word   0x200000cc
8000eec:   200000c4    .word   0x200000c4
8000ef0:   20001554    .word   0x20001554
8000ef4:   20001550    .word   0x20001550

8000efc:   2000151c    .word   0x2000151c
8000f00:   2000154c    .word   0x2000154c
8000f04:   20001508    .word   0x20001508
8000f08:   2000150c    .word   0x2000150c
8000f0c:   20001520    .word   0x20001520
8000f10:   20001548    .word   0x20001548
8000f14:   20001544    .word   0x20001544
8000f18:   200000c8    .word   0x200000c8
8000f1c:   200000c0    .word   0x200000c0
8000f20:   20001540    .word   0x20001540
8000f24:   2000153c    .word   0x2000153c
8000f28:   200014f4    .word   0x200014f4

8000f30:   20001518    .word   0x20001518
8000f34:   20001538    .word   0x20001538
8000f38:   20001500    .word   0x20001500
8000f3c:   20001504    .word   0x20001504
8000f40:   200014fc    .word   0x200014fc
8000f44:   200000d0    .word   0x200000d0

all of those seem to be addresses of global variables. Organizing them in structures should greatly reduce those literal loads.

EDIT: many of those are also within 12 bit addw/ldr range to each other but the linkers tend to not be good at this kind of address relaxing

#55 Reply
Posted by lordnoxx on 20 Nov, 2022 21:34
Hi guys I am back. Had a little time during the weekend and worked on a simplified small code that shows the behavior of variable execution time of DMA1 ISR. Together we can now walk through the mini project and find out what is going on. But there is one thing I have to make clear right from the beginning: I am not allowed to post the c-source code of the function cli(). I can only post the disassembly of that function.

Ok here we go....this is the main.c file:

Code: [Select]
#include "stm32g4xx.h" #include "main.h" #include "system_setup.h" #include "usart_char_string.h" #include "cli.h" #include "ISR_Handlers.h" #define TEST 0 #define VOUT 5.0f #define VOUT_SET ((VOUT/10)/3)*4095.0f * 0.996 #define KPV 300 #define KIV 30 #define KDV 3000 #define IOUT 3.0f #define IOUT_SET ((IOUT*0.2308f)/3)*4095.0f * 1.0f #define KPI 10000 #define KII 30000 #define KDI 5000 volatile uint16_t myarray[2]= {0,0}; char fooptr[16]; volatile char debug=0; volatile unsigned int vref=0, vref_aim=VOUT_SET; volatile int KpV=KPV, KiV=KIV, KdV=KDV; volatile int ev=0, e1v=0, cv=0, cv_out=0; volatile float vp=0, vi=0, vi1=0, vd=0; volatile unsigned int iref=0, iref_aim=IOUT_SET; volatile int KpI=KPI, KiI=KII, KdI=KDI; volatile int ei=0, e1i=0, ci=0, ci_out=0; volatile float ip=0, ii=0, ii1=0, id=0; volatile int c_out; void SysTick_Handler(void); int main(void) { setup_clock_tree_config(); setup_GPIO_config(); SystemCoreClockUpdate(); clock = SystemCoreClock; SysTick_Config(clock/1e3); //1ms setup_USART_config(); setup_Timer1_config(); setup_ADC1_config(); setup_interrupt_config(); GPIOA->ODR |= (1 << GPIO_ODR_OD11_Pos); while(1) { cli(); if(TEST) { delay_ms(200); } } } void SysTick_Handler(void) { count_ticks++; } void delay_ms(uint32_t ms) { uint32_t start=count_ticks; while ((count_ticks-start) < ms); }
If you wonder about the variables names...yes....the project is all about an digital control loop for a DC/DC converter. And yes...the DMA1 ISR executes the PID algorithms for voltage and current control. This just as a side note for those of you who are interested

Here is the code of the ISR:

Code: [Select]
void DMA1_Channel1_IRQHandler(void) { if (DMA1->ISR & DMA_ISR_TCIF1) { DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high ev = (vref - myarray[0]); vp = KpV * ev; vi = KiV * ev + vi1; if (vi > 4100) vi = 4100; if (vi < 0) vi = 0; vi1 = vi; vd = KdV * (ev - e1v); e1v = ev; cv = vp + vi + vd; if (cv < 0) cv = 0; cv_out = cv; if (cv_out > 4000) cv_out = 4000; ei = (myarray[1] - iref); ip = KpI * ei; ii = KiI * ei + ii1; if (ii > 4100000) ii = 4100000; if (ii < 0) ii = 0; ii1 = ii; id = KdI * (ei - e1i); e1i = ei; ci = ip + ii + id; if (ci < 0) ci = 0; ci_out = ci / 1000; if (ci_out > 4000) ci_out = 4000; c_out = (cv_out - ci_out); if (c_out > 4000) c_out = 4000; if (c_out < 1500) c_out = 1500; GPIOC->BSRR |= (1U << (14 + 16)); //Pin low } }
The main() function and the ISR and also the vector table reside in flash memory. Compiler optimizes for size (-Os). From the attached image you can see that the execution time is 1.842µs with cli() that gets called from main().
If I comment out cli()...

Code: [Select]
while(1) { //cli(); if(TEST) { delay_ms(200); } }
...the execution time of the ISR is 1.542µs, shown in the other attached image.

Please see also the attached ZIP. It contains the elf-file and the disassembly (list-file) with the cli() function. On request I can upload an elf file and disassembly without the cli() function

STM32-CUBE-IDE_STM32G431_Minimal_CMSIS_ISR_speed_tests_with_cli-Os.zip

#56 Reply
Posted by DavidAlfa on 20 Nov, 2022 22:15
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<10000;i++) asm("nop"); GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<40;i++) { asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); } GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
This way you discard a FPU HW / library issue.
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?

#57 Reply
Posted by lordnoxx on 22 Nov, 2022 09:11
Quote
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<10000;i++) asm("nop"); GPIOC->BSRR |= (1U << (14 + 16)); //Pin low

Tried that...Result:
Execution time of the ISR is not dependent on cli().

Quote
Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<40;i++) { asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); } GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
This way you discard a FPU HW / library issue.

Tried that too...Result:
Execution time of the ISR is not dependent on cli().

Quote
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
No, cli() has no float operations going. Also if you check the disassembly of cli() you will not find any FPU instructions. So probably the root cause is not the lazy stacking feature!?!?
No higher priority IRQs.

But...the above tests led me to test again what happens if I tell to compiler not to use the FPU at all.--> Result: Execution time of the ISR is not dependent on cli().
Hmmm...well....does this mean that the the issue is nevertheless caused by the FPU?

What can be the next steps to dive deeper into this?

#58 Reply
Posted by DavidAlfa on 22 Nov, 2022 17:31
Toggle the pin between operations, maybe you find it's happening by some specific operation?

#59 Reply
Posted by jnk0le on 22 Nov, 2022 19:57
Put all of those in one big struct so the compiler can acces them with ldr offset instead of doing pcrel load of address each time
Quote
Code: [Select]
volatile uint16_t myarray[2]= {0,0}; volatile unsigned int vref=0, vref_aim=VOUT_SET; volatile int KpV=KPV, KiV=KIV, KdV=KDV; volatile int ev=0, e1v=0, cv=0, cv_out=0; volatile float vp=0, vi=0, vi1=0, vd=0; volatile unsigned int iref=0, iref_aim=IOUT_SET; volatile int KpI=KPI, KiI=KII, KdI=KDI; volatile int ei=0, e1i=0, ci=0, ci_out=0; volatile float ip=0, ii=0, ii1=0, id=0; volatile int c_out;

#60 Reply
Posted by Jeroen3 on 22 Nov, 2022 20:26
What is your clock tree like? Are you waiting on an asynchronous bus somewhere?

What happens when you run the chip on a speed without need for wait states? Eg: 16 Mhz flat?
Can you the reproduce the results?

_{Sidenote: GPIOC->BSRR is a write only register, no need for |=}

#61 Reply
Posted by wek on 22 Nov, 2022 22:55
FPU is probably red herring, with those nops you've removed also all pcrel loads, which may have impact through alignment, as jnk0le pointed out above (and maybe others talked about it too).

To test the alignment-dependency theory, insert a single NOP to the ISR (before setting the GPIO pin) and retest, then insert one more etc.

> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical. Don't set multiple bits in a single register by a series of RMW, (as you do with FLASH_ACR), perform one single write of the final value.

JW

#62 Reply
Posted by Siwastaja on 23 Nov, 2022 07:44
Quote from: wek on 22 Nov, 2022 22:55
> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical.

Made that bigger, it's pretty important side note when the OP clearly is interested about performance. BSRR, IFCR and other similar registers exist for the purpose of avoiding read-modify-write; the RMW operation is moved to the peripheral side; consider it a simple "hardware accelerator" of modifying peripheral state.

#63 Reply
Posted by DavidAlfa on 23 Nov, 2022 08:11
Why adding sidenotes in micron size font? Just use regular size, it's confusing, I didn't even read them as looked like your personal signature.

Add "__attribute__((aligned(16)))" before the ISR function so it gets 16-byte (128bit) aligned.
It'll waste a little space, but nothing serious, tested it and got a 3.2% increment, from 106KB to 109.3KB.

What are you runnign on cli(), or what code is increasing the ISR time?
Tried your code inside a timer ISR, nothing seems to affect it, got steady 2.56us on a 32F411@100MHz.
The struct idea definitely made a difference, lowering the time to 1.96us.
Are you sure it's not related to the processing ending sooner/later depending on the input data?
What happens if the float input values are never updated, so it always makes the same calcs?

Also, maybe try enabling the Systick timer (1KHz), disabling everything else and running the code there, to discard something strange with the DMA ISR.

#64 Reply
Posted by wek on 23 Nov, 2022 10:47
DavidAlfa,

> The struct idea definitely made a difference, lowering the time [from 2.56us(?)] to 1.96us.

Wow, I wouldn't expect it to be that dramatic.

Can you please revert to the non-struct version, and try a couple of versions with added one/two/etc. _NOP()s before setting of GPIO pin?

Thanks,

JW

#65 Reply
Posted by DavidAlfa on 23 Nov, 2022 11:29
I'm not getting any difference by adding random code anywhere in the main neither by aligning the code to 16-bit:

No alignment:
08000784 <TIM3_IRQHandler>

Adding __attribute__((aligned(16))):
08000790 <TIM3_IRQHandler>

Zero difference in execution time.

Edit: It was my scope not having enough precision. Repeated in 10ns/div.

I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
0 nop: 1.96us (="same")
1 nop: +20ns
2 nop: same
3 nop: +20ns
4 nop: same

It seems adding uneven number of nops adds two additional execution cycles to the following instructions.
But by no way +600ns.
Flash latency is set to 3, I don't think it's a cache miss.
The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.

#66 Reply
Posted by wek on 23 Nov, 2022 12:43
> I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
> 0 nop: 1.96us (="same")
> 1 nop: +20ns
> 2 nop: same
> 3 nop: +20ns
> 4 nop: same

OK but that's with the variables in struct, right? If you run at 100MHz that 20ns may be 2-3 cycles so it may be one extra FLASH read, or similar.

Can you please try the same with the non-struct i.e. original version? As there are more data reads from the FLASH, the difference may be more pronounced.

> The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.

From computing point of view the 'F411 is in no way inferior to 'G431, except the max. clock frequency (i.e. the 'F42x or 'F446 are on par with 'G431). The 'G4xx are "better" in the peripheral mix. The 'G4xx are worse in longevity, given the massive increase in peripherals' complexity was paid for by using 45nm technology (vs. the older (read: more robust) 90nm for the 'F4). It's not dramatic, yet; but the pressure is already felt.

JW

#67 Reply
Posted by DavidAlfa on 23 Nov, 2022 14:15
Strange results, but nothing outside of this world.

1 nop: -20ns
2 nop: +10ns
3 nop: +10ns
4 nop: -20ns
5 nop: -20ns
6 nop: -20ns
7 nop: -20ns
8 nop: -20ns
9 nop: -20ns

asm("") -20ns Yeah,asm("Nothing")

Code: [Select]
void TIM3_IRQHandler(void) { /* USER CODE BEGIN TIM3_IRQn 0 */ __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE); 8000784: 4b7c ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>) LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high 8000786: 497d ldr r1, [pc, #500] ; (800097c <TIM3_IRQHandler+0x1f8>)
Code: [Select]
void TIM3_IRQHandler(void) { 8000784: b5f0 push {r4, r5, r6, r7, lr} /* USER CODE BEGIN TIM3_IRQn 0 */ asm(""); __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE); 8000786: 4b7c ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>) LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high 8000788: 497c ldr r1, [pc, #496] ; (800097c <TIM3_IRQHandler+0x1f8>)

#68 Reply
Posted by wek on 23 Nov, 2022 14:34
Hummm.

Thanks, David.

JW

#69 Reply
Posted by bson on 29 Nov, 2022 19:28
Quote from: Siwastaja on 13 Nov, 2022 13:08
Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
CCM doesn't have separate I/D buses like flash or SRAM. This means the pipeline can't overlap data accesses with code accesses when data is in CCM, but the flip side is it gives very predictable and deterministic execution times. Every access takes exactly one cycle, and can simply be added up.

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod