I too have never seen TCM being slower and I don't see how it is possible.
We are not talking about TCM in CM7 which is on an entirely separate bus of processor than the rest of the system.
CCM SRAM in STM32 denotes different things in different families/models. In 'G4, it's a single-port RAM accessed through both I and D but even through S port of processor (the latter through a different address region alias) and then through the common busmatrix, and it can be slave also to either of DMAs. Strangely enough, there's another chunk of RAM, denoted SRAM1, with exactly the same connectivity, and no explanation how would it be different from CCM SRAM.
The arbitration of the busmatrix is not documented in any other way than it is "round robin".
JW
Note that some linker scripts don't separate these ram regions.
I remember the 32F429 showing "RAM: .... length=192K", so the system would use any.
At least in the 429, CCM is only connected to the D-Bus, thus can't be used for instructions neither accessed by any DMA.
The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU
But placing ISR data on CCM and ISR intructions in the normal SRAM shouldn't slow it down, as they won't need to fight for access.
Not the case of the 32G431 (Totally different beast!)
Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.
This picture explains it well. Flash accelerator comes with separate DBUS and IBUS interfaces, STM32 CCM SRAM does not but has to arbitrate.
Who knows about the internals of ARM-provided M7 ITCM (tried 2 minutes in Google to no avail)? The bus is 64 bits but does that mean it's actually kinda-dual port, like the FLASH accelerator pictured? PC-relative literals are usually further away so 64 bit width does not help for that.
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)
Kind of weird optimization idea:
Since PC-relative LDR can take offset of +/- 4095 bytes, which is quite a lot, and considering wek's comment "is there any difference between CCM and SRAM?" it should be possible to place timing-critical routines on the border of these memory segments so that the PC-relative literals fall into SRAM2 and code itself into CCM. Then of course place nothing else into SRAM2 & CCM. Then DBUS accesses would be to SRAM2 and IBUS instruction fetches to CCM, and no arbitration for PC-relative literals.
About optimizing the OP code
8000ed0: 20001558 .word 0x20001558
8000ed4: 20001524 .word 0x20001524
8000ed8: 2000155c .word 0x2000155c
8000edc: 200014f8 .word 0x200014f8
8000ee4: 20001560 .word 0x20001560
8000ee8: 200000cc .word 0x200000cc
8000eec: 200000c4 .word 0x200000c4
8000ef0: 20001554 .word 0x20001554
8000ef4: 20001550 .word 0x20001550
8000efc: 2000151c .word 0x2000151c
8000f00: 2000154c .word 0x2000154c
8000f04: 20001508 .word 0x20001508
8000f08: 2000150c .word 0x2000150c
8000f0c: 20001520 .word 0x20001520
8000f10: 20001548 .word 0x20001548
8000f14: 20001544 .word 0x20001544
8000f18: 200000c8 .word 0x200000c8
8000f1c: 200000c0 .word 0x200000c0
8000f20: 20001540 .word 0x20001540
8000f24: 2000153c .word 0x2000153c
8000f28: 200014f4 .word 0x200014f4
8000f30: 20001518 .word 0x20001518
8000f34: 20001538 .word 0x20001538
8000f38: 20001500 .word 0x20001500
8000f3c: 20001504 .word 0x20001504
8000f40: 200014fc .word 0x200014fc
8000f44: 200000d0 .word 0x200000d0
all of those seem to be addresses of global variables. Organizing them in structures should greatly reduce those literal loads.
EDIT: many of those are also within 12 bit addw/ldr range to each other but the linkers tend to not be good at this kind of address relaxing
Hi guys I am back. Had a little time during the weekend and worked on a simplified small code that shows the behavior of variable execution time of DMA1 ISR. Together we can now walk through the mini project and find out what is going on. But there is one thing I have to make clear right from the beginning: I am not allowed to post the c-source code of the function cli(). I can only post the disassembly of that function.
Ok here we go....this is the main.c file:
#include "stm32g4xx.h"
#include "main.h"
#include "system_setup.h"
#include "usart_char_string.h"
#include "cli.h"
#include "ISR_Handlers.h"
#define TEST 0
#define VOUT 5.0f
#define VOUT_SET ((VOUT/10)/3)*4095.0f * 0.996
#define KPV 300
#define KIV 30
#define KDV 3000
#define IOUT 3.0f
#define IOUT_SET ((IOUT*0.2308f)/3)*4095.0f * 1.0f
#define KPI 10000
#define KII 30000
#define KDI 5000
volatile uint16_t myarray[2]= {0,0};
char fooptr[16];
volatile char debug=0;
volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;
volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;
void SysTick_Handler(void);
int main(void)
{
setup_clock_tree_config();
setup_GPIO_config();
SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3); //1ms
setup_USART_config();
setup_Timer1_config();
setup_ADC1_config();
setup_interrupt_config();
GPIOA->ODR |= (1 << GPIO_ODR_OD11_Pos);
while(1)
{
cli();
if(TEST)
{
delay_ms(200);
}
}
}
void SysTick_Handler(void)
{
count_ticks++;
}
void delay_ms(uint32_t ms)
{
uint32_t start=count_ticks;
while ((count_ticks-start) < ms);
}
If you wonder about the variables names...yes....the project is all about an digital control loop for a DC/DC converter. And yes...the DMA1 ISR executes the PID algorithms for voltage and current control. This just as a side note for those of you who are interested
Here is the code of the ISR:
void DMA1_Channel1_IRQHandler(void) {
if (DMA1->ISR & DMA_ISR_TCIF1)
{
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14); //Pin high
ev = (vref - myarray[0]);
vp = KpV * ev;
vi = KiV * ev + vi1;
if (vi > 4100) vi = 4100;
if (vi < 0) vi = 0;
vi1 = vi;
vd = KdV * (ev - e1v);
e1v = ev;
cv = vp + vi + vd;
if (cv < 0) cv = 0;
cv_out = cv;
if (cv_out > 4000) cv_out = 4000;
ei = (myarray[1] - iref);
ip = KpI * ei;
ii = KiI * ei + ii1;
if (ii > 4100000) ii = 4100000;
if (ii < 0) ii = 0;
ii1 = ii;
id = KdI * (ei - e1i);
e1i = ei;
ci = ip + ii + id;
if (ci < 0) ci = 0;
ci_out = ci / 1000;
if (ci_out > 4000) ci_out = 4000;
c_out = (cv_out - ci_out);
if (c_out > 4000) c_out = 4000;
if (c_out < 1500) c_out = 1500;
GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
}
}
The main() function and the ISR and also the vector table reside in flash memory. Compiler optimizes for size (-Os). From the attached image you can see that the execution time is 1.842µs with cli() that gets called from main().
If I comment out cli()...
while(1)
{
//cli();
if(TEST)
{
delay_ms(200);
}
}
...the execution time of the ISR is 1.542µs, shown in the other attached image.
Please see also the attached ZIP. It contains the elf-file and the disassembly (list-file) with the cli() function. On request I can upload an elf file and disassembly without the cli() function
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14); //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
Also try making a larger function:
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14); //Pin high
for(uint16_t i=0;i<40;i++) {
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
This way you discard a FPU HW / library issue.
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14); //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
Tried that...Result:
Execution time of the ISR is not dependent on cli().
Also try making a larger function:
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14); //Pin high
for(uint16_t i=0;i<40;i++) {
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
This way you discard a FPU HW / library issue.
Tried that too...Result:
Execution time of the ISR is not dependent on cli().
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
No, cli() has no float operations going. Also if you check the disassembly of cli() you will not find any FPU instructions. So probably the root cause is not the lazy stacking feature!?!?
No higher priority IRQs.
But...the above tests led me to test again what happens if I tell to compiler not to use the FPU at all.--> Result: Execution time of the ISR is not dependent on cli().
Hmmm...well....does this mean that the the issue is nevertheless caused by the FPU?
What can be the next steps to dive deeper into this?
Toggle the pin between operations, maybe you find it's happening by some specific operation?
Put all of those in one big struct so the compiler can acces them with ldr offset instead of doing pcrel load of address each time
volatile uint16_t myarray[2]= {0,0};
volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;
volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;
What is your clock tree like? Are you waiting on an asynchronous bus somewhere?
What happens when you run the chip on a speed without need for wait states? Eg: 16 Mhz flat?
Can you the reproduce the results?
Sidenote: GPIOC->BSRR is a write only register, no need for |=
FPU is probably red herring, with those nops you've removed also all pcrel loads, which may have impact through alignment, as jnk0le pointed out above (and maybe others talked about it too).
To test the alignment-dependency theory, insert a single NOP to the ISR (before setting the GPIO pin) and retest, then insert one more etc.
> Sidenote: GPIOC->BSRR is a write only register, no need for |=
Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical. Don't set multiple bits in a single register by a series of RMW, (as you do with FLASH_ACR), perform one single write of the final value.
JW
> Sidenote: GPIOC->BSRR is a write only register, no need for |=
Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical.
Made that bigger, it's pretty important side note when the OP clearly is interested about performance. BSRR, IFCR and other similar registers exist for the purpose of avoiding read-modify-write; the RMW operation is moved to the peripheral side; consider it a simple "hardware accelerator" of modifying peripheral state.
Why adding sidenotes in micron size font? Just use regular size, it's confusing, I didn't even read them as looked like your personal signature.
Add "__attribute__((aligned(16)))" before the ISR function so it gets 16-byte (128bit) aligned.
It'll waste a little space, but nothing serious, tested it and got a 3.2% increment, from 106KB to 109.3KB.
What are you runnign on cli(), or what code is increasing the ISR time?
Tried your code inside a timer ISR, nothing seems to affect it, got steady 2.56us on a 32F411@100MHz.
The struct idea definitely made a difference, lowering the time to 1.96us.
Are you sure it's not related to the processing ending sooner/later depending on the input data?
What happens if the float input values are never updated, so it always makes the same calcs?
Also, maybe try enabling the Systick timer (1KHz), disabling everything else and running the code there, to discard something strange with the DMA ISR.
DavidAlfa,
> The struct idea definitely made a difference, lowering the time [from 2.56us(?)] to 1.96us.
Wow, I wouldn't expect it to be that dramatic.
Can you please revert to the non-struct version, and try a couple of versions with added one/two/etc. _NOP()s before setting of GPIO pin?
Thanks,
JW
I'm not getting any difference by adding random code anywhere in the main neither by aligning the code to 16-bit:
No alignment:
08000784 <TIM3_IRQHandler>
Adding __attribute__((aligned(16))):
08000790 <TIM3_IRQHandler>
Zero difference in execution time.
Edit: It was my scope not having enough precision. Repeated in 10ns/div.
I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
0 nop: 1.96us (="same")
1 nop: +20ns
2 nop: same
3 nop: +20ns
4 nop: same
It seems adding uneven number of nops adds two additional execution cycles to the following instructions.
But by no way +600ns.
Flash latency is set to 3, I don't think it's a cache miss.
The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.
> I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
> 0 nop: 1.96us (="same")
> 1 nop: +20ns
> 2 nop: same
> 3 nop: +20ns
> 4 nop: same
OK but that's with the variables in struct, right? If you run at 100MHz that 20ns may be 2-3 cycles so it may be one extra FLASH read, or similar.
Can you please try the same with the non-struct i.e. original version? As there are more data reads from the FLASH, the difference may be more pronounced.
> The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.
From computing point of view the 'F411 is in no way inferior to 'G431, except the max. clock frequency (i.e. the 'F42x or 'F446 are on par with 'G431). The 'G4xx are "better" in the peripheral mix. The 'G4xx are worse in
longevity, given the massive increase in peripherals' complexity was paid for by using 45nm technology (vs. the older (read: more robust) 90nm for the 'F4). It's not dramatic, yet; but the pressure is already felt.
JW
Strange results, but nothing outside of this world.
1 nop: -20ns
2 nop: +10ns
3 nop: +10ns
4 nop: -20ns
5 nop: -20ns
6 nop: -20ns
7 nop: -20ns
8 nop: -20ns
9 nop: -20ns
asm("") -20ns Yeah,asm("Nothing")
void TIM3_IRQHandler(void)
{
/* USER CODE BEGIN TIM3_IRQn 0 */
__HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
8000784: 4b7c ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
8000786: 497d ldr r1, [pc, #500] ; (800097c <TIM3_IRQHandler+0x1f8>)
void TIM3_IRQHandler(void)
{
8000784: b5f0 push {r4, r5, r6, r7, lr}
/* USER CODE BEGIN TIM3_IRQn 0 */
asm("");
__HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
8000786: 4b7c ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
8000788: 497c ldr r1, [pc, #496] ; (800097c <TIM3_IRQHandler+0x1f8>)
Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
CCM doesn't have separate I/D buses like flash or SRAM. This means the pipeline can't overlap data accesses with code accesses when data is in CCM, but the flip side is it gives very predictable and deterministic execution times. Every access takes exactly one cycle, and can simply be added up.