Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR  (Read 7592 times)

0 Members and 1 Guest are viewing this topic.

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
I too have never seen TCM being slower and I don't see how it is possible.
We are not talking about TCM in CM7 which is on an entirely separate bus of processor than the rest of the system.

CCM SRAM in STM32 denotes different things in different families/models. In 'G4, it's a single-port RAM accessed through both I and D but even through S port of processor (the latter through a different address region alias) and then through the common busmatrix, and it can be slave also to either of DMAs. Strangely enough, there's another chunk of RAM, denoted SRAM1, with exactly the same connectivity, and no explanation how would it be different from CCM SRAM.

The arbitration of the busmatrix is not documented in any other way than it is "round robin".

JW
 
The following users thanked this post: Siwastaja

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
Note that some linker scripts don't separate these ram regions.
I remember the 32F429 showing "RAM: .... length=192K", so the system would use any.

At least in the 429, CCM is only connected to the D-Bus, thus can't be used for instructions neither accessed by any DMA.
Quote
The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU
But placing ISR data on CCM and ISR intructions in the normal SRAM shouldn't slow it down, as they won't need to fight for access.

Not the case of the 32G431 (Totally different beast!)
« Last Edit: November 13, 2022, 10:01:04 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8183
  • Country: fi
Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.

This picture explains it well. Flash accelerator comes with separate DBUS and IBUS interfaces, STM32 CCM SRAM does not but has to arbitrate.

Who knows about the internals of ARM-provided M7 ITCM (tried 2 minutes in Google to no avail)? The bus is 64 bits but does that mean it's actually kinda-dual port, like the FLASH accelerator pictured? PC-relative literals are usually further away so 64 bit width does not help for that.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8183
  • Country: fi
Quote
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Kind of weird optimization idea:

Since PC-relative LDR can take offset of +/- 4095 bytes, which is quite a lot, and considering wek's comment "is there any difference between CCM and SRAM?"  it should be possible to place timing-critical routines on the border of these memory segments so that the PC-relative literals fall into SRAM2 and code itself into CCM. Then of course place nothing else into SRAM2 & CCM. Then DBUS accesses would be to SRAM2 and IBUS instruction fetches to CCM, and no arbitration for PC-relative literals.
 

Offline jnk0le

  • Contributor
  • Posts: 44
  • Country: pl
About optimizing the OP code

Quote
8000ed0:   20001558    .word   0x20001558
 8000ed4:   20001524    .word   0x20001524
 8000ed8:   2000155c    .word   0x2000155c
 8000edc:   200014f8    .word   0x200014f8
 
 8000ee4:   20001560    .word   0x20001560
 8000ee8:   200000cc    .word   0x200000cc
 8000eec:   200000c4    .word   0x200000c4
 8000ef0:   20001554    .word   0x20001554
 8000ef4:   20001550    .word   0x20001550
 
 8000efc:   2000151c    .word   0x2000151c
 8000f00:   2000154c    .word   0x2000154c
 8000f04:   20001508    .word   0x20001508
 8000f08:   2000150c    .word   0x2000150c
 8000f0c:   20001520    .word   0x20001520
 8000f10:   20001548    .word   0x20001548
 8000f14:   20001544    .word   0x20001544
 8000f18:   200000c8    .word   0x200000c8
 8000f1c:   200000c0    .word   0x200000c0
 8000f20:   20001540    .word   0x20001540
 8000f24:   2000153c    .word   0x2000153c
 8000f28:   200014f4    .word   0x200014f4
 
 8000f30:   20001518    .word   0x20001518
 8000f34:   20001538    .word   0x20001538
 8000f38:   20001500    .word   0x20001500
 8000f3c:   20001504    .word   0x20001504
 8000f40:   200014fc    .word   0x200014fc
 8000f44:   200000d0    .word   0x200000d0

all of those seem to be addresses of global variables. Organizing them in structures should greatly reduce those literal loads.

EDIT: many of those are also within 12 bit addw/ldr range to each other but the linkers tend to not be good at this kind of address relaxing

« Last Edit: November 14, 2022, 10:14:11 am by jnk0le »
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Hi guys I am back. Had a little time during the weekend and worked on a simplified small code that shows the behavior of variable execution time of DMA1 ISR. Together we can now walk through the mini project and find out what is going on. But there is one thing I have to make clear right from the beginning: I am not allowed to post the c-source code of the function cli(). I can only post the disassembly of that function.

Ok here we go....this is the main.c file:

Code: [Select]
#include "stm32g4xx.h"
#include "main.h"
#include "system_setup.h"
#include "usart_char_string.h"
#include "cli.h"
#include "ISR_Handlers.h"

#define TEST 0
#define VOUT 5.0f
#define VOUT_SET ((VOUT/10)/3)*4095.0f * 0.996
#define KPV  300
#define KIV  30
#define KDV  3000
#define IOUT 3.0f
#define IOUT_SET ((IOUT*0.2308f)/3)*4095.0f * 1.0f
#define KPI  10000
#define KII  30000
#define KDI  5000

volatile uint16_t myarray[2]= {0,0};
char fooptr[16];
volatile char debug=0;

volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;

volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;

void SysTick_Handler(void);

int main(void)
{
setup_clock_tree_config();
setup_GPIO_config();
SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3); //1ms
setup_USART_config();
setup_Timer1_config();
setup_ADC1_config();
setup_interrupt_config();

GPIOA->ODR |= (1 << GPIO_ODR_OD11_Pos);

while(1)
{

cli();

if(TEST)
{
delay_ms(200);
}

}
}


void SysTick_Handler(void)
{
count_ticks++;
}

void delay_ms(uint32_t ms)
{
uint32_t start=count_ticks;
while ((count_ticks-start) < ms);
}

If you wonder about the variables names...yes....the project is all about an digital control loop for a DC/DC converter. And yes...the DMA1 ISR executes the PID algorithms for  voltage and current control. This just as a side note for those of you who are interested ;)

Here is the code of the ISR:

Code: [Select]
void DMA1_Channel1_IRQHandler(void) {

if (DMA1->ISR & DMA_ISR_TCIF1)
{
DMA1->IFCR |= DMA_IFCR_CTCIF1;

GPIOC->BSRR |= (1U << 14); //Pin high

ev = (vref - myarray[0]);

vp = KpV * ev;
vi = KiV * ev + vi1;

if (vi > 4100) vi = 4100;
if (vi < 0) vi = 0;
vi1 = vi;

vd = KdV * (ev - e1v);
e1v = ev;

cv = vp + vi + vd;

if (cv < 0) cv = 0;

cv_out = cv;
if (cv_out > 4000) cv_out = 4000;

ei = (myarray[1] - iref);

ip = KpI * ei;
ii = KiI * ei + ii1;

if (ii > 4100000) ii = 4100000;
if (ii < 0) ii = 0;
ii1 = ii;

id = KdI * (ei - e1i);
e1i = ei;

ci = ip + ii + id;

if (ci < 0) ci = 0;

ci_out = ci / 1000;
if (ci_out > 4000) ci_out = 4000;

c_out = (cv_out - ci_out);

if (c_out > 4000) c_out = 4000;
if (c_out < 1500) c_out = 1500;

GPIOC->BSRR |= (1U << (14 + 16)); //Pin low

}
}

The main() function and the ISR and also the vector table reside in flash memory. Compiler optimizes for size (-Os). From the attached image you can see that the execution time is 1.842µs with cli() that gets called from main().
If I comment out cli()...

Code: [Select]
while(1)
{
//cli();

if(TEST)
{
delay_ms(200);
}

}

...the execution time of the ISR is 1.542µs, shown in the other attached image.

Please see also the attached ZIP. It contains the elf-file and the disassembly (list-file) with the cli() function. On request I can upload an elf file and disassembly without the cli() function

Find out what you cannot do and then go an do it!
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<40;i++) {
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

This way you discard a FPU HW / library issue.
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
« Last Edit: November 20, 2022, 10:17:55 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Quote
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

Tried that...Result:
Execution time of the ISR is not dependent on cli().


Quote
Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<40;i++) {
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

This way you discard a FPU HW / library issue.

Tried that too...Result:
Execution time of the ISR is not dependent on cli().


Quote
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
No, cli() has no float operations going. Also if you check the disassembly of cli() you will not find any FPU instructions. So probably the root cause is not the lazy stacking feature!?!?  :-//
No higher priority IRQs.

But...the above tests led me to test again what happens if I tell to compiler not to use the FPU at all.--> Result: Execution time of the ISR is not dependent on cli().
Hmmm...well....does this mean that the the issue is nevertheless caused by the FPU?

What can be the next steps to dive deeper into this?







Find out what you cannot do and then go an do it!
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
Toggle the pin between operations, maybe you find it's happening by some specific operation?
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline jnk0le

  • Contributor
  • Posts: 44
  • Country: pl
Put all of those in one big struct so the compiler can acces them with ldr offset instead of doing pcrel load of address each time
Quote
Code: [Select]
volatile uint16_t myarray[2]= {0,0};

volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;

volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;

 

Offline Jeroen3

  • Super Contributor
  • ***
  • Posts: 4078
  • Country: nl
  • Embedded Engineer
    • jeroen3.nl
What is your clock tree like? Are you waiting on an asynchronous bus somewhere?

What happens when you run the chip on a speed without need for wait states? Eg: 16 Mhz flat?
Can you the reproduce the results?

Sidenote: GPIOC->BSRR is a write only register, no need for |=
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
FPU is probably red herring, with those nops you've removed also all pcrel loads, which may have impact through alignment, as jnk0le pointed out above (and maybe others talked about it too).

To test the alignment-dependency theory, insert a single NOP to the ISR (before setting the GPIO pin) and retest, then insert one more etc.

> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical. Don't set multiple bits in a single register by a series of RMW, (as you do with FLASH_ACR), perform one single write of the final value.


JW
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8183
  • Country: fi
> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical.


Made that bigger, it's pretty important side note when the OP clearly is interested about performance. BSRR, IFCR and other similar registers exist for the purpose of avoiding read-modify-write; the RMW operation is moved to the peripheral side; consider it a simple "hardware accelerator" of modifying peripheral state.
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
Why adding sidenotes in micron size font? Just use regular size, it's confusing, I didn't even read them as looked like your personal signature.

Add "__attribute__((aligned(16)))" before the ISR function so it gets 16-byte (128bit) aligned.
It'll waste a little space, but nothing serious, tested it and got a 3.2% increment, from 106KB to 109.3KB.

What are you runnign on cli(), or what code is increasing the ISR time?
Tried your code inside a timer ISR, nothing seems to affect it, got steady 2.56us on a 32F411@100MHz.
The struct idea definitely made a difference, lowering the time to 1.96us.
Are you sure it's not related to the processing ending sooner/later depending on the input data?
What happens if the float input values are never updated, so it always makes the same calcs?

Also, maybe try enabling the Systick timer (1KHz), disabling everything else and running the code there, to discard something strange with the DMA ISR.
« Last Edit: November 23, 2022, 11:30:20 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
DavidAlfa,

> The struct idea definitely made a difference, lowering the time [from 2.56us(?)] to 1.96us.

Wow, I wouldn't expect it to be that dramatic.

Can you please revert to the non-struct version, and try a couple of versions with added one/two/etc. _NOP()s before setting of GPIO pin?

Thanks,

JW

 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
I'm not getting any difference by adding random code anywhere in the main neither by aligning the code to 16-bit:

No alignment:
08000784 <TIM3_IRQHandler>

Adding __attribute__((aligned(16))):
08000790 <TIM3_IRQHandler>

Zero difference in execution time.

Edit: It was my scope not having enough precision. Repeated in 10ns/div.

I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
0 nop:  1.96us (="same")
1 nop: +20ns
2 nop: same
3 nop: +20ns
4 nop: same

It seems adding uneven number of nops adds two additional execution cycles to the following instructions.
But by no way +600ns.
Flash latency is set to 3, I don't think it's a cache miss.
The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way,  but who knows.
« Last Edit: November 23, 2022, 11:59:18 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
> I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
> 0 nop:  1.96us (="same")
> 1 nop: +20ns
> 2 nop: same
> 3 nop: +20ns
> 4 nop: same

OK but that's with the variables in struct, right? If you run at 100MHz that 20ns may be 2-3 cycles so it may be one extra FLASH read, or similar.

Can you please try the same with the non-struct i.e. original version? As there are more data reads from the FLASH, the difference may be more pronounced.

> The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way,  but who knows.

From computing point of view the 'F411 is in no way inferior to 'G431, except the max. clock frequency (i.e. the 'F42x or 'F446 are on par with 'G431). The 'G4xx are "better" in the peripheral mix. The 'G4xx are worse in longevity, given the massive increase in peripherals' complexity was paid for by using 45nm technology (vs. the older (read: more robust) 90nm for the 'F4). It's not dramatic, yet; but the pressure is already felt.

JW
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5931
  • Country: es
Strange results, but nothing outside of this world.

1 nop: -20ns
2 nop: +10ns
3 nop: +10ns
4 nop: -20ns
5 nop: -20ns
6 nop: -20ns
7 nop: -20ns
8 nop: -20ns
9 nop: -20ns

asm("") -20ns  Yeah,asm("Nothing")

Code: [Select]
void TIM3_IRQHandler(void)
{
  /* USER CODE BEGIN TIM3_IRQn 0 */
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000784: 4b7c      ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000786: 497d      ldr r1, [pc, #500] ; (800097c <TIM3_IRQHandler+0x1f8>)
Code: [Select]
void TIM3_IRQHandler(void)
{
 8000784: b5f0      push {r4, r5, r6, r7, lr}
  /* USER CODE BEGIN TIM3_IRQn 0 */
  asm("");
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000786: 4b7c      ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000788: 497c      ldr r1, [pc, #496] ; (800097c <TIM3_IRQHandler+0x1f8>)
« Last Edit: November 23, 2022, 02:17:31 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
Hummm.

Thanks, David.

JW
 

Offline bson

  • Supporter
  • ****
  • Posts: 2271
  • Country: us
Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
CCM doesn't have separate I/D buses like flash or SRAM.  This means the pipeline can't overlap data accesses with code accesses when data is in CCM, but the flip side is it gives very predictable and deterministic execution times.  Every access takes exactly one cycle, and can simply be added up.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf