Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR  (Read 7654 times)

0 Members and 1 Guest are viewing this topic.

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Hi folks,

I am working on a Project based on a STM32G431. While developing the firmware for it I encountered a strange phenomenon that should not be possible or happen. After fiddling around with it and doing my on efforts of debugging and testing to get a better understanding what is going on here I had to partly give up on this as my knowledge regarding deeper STM32 secrets is limited. I am a power electronics engineer and embedded firmware development is not my daily profession. Although I have quite some experience in coding and microcontrollers (8051, PIC, ATmega, STM32) as such, I came to a point where I need help to find the root cause of the issue.

So I went to the ST Community Forum and asked for help there: https://community.st.com/s/question/0D73W000001nmmASAQ/detail?fromEmail=1&s1oid=00Db0000000YtG6&s1nid=0DB0X000000DYbd&s1uid=0053W000001UI26&s1ext=0&emkind=chatterCommentNotification&emtm=1667370141288

I am still full of hope and confidence that the people there at ST can help me. But I feel I have to reach a bigger crowd of skilled engineers to dig into this. That is why I want to reach out to you here at eevblog too and want to ask kindly for additional help.


STM32G431 - How can it be that the amount of lines of code in main() affect the time it takes to execute an interrupt routine?

This question is all about the behavior of the DMA1_Channel1_IRQHandler in my application.
I am working with an STM32G431CBT. It is running at 168MHz. Configured with 4 Flash wait states.
NVIC interrupt priorities are setup using CMSIS functions as follows:

Code: [Select]
SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3);
 
NVIC_SetPriorityGrouping(0);
 
NVIC_DisableIRQ(SysTick_IRQn);
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 10, 0);
NVIC_SetPriority(SysTick_IRQn, irq_prio);
NVIC_EnableIRQ(SysTick_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 13, 0);
NVIC_SetPriority(USART2_IRQn, irq_prio);
NVIC_EnableIRQ(USART2_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 0, 0); //highest priority!!!
NVIC_SetPriority(DMA1_Channel1_IRQn, irq_prio);
NVIC_EnableIRQ(DMA1_Channel1_IRQn);

TIM1->CCR5 (on match) triggers ADC conversion sequence. ADC triggers DMA1_Channel1 (circular, periph-to-mem). DMA1 is setup to issue an interrupt on transfer complete (TCIE).
The DMA1_Channel1_IRQHandler (has highest priority) then fires and a few things are processed and calculated within it.

My code compiles with no errors and warnings. And as far as I can see it all works as intended.
But here comes the strange thing:

Right at the start of DMA1_Channel1_IRQHandler I set a GPIO Pin high. At the End of DMA1_Channel1_IRQHandler I set the same Pin low. Looking at the generated high-low pulse with an oscilloscope I can measure the execution time of the DMA1_Channel1_IRQHandler. It takes 1.69µs. But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

How can that be? An ISR should not be affected in such a way by the amount of code in a low priority function just like main().

The ISR looks like this:

Code: [Select]
    void DMA1_Channel1_IRQHandler(void)
    {
     
     
     if(DMA1->ISR & DMA_ISR_TCIF1)  //check we are here cause of a valid interrupt occurred
     {
    DMA1->IFCR |= DMA_IFCR_CTCIF1; //reset interrupt flag
     
    GPIOC->ODR |= (1U << 14); //Set Pin high
     
    ev = (v - my_array[0]);
    vp  = A * ev;
    vi  = B * vi1;
    if(vi > 4100) vi=4100;
    if(vi < 0) vi=0;
    vi1 = vi;
    vd  = C * (ev - (e1v+2));
    e1v = ev;
    cv = vp + vi + vd;
    if(cv < 0) cv=0;
    cv_t = cv/2;
    if(cv_t >  4000 ) cv_t=4000;
      /*------------------------------------------------------*/
    ei = (my_array[1] - i);
    ip  = D * ei;
    ii  = E * ii1;
    if(ii > 4100000 ) ii=4100000;
            if(ii < 0 ) ii=0;
    ii1 = ii;
    id  = F * (ei + (e1i-2));
    e1i = ei;
    ci = ip + ii + id;
    if(ci < 0) ci=0;
    ci_t = ci/1000;
    if(ci_t >  4000 ) ci_t=4000;
     
    cdac_t = (cv_t - ci_t);
     
    if(cdac_t > 4000) cdac_t=4000;
    if(cdac_t < 1500) cdac_t=1500;
     
    GPIOC->ODR &= ~(1U << 14); //Set Pin low
     
     }
    }

All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.
The timing shrinks down to 1.22µs for every call to the handler. I verified that with the pulse trigger function of the oscilloscope. No longer pulses occur.



This is what I have done so far to debug this:

Put the interrupt in a different C file than main() --> Did not make a difference

Using the BSRR register instead of ODR for setting the pin high/low ---> Did not make a difference

Next I tried moving the "GPIOC->BSRR |= (1U << (14+16)); ////Set Pin low" upwards in the ISR-Code to see whether there is that point from which on there is no longer any dependency of the ISR execution time from the amount of code in main(). Of course the measured time gets smaller as setting high and setting low are moving closer together. But I have to conclude...Such a point does not exist. Well...besides placing set pin high and set pin low right next to each other.

Code: [Select]
"GPIOC->BSRR |= (1U << (14)); //Set Pin high"
"GPIOC->BSRR |= (1U << (14+16)); //Set Pin low"

But in general...No matter where I place "//Set Pin low" there is always a dependency on the amount of code in main() regarding ISR execution time.

For testing I also told the compiler not to use the FPU. This resulted in a much longer ISR execution time, but the issue was still present.

For testing I disabled all interrupts (even SysTick) besides DMA1_Channel1_IRQHandler --> Issue still present.

 :scared: But then I found one interesting effect. Telling the compiler to optimize for speed (-Ofast) instead of size (-Os) solved the problem. No matter how much code is in main() the ISR execution time is always 1.2µs.  :scared:

So as setting the optimization level to -Ofast is more or less a workaround and not a solution to the root of the problem I have to ask ones again: With all that information and background...has anybody any Idea what is wrong here? Bug in compiler/linker or even in silicon?

 

Can anybody confirm that behavior?


Find out what you cannot do and then go an do it!
 

Offline Alti

  • Frequent Contributor
  • **
  • Posts: 404
  • Country: 00
Configured with 4 Flash wait states.
You are comparing apples and oranges.

The uC has ARMv7-EM core and this works EXACTLY as specified in Architecture Reference Manual from ARM, its designer. Precisely down to single tick, to the finest stage of the pipeline (3-stage, btw). On top of that there is ST (ST has nothing to do with ARM) that tied a 128-bit? flash memory into buses of that uC (three buses btw), with cache (check ST datasheet for depths). The opcodes are 16-bit or 32-bit and you cannot align them to 128-bit boundaries by using Ansi C! If the pipeline needs to be flushed because of some if of cache miss, you get stalls.

Please provide relevant, minimal sample that proves your point (unexpected behavior?). Assembly, alignment to flash row boundaries, cache depth etc. Or maybe you can reproduce this behavior with flash 0-wait states, cache disabled - even better. Or perhaps start from executing IRQ from SRAM, this one has 0-wait states on most STM32 chips and runs at full throttle.
 

Online DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5965
  • Country: es
Not sure if I understood the issue correctly.
The high pulse width should be the same (calling DMA ->ISR response), but adding more code to main loop will extend the time between the pulses, it will need to process more code before calling the DMA again.
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11297
  • Country: us
    • Personal site
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.
Alex
 
The following users thanked this post: harerod

Online DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5965
  • Country: es
The cache (Art Accelerator) is ST's golden-egg laying hen, the only reason they can work so fast, very close to 0 wait states.
If not using the cache... get a z80! Performance will be the same ;)

In the instant you get a ISR it needs to fetch the code, will probably cause a cache miss and require few clocks to fetch the data.
But with the first read it also fetched the next instructions, so how would the aligment hit so hard to cause an almost 50% increase?
Few clocks, yeah, but +500us seems absolutely crazy.
Unless (Like always) he didn't mention something important  :popcorn:
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11297
  • Country: us
    • Personal site
That close to 0 wait states thing is mostly marketing and only works in some cases.

The code here has no explicit loops, which is what mostly gets affected by alignment.  But if some of the variables are float, then low level routines might have loops in them.

The easiest way to check this is to place ISR into SRAM and check the performance. I predict it would be consistent and faster than any previous results.
Alex
 

Online eutectique

  • Frequent Contributor
  • **
  • Posts: 401
  • Country: be
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3148
  • Country: ca
I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 3734
  • Country: gb
  • Doing electronics since the 1960s...
I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11297
  • Country: us
    • Personal site
None of this is related to the original questions in any way.

The original issue is consistent with alignment issues.
Alex
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

That's exactly the phenomenon. And if you check my initial post --> I already tried to disable all other interrupts. But the issue still exists.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Thanks for that hint. I will check that and report back.  :) :-+
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

I will do my best to provide more information and code snippets. Unfortunately parts of the involved code is intellectual property of my customer so I am not allowed to post it here. But the ISR C-code and assembly-code should be no problem. I will provide that.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

cdac_t is volatile and the code that uses it is not removed from main(). So the optimizer does other stuff to the code that speeds up the execution and makes it consistent. It has nothing to do with unused variables/code.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

And how is the startup code related to ISR execution time?
How can I check the VTOR alignment?
Find out what you cannot do and then go an do it!
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11297
  • Country: us
    • Personal site
How can I check the VTOR alignment?
Don't bother, nothing would work at all if it was not aligned correctly.
Alex
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.


Here are two assembler snippets from the ISR with and without main().

With main():
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b73      ldr r3, [pc, #460] ; (8000e94 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: b510      push {r4, lr}
 8000ccc: f140 80c2 bpl.w 8000e54 <DMA1_Channel1_IRQHandler+0x190>
 8000cd0: 685a      ldr r2, [r3, #4]
 8000cd2: 4871      ldr r0, [pc, #452] ; (8000e98 <DMA1_Channel1_IRQHandler+0x1d4>)
 8000cd4: 4971      ldr r1, [pc, #452] ; (8000e9c <DMA1_Channel1_IRQHandler+0x1d8>)

1636802-0



Without main()
Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b73      ldr r3, [pc, #460] ; (8000e08 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000c3a: 681a      ldr r2, [r3, #0]
 8000c3c: 0792      lsls r2, r2, #30
 8000c3e: b510      push {r4, lr}
 8000c40: f140 80c2 bpl.w 8000dc8 <DMA1_Channel1_IRQHandler+0x190>
 8000c44: 685a      ldr r2, [r3, #4]
 8000c46: 4871      ldr r0, [pc, #452] ; (8000e0c <DMA1_Channel1_IRQHandler+0x1d4>)
 8000c48: 4971      ldr r1, [pc, #452] ; (8000e10 <DMA1_Channel1_IRQHandler+0x1d8>)

1636808-1

Prefetch, I-Cache and D-Cache are enabled:

Code: [Select]
FLASH->ACR |= FLASH_ACR_PRFTEN;
FLASH->ACR |= FLASH_ACR_ICEN;
FLASH->ACR |= FLASH_ACR_DCEN; 


[EDIT]: Compiler is set to -Os
[EDIT]: @ataradov: So can you say something about the alignment by looking at the above assembler code listings?
« Last Edit: November 11, 2022, 10:38:38 am by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
When set the compiler to -Ofast I get this result:

With main():
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b80      ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: f140 80cc bpl.w 8000e66 <DMA1_Channel1_IRQHandler+0x1a2>
 8000cce: 685a      ldr r2, [r3, #4]
 8000cd0: 497e      ldr r1, [pc, #504] ; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cd2: 487f      ldr r0, [pc, #508] ; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.234us

1636871-0


Without main():
Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b80      ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
 8000c3a: 681a      ldr r2, [r3, #0]
 8000c3c: 0792      lsls r2, r2, #30
 8000c3e: f140 80cc bpl.w 8000dda <DMA1_Channel1_IRQHandler+0x1a2>
 8000c42: 685a      ldr r2, [r3, #4]
 8000c44: 497e      ldr r1, [pc, #504] ; (8000e40 <DMA1_Channel1_IRQHandler+0x208>)
 8000c46: 487f      ldr r0, [pc, #508] ; (8000e44 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.180us

1636877-1



So two things are interesting here:
1. Compiling with main() and -Ofast results in an ISR execution time as fast as compiling with -Os and without main()
2. Compiling without main() and -Ofast shows less variation in execution time. 1.234us --> 1.180us


Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Next I experimented with placing the ISR in CCM SRAM according to ST's AN4296.

ISR function prototype:
Code: [Select]
void main_DCDC_control_loop_interrupt(void) __attribute__((section (".ccmram")));


MAP file confirms that the ISR code is placed/copied to CCMRAM at startup by startup code.
Code: [Select]
.ccmram         0x0000000010000000      0x290 load address 0x00000000080001d8
                       0x0000000010000000                . = ALIGN (0x4)
                       0x0000000010000000                _sccmram = .
 *(.ccmram)
 .ccmram        0x0000000010000000      0x290 ./Core/Src/ISR_Handlers.o
                       0x0000000010000000                DMA1_Channel1_IRQHandler
 *(.ccmram*)
                       0x0000000010000290                . = ALIGN (0x4)
                       0x0000000010000290                _eccmram = .


These are the results.

-> with main() and -Os: Execution time is 1.508us
1636895-0

-> without main() and -Os: Execution time is 1.430usus
1636901-1

-> without main() and -Ofast: Execution time is 1.354us
1636907-2

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!? :-//


« Last Edit: November 11, 2022, 11:21:09 am by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2309
  • Country: gb
Maybe try using an internal timer for calculating ISR execution time.
Pin toggling so fast might be causing issues, but that's just a hunch, I'm not familiar with this mcu and APB setup.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3148
  • Country: ca
Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!? :-//

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

There's no point in looking at few lines of assembler, post the whole ISR including prologue and epilogue.
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
This is all I can get from the .list file

Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b81      ldr r3, [pc, #516] ; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: b5f0      push {r4, r5, r6, r7, lr}
 8000ccc: f140 80fc bpl.w 8000ec8 <DMA1_Channel1_IRQHandler+0x204>
 8000cd0: 685a      ldr r2, [r3, #4]
 8000cd2: 4c7f      ldr r4, [pc, #508] ; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)
 8000cd4: 497f      ldr r1, [pc, #508] ; (8000ed4 <DMA1_Channel1_IRQHandler+0x210>)
 8000cd6: 4880      ldr r0, [pc, #512] ; (8000ed8 <DMA1_Channel1_IRQHandler+0x214>)
 8000cd8: 4f80      ldr r7, [pc, #512] ; (8000edc <DMA1_Channel1_IRQHandler+0x218>)
 8000cda: f042 0202 orr.w r2, r2, #2
 8000cde: 605a      str r2, [r3, #4]
 8000ce0: 4a7f      ldr r2, [pc, #508] ; (8000ee0 <DMA1_Channel1_IRQHandler+0x21c>)
 8000ce2: 6993      ldr r3, [r2, #24]
 8000ce4: f443 4380 orr.w r3, r3, #16384 ; 0x4000
 8000ce8: 6193      str r3, [r2, #24]
 8000cea: 4b7e      ldr r3, [pc, #504] ; (8000ee4 <DMA1_Channel1_IRQHandler+0x220>)
 8000cec: 681b      ldr r3, [r3, #0]
 8000cee: 8822      ldrh r2, [r4, #0]
 8000cf0: b292      uxth r2, r2
 8000cf2: 1a9b      subs r3, r3, r2
 8000cf4: 600b      str r3, [r1, #0]
 8000cf6: 4b7c      ldr r3, [pc, #496] ; (8000ee8 <DMA1_Channel1_IRQHandler+0x224>)
 8000cf8: 681b      ldr r3, [r3, #0]
 8000cfa: 680a      ldr r2, [r1, #0]
 8000cfc: 4353      muls r3, r2
 8000cfe: ee07 3a90 vmov s15, r3
 8000d02: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d06: 4b79      ldr r3, [pc, #484] ; (8000eec <DMA1_Channel1_IRQHandler+0x228>)
 8000d08: edc0 7a00 vstr s15, [r0]
 8000d0c: 681b      ldr r3, [r3, #0]
 8000d0e: 680d      ldr r5, [r1, #0]
 8000d10: 4a77      ldr r2, [pc, #476] ; (8000ef0 <DMA1_Channel1_IRQHandler+0x22c>)
 8000d12: 436b      muls r3, r5
 8000d14: ee07 3a90 vmov s15, r3
 8000d18: ed92 7a00 vldr s14, [r2]
 8000d1c: 4b75      ldr r3, [pc, #468] ; (8000ef4 <DMA1_Channel1_IRQHandler+0x230>)
 8000d1e: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d22: 2500      movs r5, #0
 8000d24: ee77 7a87 vadd.f32 s15, s15, s14
 8000d28: edc3 7a00 vstr s15, [r3]
 8000d2c: ed93 7a00 vldr s14, [r3]
 8000d30: eddf 7a71 vldr s15, [pc, #452] ; 8000ef8 <DMA1_Channel1_IRQHandler+0x234>
 8000d34: eeb4 7ae7 vcmpe.f32 s14, s15
 8000d38: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000d3c: bfc8      it gt
 8000d3e: edc3 7a00 vstrgt s15, [r3]
 8000d42: edd3 7a00 vldr s15, [r3]
 8000d46: eef5 7ac0 vcmpe.f32 s15, #0.0
 8000d4a: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000d4e: bf48      it mi
 8000d50: 601d      strmi r5, [r3, #0]
 8000d52: 681e      ldr r6, [r3, #0]
 8000d54: 6016      str r6, [r2, #0]
 8000d56: 4e69      ldr r6, [pc, #420] ; (8000efc <DMA1_Channel1_IRQHandler+0x238>)
 8000d58: 680a      ldr r2, [r1, #0]
 8000d5a: f8d6 c000 ldr.w ip, [r6]
 8000d5e: 683f      ldr r7, [r7, #0]
 8000d60: eba2 020c sub.w r2, r2, ip
 8000d64: 437a      muls r2, r7
 8000d66: ee07 2a90 vmov s15, r2
 8000d6a: 4a65      ldr r2, [pc, #404] ; (8000f00 <DMA1_Channel1_IRQHandler+0x23c>)
 8000d6c: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d70: edc2 7a00 vstr s15, [r2]
 8000d74: 6809      ldr r1, [r1, #0]
 8000d76: 6031      str r1, [r6, #0]
 8000d78: edd0 7a00 vldr s15, [r0]
 8000d7c: edd3 6a00 vldr s13, [r3]
 8000d80: ed92 7a00 vldr s14, [r2]
 8000d84: 4b5f      ldr r3, [pc, #380] ; (8000f04 <DMA1_Channel1_IRQHandler+0x240>)
 8000d86: 4860      ldr r0, [pc, #384] ; (8000f08 <DMA1_Channel1_IRQHandler+0x244>)
 8000d88: 4960      ldr r1, [pc, #384] ; (8000f0c <DMA1_Channel1_IRQHandler+0x248>)
 8000d8a: ee77 7aa6 vadd.f32 s15, s15, s13
 8000d8e: ee77 7a87 vadd.f32 s15, s15, s14
 8000d92: eefd 7ae7 vcvt.s32.f32 s15, s15
 8000d96: edc3 7a00 vstr s15, [r3]
 8000d9a: 681a      ldr r2, [r3, #0]
 8000d9c: 2a00      cmp r2, #0
 8000d9e: bfbc      itt lt
 8000da0: 2200      movlt r2, #0
 8000da2: 601a      strlt r2, [r3, #0]
 8000da4: 681b      ldr r3, [r3, #0]
 8000da6: 6003      str r3, [r0, #0]
 8000da8: 6803      ldr r3, [r0, #0]
 8000daa: 4a59      ldr r2, [pc, #356] ; (8000f10 <DMA1_Channel1_IRQHandler+0x24c>)
 8000dac: f5b3 6f7a cmp.w r3, #4000 ; 0xfa0
 8000db0: bfc4      itt gt
 8000db2: f44f 637a movgt.w r3, #4000 ; 0xfa0
 8000db6: 6003      strgt r3, [r0, #0]
 8000db8: 8863      ldrh r3, [r4, #2]
 8000dba: 6812      ldr r2, [r2, #0]
 8000dbc: 4c55      ldr r4, [pc, #340] ; (8000f14 <DMA1_Channel1_IRQHandler+0x250>)
 8000dbe: b29b      uxth r3, r3
 8000dc0: 1a9b      subs r3, r3, r2
 8000dc2: 600b      str r3, [r1, #0]
 8000dc4: 4b54      ldr r3, [pc, #336] ; (8000f18 <DMA1_Channel1_IRQHandler+0x254>)
 8000dc6: 681b      ldr r3, [r3, #0]
 8000dc8: 680a      ldr r2, [r1, #0]
 8000dca: 4353      muls r3, r2
 8000dcc: ee07 3a90 vmov s15, r3
 8000dd0: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000dd4: 4b51      ldr r3, [pc, #324] ; (8000f1c <DMA1_Channel1_IRQHandler+0x258>)
 8000dd6: edc4 7a00 vstr s15, [r4]
 8000dda: 681b      ldr r3, [r3, #0]
 8000ddc: 680e      ldr r6, [r1, #0]
 8000dde: 4a50      ldr r2, [pc, #320] ; (8000f20 <DMA1_Channel1_IRQHandler+0x25c>)
 8000de0: 4373      muls r3, r6
 8000de2: ee07 3a90 vmov s15, r3
 8000de6: ed92 7a00 vldr s14, [r2]
 8000dea: 4b4e      ldr r3, [pc, #312] ; (8000f24 <DMA1_Channel1_IRQHandler+0x260>)
 8000dec: 4e4e      ldr r6, [pc, #312] ; (8000f28 <DMA1_Channel1_IRQHandler+0x264>)
 8000dee: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000df2: ee77 7a87 vadd.f32 s15, s15, s14
 8000df6: edc3 7a00 vstr s15, [r3]
 8000dfa: ed93 7a00 vldr s14, [r3]
 8000dfe: eddf 7a4b vldr s15, [pc, #300] ; 8000f2c <DMA1_Channel1_IRQHandler+0x268>
 8000e02: eeb4 7ae7 vcmpe.f32 s14, s15
 8000e06: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000e0a: bfc8      it gt
 8000e0c: edc3 7a00 vstrgt s15, [r3]
 8000e10: edd3 7a00 vldr s15, [r3]
 8000e14: eef5 7ac0 vcmpe.f32 s15, #0.0
 8000e18: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000e1c: bf48      it mi
 8000e1e: 601d      strmi r5, [r3, #0]
 8000e20: 681d      ldr r5, [r3, #0]
 8000e22: 6015      str r5, [r2, #0]
 8000e24: 4d42      ldr r5, [pc, #264] ; (8000f30 <DMA1_Channel1_IRQHandler+0x26c>)
 8000e26: 680a      ldr r2, [r1, #0]
 8000e28: 682f      ldr r7, [r5, #0]
 8000e2a: 6836      ldr r6, [r6, #0]
 8000e2c: 1bd2      subs r2, r2, r7
 8000e2e: 4372      muls r2, r6
 8000e30: ee07 2a90 vmov s15, r2
 8000e34: 4a3f      ldr r2, [pc, #252] ; (8000f34 <DMA1_Channel1_IRQHandler+0x270>)
 8000e36: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000e3a: edc2 7a00 vstr s15, [r2]
 8000e3e: 6809      ldr r1, [r1, #0]
 8000e40: 6029      str r1, [r5, #0]
 8000e42: edd4 7a00 vldr s15, [r4]
 8000e46: edd3 6a00 vldr s13, [r3]
 8000e4a: ed92 7a00 vldr s14, [r2]
 8000e4e: 4b3a      ldr r3, [pc, #232] ; (8000f38 <DMA1_Channel1_IRQHandler+0x274>)
 8000e50: ee77 7aa6 vadd.f32 s15, s15, s13
 8000e54: ee77 7a87 vadd.f32 s15, s15, s14
 8000e58: eefd 7ae7 vcvt.s32.f32 s15, s15
 8000e5c: edc3 7a00 vstr s15, [r3]
 8000e60: 681a      ldr r2, [r3, #0]
 8000e62: 2a00      cmp r2, #0
 8000e64: bfbc      itt lt
 8000e66: 2200      movlt r2, #0
 8000e68: 601a      strlt r2, [r3, #0]
 8000e6a: 681a      ldr r2, [r3, #0]
 8000e6c: f44f 737a mov.w r3, #1000 ; 0x3e8
 8000e70: fb92 f2f3 sdiv r2, r2, r3
 8000e74: 4b31      ldr r3, [pc, #196] ; (8000f3c <DMA1_Channel1_IRQHandler+0x278>)
 8000e76: 601a      str r2, [r3, #0]
 8000e78: 681a      ldr r2, [r3, #0]
 8000e7a: f5b2 6f7a cmp.w r2, #4000 ; 0xfa0
 8000e7e: bfc4      itt gt
 8000e80: f44f 627a movgt.w r2, #4000 ; 0xfa0
 8000e84: 601a      strgt r2, [r3, #0]
 8000e86: 6802      ldr r2, [r0, #0]
 8000e88: 681b      ldr r3, [r3, #0]
 8000e8a: 1ad2      subs r2, r2, r3
 8000e8c: 4b2c      ldr r3, [pc, #176] ; (8000f40 <DMA1_Channel1_IRQHandler+0x27c>)
 8000e8e: 601a      str r2, [r3, #0]
 8000e90: 681a      ldr r2, [r3, #0]
 8000e92: f5b2 6f7a cmp.w r2, #4000 ; 0xfa0
 8000e96: bfc4      itt gt
 8000e98: f44f 627a movgt.w r2, #4000 ; 0xfa0
 8000e9c: 601a      strgt r2, [r3, #0]
 8000e9e: 6819      ldr r1, [r3, #0]
 8000ea0: f240 52db movw r2, #1499 ; 0x5db
 8000ea4: 4291      cmp r1, r2
 8000ea6: bfdc      itt le
 8000ea8: f240 52dc movwle r2, #1500 ; 0x5dc
 8000eac: 601a      strle r2, [r3, #0]
 8000eae: 4a25      ldr r2, [pc, #148] ; (8000f44 <DMA1_Channel1_IRQHandler+0x280>)
 8000eb0: 6812      ldr r2, [r2, #0]
 8000eb2: 681b      ldr r3, [r3, #0]
 8000eb4: ea43 5302 orr.w r3, r3, r2, lsl #20
 8000eb8: 4a23      ldr r2, [pc, #140] ; (8000f48 <DMA1_Channel1_IRQHandler+0x284>)
 8000eba: 65d3      str r3, [r2, #92] ; 0x5c
 8000ebc: f102 4278 add.w r2, r2, #4160749568 ; 0xf8000000
 8000ec0: 6993      ldr r3, [r2, #24]
 8000ec2: f043 4380 orr.w r3, r3, #1073741824 ; 0x40000000
 8000ec6: 6193      str r3, [r2, #24]
 8000ec8: bdf0      pop {r4, r5, r6, r7, pc}
 8000eca: bf00      nop
 8000ecc: 40020000 .word 0x40020000
 8000ed0: 20001558 .word 0x20001558
 8000ed4: 20001524 .word 0x20001524
 8000ed8: 2000155c .word 0x2000155c
 8000edc: 200014f8 .word 0x200014f8
 8000ee0: 48000800 .word 0x48000800
 8000ee4: 20001560 .word 0x20001560
 8000ee8: 200000cc .word 0x200000cc
 8000eec: 200000c4 .word 0x200000c4
 8000ef0: 20001554 .word 0x20001554
 8000ef4: 20001550 .word 0x20001550
 8000ef8: 45802000 .word 0x45802000
 8000efc: 2000151c .word 0x2000151c
 8000f00: 2000154c .word 0x2000154c
 8000f04: 20001508 .word 0x20001508
 8000f08: 2000150c .word 0x2000150c
 8000f0c: 20001520 .word 0x20001520
 8000f10: 20001548 .word 0x20001548
 8000f14: 20001544 .word 0x20001544
 8000f18: 200000c8 .word 0x200000c8
 8000f1c: 200000c0 .word 0x200000c0
 8000f20: 20001540 .word 0x20001540
 8000f24: 2000153c .word 0x2000153c
 8000f28: 200014f4 .word 0x200014f4
 8000f2c: 4a7a3e80 .word 0x4a7a3e80
 8000f30: 20001518 .word 0x20001518
 8000f34: 20001538 .word 0x20001538
 8000f38: 20001500 .word 0x20001500
 8000f3c: 20001504 .word 0x20001504
 8000f40: 200014fc .word 0x200014fc
 8000f44: 200000d0 .word 0x200000d0
 8000f48: 50000800 .word 0x50000800
Find out what you cannot do and then go an do it!
 

Offline errorprone

  • Contributor
  • Posts: 39
Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

Good point. No the vector table still resides in flash memory.
But nonetheless, when the execution of the ISR starts the time to walk through the ISR instructions should not depend on the place/memory the vector table sits in. Right?
I am not concerned about execution delay, i.e. the time from the event to the start of execution. I am wondering why the execution time, from start to end of the ISR, is not consistent when other code in the project changes or is added or is removed. And during investigating on that I found other suspicious behavior that I don't understand, e.g. execution from flash with -Ofast is faster than execution from ccram with -Ofast.

And I recently found another suspicious behavior ---> obviously the execution time depends also on from where within the code the ISR is called. I am working on a simple example to show you that here in the thread.
« Last Edit: November 11, 2022, 03:17:39 pm by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8193
  • Country: fi
Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

With main():
...
Without main():

The asm snippets posted look identical, but did I miss it or did you post the whole function in both cases (with and without main() differences, same optimization level)? Are they exactly the same or not? That is literally the first thing to check.
« Last Edit: November 11, 2022, 03:56:53 pm by Siwastaja »
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf