Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR  (Read 7546 times)

0 Members and 1 Guest are viewing this topic.

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Hi folks,

I am working on a Project based on a STM32G431. While developing the firmware for it I encountered a strange phenomenon that should not be possible or happen. After fiddling around with it and doing my on efforts of debugging and testing to get a better understanding what is going on here I had to partly give up on this as my knowledge regarding deeper STM32 secrets is limited. I am a power electronics engineer and embedded firmware development is not my daily profession. Although I have quite some experience in coding and microcontrollers (8051, PIC, ATmega, STM32) as such, I came to a point where I need help to find the root cause of the issue.

So I went to the ST Community Forum and asked for help there: https://community.st.com/s/question/0D73W000001nmmASAQ/detail?fromEmail=1&s1oid=00Db0000000YtG6&s1nid=0DB0X000000DYbd&s1uid=0053W000001UI26&s1ext=0&emkind=chatterCommentNotification&emtm=1667370141288

I am still full of hope and confidence that the people there at ST can help me. But I feel I have to reach a bigger crowd of skilled engineers to dig into this. That is why I want to reach out to you here at eevblog too and want to ask kindly for additional help.


STM32G431 - How can it be that the amount of lines of code in main() affect the time it takes to execute an interrupt routine?

This question is all about the behavior of the DMA1_Channel1_IRQHandler in my application.
I am working with an STM32G431CBT. It is running at 168MHz. Configured with 4 Flash wait states.
NVIC interrupt priorities are setup using CMSIS functions as follows:

Code: [Select]
SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3);
 
NVIC_SetPriorityGrouping(0);
 
NVIC_DisableIRQ(SysTick_IRQn);
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 10, 0);
NVIC_SetPriority(SysTick_IRQn, irq_prio);
NVIC_EnableIRQ(SysTick_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 13, 0);
NVIC_SetPriority(USART2_IRQn, irq_prio);
NVIC_EnableIRQ(USART2_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 0, 0); //highest priority!!!
NVIC_SetPriority(DMA1_Channel1_IRQn, irq_prio);
NVIC_EnableIRQ(DMA1_Channel1_IRQn);

TIM1->CCR5 (on match) triggers ADC conversion sequence. ADC triggers DMA1_Channel1 (circular, periph-to-mem). DMA1 is setup to issue an interrupt on transfer complete (TCIE).
The DMA1_Channel1_IRQHandler (has highest priority) then fires and a few things are processed and calculated within it.

My code compiles with no errors and warnings. And as far as I can see it all works as intended.
But here comes the strange thing:

Right at the start of DMA1_Channel1_IRQHandler I set a GPIO Pin high. At the End of DMA1_Channel1_IRQHandler I set the same Pin low. Looking at the generated high-low pulse with an oscilloscope I can measure the execution time of the DMA1_Channel1_IRQHandler. It takes 1.69µs. But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

How can that be? An ISR should not be affected in such a way by the amount of code in a low priority function just like main().

The ISR looks like this:

Code: [Select]
    void DMA1_Channel1_IRQHandler(void)
    {
     
     
     if(DMA1->ISR & DMA_ISR_TCIF1)  //check we are here cause of a valid interrupt occurred
     {
    DMA1->IFCR |= DMA_IFCR_CTCIF1; //reset interrupt flag
     
    GPIOC->ODR |= (1U << 14); //Set Pin high
     
    ev = (v - my_array[0]);
    vp  = A * ev;
    vi  = B * vi1;
    if(vi > 4100) vi=4100;
    if(vi < 0) vi=0;
    vi1 = vi;
    vd  = C * (ev - (e1v+2));
    e1v = ev;
    cv = vp + vi + vd;
    if(cv < 0) cv=0;
    cv_t = cv/2;
    if(cv_t >  4000 ) cv_t=4000;
      /*------------------------------------------------------*/
    ei = (my_array[1] - i);
    ip  = D * ei;
    ii  = E * ii1;
    if(ii > 4100000 ) ii=4100000;
            if(ii < 0 ) ii=0;
    ii1 = ii;
    id  = F * (ei + (e1i-2));
    e1i = ei;
    ci = ip + ii + id;
    if(ci < 0) ci=0;
    ci_t = ci/1000;
    if(ci_t >  4000 ) ci_t=4000;
     
    cdac_t = (cv_t - ci_t);
     
    if(cdac_t > 4000) cdac_t=4000;
    if(cdac_t < 1500) cdac_t=1500;
     
    GPIOC->ODR &= ~(1U << 14); //Set Pin low
     
     }
    }

All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.
The timing shrinks down to 1.22µs for every call to the handler. I verified that with the pulse trigger function of the oscilloscope. No longer pulses occur.



This is what I have done so far to debug this:

Put the interrupt in a different C file than main() --> Did not make a difference

Using the BSRR register instead of ODR for setting the pin high/low ---> Did not make a difference

Next I tried moving the "GPIOC->BSRR |= (1U << (14+16)); ////Set Pin low" upwards in the ISR-Code to see whether there is that point from which on there is no longer any dependency of the ISR execution time from the amount of code in main(). Of course the measured time gets smaller as setting high and setting low are moving closer together. But I have to conclude...Such a point does not exist. Well...besides placing set pin high and set pin low right next to each other.

Code: [Select]
"GPIOC->BSRR |= (1U << (14)); //Set Pin high"
"GPIOC->BSRR |= (1U << (14+16)); //Set Pin low"

But in general...No matter where I place "//Set Pin low" there is always a dependency on the amount of code in main() regarding ISR execution time.

For testing I also told the compiler not to use the FPU. This resulted in a much longer ISR execution time, but the issue was still present.

For testing I disabled all interrupts (even SysTick) besides DMA1_Channel1_IRQHandler --> Issue still present.

 :scared: But then I found one interesting effect. Telling the compiler to optimize for speed (-Ofast) instead of size (-Os) solved the problem. No matter how much code is in main() the ISR execution time is always 1.2µs.  :scared:

So as setting the optimization level to -Ofast is more or less a workaround and not a solution to the root of the problem I have to ask ones again: With all that information and background...has anybody any Idea what is wrong here? Bug in compiler/linker or even in silicon?

 

Can anybody confirm that behavior?


Find out what you cannot do and then go an do it!
 

Offline Alti

  • Frequent Contributor
  • **
  • Posts: 404
  • Country: 00
Configured with 4 Flash wait states.
You are comparing apples and oranges.

The uC has ARMv7-EM core and this works EXACTLY as specified in Architecture Reference Manual from ARM, its designer. Precisely down to single tick, to the finest stage of the pipeline (3-stage, btw). On top of that there is ST (ST has nothing to do with ARM) that tied a 128-bit? flash memory into buses of that uC (three buses btw), with cache (check ST datasheet for depths). The opcodes are 16-bit or 32-bit and you cannot align them to 128-bit boundaries by using Ansi C! If the pipeline needs to be flushed because of some if of cache miss, you get stalls.

Please provide relevant, minimal sample that proves your point (unexpected behavior?). Assembly, alignment to flash row boundaries, cache depth etc. Or maybe you can reproduce this behavior with flash 0-wait states, cache disabled - even better. Or perhaps start from executing IRQ from SRAM, this one has 0-wait states on most STM32 chips and runs at full throttle.
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
Not sure if I understood the issue correctly.
The high pulse width should be the same (calling DMA ->ISR response), but adding more code to main loop will extend the time between the pulses, it will need to process more code before calling the DMA again.
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.
Alex
 
The following users thanked this post: harerod

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
The cache (Art Accelerator) is ST's golden-egg laying hen, the only reason they can work so fast, very close to 0 wait states.
If not using the cache... get a z80! Performance will be the same ;)

In the instant you get a ISR it needs to fetch the code, will probably cause a cache miss and require few clocks to fetch the data.
But with the first read it also fetched the next instructions, so how would the aligment hit so hard to cause an almost 50% increase?
Few clocks, yeah, but +500us seems absolutely crazy.
Unless (Like always) he didn't mention something important  :popcorn:
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
That close to 0 wait states thing is mostly marketing and only works in some cases.

The code here has no explicit loops, which is what mostly gets affected by alignment.  But if some of the variables are float, then low level routines might have loops in them.

The easiest way to check this is to place ISR into SRAM and check the performance. I predict it would be consistent and faster than any previous results.
Alex
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 390
  • Country: be
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 3697
  • Country: gb
  • Doing electronics since the 1960s...
I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
None of this is related to the original questions in any way.

The original issue is consistent with alignment issues.
Alex
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

That's exactly the phenomenon. And if you check my initial post --> I already tried to disable all other interrupts. But the issue still exists.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Thanks for that hint. I will check that and report back.  :) :-+
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

I will do my best to provide more information and code snippets. Unfortunately parts of the involved code is intellectual property of my customer so I am not allowed to post it here. But the ISR C-code and assembly-code should be no problem. I will provide that.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

cdac_t is volatile and the code that uses it is not removed from main(). So the optimizer does other stuff to the code that speeds up the execution and makes it consistent. It has nothing to do with unused variables/code.
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

And how is the startup code related to ISR execution time?
How can I check the VTOR alignment?
Find out what you cannot do and then go an do it!
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
How can I check the VTOR alignment?
Don't bother, nothing would work at all if it was not aligned correctly.
Alex
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.


Here are two assembler snippets from the ISR with and without main().

With main():
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b73      ldr r3, [pc, #460] ; (8000e94 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: b510      push {r4, lr}
 8000ccc: f140 80c2 bpl.w 8000e54 <DMA1_Channel1_IRQHandler+0x190>
 8000cd0: 685a      ldr r2, [r3, #4]
 8000cd2: 4871      ldr r0, [pc, #452] ; (8000e98 <DMA1_Channel1_IRQHandler+0x1d4>)
 8000cd4: 4971      ldr r1, [pc, #452] ; (8000e9c <DMA1_Channel1_IRQHandler+0x1d8>)

1636802-0



Without main()
Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b73      ldr r3, [pc, #460] ; (8000e08 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000c3a: 681a      ldr r2, [r3, #0]
 8000c3c: 0792      lsls r2, r2, #30
 8000c3e: b510      push {r4, lr}
 8000c40: f140 80c2 bpl.w 8000dc8 <DMA1_Channel1_IRQHandler+0x190>
 8000c44: 685a      ldr r2, [r3, #4]
 8000c46: 4871      ldr r0, [pc, #452] ; (8000e0c <DMA1_Channel1_IRQHandler+0x1d4>)
 8000c48: 4971      ldr r1, [pc, #452] ; (8000e10 <DMA1_Channel1_IRQHandler+0x1d8>)

1636808-1

Prefetch, I-Cache and D-Cache are enabled:

Code: [Select]
FLASH->ACR |= FLASH_ACR_PRFTEN;
FLASH->ACR |= FLASH_ACR_ICEN;
FLASH->ACR |= FLASH_ACR_DCEN; 


[EDIT]: Compiler is set to -Os
[EDIT]: @ataradov: So can you say something about the alignment by looking at the above assembler code listings?
« Last Edit: November 11, 2022, 10:38:38 am by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
When set the compiler to -Ofast I get this result:

With main():
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b80      ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: f140 80cc bpl.w 8000e66 <DMA1_Channel1_IRQHandler+0x1a2>
 8000cce: 685a      ldr r2, [r3, #4]
 8000cd0: 497e      ldr r1, [pc, #504] ; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cd2: 487f      ldr r0, [pc, #508] ; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.234us

1636871-0


Without main():
Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b80      ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
 8000c3a: 681a      ldr r2, [r3, #0]
 8000c3c: 0792      lsls r2, r2, #30
 8000c3e: f140 80cc bpl.w 8000dda <DMA1_Channel1_IRQHandler+0x1a2>
 8000c42: 685a      ldr r2, [r3, #4]
 8000c44: 497e      ldr r1, [pc, #504] ; (8000e40 <DMA1_Channel1_IRQHandler+0x208>)
 8000c46: 487f      ldr r0, [pc, #508] ; (8000e44 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.180us

1636877-1



So two things are interesting here:
1. Compiling with main() and -Ofast results in an ISR execution time as fast as compiling with -Os and without main()
2. Compiling without main() and -Ofast shows less variation in execution time. 1.234us --> 1.180us


Find out what you cannot do and then go an do it!
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Next I experimented with placing the ISR in CCM SRAM according to ST's AN4296.

ISR function prototype:
Code: [Select]
void main_DCDC_control_loop_interrupt(void) __attribute__((section (".ccmram")));


MAP file confirms that the ISR code is placed/copied to CCMRAM at startup by startup code.
Code: [Select]
.ccmram         0x0000000010000000      0x290 load address 0x00000000080001d8
                       0x0000000010000000                . = ALIGN (0x4)
                       0x0000000010000000                _sccmram = .
 *(.ccmram)
 .ccmram        0x0000000010000000      0x290 ./Core/Src/ISR_Handlers.o
                       0x0000000010000000                DMA1_Channel1_IRQHandler
 *(.ccmram*)
                       0x0000000010000290                . = ALIGN (0x4)
                       0x0000000010000290                _eccmram = .


These are the results.

-> with main() and -Os: Execution time is 1.508us
1636895-0

-> without main() and -Os: Execution time is 1.430usus
1636901-1

-> without main() and -Ofast: Execution time is 1.354us
1636907-2

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!? :-//


« Last Edit: November 11, 2022, 11:21:09 am by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2300
  • Country: gb
Maybe try using an internal timer for calculating ISR execution time.
Pin toggling so fast might be causing issues, but that's just a hunch, I'm not familiar with this mcu and APB setup.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!? :-//

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

There's no point in looking at few lines of assembler, post the whole ISR including prologue and epilogue.
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
This is all I can get from the .list file

Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b81      ldr r3, [pc, #516] ; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cc6: 681a      ldr r2, [r3, #0]
 8000cc8: 0792      lsls r2, r2, #30
 8000cca: b5f0      push {r4, r5, r6, r7, lr}
 8000ccc: f140 80fc bpl.w 8000ec8 <DMA1_Channel1_IRQHandler+0x204>
 8000cd0: 685a      ldr r2, [r3, #4]
 8000cd2: 4c7f      ldr r4, [pc, #508] ; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)
 8000cd4: 497f      ldr r1, [pc, #508] ; (8000ed4 <DMA1_Channel1_IRQHandler+0x210>)
 8000cd6: 4880      ldr r0, [pc, #512] ; (8000ed8 <DMA1_Channel1_IRQHandler+0x214>)
 8000cd8: 4f80      ldr r7, [pc, #512] ; (8000edc <DMA1_Channel1_IRQHandler+0x218>)
 8000cda: f042 0202 orr.w r2, r2, #2
 8000cde: 605a      str r2, [r3, #4]
 8000ce0: 4a7f      ldr r2, [pc, #508] ; (8000ee0 <DMA1_Channel1_IRQHandler+0x21c>)
 8000ce2: 6993      ldr r3, [r2, #24]
 8000ce4: f443 4380 orr.w r3, r3, #16384 ; 0x4000
 8000ce8: 6193      str r3, [r2, #24]
 8000cea: 4b7e      ldr r3, [pc, #504] ; (8000ee4 <DMA1_Channel1_IRQHandler+0x220>)
 8000cec: 681b      ldr r3, [r3, #0]
 8000cee: 8822      ldrh r2, [r4, #0]
 8000cf0: b292      uxth r2, r2
 8000cf2: 1a9b      subs r3, r3, r2
 8000cf4: 600b      str r3, [r1, #0]
 8000cf6: 4b7c      ldr r3, [pc, #496] ; (8000ee8 <DMA1_Channel1_IRQHandler+0x224>)
 8000cf8: 681b      ldr r3, [r3, #0]
 8000cfa: 680a      ldr r2, [r1, #0]
 8000cfc: 4353      muls r3, r2
 8000cfe: ee07 3a90 vmov s15, r3
 8000d02: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d06: 4b79      ldr r3, [pc, #484] ; (8000eec <DMA1_Channel1_IRQHandler+0x228>)
 8000d08: edc0 7a00 vstr s15, [r0]
 8000d0c: 681b      ldr r3, [r3, #0]
 8000d0e: 680d      ldr r5, [r1, #0]
 8000d10: 4a77      ldr r2, [pc, #476] ; (8000ef0 <DMA1_Channel1_IRQHandler+0x22c>)
 8000d12: 436b      muls r3, r5
 8000d14: ee07 3a90 vmov s15, r3
 8000d18: ed92 7a00 vldr s14, [r2]
 8000d1c: 4b75      ldr r3, [pc, #468] ; (8000ef4 <DMA1_Channel1_IRQHandler+0x230>)
 8000d1e: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d22: 2500      movs r5, #0
 8000d24: ee77 7a87 vadd.f32 s15, s15, s14
 8000d28: edc3 7a00 vstr s15, [r3]
 8000d2c: ed93 7a00 vldr s14, [r3]
 8000d30: eddf 7a71 vldr s15, [pc, #452] ; 8000ef8 <DMA1_Channel1_IRQHandler+0x234>
 8000d34: eeb4 7ae7 vcmpe.f32 s14, s15
 8000d38: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000d3c: bfc8      it gt
 8000d3e: edc3 7a00 vstrgt s15, [r3]
 8000d42: edd3 7a00 vldr s15, [r3]
 8000d46: eef5 7ac0 vcmpe.f32 s15, #0.0
 8000d4a: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000d4e: bf48      it mi
 8000d50: 601d      strmi r5, [r3, #0]
 8000d52: 681e      ldr r6, [r3, #0]
 8000d54: 6016      str r6, [r2, #0]
 8000d56: 4e69      ldr r6, [pc, #420] ; (8000efc <DMA1_Channel1_IRQHandler+0x238>)
 8000d58: 680a      ldr r2, [r1, #0]
 8000d5a: f8d6 c000 ldr.w ip, [r6]
 8000d5e: 683f      ldr r7, [r7, #0]
 8000d60: eba2 020c sub.w r2, r2, ip
 8000d64: 437a      muls r2, r7
 8000d66: ee07 2a90 vmov s15, r2
 8000d6a: 4a65      ldr r2, [pc, #404] ; (8000f00 <DMA1_Channel1_IRQHandler+0x23c>)
 8000d6c: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000d70: edc2 7a00 vstr s15, [r2]
 8000d74: 6809      ldr r1, [r1, #0]
 8000d76: 6031      str r1, [r6, #0]
 8000d78: edd0 7a00 vldr s15, [r0]
 8000d7c: edd3 6a00 vldr s13, [r3]
 8000d80: ed92 7a00 vldr s14, [r2]
 8000d84: 4b5f      ldr r3, [pc, #380] ; (8000f04 <DMA1_Channel1_IRQHandler+0x240>)
 8000d86: 4860      ldr r0, [pc, #384] ; (8000f08 <DMA1_Channel1_IRQHandler+0x244>)
 8000d88: 4960      ldr r1, [pc, #384] ; (8000f0c <DMA1_Channel1_IRQHandler+0x248>)
 8000d8a: ee77 7aa6 vadd.f32 s15, s15, s13
 8000d8e: ee77 7a87 vadd.f32 s15, s15, s14
 8000d92: eefd 7ae7 vcvt.s32.f32 s15, s15
 8000d96: edc3 7a00 vstr s15, [r3]
 8000d9a: 681a      ldr r2, [r3, #0]
 8000d9c: 2a00      cmp r2, #0
 8000d9e: bfbc      itt lt
 8000da0: 2200      movlt r2, #0
 8000da2: 601a      strlt r2, [r3, #0]
 8000da4: 681b      ldr r3, [r3, #0]
 8000da6: 6003      str r3, [r0, #0]
 8000da8: 6803      ldr r3, [r0, #0]
 8000daa: 4a59      ldr r2, [pc, #356] ; (8000f10 <DMA1_Channel1_IRQHandler+0x24c>)
 8000dac: f5b3 6f7a cmp.w r3, #4000 ; 0xfa0
 8000db0: bfc4      itt gt
 8000db2: f44f 637a movgt.w r3, #4000 ; 0xfa0
 8000db6: 6003      strgt r3, [r0, #0]
 8000db8: 8863      ldrh r3, [r4, #2]
 8000dba: 6812      ldr r2, [r2, #0]
 8000dbc: 4c55      ldr r4, [pc, #340] ; (8000f14 <DMA1_Channel1_IRQHandler+0x250>)
 8000dbe: b29b      uxth r3, r3
 8000dc0: 1a9b      subs r3, r3, r2
 8000dc2: 600b      str r3, [r1, #0]
 8000dc4: 4b54      ldr r3, [pc, #336] ; (8000f18 <DMA1_Channel1_IRQHandler+0x254>)
 8000dc6: 681b      ldr r3, [r3, #0]
 8000dc8: 680a      ldr r2, [r1, #0]
 8000dca: 4353      muls r3, r2
 8000dcc: ee07 3a90 vmov s15, r3
 8000dd0: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000dd4: 4b51      ldr r3, [pc, #324] ; (8000f1c <DMA1_Channel1_IRQHandler+0x258>)
 8000dd6: edc4 7a00 vstr s15, [r4]
 8000dda: 681b      ldr r3, [r3, #0]
 8000ddc: 680e      ldr r6, [r1, #0]
 8000dde: 4a50      ldr r2, [pc, #320] ; (8000f20 <DMA1_Channel1_IRQHandler+0x25c>)
 8000de0: 4373      muls r3, r6
 8000de2: ee07 3a90 vmov s15, r3
 8000de6: ed92 7a00 vldr s14, [r2]
 8000dea: 4b4e      ldr r3, [pc, #312] ; (8000f24 <DMA1_Channel1_IRQHandler+0x260>)
 8000dec: 4e4e      ldr r6, [pc, #312] ; (8000f28 <DMA1_Channel1_IRQHandler+0x264>)
 8000dee: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000df2: ee77 7a87 vadd.f32 s15, s15, s14
 8000df6: edc3 7a00 vstr s15, [r3]
 8000dfa: ed93 7a00 vldr s14, [r3]
 8000dfe: eddf 7a4b vldr s15, [pc, #300] ; 8000f2c <DMA1_Channel1_IRQHandler+0x268>
 8000e02: eeb4 7ae7 vcmpe.f32 s14, s15
 8000e06: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000e0a: bfc8      it gt
 8000e0c: edc3 7a00 vstrgt s15, [r3]
 8000e10: edd3 7a00 vldr s15, [r3]
 8000e14: eef5 7ac0 vcmpe.f32 s15, #0.0
 8000e18: eef1 fa10 vmrs APSR_nzcv, fpscr
 8000e1c: bf48      it mi
 8000e1e: 601d      strmi r5, [r3, #0]
 8000e20: 681d      ldr r5, [r3, #0]
 8000e22: 6015      str r5, [r2, #0]
 8000e24: 4d42      ldr r5, [pc, #264] ; (8000f30 <DMA1_Channel1_IRQHandler+0x26c>)
 8000e26: 680a      ldr r2, [r1, #0]
 8000e28: 682f      ldr r7, [r5, #0]
 8000e2a: 6836      ldr r6, [r6, #0]
 8000e2c: 1bd2      subs r2, r2, r7
 8000e2e: 4372      muls r2, r6
 8000e30: ee07 2a90 vmov s15, r2
 8000e34: 4a3f      ldr r2, [pc, #252] ; (8000f34 <DMA1_Channel1_IRQHandler+0x270>)
 8000e36: eef8 7ae7 vcvt.f32.s32 s15, s15
 8000e3a: edc2 7a00 vstr s15, [r2]
 8000e3e: 6809      ldr r1, [r1, #0]
 8000e40: 6029      str r1, [r5, #0]
 8000e42: edd4 7a00 vldr s15, [r4]
 8000e46: edd3 6a00 vldr s13, [r3]
 8000e4a: ed92 7a00 vldr s14, [r2]
 8000e4e: 4b3a      ldr r3, [pc, #232] ; (8000f38 <DMA1_Channel1_IRQHandler+0x274>)
 8000e50: ee77 7aa6 vadd.f32 s15, s15, s13
 8000e54: ee77 7a87 vadd.f32 s15, s15, s14
 8000e58: eefd 7ae7 vcvt.s32.f32 s15, s15
 8000e5c: edc3 7a00 vstr s15, [r3]
 8000e60: 681a      ldr r2, [r3, #0]
 8000e62: 2a00      cmp r2, #0
 8000e64: bfbc      itt lt
 8000e66: 2200      movlt r2, #0
 8000e68: 601a      strlt r2, [r3, #0]
 8000e6a: 681a      ldr r2, [r3, #0]
 8000e6c: f44f 737a mov.w r3, #1000 ; 0x3e8
 8000e70: fb92 f2f3 sdiv r2, r2, r3
 8000e74: 4b31      ldr r3, [pc, #196] ; (8000f3c <DMA1_Channel1_IRQHandler+0x278>)
 8000e76: 601a      str r2, [r3, #0]
 8000e78: 681a      ldr r2, [r3, #0]
 8000e7a: f5b2 6f7a cmp.w r2, #4000 ; 0xfa0
 8000e7e: bfc4      itt gt
 8000e80: f44f 627a movgt.w r2, #4000 ; 0xfa0
 8000e84: 601a      strgt r2, [r3, #0]
 8000e86: 6802      ldr r2, [r0, #0]
 8000e88: 681b      ldr r3, [r3, #0]
 8000e8a: 1ad2      subs r2, r2, r3
 8000e8c: 4b2c      ldr r3, [pc, #176] ; (8000f40 <DMA1_Channel1_IRQHandler+0x27c>)
 8000e8e: 601a      str r2, [r3, #0]
 8000e90: 681a      ldr r2, [r3, #0]
 8000e92: f5b2 6f7a cmp.w r2, #4000 ; 0xfa0
 8000e96: bfc4      itt gt
 8000e98: f44f 627a movgt.w r2, #4000 ; 0xfa0
 8000e9c: 601a      strgt r2, [r3, #0]
 8000e9e: 6819      ldr r1, [r3, #0]
 8000ea0: f240 52db movw r2, #1499 ; 0x5db
 8000ea4: 4291      cmp r1, r2
 8000ea6: bfdc      itt le
 8000ea8: f240 52dc movwle r2, #1500 ; 0x5dc
 8000eac: 601a      strle r2, [r3, #0]
 8000eae: 4a25      ldr r2, [pc, #148] ; (8000f44 <DMA1_Channel1_IRQHandler+0x280>)
 8000eb0: 6812      ldr r2, [r2, #0]
 8000eb2: 681b      ldr r3, [r3, #0]
 8000eb4: ea43 5302 orr.w r3, r3, r2, lsl #20
 8000eb8: 4a23      ldr r2, [pc, #140] ; (8000f48 <DMA1_Channel1_IRQHandler+0x284>)
 8000eba: 65d3      str r3, [r2, #92] ; 0x5c
 8000ebc: f102 4278 add.w r2, r2, #4160749568 ; 0xf8000000
 8000ec0: 6993      ldr r3, [r2, #24]
 8000ec2: f043 4380 orr.w r3, r3, #1073741824 ; 0x40000000
 8000ec6: 6193      str r3, [r2, #24]
 8000ec8: bdf0      pop {r4, r5, r6, r7, pc}
 8000eca: bf00      nop
 8000ecc: 40020000 .word 0x40020000
 8000ed0: 20001558 .word 0x20001558
 8000ed4: 20001524 .word 0x20001524
 8000ed8: 2000155c .word 0x2000155c
 8000edc: 200014f8 .word 0x200014f8
 8000ee0: 48000800 .word 0x48000800
 8000ee4: 20001560 .word 0x20001560
 8000ee8: 200000cc .word 0x200000cc
 8000eec: 200000c4 .word 0x200000c4
 8000ef0: 20001554 .word 0x20001554
 8000ef4: 20001550 .word 0x20001550
 8000ef8: 45802000 .word 0x45802000
 8000efc: 2000151c .word 0x2000151c
 8000f00: 2000154c .word 0x2000154c
 8000f04: 20001508 .word 0x20001508
 8000f08: 2000150c .word 0x2000150c
 8000f0c: 20001520 .word 0x20001520
 8000f10: 20001548 .word 0x20001548
 8000f14: 20001544 .word 0x20001544
 8000f18: 200000c8 .word 0x200000c8
 8000f1c: 200000c0 .word 0x200000c0
 8000f20: 20001540 .word 0x20001540
 8000f24: 2000153c .word 0x2000153c
 8000f28: 200014f4 .word 0x200014f4
 8000f2c: 4a7a3e80 .word 0x4a7a3e80
 8000f30: 20001518 .word 0x20001518
 8000f34: 20001538 .word 0x20001538
 8000f38: 20001500 .word 0x20001500
 8000f3c: 20001504 .word 0x20001504
 8000f40: 200014fc .word 0x200014fc
 8000f44: 200000d0 .word 0x200000d0
 8000f48: 50000800 .word 0x50000800
Find out what you cannot do and then go an do it!
 

Offline errorprone

  • Contributor
  • Posts: 39
Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

Good point. No the vector table still resides in flash memory.
But nonetheless, when the execution of the ISR starts the time to walk through the ISR instructions should not depend on the place/memory the vector table sits in. Right?
I am not concerned about execution delay, i.e. the time from the event to the start of execution. I am wondering why the execution time, from start to end of the ISR, is not consistent when other code in the project changes or is added or is removed. And during investigating on that I found other suspicious behavior that I don't understand, e.g. execution from flash with -Ofast is faster than execution from ccram with -Ofast.

And I recently found another suspicious behavior ---> obviously the execution time depends also on from where within the code the ISR is called. I am working on a simple example to show you that here in the thread.
« Last Edit: November 11, 2022, 03:17:39 pm by lordnoxx »
Find out what you cannot do and then go an do it!
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

With main():
...
Without main():

The asm snippets posted look identical, but did I miss it or did you post the whole function in both cases (with and without main() differences, same optimization level)? Are they exactly the same or not? That is literally the first thing to check.
« Last Edit: November 11, 2022, 03:56:53 pm by Siwastaja »
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
When placing things into SRAM/CCM though attributes pay attention to the function disassembly, If it does floating point math, chances are that supporting functions will still be located in the flash.

I would reduce the function to the point where there are no calls (even generated by the compiler internally), but you can still see the difference.
Alex
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

OP has found that running code from CCM is slower than running the same code from flash. Apparently, fetching from CCM is sower than fetching from flash. I think this is because of the bus contention (i.e. you cannot fetch a command and access data at the same time).

It's easy to test - just run the piece of register-only code from CCM and from flash and see if there's any time difference.
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
CCM is a completely separate bus, it does not conflict with anything else.

But again, if you placed the code in CCM, but it still calls floating point functions from the flash, it would  be slow. 
Alex
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
OP has found that running code from CCM is slower than running the same code from flash. Apparently, fetching from CCM is sower than fetching from flash. I think this is because of the bus contention (i.e. you cannot fetch a command and access data at the same time).

This is just not physically true. See the datasheet; it's a separate bus. The whole purpose of which is to give zero wait state access, it has to be at least equally fast to flash. Unless OP also put data in the same CCM (I forgot if CCM can be used for data, at all. It sure can for instructions, which is the primary purpose).

Something else is going on. One possibility is some of the code (library calls, maybe compiler-generated memcpy/memset/software float) are still in FLASH, but because of large difference in addresses between CCM vs. FLASH, slower jump instruction (e.g., one using register for addressing, necessitating a load, or a veneer function) is now used.

Carefully examining the assembly, in full, reveals all of this.
« Last Edit: November 11, 2022, 06:41:01 pm by Siwastaja »
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
Unless OP also put data in the same CCM (I forgot if CCM can be used for data, at all. It sure can for instructions, which is the primary purpose).

Data sections can be placed into CCM.

Something else is going on. One possibility is some of the code (library calls, maybe compiler-generated memcpy/memset/software float) are still in FLASH, but because of large difference in addresses between CCM vs. FLASH, slower jump instruction (e.g., one using register for addressing, necessitating a load, or a veneer function) is now used.

This is definitely a possible reason, however I do not see any function calls in his code. He gets substantial time difference, should take quite a bit of function calling.
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
As I said, the code uses floating point math, it will call helper functions.

It is a simple thing to check - just look at a full disassembly.
Alex
 

Offline eutectique

  • Frequent Contributor
  • **
  • Posts: 390
  • Country: be
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b80      ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
Execution time is 1.234us

Address is XXXXc4, 4-byte aligned.


Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b80      ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
Execution time is 1.180us

Address is XXXX38, 8-byte aligned.


The only difference I can see.
 

Offline uer166

  • Frequent Contributor
  • **
  • Posts: 890
  • Country: us
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4: 4b80      ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
Execution time is 1.234us

Address is XXXXc4, 4-byte aligned.


Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38: 4b80      ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
Execution time is 1.180us

Address is XXXX38, 8-byte aligned.


The only difference I can see.

Flash is 8-byte wide, so maybe that's a clue? There clearly is not enough info from OP to make any determination. Post the entire listing and memory map file.
 

Offline dmills

  • Super Contributor
  • ***
  • Posts: 2093
  • Country: gb
I am wondering if your without main case might have main() being optimised into a 'stop the processor' command which will leave the cache hot for the next time the ISR is called.

Put significant doings in main and now the ISR stalls on the I & D cache fill from memory? Don't forget that if both the ISR and something that could be interrupted need the FPU then that is a fairly expensive flush to the stack that you don't need if the ONLY code touching the FPU is in the interrupt handler.
 
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
As I said, the code uses floating point math, it will call helper functions.

It is a simple thing to check - just look at a full disassembly.

Yep. It is right here:

This is all I can get from the .list file

Do you see any calls to helper functions?
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
I missed that post.

In that case I would start removing the code to get a minimal version that still shows the difference. And I would start with floating point instructions, since timing on those depends on the operands.
Alex
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
Or you can just try to add __attribute__((aligned(8))) to the function definition (and re-check from the listing it is effective) and see if eutectique & uer are on the right track. Doesn't explain the case where CCM is slower, though.
 

Offline voltsandjolts

  • Supporter
  • ****
  • Posts: 2300
  • Country: gb

The ISR looks like this:

Code: [Select]
    void DMA1_Channel1_IRQHandler(void)
    {
     
     
     if(DMA1->ISR & DMA_ISR_TCIF1)  //check we are here cause of a valid interrupt occurred
     {
    DMA1->IFCR |= DMA_IFCR_CTCIF1; //reset interrupt flag
     
    GPIOC->ODR |= (1U << 14); //Set Pin high
     
    ev = (v - my_array[0]);
    vp  = A * ev;
    vi  = B * vi1;
    if(vi > 4100) vi=4100;
    if(vi < 0) vi=0;
    vi1 = vi;
    vd  = C * (ev - (e1v+2));
    e1v = ev;
    cv = vp + vi + vd;
    if(cv < 0) cv=0;
    cv_t = cv/2;
    if(cv_t >  4000 ) cv_t=4000;
      /*------------------------------------------------------*/
    ei = (my_array[1] - i);
    ip  = D * ei;
    ii  = E * ii1;
    if(ii > 4100000 ) ii=4100000;
            if(ii < 0 ) ii=0;
    ii1 = ii;
    id  = F * (ei + (e1i-2));
    e1i = ei;
    ci = ip + ii + id;
    if(ci < 0) ci=0;
    ci_t = ci/1000;
    if(ci_t >  4000 ) ci_t=4000;
     
    cdac_t = (cv_t - ci_t);
     
    if(cdac_t > 4000) cdac_t=4000;
    if(cdac_t < 1500) cdac_t=1500;
     
    GPIOC->ODR &= ~(1U << 14); //Set Pin low
     
     }
    }

All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.

Converting float to int?

Some of the IEEE.754 operations are not supported by hardware and are done by software:
• Remainder
• Round floating-point to integer-value floating-point number
• Binary-to-decimal and decimal-to-binary conversions
• Direct comparison of single-precision and double-precision values
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
If you want just another guess to be thrown wildly, the CCMRAM may be used at one side of a running DMA, for example.

[EDIT] Another wild guess is, that stack is defined in CCMRAM, which is aliased as top of the SRAM block; and (lazy?) float stacking then interferes with code fetching.

Third wild guess is, that the busmatrix arbitrator inserts a slot whenever access to CCMRAM switches from I-bus to D-bus of the processor. [/EDIT]

Details do matter. There's no point guessing around.

As Alex said above, OP should prepare a minimal but complete code exhibiting the problem and post entirely, with disasm or elf.

JW
« Last Edit: November 12, 2022, 09:59:17 am by wek »
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
As Alex said above, OP should prepare a minimal but complete code exhibiting the problem and post entirely, with disasm or elf.

This is what I will do. Thanks to all of you how commented and started analyzing the issue. I'll be right back.
Find out what you cannot do and then go an do it!
 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
The execution from ITCM is slower than flash because you hit the von neumann bottleneck from pcrel loads of constants. FLASH memories are usually equipped with separate code and literal caches.
It's definitely faster on devices like f103 (which can't keep up with stream of nop.w at 1ws). RISCV is cleaner in this regard.

You can put only the vector table in TCM, that is not occupied by stack, as to not cause waitstated read twice in a row.
Also align function entry to a cacheline size.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
The execution from ITCM is slower than flash because you hit the von neumann bottleneck from pcrel loads of constants. FLASH memories are usually equipped with separate code and literal caches.

Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
You can't fetch instructions and data at the same time from a single ported memory, hence von neumann bottleneck. The core prefetcher can fetch instructions a bit ahead but that wont do much with series of such loads.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
FLASH memories are usually equipped with separate code and literal caches.

I didn't know that. If so, this certainly explains why flash is faster than CCM.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
That also implies flash cache controller has special interface directly to the CPU, allowing two separate addresses to be fetched at the same time. Still suspicious about that.

I have never seen ITCM being slower than flash, although to be fair I have compared those only some dozen times, not hundreds, so maybe there is a case.

Plus it's worth remembering ITCM is a 64-bit wide bus in Cortex-M7, designed and offered by ARM as a standard option, but CCM is ST's own addition. Don't know about the difference. I have used ST's CCM only in one project (on STM32F334), and it was significantly faster than FLASH in those actual interrupt handlers I used.
« Last Edit: November 13, 2022, 05:46:12 pm by Siwastaja »
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11257
  • Country: us
    • Personal site
All those things are much more complicated than a simple single ported memory. I too have never seen TCM being slower and I don't see how it is possible.

There is some issue with the measurement method or something else is going on.
Alex
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
That also implies flash cache controller has special interface directly to the CPU, allowing two separate addresses to be fetched at the same time. Still suspicious about that.

Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.



 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
I have never seen ITCM being slower than flash, although to be fair I have compared those only some dozen times, not hundreds, so maybe there is a case.

I too have never seen TCM being slower and I don't see how it is possible.

I think we see here a false positive of microbenchmarking - the interrupt vector/code/data remains in FLASH caches between runs. There is also quite a lot of pcrel constants.
In typical scenario it should be faster and if not, the overhead ought to be low.

If compiler can be somehow forced to generate MOVW+MOVT instead of pcrel constants, TCM should have exactly the same performance as FLASH, if there is no contention with stack or variables in same memory block.
 

Offline NorthGuy

  • Super Contributor
  • ***
  • Posts: 3146
  • Country: ca
... the interrupt vector/code/data remains in FLASH caches between runs.

Exactly! That's what it is.

When you also have main, the ISR code gets removed from the cache while main runs. So, the bigger the main is, the higher the probability of the ISR code being expelled from cache, and hence when the main is big, the ISR code gets slower.

If the main is big enough to expel the ISR code from the cache every time, the flash code should then become slower than CCM code. @OP: you can verify if this is so.
« Last Edit: November 13, 2022, 09:02:54 pm by NorthGuy »
 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
one more thing:

Quote
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 16 Kbytes SRAM1 (mapped at address 0x2000 0000)
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Isn't your linker script joining all those sections to create one linear layout?
That would be contention with stack in such case (and some bugs once the app grows)
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
I too have never seen TCM being slower and I don't see how it is possible.
We are not talking about TCM in CM7 which is on an entirely separate bus of processor than the rest of the system.

CCM SRAM in STM32 denotes different things in different families/models. In 'G4, it's a single-port RAM accessed through both I and D but even through S port of processor (the latter through a different address region alias) and then through the common busmatrix, and it can be slave also to either of DMAs. Strangely enough, there's another chunk of RAM, denoted SRAM1, with exactly the same connectivity, and no explanation how would it be different from CCM SRAM.

The arbitration of the busmatrix is not documented in any other way than it is "round robin".

JW
 
The following users thanked this post: Siwastaja

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
Note that some linker scripts don't separate these ram regions.
I remember the 32F429 showing "RAM: .... length=192K", so the system would use any.

At least in the 429, CCM is only connected to the D-Bus, thus can't be used for instructions neither accessed by any DMA.
Quote
The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU
But placing ISR data on CCM and ISR intructions in the normal SRAM shouldn't slow it down, as they won't need to fight for access.

Not the case of the 32G431 (Totally different beast!)
« Last Edit: November 13, 2022, 10:01:04 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.

This picture explains it well. Flash accelerator comes with separate DBUS and IBUS interfaces, STM32 CCM SRAM does not but has to arbitrate.

Who knows about the internals of ARM-provided M7 ITCM (tried 2 minutes in Google to no avail)? The bus is 64 bits but does that mean it's actually kinda-dual port, like the FLASH accelerator pictured? PC-relative literals are usually further away so 64 bit width does not help for that.
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
Quote
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Kind of weird optimization idea:

Since PC-relative LDR can take offset of +/- 4095 bytes, which is quite a lot, and considering wek's comment "is there any difference between CCM and SRAM?"  it should be possible to place timing-critical routines on the border of these memory segments so that the PC-relative literals fall into SRAM2 and code itself into CCM. Then of course place nothing else into SRAM2 & CCM. Then DBUS accesses would be to SRAM2 and IBUS instruction fetches to CCM, and no arbitration for PC-relative literals.
 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
About optimizing the OP code

Quote
8000ed0:   20001558    .word   0x20001558
 8000ed4:   20001524    .word   0x20001524
 8000ed8:   2000155c    .word   0x2000155c
 8000edc:   200014f8    .word   0x200014f8
 
 8000ee4:   20001560    .word   0x20001560
 8000ee8:   200000cc    .word   0x200000cc
 8000eec:   200000c4    .word   0x200000c4
 8000ef0:   20001554    .word   0x20001554
 8000ef4:   20001550    .word   0x20001550
 
 8000efc:   2000151c    .word   0x2000151c
 8000f00:   2000154c    .word   0x2000154c
 8000f04:   20001508    .word   0x20001508
 8000f08:   2000150c    .word   0x2000150c
 8000f0c:   20001520    .word   0x20001520
 8000f10:   20001548    .word   0x20001548
 8000f14:   20001544    .word   0x20001544
 8000f18:   200000c8    .word   0x200000c8
 8000f1c:   200000c0    .word   0x200000c0
 8000f20:   20001540    .word   0x20001540
 8000f24:   2000153c    .word   0x2000153c
 8000f28:   200014f4    .word   0x200014f4
 
 8000f30:   20001518    .word   0x20001518
 8000f34:   20001538    .word   0x20001538
 8000f38:   20001500    .word   0x20001500
 8000f3c:   20001504    .word   0x20001504
 8000f40:   200014fc    .word   0x200014fc
 8000f44:   200000d0    .word   0x200000d0

all of those seem to be addresses of global variables. Organizing them in structures should greatly reduce those literal loads.

EDIT: many of those are also within 12 bit addw/ldr range to each other but the linkers tend to not be good at this kind of address relaxing

« Last Edit: November 14, 2022, 10:14:11 am by jnk0le »
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Hi guys I am back. Had a little time during the weekend and worked on a simplified small code that shows the behavior of variable execution time of DMA1 ISR. Together we can now walk through the mini project and find out what is going on. But there is one thing I have to make clear right from the beginning: I am not allowed to post the c-source code of the function cli(). I can only post the disassembly of that function.

Ok here we go....this is the main.c file:

Code: [Select]
#include "stm32g4xx.h"
#include "main.h"
#include "system_setup.h"
#include "usart_char_string.h"
#include "cli.h"
#include "ISR_Handlers.h"

#define TEST 0
#define VOUT 5.0f
#define VOUT_SET ((VOUT/10)/3)*4095.0f * 0.996
#define KPV  300
#define KIV  30
#define KDV  3000
#define IOUT 3.0f
#define IOUT_SET ((IOUT*0.2308f)/3)*4095.0f * 1.0f
#define KPI  10000
#define KII  30000
#define KDI  5000

volatile uint16_t myarray[2]= {0,0};
char fooptr[16];
volatile char debug=0;

volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;

volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;

void SysTick_Handler(void);

int main(void)
{
setup_clock_tree_config();
setup_GPIO_config();
SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3); //1ms
setup_USART_config();
setup_Timer1_config();
setup_ADC1_config();
setup_interrupt_config();

GPIOA->ODR |= (1 << GPIO_ODR_OD11_Pos);

while(1)
{

cli();

if(TEST)
{
delay_ms(200);
}

}
}


void SysTick_Handler(void)
{
count_ticks++;
}

void delay_ms(uint32_t ms)
{
uint32_t start=count_ticks;
while ((count_ticks-start) < ms);
}

If you wonder about the variables names...yes....the project is all about an digital control loop for a DC/DC converter. And yes...the DMA1 ISR executes the PID algorithms for  voltage and current control. This just as a side note for those of you who are interested ;)

Here is the code of the ISR:

Code: [Select]
void DMA1_Channel1_IRQHandler(void) {

if (DMA1->ISR & DMA_ISR_TCIF1)
{
DMA1->IFCR |= DMA_IFCR_CTCIF1;

GPIOC->BSRR |= (1U << 14); //Pin high

ev = (vref - myarray[0]);

vp = KpV * ev;
vi = KiV * ev + vi1;

if (vi > 4100) vi = 4100;
if (vi < 0) vi = 0;
vi1 = vi;

vd = KdV * (ev - e1v);
e1v = ev;

cv = vp + vi + vd;

if (cv < 0) cv = 0;

cv_out = cv;
if (cv_out > 4000) cv_out = 4000;

ei = (myarray[1] - iref);

ip = KpI * ei;
ii = KiI * ei + ii1;

if (ii > 4100000) ii = 4100000;
if (ii < 0) ii = 0;
ii1 = ii;

id = KdI * (ei - e1i);
e1i = ei;

ci = ip + ii + id;

if (ci < 0) ci = 0;

ci_out = ci / 1000;
if (ci_out > 4000) ci_out = 4000;

c_out = (cv_out - ci_out);

if (c_out > 4000) c_out = 4000;
if (c_out < 1500) c_out = 1500;

GPIOC->BSRR |= (1U << (14 + 16)); //Pin low

}
}

The main() function and the ISR and also the vector table reside in flash memory. Compiler optimizes for size (-Os). From the attached image you can see that the execution time is 1.842µs with cli() that gets called from main().
If I comment out cli()...

Code: [Select]
while(1)
{
//cli();

if(TEST)
{
delay_ms(200);
}

}

...the execution time of the ISR is 1.542µs, shown in the other attached image.

Please see also the attached ZIP. It contains the elf-file and the disassembly (list-file) with the cli() function. On request I can upload an elf file and disassembly without the cli() function

Find out what you cannot do and then go an do it!
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<40;i++) {
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

This way you discard a FPU HW / library issue.
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
« Last Edit: November 20, 2022, 10:17:55 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline lordnoxxTopic starter

  • Contributor
  • Posts: 22
  • Country: de
Quote
What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

Tried that...Result:
Execution time of the ISR is not dependent on cli().


Quote
Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<40;i++) {
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

This way you discard a FPU HW / library issue.

Tried that too...Result:
Execution time of the ISR is not dependent on cli().


Quote
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?
No, cli() has no float operations going. Also if you check the disassembly of cli() you will not find any FPU instructions. So probably the root cause is not the lazy stacking feature!?!?  :-//
No higher priority IRQs.

But...the above tests led me to test again what happens if I tell to compiler not to use the FPU at all.--> Result: Execution time of the ISR is not dependent on cli().
Hmmm...well....does this mean that the the issue is nevertheless caused by the FPU?

What can be the next steps to dive deeper into this?







Find out what you cannot do and then go an do it!
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
Toggle the pin between operations, maybe you find it's happening by some specific operation?
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline jnk0le

  • Contributor
  • Posts: 41
  • Country: pl
Put all of those in one big struct so the compiler can acces them with ldr offset instead of doing pcrel load of address each time
Quote
Code: [Select]
volatile uint16_t myarray[2]= {0,0};

volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;

volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;

 

Offline Jeroen3

  • Super Contributor
  • ***
  • Posts: 4078
  • Country: nl
  • Embedded Engineer
    • jeroen3.nl
What is your clock tree like? Are you waiting on an asynchronous bus somewhere?

What happens when you run the chip on a speed without need for wait states? Eg: 16 Mhz flat?
Can you the reproduce the results?

Sidenote: GPIOC->BSRR is a write only register, no need for |=
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
FPU is probably red herring, with those nops you've removed also all pcrel loads, which may have impact through alignment, as jnk0le pointed out above (and maybe others talked about it too).

To test the alignment-dependency theory, insert a single NOP to the ISR (before setting the GPIO pin) and retest, then insert one more etc.

> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical. Don't set multiple bits in a single register by a series of RMW, (as you do with FLASH_ACR), perform one single write of the final value.


JW
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8172
  • Country: fi
> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical.


Made that bigger, it's pretty important side note when the OP clearly is interested about performance. BSRR, IFCR and other similar registers exist for the purpose of avoiding read-modify-write; the RMW operation is moved to the peripheral side; consider it a simple "hardware accelerator" of modifying peripheral state.
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
Why adding sidenotes in micron size font? Just use regular size, it's confusing, I didn't even read them as looked like your personal signature.

Add "__attribute__((aligned(16)))" before the ISR function so it gets 16-byte (128bit) aligned.
It'll waste a little space, but nothing serious, tested it and got a 3.2% increment, from 106KB to 109.3KB.

What are you runnign on cli(), or what code is increasing the ISR time?
Tried your code inside a timer ISR, nothing seems to affect it, got steady 2.56us on a 32F411@100MHz.
The struct idea definitely made a difference, lowering the time to 1.96us.
Are you sure it's not related to the processing ending sooner/later depending on the input data?
What happens if the float input values are never updated, so it always makes the same calcs?

Also, maybe try enabling the Systick timer (1KHz), disabling everything else and running the code there, to discard something strange with the DMA ISR.
« Last Edit: November 23, 2022, 11:30:20 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
DavidAlfa,

> The struct idea definitely made a difference, lowering the time [from 2.56us(?)] to 1.96us.

Wow, I wouldn't expect it to be that dramatic.

Can you please revert to the non-struct version, and try a couple of versions with added one/two/etc. _NOP()s before setting of GPIO pin?

Thanks,

JW

 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
I'm not getting any difference by adding random code anywhere in the main neither by aligning the code to 16-bit:

No alignment:
08000784 <TIM3_IRQHandler>

Adding __attribute__((aligned(16))):
08000790 <TIM3_IRQHandler>

Zero difference in execution time.

Edit: It was my scope not having enough precision. Repeated in 10ns/div.

I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
0 nop:  1.96us (="same")
1 nop: +20ns
2 nop: same
3 nop: +20ns
4 nop: same

It seems adding uneven number of nops adds two additional execution cycles to the following instructions.
But by no way +600ns.
Flash latency is set to 3, I don't think it's a cache miss.
The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way,  but who knows.
« Last Edit: November 23, 2022, 11:59:18 am by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
> I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
> 0 nop:  1.96us (="same")
> 1 nop: +20ns
> 2 nop: same
> 3 nop: +20ns
> 4 nop: same

OK but that's with the variables in struct, right? If you run at 100MHz that 20ns may be 2-3 cycles so it may be one extra FLASH read, or similar.

Can you please try the same with the non-struct i.e. original version? As there are more data reads from the FLASH, the difference may be more pronounced.

> The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way,  but who knows.

From computing point of view the 'F411 is in no way inferior to 'G431, except the max. clock frequency (i.e. the 'F42x or 'F446 are on par with 'G431). The 'G4xx are "better" in the peripheral mix. The 'G4xx are worse in longevity, given the massive increase in peripherals' complexity was paid for by using 45nm technology (vs. the older (read: more robust) 90nm for the 'F4). It's not dramatic, yet; but the pressure is already felt.

JW
 

Offline DavidAlfa

  • Super Contributor
  • ***
  • Posts: 5907
  • Country: es
Strange results, but nothing outside of this world.

1 nop: -20ns
2 nop: +10ns
3 nop: +10ns
4 nop: -20ns
5 nop: -20ns
6 nop: -20ns
7 nop: -20ns
8 nop: -20ns
9 nop: -20ns

asm("") -20ns  Yeah,asm("Nothing")

Code: [Select]
void TIM3_IRQHandler(void)
{
  /* USER CODE BEGIN TIM3_IRQn 0 */
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000784: 4b7c      ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000786: 497d      ldr r1, [pc, #500] ; (800097c <TIM3_IRQHandler+0x1f8>)
Code: [Select]
void TIM3_IRQHandler(void)
{
 8000784: b5f0      push {r4, r5, r6, r7, lr}
  /* USER CODE BEGIN TIM3_IRQn 0 */
  asm("");
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000786: 4b7c      ldr r3, [pc, #496] ; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000788: 497c      ldr r1, [pc, #496] ; (800097c <TIM3_IRQHandler+0x1f8>)
« Last Edit: November 23, 2022, 02:17:31 pm by DavidAlfa »
Hantek DSO2x1x            Drive        FAQ          DON'T BUY HANTEK! (Aka HALF-MADE)
Stm32 Soldering FW      Forum      Github      Donate
 

Offline wek

  • Frequent Contributor
  • **
  • Posts: 495
  • Country: sk
Hummm.

Thanks, David.

JW
 

Offline bson

  • Supporter
  • ****
  • Posts: 2270
  • Country: us
Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.
CCM doesn't have separate I/D buses like flash or SRAM.  This means the pipeline can't overlap data accesses with code accesses when data is in CCM, but the flip side is it gives very predictable and deterministic execution times.  Every access takes exactly one cycle, and can simply be added up.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf