Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR (Read 7654 times)

lordnoxx · « **on:** November 06, 2022, 11:33:43 am »

Hi folks,

I am working on a Project based on a STM32G431. While developing the firmware for it I encountered a strange phenomenon that should not be possible or happen. After fiddling around with it and doing my on efforts of debugging and testing to get a better understanding what is going on here I had to partly give up on this as my knowledge regarding deeper STM32 secrets is limited. I am a power electronics engineer and embedded firmware development is not my daily profession. Although I have quite some experience in coding and microcontrollers (8051, PIC, ATmega, STM32) as such, I came to a point where I need help to find the root cause of the issue.

So I went to the ST Community Forum and asked for help there: https://community.st.com/s/question/0D73W000001nmmASAQ/detail?fromEmail=1&s1oid=00Db0000000YtG6&s1nid=0DB0X000000DYbd&s1uid=0053W000001UI26&s1ext=0&emkind=chatterCommentNotification&emtm=1667370141288

I am still full of hope and confidence that the people there at ST can help me. But I feel I have to reach a bigger crowd of skilled engineers to dig into this. That is why I want to reach out to you here at eevblog too and want to ask kindly for additional help.

STM32G431 - How can it be that the amount of lines of code in main() affect the time it takes to execute an interrupt routine?

This question is all about the behavior of the DMA1_Channel1_IRQHandler in my application.
I am working with an STM32G431CBT. It is running at 168MHz. Configured with 4 Flash wait states.
NVIC interrupt priorities are setup using CMSIS functions as follows:

Code: [Select]

SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3);
 
NVIC_SetPriorityGrouping(0);
 
NVIC_DisableIRQ(SysTick_IRQn);
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 10, 0);
NVIC_SetPriority(SysTick_IRQn, irq_prio);
NVIC_EnableIRQ(SysTick_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 13, 0);
NVIC_SetPriority(USART2_IRQn, irq_prio);
NVIC_EnableIRQ(USART2_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 0, 0); //highest priority!!!
NVIC_SetPriority(DMA1_Channel1_IRQn, irq_prio);
NVIC_EnableIRQ(DMA1_Channel1_IRQn);

TIM1->CCR5 (on match) triggers ADC conversion sequence. ADC triggers DMA1_Channel1 (circular, periph-to-mem). DMA1 is setup to issue an interrupt on transfer complete (TCIE).
The DMA1_Channel1_IRQHandler (has highest priority) then fires and a few things are processed and calculated within it.

My code compiles with no errors and warnings. And as far as I can see it all works as intended.
But here comes the strange thing:

Right at the start of DMA1_Channel1_IRQHandler I set a GPIO Pin high. At the End of DMA1_Channel1_IRQHandler I set the same Pin low. Looking at the generated high-low pulse with an oscilloscope I can measure the execution time of the DMA1_Channel1_IRQHandler. It takes 1.69µs. But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

How can that be? An ISR should not be affected in such a way by the amount of code in a low priority function just like main().

The ISR looks like this:

Code: [Select]

    void DMA1_Channel1_IRQHandler(void)
    {
     
     
     if(DMA1->ISR & DMA_ISR_TCIF1)  //check we are here cause of a valid interrupt occurred
     {
    	DMA1->IFCR |= DMA_IFCR_CTCIF1;	//reset interrupt flag
     
    	GPIOC->ODR |= (1U << 14);	//Set Pin high
     
    	ev = (v - my_array[0]);
    	vp  = A * ev;					
    	vi  = B * vi1;			
    	if(vi > 4100) vi=4100;					
    	if(vi < 0) vi=0;
    	vi1 = vi;
    	vd  = C * (ev - (e1v+2));			
    	e1v = ev;						
    	cv = vp + vi + vd;				
    	if(cv < 0) cv=0;
    	cv_t = cv/2;	
    	if(cv_t >  4000 ) cv_t=4000;
     	/*------------------------------------------------------*/
    	ei = (my_array[1] - i);
    	ip  = D * ei;					
    	ii  = E * ii1;			
    	if(ii > 4100000 ) ii=4100000;
            if(ii < 0 ) ii=0;
    	ii1 = ii;
    	id  = F * (ei + (e1i-2));			
    	e1i = ei;						
    	ci = ip + ii + id;				
    	if(ci < 0) ci=0;
    	ci_t = ci/1000;					
    	if(ci_t >  4000 ) ci_t=4000;
     
    	cdac_t = (cv_t - ci_t);
     
    	if(cdac_t > 4000) cdac_t=4000;
    	if(cdac_t < 1500) cdac_t=1500;
     
    	GPIOC->ODR &= ~(1U << 14); //Set Pin low
      
     }
    }

All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.
The timing shrinks down to 1.22µs for every call to the handler. I verified that with the pulse trigger function of the oscilloscope. No longer pulses occur.

This is what I have done so far to debug this:

Put the interrupt in a different C file than main() --> Did not make a difference

Using the BSRR register instead of ODR for setting the pin high/low ---> Did not make a difference

Next I tried moving the "GPIOC->BSRR |= (1U << (14+16)); ////Set Pin low" upwards in the ISR-Code to see whether there is that point from which on there is no longer any dependency of the ISR execution time from the amount of code in main(). Of course the measured time gets smaller as setting high and setting low are moving closer together. But I have to conclude...Such a point does not exist. Well...besides placing set pin high and set pin low right next to each other.

Code: [Select]

"GPIOC->BSRR |= (1U << (14)); //Set Pin high"
"GPIOC->BSRR |= (1U << (14+16)); //Set Pin low"

But in general...No matter where I place "//Set Pin low" there is always a dependency on the amount of code in main() regarding ISR execution time.

For testing I also told the compiler not to use the FPU. This resulted in a much longer ISR execution time, but the issue was still present.

For testing I disabled all interrupts (even SysTick) besides DMA1_Channel1_IRQHandler --> Issue still present.

But then I found one interesting effect. Telling the compiler to optimize for speed (-Ofast) instead of size (-Os) solved the problem. No matter how much code is in main() the ISR execution time is always 1.2µs.

So as setting the optimization level to -Ofast is more or less a workaround and not a solution to the root of the problem I have to ask ones again: With all that information and background...has anybody any Idea what is wrong here? Bug in compiler/linker or even in silicon?

Can anybody confirm that behavior?

Alti · « **Reply #1 on:** November 06, 2022, 12:28:18 pm »

Quote from: lordnoxx on November 06, 2022, 11:33:43 am

Configured with 4 Flash wait states.

You are comparing apples and oranges.

The uC has ARMv7-EM core and this works EXACTLY as specified in Architecture Reference Manual from ARM, its designer. Precisely down to single tick, to the finest stage of the pipeline (3-stage, btw). On top of that there is ST (ST has nothing to do with ARM) that tied a 128-bit? flash memory into buses of that uC (three buses btw), with cache (check ST datasheet for depths). The opcodes are 16-bit or 32-bit and you cannot align them to 128-bit boundaries by using Ansi C! If the pipeline needs to be flushed because of some if of cache miss, you get stalls.

Please provide relevant, minimal sample that proves your point (unexpected behavior?). Assembly, alignment to flash row boundaries, cache depth etc. Or maybe you can reproduce this behavior with flash 0-wait states, cache disabled - even better. Or perhaps start from executing IRQ from SRAM, this one has 0-wait states on most STM32 chips and runs at full throttle.

DavidAlfa · « **Reply #2 on:** November 06, 2022, 12:30:40 pm »

Not sure if I understood the issue correctly.
The high pulse width should be the same (calling DMA ->ISR response), but adding more code to main loop will extend the time between the pulses, it will need to process more code before calling the DMA again.
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

ataradov · « **Reply #3 on:** November 06, 2022, 05:02:25 pm »

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

DavidAlfa · « **Reply #4 on:** November 06, 2022, 06:51:20 pm »

The cache (Art Accelerator) is ST's golden-egg laying hen, the only reason they can work so fast, very close to 0 wait states.
If not using the cache... get a z80! Performance will be the same

In the instant you get a ISR it needs to fetch the code, will probably cause a cache miss and require few clocks to fetch the data.
But with the first read it also fetched the next instructions, so how would the aligment hit so hard to cause an almost 50% increase?
Few clocks, yeah, but +500us seems absolutely crazy.
Unless (Like always) he didn't mention something important

ataradov · « **Reply #5 on:** November 06, 2022, 06:56:37 pm »

That close to 0 wait states thing is mostly marketing and only works in some cases.

The code here has no explicit loops, which is what mostly gets affected by alignment. But if some of the variables are float, then low level routines might have loops in them.

The easiest way to check this is to place ISR into SRAM and check the performance. I predict it would be consistent and faster than any previous results.

eutectique · « **Reply #6 on:** November 06, 2022, 07:53:42 pm »

Quote from: lordnoxx on November 06, 2022, 11:33:43 am

But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

NorthGuy · « **Reply #7 on:** November 06, 2022, 08:53:13 pm »

I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

peter-h · « **Reply #8 on:** November 06, 2022, 09:36:00 pm »

I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

ataradov · « **Reply #9 on:** November 06, 2022, 09:44:42 pm »

None of this is related to the original questions in any way.

The original issue is consistent with alignment issues.

lordnoxx · « **Reply #10 on:** November 07, 2022, 07:46:16 am »

Quote from: DavidAlfa on November 06, 2022, 12:30:40 pm

However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

That's exactly the phenomenon. And if you check my initial post --> I already tried to disable all other interrupts. But the issue still exists.

lordnoxx · « **Reply #11 on:** November 07, 2022, 07:47:42 am »

Quote from: ataradov on November 06, 2022, 05:02:25 pm

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Thanks for that hint. I will check that and report back.

lordnoxx · « **Reply #12 on:** November 07, 2022, 07:52:46 am »

Quote from: eutectique on November 06, 2022, 07:53:42 pm

Quote from: lordnoxx on November 06, 2022, 11:33:43 am
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

I will do my best to provide more information and code snippets. Unfortunately parts of the involved code is intellectual property of my customer so I am not allowed to post it here. But the ISR C-code and assembly-code should be no problem. I will provide that.

lordnoxx · « **Reply #13 on:** November 07, 2022, 07:55:32 am »

Quote from: NorthGuy on November 06, 2022, 08:53:13 pm

I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

cdac_t is volatile and the code that uses it is not removed from main(). So the optimizer does other stuff to the code that speeds up the execution and makes it consistent. It has nothing to do with unused variables/code.

lordnoxx · « **Reply #14 on:** November 07, 2022, 07:59:59 am »

Quote from: peter-h on November 06, 2022, 09:36:00 pm

I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

And how is the startup code related to ISR execution time?
How can I check the VTOR alignment?

ataradov · « **Reply #15 on:** November 07, 2022, 08:12:17 am »

Quote from: lordnoxx on November 07, 2022, 07:59:59 am

How can I check the VTOR alignment?

Don't bother, nothing would work at all if it was not aligned correctly.

lordnoxx · « **Reply #16 on:** November 11, 2022, 09:01:52 am »

Quote from: ataradov on November 06, 2022, 05:02:25 pm

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Here are two assembler snippets from the ISR with and without main().

With main():

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b73      	ldr	r3, [pc, #460]	; (8000e94 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	b510      	push	{r4, lr}
 8000ccc:	f140 80c2 	bpl.w	8000e54 <DMA1_Channel1_IRQHandler+0x190>
 8000cd0:	685a      	ldr	r2, [r3, #4]
 8000cd2:	4871      	ldr	r0, [pc, #452]	; (8000e98 <DMA1_Channel1_IRQHandler+0x1d4>)
 8000cd4:	4971      	ldr	r1, [pc, #452]	; (8000e9c <DMA1_Channel1_IRQHandler+0x1d8>)

Without main()

Code: [Select]

08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38:	4b73      	ldr	r3, [pc, #460]	; (8000e08 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000c3a:	681a      	ldr	r2, [r3, #0]
 8000c3c:	0792      	lsls	r2, r2, #30
 8000c3e:	b510      	push	{r4, lr}
 8000c40:	f140 80c2 	bpl.w	8000dc8 <DMA1_Channel1_IRQHandler+0x190>
 8000c44:	685a      	ldr	r2, [r3, #4]
 8000c46:	4871      	ldr	r0, [pc, #452]	; (8000e0c <DMA1_Channel1_IRQHandler+0x1d4>)
 8000c48:	4971      	ldr	r1, [pc, #452]	; (8000e10 <DMA1_Channel1_IRQHandler+0x1d8>)

Prefetch, I-Cache and D-Cache are enabled:

Code: [Select]

FLASH->ACR |= FLASH_ACR_PRFTEN;	
FLASH->ACR |= FLASH_ACR_ICEN;	
FLASH->ACR |= FLASH_ACR_DCEN;

[EDIT]: Compiler is set to -Os
[EDIT]: @ataradov: So can you say something about the alignment by looking at the above assembler code listings?

lordnoxx · « **Reply #17 on:** November 11, 2022, 10:59:49 am »

When set the compiler to -Ofast I get this result:

With main():

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b80      	ldr	r3, [pc, #512]	; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	f140 80cc 	bpl.w	8000e66 <DMA1_Channel1_IRQHandler+0x1a2>
 8000cce:	685a      	ldr	r2, [r3, #4]
 8000cd0:	497e      	ldr	r1, [pc, #504]	; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cd2:	487f      	ldr	r0, [pc, #508]	; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.234us

Without main():

Code: [Select]

08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38:	4b80      	ldr	r3, [pc, #512]	; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
 8000c3a:	681a      	ldr	r2, [r3, #0]
 8000c3c:	0792      	lsls	r2, r2, #30
 8000c3e:	f140 80cc 	bpl.w	8000dda <DMA1_Channel1_IRQHandler+0x1a2>
 8000c42:	685a      	ldr	r2, [r3, #4]
 8000c44:	497e      	ldr	r1, [pc, #504]	; (8000e40 <DMA1_Channel1_IRQHandler+0x208>)
 8000c46:	487f      	ldr	r0, [pc, #508]	; (8000e44 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.180us

So two things are interesting here:
1. Compiling with main() and -Ofast results in an ISR execution time as fast as compiling with -Os and without main()
2. Compiling without main() and -Ofast shows less variation in execution time. 1.234us --> 1.180us

lordnoxx · « **Reply #18 on:** November 11, 2022, 11:19:30 am »

Next I experimented with placing the ISR in CCM SRAM according to ST's AN4296.

ISR function prototype:

Code: [Select]

void main_DCDC_control_loop_interrupt(void) __attribute__((section (".ccmram")));

MAP file confirms that the ISR code is placed/copied to CCMRAM at startup by startup code.

Code: [Select]

.ccmram         0x0000000010000000      0x290 load address 0x00000000080001d8
                       0x0000000010000000                . = ALIGN (0x4)
                       0x0000000010000000                _sccmram = .
 *(.ccmram)
 .ccmram        0x0000000010000000      0x290 ./Core/Src/ISR_Handlers.o
                       0x0000000010000000                DMA1_Channel1_IRQHandler
 *(.ccmram*)
                       0x0000000010000290                . = ALIGN (0x4)
                       0x0000000010000290                _eccmram = .

These are the results.

-> with main() and -Os: Execution time is 1.508us

-> without main() and -Os: Execution time is 1.430usus

-> without main() and -Ofast: Execution time is 1.354us

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!?

voltsandjolts · « **Reply #19 on:** November 11, 2022, 12:17:39 pm »

Maybe try using an internal timer for calculating ISR execution time.
Pin toggling so fast might be causing issues, but that's just a hunch, I'm not familiar with this mcu and APB setup.

NorthGuy · « **Reply #20 on:** November 11, 2022, 02:18:20 pm »

Quote from: lordnoxx on November 11, 2022, 11:19:30 am

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!?

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

There's no point in looking at few lines of assembler, post the whole ISR including prologue and epilogue.

lordnoxx · « **Reply #21 on:** November 11, 2022, 02:39:55 pm »

This is all I can get from the .list file

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b81      	ldr	r3, [pc, #516]	; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	b5f0      	push	{r4, r5, r6, r7, lr}
 8000ccc:	f140 80fc 	bpl.w	8000ec8 <DMA1_Channel1_IRQHandler+0x204>
 8000cd0:	685a      	ldr	r2, [r3, #4]
 8000cd2:	4c7f      	ldr	r4, [pc, #508]	; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)
 8000cd4:	497f      	ldr	r1, [pc, #508]	; (8000ed4 <DMA1_Channel1_IRQHandler+0x210>)
 8000cd6:	4880      	ldr	r0, [pc, #512]	; (8000ed8 <DMA1_Channel1_IRQHandler+0x214>)
 8000cd8:	4f80      	ldr	r7, [pc, #512]	; (8000edc <DMA1_Channel1_IRQHandler+0x218>)
 8000cda:	f042 0202 	orr.w	r2, r2, #2
 8000cde:	605a      	str	r2, [r3, #4]
 8000ce0:	4a7f      	ldr	r2, [pc, #508]	; (8000ee0 <DMA1_Channel1_IRQHandler+0x21c>)
 8000ce2:	6993      	ldr	r3, [r2, #24]
 8000ce4:	f443 4380 	orr.w	r3, r3, #16384	; 0x4000
 8000ce8:	6193      	str	r3, [r2, #24]
 8000cea:	4b7e      	ldr	r3, [pc, #504]	; (8000ee4 <DMA1_Channel1_IRQHandler+0x220>)
 8000cec:	681b      	ldr	r3, [r3, #0]
 8000cee:	8822      	ldrh	r2, [r4, #0]
 8000cf0:	b292      	uxth	r2, r2
 8000cf2:	1a9b      	subs	r3, r3, r2
 8000cf4:	600b      	str	r3, [r1, #0]
 8000cf6:	4b7c      	ldr	r3, [pc, #496]	; (8000ee8 <DMA1_Channel1_IRQHandler+0x224>)
 8000cf8:	681b      	ldr	r3, [r3, #0]
 8000cfa:	680a      	ldr	r2, [r1, #0]
 8000cfc:	4353      	muls	r3, r2
 8000cfe:	ee07 3a90 	vmov	s15, r3
 8000d02:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d06:	4b79      	ldr	r3, [pc, #484]	; (8000eec <DMA1_Channel1_IRQHandler+0x228>)
 8000d08:	edc0 7a00 	vstr	s15, [r0]
 8000d0c:	681b      	ldr	r3, [r3, #0]
 8000d0e:	680d      	ldr	r5, [r1, #0]
 8000d10:	4a77      	ldr	r2, [pc, #476]	; (8000ef0 <DMA1_Channel1_IRQHandler+0x22c>)
 8000d12:	436b      	muls	r3, r5
 8000d14:	ee07 3a90 	vmov	s15, r3
 8000d18:	ed92 7a00 	vldr	s14, [r2]
 8000d1c:	4b75      	ldr	r3, [pc, #468]	; (8000ef4 <DMA1_Channel1_IRQHandler+0x230>)
 8000d1e:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d22:	2500      	movs	r5, #0
 8000d24:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000d28:	edc3 7a00 	vstr	s15, [r3]
 8000d2c:	ed93 7a00 	vldr	s14, [r3]
 8000d30:	eddf 7a71 	vldr	s15, [pc, #452]	; 8000ef8 <DMA1_Channel1_IRQHandler+0x234>
 8000d34:	eeb4 7ae7 	vcmpe.f32	s14, s15
 8000d38:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000d3c:	bfc8      	it	gt
 8000d3e:	edc3 7a00 	vstrgt	s15, [r3]
 8000d42:	edd3 7a00 	vldr	s15, [r3]
 8000d46:	eef5 7ac0 	vcmpe.f32	s15, #0.0
 8000d4a:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000d4e:	bf48      	it	mi
 8000d50:	601d      	strmi	r5, [r3, #0]
 8000d52:	681e      	ldr	r6, [r3, #0]
 8000d54:	6016      	str	r6, [r2, #0]
 8000d56:	4e69      	ldr	r6, [pc, #420]	; (8000efc <DMA1_Channel1_IRQHandler+0x238>)
 8000d58:	680a      	ldr	r2, [r1, #0]
 8000d5a:	f8d6 c000 	ldr.w	ip, [r6]
 8000d5e:	683f      	ldr	r7, [r7, #0]
 8000d60:	eba2 020c 	sub.w	r2, r2, ip
 8000d64:	437a      	muls	r2, r7
 8000d66:	ee07 2a90 	vmov	s15, r2
 8000d6a:	4a65      	ldr	r2, [pc, #404]	; (8000f00 <DMA1_Channel1_IRQHandler+0x23c>)
 8000d6c:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d70:	edc2 7a00 	vstr	s15, [r2]
 8000d74:	6809      	ldr	r1, [r1, #0]
 8000d76:	6031      	str	r1, [r6, #0]
 8000d78:	edd0 7a00 	vldr	s15, [r0]
 8000d7c:	edd3 6a00 	vldr	s13, [r3]
 8000d80:	ed92 7a00 	vldr	s14, [r2]
 8000d84:	4b5f      	ldr	r3, [pc, #380]	; (8000f04 <DMA1_Channel1_IRQHandler+0x240>)
 8000d86:	4860      	ldr	r0, [pc, #384]	; (8000f08 <DMA1_Channel1_IRQHandler+0x244>)
 8000d88:	4960      	ldr	r1, [pc, #384]	; (8000f0c <DMA1_Channel1_IRQHandler+0x248>)
 8000d8a:	ee77 7aa6 	vadd.f32	s15, s15, s13
 8000d8e:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000d92:	eefd 7ae7 	vcvt.s32.f32	s15, s15
 8000d96:	edc3 7a00 	vstr	s15, [r3]
 8000d9a:	681a      	ldr	r2, [r3, #0]
 8000d9c:	2a00      	cmp	r2, #0
 8000d9e:	bfbc      	itt	lt
 8000da0:	2200      	movlt	r2, #0
 8000da2:	601a      	strlt	r2, [r3, #0]
 8000da4:	681b      	ldr	r3, [r3, #0]
 8000da6:	6003      	str	r3, [r0, #0]
 8000da8:	6803      	ldr	r3, [r0, #0]
 8000daa:	4a59      	ldr	r2, [pc, #356]	; (8000f10 <DMA1_Channel1_IRQHandler+0x24c>)
 8000dac:	f5b3 6f7a 	cmp.w	r3, #4000	; 0xfa0
 8000db0:	bfc4      	itt	gt
 8000db2:	f44f 637a 	movgt.w	r3, #4000	; 0xfa0
 8000db6:	6003      	strgt	r3, [r0, #0]
 8000db8:	8863      	ldrh	r3, [r4, #2]
 8000dba:	6812      	ldr	r2, [r2, #0]
 8000dbc:	4c55      	ldr	r4, [pc, #340]	; (8000f14 <DMA1_Channel1_IRQHandler+0x250>)
 8000dbe:	b29b      	uxth	r3, r3
 8000dc0:	1a9b      	subs	r3, r3, r2
 8000dc2:	600b      	str	r3, [r1, #0]
 8000dc4:	4b54      	ldr	r3, [pc, #336]	; (8000f18 <DMA1_Channel1_IRQHandler+0x254>)
 8000dc6:	681b      	ldr	r3, [r3, #0]
 8000dc8:	680a      	ldr	r2, [r1, #0]
 8000dca:	4353      	muls	r3, r2
 8000dcc:	ee07 3a90 	vmov	s15, r3
 8000dd0:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000dd4:	4b51      	ldr	r3, [pc, #324]	; (8000f1c <DMA1_Channel1_IRQHandler+0x258>)
 8000dd6:	edc4 7a00 	vstr	s15, [r4]
 8000dda:	681b      	ldr	r3, [r3, #0]
 8000ddc:	680e      	ldr	r6, [r1, #0]
 8000dde:	4a50      	ldr	r2, [pc, #320]	; (8000f20 <DMA1_Channel1_IRQHandler+0x25c>)
 8000de0:	4373      	muls	r3, r6
 8000de2:	ee07 3a90 	vmov	s15, r3
 8000de6:	ed92 7a00 	vldr	s14, [r2]
 8000dea:	4b4e      	ldr	r3, [pc, #312]	; (8000f24 <DMA1_Channel1_IRQHandler+0x260>)
 8000dec:	4e4e      	ldr	r6, [pc, #312]	; (8000f28 <DMA1_Channel1_IRQHandler+0x264>)
 8000dee:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000df2:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000df6:	edc3 7a00 	vstr	s15, [r3]
 8000dfa:	ed93 7a00 	vldr	s14, [r3]
 8000dfe:	eddf 7a4b 	vldr	s15, [pc, #300]	; 8000f2c <DMA1_Channel1_IRQHandler+0x268>
 8000e02:	eeb4 7ae7 	vcmpe.f32	s14, s15
 8000e06:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000e0a:	bfc8      	it	gt
 8000e0c:	edc3 7a00 	vstrgt	s15, [r3]
 8000e10:	edd3 7a00 	vldr	s15, [r3]
 8000e14:	eef5 7ac0 	vcmpe.f32	s15, #0.0
 8000e18:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000e1c:	bf48      	it	mi
 8000e1e:	601d      	strmi	r5, [r3, #0]
 8000e20:	681d      	ldr	r5, [r3, #0]
 8000e22:	6015      	str	r5, [r2, #0]
 8000e24:	4d42      	ldr	r5, [pc, #264]	; (8000f30 <DMA1_Channel1_IRQHandler+0x26c>)
 8000e26:	680a      	ldr	r2, [r1, #0]
 8000e28:	682f      	ldr	r7, [r5, #0]
 8000e2a:	6836      	ldr	r6, [r6, #0]
 8000e2c:	1bd2      	subs	r2, r2, r7
 8000e2e:	4372      	muls	r2, r6
 8000e30:	ee07 2a90 	vmov	s15, r2
 8000e34:	4a3f      	ldr	r2, [pc, #252]	; (8000f34 <DMA1_Channel1_IRQHandler+0x270>)
 8000e36:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000e3a:	edc2 7a00 	vstr	s15, [r2]
 8000e3e:	6809      	ldr	r1, [r1, #0]
 8000e40:	6029      	str	r1, [r5, #0]
 8000e42:	edd4 7a00 	vldr	s15, [r4]
 8000e46:	edd3 6a00 	vldr	s13, [r3]
 8000e4a:	ed92 7a00 	vldr	s14, [r2]
 8000e4e:	4b3a      	ldr	r3, [pc, #232]	; (8000f38 <DMA1_Channel1_IRQHandler+0x274>)
 8000e50:	ee77 7aa6 	vadd.f32	s15, s15, s13
 8000e54:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000e58:	eefd 7ae7 	vcvt.s32.f32	s15, s15
 8000e5c:	edc3 7a00 	vstr	s15, [r3]
 8000e60:	681a      	ldr	r2, [r3, #0]
 8000e62:	2a00      	cmp	r2, #0
 8000e64:	bfbc      	itt	lt
 8000e66:	2200      	movlt	r2, #0
 8000e68:	601a      	strlt	r2, [r3, #0]
 8000e6a:	681a      	ldr	r2, [r3, #0]
 8000e6c:	f44f 737a 	mov.w	r3, #1000	; 0x3e8
 8000e70:	fb92 f2f3 	sdiv	r2, r2, r3
 8000e74:	4b31      	ldr	r3, [pc, #196]	; (8000f3c <DMA1_Channel1_IRQHandler+0x278>)
 8000e76:	601a      	str	r2, [r3, #0]
 8000e78:	681a      	ldr	r2, [r3, #0]
 8000e7a:	f5b2 6f7a 	cmp.w	r2, #4000	; 0xfa0
 8000e7e:	bfc4      	itt	gt
 8000e80:	f44f 627a 	movgt.w	r2, #4000	; 0xfa0
 8000e84:	601a      	strgt	r2, [r3, #0]
 8000e86:	6802      	ldr	r2, [r0, #0]
 8000e88:	681b      	ldr	r3, [r3, #0]
 8000e8a:	1ad2      	subs	r2, r2, r3
 8000e8c:	4b2c      	ldr	r3, [pc, #176]	; (8000f40 <DMA1_Channel1_IRQHandler+0x27c>)
 8000e8e:	601a      	str	r2, [r3, #0]
 8000e90:	681a      	ldr	r2, [r3, #0]
 8000e92:	f5b2 6f7a 	cmp.w	r2, #4000	; 0xfa0
 8000e96:	bfc4      	itt	gt
 8000e98:	f44f 627a 	movgt.w	r2, #4000	; 0xfa0
 8000e9c:	601a      	strgt	r2, [r3, #0]
 8000e9e:	6819      	ldr	r1, [r3, #0]
 8000ea0:	f240 52db 	movw	r2, #1499	; 0x5db
 8000ea4:	4291      	cmp	r1, r2
 8000ea6:	bfdc      	itt	le
 8000ea8:	f240 52dc 	movwle	r2, #1500	; 0x5dc
 8000eac:	601a      	strle	r2, [r3, #0]
 8000eae:	4a25      	ldr	r2, [pc, #148]	; (8000f44 <DMA1_Channel1_IRQHandler+0x280>)
 8000eb0:	6812      	ldr	r2, [r2, #0]
 8000eb2:	681b      	ldr	r3, [r3, #0]
 8000eb4:	ea43 5302 	orr.w	r3, r3, r2, lsl #20
 8000eb8:	4a23      	ldr	r2, [pc, #140]	; (8000f48 <DMA1_Channel1_IRQHandler+0x284>)
 8000eba:	65d3      	str	r3, [r2, #92]	; 0x5c
 8000ebc:	f102 4278 	add.w	r2, r2, #4160749568	; 0xf8000000
 8000ec0:	6993      	ldr	r3, [r2, #24]
 8000ec2:	f043 4380 	orr.w	r3, r3, #1073741824	; 0x40000000
 8000ec6:	6193      	str	r3, [r2, #24]
 8000ec8:	bdf0      	pop	{r4, r5, r6, r7, pc}
 8000eca:	bf00      	nop
 8000ecc:	40020000 	.word	0x40020000
 8000ed0:	20001558 	.word	0x20001558
 8000ed4:	20001524 	.word	0x20001524
 8000ed8:	2000155c 	.word	0x2000155c
 8000edc:	200014f8 	.word	0x200014f8
 8000ee0:	48000800 	.word	0x48000800
 8000ee4:	20001560 	.word	0x20001560
 8000ee8:	200000cc 	.word	0x200000cc
 8000eec:	200000c4 	.word	0x200000c4
 8000ef0:	20001554 	.word	0x20001554
 8000ef4:	20001550 	.word	0x20001550
 8000ef8:	45802000 	.word	0x45802000
 8000efc:	2000151c 	.word	0x2000151c
 8000f00:	2000154c 	.word	0x2000154c
 8000f04:	20001508 	.word	0x20001508
 8000f08:	2000150c 	.word	0x2000150c
 8000f0c:	20001520 	.word	0x20001520
 8000f10:	20001548 	.word	0x20001548
 8000f14:	20001544 	.word	0x20001544
 8000f18:	200000c8 	.word	0x200000c8
 8000f1c:	200000c0 	.word	0x200000c0
 8000f20:	20001540 	.word	0x20001540
 8000f24:	2000153c 	.word	0x2000153c
 8000f28:	200014f4 	.word	0x200014f4
 8000f2c:	4a7a3e80 	.word	0x4a7a3e80
 8000f30:	20001518 	.word	0x20001518
 8000f34:	20001538 	.word	0x20001538
 8000f38:	20001500 	.word	0x20001500
 8000f3c:	20001504 	.word	0x20001504
 8000f40:	200014fc 	.word	0x200014fc
 8000f44:	200000d0 	.word	0x200000d0
 8000f48:	50000800 	.word	0x50000800

errorprone · « **Reply #22 on:** November 11, 2022, 02:54:02 pm »

Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

lordnoxx · « **Reply #23 on:** November 11, 2022, 03:15:57 pm »

Quote from: errorprone on November 11, 2022, 02:54:02 pm

Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

Good point. No the vector table still resides in flash memory.
But nonetheless, when the execution of the ISR starts the time to walk through the ISR instructions should not depend on the place/memory the vector table sits in. Right?
I am not concerned about execution delay, i.e. the time from the event to the start of execution. I am wondering why the execution time, from start to end of the ISR, is not consistent when other code in the project changes or is added or is removed. And during investigating on that I found other suspicious behavior that I don't understand, e.g. execution from flash with -Ofast is faster than execution from ccram with -Ofast.

And I recently found another suspicious behavior ---> obviously the execution time depends also on from where within the code the ISR is called. I am working on a simple example to show you that here in the thread.

Siwastaja · « **Reply #24 on:** November 11, 2022, 03:55:22 pm »

Quote from: NorthGuy on November 11, 2022, 02:18:20 pm

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

Quote from: lordnoxx on November 11, 2022, 10:59:49 am

With main():
...
Without main():

The asm snippets posted look identical, but did I miss it or did you post the whole function in both cases (with and without main() differences, same optimization level)? Are they exactly the same or not? That is literally the first thing to check.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR (Read 7654 times)

Share me