Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR (Read 7546 times)

lordnoxx · « **on:** November 06, 2022, 11:33:43 am »

Hi folks,

I am working on a Project based on a STM32G431. While developing the firmware for it I encountered a strange phenomenon that should not be possible or happen. After fiddling around with it and doing my on efforts of debugging and testing to get a better understanding what is going on here I had to partly give up on this as my knowledge regarding deeper STM32 secrets is limited. I am a power electronics engineer and embedded firmware development is not my daily profession. Although I have quite some experience in coding and microcontrollers (8051, PIC, ATmega, STM32) as such, I came to a point where I need help to find the root cause of the issue.

So I went to the ST Community Forum and asked for help there: https://community.st.com/s/question/0D73W000001nmmASAQ/detail?fromEmail=1&s1oid=00Db0000000YtG6&s1nid=0DB0X000000DYbd&s1uid=0053W000001UI26&s1ext=0&emkind=chatterCommentNotification&emtm=1667370141288

I am still full of hope and confidence that the people there at ST can help me. But I feel I have to reach a bigger crowd of skilled engineers to dig into this. That is why I want to reach out to you here at eevblog too and want to ask kindly for additional help.

STM32G431 - How can it be that the amount of lines of code in main() affect the time it takes to execute an interrupt routine?

This question is all about the behavior of the DMA1_Channel1_IRQHandler in my application.
I am working with an STM32G431CBT. It is running at 168MHz. Configured with 4 Flash wait states.
NVIC interrupt priorities are setup using CMSIS functions as follows:

Code: [Select]

SystemCoreClockUpdate();
clock = SystemCoreClock;
SysTick_Config(clock/1e3);
 
NVIC_SetPriorityGrouping(0);
 
NVIC_DisableIRQ(SysTick_IRQn);
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 10, 0);
NVIC_SetPriority(SysTick_IRQn, irq_prio);
NVIC_EnableIRQ(SysTick_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 13, 0);
NVIC_SetPriority(USART2_IRQn, irq_prio);
NVIC_EnableIRQ(USART2_IRQn);
 
irq_prio = NVIC_EncodePriority(NVIC_GetPriorityGrouping(), 0, 0); //highest priority!!!
NVIC_SetPriority(DMA1_Channel1_IRQn, irq_prio);
NVIC_EnableIRQ(DMA1_Channel1_IRQn);

TIM1->CCR5 (on match) triggers ADC conversion sequence. ADC triggers DMA1_Channel1 (circular, periph-to-mem). DMA1 is setup to issue an interrupt on transfer complete (TCIE).
The DMA1_Channel1_IRQHandler (has highest priority) then fires and a few things are processed and calculated within it.

My code compiles with no errors and warnings. And as far as I can see it all works as intended.
But here comes the strange thing:

Right at the start of DMA1_Channel1_IRQHandler I set a GPIO Pin high. At the End of DMA1_Channel1_IRQHandler I set the same Pin low. Looking at the generated high-low pulse with an oscilloscope I can measure the execution time of the DMA1_Channel1_IRQHandler. It takes 1.69µs. But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

How can that be? An ISR should not be affected in such a way by the amount of code in a low priority function just like main().

The ISR looks like this:

Code: [Select]

    void DMA1_Channel1_IRQHandler(void)
    {
     
     
     if(DMA1->ISR & DMA_ISR_TCIF1)  //check we are here cause of a valid interrupt occurred
     {
    	DMA1->IFCR |= DMA_IFCR_CTCIF1;	//reset interrupt flag
     
    	GPIOC->ODR |= (1U << 14);	//Set Pin high
     
    	ev = (v - my_array[0]);
    	vp  = A * ev;					
    	vi  = B * vi1;			
    	if(vi > 4100) vi=4100;					
    	if(vi < 0) vi=0;
    	vi1 = vi;
    	vd  = C * (ev - (e1v+2));			
    	e1v = ev;						
    	cv = vp + vi + vd;				
    	if(cv < 0) cv=0;
    	cv_t = cv/2;	
    	if(cv_t >  4000 ) cv_t=4000;
     	/*------------------------------------------------------*/
    	ei = (my_array[1] - i);
    	ip  = D * ei;					
    	ii  = E * ii1;			
    	if(ii > 4100000 ) ii=4100000;
            if(ii < 0 ) ii=0;
    	ii1 = ii;
    	id  = F * (ei + (e1i-2));			
    	e1i = ei;						
    	ci = ip + ii + id;				
    	if(ci < 0) ci=0;
    	ci_t = ci/1000;					
    	if(ci_t >  4000 ) ci_t=4000;
     
    	cdac_t = (cv_t - ci_t);
     
    	if(cdac_t > 4000) cdac_t=4000;
    	if(cdac_t < 1500) cdac_t=1500;
     
    	GPIOC->ODR &= ~(1U << 14); //Set Pin low
      
     }
    }

All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.
The timing shrinks down to 1.22µs for every call to the handler. I verified that with the pulse trigger function of the oscilloscope. No longer pulses occur.

This is what I have done so far to debug this:

Put the interrupt in a different C file than main() --> Did not make a difference

Using the BSRR register instead of ODR for setting the pin high/low ---> Did not make a difference

Next I tried moving the "GPIOC->BSRR |= (1U << (14+16)); ////Set Pin low" upwards in the ISR-Code to see whether there is that point from which on there is no longer any dependency of the ISR execution time from the amount of code in main(). Of course the measured time gets smaller as setting high and setting low are moving closer together. But I have to conclude...Such a point does not exist. Well...besides placing set pin high and set pin low right next to each other.

Code: [Select]

"GPIOC->BSRR |= (1U << (14)); //Set Pin high"
"GPIOC->BSRR |= (1U << (14+16)); //Set Pin low"

But in general...No matter where I place "//Set Pin low" there is always a dependency on the amount of code in main() regarding ISR execution time.

For testing I also told the compiler not to use the FPU. This resulted in a much longer ISR execution time, but the issue was still present.

For testing I disabled all interrupts (even SysTick) besides DMA1_Channel1_IRQHandler --> Issue still present.

But then I found one interesting effect. Telling the compiler to optimize for speed (-Ofast) instead of size (-Os) solved the problem. No matter how much code is in main() the ISR execution time is always 1.2µs.

So as setting the optimization level to -Ofast is more or less a workaround and not a solution to the root of the problem I have to ask ones again: With all that information and background...has anybody any Idea what is wrong here? Bug in compiler/linker or even in silicon?

Can anybody confirm that behavior?

Alti · « **Reply #1 on:** November 06, 2022, 12:28:18 pm »

Quote from: lordnoxx on November 06, 2022, 11:33:43 am

Configured with 4 Flash wait states.

You are comparing apples and oranges.

The uC has ARMv7-EM core and this works EXACTLY as specified in Architecture Reference Manual from ARM, its designer. Precisely down to single tick, to the finest stage of the pipeline (3-stage, btw). On top of that there is ST (ST has nothing to do with ARM) that tied a 128-bit? flash memory into buses of that uC (three buses btw), with cache (check ST datasheet for depths). The opcodes are 16-bit or 32-bit and you cannot align them to 128-bit boundaries by using Ansi C! If the pipeline needs to be flushed because of some if of cache miss, you get stalls.

Please provide relevant, minimal sample that proves your point (unexpected behavior?). Assembly, alignment to flash row boundaries, cache depth etc. Or maybe you can reproduce this behavior with flash 0-wait states, cache disabled - even better. Or perhaps start from executing IRQ from SRAM, this one has 0-wait states on most STM32 chips and runs at full throttle.

DavidAlfa · « **Reply #2 on:** November 06, 2022, 12:30:40 pm »

Not sure if I understood the issue correctly.
The high pulse width should be the same (calling DMA ->ISR response), but adding more code to main loop will extend the time between the pulses, it will need to process more code before calling the DMA again.
However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

ataradov · « **Reply #3 on:** November 06, 2022, 05:02:25 pm »

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

DavidAlfa · « **Reply #4 on:** November 06, 2022, 06:51:20 pm »

The cache (Art Accelerator) is ST's golden-egg laying hen, the only reason they can work so fast, very close to 0 wait states.
If not using the cache... get a z80! Performance will be the same

In the instant you get a ISR it needs to fetch the code, will probably cause a cache miss and require few clocks to fetch the data.
But with the first read it also fetched the next instructions, so how would the aligment hit so hard to cause an almost 50% increase?
Few clocks, yeah, but +500us seems absolutely crazy.
Unless (Like always) he didn't mention something important

ataradov · « **Reply #5 on:** November 06, 2022, 06:56:37 pm »

That close to 0 wait states thing is mostly marketing and only works in some cases.

The code here has no explicit loops, which is what mostly gets affected by alignment. But if some of the variables are float, then low level routines might have loops in them.

The easiest way to check this is to place ISR into SRAM and check the performance. I predict it would be consistent and faster than any previous results.

eutectique · « **Reply #6 on:** November 06, 2022, 07:53:42 pm »

Quote from: lordnoxx on November 06, 2022, 11:33:43 am

But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

NorthGuy · « **Reply #7 on:** November 06, 2022, 08:53:13 pm »

I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

peter-h · « **Reply #8 on:** November 06, 2022, 09:36:00 pm »

I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

ataradov · « **Reply #9 on:** November 06, 2022, 09:44:42 pm »

None of this is related to the original questions in any way.

The original issue is consistent with alignment issues.

lordnoxx · « **Reply #10 on:** November 07, 2022, 07:46:16 am »

Quote from: DavidAlfa on November 06, 2022, 12:30:40 pm

However if it's the high pulse what's getting larger, check there're no other ISR running concurrently with the same or higher priority.

That's exactly the phenomenon. And if you check my initial post --> I already tried to disable all other interrupts. But the issue still exists.

lordnoxx · « **Reply #11 on:** November 07, 2022, 07:47:42 am »

Quote from: ataradov on November 06, 2022, 05:02:25 pm

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Thanks for that hint. I will check that and report back.

lordnoxx · « **Reply #12 on:** November 07, 2022, 07:52:46 am »

Quote from: eutectique on November 06, 2022, 07:53:42 pm

Quote from: lordnoxx on November 06, 2022, 11:33:43 am
But if I shrink the code in the main() by a few lines of code I can see that the execution time gets lower!!! 😮 And if I put just an empty while(1) loop in the main() execution time gets even lower....to its lowest-->1.22µs !!!!😮

You did not say what these lines are, nor shown the variable declarations. To give any definitive answer, it would be nice to see the assembly code of the interrupt handler in two cases: 1.69µs with those few lines in main(), and 1.22µs without.

I will do my best to provide more information and code snippets. Unfortunately parts of the involved code is intellectual property of my customer so I am not allowed to post it here. But the ISR C-code and assembly-code should be no problem. I will provide that.

lordnoxx · « **Reply #13 on:** November 07, 2022, 07:55:32 am »

Quote from: NorthGuy on November 06, 2022, 08:53:13 pm

I guess the compiler optimizes some code away. For example, if cdac_t is not used anywhere and is not declared as volatile, there's no reason to keep the code which calculates it. If there's any code in "main" which uses cdac_t, then the code calculating cdac_t must be kept in place and its execution will take some time.

cdac_t is volatile and the code that uses it is not removed from main(). So the optimizer does other stuff to the code that speeds up the execution and makes it consistent. It has nothing to do with unused variables/code.

lordnoxx · « **Reply #14 on:** November 07, 2022, 07:59:59 am »

Quote from: peter-h on November 06, 2022, 09:36:00 pm

I see you have used CubeMX to generate all this code.

I "inherited" such a project 2-3 years ago and found the startup code doing weird stuff e.g. initialising the clocks and the PPL and then calling a function which did all that all over again. Notably one of these was at the very end of the .s assembler startup code.

So go through the startup carefully.

Also remember that VTOR needs to be 512-aligned.

And how is the startup code related to ISR execution time?
How can I check the VTOR alignment?

ataradov · « **Reply #15 on:** November 07, 2022, 08:12:17 am »

Quote from: lordnoxx on November 07, 2022, 07:59:59 am

How can I check the VTOR alignment?

Don't bother, nothing would work at all if it was not aligned correctly.

lordnoxx · « **Reply #16 on:** November 11, 2022, 09:01:52 am »

Quote from: ataradov on November 06, 2022, 05:02:25 pm

I'm not sure about this MCU specifically, but quite often execution time of the code depends on its alignment when executing from flash. This is because of the way prefetch systems work.

So the first thing to check - does the alignment of the IRQ code change when main() changes?

The solution to this is to enable cache or place the critical code in SRAM/TCM.

Here are two assembler snippets from the ISR with and without main().

With main():

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b73      	ldr	r3, [pc, #460]	; (8000e94 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	b510      	push	{r4, lr}
 8000ccc:	f140 80c2 	bpl.w	8000e54 <DMA1_Channel1_IRQHandler+0x190>
 8000cd0:	685a      	ldr	r2, [r3, #4]
 8000cd2:	4871      	ldr	r0, [pc, #452]	; (8000e98 <DMA1_Channel1_IRQHandler+0x1d4>)
 8000cd4:	4971      	ldr	r1, [pc, #452]	; (8000e9c <DMA1_Channel1_IRQHandler+0x1d8>)

Without main()

Code: [Select]

08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38:	4b73      	ldr	r3, [pc, #460]	; (8000e08 <DMA1_Channel1_IRQHandler+0x1d0>)
 8000c3a:	681a      	ldr	r2, [r3, #0]
 8000c3c:	0792      	lsls	r2, r2, #30
 8000c3e:	b510      	push	{r4, lr}
 8000c40:	f140 80c2 	bpl.w	8000dc8 <DMA1_Channel1_IRQHandler+0x190>
 8000c44:	685a      	ldr	r2, [r3, #4]
 8000c46:	4871      	ldr	r0, [pc, #452]	; (8000e0c <DMA1_Channel1_IRQHandler+0x1d4>)
 8000c48:	4971      	ldr	r1, [pc, #452]	; (8000e10 <DMA1_Channel1_IRQHandler+0x1d8>)

Prefetch, I-Cache and D-Cache are enabled:

Code: [Select]

FLASH->ACR |= FLASH_ACR_PRFTEN;	
FLASH->ACR |= FLASH_ACR_ICEN;	
FLASH->ACR |= FLASH_ACR_DCEN;

[EDIT]: Compiler is set to -Os
[EDIT]: @ataradov: So can you say something about the alignment by looking at the above assembler code listings?

lordnoxx · « **Reply #17 on:** November 11, 2022, 10:59:49 am »

When set the compiler to -Ofast I get this result:

With main():

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b80      	ldr	r3, [pc, #512]	; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	f140 80cc 	bpl.w	8000e66 <DMA1_Channel1_IRQHandler+0x1a2>
 8000cce:	685a      	ldr	r2, [r3, #4]
 8000cd0:	497e      	ldr	r1, [pc, #504]	; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cd2:	487f      	ldr	r0, [pc, #508]	; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.234us

Without main():

Code: [Select]

08000c38 <DMA1_Channel1_IRQHandler>:
 8000c38:	4b80      	ldr	r3, [pc, #512]	; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)
 8000c3a:	681a      	ldr	r2, [r3, #0]
 8000c3c:	0792      	lsls	r2, r2, #30
 8000c3e:	f140 80cc 	bpl.w	8000dda <DMA1_Channel1_IRQHandler+0x1a2>
 8000c42:	685a      	ldr	r2, [r3, #4]
 8000c44:	497e      	ldr	r1, [pc, #504]	; (8000e40 <DMA1_Channel1_IRQHandler+0x208>)
 8000c46:	487f      	ldr	r0, [pc, #508]	; (8000e44 <DMA1_Channel1_IRQHandler+0x20c>)

Execution time is 1.180us

So two things are interesting here:
1. Compiling with main() and -Ofast results in an ISR execution time as fast as compiling with -Os and without main()
2. Compiling without main() and -Ofast shows less variation in execution time. 1.234us --> 1.180us

lordnoxx · « **Reply #18 on:** November 11, 2022, 11:19:30 am »

Next I experimented with placing the ISR in CCM SRAM according to ST's AN4296.

ISR function prototype:

Code: [Select]

void main_DCDC_control_loop_interrupt(void) __attribute__((section (".ccmram")));

MAP file confirms that the ISR code is placed/copied to CCMRAM at startup by startup code.

Code: [Select]

.ccmram         0x0000000010000000      0x290 load address 0x00000000080001d8
                       0x0000000010000000                . = ALIGN (0x4)
                       0x0000000010000000                _sccmram = .
 *(.ccmram)
 .ccmram        0x0000000010000000      0x290 ./Core/Src/ISR_Handlers.o
                       0x0000000010000000                DMA1_Channel1_IRQHandler
 *(.ccmram*)
                       0x0000000010000290                . = ALIGN (0x4)
                       0x0000000010000290                _eccmram = .

These are the results.

-> with main() and -Os: Execution time is 1.508us

-> without main() and -Os: Execution time is 1.430usus

-> without main() and -Ofast: Execution time is 1.354us

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!?

voltsandjolts · « **Reply #19 on:** November 11, 2022, 12:17:39 pm »

Maybe try using an internal timer for calculating ISR execution time.
Pin toggling so fast might be causing issues, but that's just a hunch, I'm not familiar with this mcu and APB setup.

NorthGuy · « **Reply #20 on:** November 11, 2022, 02:18:20 pm »

Quote from: lordnoxx on November 11, 2022, 11:19:30 am

Hmmm...I did not expect that outcome. I really expected much more reduction in execution time compared to when the ISR code is located in flash memory. Also veeeerryy interesting: Code in CCRAM and with optimization for speed (-Ofast) execution time gets down to "only" 1.354us which is still slower that having ISR code in flash and code optimized with -Ofast (1.180us). I mean...what the f**k?!?!?

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

There's no point in looking at few lines of assembler, post the whole ISR including prologue and epilogue.

lordnoxx · « **Reply #21 on:** November 11, 2022, 02:39:55 pm »

This is all I can get from the .list file

Code: [Select]

08000cc4 <DMA1_Channel1_IRQHandler>:
 8000cc4:	4b81      	ldr	r3, [pc, #516]	; (8000ecc <DMA1_Channel1_IRQHandler+0x208>)
 8000cc6:	681a      	ldr	r2, [r3, #0]
 8000cc8:	0792      	lsls	r2, r2, #30
 8000cca:	b5f0      	push	{r4, r5, r6, r7, lr}
 8000ccc:	f140 80fc 	bpl.w	8000ec8 <DMA1_Channel1_IRQHandler+0x204>
 8000cd0:	685a      	ldr	r2, [r3, #4]
 8000cd2:	4c7f      	ldr	r4, [pc, #508]	; (8000ed0 <DMA1_Channel1_IRQHandler+0x20c>)
 8000cd4:	497f      	ldr	r1, [pc, #508]	; (8000ed4 <DMA1_Channel1_IRQHandler+0x210>)
 8000cd6:	4880      	ldr	r0, [pc, #512]	; (8000ed8 <DMA1_Channel1_IRQHandler+0x214>)
 8000cd8:	4f80      	ldr	r7, [pc, #512]	; (8000edc <DMA1_Channel1_IRQHandler+0x218>)
 8000cda:	f042 0202 	orr.w	r2, r2, #2
 8000cde:	605a      	str	r2, [r3, #4]
 8000ce0:	4a7f      	ldr	r2, [pc, #508]	; (8000ee0 <DMA1_Channel1_IRQHandler+0x21c>)
 8000ce2:	6993      	ldr	r3, [r2, #24]
 8000ce4:	f443 4380 	orr.w	r3, r3, #16384	; 0x4000
 8000ce8:	6193      	str	r3, [r2, #24]
 8000cea:	4b7e      	ldr	r3, [pc, #504]	; (8000ee4 <DMA1_Channel1_IRQHandler+0x220>)
 8000cec:	681b      	ldr	r3, [r3, #0]
 8000cee:	8822      	ldrh	r2, [r4, #0]
 8000cf0:	b292      	uxth	r2, r2
 8000cf2:	1a9b      	subs	r3, r3, r2
 8000cf4:	600b      	str	r3, [r1, #0]
 8000cf6:	4b7c      	ldr	r3, [pc, #496]	; (8000ee8 <DMA1_Channel1_IRQHandler+0x224>)
 8000cf8:	681b      	ldr	r3, [r3, #0]
 8000cfa:	680a      	ldr	r2, [r1, #0]
 8000cfc:	4353      	muls	r3, r2
 8000cfe:	ee07 3a90 	vmov	s15, r3
 8000d02:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d06:	4b79      	ldr	r3, [pc, #484]	; (8000eec <DMA1_Channel1_IRQHandler+0x228>)
 8000d08:	edc0 7a00 	vstr	s15, [r0]
 8000d0c:	681b      	ldr	r3, [r3, #0]
 8000d0e:	680d      	ldr	r5, [r1, #0]
 8000d10:	4a77      	ldr	r2, [pc, #476]	; (8000ef0 <DMA1_Channel1_IRQHandler+0x22c>)
 8000d12:	436b      	muls	r3, r5
 8000d14:	ee07 3a90 	vmov	s15, r3
 8000d18:	ed92 7a00 	vldr	s14, [r2]
 8000d1c:	4b75      	ldr	r3, [pc, #468]	; (8000ef4 <DMA1_Channel1_IRQHandler+0x230>)
 8000d1e:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d22:	2500      	movs	r5, #0
 8000d24:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000d28:	edc3 7a00 	vstr	s15, [r3]
 8000d2c:	ed93 7a00 	vldr	s14, [r3]
 8000d30:	eddf 7a71 	vldr	s15, [pc, #452]	; 8000ef8 <DMA1_Channel1_IRQHandler+0x234>
 8000d34:	eeb4 7ae7 	vcmpe.f32	s14, s15
 8000d38:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000d3c:	bfc8      	it	gt
 8000d3e:	edc3 7a00 	vstrgt	s15, [r3]
 8000d42:	edd3 7a00 	vldr	s15, [r3]
 8000d46:	eef5 7ac0 	vcmpe.f32	s15, #0.0
 8000d4a:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000d4e:	bf48      	it	mi
 8000d50:	601d      	strmi	r5, [r3, #0]
 8000d52:	681e      	ldr	r6, [r3, #0]
 8000d54:	6016      	str	r6, [r2, #0]
 8000d56:	4e69      	ldr	r6, [pc, #420]	; (8000efc <DMA1_Channel1_IRQHandler+0x238>)
 8000d58:	680a      	ldr	r2, [r1, #0]
 8000d5a:	f8d6 c000 	ldr.w	ip, [r6]
 8000d5e:	683f      	ldr	r7, [r7, #0]
 8000d60:	eba2 020c 	sub.w	r2, r2, ip
 8000d64:	437a      	muls	r2, r7
 8000d66:	ee07 2a90 	vmov	s15, r2
 8000d6a:	4a65      	ldr	r2, [pc, #404]	; (8000f00 <DMA1_Channel1_IRQHandler+0x23c>)
 8000d6c:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000d70:	edc2 7a00 	vstr	s15, [r2]
 8000d74:	6809      	ldr	r1, [r1, #0]
 8000d76:	6031      	str	r1, [r6, #0]
 8000d78:	edd0 7a00 	vldr	s15, [r0]
 8000d7c:	edd3 6a00 	vldr	s13, [r3]
 8000d80:	ed92 7a00 	vldr	s14, [r2]
 8000d84:	4b5f      	ldr	r3, [pc, #380]	; (8000f04 <DMA1_Channel1_IRQHandler+0x240>)
 8000d86:	4860      	ldr	r0, [pc, #384]	; (8000f08 <DMA1_Channel1_IRQHandler+0x244>)
 8000d88:	4960      	ldr	r1, [pc, #384]	; (8000f0c <DMA1_Channel1_IRQHandler+0x248>)
 8000d8a:	ee77 7aa6 	vadd.f32	s15, s15, s13
 8000d8e:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000d92:	eefd 7ae7 	vcvt.s32.f32	s15, s15
 8000d96:	edc3 7a00 	vstr	s15, [r3]
 8000d9a:	681a      	ldr	r2, [r3, #0]
 8000d9c:	2a00      	cmp	r2, #0
 8000d9e:	bfbc      	itt	lt
 8000da0:	2200      	movlt	r2, #0
 8000da2:	601a      	strlt	r2, [r3, #0]
 8000da4:	681b      	ldr	r3, [r3, #0]
 8000da6:	6003      	str	r3, [r0, #0]
 8000da8:	6803      	ldr	r3, [r0, #0]
 8000daa:	4a59      	ldr	r2, [pc, #356]	; (8000f10 <DMA1_Channel1_IRQHandler+0x24c>)
 8000dac:	f5b3 6f7a 	cmp.w	r3, #4000	; 0xfa0
 8000db0:	bfc4      	itt	gt
 8000db2:	f44f 637a 	movgt.w	r3, #4000	; 0xfa0
 8000db6:	6003      	strgt	r3, [r0, #0]
 8000db8:	8863      	ldrh	r3, [r4, #2]
 8000dba:	6812      	ldr	r2, [r2, #0]
 8000dbc:	4c55      	ldr	r4, [pc, #340]	; (8000f14 <DMA1_Channel1_IRQHandler+0x250>)
 8000dbe:	b29b      	uxth	r3, r3
 8000dc0:	1a9b      	subs	r3, r3, r2
 8000dc2:	600b      	str	r3, [r1, #0]
 8000dc4:	4b54      	ldr	r3, [pc, #336]	; (8000f18 <DMA1_Channel1_IRQHandler+0x254>)
 8000dc6:	681b      	ldr	r3, [r3, #0]
 8000dc8:	680a      	ldr	r2, [r1, #0]
 8000dca:	4353      	muls	r3, r2
 8000dcc:	ee07 3a90 	vmov	s15, r3
 8000dd0:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000dd4:	4b51      	ldr	r3, [pc, #324]	; (8000f1c <DMA1_Channel1_IRQHandler+0x258>)
 8000dd6:	edc4 7a00 	vstr	s15, [r4]
 8000dda:	681b      	ldr	r3, [r3, #0]
 8000ddc:	680e      	ldr	r6, [r1, #0]
 8000dde:	4a50      	ldr	r2, [pc, #320]	; (8000f20 <DMA1_Channel1_IRQHandler+0x25c>)
 8000de0:	4373      	muls	r3, r6
 8000de2:	ee07 3a90 	vmov	s15, r3
 8000de6:	ed92 7a00 	vldr	s14, [r2]
 8000dea:	4b4e      	ldr	r3, [pc, #312]	; (8000f24 <DMA1_Channel1_IRQHandler+0x260>)
 8000dec:	4e4e      	ldr	r6, [pc, #312]	; (8000f28 <DMA1_Channel1_IRQHandler+0x264>)
 8000dee:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000df2:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000df6:	edc3 7a00 	vstr	s15, [r3]
 8000dfa:	ed93 7a00 	vldr	s14, [r3]
 8000dfe:	eddf 7a4b 	vldr	s15, [pc, #300]	; 8000f2c <DMA1_Channel1_IRQHandler+0x268>
 8000e02:	eeb4 7ae7 	vcmpe.f32	s14, s15
 8000e06:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000e0a:	bfc8      	it	gt
 8000e0c:	edc3 7a00 	vstrgt	s15, [r3]
 8000e10:	edd3 7a00 	vldr	s15, [r3]
 8000e14:	eef5 7ac0 	vcmpe.f32	s15, #0.0
 8000e18:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 8000e1c:	bf48      	it	mi
 8000e1e:	601d      	strmi	r5, [r3, #0]
 8000e20:	681d      	ldr	r5, [r3, #0]
 8000e22:	6015      	str	r5, [r2, #0]
 8000e24:	4d42      	ldr	r5, [pc, #264]	; (8000f30 <DMA1_Channel1_IRQHandler+0x26c>)
 8000e26:	680a      	ldr	r2, [r1, #0]
 8000e28:	682f      	ldr	r7, [r5, #0]
 8000e2a:	6836      	ldr	r6, [r6, #0]
 8000e2c:	1bd2      	subs	r2, r2, r7
 8000e2e:	4372      	muls	r2, r6
 8000e30:	ee07 2a90 	vmov	s15, r2
 8000e34:	4a3f      	ldr	r2, [pc, #252]	; (8000f34 <DMA1_Channel1_IRQHandler+0x270>)
 8000e36:	eef8 7ae7 	vcvt.f32.s32	s15, s15
 8000e3a:	edc2 7a00 	vstr	s15, [r2]
 8000e3e:	6809      	ldr	r1, [r1, #0]
 8000e40:	6029      	str	r1, [r5, #0]
 8000e42:	edd4 7a00 	vldr	s15, [r4]
 8000e46:	edd3 6a00 	vldr	s13, [r3]
 8000e4a:	ed92 7a00 	vldr	s14, [r2]
 8000e4e:	4b3a      	ldr	r3, [pc, #232]	; (8000f38 <DMA1_Channel1_IRQHandler+0x274>)
 8000e50:	ee77 7aa6 	vadd.f32	s15, s15, s13
 8000e54:	ee77 7a87 	vadd.f32	s15, s15, s14
 8000e58:	eefd 7ae7 	vcvt.s32.f32	s15, s15
 8000e5c:	edc3 7a00 	vstr	s15, [r3]
 8000e60:	681a      	ldr	r2, [r3, #0]
 8000e62:	2a00      	cmp	r2, #0
 8000e64:	bfbc      	itt	lt
 8000e66:	2200      	movlt	r2, #0
 8000e68:	601a      	strlt	r2, [r3, #0]
 8000e6a:	681a      	ldr	r2, [r3, #0]
 8000e6c:	f44f 737a 	mov.w	r3, #1000	; 0x3e8
 8000e70:	fb92 f2f3 	sdiv	r2, r2, r3
 8000e74:	4b31      	ldr	r3, [pc, #196]	; (8000f3c <DMA1_Channel1_IRQHandler+0x278>)
 8000e76:	601a      	str	r2, [r3, #0]
 8000e78:	681a      	ldr	r2, [r3, #0]
 8000e7a:	f5b2 6f7a 	cmp.w	r2, #4000	; 0xfa0
 8000e7e:	bfc4      	itt	gt
 8000e80:	f44f 627a 	movgt.w	r2, #4000	; 0xfa0
 8000e84:	601a      	strgt	r2, [r3, #0]
 8000e86:	6802      	ldr	r2, [r0, #0]
 8000e88:	681b      	ldr	r3, [r3, #0]
 8000e8a:	1ad2      	subs	r2, r2, r3
 8000e8c:	4b2c      	ldr	r3, [pc, #176]	; (8000f40 <DMA1_Channel1_IRQHandler+0x27c>)
 8000e8e:	601a      	str	r2, [r3, #0]
 8000e90:	681a      	ldr	r2, [r3, #0]
 8000e92:	f5b2 6f7a 	cmp.w	r2, #4000	; 0xfa0
 8000e96:	bfc4      	itt	gt
 8000e98:	f44f 627a 	movgt.w	r2, #4000	; 0xfa0
 8000e9c:	601a      	strgt	r2, [r3, #0]
 8000e9e:	6819      	ldr	r1, [r3, #0]
 8000ea0:	f240 52db 	movw	r2, #1499	; 0x5db
 8000ea4:	4291      	cmp	r1, r2
 8000ea6:	bfdc      	itt	le
 8000ea8:	f240 52dc 	movwle	r2, #1500	; 0x5dc
 8000eac:	601a      	strle	r2, [r3, #0]
 8000eae:	4a25      	ldr	r2, [pc, #148]	; (8000f44 <DMA1_Channel1_IRQHandler+0x280>)
 8000eb0:	6812      	ldr	r2, [r2, #0]
 8000eb2:	681b      	ldr	r3, [r3, #0]
 8000eb4:	ea43 5302 	orr.w	r3, r3, r2, lsl #20
 8000eb8:	4a23      	ldr	r2, [pc, #140]	; (8000f48 <DMA1_Channel1_IRQHandler+0x284>)
 8000eba:	65d3      	str	r3, [r2, #92]	; 0x5c
 8000ebc:	f102 4278 	add.w	r2, r2, #4160749568	; 0xf8000000
 8000ec0:	6993      	ldr	r3, [r2, #24]
 8000ec2:	f043 4380 	orr.w	r3, r3, #1073741824	; 0x40000000
 8000ec6:	6193      	str	r3, [r2, #24]
 8000ec8:	bdf0      	pop	{r4, r5, r6, r7, pc}
 8000eca:	bf00      	nop
 8000ecc:	40020000 	.word	0x40020000
 8000ed0:	20001558 	.word	0x20001558
 8000ed4:	20001524 	.word	0x20001524
 8000ed8:	2000155c 	.word	0x2000155c
 8000edc:	200014f8 	.word	0x200014f8
 8000ee0:	48000800 	.word	0x48000800
 8000ee4:	20001560 	.word	0x20001560
 8000ee8:	200000cc 	.word	0x200000cc
 8000eec:	200000c4 	.word	0x200000c4
 8000ef0:	20001554 	.word	0x20001554
 8000ef4:	20001550 	.word	0x20001550
 8000ef8:	45802000 	.word	0x45802000
 8000efc:	2000151c 	.word	0x2000151c
 8000f00:	2000154c 	.word	0x2000154c
 8000f04:	20001508 	.word	0x20001508
 8000f08:	2000150c 	.word	0x2000150c
 8000f0c:	20001520 	.word	0x20001520
 8000f10:	20001548 	.word	0x20001548
 8000f14:	20001544 	.word	0x20001544
 8000f18:	200000c8 	.word	0x200000c8
 8000f1c:	200000c0 	.word	0x200000c0
 8000f20:	20001540 	.word	0x20001540
 8000f24:	2000153c 	.word	0x2000153c
 8000f28:	200014f4 	.word	0x200014f4
 8000f2c:	4a7a3e80 	.word	0x4a7a3e80
 8000f30:	20001518 	.word	0x20001518
 8000f34:	20001538 	.word	0x20001538
 8000f38:	20001500 	.word	0x20001500
 8000f3c:	20001504 	.word	0x20001504
 8000f40:	200014fc 	.word	0x200014fc
 8000f44:	200000d0 	.word	0x200000d0
 8000f48:	50000800 	.word	0x50000800

errorprone · « **Reply #22 on:** November 11, 2022, 02:54:02 pm »

Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

lordnoxx · « **Reply #23 on:** November 11, 2022, 03:15:57 pm »

Quote from: errorprone on November 11, 2022, 02:54:02 pm

Not sure if you’ve done this but the vector table also needs to be moved to CCMRAM otherwise the function pointer to the ISR is still read from flash.

Good point. No the vector table still resides in flash memory.
But nonetheless, when the execution of the ISR starts the time to walk through the ISR instructions should not depend on the place/memory the vector table sits in. Right?
I am not concerned about execution delay, i.e. the time from the event to the start of execution. I am wondering why the execution time, from start to end of the ISR, is not consistent when other code in the project changes or is added or is removed. And during investigating on that I found other suspicious behavior that I don't understand, e.g. execution from flash with -Ofast is faster than execution from ccram with -Ofast.

And I recently found another suspicious behavior ---> obviously the execution time depends also on from where within the code the ISR is called. I am working on a simple example to show you that here in the thread.

Siwastaja · « **Reply #24 on:** November 11, 2022, 03:55:22 pm »

Quote from: NorthGuy on November 11, 2022, 02:18:20 pm

Usually MCUs expect execution from flash, hence their bus structure is tailored for such execution. When executing from RAM, command fetches and data access compete for the same bus, therefore you get bus congestion and slower execution.

But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

Quote from: lordnoxx on November 11, 2022, 10:59:49 am

With main():
...
Without main():

The asm snippets posted look identical, but did I miss it or did you post the whole function in both cases (with and without main() differences, same optimization level)? Are they exactly the same or not? That is literally the first thing to check.

ataradov · « **Reply #25 on:** November 11, 2022, 05:02:49 pm »

When placing things into SRAM/CCM though attributes pay attention to the function disassembly, If it does floating point math, chances are that supporting functions will still be located in the flash.

I would reduce the function to the point where there are no calls (even generated by the compiler internally), but you can still see the difference.

NorthGuy · « **Reply #26 on:** November 11, 2022, 06:01:40 pm »

Quote from: Siwastaja on November 11, 2022, 03:55:22 pm

But the CCM / ITCM exists for this reason, it should have an interface of its own, directly to CPU.

Given fairly linear code with few loops, flash with prefetch can be as fast as CCM RAM, though, so clearly that did not help the OP.

OP has found that running code from CCM is slower than running the same code from flash. Apparently, fetching from CCM is sower than fetching from flash. I think this is because of the bus contention (i.e. you cannot fetch a command and access data at the same time).

It's easy to test - just run the piece of register-only code from CCM and from flash and see if there's any time difference.

ataradov · « **Reply #27 on:** November 11, 2022, 06:26:37 pm »

CCM is a completely separate bus, it does not conflict with anything else.

But again, if you placed the code in CCM, but it still calls floating point functions from the flash, it would be slow.

Siwastaja · « **Reply #28 on:** November 11, 2022, 06:38:26 pm »

Quote from: NorthGuy on November 11, 2022, 06:01:40 pm

OP has found that running code from CCM is slower than running the same code from flash. Apparently, fetching from CCM is sower than fetching from flash. I think this is because of the bus contention (i.e. you cannot fetch a command and access data at the same time).

This is just not physically true. See the datasheet; it's a separate bus. The whole purpose of which is to give zero wait state access, it has to be at least equally fast to flash. Unless OP also put data in the same CCM (I forgot if CCM can be used for data, at all. It sure can for instructions, which is the primary purpose).

Something else is going on. One possibility is some of the code (library calls, maybe compiler-generated memcpy/memset/software float) are still in FLASH, but because of large difference in addresses between CCM vs. FLASH, slower jump instruction (e.g., one using register for addressing, necessitating a load, or a veneer function) is now used.

Carefully examining the assembly, in full, reveals all of this.

NorthGuy · « **Reply #29 on:** November 11, 2022, 07:30:47 pm »

Quote from: Siwastaja on November 11, 2022, 06:38:26 pm

Unless OP also put data in the same CCM (I forgot if CCM can be used for data, at all. It sure can for instructions, which is the primary purpose).

Data sections can be placed into CCM.

Quote from: Siwastaja on November 11, 2022, 06:38:26 pm

Something else is going on. One possibility is some of the code (library calls, maybe compiler-generated memcpy/memset/software float) are still in FLASH, but because of large difference in addresses between CCM vs. FLASH, slower jump instruction (e.g., one using register for addressing, necessitating a load, or a veneer function) is now used.

This is definitely a possible reason, however I do not see any function calls in his code. He gets substantial time difference, should take quite a bit of function calling.

ataradov · « **Reply #30 on:** November 11, 2022, 07:43:23 pm »

As I said, the code uses floating point math, it will call helper functions.

It is a simple thing to check - just look at a full disassembly.

eutectique · « **Reply #31 on:** November 11, 2022, 07:48:26 pm »

Quote from: lordnoxx on November 11, 2022, 10:59:49 am

Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>: 8000cc4: 4b80 ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)Execution time is 1.234us

Address is XXXXc4, 4-byte aligned.

Quote from: lordnoxx on November 11, 2022, 10:59:49 am

Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>: 8000c38: 4b80 ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)Execution time is 1.180us

Address is XXXX38, 8-byte aligned.

The only difference I can see.

uer166 · « **Reply #32 on:** November 11, 2022, 08:31:38 pm »

Quote from: eutectique on November 11, 2022, 07:48:26 pm

Quote from: lordnoxx on November 11, 2022, 10:59:49 am
Code: [Select]
08000cc4 <DMA1_Channel1_IRQHandler>: 8000cc4: 4b80 ldr r3, [pc, #512] ; (8000ec8 <DMA1_Channel1_IRQHandler+0x204>)Execution time is 1.234us

Address is XXXXc4, 4-byte aligned.

Quote from: lordnoxx on November 11, 2022, 10:59:49 am
Code: [Select]
08000c38 <DMA1_Channel1_IRQHandler>: 8000c38: 4b80 ldr r3, [pc, #512] ; (8000e3c <DMA1_Channel1_IRQHandler+0x204>)Execution time is 1.180us

Address is XXXX38, 8-byte aligned.

The only difference I can see.

Flash is 8-byte wide, so maybe that's a clue? There clearly is not enough info from OP to make any determination. Post the entire listing and memory map file.

dmills · « **Reply #33 on:** November 11, 2022, 10:56:19 pm »

I am wondering if your without main case might have main() being optimised into a 'stop the processor' command which will leave the cache hot for the next time the ISR is called.

Put significant doings in main and now the ISR stalls on the I & D cache fill from memory? Don't forget that if both the ISR and something that could be interrupted need the FPU then that is a fairly expensive flush to the stack that you don't need if the ONLY code touching the FPU is in the interrupt handler.

NorthGuy · « **Reply #34 on:** November 12, 2022, 12:42:42 am »

Quote from: ataradov on November 11, 2022, 07:43:23 pm

As I said, the code uses floating point math, it will call helper functions.

It is a simple thing to check - just look at a full disassembly.

Yep. It is right here:

Quote from: lordnoxx on November 11, 2022, 02:39:55 pm

This is all I can get from the .list file

Do you see any calls to helper functions?

ataradov · « **Reply #35 on:** November 12, 2022, 12:48:50 am »

I missed that post.

In that case I would start removing the code to get a minimal version that still shows the difference. And I would start with floating point instructions, since timing on those depends on the operands.

Siwastaja · « **Reply #36 on:** November 12, 2022, 08:15:24 am »

Or you can just try to add __attribute__((aligned(8))) to the function definition (and re-check from the listing it is effective) and see if eutectique & uer are on the right track. Doesn't explain the case where CCM is slower, though.

voltsandjolts · « **Reply #37 on:** November 12, 2022, 09:08:55 am »

Quote from: lordnoxx on November 06, 2022, 11:33:43 am

The ISR looks like this:

Code: [Select]
void DMA1_Channel1_IRQHandler(void) { if(DMA1->ISR & DMA_ISR_TCIF1) //check we are here cause of a valid interrupt occurred { DMA1->IFCR |= DMA_IFCR_CTCIF1; //reset interrupt flag GPIOC->ODR |= (1U << 14); //Set Pin high ev = (v - my_array[0]); vp = A * ev; vi = B * vi1; if(vi > 4100) vi=4100; if(vi < 0) vi=0; vi1 = vi; vd = C * (ev - (e1v+2)); e1v = ev; cv = vp + vi + vd; if(cv < 0) cv=0; cv_t = cv/2; if(cv_t > 4000 ) cv_t=4000; /*------------------------------------------------------*/ ei = (my_array[1] - i); ip = D * ei; ii = E * ii1; if(ii > 4100000 ) ii=4100000; if(ii < 0 ) ii=0; ii1 = ii; id = F * (ei + (e1i-2)); e1i = ei; ci = ip + ii + id; if(ci < 0) ci=0; ci_t = ci/1000; if(ci_t > 4000 ) ci_t=4000; cdac_t = (cv_t - ci_t); if(cdac_t > 4000) cdac_t=4000; if(cdac_t < 1500) cdac_t=1500; GPIOC->ODR &= ~(1U << 14); //Set Pin low } }
All the variables are global. Some of them are "int" and others are "float". All variables used in the ISR are declared volatile. I am using the FPU.

Converting float to int?

Some of the IEEE.754 operations are not supported by hardware and are done by software:
• Remainder
• Round floating-point to integer-value floating-point number
• Binary-to-decimal and decimal-to-binary conversions
• Direct comparison of single-precision and double-precision values

wek · « **Reply #38 on:** November 12, 2022, 09:30:19 am »

If you want just another guess to be thrown wildly, the CCMRAM may be used at one side of a running DMA, for example.

[EDIT] Another wild guess is, that stack is defined in CCMRAM, which is aliased as top of the SRAM block; and (lazy?) float stacking then interferes with code fetching.

Third wild guess is, that the busmatrix arbitrator inserts a slot whenever access to CCMRAM switches from I-bus to D-bus of the processor. [/EDIT]

Details do matter. There's no point guessing around.

As Alex said above, OP should prepare a minimal but complete code exhibiting the problem and post entirely, with disasm or elf.

JW

lordnoxx · « **Reply #39 on:** November 12, 2022, 09:59:19 pm »

Quote from: wek on November 12, 2022, 09:30:19 am

As Alex said above, OP should prepare a minimal but complete code exhibiting the problem and post entirely, with disasm or elf.

This is what I will do. Thanks to all of you how commented and started analyzing the issue. I'll be right back.

jnk0le · « **Reply #40 on:** November 13, 2022, 11:38:32 am »

The execution from ITCM is slower than flash because you hit the von neumann bottleneck from pcrel loads of constants. FLASH memories are usually equipped with separate code and literal caches.
It's definitely faster on devices like f103 (which can't keep up with stream of nop.w at 1ws). RISCV is cleaner in this regard.

You can put only the vector table in TCM, that is not occupied by stack, as to not cause waitstated read twice in a row.
Also align function entry to a cacheline size.

Siwastaja · « **Reply #41 on:** November 13, 2022, 01:08:27 pm »

Quote from: jnk0le on November 13, 2022, 11:38:32 am

The execution from ITCM is slower than flash because you hit the von neumann bottleneck from pcrel loads of constants. FLASH memories are usually equipped with separate code and literal caches.

Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.

jnk0le · « **Reply #42 on:** November 13, 2022, 02:29:16 pm »

Quote from: Siwastaja on November 13, 2022, 01:08:27 pm

Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.

You can't fetch instructions and data at the same time from a single ported memory, hence von neumann bottleneck. The core prefetcher can fetch instructions a bit ahead but that wont do much with series of such loads.

NorthGuy · « **Reply #43 on:** November 13, 2022, 04:32:50 pm »

Quote from: jnk0le on November 13, 2022, 11:38:32 am

FLASH memories are usually equipped with separate code and literal caches.

I didn't know that. If so, this certainly explains why flash is faster than CCM.

Siwastaja · « **Reply #44 on:** November 13, 2022, 05:40:12 pm »

That also implies flash cache controller has special interface directly to the CPU, allowing two separate addresses to be fetched at the same time. Still suspicious about that.

I have never seen ITCM being slower than flash, although to be fair I have compared those only some dozen times, not hundreds, so maybe there is a case.

Plus it's worth remembering ITCM is a 64-bit wide bus in Cortex-M7, designed and offered by ARM as a standard option, but CCM is ST's own addition. Don't know about the difference. I have used ST's CCM only in one project (on STM32F334), and it was significantly faster than FLASH in those actual interrupt handlers I used.

ataradov · « **Reply #45 on:** November 13, 2022, 05:43:34 pm »

All those things are much more complicated than a simple single ported memory. I too have never seen TCM being slower and I don't see how it is possible.

There is some issue with the measurement method or something else is going on.

NorthGuy · « **Reply #46 on:** November 13, 2022, 05:52:11 pm »

Quote from: Siwastaja on November 13, 2022, 05:40:12 pm

That also implies flash cache controller has special interface directly to the CPU, allowing two separate addresses to be fetched at the same time. Still suspicious about that.

Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.

jnk0le · « **Reply #47 on:** November 13, 2022, 08:18:48 pm »

Quote from: Siwastaja on November 13, 2022, 05:40:12 pm

I have never seen ITCM being slower than flash, although to be fair I have compared those only some dozen times, not hundreds, so maybe there is a case.

Quote from: ataradov on November 13, 2022, 05:43:34 pm

I too have never seen TCM being slower and I don't see how it is possible.

I think we see here a false positive of microbenchmarking - the interrupt vector/code/data remains in FLASH caches between runs. There is also quite a lot of pcrel constants.
In typical scenario it should be faster and if not, the overhead ought to be low.

If compiler can be somehow forced to generate MOVW+MOVT instead of pcrel constants, TCM should have exactly the same performance as FLASH, if there is no contention with stack or variables in same memory block.

NorthGuy · « **Reply #48 on:** November 13, 2022, 08:35:24 pm »

Quote from: jnk0le on November 13, 2022, 08:18:48 pm

... the interrupt vector/code/data remains in FLASH caches between runs.

Exactly! That's what it is.

When you also have main, the ISR code gets removed from the cache while main runs. So, the bigger the main is, the higher the probability of the ISR code being expelled from cache, and hence when the main is big, the ISR code gets slower.

If the main is big enough to expel the ISR code from the cache every time, the flash code should then become slower than CCM code. @OP: you can verify if this is so.

jnk0le · « **Reply #49 on:** November 13, 2022, 08:36:54 pm »

one more thing:

Quote

The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 16 Kbytes SRAM1 (mapped at address 0x2000 0000)
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Isn't your linker script joining all those sections to create one linear layout?
That would be contention with stack in such case (and some bugs once the app grows)

wek · « **Reply #50 on:** November 13, 2022, 09:28:42 pm »

Quote from: ataradov on November 13, 2022, 05:43:34 pm

I too have never seen TCM being slower and I don't see how it is possible.

We are not talking about TCM in CM7 which is on an entirely separate bus of processor than the rest of the system.

CCM SRAM in STM32 denotes different things in different families/models. In 'G4, it's a single-port RAM accessed through both I and D but even through S port of processor (the latter through a different address region alias) and then through the common busmatrix, and it can be slave also to either of DMAs. Strangely enough, there's another chunk of RAM, denoted SRAM1, with exactly the same connectivity, and no explanation how would it be different from CCM SRAM.

The arbitration of the busmatrix is not documented in any other way than it is "round robin".

JW

DavidAlfa · « **Reply #51 on:** November 13, 2022, 09:56:56 pm »

Note that some linker scripts don't separate these ram regions.
I remember the 32F429 showing "RAM: .... length=192K", so the system would use any.

At least in the 429, CCM is only connected to the D-Bus, thus can't be used for instructions neither accessed by any DMA.

Quote

The 64-Kbyte CCM (core coupled memory) data RAM is not part of the bus matrix and can be accessed only through the CPU

But placing ISR data on CCM and ISR intructions in the normal SRAM shouldn't slow it down, as they won't need to fight for access.

Not the case of the 32G431 (Totally different beast!)

Siwastaja · « **Reply #52 on:** November 14, 2022, 09:22:12 am »

Quote from: NorthGuy on November 13, 2022, 05:52:11 pm

Data and Code use different buses. Otherwise it would always be contention between instruction fetches and data access. See the bus matrix picture below.

This picture explains it well. Flash accelerator comes with separate DBUS and IBUS interfaces, STM32 CCM SRAM does not but has to arbitrate.

Who knows about the internals of ARM-provided M7 ITCM (tried 2 minutes in Google to no avail)? The bus is 64 bits but does that mean it's actually kinda-dual port, like the FLASH accelerator pictured? PC-relative literals are usually further away so 64 bit width does not help for that.

Siwastaja · « **Reply #53 on:** November 14, 2022, 09:35:04 am »

Quote from: jnk0le on November 13, 2022, 08:36:54 pm

Quote
The STM32G4 Series category 2 devices feature up to 32 Kbytes SRAM:
• 6 Kbytes SRAM2 (mapped at address 0x2000 4000)
• 10 Kbytes CCM SRAM (mapped at address 0x1000 0000 and end of SRAM2)

Kind of weird optimization idea:

Since PC-relative LDR can take offset of +/- 4095 bytes, which is quite a lot, and considering wek's comment "is there any difference between CCM and SRAM?" it should be possible to place timing-critical routines on the border of these memory segments so that the PC-relative literals fall into SRAM2 and code itself into CCM. Then of course place nothing else into SRAM2 & CCM. Then DBUS accesses would be to SRAM2 and IBUS instruction fetches to CCM, and no arbitration for PC-relative literals.

jnk0le · « **Reply #54 on:** November 14, 2022, 10:05:20 am »

About optimizing the OP code

Quote

8000ed0:   20001558    .word   0x20001558
8000ed4:   20001524    .word   0x20001524
8000ed8:   2000155c    .word   0x2000155c
8000edc:   200014f8    .word   0x200014f8

8000ee4:   20001560    .word   0x20001560
8000ee8:   200000cc    .word   0x200000cc
8000eec:   200000c4    .word   0x200000c4
8000ef0:   20001554    .word   0x20001554
8000ef4:   20001550    .word   0x20001550

8000efc:   2000151c    .word   0x2000151c
8000f00:   2000154c    .word   0x2000154c
8000f04:   20001508    .word   0x20001508
8000f08:   2000150c    .word   0x2000150c
8000f0c:   20001520    .word   0x20001520
8000f10:   20001548    .word   0x20001548
8000f14:   20001544    .word   0x20001544
8000f18:   200000c8    .word   0x200000c8
8000f1c:   200000c0    .word   0x200000c0
8000f20:   20001540    .word   0x20001540
8000f24:   2000153c    .word   0x2000153c
8000f28:   200014f4    .word   0x200014f4

8000f30:   20001518    .word   0x20001518
8000f34:   20001538    .word   0x20001538
8000f38:   20001500    .word   0x20001500
8000f3c:   20001504    .word   0x20001504
8000f40:   200014fc    .word   0x200014fc
8000f44:   200000d0    .word   0x200000d0

all of those seem to be addresses of global variables. Organizing them in structures should greatly reduce those literal loads.

EDIT: many of those are also within 12 bit addw/ldr range to each other but the linkers tend to not be good at this kind of address relaxing

lordnoxx · « **Reply #55 on:** November 20, 2022, 09:34:16 pm »

Hi guys I am back. Had a little time during the weekend and worked on a simplified small code that shows the behavior of variable execution time of DMA1 ISR. Together we can now walk through the mini project and find out what is going on. But there is one thing I have to make clear right from the beginning: I am not allowed to post the c-source code of the function cli(). I can only post the disassembly of that function.

Ok here we go....this is the main.c file:

Code: [Select]

#include "stm32g4xx.h"
#include "main.h"
#include "system_setup.h"
#include "usart_char_string.h"
#include "cli.h"
#include "ISR_Handlers.h"

#define TEST 0
#define VOUT 5.0f
#define VOUT_SET ((VOUT/10)/3)*4095.0f * 0.996
#define KPV  300
#define KIV  30
#define KDV  3000
#define IOUT 3.0f
#define IOUT_SET ((IOUT*0.2308f)/3)*4095.0f * 1.0f
#define KPI  10000
#define KII  30000
#define KDI  5000

volatile uint16_t myarray[2]= {0,0};
char fooptr[16];
volatile char debug=0;

volatile unsigned int vref=0, vref_aim=VOUT_SET;
volatile int KpV=KPV, KiV=KIV, KdV=KDV;
volatile int ev=0, e1v=0, cv=0, cv_out=0;
volatile float vp=0, vi=0, vi1=0, vd=0;

volatile unsigned int iref=0, iref_aim=IOUT_SET;
volatile int KpI=KPI, KiI=KII, KdI=KDI;
volatile int ei=0, e1i=0, ci=0, ci_out=0;
volatile float ip=0, ii=0, ii1=0, id=0;
volatile int c_out;

void SysTick_Handler(void);

int main(void)
{
	setup_clock_tree_config();
	setup_GPIO_config();
	SystemCoreClockUpdate();
	clock = SystemCoreClock;
	SysTick_Config(clock/1e3); //1ms
	setup_USART_config();
	setup_Timer1_config();
	setup_ADC1_config();
	setup_interrupt_config();

	GPIOA->ODR |= (1 << GPIO_ODR_OD11_Pos);

	while(1)
	{

		cli();

		if(TEST)
		{
			delay_ms(200);
		}

	}
}


void SysTick_Handler(void)
{
	count_ticks++;
}

void delay_ms(uint32_t ms)
{
	uint32_t start=count_ticks;
	while ((count_ticks-start) < ms);
}

If you wonder about the variables names...yes....the project is all about an digital control loop for a DC/DC converter. And yes...the DMA1 ISR executes the PID algorithms for voltage and current control. This just as a side note for those of you who are interested

Here is the code of the ISR:

Code: [Select]

void DMA1_Channel1_IRQHandler(void) {

	if (DMA1->ISR & DMA_ISR_TCIF1)
	{
		DMA1->IFCR |= DMA_IFCR_CTCIF1;

		GPIOC->BSRR |= (1U << 14);			//Pin high

		ev = (vref - myarray[0]);

		vp = KpV * ev;
		vi = KiV * ev + vi1;

		if (vi > 4100)	vi = 4100;
		if (vi < 0)		vi = 0;
		vi1 = vi;

		vd = KdV * (ev - e1v);
		e1v = ev;

		cv = vp + vi + vd;

		if (cv < 0)	cv = 0;

		cv_out = cv;
		if (cv_out > 4000)	cv_out = 4000;

		ei = (myarray[1] - iref);

		ip = KpI * ei;
		ii = KiI * ei + ii1;

		if (ii > 4100000)	ii = 4100000;
		if (ii < 0)			ii = 0;
		ii1 = ii;

		id = KdI * (ei - e1i);
		e1i = ei;

		ci = ip + ii + id;

		if (ci < 0)	ci = 0;

		ci_out = ci / 1000;
		if (ci_out > 4000)	ci_out = 4000;

		c_out = (cv_out - ci_out);

		if (c_out > 4000)	c_out = 4000;
		if (c_out < 1500)	c_out = 1500;

		GPIOC->BSRR |= (1U << (14 + 16));			//Pin low

	}
}

The main() function and the ISR and also the vector table reside in flash memory. Compiler optimizes for size (-Os). From the attached image you can see that the execution time is 1.842µs with cli() that gets called from main().
If I comment out cli()...

Code: [Select]

while(1)
{
	//cli();

	if(TEST)
	{
		delay_ms(200);
	}

}

...the execution time of the ISR is 1.542µs, shown in the other attached image.

Please see also the attached ZIP. It contains the elf-file and the disassembly (list-file) with the cli() function. On request I can upload an elf file and disassembly without the cli() function

DavidAlfa · « **Reply #56 on:** November 20, 2022, 10:15:19 pm »

What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]

DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<10000;i++) asm("nop");
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

Also try making a larger function:

Code: [Select]

DMA1->IFCR |= DMA_IFCR_CTCIF1;
GPIOC->BSRR |= (1U << 14);                    //Pin high
for(uint16_t i=0;i<40;i++) {
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
    asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop");
}
GPIOC->BSRR |= (1U << (14 + 16));             //Pin low

This way you discard a FPU HW / library issue.
Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?

lordnoxx · « **Reply #57 on:** November 22, 2022, 09:11:18 am »

Quote

What happens if you disable the float code and add a fixed nop delay, does the ISR time vary by cli()?

Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<10000;i++) asm("nop"); GPIOC->BSRR |= (1U << (14 + 16)); //Pin low

Tried that...Result:
Execution time of the ISR is not dependent on cli().

Quote

Also try making a larger function:
Code: [Select]
DMA1->IFCR |= DMA_IFCR_CTCIF1; GPIOC->BSRR |= (1U << 14); //Pin high for(uint16_t i=0;i<40;i++) { asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); asm("nop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop\nnop"); } GPIOC->BSRR |= (1U << (14 + 16)); //Pin low
This way you discard a FPU HW / library issue.

Tried that too...Result:
Execution time of the ISR is not dependent on cli().

Quote

Does cli use the FPU as well? Any other higher priority IRQ? Disabling the IRQ in any moment?

No, cli() has no float operations going. Also if you check the disassembly of cli() you will not find any FPU instructions. So probably the root cause is not the lazy stacking feature!?!?

No higher priority IRQs.

But...the above tests led me to test again what happens if I tell to compiler not to use the FPU at all.--> Result: Execution time of the ISR is not dependent on cli().
Hmmm...well....does this mean that the the issue is nevertheless caused by the FPU?

What can be the next steps to dive deeper into this?

DavidAlfa · « **Reply #58 on:** November 22, 2022, 05:31:33 pm »

Toggle the pin between operations, maybe you find it's happening by some specific operation?

jnk0le · « **Reply #59 on:** November 22, 2022, 07:57:45 pm »

Put all of those in one big struct so the compiler can acces them with ldr offset instead of doing pcrel load of address each time

Quote

Code: [Select]
volatile uint16_t myarray[2]= {0,0}; volatile unsigned int vref=0, vref_aim=VOUT_SET; volatile int KpV=KPV, KiV=KIV, KdV=KDV; volatile int ev=0, e1v=0, cv=0, cv_out=0; volatile float vp=0, vi=0, vi1=0, vd=0; volatile unsigned int iref=0, iref_aim=IOUT_SET; volatile int KpI=KPI, KiI=KII, KdI=KDI; volatile int ei=0, e1i=0, ci=0, ci_out=0; volatile float ip=0, ii=0, ii1=0, id=0; volatile int c_out;

Jeroen3 · « **Reply #60 on:** November 22, 2022, 08:26:26 pm »

What is your clock tree like? Are you waiting on an asynchronous bus somewhere?

What happens when you run the chip on a speed without need for wait states? Eg: 16 Mhz flat?
Can you the reproduce the results?

_{Sidenote: GPIOC->BSRR is a write only register, no need for |=}

wek · « **Reply #61 on:** November 22, 2022, 10:55:58 pm »

FPU is probably red herring, with those nops you've removed also all pcrel loads, which may have impact through alignment, as jnk0le pointed out above (and maybe others talked about it too).

To test the alignment-dependency theory, insert a single NOP to the ISR (before setting the GPIO pin) and retest, then insert one more etc.

> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical. Don't set multiple bits in a single register by a series of RMW, (as you do with FLASH_ACR), perform one single write of the final value.

JW

Siwastaja · « **Reply #62 on:** November 23, 2022, 07:44:11 am »

Quote from: wek on November 22, 2022, 10:55:58 pm

> Sidenote: GPIOC->BSRR is a write only register, no need for |=

Same for DMA1->IFCR . Generally avoid RMW to registers as much as possible/practical.

Made that bigger, it's pretty important side note when the OP clearly is interested about performance. BSRR, IFCR and other similar registers exist for the purpose of avoiding read-modify-write; the RMW operation is moved to the peripheral side; consider it a simple "hardware accelerator" of modifying peripheral state.

DavidAlfa · « **Reply #63 on:** November 23, 2022, 08:11:47 am »

Why adding sidenotes in micron size font? Just use regular size, it's confusing, I didn't even read them as looked like your personal signature.

Add "__attribute__((aligned(16)))" before the ISR function so it gets 16-byte (128bit) aligned.
It'll waste a little space, but nothing serious, tested it and got a 3.2% increment, from 106KB to 109.3KB.

What are you runnign on cli(), or what code is increasing the ISR time?
Tried your code inside a timer ISR, nothing seems to affect it, got steady 2.56us on a 32F411@100MHz.
The struct idea definitely made a difference, lowering the time to 1.96us.
Are you sure it's not related to the processing ending sooner/later depending on the input data?
What happens if the float input values are never updated, so it always makes the same calcs?

Also, maybe try enabling the Systick timer (1KHz), disabling everything else and running the code there, to discard something strange with the DMA ISR.

wek · « **Reply #64 on:** November 23, 2022, 10:47:39 am »

DavidAlfa,

> The struct idea definitely made a difference, lowering the time [from 2.56us(?)] to 1.96us.

Wow, I wouldn't expect it to be that dramatic.

Can you please revert to the non-struct version, and try a couple of versions with added one/two/etc. _NOP()s before setting of GPIO pin?

Thanks,

JW

DavidAlfa · « **Reply #65 on:** November 23, 2022, 11:29:30 am »

I'm not getting any difference by adding random code anywhere in the main neither by aligning the code to 16-bit:

No alignment:
08000784 <TIM3_IRQHandler>

Adding __attribute__((aligned(16))):
08000790 <TIM3_IRQHandler>

Zero difference in execution time.

Edit: It was my scope not having enough precision. Repeated in 10ns/div.

I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
0 nop: 1.96us (="same")
1 nop: +20ns
2 nop: same
3 nop: +20ns
4 nop: same

It seems adding uneven number of nops adds two additional execution cycles to the following instructions.
But by no way +600ns.
Flash latency is set to 3, I don't think it's a cache miss.
The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.

wek · « **Reply #66 on:** November 23, 2022, 12:43:50 pm »

> I'm adding nops before setting the pin high, the waveform should remain the same, right? But there's a small difference:
> 0 nop: 1.96us (="same")
> 1 nop: +20ns
> 2 nop: same
> 3 nop: +20ns
> 4 nop: same

OK but that's with the variables in struct, right? If you run at 100MHz that 20ns may be 2-3 cycles so it may be one extra FLASH read, or similar.

Can you please try the same with the non-struct i.e. original version? As there are more data reads from the FLASH, the difference may be more pronounced.

> The F411 is way inferior to the G431 so I wouldn't expect it to perform better in any way, but who knows.

From computing point of view the 'F411 is in no way inferior to 'G431, except the max. clock frequency (i.e. the 'F42x or 'F446 are on par with 'G431). The 'G4xx are "better" in the peripheral mix. The 'G4xx are worse in longevity, given the massive increase in peripherals' complexity was paid for by using 45nm technology (vs. the older (read: more robust) 90nm for the 'F4). It's not dramatic, yet; but the pressure is already felt.

JW

DavidAlfa · « **Reply #67 on:** November 23, 2022, 02:15:24 pm »

Strange results, but nothing outside of this world.

1 nop: -20ns
2 nop: +10ns
3 nop: +10ns
4 nop: -20ns
5 nop: -20ns
6 nop: -20ns
7 nop: -20ns
8 nop: -20ns
9 nop: -20ns

asm("") -20ns Yeah,asm("Nothing")

Code: [Select]

void TIM3_IRQHandler(void)
{
  /* USER CODE BEGIN TIM3_IRQn 0 */
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000784:	4b7c      	ldr	r3, [pc, #496]	; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000786:	497d      	ldr	r1, [pc, #500]	; (800097c <TIM3_IRQHandler+0x1f8>)

Code: [Select]

void TIM3_IRQHandler(void)
{
 8000784:	b5f0      	push	{r4, r5, r6, r7, lr}
  /* USER CODE BEGIN TIM3_IRQn 0 */
  asm("");
  __HAL_TIM_CLEAR_FLAG(&htim3,TIM_FLAG_UPDATE);
 8000786:	4b7c      	ldr	r3, [pc, #496]	; (8000978 <TIM3_IRQHandler+0x1f4>)
  LED_GPIO_Port->BSRR = LED_Pin; //Set Pin high
 8000788:	497c      	ldr	r1, [pc, #496]	; (800097c <TIM3_IRQHandler+0x1f8>)

wek · « **Reply #68 on:** November 23, 2022, 02:34:15 pm »

Hummm.

Thanks, David.

JW

bson · « **Reply #69 on:** November 29, 2022, 07:28:03 pm »

Quote from: Siwastaja on November 13, 2022, 01:08:27 pm

Sounds weird to me, because CCM/ITCM should be equally fast to that code/literal cache - no wait states. Load instructions finish as quickly as possible from ITCM, I don't see how they can be any faster; interface to the flash accelerator or cache should be of equal performance. But I could be wrong.

CCM doesn't have separate I/D buses like flash or SRAM. This means the pipeline can't overlap data accesses with code accesses when data is in CCM, but the flip side is it gives very predictable and deterministic execution times. Every access takes exactly one cycle, and can simply be added up.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: STM32G431 - How can it be that the amount of lines of code in main() affect ISR (Read 7546 times)

Share me