Author Topic: How are micros programmed in real world situations? (Read 30300 times)

Howardlong · « **Reply #75 on:** April 27, 2015, 05:50:53 pm »

Quote from: nctnico on April 27, 2015, 02:04:15 pm

I'd start by checking the clock frequency and then remove most of the instructions.

Clock is 204MHz, measured on both CLK0 and CLK2 pins.

Howardlong · « **Reply #76 on:** April 27, 2015, 06:29:25 pm »

Quote from: Howardlong on April 27, 2015, 05:50:53 pm

Quote from: nctnico on April 27, 2015, 02:04:15 pm
I'd start by checking the clock frequency and then remove most of the instructions.

Clock is 204MHz, measured on both CLK0 and CLK2 pins.

If I do this:

Code: [Select]

__RAMFUNC(RAM2) static void DownConvert3(NCOSTRUCT *pncos,SAMPLETYPE *pstIn,SAMPLETYPE *pstOutI,SAMPLETYPE *pstOutQ, int nNumSamples)
{
	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"
		"\nloopy:					\n\t"
		""
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"b loopy					\n\t"
		:
		: [X]"r" (x)
		: "r2","r3"
	);

I get this:

STRB is either one or two cycles, one if following a previous load or store, two otherwise. B is two.

loopy:
STRB // 2 cycles
STRB // 1 cycle
B loopy // 2 cycles

So far so good.

Now if I do this:

Code: [Select]

	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"
		"\nloopy:					\n\t"
		""
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"strb.w r3,[%[X],#100]		\n\t" // GPIO on
		"strb.w r2,[%[X],#100]		\n\t" // GPIO off
		"b loopy					\n\t"
		:
		: [X]"r" (x)
		: "r2","r3"
	);

I get this:

Errr? No comprende.

andersm · « **Reply #77 on:** April 27, 2015, 06:53:20 pm »

Look in the documentation if your chip uses a code prefetch mechanism. The small loop may fit into the prefetch buffer, while the large one requires a refill. In some chips the prefetch can be turned off for increased determinism.

Howardlong · « **Reply #78 on:** April 27, 2015, 07:03:46 pm »

Quote from: AndyC_772 on April 27, 2015, 02:13:05 pm

How long does it take if you duplicate all the instructions between setting the GPIO and clearing it? Does executing the code twice take an extra 47 cycles, or 36, or somewhere in between?

Just did this, had to tweak the CBZ as it's limited in branch range, but no matter.

I measure 74 cycles on the scope (363ns unroll by six vs previous 230ns unroll by three), but it's doing twice as much. I could shave it further a tiny bit more by combining VLDMs and VSTMs. So that's equivalent to 37 cycles if we were only doing three rather than six.

While I am super happy it worked, indeed I have been going backwards and forwards trying to explain it myself, would anyone care to explain why?

!!!

Edit: Combined VLDMs and VSTMs and used up even more of the FP registers(!) as a result and now it's at 71 cycles, so a 24% improvement on yesterday. But I still have no clue why unrolling further had such a massive impact, usually it's the law of diminishing returns after about three or four.

Howardlong · « **Reply #79 on:** April 27, 2015, 07:45:17 pm »

Quote from: Jeroen3 on April 27, 2015, 03:28:24 pm

There are several reasons why assembler code that is assumed to take 38 cycles, takes 47 clocks. Few of them are:
- Bus wait states, for when you're accessing slower-clocked domains. Such as GPIO or anything on lower-clocked APB.
- Flash wait states, remember flash isn't 32 bit wide, but mostly 128 bit, so multiple instructions fit one flash fetch, and you can get out-of-sync with your gpio set/reset. Refer to memory barriers for this.
- Flash prefetching/caching. Characteristics are highly hardware dependent.

Measuring excution time using GPIO is ambigious. Compare it with the CCNT to see what happens.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211i/Bihcgfcf.html (demo url, not sure what hardware you have, but this register exists in most arms)

If your main goal is to program a fast assembly routine. Create a testkit in a simulator with an ideal environment. No bus waits, no flash waits, no prefetching, no pipelining. This way you only test your code without also measuring variables in hardware features that are changing with your code.

As mentioned previously, the target is an LPC4370 Cortex M4F, and I've tried to simplify this by (a) running code from on chip SRAM and (b) using different on-chip SRAM blocks for code and data. There are no wait states. Only the single M4 core is running, there is no DMA or IRQ going on. The timing is jitter free and consistent.

Good point regarding GPIO clocking. Regretfully the GPIO clock is wired to the core clock in this case, running at 204MHz.

Your tips regarding flash are worth knowing too, although I'm not sure how much it applies to the flashless LPC4370 which uses a proprietary quad SPI flash interface.

Regarding the use of a synthetic environment, I don't have one, but I'm not sure of the benefit if the pipelining is removed, you're running your code on a non-representative platform. It seems to me that understanding the pipelines and interleaving instructions is key to getting maximum performance. Please correct me if I've misunderstood though.

Thank you for your input, it is appreciated.

Howardlong · « **Reply #80 on:** April 27, 2015, 07:50:01 pm »

Quote from: andersm on April 27, 2015, 06:53:20 pm

Look in the documentation if your chip uses a code prefetch mechanism. The small loop may fit into the prefetch buffer, while the large one requires a refill. In some chips the prefetch can be turned off for increased determinism.

The code's running in on-chip SRAM without wait states. The only prefetch as far as I can see on the LPC4370 is on the SPIFI flash interface and what's already part of the M4 design, but please correct me if I've misinterpreted or missed something.

This is a good discussion, I appreciate the input.

miguelvp · « **Reply #81 on:** April 28, 2015, 01:43:40 am »

This video might help, it's for an ARM Cortex M3 PSoC 5LP but it gets the point across what can cause delays.

C++ vs Assembly vs Verilog.

Howardlong · « **Reply #82 on:** April 28, 2015, 10:56:33 am »

Quote from: miguelvp on April 28, 2015, 01:43:40 am

This video might help, it's for an ARM Cortex M3 PSoC 5LP but it gets the point across what can cause delays.

C++ vs Assembly vs Verilog.

OK, thanks.

Firstly, to be clear, in general I agree that assembler is very much a last resort, but there are very occasional edge cases where you need to get under the hood.

When we got into the meat, about 08:20, I thought that the pipeline bit was glossed over from about 10:15. At about 10:50, with the scope connected, it shows five total cycles. We know that an unconditional branch will always take at least 2 cycles as the pipeline is flushed. So that leaves 3 cycles for the two STRBs. One takes 1 cycle and the other (apparently) takes 2. The TRM is opened up and he shows the pipeline diagram, but he doesn't explain what is happening and why the stall is happening, "or something like that" as he says, twice. At 14:10 he says that it's stalling on both STRBs but does not give a convincing explanation IMHO. Then he goes straight onto configurable logic elements.

From the M3 Tech Ref Manual, regarding all scalar STRs including STRB

Neighboring load and store single instructions can pipeline their address and data phases.
This enables these instructions to complete in a single execution cycle.

and

STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed
in the initial cycle, and the data store is performed at the same time as the next instruction
is executing. If the store is to the store buffer, and the store buffer is full or not enabled,
the next instruction is delayed until the store can complete. If the store is not to the store
buffer, for example to the Code segment, and that transaction stalls, the impact on timing
is only felt if another load or store operation is executed before completion.

miguelvp · « **Reply #83 on:** April 28, 2015, 02:31:43 pm »

I remembered that video because I was doing something similar and wanted to optimize it. But at the end C++ wasn't the winner and neither was the hardware approach since it wasn't under software control it didn't have much utility to me.

I ended up with assembly using a nop in between them to compensate for the branch cycle and using another register pointing to the same output port to avoid stalls or something like that.

I'll try to dig up the code, but I was using an M0 so I don't know if that will impact things, and I don't recall what frequency I achieved at 50% duty cycle.

Howardlong · « **Reply #84 on:** April 28, 2015, 06:36:28 pm »

I figured this out after some suggestions from elsewhere (I have some seriously nerdy friends on FB), but it leaves a further question.

Due to a branch limit error when unrolling to 6 from the original 3, I changed a conditional branch "cbz" to "cbnz plus b".

Original:

Code: [Select]

		cbz r5,loopexit1
	loop1:
		/****** START ******/
		// Several vxx intructions
		/****** END ******/

		subs r5,#1
		bne loop1

New:

Code: [Select]

		cbnz r5,loop1
		b loopexit1
	loop1:
		/****** START ******/
		// Several vxx intructions
		/****** END ******/

		subs r5,#1
		bne loop1

and miraculously it works, exactly 38 cycles rather than the 47 I was getting. I have no idea why you lose 9 cycles the first way. That explains why unrolling to 6 worked, it had nothing to do with the unrolling and everything to do with the cbz instruction before the loop. The cbz isn't even in the loop, so why it has any effect on the loop speed still evades me.

andersm · « **Reply #85 on:** April 28, 2015, 07:06:54 pm »

It could affect the alignment of the first instruction in the loop. IIRC at least the Cortex-M3 fetches instructions in aligned 32-bit chunks, and if the instructions straddles a fetch boundary you'll incur at least one extra memory fetch penalty. Compare the addresses of the instructions, and try adding NOPs or use an assembler pseudo-op to force alignment in the slower case.

Howardlong · « **Reply #86 on:** April 28, 2015, 07:18:47 pm »

andersm I owe you a beer!

Old:

Code: [Select]

100803f4: 0xfb06c615   mls r6, r6, r5, r12
100803f8: 0xb3dd       cbz r5, 0x10080472 <loopexit61>
100803fa: 0xf8883064   strb.w r3, [r8, #100]   ; 0x64

New:

Code: [Select]

100803f4: 0xfb06c615   mls r6, r6, r5, r12
100803f8: 0xb905       cbnz r5, 0x100803fc <DownConvert6+72>
100803fa: 0xe03b       b.n 0x10080474 <loopexit61>
100803fc: 0xf8883064   strb.w r3, [r8, #100]   ; 0x64

Edit: had some time to think about this, and it feels like such a schoolboy error in retrospect, I was already aware of mixed Thumb and ARM code. Just need to figure out a reasonably maintainable and clear way to get this implemented.

andersm · « **Reply #87 on:** April 29, 2015, 05:34:06 am »

With the GNU assembler you can use the ".balign 4" directive to align the location counter to 32 bits, padding with NOPs.

Howardlong · « **Reply #88 on:** April 29, 2015, 11:13:09 am »

Quote from: andersm on April 29, 2015, 05:34:06 am

With the GNU assembler you can use the ".balign 4" directive to align the location counter to 32 bits, padding with NOPs.

OK, this works, but I'm not sure of the scope of .balign, it looks like it just applies to the next "allocation", whether it be a line of code or data, because further down at the branch it slips back to unpadded thumb instructions. This is not a problem, just as long as I'm aware of it.

The also fixes the "problem" with the GPIO twiddling: those instructions were also not word aligned. If I have a string of adjacent alternate GPIO on and off instructions that are not word aligned, the effect is that every third one causes a stall, so I think we can say that in the worst case, not having word-width ARM instructions word aligned can have an impact of 33%.

I've applied this to the polyphase decimator too, that was also not word aligned, a 16% improvement as a result, down from 75 to 63 cycles total for two 8 tap FIRs.

There is now not any more to squeeze out of these, the processing time is as expected.

Thanks again!

Howardlong · « **Reply #89 on:** April 29, 2015, 12:18:21 pm »

A quick note on how performance is affected depending on the memory used for code.

I ran 100 iterations of the dual FIR filter and measured the time taken.

33.7us RAMLoc128 (shared with data)
33.0us RAMLoc72
33.7us RAMAHB32
89.0us SPIFI (quad SQPI flash, running at the default 40.8MHz)
54.9us SPIFI (quad SQPI flash, running at 102MHz[Maximum allowed])

Debug environment for this is two LPC Link2's, one as debugger the other as target, and the cat has found a new enclosure for the debugger.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: How are micros programmed in real world situations? (Read 30300 times)

Howardlong

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

andersm

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

miguelvp

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

miguelvp

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

andersm

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

andersm

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

Howardlong

Re: How are micros programmed in real world situations?

Share me