Author Topic: Sometimes it really pays to look at the compiler's assembly output... (Read 1199 times)

HwAoRrDk · « **on:** February 22, 2020, 02:54:10 pm »

Here is a great example of how examining the assembly code generated by your compiler and applying some lateral thinking can benefit both the size and execution speed of some code.

I've been writing some code that uses the CAN peripheral of an STM8 micro. I have a CAN message-reception interrupt routine that needs to extract the message ID, length and data of each received message. On the STM8's CAN peripheral, this involves selecting the receive mailbox 'page' of registers, and reading out the aforementioned data from them. Part of this process is determining whether you have a standard or extended ID, and reading from the appropriate registers accordingly (2 of the 4 MIDn registers for 11-bit std. IDs, all 4 for 29-bit ext. IDs).

The first stab I took at the code that reads the message ID from the registers took the following form. Nothing complicated, read each 8-bit value, shift them along, concatenate into a single 32-bit value - ho-hum, just the kind of code you write in your sleep. Just have to remember to mask off the RTR and IDE flags, as they occupy a couple of bits in the first ID register.

Code: [Select]

if(bit_is_set(CAN_RXMB_MIDR1, CAN_RXMB_MIDR1_IDE)) {
	msg.id = ((uint32_t)(CAN_RXMB_MIDR1 & ~(_BV(CAN_RXMB_MIDR1_RTR) | _BV(CAN_RXMB_MIDR1_IDE))) << 24) | ((uint32_t)CAN_RXMB_MIDR2 << 16) | ((uint32_t)CAN_RXMB_MIDR3 << 8) | CAN_RXMB_MIDR4;
} else {
	msg.id = ((uint32_t)(CAN_RXMB_MIDR1 & ~(_BV(CAN_RXMB_MIDR1_RTR) | _BV(CAN_RXMB_MIDR1_IDE))) << 6) | ((uint32_t)CAN_RXMB_MIDR2 >> 2);
}

But, examining the assembly output for this code led to a surprise:

Code: [Select]

	ld	a, 0x542a
	ld	xl, a
	ld	a, 0x542a
	ld	xh, a
	ld	a, 0x542b
	clrw	y
	ld	(0x18, sp), a
	clr	(0x17, sp)
	clr	(0x16, sp)
	clr	(0x15, sp)
	ld	a, xh
	and	a, #0x9f
	ld	yl, a
	ldw	(0x1b, sp), y
	clr	(0x1a, sp)
	clr	(0x19, sp)
	ld	a, xl
	bcp	a, #0x40
	jreq	00104$
	ld	a, (0x1c, sp)
	clrw	x
	clr	(0x1a, sp)
	ldw	y, (0x17, sp)
	ldw	(0x15, sp), y
	clr	(0x18, sp)
	clr	(0x17, sp)
	or	a, (0x15, sp)
	ld	(0x01, sp), a
	ld	a, xl
	or	a, (0x18, sp)
	ld	(0x04, sp), a
	ld	a, xh
	or	a, (0x17, sp)
	ld	(0x03, sp), a
	ld	a, (0x1a, sp)
	or	a, (0x16, sp)
	ld	(0x02, sp), a
	ld	a, 0x542c
	ld	yh, a
	clrw	x
	swapw	x
	ld	a, (0x04, sp)
	ld	(0x1c, sp), a
	ld	a, yh
	or	a, (0x03, sp)
	ld	(0x1b, sp), a
	ld	a, xl
	or	a, (0x02, sp)
	ld	(0x1a, sp), a
	ld	a, xh
	or	a, (0x01, sp)
	ld	(0x19, sp), a
	ld	a, 0x542d
	clrw	y
	clrw	x
	or	a, (0x1c, sp)
	rlwa	y
	or	a, (0x1b, sp)
	ld	yh, a
	ld	a, xl
	or	a, (0x1a, sp)
	rlwa	x
	or	a, (0x19, sp)
	ld	xh, a
	ldw	(0x07, sp), y
	ldw	(0x05, sp), x
	jra	00105$
00104$:
	ld	a, #0x06
00156$:
	sll	(0x1c, sp)
	rlc	(0x1b, sp)
	rlc	(0x1a, sp)
	rlc	(0x19, sp)
	dec	a
	jrne	00156$
	ldw	x, (0x17, sp)
	ldw	y, (0x15, sp)
	srlw	y
	rrcw	x
	srlw	y
	rrcw	x
	ld	a, xl
	or	a, (0x1c, sp)
	ld	(0x18, sp), a
	ld	a, xh
	or	a, (0x1b, sp)
	ld	(0x17, sp), a
	ld	a, yl
	or	a, (0x1a, sp)
	ld	(0x16, sp), a
	ld	a, yh
	or	a, (0x19, sp)
	ld	(0x05, sp), a
	ldw	y, (0x17, sp)
	ldw	(0x07, sp), y
	ld	a, (0x16, sp)
	ld	(0x06, sp), a

Oof, that's a lot of code!

This won't do, especially as it's in an interrupt handler, which you want to keep as short and quick as possible.

So I got to thinking: the layout of the MIDRn registers is such that, regardless of whether we're dealing with a standard (11-bit) or extended (29-bit) ID, it always has the MSB of the ID in the first register... and the STM8 is a big-endian architecture... Why not just treat the set of registers as one large 32- or 16-bit register? I can alias the first register as a pointer to a uint32_t or uint16_t, and simply read the whole thing in one go!

Code: [Select]

/* In a header file: */
#define _SFR16(mem_addr) (*(const volatile uint16_t *)(mem_addr))
#define _SFR32(mem_addr) (*(const volatile uint32_t *)(mem_addr))

if(bit_is_set(CAN_RXMB_MIDR1, CAN_RXMB_MIDR1_IDE)) {
	msg.id = _SFR32(&CAN_RXMB_MIDR1) & ~((uint32_t)(_BV(CAN_RXMB_MIDR1_RTR) | _BV(CAN_RXMB_MIDR1_IDE)) << 24);
} else {
	msg.id = (_SFR16(&CAN_RXMB_MIDR1) & ~((uint16_t)(_BV(CAN_RXMB_MIDR1_RTR) | _BV(CAN_RXMB_MIDR1_IDE)) << 8)) >> 2;
}

This drastically shrinks the compiled assembly output:

Code: [Select]

	ld	a, 0x542a
	bcp	a, #0x40
	jreq	00104$
	ldw	x, #0x542a
	ldw	y, x
	ldw	y, (0x2, y)
	ldw	x, (x)
	ld	a, xh
	and	a, #0x9f
	ld	xh, a
	ldw	(0x03, sp), y
	ldw	(0x01, sp), x
	jra	00105$
00104$:
	ldw	x, 0x542a
	ld	a, xh
	and	a, #0x9f
	ld	xh, a
	srlw	x
	srlw	x
	clrw	y
	ldw	(0x03, sp), x
	ldw	(0x01, sp), y

We've gone from 97 instructions to 23. I estimate that, taking into account that the original code was about a 70/30 split between single-cycle and two-cycle instructions, we've chopped about 110 cycles off the overall interrupt code; at a CPU speed of 16 MHz, this saves nearly 7 usec of execution time. Given that the entire ISR executed in approx. 29 usec, that's a significant 24% time saving!

So, don't be like me and mindlessly write some naive, inefficient code when all along the endianness of your architecture and register layout were giving big hints about how one should be doing things.

T3sl4co1l · « **Reply #1 on:** February 22, 2020, 05:03:09 pm »

What -O setting?

Tim

HwAoRrDk · « **Reply #2 on:** February 22, 2020, 06:21:53 pm »

None, because the compiler is SDCC and that setting doesn't exist.

But to respond to the intent of the question: the default optimisation option for SDCC is 'balanced', and this was with that. Haven't tried compiling the original code with 'speed' or 'size' optimisation levels, but I doubt it would make much difference. SDCC doesn't seem to have as sophisticated an optimiser as GCC.

Anyway, I think it's more the difference of approach to the problem that leads to the drastic improvement, rather than any compiler cleverness (or lack thereof).

rhodges · « **Reply #3 on:** February 22, 2020, 06:41:47 pm »

I agree 100% When writing STM8 code, I like to browse the assembly listings, mostly with library code, but sometimes even in main code. As you explained, sometimes making changes to your C code will help a lot. And sometimes I decide, "Screw that, I'm using inline assembly so the compiler CAN NOT mess it up."

SiliconWizard · « **Reply #4 on:** February 22, 2020, 06:51:44 pm »

Looking at the generated assembly always pays off for time-critical sections.

Now of course things can get particularly inefficient when manipulating 32-bit values on an 8-bitter, and yes the compiler still matters. SDCC seems to be translating your C code pretty "literally" here, whereas a good optimizing compiler would do much better than this. I don't know whether it would do as well as with your hand-modified version, but surely better. Unfortunately, C compilers for 8-bit targets tend to be relatively limited, so I don't have anything better than SDCC available with which I could do some testing...

One thing you could try with SDCC is to replace your bit manipulation expression with a struct with bit fields instead. SDCC may compile that more efficiently than manual bit shifting.

ajb · « **Reply #5 on:** February 22, 2020, 08:11:24 pm »

One fine point of what's going on here is that the base register definitions are certainly volatile-qualified, so in the first example the compiler is severely limited in the amount of optimization it can do at any optimization level. The casts to uint32_t in the first snippet in the OP will not necessarily change this, as these are applied after the pointer derefencing built into the register definitions (maybe some compilers will use such a cast as a hint for optimizations that would get around the normal requirements of volatile, but that sounds risky enough that I would want my compiler to NOT do so as a rule). You might be able to get around that by taking the address of the registers and recasting them to non-volatile pointers. It would be interesting to try different variations of that and see what kind of assembly you get.

T3sl4co1l · « **Reply #6 on:** February 23, 2020, 01:35:11 am »

Quote from: HwAoRrDk on February 22, 2020, 06:21:53 pm

But to respond to the intent of the question: the default optimisation option for SDCC is 'balanced', and this was with that. Haven't tried compiling the original code with 'speed' or 'size' optimisation levels, but I doubt it would make much difference. SDCC doesn't seem to have as sophisticated an optimiser as GCC.

Anyway, I think it's more the difference of approach to the problem that leads to the drastic improvement, rather than any compiler cleverness (or lack thereof).

Ah...

Sometimes the compiler is smart enough to do things like this; it depends on the internal structure, and how many conditions they put into the optimizer. (GCC and Clang use an intermediate format to reason about optimization, which is then converted to the final target instruction set with little or no additional optimization.)

An example where this optimization will fail, might be when the intermediate format has efficient indexed indirect addressing -- access to structs and arrays (and both) are always(??) implemented in literal fashion, using indexed indirect addressing modes when available -- but if the target does not, it will constantly perform pointer arithmetic to deal with the indexing. Of... more historic interest, Z80 and 8086 have some of these modes but they're pretty slow at it (10-20 cycles), so it's not always the fastest route, or the most compact (the instructions are much longer, too).

It looks like the STM8 is... 6502-ish? Doesn't have more registers, or many more anyway, but more can be used as pointers? So that'll be a lot of faffing around regardless. Hmm, SP-relative indexed? That's handy.

Optimizers also tend to be lazy, so that an operation that they most definitely can improve, they might fail on, just because they run out of time to do it.

Last crazy example I made was a bit-shuffling operation; specifically, packing 24-bit color into a 5-6-5 LCD format. I wrote a one-liner (well, one statement, sprawled over half a dozen lines for readability...) and GCC shat out something like 200 words, of what looked to be the expression pretty much verbatim. Much like your first example, just longer. Putting intermediate steps in temp variables cut that by about half.

Seems likely, GCC would've improved your particular expression, but clearly SDCC gave up. Probably there's a scale factor where each one gives up, and my example was past GCC's limit, and yours was past SDCC's.

Do try the other optimization levels -- those settings sound analogous to -O1 or 2, -O3 and -Os. Maybe they'll use more powerful/aggressive tricks, or just "think" about it for longer.

In any case, it's definitely a thing, optimizing code for the target -- ultimately, you're only writing C code for a purpose. When that's specific to a platform, you can write statements that more closely reflect the platform, and should be easier to optimize as a result. (But aren't always. Do keep an open mind to alternatives, and try several when you need to squeeze out those extra bytes/cycles. The compiler certainly isn't as clever as you are, and may not pick up on the approach you were thinking of.)

The emphasis on knowing the compiler, and choosing judicious optimization settings, and platform size and performance, reflects the reality -- your time is far more expensive than the chips are, so you rarely if ever have a justifiable reason to optimize this heavily.

(I've played with this a few times myself, fortunately I need no justification as I'm not getting paid to code, and with good reason.

)

And of course for cross-platform code, you can, say, write headers for each platform -- #if _CPU_STM8 ... #elseif _CPU_AVR ... etc. You can always keep a copy of straightforward, well commented, but maybe not terribly optimal, code in the #else clause -- that way your code always works on some platform, but also works particularly well on the enumerated platforms.

The challenge then of course being, keep all versions in sync when that code inevitably needs updating.

Tim


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Sometimes it really pays to look at the compiler's assembly output... (Read 1199 times)

HwAoRrDk

Sometimes it really pays to look at the compiler's assembly output...

T3sl4co1l

Re: Sometimes it really pays to look at the compiler's assembly output...

HwAoRrDk

Re: Sometimes it really pays to look at the compiler's assembly output...

rhodges

Re: Sometimes it really pays to look at the compiler's assembly output...

SiliconWizard

Re: Sometimes it really pays to look at the compiler's assembly output...

ajb

Re: Sometimes it really pays to look at the compiler's assembly output...

T3sl4co1l

Re: Sometimes it really pays to look at the compiler's assembly output...

Share me