Author Topic: EEVblog #1306 (1 of 5): 3 Cent Padauk Micro - Open Source Programmer (Read 10146 times)

serisman · « **Reply #25 on:** June 15, 2020, 12:30:01 am »

Quote from: spth on June 14, 2020, 09:19:53 am

Quote from: serisman on June 13, 2020, 07:40:00 pm
Also, why do you think STM8 is better than both (that hasn't necessarily been my experience).

In general, it seems like 8051 gets a bad rap lots of times and I'm not entirely sure why. Is it the split memory architecture? Or the lack of a good open source C compiler? Yes, I know about SDCC (and use it all the time), but it really isn't all that efficient/optimized in the code it generates. But, it is good enough for high level things, and one can always drop down to inline assembly for the places where speed and/or size efficiency is actually important.

Speaking of program size, 8051 is actually pretty optimal in this regard. There are a lot of 1 byte instructions in addition to the 2 byte instructions, and very few 3 byte instructions.

Let's compare STM8 to MCS-51 then. SDCC supports both. For the comparison, I'll assume we need a few KB of RAM (i.e. large memory model for mcs51, medium memory model for stm8) and want full reentrancy as in the C standard (i.e. --stack-auto option for mcs51).
The STM8 has good support for pointers (flat address space, registers x and y), while MCS-51 has to juggle with memory spaces and go through dptr. Also, the STM8 has stackpointer-relative addressing. And the SDCC stm8 port has more fancy optimizations than the mcs51 one.

Looking at a benchmark, we can see what this means (Dhrystone, STM8AF at 16 MHz vs C8051 at 24.5 MHz):

stm8 code size is half of mcs51:
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-stm8-size.svg
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-mcs51-size.svg

Despite the C8051 being single-cycle and having 50% higher clock speed, the STM8 is 85% faster:
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-stm8-score.svg
https://sourceforge.net/p/sdcc/code/HEAD/tree/trunk/sdcc-extra/historygraphs/dhrystone-mcs51-score.svg

The graphs are from SDCC, where they are used to track code size and speed to quickly notice regressions.

Thanks for the links, although to be honest, they seem to be more about how good SDCC is for a particular architecture over time than the architecture itself. It looks like more work is being put into STM8 so it is on an upward trajectory (i.e. smaller code size and faster execution), while MCS51 may have had some regressions introduced that speaks to a downward trajectory (i.e. larger code size and slower execution).

Also, turning on --stack-auto for re-entrancy as well as using the medium or large memory models seems to go against the SDCC defaults and recommendations. For the projects I have worked on, there wasn't a need to go with either of those. Potentially those options are dramatically contributing to the increased code size and lower performance. My observations from porting code originally written for AVR, is that it usually compiles to a smaller size when ported to 8051. But, it could also be that I am also optimizing it in the process and it would be smaller regardless of destination architecture.

I fully agree that SDCC is not particularly good at generating optimal code for the MCS51 architecture. What I'm not sure about is whether that is inherently because of the 8051 architecture, or just because SDCC tries to work across so many architectures that it is hard to optimize for any one. Or is it just because some architectures have had more interest, and therefore more work done on optimizing them. Also, I'm not trying to pick on SDCC, I am very thankful it exists, and I can appreciate how difficult it must be to create and maintain a complex multi-architecture compiler.

I will say, however, that after looking at the generated code for MCS51, one has to wonder if the authors actually read through and understand the full 8051 instruction set.

Something simple like:

Code: [Select]

uint8_t i = 8;
do { /* ... */ } while (--i);

should compile down to some like this (4 bytes, 6 cycles) (Note all cycle counts here and below are for the N76E003, other MCUs are even better cycle wise):

Code: [Select]

	mov	r7,#0x08	; 2 bytes, 2 cycles
00101$:
	; ...
	djnz	r7,00101$	; 2 byte, 4 cycles

why then does SDCC generate this (7 bytes, 8 cycles)? Does SDCC not know about the DJNZ (decrement and jump if not zero) instruction?:

Code: [Select]

	mov	r7,#0x08	; 2 bytes, 2 cycles
00101$:
	; ...
	mov	a,r7		; 1 byte, 1 cycle
	dec	a		; 1 byte, 1 cycle
	mov	r7,a		; 1 byte, 1 cycle
	jnz	00101$	; 2 bytes, 3 cycles

Or, consider this:

Code: [Select]

char __code lookup[] = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
char low_nibble_to_hex(uint8_t nibble) {
	return lookup[nibble & 0xF];
}

for which SDCC generates (22 bytes, 31 cycles):

Code: [Select]

	mov	r7,dpl			; 2 bytes, 2 cycles
	anl	ar7,#0x0f			; 3 bytes, 4 cycles
	mov	r6,#0x00			; 2 bytes, 2 cycles
	mov	a,r7				; 1 byte, 2 cycles
	add	a,#_lookup		; 2 bytes, 2 cycles
	mov	dpl,a				; 2 bytes, 2 cycles
	mov	a,r6				; 1 byte, 1 cycle
	addc	a,#(_lookup >> 8)	; 2 bytes, 2 cycles
	mov	dph,a			; 2 bytes, 2 cycles
	clr	a				; 1 byte, 1 cycle
	movc	a,@a+dptr	; 1 byte, 4 cycles
	mov	dpl,a				; 2 bytes, 2 cycles
	ret					; 1 byte, 5 cycles

which could be as simple as (11 bytes, 19 cycles). Does SDCC not know about the MOV DPTR, #address instruction?:

Code: [Select]

	mov	a,dpl				; 2 bytes, 3 cycles
	anl	a,#0x0f			; 2 bytes, 2 cycles
	mov	dptr,#_lookup		; 3 bytes, 3 cycles
	movc	a,@a+dptr	; 1 byte, 4 cycles
	mov	dpl,a				; 2 bytes, 2 cycles
	ret					; 1 byte, 5 cycles

And, how about this:

Code: [Select]

void print(char __xdata *string) {
	char c = *string;
	while (c != 0) {
		// ...
		string++;
		c = *string;
	};
}

for which SDCC generates (23 bytes, 40 cycles, 28 cycles in the loop):

Code: [Select]

	mov	r6,dpl			; 2 bytes, 2 cycles
	mov  r7,dph			; 2 bytes, 2 cycles
	movx	a,@dptr		; 1 byte, 4 cycles
	mov	r5,a				; 1 byte, 1 cycle
00101$:
	mov	a,r5				; 1 byte, 1 cycles
	jz	00104$			; 2 bytes, 3 cycles
	; ...
	inc	r6				; 1 byte, 3 cycles
	cjne	r6,#0x00,00116$	; 3 bytes, 4 cycles
	inc	r7				; 1 byte, 3 cycles
00116$:
	mov	dpl,r6			; 2 bytes, 2 cycles
	mov	dph,r7			; 2 bytes, 2 cycles
	movx	a,@dptr		; 1 byte, 4 cycles
	mov	r5,a				; 1 byte, 1 cycle
	sjmp	00101$			; 2 bytes, 3 cycles
00104$:
	ret					; 1 bytes, 5 cycles

which could be as simple as this (8 bytes, 20 cycles, 11 cycles in the loop). Does SDCC not know about the INC DPTR instruction?:

Code: [Select]

	movx	a,@dptr		; 1 byte, 4 cycles
00101$:
	jz	00104$			; 2 bytes, 3 cycles
	; ...
	inc	dptr				; 1 byte, 1 cycles
	movx	a,@dptr		; 1 byte, 4 cycles
	sjmp	00101$			; 2 bytes, 3 cycles
00104$:
	ret					; 1 bytes, 5 cycles

Or, how about this:

Code: [Select]

uint8_t div8(uint8_t a) {
	return a / 8;
}

for which SDCC generates this (17+ bytes, 29+ cycles):

Code: [Select]

	mov	r7,dpl			; 2 bytes, 4 cycles
	mov	r6,#0x00			; 2 bytes, 2 cycles
	mov	__divsint_PARM_2,#0x08	; 3 bytes, 3 cycles
	mov	(__divsint_PARM_2 + 1),r6	; 2 bytes, 3 cycles
	mov	dpl,r7			; 2 bytes, 4 cycles
	mov	dph,r6			; 2 bytes, 4 cycles
	ljmp	__divsint			; 3 bytes, 4 cycles (plus unknown bytes/cycles inside __divsint and a final 1 byte, 5 cycles for ret)

which could be as simple as this (10 bytes, 16 cycles). Why does SDCC need to farm this out to a helper function?:

Code: [Select]

	mov	a, dpl			; 2 bytes, 3 cycles
	mov	b, #0x08			; 3 bytes, 3 cycles
	div	ab				; 2 bytes, 3 cycles
	mov	dpl, a			; 2 bytes, 2 cycles
	ret					; 1 byte, 5 cycles

or even as simple as this (9 bytes, 14 cycles). Does SDCC not know that / 8 is the same as >> 3?:

Code: [Select]

	mov	a,dpl				; 2 bytes, 3 cycles
	swap	a				; 1 byte, 1 cycle
	rl	a				; 1 byte, 1 cycle
	anl	a,#0x1f			; 2 bytes, 2 cycles
	mov	dpl,a				; 2 bytes, 2 cycles
	ret					; 1 byte, 5 cycles

Sorry if some of these seem a bit contrived, but they are all subsets of things I have had to manually optimize around in a recent project I have been working on. And, they go to show how much of an impact a compiler implementation can have on the program size and execution speed. I don't have experience using them, but from what I have read the IAR and Keil compilers generate better / more optimized code than SDCC. Again, not trying to bash SDCC, just pointing out that what a particular compiler generates isn't the end-all be-all of a given processor architecture.

It would be an interesting exercise to take some real world code and hand optimize it to certain architectures to compare more real-world results.

serisman · « **Reply #26 on:** June 15, 2020, 01:07:35 am »

Quote from: spth on June 14, 2020, 10:04:30 am

Quote from: serisman on June 13, 2020, 07:40:00 pm
I'm curious why you prefer Padauk over 8051? […]

In general, it seems like 8051 gets a bad rap lots of times and I'm not entirely sure why. Is it the split memory architecture? Or the lack of a good open source C compiler? Yes, I know about SDCC (and use it all the time), but it really isn't all that efficient/optimized in the code it generates. But, it is good enough for high level things, and one can always drop down to inline assembly for the places where speed and/or size efficiency is actually important.

Padauk having fewer memory spaces make the architecture somewhat cleaner. On MCS-51 any pointer read / write has to go through support functions (or the programmer has to manually specifiy memory space, i.e. use non-standard extensions of C. For Padauk, the problem exists for pointer read only. It owuld be good if SDCC had better tracking of pointers, so the use of support functions could be optimized out in more cases. But often that would be very hard to do (e.g. passing pointers to a function defined in a different source file).

I have to admit that to some degree, this cleanliness in architecture comes at the cost of loss of power: The Padauks are limited to far less memory than MCS-51.

Yes, exactly. Most of the Padauks are limited to less SRAM than the MCS-51's directly accessible lower 128 bytes of SRAM anyway. And, usually the upper 128 bytes of SRAM in the MCS-51 is mostly utilized by the stack anyway, so is there really much difference other than the Padauks don't even have the possibility of using the slightly slower and harder to get at extended ram.

I have found that when I am writing code for the MCS-51, I usually know what memory type I am targeting anyway, so it hasn't been much issue really. I usually leave the 128/256 bytes of SRAM for registers, scratch pad, function arguments, highly utilized small global variables, and the stack. Everything else goes in XRAM. And, it should be clear when you need to read from program code instead of SRAM/XRAM. So, I just use the __code or __xram attributes whenever I am referencing a pointer unless I really really don't know where I am pointing too. Sometimes, that means two different versions of a function, which sometimes is less code (and always faster) than using the generic pointers and figuring it out at run-time.

AVR has the same issue in that to read program code you have to use different instructions, hence you will see PROGMEM and associated helpers scattered through AVR/Arduino code. I actually like the SDCC __code attribute way of doing it much better.

And it looks like the Padauks also require different instructions (LDTABH/LDTABL) that are only even available on the 15-bit (or higher?) MCUs.

Quote from: spth on June 14, 2020, 10:04:30 am

Quote from: serisman on June 13, 2020, 07:40:00 pm
Surely the Padauk with their limited SRAM is no better than 8051 in terms of memory architecture. Even comparing 8051 to other architectures, I don't find the split memory to be all that big of a deal really. In fact, it is kind of liberating. The internal 128/256 bytes of SRAM can be thought of like a giant 'register' pool or scratch pad, and then one can use the larger xram for normal things like global variables that aren't accessed as often, or arrays where one has to access them indirectly anyway. The 8051's instructions really aren't that different for accessing xram than what other architectures require for indirect access (i.e. mov dptr, #address; movx a, @dptr;, or movx @dptr, a; inc dptr). This is very similar to AVR's X, Y, Z registers which are used as indirect pointers into SRAM. Most 8051 MCUs these days have support for dual dptrs as well. Maybe the biggest limitation is that the stack has to fit within the internal 256 byte of SRAM, but I haven't found that to be too much of a limit so far.

Though it is somewaht unfortunate that there are so many differnt ways of handling dual dptr. There is a document suggesting to split the exisitng mcs51 backend in SDCC into 5 different ones to cover the most common variants of dual dptr handling (https://sourceforge.net/p/sdcc/wiki/8051%20Variants/). But even then there are many more variants not yet covered. Manufacturers don't even stick to one single way for their product line.

Yes, the lack of a standard dual dptr implementation is unfortunate. Splitting the MCS51 backend into 5 different ones sounds less than ideal. Just spit-balling, what about just passing a flag in that defines a specific dual dptr variant with the default (if no flag is passed) being to not use a second dptr?

Quote from: spth on June 14, 2020, 10:04:30 am

Quote from: serisman on June 13, 2020, 07:40:00 pm
I agree that these Padauk MCUs are interesting and have their place, but I think they shine in a different area than 8051's and other MCUs. To me, the main benefit is the low cost, the lower power consumption, and the fact that they are good enough for a lot of simple things. But, the small SRAM/Flash size and lack of hardware peripherals certainly can be a limitation in many projects where spending a little bit more goes a long way.

The small SRAM size is clearly a limitation. The architecture tops out at 512B, but all devices I know about have at most 256B. Code memory is far less limited: The architecture supports up to 8KW of 16-bit memory, i.e. 16 KB, though all devices I know about have at most 4KW. I am not sure yet about the peripheral situation: The Padauk FPPA approach (i.e. hardware-multihthreading) allows to do a lot of stuff in software that otherwise would need a peripheral.

The largest code memory of any Padauk IC I have been interested in so far is 3KW of 15-bit words with 256 bytes of SRAM (PFS173). For code, this isn't too big of a limitation really, but I'm curious how efficiently data is stored. If I want to store an array of bytes in program code, does it have to use a full 15-bit word for each byte?

As for peripherals, yes the Padauk FPPA solution sounds interesting, but it will rarely if ever be better than dedicated hardware (for existing standards at least). I can spin up an I2C slave that operates at 400kHz (actually seems to work fine up to at least 800 kHz) with just a few lines of code on a N76E003 ($0.20) IC. Most of the time my program gets access to the full 16MHz and only gets interrupted occasionally when I2C data is ready to process. Try doing that with a Padauk IC. The FPPA approach, however, will take the 8MHz clock speed and divide it into effectively two 4MHz processors, so the main program code is going to run at slower speed, and the peripheral is going to have to use bit banging with a maximum of a 4 MHz clock as well. This could be really interesting for custom defined protocols, but for already established protocols, I fail to see how that would ever be better (aside from the potentially cheaper cost of the Padauk). Now, if we are talking about actual multitasking (not just duplicating a communications protocol), that could get interesting, although with the program size and memory limits being what they are, I'm not sure what would really even fit.

serisman · « **Reply #27 on:** June 15, 2020, 03:35:13 am »

Quote from: serisman on June 15, 2020, 12:30:01 am

I don't have experience using them, but from what I have read the IAR and Keil compilers generate better / more optimized code than SDCC. Again, not trying to bash SDCC, just pointing out that what a particular compiler generates isn't the end-all be-all of a given processor architecture.

I just downloaded the evaluation version of Keil C51 and put the 4 above examples into it (slightly modified for Keil syntax). While not perfect, Keil generates much closer to what I posted as the ideal version in each case (some differences in how registers are used compared to SDCC).

Test code for Keil:

Code: [Select]

unsigned char code lookup[] = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
unsigned char high_nibble_to_hex(unsigned char byte) {
	return lookup[byte >> 4];
}
unsigned char low_nibble_to_hex(unsigned char byte) {
	return lookup[byte & 0x0F];
}

void print(unsigned char xdata *string) {
	unsigned char c = *string;
	while (c != 0) {
		// ...
		string++;
		c = *string;
	}
}

unsigned char div8(unsigned char a) {
	return a / 8;
}

char xdata str_buf[10];

void main() {
	unsigned char i = div8(64);
	do {
		str_buf[0] = high_nibble_to_hex(i);
		str_buf[1] = low_nibble_to_hex(i);
		str_buf[2] = 0x00;
		print(str_buf);
	} while (--i);
}

Keil compiles it to:

Code: [Select]

C:0x0003: void print(unsigned char xdata *string) {
	; unsigned char c = *string;
	mov	dpl, r7
	mov	dph, r6
	movx	a,@dptr
	mov	r7,a
	; while (c != 0) {
0009:
	mov	a,r7
	jz	0011
	; string++;
	inc	dptr
	; c = *string;
	movx	a,@dptr
	mov	r7,a
	; }
	sjmp,0009
0011:
	; }
	ret

C:0x0012: unsigned char high_nibble_to_hex(unsigned char byte) {
	; return lookup[byte >> 4];
	mov	a,r7
	swap	a
	anl	a,#0x0f
	mov	dptr,#lookup
	movc	a,@a+dptr
	mov	r7,a
	ret

C:0x001c: unsigned char low_nibble_to_hex(unsigned char byte) {
	; return lookup[byte & 0x0F];
	mov	a,r7
	anl	a,#0F
	mov	dptr,#lookup
	movc	a,@a+dptr
	mov	r7,a
	ret

C0x0025: unsigned char div8(unsigned char a) {
	; return a / 8;
	mov	a,r7
	rrc	a
	rrc	a
	rrc	a
	anl	a,#0x1f
	mov	r7,a
	ret

C:0x0800: void main() {
	; unsigned char i = div8(64);
	mov	r7,#0x40
	lcall	div8(C:0025)
	mov	r5,ar7
	: do {
C:0807
	; str_buf[0] = high_nibble_to_hex(i);
	mov	r7,ar5
	lcall	high_nibble_to_hex(C:0012)
	movx	dptr,#0x0000 (&str_buf[0])
	mov	a,r7
	mov	@dptr,a
	; str_buf[1] = low_nibble_to_hex(i);
	mov	r7,ar5
	lcall	low_nibble_to_hex(C:001C)
	movx	dptr,#0x0001 (&str_buf[1])
	mov	a,r7
	mov	@dptr,a
	; str_buf[2] = 0x00;
	clr	a
	inc	dptr
	movx	@dptr,a
	; print(str_buf);
	mov	r6,#0x00 (&str_buf[0])
	mov	r7,#0x00 (&str_buf[0])
	lcall	print(C:0003)
	; } while (--1);
	djnz	r5,c:0807
	; }
	ret

Here is the exact same code compiled by SDCC:

Code: [Select]

_high_nibble_to_hex: unsigned char high_nibble_to_hex(unsigned char byte) {
	; return lookup[byte >> 4];
	mov	a,dpl
	swap	a
	anl	a,#0x0f
	mov	dptr,#_lookup
	movc	a,@a+dptr
	; }
	mov	dpl,a
	ret

_low_nibble_to_hex: unsigned char low_nibble_to_hex(unsigned char byte) {
	mov	r7,dpl
	; return lookup[byte & 0x0F];
	anl	ar7,#0x0f
	mov	r6,#0x00
	mov	a,r7
	add	a,#_lookup
	mov	dpl,a
	mov	a,r6
	addc	a,#(_lookup >> 8)
	mov	dph,a
	clr	a
	movc	a,@a+dptr
	; }
	mov	dpl,a
	ret

_print: void print(unsigned char __xdata *string) {
	; unsigned char c = *string;
	mov	r6,dpl
	mov  r7,dph
	movx	a,@dptr
	mov	r5,a
	; while (c != 0) {
00101$:
	mov	a,r5
	jz	00104$
	; string++;
	inc	r6
	cjne	r6,#0x00,00116$
	inc	r7
00116$:
	; c = *string;
	mov	dpl,r6
	mov	dph,r7
	movx	a,@dptr
	mov	r5,a
	sjmp	00101$
00104$:
	; }
	ret

_div8: unsigned char div8(unsigned char a) {
	mov	r7,dpl
	; return a / 8;
	mov	r6,#0x00
	mov	__divsint_PARM_2,#0x08
	mov	(__divsint_PARM_2 + 1),r6
	mov	dpl,r7
	mov	dph,r6
	ljmp	__divsint

_main: void main() {
	; unsigned char i = div8(64);
	mov	dpl,#0x40
	lcall	_div8
	mov	r7,dpl
	; do {
00101$:
	; str_buf[0] = high_nibble_to_hex(i);
	mov	dpl,r7
	push	ar7
	lcall	_high_nibble_to_hex
	mov	r6,dpl
	pop	ar7
	mov	dptr,#_str_buf
	mov	a,r6
	movx	@dptr,a
	; str_buf[1] = low_nibble_to_hex(i);
	mov	dpl,r7
	push	ar7
	lcall	_low_nibble_to_hex
	mov	r6,dpl
	mov	dptr,#(_str_buf + 0x0001)
	mov	a,r6
	movx	@dptr,a
	; str_buf[2] = 0x00;
	mov	dptr,#(_str_buf + 0x0002)
	clr	a
	movx	@dptr,a
	; print(str_buf);
	mov	dptr,#_str_buf
	lcall	_print
	pop	ar7
	; while (--i);
	djnz	r7,00101$
	; }
	ret

Interestingly enough, in this case SDCC did actually do some of the same optimizations, although not as many as Keil. It's really weird that it chose to optimize high_nibble_to_hex and low_nibble_to_hex so differently. In one case SDCC uses the 'mov dptr,#_lookup' shorthand, in the other case it blows it out into several instructions (Keil does it optimally for both). The print function still isn't using 'inc dptr' for SDCC (but does for Keil). SDCC is still using a division helper (but Keil does the shorter/quicker >> 3 syntax). Both SDCC and Keil use 'mov dptr,#str_buf' syntax and 'djnz' in main (not sure why SDCC wasn't using it in my earlier tests), although SDCC has some extra push/pop code that Keil doesn't need (SDCC uses dpl/dph for function first argument where Keil seems to use r7/r6).

So, a lot of it definitely seems to come down to how optimized the compiler is for the given architecture.

greenpossum · « **Reply #28 on:** June 15, 2020, 04:19:10 am »

I assume you changed code to __code when moving from Keil to SDCC.

serisman · « **Reply #29 on:** June 15, 2020, 04:25:53 am »

Quote from: greenpossum on June 15, 2020, 04:19:10 am

I assume you changed code to __code when moving from Keil to SDCC.

Yes, and xram to __xram. Otherwise they were the same.

serisman · « **Reply #30 on:** June 15, 2020, 04:29:47 am »

And, sorry for taking this thread so far off topic. Although I do find it interesting comparing these low cost MCUs at an instruction set level. They each have their pros/cons for sure. And, learning when to use one vs. another is surely useful information to have.

spth · « **Reply #31 on:** June 15, 2020, 10:10:49 am »

Quote from: serisman on June 15, 2020, 01:07:35 am

And it looks like the Padauks also require different instructions (LDTABH/LDTABL) that are only even available on the 15-bit (or higher?) MCUs.

[…]

The largest code memory of any Padauk IC I have been interested in so far is 3KW of 15-bit words with 256 bytes of SRAM (PFS173). For code, this isn't too big of a limitation really, but I'm curious how efficiently data is stored. If I want to store an array of bytes in program code, does it have to use a full 15-bit word for each byte?

Accessing code memory is indeed different from accessing data memory. However, having two types of memory to read from, and one to write to makes code generation easier than on mcs51 where there are more.

To store objects in code memory, SDCC currently uses one word per byte. That is unlikely to change for pdk13 and pdk14, and probably also pdk15. However, a future pdk16 backend is likely to use one work for for two bytes (reading via ldtabl and ldtabh).

spth · « **Reply #32 on:** June 15, 2020, 07:32:28 pm »

Quote from: serisman on June 15, 2020, 12:30:01 am

Thanks for the links, although to be honest, they seem to be more about how good SDCC is for a particular architecture over time than the architecture itself. It looks like more work is being put into STM8 so it is on an upward trajectory (i.e. smaller code size and faster execution), while MCS51 may have had some regressions introduced that speaks to a downward trajectory (i.e. larger code size and slower execution).

Yes. In particular there were some changes in the front-end that stm8 apparently adapted to better. And of course there is still potential for more machine-independent optimizations in SDCC.

Quote

Also, turning on --stack-auto for re-entrancy as well as using the medium or large memory models seems to go against the SDCC defaults and recommendations. For the projects I have worked on, there wasn't a need to go with either of those.

For Dhrystone, we need the large memory model, as it uses about 5KB of RAM-

ebclr · « **Reply #33 on:** June 16, 2020, 08:19:30 am »

You did a good job, showing that the one to be blamed is sdcc compiler not 8051, I always use Keil, People who use SDCC are the same ones that use Linux. They don't want to pay for software, they prefer to get into a lot of restriction on a free thing, wich a much less friendly environment just to " supposed to be a freedom guy"

spth · « **Reply #34 on:** June 16, 2020, 08:31:37 am »

Quote from: ebclr on June 16, 2020, 08:19:30 am

You did a good job, showing that the one to be blamed is sdcc compiler not 8051, I always use Keil, People who use SDCC are the same ones that use Linux. They don't want to pay for software, they prefer to get into a lot of restriction on a free thing, wich a much less friendly environment just to " supposed to be a freedom guy"

However, there are further aspects:
* You apparently confuse free-as-in-beer with free-as-in-freedom.
* Major MCS-51 hardware vendors paid Keil a lot of money to provide Keil licenses at no cost to users of their hardware, so often there is no monetary advantage in using SDCC.
* Keil claims ANSI-C compliance. Not only is the ANSi C standard ancient (1989), superseded by later versions, but Keil also does a poor job at ANSI-C compliance. SDCC on the other hand has reasonable support for the historic ANSI-C 89/ISO C90, ISO C95, ISO C99 standards, the current ISO C11/C17 standard, and is already working on support for the future C2x standard.

spth · « **Reply #35 on:** June 16, 2020, 09:57:40 am »

Quote from: serisman on June 15, 2020, 01:07:35 am

Yes, the lack of a standard dual dptr implementation is unfortunate. Splitting the MCS51 backend into 5 different ones sounds less than ideal. Just spit-balling, what about just passing a flag in that defines a specific dual dptr variant with the default (if no flag is passed) being to not use a second dptr?

This would not be a full split (like mcs51 vs ds390). Rather more like the z80-and-related backends (z80, gbz80, z180, ez80_z80, tlcs90, r2k, r3k, tlcs90) which still share most of the code.

So it would not result in much code-duplication in the compiler.

However, one would want different standard libraries, so standard library functions can take advantage of dual dptr.

But, IMO, at the moment, other improvements for the mcs51 backend are more urgent.

P.S.: AFAIK, Keil uses dual dptr for standard library functions only, not for code generation.

spth · « **Reply #36 on:** June 16, 2020, 10:26:37 am »

Quote from: serisman on June 15, 2020, 03:35:13 am

Interestingly enough, in this case SDCC did actually do some of the same optimizations, although not as many as Keil. It's really weird that it chose to optimize high_nibble_to_hex and low_nibble_to_hex so differently. In one case SDCC uses the 'mov dptr,#_lookup' shorthand, in the other case it blows it out into several instructions (Keil does it optimally for both). The print function still isn't using 'inc dptr' for SDCC (but does for Keil). SDCC is still using a division helper (but Keil does the shorter/quicker >> 3 syntax). Both SDCC and Keil use 'mov dptr,#str_buf' syntax and 'djnz' in main (not sure why SDCC wasn't using it in my earlier tests), although SDCC has some extra push/pop code that Keil doesn't need (SDCC uses dpl/dph for function first argument where Keil seems to use r7/r6).

So, a lot of it definitely seems to come down to how optimized the compiler is for the given architecture.

I think you are seeing two main problems here:

1) SDCC sometimes makes bad choices as to what put into which register. It doesn't sufficiently take into account that some variables are often used as pointers, so they should better be in dptr, etc. This is due to using a gister allocator that at its core is still a simple linear scan allocator. The problem is known to SDCC developers, who came up with a better register allocator. Most ports have been converted to use the new allocator instead (AFAIK mcs51 and ds390 are the only remining ports that still use the old allocator).

2) In the division, SDCC somehow doesn't notice that the left operand is nonnegative. SDCC used to have an optimization for cases like this in the front-end, but it was buggy (it used to affect only sizeof in rare cases, but with ISO C11 _Generic, it became a bigger issue). So that optimization was removed, and replaced by other optimizations in later stages of the compiler. Apparently in your case, this doesn't work, and the division (of an int, due to integer promotion) is done signed (the /8 to >> 3 optimizationm is only valid for unsigned variables). The best solution would be for SDCC to implement generalized constant propagation, but something simpler should work in your case.

serisman · « **Reply #37 on:** June 16, 2020, 05:05:43 pm »

Quote from: spth on June 15, 2020, 10:10:49 am

Quote from: serisman on June 15, 2020, 01:07:35 am
And it looks like the Padauks also require different instructions (LDTABH/LDTABL) that are only even available on the 15-bit (or higher?) MCUs.

[…]

The largest code memory of any Padauk IC I have been interested in so far is 3KW of 15-bit words with 256 bytes of SRAM (PFS173). For code, this isn't too big of a limitation really, but I'm curious how efficiently data is stored. If I want to store an array of bytes in program code, does it have to use a full 15-bit word for each byte?

Accessing code memory is indeed different from accessing data memory. However, having two types of memory to read from, and one to write to makes code generation easier than on mcs51 where there are more.

To store objects in code memory, SDCC currently uses one word per byte. That is unlikely to change for pdk13 and pdk14, and probably also pdk15. However, a future pdk16 backend is likely to use one work for for two bytes (reading via ldtabl and ldtabh).

I was curious how it was even possible to reference program code with PDK13/14 that don't have the LDTABH/LDTABL instructions. I found the SDCC code for __gptrget, which is certainly a interesting solution to the problem (manipulating the SP and jumping to code that places a value on the accumulator and returns). But this is nowhere near as clean or efficient as what can be accomplished on any MCS51 MCU, where a simple mov dptr, #base_addr; mov a, #offset; movc a, @a+dptr; can be used. Is there a reason that PDK15 in SDCC isn't using the simpler LDTABL instruction?

serisman · « **Reply #38 on:** June 16, 2020, 05:12:46 pm »

Quote from: spth on June 15, 2020, 07:32:28 pm

Quote from: serisman on June 15, 2020, 12:30:01 am
Also, turning on --stack-auto for re-entrancy as well as using the medium or large memory models seems to go against the SDCC defaults and recommendations. For the projects I have worked on, there wasn't a need to go with either of those.
For Dhrystone, we need the large memory model, as it uses about 5KB of RAM-

[/quote]

Ahh. Yeah, my projects either haven't needed that much RAM to begin with, or are mostly using RAM as a buffer which means I have been able to get away with the small memory mode so farl. I could certainly see that more complex projects with heavy data processing or deep nested function requirements would benefit from a different architecture. I would probably reach for a 32-bit ARM at that point (which can be obtained for as low as $0.40 or so).

serisman · « **Reply #39 on:** June 16, 2020, 05:26:29 pm »

Quote from: spth on June 16, 2020, 08:31:37 am

Quote from: ebclr on June 16, 2020, 08:19:30 am
You did a good job, showing that the one to be blamed is sdcc compiler not 8051, I always use Keil, People who use SDCC are the same ones that use Linux. They don't want to pay for software, they prefer to get into a lot of restriction on a free thing, wich a much less friendly environment just to " supposed to be a freedom guy"

However, there are further aspects:
* You apparently confuse free-as-in-beer with free-as-in-freedom.
* Major MCS-51 hardware vendors paid Keil a lot of money to provide Keil licenses at no cost to users of their hardware, so often there is no monetary advantage in using SDCC.
* Keil claims ANSI-C compliance. Not only is the ANSi C standard ancient (1989), superseded by later versions, but Keil also does a poor job at ANSI-C compliance. SDCC on the other hand has reasonable support for the historic ANSI-C 89/ISO C90, ISO C95, ISO C99 standards, the current ISO C11/C17 standard, and is already working on support for the future C2x standard.

Yes, I agree.

I was not trying to blame SDCC. I really was just trying to point out that a compiler's implementation can have a dramatic impact on performance and perception of an architecture, but it isn't the final word.

As I already stated, I am really thankful that we have SDCC as an option to begin with. It isn't perfect (but neither is Keil). And as long as one is aware of the limitations, they can usually be worked around. Yes, ideally SDCC would be better, but that requires someone knowledgeable enough and caring enough to spend their time enhancing it.

I have paid for compilers in the past and would do so again if/when it made sense. I remember buying Borland Turbo C/C++ back in the day, when it came on several 3.5" floppies. Now a-days we have better open source options (GCC) thankfully. I don't like it that Keil doesn't even show a price on their website. You have to request a quote, which immediately turns me off as a consumer. And their artificial limit of 2KB program code for the evaluation version is way too restrictive to entice me to even give it a fair shake. Maybe if that limit was 8KB or 16KB or so I would be more interested in giving it a real evaluation (i.e. learn it and use it for a real project). I only downloaded it the other day to see if it was any better at generating more optimal code.

spth · « **Reply #40 on:** June 16, 2020, 05:27:37 pm »

Quote from: serisman on June 16, 2020, 05:05:43 pm

Is there a reason that PDK15 in SDCC isn't using the simpler LDTABL instruction?

In the long term, the pdk15 backend will get ldtabl support.
However, for now it just wasn't worth the effort required:

* We already have a working solution with the stack hack (needed for pdk13 and pdk14 anyway)
* ldtabl needs its operand to be 16-bit aligned
* We don't want to change the alignment of all pointers to 16 bit in SDCC, as this would waste space in structs where a pointer follows members that are an odd number of bytes.
* So some pointers will not be aligned.
* This means we need a temporary 16-bit-aligned location we can use for pointers - this location would need to be saved on interrupts, increasing interrupt latency
* For most efficient code, we'd want to track alignment, i.e. SDC should know, which pointers are actually 16-bit aligned, so it could use ldtabl for those.
* On the other hand, for better support of __sfr16, we are in a similar situation (16-bit-aligned operand required for some instructions).

serisman · « **Reply #41 on:** June 16, 2020, 05:30:54 pm »

Quote from: spth on June 16, 2020, 09:57:40 am

Quote from: serisman on June 15, 2020, 01:07:35 am
Yes, the lack of a standard dual dptr implementation is unfortunate. Splitting the MCS51 backend into 5 different ones sounds less than ideal. Just spit-balling, what about just passing a flag in that defines a specific dual dptr variant with the default (if no flag is passed) being to not use a second dptr?

This would not be a full split (like mcs51 vs ds390). Rather more like the z80-and-related backends (z80, gbz80, z180, ez80_z80, tlcs90, r2k, r3k, tlcs90) which still share most of the code.

So it would not result in much code-duplication in the compiler.

However, one would want different standard libraries, so standard library functions can take advantage of dual dptr.

But, IMO, at the moment, other improvements for the mcs51 backend are more urgent.

P.S.: AFAIK, Keil uses dual dptr for standard library functions only, not for code generation.

Ok, interesting.

I could see where dual dptr would be easier to implement in standard library functions (i.e. memcpy could potentially benefit from it) even it it wasn't ready to be implemented for user code.

And, I agree that other improvements for MCS51 are more urgent than dual dptr. I have rarely had the need for dual dptr to begin with, but I run across the need for other optimizations all the time.

serisman · « **Reply #42 on:** June 16, 2020, 05:36:34 pm »

Quote from: spth on June 16, 2020, 10:26:37 am

Quote from: serisman on June 15, 2020, 03:35:13 am
Interestingly enough, in this case SDCC did actually do some of the same optimizations, although not as many as Keil. It's really weird that it chose to optimize high_nibble_to_hex and low_nibble_to_hex so differently. In one case SDCC uses the 'mov dptr,#_lookup' shorthand, in the other case it blows it out into several instructions (Keil does it optimally for both). The print function still isn't using 'inc dptr' for SDCC (but does for Keil). SDCC is still using a division helper (but Keil does the shorter/quicker >> 3 syntax). Both SDCC and Keil use 'mov dptr,#str_buf' syntax and 'djnz' in main (not sure why SDCC wasn't using it in my earlier tests), although SDCC has some extra push/pop code that Keil doesn't need (SDCC uses dpl/dph for function first argument where Keil seems to use r7/r6).

So, a lot of it definitely seems to come down to how optimized the compiler is for the given architecture.

I think you are seeing two main problems here:

1) SDCC sometimes makes bad choices as to what put into which register. It doesn't sufficiently take into account that some variables are often used as pointers, so they should better be in dptr, etc. This is due to using a gister allocator that at its core is still a simple linear scan allocator. The problem is known to SDCC developers, who came up with a better register allocator. Most ports have been converted to use the new allocator instead (AFAIK mcs51 and ds390 are the only remining ports that still use the old allocator).

Honestly, you are above my head here (I am not a compiler expert), but it sounds good to me. Do you know if there is an intended timetable for converting MCS51 to the use the new register allocator?

Quote from: spth on June 16, 2020, 10:26:37 am

2) In the division, SDCC somehow doesn't notice that the left operand is nonnegative. SDCC used to have an optimization for cases like this in the front-end, but it was buggy (it used to affect only sizeof in rare cases, but with ISO C11 _Generic, it became a bigger issue). So that optimization was removed, and replaced by other optimizations in later stages of the compiler. Apparently in your case, this doesn't work, and the division (of an int, due to integer promotion) is done signed (the /8 to >> 3 optimizationm is only valid for unsigned variables). The best solution would be for SDCC to implement generalized constant propagation, but something simpler should work in your case.

Yeah, makes sense. I was only using unsigned variables, but apparently SDCC doesn't realize that or have that optimization at the moment. I can always write the code as >> 3, (or << 3 for multiplication) and SDCC will generate better code. It is just less clear what my code is trying to accomplish when written that way.

serisman · « **Reply #43 on:** June 16, 2020, 05:38:24 pm »

Quote from: spth on June 16, 2020, 05:27:37 pm

Quote from: serisman on June 16, 2020, 05:05:43 pm
Is there a reason that PDK15 in SDCC isn't using the simpler LDTABL instruction?

In the long term, the pdk15 backend will get ldtabl support.
However, for now it just wasn't worth the effort required:

* We already have a working solution with the stack hack (needed for pdk13 and pdk14 anyway)
* ldtabl needs its operand to be 16-bit aligned
* We don't want to change the alignment of all pointers to 16 bit in SDCC, as this would waste space in structs where a pointer follows members that are an odd number of bytes.
* So some pointers will not be aligned.
* This means we need a temporary 16-bit-aligned location we can use for pointers - this location would need to be saved on interrupts, increasing interrupt latency
* For most efficient code, we'd want to track alignment, i.e. SDC should know, which pointers are actually 16-bit aligned, so it could use ldtabl for those.
* On the other hand, for better support of __sfr16, we are in a similar situation (16-bit-aligned operand required for some instructions).

Cool, thanks for the explanation.

spth · « **Reply #44 on:** June 16, 2020, 05:46:54 pm »

Quote from: serisman on June 16, 2020, 05:36:34 pm

Do you know if there is an intended timetable for converting MCS51 to the use the new register allocator?

There is no timetable. The closest to a timetable SDCC has is considering a bug as release-critical, but there is no equivalent for features and lesser bugs.

Quote

Quote from: spth on June 16, 2020, 10:26:37 am
2) In the division, SDCC somehow doesn't notice that the left operand is nonnegative. SDCC used to have an optimization for cases like this in the front-end, but it was buggy (it used to affect only sizeof in rare cases, but with ISO C11 _Generic, it became a bigger issue). So that optimization was removed, and replaced by other optimizations in later stages of the compiler. Apparently in your case, this doesn't work, and the division (of an int, due to integer promotion) is done signed (the /8 to >> 3 optimizationm is only valid for unsigned variables). The best solution would be for SDCC to implement generalized constant propagation, but something simpler should work in your case.

Yeah, makes sense. I was only using unsigned variables, but apparently SDCC doesn't realize that or have that optimization at the moment. I can always write the code as >> 3, (or << 3 for multiplication) and SDCC will generate better code. It is just less clear what my code is trying to accomplish when written that way.

With a quick test I just found: For this case, dividing by an unsigned number works, i.e. / 8u instead of / 8. Then integer promotion promotes the left operand to unsigned instead of int, so SDCC still recognizes it as division of an unsigned number.

serisman · « **Reply #45 on:** June 16, 2020, 06:00:58 pm »

Quote from: spth on June 16, 2020, 05:46:54 pm

Quote from: serisman on June 16, 2020, 05:36:34 pm
Quote from: spth on June 16, 2020, 10:26:37 am
2) In the division, SDCC somehow doesn't notice that the left operand is nonnegative. SDCC used to have an optimization for cases like this in the front-end, but it was buggy (it used to affect only sizeof in rare cases, but with ISO C11 _Generic, it became a bigger issue). So that optimization was removed, and replaced by other optimizations in later stages of the compiler. Apparently in your case, this doesn't work, and the division (of an int, due to integer promotion) is done signed (the /8 to >> 3 optimizationm is only valid for unsigned variables). The best solution would be for SDCC to implement generalized constant propagation, but something simpler should work in your case.

Yeah, makes sense. I was only using unsigned variables, but apparently SDCC doesn't realize that or have that optimization at the moment. I can always write the code as >> 3, (or << 3 for multiplication) and SDCC will generate better code. It is just less clear what my code is trying to accomplish when written that way.

With a quick test I just found: For this case, dividing by an unsigned number works, i.e. / 8u instead of / 8. Then integer promotion promotes the left operand to unsigned instead of int, so SDCC still recognizes it as division of an unsigned number.

Thanks! I'll have to try that out later. I haven't been using that syntax for constants, but it makes sense why it might be required.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: EEVblog #1306 (1 of 5): 3 Cent Padauk Micro - Open Source Programmer (Read 10146 times)

Share me