Author Topic: Recursion - do you use it in the real world? (Read 42126 times)

Cerebus · « **Reply #100 on:** September 07, 2017, 05:40:18 pm »

Quote from: brucehoult on September 07, 2017, 05:35:55 pm

Everyone is entitled to their own opinions, but not their own facts.

Whether that is true is a matter of opinion.

Bruce Abbott · « **Reply #101 on:** September 07, 2017, 05:41:00 pm »

Quote from: NorthGuy on September 07, 2017, 04:31:55 pm

This is not necessarily true.

Take for example, PIC16 - 16-level deep hardware call stack, very small memory. Seems like it would be stupid to use recursion. XC8 doesn't even support recursion with default settings.

PIC16 hardware stack is for return addresses only, you can't store data on it. Now you may be wondering - how does XC8 store local variables if it can't use the stack? It just puts them in fixed RAM locations. 'Local' doesn't have to mean 'stack'.

Quote

But, on the other hand, you have very small memory. How much data can you really fit into 1024 bytes? When your data is so small, its processing is unlikely to require huge recursion depth. May be you'll need 5-7 levels.

Modern PIC16F's have up to 4K RAM, and many low-end ARM MCUs have a similar amount (but use it far less efficiently). But so what? This small amount of RAM may be perfectly adequate, except perhaps for this one function. So let's say you can get 5-7 levels with recursion, or 20-30 without. If your application needs 20 levels...

Who am I kidding? If your code is too bloated you just use a more powerful chip! I hear Raspberry Pi is the minimum now...

Mechatrommer · « **Reply #102 on:** September 07, 2017, 06:10:53 pm »

Quote from: brucehoult on September 07, 2017, 05:35:55 pm

Everyone is entitled to their own opinions, but not their own facts .... I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.

you are not the *only* one in this *world*. take me as example.. is the *only* one that i know in at least 100km radius from where i'm sitting, who actually compiled, runned and in house performance tested code to choose which one will suit my need. long time ago, it is said quicksort (iirc) is the fastest sort, but when i coded, built and tested all, it turned out shellsort (gonnet baeza variant) is the fastest in my particular system and data set, so i still use it until today, no matter what the internet said... having said that... your fact is only applicable to your case only (machine, code and usage), but other people's fact might be different from yours.

NorthGuy · « **Reply #103 on:** September 07, 2017, 06:19:02 pm »

Quote from: brucehoult on September 07, 2017, 05:39:54 pm

That's my code (compiled for RISC-V), not hamster_nz's code. You said " The hamster_nz's code uses "%" operation. The compiler (it's the same GCC everywhere, right?) implements it as a division. In your code, the compiler replaces division with multiplication, which makes hamster_nz's code way slower compared to yours."

Sorry. My bad. I assumed the code posted after the sentence "Your function compiles to 146 bytes of code" would be hamster_nz's code.

brucehoult · « **Reply #104 on:** September 07, 2017, 06:37:23 pm »

Quote from: Mechatrommer on September 07, 2017, 06:10:53 pm

Quote from: brucehoult on September 07, 2017, 05:35:55 pm
Everyone is entitled to their own opinions, but not their own facts .... I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.
you are not the *only* one in this *world*. take me as example.. is the *only* one that i know in at least 100km radius from where i'm sitting, who actually compiled, runned and in house performance tested code to choose which one will suit my need. long time ago, it is said quicksort (iirc) is the fastest sort, but when i coded, built and tested all, it turned out shellsort (gonnet baeza variant) is the fastest in my particular system and data set, so i still use it until today, no matter what the internet said... having said that... your fact is only applicable to your case only (machine, code and usage), but other people's fact might be different from yours.

For this usage, of course.

Sasa said "Not only it [recursion] waste resources, but it is as well extremely slow. Printing on screen or file is anyway slow, but try to convert millions of numbers to memory stream and you will notice difference."

Absolutely correct about printing to screen or file ... the code speed is irrelevant.

So I made a test that did exactly what (s)he suggested: converted 10 million negative and 10 million positive numbers into a memory stream (array). Recursive was faster on everything from ARM A series through to i7.

Alas, you fundamentally can't run this test on a small PIC or AVR because it requires something around a 13 MB memory array.

OK, you could output always to the same small buffer. Pretty unrealistic, but it could be done.

Small PICs are fundamentally unsuited to running C code, and especially recursive code, so I accept that my version would probably suck there. The AVR on the other hand is perfectly fine for C and the recursive version should do well. I don't happen to have an AVR handy in Moscow -- mine are back in NZ. I could try it on the SiFive FE310 RISCV microcontroller board I have (reusing a small buffer, as there is only 16 KB RAM).

I'd expect recursive to be better on *any* ARM, from a 1980s Archimedes on, and certainly on anything from an ARM7TDMI and up.

sasa · « **Reply #105 on:** September 07, 2017, 07:24:56 pm »

Quote from: brucehoult on September 07, 2017, 05:35:55 pm

Everyone is entitled to their own opinions, but not their own facts.

Not opinion, but attitude based on 30+ years experience. And the fact is that one methodology is not suitable in all cases.

At end, the only fair judges are customers. They want accurate, fast and stable application/project, making his main job easier or even possible. If they experiencing slowness, many bugs and crashes - your career is finished. I'm not talking about games or anything trivial, but about sensitive applications - one glitch may cost many people life, or millions Euro lost...

Thus, there is no place for vanity and experimentation - you have to use what is the most suitable, proved (certified) and passed extreme testing before entire project get a green light for commercial use.

Quote

I'm seeing a lot of opinions that recursive code is bigger and or slower than iterative code. I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.

Showing valid examples where recursion is necessary, more appropriate and more clear and no body will claim otherwise. Thus lies its magnificence and strength.

Factorial, Fibonacci numbers, continued fractions, converting integers to string and similar, etc, are just examples where recursion can be used, nothing else. Its iterative solutions are much stable and much faster, in a word much suitable solutions. That is not my opinion, but fact.

Quote

Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.

No body will persuade me that trivial: code + stack push + recursive call is faster than code +loop jump. It is simply impossible. You are very well aware of that.

However, if we look at any "divide and conquer" sorting algorithm, we will clearly see that recursion is justified and have full sense - as result we have efficient, fast and clear code.

brucehoult · « **Reply #106 on:** September 07, 2017, 07:54:16 pm »

Quote from: sasa on September 07, 2017, 07:24:56 pm

Quote from: brucehoult on September 07, 2017, 05:35:55 pm
Everyone is entitled to their own opinions, but not their own facts.

Not opinion, but attitude based on 30+ years experience. And the fact is that one methodology is not suitable in all cases.

32.75 years *commercial* experience here, plus valuable experience as a student before that.

I certainly agree that one methodology is not suitable in all cases. That's why I choose the most suitable one, as appropriate.

Quote

Quote
I'm seeing a lot of opinions that recursive code is bigger and or slower than iterative code. I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.

Showing valid examples where recursion is necessary, more appropriate and more clear and no body will claim otherwise. Thus lies its magnificence and strength.

Factorial, Fibonacci numbers, continued fractions, converting integers to string and similar, etc, are just examples where recursion can be used, nothing else. Its iterative solutions are much stable and much faster, in a word much suitable solutions. That is not my opinion, but fact.

No, it's opinion. I've demonstrated that the facts are otherwise, at least for converting integers to strings, on a range of machines. The recursive solution is smaller code and faster. If you dispute this, show me smaller and faster iterative code. It shouldn't be that hard. All my cards are on the table.

Quote

Quote
Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.

No body will persuade me that trivial: [stack push + recursive call] is faster than [loop jump]. It is simply impossible. You are very well aware of that.

The examples I've posted here *demonstrate* the opposite is true. If you're not persuaded by demonstrations that you can easily reproduce yourself -- the essence of the scientific method -- then certainly nothing anyone says will correct your wrong opinions.

But consider this: it is not simply as you paint it. The iterative solutions are not just a loop jump. They have to save data into memory and retrieve it later, just as the recursive solution does. Worse, they require at *least* two loops instead of effectively one.

NorthGuy · « **Reply #107 on:** September 07, 2017, 07:59:28 pm »

Quote from: brucehoult on September 07, 2017, 05:39:54 pm

The ARM compiler uses multiplication for both of our functions. The RISC-V compiler uses division for both of our functions.

I would suggest using this function for speed comparisons:

Code: [Select]

void print_dec(int n) {
  unsigned u, d;
  char ch, *p, buf[12];

  u = n;
  if (n < 0) {
    putchar('-');
    u = -n;
  }
  
  p = buf;
  *p++ = 0;

  while(1) {
    d = u/10;
    *p++ = u - d*10 + '0';
    if (u = d) continue;
    while (ch = *--p) putchar(ch);
    return;
  }
}

IMHO, it is most similar to yours, but doesn't use the recursion.

sasa · « **Reply #108 on:** September 07, 2017, 08:20:33 pm »

Quote from: brucehoult on September 07, 2017, 06:37:23 pm

Sasa said "Not only it [recursion] waste resources, but it is as well extremely slow. Printing on screen or file is anyway slow, but try to convert millions of numbers to memory stream and you will notice difference."

Absolutely correct about printing to screen or file ... the code speed is irrelevant.

So I made a test that did exactly what (s)he suggested: converted 10 million negative and 10 million positive numbers into a memory stream (array). Recursive was faster on everything from ARM A series through to i7.

(It is he, as it is in profile... Anyway...)

How you exactly provide these tests? Did you store random numbers in array then read it from it, used random integer function call before each call to test subject functions, or simply convert two constant 10 mil. times?

As well, testing environment is very important - ensure your main thread have the highest priority and if possible disable all background...

That, makes a big difference.

brucehoult · « **Reply #109 on:** September 07, 2017, 08:28:37 pm »

Quote from: NorthGuy on September 07, 2017, 07:59:28 pm

Quote from: brucehoult on September 07, 2017, 05:39:54 pm
The ARM compiler uses multiplication for both of our functions. The RISC-V compiler uses division for both of our functions.

I would suggest using this function for speed comparisons:

Code: [Select]
void print_dec(int n) { unsigned u, d; char ch, *p, buf[12]; u = n; if (n < 0) { putchar('-'); u = -n; } p = buf; *p++ = 0; while(1) { d = u/10; *p++ = u - d*10 + '0'; if (u = d) continue; while (ch = *--p) putchar(ch); return; } }
IMHO, it is most similar to yours, but doesn't use the recursion.

Strange code. Why not?

Code: [Select]

void print_dec(int n) {
  unsigned u, d;
  char ch, *p, buf[12];

  u = n;
  if (n < 0) {
    putchar('-');
    u = -n;
  }
  
  p = buf;
  *p++ = 0;

  do {
    d = u/10;
    *p++ = u - d*10 + '0';
  } while (u = d);
  while (ch = *--p) putchar(ch);
}

The Thumb2 code is 96 bytes long, the same as hamster_nz's 2nd try (compared to 52 bytes for mine):

Code: [Select]

00000000 <print_dec>:
   0:	b57f      	push	{r0, r1, r2, r3, r4, r5, r6, lr}
   2:	1e04      	subs	r4, r0, #0
   4:	4b14      	ldr	r3, [pc, #80]	; (58 <print_dec+0x58>)
   6:	681a      	ldr	r2, [r3, #0]
   8:	461e      	mov	r6, r3
   a:	9203      	str	r2, [sp, #12]
   c:	da03      	bge.n	16 <print_dec+0x16>
   e:	202d      	movs	r0, #45	; 0x2d
  10:	4264      	negs	r4, r4
  12:	f7ff fffe 	bl	0 <putchar>
  16:	4911      	ldr	r1, [pc, #68]	; (5c <print_dec+0x5c>)
  18:	f10d 0501 	add.w	r5, sp, #1
  1c:	2300      	movs	r3, #0
  1e:	f88d 3000 	strb.w	r3, [sp]
  22:	fba4 2301 	umull	r2, r3, r4, r1
  26:	3430      	adds	r4, #48	; 0x30
  28:	08db      	lsrs	r3, r3, #3
  2a:	eb03 0283 	add.w	r2, r3, r3, lsl #2
  2e:	eba4 0442 	sub.w	r4, r4, r2, lsl #1
  32:	f805 4b01 	strb.w	r4, [r5], #1
  36:	461c      	mov	r4, r3
  38:	2b00      	cmp	r3, #0
  3a:	d1f2      	bne.n	22 <print_dec+0x22>
  3c:	f815 0d01 	ldrb.w	r0, [r5, #-1]!
  40:	b110      	cbz	r0, 8 <putchar+0x8>
  42:	f7ff fffe 	bl	0 <putchar>
  46:	e7f9      	b.n	3c <print_dec+0x3c>
  48:	9a03      	ldr	r2, [sp, #12]
  4a:	6833      	ldr	r3, [r6, #0]
  4c:	429a      	cmp	r2, r3
  4e:	d001      	beq.n	54 <print_dec+0x54>
  50:	f7ff fffe 	bl	0 <__stack_chk_fail>
  54:	b004      	add	sp, #16
  56:	bd70      	pop	{r4, r5, r6, pc}
  58:	00000000 	.word	0x00000000
  5c:	cccccccd 	.word	0xcccccccd

I'll check the speed later when I 'm where my ARM boards are.

brucehoult · « **Reply #110 on:** September 07, 2017, 08:35:32 pm »

Quote from: sasa on September 07, 2017, 08:20:33 pm

Quote from: brucehoult on September 07, 2017, 06:37:23 pm

Sasa said "Not only it [recursion] waste resources, but it is as well extremely slow. Printing on screen or file is anyway slow, but try to convert millions of numbers to memory stream and you will notice difference."

Absolutely correct about printing to screen or file ... the code speed is irrelevant.

So I made a test that did exactly what (s)he suggested: converted 10 million negative and 10 million positive numbers into a memory stream (array). Recursive was faster on everything from ARM A series through to i7.

(It is he, as it is in profile... Anyway...)

How you exactly provide these tests? Did you store random numbers in array then read it from it, used random integer function call before each call to test subject functions, or simply convert two constant 10 mil. times?

I posted the entire test framework source code here, along with the exact compile command used and compiler version. You can duplicate it yourself.

Quote

As well, testing environment is very important - ensure your main thread have the highest priority and if possible disable all background...

That, makes a big difference.

Y'know, my job is working on code performance optimisation for a major electronics company with one of the biggest market shares in phones.

I've got some idea how to compare different versions of code to see which one is best. And it all has to pass code review, often not only internally but also by engineers at Google (e.g. AOSP) or Microsoft (e.g. CoreCLR).

janoc · « **Reply #111 on:** September 07, 2017, 08:38:05 pm »

Quote from: sasa on September 07, 2017, 07:24:56 pm

No body will persuade me that trivial: code + stack push + recursive call is faster than code +loop jump. It is simply impossible. You are very well aware of that.

You are forgetting that many modern compilers will do a tail call optimization and thus there is no stack push! (assuming arguments fit into registers) The entire recursive call is optimized away and replaced by a jump.

And brucehoult has demonstrated that even non tail-recursive function could be faster than iterative version - the overhead of the function call is simply less than the extra code that the non-recursive version needs.

Assumptions are dangerous.

(and excellent work, brucehoult

)

brucehoult · « **Reply #112 on:** September 07, 2017, 09:24:32 pm »

Quote from: NorthGuy on September 07, 2017, 07:59:28 pm

Quote from: brucehoult on September 07, 2017, 05:39:54 pm
The ARM compiler uses multiplication for both of our functions. The RISC-V compiler uses division for both of our functions.

I would suggest using this function for speed comparisons:

Code: [Select]
void print_dec(int n) { unsigned u, d; char ch, *p, buf[12]; u = n; if (n < 0) { putchar('-'); u = -n; } p = buf; *p++ = 0; while(1) { d = u/10; *p++ = u - d*10 + '0'; if (u = d) continue; while (ch = *--p) putchar(ch); return; } }
IMHO, it is most similar to yours, but doesn't use the recursion.

Congratulations! Comes in almost exactly the same as the recursive one.

Yours (median of five runs)

real   0m5.489s
user   0m5.220s
sys   0m0.260s

vs mine (likewise)

real   0m5.472s
user   0m5.170s
sys   0m0.290s

I'd say it's too close to call. We're down to experimental error.

Your compiled code is still 1.85x bigger than mine :-)

sasa · « **Reply #113 on:** September 07, 2017, 09:35:28 pm »

Quote from: brucehoult on September 07, 2017, 08:35:32 pm

I posted the entire test framework source code here, along with the exact compile command used and compiler version. You can duplicate it yourself.
..
Y'know, my job is working on code performance optimisation for a major electronics company with one of the biggest market shares in phones.

OK, I accept the challenge. One part of my job is to make source code optimization for some critical low level functions and algorithms. And I do not rely too much on compiler's optimization, as I use the same code almost unchanged on many platforms, including MCUs. So we will see...

I suggest following:

1. To change function according to C itoc function with one change

char *itoc_contester_version(int n, char *s) {

s - pointer to the first resulting character

Resulting pointer to the null-terminated string, pointing exactly on the last, null-terminator that will help to easy continue storing next converted number

It is assumed result is in decimal base

2. To avoid any compiler/platform specific code, pure C.

BTW. NorthGuy is on the right track - using pointer's arithmetic is the key.

hamster_nz · « **Reply #114 on:** September 07, 2017, 09:44:33 pm »

Quote from: brucehoult on September 07, 2017, 05:35:55 pm

Quote from: sasa on September 07, 2017, 04:55:33 pm
Quote from: brucehoult on September 07, 2017, 03:36:01 pm

You're looking at the trees and missing the forest!
...

There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.

It would be good to check them sometimes :-)

Not sure what you are talking about. My attitude about recursion you can read at end of page 1 of this thread.

With 30+ years of professional experience, everybody have his own attitude about everything, especially in his own profession.

Everyone is entitled to their own opinions, but not their own facts. I'm seeing a lot of opinions that recursive code is bigger and or slower than iterative code. I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.

Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.

The problem of "optimized" is a multi-dimensional one, code size, stack size, speed, power, target architecture, code maintainability.... the costs you put on these are weighted to favor the recursive solution, and every time somebody 'betters the recursive solution you move the goal posts.

We are talking about stack usage...

The goalpost change "But it is about total memory usage (code AND stack)!"

So we get something that is a dozen bytes more code, but a greater saving in stack, taking in it below the recursive solution

The goalpost move "But it is about speed"

Then the goalpost move "Code is more expensive than stack". If I have an application that only uses a fraction of my ROM, then larger code costs nothing if performance is the key constraint.

Sure, recursion is a workable solution to the problem - it produces the correct output, but it is only optimal for the use case where it is optimal - it won't be the fastest if code size isn't a constraint, or use the least RAM if that is a constraint.

And even as a recursive function it is sub-optimal if you have to print both signed and unsigned values: For example, you could improve performance (measured by code size & speed) by splitting it into two functions, avoiding the sign test for each digit:

Code: [Select]

void print_unsigned(unsigned u) 
    unsigned d = u/10, ch = u - d*10 + '0';
    if (d != 0) print_unsigned(d);
    putchar(ch);
}

void print_signed(int n) {
    unsigned u = n;
    if (n < 0) {
        putchar('-');
        print_unsigned((unsigned)(-n)) 
    } else {
        print_unsigned((unsigned)(n)) 
    }
}

How does recursion go if you want to include leading spaces, to keep it printing 11 characters for each number?

splin · « **Reply #115 on:** September 08, 2017, 01:15:32 am »

For comparison, the assembler version of print_dec() from "The Definitive Guide to ARM Cortex-M3 and Cortex-M4 Processors" by Joseph Yiu:

Code: [Select]

; PutDecLoop; Function to display Decimal value
;------------------------------------------------
PutDec FUNCTION
; Subroutine to display register value in decimal
; Input R0 = value to be displayed.
; Since it is 32 bit, the maximum number of character
; in decimal format, including null termination is 11
; According AAPCS, a function can change R0-R3, R12
; So we use these registers for processing.
; R0 - Input value
; R1 - Divide value (10)
; R2 - Divide result
; R3 - Remainder
; R12- Text buffer pointer

        PUSH {R4, LR}           ; Save register values
        MOV  R12, SP            ; Copy current Stack Pointer to R12
        SUB  SP, SP, #12        ; Reserved 12 bytes as text buffer
        MOVS R1, #0             ; Null character
        STRB R1,[R12, #-1]!     ; Put a null character at end of text buffer
                                ; buffer,pre-indexed
        MOVS R1, #10            ; Set divide value
PutDecLoop
        UDIV R2, R0, R1         ; R2 = R0 / 10 = divide result
        MUL  R4, R2, R1         ; R4 = R2 * 10 = Value - remainder (multiple of 10)
        SUB  R3, R0, R4         ; R3 = R0 - (R2 * 10) = remainder
        ADDS R3, #'0'           ; convert to ASCII (R3 can only be 0-9)
        STRB R3,[R12, #-1]!     ; Put ascii character in text buffer
                                ; pre-indexed
        MOVS R0, R2             ; Set R0 = Divide result and set Z flag if R2=0
        BNE PutDecLoop          ; If R0(R2) is already 0, then there
                                ; is no more digit
        MOV  R0, R12            ; Put R0 to starting location of text
                                ; buffer
        BL Puts                 ; Display the result using Puts
        ADD  SP, SP, #12        ; Restore stack location
        POP  {R4, PC}           ; Return
ENDFUNC

I reckon that the 17 instructions take 42 bytes and uses 12 bytes of stack +8 to save R4 and LR (but it doesn't handle negative numbers). He could have saved a few more bytes by using R3 rather than R12 and using R0 instead of R3. He could also have saved a few cycles by not saving LR and jumping to Puts() at the end of the function rather than calling it (after restoring registers and SP) - but that is perhaps going a step too far for most...

Ok. the comments could have been a bit better, but I'm sure its not just me who doesn't find that very much harder to read and understand than the C version - but it would be harder to write (by me) not having much experience writing Cortex assembler. This isn't a thread about ARM assembler though so apologies for the diversion.

[EDIT] Detabbed code

Mechatrommer · « **Reply #116 on:** September 08, 2017, 01:55:19 am »

Quote from: janoc on September 07, 2017, 08:38:05 pm

You are forgetting that many modern compilers will do a tail call optimization and thus there is no stack push! (assuming arguments fit into registers) The entire recursive call is optimized away and replaced by a jump.

"assuming"? assuming or "hoping" compiler will do its magic is not a wise thing to do, from a programmer standpoint, imho... what if arguments are not fit in registers? "assuming" will create very minimal threat if you are working on a fixed familiar system everytime... but not when on many different systems...

NorthGuy · « **Reply #117 on:** September 08, 2017, 02:01:24 am »

Quote from: brucehoult on September 07, 2017, 09:24:32 pm

I'd say it's too close to call. We're down to experimental error.

Your compiled code is still 1.85x bigger than mine :-)

So true. But I think the comparison would better reflect the merits of recursion if you compiled it with -fno-stack-protector so that the code wouldn't be affected by stack protection paranoia.

Or, if you insist on stack protection, then you could compile both variants with -fstack-protector-all.

sasa · « **Reply #118 on:** September 08, 2017, 09:22:52 am »

To whom it may concern,

I have briefly made this morning optimized version. I do not have a time currently to write much, some notes are in the code. I hope I did not miss to add someone code. Anyway, can easily added later and cleaned by anyone interested.

Results are made on old Intel Atom, Linux 64-bit, GCC 6.3.0. Clean environment (though, not enough) - only console and calculator, rest is closed, no mouse is moved and no key pressed. The first dummy test is performed, to avoid latency effect on ENTER press during starting application.

@ brucehoult
I did not made and corrections on your original code except inserted pointer to directly write data th the array and add null-terminator. I have added your modified function, similarly hamster_nz also noted.

itoc_ref_1 - Your original code

itoc_ref_2 - I have split signed and unsigned parts of your code and made some corrections - much faster.

itoc_NorthGuy_1 - NorthGuy's iterative function. Fast, however continue and some other code make a problem to compiler to optimized it better

itoc_sasa_1 - My almost top optimized iterative function. Still can be checked some variations which on this or other platforms may work faster.

Attention: Look at your original code (ref_1) and notice that wrong type for ch and implicit conversion may cost you some large percentage of speed!

As notable from tables below, my function is notable faster, (see tables or test if it is hard to believe) when compiler use maximum optimization, in comparison to your original version. That imply that is still possible to optimize this source code for this particular platform and compiler, however I currently I have no time for games....

Difference in size when compiled with maximum optimization is only dozen bytes, IIRC. Try any compiler level optimization and you will probably notice similar. Try it on embedded systems, as well.

I have implemented almost all I have commented before, on your recursive and my own iterative code.

I believe all parallels pro et contra for using some methods and techniques are cleared through this thread.

Default optimization:

Code: [Select]

gcc -o printdec printdec.c -Wall -pedantic-errors
itoc_ref_1      : 4254 ms
itoc_ref_1      : 4255 ms
itoc_ref_1      : 4254 ms
itoc_ref_2      : 4064 ms
itoc_ref_2      : 4062 ms
itoc_ref_2      : 4062 ms
itoc_NorthGuy_1 : 4246 ms
itoc_NorthGuy_1 : 4245 ms
itoc_NorthGuy_1 : 4248 ms
itoc_sasa_1     : 3731 ms
itoc_sasa_1     : 3730 ms
itoc_sasa_1     : 3731 ms

Maximum optimization:

Code: [Select]

gcc -o printdec printdec.c -Wall -pedantic-errors -O
itoc_ref_1      : 3279 ms
itoc_ref_1      : 3265 ms
itoc_ref_1      : 3270 ms
itoc_ref_2      : 2786 ms
itoc_ref_2      : 2784 ms
itoc_ref_2      : 2785 ms
itoc_NorthGuy_1 : 2789 ms
itoc_NorthGuy_1 : 2790 ms
itoc_NorthGuy_1 : 2789 ms
itoc_sasa_1     : 2237 ms
itoc_sasa_1     : 2238 ms
itoc_sasa_1     : 2239 ms

Compilable modified code is attached.

westfw · « **Reply #119 on:** September 08, 2017, 09:35:41 am »

Quote

No body will persuade me that trivial: code + stack push + recursive call is faster than code +loop jump. It is simply impossible.

Well, perhaps more detailed analysis is in order. Here's the meat of the recursive version of decout:

Code: [Select]

void decout(int n) {

{
  int rem = n % 10;   // extract the least significant digit of the number.
  n /= 10;             // remove that digit from the overall number
  if (n > 0) {         // If there are any more digits in the number
    decout(n);         //  Then print them first
  }
  putchar(rem + '0');  // Convert the digit to ascii and print
}

And one of the non-recursive versions (also trimmed):

Code: [Select]

void print_dec_2(int n) {
  do {
    unsigned int t = u/10;
    digits[i++] = u % 10;
    u = t;
  } while(i<10 && u != 0);

  while(i) {
    putchar(digits[--i]+'0');
  }
}

Notice that the recursive version does not contain ANY loops or any explicit stores. Essentially, each loop iteration branch that needs to be done by the non-recursive version is replaced by a call or return (BL or pop of sp) (which takes the same time, probably), and each store/retrieval of a character to the buffer in non-recursive is replaced by a POP (which also takes the same time.) The recursive version may lose a bit of time doing the sign check each time, and the non-recursive version may lose a bit by having to maintain its own counter or pointer, but at some fundamental level they are doing exactly the same thing - N multiplies/divides, M adds/subtracts, O loads and stores, and P branches. As the assembly dumps show, the compiler optimizes the recursive version sufficiently that it does not need any "extra" data saved on the stack, other than the PCs and the extra bytes in each digit.

Quote

We're down to experimental error.

Indeed. Thanks to Bruce for doing the grunt-work. This is not supposed to show that recursion is always "better" in any way than an iterative method; but it DOES show that recursion is not necessarily extremely expensive, either.

Quote

You can usually add saving the base pointer to that stack too

Doesn't happen in this case. Compilers have gotten very good at not creating "stack frames" when they're not needed, and CPU/ABI combinations have gotten pretty good at ensure that simple functions don't usually need them.

Quote

2. You always check value is negative.

I'll give you that one. But this is a micro-optimization...

Quote

1. You always send unsigned value in every call but first time. Internal conversion from signed to unsigned all the time.

signed/unsigned is usually not an actual conversion, but more a signal to the compiler on how to treat condition flags. So these "casts" don't actually add any instructions or execution time.

Quote

3. Every recursion put on stack all variables. You spend more of the memory

As shown above, in this example "all of the variables" consists only of the extracted digit, which you'd have to save somewhere anyway.

Code: [Select]

wrong type for ch and implicit conversion may cost you some large percentage of speed!Show me where in the assembly language, please (see (1) above.) The use of int vs char for the individual digits is an "interesting" optimization - on most 8bit CPUs (maybe including x86), using char would be desirable to save registers and possible compute time in multiply/divide. On most 32bit RISC CPUs (and ARM in particular), using "int" probably avoids un-needed "covert word to byte" instructions.

(There ought to be some "rule" that distinguishes between variables that would have to be saved anyway, and variables that are local (in a stack frame, or in registers that need saving) but not actually part of the "recursive context." Functions that have much more local context than recursive context would be poor candidates for recursion.)

(It's worth noting that the analysis above loses validity on other CPUs, where "call" and "branch" do NOT have the same cost. On bigger AVRs, call and return each take 4 cycles, and branches only 1 or 2. OTOH, an AVR could store each digit in only a single byte on the stack, with the "push" being significantly cheaper than storing to local variables.)

The USUAL excuse for recursion is that it makes the code much simpler, more compact, and easier to understand at the SOURCE CODE level. That's vaguely true in this example ("print the rest of the number, print the last digit" vs "extract all the digits and print them backwards"), and not really true at all for the factorial/fibonacci examples. But it gets real obvious when trying to do trees or divide-and-conquer sort algorithms without recursion. Towers of Hanoi is probably a good example - anyone want to turn this Hanoi code into an iterative version?

Code: [Select]

void hanoifun(int n, char fr, char tr, char ar)//fr=from rod,tr =to rod, ar=aux rod
{
    if (n == 1)
    {
        printf("\n Move disk 1 from rod %c to rod %c", fr, tr);
        return;
    }
    hanoifun(n-1, fr, ar, tr);
    printf("\n Move disk %d from rod %c to rod %c", n, fr, tr);
    hanoifun(n-1, ar, tr, fr);
}

Siwastaja · « **Reply #120 on:** September 08, 2017, 10:24:19 am »

Quote from: brucehoult on September 07, 2017, 05:35:55 pm

Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.

What I encounter often when doing embedded is that I'm having shitloads of excess flash ROM, sitting there doing nothing, while the RAM is the bottleneck. It's not untypical to have 128k of flash for a 10k program, and only 4k of RAM with about 1k available for stack.

This strange balance is because flash rom is cheap, so why not sell it so that it's definitely not running out; while SRAM is freaking expensive and dominating the MCU price.

Actually, I have never ran out of flash rom in any MCU project, which is a bit surprising, since I've been doing quite a lot of things. But RAM is scarce.

Usually (but not always), the importance ends up in this order in my embedded projects:
- RAM usage
- Speed
- Program memory usage

So, to calculate relevant "figure of merits", I'd introduce some heavy coefficients for the cost of stack bytes vs. the cost of program memory bytes, like 10x.

Your mileage may vary.

brucehoult · « **Reply #121 on:** September 08, 2017, 11:05:18 am »

Quote from: Siwastaja on September 08, 2017, 10:24:19 am

Quote from: brucehoult on September 07, 2017, 05:35:55 pm
Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.

What I encounter often when doing embedded is that I'm having shitloads of excess flash ROM, sitting there doing nothing, while the RAM is the bottleneck. It's not untypical to have 128k of flash for a 10k program, and only 4k of RAM with about 1k available for stack.

This strange balance is because flash rom is cheap, so why not sell it so that it's definitely not running out; while SRAM is freaking expensive and dominating the MCU price.

Actually, I have never ran out of flash rom in any MCU project, which is a bit surprising, since I've been doing quite a lot of things. But RAM is scarce.

Usually (but not always), the importance ends up in this order in my embedded projects:
- RAM usage
- Speed
- Program memory usage

So, to calculate relevant "figure of merits", I'd introduce some heavy coefficients for the cost of stack bytes vs. the cost of program memory bytes, like 10x.

Your mileage may vary.

I agree with you. But do consider that we're talking about 80 bytes of stack space difference (on an ARM, only 30 on AVR (or 40 on Mega) which is small even when you've only got 1024 bytes to start with.

Also, most programs do some calculations and then print results. That stack space is being used only during the result printing. The calculation part is free to use it. (different story if you want to do debug print from the middle of the calculation -- but that's development time and you may be able to afford one bigger device for dev work).

brucehoult · « **Reply #122 on:** September 08, 2017, 11:26:44 am »

Quote from: westfw on September 08, 2017, 09:35:41 am

The USUAL excuse for recursion is that it makes the code much simpler, more compact, and easier to understand at the SOURCE CODE level. That's vaguely true in this example ("print the rest of the number, print the last digit" vs "extract all the digits and print them backwards"), and not really true at all for the factorial/fibonacci examples. But it gets real obvious when trying to do trees or divide-and-conquer sort algorithms without recursion. Towers of Hanoi is probably a good example - anyone want to turn this Hanoi code into an iterative version?

Code: [Select]
void hanoifun(int n, char fr, char tr, char ar)//fr=from rod,tr =to rod, ar=aux rod { if (n == 1) { printf("\n Move disk 1 from rod %c to rod %c", fr, tr); return; } hanoifun(n-1, fr, ar, tr); printf("\n Move disk %d from rod %c to rod %c", n, fr, tr); hanoifun(n-1, ar, tr, fr); }

I'll reply to the other points later, but in the meantime try this :-) :-) :-)

Code: [Select]

typedef unsigned state;
int cto(state n){return n&1?cto(n>>1)+1:0;}

void hanoifun(int n, char fr, char tr, char ar) { //fr=from rod,tr =to rod, ar=aux rod
    char r[] = {fr,tr,ar};
    for (state s=1; s<((state)1)<<n; ++s)
        printf("\n Move disk %d from rod %c to rod %c",
               cto(~s)+1, r[(s&s-1)%3], r[((s|s-1)+1)%3]);
}

OBVIOUSLY, this is intended to be obfuscated code,not "good" code. But it works, producing byte-for-byte the same output.

NorthGuy · « **Reply #123 on:** September 08, 2017, 01:53:44 pm »

Quote from: sasa on September 08, 2017, 09:22:52 am

To whom it may concern,

Does your C compiler insert the stack protector bloat in all the functions with local array as brucehoult's does? This certainly screws up the results to some extent.

I don't think you can compare the non-recursive routines which can build the output in place with the recursive routines which spit the characters out one by one. You want to make recursive and non-recursive cases as similar as possible. Then your tests will separate the effect of the recursion in the purest form.

brucehoult · « **Reply #124 on:** September 08, 2017, 02:01:06 pm »

Quote from: NorthGuy on September 08, 2017, 02:01:24 am

Quote from: brucehoult on September 07, 2017, 09:24:32 pm
I'd say it's too close to call. We're down to experimental error.

Your compiled code is still 1.85x bigger than mine :-)

So true. But I think the comparison would better reflect the merits of recursion if you compiled it with -fno-stack-protector so that the code wouldn't be affected by stack protection paranoia.

Or, if you insist on stack protection, then you could compile both variants with -fstack-protector-all.

That's true. I picked simple -Os at the start (and -march=armv7 if your compiler doesn't default to that) and didn't want to go mucking with flags halfway through.

I could do that, but then I'd split my recursive code into two functions (one for signed and one for unsigned, with one calling the other) to even things up and get both faster code and two useful functions for the price of one :-)


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Poll

How often have you come across recursion?

Author Topic: Recursion - do you use it in the real world? (Read 42126 times)

Share me