I don't see how recursion is "natural" here.
Because you took a trivial example of an iteration over an array/list. Try something a bit more involved, such as searching for a given file in a directory hierarchy.
You can either futz with the stack by hand (in each directory push all subdirectories onto stack, compare files, if not found start popping subdirectories from the stack and repeating the procedure for each of them until the stack is empty) or you write a simple recursive function where you compare the files (the "first elements") and if the file isn't found you call the same procedure on each of the subdirectories (the "rest") ...
I dare to say that the recursive solution will be much shorter and more readable, plus you don't need to implement a stack data structure.
Same if you want to build an expression evaluator (e.g. for a calculator). You can do it with stack, but unless you are building an RPN calculator, it is not a particularly natural fit - the infix notation is more naturally represented as a tree and tree traversal is easier done recursively again.
There's no reason to use recursion where it is not needed (e.g. where you necessarily want to use tail recursion because you feel it's cool). Similarly, there's no reason not to use recursion where the recursion helps (e.g. no reason to avoid recursion because you're scared of how dangerous it is). This is just as with everything else - forget the rules and use your brain cells instead.
Sure, no dispute with that. I am merely pointing out that recursion is far from an useless concept that has no use in embedded programming.
Hell if you have a test case you can even make sure it doesn't shit the stack. LLVM at least has a half decent optimiser that can actually handle this well (unlike GCC which is a hairy testicle).
GCC had tail call optimization from before LLVM even existed. I am not going to claim to know which one performs better, but the feature is there since a long time.
You C guys really need to go on a mental excursion and write some Haskell, Scheme or Erlang or something and come back having licked the toad of glory even if you never touch it again.
This. I have started with imperative programming (Basic, Z80 assembly, Pascal, C/C++ ...) as most people but getting exposed to the ideas behind languages like Lisp or (more recently) Haskell certainly did make me a much better programmer. Not because I am trying to write Lisp/Haskell in every project but simply because it teaches techniques that are super useful even when writing assembly code or C. For example pure functions, writing code that doesn't do side effects -> super helpful when dealing with multithreaded code (if a code doesn't have internal state/side effects it is implicitly thread safe). Such code also tends to be easier to maintain, because its effects are easier to test and verify.
Or that tail recursion - it is usually taught as a concept with Scheme or Lisp but it is very useful even in C. Or declarative programming in the style of Prolog - again a concept that is very useful even if you never touch Prolog in your life.
Also to note, the stack and stack frames are a compromise in the machine model which happens to leak into the language abstraction, not a required part of the machine or the language. There doesn't have to be turtles all the way down until they topple with a stack overflow; there can be one well defined turtle. Some embedded systems guys may recognise this if they have dirty assembly fingers. There was another thread on that where all the C programmers whinged about goto (or JMP as it used to be called)
Yep. On the other hand, stack is pretty essential for implementing many things, such as function calls or interrupts. You
can do without, but it is usually not fun. So it helps having support for it.
I don't see how recursion is "natural" here.
Because you took a trivial example of an iteration over an array/list. Try something a bit more involved, such as searching for a given file in a directory hierarchy.
But searching the directory is not the case where the recursion cannot be removed through the tail call optimization. Here recursion greatly simplifies things.
At the same time, nearly any case where the recursion can be removed through the end call optimization will be more or less trivial. So, not much need for recursion in such cases.
This is exactly the distinction I wanted to make here:
Back to the recursion.
Also to note, the stack and stack frames are a compromise in the machine model which happens to leak into the language abstraction, not a required part of the machine or the language. There doesn't have to be turtles all the way down until they topple with a stack overflow; there can be one well defined turtle. Some embedded systems guys may recognise this if they have dirty assembly fingers. There was another thread on that where all the C programmers whinged about goto (or JMP as it used to be called)
Yep. On the other hand, stack is pretty essential for implementing many things, such as function calls or interrupts. You can do without, but it is usually not fun. So it helps having support for it.
As an experiment, I wrote a code generator (compiler is a bit too advanced a word!) for the pic16 architecture around 2000 that used coroutines for multi tasking and multi threading. It had a bastardisation of basic and pascal as a language. I got bored of it and never finished it but the approach is definitely viable and not terribly inflexible. If you have a complex system, you will have hundreds of coroutines though which makes a debugging nightmare. Synchronisation was a bitch as well.
Quote from: brucehoult on Yesterday at 11:11:51 PM>
Quote from: westfw on Yesterday at 09:17:57 PMI used to like a recusive decimal output function, even on moderately small systems.
Between the array and the code for the extra loop the recursive version is simpler and cleaner and the memory use is little different.
Actually, the memory use is a little depressing on most modern processors. The "temporary array" would require N bytes for an N digit number. A typical 32bit CPU will required at least 8*N bytes of stack space: word-wide pushes of each digit plus word-wide pushes of the PC for each recursive call. And it's harder to add formatting to the recursive version.
But it is cute and simple for basic conversion. (OTTH a typical 32bit CPU will usually have several KBytes (or much more) of stack, so...)
Actually, the memory use is a little depressing on most modern processors. The "temporary array" would require N bytes for an N digit number. A typical 32bit CPU will required at least 8*N bytes of stack space: word-wide pushes of each digit plus word-wide pushes of the PC for each recursive call. And it's harder to add formatting to the recursive version.
But it is cute and simple for basic conversion. (OTTH a typical 32bit CPU will usually have several KBytes (or much more) of stack, so...)
You can usually add saving the base pointer to that stack too, so make that 12*N - so that is more memory for one level pf recursion than it takes to buffer the the entire output.
More instructions to make the new stack frame, more time....
Suddenly
void print_dec(int n) {
int i;
unsigned int u;
char digits[10];
if(n < 0) {
u = -n;
putchar('-');
} else {
u = n;
}
for(i = 0; i < 10; i++) {
digits[9-i] = u%10+'0';
u/=10;
}
for(i = 0; i < 9; i++) {
if(digits[i] != '0')
break;
}
for(; i < 10; i++) {
putchar(digits[i]);
}
}
isn't looking so terrible.
Actually, the memory use is a little depressing on most modern processors. The "temporary array" would require N bytes for an N digit number. A typical 32bit CPU will required at least 8*N bytes of stack space: word-wide pushes of each digit plus word-wide pushes of the PC for each recursive call. And it's harder to add formatting to the recursive version.
But it is cute and simple for basic conversion. (OTTH a typical 32bit CPU will usually have several KBytes (or much more) of stack, so...)
You can usually add saving the base pointer to that stack too, so make that 12*N - so that is more memory for one level pf recursion than it takes to buffer the the entire output.
More instructions to make the new stack frame, more time....
Suddenly
void print_dec(int n) {
int i;
unsigned int u;
char digits[10];
if(n < 0) {
u = -n;
putchar('-');
} else {
u = n;
}
for(i = 0; i < 10; i++) {
digits[9-i] = u%10+'0';
u/=10;
}
for(i = 0; i < 9; i++) {
if(digits[i] != '0')
break;
}
for(; i < 10; i++) {
putchar(digits[i]);
}
}
isn't looking so terrible.
Let's try, shall we?
Using arm-linux-gnueabihf-gcc 5.4.0 20160609 compiling for ARMv7 Thumb2 with -Os (what I almost always use for everything) your code compiles to 140 bytes of code and uses 20 bytes of stack, for 160 total.
My recursive code (below) compiles to 52 bytes of code and uses 8 bytes of stack per digit printed, for a maximum of 80 bytes of stack. That's 132 bytes total.
My code+stack is less than your code! I need 60 bytes more of stack than you do.
Incidentally, both versions happen to use one 2-byte NOP for alignment...
I've tested both versions against printf on x86 for all 2^32 input values (takes about 5 min). All ok.
void print_dec(int n) {
unsigned u = n;
if (n < 0) {
putchar('-');
u = -n;
}
unsigned d = u/10, ch = u - d*10 + '0';
if (d != 0) print_dec(d);
putchar(ch);
}
00000000 <print_dec>:
0: b510 push {r4, lr}
2: 1e04 subs r4, r0, #0
4: da03 bge.n e <print_dec+0xe>
6: 202d movs r0, #45 ; 0x2d
8: 4264 negs r4, r4
a: f7ff fffe bl 0 <putchar>
e: 4a08 ldr r2, [pc, #32] ; (30 <print_dec+0x30>)
10: fba4 2302 umull r2, r3, r4, r2
14: 3430 adds r4, #48 ; 0x30
16: 08d8 lsrs r0, r3, #3
18: 230a movs r3, #10
1a: fb03 4410 mls r4, r3, r0, r4
1e: b108 cbz r0, 6 <print_dec+0x6>
20: f7ff fffe bl 0 <print_dec>
24: 4620 mov r0, r4
26: e8bd 4010 ldmia.w sp!, {r4, lr}
2a: f7ff bffe b.w 0 <putchar>
2e: bf00 nop
30: cccccccd .word 0xcccccccd
Just for yuks I tried the same code on 32 bit RISCV. It compiles to 66 bytes of code. Unfortunately, the ABI says the stack pointer must always be 16 byte aligned, so it uses 16 bytes per level or 160 maximum even though it's storing less data than that. So my worst case total is 226 bytes of code+stack.
Using riscv32-unknown-elf-gcc 6.1.0 with -Os
Your function compiles to 146 bytes of code, and uses 32 bytes of stack, total 178 bytes.
00010222 <print_dec>:
10222: 1141 addi sp,sp,-16
10224: c422 sw s0,8(sp)
10226: c226 sw s1,4(sp)
10228: c606 sw ra,12(sp)
1022a: 842a mv s0,a0
1022c: 00055a63 bgez a0,10240 <print_dec+0x1e>
10230: 8581a503 lw a0,-1960(gp) # 1b910 <_impure_ptr>
10234: 02d00593 li a1,45
10238: 40800433 neg s0,s0
1023c: 4510 lw a2,8(a0)
1023e: 3f55 jal 101f2 <__sputc_r>
10240: 45a9 li a1,10
10242: 02b45533 divu a0,s0,a1
10246: 02b47433 remu s0,s0,a1
1024a: 03040413 addi s0,s0,48
1024e: c111 beqz a0,10252 <print_dec+0x30>
10250: 3fc9 jal 10222 <print_dec>
10252: 8581a503 lw a0,-1960(gp) # 1b910 <_impure_ptr>
10256: 85a2 mv a1,s0
10258: 40b2 lw ra,12(sp)
1025a: 4422 lw s0,8(sp)
1025c: 4492 lw s1,4(sp)
1025e: 4510 lw a2,8(a0)
10260: 0141 addi sp,sp,16
10262: bf41 j 101f2 <__sputc_r>
One reason for the 14 bytes more of code is this toolchain inlined putchar to putc, which needs to load stdout from a global and then pass it. I didn't look closely enough to figure it out, but I think that might also be the reason for saving three registers not just two (not that it makes any difference with mandated 16 byte alignment)
Let's try, shall we?
If you are not adverse to 'compact & unreadable' code this is most likely the same size as the recursive code, and most likely quicker too - but fudging ugly:
void print_dec_2(int n) {
unsigned int u = n, i = 0;
char digits[10];
if(n < 0) {
u = -u;
putchar('-');
}
do {
unsigned int t = u/10;
digits[i++] = u-t*10;
u = t;
} while(i<10 && u != 0);
while(i) {
putchar(digits[--i]+'0');
}
}
It could be made smaller by removing the "i<10 &&" test, which traps overflows if ints need greater than 10 characters - e.g. a 64 bit int - but better safe than sorry.
Let's try, shall we?
If you are not adverse to 'compact & unreadable' code this is most likely the same size as the recursive code, and most likely quicker too - but fudging ugly:
void print_dec_2(int n) {
unsigned int u = n, i = 0;
char digits[10];
if(n < 0) {
u = -u;
putchar('-');
}
do {
unsigned int t = u/10;
digits[i++] = u-t*10;
u = t;
} while(i<10 && u != 0);
while(i) {
putchar(digits[--i]+'0');
}
}
It could be made smaller by removing the "i<10 &&" test, which traps overflows if ints need greater than 10 characters - e.g. a 64 bit int - but better safe than sorry.
Congratulations, that's better at 96 bytes of code only 44 more than the recursive version. I took out the i<10 test as that's really better done at compile time with a test on sizeof(int)*CHAR_BIT or UINT_MAX or some such.
I do agree that's less easily understandable than the recursive version, but for me at least it's much more quickly understandable than your first version!
00000000 <print_dec_2>:
0: b57f push {r0, r1, r2, r3, r4, r5, r6, lr}
2: 1e04 subs r4, r0, #0
4: 4b14 ldr r3, [pc, #80] ; (58 <print_dec_2+0x58>)
6: 681a ldr r2, [r3, #0]
8: 461e mov r6, r3
a: 9203 str r2, [sp, #12]
c: da03 bge.n 16 <print_dec_2+0x16>
e: 202d movs r0, #45 ; 0x2d
10: 4264 negs r4, r4
12: f7ff fffe bl 0 <putchar>
16: 4811 ldr r0, [pc, #68] ; (5c <print_dec_2+0x5c>)
18: 2500 movs r5, #0
1a: fba4 2300 umull r2, r3, r4, r0
1e: 3501 adds r5, #1
20: eb0d 0205 add.w r2, sp, r5
24: 08db lsrs r3, r3, #3
26: eb03 0183 add.w r1, r3, r3, lsl #2
2a: eba4 0441 sub.w r4, r4, r1, lsl #1
2e: f802 4c01 strb.w r4, [r2, #-1]
32: 461c mov r4, r3
34: 2b00 cmp r3, #0
36: d1f0 bne.n 1a <print_dec_2+0x1a>
38: 3d01 subs r5, #1
3a: f81d 0005 ldrb.w r0, [sp, r5]
3e: 3030 adds r0, #48 ; 0x30
40: f7ff fffe bl 0 <putchar>
44: 2d00 cmp r5, #0
46: d1f7 bne.n 38 <print_dec_2+0x38>
48: 9a03 ldr r2, [sp, #12]
4a: 6833 ldr r3, [r6, #0]
4c: 429a cmp r2, r3
4e: d001 beq.n 54 <print_dec_2+0x54>
50: f7ff fffe bl 0 <__stack_chk_fail>
54: b004 add sp, #16
56: bd70 pop {r4, r5, r6, pc}
58: 00000000 .word 0x00000000
5c: cccccccd .word 0xcccccccd
My recursive code (below) compiles to 52 bytes of code and uses 8 bytes of stack per digit printed, for a maximum of 80 bytes of stack. That's 132 bytes total.
void print_dec(int n) {
unsigned u = n;
if (n < 0) {
putchar('-');
u = -n;
}
unsigned d = u/10, ch = u - d*10 + '0';
if (d != 0) print_dec(d);
putchar(ch);
}
...
This is as well proper example why should not force recursion even that is "naturally" obvious.
1. You always send unsigned value in every call but first time. Internal conversion from signed to unsigned all the time.
2. You always check value is negative. As known from point 1, that is possible only once. Lost time for useless check.
3. Every recursion put on stack all variables. You spend more of the memory
Not only it waste resources, but it is as well extremely slow. Printing on screen or file is anyway slow, but try to convert millions of numbers to memory stream and you will notice difference.
Connected example: "Print reversed entered text and check it is a palindrome. Naturally, everybody will put a finger on last letter and start to copy each letter from right to the left.
No recursion! It is the same principle with resulted string from converting number to string, thus recursion is only more complicated thing.
Challenge 2 (MCU related): "Assume there is no memory for additional temporary string nor stack for recursion. Convert integer and send result through serial port."
I will immediately reject every youngster application for the job if fail to see solution in 1 second. As well, the best for him and any company is that burn his diploma out...
My recursive code (below) compiles to 52 bytes of code and uses 8 bytes of stack per digit printed, for a maximum of 80 bytes of stack. That's 132 bytes total.
void print_dec(int n) {
unsigned u = n;
if (n < 0) {
putchar('-');
u = -n;
}
unsigned d = u/10, ch = u - d*10 + '0';
if (d != 0) print_dec(d);
putchar(ch);
}
...
This is as well proper example why should not force recursion even that is "naturally" obvious.
1. You always send unsigned value in every call but first time. Internal conversion from signed to unsigned all the time.
2. You always check value is negative. As known from point 1, that is possible only once. Lost time for useless check.
3. Every recursion put on stack all variables. You spend more of the memory
Not only it waste resources, but it is as well extremely slow. Printing on screen or file is anyway slow, but try to convert millions of numbers to memory stream and you will notice difference.
You're looking at the trees and missing the forest!
I'm fully aware of your three points. They make no difference. Try it!
#define N 10000000
char buf[N*24]; // more than enough
char *p = buf;
__attribute__ ((noinline))
void myputchar(char ch) {
*p++ = ch;
}
void print_dec(int n) {
unsigned u = n;
if (n < 0) {
myputchar('-');
u = -n;
}
unsigned d = u/10, ch = u - d*10 + '0';
if (d != 0) print_dec(d);
myputchar(ch);
}
int main() {
for (int i=-N; i<=N; ++i) print_dec(i);
return 0;
}
On a Raspberry Pi 3, compiled gcc -std=c99 -Os -mthumb -march=armv7
real 0m5.996s
user 0m5.680s
sys 0m0.310s
With hamster_nz's most recent code substituted:
real 0m6.312s
user 0m6.020s
sys 0m0.290s
Whose code is faster?
There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.
It would be good to check them sometimes :-)
Btw, on an Odroid XU4 (2.0 GHz ARM A15):
hamster_nz's iterative code:
real 0m2.197s
user 0m1.860s
sys 0m0.325s
My recursive code:
real 0m2.128s
user 0m1.770s
sys 0m0.355s
Closer, but the recursive is faster.
On an i7 3770:
hamster_nz's iterative code:
real 0m0.898s
user 0m0.876s
sys 0m0.020s
My recursive code:
real 0m0.856s
user 0m0.848s
sys 0m0.008s
Close, but again recursive has the edge.
hmmm 4 pages about recursive function... here's my take... recursion is a tool. if you think you need it, then you use it. if you think you dont, then dont forbid others from using it... if you are taboo about recursion, you can replace recursive function with few (2, 3 or 4) functions that can emulate recursion... but by using the recursive tool you need to be aware of the pros and cons...
recursive function pros:
1) simpler code, hence managing and understanding it will be alot easier, hence less prone to logic/semantics error. hence in accordance to kiss principle.
cons:
1) slower execution due to multiple self function calls (can be hundreds of recursive function calls) which in effect will execute housekeeping job a normal program did when jumping (calling) another (modular) function, such as preparing stack and pointers to arguments etc.
2) require deep stack memory space to store return points and local variables. a big no no to limited resource mcu..
with non-recursive functions, the pro and cons is the other way around except con #1 because they can still call hundreds of non-recursive function calls. more difficult to maintain, less readable etc, but requiring close to nothing stack space.
anyway. i cared to reply just because i see missing option in the polls. i want to vote... use recursion depending on the need or when appropriate, there should be no definable time limit where you should use or not use recursion. but its not in the poll list... recursion is a tool, understand it and learn its drawback is the way to go. speculating without really understanding it is a complete waste of time, much like what usually happened in religious discussions, you need to understand your enemy to love or hate them in the right way. imho...
Whose code is faster?
There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.
It would be good to check them sometimes :-)
To be fair, your test does not really test what you want to test. The hamster_nz's code uses "%" operation. The compiler (it's the same GCC everywhere, right?) implements it as a division. In your code, the compiler replaces division with multiplication, which makes hamster_nz's code way slower compared to yours.
If you want a fair comparison, take your code, convert it into non-recursive version without touching anything else, then do the tests.
The hamster_nz's code uses "%" operation. The compiler (it's the same GCC everywhere, right?) implements it as a division. In your code, the compiler replaces division with multiplication, which makes hamster_nz's code way slower compared to yours.
I just thought it is worth noting that, at some levels of optimization, GCC will do strength reduction on expensive operations with constants (e.g. '% 8' ==> ' & 0x7'), but it clearly hasn't here (I'm assuming, I haven't bothered to read the assembler).
2) require deep stack memory space to store return points and local variables. a big no no to limited resource mcu..
This is not necessarily true.
Take for example, PIC16 - 16-level deep hardware call stack, very small memory. Seems like it would be stupid to use recursion. XC8 doesn't even support recursion with default settings.
But, on the other hand, you have very small memory. How much data can you really fit into 1024 bytes? When your data is so small, its processing is unlikely to require huge recursion depth. May be you'll need 5-7 levels. Compared to this, the 16-level deep stack looks quite adequate.
your code compiles to 140 bytes of code and uses 20 bytes of stack, for 160 total.
My recursive code (below) compiles to 52 bytes of code and uses 8 bytes of stack per digit printed, for a maximum of 80 bytes of stack...
I need 60 bytes more of stack than you do.
Or to put it another way, you need
4 times more stack than he does. But this is a fairly trivial example. How does it scale up? Who's stack will blow out first?
Most MCUs have a lot more ROM than RAM, and many have even more limited stack space. But hey, just because this is the Microcontroller & FPGA forum doesn't mean...
On an i7 3770:
hamster_nz's iterative code:
real 0m0.898s
user 0m0.876s
sys 0m0.020s
My recursive code:
real 0m0.856s
user 0m0.848s
sys 0m0.008s
Close, but again recursive has the edge.
... that you can't try to win the argument by throwing an i7 at it!
You're looking at the trees and missing the forest!
...
There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.
It would be good to check them sometimes :-)
Not sure what you are talking about. My attitude about recursion you can read at end of page 1 of this thread.
With 30+ years of professional experience, everybody have his own attitude about everything, especially in his own profession. But I'm the liberal one in using different methods and different solutions in different situations, as long it is the most suitable way - I do not force either as the best a priori motivated by some subjective reasons.
If this is only an exercise to prove that everything can be transformed to recursion, yes, it can. But the question is (looking the whole picture) is it the most suitable solution?
2) require deep stack memory space to store return points and local variables. a big no no to limited resource mcu..
This is not necessarily true.
Take for example, PIC16 - 16-level deep hardware call stack, very small memory. Seems like it would be stupid to use recursion. XC8 doesn't even support recursion with default settings.
But, on the other hand, you have very small memory. How much data can you really fit into 1024 bytes? When your data is so small, its processing is unlikely to require huge recursion depth. May be you'll need 5-7 levels. Compared to this, the 16-level deep stack looks quite adequate.
if your recursive function doesnt store any local variables (i cant think of any recursive function that doesnt need local/temporary variable(s)), then 16 deep calls (give or take) is feasible, but if the function store say 2 variables, then (16/3) deep calls is feasible, ymmv. unless of course if a compiler can do a nice trick of saving local variables in heap memory. yes my statement is not entirely true but i'm talking what is generally recursion is used for. i dont think i will have a need for recursion in 16 deep stack processor, even if i do, recursive function that i usually used will not be runable in such tiny system. more like only for few led blink and some simple arithmetics. recursion is usually used in more complex data structures / algorithm afaik, other than simplicity, will be also because of performance reason. if you only have 1024 items in a list (which as you said pic16 can maximally hold on to), i dont think linear search or O(n^2) bubblesort on 1-16MHz processor will give much impact on performance, so recursive binary search/ O(N.log(N)) sort will not be necessary imho, but... ymmv...
Whose code is faster?
There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.
It would be good to check them sometimes :-)
To be fair, your test does not really test what you want to test. The hamster_nz's code uses "%" operation. The compiler (it's the same GCC everywhere, right?) implements it as a division. In your code, the compiler replaces division with multiplication, which makes hamster_nz's code way slower compared to yours.
If you want a fair comparison, take your code, convert it into non-recursive version without touching anything else, then do the tests.
Here we have yet another assumption confidently stated, without data to back it up. And, again, incorrectly.
Here is hamster_nz's code, compiled (you really could do this yourself in about 30 seconds, you know...):
00010418 <print_dec>:
10418: b57f push {r0, r1, r2, r3, r4, r5, r6, lr}
1041a: 1e05 subs r5, r0, #0
1041c: da03 bge.n 10426 <print_dec+0xe>
1041e: 202d movs r0, #45 ; 0x2d
10420: 426d negs r5, r5
10422: f7ff fff1 bl 10408 <myputchar>
10426: 4a0f ldr r2, [pc, #60] ; (10464 <print_dec+0x4c>)
10428: ae01 add r6, sp, #4
1042a: 2400 movs r4, #0
1042c: fba5 0102 umull r0, r1, r5, r2
10430: 3401 adds r4, #1
10432: 2c0a cmp r4, #10
10434: ea4f 03d1 mov.w r3, r1, lsr #3
10438: eb06 0104 add.w r1, r6, r4
1043c: eb03 0083 add.w r0, r3, r3, lsl #2
10440: eba5 0540 sub.w r5, r5, r0, lsl #1
10444: f801 5c01 strb.w r5, [r1, #-1]
10448: d002 beq.n 10450 <print_dec+0x38>
1044a: 461d mov r5, r3
1044c: 2b00 cmp r3, #0
1044e: d1ed bne.n 1042c <print_dec+0x14>
10450: 3c01 subs r4, #1
10452: 5d30 ldrb r0, [r6, r4]
10454: 3030 adds r0, #48 ; 0x30
10456: b2c0 uxtb r0, r0
10458: f7ff ffd6 bl 10408 <myputchar>
1045c: 2c00 cmp r4, #0
1045e: d1f7 bne.n 10450 <print_dec+0x38>
10460: b004 add sp, #16
10462: bd70 pop {r4, r5, r6, pc}
10464: cccccccd .word 0xcccccccd
Where is the division? There is none. The compiler implements divide/mod by 10 by multiplying by 0xcccccccd, exactly the same as in my code.
You're looking at the trees and missing the forest!
...
There are a lot of people here with assumptions and prejudices about what is fast, what is slow, what is efficient, what is wasteful.
It would be good to check them sometimes :-)
Not sure what you are talking about. My attitude about recursion you can read at end of page 1 of this thread.
With 30+ years of professional experience, everybody have his own attitude about everything, especially in his own profession.
Everyone is entitled to their own opinions, but not their own facts. I'm seeing a lot of opinions that recursive code is bigger and or slower than iterative code. I'm the *only* one here who is bothering to establish facts to back up my opinions, by actually compiling and running code -- the results of which have been (at least for this example) that the recursive code is both smaller and faster than the iterative code.
Twice smaller code for a little more stack usage is a good trade-off in most circumstances. Bigger code is sitting there 100% of the time, whether in RAM or flash/ROM. A few bytes of stack usage is temporary, and as long as you have it in the first place is shared with other code running before and afterwards.
Where is the division?
here:
Your function compiles to 146 bytes of code, and uses 32 bytes of stack, total 178 bytes.
00010222 <print_dec>:
10222: 1141 addi sp,sp,-16
10224: c422 sw s0,8(sp)
10226: c226 sw s1,4(sp)
10228: c606 sw ra,12(sp)
1022a: 842a mv s0,a0
1022c: 00055a63 bgez a0,10240 <print_dec+0x1e>
10230: 8581a503 lw a0,-1960(gp) # 1b910 <_impure_ptr>
10234: 02d00593 li a1,45
10238: 40800433 neg s0,s0
1023c: 4510 lw a2,8(a0)
1023e: 3f55 jal 101f2 <__sputc_r>
10240: 45a9 li a1,10
10242: 02b45533 divu a0,s0,a1
10246: 02b47433 remu s0,s0,a1
1024a: 03040413 addi s0,s0,48
1024e: c111 beqz a0,10252 <print_dec+0x30>
10250: 3fc9 jal 10222 <print_dec>
10252: 8581a503 lw a0,-1960(gp) # 1b910 <_impure_ptr>
10256: 85a2 mv a1,s0
10258: 40b2 lw ra,12(sp)
1025a: 4422 lw s0,8(sp)
1025c: 4492 lw s1,4(sp)
1025e: 4510 lw a2,8(a0)
10260: 0141 addi sp,sp,16
10262: bf41 j 101f2 <__sputc_r>
That's my code (compiled for RISC-V), not hamster_nz's code. You said " The hamster_nz's code uses "%" operation. The compiler (it's the same GCC everywhere, right?) implements it as a division. In your code, the compiler replaces division with multiplication, which makes hamster_nz's code way slower compared to yours."
The ARM compiler uses multiplication for both of our functions. The RISC-V compiler uses division for both of our functions.