Author Topic: Divide algorithm (Read 18671 times)

rs20 · « **Reply #25 on:** March 14, 2014, 01:20:00 am »

Quote from: miguelvp on March 13, 2014, 07:38:35 pm

As for the converting a fixed point binary fraction to decimal, is not hard, the code is above and it works. (hint) third loop does that.

Sorry, I meant doing it precisely was "hard" (as opposed to correct to 4dp), and as I mentioned in my original reply, I define "hard" as similarly difficult to the OP's question.

Quote

why do a long division base 10 when the processor is base 2? base 2 long division is easier too, if the value is divisible then you put a 1 and shift, if not you just shift because the value is 0.

You could explain to me how easy adding numbers is, it's not particularly relevant if you end up with something that's of no use to solving the problem at hand. For example, consider two different calculations, one of which works out to 0.10000000000000..., and another that works out to 0.099999999999... You want to print this number to 1 decimal place. Clearly, the first should print 0.1, the second should print 0.0. This is trivial to achieve correctly using base-10 division code.

Now, how do we achieve this with binary division and then decimal conversion second? 0.099999999999 and 0.10000000000000 in binary are:

0.0001100110011001100110011001100110011001100110011...
0.0001100110011001100110011001100110011000100000000...

So you need words and words of binary expansion before you can decide what the very first decimal digit is. One tenth has an infinite recurring binary expansion, never forget this. A contrived example for sure, but one that I think justifies my uneasiness with the approach in general. It's just unnecessarily wrong when a single line using AVR's standard compiler can do this with absolute precision.

Quote

And no, i'm not incrementing things by 2^-16 steps, i'm doing a long division to obtain the integer, then the fraction, then converting the fraction to decimal in three loops with lots of comments so you can follow.

Why have you got numbers like 000015258 in your comments, and nothing like that in the code itself? It's confusing. All I know is that your very own sample output shows an error in the fifth decimal place, so there has to be some sort of approximation going on here. When asked a problem as simple as outputting a division to 6 dp, something that could be done with perfect precision in one line in AVR's standard compiler, just making such approximations is bit rough, I suspect.

As I mentioned, my code does involve performing divides and getting remainders; if I were to write this code I'd use your divide code. I'd just pull out once I hit the fractional part, and keep the remainder, and start working digit by digit. All I ever need to do in addition to your code is * 10, which is trivial. My code will have longer run-time, of course, but still easily within OP's spec.

westfw · « **Reply #26 on:** March 14, 2014, 04:34:50 am »

You guys are making this a lot more complicated than it needs to be , assuming that you're allowed to use the math primitives that are already in C (except for Mechatrommer!) When you do an integer division, you get a remainder. To get digits of fraction, you essentially treat the remainder as if it were multiplied by some power of 10, and do more integer division (just like doing long division by hand; the decimal point is an illusion, and you can keep doing more division to get as many digits as you want.) With 32bit longs, you can get about 9 digits per division, but you don't need to. Here's an example (debugged, even. I kept NDIGITS small so as to test more):

Code: [Select]

#include <stdio.h>

// Number of times to loop over remainder
#define NLOOPS 10
// Digits per loop
#define NDIGITS 3

// Support macros for building printf strings
#define xstr(s) str(s)
#define str(s) #s

int main(int argc, char *argv[])
{
    long dividend;
    long divisor;
    long multiplier = 1;

    if (argc < 2) {
	printf("\n Want arguments \"divisor dividend\"");
	return 0;
    }
    sscanf(argv[1], "%ld", &dividend);
    sscanf(argv[2], "%ld", &divisor);


    // Print integer part
    printf("\n%ld.", dividend / divisor);


    // Compute the multiplier
    for (int i=0; i < NDIGITS; i++) {
	multiplier *= 10;
    }

    for (int i=0; i < NLOOPS; i++) {
	dividend = dividend % divisor;  // start with the remainder
	dividend *= multiplier;  // multiply to get bigger integer
        // print NDIGITS more digits of fraction (with leading zeros.)
	printf("%0" xstr(NDIGITS) "ld", dividend/divisor);
    }
    printf("\n");
}

rs20 · « **Reply #27 on:** March 14, 2014, 05:25:42 am »

Quote from: westfw on March 14, 2014, 04:34:50 am

You guys are making this a lot more complicated than it needs to be...

My point exactly :-). My only concern with your code is that we're dealing with a very, very primitive MCU here. Integer division does not come for free; if you use the / operator, that's going to inject some general-purpose divider into your assembler. I'd be very curious to see what that assembler looks like, and if the / 10 and % 10 operations are actually combined together cleverly. Also, just a nitpick, if you set NDIGITS to 9, the denominator can't be more than 4, which is a bit restrictive; there's a tradeoff between number of digits computed at a time and allowed denominator size here. (e.g., if you change from longs to uint32_t's [or probably just ints on a typical desktop], calculating 7/9 gives garbage since 7*10^9 overflows the unsigned 32-bit.)

Here's my code. Based on hamster_nz's ideas, this will go about twice as fast, uses only operations directly mappable to the assembler in question, has no function call overhead, figures out how many places to use before the decimal place by itself, won't overflow, and is absolutely precise. I've also included features like truncating early (1/8 = 0.125 instead of 0.1250000000...) and round to nearest (2/3 = 0.666667), but these can be removed by just deleting the relevant lines. Presented as a GCC program, but the printDivide function is ready for MCU. I think this code should work out to be amongst the most compact once compiled, too.

Code: [Select]

#include <stdio.h>
#include <stdint.h>

#define DP 50

void printDivide(uint32_t num, uint32_t denom) {
  // Preconditions.
  if (num > 0xFFFFFFFF/10 || denom > 0xFFFFFFFF/10 || denom == 0) {
     putchar('X'); putchar('\r'); putchar('\n');
     return;
  }

  int8_t place = 0; // We always want to print the ones digit

  // Identify the place of the first digit by multiplying denom by 10 until it's within a factor of 10 of num.
  while (1) {
    // This multiplication is guaranteed to be safe because denom is less than num and num is less that UINT32_MAX / 10.
    uint32_t tendenom = (denom << 3) + (denom << 1); // Even a simple * 10 gets turned into a call to llmod by xc8.
    if (tendenom > num) {
      break;
    }
    denom = tendenom;
    ++place;
  }

  // Print answer to 50 d.p.
  for (; place >= -DP; place--) {
    if (place == -1) putchar('.');
    char digit = '0';
    // Slight optimization here: pull a bit of bit-shifting trickery to find the digit in 4 iterations rather than 10.
    uint32_t shifted_denom = denom << 3;
    int8_t bitt;
    for (bitt = 8;; bitt >>= 1, shifted_denom >>= 1) {
      if (shifted_denom <= num) {
        num -= shifted_denom;
        digit += bitt;
      }
      if (bitt == 1) break;
    }

    uint32_t tnum = num << 1;

    // Optional line: round to nearest (delete this line if you want, e.g., 0.6666 instead of 0.6667).
    if (place == -DP && tnum >= denom) ++digit;

    putchar(digit);

    // Optional line: bail out if the decimal expansion is complete.
    if (num == 0 && place <= 0) break;

    num = (tnum << 2) + tnum; // Even a simple * 5 gets turned into a call to llmod by xc8.
  }
  putchar('\r'); putchar('\n');
}

int main(void) {
  printDivide(1, 9801);
  printDivide(1, 1);
  printDivide(1, 8);
  printDivide(429496729, 1);
  printDivide(429496730, 3);
  printDivide(1, 429496729);
  printDivide(1, 429496730);
  printDivide(3, 400000001);
  printDivide(0, 77);
  printDivide(77, 0);
  printDivide(1, 3);
  printDivide(2, 3);
}

printDivide is 313 words when compiled for the 16F1503 using the free XC8 compiler. This is less than just writing printf("%ld\r\n", (long)2345214), which adds 445 words by itself. There's plenty of room for further optimisation.

Sample output:

Code: [Select]

$ gcc arbdivide.cc && ./a.out
0.00010203040506070809101112131415161718192021222324
1
0.125
429496729
X
0.00000000232830643979130281106285212244305590508933
X
0.00000000749999998125000004687499988281250029296875
0
X
0.33333333333333333333333333333333333333333333333333
0.66666666666666666666666666666666666666666666666667

dannyf · « **Reply #28 on:** March 14, 2014, 10:56:32 am »

Really nice ideas and implementations to do integer math with printf() + its siblings on "primitive mcus". With guys like these, we will need some super computers to just blink leds to the 27th digit! Talking about not making it overly complicated.

Keep up the good work, guys.

rs20 · « **Reply #29 on:** March 14, 2014, 11:19:46 am »

Quote from: dannyf on March 14, 2014, 10:56:32 am

Really nice ideas and implementations to do integer math with printf() + its siblings on "primitive mcus". With guys like these, we will need some super computers to just blink leds to the 27th digit! Talking about not making it overly complicated.

Keep up the good work, guys.

Forgive me if I'm taking this too seriously, but I don't understand what you're saying through the sarcasm.
- If you're saying that using printf on an MCU is something that should be avoided somewhat, I would agree with you entirely; indeed, I avoided *, /, % and printf so the code I provided above compiles to a smaller object file than printf alone.
- If you're saying that the MCU is underpowered for the task requested by the OP, I disagree and present the evidence of the solutions already presented.
- If you're saying that we're way over-optimizing, you are correct. And we're loving it.

I disagree that Qn is an appropriate choice. It sounds heavy, and suggesting Qn still leaves the question of rendering those fractional bits as a decimal completely unanswered, correct me if I'm wrong. And that's 90% of the problem. On the flip side, though, if the OP just wants 6dp, just doing (A*100000)/B, possibly with the support of an adaptation of the lldiv code below to support 64-bit ints, would be appropriate. It's true that we've taken the OPs request for "arbitrary precision" and taken it literally and run with it.

On an unrelated note, I've been playing around with the XC8 compiler and noticed some interesting things (I don't mess with assembler much, so I'm not sure how much of this is common; also, I have the free version, so I can't try optimizing to see if these things are fixed):

Check out this assembly code generated for a 32-bit shift-right-by-5 operation:

Code: [Select]

;arbdivide.c: 74: uint16_t y = x / 32;
        movf    main@x+1,w
        movwf   ??_main+1
        movf    main@x,w
        movwf   ??_main
        movlw   5
u885:
        lsrf    ??_main+1,f
        rrf     ??_main,f
        decfsz  9,f
        goto    u885
        movf    ??_main,w
        movwf   main@y
        movf    ??_main+1,w
        movwf   main@y+1
        line    76

That's right. It's basically a for loop doing x >>= 1 five times. That's like, at least 20 clock cycles right there. I've modified my code in my previous post to do less shifts-by-N and more shifts-by-1 accordingly. Because I'm all about premature optimization.

Also, taking a uint32_t x and going x % 10 makes a call to llmod:

Code: [Select]

__llmod(unsigned long int dividend, unsigned long int divisor)
{
        unsigned char   counter;

        if(divisor != 0) {
                counter = 1;
                while((divisor & 0x80000000UL) == 0) {
                        divisor <<= 1;
                        counter++;
                }
                do {
                        if((unsigned long)divisor <= (unsigned long)dividend)
                                dividend -= divisor;
                        *(unsigned long int *)&divisor >>= 1;
                } while(--counter != 0);
        }
        return dividend;
}

This is called even with a literal as the divisor. It's even called from printf as part of its %ld printing code. Which seems pretty bad when you consider that the first thing the code does is spend 22 iterations shifting the 10 left bit-by-bit until it becomes 0xA0000000. And that's repeated for every single digit in the printed number. I find it hard to believe that the generated code would still do this with -O3, but also find it hard to believe that the compiler could be smart enough to optimise this away. Finally, division is done by lldiv, a completely separate function. If you want both x%10 and x/10, you calculate one and then calculate the other from scratch, even though it's straightforward to blend the two loops together and get both at once (as both hamster_nz's and my code do). Fair enough, C doesn't have a way of asking for div and mod simultaneously, but it's just worth being aware how much performance you can be throwing away by not sticking close to the metal.

dannyf · « **Reply #30 on:** March 14, 2014, 04:20:07 pm »

-much of this is common; also, I have the free version, -

sounds like a case of having your cake (free tools) and eating it (quality code) at the same time

miguelvp · « **Reply #31 on:** March 14, 2014, 07:21:15 pm »

Quote from: hamster_nz on March 13, 2014, 08:22:57 pm

Quote from: miguelvp on March 13, 2014, 07:38:35 pm
So I decided to compile the code and test it, here are the results on an intel processor:

Code: [Select]
// debug Result: 37.393936155, cycles: 7350 fixed point 16.16 Result: 37.393940, cycles: 133 floating point IEEE 754 // profile Result: 37.393936155, cycles: 4981 Result: 37.393940, cycles: 81

But it is wrong [grin] - "I need to divide 2 long ints up to an arbitrary precision" as the original poster's problem.

The code is a guideline, he can expand it by changing the precision of the integral and fractional parts.
He will need to change the fractional constant as well to a 5000... that fits on the number of bits that he needs.

edit: 5000... constant explanation at the end of this post.

Also use more than one unsigned long to expand the operation to concatenate the accumulator and results.

I wouldn't do it in C because it's inefficient, Targeted to the very spartan instruction set that the PIC has will yield better results.

One more thing, I added my code to get those results to an existing program that I was working on that uses many threads, that might affect the precision counter so the cycles could be way less. Didn't want to do a one off or make a new project just to test the results.

Quote from: hamster_nz on March 13, 2014, 08:22:57 pm

Now I agree that 16.16 should be enough for anybody (well, I have used use 4.68 on a FPGA fractal generator....) - but in my code I can change the how many decimal digits I want to extract to an arbitrary number and keep on extracting digits:

Code: [Select]
... while(place >= -100) putchar('0'+divide_extract_next_digit()); ...
37.3939393939393939393939393939393939393939393939393939393939393939393939393939393939393939393939393939

Sure, it might be off by one least significant count (as it always rounds down), but it can keep extracting valid digits to the end of time if needed...

As a check - what do you get for 3/400000001? I get

0.00000000749999998125000004687499988281250029296874926757812683105468292236329269409176826477057933807355165481612086295969784260075539349811151625472120936319697659200755851998110370004724074988189812

(as does 'bc')

Very impressive! I will test your code to see what it generates in assembly and how fast it is.

I'm just trying to get people to think at the system level instead of relying on C code generation that makes sense programming wise but without knowing the underlying hardware you are not doing yourself a favour by not understanding these concepts. Lots of people shy from binary math, when in reality is easier than decimal math, just 2 values instead of 10 per digit.

Quote from: rs20 on March 14, 2014, 01:20:00 am

Quote from: miguelvp on March 13, 2014, 07:38:35 pm
As for the converting a fixed point binary fraction to decimal, is not hard, the code is above and it works. (hint) third loop does that.

Sorry, I meant doing it precisely was "hard" (as opposed to correct to 4dp), and as I mentioned in my original reply, I define "hard" as similarly difficult to the OP's question.

But that's the thing it's not hard at all, you place the maximum constant 500000.... that fits into your working memory model unsigned long in the sample above but you could concatenate it ad-infinitum by changing the precision value. for example if you place it on an unsigned long long (quad word) then
you can fit 5000000000000000000 into it and change the precision to 64 obtaining the accuracy of 2^-64 (5.4210108624275221703311375920553e-20) instead of just 2^-16 but it's all into what you need.

edit: 5000... constant explanation at the end of this post.

Quote from: rs20 on March 14, 2014, 01:20:00 am

Quote
why do a long division base 10 when the processor is base 2? base 2 long division is easier too, if the value is divisible then you put a 1 and shift, if not you just shift because the value is 0.

You could explain to me how easy adding numbers is, it's not particularly relevant if you end up with something that's of no use to solving the problem at hand. For example, consider two different calculations, one of which works out to 0.10000000000000..., and another that works out to 0.099999999999... You want to print this number to 1 decimal place. Clearly, the first should print 0.1, the second should print 0.0. This is trivial to achieve correctly using base-10 division code.

Now, how do we achieve this with binary division and then decimal conversion second? 0.099999999999 and 0.10000000000000 in binary are:

0.0001100110011001100110011001100110011001100110011...
0.0001100110011001100110011001100110011000100000000...

So you need words and words of binary expansion before you can decide what the very first decimal digit is. One tenth has an infinite recurring binary expansion, never forget this. A contrived example for sure, but one that I think justifies my uneasiness with the approach in general. It's just unnecessarily wrong when a single line using AVR's standard compiler can do this with absolute precision.

You just add them as you would any integer as long as they are aligned (i.e. both numbers 16.16) it will keep the result because it doesn't matter where the fraction start just like on base 10, if you ignore the dot you can add them then you place the dot afterwards (figuratively since there is no dot unless you want to print the result) .

Quote from: rs20 on March 14, 2014, 01:20:00 am

Quote
And no, i'm not incrementing things by 2^-16 steps, i'm doing a long division to obtain the integer, then the fraction, then converting the fraction to decimal in three loops with lots of comments so you can follow.

Why have you got numbers like 000015258 in your comments, and nothing like that in the code itself? It's confusing. All I know is that your very own sample output shows an error in the fifth decimal place, so there has to be some sort of approximation going on here. When asked a problem as simple as outputting a division to 6 dp, something that could be done with perfect precision in one line in AVR's standard compiler, just making such approximations is bit rough, I suspect.

As I mentioned, my code does involve performing divides and getting remainders; if I were to write this code I'd use your divide code. I'd just pull out once I hit the fractional part, and keep the remainder, and start working digit by digit. All I ever need to do in addition to your code is * 10, which is trivial. My code will have longer run-time, of course, but still easily within OP's spec.

It has to do with the precision you work with. IEEE 754 for example has a precision of 2^-24 but you can change the range via the exponent so it can shift the value by 128 shifts each way.

My example had a precision of 2^-16 (0.0000152587890625) but since I did choose and unsigned 32bit variable then it truncates precision so it can fit (.)999984742 at 000015258 increments. I can change the accumulator to be a quad word (64 bits unsigned) and the precision variable and increase the precision by many orders of magnitude at the cost of longer computations and resources, so it depends on what you need my constrains in the text under Caveats I state that this routine as coded will be accurate from 0.0 to 65536.0-0.0001 in 0.0001 steps.

That said, it can be expanded by using more storage dealing with the extra bits at the cost of time to get the result by using a quad word as mentioned above.

Being fixed point the precision is stuck at that no shifting by exponent unless you want to add that into the function and make it floating point. shifting in binary is the same as moving the decimal dot in decimal, we tend to think in decimal so we think of shifting as multiplying or dividing by 2 but we are actually moving the binary fractional point that happens to be base 2.

Thanks for your code by the way I will look into it since it looks like a nice compromise between both methods.

Edit: for clarification the 500000... constant represents 2^-1 or 1/2 or 0.5 so when you shift that constant right it now represents 2^-2 or 1/4 or 0.25, further right shifts gives you the next negative exponent for the fractional part.

miguelvp · « **Reply #32 on:** March 14, 2014, 07:33:02 pm »

Quote from: rs20 on March 14, 2014, 11:19:46 am

Quote from: dannyf on March 14, 2014, 10:56:32 am
Really nice ideas and implementations to do integer math with printf() + its siblings on "primitive mcus". With guys like these, we will need some super computers to just blink leds to the 27th digit! Talking about not making it overly complicated.

Keep up the good work, guys.

Forgive me if I'm taking this too seriously, but I don't understand what you're saying through the sarcasm.
- If you're saying that using printf on an MCU is something that should be avoided somewhat, I would agree with you entirely; indeed, I avoided *, /, % and printf so the code I provided above compiles to a smaller object file than printf alone.
- If you're saying that the MCU is underpowered for the task requested by the OP, I disagree and present the evidence of the solutions already presented.
- If you're saying that we're way over-optimizing, you are correct. And we're loving it.

I disagree that Qn is an appropriate choice. It sounds heavy, and suggesting Qn still leaves the question of rendering those fractional bits as a decimal completely unanswered, correct me if I'm wrong. And that's 90% of the problem. On the flip side, though, if the OP just wants 6dp, just doing (A*100000)/B, possibly with the support of an adaptation of the lldiv code below to support 64-bit ints, would be appropriate. It's true that we've taken the OPs request for "arbitrary precision" and taken it literally and run with it.

I agree with rs20 100% i'm loving it as well because doing things the hard way and understanding the underlying architecture you are working on and optimizing your code to fit better on that architecture gives you an edge among the rest.

We rely on tools too much and treat them as black boxes, but sometimes you gotta peek into what's inside the black box.

Finally to rs20, awesome job and thanks a lot for your code and views. Optimization for small mcu's is a lost art, I had to deal with it back on z80 days but even that processor was more powerful instruction set wise.

Also I agree on the printf usage as not being good, I had the sprintf just to use it as a black box to show the results, printing the values would be a totally different topic.

lewis · « **Reply #33 on:** March 14, 2014, 08:12:34 pm »

Very insightful thread guys, might need to borrow rs20's code for future reference! It's so easy to type '/' or '%' until you realise how expensive these can be on a small architecture.

npelov · « **Reply #34 on:** March 14, 2014, 08:51:00 pm »

Well you really took my task seriously, so thanks for that.

I needed 12 digits (total) and without putting much thought in it I made rough calculation that whatever format you use you can't fit it in 32 bit (the largest supported format). So built in floating point wouldn't do the job. I've managed to do the job using hamster_nz's code. It's still pretty ugly to post it here, but the goal is constantly changing and I'm trying not to spend a lot of time on something I'll delete later. The next step would be to count total digits instead digits after decimal point. Currently it's 1068 instructions and I bet I can drop 200 or so if I spend 1-2 hours on it. But I'll do that at the end of the project.

One thing I don't understand. Why do you need to check if A and B are < 0xFFFFFFFF/10. Sorry I didn't have time to read rs20's code in details (I'll do later, I promise), but my code can divide 0xFFFFFFFF to 0xFFFFFFFE. I don't multiply by 10 at all.

Well, whatever, I'll post the code - but if someone wants to use it better optimize it or come back later - I'll post the optimized code. I also think I'll remove string buffering and I'll output directly. The string is 16 chars wide - just enough to display on 16 character LCD. However I left space at the end for "Hz" text. I can write that on the panel and have 16 digits, but I think I don't have measurement accuracy for 16 digits.

Code: [Select]

#ifndef DIVIDE_LIB_H
#define DIVIDE_LIB_H

#define DIV_MAX_INT_DIGITS 8
#define DECIMAL_POINT_AT 2 // normally this is 0. increase by 1
                           // to multiply by 10

#ifndef DIVIDE_LIB_C
extern char outputBuffer[17];
extern signed char outBuffIndex;
#endif

void longToStr(unsigned long x);
void divideToString(unsigned long a, unsigned long b);
void padOutBuffer(void);
void clearOutBuffer(void);

#endif //DIVIDE_LIB_H

Code: [Select]


#define DIVIDE_LIB_C
#include "divide_lib.h"

unsigned long numerator, denominator;
char place;
char outputBuffer[17];
signed char outBuffIndex;

void padOutBuffer(void){
  while(outBuffIndex>=0){
    outputBuffer[outBuffIndex--] = ' ';
  }
}

void clearOutBuffer(void){
  outBuffIndex = 15;
  padOutBuffer();
}

void longToStr(unsigned long x){
  unsigned char digit;
  do{
    digit = x % 10;
    outputBuffer[outBuffIndex--] = digit + '0';
    x = x / 10;
    if(outBuffIndex<0)
      break;
  }while(x>0);
}

unsigned char divide_extract_next_digit(void){
   int digit = 0;
//   printf("Extracting place %i...",place);


   /* Extracting the decimal digits */
   numerator *= 10;

  /* Actually extract the digit */
  while(numerator >= denominator){
    digit++;
    numerator -= denominator;
  }

   place++;
//   printf("%i\r\n",digit);
   return digit;
}

void divideToString(unsigned long a, unsigned long b){
  signed char c;
  signed char i;
  unsigned long intResult;


  if(b == 0){
    outputBuffer[0] = 'e';
    outputBuffer[1] = 0;
    return;
  }


   place = DIV_MAX_INT_DIGITS;
   intResult = a/b; // doing int division the lazy way - I'll change that later
   a = a % b; // I just noticed that - I do mod twice!!!  :))) it still works -  every time it'll return the same result. remove that and you saved 20 instructions

  outBuffIndex = 12-DIV_MAX_INT_DIGITS;
  longToStr(intResult);
  padOutBuffer();
  outBuffIndex = 13-DIV_MAX_INT_DIGITS;


   numerator = a%b; // well that's a-intResult ... optimization .... (wrong ... it's a-intResult*b)
    denominator = b;


  place = 0;
  while(place < DIV_MAX_INT_DIGITS){
    if(place == DECIMAL_POINT_AT)
    outputBuffer[outBuffIndex++] = '.';
    outputBuffer[outBuffIndex++] = '0'+divide_extract_next_digit();
  }

}

miguelvp · « **Reply #35 on:** March 14, 2014, 09:14:34 pm »

Quote from: rs20 on March 14, 2014, 11:19:46 am

On an unrelated note, I've been playing around with the XC8 compiler and noticed some interesting things (I don't mess with assembler much, so I'm not sure how much of this is common; also, I have the free version, so I can't try optimizing to see if these things are fixed):

Check out this assembly code generated for a 32-bit shift-right-by-5 operation:
Code: [Select]
;arbdivide.c: 74: uint16_t y = x / 32; movf main@x+1,w movwf ??_main+1 movf main@x,w movwf ??_main movlw 5 u885: lsrf ??_main+1,f rrf ??_main,f decfsz 9,f goto u885 movf ??_main,w movwf main@y movf ??_main+1,w movwf main@y+1 line 76That's right. It's basically a for loop doing x >>= 1 five times. That's like, at least 20 clock cycles right there. I've modified my code in my previous post to do less shifts-by-N and more shifts-by-1 accordingly. Because I'm all about premature optimization.

Also, taking a uint32_t x and going x % 10 makes a call to llmod:

Code: [Select]
__llmod(unsigned long int dividend, unsigned long int divisor) { unsigned char counter; if(divisor != 0) { counter = 1; while((divisor & 0x80000000UL) == 0) { divisor <<= 1; counter++; } do { if((unsigned long)divisor <= (unsigned long)dividend) dividend -= divisor; *(unsigned long int *)&divisor >>= 1; } while(--counter != 0); } return dividend; }
This is called even with a literal as the divisor. It's even called from printf as part of its %ld printing code. Which seems pretty bad when you consider that the first thing the code does is spend 22 iterations shifting the 10 left bit-by-bit until it becomes 0xA0000000. And that's repeated for every single digit in the printed number. I find it hard to believe that the generated code would still do this with -O3, but also find it hard to believe that the compiler could be smart enough to optimise this away. Finally, division is done by lldiv, a completely separate function. If you want both x%10 and x/10, you calculate one and then calculate the other from scratch, even though it's straightforward to blend the two loops together and get both at once (as both hamster_nz's and my code do). Fair enough, C doesn't have a way of asking for div and mod simultaneously, but it's just worth being aware how much performance you can be throwing away by not sticking close to the metal.

And that's why I used shifts and why I kept on saying that using 10 base and divides are not good.

Also, if you look at __llmod code there, it's pretty much what I wrote myself but without the call overhead, and with less loop usage for computing the modulus for the 2nd loop

Edit: also it's weird that those functions use C code instead of having optimized assembler, but then it will be too specific to for that micro controller.

miguelvp · « **Reply #36 on:** March 15, 2014, 01:08:07 am »

Quote from: npelov on March 14, 2014, 08:51:00 pm

One thing I don't understand. Why do you need to check if A and B are < 0xFFFFFFFF/10. Sorry I didn't have time to read rs20's code in details (I'll do later, I promise), but my code can divide 0xFFFFFFFF to 0xFFFFFFFE. I don't multiply by 10 at all.

Code: [Select]
... void longToStr(unsigned long x){ unsigned char digit; do{ digit = x % 10; outputBuffer[outBuffIndex--] = digit + '0'; x = x / 10; if(outBuffIndex<0) break; }while(x>0); } unsigned char divide_extract_next_digit(void){ int digit = 0; // printf("Extracting place %i...",place); /* Extracting the decimal digits */ numerator *= 10; /* Actually extract the digit */ while(numerator >= denominator){ digit++; numerator -= denominator; } place++; // printf("%i\r\n",digit); return digit; } ... place = DIV_MAX_INT_DIGITS; intResult = a/b; // doing int division the lazy way - I'll change that later a = a % b; // I just noticed that - I do mod twice!!! :))) it still works - every time it'll return the same result. remove that and you saved 20 instructions ... numerator = a%b; // well that's a-intResult ... optimization .... keep in mind that intresult is destroyed when converted to string.

The check to see if A and B are < 0xFFFFFFFF/10 is done so that if you multiply A or B by 10 it doesn't overflow for lack of storage in the unsigned long.

Since 0xFFFFFFFF/10 are both constants, the preprocessor of the compiler will substitute it with 0x19999999 (429496729) so if you multiply it by 10 decimal you will get 0xFFFFFFFA or 4294967290 decimal (Not overflowing even if the constant is truncated). On the comparison I would change it to <= to allow 429496729 as an input.

Edit: but if your A or B are one more 429496730, ifyou multiply it by 10 then it will overflow to 0x100000004

But the reasoning is that if A or B are larger than that number it might overflow.

On the statement you make that you don't multiply by 10 at all, well, you do.
Also all those divisions and modulus are going to cost you a lot.

rs20 · « **Reply #37 on:** March 15, 2014, 01:47:43 am »

Quote from: npelov on March 14, 2014, 08:51:00 pm

One thing I don't understand. Why do you need to check if A and B are < 0xFFFFFFFF/10. Sorry I didn't have time to read rs20's code in details (I'll do later, I promise), but my code can divide 0xFFFFFFFF to 0xFFFFFFFE. I don't multiply by 10 at all.

Nope, try to divide 4294967295UL by 4000000000UL, your code prints out 1.00000 instead of the correct answer, 1.07374.... Your example just happens to be a poor test because the correct output consists of just ones and zeros, the only output this code could possibly produce with a denominator that high. You do numerator *= 10 in divide_extract_next_digit, that's where it goes wrong.

So your denominator must be < 0xFFFFFFFF/10, but your code is indeed better than mine in the sense that your numerator may be any value. That's only because the "// Identify the place of the first digit" section of my code would get confused, it's not too hard to fix that and the rest of the algorithm would work just fine once that fix is made. What is the end intended use for this code, is a 32-bit by 28-bit divider really more useful to you that a 28-bit by 28-bit divider?

Also; you say in your comments "keep in mind that intresult is destroyed when converted to string". This is false, you're passing by value, not by reference, the function is operating on a copy of the value. Unless XC8 doesn't follow the rules of C.

"Also keep in mind that subtract 2 long ints compiles into more instructions than a%b, because the latter is a function." Good observation, but just make a subtract function and this objection to using subtraction disappears. Also, worrying about little things like this when you have completely separate, independent code for printing the integer and fractional parts is odd.

npelov · « **Reply #38 on:** March 15, 2014, 08:44:25 am »

Yes, you are right. I didn't put too much thought on it. Actually they don't both need to be < 0xFFFFFFFF/10. a%b must be < 0xFFFFFFFF/10, because that's what I multiply by 10. So correct me if I'm wrong bu A doesn't have to be < 0xFFFFFFFF/10. If B is < 0xFFFFFFFF/10 then A%B will be < 0xFFFFFFFF/10. I tested it and this works just fine: divideToString(0xFFFFFFFF, 400000000UL);

And yes intResult is not destroyed now (when I moved this into function). I guess this bit of information was left in my mind before I make it's own function.
And numerator is not actually a-intResult. It's a-intResult*b

. well it's a mess. I'll rewrite it later.

You are right that having separate code for integer and fractional part is not the best. It's just the way I work - I need motivation to continue. I take a note when I do something stupid, then do the stupid thing in order to have quicker results just to see it working, and then I optimize. That''s the same for the divisions. I can completely remove divisions (even by 10) and I'll at least save lib code for division if not time.

And I said the code is bad. One of the reasons to post it is to have someone else look at it and tell me mistakes I didn't see.

rs20 · « **Reply #39 on:** March 15, 2014, 09:45:46 am »

You're right, A can be any number (from my previous message, "your code is indeed better than mine in the sense that your numerator may be any value.") So yes, A%B must be less than 0x19999999, and a good way to guarantee that is to restrict B to less than 0x19999999.

As for everything else, my comments are intended as helpful hints and little warnings and things to watch out for, not as a criticism of your coding skills. You're doing well!

npelov · « **Reply #40 on:** March 15, 2014, 10:59:51 am »

I was studying your code to see if I can modify it to give fixed number of digits and I found that there is a place where you multiply by 16 (shift by 4) and this makes a limitation of denominator to max_int>>4. But you could shift left by 3 if you move the shift right at the end:

Code: [Select]

    unsigned long shifted_denom = denom << 3;
    for (unsigned char bitt = 8; bitt; bitt >>= 1) {
      if (shifted_denom <= num) {
        num -= shifted_denom;
        digit += bitt;
      }
      shifted_denom >>= 1;
    }

It also reduces time by one shift which is actually 4 byte shifts per digit. But like I said I don't really care about the time. The worst code will do the job 10 times faster than I really need to.

So here is what I came up with(I don't need to put it in a buffer, but it's easier to run in simulator and see result):

Code: [Select]

#define putChar(x) outputBuffer[outBuffIndex++] = x
#define TOTAL_DIGITS 14

char outputBuffer[17];
signed char outBuffIndex;

void divideToStr(unsigned long num, unsigned long denom) {
  outBuffIndex = 0;
  // Preconditions.
  if (num > 0xFFFFFFFF/10 || denom > 0xFFFFFFFF/10 || denom == 0) {
     putChar('X');
     return;
  }
  signed char place = 0; // We always want to print the ones digit

  // Identify the place of the first digit by multiplying denom by 10 until it's within a factor of 10 of num.
  while (1) {
    // This multiplication is guaranteed to be safe because denom is less than num and num is less that UINT32_MAX / 10.
    unsigned long tendenom = (denom << 3) + (denom << 1); // This is just * 10, surely the compiler can accept * 10? Anyway.
    if (tendenom > num) {
      break;
    }
    denom = tendenom;
    ++place;
  }

  // Print answer to 50 d.p.
  //for (; place >= -DP; place--) {
  for (; outBuffIndex<TOTAL_DIGITS; place--) {
    if (place == -1) putChar('.');
    char digit = '0';
    // Slight optimization here: pull a bit of bit-shifting trickery to find the digit in 4 iterations rather than 10.
    unsigned long shifted_denom = denom << 3;
    for (unsigned char bitt = 8; bitt; bitt >>= 1) {
      if (shifted_denom <= num) {
        num -= shifted_denom;
        digit += bitt;
      }
      shifted_denom >>= 1;
    }


    // Optional line: round to nearest (delete this line if you want, e.g., 0.6666 instead of 0.6667).
    if (outBuffIndex == TOTAL_DIGITS-1 && (num << 1) >= denom) ++digit;

    putChar(digit);

    // Optional line: bail out if the decimal expansion is complete.
    //if (num == 0) break;

    num = (num << 3) + (num << 1); // This is just * 10, surely the compiler can accept * 10? Anyway.
    //num = num*10; // This is just * 10, surely the compiler can accept * 10? Anyway.
  }
}

One thing I noticed. Maybe I don't need that many digits. Because I'm using long int here is the finest resolution I can get:
4294967295 / 3999999998L = 10.73741829118
4294967295 / 3999999999L = 10.73741826434
4294967295 / 400000000UL = 10.73741823750

Incrementing the number by one makes the eighth (never realized how wierd the letters in this word are) number after the decimal point change. So there is no point of calculating more than 10 digits in total. Because my algorithm allows any long int in divided number I have one decimal digit more resolution. But displaying more than 11 digits is not helpful.

So here is what I'll do - I'll still use my code because of higher possible value of dividend and I don't care it takes a lot more time unless I run out of program memory.
rs20's code is pretty neat and I'll save it for future use - I can reduce numbers to 16 bit and use it with ADC. It's the cheapest way to display fractional ADC results. I'm sure lots of other people would want to use it.

dannyf · « **Reply #41 on:** March 15, 2014, 12:12:11 pm »

I never quite understood the purpose of those all fabulous and elegant pieces of code presented here. In my highly simplistic mind, I thought something like this may be achieved in more simplistic ways, like this:

Code: [Select]

  tmp1=ul_n; tmp2=ul_n / ul_d; //ul_n is the numerator and ul_d the denominator
  for (tmp=0; tmp<MAXDIGs; tmp++) {
    tmp1 -= tmp2 * ul_d; //obtain the residual
    tmp1 *= 10; //go to the next digit
    tmp2 = tmp1 / ul_d; //tmp2 now contains the next decimal point, between 0..9
    //save tmp2 somewhere. Output in BCD or ascii.
  }

This will produce all the decimal points of ul_n / ul_d, up to MAXDIGs.

More optimization is possible but that's the general ideal.

npelov · « **Reply #42 on:** March 15, 2014, 02:35:10 pm »

@dannyf I'm open to suggestions, but I'm having troubles understanding your code. Variables naming confused me. Also how do you know where the decimal point is?

rs20 · « **Reply #43 on:** March 16, 2014, 02:44:57 am »

Quote from: dannyf on March 15, 2014, 12:12:11 pm

I never quite understood the purpose of those all fabulous and elegant pieces of code presented here. In my highly simplistic mind, I thought something like this may be achieved in more simplistic ways, like this:

Code: [Select]
tmp1=ul_n; tmp2=ul_n / ul_d; //ul_n is the numerator and ul_d the denominator for (tmp=0; tmp<MAXDIGs; tmp++) { tmp1 -= tmp2 * ul_d; //obtain the residual tmp1 *= 10; //go to the next digit tmp2 = tmp1 / ul_d; //tmp2 now contains the next decimal point, between 0..9 //save tmp2 somewhere. Output in BCD or ascii. }
This will produce all the decimal points of ul_n / ul_d, up to MAXDIGs.

More optimization is possible but that's the general ideal.

Yes, this code works fine, although to complete the picture you need to print out ul_n/ul_d as an integer, which is not necessarily a single digit. Actually doing that is of similar difficulty to the fractional part, which is why I cheat in my code by multiplying ul_d by 10 until the integer part *is* a single digit, so that there is only a fractional part.

To clarify for other people watching, tmp1 -= tmp2 * ul_d could be re-written as tmp1 = tmp1 % ul_d and it'd be equivalent.

If you optimize it by removing all divides and multiplies, (and do a little trickery to avoid ever having to print ul_n/ul_d when it may be more than one digit), you end up with my code. It might seem over-the-top, and probably is, but when you know that the quotient from each division is a number between 0 and 9, and accordingly combine the divide-and-modulo calculation into a single 4-iteration loop instead of function calls out to two separate generic 32-iteration loops (llmod and lldiv), the code size and execution time savings seem really substantial. Obviously that may be a false economy iff you already have other code doing long division and modulos (or, equivalently, use printf with "%ld" ever).

rs20 · « **Reply #44 on:** March 16, 2014, 04:31:29 am »

Quote from: npelov on March 15, 2014, 10:59:51 am

I was studying your code to see if I can modify it to give fixed number of digits and I found that there is a place where you multiply by 16 (shift by 4) and this makes a limitation of denominator to max_int>>4. But you could shift left by 3 if you move the shift right at the end:

Code: [Select]
unsigned long shifted_denom = denom << 3; for (unsigned char bitt = 8; bitt; bitt >>= 1) { if (shifted_denom <= num) { num -= shifted_denom; digit += bitt; } shifted_denom >>= 1; }
It also reduces time by one shift which is actually 4 byte shifts per digit. But like I said I don't really care about the time. The worst code will do the job 10 times faster than I really need to.

Good catch, I've updated my code in my previous post.

Mechatrommer · « **Reply #45 on:** March 16, 2014, 06:30:07 am »

Quote from: dannyf on March 15, 2014, 12:12:11 pm

I never quite understood the purpose of those all fabulous and elegant pieces of code presented here. In my highly simplistic mind, I thought something like this may be achieved in more simplistic ways, like this:
Code: [Select]
tmp1=ul_n; tmp2=ul_n / ul_d; //ul_n is the numerator and ul_d the denominator for (tmp=0; tmp<MAXDIGs; tmp++) { tmp1 -= tmp2 * ul_d; //obtain the residual tmp1 *= 10; //go to the next digit tmp2 = tmp1 / ul_d; //tmp2 now contains the next decimal point, between 0..9 //save tmp2 somewhere. Output in BCD or ascii. }

Code: [Select]

tmp1 -= tmp2 * ul_d; //obtain the residualis similar to modulus operator % to get residual
the semantic of your whole code is somewhat similar to mine earlier except your finite loop.

why the complexity that you saw in this thread is that everybody tried to get rid of * and / or % operators wherever possible, which are more expensive than + and - or >> and << operators in cheap mcu that doesnt support "mul" or "div" op code. and/or probably tried to hack/exploit the features of binary numbers using the said op code ("add", "sub", "shr" or "shl").

in some other schools, the "pupils" just decided to punch in the * and / (just like you and i did) and hope for the best for the compiler to translate it to the most optimized way given the limitation of the related mcu op codes domain, or insert the non-native contructs/subroutines which are also hoped to be highly optimized.

miguelvp · « **Reply #46 on:** March 16, 2014, 06:40:34 am »

And because the operations you are relying on, use the same fabulous and elegant pieces of code done by the Philippine Electronics and Robotics Enthusiast Club maintained by a couple of guys. (well that's probably an adaptation).

I looked at the source and actually is really good but they use the same constructs as my code because they know the limitations of the PIC 16.

Meaning that if you want a specific functionality, there is no reason not to customize your code and understand how things work in the inside.

But you are way better off by doing it in assembler for only 2K of program space.

The code you wrote will call the following elegant pieces of code, done by programmers that care what's going on inside:

Code: (__lldiv) [Select]

// long unsigned unsigned division

unsigned long int
__lldiv(unsigned long int divisor, unsigned long int dividend)
{
	unsigned long int	quotient;
	unsigned char	counter;

	quotient = 0;
	if(divisor != 0) {
		counter = 1;
		while((divisor & 0x80000000UL) == 0) {
			divisor <<= 1;
			counter++;
		}
		do {
			quotient <<= 1;
			if(divisor <= dividend) {
				dividend -= divisor;
				quotient |= 1;
			}
			divisor >>= 1;
		} while(--counter != 0);
	}
	return quotient;
}

Code: (__lmul) [Select]

unsigned long
__lmul(unsigned long multiplier, unsigned long multiplicand)
{
	unsigned long	product = 0;

	do {
		if(multiplier & 1)
			product += multiplicand;
		multiplicand <<= 1;
		multiplier >>= 1;
	} while(multiplier != 0);
	return product;
}

Edit: I found that this code by google after I saw the llmod that rs20 posted, it finds the most significant bit to divide by or do the reminder with an ongoing counter then it performs the operation, simple but using binary math.

Also the PIC18 has a DAW (decimal adjust W register) instruction set and also MUL instructions, that might help I did search and found code that will convert a binary representation to BCD and then use this instruction for outputting the values, but there is an Errata that can be circumvented so it will take research (not much of it) to figure it out if you want to go that route on higher end PICs.

But your compiler might not use them, unless you investigate if they do or not. But the OP uses a PIC16 so he is out of luck.

westfw · « **Reply #47 on:** March 16, 2014, 08:29:29 am »

I think you can divide the solutions into two groups:
1) Those that missed the "PIC with 2K code space" comment, and have used "standard" C functions for division, multiplication, modulus, and printing.
2) Those that are wrapping their own optimized math code into the solution to save space.

People using the free XC8 compiler should download the latest version. Microchip seems to have been (quietly?) upping the level of "optimization" done in the free version.

hamster_nz · « **Reply #48 on:** March 16, 2014, 09:26:21 am »

I have a great amount of time for people who find interest in working on problems like these, rather than just using canned code.

It is not like anybody can wake up at 6am on Tuesday and say "today I am going to be an awesome coder". Great coders are made by people working on problems like this to develop a deep understanding of how to do things, and building up knowledge of a wide range of techniques to approach these problems.

I also think that there most probably isn't a "right" answer to these sorts of problems... there is always a better way, unless your use-case is the generic use-case, when the code is to go into a library. Out side of libraries people quite often have different constraints for what their "better" is - it might be portability, maintainability, freedom from legal issues, code size, data size, fastest performance, fixed latency...

rs20 · « **Reply #49 on:** March 16, 2014, 10:30:03 am »

Quote from: westfw on March 16, 2014, 08:29:29 am

People using the free XC8 compiler should download the latest version. Microchip seems to have been (quietly?) upping the level of "optimization" done in the free version.

I downloaded the free version a couple of days ago (for the purposes of this thread), and the llmod and lldiv code is still used. (I don't mean to imply that you were claiming otherwise, but throwing it out there nevertheless.)

Quote from: miguelvp on March 16, 2014, 06:40:34 am

I found that this code by google after I saw the llmod that rs20 posted, it finds the most significant bit to divide by or do the reminder with an ongoing counter then it performs the operation, simple but using binary math.

For the record, I found this file when inspecting the assembler code generated by xc8. The assembler is annotated with the lines of source code, and the files from which that source code came from. Very nice little bits of code, but by necessity they need to work for divisions of any long by any other long. There's no "divide long by byte" code, so it's significantly suboptimal for dividing by 10.

On a increasingly off-topic note, all this made me interested in looking at the assembler for my own project I'm working on using an AVR32 chip. AVR-GCC does optimization for free, and (I'm not an AVR fanboy, honest) I was very impressed by some of the things it did. Here's the C code:

Code: [Select]

uint32_t valu = ...;
valu = ((valu - 10178195) * 71) / 1940 - 20;

Now, for background, the AVR32 can do 32-bit multiply in 1 clock, 32-bit times 32-bit to 64-bit output in 2 clocks, but 32-bit divide takes 17 clocks. So the assembler output, with my commentary added. 8000406c is the highlight for me:

Code: [Select]

The input valu is r8, the output ends up in r11.
r11 = r8 * 71, done as (r8 + r8 << 3) << 3 - r8 for some reason
8000404e:	f0 08 00 3b 	add	r11,r8,r8<<0x3
80004052:	a3 7b       	lsl	r11,0x3
80004054:	10 1b       	sub	r11,r8

r8 = -722651845
80004056:	e0 68 35 3b 	mov	r8,13627
8000405a:	ea 18 d4 ed 	orh	r8,0xd4ed

r9 = -2027933011
8000405e:	e0 69 32 ad 	mov	r9,12973
80004062:	ea 19 87 20 	orh	r9,0x8720

r11 += r8
80004066:	10 0b       	add	r11,r8

lr = r11 >> 31
80004068:	f6 0e 14 1f 	asr	lr,r11,0x1f

r9r8 (combined into 64-bit destination register) = r11 * r9
8000406c:	f6 09 04 48 	muls.d	r8,r11,r9


r11 += r9 (the upper word)
r11 >>= 10
r11 -= lr
r11 -= 20

80004070:	30 aa       	mov	r10,10
80004072:	12 0b       	add	r11,r9
80004076:	f6 0a 08 4b 	asr	r11,r11,r10
8000407a:	1c 1b       	sub	r11,lr
8000407c:	21 4b       	sub	r11,20

I guess the highlight for me is the replacement of 17-clock division with 2-clock 64-bit multiply, keeping only the upper 32 bits of that multiplication (plus some other miscellaneous stuff that takes a few more clock cycles). I guess the multiplicand used is sort of the reciprocal of the denominator? The coolest thing is that the only reason I have that weird formula is to approximate a multiplication by a fractional number (I didn't realise division was so expensive); but it turns out I can just use muls.d and take the upper register and get it done in one hit. Learning assembler tricks from your compiler? Awesome!


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Divide algorithm (Read 18671 times)

Share me