Author Topic: Divide algorithm  (Read 18480 times)

0 Members and 1 Guest are viewing this topic.

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Re: Divide algorithm
« Reply #50 on: March 16, 2014, 11:42:31 am »
Quote
If you optimize it by removing all divides and multiplies,

Whether that (removing divides and multipies) is optimization or not depends on the particular chip in question.
================================
https://dannyelectronics.wordpress.com/
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Re: Divide algorithm
« Reply #51 on: March 16, 2014, 02:48:34 pm »
Quote
The code you wrote will call the following elegant pieces of code,

When the numerator is 2^N, that piece of code collapses nicely, and can be used to quickly generate tuning words at a given frequency for DDS.
================================
https://dannyelectronics.wordpress.com/
 

Offline miguelvp

  • Super Contributor
  • ***
  • Posts: 5550
  • Country: us
Re: Divide algorithm
« Reply #52 on: March 16, 2014, 05:04:20 pm »
On a increasingly off-topic note, all this made me interested in looking at the assembler for my own project I'm working on using an AVR32 chip. AVR-GCC does optimization for free, and (I'm not an AVR fanboy, honest) I was very impressed by some of the things it did. Here's the C code:

Code: [Select]
uint32_t valu = ...;
valu = ((valu - 10178195) * 71) / 1940 - 20;

Now, for background, the AVR32 can do 32-bit multiply in 1 clock, 32-bit times 32-bit to 64-bit output in 2 clocks, but 32-bit divide takes 17 clocks. So the assembler output, with my commentary added. 8000406c is the highlight for me:

Code: [Select]
The input valu is r8, the output ends up in r11.
r11 = r8 * 71, done as (r8 + r8 << 3) << 3 - r8 for some reason
8000404e: f0 08 00 3b add r11,r8,r8<<0x3 ; *8(r8)+1(r8) = *9(r8)
80004052: a3 7b        lsl r11,0x3        ; *8(r8) = *72(r8)
80004054: 10 1b        sub r11,r8         ; -1(r8) = *71(r8)

I guess the highlight for me is the replacement of 17-clock division with 2-clock 64-bit multiply, keeping only the upper 32 bits of that multiplication (plus some other miscellaneous stuff that takes a few more clock cycles). I guess the multiplicand used is sort of the reciprocal of the denominator? The coolest thing is that the only reason I have that weird formula is to approximate a multiplication by a fractional number (I didn't realise division was so expensive); but it turns out I can just use muls.d and take the upper register and get it done in one hit. Learning assembler tricks from your compiler? Awesome!

I added comments following the code what it's doing for the r8*71.

Also the reason the compiler was able to optimize the division is because you are dividing by a constant. If you had a variable on the denominator there is no much the compiler can do.

It seems like  AVR-GCC is really working hard to optimize the code by looking what are constants and what not, seems it would have been cheaper to use the muls in one clock instead of that shift, add, shift, subs? strange. Might be a power optimization if the clock savings are not that much.

Edit:

The divide is done by multiplying with the inverse of 1940

0.00051546391752577319587628865979381

In binary (fixed point 0. implied at the left)
(.)0000000000 1000 0111 0010 0000 0011 0010 1010 1101
so it's the constant 0x872032AD (2267034285) shifted to the right 10 bits.


Pretty clever compiler, although it will be really dividing by
1939.9999992077755453971883799719
because of the binary limitations to store fractions and the inverse represented is:
2267034285/(1024*4294967296) = 0.00051546391773626965004950761795044
where 2267034285 is the constant the compiler figured out (0x872032AD), 1024 is the 10 bit right shift and 4294967296 is 2^32 needed to change the meaning of the constant to 2^-32.

Reason being for the inaccuracy is that 1940 = 2*2*5*97 and you can't store 1/5 or 1/97 very well in binary format.

Edit again: that said, you won't find a problem since if you divide max unsigned int32 by either value you'll get the same result 2213900 so the precision should not affect 32 bit unsigned operations by themselves.
« Last Edit: March 16, 2014, 06:48:36 pm by miguelvp »
 

Offline dannyf

  • Super Contributor
  • ***
  • Posts: 8221
  • Country: 00
Re: Divide algorithm
« Reply #53 on: March 17, 2014, 12:01:27 am »
Code: [Select]
r11 = r8 * 71, done as (r8 + r8 << 3) << 3 - r8 for some reason
(r8 + r8 << 3) = r8 + 8*r8 = 9*r8

(r8 + r8 << 3) << 3 = 9 * r8 * 8 = 72 * r8

(r8 + r8 << 3) << 3 - r8 = 72 * r8 - r8 = 71 * r8.

This CAN be quite inefficient on some chips.
================================
https://dannyelectronics.wordpress.com/
 

Offline miguelvp

  • Super Contributor
  • ***
  • Posts: 5550
  • Country: us
Re: Divide algorithm
« Reply #54 on: March 17, 2014, 01:47:03 am »
Code: [Select]
r11 = r8 * 71, done as (r8 + r8 << 3) << 3 - r8 for some reason
(r8 + r8 << 3) = r8 + 8*r8 = 9*r8

(r8 + r8 << 3) << 3 = 9 * r8 * 8 = 72 * r8

(r8 + r8 << 3) << 3 - r8 = 72 * r8 - r8 = 71 * r8.

This CAN be quite inefficient on some chips.

I already noted all that on my reply but I disagree on being inefficient on ANY chip. It takes around 3 cycles on his, maybe others will take 4 cycles because they don't have a shift & add instruction.

I haven't seen a single mcu/cpu that can't do shifts, adds and subs fast (even through the carry flag). So I think the inefficiency concern is a bit overstated. Or if there is such a processor it's a pile of bat dung. because shifts adds subtracts and flags are the core of all processors out there.

But his chip can do a mul 32bit in 1 cycle so it's strange the compiler decided to use ~3|4cycles instead of the mul instruction. Then again, the mul instruction might need setup, to load the constant (71), do the multiplication then extract the answer so it's probably the same time and if the ALU doing the mul takes more power, then shifting will be better. And it looks like both approaches will take around 8 bytes of instruction set.

Edit: just setting up to call a function and processing the result will actually burn more program space than doing the shifts add, subs just to burn more cycles with a generic function.

But everything comes at a price, code maintenance if not well documented can be a pain if using "elegant" solutions, might take the next person a long time to ramp up. Then again, it might be a good test if to keep him or not. Clarification "as a programmer that is"

« Last Edit: March 17, 2014, 05:14:13 am by miguelvp »
 

Offline westfw

  • Super Contributor
  • ***
  • Posts: 4196
  • Country: us
Re: Divide algorithm
« Reply #55 on: March 17, 2014, 03:14:54 am »
Quote
There's no "divide long by byte" code,
Yep; this (mixed size arithmetic) is one of the common areas where an assembly language programmer can do better than the C compiler (although, doing the primitive operations in C is another route, I guess.)

As for AVR multiply, it use registers R0/R1, which avr-gcc normally reserves for other purposes (R1 is "known zero"), so the setup/cleanup time for a multiply is relatively significant.  (now, I wonder why R1 is still the "known zero" instead of, say, R15.  It may have originated before the multiply instruction, but nowadays... (all the R0..R15 registers are sort of "second class registers" and don't get used too much by the compiler...))

 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf