Fractions and floating point math?

Fractions and floating point math?
Posted by rexxar on 08 Oct, 2014 03:32
I'm using an ATtiny25 for a project, and since floating point math has to be done in software, invoking it takes up a lot of program space. I need to multiply a value by .625 at one point, but if I try to do that, I run out of memory. However, if I multiply by (5/8), everything goes fine.

I would have thought that 5/8 would get turned into .625 in the background, but it seems it doesn't? I'm a bit confused how the program handles that value without using floating point math. The only thing I can figure is that it's taking 'x * (5 / 8 )' and multiplying x by 5, then dividing by 8, basically just ignoring the parentheses, is that correct?

Sorry if this is a dumb question, but I've not actually studied programming or floating point math.

#1 Reply
Posted by Kremmen on 08 Oct, 2014 05:06
Depends on what exactly you write on the source line. Assuming you wrote it the way you put it in your post, i.e.

x = x * (5/8);

then that is a pure integer operation - no "fractions" involved at all. To the compiler the expression 5/8 is an arithmetical division of 2 integer constants, not a number representation system.
There is such a thing as fixed point representation where you have an implied decimal point and that system can handle fractions as decimal numbers with a more limited range than the floating point system.

#2 Reply
Posted by Dago on 08 Oct, 2014 05:11
It might help if you'd tell more about what you are calculating and how often and what kind of precision you need.

There are quite a lot of "tricks" you can use to avoid floating point calculations such as fixed point arithmetic Kremmen mentioned. It is often also possible to scale values so that you don't need to use fractional number but can use integers or even better; twos exponents. But how and when to use these tricks depends on the application.

#3 Reply
Posted by rs20 on 08 Oct, 2014 05:19
To answer your question directly, if you did this code:

int x = ...;
int y = x * (5/8);

Integer division is used, as hinted at by others. If you perform 5/8 with integer division, you get 0. Therefore, you're calculating y = x * 0. Which simplifies to y = 0. Very fast, but probably not what you wanted :-). Of course, you can do y = (x * 5) / 8, which will give an integer answer within one LSB of what you want (assuming the initial x*5 operation doesn't overflow your int).

#4 Reply
Posted by Psi on 08 Oct, 2014 06:20
Quote from: rexxar on 08 Oct, 2014 03:32
I need to multiply a value by .625 at one point,

Assuming the value your using isn't huge you can.. multiple it up by a fixed amount, do the calc as integer math, then divide it back down.
I'm using the value 123 as an example and a 1024 multiply factor.

// declare some variables
uint8_t value;
uint32_t temp; // note: this variable MUST be large enough to store the number after its been multiplied up, 4billion should do it .
uint8_t answer;

// set an example number
value=123;

// Do the calculation as integer math...
// The value 640 was pre-calculated.. 0.625 * 1024 = 640.
// need to cast so the calculation is done as 32bit
temp = (uint32_t)(value) * 640;

// now we do a fast divide using right shifts to counteract the previous 1024 multiple. (10 shifts gives a 1024 divide)
// This gives us the answer.
answer = (temp >> 10);

Here's the actual math
0.625 * 1024 = 640
123 * 640 = 78720
78720 / 1024 = 76

As long as you double check that all the calculations can fit inside the variables your using then this approach works.
I picked the value 1024 to get all of 0.625 moved past the decimal point.
For 0.625 the multiply has to be larger than 1000 or you will truncating bits off the end.
Using powers of 2 makes the multiple/divide easy to do with shift ops.
Note: This example uses unsigned variables, some mods will be needed if you want to deal with negative numbers.

#5 Reply
Posted by mikerj on 08 Oct, 2014 09:22
0.625 is a nice number because it can be expressed by a small number of binary factors:

0.625 = 2^-1 + 2^-3

This means you can multiply by 0.625 with just two shift operations and an addition:

Code: [Select]
Multiply 'x' by 0.625, result is placed in 'out'. out = x >> 1; out += x >> 3;
This will be pretty fast in this case, and is the simplest to write and understand. However, if you have many terms, and you are using 16 or 32 bit variables on an 8 bit micro then there may be significant savings in execution time to be made by accumulating the shifts:

Code: [Select]
out = temp = x >> 1; temp >>= 2; out += temp;
In this case it won't be worth the effort, and will likely be slower in fact.

I wrote a small Windows utility many years ago that calculates all the binary factors of an arbitrary multiplier/divisor to get a given degree of accuracy (not all fractions are easily expressible as a sum of binary weights), and then prints out all the shift operations. If anyone is interested I'll post it.

#6 Reply
Posted by rs20 on 08 Oct, 2014 09:46
Doing all that shifting probably isn't much more efficient than going (x * 5) / 8, especially since any half-decent compiler ought to optimize the / 8 away (right?). (x * 5) >> 3 if you want to be pedantic, it's one less operation that two shifts and an addition. (Your method will never overflow though, that's a good feature).

#7 Reply
Posted by nuno on 08 Oct, 2014 09:47
I usually do an "optimization" on this calculation

y = (x * 5) / 8

which is, I round it to the nearest integer, which gives better accuracy

y = (x * 5 + 4) / 8 (for x >= 0)

Generalizing,

#define SCALE_BY_FRACTION(x, n, d) ( (x) * (n) + (d) / 2) / (d) )

I leave you the exercise of making the negative/positive input version

#8 Reply
Posted by Psi on 08 Oct, 2014 09:55
Quote from: rs20 on 08 Oct, 2014 09:46
Doing all that shifting probably isn't much more efficient than going (x * 5) / 8, especially since any half-decent compiler ought to optimize the / 8 away (right?).

yeah, but its always good to understand whats happening and why its fast.

#9 Reply
Posted by jpb on 08 Oct, 2014 10:19
Quote from: Psi on 08 Oct, 2014 09:55
Quote from: rs20 on 08 Oct, 2014 09:46
Doing all that shifting probably isn't much more efficient than going (x * 5) / 8, especially since any half-decent compiler ought to optimize the / 8 away (right?).

yeah, but its always good to understand whats happening and why its fast.
Shifting is generally very fast anyway. It is a while since I did any ARM assembler but my memory is that it is able to shift and add in the same instruction or at least in the same number of cycles as adding. Multiplying is generally going to be much slower.

But nowadays the bottleneck is getting in and out of memory and the cost of doing any instructions on the registers themselves is trivial in comparison. I was told in one lecture that whereas in the past we had maths coprocessors like the 8087, now the cost of transferring between the main processor and the coprocessor is such as to negate any gain in having dedicated hardware.

#10 Reply
Posted by macboy on 08 Oct, 2014 16:05
Quote from: rs20 on 08 Oct, 2014 09:46
Doing all that shifting probably isn't much more efficient than going (x * 5) / 8, especially since any half-decent compiler ought to optimize the / 8 away (right?). (x * 5) >> 3 if you want to be pedantic, it's one less operation that two shifts and an addition. (Your method will never overflow though, that's a good feature).
Actually if you really want to be pedantic, realise that one operation (e.g. multiply) does not equate to one instruction. In this case, the ATTiny 25 has no integer multiply instructions, so it relies on software to do it. So your "one less operation" is defeated by a multiple-instruction multiply operation. Bit shifts, OTOH, are single-clock instructions. If you want hardware multiply, drop ATTiny and go ATMega.

#11 Reply
Posted by IanB on 08 Oct, 2014 16:35
Quote from: Psi on 08 Oct, 2014 06:20
Code: [Select]
// Do the calculation as integer math... // The value 640 was pre-calculated.. 0.625 * 1024 = 640. // need to cast so the calculation is done as 32bit temp = (uint32_t)(value) * 640;

A slightly more human readable way of writing this (less "line noise") would be to write:

Code: [Select]
temp = value * 640ul;
The "ul" suffix forces 640 to be an unsigned long constant, which will cause the whole expression to be evaluated as such.

#12 Reply
Posted by Rick Law on 08 Oct, 2014 18:18
First,
While it was pointed out earlier to do the math in 32 bits, but it was not clearly explained. For someone who doesn't yet understand int vs float for (x*5/8), it may be beneficial to explained:
The multiply is done first to preserve as much precision as feasible, but multiply int16 with another number could overflow the int16. So the multiply should be done in 32 bit.

Second,
The following is perhaps easier to understand than shifting and can be generalized. The approach is to do the multiply first (to preserve the precision):
int y=something;
int x = ((y*5L)/8L); // force to do the multiply first so precision is preserved but not rounded
In this version, fraction is preserved until the last divide. The last divide truncates the fraction.

Third, rounding (for positive)

int y=something;
int x = (((y*10L*5L)/8L+5L)/10L;

You can see that the factor is 10*5/8 instead of 5/8. For rounding, we normally add a 0.5. But we are doing it in integer, so 0.5 is not feasible. To get around that, multiply the expression by 10 first, now we can add a 5 to round it to the tens and divide it back down by 10.

Fourth, rounding (for positive or negative)

int y=something;
int x = (((y*10L*5L)/8L+(y<0 ? -5L : 5L))/10L;

You can see the same 10*5/8. But instead of just adding 5L, it does a test first:It adds -5 if y is negative, otherwise it adds +5.

Fifth, something for you to try
The code (and output):

int y1=24094; // a number that overflows on multiply and needs rounding
int y2=-y1; // same number in negative
int result1,result2,result3,result4,result5,result6,result7;
Serial.print(" y1=");Serial.println(y1);
Serial.print(" y2=");Serial.println(y2);
Serial.println("Result when done in float expression:");
Serial.print(" y1*5.0/8.0=");Serial.println((float) y1*5.0/8.0);
Serial.print(" y2*5.0/8.0=");Serial.println((float) y2*5.0/8.0);
Serial.println();
Serial.print(" R1 forcing eval constant first=");Serial.println( result1 = (y1 * ( 5 / 8 )) );
Serial.print(" R2 multiply in int16 first cause overflow=");Serial.println( result2 = (y1 * 5) / 8 );
Serial.print(" R3 multiply in int32 eliminiated overflow=");Serial.println( result3 = ((y1 * 5L) / 8L) );
Serial.println();
Serial.print(" R4 +rounding a + =");Serial.println( result4 = ((y1*10L*5L)/8L + 5L)/10L );
Serial.print(" R5 +rounding a - =");Serial.println( result5 = ((y2*10L*5L)/8L + 5L)/10L );
Serial.print(" R6 +-rounding a + =");Serial.println( result6 = ((y1*10L*5L)/8L + (y1<0 ? -5L : 5L))/10L );
Serial.print(" R7 +-rounding a - =");Serial.println( result7 = ((y2*10L*5L)/8L + (y2<0 ? -5L : 5L))/10L );

You should see that result6 and result7 give you the same results for + and - with rounding. Complicated as that expression looks, it is still many times faster than doing it in float. This is what you should see with the serial output:

y1=24094
y2=-24094
Result when done in float expression:
y1*5.0/8.0=15058.75
y2*5.0/8.0=-15058.75

R1 forcing eval constant first=0
R2 multiply in int16 first cause overflow=-1325
R3 multiply in int32 eliminiated overflow=15058

R4 +rounding a + =15059
R5 +rounding a - =-15058
R6 +-rounding a + =15059
R7 +-rounding a - =-15059

Hope this helps in your understanding and future code development...

Rick

#13 Reply
Posted by mikerj on 08 Oct, 2014 20:45
Quote from: Rick Law on 08 Oct, 2014 18:18
First,
While it was pointed out earlier to do the math in 32 bits, but it was not clearly explained. For someone who doesn't yet understand int vs float for (x*5/8), it may be beneficial to explained:
The multiply is done first to preserve as much precision as feasible, but multiply int16 with another number could overflow the int16. So the multiply should be done in 32 bit.

In general terms yes. On an ATTiny, 32 bit multiplications are a good way of losing large numbers of clock cycles.

#14 Reply
Posted by Rick Law on 09 Oct, 2014 02:07
Quote from: mikerj on 08 Oct, 2014 20:45
Quote from: Rick Law on 08 Oct, 2014 18:18
First,
While it was pointed out earlier to do the math in 32 bits, but it was not clearly explained. For someone who doesn't yet understand int vs float for (x*5/8), it may be beneficial to explained:
The multiply is done first to preserve as much precision as feasible, but multiply int16 with another number could overflow the int16. So the multiply should be done in 32 bit.

In general terms yes. On an ATTiny, 32 bit multiplications are a good way of losing large numbers of clock cycles.

The important part is the OP needs to understand is why multiply must occur first, and then to ascertain if multiply must be done in 32 bit.

As to whether the multiply should be converted to bit-shifting and add, or to incorporate rounding or not: Once the ideas or expressions (shown in my demo code result3,4,5,6,7) are understood, shift vs multiply, rounding or not, etc., are simple trade off decision between speed and code readability.

The "something (code) for you to try" demonstrates unless one is sure the expected range of numbers are small enough to avoid overflow, or is large enough that (y/8) precision lost is okay, multiply must be done first and 32bit multiply must be used.

So, he should have some fruit for thought.

#15 Reply
Posted by mikerj on 09 Oct, 2014 13:22
Quote from: Rick Law on 09 Oct, 2014 02:07
Quote from: mikerj on 08 Oct, 2014 20:45
Quote from: Rick Law on 08 Oct, 2014 18:18
First,
While it was pointed out earlier to do the math in 32 bits, but it was not clearly explained. For someone who doesn't yet understand int vs float for (x*5/8), it may be beneficial to explained:
The multiply is done first to preserve as much precision as feasible, but multiply int16 with another number could overflow the int16. So the multiply should be done in 32 bit.

In general terms yes. On an ATTiny, 32 bit multiplications are a good way of losing large numbers of clock cycles.

The important part is the OP needs to understand is why multiply must occur first, and then to ascertain if multiply must be done in 32 bit.

As to whether the multiply should be converted to bit-shifting and add, or to incorporate rounding or not: Once the ideas or expressions (shown in my demo code result3,4,5,6,7) are understood, shift vs multiply, rounding or not, etc., are simple trade off decision between speed and code readability.

The "something (code) for you to try" demonstrates unless one is sure the expected range of numbers are small enough to avoid overflow, or is large enough that (y/8) precision lost is okay, multiply must be done first and 32bit multiply must be used.

So, he should have some fruit for thought.

You are missing the point, the shift and add solution does not simply replace the multiply by a shift operation in a simple scaling operation. By breaking down the multiplier into a binary polynomial and dealing with each term individually, no scaling is involved.

Not only does this eliminate the multiply itself, it also eliminates the 32 bit intermediate result that the scaling operation produces.

On micros that have a fast 32 bit integer multiply and a barrel shifter, then performing a fractional multiply by scaling and then dividing (using shifts) is almost always the fastest solution. On basic 8 bit micros with no hardware multiply and and only single bit shifts this is usually not the most efficient way. However, it does depend on the number of terms in the binary polynomial you use which is a tradeoff of accuracy and speed (for values that either don't have a finite number of terms or that have a large number of terms).

#16 Reply
Posted by larsdenmark on 09 Oct, 2014 13:56
Quote from: mikerj on 08 Oct, 2014 09:22
0.625 is a nice number because it can be expressed by a small number of binary factors:
Code: [Select]
out = x >> 1; out += x >> 3;

If I try this with x=101 I get out=62. Since 101*0.625 is 63.125 I would expect 63 (or even 64) as the result, but not 62.

I would do:
out = x << 2 + x;
out = out >> 3;
bit it requires that out and x are able to hold the extra bits without overflow.

#17 Reply
Posted by mikerj on 09 Oct, 2014 17:45
Quote from: larsdenmark on 09 Oct, 2014 13:56
Quote from: mikerj on 08 Oct, 2014 09:22
0.625 is a nice number because it can be expressed by a small number of binary factors:
Code: [Select]
out = x >> 1; out += x >> 3;

If I try this with x=101 I get out=62. Since 101*0.625 is 63.125 I would expect 63 (or even 64) as the result, but not 62.

I would do:
out = x << 2 + x;
out = out >> 3;
bit it requires that out and x are able to hold the extra bits without overflow.

Which is scaling and dividing i.e. (x*5) / 8, albeit with a nice small scaling factor and using shifts which likely makes this very suitable for the OP's application.

The basic binary polynomial method is prone to rounding errors since you discard bits when you shift right. However the results are still good enough for a lot of applications (e.g. you couldn't use this on a GPS satellite, but it wouldn't make any difference if an ECU calculated the engine temperature with an error of 0.1C ). You often have to put up with errors when scaling and dividing anyway, if the multiplier doesn't divide exactly into a scaling factor that you can use within your intermediate result size limits.

To remove the rounding errors under all conditions you'd need to accumulate the most significant fractional bit that gets discarded at each stage (i.e. the halves) which is an extra annoyance, but can still work out faster than a multiply.

if( x & 1 ) frac++;
out = x >> 1;
if( x & 4 ) frac++;
out += x >> 3;
out += frac>>1;

#18 Reply
Posted by larsdenmark on 09 Oct, 2014 19:38
Quote from: mikerj on 09 Oct, 2014 17:45
if( x & 1 ) frac++;
out = x >> 1;
if( x & 4 ) frac++;
out += x >> 3;
out += frac>>1;

This looks very nice indeed. I've been looking briefly at the instruction set for the attiny25 and it seems as if its "logical shift right" supports the carry flag, which means that you - using assembler - could avoid the if-statements above and simply accumulate the frac variable based on the value of the carry flag after having shifted to the right. That should be pretty fast.

#19 Reply
Posted by Rick Law on 09 Oct, 2014 22:14
Quote from: mikerj on 09 Oct, 2014 13:22

...

You are missing the point, the shift and add solution does not simply replace the multiply by a shift operation in a simple scaling operation. By breaking down the multiplier into a binary polynomial and dealing with each term individually, no scaling is involved.
...

I might have missed/misunderstood your post, or you have misunderstood mine.

I was referring to doing x*5 by left shift: x*5 = x*4+x = (x<<2)+x and
doing result/8 by right shift: result/8 = result>>3
so x*5/8 = ((x<<2)+x)>>3

This would need 32 bit to ensure no overflow, but a lot easier to the eyes. Of course, the easiest to read is (x*5)/8. Either way, understand the difference between (int) x*5/8 and (float) x*5/8, and understanding why/when error would occur would be good for the OP's coding skill development.

In my opinion, except in rare cases where every tick counts, I would take readability over the very hard to read codes you suggested:
if( x & 1 ) frac++;
out = x >> 1;
if( x & 4 ) frac++;
out += x >> 3;
out += frac>>1;
I hate to debug this; or (for whatever reason) have to to modify the factor from 5/8 to something else. One of the method to increase reliability is to reduce complexity. This is too complex in my book unless in the rare case where every tick counts.

Rick

#20 Reply
Posted by nuno on 09 Oct, 2014 23:31
Quote from: mikerj on 09 Oct, 2014 17:45
if( x & 1 ) frac++;
out = x >> 1;
if( x & 4 ) frac++;
out += x >> 3;
out += frac>>1;

If you have headroom in the datatype's size, I guess you could just add 1/2 of the divisor to x at the start, that would be faster.

#21 Reply
Posted by miguelvp on 10 Oct, 2014 03:56
You are wasting your time, I put some shift multiply and divide algorithms and they were deemed tricks and shortcuts instead of math.
It's hard for decimal people to think what binary math is like and it's just voodoo to them.

I even explained how to work with floating point formats both float and double.

#22 Reply
Posted by IanB on 10 Oct, 2014 04:11
Quote from: nuno on 09 Oct, 2014 23:31
If you have headroom in the datatype's size, I guess you could just add 1/2 of the divisor to x at the start, that would be faster.

Having headroom in the data type's size is always going to be the easiest approach unless the hardware is horribly limited.

For instance, look again at multiplying 101 by 0.625. In binary we are multiplying 1100101 by 0.101.

If we normalize the fraction on the right we have 1.01, so we can multiply by 1.01 and divide by 10 later.

Multiplying by 1.01:

1100101 + 11001 = 1111110

Dividing the result by 10:

1111110 / 10 = 111111 (= 63 decimal).

Compare with the floating point exact result: 101 * 0.625 = 63.125

The overall summary of this thread is that careful use of fixed point arithmetic is nearly always going to perform better than floating point arithmetic on restricted hardware. (For instance, Honeywell process automation systems were designed long ago to do their control calculations using scaled 12 bit fixed point quantities.)