Author Topic: silly question, but .. how do you compare fp values to zero? (Read 1847 times)

DiTBho · « **on:** July 23, 2022, 01:57:09 pm »

do you define a small epsilon, say epsilon = 1e-9, and a floating point number(x) is zero if and only if:

              is_zero(x) = ( - epsilon < (x) =< + epsilon )

something like that?

Code: [Select]

| 2.000000 -1.000000 7.000000 6.000000 |
| 0.000000 1.500000 -4.500000 0.000000 |
| 0.000000 0.000000 -0.000000 0.000000 |
solution:
    infinite solutions

I am writing a polymorphic arbitrary linear system solver, and with floating numbers I need the is_EqualTo_zero operator to manage singular matrices (those with no-solutions and those with infinite solutions).

In this example, there is a cell that contains "-0.000000", and it's is not exactly zero, hence it won't pass the silly "if (p_cell->value == 0.0)" test.

DiTBho · « **Reply #1 on:** July 23, 2022, 01:59:15 pm »

(
Ops: in C language
)

golden_labels · « **Reply #2 on:** July 23, 2022, 02:36:04 pm »

Floating point math doesn’t have equality relation,⁽¹⁾ and the similarity relation⁽²⁾ depends on the calculations domain. In other words, how you compare results of FP calculations for similarity is bound to what you do, what kind of errors you expect and what you can tolerate.

If if your calculations 0 can be obtained only using multiplicative operations, then you can safely compare using == and !=. That’s because the only ways that give you a result similar to 0 are involving multiplication by literal 0. I never heard of any single FP system in which that operation would not be equivalent to returning exact 0, and certainly it is in IEEE-754.⁽³⁾ Same hold for division by ±infinity (if that operation works) and if NaN appears, it will also never compare equal to literal 0.

Otherwise: closeness to 0. That is the solution you proposed, which more commonly is expressed as abs(x) < epsilon. In special cases where calculations are certain to never end on one of the sides (negative or positive), it becomes x < epsilon or x > -epsilon.⁽⁴⁾ The question remaining is the value of epsilon. That is something that is fully dependent on the scale of your calculations and there is no golden rule: you just need to choose the parameter properly. In examples people often use arbitrary values, like 1e-9, but that’s not exactly right. If you are operating on the scales of 1e+30, that may be way too small. If you are operating with numbers in the 1e-30 range, quite obviously even your inputs would already be considered “zero”. Usually size to ten orders of magnitude below minimum value of the calculations range or somewhere near it is a good initial guess.

⁽¹⁾ Except for “was foo explicitly set to bar” and very specific, precisely controlled, platform-dependent optimizations.
⁽²⁾ Which is reflexive and symmetric, but — unlike equivalence or equality — not transitive (a trap for many beginners).
⁽³⁾ Most likely what you’re dealing with.
⁽⁴⁾ You want the minus sign on epsilon, because that’s a constant.

brucehoult · « **Reply #3 on:** July 23, 2022, 02:39:41 pm »

I use the following function:

Code: [Select]

int fp_eq(float reference, float actual, float relErr)
{
  // if near zero, do absolute error instead.
  float absErr = relErr * ((fabsf(reference) > relErr) ? fabsf(reference) : relErr);
  return fabsf(actual - reference) < absErr;
}

DiTBho · « **Reply #4 on:** July 23, 2022, 03:11:11 pm »

Quote from: golden_labels on July 23, 2022, 02:36:04 pm

If if your calculations 0 can be obtained only using multiplicative operations

There are a lot of "multiply and accumulate" and "multiply and subtract" operations during several pivoting loop by the Gauss-reduction function.
That's what causes epsilon-residuals.

Quote from: golden_labels on July 23, 2022, 02:36:04 pm

The question remaining is the value of epsilon. That is something that is fully dependent on the scale of your calculations and there is no golden rule

Worse still, it is a polymorphic library, epsilon is an exclusive property of floating point. You don't have this problem with sint128_t matrices. Anyway, I will try to create an autoscale function to get some reference in order to estimate the best epsilon for each linear problem described by a floating point matrix.

Thanks

Nominal Animal · « **Reply #5 on:** July 24, 2022, 05:28:25 am »

I use -ε ≤ x ≤ ε, or |x| ≤ ε, so that the user can specify an exact zero if they so desire. In this form, ε is then the largest nonnegative value the user considers insignificant/zero. So, something along the lines of

Code: [Select]

static inline int  int_zero(int x, ...) { return !x; }
static inline int  long_zero(long x, ...) { return !x; }
static inline int  float_zero(float x, float eps) { return fabsf(x) <= eps; }
static inline int  double_zero(double x, double eps) { return fabs(x) <= eps; }
#define  fp_zero(x, eps) (_Generic((x), float: float_zero, double: double_zero, int: int_zero, long: long_zero)(x, eps))

(where the inline is completely irrelevant to the compiler, and only my personal preference to indicate to us humans that this is a helper or accessor function. Those that are equivalently just static are "just" internal functions.)

More importantly, I try hard to pick algorithms that are numerically stable. The second chapter of old Numerical Recipes in C deals with exactly this problem. A common measure to quantify stability is condition number.

In this case, we have problems of type Ax=b where x is a vector with unknown components, b is a vector of same size with known components, and A is a matrix of coefficients. If we know the pseudoinverse A⁺ of A (AA⁺A = A), then x=A⁺b+(I-A⁺A)w. If A has full column rank, then (I-A⁺A) is a zero matrix and w is a zero vector.

The two numerically most stable approaches to solving this type of problem are singular value decomposition and QR decomposition. The former finds A=U∑V^T, with U and V real orthogonal matrices (so their transposes are their inverses) and ∑ has non-negative values on the diagonal, so that A⁺=V∑⁺U^* (and is trivial to compute). The latter finds A^T=QR where Q is an orthogonal matrix (and its transpose is its inverse) and R is upper right triangular matrix. If we use S for the square upper right triangular submatrix of R, then we can calculate w=(S^T)⁻¹b directly via forward substitution; then, if we append the correct number of zero elements to w, the solution we seek is x=Qw.

Of these, QR decomposition involves fewer operations, and can be implemented using Givens rotations or Householder transformations. (While the Wikipedia article describes it in terms of Gram–Schmidt process, obtaining the orthonormal basis of A in Q, it is not numerically stable way to do it at all; Gram–Schmidt process itself is not numerically stable at all, and the arbitrary choice in which one chooses the vectors greatly affects the result.)

A slight variant of QR, rank-revealing QR algorithm, can obtain the rank of A while doing the decomposition, revealing whether the solution x has free variables or not (and possibly how many).

For this particular problem type – a polymorphic library for solving these – I very much like the QR approach, because it simplifies to a small number of "pseudo-primitive operations" on the matrix, so one can implement them all for all supported types, and obtain a fast and numerically stable implementation.

Apologies for not answering the stated question, and waffling on about the underlying problem!

(I've recently been reminded that this habit of mine greatly annoys some people.)

DiTBho · « **Reply #6 on:** July 24, 2022, 09:46:46 am »

Quote from: Nominal Animal on July 24, 2022, 05:28:25 am

The two numerically most stable approaches to solving this type of problem are singular value decomposition and QR decomposition.

At the moment I have a full-pivoting-Gauss-elimination. Not that bad, because it's not too much sensitive to the roundoff errors that come from these two computer arithmetic operations

mac: a += b * c roundoff errors
msc: a -= b * c roundoff errors

at each computation loop it interactively find the greatest amplitude for pivot so the effect of these roundoff errors tends to be less severe.

The algorithm is also a kind of LU decomposition that transforms the matrix |Ab| of size(n+1,n) into a triangular matrix, so the number of non-zero lines of that matrix will give you the rank of A

The matrix |Ab| is the matrix A which is accompanied on the right by the vector b of known terms.

I added a trick to fully manage the three cases

unique solution
no solution
infinite solutions

Code: [Select]

+ app.app_session
| 1.000000 1.000000 -1.000000 5.000000 |
| 0.000000 1.000000 -5.000000 8.000000 |
| 0.000000 0.000000 1.000000 -1.000000 |
++ lib_math_matrix_gauss_v1.matrix_gauss_solve
+++ lib_math_matrix_gauss_v1.matrix_gauss_reduce_ref
++++ lib_math_matrix_gauss_v1.matrix_gauss_ref_get
+++ lib_math_matrix_gauss_v1.matrix_gauss_unknowns_get
| 1.000000 1.000000 -1.000000 5.000000 |
| 0.000000 1.000000 -5.000000 8.000000 |
| 0.000000 0.000000 1.000000 -1.000000 |
solution:
| 1.000000 |
| 3.000000 |
| -1.000000 |

Code: [Select]

+ app.app_session
| 1.000000 1.000000 1.000000 2.000000 |
| 0.000000 1.000000 -3.000000 1.000000 |
| 2.000000 1.000000 5.000000 0.000000 |
++ lib_math_matrix_gauss_v1.matrix_gauss_solve
+++ lib_math_matrix_gauss_v1.matrix_gauss_reduce_ref
++++ lib_math_matrix_gauss_v1.matrix_gauss_ref_get
| 2.000000 1.000000 5.000000 0.000000 |
| 0.000000 1.000000 -3.000000 1.000000 |
| 0.000000 0.000000 0.000000 1.500000 |
solution:
       no_solution

Code: [Select]

+ app.app_session
| 1.000000 0.000000 -7.000000 -5.000000 |
| 0.000000 2.000000 -3.000000 1.000000 |
| 0.000000 0.000000 -0.000000 0.000000 |
++ lib_math_matrix_gauss_v1.matrix_gauss_solve
+++ lib_math_matrix_gauss_v1.matrix_gauss_reduce_ref
++++ lib_math_matrix_gauss_v1.matrix_gauss_ref_get
| 1.000000 0.000000 -7.000000 -5.000000 |
| 0.000000 2.000000 -3.000000 1.000000 |
| 0.000000 0.000000 0.000000 0.000000 |
solution:
       infinite solutions

artag · « **Reply #7 on:** July 24, 2022, 09:53:36 am »

Quote from: Nominal Animal on July 24, 2022, 05:28:25 am

Apologies for not answering the stated question, and waffling on about the underlying problem!
(I've recently been reminded that this habit of mine greatly annoys some people.)

Not everyone. The sidetracks are often the most interesting and education aspects of a thread. I despise Stackoverflow ~~for deliberately killing them~~.

DiTBho · « **Reply #8 on:** July 24, 2022, 09:56:17 am »

Quote from: Nominal Animal on July 24, 2022, 05:28:25 am

A slight variant of QR, rank-revealing QR algorithm, can obtain the rank of A while doing the decomposition, revealing whether the solution x has free variables or not (and possibly how many).

I think even a LU decomposition can measure the rank. In my case, the interactive approach immediately terminates when it finds the first free-variable.

It doesn't measure "how many", it ends if rank < n. This is a defect, but I am more interested in "unique solution", if any

DiTBho · « **Reply #9 on:** July 24, 2022, 10:07:59 am »

Quote from: Nominal Animal on July 24, 2022, 05:28:25 am

For this particular problem type – a polymorphic library for solving these – I very much like the QR approach, because it simplifies to a small number of "pseudo-primitive operations" on the matrix, so one can implement them all for all supported types, and obtain a fast and numerically stable implementation.

I will look at the QR approach.

Just, a note about what I have already (yup, mac is no more in use)

These are all the pseudo-operators involved:

Code: [Select]

    eval_abs = p_matrix->context.method.eval.abs; /* absolute value */
    eval_clr = p_matrix->context.method.eval.clr; /* set it as zero */
    eval_div = p_matrix->context.method.eval.div; /* division */
    eval_msc = p_matrix->context.method.eval.msc; /* a -= b * c */
    cmp_isgt = p_matrix->context.method.cmp.isgt; /* compare, is it greater than ? */
    cmp_is_0 = p_matrix->context.method.cmp.is_0; /* compare, is it zero ? */

(yup, mac is no more in use. Got simplified, and now it only uses msc

)

As you can see, there just a few pseudo-primitive operations on the matrix, and they should be already extendible to complex numbers.

eval.abs ---> use the module r: z = r exp(iθ)
cmp.isgt ---> use the module r: z = r exp(iθ)

(otherwise, how can you say that a complex number A is greater than a complex number B?)

DiTBho · « **Reply #10 on:** July 24, 2022, 10:16:28 am »

Code: [Select]

private void app1()
{
    fp64_t vector0_data[4];
    fp64_t matrix2_data[3][4] = /* unique solution */
    {
        { 1.0, 1.0, -1.0, 5.0  },
        { 0.0, 1.0, -5.0, 8.0  },
        { 0.0, 0.0, 1.0,  -1.0 }
    };
    matrix_t   matrix;
    p_matrix_t p_matrix;
    matrix_t   vector;
    p_matrix_t p_vector;

    p_matrix = get_address(matrix);
    p_vector = get_address(vector);

    matrix_init(p_matrix, 3, 4, matrix2_data, matrix_cell_is_fp64);
    matrix_init(p_vector, 3, 1, vector0_data, matrix_cell_is_fp64);

    matrix_show(p_matrix); /* |Ab| */

    matrix_gauss_solve(p_matrix, p_vector); /* |A| x = b : x=? */

    matrix_show(p_matrix); /* triag(|Ab|) */
    matrix_show(p_vector); /* x */
}

This is how the main program looks like.

matrix_cell_is_fp64 is a function callback, it assigns the pseudo operators to the matrix context methods. So, you need to write a callback like this for every data-type you want to use.

Next step: matrix_cell_is_cplxfp64

Nominal Animal · « **Reply #11 on:** July 24, 2022, 11:05:37 am »

Quote from: DiTBho on July 24, 2022, 09:46:46 am

At the moment I have a full-pivoting-Gauss-elimination. Not that bad, because it's not too much sensitive to the roundoff errors

Right; as I don't know the exact algorithm, I cannot know if it is better or worse than QR wrt. numerical stability, obviously.

Quote from: DiTBho on July 24, 2022, 09:46:46 am

The algorithm is also a kind of LU decomposition

Okay, that makes a lot of sense. LU decomposition is often used to solve Ax=b for x, as it is more efficient (takes fewer multiplications and additions) than QR decomposition.

If you are careful about how you do the permutation (no multiplication, just reordering rows/columns), and you solve the linear system in two steps, first solving Ly=Pb via forward substitution, and then Ux=y for x via back substitution, you should be close to the numerical stability of QR decomposition, I think.

Quote from: DiTBho on July 24, 2022, 09:56:17 am

In my case, the interactive approach immediately terminates when it finds the first free-variable.

Sure, it is a reasonable approach when solving systems of linear equations (Ax=b for x), I do agree.

Quote from: DiTBho on July 24, 2022, 10:07:59 am

I will look at the QR approach.

If you intend to implement linear least squares fitting, then it will be useful.

If your library already has the features you need while using LU decomposition, I wouldn't bother.

Even the zero comparison check being discussed in this thread can be simplified to
scale = reciprocal(value);
if (scale == 0.0) value_was_effectively_zero();
where for float and double types,

Code: [Select]

static inline float  float_reciprocal(float v) { float r = 1.0f / v; return isfinite(r) ? r : 0.0f; }
static inline double  double_reciprocal(double v) { double r = 1.0f / v; return isfinite(r) ? r : 0.0f; }

i.e., do the key operation on the value, and checking if the operation yields "garbage" (errors or other non-finite/non-numerical results). This one catches both the case when value is so close to zero its reciprocal cannot be described, as well as the case where the value is so large in magnitude its reciprocal is effectively zero. Neither is something you want in matrix A when solving a system of equations, that's for sure.

DiTBho · « **Reply #12 on:** July 25, 2022, 06:01:07 pm »

With Full Pivoting the algorithm does both "rows swap" (equation permutation) and "cols swap" ( variable permutation) to select the largest element in the magnitude as the pivot element by interchanging row as well as column ... it really chooses *any* element from the so far "unreduced lower-right submatrix".

But it has a cost: it needs an "order" vector that tells about the order of the unknowns, since when you swap two columns you are also swapping two solution-variables, so you need the "order" vector, which mathematically needs to be written as "permutation matrix" P.

A = L U P
L U (P x) = b

With no permutations P looks like
100
010
001

Mechatrommer · « **Reply #13 on:** July 25, 2022, 07:56:03 pm »

https://noobtuts.com/cpp/compare-float-values

Code: [Select]

#ifndef epsilon
#define epsilon 1e-32
#endif

or if you are perfectionist...

Code: [Select]

is_zero(x) {
    return ((1.0 / x) == NaN);
}

your question is actually imho... a grey area... ymmv.

Nominal Animal · « **Reply #14 on:** July 26, 2022, 10:08:18 am »

Quote from: DiTBho on July 25, 2022, 06:01:07 pm

But it has a cost: it needs an "order" vector that tells about the order of the unknowns, since when you swap two columns you are also swapping two solution-variables, so you need the "order" vector, which mathematically needs to be written as "permutation matrix" P.

Exactly. In programming, these can be implemented in various ways, with a lookup array for both possibly the simplest one. In other words, if [row*rowstride + col*colstride] is the index to element on row row, column col, then one uses [rowoffset[row] + coloffset[col]] instead, with rowoffset[] initialised to multiples of rowstride, and coloffset[] to multiples of colstride, assuming a dense matrix representation.

Quote from: Mechatrommer on July 25, 2022, 07:56:03 pm

Code: [Select]
#ifndef epsilon #define epsilon 1e-32 #endif

No. You should not suggest a random value to be used as an epsilon. That is just wrong.

Here are much better suggestions:
0.000000000000000000000000000000000000000000001f (smallest positive nonzero float squared)
0.000000000000000000000000000000000000002938737f (smallest positive nonzero float with a finite reciprocal)
0.000000000000000000000026469782f (smallest positive float with a nonzero square)
0.000000000000000000054210115f (smallest positive float with a nonzero square and finite squared reciprocal)

These have at least a basis in IEEE-754 floating-point properties, and are not just "magic" numbers one should use without understanding their meaning or effect.

Something like the last one is very useful when doing e.g. vector algebra in 3D using float, as r² and 1/r² (squared vector length and reciprocal of squared vector length, respectively) are very commonly used scalar quantities, and they should remain nonzero and finite to be useful.

Mechatrommer · « **Reply #15 on:** July 26, 2022, 10:48:44 am »

Quote from: Nominal Animal on July 26, 2022, 10:08:18 am

Quote from: Mechatrommer on July 25, 2022, 07:56:03 pm
Code: [Select]
#ifndef epsilon #define epsilon 1e-32 #endif
No. You should not suggest a random value to be used as an epsilon. That is just wrong.

Here are much better suggestions:
0.000000000000000000000000000000000000000000001f (smallest positive nonzero float squared)
0.000000000000000000000000000000000000002938737f (smallest positive nonzero float with a finite reciprocal)
0.000000000000000000000026469782f (smallest positive float with a nonzero square)
0.000000000000000000054210115f (smallest positive float with a nonzero square and finite squared reciprocal)

These have at least a basis in IEEE-754 floating-point properties, and are not just "magic" numbers one should use without understanding their meaning or effect.

proof? how it make a difference between 1e-9 (OP), 5e-3 (i linked earlier from an experienced programmer, which i find too large for my taste), 1e-32 (random by me) or the suggested 1e-45 above? they are just value comparison in the end.

Quote from: Nominal Animal on July 26, 2022, 10:08:18 am

Something like the last one is very useful when doing e.g. vector algebra in 3D using float, as r² and 1/r² (squared vector length and reciprocal of squared vector length, respectively) are very commonly used scalar quantities, and they should remain nonzero and finite to be useful.

and how the OP know what kind of user that will use his polymorphic library? i was suggesting (#ifndef), and the link too (default value argument), the end user that have more knowledge about their system should define their own appropriate epsilon value. btw, i just know its called epsilon in this thread, usually i will just define it as SMALL_VALUE or otherwise a BIG_VALUE to limit the calculated range, to what i think suitable for my particular program. hence is_zero(x) return if x < SMALL_VALUE vice versa. semantically they are the same, i just might get away with terminology violation?

(ps: thats how a mechatrommer do it

)

brucehoult · « **Reply #16 on:** July 26, 2022, 10:49:31 am »

Quote from: Nominal Animal on July 26, 2022, 10:08:18 am

Quote from: DiTBho on July 25, 2022, 06:01:07 pm
But it has a cost: it needs an "order" vector that tells about the order of the unknowns, since when you swap two columns you are also swapping two solution-variables, so you need the "order" vector, which mathematically needs to be written as "permutation matrix" P.
Exactly. In programming, these can be implemented in various ways, with a lookup array for both possibly the simplest one. In other words, if [row*rowstride + col*colstride] is the index to element on row row, column col, then one uses [rowoffset[row] + coloffset[col]] instead, with rowoffset[] initialised to multiples of rowstride, and coloffset[] to multiples of colstride, assuming a dense matrix representation.

Quote from: Mechatrommer on July 25, 2022, 07:56:03 pm
Code: [Select]
#ifndef epsilon #define epsilon 1e-32 #endif
No. You should not suggest a random value to be used as an epsilon. That is just wrong.

Here are much better suggestions:
0.000000000000000000000000000000000000000000001f (smallest positive nonzero float squared)
0.000000000000000000000000000000000000002938737f (smallest positive nonzero float with a finite reciprocal)
0.000000000000000000000026469782f (smallest positive float with a nonzero square)
0.000000000000000000054210115f (smallest positive float with a nonzero square and finite squared reciprocal)

These have at least a basis in IEEE-754 floating-point properties, and are not just "magic" numbers one should use without understanding their meaning or effect.

Something like the last one is very useful when doing e.g. vector algebra in 3D using float, as r² and 1/r² (squared vector length and reciprocal of squared vector length, respectively) are very commonly used scalar quantities, and they should remain nonzero and finite to be useful.

Uhhh .. let's look at your final suggestion...

The next higher IEEE double precision number after 1.0 is...

1.00000000000000022204460492503

If you are looking for an answer of 1.0 and you get that and subtract them then you get...

0.00000000000000022204460492503

Let's compare that to your suggested eplison:

0.00000000000000022204460492503
0.000000000000000000054210115

The only way you can get your epsilon as an absolute difference is for numbers in the vicinity of 1/5000 or smaller.

You can NEVER get your epsilon as a relative difference e.g. a/b can never give a non-zero number that is 1.000000000000000000054210115 or anywhere near to it, let alone smaller.

I would suggest instead the square root of the smallest relative difference between successive doubles, minus 1 i.w. sqrt(0.00000000000000022204460492503) which is about 1.5x10^-8

DiTBho · « **Reply #17 on:** July 26, 2022, 11:27:34 am »

Quote from: brucehoult on July 26, 2022, 10:49:31 am

I would suggest instead the square root of the smallest relative difference between successive doubles, minus 1 i.w. sqrt(0.00000000000000022204460492503) which is about 1.5x10^-8

Yup, I can collect this information during evaluation loops, so it will be somehow "adaptive"

Nominal Animal · « **Reply #18 on:** July 26, 2022, 01:05:09 pm »

Quote from: Mechatrommer on July 26, 2022, 10:48:44 am

proof? how it make a difference between 1e-9 (OP), 5e-3 (i linked earlier from an experienced programmer, which i find too large for my taste), 1e-32 (random by me) or the suggested 1e-45 above? they are just value comparison in the end.

As discussed earlier, the epsilon is a value that is supposed to be supplied by the user of the library, because only they know what the numbers refer to.

I even showed explicitly why I use <= instead of < in the comparison: it gives the user an intuitive description of absolute epsilon –– the largest positive value that is insignificant, qualitatively equivalent to zero. It even works when the user explicitly sets epsilon to zero; hopefully they know what they are doing then.

Quote from: brucehoult on July 26, 2022, 10:49:31 am

Uhhh .. let's look at your final suggestion...

The next higher IEEE double precision number after 1.0 is...

Which part of float (which refers explicitly to IEEE 754 Binary32, i.e. single precision) did you miss?

The point in my suggestions was that they have at least some kind of logic in them, and are not just random constants without context.
In other words, the numeric value was not the core of my suggestion; the text was.

brucehoult · « **Reply #19 on:** July 26, 2022, 01:25:26 pm »

Quote from: Nominal Animal on July 26, 2022, 01:05:09 pm

Quote from: brucehoult on July 26, 2022, 10:49:31 am
Uhhh .. let's look at your final suggestion...

The next higher IEEE double precision number after 1.0 is...
Which part of float (which refers explicitly to IEEE 754 Binary32, i.e. single precision) did you miss?

Your Float-derived epsilon values were orders of magnitude too small to be useful with Double. Using them after a series of Float calculations would be completely ludicrous. For that you *would* want something around 10^-3 to 10^-4 at the end of a series of calculations such as solving a matrix.

Nominal Animal · « **Reply #20 on:** July 26, 2022, 02:20:29 pm »

Quote from: brucehoult on July 26, 2022, 01:25:26 pm

Using them after a series of Float calculations would be completely ludicrous. For that you *would* want something around 10^-3 to 10^-4 at the end of a series of calculations such as solving a matrix.

The point was exactly that the choice of epsilon is not just a number; it has to have some qualitative meaning to the user. You, too, are now making the grave error of throwing around magic numbers without explaining why. (I know why, you know why, most reading this do not know why.)

The 0.000000000000000000054210115f is the smallest one I'd use, and only with single-precision 2D/3D vector algebra, for the exact reasons I listed. It works absolutely fine for e.g. raytracing using single precision. (For OpenGL and similar, that have significant limitations in their Z buffer, you may consider a much larger epsilon, exactly because of the Z buffer range limitations. These involve differentiating numbers close to 1 from exactly 1.)

For single-precision linear algebra, we need to examine the operations performed. If we assume a square n×n matrix, LU decomposition involves sums of products. The substitution passes both involve a sum of products. So, a reasonable first estimate for an absolute epsilon would be the largest positive number that when raised to third or fourth power, yields zero. For single-precision IEEE-754 floating point, these are 0.0000000000000008881784f ≃ 1e-15 and 0.0000000000051448788f ≃ 5e-12. However, if we consider their reciprocals (infinite) too, then they are 0.00000000000014323644f ≃ 1e-13 and 0.00000000023283064 ≃ 2e-10 instead.

In practice, domain cancellation in the sums often reduce precision. (As an example of this, calculate the sum of 2e30, 1e-30, and -2e30. Unless your summation does something clever like reordering the terms in order of descending magnitude, you'll get a result of zero instead 1-30, even if using double precision.) So, the above must be understood in terms of "when doing these kinds of operations, values smaller than this will produce zero or infinity".
For domain cancellation, a typical "fudge factor" is about one order of decimal magnitude, or three or four bits. To be precise, it depends on the number of terms in the sums, but also the variance of the terms in the sums; so a crude first estimate is often used.

A practical estimate for the absolute epsilon would be the largest positive value that when cubed, is still zero; and the reciprocal of the cube, squared, is still infinite. (You need to examine the various ways of solving Ax=b, condition numbers, and exactly when the calculation may fail, to decide for yourself if that is a reasonable model for the complexity of the operations involved, with the "fudge factor" applied to deal with domain cancellation and similar rounding errors in the addition and subtraction operations.)
That is, it is the largest value that we want to treat as zero when dealing with a calculation that involves the squared reciprocal of its cube. This would be 0.000052322018 ≃ 5e-5 (still for single-precision floating-point). Thus, with the fudge factor applied, you end up with around 5e-4.

This is how you reason towards a reasonable absolute epsilon. You do not just pull it out of your hat, and rely on your authority or experience that it will work. There should always be a reason you can put in a comment next to where you assign the value to your absolute epsilon in the code. And I do expect any numerically sane code to have that comment, too. Otherwise, it is just somebody's best guess.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: silly question, but .. how do you compare fp values to zero? (Read 1847 times)

Share me