Author Topic: Do you see the error in this code. I sure as hell do not. Cost DM 1200 milllion (Read 7573 times)

ez24 · « **on:** July 29, 2016, 06:57:50 am »

FYI this error cost DM 1200 million (I have no idea what this is in US $ )

strict precondition 1:
{Set."x"=FLPT and Set."y"=INT16 and -32768 <= x <= +32767}
strict precondition 2:
{Set."x"=FLPT and Set."y"=INT16 and int(x) in INT16}
strict precondition 3:
{Set."x"=FLPT and Set."y"=INT16 and int(x) in Set."y"}
program code (convert floating point x to fixed point integer y):
y := int(x)
postcondition:
{Set."x"=FLPT and Set."y"=INT16 and y=int(x)}

for more details see:

http://www.cas.mcmaster.ca/%7Ebaber/TechnicalReports/Ariane5/Ariane5.htm

(I hope someone can explain in layman's words what this is

)

PartialDischarge · « **Reply #1 on:** July 29, 2016, 08:54:39 am »

It seems that the error was in the conversion from floating point to fixed point. One has to be very careful in doing this things , and actually it is a conversion that strictly speaking doesn't make sense and is a very bad programming practice.

I've programmed some communications algorithms (a software PLL with I and Q phases for example) in assembly and used fixed point instead of floating point in C due to execution speed, and the thing is nowhere trivial.
Most programmers lack the knowledge of the secrets behind these two types, especially the good old signed fixed point and its forgotten notation Qa.b

PartialDischarge · « **Reply #2 on:** July 29, 2016, 09:08:20 am »

I'll give some examples

* 0.1, float(0.1) and double(0.1) are NOT the same numbers

* a fixed point signed 16 bit number (Q0.15) goes from -32768 to +32767, and it is not simmetrical, so you cannot represent with it
an analog quantity of +1 to -1. It only goes from +0.999969 to -1. If this is not taken into account an overflow may occur. Reading the posted link it seems that this could have happened, ie if the routine thinks that the highest positive number (+1) is +32768, it will actually represent the largest negative number, -32768

* floating point accuracy is limited!!! See attached code from matlab and the weird result.

* Never compare for equality two floating point numbers. For example, there is a system defined constant for PI called M_PI with lots and lots of decimals, but this code:

float x=M_PI;
if (x-PI)==0) return 1
else return 0;

it returns 0.

Maxlor · « **Reply #3 on:** July 29, 2016, 09:50:24 am »

The function looks correct (although I have never seen that syntax, no idea what it is.) It does exactly what it says it does, under given preconditions.

The problem is with how this function was called, i.e. without checking preconditions, and handling the case where the preconditions aren't met. The default handler apparently was ending the program, which usually is a reasonable thing to do (and often, the only thing to do) but in this case, had some rather explody consequences.

A precondition is a condition, or a check, that must hold (i.e., the check must pass) before a function is called. The function only guarantees its behaviour if the preconditions hold, and if they don't, it might behave in an undocumented way, or crash, or throw an exception.

Similarly, a postcondition is a condition that holds after the function. If they break, it's usually because there's an error in the function implementation.

Generally, when writing a function, it's quite possible that there are sets of input parameters for which the function doesn't make sense, and you don't have enough control over the compiler to prohibit those sets of parameters. An example is a square root function with a signature like this: float sqrt(float value); Now, with these types it's not possible in C to limit value to non-negative values, but that is a precondition of the function. (Note: it cannot return complex values, since the return type is specified as float as well.) When sqrt(-42) is called, the precondition is violated, which is clearly a programming error outside the sqrt() function. The reasonable thing to do here is to abort() or throw an exception so the error can be easily identified and fixed. Or maybe, if this behaviour is documented, return some special value like 0; although such behaviour is likely more trouble than its worth in the long run, since it hides errors, making them much harder to track down.

With safety-conscious code, it's not uncommon to see preconditions and postconditions written down as code, and for the compiler to generate explicit checks for them, so that errors can be found as quickly as possible. But even if no code is generated to check them, all functions can be thought of as having pre- and postconditions. Sometimes they're documented, sometimes they're merely implied.

Dr. Frank · « **Reply #4 on:** July 29, 2016, 10:47:43 am »

Quote from: ez24 on July 29, 2016, 06:57:50 am

FYI this error cost DM 1200 million (I have no idea what this is in US $ )

program code (convert floating point x to fixed point integer y):
y := int(x)

(I hope someone can explain in layman's words what this is )

Well DM = Deutsche Mark (German currency before European Unions €uro)
That is about 600 million Euro, or about 700 M US$ as of today.

The description is very cryptic.

The critical code simply is
y:= int(x)
which is not yet faulty itself, only in the context, that x is a real number, and y is integer.

Such assignments are dangerous logically, as it has to be checked beforehand, whether the absolute value of x fits the maximum range for the integer function int(). In this case, it would have to be in the range of [-32768... +32767], determined by y (single integer).
Otherwise, a runtime error (defined by the compiler) might occur, or y maybe filled with random numbers.

That's a freshman's programming error, and bad SW testing.

It reminds me of that Mars satellite crash, where two different SW teams mixed inch and meter calculations.

Frank

Cervisia · « **Reply #5 on:** July 29, 2016, 11:54:46 am »

That web page does not make a good job of explaining the error.

See https://en.wikipedia.org/wiki/Cluster_(spacecraft).

bktemp · « **Reply #6 on:** July 29, 2016, 01:46:42 pm »

What would be the best solution for such a problem? Clipping the value to the nearest possible value (-32768 or 32767)?
Shutting down the whole computer sounds like the worst solution to me.

Maxlor · « **Reply #7 on:** July 29, 2016, 02:47:16 pm »

Quote from: bktemp on July 29, 2016, 01:46:42 pm

What would be the best solution for such a problem? Clipping the value to the nearest possible value (-32768 or 32767)?
Shutting down the whole computer sounds like the worst solution to me.

The error is external to the function, so it should be handled externally. If it isn't, one might run into trouble no matter what the function does, be it crash, return 0, returned the clamped value, do modulo arithmetics, throw an exception... it's all wrong.

If you find yourself writing code like this, throw an exception if your language supports that, return an error code if your signature allows for that (or a return value signifying an error, like NaN), otherwise just abort() (in other words, crash.)

suicidaleggroll · « **Reply #8 on:** July 29, 2016, 03:00:48 pm »

Quote from: bktemp on July 29, 2016, 01:46:42 pm

What would be the best solution for such a problem? Clipping the value to the nearest possible value (-32768 or 32767)?
Shutting down the whole computer sounds like the worst solution to me.

When the inputs to a function are out of bounds, there is a problem upstream from the function. It's not the function's fault, but how the function responds is very important. Far and away the worst thing you can do is proceed with the calculation anyway and return a value that seems correct, but is wrong. As long as you don't do that, the "correct" behavior depends on the situation. Usually returning an error code would be the right approach, but if the system doesn't support that, generally the best you can do is crash. Clipping the input to the rail might be the right approach, but it is VERY situation-specific, and should not be the de facto response, as that can make things 10x worse in the wrong situation.

rstofer · « **Reply #9 on:** July 29, 2016, 03:08:21 pm »

Ada throws an exception if the integer overflows.

Kalvin · « **Reply #10 on:** July 29, 2016, 03:15:14 pm »

Quote from: rstofer on July 29, 2016, 03:08:21 pm

Ada throws an exception if the integer overflows.

Yes. But what happens if the exception is not handled properly? The software will probably crash and will be restarted, or it will hangup until the watchdog will restart it.

bktemp · « **Reply #11 on:** July 29, 2016, 03:40:13 pm »

So the code shown here, was fine, the actual problem was either the missing input range checking or the wrong exception handling?

Kalvin · « **Reply #12 on:** July 29, 2016, 03:54:59 pm »

Quote from: bktemp on July 29, 2016, 03:40:13 pm

So the code shown here, was fine, the actual problem was either the missing input range checking or the wrong exception handling?

The input range checking was there, and the resulting exception was handled so that the computer was shut down.

Here is a copy of the report: https://www.ima.umn.edu/~arnold/disasters/ariane5rep.html

Searching for a term "exception" will eventually lead to following paragraph:

Quote

Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure. In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory (which was recovered and read out for Ariane 501), and finally, the SRI processor should be shut down.

It was the decision to cease the processor operation which finally proved fatal. Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.

...

An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. The supplier of the SRI was only following the specification given to it, which stipulated that in the event of any detected exception the processor was to be stopped. The exception which occurred was not due to random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.

This means that critical software - in the sense that failure of the software puts the mission at risk - must be identified at a very detailed level, that exceptional behaviour must be confined, and that a reasonable back-up policy must take software failures into account.

Len · « **Reply #13 on:** July 29, 2016, 03:55:13 pm »

Quote from: bktemp on July 29, 2016, 03:40:13 pm

So the code shown here, was fine, the actual problem was either the missing input range checking or the wrong exception handling?

In short, yes, the cause was bad exception handling and unexpected input values.
More details are in the report of the Ariane 501 Inquiry Board, available here:
http://esamultimedia.esa.int/docs/esa-x-1819eng.pdf

rstofer · « **Reply #14 on:** July 29, 2016, 05:04:21 pm »

Quote from: Kalvin on July 29, 2016, 03:15:14 pm

Quote from: rstofer on July 29, 2016, 03:08:21 pm
Ada throws an exception if the integer overflows.
Yes. But what happens if the exception is not handled properly? The software will probably crash and will be restarted, or it will hangup until the watchdog will restart it.

At least the Ada programmer would have had an opportunity to consider the exception. Maybe there is no solution, maybe there is.
I don't know much about Ada but I do know that it is being used in critical systems where failure isn't an option.

Furthermore, the programmer could have defined a data type for the 'sensor' output that limited it's range, in effect sanitizing the reading. Still, the reading must have been an exception so maybe sanitizing it is the wrong thing to do.

In the version of Ada that I am playing with, I can't get within 64 of 2^31 before it throws an exception. The value 2^31-65 is the largest positive value that converts without an exception. I don't know why that is...

In any event, Ada may not be capable of running on 16 bit machines so it might not have been an option.

Kilrah · « **Reply #15 on:** July 29, 2016, 05:46:26 pm »

Quote from: rstofer on July 29, 2016, 05:04:21 pm

At least the Ada programmer would have had an opportunity to consider the exception. Maybe there is no solution, maybe there is.

I don't see why.
There was error handling, and it did what people told it to becasue they didn't think it through correctly, the language is irrelevant. Obviously that case was never encountered at a time where "the programmer would have had an opportunity to consider the exception", becasue if the system had crashed during testing they'd for sure have investigated as well.

Kalvin · « **Reply #16 on:** July 29, 2016, 05:52:51 pm »

Quote from: rstofer on July 29, 2016, 05:04:21 pm

In the version of Ada that I am playing with, I can't get within 64 of 2^31 before it throws an exception. The value 2^31-65 is the largest positive value that converts without an exception. I don't know why that is...

In any event, Ada may not be capable of running on 16 bit machines so it might not have been an option.

The single precision IEEE floating point number has 24-bit mantissa and 8-bit exponent. I guess that the 2**31 - 65 is the maximum integer presentation of the 24-bit single precision IEEE floating point normalized number which will fit in the 31 bits.

The Ada GNAT compiler is available even for the 8-bit Atmel Atmega328 etc. microcontrollers, so the 16-bits should not be a problem.

CatalinaWOW · « **Reply #17 on:** July 29, 2016, 06:40:29 pm »

Answers on this thread show a problem common to many programmers. The "solution" is to throw an exception, halt the code or something similar. Meanwhile, in the real world you are a few meters in the air, with hot flames and thousands of tons of fuel and oxidizer.

The universal and first step in writing good code is understand what the system is trying to accomplish. Languages, data types, A/D converters, and other constructs are then tools to accomplish that. The entire chain must be remembered. The complexity of software often leads people to consider SW in isolation, thinking in terms of language purity, or ignoring hardware interfaces. I can't count the times that I have been in the lab integrating some complex system and received a dumb look from a programmer when asked "What format did you send to device X. Was it BCD, 2s complement, big endian or little endian and so on". The problem isn't so much that they didn't know the answer to the question, but that they didn't even know a question was there.

bktemp · « **Reply #18 on:** July 29, 2016, 07:10:40 pm »

The report is an interesting read.
It looks like the programmers of the individual sub-systems did exactly what was requested. Maybe they even got information about the ranges of the input data so they skipped the range checking. The same applies to the exception handler: It was only designed to handle random hardware errors, because the software was considered as correct.

So the people writing the requirements obviously missed some facts, leading to a faulty software developement.

rstofer · « **Reply #19 on:** July 29, 2016, 07:26:40 pm »

Quote from: Kalvin on July 29, 2016, 05:52:51 pm

Quote from: rstofer on July 29, 2016, 05:04:21 pm
In the version of Ada that I am playing with, I can't get within 64 of 2^31 before it throws an exception. The value 2^31-65 is the largest positive value that converts without an exception. I don't know why that is...

In any event, Ada may not be capable of running on 16 bit machines so it might not have been an option.

The single precision IEEE floating point number has 24-bit mantissa and 8-bit exponent. I guess that the 2**31 - 65 is the maximum integer presentation of the 24-bit single precision IEEE floating point normalized number which will fit in the 31 bits.

The Ada GNAT compiler is available even for the 8-bit Atmel Atmega328 etc. microcontrollers, so the 16-bits should not be a problem.

I hadn't even thought to try GNAT with an Atmega. Interesting! I am looking at it in the context of ARM processors but at the moment, I am just messing around in Linux. I'm just using gedit or nano and the command line. I have tried a couple of "hello world" kinds of things on one of my ARM boards and it seems to work fine.

Exception handling for embedded processors is a little bizarre. What to do? It's not like there are people around to intercede. In the nuclear business, measurements are generally taken by 3 independent means. Actions are taken on best-out-of-three. But even there, the default solution may be to shut down the reaction. Not a satisfactory answer for something that flies.

rs20 · « **Reply #20 on:** July 29, 2016, 07:43:25 pm »

The exception handling strategy did not cause this failure. The mistaken precondition, and the impossible attempt to squish a value into a int16 that couldn't fit, did.

I don't understand why people are saying "the exception wasn't handled properly". The code was incapable of handling the large function parameter; the code was fatally flawed. What the hell are you supposed to do if your function comes back to you and says "my inputs are out of bound, I cannot tell you where to point the rocket"? If all three computers are concurring on that conclusion, then you're just screwed. The handler can't say "oh, didn't need that number anyway, let's just soldier on then"*. Come on. Assuming that only one computer has had a hardware fault, and disabling the processor so that the remaining two can (hypothetically) continue to fly the rocket successfully may have been a incorrect assumption in this case, but it's the correct approach when there is a hardware failure, handling a failure of all three computers is almost impossible by definition unless someone can foresee the particular failure (in which case, why not just fix the software?), but there is no way to handle the exception properly either, so that assumption didn't cause the failure.

Also, as a note of context, the software had flown successfully in the past, and the assumptions that the preconditions were true did hold true in the past flights. An upgraded rocket generated larger numbers (higher lateral accelerations, from memory?), and they didn't catch the fact that the precondition was no longer satisfied. To be fair, this is a bit of a subtle case -- hardware evolving to generate conditions that software, designed for earlier version of the hardware, could not have expected -- although I'd hope they had some sort of integration testing with virtual sensors and real rocket hardware to do a mock flight each time.

* Now technically, they could have had a fallback navigation system to defer to, written by a totally different parallel team, etc, etc. But if that's what you're suggesting, be explicit about it because that is a huge undertaking to prepare that defense across all software, even for a space agency.

Marco · « **Reply #21 on:** July 29, 2016, 08:14:02 pm »

The primary problem I see is running with assertions on in production code on both the primary and backup.

What's the point? Just try to keep going, the self destruct is autonomous any way ... the worst that can happen is self destruct, the best that can happen is that it manages to keep flying.

Quote from: rs20 on July 29, 2016, 07:43:25 pm

The handler can't say "oh, didn't need that number anyway, let's just soldier on then"*. Come on.

It can and it should, maybe some code down the line has a sanity check to prevent bogus input, maybe after the next course correction the error will not trigger again, maybe a lot of things. May not self destruct abort is better than will self destruct abort.

The bug is simply a bug, but this is a fundamental flaw.

Kjelt · « **Reply #22 on:** July 29, 2016, 08:23:21 pm »

The one and only mistake was reusing software from an old product for a new product without decent testing , this is the result the first product(batch) fails.

suicidaleggroll · « **Reply #23 on:** July 29, 2016, 08:23:59 pm »

Quote from: Marco on July 29, 2016, 08:14:02 pm

The bug is simply a bug, but this is a fundamental flaw.

But it's not a bug, the software is doing exactly what it's supposed to do. The problem is person who decided to load software designed for a smaller rocket onto a bigger one without checking the acceptable input ranges first.

Marco · « **Reply #24 on:** July 29, 2016, 08:28:09 pm »

Meh.

Doesn't remove the fact that may not self destruct abort is better than will self destruct abort.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Do you see the error in this code. I sure as hell do not. Cost DM 1200 milllion (Read 7573 times)

Share me