Author Topic: UK NATS Outage - August 2023 - Report  (Read 1384 times)

0 Members and 1 Guest are viewing this topic.

Offline AndyBeezTopic starter

  • Frequent Contributor
  • **
  • Posts: 856
  • Country: nu
UK NATS Outage - August 2023 - Report
« on: September 06, 2023, 02:47:32 pm »
On the 28th August 2023, a major outage of a component system of the United Kingdom’s air traffic control environment bought chaos to British Airports. Unable to file flight plans, airlines where forced to reschedule, postpone or cancel flights. The knock on effect left aircraft and aircrews in the wrong place, with disruption lasting for days after. Airlines have already paid out millions of pounds in compensation. This outage was not the result of a cyber attack but instead, the a killer combination of garbage data in and an unexpected exception out.

So what happened? In summary, a flight plan for an aircraft transiting UK airspace contained duplicate way-points. Unable to cope with this ambiguity, the system for administering UK flight plan data showed the ‘fault’ light, the backup kicked in, showed the another 'fault' light, and everything stopped.

Surprisingly, waypoints [or aircraft navigation points] are not required to have unique names; rather, the UK ’s Flight Plan Reception Suite Automated (FPRSA-R) system when detecting such a discrepancy, is was designed to divert the flight plan to a real human for clarification. Unfortunately, when a real ambiguous flight plan was filed, by the French no-less, the software “threw an exception.” I'm sure someone suggested turning it off and then on again, but the damage to UK air travel had already been done. The NATS CEO subsequent claimed this was a "one in 15 million event". Which is about the same odds as being struck by lightning.

A preliminary report of this incident is now out in the public domain. Software Architects and students of computer system resilience may want to give this a read.

ATS Major Incident - Preliminary Report - Full PDF

https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident%20Investigation%20Preliminary%20Report.pdf

TLDR? Some salient points 4U :blah:

It was like this… :o
Quote
On 28th August 2023, significant disruption was experienced across UK airspace following an incident affecting part of the technical infrastructure that supports NATS’ safe controlling of aircraft. In keeping with its primary purpose, NATS delivered a safe operation throughout. However, the reduced levels of flights that resulted from the measures needed to maintain safety due to the technical incident caused significant disruption to the UK aviation system.

Then it failed dis-gracefully…   :(
Quote
Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention. In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception.

l'exception exceptional... :wtf:
Quote
A critical exception is, broadly speaking, an exception of last resort after exploring all other handling options. Critical exceptions can be raised as a result of software logic or hardware faults, but essentially mark the point at which the affected system cannot continue.

So we then took a log dump…  :P
Quote
Having raised a critical exception the FPRSA-R primary system wrote a log file into the system log. It then correctly placed itself into maintenance mode and the C&M system identified that the primary system was no longer available. In the event of a failure of a primary system the backup system is designed to take over processing seamlessly. In this instance the backup system took over processing flight plan messages. As is common in complex real-time systems the backup system software is located on separate hardware with separate power and data feeds.

The backup system failed ‘by design’…   :-X
Quote
Therefore, on taking over the duties of the primary server, the backup system applied the same logic to the flight plan with the same result. It subsequently raised its own critical exception, writing a log file into the system log and placed itself into maintenance mode.

Buggers hunt…  >:(
Quote
Now that the root cause has been identified further work needs to be undertaken to trace back through the development and testing of the FPRSA-R sub-system to understand whether the combination of events that led to the incident could have been mitigated at some point in the software development cycle. It is our understanding from the manufacturer that the specific area of software related to this investigation is unique to NATS.

We can repair it do a stability fix… ;D
Quote
A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident. This change will prevent the software from finding a duplicate waypoint that could cause an incident.

This change will prevent the software from finding a duplicate waypoint that could cause another identical incident.

« Last Edit: September 06, 2023, 02:53:37 pm by AndyBeez »
 
The following users thanked this post: tridac

Offline tridac

  • Regular Contributor
  • *
  • Posts: 115
  • Country: gb
Re: UK NATS Outage - August 2023 - Report
« Reply #1 on: September 09, 2023, 03:26:31 pm »
Any critical system that crashes, for whatever reason, on receipt of invalid data, is not fit for purpose. If duplicate data was received, why couldn't the system just reject it, send a critical error note to the originator, and continue ?. Heavy system load at holiday time lining all the holes in the cheese, never thoroughly enough tested ?. Not a robust enough system design by half, which should be designed to fail gracefully. Sounds like yet another example of a government service sold off to save money. Flying on a wing and a prayer, most likely...
Test gear restoration, hardware and software projects...
 
The following users thanked this post: AndyBeez

Online Marco

  • Super Contributor
  • ***
  • Posts: 6723
  • Country: nl
Re: UK NATS Outage - August 2023 - Report
« Reply #2 on: September 10, 2023, 03:48:29 pm »
That can't be the whole story, there almost certainly are far more causes which cause a flight to be unrouteable and cause it to be individually rejected. There must be some specific reason this wasn't individually rejected they don't want to share. An uncaught exception or something similarly silly wouldn't be enough for them to be this evasive, so my guess would be that it triggered a priority override inserted by a security agency which screwed everything up.
« Last Edit: September 10, 2023, 03:55:22 pm by Marco »
 
The following users thanked this post: AndyBeez

Offline tridac

  • Regular Contributor
  • *
  • Posts: 115
  • Country: gb
Re: UK NATS Outage - August 2023 - Report
« Reply #3 on: September 10, 2023, 06:09:54 pm »
Could be I guess, but more likely a design failure. The report says that the primary system shutdown, followed by the backup for the same reason. A tacit admission of design failure. Basic rule in safety critical software, is that all input data is validated before being accepted by the system. If that had been done, the data would have been rejected, for any reason of doubt, originator informed and perhaps one flight cancelled. One key factor is that this happened at what was probably peak system load. In the past, running ouit of resources, like memory starvation might have caused the system to thrash until it fell over, but would not have thought that likely these days. The standard of software these days can be dire, time and money pressure overriding quality and attention to detail, with predictable results...
Test gear restoration, hardware and software projects...
 
The following users thanked this post: AndyBeez

Offline AndyBeezTopic starter

  • Frequent Contributor
  • **
  • Posts: 856
  • Country: nu
Re: UK NATS Outage - August 2023 - Report
« Reply #4 on: September 11, 2023, 08:56:23 pm »
Whatever the root cause of the meltdown, I am sure there are plenty of people denying responsibility for the job they were paid very well to do. I wonder if other ATC operators have the same duplicate waypoint vulnerability in their codebase? Maybe they should check? Someone must have asked the question, what-if there is crap data? I guess the response from those in the know was, that's not crap data, it's just not invalid data. Which is like specifying that an input field has to accept any UTF16 character because someone may have changed their name to the turd emoji :-//


The take-home for any software student is that system meltdowns have consequences. This from the press release on the RYANAIR website. The CEO of the Irish airline, Michael O'Leary, is certainly not a man to voice his opinions in diplomatic language.

Quote
This Preliminary NATS Report is factually inaccurate. It ridiculously understates the number of flights that were cancelled or delayed through the NATS system failure on Mon 28th Aug last. In Ryanair’s case, we suffered over 370 flight cancellations (over 63,000 passengers), and more than 1,500 flight delays over the 2 days (Mon & Tues over 270,000 passengers delayed).

This whitewash report, which understates the number of flights cancellations and flight delays, fails to explain why 1 inaccurate flight plan brought down not just the NATS ATC system, but also the backup system.

This Report, which is full of false figures about flight cancellations and delays, and avoids any explanation of why NATS backup system failed so spectacularly will not solve this problem unless NATS accepts responsibility for its incompetence and reimburses airlines and passengers for the avoidable right to care expenses they suffered due to NATS failure on Mon 28th Aug last.
Full release: https://corporate.ryanair.com/news/ryanair-rejects-nats-whitewash-report/#:~:text=Ryanair%2C%20the%20UK%27s%20no.1,Mon%2028th%20Aug%20last.&text=NATS%20claims%20just%20575%20flights,UK%20that%20day%20were%20delayed.
 

Offline TimFox

  • Super Contributor
  • ***
  • Posts: 7954
  • Country: us
  • Retired, now restoring antique test equipment
Re: UK NATS Outage - August 2023 - Report
« Reply #5 on: September 11, 2023, 09:21:12 pm »
Generic problem:
Complex software systems can be tested in advance to see what happens for a foreseeable fault or data error.
However, apparently no one thought of that particular data error before it happened.
 
The following users thanked this post: AndyBeez

Offline tggzzz

  • Super Contributor
  • ***
  • Posts: 19517
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: UK NATS Outage - August 2023 - Report
« Reply #6 on: September 11, 2023, 09:44:36 pm »
Generic problem:
Complex software systems can be tested in advance to see what happens for a foreseeable fault or data error.
However, apparently no one thought of that particular data error before it happened.

Old adage, which I'm sure you are aware of: "you can't test quality into a product".

Software weeklies believing the TDD dogma can't understand that.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 
The following users thanked this post: AndyBeez

Offline TimFox

  • Super Contributor
  • ***
  • Posts: 7954
  • Country: us
  • Retired, now restoring antique test equipment
Re: UK NATS Outage - August 2023 - Report
« Reply #7 on: September 11, 2023, 09:51:33 pm »
They also don't have enough imagination to think of every possible (or even plausible) fault in a complex system.
Who'd a thunkit that a flight plan would have duplicate waypoints?  That shouldn't happen.
 
The following users thanked this post: AndyBeez

Offline AndyBeezTopic starter

  • Frequent Contributor
  • **
  • Posts: 856
  • Country: nu
Re: UK NATS Outage - August 2023 - Report
« Reply #8 on: September 11, 2023, 11:21:22 pm »
Old adage, which I'm sure you are aware of: "you can't test quality into a product".
Software weeklies believing the TDD dogma can't understand that.
One issue with TDD (test driven development) is code, and exception catching, is built to only pass the test cases. Thinking outside of the box and considering the other what-ifs is not required. Rather like an EE not designing for thermal resilience because this is not to be tested for. Untill a unit at the customer overheats - then heatsinks become an 'evolving' requirement. TDD is a rather minimalist methodology when done 'correctly'. No-one drops the ball because there is no ball to drop.

Way points with duplicate names have been known about for a long time, so this should have been a test case. It's not as if anyone would have been short of test data. For once I have to agree with Michael O'Leary; how one ambiguous flight plan could crash the backup system is hard to comprehend.

Think NATS > Think Fukushima.
« Last Edit: September 11, 2023, 11:25:33 pm by AndyBeez »
 

Offline tggzzz

  • Super Contributor
  • ***
  • Posts: 19517
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Re: UK NATS Outage - August 2023 - Report
« Reply #9 on: September 12, 2023, 12:15:26 am »
Old adage, which I'm sure you are aware of: "you can't test quality into a product".
Software weeklies believing the TDD dogma can't understand that.
One issue with TDD (test driven development) is code, and exception catching, is built to only pass the test cases. Thinking outside of the box and considering the other what-ifs is not required. Rather like an EE not designing for thermal resilience because this is not to be tested for. Untill a unit at the customer overheats - then heatsinks become an 'evolving' requirement. TDD is a rather minimalist methodology when done 'correctly'. No-one drops the ball because there is no ball to drop.

I have, too many times, heard TDD/XP/Agile people say in all seriousness "it passes the tests therefore it is working".

TDD/XP/Agile has its good features, just as Waterfall has its good features. The associated ignorant dogmatic zeal sustains the idiotic misbehaviour.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 
The following users thanked this post: AndyBeez

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 14488
  • Country: fr
Re: UK NATS Outage - August 2023 - Report
« Reply #10 on: September 12, 2023, 03:50:43 am »
Old adage, which I'm sure you are aware of: "you can't test quality into a product".
Software weeklies believing the TDD dogma can't understand that.
One issue with TDD (test driven development) is code, and exception catching, is built to only pass the test cases.

Yes, it is one pitfall, which is a consequence of a method that focuses on one particular aspect of things - so just like with other methods that revolve around one principle only, teams will just naturally tend to focus on this one aspect.

The main root problem, beyond being too obsessively focused on tests, is that it assumes it's actually possible to design tests that fully cover requirements, and that while doing so, people are less likely to introduce errors than when implementing functionalities themselves. That's just not so, and it's flawed.

One merit I see with this approach is if tests are designed by people different from those who implement. Not because tests are per se all that meaningful, but mostly because that's now two related parts of the design that are confronted to one another, and that's always good. Note that you'd get the same benefit, potentially with more insights, by making two different people/teams implement the same feature, and compare the 'outputs' of the two implementations vs. a common set of 'inputs'.

And then, beyond the flaw of assuming that tests will adhere to the specs more closely than the implementation, there's the major problem of the specs themselves - against which you design said tests.
In a large majority of software defects, the root cause is with faulty specs, or even the absence thereof. No amount of TDD is going to help with that.
 
The following users thanked this post: AndyBeez

Online Marco

  • Super Contributor
  • ***
  • Posts: 6723
  • Country: nl
Re: UK NATS Outage - August 2023 - Report
« Reply #11 on: September 12, 2023, 07:20:14 am »
all input data is validated before being accepted by the system.
Wouldn't really help in the case as described, it was ostensibly valid, but an unexpected cornercase.

Assuming each flight can be handled seperately I'd fork, limit the amount of memory the process could use and never route more than fit in memory at the same time worst case. Segfault, OOM, time out, error, garbage output ... flight goes on the failed heap and you move on. Always assume your code is garbage. It's so little data from a modern computer standard just doing process isolation makes most sense if possible.

I still can't imagine they don't have a codepath for failed routing already other than panic and stop everything.
« Last Edit: September 12, 2023, 07:38:49 am by Marco »
 
The following users thanked this post: AndyBeez


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf