Products > Security
UK NATS Outage - August 2023 - Report
AndyBeez:
On the 28th August 2023, a major outage of a component system of the United Kingdom’s air traffic control environment bought chaos to British Airports. Unable to file flight plans, airlines where forced to reschedule, postpone or cancel flights. The knock on effect left aircraft and aircrews in the wrong place, with disruption lasting for days after. Airlines have already paid out millions of pounds in compensation. This outage was not the result of a cyber attack but instead, the a killer combination of garbage data in and an unexpected exception out.
So what happened? In summary, a flight plan for an aircraft transiting UK airspace contained duplicate way-points. Unable to cope with this ambiguity, the system for administering UK flight plan data showed the ‘fault’ light, the backup kicked in, showed the another 'fault' light, and everything stopped.
Surprisingly, waypoints [or aircraft navigation points] are not required to have unique names; rather, the UK ’s Flight Plan Reception Suite Automated (FPRSA-R) system when detecting such a discrepancy, is was designed to divert the flight plan to a real human for clarification. Unfortunately, when a real ambiguous flight plan was filed, by the French no-less, the software “threw an exception.” I'm sure someone suggested turning it off and then on again, but the damage to UK air travel had already been done. The NATS CEO subsequent claimed this was a "one in 15 million event". Which is about the same odds as being struck by lightning.
A preliminary report of this incident is now out in the public domain. Software Architects and students of computer system resilience may want to give this a read.
ATS Major Incident - Preliminary Report - Full PDF
https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident%20Investigation%20Preliminary%20Report.pdf
TLDR? Some salient points 4U :blah:
It was like this… :o
--- Quote ---On 28th August 2023, significant disruption was experienced across UK airspace following an incident affecting part of the technical infrastructure that supports NATS’ safe controlling of aircraft. In keeping with its primary purpose, NATS delivered a safe operation throughout. However, the reduced levels of flights that resulted from the measures needed to maintain safety due to the technical incident caused significant disruption to the UK aviation system.
--- End quote ---
Then it failed dis-gracefully… :(
--- Quote ---Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention. In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception.
--- End quote ---
l'exception exceptional... :wtf:
--- Quote ---A critical exception is, broadly speaking, an exception of last resort after exploring all other handling options. Critical exceptions can be raised as a result of software logic or hardware faults, but essentially mark the point at which the affected system cannot continue.
--- End quote ---
So we then took a log dump… :P
--- Quote ---Having raised a critical exception the FPRSA-R primary system wrote a log file into the system log. It then correctly placed itself into maintenance mode and the C&M system identified that the primary system was no longer available. In the event of a failure of a primary system the backup system is designed to take over processing seamlessly. In this instance the backup system took over processing flight plan messages. As is common in complex real-time systems the backup system software is located on separate hardware with separate power and data feeds.
--- End quote ---
The backup system failed ‘by design’… :-X
--- Quote ---Therefore, on taking over the duties of the primary server, the backup system applied the same logic to the flight plan with the same result. It subsequently raised its own critical exception, writing a log file into the system log and placed itself into maintenance mode.
--- End quote ---
Buggers hunt… >:(
--- Quote ---Now that the root cause has been identified further work needs to be undertaken to trace back through the development and testing of the FPRSA-R sub-system to understand whether the combination of events that led to the incident could have been mitigated at some point in the software development cycle. It is our understanding from the manufacturer that the specific area of software related to this investigation is unique to NATS.
--- End quote ---
We can repair it do a stability fix… ;D
--- Quote ---A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident. This change will prevent the software from finding a duplicate waypoint that could cause an incident.
This change will prevent the software from finding a duplicate waypoint that could cause another identical incident.
--- End quote ---
tridac:
Any critical system that crashes, for whatever reason, on receipt of invalid data, is not fit for purpose. If duplicate data was received, why couldn't the system just reject it, send a critical error note to the originator, and continue ?. Heavy system load at holiday time lining all the holes in the cheese, never thoroughly enough tested ?. Not a robust enough system design by half, which should be designed to fail gracefully. Sounds like yet another example of a government service sold off to save money. Flying on a wing and a prayer, most likely...
Marco:
That can't be the whole story, there almost certainly are far more causes which cause a flight to be unrouteable and cause it to be individually rejected. There must be some specific reason this wasn't individually rejected they don't want to share. An uncaught exception or something similarly silly wouldn't be enough for them to be this evasive, so my guess would be that it triggered a priority override inserted by a security agency which screwed everything up.
tridac:
Could be I guess, but more likely a design failure. The report says that the primary system shutdown, followed by the backup for the same reason. A tacit admission of design failure. Basic rule in safety critical software, is that all input data is validated before being accepted by the system. If that had been done, the data would have been rejected, for any reason of doubt, originator informed and perhaps one flight cancelled. One key factor is that this happened at what was probably peak system load. In the past, running ouit of resources, like memory starvation might have caused the system to thrash until it fell over, but would not have thought that likely these days. The standard of software these days can be dire, time and money pressure overriding quality and attention to detail, with predictable results...
AndyBeez:
Whatever the root cause of the meltdown, I am sure there are plenty of people denying responsibility for the job they were paid very well to do. I wonder if other ATC operators have the same duplicate waypoint vulnerability in their codebase? Maybe they should check? Someone must have asked the question, what-if there is crap data? I guess the response from those in the know was, that's not crap data, it's just not invalid data. Which is like specifying that an input field has to accept any UTF16 character because someone may have changed their name to the turd emoji :-//
The take-home for any software student is that system meltdowns have consequences. This from the press release on the RYANAIR website. The CEO of the Irish airline, Michael O'Leary, is certainly not a man to voice his opinions in diplomatic language.
--- Quote ---This Preliminary NATS Report is factually inaccurate. It ridiculously understates the number of flights that were cancelled or delayed through the NATS system failure on Mon 28th Aug last. In Ryanair’s case, we suffered over 370 flight cancellations (over 63,000 passengers), and more than 1,500 flight delays over the 2 days (Mon & Tues over 270,000 passengers delayed).
This whitewash report, which understates the number of flights cancellations and flight delays, fails to explain why 1 inaccurate flight plan brought down not just the NATS ATC system, but also the backup system.
This Report, which is full of false figures about flight cancellations and delays, and avoids any explanation of why NATS backup system failed so spectacularly will not solve this problem unless NATS accepts responsibility for its incompetence and reimburses airlines and passengers for the avoidable right to care expenses they suffered due to NATS failure on Mon 28th Aug last.
--- End quote ---
Full release: https://corporate.ryanair.com/news/ryanair-rejects-nats-whitewash-report/#:~:text=Ryanair%2C%20the%20UK%27s%20no.1,Mon%2028th%20Aug%20last.&text=NATS%20claims%20just%20575%20flights,UK%20that%20day%20were%20delayed.
Navigation
[0] Message Index
[#] Next page
Go to full version