On the 28th August 2023, a major outage of a component system of the United Kingdom’s air traffic control environment bought chaos to British Airports. Unable to file flight plans, airlines where forced to reschedule, postpone or cancel flights. The knock on effect left aircraft and aircrews in the wrong place, with disruption lasting for days after. Airlines have already paid out millions of pounds in compensation. This outage was not the result of a cyber attack but instead, the a killer combination of garbage data in and an unexpected exception out.
So what happened? In summary, a flight plan for an aircraft transiting UK airspace contained duplicate way-points. Unable to cope with this ambiguity, the system for administering UK flight plan data showed the ‘fault’ light, the backup kicked in, showed the another 'fault' light, and everything stopped.
Surprisingly, waypoints [or aircraft navigation points] are not required to have unique names; rather, the UK ’s Flight Plan Reception Suite Automated (FPRSA-R) system when detecting such a discrepancy,
is was designed to divert the flight plan to a real human for clarification. Unfortunately, when a real ambiguous flight plan was filed, by the French no-less, the software “threw an exception.” I'm sure someone suggested turning it off and then on again, but the damage to UK air travel had already been done. The NATS CEO subsequent claimed this was a "one in 15 million event". Which is about the same odds as being struck by lightning.
A preliminary report of this incident is now out in the public domain. Software Architects and students of computer system resilience may want to give this a read.
ATS Major Incident - Preliminary Report - Full PDF
https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident%20Investigation%20Preliminary%20Report.pdf
TLDR? Some salient points 4U
It was like this…
On 28th August 2023, significant disruption was experienced across UK airspace following an incident affecting part of the technical infrastructure that supports NATS’ safe controlling of aircraft. In keeping with its primary purpose, NATS delivered a safe operation throughout. However, the reduced levels of flights that resulted from the measures needed to maintain safety due to the technical incident caused significant disruption to the UK aviation system.
Then it failed dis-gracefully…
Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention. In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception.
l'exception exceptional...
A critical exception is, broadly speaking, an exception of last resort after exploring all other handling options. Critical exceptions can be raised as a result of software logic or hardware faults, but essentially mark the point at which the affected system cannot continue.
So we then took a log dump…
Having raised a critical exception the FPRSA-R primary system wrote a log file into the system log. It then correctly placed itself into maintenance mode and the C&M system identified that the primary system was no longer available. In the event of a failure of a primary system the backup system is designed to take over processing seamlessly. In this instance the backup system took over processing flight plan messages. As is common in complex real-time systems the backup system software is located on separate hardware with separate power and data feeds.
The backup system failed ‘by design’…
Therefore, on taking over the duties of the primary server, the backup system applied the same logic to the flight plan with the same result. It subsequently raised its own critical exception, writing a log file into the system log and placed itself into maintenance mode.
Buggers hunt…
Now that the root cause has been identified further work needs to be undertaken to trace back through the development and testing of the FPRSA-R sub-system to understand whether the combination of events that led to the incident could have been mitigated at some point in the software development cycle. It is our understanding from the manufacturer that the specific area of software related to this investigation is unique to NATS.
We can
repair it do a stability fix…
A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident. This change will prevent the software from finding a duplicate waypoint that could cause an incident.
This change will prevent the software from finding a duplicate waypoint that could cause another identical incident.