UK air traffic control meltdown

38 minute read Published: 2023-09-08

Comments on reddit and Hacker News. Other posts: part two and part three.

UPDATE: The flight plan which triggered the chaos (but was valid) has been identified!


On 28 August 2023 NATS, the UK's air traffic control operator, suffered a major technical incident. The BBC reports that more than 2000 flights were cancelled and the cost has been estimated at over £100 million GBP. The incident probably affected hundreds of thousands of people.

The press initially reported the cause was a faulty flight plan: "UK air traffic control: inquiry into whether French error caused failure" (The Times) and in typical Mail Online reporting style: "Did blunder by French airline spark air traffic control issues? Officials probe if a single badly filed travel plan caused UK's entire flight-control system to collapse in worst outage for a decade - with 1,000 flights cancelled and chaos set to last DAYS".

So what happened? These are notes on my reading of the incident report:

NATS Major Incident Preliminary Report
Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023
pdf.

NATS is a "public-private" company in the UK that is responsible for all of the UK's air traffic control:

Air Traffic Control (ATC) is the provision and operation of a safe system for controlling and monitoring aircraft. [..]
aircraft [..] are required to file a flight plan. [..]
ATC ensures that aircraft are safely separated laterally and vertically.

What went wrong

The start of the sequence of events leading to the incident can be tracked back to the point at which a flight plan was entered into the flight planning system.

[Airlines] submit the plan into Eurocontrol’s Integrated Initial Flight Plan Processing System (IFPS). [..]

If the submitted flight plan is accepted by IFPS, i.e. it is compliant with IFPS defined parameters [...] this is sufficient for a flight to depart with local ATC approval. The flight plan will be sent from IFPS to all relevant ANSPs who need to manage the flight. [..]

Within the NATS En-route operations at Swanwick Centre, the data is passed to FPRSA-R. The FPRSA-R sub-system exists to convert the data received from IFPS (in a format known as ATS Data Exchange Presentation, ADEXP) into a format that is compatible with the UK National Airspace System (NAS). NAS is the flight data processing system which contains all of the relevant airspace and routings. [..]

FPRSA-R has a primary and backup system monitored both by dedicated Control and Monitoring (C&M) systems and also an aggregated central C&M system. Further resilience is provided by NAS storing 4 hours of previously filed flight data to allow the operation to continue in the event of the loss of automatic processing of flight data. [..]

In addition to the technical resilience provided by backup systems, and the 4 hours of stored flight data, there is operational contingency available to allow safe service to continue. This is provided through the ability to input flight data manually, directly into NAS using a manual input system.

To summarise:

The NATS ATC System was operating normally. [..]

[On] 28 August the airline submitted an ICAO4444 compliant flight plan into Eurocontrol’s flight planning distribution system, IFPS.

ICAO stands for International Civil Aviation Organization, a United Nations agency. An ICAO4444 flight plan looks like this:

(FPL-TTT123-IS
-C550/L-SDE1E2GHIJ3J5RWZ/SB1D1
-KPWM1225
-N0440F310 SSOXS5 SSOXS DCT BUZRD
DCT SEY DCT HTO J174 ORF J121
CHS EESNT LUNNI1
-KJAX0214 KMCO
-PBN/A1L1B1C1D1O1T1 NAV/Z1 GBAS
DAT/1FANS2PDC SUR/260B RSP180
DOF/220501 REG/N123A SEL/BPAM
CODE/A05ED7)

Such messages are in a format that is meant to be read by machines, but also by humans if necessary. The format is spec'd over many many pages of PDF, but is roughly:

( FPL-ACID-Flt Rules Flight Type
- AC Type/Wake Cat-
Equip.&Capability
- Departure EOBT
- Speed Altitude [sp] Route
- Destination ETE [sp]
Alternate(s)
- Other Information )

The route part (in this example: N0440F310 SSOXS5 SSOXS DCT BUZRD DCT SEY DCT HTO J174 ORF J121 CHS EESNT LUNNI1) encodes an overall speed (here N0440 meaning 440 knots), an overall altitude (here F310 which means "Flight Level 310" which means 310 × 100 ft (can also be in km)), and a sequence of waypoints (referenced by name) separated by a description of how to get from the previous waypoint to the next one, usually by referencing a "known route" by name.

The flight plan was accepted by IFPS [..]
the aircraft was cleared to depart at 04:00. [..]

At 08:32 the flight plan was received by NATS’ FPRSA-R sub-system from Eurocontrol’s IFPS system. This is consistent with the 4 hour rule mentioned above. The purpose of the FPRSA-R software is to extract the UK portion of the flight plan [..]

The flight plans delivered to FPRSA-R by IFPS are converted from [..] ICAO4444 to [..] ADEXP. ADEXP is a European-wide flight plan specification that includes, amongst other data, additional geographical waypoints within the European region specific to the route of a flight. For flights transiting through UK airspace, rather than landing in the UK, this will include additional waypoints outside of UK airspace required for its onward journey. Following this conversion the ADEXP version of a flight plan includes, amongst other aspects, the original ICAO4444 flight plan plus an additional list of waypoints and other data.

ADEXP looks like this:

-TITLE IFPL
-BEGIN ADDR
  -FAC LIIRZEZX
  [...]
  -FAC LYZZEBXX
-END ADDR
-ADEP EDDF
-ADES LGTS
-ARCID KIM1
-ARCTYP B738
-CEQPT SDGRWY
-EOBD 170729
-EOBT 0715
-FILTIM 280832
-IFPLID AT00441635
-ORIGIN -NETWORKTYPE SITA -FAC FRAOXLH
-SEQPT C
-WKTRC M
-PBN B2
-REG DABHM
-SEL KMGJ
-SRC FPL
-TTLEET 0210
-RFL F330
-SPEED N0417
-FLTRUL I
-FLTTYP S
-ROUTE N0417F330 ANEKI8L ANEKI Y163 NATOR UN850 TRA UP131 RESIA Q333
BABAG UN606 PEVAL DCT PETAK UL607 PINDO UM603 EDASI
-ALTRNT1 LBSF
-BEGIN RTEPTS
  -PT -PTID EDDF -FL F004 -ETO 170729073000
  -PT -PTID RID -FL F100 -ETO 170729073404
  -PT -PTID ANEKI -FL F210 -ETO 170729073856
  -PT -PTID NEKLO -FL F214 -ETO 170729073911
  -PT -PTID BADLI -FL F248 -ETO 170729074118
  -PT -PTID PABLA -FL F279 -ETO 170729074348
  -PT -PTID HERBI -FL F308 -ETO 170729074624
  -PT -PTID NATOR -FL F330 -ETO 170729074911
  -PT -PTID TITIX -FL F330 -ETO 170729075154
  -PT -PTID TRA -FL F330 -ETO 170729075323
  -PT -PTID ARGAX -FL F330 -ETO 170729080055
  -PT -PTID RESIA -FL F330 -ETO 170729080731
  -PT -PTID UNTAD -FL F330 -ETO 170729081243
  -PT -PTID DIKEM -FL F330 -ETO 170729081627
  -PT -PTID ROKIB -FL F330 -ETO 170729081824
  -PT -PTID BABAG -FL F330 -ETO 170729082816
  -PT -PTID PEVAL -FL F330 -ETO 170729082916
  -PT -PTID PETAK -FL F330 -ETO 170729091754
  -PT -PTID PINDO -FL F330 -ETO 170729093322
  -PT -PTID EDASI -FL F165 -ETO 170729094347
  -PT -PTID LGTS -FL F000 -ETO 170729095713
-END RTEPTS
-SID ANEKI8L
-ATSRT Y163 ANEKI NATOR
-ATSRT UN850 NATOR TRA
-ATSRT UP131 TRA RESIA
-ATSRT Q333 RESIA BABAG
-ATSRT UN606 BABAG PEVAL
-DCT PEVAL PETAK
-ATSRT UL607 PETAK PINDO
n -ATSRT UM603 PINDO EDASI

You can read about ADEXP in the official spec. Some notable fields (page 48):

Adexp Primary FieldKindSyntaxSemantic
routeb'-' "ROUTE" {LIM_CHAR}Complete ICAO Field 15 information containing speed, RFL and route (conforming to the syntax given in Ref. [3]).
rteptsc'-' "BEGIN" "RTEPTS" { pt I ad / vec} '-' "END" "RTEPTS"List of route points. May also contain an aerodrome identifier.

In the example, we have the ICAO route:

-ROUTE N0417F330 ANEKI8L ANEKI Y163 NATOR UN850 TRA UP131 RESIA Q333 BABAG UN606 PEVAL DCT PETAK UL607 PINDO UM603 EDASI

(9 waypoints, 11 if you add the start and end waypoints)

Visually, routes look like: some route

(You can play around with flight plans at flightplandatabase.com, a website for people who like playing with flight simulators.)

We can indent the "route" parts between the waypoints in the ICAO plan to make things clearer:

N0417F330
  ANEKI8L 
  ANEKI 
    Y163
  NATOR
    UN850
  TRA
    UP131
  RESIA
    Q333
  BABAG
    UN606
  PEVAL
    DCT
  PETAK
    UL607
  PINDO
    UM603
  EDASI

E.g. ANEKI Y163 NATOR means "go from waypoint ANEKI to waypoint NATOR via the route Y163". DCT means "direct".

The ADEXP format has more waypoints, along with more precision about altitude and estimated time at each waypoint:

-BEGIN RTEPTS
-PT -PTID EDDF  -FL F004 -ETO 170729073000
-PT -PTID RID   -FL F100 -ETO 170729073404
-PT -PTID ANEKI -FL F210 -ETO 170729073856
-PT -PTID NEKLO -FL F214 -ETO 170729073911
-PT -PTID BADLI -FL F248 -ETO 170729074118
-PT -PTID PABLA -FL F279 -ETO 170729074348
-PT -PTID HERBI -FL F308 -ETO 170729074624
-PT -PTID NATOR -FL F330 -ETO 170729074911
-PT -PTID TITIX -FL F330 -ETO 170729075154
-PT -PTID TRA   -FL F330 -ETO 170729075323
-PT -PTID ARGAX -FL F330 -ETO 170729080055
-PT -PTID RESIA -FL F330 -ETO 170729080731
-PT -PTID UNTAD -FL F330 -ETO 170729081243
-PT -PTID DIKEM -FL F330 -ETO 170729081627
-PT -PTID ROKIB -FL F330 -ETO 170729081824
-PT -PTID BABAG -FL F330 -ETO 170729082816
-PT -PTID PEVAL -FL F330 -ETO 170729082916
-PT -PTID PETAK -FL F330 -ETO 170729091754
-PT -PTID PINDO -FL F330 -ETO 170729093322
-PT -PTID EDASI -FL F165 -ETO 170729094347
-PT -PTID LGTS  -FL F000 -ETO 170729095713
-END RTEPTS

(21 waypoints)

We can mark which of the ADEXP waypoints have a corresponding waypoint in the ICAO plan (with a +) and which are implicit (with a |):

EDDF   |
RID    |
ANEKI  + 
NEKLO  |
BADLI  |
PABLA  |
HERBI  |
NATOR  +
TITIX  |
TRA    +
ARGAX  |
RESIA  +
UNTAD  |
DIKEM  |
ROKIB  |
BABAG  +
PEVAL  |
PETAK  +
PINDO  +
EDASI  +
LGTS   |

Note that the ICAO waypoints do not contain the start and end, since in the original ICAO format these are specified in other fields (so it would waste space to list them again in this list).

The ADEXP waypoints plan included two waypoints along its route that were geographically distinct but which have the same designator.

This means there were two lines like:

-PT -PTID RESIA -FL F330 -ETO 170729080731

that had the same PTID string like "RESIA".

Although there has been work by ICAO and other bodies to eradicate non-unique waypoint names there are duplicates around the world. In order to avoid confusion latest standards state that such identical designators should be geographically widely spaced. In this specific event, both of the waypoints were located outside of the UK, one towards the beginning of the route and one towards the end; approximately 4000 nautical miles apart.

4000 nautical miles is 7408km. Here is an arc of that length on the globe: 4000 nautical miles on a globe

Once the ADEXP file had been received, the FPRSA-R software commenced searching for the UK airspace entry point in the waypoint information per the ADEXP flight plan, commencing at the first line of that waypoint data. FPRSA-R was able to specifically identify the character string as it appeared in the ADEXP flight plan text.

The programming style is very imperative in non domain structured programming style. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained.

Having correctly identified the entry point, the software moved on to search for the exit point from UK airspace in the waypoint data.

Having completed those steps,

This part of the code identified entry and exit waypoints to UK airspace in the list of ADEXP waypoints.

FPRSA-R then searches the ICAO4444 section of the ADEXP file.

It seems at this point, having identified the entry and exit points from the list of ADEXP waypoints, it will try to extract the UK portion of the flight plan from the ICAO route. Why does it do this? User t0mas88 on Hacker News explains:

They use the ADEXP to determine which part of the route is in the UK. Because the auto generated points are ATC area handover points. So this data is the best way so see which part of the route is within the UK airspace. Then it needs to find the ICAO part that corresponds, because the controller needs to use the ICAO plan that the pilot has.

If the controller sees other (auto generated) waypoints that the pilots don't have you get problems during operation. A simple example is that controllers can tell pilots to fly in a straight line to a specific point on their filed route (and do so quite often). The pilot is expected to continue the filed route from that point onwards.

They can also tell a pilot to fly direct to some random other point (this also happens but less often). The pilot is then not expected to pick up a route after that point.

The radio instruction for both is exactly the same, the only difference is whether the point is part of the planned route or not. So the controller needs to see the exact same route as the pilots have, not one with additional waypoints added by the IFPS system.

Back to the report:

It initially searches from the beginning of that data, to find the identified UK airspace entry point. This was successfully found. Next, it searches backwards, from the end of that section, to find the UK airspace exit point. This did not appear in that section of the flight plan so the search was unsuccessful. As there is no requirement for a flight plan to contain an exit waypoint from a Flight Information Region (FIR) or a country’s airspace, the software is designed to cope with this scenario.

Therefore, where there is no UK exit point explicitly included, the software logic utilises the waypoints as detailed in the ADEXP file to search for the next nearest point beyond the UK exit point. This was also not present.

The software therefore moved on to the next waypoint.

OK, so I think this is what is going on, the situation looked something like this:

           4       2        8         5              1           9
ICAO:  F------Q--------T--------O-----------P---------------Y--------U

ADEXP: F   S  Q    C   T   A    O  E  X     P   W   B   Q   Y        U
                       UK  UK   UK UK UK    UK  UK

Here the ICAO route has waypoints (represented by capital letters) separated by known routes (numbers). On the bottom we have the ADEXP waypoints. The ADEXP waypoints that are located in the UK airspace are marked with UK.

This search was successful as a duplicate identifier appeared in the flight plan.

What should the software have done? Well, Q is clearly not the waypoint we are searching for, we are searching for waypoint Y, since T 8 O 5 P 1 Y is the smallest segment of the ICAO plan that contains all the UK waypoints.

It's important to note here that the original algorithm is buggy; it is perfectly possible to unambiguously extract the UK portion of this example flight plan; see below. And this is likely the case for the flight plan that caused the meltdown too.

(Side note: How are UK waypoints identified? Are there any UK waypoints that have a duplicate somewhere else in the world? Either there is special code to handle this case, in which case the "duplicate waypoint name" problem is well known to the engineers, or it doesn't, and this is another potential problem.)

Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points. This is the root cause of the incident. We can therefore rule out any cyber related contribution to this incident.

Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention.

We are left wondering if, had the misidentified waypoint been in a more plausible geographic location, the code might not have thrown an exception and passed along wrong data to ATCOs.

In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception. A critical exception is, broadly speaking, an exception of last resort after exploring all other handling options. Critical exceptions can be raised as a result of software logic or hardware faults, but essentially mark the point at which the affected system cannot continue.

if (end_idx < start_idx) {
  throw new Exception("This should never happen");
} else {
  ...
}

It sounds like the exception was raised in a later portion of the code, which converts the plan to an internal format for NAS. This part probable failed because the entry/exit pair doesn't specify a valid portion of the ICAO flight plan, but it's a shame we don't have more detail about how the actual exception occurred.

Clearly a better way to handle this specific logic error would be for FPRSA-R to identify and remove the message and avoid a critical exception. However, since flight data is safety critical information that is passed to ATCOs the system must be sure it is correct and could not do so in this case. It therefore stopped operating, avoiding any opportunity for incorrect data being passed to a controller. The change to the software will now remove the need for a critical exception to be raised in these specific circumstances.

Having raised a critical exception the FPRSA-R primary system wrote a log file into the system log. It then correctly placed itself into maintenance mode and the C&M system identified that the primary system was no longer available. In the event of a failure of a primary system the backup system is designed to take over processing seamlessly. In this instance the backup system took over processing flight plan messages. As is common in complex real-time systems the backup system software is located on separate hardware with separate power and data feeds.

Therefore, on taking over the duties of the primary server, the backup system applied the same logic to the flight plan with the same result. It subsequently raised its own critical exception, writing a log file into the system log and placed itself into maintenance mode.

At this point with both the primary and backup FPRSA-R sub-systems having failed safely the FPRSA-R was no longer able to automatically process flight plans. It required restoration to normal service through manual intervention. The entire process described above, from the point of receipt of the ADEXP message to both the primary and backup sub-systems moving into maintenance mode, took less than 20 seconds. 08:32 therefore marks the point at which the automatic processing of flight plans ceased and the 4 hour buffer to manual flight plan input commenced. The steps taken to restore the FPRSA-R sub-system are described in section 5 of this report.

Then support teams tried to fix things, but unfortunately it took longer than the 4 hours they had:

The 1st Line support team were alerted to the incident through the C&M systems that directly monitor operational systems as well as through direct feedback from the Operational teams using the FPRSA-R sub-system at the time. The initial response for the team followed standard recovery processes using the centralised C&M systems to restart the sub-system. Following multiple attempts to restore the service, which were unsuccessful, the 2nd Line engineering team was mobilised and supported the on-site engineers remotely via video link.

have you tried turning it off and on again?

As sharp-eyed reddit user voidstarcpp points out, it's interesting to contrast these two parts of the report:

However, since flight data is safety critical information that is passed to ATCOs the system must be sure it is correct and could not do so in this case. It therefore stopped operating, avoiding any opportunity for incorrect data being passed to a controller.

and

The initial response for the team followed standard recovery processes using the centralised C&M systems to restart the sub-system.

If the system crashed in such a way as to signal the possibility of a critical error, one that could send incorrect data to ATCOs if flight plan processing continued, then, since the team didn't know what was wrong, why did they try to restart processing flight plans?

The on-call teams working remotely with the on-site engineering teams followed a staged analysis, involving increasingly detailed procedures to attempt to resolve the issue, none of which were successful. As per standard escalation procedures, 2nd Line engineers were engaged to provide further access to advanced diagnostics and logging capabilities.

It doesn't say how long it took, but the manufacturer of the FPRSA-R system was eventually called:

Additional support was then requested from the Technical Design team and sub-system manufacturer as 1st and 2nd Line support had been unable to restore the service or identify the precise root cause, which was unusual. The manufacturer was able to offer further expertise including analysis of lower-level software logs which led to identification of the likely flight plan that had caused the software exception. Through understanding which flight plan had caused the incident the manufacturer was able to provide the precise sequence of actions necessary to recover the system in a controlled and safe manner.

The system was eventually restored, but unfortunately the knock-on effects by that point were already disastrous.

The manufacturer is an Austrian company, Frequentis AG:

An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers. The manufacturer’s ATC products are operating in approximately 150 countries and they hold a world-leading position in aeronautical information management (AIM) and message handling systems.

The "Nobody ever gets fired for buying IBM" defence.

We can find a few job ads related to air traffic control systems at Frequentis AG on their careers page Programming languages used: Ada, C++, Java, Python, with Java being the most common. The code above sounds like it could have been written in any of these languages, but Ada would at least be safer than the others in other ways.

Thoughts

Things that went wrong:

  1. The software that processes flight plans (FPRSA-R) was written in a buggy way.
  2. The software and system are not properly tested.
  3. The FPRSA-R system has bad failure modes

The software was buggy

The software was incapable of extracting the UK portion of the ICAO flight plan, even though the flight plan was apparently valid (at least according to IFPS).

NATS officials are trying to spin this as:

An air traffic meltdown in Britain was caused by a "one in 15 million" event, the boss of traffic control provider NATS said, as initial findings showed how a single flight plan with two identically labelled markers caused the chaos.

"This was a one in 15 million chance. We've processed 15 million flight plans with this system up until this point and never seen this before," NATS CEO Martin Rolfe told the BBC, as airlines stepped up calls for compensation for the breakdown. Reuters

The system was put in place in 2018, so what Martin Rolfe is saying here is that this sort of thing only had a chance of occurring "once every 5 years", which is apparently an acceptable frequency for having a complete air traffic control meltdown.

The system was poorly tested

For a safety critical system, I would expect all the steps in the processing of the flight plan to be tested, and extracting the UK part of a flight plan is quite an important step. Testing failed to expose the bug in this case, maybe because the testing also did not consider duplicate waypoint names.

Fuzzing may have prevented this this bug/crash. By bombarding such a system with randomly generated flight plans, you can see if any of them cause bad failure modes: a crashed system where one doesn't know immediately what went wrong. By inspecting which sorts of flight plans cause problems, it would become apparent that those with duplicate waypoint identifiers in the ADEXP portion cannot be processed properly. I would assume all flight plans need to have their UK part extracted, so the faulty code is not buried deep in some branching code. So fuzzing would hit this code path, unless it is always blocked at some earlier point in the pipeline (by some validation code maybe). But this would be apparent to the testers: if no inputs make it all the way through the system, it's clear the system isn't being well tested.

The FPRSA-R system has bad failure modes

All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.

A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":

The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.

When FPRSA-R did crash, it did so in an obscure way. This is a system which processes flight plans, yet the relevant flight plan was only found in "lower-level software logs". If there is an error processing a flight plan, which brings down the whole system, a notification (including the flight plan) should immediately be sent to some monitoring team.

NATS was also not prepared for an FPRSA-R system failure. The 1st and 2nd Line support engineers were not able to locate the low-level log files. This has been fixed:

An operating instruction has been put in place to allow prompt recovery of the FPRSA-R sub-system if the same circumstances recur. Each of the technical operators have been trained to implement the new process. With enhanced monitoring in place, additional engineering expertise will also be present to oversee the activity.

But, the low-level log files inside the magic FPRSA-R box provided by Frequentis AG was not the only place where the flight plan existed. Indeed, when the primary crashed, the queue of flight plans was redirected to the backup, which failed in the exact same way on the exact same flight plan. It stands to reason that if the system that is meant to process flight plans has crashed, that might be because of a problematic flight plan. So they could have inspected these for anomalies, or manually removed some from the head of the queue, to see if this fixed the problem. In fact they did start processing the flights manually:

The fallback procedures for this incident include specific actions for the manual entry of flight plan data into the system and manual coordination of flights between sectors. In the event of a reversion to manual process, it is necessary to reduce the UK traffic flow by implementing air traffic control restrictions.

But maybe this process didn't simultaneously drain the flight plan queue that feeds FPRSA-R; if it did, then one of the attempts to restart the system (after some flight plans had been manually processed) would have succeeded.

Possible lack of formal verification

As reddit user DontWannaMissAFling points out:

But what's wild to me is that something as safety critical as air traffic control apparently isn't using proven techniques like formal verification, model checking to eliminate these classes of bugs entirely.

Like as an industry we use TLA+ to stop AWS from having downtime or Xboxes segfaulting, but not to keep planes in the air?

I agree that it certainly doesn't sound like any formal verification was used in this case (for this step, in this system), and the report doesn't mention anything. Using formal verification might have helped here.

But it's possible formal verification was used, but faulty code still made its way into production: end-2-end formal verification for large systems is still in its infancy. We'll have to wait for the result of the enquiry to know more.

Humans were kept safe at all times

I'd like to note (as does NATS in the report) that despite all the problems highlighted above, planes in the air over the UK were still safe at all times. They were being monitored by experienced ATCOs, which monitor planes by their known flight plan, radio, radar and vision. The consequence of all this was not that any human lives were put in danger, it's simply that far fewer flights could take off in the first place, or had to be diverted away from UK airspace. NATS did the right thing (reducing the number of flights), and kept everybody safe. And that's great.

Transparent reporting

Because ultimately these systems are about keeping humans safe from harm, it's also great that they released a report that was quite transparent and detailed. I think this is important for critical infrastructure.

The boss of Ryanair has slammed a report on the flights chaos seen over the bank holiday as "rubbish".

Michael O'Leary claimed the findings "downplay the impact on the aviation industry" and said the report was "full of excuses".

I don't agree with this, I think the report was quite factual, and I don't think it was in the scope of the initial report to analyse to what extent there had been failings at NATS.

How to code this properly

So, how can we avoid this bug?

(Note: This post explain how to code up a correct solution (in full knowledge of the duplicate waypoint issue). See this post for an explanation of how the bug might have been avoided even in the absence of such knowledge.)

Let's recap the problem. There are two sequences of waypoints:

Because the ICAO plan doesn't need to include the waypoints at which it enters/exits an air traffic control region, extracting the segment of the ICAO flight plan corresponding to the UK portion of the flight is not entirely trivial. Of course, if we take the entire ICAO flight plan, it already contains the UK portion, but what we really want is the smallest such segment. It's interesting to note here that a flight could possibly enter UK airspace, and then exit it again, and enter it again. We'll ignore this, that is, we will just find a single contiguous segment that contains all UK portions of the flight, since this is what the original code seemed to do.

I'm unsure why this task attempts to do this only using the ADEXP data, rather than consulting a database about how waypoints and flight segments intersect UK airspace. It seems strange, but let's move on.

Note that it is impossible to achieve this task with the ICAO flight plan alone (and no knowledge of routes), even if you know for each waypoint if it is in the UK or not. Indeed you could even be in a situation where none of the waypoints in the ICAO route are in the UK, for example when the flight plan clips a small portion of the UK between two of the ICAO waypoints.

So this is why the ADEXP waypoint list is used, and the assumption here, I assume, is that the ADEXP list contains all the waypoints, and that furthermore, waypoint granularity is such that if adjacent waypoints both don't intersect UK airspace, then the segment between them doesn't either.

The mistake of the faulty algorithm described above is to try to work on both the ICAO data and the ADEXP data as they are, maintaining pointers into each of them, updating them with vague and wrong invariants in the background of the programmer's mind. This is a recipe for bugs. Instead, the first thing to do is to reconcile the data and then carefully extract the UK portion from that.

So we create a data structure for a plan:

-- A flight plan, with segments between points 'p' via routes 'r'.
data Plan p r
  = End p
  | Leg p r (Plan p r)

(This is Haskell code, but the ideas apply to most languages. Full code is in this repo.)

This says that a Plan p r has either arrived at its destination End p, or consists of a segment starting from p, via r, and the rest of the plan: Leg p r rest.

We can now define all the sorts of flight plan data we will deal with:

type ICAO     p r = Plan p r         -- points and routes, no intermediate waypoints
type ADEXP    p   = Plan p [p]       -- points and intermediate waypoints, no route data
type Combined p r = Plan p (Via p r) -- all the data combined

data Via p r = Via
  { route   :: r,
    through :: [p]
  }
  deriving stock (Show)

Here Combined is our reconciled flight plan, it combined all the information form ICAO and ADEXP. We can project a Plan back down to ICAO or ADEXP:

projectICAO :: Combined p r -> ICAO p r
projectICAO = mapRoutes (.route)

projectADEXP :: Combined p r -> ADEXP p
projectADEXP = mapRoutes (.through)

mapRoutes :: (r -> r') -> Plan p r -> Plan p r'
mapRoutes _ (End p) = End p
mapRoutes f (Leg p r rest) = Leg p (f r) (mapRoutes f rest)

We'll assume we have already parsed the data into the data structures above. This is just a matter of reading the spec carefully and turning it into code, and hopefully something the FPRSA-R did correctly, though as noted previously it might be working on the text version directly.

Now we write our reconciliation function. For ICAO and ADEXP to reconcile, the start and end points must match. When reconciling a leg of a flight plan, a certain amount of waypoints can be skipped in the ICAO plan, and the rest of them must reconcile with the rest of the flight plan:

reconcile :: (Eq p) => ICAO p r -> [p] -> [Combined p r]
reconcile (End p) [p']             | p == p' = pure (End p)
reconcile (Leg p r rest) (p' : ps) | p == p' = do
  (through, restAdexp) <- splits ps
  recoRest <- reconcile rest restAdexp
  pure (Leg p Via {route = r, through} recoRest)
reconcile _ _ = []

-- All the ways to snap a list in two. That is, 'split xs' returns all pairs
-- '(ys, zs)' where 'xs == ys ++ zs'.
splits :: [a] -> [([a], [a])]
splits xs = zip (inits xs) (tails xs)

Note that the function produces all the possible reconciliations. This is because reconciliations are not necessarily unique because waypoints can appear more than once. By calculating all the possible reconciliations, we'll know if the data is ambiguous, and flag those flight plans for manual processing.

Next, we extract the UK portion of the flight plan. This is done in 3 steps:

  1. Remove all legs at the start which don't cross into UK airspace.
  2. Traverse the legs which are in UK airspace.
  3. Once the rest of the flight plan is never again in the UK, cut it short.

Each function calls the next step in sequence. Note that we return a NonUkPlan error when the system reaches the end of the plan without having found a UK part. By having a compiler which checks that pattern-matches are covering, the possible failures arise naturally while coding.

-- Extract the UK part of the flight.
ukSegment :: (p -> Bool) -> Combined p r -> Either Err (Combined p r)
ukSegment uk (End p)
  | nonUkPlan uk (End p) = Left NonUkPlan
  | otherwise = pure (End p)
ukSegment uk plan@(Leg _ _ rest) =
  if nonUkLeg uk plan
    then ukSegment uk rest
    else pure (flyUK uk plan)

-- Fly the UK part of the flight.
flyUK :: (p -> Bool) -> Combined p r -> Combined p r
flyUK _ (End end) = End end
flyUK uk (Leg p v rest)
  | nonUkPlan uk rest = Leg p v (afterUK rest)
  | otherwise = Leg p v (flyUK uk rest)

-- Skip the rest of the flight.
afterUK :: Combined p r -> Combined p r
afterUK plan = End (start plan)

These use some small functions:

-- The next leg of the journey doesn't fly through the UK.
nonUkLeg :: (p -> Bool) -> Combined p r -> Bool
nonUkLeg uk (End p) = not (uk p)
nonUkLeg uk (Leg p v _) = not (uk p) && not (any uk v.through)

-- The whole plan isn't in the UK.
nonUkPlan :: (a -> Bool) -> Combined a r -> Bool
nonUkPlan uk plan = all (nonUkLeg uk) (legs plan)

legs :: Plan p r -> [Plan p r]
legs (End p) = [End p]
legs plan@(Leg _ _ rest) = plan : legs rest

start :: Plan p r -> p
start (End p) = p
start (Leg p _ _) = p

Putting it all together, we get:

ukPartOfICAO :: (Eq p) => (p -> Bool) -> ICAO p r -> [p] -> Either Err (ICAO p r)
ukPartOfICAO uk icao adexp = case reconcile icao adexp of
  [plan] -> projectICAO <$> ukSegment uk plan
  []     -> Left CannotReconcileIcaoAdexp
  _      -> Left AmbiguousReconciliationsOfIcaoAdexp

We collected the following errors while coding:

data Err
  = NonUkPlan
  | CannotReconcileIcaoAdexp
  | AmbiguousReconciliationsOfIcaoAdexp

Let's test it with our example:

           4       2        8         5              1           9
ICAO:  F------Q--------T--------O-----------P---------------Y--------U

ADEXP: F   S  Q    C   T   A    O  E  X     P   W   B   Q   Y        U
                       UK  UK   UK UK UK    UK  UK
inUK = (`elem` ["T", "A", "O", "E", "X", "P", "W"])
icao = ("F", 4) ~> ("Q", 2) ~> ("T", 8) ~> ("O", 5) ~> ("P", 1) ~> ("Y", 9) ~> End "U"
adexp = ["F", "S", "Q", "C", "T", "A", "O", "E", "X", "P", "W", "B", "Q", "Y", "U"]

infixr 6 ~>
(~>) (p, r) = Leg p r

And try this at the REPL:

λ> ukPortionOfICAO inUK icao adexp
Right (Leg "T" 8 (Leg "O" 5 (Leg "P" 1 (End "Y"))))

We can see that this is the correct result:

                                 UK portion of ICAO
                       ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
           4       2        8         5              1           9
ICAO:  F------Q--------T--------O-----------P---------------Y--------U

ADEXP: F   S  Q    C   T   A    O  E  X     P   W   B   Q   Y        U
                       UK  UK   UK UK UK    UK  UK

The waypoint Q is a duplicate in the ADEXP list, but the system still returns the correct portion of the ICAO flight path. Crisis averted! The fact that there is a duplicate identifier in this case is immaterial, the ICAO and ADEXP data still reconcile unambiguously, and the correct sub-route is well-defined.

How large can flight plans get? Well here is a flight plan from London to Sydney that contains a total of 158 waypoints, and about a third of them appear in the ICAO route: London to Sydney flight plan ukPortionOfICAO returns practically instantly for such a flight plan.

Full code here.)

Edit 2023-09-13: So the actual flight plan has been identified!

Tonight we were wondering why nobody had identified the flight which caused the UK air traffic control crash so we worked it out. It was FBU (French Bee) 731 from LAX/KLAX to ORY/LFPO.

It passed two waypoints called DVL on its expanded flight plan: Devil's Lake, Wisconsin, US, and Deauville, Normandy, FR (an intermediate on airway UN859).

https://www.flightaware.com/live/flight/FBU731/history/20230828/0255Z/KLAX/LFPO

Credit to @marksteward and @benelsen for doing much of the legwork here.

It's flight plan was:

ORCKA5 LAS Q70 BLIPP Q842 WINEN MLF J107 OCS CZI DIK DVL VBI VESRU YXL ELVEL PIGLA 5800N/07000W MIBNO MAXAR 6200N/05000W 6200N/04000W 6000N/03000W 5800N/02000W SUNOT KESIX APSOV SOSIM L15 KEPAD L151 KIDLI UN859 GWC UN859 LGL A34 BOBSA

I don't know what the actual ADEXP data was, but the above says that DVL is the duplicate identifier, and that it would appear when ADEXP expands the UN859 route, near the end of the flight. So the example in the main text is consistent with this configuration:

Comments on reddit and Hacker News
Other posts: part two and part three.