Domain structured programming (UK ATC meltdown)

11 minute read Published: 2023-09-15

This is part 3 of my blog posts about the UK air traffic control meltdown. You might want to read part one and part two.

Having had some time to process some of the comments about my first post, I realise now that I was not clear in communicating my thoughts about the programming style, and how that contributed to the existence of the bug. I will have another go now. The point I was trying to make was:

With a better programming style, the bug might not have been made even if the code is only meant for flight plans with no duplicate waypoints. Of course, I also agree that better specs/requirements would have helped too.

The original post attracted comments of the sort:

What I hope to convince you of in this post is that the programming style does matter: the bug might not have occurred when using "domain structured programming", even if the programmer was not aware of the duplicate waypoint issue. And the ideas behind this sort of programming style have been around for decades.

Domain structured programming

In the original version of my notes on the bug, I used the word "imperative" to describe what I thought the original code might look like. Using this word was a mistake. What I was really meant was that it was not "domain structured programming", by which I roughly mean:

By programming in this style the various manipulations by the program all make sense in the domain, and are thus less likely to "go wrong", especially for reasons that are irrelevant to the problem at hand.

When invariants are expressed in terms of the domain the "modelling gap" is small. Programs not written in this style have invariants too, but because the programming objects don't correspond to the domain objects, the invariants are harder to formulate and reasoning about them is more brittle. Compare:

(1) Cutting a flight plan short at a point where the rest of the plan is not in the UK results in a flight plan that fully contains the UK part.

and

(2) Iterating rightwards in the ADEXP list from the known UK exit point until a match is found in the ICAO plan produces the index marking the end of the UK part.

Statement (1) is in domain structured programming style, it is obvious and unconditionally true. Note that I'm not just referring here to the fact that one can use formal verification to prove that some formalisation of this statement is true, I'm saying that this is an obviously true statement in the domain of air traffic control.

Statement (2) refers to objects that are domain-irrelevant, such as array indices. Judging its validity requires thought and it breaks down in the presence of duplicate waypoint names.

I think what I'm advocating for here is basically what you get when you mix

Programming in this style will lead to code that is more robust, because its correctness is based on the logical structure of the domain. The code is also less affected by unknowns (e.g. possibility of duplicate waypoint names), because the correctness of the code is based on known facts about the domain.

A smaller modelling gap also ensures that errors, should they occur, are usually "closer"2 to the route cause, and this can make failures less serious.

Another look at the problem

Let us from now on assume that we do not know that a flight plan can contain duplicate waypoints and that they are unique, and look at the problem anew.

We are given two versions of the flight plan, ICAO and ADEXP, each containing information the other does not have. Our goal is to extract the UK part of the ICAO flight plan, using data from the ADEXP plan.

One possibility, which sounds like what the original code did, is to keep the data as it is and perform various operations until we are satisfied we have found two indices start_index and end_index which delimit the UK portion of the flight plan. The process is not manipulating domain objects, it is manipulating array indices. The end result is an end_index value for which end_index < start_index. This doesn't correspond to a sub-flight-plan at all. The code has "gone wrong", and not noticed: the invariant that the pair of indices represent a valid sub-plan has been violated. Noticing that something has gone wrong is important: the closer you are to the root cause, the more likely the exception will contain the information that is relevant to those who are dealing with the crash. It may even have failed gracefully at this point.

The "domain structured programming" way to proceed is to perform operations on a flight plan, one that unifies the information from the ICAO and ADEXP plans. This therefore involves a first step which works out how the ICAO and ADEXP plans "match up".

In the original post this was achieved by a function reconcile :: ICAO p r -> [p] -> [Combined p r]. If waypoint names are unique then plans can only match up in at most one way, so I might3 have defined something like this instead:

reconcile' :: (Eq p) => ICAO p r -> [p] -> Maybe (Combined p r)
reconcile' (End p) [p'] | p == p' = pure (End p)
reconcile' (Leg p r rest) (p' : ps) | p == p' = do
  let (skipped, restAdexp) = span (\x -> x /= start rest) ps
  recoRest <- reconcile' rest restAdexp
  pure (Leg p Via {route = r, through = skipped} recoRest)
reconcile' _ _ = Nothing

This function is simpler than reconcile from the original post and does the obvious thing: when reconciling a leg of the ICAO plan with the ADEXP list, it will skip all the ADEXP points that are not equal to the next point in the ICAO plan, and then reconciles the rest of the plan recursively.

This function is an example of "Parse, don’t validate". Parsing is how semantically meaningful domain values are produced from less meaningful data, and so is often a necessary first step in structuring the solution in terms of the domain. In fact I wouldn't be surprised if the original code violated precisely this slogan, having code that validates that the plans are consistent, but not parsing them into a reconciled data structure. This validation code would look much like the code of reconcile' above, but just performing checks and not returning a value.

The function reconcile' is wrong because it doesn't consider the possibility of duplicate waypoints, but much less wrong than the original bug because of the programming style. It will either:

The code for extracting the UK part of the plan is the same as in the original post, this is quite simple once we have the version of the flight plan which combines all the information we need. All that changes is pattern matching on a Maybe-value instead of a list:

ukPartOfICAO' :: (Eq p) => (p -> Bool) -> ICAO p r -> [p] -> Either Err (ICAO p r)
ukPartOfICAO' uk icao adexp = case reconcile' icao adexp of
  Just plan -> projectICAO <$> ukSegment uk plan
  _ -> Left CannotReconcileIcaoAdexp

For the original example this version of the code simply works correctly:

λ> ukPortionOfICAO' inUK icao adexp
Right (Leg "T" 8 (Leg "O" 5 (Leg "P" 1 (End "Y"))))
                                 UK portion of ICAO
                       ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
           4       2        8         5              1           9
ICAO:  F------Q--------T--------O-----------P---------------Y--------U

ADEXP: F   S  Q    C   T   A    O  E  X     P   W   B   Q   Y        U
                       UK  UK   UK UK UK    UK  UK

The bug would therefore have been completely avoided even though the spec might not have mentioned the duplicate waypoint issue. The reason it still works is that the the duplicate waypoint doesn't get in the way of reconciling the inputs, it is just irrelevant.

The reconciliation is also unique in this case, so the processing of the flight plan is completely correct. The code here differs from the code of the previous post only when there are multiple reconciliations: the first post has code which fails with an error, whereas this code will use the first reconciliation (which might be the wrong one). Since reconcile' settles on the first matching waypoint it is trying to find, to have gone wrong it would have to match some w in the ICAO list with w1 when it should have matched it with w2, for two waypoints w1 and w2 in the ADEXP list, appearing in that order. This would mean that there is a leg of the ICAO flight plan which skips over w1 and ends at w2. Since waypoints with duplicate names are supposed to be very far apart, and flight plan legs are not supposed to be over very long distances, this shouldn't happen. Indeed this requirement is in place precisely to avoid flight plans being ambiguous.

Full code here.

To summarise:
This coding style represents the objects of the domain with data-structures. It operates on these values with semantically meaningful operations, e.g. "removing the first leg of a flight plan results in a flight plan". It's correctness is based on reasoning patterns from the domain. So in the end you get a valid flight plan, not indices into a list which doesn't mean anything.

In my opinion, it is important that we slowly move towards such a coding style. I would hope to see this style used more and more in industry, especially in critical software.

Notes:


Many thanks to Arnaud Spiwack and Csongor Kiss for providing feedback on a draft of this post.


1

I was tempted to just refer to the style I am advocating for as "Domain-Driven Design" (DDD), but I don't think this term, as it has come to be understood, is referring to quite the same idea. DDD, from what I've seen, is more about structuring interfaces, for example naming methods. I'm not sure "DDD" really prescribes how those methods should be written, in a more technical sense.

2

Closer conceptually, and in terms of the call graph/code. Failing close to the root cause is why assertions are useful for example, even though this can be counter-intuitive at first ("why instruct the progam to fail more?"). This is also why JavaScript can be so hard to debug: it does its damdest to not fail at all at every turn.

3

Of course I know it's impossible for me to know what I would have coded had someone not shouted "DUPLICATE WAYPOINTS" in my ear, but I am trying my best.