This is part 3 of my blog posts about the UK air traffic control meltdown. You might want to read part one and part two.
Having had some time to process some of the comments about my first post, I realise now that I was not clear in communicating my thoughts about the programming style, and how that contributed to the existence of the bug. I will have another go now. The point I was trying to make was:
With a better programming style, the bug might not have been made even if the code is only meant for flight plans with no duplicate waypoints. Of course, I also agree that better specs/requirements would have helped too.
The original post attracted comments of the sort:
- What is even meant by "imperative" programming? How is it relevant to the rest of the post?
- Why blame the bug on programming style, when it is just that the original code was not aware of the possibility of waypoints with duplicate names? The original algorithm (probably) works fine on flight plans with no duplicate waypoints.
What I hope to convince you of in this post is that the programming style does matter: the bug might not have occurred when using "domain structured programming", even if the programmer was not aware of the duplicate waypoint issue. And the ideas behind this sort of programming style have been around for decades.
Domain structured programming
In the original version of my notes on the bug, I used the word "imperative" to describe what I thought the original code might look like. Using this word was a mistake. What I was really meant was that it was not "domain structured programming", by which I roughly mean:
- representing domain objects with data-structures as faithfully as possible,
- operating on them with functions that make sense in the domain, and that are known to preserve certain invariants,
- building up a solution by composing such functions.
By programming in this style the various manipulations by the program all make sense in the domain, and are thus less likely to "go wrong", especially for reasons that are irrelevant to the problem at hand.
When invariants are expressed in terms of the domain the "modelling gap" is small. Programs not written in this style have invariants too, but because the programming objects don't correspond to the domain objects, the invariants are harder to formulate and reasoning about them is more brittle. Compare:
(1) Cutting a flight plan short at a point where the rest of the plan is not in the UK results in a flight plan that fully contains the UK part.
and
(2) Iterating rightwards in the ADEXP list from the known UK exit point until a match is found in the ICAO plan produces the index marking the end of the UK part.
Statement (1) is in domain structured programming style, it is obvious and unconditionally true. Note that I'm not just referring here to the fact that one can use formal verification to prove that some formalisation of this statement is true, I'm saying that this is an obviously true statement in the domain of air traffic control.
Statement (2) refers to objects that are domain-irrelevant, such as array indices. Judging its validity requires thought and it breaks down in the presence of duplicate waypoint names.
I think what I'm advocating for here is basically what you get when you mix
- Structured programming, which advocates using the correct data-structures for solving a problem, and structuring the code in analogy with those data structures,
- Domain-driven design, which argues that the language of the domain should be used in the programming source code1,
- Parse, don’t validate, or "Type-driven design", which argues that properties/invariants should be represented by proper data-strctures ("make illegal state unrepresentable"), rather than merely validated at the boundaries.
Programming in this style will lead to code that is more robust, because its correctness is based on the logical structure of the domain. The code is also less affected by unknowns (e.g. possibility of duplicate waypoint names), because the correctness of the code is based on known facts about the domain.
A smaller modelling gap also ensures that errors, should they occur, are usually "closer"2 to the route cause, and this can make failures less serious.
Another look at the problem
Let us from now on assume that we do not know that a flight plan can contain duplicate waypoints and that they are unique, and look at the problem anew.
We are given two versions of the flight plan, ICAO and ADEXP, each containing information the other does not have. Our goal is to extract the UK part of the ICAO flight plan, using data from the ADEXP plan.
One possibility, which sounds like what the original code did, is to keep the
data as it is and perform various operations until we are satisfied we have
found two indices start_index
and end_index
which delimit the UK portion of
the flight plan. The process is not manipulating domain objects, it is
manipulating array indices. The end result is an end_index
value for which
end_index < start_index
. This doesn't correspond to a sub-flight-plan at all.
The code has "gone wrong", and not noticed: the invariant that the pair of
indices represent a valid sub-plan has been violated. Noticing that something
has gone wrong is important: the closer you are to the root cause, the more
likely the exception will contain the information that is relevant to those who
are dealing with the crash. It may even have failed gracefully at this point.
The "domain structured programming" way to proceed is to perform operations on a flight plan, one that unifies the information from the ICAO and ADEXP plans. This therefore involves a first step which works out how the ICAO and ADEXP plans "match up".
In the original post this was achieved by a function reconcile :: ICAO p r -> [p] -> [Combined p r]
. If waypoint names are unique then plans can only match
up in at most one way, so I might3 have defined something like this instead:
reconcile' :: (Eq p) => ICAO p r -> [p] -> Maybe (Combined p r)
reconcile' (End p) [p'] | p == p' = pure (End p)
reconcile' (Leg p r rest) (p' : ps) | p == p' = do
let (skipped, restAdexp) = span (\x -> x /= start rest) ps
recoRest <- reconcile' rest restAdexp
pure (Leg p Via {route = r, through = skipped} recoRest)
reconcile' _ _ = Nothing
This function is simpler than reconcile
from the original post and does the
obvious thing: when reconciling a leg of the ICAO plan with the ADEXP list, it
will skip all the ADEXP points that are not equal to the next point in the ICAO
plan, and then reconciles the rest of the plan recursively.
This function is an example of "Parse, don’t
validate".
Parsing is how semantically meaningful domain values are produced from less
meaningful data, and so is often a necessary first step in structuring the
solution in terms of the domain. In fact I wouldn't be surprised if the original
code violated precisely this slogan, having code that validates that the plans
are consistent, but not parsing them into a reconciled data structure. This
validation code would look much like the code of reconcile'
above, but just
performing checks and not returning a value.
The function reconcile'
is wrong because it doesn't consider the possibility
of duplicate waypoints, but much less wrong than the original bug because of the
programming style. It will either:
- return
Nothing
because it can't create a valid domain object (a flight plan), in which case it knows it should now fail gracefully, or, - succeed correctly, or,
- succeed incorrectly: but this is now made much more unlikely (see below) and far more understandable: the input data is ambiguous.
The code for extracting the UK part of the plan is the same as in the original
post, this is quite simple once we have the version of the flight plan which
combines all the information we need. All that changes is pattern matching on a
Maybe
-value instead of a list:
ukPartOfICAO' :: (Eq p) => (p -> Bool) -> ICAO p r -> [p] -> Either Err (ICAO p r)
ukPartOfICAO' uk icao adexp = case reconcile' icao adexp of
Just plan -> projectICAO <$> ukSegment uk plan
_ -> Left CannotReconcileIcaoAdexp
For the original example this version of the code simply works correctly:
λ> ukPortionOfICAO' inUK icao adexp
Right (Leg "T" 8 (Leg "O" 5 (Leg "P" 1 (End "Y"))))
UK portion of ICAO
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫
4 2 8 5 1 9
ICAO: F------Q--------T--------O-----------P---------------Y--------U
ADEXP: F S Q C T A O E X P W B Q Y U
UK UK UK UK UK UK UK
The bug would therefore have been completely avoided even though the spec might not have mentioned the duplicate waypoint issue. The reason it still works is that the the duplicate waypoint doesn't get in the way of reconciling the inputs, it is just irrelevant.
The reconciliation is also unique in this case, so the processing of the flight
plan is completely correct. The code here differs from the code of the previous
post only when there are multiple reconciliations: the first post has code which
fails with an error, whereas this code will use the first reconciliation (which
might be the wrong one). Since reconcile'
settles on the first matching
waypoint it is trying to find, to have gone wrong it would have to match some
w
in the ICAO list with w1
when it should have matched it with w2
, for two
waypoints w1
and w2
in the ADEXP list, appearing in that order. This would
mean that there is a leg of the ICAO flight plan which skips over w1
and ends
at w2
. Since waypoints with duplicate names are supposed to be very far apart,
and flight plan legs are not supposed to be over very long distances, this
shouldn't happen. Indeed this requirement is in place precisely to avoid flight
plans being ambiguous.
Full code here.
To summarise:
This coding style represents the objects of the domain
with data-structures. It operates on these values with semantically meaningful
operations, e.g. "removing the first leg of a flight plan results in a flight
plan". It's correctness is based on reasoning patterns from the domain. So in
the end you get a valid flight plan, not indices into a list which doesn't mean
anything.
In my opinion, it is important that we slowly move towards such a coding style. I would hope to see this style used more and more in industry, especially in critical software.
Notes:
- It's possible to code in this style in most languages.
- It's possible to not code in this style in all languages.
- In my opinion, some languages lend themselves more to this style than others.
- This style most certainly doesn't prevent all bugs, just some of them.
- There are many other techniques/tools to prevent bugs.
- Static types are not a prerequisite to this style (though they do help in my opinion). I've seen these ideas espoused by the Clojure community, for example.
Many thanks to Arnaud Spiwack and Csongor Kiss for providing feedback on a draft of this post.
I was tempted to just refer to the style I am advocating for as "Domain-Driven Design" (DDD), but I don't think this term, as it has come to be understood, is referring to quite the same idea. DDD, from what I've seen, is more about structuring interfaces, for example naming methods. I'm not sure "DDD" really prescribes how those methods should be written, in a more technical sense.
Closer conceptually, and in terms of the call graph/code. Failing close to the root cause is why assertions are useful for example, even though this can be counter-intuitive at first ("why instruct the progam to fail more?"). This is also why JavaScript can be so hard to debug: it does its damdest to not fail at all at every turn.
Of course I know it's impossible for me to know what I would have coded had someone not shouted "DUPLICATE WAYPOINTS" in my ear, but I am trying my best.