Air disasters and software; not such a tenuous link

United 173

It’s ten past five in the evening of the 28th of December, 1978. In the skies above Portland International Airport, Oregon, there’s a DC-8, with 189 souls on board, coming in to land. As they lower the landing gear there’s a loud thump – both heard and felt in the cockpit. So it’s hardly a surprise when a couple of minutes later they reject Portland Approach’s request to continue their descent, and instead say they’ll climb back to 5,000 feet because “we got gear problem”.

Landing gear problems are surprisingly common, but a relatively safe fault to develop. Aircraft are designed explicitly to land without any undercarriage at all, and can do so perfectly safely.

So it’s quite a surprise when, after about an hour of normal radio transmissions, United 173 makes a Mayday call, and hits powerlines in suburban Portland a few seconds afterward, crashing into a wooded area of a residential district.

In the crash, 10 people were killed, and 24 seriously injured – the figures are remarkably low, helped in part because the 304th air rescue team were running an exercise at the time, and happened to be airborne with full equipment, out of Portland International.

Air Crashes

The notion that there’s a link between air disasters and software is at first glance a pretty tenuous one. The idea of actually doing a talk on the subject seems even more bizarre, especially at JSConf EU, a conference that’s well-known for impressive technical talks and live demonstrations. Needless to say, I was surprised myself when it was accepted, especially given my colleague, Lloyd Watkin, was also accepted. This post is really the prose form of my talk – it should probably be read alongside the slides – you’re welcome to take the slides on github, and use this prose, to perform the talk yourself if you like.

I became fascinated with air accident investigation reports about fifteen years ago, when my then-CTO suggested they were excellent reading for software developers. He pointed out that aircraft are complex systems, and any complex system is essentially pretty similar. There are a bunch of components, and they interact in various ways. Software engineers, like aircraft designers, try to minimize the way these components interact, and ensure their interactions are well understood. We both solve problems by using redundant components, too. Often, relatively minor incidents lie at the core of failures in both cases and cause chains of errors that sometimes result in serious incidents – and we both call those crashes.

But aircraft crashes – actually even relatively minor incidents – are scrutinised to an astonishing degree by a team of specialists, who look at all the factors involved and make recommendations for individual airlines and the industry as a whole. While it’d be great if every wobble of a software project generated the same level of investigation, we can still learn a lot from the reports these people have written.

United 173

So what happened to United 173? It turns out that with the full Cockpit Voice Recorder transcript and the telemetry from the aircraft, it’s made astonishingly plain. When the aircraft took off from Denver, it had 46,700 pounds of fuel. We know from the detailed flight plan and the telemetry that it had around 14,800 pounds left as it approached Portland. We also know that the burn rate when it was circling at 5,000 feet with the landing gear down would have been 220 pounds per minute.

This is consistent with the first report I’m aware of, when the Flight Engineer notes at 17:38 that there is 7,000 pounds remaining. He reports 5,000 a few minutes later at 17:46. Just five minutes afterward, the Captain tells the Flight Engineer to prepare a Landing Card based on 4,000 pounds of fuel on touchdown. Based on the burn rate, we know this is roughly what the aircraft has now; the Captain clearly is unaware of the fuel levels, and the Flight Engineer says they don’t have enough fuel for that – but the Captain dismisses him. Four minutes further, the First Officer – who’s actually flying the plane – requests the fuel level, and the Flight Engineer responds with 4,000 pounds. It’s now 17:55.

Nearly ten minutes later, at 18:03, the Captain says to Portland Approach that he’ll be landing with “[…] about 4,000 well, make it 3,000 pounds of fuel”. Again, it’s clear that the Captain has no idea what the fuel level is, and hasn’t registered it’s going down fast. The First Officer is, however, and warns the Captain they’re going to lose an engine just three minutes later. The Captain’s response seems mostly bewilderment, but a minute later he finally requests landing clearance from PDX.

By 18:09, the fuel level is a mere 1,000 pounds of fuel. The Flight Engineer is trying to keep the engines going by cross-feeding fuel between the tanks, but after four minutes the engines are starved, and cut out – and the plane crashes.

The Problem

The NTSB report for the accident blamed Captain Malburn McBroom for both failing to observe the fuel levels and heed the warnings form his crew, though the report also noted a contributing factor of “crew failure”, as they felt that the crew weren’t adequately flagging the issue.

It noted, too, that “this accident exemplifies a recurring problem” – one of the investigators, Alan Diehl, had realised there was a common theme running through a number of accidents over a long period of time.

Captain McBroom never flew again, stripped of his license. Flight Engineer Forrest “Frostie” Mendenhall was sadly was killed on impact.

The NTSB report was certainly correct – there are a number of similar accidents, from the Tu-124 Neva River crash in 1963 to the 1972 Eastern 401 Florida Everglades crash and of course the largest crash in aviation history – the Tenerife Airport Disaster.

Tenerife

It’s reasonable to describe the events of the 27th March 1977 as one of the longest cascade errors in history. For our purposes, it begins at lunchtime that day, when anti-Franco separatists called to warn there was a bomb placed in the terminal building at Las Palmas. The bomb exploded, injuring only one person, but due to a second bomb threat the airport remained closed while it was thoroughly searched.

Aircraft in flight were diverted to Los Rodeos, Tenerife’s relatively tiny airport now called Tenerife North Airport. Due to its relatively high altitude and location, it regularly suffered from low clouds which would come across the airport at ground level.

By 6pm that evening, it was playing host to a considerable number of aircraft, blocking the main apron. Blocking them all was a 747 flying KL 4805, with 248 souls on board. During the wait for Las Palmas to reopen, this was being refuelled in order to reduce the waiting time at Las Palmas and fit with safety regulations – however the airport actually reopened during this refuel. Eventually, the aircraft was cleared to back-taxi, rolling all the way down the only runway, in order to position for take-off. There are diagrams on the slides taken from a diagram on Wikipedia and adapted.

Behind it was another 747, flying PA 1736, with 396 souls. Once the KLM aircraft had unblocked them, they too were cleared to back-taxi, but only down to exit 3. This would get them clear of the runway at the earliest point they could rejoin the congested taxiway running alongside the runway. As the Pan-Am aircraft taxied down the runway, dense low cloud obscured nearly the entire length, and they missed exit 3 entirely, carrying on toward exit 4.

When the KLM aircraft had completed the 180-degree turn at the end of the runway, the Captain – an experienced 747 flight instructor called van Zantzen – started moving the throttles. His First Officer immediately noticed and flagged the error, pointing out they didn’t have ATC clearance yet. They asked and had ATC clearance granted – but this is not take-off clearance. They repeated the instructions back, and added, “we are at take-off” – to which they heard only the response “OK”.

They did not hear – due to radio interference – the tower also saying “Stand by for take-off, I will call you”, nor did they hear a worried Pam-Am response explaining they were still on the runway. Instead they heard only the Tower’s response to Pam-Am, saying “Ah, papa alpha one seven three six report the runway clear.”

It’s hard to know if the KLM crew realised this was a request, and not the report itself. At least the Captain thought it was clear – the Flight Engineer queried it and was dismissed by the Captain (and possibly the First Officer too; the CVR is unclear). In any case, they were already moving down the runway. The Pam Am crew saw the other aircraft with only a very short time to react, putting on full throttle and trying to manoeuvre down exit 4. The KLM flight was by this time moving too fast to stop – instead, the Captain pulled back as hard as possible – dragging the tail along the ground in a “tailstrike” – and got airborne.

But it wasn’t enough – the landing gear and engines impacted the Pam Am flight causing significant damage to both aircraft and arresting the KLM flight’s forward motion. The KLM aircraft disintegrated, its full fuel tanks causing a huge fire, debris scattered along the runway, killing all on board.

Firecrews – as disorientated by the cloud as anyone else, initially thought only one aircraft was involved, assuming that the Pam Am aircraft was just another piece of burning debris. When they did realise the situation, 62 people had escaped, one being killed when she walked into the exhaust from the engines, still going at full power.

In total, 248 people on the KLM flight died, and 335 on the Pam Am flight.

Crew Resource Management

In both cases, the primary problem was a lack of what’s now called “Crew Resource Management” (CRM). This is, essentially, a set of doctrines and techniques for ensuring a high degree of effectiveness within a team, particularly in a crisis situation. It explicitly de-emphasises the leader as sole authority, instead promoting effective use of every member of the team. It consists of four key areas:

Situational Awareness

Sitiuational Awareness is essentially what it sounds like – ensuring that everyone on the team understands the current situation, and ensuring that missing information about the situation is identified and obtained. In the case of United 173, this means ensuring that the entire crew were properly aware of the amount of flight time the fuel situation was allowing; in the case of KL 4805 at Tenerife, it means ensuring that they got positive information that they had take-off permission, and ensuring positive information on the location of PA 1736.

Planning and Decision Making

In CRM, planning is done at the team level, and not by the leader. Decisions are, of course, made by a single person, but after discussion by the team whenever possible.

Communications

Communication both within the team and outside is, of course, a vital component of any activity. In CRM, communicating problems is particularly important, but not communicating irrelevant information can often be as (if not more) important. When in United 173 the First Officer and Flight Engineer were discussing the fuel, they should have flagged it with the Captain and ensured they had a clear response that the Captain understood their concern. Instead, much of the cockpit discussion is over essentially trivial matters, such as whether or not to wear coats.

The communication in Tenerife was found to be so poor that flight crews no longer use the word “take-off” on radio – they now refer to “departure”, and the phrase is only used once, by the tower, when granting take-off permission.

Teamwork

A key factor if you have more than one person is effective division of labour. When there’s a crisis, it’s natural behaviour for everyone to seek to work on the problem – a common phrase used in CRM training now is “someone has to fly the plane”. However, it’s best if each member of the team has a defined role or set of tasks, and the others trust each other to get on with things, supporting each other through their own tasks.

An easy trap to fall into is to think of CRM and similar practises as “leadership” or “management” training, but it’s not – in any given team there can only be one leader, but everyone in the team has to practise effective teamwork.

After the United 173 crash, NASA gave a presentation to the aviation industry, and CRM rapidly became standard practise. United themselves started including it in their training regime in 1980, well before it became mandatory. This would turn out to have a dramatic effect.

United 232

There are hundreds of incidents which have resulted in no fatalities through CRM and CRM is practised daily in every airline cockpit in the world, so finding examples of where it’s worked is fairly uninteresting. A quick YouTube search for landing gear failure will find countless videos of aircraft landing perfectly well with damaged landing gear or no landing gear at all, for example.

But one accident in particular stands out as a case where an incident that should – by all that was known at the time – have resulted in close to 100% fatalities was essentially turned out by excellent Crew Resource Management.

On the 19th of July 1989 a three-engined DC-10 flying from Denver to Chicago was cruising normally when – at 15:16 local time – engine number two, mounted at the base of the vertical stabiliser, suffered a catastrophic turbine fan fracture, caused by a manufacturing fault. The tail was punctured by razor sharp fragments of metal in several places.

The crew reacted perfectly. Captain Al Haynes and Second Officer Dudley Dvorak worked through the checklist for engine failure. Step one was to reduce the throttle – but the lever was stuck solidly. This being the case, they cut the fuel supply, firewalling the engine from the rest of the system. After 14 seconds, the engine fire was under control.

The damage to the tailplane was severe, causing the aircraft to have a tendency to roll to the right – this would only be a problem, however, if they lost aileron control.

As Al Haynes finished working on the immediate problem, First Officer Bill Records called to him saying the aircraft wasn’t responding to his input. Al Haynes turned and saw the control yoke had been pulled all the way to counteract the rolling descent. Rapidly, the two pilots worked to correct the roll, discovering they had little or no hydraulics at all, and therefore not only no aileron, but no elevator, rudder, speed brakes, flaps, or wheel brakes.

They could operate the throttles of the two remaining engines to regain some control, but since the middle lever was immobile, this was best done by Dudley Dvorak, who also needed to talk to United’s maintenance crews to discover the correct procedure for such a case – it was substantially beyond their training. Al Haynes – who maintained his sense of humour throughout the flight – comments that “we didn’t do this thing on my last [performance check in a simulator].”

Dudley, in turn, was having an almost comical conversation with maintenance, who were understandably finding it difficult to grasp that all three independent hydraulic systems had failed – thought to be a probability of a billion to one.

After a few minutes, something almost off a Hollywood script happened – a first class passenger walked in to help. He introduces himself as Denny Fitch. But unlike most passengers, he’s actually one of United’s DC-10 flight instructors, and has just finished giving a class of pilots a course on how to react to every emergency possible.

In particular, he was also very familiar with JAL 123.

Japan Airlines 123

In fairness, probably any flight instructor – maybe any pilot – knows the tale of JAL 123. It ranks as the most serious single-aircraft crash in history. The aircraft involved – a 747 Short Range variant – took off in the early evening of the 12th of August 1985 from Tokyo, Japan, packed with 524 people. 7 years previously, the same aircraft had been involved in another accident – a tailstrike on landing.

Unrecognised at the time was that the repair required to the aft bulkhead had been botched – a patch which didn’t quite fit was simply cut to size, rendering the repair suitable only for 10,000 compression cycles. By this flight, the aircraft had flown over 12,000 times, and this time, as the aircraft gained altitude, the bulkhead gave way explosively.

The air rushed through into the tail section, and literally blew the vertical stabiliser off the airframe – the rudder and much of the fin was simply gone. In addition, this caused – like United 232 would suffer 4 years later – total hydraulic loss.

The flight crew soon realised that their only choice was to control the plane using “differential thrust” – using the engines individually to turn and control the plane. But without a vertical stabiliser the plane was extremely close to uncontrollable. Despite their heroic efforts – which haven’t been replicated in a simulator – they were unable to keep the plane flying, and it crashed into mountainous and relatively remote terrain after 35 minutes.

Although helicopters found the crash, it was after dark, and there were no signs of survivors. Miraculously, there were in fact 4 survivors.

United 232

Back in the cockpit of United 232, Al Haynes clearly had no illusions about his own chances. While discussing evacuation plans with a member of the cabin crew, he comments “but I really have my doubts you’ll see us standing up, honey.” However, for much of the flight he retains his humour, commenting cheerfully, “Won’t this be a fun landing?”

They’d elected to land at Sioux City Gateway, a relatively small airport, whose runway 33 was suitable for landing a DC-10. The controller on duty, Kevin Bachmann, had relocated there after serving in a larger airport due to the relatively high stress – he was going to regret that choice today. Commenting that a particular course change he suggested would keep the aircraft away from the city, Al Haynes said, “Whatever you do, keep us away from the city.”

Dudley Dvorak handed over throttle control to Denny Fitch, standing behind the two pilots, in part because Denny “[seemed] to know what he was doing.” – in part because this freed Dudley to work on communication with the maintenance teams – who in fact had nothing to offer. Al Haynes and Bill Records continued to fly the plane using the yoke and pedals, despite this having no effect – partly because there was always the hope that it might have some effect occasionally, though it never did, and partly I suspect as a way of handling the stress. Denny Fitch, now in the Flight Engineer’s chair, was in effect controlling the plane solo.

As the flight progressed, the crew discuss every aspect of the flight. An oft-cited example is the discussion about the landing gear. They unanimously agreed that – given it seemed they had a reasonable chance of making the airfield – the landing gear should be lowered, despite noting this would work against them if they were forced to ditch on a soft surface. They then discussed the two different methods of lowering the landing gear, electing to crank it down rather than letting it drop in order to release the outer ailerons, and, hopefully, a dribble of hydraulic fluid. Crew members flag potential problems that haven’t been raised – Bill Records notes stopping will be a problem, Dudley Dvorak checks the visible damage during a moment of relative calm, and so on.

As the aircraft approached Sioux City, it almost seemed that a miracle had taken place. Certainly Kevin Bachmann had become cautiously optimistic that the crew might actually land the plane. He grants formal landing clearance, “on any runway.” (Al Haynes responds with laughter – “you want to be particular and make it a runway, huh?”) He even handles a last-minute runway change to the long-closed runway 22 with aplomb, despite it being where the emergency equipment is currently waiting. “That’s a closed runway, sir, that’ll work, sir. We’re getting the equipment off the runway.”

The plane came in fast – a normal landing would be 140 knots, with a sink rate of 200 feet per minute. United 232 was coming in at 215 knots and 1,850 feet per minute – both increasing. Al Haynes called to Denny Fitch, “Pull the power back!”, Denny responds with “No, I can’t pull them or we’ll lose it, that’s what’s turning you.” Al Haynes says OK.

50 meters from the ground, the right wing starts to drop, and someone shouts, “Left throttle, left, left, left, left…”

Someone exclaims, “God!”, and the CVR ends, with the sound of impact.

The plane’s right wing came into contact with the ground, closely followed by the right engine (number 3). The spilt fuel ignited almost immediately, and the impact slammed the tail of the aircraft into the ground, breaking it off. The change in weight caused the broken rear of the plane to rise, slamming the nose into the ground. The plane, now missing its right wing and tail, bounced along the runway, slewing right into cornfields. The cockpit section broke off, slamming first class into the ground, before it, too broke clear. The mid-section, streaming dense smoke and flames, lazily rolled over, coming to rest upside down in a cloud of smoke and dust.

Kevin Bachmann sees the crash from the tower, goes downstairs, and in his own words, “had a little cry”.

But amazingly, though 112 people died in the crash – a third of these due to smoke inhalation – 184 survived.

And as for the crew – after 35 minutes the rescue teams noticed that a waist-high section of debris which they’d taken to be a section of nose-cone had a hand, sticking out, and waving to them. The hand was Dudley Dvorak’s, and this crumpled mess held not only Dudley, but all four members of the flight crew. Three of them – Dudley Dvorak, Bill Records, and Al Haynes, would return to flight operations within just a few months – Al Haynes was flying by October that year. Denny Fitch was told by doctors that he would never fly again, due to nerve damage in his arm.

Just under a year later, Captain Dennis E Fitch took the controls of a DC-10 to take-off from Honolulu, Hawai’i.

Crew Resource Management

It’s fair to say that the vast majority of developers and sysadmins will never have responsibility for other people’s lives – though of course some do. Our problems are generally more mundane, our responsibilities thankfully less. But even so, following the basic tenets of CRM – ensuring the team understands the situation, planning collectively and ensuring decisions are based on that, communicating effectively, and making the best use of each individual in the team – are vital both to avoid crises and handle them when they do arise.