If you look closely, you can see a “NETWORK FAILURE” message among the departures. Failures can happen. I work in the IT area and everyday I have to deal with the concepts of Redundancy, Back Up, Storage, High Availability, Disaster Recovery, etc. What it is really strange in this case, is not the failure itself but the fact that the error message appears on the display. This is what I consider a dual mistake: a communication and a design error. Let me explain what I mean.
That message doesn’t contain any useful information for a passenger departing from the Spanish airport. It answer no question but creates confusion: since travellers are not aware of the type of failure, they don’t know if the message refers to something within the display (is the airport network down? are departures affected by some kind of routes network problem? etc.) or outside it (the source of the information displayed at the Terminal B). Under an Security point of view, providing that message is risky too: if the failure is the consequence of a hacker attack, giving him the confirmation that the hack was succesfull is not a clever idea. Next time he could achieve a DoS (Denial Of Service) basing on the first successful attack. So, programmers, LAN and IT managers at the airports should prevent some error messages to be broadcasted.
Under a design point of view, a network failure is a symptom that something in the “chain” has failed: there was a Single Point of Failure (SPF), the Business Continuity Plan (BCP) did not succeed, the Back Up plan did not work, the configuration was not correctly implemented, the Hardware was obsolete or at full capacity, etc. There can be many reasons for a failure (or a network one). For sure, they must be avoided, especially if the network is used to trasmit mission critical information: in this case, a fault can be catastrophic. Risk Management should be performed, in order to assess those assets that must be hardened, to mitigate the risk of loss or deterioration of the assets, and to monitor the risk in accordance with a particular metric in order to keep it to an “acceptable level”. Even if the flying operations and the Air Traffic Control are those fields where Aviation Safety focus more often, the IT department of an airport must be seriously taken in consideration. Even if applying effective countermeasures and contingency plans can cost a lot, underestimate the damage that can be inflicted by a poorly maintained Local Area Network or Hardware Component could lead to a disaster. A few examples: On Apr. 20, 2002 a power supply problem makes the Rome Fiumicino Tower mute betweek 4.40 and 5.20 LT. On Mar 16, 2003, a network failure causes a radar black out at Rome ACC based in Ciampino around 22.00LT: all intercontinental flights to Fiumicino are diverted to Malpensa, Rome Radar switches to procedural control and take off are blocked until midnight. On Aug. 2007 a malfunctioning NIC (Network Interface Card), which allowed computer to interconnect to the LAN (Local Area Network), on a single desktop computer of the immigration control in the Tom Bradley International Terminal at LAX, experiences a failure. A total system failure affecting other computer of the same immigration system occurs at 14.00LT and lasts some 9 hours. All international flights are delayed by some hours. Thousands passengers have to wait for hours at the airport. A second outage on the Customs systems is caused by a power supply failure. Customs computers with a life of about 4 years were at their four-year phase and had to be replaced. In July 2008, a failure of the Dublin airport radar system causes fear and many grouned flights. Tracks vanish from the controllers’ radar screens. The first failure lasts 10 minutes, the second time the controllers have to close the airport to all inbound flights. As a consequence, 200 flights are delayed, diverted or cancelled. Ryanair, that is the main airport’s user, claims that more than 13.000 passengers are affected with a cost to the airline of about 1 million GBP. The shutdown was caused by a faulty network interface card (once again) but was actually a double fault, since the LAN recovery failed too. The following is an excerpt of an interesting article on the Dublin event published by the Irish Times (http://www.irishtimes.com/newspaper/ireland/2008/0920/1221835128140.html):
When it subsequently emerged that there had been a series of faults in the radar system since June 2nd, Ryanair called on the Department of Transport “and Ireland’s useless aviation regulator” to explain why there was no contingency plan for the repeated IAA computer system failures at Dublin airport.
Aer Lingus chief executive Dermot Mannion suggested that a back-up system may be needed if the upheaval was not to repeat itself, but industry sources said a back-up system would cost as much to install as an initial system.
However, yesterday’s Report of the Irish Aviation Authority into the ATM System Malfunction at Dublin Airport maintained that while “worldwide, air navigation service providers cannot rule out the possibility of failures” the IAA was “confident that the measures recommended by the system supplier Thales ATM and now being implemented will minimise the effect of a recurrence of like or similar failures of its ATM system in the future”.
The report revealed that the root cause of the failures at Dublin airport was a faulty network interface card and that all of the Dublin failures had the same root cause.
It concluded that the failure was not “a single point of failure” but was caused by a double failure – a hardware failure of the network interface card and a failure of the local area network recovery mechanism.
The IAA said the system had been “stable” since July 9th and added: “IAA engineering, air traffic control, safety, support and management staff worked around the clock to resolve the issues as quickly as possible.”
Thales ATM, suppliers of the radar system at Dublin airport, recommended:
• That additional network monitoring be undertaken. Monitoring tools and a “passive analyser” should be installed for the early identification of any similar malfunctions. This work has been completed.
• That a software programme to protect the local area network recovery mechanism be developed. This programme is currently being tested.
• That changes in procedures in relation to hardware testing be made before insertion in the operational system. These changes have been implemented.
• Thales ATM is also studying other potential improvements in the network design to prevent a recurrence.
• A spokeswoman for the IAA said it and Thales ATM had jointly supplied engineers to work on the problem. While it did not expect to have its costs refunded by Thales ATM, neither did it expect a bill from the company for its time.