Author Topic: British Airways' Computer Failure being blamed on an Electrical Engineer ...  (Read 13430 times)

0 Members and 1 Guest are viewing this topic.

Offline AvaceeTopic starter

  • Supporter
  • ****
  • Posts: 299
  • Country: gb
Apologies if there's already a thread going about this ....

http://www.bbc.co.uk/news/business-40159202

--- Quote ---
British Airways' parent company has confirmed that human error caused an IT meltdown that left 75,000 passengers stranded over the Bank Holiday.
Willie Walsh. boss of IAG, said that an electrical engineer disconnected the uninterruptible power supply which shut down BA's data centre.
--- End ---

So crap DR plan, crap failover, crap resilience, crap electrical architecture .. and what would have happened if said UPS had failed during normal operation?
A design that allows the unplugging of one thing to cripple a company for nearly 3 days .. just wow :p

And I wonder how many people will be fired.....

« Last Edit: June 05, 2017, 06:52:57 pm by Avacee »
 
The following users thanked this post: Koen

Offline fourtytwo42

  • Super Contributor
  • ***
  • Posts: 1185
  • Country: gb
  • Interested in all things green/ECO NOT political
Frankly if anybody believes there face saving story I shall be surprised, I hope there insurers wring the truth out of them and the associated managers/board members brought to book, but that would be utopia!
 
The following users thanked this post: Electro Detective

Online MK14

  • Super Contributor
  • ***
  • Posts: 4536
  • Country: gb
http://www.theregister.co.uk/2017/06/05/british_airways_critical_path_analysis/

tl;dr
Badly or terribly designed IT system(s), too little or bad testing, with partly or massively cut back (sacked/redundant) UK workforce, outsourced to India, by possibly incompetent? top management.

So what happened was no surprise really.
« Last Edit: June 05, 2017, 07:31:31 pm by MK14 »
 

Offline Tom45

  • Frequent Contributor
  • **
  • Posts: 556
  • Country: us
"boss of IAG, said that an electrical engineer disconnected the uninterruptible power supply which shut down BA's data centre"

What else could the boss say?

"It failed because of our management's lack of planning and testing for reliability" isn't ever going to be said by anyone in management.
 

Offline stj

  • Super Contributor
  • ***
  • Posts: 2155
  • Country: gb
that's funny,
the initial story was the staff in the indian datacenter didnt know how to switch over to the backup server because of a lack of training!!!
 

Offline julian1

  • Frequent Contributor
  • **
  • Posts: 735
  • Country: au
I think the article has been edited. Its now simply "engineer",

"boss of IAG, said an engineer disconnected a power supply, with the major damage caused by a surge when it was reconnected."
 

Offline madires

  • Super Contributor
  • ***
  • Posts: 7764
  • Country: de
  • A qualified hobbyist ;)
No hot-stand-by data center? Would be less expensive than the outage. :palm:
 

Offline AvaceeTopic starter

  • Supporter
  • ****
  • Posts: 299
  • Country: gb
BA does have a secondary Data Centre. It's about 1 mile away from the main DC.
The problem was not only disconnecting the UPS but then turning the power back on in the wrong order.
It appears some files were corrupted during the incorrect restart procedure and these sync'd to the backup site and took that down too.  :palm:
 

Offline Rick Law

  • Super Contributor
  • ***
  • Posts: 3441
  • Country: us
I see some job opening at BA due to impending departures...

"Never take a job where success is defined by the lack of failures."

(Not a quote from someone famous.  Just my thoughts...)
 

Offline Jr460

  • Regular Contributor
  • *
  • Posts: 142
We once had the transfer switch for the UPS blow up.   I never went down to the basement to see it, I was too busy making sure things came back up clean once some bypass switches where thrown.

Doing DR is hard.

We do all kinds of things in the servers to withstand faults.   RAID, or mirror boot disks, redundant power supplies fro different feeds, ECC memory, OSes that can disable a failed CPU and continue.   Hot swap IO cards, redundant networking links to different switches, redundant Fiber Channel with two completely separate SANs.  Clustering to allow processes to failover to another cluster member.   Oh the same stuff and a parallel site.  The list goes on.   However, one stupid move by an application writer can make all of a moot point.

We had an new app that was coming in.   Pulled data from a message queue, processed, wrote some things to a database, and then sent results out to a different queue.   Database could be setup with realtime replication.  So one would think the app run active/active.  Yes they said, but because of licensing, it will be more of active/standby.   I asked the stupid question of what needs to be done to switch active sites.   "Oh easy, stop process on the east machine, copy a few data files over to the west machine, and then start process on west machine"

I asked, "What if the east site is a smoking hole in the ground, can I get those files from a backup?"   Now the designer of the app starts looking a bit nervous.   Seem they had active state in the few data files, and that getting lost would mean lots of man time to get things back in sync.  I then had to ask, "You have database connections, why not have that in a replicated table?"  Seems they never thought of that since they developed and tested on an offsite PC sitting under someone's desk with no database.  But now they would have to re-code parts of the app for that to happen and no one had the budget for it.   |O

All of our slick things to handle faults undone by first lazy and then budget.
 

Offline boffin

  • Supporter
  • ****
  • Posts: 1027
  • Country: ca
There was a pretty big outage of power in downtown Vancouver Canada about 10 years ago, took out about 1/3 of all the city core.  One of the largest data centres downtown fired up their generator, only to find out that there was no water pressure to water cool their generator, as the failure had taken out the city's downtown pumps too.  The generator lasted about 15 mins until it overheated
 

Offline stj

  • Super Contributor
  • ***
  • Posts: 2155
  • Country: gb
that was actually predictable,
no power = no pumping stations.  :palm:

somebody should have been sacked!
 

Offline floobydust

  • Super Contributor
  • ***
  • Posts: 6979
  • Country: ca
Nobody wants to test a big offline UPS, too scary. Everything is running fine, who wants to pull the plug and see if the UPS transfers and runs?

I worked in a power plant, switchgear blew up, lost Station service. 50kW UPS was stone dead. It apparently worked when the plant was new but never tested years afterwards.

The Westinghouse DCS used DRAM for program storage with lithium batteries, so a few hours until all dead. Panic to backup everything.
Nobody had foreseen a 485MW power generating station with no power scenario.
 

Offline CJay

  • Super Contributor
  • ***
  • Posts: 4136
  • Country: gb
One of the major constituent parts of the business I work for is DR.

Although we design DR plans that work (BT Tunnel fire, major clients invoked their DR plan and were 'business as usual'in one of our DR suites within two hours for example) it's ultimately up to the client to implement and regularly test their plans

It makes not one jot of difference how large their IT estate is, if they're scared to or unable to test then it's just a matter of time until something like this happens.

It's amazing how many companies think they have a DR plan but never test it, in part or in full or a DR plan that fails when they test it because their software and hardware is just so badly implemented.

@floobydust, you said it yourself, whichever moron made the decisions to ignore the UPS testing and maintenance should have been taken out back and shot.
 

Offline tggzzz

  • Super Contributor
  • ***
  • Posts: 19494
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
Nobody had foreseen a 485MW power generating station with no power scenario.

The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?

The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4536
  • Country: gb
Nobody had foreseen a 485MW power generating station with no power scenario.

The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?

The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.

Disclaimer: I don't know that much about the UK's power grid.

My limited understanding is that each Nuclear power station has to have fairly extensive backup generators (for safety reasons), independent of the national grid (which it may lose connection to or may be without power itself, during something terrible).
Because the Nuclear power stations each needs power, to keep cooling the core, and other things. To avoid a possible meltdown, especially under certain fault conditions, such as when the control rods have been damaged/lost etc.

So presumably that is powerful enough to restart each Nuclear power station.

BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).

EDIT:
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.
« Last Edit: June 06, 2017, 08:03:45 am by MK14 »
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4536
  • Country: gb
***deleted***

Look, some things you just don't joke about.
 
The following users thanked this post: CJay

Offline CJay

  • Super Contributor
  • ***
  • Posts: 4136
  • Country: gb
Nobody had foreseen a 485MW power generating station with no power scenario.

The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?

The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.

They usually have decent sized gas or diesel backup generators to bootstrap with.
 

Offline Electro Detective

  • Super Contributor
  • ***
  • Posts: 2715
  • Country: au
Call center person with awesome suntan was overworked.. marketing some investment product on the phone, accepting credit card payment for a gas bill,
flogging a new phone plan, watching pawn etc
when the 'unforeseen' incident happened, and not sure what UPS was implied,
computer back up procedure or Ebay delivery issue.

Let's blame it on the electrical engineer (and give him a good payoff)

or get an actor in our COT (circle of trust) to 'play good engineer gone bad' with broken family, drinking problem, incurable EEVblog addict,
throw in a call girl affair and tie her to some alleged obscure ancient re-incarnated cult cell affiliation BS,   

therefore distance ourselves, and let the news media  :bullshit: trash him  :-- :-- :--

Corporate problems need corporat solutions... yesterday  :-+
« Last Edit: June 08, 2017, 02:15:20 am by Electro Detective »
 

Offline tggzzz

  • Super Contributor
  • ***
  • Posts: 19494
  • Country: gb
  • Numbers, not adjectives
    • Having fun doing more, with less
So presumably that is powerful enough to restart each Nuclear power station.

Spot the presumption :)

Quote
BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).

That's the definition of "blackstart!".

It is a shame that youngsters have never "bootstrapped" a computer.

Quote
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.

They are unreliable.

If you look at the wind power stats on gridwatch, then - as a rule of thumb - then x% of the time they are generating less than x% of their peak output. That means 1% of the time (3 days/year) they are generating less than 1% of peak, 10% of the time (1month/year) they are generating less than 10% of their peak. The other necessary piece of information is how long that lasts; if there is a blocking high pressure over the UK, then the wind output is very low for days at a time.
There are lies, damned lies, statistics - and ADC/DAC specs.
Glider pilot's aphorism: "there is no substitute for span". Retort: "There is a substitute: skill+imagination. But you can buy span".
Having fun doing more, with less
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4536
  • Country: gb
So presumably that is powerful enough to restart each Nuclear power station.

Spot the presumption :)

Quote
BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).

That's the definition of "blackstart!".

It is a shame that youngsters have never "bootstrapped" a computer.

Quote
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.

They are unreliable.

If you look at the wind power stats on gridwatch, then - as a rule of thumb - then x% of the time they are generating less than x% of their peak output. That means 1% of the time (3 days/year) they are generating less than 1% of peak, 10% of the time (1month/year) they are generating less than 10% of their peak. The other necessary piece of information is how long that lasts; if there is a blocking high pressure over the UK, then the wind output is very low for days at a time.

Fairly recently we have had part of the NHS knocked out through hacking.
Later British Airways (this thread), probably due to negligence (at the top management of British Airways I guess, and maybe lower), knocked out for a few days.
Terrorist attack(s) could knock out things (in theory) or cause them to be shut down.
Very bad weather sometimes knocks things out etc etc

So if a terrible computer bug(s) or hardware mistakes and/or poor management and/or bad hacking/terrorists/etc, were to strike the Nation Grid.

Maybe we could get the scenario you describe.

Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).

Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).
 

Offline madires

  • Super Contributor
  • ***
  • Posts: 7764
  • Country: de
  • A qualified hobbyist ;)
BA does have a secondary Data Centre. It's about 1 mile away from the main DC.
The problem was not only disconnecting the UPS but then turning the power back on in the wrong order.
It appears some files were corrupted during the incorrect restart procedure and these sync'd to the backup site and took that down too.  :palm:

Systems can recover from an outage, run for a moment and crash again. Such basic fault conditions have to be considered for a proper design.
 

Offline madires

  • Super Contributor
  • ***
  • Posts: 7764
  • Country: de
  • A qualified hobbyist ;)
Nobody wants to test a big offline UPS, too scary. Everything is running fine, who wants to pull the plug and see if the UPS transfers and runs?

The problem with that attitute is that backup systems will break when not run regularly. The risk of test runs can be reduced by using a BIG dummy load. Anyway, a lot of data center outages are caused by power problems, like overseen issues because of never performing test runs or lack of capacity planning.
 

Offline CJay

  • Super Contributor
  • ***
  • Posts: 4136
  • Country: gb
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).

Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).

Heh, joke maybe and perhaps not quite a bloke with a pair of jumpleads but I guess it'd be possible at some level to jumpstart a backup generator with a car battery and bootstrap the chain that way, I'd be pretty worried if it came to that but given recent events...

It'd have to be a pretty bleak day to need electricity from the continent to restart our grid but I don't see why it'd not be possible, it'd only take a station or two to be restarted before we had enough power to restart the rest and as already mentioned, we've got Dinorwig and other renewable solutions too :)
 

Online MK14

  • Super Contributor
  • ***
  • Posts: 4536
  • Country: gb
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).

Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).

Heh, joke maybe and perhaps not quite a bloke with a pair of jumpleads but I guess it'd be possible at some level to jumpstart a backup generator with a car battery and bootstrap the chain that way, I'd be pretty worried if it came to that but given recent events...

It'd have to be a pretty bleak day to need electricity from the continent to restart our grid but I don't see why it'd not be possible, it'd only take a station or two to be restarted before we had enough power to restart the rest and as already mentioned, we've got Dinorwig and other renewable solutions too :)

In practice, when situations like that arise. Solutions are normally found. Even if some quite strange events take place.

E.g. I think a lot of the military (and probably large commercial ships) battleships/etc (fairly large ones), boast about having the electricity generating ability to power a modestly sized town, so maybe that would be another option.

Japan (Fukushima), when they had their huge Nuclear disaster (somewhat recently), effectively lost all power. So that had to sort something out there, as it was the only means of restarting the cooling systems.
Although I think a major part of the disaster, was that they didn't succeed in getting the power back quickly enough, so there was some kind of melt down or something.

I think there are fairly large emergency power generators available, so they perhaps could be used to start up one of the turbines, gradually. Then the one rotating generator could power/restart the rest of the plant.

I guess a real Electrical Power Plant Expert, would know what can and can't work, in practice.

Some people (not me, but other(s) tell me about it) worry about a giant EMP (military, Nuclear bombs or even weather via the sun) pulse, knocking out a countries (or even many countries) electrical systems big time.

There are so many millions of things that *MIGHT* go wrong with society, but in practice most/all of them WON'T occur. It perhaps is best to get on with life and NOT worry about such things too much.

E.g. Mad Cows Disease causing a large number of deaths/injuries, Year 2000 millennium bug, Bird flu epidemic, Sars epidemic, Trump causing World War 3 (although maybe he needs longer ?), Aids wiping out most of the worlds population, etc etc.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf