EEVblog Electronics Community Forum
General => General Technical Chat => Topic started by: Avacee on June 05, 2017, 06:50:56 pm
-
Apologies if there's already a thread going about this ....
http://www.bbc.co.uk/news/business-40159202 (http://www.bbc.co.uk/news/business-40159202)
--- Quote ---
British Airways' parent company has confirmed that human error caused an IT meltdown that left 75,000 passengers stranded over the Bank Holiday.
Willie Walsh. boss of IAG, said that an electrical engineer disconnected the uninterruptible power supply which shut down BA's data centre.
--- End ---
So crap DR plan, crap failover, crap resilience, crap electrical architecture .. and what would have happened if said UPS had failed during normal operation?
A design that allows the unplugging of one thing to cripple a company for nearly 3 days .. just wow :p
And I wonder how many people will be fired.....
-
Frankly if anybody believes there face saving story I shall be surprised, I hope there insurers wring the truth out of them and the associated managers/board members brought to book, but that would be utopia!
-
http://www.theregister.co.uk/2017/06/05/british_airways_critical_path_analysis/ (http://www.theregister.co.uk/2017/06/05/british_airways_critical_path_analysis/)
tl;dr
Badly or terribly designed IT system(s), too little or bad testing, with partly or massively cut back (sacked/redundant) UK workforce, outsourced to India, by possibly incompetent? top management.
So what happened was no surprise really.
-
"boss of IAG, said that an electrical engineer disconnected the uninterruptible power supply which shut down BA's data centre"
What else could the boss say?
"It failed because of our management's lack of planning and testing for reliability" isn't ever going to be said by anyone in management.
-
that's funny,
the initial story was the staff in the indian datacenter didnt know how to switch over to the backup server because of a lack of training!!!
-
I think the article has been edited. Its now simply "engineer",
"boss of IAG, said an engineer disconnected a power supply, with the major damage caused by a surge when it was reconnected."
-
No hot-stand-by data center? Would be less expensive than the outage. :palm:
-
BA does have a secondary Data Centre. It's about 1 mile away from the main DC.
The problem was not only disconnecting the UPS but then turning the power back on in the wrong order.
It appears some files were corrupted during the incorrect restart procedure and these sync'd to the backup site and took that down too. :palm:
-
I see some job opening at BA due to impending departures...
"Never take a job where success is defined by the lack of failures."
(Not a quote from someone famous. Just my thoughts...)
-
We once had the transfer switch for the UPS blow up. I never went down to the basement to see it, I was too busy making sure things came back up clean once some bypass switches where thrown.
Doing DR is hard.
We do all kinds of things in the servers to withstand faults. RAID, or mirror boot disks, redundant power supplies fro different feeds, ECC memory, OSes that can disable a failed CPU and continue. Hot swap IO cards, redundant networking links to different switches, redundant Fiber Channel with two completely separate SANs. Clustering to allow processes to failover to another cluster member. Oh the same stuff and a parallel site. The list goes on. However, one stupid move by an application writer can make all of a moot point.
We had an new app that was coming in. Pulled data from a message queue, processed, wrote some things to a database, and then sent results out to a different queue. Database could be setup with realtime replication. So one would think the app run active/active. Yes they said, but because of licensing, it will be more of active/standby. I asked the stupid question of what needs to be done to switch active sites. "Oh easy, stop process on the east machine, copy a few data files over to the west machine, and then start process on west machine"
I asked, "What if the east site is a smoking hole in the ground, can I get those files from a backup?" Now the designer of the app starts looking a bit nervous. Seem they had active state in the few data files, and that getting lost would mean lots of man time to get things back in sync. I then had to ask, "You have database connections, why not have that in a replicated table?" Seems they never thought of that since they developed and tested on an offsite PC sitting under someone's desk with no database. But now they would have to re-code parts of the app for that to happen and no one had the budget for it. |O
All of our slick things to handle faults undone by first lazy and then budget.
-
There was a pretty big outage of power in downtown Vancouver Canada about 10 years ago, took out about 1/3 of all the city core. One of the largest data centres downtown fired up their generator, only to find out that there was no water pressure to water cool their generator, as the failure had taken out the city's downtown pumps too. The generator lasted about 15 mins until it overheated
-
that was actually predictable,
no power = no pumping stations. :palm:
somebody should have been sacked!
-
Nobody wants to test a big offline UPS, too scary. Everything is running fine, who wants to pull the plug and see if the UPS transfers and runs?
I worked in a power plant, switchgear blew up, lost Station service. 50kW UPS was stone dead. It apparently worked when the plant was new but never tested years afterwards.
The Westinghouse DCS used DRAM for program storage with lithium batteries, so a few hours until all dead. Panic to backup everything.
Nobody had foreseen a 485MW power generating station with no power scenario.
-
One of the major constituent parts of the business I work for is DR.
Although we design DR plans that work (BT Tunnel fire, major clients invoked their DR plan and were 'business as usual'in one of our DR suites within two hours for example) it's ultimately up to the client to implement and regularly test their plans
It makes not one jot of difference how large their IT estate is, if they're scared to or unable to test then it's just a matter of time until something like this happens.
It's amazing how many companies think they have a DR plan but never test it, in part or in full or a DR plan that fails when they test it because their software and hardware is just so badly implemented.
@floobydust, you said it yourself, whichever moron made the decisions to ignore the UPS testing and maintenance should have been taken out back and shot.
-
Nobody had foreseen a 485MW power generating station with no power scenario.
The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?
The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.
-
Nobody had foreseen a 485MW power generating station with no power scenario.
The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?
The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.
Disclaimer: I don't know that much about the UK's power grid.
My limited understanding is that each Nuclear power station has to have fairly extensive backup generators (for safety reasons), independent of the national grid (which it may lose connection to or may be without power itself, during something terrible).
Because the Nuclear power stations each needs power, to keep cooling the core, and other things. To avoid a possible meltdown, especially under certain fault conditions, such as when the control rods have been damaged/lost etc.
So presumably that is powerful enough to restart each Nuclear power station.
BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).
EDIT:
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.
-
***deleted***
Look, some things you just don't joke about.
-
Nobody had foreseen a 485MW power generating station with no power scenario.
The nightmare scenario in the UK is if the entire grid and all power stations go down. How do you get the first power station back online, given that it requires significant amounts of electricity to startup?
The solution is to use two pumped storage hydroelectricity stations as reserve "black start" capability.
They usually have decent sized gas or diesel backup generators to bootstrap with.
-
Call center person with awesome suntan was overworked.. marketing some investment product on the phone, accepting credit card payment for a gas bill,
flogging a new phone plan, watching pawn etc
when the 'unforeseen' incident happened, and not sure what UPS was implied,
computer back up procedure or Ebay delivery issue.
Let's blame it on the electrical engineer (and give him a good payoff)
or get an actor in our COT (circle of trust) to 'play good engineer gone bad' with broken family, drinking problem, incurable EEVblog addict,
throw in a call girl affair and tie her to some alleged obscure ancient re-incarnated cult cell affiliation BS,
therefore distance ourselves, and let the news media :bullshit: trash him :-- :-- :--
Corporate problems need corporat solutions... yesterday :-+
-
So presumably that is powerful enough to restart each Nuclear power station.
Spot the presumption :)
BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).
That's the definition of "blackstart!".
It is a shame that youngsters have never "bootstrapped" a computer.
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.
They are unreliable.
If you look at the wind power stats on gridwatch, then - as a rule of thumb - then x% of the time they are generating less than x% of their peak output. That means 1% of the time (3 days/year) they are generating less than 1% of peak, 10% of the time (1month/year) they are generating less than 10% of their peak. The other necessary piece of information is how long that lasts; if there is a blocking high pressure over the UK, then the wind output is very low for days at a time.
-
So presumably that is powerful enough to restart each Nuclear power station.
Spot the presumption :)
BUT the Nation Grid, if there were a HUGE number of power generation failures. May be difficult to restart back up (I guess).
That's the definition of "blackstart!".
It is a shame that youngsters have never "bootstrapped" a computer.
Also I think there are a number of Wind farms and maybe other types of "free" energy, possibly even solar. Which maybe could give some juice as well.
They are unreliable.
If you look at the wind power stats on gridwatch, then - as a rule of thumb - then x% of the time they are generating less than x% of their peak output. That means 1% of the time (3 days/year) they are generating less than 1% of peak, 10% of the time (1month/year) they are generating less than 10% of their peak. The other necessary piece of information is how long that lasts; if there is a blocking high pressure over the UK, then the wind output is very low for days at a time.
Fairly recently we have had part of the NHS knocked out through hacking.
Later British Airways (this thread), probably due to negligence (at the top management of British Airways I guess, and maybe lower), knocked out for a few days.
Terrorist attack(s) could knock out things (in theory) or cause them to be shut down.
Very bad weather sometimes knocks things out etc etc
So if a terrible computer bug(s) or hardware mistakes and/or poor management and/or bad hacking/terrorists/etc, were to strike the Nation Grid.
Maybe we could get the scenario you describe.
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).
Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).
-
BA does have a secondary Data Centre. It's about 1 mile away from the main DC.
The problem was not only disconnecting the UPS but then turning the power back on in the wrong order.
It appears some files were corrupted during the incorrect restart procedure and these sync'd to the backup site and took that down too. :palm:
Systems can recover from an outage, run for a moment and crash again. Such basic fault conditions have to be considered for a proper design.
-
Nobody wants to test a big offline UPS, too scary. Everything is running fine, who wants to pull the plug and see if the UPS transfers and runs?
The problem with that attitute is that backup systems will break when not run regularly. The risk of test runs can be reduced by using a BIG dummy load. Anyway, a lot of data center outages are caused by power problems, like overseen issues because of never performing test runs or lack of capacity planning.
-
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).
Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).
Heh, joke maybe and perhaps not quite a bloke with a pair of jumpleads but I guess it'd be possible at some level to jumpstart a backup generator with a car battery and bootstrap the chain that way, I'd be pretty worried if it came to that but given recent events...
It'd have to be a pretty bleak day to need electricity from the continent to restart our grid but I don't see why it'd not be possible, it'd only take a station or two to be restarted before we had enough power to restart the rest and as already mentioned, we've got Dinorwig and other renewable solutions too :)
-
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).
Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).
Heh, joke maybe and perhaps not quite a bloke with a pair of jumpleads but I guess it'd be possible at some level to jumpstart a backup generator with a car battery and bootstrap the chain that way, I'd be pretty worried if it came to that but given recent events...
It'd have to be a pretty bleak day to need electricity from the continent to restart our grid but I don't see why it'd not be possible, it'd only take a station or two to be restarted before we had enough power to restart the rest and as already mentioned, we've got Dinorwig and other renewable solutions too :)
In practice, when situations like that arise. Solutions are normally found. Even if some quite strange events take place.
E.g. I think a lot of the military (and probably large commercial ships) battleships/etc (fairly large ones), boast about having the electricity generating ability to power a modestly sized town, so maybe that would be another option.
Japan (Fukushima), when they had their huge Nuclear disaster (somewhat recently), effectively lost all power. So that had to sort something out there, as it was the only means of restarting the cooling systems.
Although I think a major part of the disaster, was that they didn't succeed in getting the power back quickly enough, so there was some kind of melt down or something.
I think there are fairly large emergency power generators available, so they perhaps could be used to start up one of the turbines, gradually. Then the one rotating generator could power/restart the rest of the plant.
I guess a real Electrical Power Plant Expert, would know what can and can't work, in practice.
Some people (not me, but other(s) tell me about it) worry about a giant EMP (military, Nuclear bombs or even weather via the sun) pulse, knocking out a countries (or even many countries) electrical systems big time.
There are so many millions of things that *MIGHT* go wrong with society, but in practice most/all of them WON'T occur. It perhaps is best to get on with life and NOT worry about such things too much.
E.g. Mad Cows Disease causing a large number of deaths/injuries, Year 2000 millennium bug, Bird flu epidemic, Sars epidemic, Trump causing World War 3 (although maybe he needs longer ?), Aids wiping out most of the worlds population, etc etc.
-
E.g. I think a lot of the military (and probably large commercial ships) battleships/etc (fairly large ones), boast about having the electricity generating ability to power a modestly sized town, so maybe that would be another option.
A few of those powerplants were also installed in universities, Manchester Uni had a share in a training reactor out near Risley for a while, I had been told it was a modified variant of a submarine powerplant but I'm not sure that's true.
I believe a lot of battleships are hybrids, they've some form of powerplant driving huge electric motors, makes sense for it to be nuclear, I don't think you can match the energy density...
I think there are fairly large emergency power generators available, so they perhaps could be used to start up one of the turbines, gradually. Then the one rotating generator could power/restart the rest of the plant.
I've called in one or two when there've been major outages on client sites, impressive pieces of equipment capable of a couple of hundred KW, delivered and connected in a couple of hours, they're not much bigger than a garden shed. I guess you might need something other than 415V three phase to kickstart a powerstation though.
There are so many millions of things that *MIGHT* go wrong with society, but in practice most/all of them WON'T occur. It perhaps is best to get on with life and NOT worry about such things too much.
Far better to get on with life and only worry about the things you can influence I reckon.
-
Maybe they could use a car battery to "bootstrap" the Nation Grid ? (Joke).
Thinking about it more, I believe we can get substantial power/electricity from France, via huge cables which are already used, to "sell" electricity, between the countries (maybe even Europe). If I remember correctly.
Maybe that would be powerful enough to restart the UK's National Grid.
(Assuming France has not been hit by the same Fault(s)/Weather/Hacking/Terrorist/etc incident).
Heh, joke maybe and perhaps not quite a bloke with a pair of jumpleads but I guess it'd be possible at some level to jumpstart a backup generator with a car battery and bootstrap the chain that way, I'd be pretty worried if it came to that but given recent events...
It'd have to be a pretty bleak day to need electricity from the continent to restart our grid but I don't see why it'd not be possible, it'd only take a station or two to be restarted before we had enough power to restart the rest and as already mentioned, we've got Dinorwig and other renewable solutions too :)
In practice, when situations like that arise. Solutions are normally found. Even if some quite strange events take place.
E.g. I think a lot of the military (and probably large commercial ships) battleships/etc (fairly large ones), boast about having the electricity generating ability to power a modestly sized town, so maybe that would be another option.
Japan (Fukushima), when they had their huge Nuclear disaster (somewhat recently), effectively lost all power. So that had to sort something out there, as it was the only means of restarting the cooling systems.
Although I think a major part of the disaster, was that they didn't succeed in getting the power back quickly enough, so there was some kind of melt down or something.
I think there are fairly large emergency power generators available, so they perhaps could be used to start up one of the turbines, gradually. Then the one rotating generator could power/restart the rest of the plant.
I guess a real Electrical Power Plant Expert, would know what can and can't work, in practice.
Some people (not me, but other(s) tell me about it) worry about a giant EMP (military, Nuclear bombs or even weather via the sun) pulse, knocking out a countries (or even many countries) electrical systems big time.
There are so many millions of things that *MIGHT* go wrong with society, but in practice most/all of them WON'T occur. It perhaps is best to get on with life and NOT worry about such things too much.
E.g. Mad Cows Disease causing a large number of deaths/injuries, Year 2000 millennium bug, Bird flu epidemic, Sars epidemic, Trump causing World War 3 (although maybe he needs longer ?), Aids wiping out most of the worlds population, etc etc.
Firstly, your precautions you take should be based on the probability of X happening times the consequences when X happens. If it isn't restored quickly enough, total loss of electricity => mass deaths, possibly the majority of the population.
Secondly, knowledgeable Electrical Engineers have considered how to restore power. I mentioned the "blackstart" mechanisms, and you can google for that term rather than speculate.
Finally, a second "Carrington Event" is a nightmare scenario; it would probably take out a hemisphere, not just one country. Nobody really knows how the grid would cope, but a mini version produced the Quebec blackout in March 1989.
-
Far better to get on with life and only worry about the things you can influence I reckon.
... to ensure that someone knowledgable has plans and mechanisms, and that there is nothing preventing them from being actioned. The first is dealt with by professionals; the second has to be enabled by politicians. Guess which is more problematic.
-
... to ensure that someone knowledgable has plans and mechanisms, and that there is nothing preventing them from being actioned. The first is dealt with by professionals; the second has to be enabled by politicians. Guess which is more problematic.
Absolutely, I'm working on a contract now where people are paid to worry about generation and distribution, onshore and off, it always brings a wry smile when they come to ask for help then and apologise to me for not knowing enough about IT to do my job.
*edit* Googling 'blackstart grid' produces some fascinating documents.
-
There are so many millions of things that *MIGHT* go wrong with society, but in practice most/all of them WON'T occur. It perhaps is best to get on with life and NOT worry about such things too much.
Far better to get on with life and only worry about the things you can influence I reckon.
Although that's the advice I am apparently giving/knowing, here, have I been MOSTLY following that advice, in practice ?
tl;dr
No!
If you DON'T believe me, if I was asked stuff like ?
Some of the current/past News item highlights e.g. NHS/British-Airways IT failings, etc etc and you will find I am NOT practicing what I preach all the time :-DD
-
Firstly, your precautions you take should be based on the probability of X happening times the consequences when X happens. If it isn't restored quickly enough, total loss of electricity => mass deaths, possibly the majority of the population.
Secondly, knowledgeable Electrical Engineers have considered how to restore power. I mentioned the "blackstart" mechanisms, and you can google for that term rather than speculate.
Finally, a second "Carrington Event" is a nightmare scenario; it would probably take out a hemisphere, not just one country. Nobody really knows how the grid would cope, but a mini version produced the Quebec blackout in March 1989.
...Based on probability of X ...
Yes, us humans SHOULD be using that technique, when we go about our daily lives.
But in many cases, we DON'T really use that method, or think we do, but get it WAY wrong, in practice.
E.g. Someone walks past your car, peacefully/legally driving on the road at 30 MPH, while they weren't looking, were talking/texting on a mobile phone (about a highly unlikely event that would endanger their lives, the latest terror attack), almost getting themselves killed, by something considerably more likely to kill them in practice. A road accident, because they were walking while talking/texting on a mobile phone.
If British Airways and/or the IT department there, had taken into account the consequences of their IT systems, dramatically breaking, for any extended, multi-day total outage. They would/should have spent the necessary money, had the right equipment, and kept on hand the necessary in-house (ideally), competent/trained/good IT workers to prevent/stop the kind of thing which caused the British Airways IT systems, to go into total meltdown, for a number of days.
-
One of the more amusing blackstart-ish things I have seen was a sleeper train down from somewhere in Scotland....
The on car emergency batteries for the lighting and such had been allowed to go so flat that the charger was tripping off line when switched on, and the train could not leave without that system working. This was causing a certain amount of stress to all concerned by such things (Passengers get rather sarky under these circumstances).
The solution?
Take a reel of cable, strip the ends and jumper from an aux battery power connection point in one car to the aux power connector on the one with the flat batteries, this got the battery voltage up far enough to get the charger to engage, creative but of course the batteries were still mostly flat so they would in fact have had little emergency lighting in that car for some time (The charge lights were green however so the train was allowed to leave).
My favourite DR cockup involves a hospital, with backup plant installed on the roof, fuel tanks underground, and a directive from the senior manglement that "ONLY directly life critical medical equipment is to be connected to the protected power", sounds reasonable until you realise that that does not include the fuel lift pumps for the generators (Tests with a load bank were fine because the pumps were running off the mains, take the mains away however....)!
Germany had an interesting one involving solar power some few years back. At the time the solar inverters were set up to generate real power only, which is reasonable when solar is a small proportion of your total generation..... Well they had a day where this was NOT the case, and most of the required capacity was being covered by the solar installed capacity, which as it turns out is fine until you have a major fault in the medium voltage network and need something to provide **LOTS** of KVAR to clear the fault. The solar inverters have all sorts of safety measures and as the grid voltage started to collapse they tripped off line in a cascading failure!
Modern inverters for that market can produce reactive power as well as real.
73 Dan.
-
Back in the '80s, my then employer, Racal, had big backup generators and fuel stores at most of its sites - the chairman (Sir) Ernie Harrison had these installed back in the '70s so he wouldn't be held to ransom by the 3 day week ( caused by the miners + power workers strikes), or so the folk history went.
Anyway, while I was working there we had major problems with the incoming mains supply to the R&D building, we were at maximum load for the supply cable and the phase loadings weren't very well matched, so we kept blowing supply fuses (in fact it got so bad that service inlet partially melted due to continued long term fuse dissipation). Each time the supply failed (which got to every couple of days due to the deteriorating conditions), there would be a brief pause and then the generator would kick it and peace and light would be restored to the lab.
There was however one flaw with the system - when the supply failed someone had to remember to phone security (in the other building) to go round and open the doors to the generator house. The doors had louvres but nowhere near big enough to adequately cool the diesel engine. Inevitably one day it got forgotten resulting in an awful lot of steam, loud banging noises and a lot of running about and cursing! ;D
The human element will often bite you in the end.
P.S. It wasn't quite as much fun for me, one of my jobs was managing the lab PDP-11/44 and its addon backup battery supply only lasted about 5 minutes at best! It often took that long to get people to go back to their terminals, save their edits and log off so I could shut it down! ::)
-
The human element will often bite you in the end.
You have probably heard this one already, but I encountered this, from the person who experienced it themselves.
His IT department dealt with absolute computer illiterate people, in the company.
One day the person said that their computer was faulty. The fault was that the "Coffee cup holder function had broken".
At first he said "I don't know what you are talking .......... Wait, wait, you don't mean ..."
Yes, it was the button which they thought brought out the coffee cup holder, on the front of the CD or DVD, unit.
**********************************************************************************************************
Apparently at a (UK company I think), there was a new director, very high up in the company. Who discovered that she could NOT gain entry to the IT server room. She was furious about this, because she was near the top of the company (she was a senior director), so she insisted on being given security access rights to the IT server room. Despite having absolutely no business in there, and knowing next to nothing about computers.
This was granted.
Then one day all the servers, suddenly stopped working and the entire IT systems had been knocked out. This was very strange, because there were many backup systems, and lots of UPS and stuff, so it should not really have been possible.
It turned out that the Senior director, was walking past the IT server room, and suddenly had a mobile phone call, so she decided to go in the server room, to take the call. But could not stand the noise which was disturbing her call. So she went round and unplugged all the power leads, from all the terribly noisy servers, in the room.
Her access to the IT server room was blocked, she may have even been disciplined or even sacked, after that. It knocked out the entire IT servers and caused terrible (expensive) problems.
Disclaimer: I repeated this story from memory, and may have got the exact details slightly wrong and/or slightly changed it. I think the gist is correct.
-
My favourite DR cockup involves a hospital, with backup plant installed on the roof, fuel tanks underground, and a directive from the senior manglement that "ONLY directly life critical medical equipment is to be connected to the protected power", sounds reasonable until you realise that that does not include the fuel lift pumps for the generators (Tests with a load bank were fine because the pumps were running off the mains, take the mains away however....)!
We had a similar problem at one of our datacentres a couple of years ago, when Hurricane Sandy hit New York. The backup generators were on the roof, but due New York City fire code requirements, the fuel tanks and pumps were in the basement. Power went out... generators started... basement flooded... fuel pumps died... generators died...
I'm not sure if that was before or after they found the shark swimming around the lobby!
-
I was caught up in the BA shenanigans in Tokyo a week ago Sunday. Five minutes before getting in the cab to go to the airport at 5am to return home I had a message telling me that my flight was cancelled, and that I should rebook online or call up.
I still went to the airport and they rebooked me there and then onto an ANA fligt direct back home which got me back 2.5h late, so it could have worked out a lot worse.
BA has been cost cutting for the past three or four years, to such an extent that in economy there's no difference between them and the worst of the low cost carriers.
A key method BA use to cost cut is to outsource everything. I've worked on both sides of the fence for outsourcing in my career, both as vendor and as customer, and the only time I've seen outsourcing work for the better is in very clear specific and limited scenarios. Wholesale outsourcing innevitably leads to loss of expertise and experience that simply can't be handed over in four weeks or read from a library of manuals.
This gets us back to the commodity methods of human resourcing, where there's an assumption that you can lift off anyone off the street with the right words on their CV/resume and expect to provide the same service. I think all of us know how that really works out in practice.
I am absolutely certain that whether or not the original problem was caused by an employee or a contractor (they are currently pointing the finger at their outsourced facilities company) their systems would have been up in a fraction the time had they not sacked several hundred of their employees over recnet months and years and handed the keys over to outsourcers.
Finally, having no functioning DR plan, and their two DCs barely 1km from each other is so completely foreign to anyone involved in enterprise DR planning it's risable.
-
Why would things sync in a fail over data center? Syncing is for backups and for repairs. Failover should be 100% transparent, except for a human telling the front ends "1 response/ack instead of 2 is enough for now" (front ends obviously also redundant, but designed to have no state which can't simply be lost).
Thinking in terms of primary/secondary sets you up for failure.
-
the blame here lies 100% on management and the software developers, 0% on the electrical engineer.
-
Why would things sync in a fail over data center? Syncing is for backups and for repairs. Failover should be 100% transparent, except for a human telling the front ends "1 response/ack instead of 2 is enough for now" (front ends obviously also redundant, but designed to have no state which can't simply be lost).
Thinking in terms of primary/secondary sets you up for failure.
It depends on what your recovery point and recovery time objectives are defined by what's acceptable to the business. It's pretty common to have data synchronously replicated to prevent data loss, but it may take some agreed amount of time to get your passive on line.
Achieving true active/active across DCs, particularly with synchronous replication (i.e. No data loss) is hard and expensive in today's hugely complex multi-vendor distributed systems. For a true 24/7 operation like BA where their business depends on it, they could accept perhaps a few minutes' downtime for failover. Several days, not so much.
-
It's pretty common to have data synchronously replicated to prevent data loss, but it may take some agreed amount of time to get your passive on line.
Just don't make it passive. If you're doing it synchronously, just have the front end lock up when they don't respond synchronously ... problem solved. Recovery point? What recovery, there is no recovery. There is only switching from redundant to non redundant.
For a 3 day outage of BA you can afford to just take a baseball bat to the complex multivendor world and force them to do things sanely. Instead they will say "lets just do things the industry standard way, but properly this time ... we model this as a once in a X years occurrence if we do things properly" (with X by pure coincidence being just large enough to stick with industry standard behaviour). Standard behaviour in IT is insane. For instance, lets use C for a few more decades, what's a few more trillion worth of damages after all.
-
it is sad how easy it could be for terrorists to crash the power system here in Europe.
Just a Handfull products from ebay is required for. :scared:
-
the blame here lies 100% on management and the software developers, 0% on the electrical engineer.
Engineers are the ones to blame and throw under the bus.
Corporate leadership, management and exec's are accountable for nothing.
It's unfortunate that this song is the plight of the engineer...
-
It's pretty common to have data synchronously replicated to prevent data loss, but it may take some agreed amount of time to get your passive on line.
Just don't make it passive. If you're doing it synchronously, just have the front end lock up when they don't respond synchronously ... problem solved. Recovery point? What recovery, there is no recovery. There is only switching from redundant to non redundant.
As I already stated, there is a significant cost to engineering something to be synchronous and active/active. (This may be why their DCs are so close, to mitigate against performance due to latency.)
There is also a significant risk in operating true synchronous active/active, and that is common mode failure. If you make a change on one system it's immediately replicated on the other. If that change is catastrophic, you have nothing.
For outfits like Google and Facebook, not generally being transactional systems, it doesn't matter if there's loss of data or inconsistent results. This makes their fundamental business continuity and redundancy designs far simpler despite the sheer size.
FWIW, there are still some peripheral, but customer-facing BA systems that have yet to be fully recovered.
What remains to be seen is how a change at one DC apparently disabled their disaster recovery plan, assuming they had one of course.
-
There is also a significant risk in operating true synchronous active/active, and that is common mode failure. If you make a change on one system it's immediately replicated on the other. If that change is catastrophic, you have nothing.
Exacly! The design of HA systems starts with the communication protocols.
-
Years ago an IT manager at IBM never had a maintenance system to replace batteries in the numerous UPS systems. One day when there was a power outage, everything crashed, with some systems taking many hours to recover. The IT manager had no idea about UPS systems despite the fact he had a bachelor's degree in IT.
The IT manager also "managed" 72 servers. I was just an engineer in an electronics department, but was given the task to clean up the mess. I managed to consolidate all the servers down to 11 servers, eliminating the jobs of three full time server administrators, and reducing the power bill.
Based upon this and other experiences with hopeless IT managers, it does not surprise me that some fundamental issue shut down British Airways systems.
Fortunately the company I work for now was a highly competent IT manager. He is a pleasure to work with. Proactive, knowledgeable and smart, not reactive and clueless like many of them. Most electronic engineers, unlike most IT professionals, understand control theory. In my opinion, control theory is important in running an efficient IT department in a large organisation.
-
There is also a significant risk in operating true synchronous active/active, and that is common mode failure. If you make a change on one system it's immediately replicated on the other. If that change is catastrophic, you have nothing.
You have backups and a transaction log. Storage is cheap, the data to re-initialize your "secondary" is still there ... initializing is simply not the default way to fail over.
-
There is also a significant risk in operating true synchronous active/active, and that is common mode failure. If you make a change on one system it's immediately replicated on the other. If that change is catastrophic, you have nothing.
You have backups and a transaction log. Storage is cheap, the data to re-initialize your "secondary" is still there ... initializing is simply not the default way to fail over.
That is not synchronous. Transaction logs are only dumped periodically. You therefore risk data loss.
If your RPO and RTO will support it, that may be acceptable. Certainly I've encountered that as a solution. There are other ways nowadays such as mirroring.
In the DCs I do work in, I have both synchronous mirroring _and_ automated periodic (every 15mins) transaction log shipping configured for production systems. There is no need to reinitialise, everything is already set up and ready to go.
That's great, but it didn't stop a fat fingered operator at our outsourced provider accidentally taking out the wrong VM host today, losing five systems. When they came back up fifteen minutes later, thankfully all five recovered and resynced perfectly in under a minute automatically with no data loss. I like to think that last bit was by design.
-
THere's a reason I *NEVER* unplug a server myself, the look of panic on the faces on the admins when they pull the power from the wrong machine is priceless though.
-
THere's a reason I *NEVER* unplug a server myself, the look of panic on the faces on the admins when they pull the power from the wrong machine is priceless though.
There are those who have and those who will! The problem with outsourcers is that it's difficult to get a striaght story out of them because there are typically financial implications if SLAs are not met. If you can get to the engineer directly, preferrably on the phone, before the account managers get to them, you stand a chance of figuring out what -really- went wrong. This is a key reason why there is such distrust with outsource providers.
-
http://www.bbc.co.uk/news/business-40750168 (http://www.bbc.co.uk/news/business-40750168)
BA estimates the cost of the "IT Failure" to be £58 million ... ouch!
Still awaiting the final report on who did what and why it took so long to recover but the cynic in me thinks it'll be a master-class in CYAABSE (Cover Your Arse And Blame Someone Else)
-
the blame here lies 100% on management and the software developers, 0% on the electrical engineer.
I have spent an entire career worrying about electrical systems, including UPS systems for computer facilities. Management has responsibility for providing guidance and funding but the engineers run the system. They are responsible for switching procedures and should be responsible for maintenance and testing.
It is darn hard to design a system that will provide UPS protection while still being able to test the UPS. Batteries in series are a serious problem. One defective cell and the chain is out. Two parallel chains out and the bank probably overloads. How many of the parallel chains should we design to lose? AGM batteries are maintenance free and non-maintainable. They are designed to fail in this application. Glass case batteries (like those used in nuclear plants and telephone exchanges) are the only acceptable choice. They are pricey and it takes a lot of space to provide for servicing. Batteries are interesting; it is impossible to make them safe!
How about parallel UPSs? Do we try that? Should they individually be capable of supporting the entire load? How many parallel generators? Each fully capable? Best 2 out of 3? Do we put the cooling system on separate generators?
How much money have you got? I can show you a way to spend it! Lots of it! And, still, there may be common modes of failure.
If an engineer threw a switch and the systems went down, it's a training issue. There wasn't a procedure or the engineer didn't follow it.
-
I just have one question... Bank Holiday? ???
-
I just have one question... Bank Holiday? ???
It means National free vacation day in the UK, for everyone. Called that, because in the old days, Banks were closed on those days. Hence Bank Holidays (= vacation).
-
the blame here lies 100% on management and the software developers, 0% on the electrical engineer.
I have spent an entire career worrying about electrical systems, including UPS systems for computer facilities. Management has responsibility for providing guidance and funding but the engineers run the system. They are responsible for switching procedures and should be responsible for maintenance and testing.
the software developers are 100% at fault for not designing a system that can tolerate infrastructure failures. and 100% at fault for not testing such scenarios.
the engineer is 0% at fault.
it is simply not possible to make 100% reliable systems. you design for failure, and expect it. you design your system to work even when they happen.
if you don't, your design was faulty from the beginning even before the failure happened.
-
the software developers are 100% at fault for not designing a system that can tolerate infrastructure failures. and 100% at fault for not testing such scenarios.
I presume you have been taught the "Byzantine Generals" problem, for which there cannot be a solution.
I presume you know why "two phase transactions" are used - and that they are insufficient so that sometimes you have to use three phase transactions - and that they are insufficient so that sometimes you have to use four phase transactions -...
Some software/infrastructure problems are extremely difficult to solve. The "split brain" problem is one of them.
it is simply not possible to make 100% reliable systems. you design for failure, and expect it. you design your system to work even when they happen.
if you don't, your design was faulty from the beginning even before the failure happened.
Agreed, subject to the caveat above.
-
AFAICS the incident arose from a failure to understand that syncing is not backup.
Syncing actually doubles the chances of data loss, because it takes only one fault on either of two systems to corrupt the data on both. Thus, it halves the MTBF.
I meet this frequently with bosses who insist on having a RAID array on a small server. When pressed to say why they insist on it, they say that it's to protect the data. Quite likely a lot of such systems are installed with no actual backup hardware on the assumption that none is needed. ::)
In fact, having any kind of RAID increases the need for dependable backups, because you typically cannot read individual disks from an array. So, if something goes wrong with the server, the data becomes inaccesible. If you'd used a standard disk format you could have just popped it into a workstation and read it.
The same applies to people syncing data with cloud services. They still need a backup. I'm sure a lot of users think they don't, though.
-
I remember a situation with a bank on Christmas Eve, a node in a cluster tripped, one of the system administrators attempted to power it back on and it took the UPS out. We of course got blamed for the outage because RCA identified a short in the node's power supply.
-
That's actually a good point regarding jumbo jet in the DC. There's a few great big data centres around Heathrow airport. Who the hell builds a data centre next to the world's largest game of lawn darts!?
-
It still doesn't deal with why they failed to resume business at an alternate site.
yep. this is architectural error. the system was doomed to fail before they ever powered it on.
if your system depends on a UPS saving your ass from power failure, then you deserve what you get. BA got it.
hope it was worth the pennies they saved on actual testing.
-
It's not just the data center design, all software has to support a switch over. Presumably BA (like a ton of other companies) run some very old software not capable of sharing data with stand-by systems in real time or supporting a fast switch over. Replacing those legacy systems running core services is very expensive. And the MBAs hate spending money on things that already work somehow >:D
-
an electrical engineer disconnected the uninterruptible power supply which shut down BA's data centre.
:-DD
https://www.youtube.com/watch?v=FC0pT9xg1oI (https://www.youtube.com/watch?v=FC0pT9xg1oI)
-
It's not just the data center design, all software has to support a switch over. Presumably BA (like a ton of other companies) run some very old software not capable of sharing data with stand-by systems in real time or supporting a fast switch over. Replacing those legacy systems running core services is very expensive. And the MBAs hate spending money on things that already work somehow >:D
As corporate structures get more complex, the accounting codes used for business units and projects get more and more finely specified. At some point when the accounts are reviewed they get visibility based on how much money the business units bring in. Only some accounting centers bring in money: everything else is, be definition, a cost of doing business.
These centers that represent business costs inevitably become the targets for cost reduction. Redundant infrastructure, engineering, etc. easily make the list. Testing becomes, by definition, a cost that can be avoided, because it becomes difficult for a director to justify $100,000 per year or whatever to test something that apparently works just fine.
Anyway, it's reassuring that an engineer was blamed for this failure. How inspiring it is to see that management retains its 100% success rating!
-
It still doesn't deal with why they failed to resume business at an alternate site.
Probably some vital piece of equipment at the main site needed to be operated to transfer control/data to the backup.
Happened before with an Australian telco
http://whrl.pl/RNiFz (http://whrl.pl/RNiFz)
I work in riverside. I was onsite minutes after the power was cut to the floor.
The story goes like this from the horses mouths:
• The AirCond guys (Leibert) were doing routine maintenance and switched over from water pump 1 to water pump 2 and a strainer burst under the floor.
• This lifted the floor tile due to the pressure and sprayed water all over the power distribution system and the AAPT transmission and voice switch system
• The Fire guys shut power to the static power switch on the floor and cut the UPS supplied to the third party room
• I saw the aapt voice switch first hand, and its main cpu section has a burnt out section to its main back bone the size of a a1 piece of paper. That’s what the smell and smoke was from.
• The AAPT guys wont allow the power to be turned backon until the UPS is dried out and the floor is dried.
• Apparently most services didn’t failover due to the power cut to the transmission kit itself.
The (data) transmission system was needed to send a signal to the remote site to take over.
-
The last I heard from Cruz (BA PHB) was that they had a power failure at both sites.
Frankly I doubt it'll ever be made public what happened. Cruz is an even more psychopathic cost cutter than his predecessor Willie Walsh, who is now group CEO.
I don't buy that it was specifically "software developers", which seems to point the finger at nameless droids pushing out lines of code every day. This is an architectural, infrastructure and process problem. Your average application software developer nowadays doesn't know the difference between resilience and disaster recovery, or even think about it when typing code. That's someone else's problem these days, such are the joys of layer upon layer of abstraction.
-
Sounds about right.
My day job is (unfortunately) the architecture of such systems in a different sector and I sit between the cost cutting CXO layer and the mix of outsourced and internal development and operations, none of which care about other concerns. The CXO cares only about the capex, opex and hiding failures and risks from other CXO roles. The development teams care about lunch time, Friday and whether or not they can Google themselves out of a sticky spot. The operations team are concerned about playing CYA and blaming as much as possible on no capital and what the development team churn out. This entire thing is built on 25 years' of legacy written by the lowest bidder and someone comes up and says "hey we're going to do a merger with another 10k users; can you bend it some more?". When we raise a risk and the minimal credible protection against it, someone says it's too expensive, too difficult or too radical to resolve. This is what leads to failures.
The only consolation is the money is damn good for putting up with this shit day in day out :-+
-
The last I heard from Cruz (BA PHB) was that they had a power failure at both sites.
So both sites lost both their redundant supplies and none of the generators worked? Smells fishy. Wait, they had redundant supplies and generator backup.. right? RIGHT?
Ahh, BA..
-
If it's anything like our big UPS's, they like to blow up when called into duty literally 1-2 seconds before the generator kicks in just to smite you.
-
When we raise a risk and the minimal credible protection against it, someone says it's too expensive, too difficult or too radical to resolve. This is what leads to failures.
The only consolation is the money is damn good for putting up with this shit day in day out :-+
All of which makes companies like the one I work for essential (disaster recovery is a *large* part of our business with ~6000 seats across the country)
-
The last I heard from Cruz (BA PHB) was that they had a power failure at both sites.
So both sites lost both their redundant supplies and none of the generators worked? Smells fishy. Wait, they had redundant supplies and generator backup.. right? RIGHT?
Ahh, BA..
At the very least, we're getting nowhere near the full story, and I'd suggest we never will. Even if they wanted to (which I seriously doubt they do) there'll probably be some gagging clause in an outsourcer agreement somewhere.
-
According to todays news the system is down again today. I wonder who will get the blame this time?
-
http://www.telegraph.co.uk/business/2017/08/02/british-airways-apologises-total-chaos-heathrow-due-system-failure/ (http://www.telegraph.co.uk/business/2017/08/02/british-airways-apologises-total-chaos-heathrow-due-system-failure/)
-
According to todays news the system is down again today. I wonder who will get the blame this time?
Apparently, it was the same electrical engineers fault. Even though he hasn't been there in months, and was sleeping at the time of the outages.
Because he failed to notice that the computer servers, that the electrical system which he slightly worked on, was powering. Was poorly designed, prone to failure and had inadequate and too slow a backup/recovery system, in place.
The bigger causes due to things like lack of expenditure, missing duplication/backups, excessive outsourcing (especially abroad), and very poor upper management control of the computer systems.
Was also, NOT addressed/reported by the same electrical engineer.
tl;dr
We need to urgently improve the training given to electrical engineers.
-
The bigger causes due to things like lack of expenditure, missing duplication/backups, excessive outsourcing (especially abroad), and very poor upper management control of the computer systems.
I think you'll find we have no need to offshore IT incompetence, we have our own world leading incompetents born, bred and educated right here in the UK
-
I think you'll find we have no need to offshore IT incompetence, we have our own world leading incompetents born, bred and educated right here in the UK
Let's say a computer system is developed, in 2005, to do something, such as the flight bookings system.
Let's also assume exactly 100 people are needed to develop it.
If you exclusively develop it in house, some/many, maybe between 25 and 75 of the original 2005 workforce, will be still working at the computer centre. Which can be very useful, as they will know a lot about the computer flight booking system.
But if you outsourced all the work, abroad in 2005, and the computer flight booking systems goes horribly wrong in 2017. You will potentially have 0 people (now) who developed it and/or know much about it. Which can cause terrible problems.
E.g. You say to them, where is the source code ?
Answer...don't know.
What computer language was it written in ?
Answer...don't know.
Well how do you fix any errors which occur ?
Answer...don't know.
I hope you get the picture.
-
But if you outsourced all the work, abroad in 2005, and the computer flight booking systems goes horribly wrong in 2017. You will potentially have 0 people (now) who developed it and/or know much about it. Which can cause terrible problems.
E.g. You say to them, where is the source code ?
Answer...don't know.
What computer language was it written in ?
Answer...don't know.
Well how do you fix any errors which occur ?
Answer...don't know.
I hope you get the picture.
Oh i absolutely do, been there, done that.
Two organisations stick in my mind from direct experience of working on their legacy systems that had to be maintained, one because it was production, the other because it contained data that was evidence in a legal battle. (Data General something or other and a DEC PDP11 with *masses* of unobtanium external storage)
Neither had any staff who knew the systems, on a hardware or software level, the company they'd outsourced support to at least knew the hardware (the company I worked for) but software was off limits for the contract.
One of them knew where they could find two of the programmers who were happy to contract a few days here and there when needed (funnily enough one of the programmers had retired but got bored and took a job with the post office and one of his collections was the organisation in question) and had documentation to the hilt, the other had nothing.
Guess which one failed after a datacentre move...
-
Oh i absolutely do, been there, done that.
Two organisations stick in my mind from direct experience of working on their legacy systems that had to be maintained, one because it was production, the other because it contained data that was evidence in a legal battle. (Data General something or other and a DEC PDP11 with *masses* of unobtanium external storage)
Neither had any staff who knew the systems, on a hardware or software level, the company they'd outsourced support to at least knew the hardware (the company I worked for) but software was off limits for the contract.
One of them knew where they could find two of the programmers who were happy to contract a few days here and there when needed (funnily enough one of the programmers had retired but got bored and took a job with the post office and one of his collections was the organisation in question) and had documentation to the hilt, the other had nothing.
Guess which one failed after a datacentre move...
One of the craziest stories I've heard (from a book or other solid source), was with a VAX 11-780.
Apparently the VAX 11-780 is amazingly well engineered (it is a mostly all TTL etc, cpu and much of the rest of the ICs). Not only are TTL often extremely reliable, but the computer is well ventilated and has multiple systems (duplication), in some parts. So it can run, on and on and on. Potentially without maintenance.
So there was this VAX 11-780, quietly performing its duties for this large organization, which got forgotten about. It was left on and running.
Anyway, somehow the internal building maintenance people, built a wall around it and effectively sealed/hid the VAX 11-780 behind this solid wall.
It was found, a rather long time later, still switched on, and still running its software. Doing whatever it was supposed to do.
But the people who knew what it did, had long left the company, and no one, knew about it.
-
Sounds like the kind of story I'd wish was true. But I doubt it. Space in datacenters is valuable.
It would take me a very long time to chase the story down. But the following is either the SAME story, or a similar one. I'm NOT sure which. E.g. I could be confused (mis-remembering, and it was not a VAX 11-780, but some other old server/computer).
https://www.theregister.co.uk/2001/04/12/missing_novell_server_discovered_after/ (https://www.theregister.co.uk/2001/04/12/missing_novell_server_discovered_after/)
EDIT:
It is possible that it is a computer folklore story (i.e. not really true). Hence why differing computer systems, are mentioned between me (VAX 11-780), from the story I original heard, and the story I linked to (potentially different computer system).
-
A Vax 11/780 isn't the sort of thing you could lose easily, but I have found machines stashed down the back of server racks, under floors and in ceiling voids above tiles during data centre moves, most of the time nobody knew what they did or who put them there so, if they were working then we'd relocate them too and put them neatly in a rack.
One occasion we had to hunt down a Proliant 1600 that had died, we knew it was a PL1600 because it had an ILO that we could talk to but nobody knew where it was, turned out to be sat behind a row of filing cabinets in an office in what used to be an accountant's office, it had been there for several years running some flavour of Unix with a failed mirror set until its one working disk died and everyone noticed it was gone.
The most unnerving thing I found was an undocumented and well tucked away patch cable that seemed to bypass the firewalls in a pharmaceuticals company.
-
This happens to a lot of companies. We lost a switch cabinet in a factory floor. Turned out someone had built a sub-building underneath it and it was hidden just above roof invisible from eye level as it was mounted on one of the roof struts. Literally took all day to find it because it had been erased from the floor plan by the site office.
ILOs are fun. One outfit I worked for connected all the blade chassis' ILO ports to a blade switch in one chassis. There was a power outage event in the DC rack (so much for having your own racked UPS's). At that point it became evident that the chassis with the switch in it happened to be the one that had a failed supply in it and the other power supply was connected to the dead bus in the DC. At exactly the same time they realised that the power up policy after a failure for every sodding blade was to stay off. This policy was set in the office while they were testing because they are so noisy.
-
This lost computers thing is going to get even worse. :palm:
Just image dozens of Raspberry PIs hiding all over the place.
Doing their thing until they fail. Might be a good idea to install
some kind of paging arrangement by default so as to be able
to find those still running.
This lost active equipment situation is not new.
I remember a vacuum pump running for years (decades?) in a
large bakery. The pump was hidden under a massive layer of
dusty/gunky bakery residues to the day it failed. It took a
whole day to to a) notice that there actually was a failed
vacuum pump somewhere and b) then to find out where
it might be. Great fun.