Author Topic: Forum Outage (Read 55740 times)

WarSim · « **Reply #100 on:** November 23, 2013, 04:35:15 am »

Yes despite the fact that Windows has huge issues, marketing in the Microsoft camp has won out.
Back in the day of most computer users being sheep there was little hope for other OSs.

Today people know more about computers and make a more informed choices for themselves.
Unfortunately people insist on lashing out at others who choose differently.

Over the last 35 years I have experienced many impressive and many horrid OSs.
As a Systems Analysts I had to know far too much of all of them.
In summary every OSs had a purpose, no OS has ever been the end all be all to everyone.

Today there are about 18 OS choices for enterprise, 5 OS categories for embedded and 3 OS choices for consumer computing.
Just pick the one you want and enjoy your choice.
Attacking another's choice is just childish.
Defending an OS by pointing at the flaws of another just seems silly and too easy to me.

Despite the OS or DB used, any DB should be able to recover after host failure if the programer does his job right. Unfortunately solid recovery code only seems to be used in the financial sector these days. IMHO it is just laziness.

Sent from my iPad using Tapatalk

walshms · « **Reply #101 on:** November 23, 2013, 06:09:08 am »

Quote from: Rigby on November 23, 2013, 04:03:10 am

Windows' market position says all anyone will ever need to hear about big business' trust in it, so your point there is completely without merit.

Your opinion, no more valid than mine. On market dominance: Windows Server is actually in the minority worldwide. You might want to check that. On the desktop -- yes, no competition. On servers? No, sorry, you'd be wrong.

Ask Google, Facebook, Yahoo, LinkedIn... hell, pick any large-scale service provider anywhere in the world what they're running and they'll tell you it's Linux. Dave's server is running on Linux, as are most of the virtual hosts in the world, whether they're running on Xensource, Xen, KVM or VMWare. Take a look at the Top500 list sometime. See what dominates it. Technical merit *is* the reason, and the list grows day by day. IBM and HP rely on it internally, and IBM's Z196 midrange is designed to run it.

Quote

There are also a great deal of hospital systems run on Windows, so your entire "big business or life depends on it" argument is gone.

My wife works for one of the largest hospital systems around, and I can tell you, she (and the rest of the IT team supporting them) wishes they could get rid of Windows. The problem is the software, not the OS; the software hasn't been ported to Linux on the whole. It's just a matter of time... it will be. NASA, ESA and in fact most of the rest of the space industry went to Linux because they couldn't take the downtime anymore.

Quote

Sound on Linux is STILL a giant pain in the ass. Fonts in X STILL look like shit. X in general is STILL a giant architectural nightmare.

When you can show me what *servers* require sound, or for that matter even a GUI, I'll be happy to have that conversation with you. In the mean time, I can only presume you're talking about desktops, and that's just not where the effort has been invested.

Granted, I wish some additional effort were invested there.

Quote

When I start seeing anything that isn't complete and utter Windows dominance in the places where time is money, then we'll talk.

NASDAQ, NYSE, DAX, FTSE... need I go on?

I think you really need to catch up. Really not trying to tweak you here, but... honestly, you need to spend a little time and catch up to events.

walshms · « **Reply #102 on:** November 23, 2013, 06:20:29 am »

Quote from: WarSim on November 23, 2013, 04:35:15 am

Despite the OS or DB used, any DB should be able to recover after host failure if the programer does his job right. Unfortunately solid recovery code only seems to be used in the financial sector these days. IMHO it is just laziness.

Actually, it's already there, and it's easy enough to implement. A stock MySQL install today is robust enough, as long as you have a battery-backed cache on your controller and UPS monitoring. All of our customers are set up this way, and none has ever had a database crash because of a power outage, or for that matter even a power supply failure (which a UPS can't help you with.)

Virtually every HP server sold is sold with a battery-backed cache, for example; it's nothing more than a tiny rechargeable cell pack that mounts inside the server and is connected to the controller. Takes about three hours to fully charge, and it's good to keep the cache valid for about 36 hours in a power failure.

Heck, even Microsoft SQL Server will survive that as long as it's on similar hardware and similarly configured.

jancumps · « **Reply #103 on:** November 23, 2013, 08:33:08 am »

Quote from: Corporate666 on November 23, 2013, 02:56:40 am

...
Quote
..I had to order some parts from Farnell which I really needed the next day. Unfortunately Farnell (for the first time) forgot to put the parts in the envelope so I got nothing. Some parts worth 20 cents suddenly became a potential deal breaker. Fortunately these parts where generic so I could buy them in a local shop.

...
...
In your Farnell example, you could have ordered multiple parts from multiple suppliers to be delivered to multiple locations (home and work), if it was so critical to pay for next-day from Farnell. But as always, it comes down to cost vs. benefit.

This example shows how difficult and cost-impacting it is can be to get everything right. The Farnell deal has several non-guarantied services embedded. The parcel service could have lost it, an accident, ...

But that was (maybe lucky for you, maybe calculated in by you - the word fortunately might indicate it was luck) mitigated by the fact that such an incident was easily resolved because general availability in your neighbourhood of the parts. What if there was a more difficult to find part in the order?

AndersAnd · « **Reply #104 on:** November 23, 2013, 12:02:17 pm »

Quote from: Corporate666 on November 23, 2013, 02:56:40 am

In your Farnell example, you could have ordered multiple parts from multiple suppliers to be delivered to multiple locations (home and work), if it was so critical to pay for next-day from Farnell. But as always, it comes down to cost vs. benefit.

And order different brands in case there was a production fault in products from one manufacturer. That is if a alternative brands exists, but otherwise you could also be in trouble if the manufacturer run out of stock, the factory is flooded, burned down or something. Because of this some companies tries to only design with parts where there's more than one manufacturer to choose from. That's not always easy especially for the more specialized ICs.

Rigby · « **Reply #105 on:** November 23, 2013, 02:06:31 pm »

Quote from: walshms on November 23, 2013, 06:09:08 am

Quote from: Rigby on November 23, 2013, 04:03:10 am
Windows' market position says all anyone will ever need to hear about big business' trust in it, so your point there is completely without merit.

Your opinion, no more valid than mine. On market dominance: Windows Server is actually in the minority worldwide. You might want to check that. On the desktop -- yes, no competition. On servers? No, sorry, you'd be wrong.

Ask Google, Facebook, Yahoo, LinkedIn... hell, pick any large-scale service provider anywhere in the world what they're running and they'll tell you it's Linux. Dave's server is running on Linux, as are most of the virtual hosts in the world, whether they're running on Xensource, Xen, KVM or VMWare. Take a look at the Top500 list sometime. See what dominates it. Technical merit *is* the reason, and the list grows day by day. IBM and HP rely on it internally, and IBM's Z196 midrange is designed to run it.

Quote
There are also a great deal of hospital systems run on Windows, so your entire "big business or life depends on it" argument is gone.

My wife works for one of the largest hospital systems around, and I can tell you, she (and the rest of the IT team supporting them) wishes they could get rid of Windows. The problem is the software, not the OS; the software hasn't been ported to Linux on the whole. It's just a matter of time... it will be. NASA, ESA and in fact most of the rest of the space industry went to Linux because they couldn't take the downtime anymore.

Quote
Sound on Linux is STILL a giant pain in the ass. Fonts in X STILL look like shit. X in general is STILL a giant architectural nightmare.

When you can show me what *servers* require sound, or for that matter even a GUI, I'll be happy to have that conversation with you. In the mean time, I can only presume you're talking about desktops, and that's just not where the effort has been invested.

Granted, I wish some additional effort were invested there.

Quote
When I start seeing anything that isn't complete and utter Windows dominance in the places where time is money, then we'll talk.

NASDAQ, NYSE, DAX, FTSE... need I go on?

I think you really need to catch up. Really not trying to tweak you here, but... honestly, you need to spend a little time and catch up to events.

quit changing your argument. i started with end-user systems, you brought up architecture and business. i address architecture and business, you switch to servers. pick a damn point and stick to it.

we'll just disagree to agree. have fun on slashdot.

WarSim · « **Reply #106 on:** November 23, 2013, 03:38:41 pm »

Quote from: walshms on November 23, 2013, 06:20:29 am

Quote from: WarSim on November 23, 2013, 04:35:15 am
Despite the OS or DB used, any DB should be able to recover after host failure if the programer does his job right. Unfortunately solid recovery code only seems to be used in the financial sector these days. IMHO it is just laziness.

Actually, it's already there, and it's easy enough to implement. A stock MySQL install today is robust enough, as long as you have a battery-backed cache on your controller and UPS monitoring. All of our customers are set up this way, and none has ever had a database crash because of a power outage, or for that matter even a power supply failure (which a UPS can't help you with.)

Virtually every HP server sold is sold with a battery-backed cache, for example; it's nothing more than a tiny rechargeable cell pack that mounts inside the server and is connected to the controller. Takes about three hours to fully charge, and it's good to keep the cache valid for about 36 hours in a power failure.

Heck, even Microsoft SQL Server will survive that as long as it's on similar hardware and similarly configured.

Yes MySQL is a proper DB that has the ability to recover from a crash if the programer codes the ability to do so. My comment was in no way a MySQL bash. It was a disappointment that the forums code was not able to recover faster, and in-still more confidence in the user.

Sent from my iPad using Tapatalk

alm · « **Reply #107 on:** November 23, 2013, 03:44:56 pm »

Quote from: WarSim on November 23, 2013, 03:38:41 pm

Yes MySQL is a proper DB that has the ability to recover from a crash if the programer codes the ability to do so. My comment was in no way a MySQL bash. It was a disappointment that the forums code was not able to recover faster, and in-still more confidence in the user.

What are you talking about? The MyISAM storage engine is not robust or ACID compliant, but is the only one with properly implemented full text search (even today). How is this the programmer's or even the forum software's fault? How is the programmer supposed to write robust code if the storage engine does not support transactions and makes START TRANSACTION a non-op?

arekm · « **Reply #108 on:** November 23, 2013, 04:12:33 pm »

Quote from: alm on November 23, 2013, 03:44:56 pm

What are you talking about? The MyISAM storage engine is not robust or ACID compliant, but is the only one with properly implemented full text search (even today). How is this the programmer's or even the forum software's fault?

Current MySQL + InnoDB gives ACID and full text search. If you don't trust InnoDB for full text search then use Sphinx (mentioned in SMF performance hints).

So you see, it is possible to make improvements to this forum but the problem is that someone has to do it.

ps. SMF forum uses sphinx
http://www.simplemachines.org/community/index.php?topic=203615.msg2961417#msg2961417
and there seem to be a "big forum operator" hidden area on smf forum (http://www.simplemachines.org/community/index.php?topic=418993.0). sphinx plugin should be available there.

madires · « **Reply #109 on:** November 23, 2013, 04:46:35 pm »

Quote from: nctnico on November 23, 2013, 02:20:58 am

Quote from: wilfred on November 23, 2013, 01:04:47 am
Banks, Telcos and Insurance companies can expect and demand continuous availability (barring natural disasters) year in year out.
Those kind of businesses usually have SLAs in place with severe penalties so their websites are run on different hardware and seperate UPS with dual power feeds. SLAs come with a price tag though so many hosting providers don't provide them by default. Although the hosting provider I've been using for over a decade offers a minimal SLA of 99.8% (less than 18 hours down time per year) or money back on all their hosting services. Even the ones they charge €6 per month for.

... in different datacenters as a complete datacenter can fail too. The usual power setup for a datacenter is:
- two HV feeds with substations in-house (we're talking about MWs!)
- large UPS (maybe multiple) for short outages and giving the diesel engines time to start and the generators time to settle (good diesel generators can start and deliver power in about 10-20s, diesels are pre-heated)
- if there's a -48VDC supply for telco equipment it's buffered by batteries (usually much longer runtime than the AC UPS)

WarSim · « **Reply #110 on:** November 23, 2013, 05:00:55 pm »

Quote from: alm on November 23, 2013, 03:44:56 pm

Quote from: WarSim on November 23, 2013, 03:38:41 pm
Yes MySQL is a proper DB that has the ability to recover from a crash if the programer codes the ability to do so. My comment was in no way a MySQL bash. It was a disappointment that the forums code was not able to recover faster, and in-still more confidence in the user.
What are you talking about? The MyISAM storage engine is not robust or ACID compliant, but is the only one with properly implemented full text search (even today). How is this the programmer's or even the forum software's fault? How is the programmer supposed to write robust code if the storage engine does not support transactions and makes START TRANSACTION a non-op?

ISAM is an accessor option which the programer selected. Full stop.

Sent from my iPad using Tapatalk

daveshah · « **Reply #111 on:** November 23, 2013, 05:08:13 pm »

Quote from: WarSim on November 23, 2013, 05:00:55 pm

ISAM is an accessor option which the programer selected. Full stop.

Unfortunately, if you want fully working fulltext searches it is the only option (although InnoDB is now starting to support full-text searches, software takes time to transition.)

WarSim · « **Reply #112 on:** November 23, 2013, 05:15:13 pm »

I have noticed that most people are talking about UPS to avoid the problem, instead of recovering from such issues.
I guess it is a preference.
I know power failures occurs and some last longer than the UPS capabilities.
I worked in places that had UPS to span to weeks of generator power to span to mobile casually power. Even in these cases recovery from crashes had to be safeguarded against to insure a minimum of 20 sec to availability and 15 min for full restore. Ready for the next event. Apparently most of the good coding practices used there are not applicable here, which I find disappointing.
Yes these clusters are a mixture of UNIXies and Linux. No Windows because it is not capable of the required security, not because of the DB options.

Sent from my iPad using Tapatalk

walshms · « **Reply #113 on:** November 23, 2013, 09:16:59 pm »

Quote from: Rigby on November 23, 2013, 02:06:31 pm

quit changing your argument. i started with end-user systems, you brought up architecture and business. i address architecture and business, you switch to servers. pick a damn point and stick to it.

The entire thread was about servers, friend... really, it started out with Dave's server being down, and that's all I've been talking about all along. If the subject got changed, it wasn't by me.

Edit: fix quoting

walshms · « **Reply #114 on:** November 23, 2013, 09:22:16 pm »

Quote from: WarSim on November 23, 2013, 05:15:13 pm

I have noticed that most people are talking about UPS to avoid the problem, instead of recovering from such issues.
I guess it is a preference.

For my part, it's more than just having a UPS... it's monitoring the UPS so you know when the power is out, and you shut down the database gracefully.

Quote

Apparently most of the good coding practices used there are not applicable here, which I find disappointing.
Yes these clusters are a mixture of UNIXies and Linux. No Windows because it is not capable of the required security, not because of the DB options.

Good coding only takes you so far, as important as it is. If power loss is imminent, you need to shut down the DB to preserve integrity -- no matter whose DB you're using.

Even Windows can be secured. Unplug the network and lock the door.

WarSim · « **Reply #115 on:** November 23, 2013, 11:20:04 pm »

Quote from: walshms on November 23, 2013, 09:22:16 pm

Good coding only takes you so far, as important as it is. If power loss is imminent, you need to shut down the DB to preserve integrity -- no matter whose DB you're using.

Even Windows can be secured. Unplug the network and lock the door.

I think good coding practices was the wrong term for me to use.
I was referring to systems that need to operate right up to the point of power loss. The option to shutdown beforehand is not an option. Since this is a special case good coding practices is not the right phrase in this context.
Just like most system are allowed to stall if a raid block is ripped out.

I would still like to see more robust methods used in all sectors, but accept that it is not the norm.

Sent from my iPad using Tapatalk

gnif · « **Reply #116 on:** November 25, 2013, 12:47:18 am »

Ok, time to weigh in with what has been done and why, and what has caused these outages.

* The server is a dedicated server... with a twist. The actual server is a virtual machine on a dedicated server, it has 100% of the server's resources at it's disposal and is not shared. The reason for this is the provider Dave chose has insisted that it be configured this way and will not allow us to run normally on the hardware. Their reason is so that if a hardware failure occurs, they can just move the virtual system to a new host and boot it up. IMO this reason is faulty since those involved with Linux know that it is not like windows, you can just move it to new hardware and short of having to adjust a few things, like fstab entries, or ethernet configuration, it will work.

* Because the server is virtual, there is another potential point of data loss, when IOs are pending in the virtual layer between the virtual machine and the physical hardware. Soft buffers if you will. There is nothing we can do about this short of move to another DC that will allow us to have full control over the dedicated hardware.

* The database is using MyISAM for most of it's tables, we experimented earlier on with converting some of the larger tables to InnoDB to improve performance and reliability, which did help, most of the critical tables are on InnoDB, but on tables that are thrashed such as the 'users online' table, conversion to InnoDB caused performance penalties.

* The tables that are crashing are the low priority ones that only store temp data, such as the online table, again because these tables are thrashed. I may look at turning these into heap/memory tables to avoid this in future (ie, reboots will erase their contents).

* The provider does indeed have backup power hardware, but due to a fault in wiring, they had a catastrophic failure which caused them to have to re-wire a large portion of the data centre's backup systems. This affected multiple providers, not just Dave's, some of the biggest names in hosting were also taken offline, causing outages for tens of thousands of clients according to the report.

* Adding further redundancy by means of a UPS in the rack would be insanely expensive, firstly for the cost of the rack space, and then the hardware. I doubt that the provider would even allow Dave to do this with the package he currently has. While an outage of a few hours is a pain in the ass, it is sometimes unavoidable unless you have redundant servers and low TTL on your DNS, or a reverse proxy in another physical location (which also provides a point of failure). Every option here drives the cost up.

Dave's server is monitored 24/7, but since I provide this service to him as a favour, I can not prioritise it over my paying clients or my family time, if outages are detected and I am available to fix the issue, downtime will normally be much shorter. Also both outages, including the database problems were detected since I check if the forum and the website are loading correctly every 5 minutes, but I was not available at the time to look into it.

As for backups, I will discuss this with Dave, I should be able to accommodate him with a offsite backup to one of my spare servers in AU, again at no cost.
Edit: A MySQL slave is also being considered.

Edit 2: The outage was for tens of thousands according to the report.
http://newswire.net/newsroom/financial/00078336-bluehost-service-out.html

dr.diesel · « **Reply #117 on:** November 25, 2013, 01:05:03 am »

Quote from: gnif on November 25, 2013, 12:47:18 am

IMO this reason is faulty since those involved with Linux know that it is not like windows, you can just move it to new hardware and short of having to adjust a few things, like fstab entries, or ethernet configuration, it will work.

Not surprising how many people in the industry, are not really knowledgeable "in" the industry.

Thanks for the help and keep up the good work.

Rigby · « **Reply #118 on:** November 25, 2013, 01:41:52 am »

Quote from: dr.diesel on November 25, 2013, 01:05:03 am

Quote from: gnif on November 25, 2013, 12:47:18 am
IMO this reason is faulty since those involved with Linux know that it is not like windows, you can just move it to new hardware and short of having to adjust a few things, like fstab entries, or ethernet configuration, it will work.

Not surprising how many people in the industry, are not really knowledgeable "in" the industry.

Right, as a former sysadmin I certainly see where they are coming from. They have lots and lots of customers and they can't count on any given customer knowing what they're doing, even if it's clear that they do. If the datacenter hadn't wired their stuff wrong, that is to say "if there was never going to be an unexpected loss of power to the server hardware," the provider's failover plan would be sound. It also usually happens that the hardware a VM runs in is not the same hardware the VM's disk resides in, and often block-level de-duplication or other optimizations are done on the SAN which help out the provider. This is partially why they insist on a virtual machine.

dr.diesel · « **Reply #119 on:** November 25, 2013, 01:49:26 am »

Quote from: Rigby on November 25, 2013, 01:41:52 am

knowing what they're doing

Quote from: Rigby on November 25, 2013, 01:41:52 am

If the datacenter hadn't wired their stuff wrong

Knowing what you're doing means extensive testing, multiple times, each and every contingency. Clearly something these amateurs forgot to cross off the checklist. As the former admin of hundreds of boxes, this mistake would have gotten me instantly fired in my line of work.

gnif · « **Reply #120 on:** November 25, 2013, 01:58:44 am »

Quote from: Rigby on November 25, 2013, 01:41:52 am

Quote from: dr.diesel on November 25, 2013, 01:05:03 am
Quote from: gnif on November 25, 2013, 12:47:18 am
IMO this reason is faulty since those involved with Linux know that it is not like windows, you can just move it to new hardware and short of having to adjust a few things, like fstab entries, or ethernet configuration, it will work.

Not surprising how many people in the industry, are not really knowledgeable "in" the industry.

Right, as a former sysadmin I certainly see where they are coming from. They have lots and lots of customers and they can't count on any given customer knowing what they're doing, even if it's clear that they do. If the datacenter hadn't wired their stuff wrong, that is to say "if there was never going to be an unexpected loss of power to the server hardware," the provider's failover plan would be sound. It also usually happens that the hardware a VM runs in is not the same hardware the VM's disk resides in, and often block-level de-duplication or other optimizations are done on the SAN which help out the provider. This is partially why they insist on a virtual machine.

In this instance the disks are physically in the host on an adaptec raid controller configured as a RAID-1 array, so in this instance, it does not explain why the machine is a VM. Regarding customers not knowing what they are doing, yes, I completely understand that, but this basic package does not include a support level with the software layer, you are expected to know what you are doing, or be willing to pay someone to do it for you.

Quote from: dr.diesel on November 25, 2013, 01:49:26 am

Knowing what you're doing means extensive testing, multiple times, each and every contingency. Clearly something these amateurs forgot to cross off the checklist. As the former admin of hundreds of boxes, this mistake would have gotten me instantly fired in my line of work.

Agreed! The DC I use in AU (Host Networks) does load testing of their backup system every 3 months, which involves real testing, not dummy load, which is one of the main reasons I chose them for my mission critical infrastructure. A failure of that magnitude would cause heads to roll.

Rigby · « **Reply #121 on:** November 25, 2013, 02:42:11 am »

I'm sure someone's been fired over this, but it still happened, didn't it? Threat of job loss doesn't prevent stupidity, it just makes for some half-assed "If I did that I'd have been fired" story later on. There is no job I've ever had where a single mistake would have resulted in termination of employment. There are plenty of intentional things you can do to get fired, but anyone that fires someone without any other cause than the single mistake deserves to go out of business. Maybe this hosting company will.

The datacenter management entity needs to find out how the hell ALL the contractors, ALL the employees, everyone missed that mistake, perhaps implement some process to prevent it in the future, conduct a bit of training or something, and stop firing people for mistakes. Fire the guy that steals all the 48V bus bar copper for his meth addiction, yes. Let people learn from their mistakes.

xrunner · « **Reply #122 on:** November 25, 2013, 03:01:30 am »

Well all I have to say is these outages are just unconscionable.

There are people blowing up test equipment that post here and we should not have to miss any of those posts.

EEVblog · « **Reply #123 on:** November 25, 2013, 03:24:12 am »

Quote from: gnif on November 25, 2013, 12:47:18 am

As for backups, I will discuss this with Dave, I should be able to accommodate him with a offsite backup to one of my spare servers in AU, again at no cost.
Edit: A MySQL slave is also being considered.

Thanks.
I will need needing a solution like this shortly, because my current automated backup system at siteautobackup.com has said they are no longer going to offer the service. I'm not sure of hte cutoff date though. I currently have two backups, daily, and weekly. Both are full cpanel backups of all the files and databases. At present I would still rely upon HostGator to restore my basic server should the actual redundant hard drives fail, before I could reinstall the cpanel backup.

EEVblog · « **Reply #124 on:** November 25, 2013, 03:30:37 am »

Quote from: gnif on November 25, 2013, 01:58:44 am

Agreed! The DC I use in AU (Host Networks) does load testing of their backup system every 3 months, which involves real testing, not dummy load, which is one of the main reasons I chose them for my mission critical infrastructure. A failure of that magnitude would cause heads to roll.

But that's the trick, no matter how much testing you do, you can never guarantee that it's going to be fail-safe when the time comes.
It's like equipment calibration, it's all about a confidence level. One test every 12 months saying everything is fine is one confidence level, one test every 3 months is a higher confidence level again, but it's not 100%. For a bit of test gear, it can fail out of calibration one day after the cal test, and likely in both cases a server backup failure could lie lurking ready to strike one day after the test.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Forum Outage (Read 55740 times)

alm

Share me