Author Topic: Server Outage 12-3-18  (Read 11657 times)

0 Members and 1 Guest are viewing this topic.

Offline EEVblogTopic starter

  • Administrator
  • *****
  • Posts: 37734
  • Country: au
    • EEVblog
Server Outage 12-3-18
« on: March 11, 2018, 10:36:46 pm »
The server was down all last night. Hostgator's response attached.




 

Offline frozenfrogz

  • Frequent Contributor
  • **
  • Posts: 936
  • Country: de
  • Having fun with Arduino and Raspberry Pi
Re: Server Outage 12-3-18
« Reply #1 on: March 11, 2018, 10:44:54 pm »
The server used to work, then it took an arrow in the knee.

Nice to have it back!
He’s like a trained ape. Without the training.
 

Offline SparkyFX

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
Re: Server Outage 12-3-18
« Reply #2 on: March 11, 2018, 10:59:05 pm »
Strange, but nothing an unattended dist-upgrade on certain packages couldn´t have caused.
Nice to have the eevblog online again.
Support your local planet.
 

Offline hermit

  • Frequent Contributor
  • **
  • Posts: 482
  • Country: us
Re: Server Outage 12-3-18
« Reply #3 on: March 11, 2018, 11:48:19 pm »
I've used cPanel for years without this kind of mystery reconfiguration.  @gnif has some forensic work to do from the sounds of it.  Not that he won't find it was some 'unscheduled help' from the hosting company.  ;)
 

Offline Specmaster

  • Super Contributor
  • ***
  • Posts: 14483
  • Country: gb
Re: Server Outage 12-3-18
« Reply #4 on: March 12, 2018, 10:32:23 am »
I notice that its not always handling quotes correctly still?
Who let Murphy in?

Brymen-Fluke-HP-Thurlby-Thander-Tek-Extech-Black Star-GW-Avo-Kyoritsu-Amprobe-ITT-Robin-TTi
 

Offline gnif

  • Administrator
  • *****
  • Posts: 1676
  • Country: au
Re: Server Outage 12-3-18
« Reply #5 on: March 12, 2018, 10:37:32 am »
I've used cPanel for years without this kind of mystery reconfiguration.  @gnif has some forensic work to do from the sounds of it.  Not that he won't find it was some 'unscheduled help' from the hosting company.  ;)

What they failed to mention was that they deployed the server to use DHCP, they then moved away from DHCP when they moved data centers, and never updated Dave's "managed" server, so when their DHCP server went offline, so did this (and how knows how many more) server when the DHCP lease expired.

They then tried to perform a `yum update`, which failed as cPanel blacklists the perl packages. They assumed this was due to the "incompatible" CentOS repositories. Rather then contacting Dave or myself about what they wanted to do, they forcibly removed the kernel and swapped it out for a different one. Their reason for changing the kernel was valid, but the assumption that they could just do this was extremely unprofessional.

Edit: I just confirmed by restoring the config from backup that it was indeed configured by HG to use DHCP
Code: [Select]
# cat etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="dhcp"
IPV6INIT="yes"
MTU="1500"
NM_CONTROLLED="no"
ONBOOT="yes"
TYPE="Ethernet"

Here is the bash history of their session and my comments.

Code: [Select]
  501  2018-03-12 07:54:25 yum list installed | grep *redacted*
  502  2018-03-12 07:54:34 yum erase kernel-*redacted*
  503  2018-03-12 07:55:10 yum erase kernel-*redacted* kernel-*redacted*
  504  2018-03-12 07:55:18 yum update kernel
  505  2018-03-12 07:56:52 reboot

Note they did not even check why it couldn't talk to the network, they just blindly assumed a kernel fault, and uninstalled it  :palm:

Code: [Select]
  506  2018-03-12 07:58:28 ping [url=http://www.goog]www.goog[/url]
  507  2018-03-12 07:58:33 ping [url=http://www.google.com]www.google.com[/url]
  508  2018-03-12 07:58:37 history
  509  2018-03-12 08:00:14 yum repolist

Ok, failed to ping google... checked bash history, not sure why. Then checked the repositories installed into yum.

Code: [Select]
  510  2018-03-12 08:06:05 cat /etc/sysconfig/network-scripts/ifcfg-eth0

Great, ok, finally looking at the network configuration. Anything else related to altering this file is missing from the history, I can only assume it was done in another terminal, or remote transfer via SCP or similar.

Code: [Select]
  511  2018-03-12 08:10:31 for ip in $(ip a s eth0 | awk -F'[ /]' '/inet.*global/ {print $6}') ; do echo -ne "${ip}\t" ; ping -c3 -I ${ip} 8.8.8.8 2>&1 > /dev/null && echo "UP" || echo "DOWN" ; done
  512  2018-03-12 08:13:34 yum update

A simple script to check for connectivity and then... `yum update`  :palm:
The server was back online at this point... their work was done, if they wanted to do anything more they should have checked first.
« Last Edit: March 12, 2018, 10:50:34 am by gnif »
 
The following users thanked this post: EEVblog, hwj-d

Offline Specmaster

  • Super Contributor
  • ***
  • Posts: 14483
  • Country: gb
Re: Server Outage 12-3-18
« Reply #6 on: March 12, 2018, 10:49:33 am »
Thank god we have someone who understands the system because all of that to me was just gobbledegook  :phew:
« Last Edit: March 12, 2018, 10:52:56 am by Specmaster »
Who let Murphy in?

Brymen-Fluke-HP-Thurlby-Thander-Tek-Extech-Black Star-GW-Avo-Kyoritsu-Amprobe-ITT-Robin-TTi
 

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23018
  • Country: gb
Re: Server Outage 12-3-18
« Reply #7 on: March 12, 2018, 10:51:21 am »
So much experience with stuff like this it's unreal. The first thing that popped into my head when I read wilfred's post is that I bet it was the hosting company with a wrecking ball rather than "user error". What a turd!

Nice to know that managed hosting companies are still as thoroughly incompetent as they were a decade ago.  :palm:

Also why remove DHCP? Their IPAM is probably an Excel sheet they email to each other  :palm:
 

Offline hwj-d

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
  • save the children - chase the cabal
Re: Server Outage 12-3-18
« Reply #8 on: March 12, 2018, 10:54:17 am »
Thanks for you work. I've missed it the whole daylight sunday, which is night in australia of course without knowing what's happened. I'm glad to have you back.

Small official hint, twitter and/or FB, mail would be usefull.  ;)
(€: I passed through all infrastructure possibilities, ping, traceroute, dns, proxies, etc for this  ;D )

 :-+
« Last Edit: March 12, 2018, 11:03:29 am by hwj-d »
 

Offline gnif

  • Administrator
  • *****
  • Posts: 1676
  • Country: au
Re: Server Outage 12-3-18
« Reply #9 on: March 12, 2018, 10:56:13 am »
Thanks for you work. I've missed it the whole daylight sunday, which is night in australia of course without knowing what's happened. I'm glad to have you back.

Small official hint, twitter and/or FB, mail would be usefull.  ;)

 :-+

You're welcome. I can't post on Dave's Twitter when things here are down but I can post on my own if people care to follow it. https://twitter.com/HostFission
I should actually list this here too, I have a patreon for those that want to support the volunteer aspect of my work: https://www.patreon.com/gnif
« Last Edit: March 12, 2018, 11:02:56 am by gnif »
 
The following users thanked this post: hwj-d, BillB

Offline hwj-d

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
  • save the children - chase the cabal
Re: Server Outage 12-3-18
« Reply #10 on: March 12, 2018, 11:11:30 am »
Can't this be automated, maybe mail if the server is down more than 2h ? Dont know ...  :)
 

Offline EEVblogTopic starter

  • Administrator
  • *****
  • Posts: 37734
  • Country: au
    • EEVblog
Re: Server Outage 12-3-18
« Reply #11 on: March 12, 2018, 11:16:55 am »
I should actually list this here too, I have a patreon for those that want to support the volunteer aspect of my work: https://www.patreon.com/gnif

 :-+
 
The following users thanked this post: hwj-d

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23018
  • Country: gb
Re: Server Outage 12-3-18
« Reply #12 on: March 12, 2018, 11:19:15 am »
Can't this be automated, maybe mail if the server is down more than 2h ? Dont know ...  :)

Take a look at https://www.pingdom.com/ ... that can post on twitter if everything goes down :)
 

Offline gnif

  • Administrator
  • *****
  • Posts: 1676
  • Country: au
Re: Server Outage 12-3-18
« Reply #13 on: March 12, 2018, 11:44:30 am »
Can't this be automated, maybe mail if the server is down more than 2h ? Dont know ...  :)

Take a look at https://www.pingdom.com/ ... that can post on twitter if everything goes down :)

I have monitoring software setup to keep an eye on Dave's server, but it experiences ICMP packet loss so often it generates ton's of false positives, this could also be a sign of HG's network being overloaded.

I did have an alert for the outage, but due to the time (1am) and the continual false alerts from this server I incorrectly assumed it was another false positive, it was when I got an email from Dave I realized it was an actual outage and acted.

In short, pingdom will be plastering twitter with outage notifications when there isn't one...
 
The following users thanked this post: hwj-d

Offline hwj-d

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
  • save the children - chase the cabal
Re: Server Outage 12-3-18
« Reply #14 on: March 12, 2018, 11:51:33 am »

Take a look at https://www.pingdom.com/ ...

Of course, I already noticed that the forum can't be reached globaly.  :)
But this doesn't replace a notification mail from server itself.

I do not insist on this service and its more work. This is just a suggestion.
 

Offline bd139

  • Super Contributor
  • ***
  • Posts: 23018
  • Country: gb
Re: Server Outage 12-3-18
« Reply #15 on: March 12, 2018, 11:54:15 am »
Can't this be automated, maybe mail if the server is down more than 2h ? Dont know ...  :)

Take a look at https://www.pingdom.com/ ... that can post on twitter if everything goes down :)

I have monitoring software setup to keep an eye on Dave's server, but it experiences ICMP packet loss so often it generates ton's of false positives, this could also be a sign of HG's network being overloaded.

I did have an alert for the outage, but due to the time (1am) and the continual false alerts from this server I incorrectly assumed it was another false positive, it was when I got an email from Dave I realized it was an actual outage and acted.

In short, pingdom will be plastering twitter with outage notifications when there isn't one...

Ack you're doomed then. I wish thee luck :)
 

Offline hwj-d

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
  • save the children - chase the cabal
Re: Server Outage 12-3-18
« Reply #16 on: March 12, 2018, 11:57:56 am »
Quote
I did have an alert for the outage, but due to the time (1am) and the continual false alerts from this server I incorrectly assumed it was another false positive, it was when I got an email from Dave I realized it was an actual outage and acted.

Got it. Have understanding for that.
 

Offline Specmaster

  • Super Contributor
  • ***
  • Posts: 14483
  • Country: gb
Re: Server Outage 12-3-18
« Reply #17 on: March 12, 2018, 11:58:59 am »
Thank god that we don't suffer outage that much then.
Who let Murphy in?

Brymen-Fluke-HP-Thurlby-Thander-Tek-Extech-Black Star-GW-Avo-Kyoritsu-Amprobe-ITT-Robin-TTi
 

Offline PA0PBZ

  • Super Contributor
  • ***
  • Posts: 5127
  • Country: nl
Re: Server Outage 12-3-18
« Reply #18 on: March 12, 2018, 12:02:38 pm »
Small official hint, twitter and/or FB, mail would be usefull.  ;)

hmm...

https://twitter.com/eevblog/status/972800200199647232
Keyboard error: Press F1 to continue.
 
The following users thanked this post: hwj-d

Offline hwj-d

  • Frequent Contributor
  • **
  • Posts: 676
  • Country: de
  • save the children - chase the cabal
Re: Server Outage 12-3-18
« Reply #19 on: March 12, 2018, 12:22:28 pm »
Small official hint, twitter and/or FB, mail would be usefull.  ;)

hmm...

https://twitter.com/eevblog/status/972800200199647232

Yes.  :-/O
Now i will follow Dave, if the server fails again.
 

Offline station240

  • Supporter
  • ****
  • Posts: 967
  • Country: au
Re: Server Outage 12-3-18
« Reply #20 on: March 12, 2018, 12:52:33 pm »
gnif and Dave, send HostGater a bill for the repair work, with a note that says "this could have been avoided if you had actually contacted us"

So many clowns in the web hosting industry, It's like a race to the bottom for who can have the cheapest plans and the least capable technical staff.
 

Offline mnementh

  • Super Contributor
  • ***
  • Posts: 17541
  • Country: us
  • *Hiding in the Dwagon-Cave*
Re: Server Outage 12-3-18
« Reply #21 on: March 12, 2018, 12:57:08 pm »
Small official hint, twitter and/or FB, mail would be usefull.  ;)

hmm...

https://twitter.com/eevblog/status/972800200199647232

Yes.  :-/O
Now i will follow Dave, if the server fails again.

That seems reasonable... except that many of us don't Twit because of the inverse exponential SNR. And a reasonable fear that some of the Twit-in-Chief's verbal ejaculate might get on us.


mnem
Thanks for all the fish. ;)
alt-codes work here:  alt-0128 = €  alt-156 = £  alt-0216 = Ø  alt-225 = ß  alt-230 = µ  alt-234 = Ω  alt-236 = ∞  alt-248 = °
 

Offline rrinker

  • Super Contributor
  • ***
  • Posts: 2046
  • Country: us
Re: Server Outage 12-3-18
« Reply #22 on: March 16, 2018, 05:37:32 pm »
 I don;t use Twitter because I though tthe whole idea was kind of stupid when it first came out.

But wow, can't even get a thread on the server being down without some snarky political comment sneaking in.  |O |O
 

Offline gnif

  • Administrator
  • *****
  • Posts: 1676
  • Country: au
Re: Server Outage 12-3-18
« Reply #23 on: May 10, 2018, 03:35:47 am »
Sorry about that short outage, making some server changes and missed a configuration directive.

Edit: A few more caused by a stray semicolon in the puppet generated nginx config. That should be the last of it though for today :D
« Last Edit: May 10, 2018, 05:42:58 am by gnif »
 

Online IanJ

  • Supporter
  • ****
  • Posts: 1606
  • Country: scotland
  • Full time EE & Youtuber
    • IanJohnston.com
Re: Server Outage 12-3-18
« Reply #24 on: May 10, 2018, 06:13:30 am »
What they failed to mention was that they deployed the server to use DHCP,

1and1 did that to me about a year ago on my server that I had been using for 2 years.......several calls later and several techs later they fixed it. They couldn't give me a reason why the static IP setup suddenly changed to DHCP.
I've since ditched 1and1.

Ian.
Ian Johnston - Original designer of the PDVS2mini || Author of the free WinGPIB app.
Website - www.ianjohnston.com
YT Channel (electronics repairs & projects): www.youtube.com/user/IanScottJohnston, Twitter (X): https://twitter.com/IanSJohnston
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf