Ok, time to weigh in with what has been done and why, and what has caused these outages.
* The server is a dedicated server... with a twist. The actual server is a virtual machine on a dedicated server, it has 100% of the server's resources at it's disposal and is not shared. The reason for this is the provider Dave chose has insisted that it be configured this way and will not allow us to run normally on the hardware. Their reason is so that if a hardware failure occurs, they can just move the virtual system to a new host and boot it up. IMO this reason is faulty since those involved with Linux know that it is not like windows, you can just move it to new hardware and short of having to adjust a few things, like fstab entries, or ethernet configuration, it will work.
* Because the server is virtual, there is another potential point of data loss, when IOs are pending in the virtual layer between the virtual machine and the physical hardware. Soft buffers if you will. There is nothing we can do about this short of move to another DC that will allow us to have full control over the dedicated hardware.
* The database is using MyISAM for most of it's tables, we experimented earlier on with converting some of the larger tables to InnoDB to improve performance and reliability, which did help, most of the critical tables are on InnoDB, but on tables that are thrashed such as the 'users online' table, conversion to InnoDB caused performance penalties.
* The tables that are crashing are the low priority ones that only store temp data, such as the online table, again because these tables are thrashed. I may look at turning these into heap/memory tables to avoid this in future (ie, reboots will erase their contents).
* The provider does indeed have backup power hardware, but due to a fault in wiring, they had a catastrophic failure which caused them to have to re-wire a large portion of the data centre's backup systems. This affected multiple providers, not just Dave's, some of the biggest names in hosting were also taken offline, causing outages for tens of thousands of clients according to
the report.
* Adding further redundancy by means of a UPS in the rack would be insanely expensive, firstly for the cost of the rack space, and then the hardware. I doubt that the provider would even allow Dave to do this with the package he currently has. While an outage of a few hours is a pain in the ass, it is sometimes unavoidable unless you have redundant servers and low TTL on your DNS, or a reverse proxy in another physical location (which also provides a point of failure). Every option here drives the cost up.
Dave's server is monitored 24/7, but since I provide this service to him as a favour, I can not prioritise it over my paying clients or my family time, if outages are detected and I am available to fix the issue, downtime will normally be much shorter. Also both outages, including the database problems were detected since I check if the forum and the website are loading correctly every 5 minutes, but I was not available at the time to look into it.
As for backups, I will discuss this with Dave, I should be able to accommodate him with a offsite backup to one of my spare servers in AU, again at no cost.
Edit: A MySQL slave is also being considered.
Edit 2: The outage was for tens of thousands according to the report.
http://newswire.net/newsroom/financial/00078336-bluehost-service-out.html