What happens when you RAM goes corrupt and slowly corrupts data over time, which shows itself after a while when the errors built up to critical mass? All the garbage is copied perfectly down the line, again and again, and your last clean backups will be months or even years ago. Not testing anything is setting yourself up for failure. You're not the first and won't be the last. In the end, there has to be a monkey checking to see if the recovery process output lines up with the input.
Besides, anyone not nervous about his backups is complacent and will fall eventually
Hi Mr S.
A bit more detail here. Overall I support your sentiment and ideals for 'testing' of backups and system maintainability (in an ideal world). I thoroughly stand by what I have said though but have been tongue in cheek on some points - forgive me
.
If it helps, what I found over my career in the 'business' was that what is 'ideal' in theory simply does not and could not work in practice. Because of this, one has to develop robust and sustainable strategies to maintain reliability, continuity and accuracy / availability of data - as well as the option of reversion to earlier states if needed with the minimum of time delay and offline status. It would be very nice in an ideal world to be able to test every backup in a physical sense. This is, in practically all cases simply not feasible.
Take for instance a scenario I have a lot of experience with. A 999 (911 to our Yank friends) Emergency control room. If we look at the system as a whole - and there may well be three slightly different versions of the system running in a single control room in an attempt to remove single point failure and also versions of the non commonality even in each of the 'semi duplicate main systems' (as mentioned below) - along with this, there is also the real time logging to a fail over site - from EACH copy of the system...
... so, for each 'failsafe minimised' copy of the main systems running ----- (and again - this is repeated for EACH instance of the C&C ! - ) let us say there is a main server, a failover server and appropriate storage attached however needed. The system runs various flavours of database, GIS (Graphical Information systems), Scripted advice systems (the scripts the operators can rely on to advise patients where needed - and depending on qualification of the operator), a vehicle radio and crew audio / video interface system, a system status management module ( an overwatch for all vehicle placements and call geographics and historical data based prediction units for vehicle repositioning) and of course, the main command and control interface. There are many more parts to the system as well that we can just refer to as /management and information systems and monitoring. The main command and control system is the main player here and let us say it is from company A. It is the overseer and controller / caller of the functionality of the various system parts. The database is from company B and although called and commanded by the C&C (command and control) is independently managed by its own main application. The GIS is from company C and tightly integrated to the C&C, it however, also has a completely independent supervisor / overseer and indeed, database storage system. This database storage and management package may not be from the same supplier as database B. ! - (Ideally yes, in the real world, often no). Again, this is integrated at a high level into the C&C (A). The vehicle communication system is from company D. It is again heavily integrated into C&C A. It may well have a bespoke data logging system of it's own. It will communicate with the radio comms stack (physical) and that will communicate with the the various GSM/G or Radio based systems and their command and control networks. The scripted patient advice system - lets call this E, is again tightly integrated into C&C A, but is in reality a separate module again with a full internal management package and database (which hopefully is on a platform used by another part of the system but often is not as it may be bespoke. ..... In the background is a stack of monitoring systems and overwatch. These systems tend to be so specialised that no one supplier provides a complete one stop shop, or indeed could do that. they rely on many many parts all working semi autonomously but under a master C&C.
As you can imagine, the backups for these systems are HUGELY complex and disparate and can involve many pathways and timings.
To add to all the above, the system also runs on hardware with an OS - or OSes! and the hardware and OS are updated in line with needs and fixes and security patches where needed all these aspects of the hardware and OS can impact the system. As you can imagine, there are also external links to the outside world via firewalls and gateways, each with their own OS. firmware etc.
It is simply IMPOSSIBLE to 'test' a backup of the system. Even if one had a complete and identical setup and infrastructure. It would also be impossible to apply a full system load scenario onto any restoration. If a data block is verified and hashed at backup time it will restore if that verification is intact. If the data was corrupted prior to backup time and not flagged by the management software of that part of the system, it will be corrupted in the backup (but correctly hashed and verified) and will also restore correctly !. Unless the system is fully functional and operable you cannot evaluate it and it is simply not possible to have a complete identical highly complex clone of the system to test with. Many sub systems can run their own off site cloud backups that are specific to that system and fully integrity checked as the backup arm of the application is part and parcel of it. Other parts may well use a specific backup application whereas the database modules may use another backup application and system. Transaction logging may and does present complex multiple real time backups. Also add to this that many of the workstations will run differing workstation specifications and OS patch levels and multiple servers for failover will often run differing levels of system state / hardware / firmware as Identical OS, hardware, bios etc etc that fails at a level below the main application, may well suffer exactly the same failure on fail safe change over. !. a classic single point failure SPF. this is why workstations are often split 50:50 as to Hardware types and patch levels of OS and drivers. A failure occurring due to an interaction of ANY part on a workstation may well affect ALL unless one has the split hardware to remove another SPF. This is, of course, necessary to allow for and attempt to achieve the 'ideal world' removal of single point failures. (hah !) - one can only try... for that to be possible, the parts would all have to be fabricated from Unobtainium !!!.
Again, to 'test' a backup (well multiple interdependent backupS) is simply not feasible. One cannot take a system 'offline' for a test of a backup restore, which would also be quite good fun (not) as the system would have to be wound down, backed up in all forms, prior to the test of the restore and then it would have to be restored again back to the state it was in prior to the restore in the first place..
.......
Add to this that without load the system would be highly UNLIKELY to reveal any issues at all if data corruption occurred prior to backup (your /ram/ degradation hypotheses for example - although - there would be no non ecc memory anywhere near that system. - even in the late 90s, the section or module of ECC memory could be taken off line automatically - in real time - if issues were flagged. All transparently by the server.) Enterprise backup solutions are extremely (incredibly) good at flagging any discrepancy and run a huge amount of checks on each data pack. Errors that occur are almost certainly corruption or deletion PRIOR to backup and corruption AFTER backup and AFTER verify whilst in storage.
The same 'problems' as above and inability to 'test' backups in real world occur in many systems - such as ATC (Air traffic control) etc.
With this level of complexity and mission critical status - in fact - LIFE critical systems if you boil it down !. it is simply NOT possible to test integrity of backups after the fact. The multitude of protection systems must be in place to interrogate the data integrity and structure PRIOR to backup time. Testing DURING backup is an added bonus.
Hope that helps. Real world is a beee-atch !!, but can be suitably tamed and controlled.
One may ask then - HOW does one restore in the event of a system failure... well.. firstly, I have never had a
complete system failure apart from a power issue in the mid 90's.... (it's all the fault of APC and a vacuum cleaner ho hum)... the system can still function with multiple sections not available. There are redundant parts to most sections, the GIS is backed up by a secondary mapping system, the patient advice is backed up by crib cards on the desks, the radio system is fully redundant and backed up by personal cell units .. etc etc etc... The whole system can be offloaded to another site if needed.
How would we bring a system back up ? ..... it is like how a hedgehog screws... very carefully...
.. no, seriously, we bring it up one piece at a time and test each section and then bring another unit on line. The systems are never totally powered down either, due to the multiple systems (slightly different copies of systems) and failovers, sections can be taken down and updated or new updates tested in relative safety.. minimising SPF is the key really. There is so much more to a system of such complexity than a single backup - as a single 'backup' does not exist. One has to plan for remedial action after failure, but the key thing is to plan and invest in multiple redundancy and continuity in the first place. A bit like an airliner... and in the same way, when a system (well, subsystem) fails you investigate and add redundancy along with a solution where possible. Thats why every desk has a stack of note taking forms, pencils and pens, a cellphone or two and a set of maps and mapbooks, torches etc, that is why printers run continuously at the back of the room producing summary paper backups of ongoing jobs,,,,,,, thank god we never needed them.
So, with the best will in the world, no, It is not essential to 'test' backups.. providing the appropriate steps are taken.
Jeez, I am so glad I retired