Author Topic: Are your backups up to date? (Read 10035 times)

Freelander · « **Reply #75 on:** November 23, 2017, 06:10:44 pm »

Quote from: bd139 on November 23, 2017, 05:17:44 pm

That's funny. I have some family members who work close to NHS IT. It's a shitfest and a half. Glad I work in private sector finance.

They need to bring it all in house and run it like they ran Spine 2.

I would agree. ! - when we had NHSnet (1) with BT and Cable and Wireless it was very very good. !. for security there was a national team of security managers each with a portion of the country to administer - mine was Northwest UK. ANYTHING that health authorities / trusts wanted to do with reference to connection to other entities and visa versa had to go through one of us. There were effectively two national firewalls (BT and C&W) and even to apply a rule was down to our NHSIA team member. We were also 'inspectors' and enforcers for the total network security - all aspects - including backups and system integrity ! - and had the power to 'pull the plug' on a non conforming trust (it was threatened but never actually needed to be used). We pen-tested the systems at our leisure (which was fun), a great way to pass an evening and far better than 'telly'

Our favourite pastime was locking the Finance Director of a trust or HA out of his systems completely by changing their password after cracking it. It was usually ridiculously simple and very worthwhile because the Finance Directors virtually always were IT directors as well - !!!! (Ridiculous yes, but that's the way it was - they knew bugger all about IT but controlled the purse strings).
Many an IT department manager would contact us to .. errrrr.. how shall we put this .... 'stitch up' the Finance Director' when he refused to pay for a vital upgrade because he hadn't got a clue what it actually did or was for, and usually did not believe the IT Manager as these 'computer' things were expensive and it worked ok for 3 years and he ran windows 95 at home with no worries...... Oh what joy when he had to call me or one of the other guys on the national team as we could lock them out and keep them out for not following protocols. The money usually flowed in the right direction very rapidly from then on. A nice message on their screen saying something like - please contact your Chief Executive or the NHSIA always, for some reason, resulted in us being contacted and not the Chief exec.
I am glad I am out of it - I left in 2004 at the time they disbanded the agency. They reap what they sow. It's gone to ratshit now as far as I can gather. Same old problems, lack of money and lack of wages. Prior to the NHSIA I managed IT at Lancs Ambulance service. We were the highest paid IT staff in the country - by far ( 1995 - 2000). That was only because we had a chief exec that had his head screwed on and also the fact that we (there were only two of us originally) put forward concise business cases to update and develop the systems - and save costs ! - (we were using MD Unix systems over 9600 baud serial) and paying nearly 200 grand a year maintenance. We factored in our wage costs and expenses and a complete change to NT / Compaq and won the case. We even got to be one of the main pilot schemes for NHS Direct as it is called now. A million quids worth of hardware and software all specced, built and set up in house and all the staff we needed to do the job - it was like being in a sweetie shop

. My assistant manager was payed more than the assistant finance director. ! - we never, ever, had a staff retention problem. None of the staff were below MCSE qualification (we were microsoft only). All training paid for in house.
Then the government came along and wanted to make a 'joined up' national system like something only a Jedi knight could envisage... which was, of course, the beginning of the end. - ah, but I digress... sad to hear it is still bad in many areas.

Freelander · « **Reply #76 on:** November 23, 2017, 09:32:25 pm »

Quote from: Mr. Scram on November 23, 2017, 04:22:53 pm

What happens when you RAM goes corrupt and slowly corrupts data over time, which shows itself after a while when the errors built up to critical mass? All the garbage is copied perfectly down the line, again and again, and your last clean backups will be months or even years ago. Not testing anything is setting yourself up for failure. You're not the first and won't be the last. In the end, there has to be a monkey checking to see if the recovery process output lines up with the input.

Besides, anyone not nervous about his backups is complacent and will fall eventually

Hi Mr S.
A bit more detail here. Overall I support your sentiment and ideals for 'testing' of backups and system maintainability (in an ideal world). I thoroughly stand by what I have said though but have been tongue in cheek on some points - forgive me

.
If it helps, what I found over my career in the 'business' was that what is 'ideal' in theory simply does not and could not work in practice. Because of this, one has to develop robust and sustainable strategies to maintain reliability, continuity and accuracy / availability of data - as well as the option of reversion to earlier states if needed with the minimum of time delay and offline status. It would be very nice in an ideal world to be able to test every backup in a physical sense. This is, in practically all cases simply not feasible.
Take for instance a scenario I have a lot of experience with. A 999 (911 to our Yank friends) Emergency control room. If we look at the system as a whole - and there may well be three slightly different versions of the system running in a single control room in an attempt to remove single point failure and also versions of the non commonality even in each of the 'semi duplicate main systems' (as mentioned below) - along with this, there is also the real time logging to a fail over site - from EACH copy of the system...

... so, for each 'failsafe minimised' copy of the main systems running ----- (and again - this is repeated for EACH instance of the C&C ! - ) let us say there is a main server, a failover server and appropriate storage attached however needed. The system runs various flavours of database, GIS (Graphical Information systems), Scripted advice systems (the scripts the operators can rely on to advise patients where needed - and depending on qualification of the operator), a vehicle radio and crew audio / video interface system, a system status management module ( an overwatch for all vehicle placements and call geographics and historical data based prediction units for vehicle repositioning) and of course, the main command and control interface. There are many more parts to the system as well that we can just refer to as /management and information systems and monitoring. The main command and control system is the main player here and let us say it is from company A. It is the overseer and controller / caller of the functionality of the various system parts. The database is from company B and although called and commanded by the C&C (command and control) is independently managed by its own main application. The GIS is from company C and tightly integrated to the C&C, it however, also has a completely independent supervisor / overseer and indeed, database storage system. This database storage and management package may not be from the same supplier as database B. ! - (Ideally yes, in the real world, often no). Again, this is integrated at a high level into the C&C (A). The vehicle communication system is from company D. It is again heavily integrated into C&C A. It may well have a bespoke data logging system of it's own. It will communicate with the radio comms stack (physical) and that will communicate with the the various GSM/G or Radio based systems and their command and control networks. The scripted patient advice system - lets call this E, is again tightly integrated into C&C A, but is in reality a separate module again with a full internal management package and database (which hopefully is on a platform used by another part of the system but often is not as it may be bespoke. ..... In the background is a stack of monitoring systems and overwatch. These systems tend to be so specialised that no one supplier provides a complete one stop shop, or indeed could do that. they rely on many many parts all working semi autonomously but under a master C&C.
As you can imagine, the backups for these systems are HUGELY complex and disparate and can involve many pathways and timings.
To add to all the above, the system also runs on hardware with an OS - or OSes! and the hardware and OS are updated in line with needs and fixes and security patches where needed all these aspects of the hardware and OS can impact the system. As you can imagine, there are also external links to the outside world via firewalls and gateways, each with their own OS. firmware etc.
It is simply IMPOSSIBLE to 'test' a backup of the system. Even if one had a complete and identical setup and infrastructure. It would also be impossible to apply a full system load scenario onto any restoration. If a data block is verified and hashed at backup time it will restore if that verification is intact. If the data was corrupted prior to backup time and not flagged by the management software of that part of the system, it will be corrupted in the backup (but correctly hashed and verified) and will also restore correctly !. Unless the system is fully functional and operable you cannot evaluate it and it is simply not possible to have a complete identical highly complex clone of the system to test with. Many sub systems can run their own off site cloud backups that are specific to that system and fully integrity checked as the backup arm of the application is part and parcel of it. Other parts may well use a specific backup application whereas the database modules may use another backup application and system. Transaction logging may and does present complex multiple real time backups. Also add to this that many of the workstations will run differing workstation specifications and OS patch levels and multiple servers for failover will often run differing levels of system state / hardware / firmware as Identical OS, hardware, bios etc etc that fails at a level below the main application, may well suffer exactly the same failure on fail safe change over. !. a classic single point failure SPF. this is why workstations are often split 50:50 as to Hardware types and patch levels of OS and drivers. A failure occurring due to an interaction of ANY part on a workstation may well affect ALL unless one has the split hardware to remove another SPF. This is, of course, necessary to allow for and attempt to achieve the 'ideal world' removal of single point failures. (hah !) - one can only try... for that to be possible, the parts would all have to be fabricated from Unobtainium !!!.
Again, to 'test' a backup (well multiple interdependent backupS) is simply not feasible. One cannot take a system 'offline' for a test of a backup restore, which would also be quite good fun (not) as the system would have to be wound down, backed up in all forms, prior to the test of the restore and then it would have to be restored again back to the state it was in prior to the restore in the first place..

.......

Add to this that without load the system would be highly UNLIKELY to reveal any issues at all if data corruption occurred prior to backup (your /ram/ degradation hypotheses for example - although - there would be no non ecc memory anywhere near that system. - even in the late 90s, the section or module of ECC memory could be taken off line automatically - in real time - if issues were flagged. All transparently by the server.) Enterprise backup solutions are extremely (incredibly) good at flagging any discrepancy and run a huge amount of checks on each data pack. Errors that occur are almost certainly corruption or deletion PRIOR to backup and corruption AFTER backup and AFTER verify whilst in storage.
The same 'problems' as above and inability to 'test' backups in real world occur in many systems - such as ATC (Air traffic control) etc.
With this level of complexity and mission critical status - in fact - LIFE critical systems if you boil it down !. it is simply NOT possible to test integrity of backups after the fact. The multitude of protection systems must be in place to interrogate the data integrity and structure PRIOR to backup time. Testing DURING backup is an added bonus.
Hope that helps. Real world is a beee-atch !!, but can be suitably tamed and controlled.

One may ask then - HOW does one restore in the event of a system failure... well.. firstly, I have never had a complete system failure apart from a power issue in the mid 90's.... (it's all the fault of APC and a vacuum cleaner ho hum)... the system can still function with multiple sections not available. There are redundant parts to most sections, the GIS is backed up by a secondary mapping system, the patient advice is backed up by crib cards on the desks, the radio system is fully redundant and backed up by personal cell units .. etc etc etc... The whole system can be offloaded to another site if needed.
How would we bring a system back up ? ..... it is like how a hedgehog screws... very carefully...

.. no, seriously, we bring it up one piece at a time and test each section and then bring another unit on line. The systems are never totally powered down either, due to the multiple systems (slightly different copies of systems) and failovers, sections can be taken down and updated or new updates tested in relative safety.. minimising SPF is the key really. There is so much more to a system of such complexity than a single backup - as a single 'backup' does not exist. One has to plan for remedial action after failure, but the key thing is to plan and invest in multiple redundancy and continuity in the first place. A bit like an airliner... and in the same way, when a system (well, subsystem) fails you investigate and add redundancy along with a solution where possible. Thats why every desk has a stack of note taking forms, pencils and pens, a cellphone or two and a set of maps and mapbooks, torches etc, that is why printers run continuously at the back of the room producing summary paper backups of ongoing jobs,,,,,,, thank god we never needed them.

So, with the best will in the world, no, It is not essential to 'test' backups.. providing the appropriate steps are taken.

Jeez, I am so glad I retired


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Are your backups up to date? (Read 10035 times)

Freelander

Re: Are your backups up to date?

Freelander

Re: Are your backups up to date?

Share me