This article is a story from my past. I used to work as a sysadmin for a company that ran an online shop that sold computer hardware and software.
In the back, dozens of employees used a terminal server to work with the ERP software that managed all goods and trade. Terminal servers and databases were critical for the business process of selling our products. When one of these systems failed, dozens of employees couldn't do their work, and customers couldn't buy anything anymore.
How long do we have?
So we sat down with the owner of the business process and asked: "Well, think about a situation where the system is broken and not usable anymore. How long do you have to recover until the damage to the company would be devastating?" The manager thought about it for some time and said: "Two hours, max!"
We sat down with the guys running the ERP application and asked them: "How long would it take to reinstall the application and configure the terminal server if we had to recover from fatal system failure?" The answer was: "Two days, minimum!" I guess you can see the gap here.
For me, the first priority was to bring down the restore time from two days to under two hours. Reinstalling the operating system, applications, and restoring data backups wasn't an option because it takes too long. I chose an image backup application that can perform a complete system backup meaning the operating system, installed applications, libraries, and data. Using this software, I would be able to restore the system without any manual installation tasks to complete.
What do we need?
A classic setup is to run the backup at night when no one is working. In case of failure, we would have to restore the image from last night or maybe an even older version. In this scenario, a test showed that we needed five hours for a complete restore. That's better than two days but still not good enough. And there is another issue hidden in this approach. If the system had crashed in the evening, and we have to restore from last nights backup, we will lose all changes from the current day. That was not acceptable for our online shop. Well, our first thought was to run an incremental backup every hour. In that case, we would lose only the last hour of data. But what happens when the issue that breaks our system is already in the backup image when we discover the problem? Well, here is what we did.
We sat down again and discussed how long it takes to discover a devastating issue that would lead to a recovery of the system. We agreed on a timeframe of four hours. In summary, we wanted to lose as little work as possible, restore in max two hours, and make sure that an issue wouldn't already be in the backup when we discovered it. With this information, I set up a backup approach that software vendors refer to as instant restore, shadow restore, preemptive restore, or similar term. We ran incremental backup jobs every hour and restored the backups in the background to a new virtual machine. Each full hour, we had a system ready that was four hours back in time and just needed to be finished. So if I choose to restore the incremental from one hour ago, it would take less time than a complete system restore because only the small increments had to be restored to the almost-ready virtual machine.
And the effort paid off
One day, I was on vacation, having a barbecue and some beer, when I got a call from my colleague telling me that the terminal server with the ERP application was broken due to a failed update and the guy who ran the update forgot to take a snapshot first.
The only thing I needed to tell my colleague was to shut down the broken machine, find the UI of our backup/restore system, and then identify the restore job. Finally, I told him how to choose the timestamp from the last four hours when the restore should finish. The restore finished 30 minutes later, and the system was ready to be used again. We were back in action after a total of 30 minutes, and only the work from the last two hours or so was lost! Awesome! Now, back to vacation.
So keep in mind, time and money for a well-suited backup and recovery solution is well spent when it comes to an emergency restore.
[ Free online course: Red Hat Enterprise Linux technical overview. ]