This article is a story from my past. I used to work as a sysadmin for a company that ran an online shop that sold computer hardware and software.
In the back, dozens of employees used a terminal server to work with the ERP software that managed all goods and trade. Terminal servers and databases were critical for the business process of selling our products. When one of these systems failed, dozens of employees couldn't do their work, and customers couldn't buy anything anymore.
How long do we have?
So we sat down with the owner of the business process and asked: "Well, think about a situation where the system is broken and not usable anymore. How long do you have to recover until the damage to the company would be devastating?" The manager thought about it for some time and said: "Two hours, max!"
We sat down with the guys running the ERP application and asked them: "How long would it take to reinstall the application and configure the terminal server if we had to recover from fatal system failure?" The answer was: "Two days, minimum!" I guess you can see the gap here.
For me, the first priority was to bring down the restore time from two days to under two hours. Reinstalling the operating system, applications, and restoring data backups wasn't an option because it takes too long. I chose an image backup application that can perform a complete system backup meaning the operating system, installed applications, libraries, and data. Using this software, I would be able to restore the system without any manual installation tasks to complete.
What do we need?
A classic setup is to run the backup at night when no one is working. In case of failure, we would have to restore the image from last night or maybe an even older version. In this scenario, a test showed that we needed five hours for a complete restore. That's better than two days but still not good enough. And there is another issue hidden in this approach. If the system had crashed in the evening, and we have to restore from last nights backup, we will lose all changes from the current day. That was not acceptable for our online shop. Well, our first thought was to run an incremental backup every hour. In that case, we would lose only the last hour of data. But what happens when the issue that breaks our system is already in the backup image when we discover the problem? Well, here is what we did.
We sat down again and discussed how long it takes to discover a devastating issue that would lead to a recovery of the system. We agreed on a timeframe of four hours. In summary, we wanted to lose as little work as possible, restore in max two hours, and make sure that an issue wouldn't already be in the backup when we discovered it. With this information, I set up a backup approach that software vendors refer to as instant restore, shadow restore, preemptive restore, or similar term. We ran incremental backup jobs every hour and restored the backups in the background to a new virtual machine. Each full hour, we had a system ready that was four hours back in time and just needed to be finished. So if I choose to restore the incremental from one hour ago, it would take less time than a complete system restore because only the small increments had to be restored to the almost-ready virtual machine.
And the effort paid off
One day, I was on vacation, having a barbecue and some beer, when I got a call from my colleague telling me that the terminal server with the ERP application was broken due to a failed update and the guy who ran the update forgot to take a snapshot first.
The only thing I needed to tell my colleague was to shut down the broken machine, find the UI of our backup/restore system, and then identify the restore job. Finally, I told him how to choose the timestamp from the last four hours when the restore should finish. The restore finished 30 minutes later, and the system was ready to be used again. We were back in action after a total of 30 minutes, and only the work from the last two hours or so was lost! Awesome! Now, back to vacation.
Wrap up
So keep in mind, time and money for a well-suited backup and recovery solution is well spent when it comes to an emergency restore.
[ Free online course: Red Hat Enterprise Linux technical overview. ]
Sull'autore
Jörg has been a Sysadmin for over ten years now. His fields of operation include Virtualization (VMware), Linux System Administration and Automation (RHEL), Firewalling (Forcepoint), and Loadbalancing (F5). He is a member of the Red Hat Accelerators Community and author of his personal blog at https://www.my-it-brain.de.
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Serie originali
Raccontiamo le interessanti storie di leader e creatori di tecnologie pensate per le aziende
Prodotti
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servizi cloud
- Scopri tutti i prodotti
Strumenti
- Formazione e certificazioni
- Il mio account
- Supporto clienti
- Risorse per sviluppatori
- Trova un partner
- Red Hat Ecosystem Catalog
- Calcola il valore delle soluzioni Red Hat
- Documentazione
Prova, acquista, vendi
Comunica
- Contatta l'ufficio vendite
- Contatta l'assistenza clienti
- Contatta un esperto della formazione
- Social media
Informazioni su Red Hat
Red Hat è leader mondiale nella fornitura di soluzioni open source per le aziende, tra cui Linux, Kubernetes, container e soluzioni cloud. Le nostre soluzioni open source, rese sicure per un uso aziendale, consentono di operare su più piattaforme e ambienti, dal datacenter centrale all'edge della rete.
Seleziona la tua lingua
Red Hat legal and privacy links
- Informazioni su Red Hat
- Opportunità di lavoro
- Eventi
- Sedi
- Contattaci
- Blog di Red Hat
- Diversità, equità e inclusione
- Cool Stuff Store
- Red Hat Summit