This article is a story from my past. I used to work as a sysadmin for a company that ran an online shop that sold computer hardware and software.
In the back, dozens of employees used a terminal server to work with the ERP software that managed all goods and trade. Terminal servers and databases were critical for the business process of selling our products. When one of these systems failed, dozens of employees couldn't do their work, and customers couldn't buy anything anymore.
How long do we have?
So we sat down with the owner of the business process and asked: "Well, think about a situation where the system is broken and not usable anymore. How long do you have to recover until the damage to the company would be devastating?" The manager thought about it for some time and said: "Two hours, max!"
We sat down with the guys running the ERP application and asked them: "How long would it take to reinstall the application and configure the terminal server if we had to recover from fatal system failure?" The answer was: "Two days, minimum!" I guess you can see the gap here.
For me, the first priority was to bring down the restore time from two days to under two hours. Reinstalling the operating system, applications, and restoring data backups wasn't an option because it takes too long. I chose an image backup application that can perform a complete system backup meaning the operating system, installed applications, libraries, and data. Using this software, I would be able to restore the system without any manual installation tasks to complete.
What do we need?
A classic setup is to run the backup at night when no one is working. In case of failure, we would have to restore the image from last night or maybe an even older version. In this scenario, a test showed that we needed five hours for a complete restore. That's better than two days but still not good enough. And there is another issue hidden in this approach. If the system had crashed in the evening, and we have to restore from last nights backup, we will lose all changes from the current day. That was not acceptable for our online shop. Well, our first thought was to run an incremental backup every hour. In that case, we would lose only the last hour of data. But what happens when the issue that breaks our system is already in the backup image when we discover the problem? Well, here is what we did.
We sat down again and discussed how long it takes to discover a devastating issue that would lead to a recovery of the system. We agreed on a timeframe of four hours. In summary, we wanted to lose as little work as possible, restore in max two hours, and make sure that an issue wouldn't already be in the backup when we discovered it. With this information, I set up a backup approach that software vendors refer to as instant restore, shadow restore, preemptive restore, or similar term. We ran incremental backup jobs every hour and restored the backups in the background to a new virtual machine. Each full hour, we had a system ready that was four hours back in time and just needed to be finished. So if I choose to restore the incremental from one hour ago, it would take less time than a complete system restore because only the small increments had to be restored to the almost-ready virtual machine.
And the effort paid off
One day, I was on vacation, having a barbecue and some beer, when I got a call from my colleague telling me that the terminal server with the ERP application was broken due to a failed update and the guy who ran the update forgot to take a snapshot first.
The only thing I needed to tell my colleague was to shut down the broken machine, find the UI of our backup/restore system, and then identify the restore job. Finally, I told him how to choose the timestamp from the last four hours when the restore should finish. The restore finished 30 minutes later, and the system was ready to be used again. We were back in action after a total of 30 minutes, and only the work from the last two hours or so was lost! Awesome! Now, back to vacation.
Wrap up
So keep in mind, time and money for a well-suited backup and recovery solution is well spent when it comes to an emergency restore.
[ Free online course: Red Hat Enterprise Linux technical overview. ]
저자 소개
Jörg has been a Sysadmin for over ten years now. His fields of operation include Virtualization (VMware), Linux System Administration and Automation (RHEL), Firewalling (Forcepoint), and Loadbalancing (F5). He is a member of the Red Hat Accelerators Community and author of his personal blog at https://www.my-it-brain.de.
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
오리지널 쇼
엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리
제품
- Red Hat Enterprise Linux
- Red Hat OpenShift Enterprise
- Red Hat Ansible Automation Platform
- 클라우드 서비스
- 모든 제품 보기
툴
체험, 구매 & 영업
커뮤니케이션
Red Hat 소개
Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.