It was a dark and stormy summer afternoon in Denver…
I was working on several projects simultaneously for a small company that had been carved out of a larger one that had gone out of business. The smaller company had inherited some of the bigger company's infrastructure, and all the headaches along with it. That day, I had some additional consultants working with me on a project to migrate email service from a large proprietary onsite cluster to a cloud provider, while at the same time, I was working on reconfiguring a massive storage array.
At some point, I clicked the wrong button.
All of a sudden, I started getting calls. The CIO and the consultants were standing in front of my desk. The email servers were completely offline—they responded, but could not access the backing storage. I didn't know it yet, but I had deleted the storage pool for the active email servers.
My vision blurred into a tunnel, and my stomach fell into a bottomless pit. I struggled to breathe. I did my best to maintain a poker face as the executives and consultants watched impatiently. I scanned logs and messages looking for clues. I ran tests on all the components to find the source of the issue and came up with nothing. The data seemed to be gone, and panic was setting in.
I pushed back from the desk and excused myself to use the restroom. Closing and latching the door behind me, I contemplated my fate for a moment, then splashed cold water on my face and took a deep breath. Then it dawned on me: earlier, I had set up an active mirror of that storage pool. The data was all there; I just needed to reconnect it.
I returned to my desk and couldn't help a bit of a smirk. A couple of commands, a couple of clicks, and a sip of coffee. About five minutes of testing, and I could say, "Sorry, guys. Should be good now." The whole thing had happened in about 30 minutes.
We've all been there
Everyone makes mistakes, even the most senior and venerable engineers and systems administrators. We're all human. It just so happens that, as a sysadmin, a small mistake in a moment can cause very visible problems, and, PANIC. This is normal, though. What separates the hero from the unemployed in that moment, can be just a few simple things.
When an incident occurs, focusing on who's at fault can be tempting; blame is something we know how to do and can do something about, and it can even offer some relief if we can tell ourselves it's not our fault. But in fact, blame accomplishes nothing and can be counterproductive in a moment of crisis—it can distract us from finding a solution to the problem, and create even more stress.
Backups, backups, backups
This is just one of the times when having a backup saved the day for me, and for a client. Every sysadmin I've ever worked with will tell you the same thing—always have a backup. Do regular backups. Make backups of configurations you are working on. Make a habit of creating a backup as the first step in any project. There are some great articles here on Enable Sysadmin about the various things you can do to protect yourself.
Another good practice is to never work on production systems until you have tested the change. This may not always be possible, but if it is, the extra effort and time will be well worth it for the rare occasions when you have an unexpected result, so you can avoid the panic of wondering where you might have saved your most recent resume. Having a plan and being prepared can go a long way to avoiding those very stressful situations.
Breathe in, breathe out
The panic response in humans is related to the "fight or flight" reflex, which served our ancestors so well. It's a really useful resource for avoiding saber tooth tigers (and angry CFOs), but not so much for understanding and solving complex technical problems. Understanding that it's normal but not really helpful, we can recognize it and find a way to overcome it in the moment.
The simplest way we can tame the impulse to blackout and flee is to take a deep breath (or several). Studies have shown that simple breathing exercises and meditation can improve our general outlook and ability to focus on a specific task. There is also evidence that temperature changes can make a difference; something as simple as a splash of water on the face or an ice-cold beverage can calm a panic. These things work for me.
Walk the path of troubleshooting, one step at a time
Once we have convinced ourselves that the world is not going to end immediately, we can focus on solving the problem. Take the situation one element, one step at a time to find what went wrong, then take that and apply the solution(s) systematically. Again, it's important to focus on the problem and solution in front of you rather than worrying about things you can't do anything about right now or what might happen later. Remember, blame is not helpful, and that includes blaming yourself.
Most often, when I focus on the problem, I find that I forget to panic, and I can do even better work on the solution. Many times, I have found solutions I wouldn't have seen or thought of otherwise in this state.
Take five
Another thing that's easy to forget is that, when you've been working on a problem, it's important to give yourself a break. Drink some water. Take a short walk. Rest your brain for a couple of minutes. Hunger, thirst, and fatigue can lead to less clear thinking and, you guessed it, panic.
Time to face the music
My last piece of advice—though certainly not the least important—is, if you are responsible for an incident, be honest about what happened. This will benefit you for both the short and long term.
During the early years of the space program, the directors and engineers at NASA established a routine of getting together and going over what went wrong and what and how to improve for the next time. The same thing happens in the military, emergency management, and healthcare fields. It's also considered good agile/DevOps practice. Some of the smartest, highest-strung engineers, administrators, and managers I've known and worked with—people with millions of dollars and thousands of lives in their area of responsibility—have insisted on the importance of learning lessons from mistakes and incidents. It's a mark of a true professional to own up to mistakes and work to improve.
It's hard to lose face, but not only will your colleagues appreciate you taking responsibility and working to improve the team, but I promise you will rest better and be able to manage the next problem better if you look at these situations as learning opportunities.
Accidents and mistakes can't ever be avoided entirely, but hopefully, you will find some of this advice useful the next time you face an unexpected challenge.
[ Want to test your sysadmin skills? Take a skills assessment today. ]
저자 소개
Glen Newell has been solving problems with technology for 20 years. As a Systems Engineer and administrator, he’s built and managed servers for Web Services, Healthcare, Finance, Education, and a wide variety of enterprise applications. He’s been working with and promoting open source technologies and methods for his entire career and loves to share what he learns and help people understand technology.
유사한 검색 결과
Red Hat Learning Subscription Course: Skills for the future
Bridging the gap: Red Hat Academy shaping open source talent in APAC
How Should We Handle Failure? | Compiler
The Legend Of Hadoop | Compiler: Legacies
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래