Sysadmin tales: How to keep calm and not panic when things break
It was a dark and stormy summer afternoon in Denver…
I was working on several projects simultaneously for a small company that had been carved out of a larger one that had gone out of business. The smaller company had inherited some of the bigger company's infrastructure, and all the headaches along with it. That day, I had some additional consultants working with me on a project to migrate email service from a large proprietary onsite cluster to a cloud provider, while at the same time, I was working on reconfiguring a massive storage array.
At some point, I clicked the wrong button.
All of a sudden, I started getting calls. The CIO and the consultants were standing in front of my desk. The email servers were completely offline—they responded, but could not access the backing storage. I didn't know it yet, but I had deleted the storage pool for the active email servers.
My vision blurred into a tunnel, and my stomach fell into a bottomless pit. I struggled to breathe. I did my best to maintain a poker face as the executives and consultants watched impatiently. I scanned logs and messages looking for clues. I ran tests on all the components to find the source of the issue and came up with nothing. The data seemed to be gone, and panic was setting in.
I pushed back from the desk and excused myself to use the restroom. Closing and latching the door behind me, I contemplated my fate for a moment, then splashed cold water on my face and took a deep breath. Then it dawned on me: earlier, I had set up an active mirror of that storage pool. The data was all there; I just needed to reconnect it.
I returned to my desk and couldn't help a bit of a smirk. A couple of commands, a couple of clicks, and a sip of coffee. About five minutes of testing, and I could say, "Sorry, guys. Should be good now." The whole thing had happened in about 30 minutes.
We've all been there
Everyone makes mistakes, even the most senior and venerable engineers and systems administrators. We're all human. It just so happens that, as a sysadmin, a small mistake in a moment can cause very visible problems, and, PANIC. This is normal, though. What separates the hero from the unemployed in that moment, can be just a few simple things.
When an incident occurs, focusing on who's at fault can be tempting; blame is something we know how to do and can do something about, and it can even offer some relief if we can tell ourselves it's not our fault. But in fact, blame accomplishes nothing and can be counterproductive in a moment of crisis—it can distract us from finding a solution to the problem, and create even more stress.
Backups, backups, backups
This is just one of the times when having a backup saved the day for me, and for a client. Every sysadmin I've ever worked with will tell you the same thing—always have a backup. Do regular backups. Make backups of configurations you are working on. Make a habit of creating a backup as the first step in any project. There are some great articles here on Enable Sysadmin about the various things you can do to protect yourself.
Another good practice is to never work on production systems until you have tested the change. This may not always be possible, but if it is, the extra effort and time will be well worth it for the rare occasions when you have an unexpected result, so you can avoid the panic of wondering where you might have saved your most recent resume. Having a plan and being prepared can go a long way to avoiding those very stressful situations.
Breathe in, breathe out
The panic response in humans is related to the "fight or flight" reflex, which served our ancestors so well. It's a really useful resource for avoiding saber tooth tigers (and angry CFOs), but not so much for understanding and solving complex technical problems. Understanding that it's normal but not really helpful, we can recognize it and find a way to overcome it in the moment.
The simplest way we can tame the impulse to blackout and flee is to take a deep breath (or several). Studies have shown that simple breathing exercises and meditation can improve our general outlook and ability to focus on a specific task. There is also evidence that temperature changes can make a difference; something as simple as a splash of water on the face or an ice-cold beverage can calm a panic. These things work for me.
Walk the path of troubleshooting, one step at a time
Once we have convinced ourselves that the world is not going to end immediately, we can focus on solving the problem. Take the situation one element, one step at a time to find what went wrong, then take that and apply the solution(s) systematically. Again, it's important to focus on the problem and solution in front of you rather than worrying about things you can't do anything about right now or what might happen later. Remember, blame is not helpful, and that includes blaming yourself.
Most often, when I focus on the problem, I find that I forget to panic, and I can do even better work on the solution. Many times, I have found solutions I wouldn't have seen or thought of otherwise in this state.
Another thing that's easy to forget is that, when you've been working on a problem, it's important to give yourself a break. Drink some water. Take a short walk. Rest your brain for a couple of minutes. Hunger, thirst, and fatigue can lead to less clear thinking and, you guessed it, panic.
Time to face the music
My last piece of advice—though certainly not the least important—is, if you are responsible for an incident, be honest about what happened. This will benefit you for both the short and long term.
During the early years of the space program, the directors and engineers at NASA established a routine of getting together and going over what went wrong and what and how to improve for the next time. The same thing happens in the military, emergency management, and healthcare fields. It's also considered good agile/DevOps practice. Some of the smartest, highest-strung engineers, administrators, and managers I've known and worked with—people with millions of dollars and thousands of lives in their area of responsibility—have insisted on the importance of learning lessons from mistakes and incidents. It's a mark of a true professional to own up to mistakes and work to improve.
It's hard to lose face, but not only will your colleagues appreciate you taking responsibility and working to improve the team, but I promise you will rest better and be able to manage the next problem better if you look at these situations as learning opportunities.
Accidents and mistakes can't ever be avoided entirely, but hopefully, you will find some of this advice useful the next time you face an unexpected challenge.
[ Want to test your sysadmin skills? Take a skills assessment today. ]