I would like to share a personal horror story from when I was a newly-minted sysadmin. I was hired into my first sysadmin job with some Linux skills, but no system administration experience. It was a true junior position, and I was learning on the job while being trained by senior staff. The small team I was on supported the primary web servers for a university and its central IT department, to which my team belonged.
I’d been there no longer than a couple of weeks when I was asked to make a small change to the Apache server configuration hosting the IT department’s website. I had been trained by the senior staff in this situation. I knew how to make the change, commit it to our version control, and then roll the change out to the servers. I opened the Apache documentation too, just to be sure, and had it in front of me. I could do this!
I made the change, and double-checked it against the documentation. I committed it to Subversion, and rolled the change out to the servers. Satisfied at a job well done, I added my notes to the ticket, and then closed the request.
Also, I forgot an angle bracket.
A few minutes later, the change propagated to the servers, Apache restarted (or tried to), and the website for our department came crashing down. As I frantically tried to roll back my changes—I didn’t know what I’d broken, and could not, in the heat of the moment, remember how to get older versions out of Subversion—I could hear two coworkers talking in a cube nearby.
"Is our website down?"
"Haha, yeah, it looks like it."
Eye roll, snarky comment, nudge nudge wink wink.
I sunk lower in my cube, both shame and embarrassment adding to my panic as I finally retrieved the older version, rolled it out to the web servers, and verified that the site had come back up.
Later that afternoon, I was still in my cube, shame and embarrassment still in full effect, but panic replaced by fear. Two or three weeks into my job, I’d taken down one of our highest traffic sites, having been trained and trusted to do the job. Certainly, I was not going to be employed for long. That fear was only compounded when my boss’ boss’ boss showed up in my cube.
"Don’t worry," she said. I can only wonder what face of pure terror I made when she came into view.
"Don’t worry. We’re not mad at you. You made a mistake, and you fixed it. And now you have learned and you will not make that mistake again."
She was right. Sure, this was a simple typo, but I did not do a syntax check. This lesson has followed me. I never again committed code or configuration to a production service without testing.
I’ve shared this story before, most notably in an article about DevOps and in support of a culture of continual learning and experimentation. This same response from management is a partial example of a blameless postmortem, a key feature of the Google Site Reliability Engineering culture as well.
Humans make mistakes. Failures will occur. In a safe, blame-free culture, team members can learn from their mistakes, and as a result, services can be hardened against the same problems, and the team and organization as a whole can grow.
About the author
Chris Collins is an SRE at Red Hat and a Community Moderator for Opensource.com. He is a container and container orchestration, DevOps, and automation evangelist, and will talk with anyone interested in those topics for far too long and with much enthusiasm.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit