I would like to share a personal horror story from when I was a newly-minted sysadmin. I was hired into my first sysadmin job with some Linux skills, but no system administration experience. It was a true junior position, and I was learning on the job while being trained by senior staff. The small team I was on supported the primary web servers for a university and its central IT department, to which my team belonged.
I’d been there no longer than a couple of weeks when I was asked to make a small change to the Apache server configuration hosting the IT department’s website. I had been trained by the senior staff in this situation. I knew how to make the change, commit it to our version control, and then roll the change out to the servers. I opened the Apache documentation too, just to be sure, and had it in front of me. I could do this!
I made the change, and double-checked it against the documentation. I committed it to Subversion, and rolled the change out to the servers. Satisfied at a job well done, I added my notes to the ticket, and then closed the request.
Also, I forgot an angle bracket.
A few minutes later, the change propagated to the servers, Apache restarted (or tried to), and the website for our department came crashing down. As I frantically tried to roll back my changes—I didn’t know what I’d broken, and could not, in the heat of the moment, remember how to get older versions out of Subversion—I could hear two coworkers talking in a cube nearby.
"Is our website down?"
"Haha, yeah, it looks like it."
Eye roll, snarky comment, nudge nudge wink wink.
I sunk lower in my cube, both shame and embarrassment adding to my panic as I finally retrieved the older version, rolled it out to the web servers, and verified that the site had come back up.
Later that afternoon, I was still in my cube, shame and embarrassment still in full effect, but panic replaced by fear. Two or three weeks into my job, I’d taken down one of our highest traffic sites, having been trained and trusted to do the job. Certainly, I was not going to be employed for long. That fear was only compounded when my boss’ boss’ boss showed up in my cube.
"Don’t worry," she said. I can only wonder what face of pure terror I made when she came into view.
"Don’t worry. We’re not mad at you. You made a mistake, and you fixed it. And now you have learned and you will not make that mistake again."
She was right. Sure, this was a simple typo, but I did not do a syntax check. This lesson has followed me. I never again committed code or configuration to a production service without testing.
I’ve shared this story before, most notably in an article about DevOps and in support of a culture of continual learning and experimentation. This same response from management is a partial example of a blameless postmortem, a key feature of the Google Site Reliability Engineering culture as well.
Humans make mistakes. Failures will occur. In a safe, blame-free culture, team members can learn from their mistakes, and as a result, services can be hardened against the same problems, and the team and organization as a whole can grow.