I would like to share a personal horror story from when I was a newly-minted sysadmin. I was hired into my first sysadmin job with some Linux skills, but no system administration experience. It was a true junior position, and I was learning on the job while being trained by senior staff. The small team I was on supported the primary web servers for a university and its central IT department, to which my team belonged.
I’d been there no longer than a couple of weeks when I was asked to make a small change to the Apache server configuration hosting the IT department’s website. I had been trained by the senior staff in this situation. I knew how to make the change, commit it to our version control, and then roll the change out to the servers. I opened the Apache documentation too, just to be sure, and had it in front of me. I could do this!
I made the change, and double-checked it against the documentation. I committed it to Subversion, and rolled the change out to the servers. Satisfied at a job well done, I added my notes to the ticket, and then closed the request.
Also, I forgot an angle bracket.
A few minutes later, the change propagated to the servers, Apache restarted (or tried to), and the website for our department came crashing down. As I frantically tried to roll back my changes—I didn’t know what I’d broken, and could not, in the heat of the moment, remember how to get older versions out of Subversion—I could hear two coworkers talking in a cube nearby.
"Is our website down?"
"Haha, yeah, it looks like it."
Eye roll, snarky comment, nudge nudge wink wink.
I sunk lower in my cube, both shame and embarrassment adding to my panic as I finally retrieved the older version, rolled it out to the web servers, and verified that the site had come back up.
Later that afternoon, I was still in my cube, shame and embarrassment still in full effect, but panic replaced by fear. Two or three weeks into my job, I’d taken down one of our highest traffic sites, having been trained and trusted to do the job. Certainly, I was not going to be employed for long. That fear was only compounded when my boss’ boss’ boss showed up in my cube.
"Don’t worry," she said. I can only wonder what face of pure terror I made when she came into view.
"Don’t worry. We’re not mad at you. You made a mistake, and you fixed it. And now you have learned and you will not make that mistake again."
She was right. Sure, this was a simple typo, but I did not do a syntax check. This lesson has followed me. I never again committed code or configuration to a production service without testing.
I’ve shared this story before, most notably in an article about DevOps and in support of a culture of continual learning and experimentation. This same response from management is a partial example of a blameless postmortem, a key feature of the Google Site Reliability Engineering culture as well.
Humans make mistakes. Failures will occur. In a safe, blame-free culture, team members can learn from their mistakes, and as a result, services can be hardened against the same problems, and the team and organization as a whole can grow.
저자 소개
Chris Collins is an SRE at Red Hat and a Community Moderator for Opensource.com. He is a container and container orchestration, DevOps, and automation evangelist, and will talk with anyone interested in those topics for far too long and with much enthusiasm.
유사한 검색 결과
Bridging the gap: Red Hat Academy shaping open source talent in APAC
From banking to tech: Bradley's Red Hat journey
Mailbag: Managers, Technical Debt | Compiler
Hardy Hardware | Compiler: Legacies
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래