Book review: Site Reliability Engineering
Maybe this is too obvious for others out there, but a book I would recommend for sysadmins is Site Reliability Engineering (SRE), edited by Betsy Beyer, Chris Jones, et al. It’s not an obscure choice by any means. This book might be one of the best known titles to sysadmins everywhere. I recommend it because it’s easy to ignore, but—I think— game-changing in its own right.
For years, I’d ignored this SRE book on the basis that anything Google-scale could not possibly apply to what I did on a day-to-day basis. I reasoned that the masses of online discussion could be chalked up to the fanboys and fangirls. Certainly, after a decade as a sysadmin, nothing truly new would be included in what was essentially a sysadmin handbook.
I was wrong. When I did finally pick up a copy and start to read it, my mind was changed within a few chapters. No, there’s no magical recipe for perfect system administration. Yes, it describes a job that focuses heavily on programming rather than "traditional" system administration. No, it is not a manual about how to be a system administrator.
Site Reliability Engineering describes exactly the challenges facing my team. We’re handling more servers per sysadmin than ever before—a ratio of hundreds-to-one where ten years ago it was dozens-to-one. Even with better automation tools and increased scripting, trying to handle that scale is challenging, and a new workflow has to be developed to deal with the load.
SREs are arguably not sysadmins as we know the term, but they are the next generation of operations staff. This book discusses well-thought-out steps to transition a team from traditional sysadmins to a team of SREs, including the skills needed, practices to put into place as a team, and policies from leadership that support and enhance these changes. It is well worth the read, even as a single contributing individual.