Ten years ago, Bobby Dorlus didn't even know what site reliability engineering (SRE) was. "It was more or less either you were on the operations, development, or architecture side of the house, or you were in management," he says.
Many organizations may not even refer to their engineers as SREs; they may just integrate the practices into their current software engineering process. According to Google, "SRE is what you get when you treat operations as if it's a software problem." The core of reliability engineering is strengthening the human side of the software problem.
Bobby Dorlus has spent the last eight and a half years as a Staff SRE at Twitter in what he calls "one of the most fruitful technical experiences of my 20 years in tech." He has always loved solving problems. As an SRE, he can be as analytical and conceptual as humanly possible. To him, there's no problem too great when trying to transform a somewhat reliable system into an incredibly reliable system.
I recently sat down with Bobby to explore the basic roles and responsibilities of an SRE and where it fits in today's enterprise architecture.
Marjorie Freeman: How do SREs interact with software engineers at your organization?
Bobby Dorlus: At Twitter, SREs have a slightly different role than SREs at other companies. They are not only highly proficient systems engineers, but they also work more closely with application developers. We also have SREs working for the platform services team building the infrastructure that the applications depend on. In addition, there are SREs that work more closely with our software engineering partners who are primarily responsible for developing code.
The focus areas that SREs and software engineers often collaborate on is enhancing and maintaining the user experience in compliance with defined service-level objectives (SLOs), which are agreed upon by SREs and software engineers. The SRE team is there to ensure the infrastructure is designed in a way that supports the specific SLO—that applications are being developed and features are being deployed in a way that doesn't hinder the user experience.
As an SRE, a strong DevOps foundation and software development skills are a must. It also doesn't hurt to have a firm understanding of systems design and automation. Automation is key in today's enterprise, and as an SRE, you know that if a system creates toil, that usually means something should be automated. This is where SREs will work with service owners to develop ways to automate away toil to reduce operational burden.
[ Starting a new enterprise modernization project? Download the eBook An architect's guide to multicloud infrastructure. ]
Marjorie: Where do you see overlap between engineers and architects?
Bobby: I have worked for organizations where there are separate departments for architects and engineers. At Twitter, there isn't really a structure that separates our architect and engineer teams. In fact, there's also the potential to leverage the capabilities of each role, depending on the project. In general, though, the more senior-level the role, the more your responsibilities fall under an architectural umbrella.
Architects build the designs; engineers execute those designs. Because engineers execute on the designs crafted by architects, they can pinpoint if something isn't working and needs to be tweaked. Being a Staff SRE, I often architect solutions to solve a specific business need. Resource estimation is a great example of a way to address a need and design a solution. In a recent case, the goal was to identify any inefficiencies in the datacenter and to architect a solution to take advantage of running different workloads on the same bare-metal machine.
In addition to architecting solutions, I also draft documents and work to get buy-in from my peers and engineering leadership—which is similar to the roles and responsibilities of the traditional architect. But as an engineer, I also am involved in the implementation process.
Some engineering roles are essentially architects without necessarily having it in their title.
[ Enterprise architecture overlaps with a lot of other tech roles. Is 'systems architect' a role or a skill? Share your perspective. ]
Marjorie: What is the most interesting project or assignment that you've ever worked on as an SRE?
Bobby: Twitter is a very large hyperscale infrastructure. A lot of people aren't aware that Twitter builds its own computers and datacenters. As cool as that is, as an engineer, you've got to ensure the company is getting the most efficiency out of the hardware it's building inside of the datacenter. One of the roles of an SRE on Twitter's Compute Platform team is to maintain a certain amount of computing capacity for future demand and support Platform Engineering initiatives to increase use of underutilized resources within our datacenters.
Earlier, I mentioned a recent resource estimation project I was involved in. In the process of reviewing our resource consumption within our datacenters, we discovered an opportunity to increase capacity within our compute platform while also increasing utilization within our storage platforms. While executing the project, after running in-depth analyses, we found that at least 50% of bare-metal computing resources on storage machines were underutilized. So imagine your computer has 100 slots for CPUs, and only 40 of them are being used. This means you have 60 CPUs that are not doing anything. This is where I come in and figure out how to take advantage of those underutilized resources on those storage machines. The goal of this project was to be able to run two different types of workloads on the same bare-metal machines.
I know it might not sound very attractive in the sense of the virtualization and containerization that we have nowadays, but just imagine the containerization we implemented was all at the operating system and kernel level where we used systemd to actually introduce those isolation mechanisms. It was a tricky project, but one of those experiences that, when you see its lasting impacts, makes it well worth the complexity.
[ Learn more about Developing and deploying containers using Red Hat & AWS solutions.]
Marjorie: How did you get started, and how do you improve your tech skills?
Bobby: I'm originally from South Florida. My parents are originally from Haiti, and I am a first-generation Haitian American. Being that my family immigrated to the United States, we didn't always have the same access that many people have with entering the workforce or attending college. But I didn't let that interfere with my passion. I attended two years of technical school and started my self-taught journey toward what soon became a very fruitful career. I may not have had the financial resources to attend a four-year university, but the beauty of technology is it's hands-on and affords you the opportunity to learn as you "do."
I've always enjoyed being a problem solver, taking things apart and fixing them. Getting into computer science is obviously the best way to learn how something works from the lowest level—the operating system, the functionality of the kernel—up to the highest level—hosting software and running systems at a scale that people can depend on—while enjoying yourself at the same time.
My desire to learn definitely hasn't lost its momentum. I often tell my mentees, "learn how to learn." Just because you learn something new today doesn't mean that that same knowledge will be enough tomorrow. Architects also must keep the hands-on skills they've cultivated throughout their careers sharp.
Out of my 20 years in technology, the most impactful certification I ever completed was the Red Hat Certified Engineer (RHCE). I completed that certification in 2006, and one thing I vividly remember was the amount of work they wanted me to be able to demonstrate in a matter of only a few hours. Being able to break down a problem and solve it in a reasonable amount of time helps you land the kind of roles that are truly in demand today, like DevOps, infrastructure engineers, SRE, and so on. No matter what your title says, if you specialize in building infrastructures and implementing architectures, you are solving problems with the end user in mind.
My philosophy is: If it isn't broken, you may not have to fix it, but you can still make it better.