SREs on a Plane
One of the things I love about my job is that I get to explore the tech and tools that make our lives easier. But easier doesn't always mean simpler. Take the history of flight, circumnavigating the globe in 80 days isn't as impossible as it seemed back in Jules Verne's day, but air travel has transformed dramatically since then. We've gone from crazy contraptions with heavy flapping wings to sleep jets with autopilot and advanced systems. Likewise, in the world of operations, we're shifting from static configuration to code as infrastructure as we build cloud-based SaaS platforms and AI tools that people depend on every day to keep operations running smoothly. But as these systems get more complex, who keeps us airborne in the hybrid cloud?
00:50 — INTRO ANIMATION
Cloud computing, and more specifically a platform like Kubernetes, is like this idea of a journey around the globe. We have infinite variables to consider. It can seem easy to just hop on a plane and arrive at your destination, but you're not the one flying the plane. There's a pilot in the cockpit with thousands of logged hours, experience with weather patterns, weight distribution, navigation tools, and automated systems. This is essentially the role of the SRE. Site Reliability Engineering is a software development discipline that combines both systems and software engineering principles to optimize the reliability, efficiency, and scalability of complex systems. SRE teams use and build software to manage systems, solve problems, and automate operations tasks. So, to get an idea of the expertise of an SRE, let's talk to one.
Hey Chris, how are you?
Hey, Candace. I've been really digging into the concept of SREs and I wanted to get your viewpoint and understand from your perspective, some of the details around Site Reliability Engineering. So, why don't you just start with the very basics. From your point of view, how would you describe what is an SRE and what do SREs do?
Sure, so SRE is a Site Reliability Engineer and it's basically a DevOps position. So, we have two basic functions behind the role, the development side and the Ops side. So, for the Ops side, that's pretty easy. What we do is we pay attention to the alerts that are coming in from our cluster. So, anything that might be broken on the clusters, we get alerts for those. And then we also look at any sort of customer issues that may be coming in through tickets. And then our development work revolves around a couple of different things. So one, we do feature work where customers ask us to add certain features to the OpenShift Dedicated product and we work on those. And then, of course, we want to do development work to reduce Ops pain. And so, we do a lot of automation and things to help us lower the number of alerts we get on any given day, to help the customer issues that we see, automate those away, things like that.
What happens when something goes wrong? I mean, you're getting all these alerts, there's a lot happening. Can you walk us through that process?
When things go really wrong we call them incidents. And we have an incident response process where basically we get a team of our SREs together who work to resolve the issue. And then we also have a post-mortem process. So, we write up an RCA doc, a Root Cause Analysis, and we go through in a meeting together, post-mortem review meeting, what went wrong, how long it took us to resolve certain things and we talk a lot about, we wanna create action items to either decrease time to resolution for the next time this thing goes wrong or completely automate the problem away altogether.
I think that's an awesome process and the post-mortem or retrospective view. I mean, that's applied in various sort of disciplines, development as well. I'm interested in that part of learning what went wrong, Root Cause Analysis. That itself can be difficult. I've worked on systems where an initial issue creates a flood of alerts and those all become distractions from the real issue.
I've also worked on systems where we've tried to work on that closed loop remediation. You talked about automating it so it never goes away. I can't help but think, can't we just write some code and AI that does all this magic for you?
It would be really awesome if we could just insert an AI module into our clusters and have all the problems go away. But unfortunately, that's not really how it works. AI isn't magic. It's really humans writing code. And so, what we have to do as SREs is we have to have the problems happen and then look at the problems and say, "What kind of code can we write here to make these problems go away or to make these problems less bad when they happen again?
I love that mindset of a little bit of experimentation, maybe in some manual work to figure out what's happening and then apply that through all of your experience, develop code, deploy an operator that can take that learning and automate that learning into helping you run production environments. It feels to me, and I know we've been experimenting with how this could work, that you could take that same concept and expand it out to a broader community. And thinking about the open source community development model, how do we apply that to operations?
Absolutely, so that's what we mean when we were talking about operate first. Operate first is the idea that we are taking our own products and we are deploying them at some large scale in order to be able to give feedback to the developers about what it's like to work in operations with those products deployed at a large scale. So, we have a lot of initiatives that help that feedback loop with our OpenShift developers. For example, we are asking our developers to come in and do shadowing with the SRE team so that they can see what it's like to run their product at scale and they can see what it's like to be getting alerts from the clusters and they can see the sorts of requests that we get from customers and things like that. And they can take that information and that knowledge back to their teams and say, "You know, okay, this one alert is really noisy. Let's try to make sure this alert is less noisy for our SRE team. Or, our SRE team wasn't getting alerted at all about this one issue that we think is really important. So, maybe we should put in alerts for these sorts of issues."
I mean, I can say from my own personal experience, having lived more on the developer side, you don't always appreciate the challenges that you're introducing for operations in the code that you're writing. So, I can, I really love that notion of learning from one another and take it a step further, how we can change the systems and improve those systems, learning from the operations experience, informing the developers of the platform. So, this is a community effort. This is how we collaborate and this is the beauty of open source development. Really great conversation. I'm so glad to have an opportunity to learn from you more, a day in the life of an SRE. Thank you so much, Candace.
Thank you so much for having me, it was lovely.
The history of flight is long and storied and in the beginning it was 100% manual, but over time it's become more and more digitally automated. And now we have autopilot, which largely flies the plane. But when unexpected conditions arise, you're still trusting your life to a human expert. The pilots feedback and the operations data are vital for ground teams and this helps to continually optimize systems for safer and more efficient travel for all airlines and all passengers.
08:26 — OUTRO ANIMATION
Meet the guest
Senior Site Reliability Engineer Red Hat
Red Hat’s approach to site reliability engineering (SRE)
Organizations can move more efficiently with cloud services supported by site reliability engineering practices that make IT work at scale.Learn more about our approach
Why use cloud services instead of self-managed infrastructure?
A growing number of organizations are looking to cloud services to help evolve and accelerate their digital transformation plans. But why? What benefits do cloud services offer?Read the blog post
More like this
Get into GitOps
Is there more to GitOps than meets the eye? We ponder the future of continuous delivery and automation beyond Kubernetes.
DevOps_Tear Down That Wall
As the race to deliver applications ramps up, the wall between development and operations comes crashing down. But what is DevOps, really?
How Bad Is Betting Wrong on the Future?
We speak to experts in the DevOps space about betting wrong on the future, how development projects go awry, and what teams can do to get things back on track.
Check out our podcasts
Want to hear more tales from the tech world? Red Hat’s award-winning podcasts feature remarkable stories from makers, coders, and leaders across the industry.
Presented by Red Hat
For 25 years, Red Hat has been bringing open source technologies to the enterprise. From the operating system to containers, we believe in building better technology together–and celebrating the unsung heroes who are remaking our world from the command line up.