Building Practical Self-healing IT
No matter how well we plan for outages and service incidents, we can't always account for everything that might possibly go wrong. Especially when we're relying on increasingly distributed and dynamic architecture. In a traditional Ops flow, when there's an incident, remediating a problem may take awhile. A ticket has to be created. It has to be routed to support. A support tech has to investigate the issue. All these bottlenecks means that it takes a lot of time to fix issues. But what if our infrastructure could fix itself?
00:33 — INTRO ANIMATION
Self-healing infrastructure is a lofty goal. But like anything, we have to deal with practical things first. Mapping your incident workflow to automation, identifying the root cause of problems, and then fixing them. To do this, a good first step is event-driven automation. Event-driven automation applies automated processes to respond to events generated by a user or system. By combining event monitoring, streaming, intelligence, and IT automation, we can respond more quickly to events, reduce operational toil, and improve reliability, without requiring human intervention. With event-driven automation, we have the ability to make self-healing work. But to get a better picture of how event-driven automation can help us set the stage for better self-healing applications and services in the near future, let's talk to somebody who's actually built something like this today, Mike Dobozy. Hey Mike, great to have you.
Hey Chris, good to talk with you today.
So I know you've got great experience in working with customers, building this kind of self-healing infrastructure, and I know there's a vision of a complex closed-loop self-healing system. That's complicated. So whereHow do you break it down and where do we begin?
Really when you're talking about self-healing infrastructures, you've got four main technologies that you're talking about. One is what we call the event producer. That is a monitoring tool or an agent, or potentially emails that detect a failure, and basically tell other systems, like, hey I've got something going on on this system. That event, the failure event goes through a messaging subsystem, could be Kafka, it could be a bunch of other things. We see Kafka a lot, like at clients these days. Then once the actual message goes through, it actually gets consumed by a layer that does intelligent routing. And what that does is, based off of like rules, or metadata, or a simple set of criteria, we decide, you know what, like this event is, actually should be, remediated by this playbook and the automation platform, if you will. And the automation platform is the most important thing. It is the thing that is ultimately responsible for remediating issues. Now that kind of process, the pattern matching that exists in the intelligent routing piece, right now, that's like very simple, very simple stuff. And like on the line, a lot of the clients that we see doing self-healing infrastructure, but what you're doing is you're setting the stage for more complicated things down the line. Because the process is designed to be so open-ended that you can start, build your cost-save, a build up your kind of like benefits within the, within the organization, and then work ononce you've actually done thatthen work on kind of like more complex things down the line.
I like that the simple kind of steps you've got events coming out of the infrastructure. Essentially, we're talking about event-driven automation, but it follows very much event-driven architecture for software design patterns. And the importance of flexibility when you come to the response remediation. And I think we understand reasonably well when you talk about a, say a single cluster of one platform, but when you really think about your entire infrastructure, the flexibility to be able to touch any part of that infrastructure, and essentially run the right playbooks to respond to those events becomes really important. We talk a lot about observability in the industry and, and observability to me is critical. That's part, that's part of that event source that you're describing, is critical to the success of any self-healing infrastructure.
Yeah. So that is kind of like the, the, one of the fun aspects of like, when you actually build out a self-healing infrastructure, and let's say you actually build out to a, build it out to a very large estate, you're going to get a ton of data coming in. And so, like one of the, one of the key parts of any sort of, like, AI Ops engagement is a filtering process. So like a cleansing process to say, let me actually like, take the data that's valuable that I want to use to train models. And let's actually, like, toss the rest of it. Once you get to that point to like the other thing that's really important that you alluded to, is like traceability and observability. So one of the things you're going to have a ton of events coming through the system, you want to be able to actually like see at any given moment, Hey, what is the system actually doing? When we see a bunch of events that are actually like coming in, that are basically all the same type, we aggregate them together. And so we all, we treat them as kind of like a single event. What we're trying to do is actually keep load off of the automation platform that we have as part of this process. And so that way, you know, you don't have 10,000 events coming into the automation platform, all asking to be remediated. You have many, many less than that because they've been aggregated together.
I can imagine just the complexity. I've got experience with trying to debug complex scenarios. And when you get to root cause analysis, it's not uncommon that the actual single root cause is a trigger for hundreds or thousands, or even hundreds of thousands of events, which if you responded to each of those in kind with your automation, or your playbooks, you're actually producing more load on the system and maybe even aggravating the actual problem. And so that notion of correlating events, I see a bunch of leaf issues here, but the root cause is over here, I think that is where the intelligence comes in. This is great. I mean, it's a long-term effort. We've spent years, even in the academic world, looking at building autonomic systems and developing concepts of monitor, analyze, plan, execute in a closed loop remediation. Today, you're bringing that to life and we're seeing even how that can transition into the future with data and AI and building an intelligent autonomic system. So, Mike, this has been fantastic. Thank you so much.
Of course, of course, happy to talk with you. Chris
Event-driven automation can give us a better handle on complicated IT systems with more data, more context and faster remediation. But if we just respond mechanically to events, we risk never actually solving the problems. And having access to more and more data brings its own set of challenges. For self-healing infrastructure, We need to solve the more complex problems of root cause analysis and combining AI and machine learning and infrastructure automation will help with this. Achieving the goal of AI Ops, where autonomic systems instantly fixed themselves is still a bit out on the horizon, but like any complex problem, we have to start with small changes and easy solutions to build toward the bigger picture.
07:56 — OUTRO ANIMATION
Meet the guest
Principal Architect Red Hat
Automation vs. event-driven automation
Discover how event-driven automation requires less intervention than automation, and can increase your productivity even more.Read the blog
Accelerate your path to self-healing IT infrastructure
By taking advantage of modern technologies and practices—like DevSecOps—organizations can detect problems and remediate them automatically.Read the whitepaper
More like this
GitOps with Argo CD
The ArgoCD project showcases the power of open source: You may think you're the only one with a problem until you speak up and discover you're not.
You Need Ops to AIOps
AIOps isn't a product you buy. It's also not a replacement for DevOps—or human intelligence. How should we think about its future?
The One About DevSecOps
This episode examines the changes needed for better security—and how automation is key to meeting the rising challenges.
Check out our podcasts
Want to hear more tales from the tech world? Red Hat’s award-winning podcasts feature remarkable stories from makers, coders, and leaders across the industry.
Presented by Red Hat
For 25 years, Red Hat has been bringing open source technologies to the enterprise. From the operating system to containers, we believe in building better technology together–and celebrating the unsung heroes who are remaking our world from the command line up.