Building practical self-healing IT

This video can't play due to privacy settings

To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."

December 1, 2021 | Technically Speaking Team Automation and management

When IT operations fail, it would be great if our infrastructure could simply fix itself. But how practical is self-healing infrastructure? DevOps has evolved with AI/ML, and adding self-healing capabilities to existing infrastructure can have immediate benefits. But for more complicated problems, how do we cut through the noise and make it simpler to solve problems? In this episode, Red Hat CTO Chris Wright is joined by Mike Dobozy to talk about how event-driven automation is helping us realize a future with truly closed-loop automation and autonomic systems.

Transcript

00:01 - Chris Wright
No matter how well we plan for outages and service incidents, we can't always account for everything that might possibly go wrong. Especially when we're relying on increasingly distributed and dynamic architecture. In a traditional Ops flow, when there's an incident, remediating a problem may take awhile. A ticket has to be created. It has to be routed to support. A support tech has to investigate the issue. All these bottlenecks means that it takes a lot of time to fix issues. But what if our infrastructure could fix itself?

00:33 - Host
INTRO ANIMATION

00:41 - Chris Wright
Self-healing infrastructure is a lofty goal. But like anything, we have to deal with practical things first. Mapping your incident workflow to automation, identifying the root cause of problems, and then fixing them. To do this, a good first step is event-driven automation. Event-driven automation applies automated processes to respond to events generated by a user or system. By combining event monitoring, streaming, intelligence, and IT automation, we can respond more quickly to events, reduce operational toil, and improve reliability, without requiring human intervention. With event-driven automation, we have the ability to make self-healing work. But to get a better picture of how event-driven automation can help us set the stage for better self-healing applications and services in the near future, let's talk to somebody who's actually built something like this today, Mike Dobozy. Hey Mike, great to have you.

01:45 - Mike Dobozy
Hey Chris, good to talk with you today.

01:48 - Chris Wright
So I know you've got great experience in working with customers, building this kind of self-healing infrastructure, and I know there's a vision of a complex closed-loop self-healing system. That's complicated. So whereHow do you break it down and where do we begin?

02:13 - Mike Dobozy
Really when you're talking about self-healing infrastructures, you've got four main technologies that you're talking about. One is what we call the event producer. That is a monitoring tool or an agent, or potentially emails that detect a failure, and basically tell other systems, like, hey I've got something going on on this system. That event, the failure event goes through a messaging subsystem, could be Kafka, it could be a bunch of other things. We see Kafka a lot, like at clients these days. Then once the actual message goes through, it actually gets consumed by a layer that does intelligent routing. And what that does is, based off of like rules, or metadata, or a simple set of criteria, we decide, you know what, like this event is, actually should be, remediated by this playbook and the automation platform, if you will. And the automation platform is the most important thing. It is the thing that is ultimately responsible for remediating issues. Now that kind of process, the pattern matching that exists in the intelligent routing piece, right now, that's like very simple, very simple stuff. And like on the line, a lot of the clients that we see doing self-healing infrastructure, but what you're doing is you're setting the stage for more complicated things down the line. Because the process is designed to be so open-ended that you can start, build your cost-save, a build up your kind of like benefits within the, within the organization, and then work ononce you've actually done thatthen work on kind of like more complex things down the line.

03:40 - Chris Wright
I like that the simple kind of steps you've got events coming out of the infrastructure. Essentially, we're talking about event-driven automation, but it follows very much event-driven architecture for software design patterns. And the importance of flexibility when you come to the response remediation. And I think we understand reasonably well when you talk about a, say a single cluster of one platform, but when you really think about your entire infrastructure, the flexibility to be able to touch any part of that infrastructure, and essentially run the right playbooks to respond to those events becomes really important. We talk a lot about observability in the industry and, and observability to me is critical. That's part, that's part of that event source that you're describing, is critical to the success of any self-healing infrastructure.

04:36 - Mike Dobozy
Yeah. So that is kind of like the, the, one of the fun aspects of like, when you actually build out a self-healing infrastructure, and let's say you actually build out to a, build it out to a very large estate, you're going to get a ton of data coming in. And so, like one of the, one of the key parts of any sort of, like, AI Ops engagement is a filtering process. So like a cleansing process to say, let me actually like, take the data that's valuable that I want to use to train models. And let's actually, like, toss the rest of it. Once you get to that point to like the other thing that's really important that you alluded to, is like traceability and observability. So one of the things you're going to have a ton of events coming through the system, you want to be able to actually like see at any given moment, Hey, what is the system actually doing? When we see a bunch of events that are actually like coming in, that are basically all the same type, we aggregate them together. And so we all, we treat them as kind of like a single event. What we're trying to do is actually keep load off of the automation platform that we have as part of this process. And so that way, you know, you don't have 10,000 events coming into the automation platform, all asking to be remediated. You have many, many less than that because they've been aggregated together.

05:46 - Chris Wright
I can imagine just the complexity. I've got experience with trying to debug complex scenarios. And when you get to root cause analysis, it's not uncommon that the actual single root cause is a trigger for hundreds or thousands, or even hundreds of thousands of events, which if you responded to each of those in kind with your automation, or your playbooks, you're actually producing more load on the system and maybe even aggravating the actual problem. And so that notion of correlating events, I see a bunch of leaf issues here, but the root cause is over here, I think that is where the intelligence comes in. This is great. I mean, it's a long-term effort. We've spent years, even in the academic world, looking at building autonomic systems and developing concepts of monitor, analyze, plan, execute in a closed loop remediation. Today, you're bringing that to life and we're seeing even how that can transition into the future with data and AI and building an intelligent autonomic system. So, Mike, this has been fantastic. Thank you so much.

07:03 - Mike Dobozy
Of course, of course, happy to talk with you. Chris

07:08 - Chris Wright
Event-driven automation can give us a better handle on complicated IT systems with more data, more context and faster remediation. But if we just respond mechanically to events, we risk never actually solving the problems. And having access to more and more data brings its own set of challenges. For self-healing infrastructure, We need to solve the more complex problems of root cause analysis and combining AI and machine learning and infrastructure automation will help with this. Achieving the goal of AI Ops, where autonomic systems instantly fixed themselves is still a bit out on the horizon, but like any complex problem, we have to start with small changes and easy solutions to build toward the bigger picture.

07:56 - Host
OUTRO ANIMATION

About the show

Technically Speaking

What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.