Culture of innovation and collaboration: Hybrid cloud, AIOps and machine learning

4 settembre 2020Karena Angell, Lis Strenger6 minuti (tempo di lettura)

Red Hat is continually innovating and part of that innovation includes researching continually striving to solve the problems our customers face. That innovation is driven through the Office of the CTO and includes OpenShift, OpenShift Container Storage and use cases such as the Hybrid Cloud, Privacy concerns in AI and Data Caching. We recently interviewed Marcel Hild, Software Engineering Manager in the AI Center of Excellence for the office of the CTO here at Red Hat, about these very topics.

Can you tell me a little bit about what you've done with OpenShift 4?

With AI Ops we've been working closely with the OpenShift team to get some of the Operations data, from a customer side, to Red Hat, and help to analyze, visualize and make some sense of all this data. With OpenShift 4, one of the key changes is that we haven't always connected cluster to cluster, to send back data at a five minute interval to Red Hat, so that we can hopefully find issues before the customer experiences these issues.

The step before troubleshooting.

Hopefully so. It also augments troubleshooting, and it also augments development, because we identify bugs. That's all in the grand scheme of it.

Can you explain a little bit how you troubleshoot so it would make sense to someone who doesn't work daily with this technology?

Think of a huge cluster. A huge OpenShift deployment where you have 100 nodes, and you have thousands of containers going up and down every minute. Nobody from a traditional background could grasp this complexity. When we started out with mainframes, you had one server which had one database on it. And we virtualize these machines, and suddenly we have a physical server that has virtual machines on top of that, and now we have containers that are running virtual machines that are running and on bare metal installations. So the overall complexity of today's IT deployments, IT operations, it's just a magnitude larger.

And that's only the operational part of the infrastructure. Take in all the microservices that make up a modern application nowadays, take in service-defined networks, et cetera, and it's just overwhelming. Nobody can really grasp all that. What AI Ops is trying to do is make understanding and visualizing that complexity a little bit easier, so that you find that needle in the haystack with greater precision.

And how do you do that?

By having machine learning techniques applied to our operational data. These techniques are ubiquitous in the financial sector, where you have stock market prices and once there's a blip somewhere, some machine is triggered to sell your stock. It's ubiquitous in the health sector, where we find anomalies to detect cancer in scans. But we haven't applied it to our very own industry, the IT industry, where we're still having these large deployments, and we're trying to keep them under control with manual tools.

It's about applying these already-proven techniques to our internal and IT operations.

Right now it's a combination of manpower and machine. How difficult is it to fuse those two, and combine those two efforts? Are there many challenges involved in that?

There are some challenges involved. One challenge is the mindset.

We had this DevOps movement, where we married development and operations to a new role, the DevOps person. Nowadays a developer can deploy stuff on OpenShift, like an Ops person does, and an Ops person uses development techniques, and we have DevOps.

The same must happen for the data science world and the DevOps world, so we get AI DevOps or Data Science DevOps.

Using some techniques from the data science world, we can analyze time-sensitive data, which is in essence metrics from the monitoring perspective. It's just time-to-use data. The data scientists have tools to find patterns in this time-sensitive data. But we need to bring both roles together, so that the data scientists know how the DevOps people talk about language, and the problem domain that they're dealing with. And the DevOps people need to understand what data science tools are available, and how they can help them?

You're saying that more communication between these different moving parts would aid the technology's evolution?

Yes. more communication between the roles, when the data is already there. If you install an OpenShift cluster, there's no lack of data. But we haven't made concerted efforts to analyze that data.

If you look at the AI world, you have a lot of data sets out there that are images. And you can apply techniques to identify cats in the images. Or, if you have a lot of voice recordings, a voice recognition model.

But there's a real scarcity of operational open data sets. Try to find some data set that has just logging data in it, well, there might be an outage of some sort. It's not there. So that's another initiative that we at Red Hat are trying to push forward now, working with open source communities to provide these open data sets. So apply what we get with software. Open up the software world, and also open up the data world with regards to IT data.

How easy is it to open up these worlds?

Right. It's about convincing people that there's value in doing so.

And is that difficult to do?

Well, yeah. You need to talk to people. And they say, “Yeah, there might be some IP in the data, or some proprietary stuff, I don't know.” I wouldn't say it's impossible to do so. But it sometimes needs some convincing. As I said earlier, the DevOps people and the data science people, they haven't come together yet. So they might not see a lot of value in opening up and creating these data sets, because they haven't seen the tools that would help them in their day to day work.

When you're working with a hybrid cloud, does that aspect of the data challenge change? Does that matter?

What we're trying to do with OpenShift and with the whole Red Hat portfolio is to make the experience of your hybrid cloud's installation, for on premise installation, as seamless and as pleasurable as in the public cloud. And you can only do that tapping into the data from all your customers.

That's the main advantage that the public cloud providers have. They see the workload from all other customers on their systems. So they can always tune and improve the performance of these services for all the customers, because they have an overall knowledge of log metrics, all the operational data from the workloads running there. If Red Hat strives to make the same experience for the hybrid cloud people, we need the logging data, the metrics, data, the operational data from the on-premise installations from our products running in those data centers. Then send them back to Red Hat, analyze them with AIOps techniques, and make the experience as pleasurable and as seamless as possible.

It goes back to detecting issues before they happen, to know before the customer sees them live. That's our first step for OpenShift 4, where we have this always-connected cluster. We see the OpenShift 4 clusters running for customers, running at laptops, running in data centers. And we can see we have an issue here, it's happening for 10% of our customers, let's dig into that before the other 90% of our customers also see that.

I suppose that goes to trust, then. Would you agree?

Yes. Red Hat is already trusted by our partners, by our customers. They are getting online patches from us. It's already in the data center, as a trustworthy partner. And in your data center, you know, you don't only have Red Hat software running. On top of Red Hat's OpenShift container platform, you would also have some database servers, or some other individual software vendors that are putting their software to actually do the work of a customer, and run it there.

If we're talking about AI Ops, we need to have the logging data and operational data from the software out there. It's unlikely that a customer sends back data to each and every software vendor in the data center. It's more likely it only sends it back to a single or trusted partner, namely Red Hat. So maybe we get into the position where we also have to provide a data platform for customers, where the customer sends back their operational data to reference not only from our products.