Red Hat is continually innovating and part of that innovation includes researching and striving to solve the problems our customers face. That innovation is driven in part through the Office of the CTO and includes Red Hat OpenShift, Red Hat OpenShift Container Storage and use cases such as the Open Hybrid Cloud, Artificial Intelligence and Machine Learning. We recently interviewed Michael Clifford, Data Scientist in the office of the CTO, here at Red Hat about these very topics.
Your title is Data Scientist, right?
What's that mean in terms of working with OpenShift 4, and with the hybrid cloud?
Working in this domain is really twofold.
If we want to provide infrastructure for other companies that want to do machine learning workloads, we're working as the beta testers.
Then, on the other side of it is a question: how do we actually implement some kind of intelligence into the applications that are running on the OpenShift Container Platform?
For example, one of the main, cool features of OpenShift 4 is the automatic updates that happen. But, how do you actually know when an update is happening automatically on hundreds of thousands of servers at a time? You need some kind of intelligent automation to manage that process.
That's one of the things we worked on early, both testing out how other users would use our infrastructure — from the data scientist perspective — as well as implementing the intelligent applications that run behind some of that infrastructure.
So your role is to analyze what's happening, then come up with ways to make it less disruptive. Is that accurate?
Exactly. During an update, we say, "Oh, something strange happening during this update, let's roll back before anything breaks."
What are some of the data science tools you use to detect that?
Basically, you're ingesting all the data from all the updates that have occurred in the past. Then the machine learning model essentially learns what it looks like when a thing is updating normally. As a new update happens we continually compare it to our model, and if something starts to really deviate in any significant way, we say, "okay, let's flag this, roll it back."
And you're talking about hundreds of thousands of updates to monitor.
One of the things about working in the AIOps area is that even though there's a lot of data, it's sometimes not very clean data. With a lot of data science projects, people have this idea that you get a file that's very cleanly defined, and you can do your exploratory analysis and all kinds of other stuff on it. With these live, machine-generated, real-time data sets, things can be all over the place.
So the bigger challenge with this particular project is less the machine learning algorithm that's put into place, than the infrastructure required to parse the data — to get enough data that's meaningful, and convert it to a format that is actually usable and ingestible by machine learning tools.
What's generating this hard-to-manage data?
The data wasn't generated with machine learning in mind. There's a lot of post-processing and pre-processing that has to happen between capturing all this massive amount of data, turning it into a format that can actually be used for intelligence.
With that kind of data, is it harder to decide what is useful data for machine learning, and what is just something that has to be managed?
A lot of times, especially with this type of stuff, you will have to go back and talk to a subject matter expert. Like somebody who's actually working on the OpenShift 4 updates, and you can say, "This variable seems like it would be very informative. Is this something that we should use?" And they'll say, "No, this is generated by something that you're trying to predict anyways, so it'll be a circular prediction."
I think that's just a big part of the practice of data science — a lot of looking at the data, but also talking with subject matter experts to determine the right thing to do.