Jump to section

What is MLOps?

Copy URL

Machine learning operations (MLOps) is a set of workflow practices aiming to streamline the process of deploying and maintaining machine learning (ML) models.

Inspired by DevOps and GitOps principles, MLOps seeks to establish a continuous evolution for integrating ML models into software development processes. By adopting MLOps, data scientists, engineers, and IT teams can synchronously ensure that machine learning models stay accurate and up to date by streamlining the iterative training loop. This enables continuous monitoring, retraining, and deployment, allowing models to adapt to changing data and maintain peak performance over time.

Machine learning models make predictions by detecting patterns in data. As the model evolves and is exposed to newer data it was not trained on, a problem called “data drift” arises. Data drift will happen naturally over time, as the statistical properties used to train an ML model become outdated, and can negatively impact a business if not addressed and corrected.

To avoid drift, it’s important for organizations to monitor their models and keep a high level of prediction accuracy. Applying the practices of MLOps can benefit a team by increasing the quality and accuracy of a predictive model while simplifying the management process, avoiding data drift and optimizing efficiency for data scientists.

Here are some specific ways that MLOps can benefit an organization:

Reproducibility: Organizations can rely on consistent reproducibility of ML experiments as an MLOps framework helps track and manage changes to the code, data, and configuration files associated with different models. 

Continuous integration and continuous deployment (CI/CD): MLOps frameworks integrate with CI/CD pipelines, allowing for automated testing, validation, and deployment. In turn, this expedites development and delivery cycles and encourages a culture of continuous improvement.

Increased collaboration and faster timelines: MLOps enables team members to work together effectively while eliminating bottlenecks and increasing productivity. Furthermore, when manual tasks become automated, organizations can deploy more models faster and iterate on them more frequently to provide the best accuracy.

Cost savings: Making the ongoing adjustments and enhancements required to maintain an accurate ML model is tedious, especially if it’s done manually. Automating with MLOps helps organizations save on resources which may have otherwise been allocated to fund time-consuming manual work. It also minimizes the risk of manual errors and increases the time to value by streamlining the deployment process.

Improved governance and compliance: MLOps practices enable organizations to enforce security measures and ensure compliance with data privacy regulations. Monitoring performance and accuracy also ensures that model drift can be tracked as new data is integrated and proactive measures can be taken to maintain a high level of accuracy over time.

Adopting an MLOps practice takes away the tedious manual labor involved in looking after a machine learning model while ensuring its ongoing performance and reliability. By streamlining collaboration between different teams, an MLOps practice fosters agile development and data-driven decision making within organizations. 

MLOps allows industries of all kinds to automate and simplify the ML development process. Use cases include using MLOps for:

Predictive maintenance: predicting equipment failure and scheduling maintenance proactively.

Fraud detection: building and deploying models that continuously monitor transactions for suspicious activity.

Natural language processing (NLP): ensuring that applications such as chat bots, translators and other large language models (LLMs) perform effectively and reliably.

Computer vision: supporting tasks like medical image analysis, object detection, and autonomous driving. 

Anomaly detection: detecting variations from the norm in various contexts such as network security, industrial processes, and IoT devices.

Healthcare: deploying models for disease diagnosis, patient outcome prediction, and medical imaging analysis. 

Retail: managing inventory, forecasting demand, optimizing prices and enhancing the customer shopping experience.

MLOps can be considered an evolution of DevOps, and is based on the same foundational concepts of collaboration, automation, and continuous improvement applied to developing ML models. MLOps and DevOps share the goal of improving collaboration with the IT operations team, with whom they must work closely in order to manage and maintain a software or ML model throughout its life cycle. 

While DevOps focuses on automating routine operational tasks and standardizing environments for development and deployment, MLOps is more experimental in nature and focuses on exploring ways to manage and maintain data pipelines. Because the data used in ML models is constantly evolving, the model itself must evolve alongside it, which requires ongoing adaptation and fine tuning. 

Test, deployment, and production looks different for MLOps than it does for DevOps. This is why, in an ML project, teams often include data scientists who may not specialize in software engineering, but focus their efforts on exploratory data analysis, model development and experimentation. Some of the tasks involved in MLOps that typically aren’t accounted for in DevOps include:

  • Testing for data validation, trained model quality evaluation and model validation.
  • Building a multi-step pipeline to automatically retrain and deploy an ML model as it receives new data.
  • Tracking summary statistics of your data and monitoring online performance of the model to communicate when values deviate from expectations

Lastly, when it comes to continuous integration and continuous deployment (CI/CD) in MLOps, CI is no longer about testing and validating code and components (as it is in DevOps), but also means testing and validating data, data schemas, and models. CD is no longer about a single software package or services, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).

There’s no single way to build and operationalize ML models, but there is a consistent need to gather and prepare data, develop models, turn models into AI enabled intelligent applications, and derive revenue from those applications.

Red Hat® OpenShift®, includes key capabilities to enable MLOps in a consistent, 5-step manner across data centers, public cloud computing, and edge computing:

Step 1: Gather/prep data
Collect, clean, and label structured or unstructured data into a suitable format for training and testing ML models.

Step 2: Model training

ML models are trained on Jupyter notebooks on Red Hat OpenShift.

Step 3: Automation

Red Hat OpenShift Pipelines offers event-driven, continuous integration capability that helps package ML models as container images.

Step 4: Deploy

Red Hat OpenShift GitOps automates the deployment of ML models at scale, anywhere–whether that’s public, private, hybrid, or on the edge.

Step 5: Monitor

Using the tools provided by our ecosystem partners, your team can monitor your models, and update them with retraining and redeployment, as needed. As new data is ingested, the process loops back to stage 1, continuously and automatically moving through the 5 stages indefinitely. 

Whether you’re in an exploratory stage of integrating machine learning within your organization or you’ve been working with ML pipelines for a while, it can be helpful to understand how your workflows and processes fit into the broader scope of MLOps. The maturity of a machine learning process is typically categorized into 1 of 3 levels, depending on how much automation is present in the workflow. 

MLOps level 0: Everything is manual

Teams just starting out with machine learning typically operate with a completely manual workflow. At this stage, data scientists who create the model are disconnected from engineers who serve the model, and every step of the process (data prep, model training, automating, deploying, and monitoring) is executed without automation. There is no continuous integration (CI), nor is there continuous deployment (CD). New model versioning is deployed infrequently, and when a new model is deployed there is a greater chance that it fails to adapt to changes. 

MLOps level 1: Automated ML pipeline

It makes sense to start introducing automation to the workflow if the model needs to proactively adjust to new factors. With an automated pipeline, fresh data is looped in for continuous training (CT)–this allows the model to access the most relevant information for prediction services. 

MLOps level 2: Automated CI/CD system

At this stage, updates to the ML model are rapid and reliable. The model is retrained with fresh data daily, if not hourly, and updates are deployed on thousands of servers simultaneously. This system allows data scientists and engineers to operate harmoniously in a singular, collaborative setting. 

Build vs buy

Resources and timeline are both factors to consider when deciding whether to build or buy an MLOps platform. It can take over a year to build a functioning ML infrastructure, and even longer to figure out how to build a pipeline that actually produces value for your organization. Furthermore, maintaining an infrastructure requires lifecycle management and a dedicated team. If your team doesn’t have the skill set or bandwidth to learn the skill set, investing in an end-to-end MLOps platform may be the best solution. 

Red Hat OpenShift AI includes key capabilities to enable MLOps in a consistent way across datacenters, public cloud computing, and edge computing. It provides a single, consistent, enterprise-ready application platform that brings together data scientists and application developers in simplifying the integration of AI into applications securely, consistently and at scale. 

Kubeflow is a Kubernetes-native, open-source framework for developing, managing, and running machine learning (ML) workloads. Running Kubeflow on OpenShift can help standardize machine learning operations by organizing projects while leveraging the power of cloud computing. Some of the key use cases for Kubeflow include data prep, model training, evaluation, optimization, and deployment.

Introducing

InstructLab

InstructLab is an open source project for enhancing large language models (LLMs).

Keep reading

Article

What is generative AI?

Generative AI relies on deep learning models trained on large data sets to create new content.

Article

What is machine learning?

Machine learning is the technique of training a computer to find patterns, make predictions, and learn from experience without being explicitly programmed.

Article

What are foundation models?

A foundation model is a type of machine learning (ML) model that is pre-trained to perform a range of tasks. 

More about AI/ML

Products

Now available

A foundation model platform used to seamlessly develop, test, and run Granite family LLMs for enterprise applications.

An AI-focused portfolio that provides tools to train, tune, serve, monitor, and manage AI/ML experiments and models on Red Hat OpenShift.

An enterprise application platform with a unified set of tested services for bringing apps to market on your choice of infrastructure. 

Red Hat Ansible Lightspeed with IBM watsonx Code Assistant is a generative AI service designed by and for Ansible automators, operators, and developers. 

Resources

e-book

Top considerations for building a production-ready AI/ML environment

Analyst Material

The Total Economic Impact™ Of Red Hat Hybrid Cloud Platform For MLOps

Webinar

Getting the most out of AI with open source and Kubernetes