Machine learning operations (MLOps) is a set of workflow practices aiming to streamline the process of deploying and maintaining machine learning (ML) models.
Inspired by DevOps and GitOps principles, MLOps seeks to establish a continuous evolution for integrating ML models into software development processes. By adopting MLOps, data scientists, engineers, and IT teams can synchronously ensure that machine learning models stay accurate and up to date by streamlining the iterative training loop. This enables continuous monitoring, retraining, and deployment, allowing models to adapt to changing data and maintain peak performance over time.
Adopting an MLOps practice takes away the tedious manual labor involved in looking after a machine learning model while ensuring its ongoing performance and reliability. By streamlining collaboration between different teams, an MLOps practice fosters agile development and data-driven decision making within organizations.
MLOps allows industries of all kinds to automate and simplify the ML development process. Use cases include using MLOps for:
Predictive maintenance: predicting equipment failure and scheduling maintenance proactively.
Fraud detection: building and deploying models that continuously monitor transactions for suspicious activity.
Natural language processing (NLP): ensuring that applications such as chat bots, translators and other large language models (LLMs) perform effectively and reliably.
Computer vision: supporting tasks like medical image analysis, object detection, and autonomous driving.
Anomaly detection: detecting variations from the norm in various contexts such as network security, industrial processes, and IoT devices.
Healthcare: deploying models for disease diagnosis, patient outcome prediction, and medical imaging analysis.
Retail: managing inventory, forecasting demand, optimizing prices and enhancing the customer shopping experience.
MLOps can be considered an evolution of DevOps, and is based on the same foundational concepts of collaboration, automation, and continuous improvement applied to developing ML models. MLOps and DevOps share the goal of improving collaboration with the IT operations team, with whom they must work closely in order to manage and maintain a software or ML model throughout its life cycle.
While DevOps focuses on automating routine operational tasks and standardizing environments for development and deployment, MLOps is more experimental in nature and focuses on exploring ways to manage and maintain data pipelines. Because the data used in ML models is constantly evolving, the model itself must evolve alongside it, which requires ongoing adaptation and fine tuning.
Test, deployment, and production looks different for MLOps than it does for DevOps. This is why, in an ML project, teams often include data scientists who may not specialize in software engineering, but focus their efforts on exploratory data analysis, model development and experimentation. Some of the tasks involved in MLOps that typically aren’t accounted for in DevOps include:
- Testing for data validation, trained model quality evaluation and model validation.
- Building a multi-step pipeline to automatically retrain and deploy an ML model as it receives new data.
- Tracking summary statistics of your data and monitoring online performance of the model to communicate when values deviate from expectations
Lastly, when it comes to continuous integration and continuous deployment (CI/CD) in MLOps, CI is no longer about testing and validating code and components (as it is in DevOps), but also means testing and validating data, data schemas, and models. CD is no longer about a single software package or services, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
There’s no single way to build and operationalize ML models, but there is a consistent need to gather and prepare data, develop models, turn models into AI enabled applications, and derive revenue from those applications.
Red Hat® OpenShift®, includes key capabilities to enable MLOps in a consistent, 5-step manner across data centers, public cloud computing, and edge computing:
Step 1: Gather/prep data
Collect, clean, and label structured or unstructured data into a suitable format for training and testing ML models.
Step 2: Model training
ML models are trained on Jupyter notebooks on Red Hat OpenShift.
Step 3: Automation
Red Hat OpenShift Pipelines offers event-driven, continuous integration capability that helps package ML models as container images.
Step 4: Deploy
Red Hat OpenShift GitOps automates the deployment of ML models at scale, anywhere–whether that’s public, private, hybrid, or on the edge.
Step 5: Monitor
Using the tools provided by our ecosystem partners, your team can monitor your models, and update them with retraining and redeployment, as needed. As new data is ingested, the process loops back to stage 1, continuously and automatically moving through the 5 stages indefinitely.
Whether you’re in an exploratory stage of integrating machine learning within your organization or you’ve been working with ML pipelines for a while, it can be helpful to understand how your workflows and processes fit into the broader scope of MLOps. The maturity of a machine learning process is typically categorized into 1 of 3 levels, depending on how much automation is present in the workflow.
MLOps level 0: Everything is manual
Teams just starting out with machine learning typically operate with a completely manual workflow. At this stage, data scientists who create the model are disconnected from engineers who serve the model, and every step of the process (data prep, model training, automating, deploying, and monitoring) is executed without automation. There is no continuous integration (CI), nor is there continuous deployment (CD). New model versioning is deployed infrequently, and when a new model is deployed there is a greater chance that it fails to adapt to changes.
MLOps level 1: Automated ML pipeline
It makes sense to start introducing automation to the workflow if the model needs to proactively adjust to new factors. With an automated pipeline, fresh data is looped in for continuous training (CT)–this allows the model to access the most relevant information for prediction services.
MLOps level 2: Automated CI/CD system
At this stage, updates to the ML model are rapid and reliable. The model is retrained with fresh data daily, if not hourly, and updates are deployed on thousands of servers simultaneously. This system allows data scientists and engineers to operate harmoniously in a singular, collaborative setting.
Build vs buy
Resources and timeline are both factors to consider when deciding whether to build or buy an MLOps platform. It can take over a year to build a functioning ML infrastructure, and even longer to figure out how to build a pipeline that actually produces value for your organization. Furthermore, maintaining an infrastructure requires lifecycle management and a dedicated team. If your team doesn’t have the skill set or bandwidth to learn the skill set, investing in an end-to-end MLOps platform may be the best solution.
Red Hat OpenShift AI includes key capabilities to enable MLOps in a consistent way across datacenters, public cloud computing, and edge computing. It provides a single, consistent, enterprise-ready application platform that brings together data scientists and application developers in simplifying the integration of AI into applications securely, consistently and at scale.
Kubeflow is a Kubernetes-native, open-source framework for developing, managing, and running machine learning (ML) workloads. Running Kubeflow on OpenShift can help standardize machine learning operations by organizing projects while leveraging the power of cloud computing. Some of the key use cases for Kubeflow include data prep, model training, evaluation, optimization, and deployment.