What is Kubeflow?

Copy URL

Kubeflow is a Kubernetes-native, open-source framework for developing, managing, and running machine learning (ML) workloads. Kubeflow is an AI/ML platform that brings together several tools covering the main AI/ML use cases: data exploration, data pipelines, model training, and model serving.

Kubeflow allows data scientists to access those capabilities via a portal which provides high-level abstractions to interact with these tools. This means data scientists do not need to be concerned about having to learn low-level details of how Kubernetes plugs into each of these tools. Kubeflow itself is specifically designed to run on Kubernetes and fully embraces many of the key concepts, including the operator model. 

Kubeflow solves many of the challenges involved in orchestrating machine learning pipelines by providing a set of tools and APIs that simplify the process of training and deploying ML models at scale. A “pipeline” denotes an ML workflow, including the components of the workflow and how those components interact. Kubeflow is able to accommodate the needs of multiple teams in one project and allows those teams to work from any infrastructure. This means that data scientists can train and serve ML models from the cloud of their choice, including IBM Cloud, Google Cloud, Amazon’s AWS, or Azure. 

Overall, Kubeflow standardizes machine learning operations (MLOps) by organizing projects while leveraging the power of cloud computing. Some of the key use cases for Kubeflow include data preparation, model training, evaluation, optimization, and deployment.


Kubernetes is key to accelerating the ML lifecycle as these technologies provide data scientists the much needed agility, flexibility, portability, and scalability to train, test, and deploy ML models.

Scalability: Kubernetes allows users to scale ML workloads up or down, depending on demand. This ensures that machine learning pipelines can accommodate large-scale processing and training without interfering with other elements of the project. 

Efficiency: Kubernetes optimizes resource allocation by scheduling workloads onto nodes based on their availability and capacity. By ensuring that computing resources are being utilized with intention, users can expect a reduction in cost and an increase in performance.

Portability: Kubernetes provides a standardized, platform-agnostic environment that allows data scientists to develop one ML pipeline and deploy it across multiple environments and cloud platforms. This means not having to worry about compatibility issues and vendor lock-in.

Fault tolerance: With built-in fault tolerance and self-healing capabilities, users can trust Kubernetes to keep ML pipelines running even in the event of a hardware or software failure.

  1. The Kubeflow Central Dashboard offers an authenticated web interface for accessing Kubeflow and its ecosystem components. Serving as a centralized hub, it aggregates the user interfaces of various tools and services within the cluster, providing a unified access point for managing your machine learning platform.
  2. Kubeflow integrates with Jupyter Notebooks, providing an interactive environment for data exploration, experimentation, and model development. Notebooks support various programming languages, including Python, R, and Scala, and allow users to create and execute ML workflows in a collaborative and reproducible manner.
  3. Kubeflow Pipelines enable users to define and execute complex ML workflows as directed acyclic graphs (DAGs). Kubeflow Pipelines provide a way to orchestrate and automate the end-to-end process of data preprocessing, model training, evaluation, and deployment which promotes reproducibility, scalability, and collaboration in ML projects. The Kubeflow Pipelines SDK is a collection of Python packages, allowing users to define and execute their machine learning workflows with precision and efficiency.
  4. The Kubeflow Training Operator provides tools for training machine learning models at scale. This includes support for distributed training using frameworks like TensorFlow, PyTorch, and XGBoost. Users can leverage Kubernetes' scalability and resource management capabilities to train models efficiently across clusters of machines.
  5. Kubeflow Serving allows users to deploy trained ML models as scalable, production-ready services. It provides a consistent interface for serving models using popular frameworks like TensorFlow Serving, Seldon Core, or custom inference servers. Models can be deployed in real-time or batch processing scenarios, serving predictions over HTTP endpoints.
  6. Kubeflow Metadata is a centralized repository for tracking and managing metadata associated with ML experiments, runs, and artifacts. It provides a consistent view of ML metadata across the entire workflow, enabling reproducibility, collaboration, and governance in ML projects.

In addition, Kubeflow provides web-based user interfaces (UIs) for monitoring and managing ML experiments, model training jobs, and inference services. These UIs offer visualizations, metrics, and logs to help users track the progress of their ML workflows, troubleshoot issues, and make informed decisions.

As it embraces the Kubernetes operator model, Kubeflow is extensible and supports customization to adapt to specific use cases and environments. Users can integrate additional components, such as data preprocessing tools, feature stores, monitoring solutions, and external data sources, to enhance the capabilities of their ML workflows.

Red Hat® OpenShift® is the trusted, comprehensive, and consistent platform for application development, deployment, and management across all environments. With DevOps capabilities (e.g. OpenShift Pipelines, OpenShift GitOps, and Red Hat Quay) and integration with hardware accelerators, Red Hat OpenShift enables better collaboration, and accelerates the delivery of AI-powered applications.

Red Hat OpenShift AI provides a visual editor for creating and automating data science pipelines and experimentation that is based on Kubeflow pipelines. OpenShift AI is an integrated MLOps platform for building, training, deploying, and monitoring AI-enabled applications and predictive and foundation models at scale across hybrid cloud environments. Automate and simplify the iterative process of integrating ML models into software development processes, production rollout, monitoring, retraining, and redeployment for continued prediction accuracy.

Red Hat OpenShift is available natively on IBM Cloud, Google Cloud, AWS, and Azure allowing users to automate the management of kubernetes clusters to build, deploy, and scale applications quickly with a production-ready application platform. 


Keep reading


What's a Linux container?

A Linux container is a set of processes isolated from the system, running from a distinct image that provides all the files necessary to support the processes.


Containers vs VMs

Linux containers and virtual machines (VMs) are packaged computing environments that combine various IT components and isolate them from the rest of the system.


What is container orchestration?

Container orchestration automates the deployment, management, scaling, and networking of containers.

More about containers


An enterprise application platform with a unified set of tested services for bringing apps to market on your choice of infrastructure.



Command Line Heroes Season 1, Episode 5:
"The Containers Derby"


Boost agility with hybrid cloud and containers


Free training course

Running Containers with Red Hat Technical Overview

Free training course

Containers, Kubernetes and Red Hat OpenShift Technical Overview

Free training course

Developing Cloud-Native Applications with Microservices Architectures