In the fast-paced world of artificial intelligence/machine learning (AI/ML), the biggest challenge isn't just building a model—it's managing the data that powers it. We’ve all been there: a model's performance shifts or a data analysis yields inconsistent results, and you’re left wondering, "Wait, which version of the dataset did I use for this training run or report?"

But, there’s good news. If you’re working in a Red Hat environment, our new AI quickstart solves this exact problem. It combines the orchestration power of Red Hat OpenShift AI with the "Git-for-data" versioning capabilities of lakeFS.

The quickstart is a cohesive ecosystem in which each layer solves a specific challenge in the modern AI lifecycle. Instead of a rigid list of tools, imagine a workflow where infrastructure, data management, and model development are smoothly integrated.

Here’s a breakdown of why this quickstart is a game-changer and how you can get started.

The foundation: Red Hat OpenShift

At the base of everything is Red Hat OpenShift. While Kubernetes can often feel like a DIY project, Red Hat OpenShift provides the hardened, enterprise-grade foundation. It handles the heavy lifting of scalability and security so that when your AI models move from a local laptop to production, the underlying infrastructure can handle the load with ease.

The intelligence layer: OpenShift AI

Sitting directly on top of that foundation is Red Hat OpenShift AI. This is where the actual "work" happens for data scientists. Rather than jumping between fragmented tools, this layer consolidates the entire environment—Jupyter Notebooks for experimentation, model serving for deployment, and automated pipelines—into a single, unified dashboard. It effectively bridges the gap between writing code and delivering a functional AI service.

The versioning engine: lakeFS

Data management is where lakeFS shines. In traditional development, we use Git to version code. lakeFS brings that same logic to the data itself. By acting as a data control plane over your object storage, it allows you to branch, commit, and revert data sets just as easily as you would a script.

Because lakeFS is built for multimodal data, it treats structured tables, semistructured JSON, and unstructured images or metadata with the same level of version control. This helps ensure that every AI model is reproducible—if a model behaves unexpectedly, you can simply "roll back" the data to the exact state it was in when the model was trained.

What’s inside the quickstart?

The Fraud-Detection-data-versioning-with-lakeFS quickstart isn't just a "hello world" demo. It’s a full lifecycle workflow based on a real-world fraud detection use case. When you run through it, you’ll learn how to:

  • Train a model in isolation: Use lakeFS branches to create a sandbox for your data. You can experiment with new data preprocessing techniques without affecting the main production dataset branch.
  • Version your artifacts: Every time you run a training pipeline, the quickstart shows you how to commit the specific state of your data. This enables 100% reproducibility if a model behaves strangely 3 months from now; you can "check out" the exact data used to train it.
  • Automate with pipelines: Integrate lakeFS directly into OpenShift AI pipelines. The pipeline doesn't just run code; it creates a data snapshot at every step.
  • Efficient model serving: Deploy your trained fraud detection model using OpenShift AI’s single-model serving platform, with the model weights pulled directly from a versioned lakeFS repository.

Why this matters for machine learning operations (MLOps)

Most teams treat data as a "live" entity that is constantly changing, which can quickly become challenging for both auditing and debugging. However, by using lakeFS as an AI data control plane on Red Hat OpenShift, organizations gain powerful tools to help manage their data lifecycles more effectively.

A primary benefit of this setup is zero-copy branching, which allows teams to create a copy of massive datasets, such as a 1TB dataset for testing in milliseconds, entirely without duplicating the underlying data. This capability naturally extends into data CI/CD practices, enabling teams to use "pre-merge hooks" that validate data quality before any information reaches the training pipeline. In the event that bad data ingestion ruins a model, the system provides instant rollbacks, letting users immediately revert the data repository back to its previous state.

Furthermore, this architecture is designed to manage all AI data formats, including unstructured data and metadata. Teams can easily access and apply version control to any format, whether it's structured, semi-structured, or unstructured, alongside its associated metadata. Ultimately, this comprehensive control is vital for compliance. For highly regulated industries like financial services, healthcare, telecommunications (telco), and the public sector, this system provides a verifiable chain of custody for training data, helping ensure that every model can be traced back to a specific, immutable dataset snapshot.

Business impact: Beyond technical reproducibility

While the technical advantages of this architecture are clear, its impact on the enterprise is also significant. By introducing lakeFS-based data versioning, ML teams can deliver up to 2-3 times the number of models with a smaller team. Because this system eliminates environment drift, rework, and dataset duplication, teams are able to scale their output without needing to increase headcount.

This increased throughput is largely driven by faster experimentation cycles. Using zero-copy branching, teams can immediately test new datasets and features instead of waiting for large volumes of data to be duplicated or infrastructure to be provisioned, reducing testing time by as much as 80%. Additionally, because logical branching avoids the physical replication of massive datasets, organizations can prevent unnecessary growth in cloud storage, helping to potentially lower overall infrastructure costs.

Beyond speed and cost, the enterprise also can benefit from improved risk management and operational stability. Immutable data commits establish a verifiable chain of custody for training data, so that every deployed model can be traced back to a specific data snapshot. This level of compliance and audit readiness is crucial for meeting internal governance requirements and supporting highly regulated industries. Finally, if faulty data does manage to enter the pipelines, instant rollbacks minimize downtime, helping teams avoid operational disruptions and costly retraining cycles by helping lower incident recovery time.

Get started

The benefit of the quickstart is it’s"ready-to-run" design, enabling non-experts the ability to get started more easily. To begin, you'll need access to an OpenShift cluster with OpenShift AI installed. While standard user access is sufficient for most tasks, you'll need cluster-admin permissions if you choose to configure an optional model registry. Deployment is straightforward, as the repository includes a Makefile and automation scripts that handle the heavy lifting of deploying lakeFS, object storage, configuring the S3 gateway, and setting up the necessary data connections within your OpenShift AI project.

Once the prerequisites are in place, the workflow follows a simple sequence. First, you'll create a Data Science Project directly within the OpenShift AI dashboard. Next, run the provided setup script to initialize your lakeFS repositories and object storage buckets. With the environment prepared, you can launch a workbench and clone the quickstart repository. Finally, by following the provided notebooks, you'll learn how to branch your data, train the model, and see how lakeFS tracks every change.

The Fraud-Detection-data-versioning-with-lakeFS quickstart serves as an entry point for teams aiming to professionalize their AI operations, helping them transition away from manual data management toward a highly reliable, version-controlled architecture. 

Ready to try it? Check out the Fraud-Detection-data-versioning-with-lakeFS quickstart or on our official repository.

Product trial

Red Hat OpenShift AI (Self-Managed) | Product Trial

An open source machine learning (ML) platform for the hybrid cloud.

About the authors

Sean has been (back) at Red Hat since 2020 working with strategic Red Hat ecosystem partners to co-create integrated product solutions and get them to market.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation for tech, teams, and environments

AI icon

Artificial intelligence

Updates on the platforms that free customers to run AI workloads anywhere

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

The latest on how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the platforms that simplify operations at the edge

Infrastructure icon

Infrastructure

The latest on the world’s leading enterprise Linux platform

application development icon

Applications

Inside our solutions to the toughest application challenges

Virtualization icon

Virtualization

The future of enterprise virtualization for your workloads on-premise or across clouds