Resilient model training on Red Hat OpenShift AI with Kubeflow Trainer

December 18, 2025Esa Fazal7-minute read

Imagine that after 60 hours of training, a large language model (LLM) on an 8x NVIDIA H100 GPU cluster costing $55 an hour, your job fails at 90% completion. You must restart from your last checkpoint, which was saved 3 hours ago, wasting $165 in compute costs, and delaying model deployment. This kind of scenario isn't hypothetical, it's a daily reality for organizations running distributed AI training workloads in production environments.

LLM training represents one of the most compute-intensive workloads in modern AI infrastructure. With GPU clusters costing thousands of dollars and training jobs running for days or weeks, any interruption can result in catastrophic financial losses and project delays.

This article explores the challenges of distributed model training, examines the limitations of existing periodic checkpointing approaches, and introduces just-in-time (JIT) checkpointing, a new capability coming to Red Hat OpenShift AI 3.2 that protects your training investments while enabling new operational patterns like GPU-as-a-service and sustainable AI training practices.

Problem: Training failures are expensive

The financial and operational impact of training failures extends far beyond individual job restarts. Industry-facing studies show that a substantial fraction of GPU compute is wasted. For example, one study found many training jobs operating at less than 50% GPU utilization. Others show that interruptions and slowdowns due to failures or stragglers can delay job completion by roughly 2 times or more than planned. Taken together, findings across large-scale machine learning (ML) clusters suggest that 30% or more of GPU spending may be lost to idle time, interruptions, and inefficiencies.

Real-world cost impact

Consider a financial services organization training a fraud detection model on an 8 GPU cluster:

Training duration: 72 hours planned
GPU cost: $55/hour for 8 GPU cluster (AWS p5.48xlarge)
Total planned cost: $3,960 (72 hours × $55/hour)
Failure at hour 60: Must restart from last checkpoint (potentially hours back)
Lost progress: 3 hours from last checkpoint = $165 wasted
With 4 failures a week: $660 in lost compute costs weekly per training job

The business impact compounds:

Delayed model deployment
Missed market opportunities
Reduced data scientist productivity

Challenges in shared cluster environments

The problem intensifies in shared cluster environments where Kueue enabled scheduling introduces preemption scenarios:

Users submit training jobs without guaranteed resource availability
Higher-priority jobs can preempt running training workloads
Infrastructure maintenance requires graceful job termination
Node failures and resource rebalancing interrupt training
GPU underutilization when preempted jobs release more resources than priority jobs need, leaving idle capacity

In these environments, the lack of resilient model checkpointing means that any interruption (planned or unplanned) can result in significant training progress loss. Additionally, without the ability to dynamically scale training jobs, clusters experience GPU underutilization when preemption occurs.

Periodic checkpointing and its limitations

To mitigate training failures, most organizations implement periodic checkpointing, saving the complete training state at fixed intervals. This captures model parameters, optimizer states, learning rate schedules, and training progress, enabling training to resume from the last saved checkpoint after interruptions.

How periodic checkpointing works

Periodic checkpointing saves training state based on:

Step intervals: Every N training steps (for example, every 500 steps)
Epoch intervals: After each complete pass through the dataset

For example, a typical configuration might save checkpoints every epoch:

training_args = TrainingArguments(
    output_dir="/mnt/checkpoints",
    save_strategy="epoch",  # Save after each epoch
    save_total_limit=5,     # Keep only 5 most recent checkpoints
)

Critical limitations of periodic checkpointing

While periodic checkpointing provides basic protection, it suffers from several critical limitations.

Windows of vulnerability

Checkpoints save at fixed intervals, creating gaps where failures result in lost progress.

If checkpoints save every epoch and each epoch takes about an hour, then a failure at 58 minutes loses nearly an hour of training
For an 8 GPU cluster at $55 an hour, that's $53 in wasted compute time for each failure
With multiple failures, losses compound quickly

Training interruption

Current periodic checkpoint implementations use synchronous saves that block training progress during write operations:

Large models (with over 70 billion parameters) can take 5-15 minutes to checkpoint
During this time, GPUs sit idle, wasting compute resources
For a $55/hour cluster, 10 minutes of idle time = $9 wasted
Over a 72 hour training run with 10 checkpoints, that's $92 in idle GPU time

Note: PyTorch has addressed this with asynchronous distributed checkpointing and safetensors format support, which enable non-blocking checkpoint saves. HuggingFace Transformers will adopt PyTorch's async checkpoint capabilities in future releases. JIT checkpointing also addresses this issue using asynchronous CUDA streams.

Unpredictable preemption

In shared clusters with Kueue scheduling, preemption timing is unpredictable:

A job might be preempted 30 seconds after the last epoch checkpoint
Or 58 minutes into an epoch that takes about an hour
Users have no control over when preemption occurs
Result is a highly variable progress loss ranging from seconds to hours

Storage and I/O overhead

Large model checkpoints can exceed 100-500 GB
Frequent checkpointing creates significant I/O pressure on shared storage
Storage capacity fills quickly with multiple checkpoint versions
Network bandwidth consumption impacts other workloads

Incomplete failure protection

Periodic checkpoints only save during "safe" intervals:

SIGTERM signals during preemption may arrive between checkpoints
Infrastructure failures are unpredictable
Node evictions and resource rebalancing don't align with checkpoint schedules

The need for a better solution

These limitations create a fundamental tension:

Checkpoint too frequently: Wastes GPU time with idle periods and excessive I/O
Checkpoint too infrequently: Risks losing significant training progress on failures

Organizations need a checkpointing solution that:

Saves training state precisely when needed (on termination signals)
Minimizes GPU idle time during checkpoint operations
Protects against unpredictable preemption and infrastructure events
Reduces storage overhead and I/O pressure
Works seamlessly in shared cluster environments with dynamic resource allocation

The solution: Just-in-time (JIT) checkpointing

Just-in-time checkpointing represents a paradigm shift from interval-based to event-driven checkpoint management. Instead of relying on fixed epoch/step intervals, JIT checkpointing triggers immediate checkpoint saving upon receiving termination signals (SIGTERM), ensuring minimal training time loss during infrastructure events.

How JIT checkpointing works

The core innovation lies in signal handling, asynchronous execution using CUDA streams, and graceful termination.

Signal handler registration: Training process registers a SIGTERM handler on startup
Graceful termination period: Kubernetes sends SIGTERM before terminating pods (configurable using terminationGracePeriodSeconds)
Immediate checkpoint trigger: Handler triggers asynchronous checkpoint save using a separate CUDA stream
Asynchronous execution with separate CUDA stream: Creates a dedicated CUDA stream for checkpointing to avoid waiting for the current step to complete and not block GPU during save
Checkpoint completion: Checkpoint completes within the grace period before pod termination
Automatic resume: Jobs automatically detect and resume from the saved checkpoint on restart

Asynchronous checkpointing can be achieved in a distributed training environment with CUDA streams:

Distributed JIT checkpoint architecture visualized

How JIT checkpointing overcomes periodic checkpointing limitations

There are common challenges that both checkpointing is meant to solve. Here's how periodic and JIT checkpointing compare:

Windows of vulnerability

Periodic checkpointing: Failures between checkpoints lose progress (up to full interval).
JIT checkpointing: Checkpoint triggered on SIGTERM. Loss limited to grace period only.

Training interruption

Periodic checkpointing: Synchronous saves block GPU for 5-15 minutes during training.
JIT checkpointing: Asynchronous checkpoint using CUDA streams, a non-blocking operation.

Unpredictable preemption

Periodic checkpointing: Preemption timing is random, resulting in highly variable loss.
JIT checkpointing: Checkpoint is triggered by preemption signal for consistent protection.

Storage overhead

Periodic checkpointing: Frequent checkpoints mean excessive I/O and storage.
JIT checkpointing: Checkpoints only on termination events reduce overhead.

Cost savings

Returning to our financial services example with an 8-GPU cluster with no JIT checkpointing, assume epoch-based periodic checkpoints with about an hour to each epoch:

Preemption after 58 minutes, loses 58 minutes of progress (the last checkpoint was at end of previous epoch)
$53 wasted per preemption ($55/hour × 0.97 hours)
$213 each week wasted (with 4 preemptions in a week)

Compare that to JIT checkpointing:

Preemption triggers SIGTERM, so the checkpoint completes within the configured grace period. For a 70B parameter model, it's safe to assume around 3 to 5 minute iterations + 5 to 10 minute checkpoint saves, for a grace period of jjust 10 minutes.
$9 wasted per preemption ($55/hour × 10/60 hours)
$37/week wasted (with 4 preemptions in a week)

That's a savings of $176 every week, or about $9,100 a year.

Use cases and common scenarios

JIT checkpointing enables new operational patterns and solves critical production challenges.

Kueue preemption protection

Higher-priority jobs trigger SIGTERM to lower-priority jobs
JIT checkpointing saves state before preemption
Jobs resume seamlessly when resources become available

Planned maintenance windows

Infrastructure teams can drain nodes gracefully, and resume jobs automatically after maintenance is complete.

Resource rebalancing

Cluster autoscaling triggers pod evictions, and JIT checkpointing preserves the training state without progress loss.

GPU-as-a-service

Organizations can offer elastic GPU resources
Users don't need guaranteed allocations
Workloads gracefully yield to higher-priority requests

Combining JIT with periodic checkpointing

For maximum resilience, JIT checkpointing works in combination with periodic checkpointing. Use periodic checkpoints to protect against unexpected failures (such as node crashes, power loss). Use JIT checkpoints to protect against planned events (such as preemption, maintenance, scaling).

Red Hat OpenShift AI integration with Kubeflow Trainer v2

Red Hat OpenShift AI 3.2 introduces Kubeflow Trainer v2 with native support for resilient model checkpointing. You can enable the feature using the feature flag in the Kubeflow SDK. This feature currently supports HuggingFace transformers and TRL trainers (SFTTrainer, Trainer, Seq2SeqTrainer, DPOTrainer, PPOTrainer, RewardTrainer, and so on).

The SDK automatically injects checkpoint configuration into your training function at runtime. It detects existing checkpoints and resumes training automatically, and manages graceful shutdown during preemption and termination events.

That means there's no need for code changes, and checkpoint behavior is defined in SDK rather than scattered across code bases. Your existing training scripts work without modification, and your jobs automatically detect, and resume from, checkpoints.

Coming soon: Full JIT checkpointing support

The resilient model checkpointing capabilities demonstrated in this article are expected to be fully integrated into future releases of Red Hat OpenShift AI through the Kubeflow Training operator v2. In addition to that, we're working on an enhanced UI and user experience, including:

Model training dashboard with real-time progress tracking and checkpoint visibility
TrainJob lifecycle management so you can pause, resume, and monitor training jobs from the UI
Scale TrainJob's ability to horizontally scale jobs based on available resources

Resilient model checkpointing with JIT capabilities represents a fundamental shift in distributed AI training economics. With JIT checkpointing, you can:

Eliminate vulnerability windows: Checkpoint triggered on SIGTERM signals, not arbitrary intervals
Enable asynchronous execution: CUDA streams enable non-blocking checkpoint saves, preventing corruption
Minimize lost progress: Progress loss is limited to the configured grace period (minutes instead of hours)
Activate event-driven protection: Checkpoints save precisely when needed, reducing storage overhead
Deliver measurable ROI: Up to $9,100/year savings per 8-GPU cluster instance

For more information, read the official OpenShift AI documentation, and check out the distributed-workloads GitHub repository for examples.

About the author

Esa Fazal

Senior Software Engineer

I'm a Senior Software Engineer at Red Hat working on Kubeflow and distributed AI training. With nearly seven years in the industry across AI startups and cloud infrastructure at Ericsson, and now focus on making large-scale model training resilient and efficient on Kubernetes using Kubeflow.

Read full bio

Keep exploring

Browse by channel

Explore all channels

Resilient model training on Red Hat OpenShift AI with Kubeflow Trainer

Problem: Training failures are expensive

Real-world cost impact

Challenges in shared cluster environments

Periodic checkpointing and its limitations

How periodic checkpointing works

Critical limitations of periodic checkpointing

Windows of vulnerability

Training interruption

Unpredictable preemption

Storage and I/O overhead

Incomplete failure protection

The need for a better solution

The solution: Just-in-time (JIT) checkpointing

How JIT checkpointing works

How JIT checkpointing overcomes periodic checkpointing limitations

Windows of vulnerability

Training interruption

Unpredictable preemption

Storage overhead

Cost savings

Use cases and common scenarios

Kueue preemption protection

Planned maintenance windows

Resource rebalancing

GPU-as-a-service

Combining JIT with periodic checkpointing

Red Hat OpenShift AI integration with Kubeflow Trainer v2

Coming soon: Full JIT checkpointing support

About the author

Esa Fazal

More like this

Keep exploring

Browse by channel

平台

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links