Imagine that after 60 hours of training, a large language model (LLM) on an 8x NVIDIA H100 GPU cluster costing $55 an hour, your job fails at 90% completion. You must restart from your last checkpoint, which was saved 3 hours ago, wasting $165 in compute costs, and delaying model deployment. This kind of scenario isn't hypothetical, it's a daily reality for organizations running distributed AI training workloads in production environments.
LLM training represents one of the most compute-intensive workloads in modern AI infrastructure. With GPU clusters costing thousands of dollars and training jobs running for days or weeks, any interruption can result in catastrophic financial losses and project delays.
This article explores the challenges of distributed model training, examines the limitations of existing periodic checkpointing approaches, and introduces just-in-time (JIT) checkpointing, a new capability coming to Red Hat OpenShift AI 3.2 that protects your training investments while enabling new operational patterns like GPU-as-a-service and sustainable AI training practices.
Problem: Training failures are expensive
The financial and operational impact of training failures extends far beyond individual job restarts. Industry-facing studies show that a substantial fraction of GPU compute is wasted. For example, one study found many training jobs operating at less than 50% GPU utilization. Others show that interruptions and slowdowns due to failures or stragglers can delay job completion by roughly 2 times or more than planned. Taken together, findings across large-scale machine learning (ML) clusters suggest that 30% or more of GPU spending may be lost to idle time, interruptions, and inefficiencies.
Real-world cost impact
Consider a financial services organization training a fraud detection model on an 8 GPU cluster:
- Training duration: 72 hours planned
- GPU cost: $55/hour for 8 GPU cluster (AWS p5.48xlarge)
- Total planned cost: $3,960 (72 hours × $55/hour)
- Failure at hour 60: Must restart from last checkpoint (potentially hours back)
- Lost progress: 3 hours from last checkpoint = $165 wasted
- With 4 failures a week: $660 in lost compute costs weekly per training job
The business impact compounds:
- Delayed model deployment
- Missed market opportunities
- Reduced data scientist productivity
Challenges in shared cluster environments
The problem intensifies in shared cluster environments where Kueue enabled scheduling introduces preemption scenarios:
- Users submit training jobs without guaranteed resource availability
- Higher-priority jobs can preempt running training workloads
- Infrastructure maintenance requires graceful job termination
- Node failures and resource rebalancing interrupt training
- GPU underutilization when preempted jobs release more resources than priority jobs need, leaving idle capacity
In these environments, the lack of resilient model checkpointing means that any interruption (planned or unplanned) can result in significant training progress loss. Additionally, without the ability to dynamically scale training jobs, clusters experience GPU underutilization when preemption occurs.
Periodic checkpointing and its limitations
To mitigate training failures, most organizations implement periodic checkpointing, saving the complete training state at fixed intervals. This captures model parameters, optimizer states, learning rate schedules, and training progress, enabling training to resume from the last saved checkpoint after interruptions.
How periodic checkpointing works
Periodic checkpointing saves training state based on:
- Step intervals: Every N training steps (for example, every 500 steps)
- Epoch intervals: After each complete pass through the dataset
For example, a typical configuration might save checkpoints every epoch:
training_args = TrainingArguments(
output_dir="/mnt/checkpoints",
save_strategy="epoch", # Save after each epoch
save_total_limit=5, # Keep only 5 most recent checkpoints
)Critical limitations of periodic checkpointing
While periodic checkpointing provides basic protection, it suffers from several critical limitations.
Windows of vulnerability
Checkpoints save at fixed intervals, creating gaps where failures result in lost progress.
- If checkpoints save every epoch and each epoch takes about an hour, then a failure at 58 minutes loses nearly an hour of training
- For an 8 GPU cluster at $55 an hour, that's $53 in wasted compute time for each failure
- With multiple failures, losses compound quickly
Training interruption
Current periodic checkpoint implementations use synchronous saves that block training progress during write operations:
- Large models (with over 70 billion parameters) can take 5-15 minutes to checkpoint
- During this time, GPUs sit idle, wasting compute resources
- For a $55/hour cluster, 10 minutes of idle time = $9 wasted
- Over a 72 hour training run with 10 checkpoints, that's $92 in idle GPU time
Note: PyTorch has addressed this with asynchronous distributed checkpointing and safetensors format support, which enable non-blocking checkpoint saves. HuggingFace Transformers will adopt PyTorch's async checkpoint capabilities in future releases. JIT checkpointing also addresses this issue using asynchronous CUDA streams.
Unpredictable preemption
In shared clusters with Kueue scheduling, preemption timing is unpredictable:
- A job might be preempted 30 seconds after the last epoch checkpoint
- Or 58 minutes into an epoch that takes about an hour
- Users have no control over when preemption occurs
- Result is a highly variable progress loss ranging from seconds to hours
Storage and I/O overhead
- Large model checkpoints can exceed 100-500 GB
- Frequent checkpointing creates significant I/O pressure on shared storage
- Storage capacity fills quickly with multiple checkpoint versions
- Network bandwidth consumption impacts other workloads
Incomplete failure protection
Periodic checkpoints only save during "safe" intervals:
- SIGTERM signals during preemption may arrive between checkpoints
- Infrastructure failures are unpredictable
- Node evictions and resource rebalancing don't align with checkpoint schedules
The need for a better solution
These limitations create a fundamental tension:
- Checkpoint too frequently: Wastes GPU time with idle periods and excessive I/O
- Checkpoint too infrequently: Risks losing significant training progress on failures
Organizations need a checkpointing solution that:
- Saves training state precisely when needed (on termination signals)
- Minimizes GPU idle time during checkpoint operations
- Protects against unpredictable preemption and infrastructure events
- Reduces storage overhead and I/O pressure
- Works seamlessly in shared cluster environments with dynamic resource allocation
The solution: Just-in-time (JIT) checkpointing
Just-in-time checkpointing represents a paradigm shift from interval-based to event-driven checkpoint management. Instead of relying on fixed epoch/step intervals, JIT checkpointing triggers immediate checkpoint saving upon receiving termination signals (SIGTERM), ensuring minimal training time loss during infrastructure events.
How JIT checkpointing works
The core innovation lies in signal handling, asynchronous execution using CUDA streams, and graceful termination.
- Signal handler registration: Training process registers a SIGTERM handler on startup
- Graceful termination period: Kubernetes sends SIGTERM before terminating pods (configurable using terminationGracePeriodSeconds)
- Immediate checkpoint trigger: Handler triggers asynchronous checkpoint save using a separate CUDA stream
- Asynchronous execution with separate CUDA stream: Creates a dedicated CUDA stream for checkpointing to avoid waiting for the current step to complete and not block GPU during save
- Checkpoint completion: Checkpoint completes within the grace period before pod termination
- Automatic resume: Jobs automatically detect and resume from the saved checkpoint on restart
Asynchronous checkpointing can be achieved in a distributed training environment with CUDA streams:
How JIT checkpointing overcomes periodic checkpointing limitations
There are common challenges that both checkpointing is meant to solve. Here's how periodic and JIT checkpointing compare:
Windows of vulnerability
- Periodic checkpointing: Failures between checkpoints lose progress (up to full interval).
- JIT checkpointing: Checkpoint triggered on SIGTERM. Loss limited to grace period only.
Training interruption
- Periodic checkpointing: Synchronous saves block GPU for 5-15 minutes during training.
- JIT checkpointing: Asynchronous checkpoint using CUDA streams, a non-blocking operation.
Unpredictable preemption
- Periodic checkpointing: Preemption timing is random, resulting in highly variable loss.
- JIT checkpointing: Checkpoint is triggered by preemption signal for consistent protection.
Storage overhead
- Periodic checkpointing: Frequent checkpoints mean excessive I/O and storage.
- JIT checkpointing: Checkpoints only on termination events reduce overhead.
Cost savings
Returning to our financial services example with an 8-GPU cluster with no JIT checkpointing, assume epoch-based periodic checkpoints with about an hour to each epoch:
- Preemption after 58 minutes, loses 58 minutes of progress (the last checkpoint was at end of previous epoch)
- $53 wasted per preemption ($55/hour × 0.97 hours)
- $213 each week wasted (with 4 preemptions in a week)
Compare that to JIT checkpointing:
- Preemption triggers SIGTERM, so the checkpoint completes within the configured grace period. For a 70B parameter model, it's safe to assume around 3 to 5 minute iterations + 5 to 10 minute checkpoint saves, for a grace period of jjust 10 minutes.
- $9 wasted per preemption ($55/hour × 10/60 hours)
- $37/week wasted (with 4 preemptions in a week)
That's a savings of $176 every week, or about $9,100 a year.
Use cases and common scenarios
JIT checkpointing enables new operational patterns and solves critical production challenges.
Kueue preemption protection
- Higher-priority jobs trigger SIGTERM to lower-priority jobs
- JIT checkpointing saves state before preemption
- Jobs resume seamlessly when resources become available
Planned maintenance windows
Infrastructure teams can drain nodes gracefully, and resume jobs automatically after maintenance is complete.
Resource rebalancing
Cluster autoscaling triggers pod evictions, and JIT checkpointing preserves the training state without progress loss.
GPU-as-a-service
- Organizations can offer elastic GPU resources
- Users don't need guaranteed allocations
- Workloads gracefully yield to higher-priority requests
Combining JIT with periodic checkpointing
For maximum resilience, JIT checkpointing works in combination with periodic checkpointing. Use periodic checkpoints to protect against unexpected failures (such as node crashes, power loss). Use JIT checkpoints to protect against planned events (such as preemption, maintenance, scaling).
Red Hat OpenShift AI integration with Kubeflow Trainer v2
Red Hat OpenShift AI 3.2 introduces Kubeflow Trainer v2 with native support for resilient model checkpointing. You can enable the feature using the feature flag in the Kubeflow SDK. This feature currently supports HuggingFace transformers and TRL trainers (SFTTrainer, Trainer, Seq2SeqTrainer, DPOTrainer, PPOTrainer, RewardTrainer, and so on).
The SDK automatically injects checkpoint configuration into your training function at runtime. It detects existing checkpoints and resumes training automatically, and manages graceful shutdown during preemption and termination events.
That means there's no need for code changes, and checkpoint behavior is defined in SDK rather than scattered across code bases. Your existing training scripts work without modification, and your jobs automatically detect, and resume from, checkpoints.
Coming soon: Full JIT checkpointing support
The resilient model checkpointing capabilities demonstrated in this article are expected to be fully integrated into future releases of Red Hat OpenShift AI through the Kubeflow Training operator v2. In addition to that, we're working on an enhanced UI and user experience, including:
- Model training dashboard with real-time progress tracking and checkpoint visibility
- TrainJob lifecycle management so you can pause, resume, and monitor training jobs from the UI
- Scale TrainJob's ability to horizontally scale jobs based on available resources
Resilient model checkpointing with JIT capabilities represents a fundamental shift in distributed AI training economics. With JIT checkpointing, you can:
- Eliminate vulnerability windows: Checkpoint triggered on SIGTERM signals, not arbitrary intervals
- Enable asynchronous execution: CUDA streams enable non-blocking checkpoint saves, preventing corruption
- Minimize lost progress: Progress loss is limited to the configured grace period (minutes instead of hours)
- Activate event-driven protection: Checkpoints save precisely when needed, reducing storage overhead
- Deliver measurable ROI: Up to $9,100/year savings per 8-GPU cluster instance
For more information, read the official OpenShift AI documentation, and check out the distributed-workloads GitHub repository for examples.
About the author
I'm a Senior Software Engineer at Red Hat working on Kubeflow and distributed AI training. With nearly seven years in the industry across AI startups and cloud infrastructure at Ericsson, and now focus on making large-scale model training resilient and efficient on Kubernetes using Kubeflow.
More like this
Red Hat to acquire Chatterbox Labs: Frequently Asked Questions
Implementing best practices: Controlled network environment for Ray clusters in Red Hat OpenShift AI 3.0
Technically Speaking | Platform engineering for AI agents
Technically Speaking | Driving healthcare discoveries with AI
Browse by channel
Automation
The latest on IT automation that spans tech, teams, and environments
Artificial intelligence
Explore the platforms and partners building a faster path for AI
Cloud services
Get updates on our portfolio of managed cloud services
Security
Explore how we reduce risks across environments and technologies
Edge computing
Updates on the solutions that simplify infrastructure at the edge
Infrastructure
Stay up to date on the world’s leading enterprise Linux platform
Applications
The latest on our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech