Imagine that after 60 hours of training, a large language model (LLM) on an 8x NVIDIA H100 GPU cluster costing $55 an hour, your job fails at 90% completion. You must restart from your last checkpoint, which was saved 3 hours ago, wasting $165 in compute costs, and delaying model deployment. This kind of scenario isn't hypothetical, it's a daily reality for organizations running distributed AI training workloads in production environments. 

LLM training represents one of the most compute-intensive workloads in modern AI infrastructure. With GPU clusters costing thousands of dollars and training jobs running for days or weeks, any interruption can result in catastrophic financial losses and project delays.

This article explores the challenges of distributed model training, examines the limitations of existing periodic checkpointing approaches, and introduces just-in-time (JIT) checkpointing, a new capability coming to Red Hat OpenShift AI 3.2 that protects your training investments while enabling new operational patterns like GPU-as-a-service and sustainable AI training practices.

Problem: Training failures are expensive

The financial and operational impact of training failures extends far beyond individual job restarts. Industry-facing studies show that a substantial fraction of GPU compute is wasted. For example, one study found many training jobs operating at less than 50% GPU utilization. Others show that interruptions and slowdowns due to failures or stragglers can delay job completion by roughly 2 times or more than planned. Taken together, findings across large-scale machine learning (ML) clusters suggest that 30% or more of GPU spending may be lost to idle time, interruptions, and inefficiencies.

Real-world cost impact

Consider a financial services organization training a fraud detection model on an 8 GPU cluster:

  • Training duration: 72 hours planned
  • GPU cost: $55/hour for 8 GPU cluster (AWS p5.48xlarge)
  • Total planned cost: $3,960 (72 hours × $55/hour)
  • Failure at hour 60: Must restart from last checkpoint (potentially hours back)
  • Lost progress: 3 hours from last checkpoint = $165 wasted
  • With 4 failures a week: $660 in lost compute costs weekly per training job

The business impact compounds:

  • Delayed model deployment
  • Missed market opportunities
  • Reduced data scientist productivity

Challenges in shared cluster environments

The problem intensifies in shared cluster environments where Kueue enabled scheduling introduces preemption scenarios:

  • Users submit training jobs without guaranteed resource availability
  • Higher-priority jobs can preempt running training workloads
  • Infrastructure maintenance requires graceful job termination
  • Node failures and resource rebalancing interrupt training
  • GPU underutilization when preempted jobs release more resources than priority jobs need, leaving idle capacity

In these environments, the lack of resilient model checkpointing means that any interruption (planned or unplanned) can result in significant training progress loss. Additionally, without the ability to dynamically scale training jobs, clusters experience GPU underutilization when preemption occurs.

Periodic checkpointing and its limitations

To mitigate training failures, most organizations implement periodic checkpointing, saving the complete training state at fixed intervals. This captures model parameters, optimizer states, learning rate schedules, and training progress, enabling training to resume from the last saved checkpoint after interruptions.

How periodic checkpointing works

Periodic checkpointing saves training state based on:

  • Step intervals: Every N training steps (for example, every 500 steps)
  • Epoch intervals: After each complete pass through the dataset

For example, a typical configuration might save checkpoints every epoch:

training_args = TrainingArguments(
    output_dir="/mnt/checkpoints",
    save_strategy="epoch",  # Save after each epoch
    save_total_limit=5,     # Keep only 5 most recent checkpoints
)

Critical limitations of periodic checkpointing

While periodic checkpointing provides basic protection, it suffers from several critical limitations.

Windows of vulnerability

Checkpoints save at fixed intervals, creating gaps where failures result in lost progress.

  • If checkpoints save every epoch and each epoch takes about an hour, then a failure at 58 minutes loses nearly an hour of training
  • For an 8 GPU cluster at $55 an hour, that's $53 in wasted compute time for each failure
  • With multiple failures, losses compound quickly

Training interruption

Current periodic checkpoint implementations use synchronous saves that block training progress during write operations:

  • Large models (with over 70 billion parameters) can take 5-15 minutes to checkpoint
  • During this time, GPUs sit idle, wasting compute resources
  • For a $55/hour cluster, 10 minutes of idle time = $9 wasted
  • Over a 72 hour training run with 10 checkpoints, that's $92 in idle GPU time

Note: PyTorch has addressed this with asynchronous distributed checkpointing and safetensors format support, which enable non-blocking checkpoint saves. HuggingFace Transformers will adopt PyTorch's async checkpoint capabilities in future releases. JIT checkpointing also addresses this issue using asynchronous CUDA streams.

Unpredictable preemption

In shared clusters with Kueue scheduling, preemption timing is unpredictable:

  • A job might be preempted 30 seconds after the last epoch checkpoint
  • Or 58 minutes into an epoch that takes about an hour
  • Users have no control over when preemption occurs
  • Result is a highly variable progress loss ranging from seconds to hours

Storage and I/O overhead

  • Large model checkpoints can exceed 100-500 GB
  • Frequent checkpointing creates significant I/O pressure on shared storage
  • Storage capacity fills quickly with multiple checkpoint versions
  • Network bandwidth consumption impacts other workloads

Incomplete failure protection

Periodic checkpoints only save during "safe" intervals:

  • SIGTERM signals during preemption may arrive between checkpoints
  • Infrastructure failures are unpredictable
  • Node evictions and resource rebalancing don't align with checkpoint schedules

The need for a better solution

These limitations create a fundamental tension:

  • Checkpoint too frequently: Wastes GPU time with idle periods and excessive I/O
  • Checkpoint too infrequently: Risks losing significant training progress on failures

Organizations need a checkpointing solution that:

  • Saves training state precisely when needed (on termination signals)
  • Minimizes GPU idle time during checkpoint operations
  • Protects against unpredictable preemption and infrastructure events
  • Reduces storage overhead and I/O pressure
  • Works seamlessly in shared cluster environments with dynamic resource allocation

The solution: Just-in-time (JIT) checkpointing

Just-in-time checkpointing represents a paradigm shift from interval-based to event-driven checkpoint management. Instead of relying on fixed epoch/step intervals, JIT checkpointing triggers immediate checkpoint saving upon receiving termination signals (SIGTERM), ensuring minimal training time loss during infrastructure events.

How JIT checkpointing works

The core innovation lies in signal handling, asynchronous execution using CUDA streams, and graceful termination.

  1. Signal handler registration: Training process registers a SIGTERM handler on startup
  2. Graceful termination period: Kubernetes sends SIGTERM before terminating pods (configurable using terminationGracePeriodSeconds)
  3. Immediate checkpoint trigger: Handler triggers asynchronous checkpoint save using a separate CUDA stream
  4. Asynchronous execution with separate CUDA stream: Creates a dedicated CUDA stream for checkpointing to avoid waiting for the current step to complete and not block GPU during save
  5. Checkpoint completion: Checkpoint completes within the grace period before pod termination
  6. Automatic resume: Jobs automatically detect and resume from the saved checkpoint on restart

Asynchronous checkpointing can be achieved in a distributed training environment with CUDA streams:

Distributed JIT checkpoint architecture visualized

How JIT checkpointing overcomes periodic checkpointing limitations

There are common challenges that both checkpointing is meant to solve. Here's how periodic and JIT checkpointing compare:

Windows of vulnerability

  • Periodic checkpointing: Failures between checkpoints lose progress (up to full interval).
  • JIT checkpointing: Checkpoint triggered on SIGTERM. Loss limited to grace period only.

Training interruption

  • Periodic checkpointing: Synchronous saves block GPU for 5-15 minutes during training.
  • JIT checkpointing: Asynchronous checkpoint using CUDA streams, a non-blocking operation.

Unpredictable preemption

  • Periodic checkpointing: Preemption timing is random, resulting in highly variable loss.
  • JIT checkpointing: Checkpoint is triggered by preemption signal for consistent protection.

Storage overhead

  • Periodic checkpointing: Frequent checkpoints mean excessive I/O and storage.
  • JIT checkpointing: Checkpoints only on termination events reduce overhead.

Cost savings

Returning to our financial services example with an 8-GPU cluster with no JIT checkpointing, assume epoch-based periodic checkpoints with about an hour to each epoch:

  • Preemption after 58 minutes, loses 58 minutes of progress (the last checkpoint was at end of previous epoch)
  • $53 wasted per preemption ($55/hour × 0.97 hours)
  • $213 each week wasted (with 4 preemptions in a week)

Compare that to JIT checkpointing:

  • Preemption triggers SIGTERM, so the checkpoint completes within the configured grace period. For a 70B parameter model, it's safe to assume around 3 to 5 minute iterations + 5 to 10 minute checkpoint saves, for a grace period of jjust 10 minutes.
  • $9 wasted per preemption ($55/hour × 10/60 hours)
  • $37/week wasted (with 4 preemptions in a week)

That's a savings of $176 every week, or about $9,100 a year.

Use cases and common scenarios

JIT checkpointing enables new operational patterns and solves critical production challenges.

Kueue preemption protection

  • Higher-priority jobs trigger SIGTERM to lower-priority jobs
  • JIT checkpointing saves state before preemption
  • Jobs resume seamlessly when resources become available

Planned maintenance windows

Infrastructure teams can drain nodes gracefully, and resume jobs automatically after maintenance is complete.

Resource rebalancing

Cluster autoscaling triggers pod evictions, and JIT checkpointing preserves the training state without progress loss.

GPU-as-a-service

  • Organizations can offer elastic GPU resources
  • Users don't need guaranteed allocations
  • Workloads gracefully yield to higher-priority requests

Combining JIT with periodic checkpointing

For maximum resilience, JIT checkpointing works in combination with periodic checkpointing. Use periodic checkpoints to protect against unexpected failures (such as node crashes, power loss). Use JIT checkpoints to protect against planned events (such as preemption, maintenance, scaling).

Red Hat OpenShift AI integration with Kubeflow Trainer v2

Red Hat OpenShift AI 3.2 introduces Kubeflow Trainer v2 with native support for resilient model checkpointing. You can enable the feature using the feature flag in the Kubeflow SDK. This feature currently supports HuggingFace transformers and TRL trainers (SFTTrainer, Trainer, Seq2SeqTrainer, DPOTrainer, PPOTrainer, RewardTrainer, and so on).

The SDK automatically injects checkpoint configuration into your training function at runtime. It detects existing checkpoints and resumes training automatically, and manages graceful shutdown during preemption and termination events.

That means there's no need for code changes, and checkpoint behavior is defined in SDK rather than scattered across code bases. Your existing training scripts work without modification, and your jobs automatically detect, and resume from, checkpoints.

Coming soon: Full JIT checkpointing support

The resilient model checkpointing capabilities demonstrated in this article are expected to be fully integrated into future releases of Red Hat OpenShift AI through the Kubeflow Training operator v2. In addition to that, we're working on an enhanced UI and user experience, including:

  • Model training dashboard with real-time progress tracking and checkpoint visibility
  • TrainJob lifecycle management so you can pause, resume, and monitor training jobs from the UI
  • Scale TrainJob's ability to horizontally scale jobs based on available resources

Resilient model checkpointing with JIT capabilities represents a fundamental shift in distributed AI training economics. With JIT checkpointing, you can:

  • Eliminate vulnerability windows: Checkpoint triggered on SIGTERM signals, not arbitrary intervals
  • Enable asynchronous execution: CUDA streams enable non-blocking checkpoint saves, preventing corruption
  • Minimize lost progress: Progress loss is limited to the configured grace period (minutes instead of hours)
  • Activate event-driven protection: Checkpoints save precisely when needed, reducing storage overhead
  • Deliver measurable ROI: Up to $9,100/year savings per 8-GPU cluster instance

For more information, read the official OpenShift AI documentation, and check out the distributed-workloads GitHub repository for examples.


About the author

I'm a Senior Software Engineer at Red Hat working on Kubeflow and distributed AI training. With nearly seven years in the industry across AI startups and cloud infrastructure at Ericsson, and now focus on making large-scale model training resilient and efficient on Kubernetes using Kubeflow.

UI_Icon-Red_Hat-Close-A-Black-RGB

Browse by channel

automation icon

Automation

The latest on IT automation that spans tech, teams, and environments

AI icon

Artificial intelligence

Explore the platforms and partners building a faster path for AI

cloud services icon

Cloud services

Get updates on our portfolio of managed cloud services

security icon

Security

Explore how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the solutions that simplify infrastructure at the edge

Infrastructure icon

Infrastructure

Stay up to date on the world’s leading enterprise Linux platform

application development icon

Applications

The latest on our solutions to the toughest application challenges

Original series icon

Original shows

Entertaining stories from the makers and leaders in enterprise tech