The modern era of AI training, particularly for large models, faces simultaneous demands for computational scale and strict data privacy. Traditional machine learning (ML) requires centralizing the training data, resulting in significant hurdles and effort concerning data privacy, security, and data efficiency/volume.
This challenge is magnified across heterogeneous global infrastructure in multicloud, hybrid cloud, and edge environments, so organizations must train models using the existing distributed datasets while protecting data privacy.
Federated learning (FL) addresses this challenge by moving the model training to the data. Remote clusters or devices (collaborators/clients) train models locally using their private data and only share model updates (not the raw data) back to a central server (aggregator). This helps protect data privacy from end-to-end. This approach is crucial for privacy sensitive or high data load scenarios which we find in healthcare, retail, industrial automation, and software defined vehicles (SDV) with advanced driver-assistance systems (ADAS) and autonomous driving (AD) functionality, such as lane departure warning, adaptive cruise control, and driver fatigue monitoring.
To manage and orchestrate these distributed computation units, we utilize the federated learning custom resource definition (CRD) of Open Cluster Management (OCM).
OCM: The foundation for distributed operations
OCM is a Kubernetes multicluster orchestration platform and an open source CNCF Sandbox project.
OCM employs a hub-spoke architecture and uses a pull-based model.
- Hub cluster: This acts as the central control plane (OCM Control Plane) responsible for orchestration.
- Managed (spoke) clusters: These are remote clusters where workloads are deployed.
Managed clusters pull their desired state from and report status back to the hub. OCM provides APIs like ManifestWork and Placement to schedule workloads. We’ll cover more federated learning API details below.
We'll now look at why and how the distributed cluster management design of OCM aligns closely with the requirements of deploying and managing FL contributors.
Native integration: OCM as the FL orchestrator
1. Architectural alignment
The combination of OCM and FL is effective due to their fundamental structural congruence. OCM natively supports FL because both systems share an identical foundational design: the hub-spoke architecture and a pull-based protocol.
OCM component | FL component | Function |
OCM Hub Control Plane | Aggregator/Server | Orchestrates state and aggregates model updates. |
Managed Cluster | Collaborator/Client | Pulls desired state/global model, trains locally, and pushes updates. |
2. Flexible placement for multiactor client selection
OCM’s core operational advantage is its ability to automate client selection in FL setups by leveraging its flexible cross-cluster scheduling capabilities. This capability uses the OCM Placement API to implement sophisticated, multicriteria policies, providing efficiency and privacy compliance simultaneously.
The Placement API enables integrated client selection based on the following factors:
- Data locality (privacy criterion): FL workloads are scheduled only to managed clusters that claim to have the necessary private data.
- Resource optimization (efficiency criterion): The OCM scheduling strategy offers flexible policies that enable the combined assessment of multiple factors. It selects clusters not only based on data presence but also on advertised attributes like CPU/memory availability.
3. Secure communication between collaborator and aggregator via OCM add-on registration
The FL add-on collaborator is deployed on the managed clusters and leverages OCM’s add-on registration mechanism to establish protected, encrypted communication with the aggregator on the hub. Upon registration, each collaborator add-on automatically obtains certificates from the OCM hub. These certificates authenticate and encrypt all model updates exchanged during FL, enabling confidentiality, integrity, and privacy across multiple clusters.
This process efficiently assigns AI training tasks only to adequately resourced clusters, providing integrated client selection based on both data locality and resource capacity.
The FL training lifecycle: OCM-driven scheduling
A dedicated Federated Learning Controller was developed to manage the training lifecycle of FL across multiple clusters. The controller utilizes CRDs to define the workflows and supports popular FL runtimes like Flower and OpenFL, and is extensible.
The OCM-managed workflow proceeds through defined stages:
Steps | OCM/FL Phase | Description |
0 | Prerequisite | The federated learning add-on is installed. The FL application is available as a Kubernetes-deployable container. |
1 | FederatedLearning CR | A custom resource is created on the hub, defining the framework (e.g., flower), the number of training rounds (each round being 1 full cycle where clients train locally and return updates for aggregation), the required number of available training contributors, and the model storage configuration (e.g., specifying a PersistentVolumeClaim (PVC) path). |
2, 3, 4 | Waiting & Scheduling | The resource status is “Waiting”. The server (aggregator) is initialized on the hub, and the OCM controller uses Placement to schedule clients (collaborators). |
5, 6 | Running | The status changes to “Running”. Clients pull the global model, train the model locally on private data, and synchronize model updates back to the model aggregator. The training rounds parameter determines how often this phase repeats. |
7 | Completed | The status reaches “Completed”. Validation can be performed by deploying Jupyter Notebooks to verify the model’s performance against the entire aggregated dataset (e.g., confirming it predicts all Modified National Institute of Standards and Technology (MNIST) digits). |
Red Hat Advanced Cluster Management : Enterprise control and operational value for FL environments
The core APIs and architecture provided by OCM serve as the foundation of Red Hat Advanced Cluster Management for Kubernetes. Red Hat Advanced Cluster Management provides lifecycle management for a homogeneous FL platform (Red Hat OpenShift) across a heterogeneous infrastructure footprint. Running the FL controller on Red Hat Advanced Cluster Management provides additional benefits beyond what OCM alone offers. Red Hat Advanced Cluster Management delivers centralized visibility, policy-driven governance, and lifecycle management across multicluster estates, significantly enhancing the manageability of distributed and FL environments.
1. Observability
Red Hat Advanced Cluster Management provides unified observability across distributed FL workflows, enabling operators to monitor training progress, cluster status, and cross-cluster coordination from a single, consistent interface.
2. Enhanced connectivity and security
The FL CRD supports protected communication between the aggregator and clients through TLS-enabled channels. It also offers flexible networking options beyond NodePort—including LoadBalancer, Route, and other ingress types—providing protected and adaptable connectivity across heterogeneous environments.
3. End-to-end ML lifecycle integration with Red Hat Advanced Cluster Management and Red Hat OpenShift AI
By leveraging Red Hat Advanced Cluster Management with OpenShift AI, enterprises can build a complete FL workflow—from model prototyping and distributed training to validation and production deployment—within a unified platform.
Wrap up
FL is transforming AI by moving model training directly to the data, effectively resolving the friction between computational scale, data transfer, and strict privacy requirements. Here we've highlighted how Red Hat Advanced Cluster Management provides the orchestration, protection, and observability needed to manage complex distributed Kubernetes environments.
Get in touch with Red Hat today to explore how you can empower your organization with federated learning.
リソース
適応力のある企業:AI への対応力が破壊的革新への対応力となる理由
執筆者紹介
Andreas Spanner leads Red Hat’s Cloud Strategy & Digital Transformation efforts across Australia and New Zealand. Spanner has worked on a wide range of initiatives across different industries in Europe, North America and APAC including full-scale ERP migrations, HR, finance and accounting, manufacturing, supply chain logistics transformations and scalable core banking strategies to support regional business growth strategies. He has an engineering degree from the University of Ravensburg, Germany.
類似検索
Feature store: The front-end for all of your AI data pipelines
Smarter troubleshooting with the new MCP server for Red Hat Enterprise Linux (now in developer preview)
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
チャンネル別に見る
自動化
テクノロジー、チームおよび環境に関する IT 自動化の最新情報
AI (人工知能)
お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート
オープン・ハイブリッドクラウド
ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。
セキュリティ
環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報
エッジコンピューティング
エッジでの運用を単純化するプラットフォームのアップデート
インフラストラクチャ
世界有数のエンタープライズ向け Linux プラットフォームの最新情報
アプリケーション
アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細
仮想化
オンプレミスまたは複数クラウドでのワークロードに対応するエンタープライズ仮想化の将来についてご覧ください