The modern era of AI training, particularly for large models, faces simultaneous demands for computational scale and strict data privacy. Traditional machine learning (ML) requires centralizing the training data, resulting in significant hurdles and effort concerning data privacy, security, and data efficiency/volume.
This challenge is magnified across heterogeneous global infrastructure in multicloud, hybrid cloud, and edge environments, so organizations must train models using the existing distributed datasets while protecting data privacy.
Federated learning (FL) addresses this challenge by moving the model training to the data. Remote clusters or devices (collaborators/clients) train models locally using their private data and only share model updates (not the raw data) back to a central server (aggregator). This helps protect data privacy from end-to-end. This approach is crucial for privacy sensitive or high data load scenarios which we find in healthcare, retail, industrial automation, and software defined vehicles (SDV) with advanced driver-assistance systems (ADAS) and autonomous driving (AD) functionality, such as lane departure warning, adaptive cruise control, and driver fatigue monitoring.
To manage and orchestrate these distributed computation units, we utilize the federated learning custom resource definition (CRD) of Open Cluster Management (OCM).
OCM: The foundation for distributed operations
OCM is a Kubernetes multicluster orchestration platform and an open source CNCF Sandbox project.
OCM employs a hub-spoke architecture and uses a pull-based model.
- Hub cluster: This acts as the central control plane (OCM Control Plane) responsible for orchestration.
- Managed (spoke) clusters: These are remote clusters where workloads are deployed.
Managed clusters pull their desired state from and report status back to the hub. OCM provides APIs like ManifestWork and Placement to schedule workloads. We’ll cover more federated learning API details below.
We'll now look at why and how the distributed cluster management design of OCM aligns closely with the requirements of deploying and managing FL contributors.
Native integration: OCM as the FL orchestrator
1. Architectural alignment
The combination of OCM and FL is effective due to their fundamental structural congruence. OCM natively supports FL because both systems share an identical foundational design: the hub-spoke architecture and a pull-based protocol.
OCM component | FL component | Function |
OCM Hub Control Plane | Aggregator/Server | Orchestrates state and aggregates model updates. |
Managed Cluster | Collaborator/Client | Pulls desired state/global model, trains locally, and pushes updates. |
2. Flexible placement for multiactor client selection
OCM’s core operational advantage is its ability to automate client selection in FL setups by leveraging its flexible cross-cluster scheduling capabilities. This capability uses the OCM Placement API to implement sophisticated, multicriteria policies, providing efficiency and privacy compliance simultaneously.
The Placement API enables integrated client selection based on the following factors:
- Data locality (privacy criterion): FL workloads are scheduled only to managed clusters that claim to have the necessary private data.
- Resource optimization (efficiency criterion): The OCM scheduling strategy offers flexible policies that enable the combined assessment of multiple factors. It selects clusters not only based on data presence but also on advertised attributes like CPU/memory availability.
3. Secure communication between collaborator and aggregator via OCM add-on registration
The FL add-on collaborator is deployed on the managed clusters and leverages OCM’s add-on registration mechanism to establish protected, encrypted communication with the aggregator on the hub. Upon registration, each collaborator add-on automatically obtains certificates from the OCM hub. These certificates authenticate and encrypt all model updates exchanged during FL, enabling confidentiality, integrity, and privacy across multiple clusters.
This process efficiently assigns AI training tasks only to adequately resourced clusters, providing integrated client selection based on both data locality and resource capacity.
The FL training lifecycle: OCM-driven scheduling
A dedicated Federated Learning Controller was developed to manage the training lifecycle of FL across multiple clusters. The controller utilizes CRDs to define the workflows and supports popular FL runtimes like Flower and OpenFL, and is extensible.
The OCM-managed workflow proceeds through defined stages:
Steps | OCM/FL Phase | Description |
0 | Prerequisite | The federated learning add-on is installed. The FL application is available as a Kubernetes-deployable container. |
1 | FederatedLearning CR | A custom resource is created on the hub, defining the framework (e.g., flower), the number of training rounds (each round being 1 full cycle where clients train locally and return updates for aggregation), the required number of available training contributors, and the model storage configuration (e.g., specifying a PersistentVolumeClaim (PVC) path). |
2, 3, 4 | Waiting & Scheduling | The resource status is “Waiting”. The server (aggregator) is initialized on the hub, and the OCM controller uses Placement to schedule clients (collaborators). |
5, 6 | Running | The status changes to “Running”. Clients pull the global model, train the model locally on private data, and synchronize model updates back to the model aggregator. The training rounds parameter determines how often this phase repeats. |
7 | Completed | The status reaches “Completed”. Validation can be performed by deploying Jupyter Notebooks to verify the model’s performance against the entire aggregated dataset (e.g., confirming it predicts all Modified National Institute of Standards and Technology (MNIST) digits). |
Red Hat Advanced Cluster Management : Enterprise control and operational value for FL environments
The core APIs and architecture provided by OCM serve as the foundation of Red Hat Advanced Cluster Management for Kubernetes. Red Hat Advanced Cluster Management provides lifecycle management for a homogeneous FL platform (Red Hat OpenShift) across a heterogeneous infrastructure footprint. Running the FL controller on Red Hat Advanced Cluster Management provides additional benefits beyond what OCM alone offers. Red Hat Advanced Cluster Management delivers centralized visibility, policy-driven governance, and lifecycle management across multicluster estates, significantly enhancing the manageability of distributed and FL environments.
1. Observability
Red Hat Advanced Cluster Management provides unified observability across distributed FL workflows, enabling operators to monitor training progress, cluster status, and cross-cluster coordination from a single, consistent interface.
2. Enhanced connectivity and security
The FL CRD supports protected communication between the aggregator and clients through TLS-enabled channels. It also offers flexible networking options beyond NodePort—including LoadBalancer, Route, and other ingress types—providing protected and adaptable connectivity across heterogeneous environments.
3. End-to-end ML lifecycle integration with Red Hat Advanced Cluster Management and Red Hat OpenShift AI
By leveraging Red Hat Advanced Cluster Management with OpenShift AI, enterprises can build a complete FL workflow—from model prototyping and distributed training to validation and production deployment—within a unified platform.
Wrap up
FL is transforming AI by moving model training directly to the data, effectively resolving the friction between computational scale, data transfer, and strict privacy requirements. Here we've highlighted how Red Hat Advanced Cluster Management provides the orchestration, protection, and observability needed to manage complex distributed Kubernetes environments.
Get in touch with Red Hat today to explore how you can empower your organization with federated learning.
리소스
적응형 엔터프라이즈: AI 준비성은 곧 위기 대응력
저자 소개
Andreas Spanner leads Red Hat’s Cloud Strategy & Digital Transformation efforts across Australia and New Zealand. Spanner has worked on a wide range of initiatives across different industries in Europe, North America and APAC including full-scale ERP migrations, HR, finance and accounting, manufacturing, supply chain logistics transformations and scalable core banking strategies to support regional business growth strategies. He has an engineering degree from the University of Ravensburg, Germany.
유사한 검색 결과
Feature store: The front-end for all of your AI data pipelines
Smarter troubleshooting with the new MCP server for Red Hat Enterprise Linux (now in developer preview)
Technically Speaking | Build a production-ready AI toolbox
Technically Speaking | Platform engineering for AI agents
채널별 검색
오토메이션
기술, 팀, 인프라를 위한 IT 자동화 최신 동향
인공지능
고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트
오픈 하이브리드 클라우드
하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요
보안
환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보
엣지 컴퓨팅
엣지에서의 운영을 단순화하는 플랫폼 업데이트
인프라
세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보
애플리케이션
복잡한 애플리케이션에 대한 솔루션 더 보기
가상화
온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래