The modern era of AI training, particularly for large models, faces simultaneous demands for computational scale and strict data privacy. Traditional machine learning (ML) requires centralizing the training data, resulting in significant hurdles and effort concerning data privacy, security, and data efficiency/volume.

This challenge is magnified across heterogeneous global infrastructure in multicloud, hybrid cloud, and edge environments, so organizations must train models using the existing distributed datasets while protecting data privacy.

Privacy-preserving AI training

 

Federated learning (FL) addresses this challenge by moving the model training to the data. Remote clusters or devices (collaborators/clients) train models locally using their private data and only share model updates (not the raw data) back to a central server (aggregator). This helps protect data privacy from end-to-end. This approach is crucial for privacy sensitive or high data load scenarios which we find in healthcare, retail, industrial automation, and software defined vehicles (SDV) with advanced driver-assistance systems (ADAS) and autonomous driving (AD) functionality, such as lane departure warning, adaptive cruise control, and driver fatigue monitoring.

To manage and orchestrate these distributed computation units, we utilize the federated learning custom resource definition (CRD) of Open Cluster Management (OCM).

OCM: The foundation for distributed operations

OCM is a Kubernetes multicluster orchestration platform and an open source CNCF Sandbox project

OCM employs a hub-spoke architecture and uses a pull-based model.

  1. Hub cluster: This acts as the central control plane (OCM Control Plane) responsible for orchestration.
  2. Managed (spoke) clusters: These are remote clusters where workloads are deployed.

Managed clusters pull their desired state from and report status back to the hub. OCM provides APIs like ManifestWork and Placement to schedule workloads. We’ll cover more federated learning API details below.

We'll now look at why and how the distributed cluster management design of OCM aligns closely with the requirements of deploying and managing FL  contributors.

Native integration: OCM as the FL orchestrator

1. Architectural alignment

The combination of OCM and FL is effective due to their fundamental structural congruence. OCM natively supports FL because both systems share an identical foundational design: the hub-spoke architecture and a pull-based protocol.

Architecture and concepts Mapping OCM and FL

 

OCM component

FL component

Function

OCM Hub Control Plane

Aggregator/Server

Orchestrates state and aggregates model updates.

Managed Cluster

Collaborator/Client

Pulls desired state/global model, trains locally, and pushes updates.

2. Flexible placement for multiactor client selection

OCM’s core operational advantage is its ability to automate client selection in FL setups by leveraging its flexible cross-cluster scheduling capabilities. This capability uses the OCM Placement API to implement sophisticated, multicriteria policies, providing efficiency and privacy compliance simultaneously.

The Placement API enables integrated client selection based on the following factors:

  • Data locality (privacy criterion): FL workloads are scheduled only to managed clusters that claim to have the necessary private data.
  • Resource optimization (efficiency criterion): The OCM scheduling strategy offers flexible policies that enable the combined assessment of multiple factors. It selects clusters not only based on data presence but also on advertised attributes like CPU/memory availability. 

3. Secure communication between collaborator and aggregator via OCM add-on registration

The FL add-on collaborator is deployed on the managed clusters and leverages OCM’s add-on registration mechanism to establish protected, encrypted communication with the aggregator on the hub. Upon registration, each collaborator add-on automatically obtains certificates from the OCM hub. These certificates authenticate and encrypt all model updates exchanged during FL, enabling confidentiality, integrity, and privacy across multiple clusters.

This process efficiently assigns AI training tasks only to adequately resourced clusters, providing integrated client selection based on both data locality and resource capacity.

The FL training lifecycle: OCM-driven scheduling

A dedicated Federated Learning Controller was developed to manage the training lifecycle of FL across multiple clusters. The controller utilizes CRDs to define the workflows and supports popular FL runtimes like Flower and OpenFL, and is extensible.

 Workflow of FL operating under OCM management OCM management

The OCM-managed workflow proceeds through defined stages:

Steps

OCM/FL Phase

Description

0

Prerequisite

The federated learning add-on is installed. The FL application  is available as a  Kubernetes-deployable container.

1

FederatedLearning CR

A custom resource is created on the hub, defining the framework (e.g., flower), the number of training rounds (each round being 1 full cycle where clients train locally and return updates for aggregation), the required number of available training contributors, and the model storage configuration (e.g., specifying a PersistentVolumeClaim (PVC) path).

2, 3, 4

Waiting & Scheduling

The resource status is “Waiting”. The server (aggregator) is initialized on the hub, and the OCM controller uses Placement to schedule clients (collaborators).

5, 6

Running

The status changes to “Running”. Clients pull the global model, train the model locally on private data, and synchronize model updates back to the model aggregator. The training rounds parameter determines how often this phase repeats.

7

Completed

The status reaches “Completed”.  Validation can be performed by deploying Jupyter Notebooks to verify the model’s performance against the entire aggregated dataset (e.g., confirming it predicts all Modified National Institute of Standards and Technology (MNIST) digits).

Red Hat Advanced Cluster Management : Enterprise control and operational value for FL environments

The core APIs and architecture provided by OCM serve as the foundation of Red Hat Advanced Cluster Management for Kubernetes. Red Hat Advanced Cluster Management provides lifecycle management for a homogeneous FL platform (Red Hat OpenShift) across a heterogeneous infrastructure footprint.  Running the FL controller on Red Hat Advanced Cluster Management provides additional benefits beyond what OCM alone offers. Red Hat Advanced Cluster Management delivers centralized visibility, policy-driven governance, and lifecycle management across multicluster estates, significantly enhancing the manageability of distributed and  FL environments.

1. Observability

Red Hat Advanced Cluster Management provides unified observability across distributed FL workflows, enabling operators to monitor training progress, cluster status, and cross-cluster coordination from a single, consistent interface.

2. Enhanced connectivity and security

The FL CRD supports protected communication between the aggregator and clients through TLS-enabled channels. It also offers flexible networking options beyond NodePort—including LoadBalancer, Route, and other ingress types—providing protected and adaptable connectivity across heterogeneous environments.

3. End-to-end ML lifecycle integration with Red Hat Advanced Cluster Management and Red Hat OpenShift AI

By leveraging Red Hat Advanced Cluster Management with OpenShift AI, enterprises can build a complete FL workflow—from model prototyping and distributed training to validation and production deployment—within a unified platform.

Wrap up

FL is transforming AI by moving model training directly to the data, effectively resolving the friction between computational scale, data transfer, and strict privacy requirements. Here we've highlighted how Red Hat Advanced Cluster Management provides the orchestration, protection, and observability needed to manage complex distributed Kubernetes environments.

Get in touch with Red Hat today to explore how you can empower your organization with federated learning.

리소스

적응형 엔터프라이즈: AI 준비성은 곧 위기 대응력

Red Hat의 COO 겸 CSO인 Michael Ferris가 쓴 이 e-Book은 오늘날 IT 리더들이 직면한 AI의 변화와 기술적 위기의 속도를 살펴봅니다.

저자 소개

Andreas Spanner leads Red Hat’s Cloud Strategy & Digital Transformation efforts across Australia and New Zealand. Spanner has worked on a wide range of initiatives across different industries in Europe, North America and APAC including full-scale ERP migrations, HR, finance and accounting, manufacturing, supply chain logistics transformations and scalable core banking strategies to support regional business growth strategies. He has an engineering degree from the University of Ravensburg, Germany.

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래