In-service upgrades for telco 5G cloud-native core CaaS infrastructure with no service disruption

24 mars 2025Dmitry Muznikas4 minutes (temps de lecture)

The rapid evolution of 5G networks to bare-metal Kubernetes deployments has placed immense demands on telco infrastructure. Cloud-native core components hosted on Container-as-a-Service (CaaS) platforms provide the agility and scalability required for modern networks. However, upgrading these infrastructures poses unique challenges, especially to ensure high availability (HA) and service continuity for communication service providers (CSPs).

In this blog, we’ll explore a typical 5G core deployment architecture, the ramifications of upgrading CaaS infrastructure, and best practices for telco operators working within demanding operational constraints.

5G core site deployment: architecture and high availability

To achieve HA, 5G core cloud-native function (CNF) elements are deployed across at least two independent CaaS clusters in separate geographical locations. An advanced deployment might span three sites to enable tolerance for double-site failure, a crucial factor for disaster recovery or managing high-traffic events.

Each cluster hosts its application CNFs independently, ensuring redundancy and fault tolerance. Peering between CNFs enables seamless session processing and availability, even during disruption.

One critical aspect of such deployments is hardware dimensioning. This determines how many compute nodes can be upgraded simultaneously without jeopardizing HA. Insufficient or low spare capacity slows the upgrade process, while careful planning ensures smoother execution and a better upgrade experience without operational disruptions.

Telco upgrade operational principles

Telco operators follow strict protocols to minimize disruption and reduce the risk of service outages during maintenance activities. Infrastructure upgrades represent a CSP's most complex and impactful change. They are executed within maintenance windows (MW), typically limited to four hours at night. For example:

Before the maintenance window:
- Perform preparatory tasks that won’t impact network operations, such as running validation scripts and monitoring
- Ensure all pre-upgrade steps, such as backups, are completed to maximize efficiency during the maintenance window
During the maintenance window:
- Execute upgrade tasks that may temporarily affect the network state
- Validate the network element to ensure full functionality post-upgrade
After the maintenance window:
- Conduct long-term validation to ensure stability, such as a graceful increase in subscriber traffic
- Monitor performance metrics and address residual issues promptly

Upgrading the CaaS infrastructure

Upgrading a bare-metal CaaS infrastructure is a complex, yet essential, operation that ensures the Kubernetes platform remains current, stable, and secure. The process must be carefully orchestrated to maintain HA and to minimize service disruptions, particularly given the scale and operational sensitivity of the telco environment.

The upgrade workflow sequentially upgrades control plane components, worker nodes, and storage nodes. Each phase needs special consideration to address the unique requirements of the infrastructure and CNF application functionality:

In-service software upgrade (ISSU):
- Ensures maximum network availability without downtime by maintaining ongoing CNF traffic
In-place software upgrade:
- Requires traffic offloading and shifts CSP services to an HA site before upgrades
- Typically faster, but involves service interruptions and additional orchestration

Both options have advantages and disadvantages, but this article focuses on ISSU, the state most CSPs aim to achieve due to its potential to deliver uninterrupted services.

The ISSU is fundamentally based on Kubernetes technology and backward compatibility with the control plane. Backward compatibility provides Kubernetes API stability between the new version of the master node services and forthcoming upgraded worker node Kublet services.

In this way, the CNF application can continue processing traffic and process Kubernetes operations, while gracefully moving all workloads to the new releases. This stability and worker batch approach fits CSP operational rules, and allows splitting large data center upgrades into multiple MWs.

The fundamentals of Kubernetes API compatibility also require CNFs to follow these rules and be compatible with all their Kubernetes APIs. These applications are composed of multiple components, including Helm charts for deployment and Custom Resource Definitions (CRD) for managing configurations specific to telco workloads.

This usually results in a round of CNF application upgrades to ensure forward compatibility with the next Kubernetes release, even before starting the CaaS Infrastructure upgrade.

Step-by-step process for upgrading a 5G site

Here's an example of an upgrade sequence:

CNF application upgrades to ensure forward compatibility
Control plane upgrade:
- Upgrade CaaS control plane and Kubernetes masters while ensuring cluster stability
Worker node upgrade:
- Draining Nodes: Safely transition workloads before upgrading
- Batch Upgrades: Nodes are grouped into batches and upgraded in parallel or serially, depending on the following CNF elements:
  - Host group configurations
  - Pod disruption budget definitions
  - Pod Termination Grace Period definitions
  - Available failover capacity
Storage node upgrade:
- Upgrade storage nodes serially to preserve HA and maintain replication factors
- Ensure uninterrupted operations and data integrity

What factors influence upgrade duration?

Your maintenance window must be log enough for your upgrade to complete. There are many factors that might influence the duration of an upgrade, and these are unique to your environment. However, there are common factors to consider and investigate, including:

CNF application characteristics:
- Pods with lower resource requirements are easier to reschedule
- Flexible anti-affinity rules simplify rescheduling
- Hardware-agnostic pods reduce complexity
Infrastructure characteristics:
- Adequate failover nodes allow smoother transitions
- Pre-cordoned nodes improve predictability
- The duration of CNF Pod draining, upgrading, and rebooting nodes directly affects the completion time of the worker compute batch

In-service upgrades in the real world

Upgrading Telco 5G Cloud-Native Core infrastructure is a sophisticated operation that demands meticulous planning and precise execution. Every phase of the process, from CNF application upgrades to worker and storage node updates, requires careful orchestration to ensure high availability and service continuity.

Success hinges on thorough planning, encompassing hardware dimensioning, Kubernetes API compatibility, and pre-upgrade validations. Properly evaluating these factors allows operators to anticipate challenges, streamline the upgrade process, and maintain fluent operations.

Multiple factors influence the speed and efficiency of upgrades, including CNF characteristics like pod resource requirements, anti-affinity rules, and hardware dependencies, as well as infrastructure considerations like failover capacity and node batching strategies. A structured approach to managing these variables ensures that telco operators can effectively meet their maintenance windows and reduce disruptions.

For more information about Red Hat's telco services, visit the telco industry page.

À propos de l'auteur

Dmitry Muznikas

Principal Product Manager

Dmitry Muznikas is a Principal Product Manager at Red Hat with extensive experience in cloud infrastructure, 5G networks, and Telco-specific technologies. With a career spanning over 16 years, Dmitry has played a pivotal role in driving product strategies that align cutting-edge technologies with the unique demands of the telecommunications industry.
Currently, one of Dmitry's focus points is enabling seamless Cloud infrastructure upgrades and migrations for Communication Service Providers (CSPs) to cloud-native architectures.

Read full bio