The rapid evolution of 5G networks to bare-metal Kubernetes deployments has placed immense demands on telco infrastructure. Cloud-native core components hosted on Container-as-a-Service (CaaS) platforms provide the agility and scalability required for modern networks. However, upgrading these infrastructures poses unique challenges, especially to ensure high availability (HA) and service continuity for communication service providers (CSPs).
In this blog, we’ll explore a typical 5G core deployment architecture, the ramifications of upgrading CaaS infrastructure, and best practices for telco operators working within demanding operational constraints.
5G core site deployment: architecture and high availability
To achieve HA, 5G core cloud-native function (CNF) elements are deployed across at least two independent CaaS clusters in separate geographical locations. An advanced deployment might span three sites to enable tolerance for double-site failure, a crucial factor for disaster recovery or managing high-traffic events.
Each cluster hosts its application CNFs independently, ensuring redundancy and fault tolerance. Peering between CNFs enables seamless session processing and availability, even during disruption.
One critical aspect of such deployments is hardware dimensioning. This determines how many compute nodes can be upgraded simultaneously without jeopardizing HA. Insufficient or low spare capacity slows the upgrade process, while careful planning ensures smoother execution and a better upgrade experience without operational disruptions.
Telco upgrade operational principles
Telco operators follow strict protocols to minimize disruption and reduce the risk of service outages during maintenance activities. Infrastructure upgrades represent a CSP's most complex and impactful change. They are executed within maintenance windows (MW), typically limited to four hours at night. For example:
- Before the maintenance window:
- Perform preparatory tasks that won’t impact network operations, such as running validation scripts and monitoring
- Ensure all pre-upgrade steps, such as backups, are completed to maximize efficiency during the maintenance window
- During the maintenance window:
- Execute upgrade tasks that may temporarily affect the network state
- Validate the network element to ensure full functionality post-upgrade
- After the maintenance window:
- Conduct long-term validation to ensure stability, such as a graceful increase in subscriber traffic
- Monitor performance metrics and address residual issues promptly
Upgrading the CaaS infrastructure
Upgrading a bare-metal CaaS infrastructure is a complex, yet essential, operation that ensures the Kubernetes platform remains current, stable, and secure. The process must be carefully orchestrated to maintain HA and to minimize service disruptions, particularly given the scale and operational sensitivity of the telco environment.
The upgrade workflow sequentially upgrades control plane components, worker nodes, and storage nodes. Each phase needs special consideration to address the unique requirements of the infrastructure and CNF application functionality:
- In-service software upgrade (ISSU):
- Ensures maximum network availability without downtime by maintaining ongoing CNF traffic
- In-place software upgrade:
- Requires traffic offloading and shifts CSP services to an HA site before upgrades
- Typically faster, but involves service interruptions and additional orchestration
Both options have advantages and disadvantages, but this article focuses on ISSU, the state most CSPs aim to achieve due to its potential to deliver uninterrupted services.
The ISSU is fundamentally based on Kubernetes technology and backward compatibility with the control plane. Backward compatibility provides Kubernetes API stability between the new version of the master node services and forthcoming upgraded worker node Kublet services.
In this way, the CNF application can continue processing traffic and process Kubernetes operations, while gracefully moving all workloads to the new releases. This stability and worker batch approach fits CSP operational rules, and allows splitting large data center upgrades into multiple MWs.
The fundamentals of Kubernetes API compatibility also require CNFs to follow these rules and be compatible with all their Kubernetes APIs. These applications are composed of multiple components, including Helm charts for deployment and Custom Resource Definitions (CRD) for managing configurations specific to telco workloads.
This usually results in a round of CNF application upgrades to ensure forward compatibility with the next Kubernetes release, even before starting the CaaS Infrastructure upgrade.
Step-by-step process for upgrading a 5G site
Here's an example of an upgrade sequence:
- CNF application upgrades to ensure forward compatibility
- Control plane upgrade:
- Upgrade CaaS control plane and Kubernetes masters while ensuring cluster stability
- Worker node upgrade:
- Draining Nodes: Safely transition workloads before upgrading
- Batch Upgrades: Nodes are grouped into batches and upgraded in parallel or serially, depending on the following CNF elements:
- Host group configurations
- Pod disruption budget definitions
- Pod Termination Grace Period definitions
- Available failover capacity
- Storage node upgrade:
- Upgrade storage nodes serially to preserve HA and maintain replication factors
- Ensure uninterrupted operations and data integrity
What factors influence upgrade duration?
Your maintenance window must be log enough for your upgrade to complete. There are many factors that might influence the duration of an upgrade, and these are unique to your environment. However, there are common factors to consider and investigate, including:
- CNF application characteristics:
- Pods with lower resource requirements are easier to reschedule
- Flexible anti-affinity rules simplify rescheduling
- Hardware-agnostic pods reduce complexity
- Infrastructure characteristics:
- Adequate failover nodes allow smoother transitions
- Pre-cordoned nodes improve predictability
- The duration of CNF Pod draining, upgrading, and rebooting nodes directly affects the completion time of the worker compute batch
In-service upgrades in the real world
Upgrading Telco 5G Cloud-Native Core infrastructure is a sophisticated operation that demands meticulous planning and precise execution. Every phase of the process, from CNF application upgrades to worker and storage node updates, requires careful orchestration to ensure high availability and service continuity.
Success hinges on thorough planning, encompassing hardware dimensioning, Kubernetes API compatibility, and pre-upgrade validations. Properly evaluating these factors allows operators to anticipate challenges, streamline the upgrade process, and maintain fluent operations.
Multiple factors influence the speed and efficiency of upgrades, including CNF characteristics like pod resource requirements, anti-affinity rules, and hardware dependencies, as well as infrastructure considerations like failover capacity and node batching strategies. A structured approach to managing these variables ensures that telco operators can effectively meet their maintenance windows and reduce disruptions.
For more information about Red Hat's telco services, visit the telco industry page.
product trial
Red Hat OpenShift Container Platform | Essai de produit
À propos de l'auteur
Dmitry Muznikas is a Principal Product Manager at Red Hat with extensive experience in cloud infrastructure, 5G networks, and Telco-specific technologies. With a career spanning over 16 years, Dmitry has played a pivotal role in driving product strategies that align cutting-edge technologies with the unique demands of the telecommunications industry.
Currently, one of Dmitry's focus points is enabling seamless Cloud infrastructure upgrades and migrations for Communication Service Providers (CSPs) to cloud-native architectures.
Contenu similaire
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Programmes originaux
Histoires passionnantes de créateurs et de leaders de technologies d'entreprise