In my previous article, I discussed the typical IT journey of a telco company, and the architectural requirements of the telco industry. In this article, I explore key concerns for the telco cloud and availability and resilience.
While some enterprises can tolerate some loss of service, the telco world has zero tolerance (which they call "Five 9's availability"). To address the requirement for high availability, a telco maintains software backups with the ability to recover to points in time, spare hardware capacity, geographical redundancy, and support for quick recovery from disasters.
Resiliency and the telco cloud
End-to-end resiliency is a combination of the capabilities of a cloud platform, the applications that run on it, and the underlying hardware. Prerequisites of the cloud platform are the inherited requirements for other components within the system. Applications must be designed to be resilient and have application-specific capabilities. Redundancy must be built into the application software architecture, and geo-redundancy capabilities must also be supported.
Outside of software, the underlying hardware must also be designed to be resilient and highly available. Hardware redundancy is one way to achieve this.
High availability and the telco cloud
All components of the cloud architecture must be designed with a policy of high availability.
High availability is essential for customers, especially for some rainy-day cases like abrupt node reboot or power outages. In a highly available network, the cluster can be recovered, and it's robust enough to work as normal in such circumstances. High availability in a Kubernetes system must be offered at the following levels:
- Kubernetes cluster high availability
- etcd cluster high availability
- keepalived floating IP, to avoid a single point of failure
- Local registry redundancy
- Chart repository redundancy
- Persistent storage redundancy
- Service discovery high availability
In a cloud based on Kubernetes, redundancy of the controller managers and the Kubernetes nodes is built into the system. Each of the three controller nodes includes the controller manager, schedule, and proxy. Redundancy is achieved among the three servers.
The nodes on which workloads run are also designed with redundancy. In addition, in a robust and secure architecture, there are often worker nodes designated as interfaces to the network so that only they can interface with the outside world. These nodes, often referred to as edge nodes, are redundant with each other and provide a highly available interface to the outside world and the internal network.
Self-healing environments
When an issue occurs in a specific area of a system, it can often be mitigated by a self-healing environment. This can include:
- Process-level healing: Process supervision triggers correction in the event of a failure
- Container-level healing: A restart policy is defined, and liveness probes detect failures. When needed, the container is restarted
- Pod-level healing: If a host becomes unavailable, affected pods are scheduled to restart in another host
Disasters can happen
Business continuity plans must account for potential disasters, ranging from loss of data to a natural disaster that wipes out an entire local system. A good telco operator has a business continuity plan and practices periodically. The business continuity plan addresses all aspects of recovery, from the moment when a disaster is declared through bringing up a geo-redundant site until the original site can be restored. The recovery plan spans from the bring-up of hardware through the cloud, until all applications hosted by the cloud are brought up, and continuity of operations is confirmed from an alternate site.
The strength of a recovery plan is based on proper data retention, or periodic backups to points in time. Additionally, its strength is a function of the time it takes the system to recover.
As part of a Disaster Recovery program, a platform must also be able to recover from disaster based on the backups that have been performed. The customer can define a cadence of backup snapshots and, in the event of a disaster, rollback to the relevant point in time.
A backup plan includes successful snapshots of all releases of all applications in the cloud, and a backup of underlying infrastructure and configurations. Restoring the platform from data corruption or other logical errors may not require re-installation and may not result in downtime for unaffected elements, depending on the recovery strategy of the applications and the hardware infrastructure.
Aiming for zero downtime
The goal of near-zero downtime is achievable with well designed and highly available architecture. In the event of failure in one component, there's a smooth transition to a redundant component, allowing the customer to restore the failed component.
Availability of a geo-redundant site contributes to high availability because, in the event of a disaster, customers can fail over to a geo-redundant site. A well defined recovery plan means the original site can be reinstated after hardware has been restored.
For more information about Red Hat's telco services, visit our Telco industry page.
product trial
Red Hat Advanced Cluster Security Cloud Service | product trial
About the author
With over two decades of experience in the telco world, spanning positions ranging from software engineer, system engineer, marketing and product management, Amy has a broad perspective of where the wind blows in the telco world. She has grown with the industry from legacy systems, through virtualization and to the cloud. In the past few years, Amy has developed a keen interest in security in the real world. She has lectured in different venues and across diverse fields. A curious person, she is always open to meeting new people and hearing new ideas.
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Inclusion at Red Hat
- Cool Stuff Store
- Red Hat Summit