Getting telco observability right with Red Hat

2025년 3월 21일4분 읽기관측 가능성

Senior Principal Product Manager

In the introductory article of this series, we discussed fault management and performance management (FM/PM) and how it's a critical domain for telco operators. An important part of FM/PM is observability, because you can't effectively manage what you can't see.

Observability needs a single pane of observability of the entire network and services, including the hardware and software components. The challenge is integrating diverse monitoring, telemetry, and alerting systems components into one interface that can provide real-time insights across the entire telecom infrastructure. In this article, we look at characteristics of some of the most sought-after requirements from telco operators when observing and monitoring solutions for their networks.

Real-time metrics feedback loop

For optimal functionality of a 5G telecom service provider network, the timely correlation of events and metrics must be ensured across different layers and protocols in the telco network, including RAN, core, and transport entities. This correlation depends on how quickly faults are propagated to the management (local and remote) and how fast the response and feedback propagates back to the origin.

A good observability solution implements this propagation loop within the proscribed time budget. A non-real-time (if not a near-real-time) response is essential for the proper functioning and performance of a 5G network. Any degradation in the network must be quickly noticed by the operator monitoring the networks.

Real-time metrics feedback loops are the backbone of modern observability and resilience strategies, particularly in industries like telecom, where rapid adjustments are key to maintaining service quality.

On-prem long-term log storage

Telco networks generate massive amounts of real-time data, such as call detail records (CDR), packet flow, and signaling data. Storing, collecting, and processing this data in real time to identify and respond to issues is a challenge. Storing on-prem long term also enables a comprehensive view by correlating logs, metrics, and traces into one cohesive interface. This helps in root cause analysis and faster troubleshooting. Plus, the regulatory compliances in many jurisdictions require telcos to store logs for months, if not years!

Logging helps with network optimization and SLA monitoring. It enriches data for AI/ML models for predictive analysis and anomaly detection. Last but not least, logs play an important role in troubleshooting and finding patterns for rare issues.

Identifying data retention requirements of various cloud applications can help choose the right storage solution, such as object storage like Ceph. Log management tools like ELK, Searchstack, and Loki need to be able to access, aggregate, ingest, and process logs with indexing or compression and encryption, as required. Telcos must treat long-term log storage as a regulatory requirement and a strategic asset for better observability, security, and optimization.

Metrics based autoscaling

Horizontal pod autoscaling using custom metrics allows scaling pods based on metrics beyond the default CPU or memory usage, such as application-specific metrics like number of sessions, data throughput, or business metrics.

Consider this use case of dynamic scaling in a 5G network:

Metrics collected: CPU and memory usage, throughput, and network specific traffic patterns
Analysis: If traffic exceeds 80% capacity on a specific slice, the system predicts a potential SLA breach
Feedback action: Automatically spin up additional virtualized RAN nodes to handle the load

Such use cases need custom metrics to be observed by the telco cloud. Metric collectors, like Prometheus, can be configured to scrape custom metrics data.

Variety of management interfaces

A typical telco service provider has its own customized (usually hierarchical) management interfaces. Management components expect incoming metrics, alerts, or logs to be in a specific format, and through specific protocols and interfaces.

The challenge lies with the cloud metric servers and their clients to keep their implementation as generic as possible to cater to various management interfaces, but also easily customize as the telco operator needs. Retention times, secure interfaces, packet sizes, and protocol versions are some of the tunable parameters required to support these requirements.

Regulatory compliance audit support

Telecom observability and monitoring solutions must align with a variety of regulatory compliance standards to ensure proper auditing, data integrity, and lawful data management. These compliance standards aim to protect sensitive information, enable lawful interception, and provide sufficient transparency to meet legal, business, and operational requirements.

Some common requirements for observability systems include:

Enable detailed logging of all intercepted communications
Support secure and auditable access pathways for authorized entities
Ensures the confidentiality, integrity, and availability of monitoring data
Audit logs must track system access, modifications, and operational changes

Telco networks generate massive amounts of observability data, making real-time compliance challenging.

Granularity in parameters

The granularity of parameters in a telco monitoring solution significantly impacts its effectiveness, efficiency, and the value it provides. In this context, "granularity" refers to the level of detail or resolution at which data is collected, processed, and analyzed. Choosing the right level of granularity is critical for addressing the unique demands of a telco network.

Detailed metrics, such as per-second packet latency or per-user session throughput, offer deep visibility into network performance. This is useful for diagnosing specific issues, such as jitter in VoIP calls or individual user experience. Aggregated metrics provide a high-level overview, like average network latency per hour. These are suitable for identifying long-term trends but may overlook transient issues.

Observability in telco

Observability is a requirement for telco networks, to ensure compliance and quality of service. Getting observability right, however, isn't easy. Red Hat serves the telco industry, and provides solutions that meet and exceed expectations. For more information about our Telco services visit our Telco industry page.

저자 소개

Deepak Sreenivas

Senior Principal Product Manager

Deepak has been working in RedHat since 2023 as Product Manager for Cloud Telco platforms. Prior to this he has been with Nokia & Ericsson in areas of software development and solution architecture for products in Radio and core networks. His recent interest has been in Telco Observability and the involved AI/ML technology and tooling for the same.

유사한 검색 결과

Blog post

Istio 앰비언트 모드를 통해 제공되는 OpenShift Service Mesh 3.2

Blog post

OpenShift Service Mesh 3.1 소개

자세히 알아보기

채널별 검색

모든 채널 탐색