How to implement observability in your IT architecture
IT has crossed a complexity threshold where additional automation and monitoring are needed. This new environment requires a new way of thinking about visibility, so many organizations are embracing the concept of observability.
[ Learn how IT modernization can help alleviate technical debt. ]
Observability is a way to keep track of system and application health and performance in the cloud-native age in order to keep those systems and applications up and running. According to Gartner:
"Observability is the evolution of monitoring into a process that offers insight into digital business applications, speeds innovation, and enhances customer experience."
Or, as Charity Majors, Liz Fong-Jones, George Miranda explain in Observability Engineering:
"Put simply, our definition of 'observability' for software systems is a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre... It is about how people interact with and try to understand their complex systems."
The three core pillars of observability are traces, metrics, and logs. In this article, I'll take a deeper look into the three pillars and how they can make IT architecture more effective. All three provide different types of data, delivering valuable insights into system and application health.
Pillar 1: Tracing
Distributed tracing is an essential method of tracking the performance of application requests, most commonly to monitor microservices.
A distributed trace is a record of a single service call corresponding to a request from an individual user. The trace starts with an initial span, called a "parent span." The request also triggers downstream subcalls to other services, generating a tree structure of multiple "child" spans. Together, these spans represent the entire path of the request, including all the subcalls, enabling developers, site reliability engineers (SREs), IT ops, and DevOps users to identify performance issues.
Visualizing and analyzing these traces is essential to determine system behavior. For that, trace data is sent to a third-party solution such as an application performance management (APM) tool or an open source distributed tracing tool (for example, Jaeger).
The trace should include the following metadata about each request:
- The instance of the service or application that is being called
- The container runtime it was running on
- The invoked business method
- The performance of the specific request
- The overall results of each of the above actions
One of the top challenges in tracing is data sampling. You cannot collect a trace of every single transaction; this would result in too much data because your application can receive millions of requests. So the bottom line with tracing is that you must carefully decide what to trace.
[ Check out Red Hat Portfolio Architecture Center for a wide variety of reference architectures you can use. ]
Pillar 2: Metrics
Metrics are the measurement of a specific activity over a specified interval of time. They are used to gain insight into the performance of a system or application.
A metric typically consists of a timestamp, name, value, and dimensions. The dimensions are a set of key-value pairs that describe additional metadata about the metric.
Examples of metrics include:
- System metrics: Such as CPU, memory, and disk usage
- Infrastructure metrics: For example, data from cloud providers
- Application metrics: Including APM or error tracking
- User and web tracking scripts: Such as data from web analytics
- Business metrics: Like customer sign-ups
Prometheus is the most commonly used metrics-based open source monitoring solution.
Pillar 3: Logging
Logs are records of events that happen at discrete points in time on specific systems. The data is stored in a log file, which has three formats: plain text, structured, and binary.
Logs can monitor system and application health and provide a historical record to support troubleshooting, such as during a system outage or application downtime.
Examples of logs include:
- System and server logs: Such as syslog
- Firewall and network system logs
- Application logs: For example, Log4j
- Platform and server logs: Including Apache, Nginx, databases
Observability is different from APM
Although it has grown out of the APM market, observability is more than just APM with a new name and marketing approach. The most crucial factor differentiating observability from APM is that observability includes three distinct monitoring approaches—tracing, metrics, and logs—while APM provides tracing alone. By collecting and aggregating these various types of data from multiple sources, observability offers a much broader view of the overall system and application health and performance, with the ability to gain much deeper insights into potential performance issues.
Another important distinction is that open source tools are the foundation of observability, but not APM. While some APM vendors have recently open-sourced the client side of their stack, the server side of all the popular commercial APM solutions is still proprietary.
These distinctions do not mean that observability and APM are unconnected. Application performance management can still be an important component of an observability implementation.
[ Related reading: How we designed observability for a hybrid cloud platform ]
Why is observability important?
You might be wondering how and when to use observability. Is it required, and how should you consider it when planning you architecture? The following may help you answer these questions.
The complexity of cloud-native applications is the main challenge driving the adoption of observability. Cloud architectures are strikingly different from traditional on-premises, datacenters, and virtualization-based architectures.
Modern applications typically have more services and components, with more complex topology and a much faster pace of change. Distributed architectures are larger, more complicated, and constantly changing due to factors such as short-lived containers, autoscaling, dynamic adjustment of cluster size, and function-as-a-service. Observability meets the need to make orchestrated systems such as Kubernetes more comprehensible.
[ Related reading: How to architect distributed scalable 5G with observability ]
Another challenge driving the need for observability is the unpredictable nature of cloud-native environments. Increasingly, all systems are unpredictable because of the many moving parts. These new technologies with new behaviors can cause unforeseen problems.
In an extensive system, something can always be broken. A perfectly running system is no longer possible. Observability provides a high-level view to drill down and see what is happening in the application.
The number of applications and components in a microservices architecture is high, and consequently, there is a growing amount of relevant data being collected by monitoring tools—much more data than can possibly be handled by traditional monitoring systems. Observability helps collect, aggregate, and analyze data to understand the multitude of transactions between services and deliver a comprehensive view of system and application health.
In the past, system administrators managed system and application performance. Today, there are a variety of roles that all need access to information about the health of the IT environment, including developers, DevOps teams, and SREs. Observability provides a broader view of the current state of the environment to meet the disparate needs of this expanding audience of IT professionals.
Observability meets a need for more open source tools to monitor system and application performance. The market for traditional monitoring, like APM, has been dominated by proprietary vendors. However, as cloud-native environments become more complex, it becomes increasingly difficult and economically impractical for commercial proprietary monitoring vendors to cover the many needs of diverse organizations. The market has hit an inflection point, and it is becoming more efficient for vendors to collaborate on an open source core and differentiate themselves with features further up the stack.
In addition, IT organizations are moving toward open source rather than placing blocks of proprietary code into their systems.
The Cloud Native Computing Foundation (CNCF) has helped incubate several open source observability projects. The first project is Prometheus, an extremely successful and popular metrics-based monitoring tool.
Choosing an observability system
You have a range of options when selecting an observability system. You can purchase everything from a vendor and send all your data to a third party, or on the other end of the spectrum, you can build, deploy, and run your own observability system.
Embarking on an observability initiative requires you to make several choices. You need to consider how to architect systems to be observable and also how to architect the observability system itself.
You can also choose between manual and automatic options. The recommended method is automatic instrumentation, using, for example, a Java agent or framework support to instrument your code automatically. You can accomplish this manually, which provides more control, but it can be unwieldy and error-prone.
No matter which routes you choose, look for the following capabilities in an observability system:
The most important consideration when implementing an observability system is to send data out of the observed system. One of the basic architectural rules of observability is that you must separate the solution from the system it monitors so that you can still monitor it, even in a system outage. You can have probes within the monitored system, but the data collected by the probes must be exfiltrated to a standalone monitoring solution.
It's important to understand the data collected and how it connects to the entire architecture. Key performance indicators (KPIs) can tell you how your architecture is performing. For example, what happens when a new service is released, and what impact did it have on your architecture and ability to do business? These are important pieces of information driven by data churned throughout the architecture. Observability is not only about collecting the data. You need to be able to correlate and analyze the data from multiple sources to provide deep insight into system health.
An observability system must be stable and robust with minimal changes. Due to the unpredictable behavior of applications, observability is an application that needs exemplary architecture and practice to ensure continuity and stability.
Make appropriate choices for a stable system. For example, perhaps a Kubernetes cluster hosts your applications. If you host your observability stack in the same environment, what happens when it goes down? Will you lose your ability to observe, analyze, and investigate the issues?
Simplicity and automation
An observability solution should be as close to off-the-shelf as possible. You can customize it within the configuration but not with bespoke coding on top of the observability system. If you make too many tweaks, you risk an observability system crash. Automate as many aspects of your observability solution as possible and avoid manual intervention whenever you can.
A hybrid cloud approach requires a container platform you can match to workload requirements to provide greater choice, agility, and adaptability. Despite these advantages, managing a hybrid cloud still has its headaches. Keeping pace with application services and their underlying resources in a highly complex deployment can be a real challenge.
[ The Red Hat Hybrid Cloud Console allows you to monitor and optimize your technology landscape through a single lens, including public and private clouds, on-premises and at the edge. ]
Navigate the shifting technology landscape. Read An architect's guide to multicloud infrastructure.