피드 구독

Previously in The Quest for Operations Intelligence, the focus was placed on what can be delivered with log aggregation and how to improve it. A conclusion was that to have full situational awareness on IT, you would need logs, metrics, configuration and events information correlated for easy one stop analysis when problems arise.

While we talked about logs, metrics and configuration in depth, we left events at the time without any sort of definition. What are events and what can we use them for in our quest for operations happiness?

Event happiness

Those most effected by this quest are the system administrators, who are the ones on call when things go wrong in your infrastructure. When the call comes in the middle of the night, this is the moment when log aggregation and metrics can save very precious time in finding the cause of failure.

The question is, what's happened to bring the system administrator to his post in the deep dark of night?

System monitoring has discovered a failure in the infrastructure, generated an alert which triggered an action which sent a message to the sysadmin to respond to the issue.

Depending on the organizational structure, teams taking care of monitoring are often spread throughout the organization. It's handled by the IT ops team itself, by a specific monitoring team, by the security team or possibly by a cross-functional group. The core of this activity is to perform checks on as many parts of your infrastructure as possible.

These checks are the unit testing of your IT infrastructure.

They are often pieces of code or scripts that validate the status of a critical part of the infrastructure, that it's working in a general sense (i.e. checking HTTPD service status by creating TCP connections to ports 80 and 443). These checks can also become very specific, such as downloading the main web page of a server and checking that static objects match the previously recorded sha256 hash.

These checks use metrics by reviewing that some parameters do not go beyond safe thresholds (i.e. CPU usage beyond 90% for more than one minute), or detect certain messages in the generated logs  such as any critical message or an specific message that is known as the symptom of a coming outage.

Figure 1. Log aggregation enriching checks and monitoring.

Beyond monitoring

Monitoring hardly ever goes beyond checks and metrics gathering. Adding log aggregation complements it by enriching the context required to understand any deviation from the performed checks. The process to achieve this is detailed in figure 1.
What's the difference between an event, an alert and the other components involved?
If we think through it, an event is almost anything happening on the system whether is user or administrator executed, scheduled or automatically run. The ones that are important are the failed checks from monitoring. These require quick analysis to be able to keep the system running (ITSM incident management) and attempt to perform auto-remediation when possible.
At the same time events are critical to trigger searches in metrics and logs in order to find the root cause, prepare a change, and apply it so the problem (ITIL problem management). The next step is to uncover patterns in metrics and logs so the incidents can be avoided.
Another interesting event is the moment in which a new version of an app is deployed. The information revealed with metrics and logs can help discover performance issues or any errors before changes are fully applied (assuming A/B testing and/or canary testing).
Coupling CI/CD pipelines with IT Intelligence can provide great operational benefits. A similar use case would be the moment when a software update for an Operating System, IaaS, PaaS, CMP, or any critical component of your infrastructure is applied.
Finally, our job as system administrators is to keep our users happy. Being able to debug their requests with just the "Request Id" can bring some operations happiness to IT. Monitoring and events are the perfect match to log, metrics, and configuration aggregation to achieve much better situational awareness while supercharging your operational intelligence.

저자 소개

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Original series icon

오리지널 쇼

엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리