Firstly a big thanks to Thomas Wiest and the OpenShift Online/Dedicated teams. Their input was key in this.
Red Hat runs and operates hosted OpenShift solutions using the same codebase as OpenShift Container Platform. There are two hosted service offerings: OpenShift Online which is a shared multi-tenant service, and OpenShift Dedicated where we offer a dedicated hosted environment for single-tenant use.
All of this is monitored and managed based on tooling that has been developed and deployed by Red Hat, and the approach has been refined as part of the experience from running OpenShift Online (v2) and now OpenShift Dedicated and Online (v3). This monitoring is currently based around Zabbix, NB watch for future posts on the future of CloudForms, and the cm-ops project which is underway to add alerting and thresholds to CloudForms as well.
All of the code can be found in the following git-repo:
https://github.com/openshift/openshift-tools
NOTE: some of the Standard Operating Procedures (SOP) which you will see referred to in the public git repo are actually not publicly visible and are in a private git repo as these are specific to Red Hat Operations.
The tooling itself has all been designed to be deployed and operated as containers, and the complete installation has been automated using Ansible. This allows for users to deploy a stand-alone single machine deployment on which they can develop monitoring, metrics, and alerts as well as testing any changes. Once a new threshold or metric is developed it can be pushed into the stg branch and then up into int and then prod branches.
For more information about how to build an all-in-one branch please see:
https://github.com/openshift/openshift-tools/blob/stg/docs/local_development_monitoring.adoc
Today, this solution has been adopted by some customers who do not have existing monitoring and alerting capabilities already on premise - either via self-installation or working with Red Hat Consultancy based on the open code. Please note that Red Hat does not officially support the code in the openshift-tools repository. This code is in active use and may break from time to time. Also, no effort is being put forth to make it backwards-compatible.
The rest of this post is not going to explore the architectural decisions made, how monitoring works or components like the Zagg in detail, but will instead focus on alerts and thresholds.
Alerts and Thresholds
Alerts and thresholds can be reused for those enterprises who already have an event and alert infrastructure in place. For example, many organisations already have tooling such as IBM Netcool, CA Unified Infrastructure Monitoring, Solarwinds, BMC TrueSight or one of the many many other solutions out there.
For users of these tools, what you would like to do is:
- Harvest the counters that need to be tracked.
- Harvest the thresholds that are being alerted.
- Harvest the metrics to collect.
All of these counters and thresholds are configured by Ansible as part of the openshift-tools monitoring installation, so let’s walk through a couple of examples. The configuration is all stored as Ansible variables for the playbooks so I am looking at the directory in the git repo as below:
https://github.com/openshift/openshift-tools/tree/stg/ansible/roles/os_zabbix/vars
Firstly let’s define a couple of tags that are used in the playbooks. In Zabbix, items are referred to by their “key”. This represents a piece of data we want to receive, a metric of data.
Also in Zabbix, a trigger is a logical expression that defines a problem threshold and is used to “evaluate” data received in items. The part that defines what to evaluate is called the trigger’s expression.
Based on this, we can look at a key from one of the Ansible variable files, for example:
$ cat template_docker.yml |grep docker.storage.data.space.percent_available
- key: docker.storage.data.space.percent_available
expression: "{Template Docker:docker.storage.data.space.percent_available.max(#2)}<5 or {Template Docker:docker.storage.data.space.available.max(#2)}<5" # < 5% or < 5GB
expression: "{Template Docker:docker.storage.data.space.percent_available.max(#2)}<10 or {Template Docker:docker.storage.data.space.available.max(#2)}<10" # < 10% or < 10GB
Here you can see that we have defined a key (item) and then some expressions (triggers) which will fire when we hit percentage or absolute values for the storage space of the graph driver for the docker daemon on a node.
You can pull out all of the expressions to see all of the thresholds that are currently alerted for on the OpenShift Online and Dedicated platforms with a simple bit of “grepping” once you have cloned the repo:
# git clone https://github.com/openshift/openshift-tools
# cd openshift-tools
# git show-branch
[stg] …# cat ./ansible/roles/os_zabbix/vars/* |grep -B1 -A2 expression
………………<snip>...............
Using this method you should be able to collect all expression thresholds that are alerted from on the platform and replicate these into your OpenShift Container Platform deployment.
How the Metrics for the Keys (Items) are Generated
For the first example above, where we looked at the docker daemon space and the alerts that fire when this drops to 5 or 10 percent/GB. To fire this alert we need to be collecting the raw metrics from the hosts and you can find the agents that collect these values also in the same repo, this time in the following folder:
https://github.com/openshift/openshift-tools/blob/stg/scripts/monitoring
For the docker daemon metrics you can see the following script which is just called from the crontab:
https://github.com/openshift/openshift-tools/blob/stg/scripts/monitoring/cron-send-docker-metrics.py
The script calls a utility script as below to collect the information and then posts this to the custom Zabbix Collector (which is beyond the scope of this blog post).
https://github.com/openshift/openshift-tools/blob/stg/openshift_tools/monitoring/dockerutil.py
Summary
We have looked at the openshift-tools repo and the OpenShift platform level monitoring, metrics and alerts that are configured and collected as part of the Red Hat OpenShift cloud services. Hopefully this will allow users who are looking to deploy OpenShift on premise to harvest this knowledge and to plug this into existing alerting and eventing systems.
Further reading
To complement the OpenShift Platform monitoring, also make sure to check out application level monitoring, either using tools such as Prometheus - check out the Fabric8 tooling for some great quick starts on this:
Or from partners such as:
http://www.coscale.com/blog/openshift-monitoring
https://blog.openshift.com/appdynamics-integration-with-openshift/
https://newrelic.com/openshift
https://blog.openshift.com/openshift-ecosystem-using-sysdig-monitor-openshift
Plus many more.
Thanks for reading this far.
Chris
About the author
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit