How do you know if something bad is happening in your cluster? How do you know that a node is down, an application isn’t responding, or the storage backing a PVC has “disappeared”? If your answer to any of those is “when the users tell us there’s an error”, then it may be time to reevaluate your monitoring and alerting strategy.
Fortunately, OpenShift has built-in tools for doing just this. With only a small amount of work you can ensure that you’re receiving the proper alerts and warnings so that you can, hopefully, avoid any sticky situations. This week we are joined by Brian Gottfried, from Red Hat Consulting, to focus on Alertmanager, how to configure it and how to customize the settings to avoid both too many alerts and not enough.
As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!
If you’re interested in more streaming content, please subscribe to the OpenShift.tv streaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.
Episode 31 recorded stream:
Use this link to jump directly to where we start talking about today’s topic.
Supporting links for today:
- A question from Twitter about disconnected installs and using an ImageContentSourcePolicy (ICSP). While the ICSP is necessary to map image locations from their original, connected, locations to the new disconnected, there are some things on the roadmap to make disconnected a better overall experience. We also talked about disconnected installs in episode 13 if you want more information.
- Another Twitter inspired topic this week: load balancers for OpenShift. There’s a number of options available for load balancing OpenShift API and Ingress traffic, the best one is the one that works for you!
Questions answered during the stream:
- Can the timezone be set for the cluster? Unfortunately not, but you can track the RFE here.
- What is the architecture and components of the OpenShift Monitoring service? The architecture diagram used can be found in the docs here. There are multiple components, including Prometheus for data export and collection, Thanos for aggregation and reduction, and when using Advanced Cluster Manager, Observatorium for historical views.
- Alert fatigue is real and you should be careful of it when configuring your system!
- How do I enable user workload monitoring? It’s done by adding a ConfigMap to the openshift-monitoring namespace, see the docs here.
- What role does Thanos play? It’s important for aggregating metrics across multiple Prometheus instances, for example when user workload monitoring is enabled in the cluster.
- What persistent storage should I use for long term data retention? Brian answers this during the stream, the docs explain how to configure persistent storage.
- Is user workload monitoring with Istio and mTLS going to be supported? This is most likely because you cannot modify alerts and monitoring in the system namespaces.
- Is there a way to get individual container metrics from a Pod with multiple containers? You would need to configure each container to expose a different metrics endpoint, then configure ServiceMonitors for each of them.
- How do I configure AlertManager to send to Mattermost? You’ll need to use a webhook via a community plugin for AlertManager or create your own.
- Is it possible to configure per-namespace alert addresses? Yes, this should be possible using labels. There’s an example here.