How do you know if something bad is happening in your cluster? How do you know that a node is down, an application isn’t responding, or the storage backing a PVC has “disappeared”? If your answer to any of those is “when the users tell us there’s an error”, then it may be time to reevaluate your monitoring and alerting strategy.
Fortunately, OpenShift has built-in tools for doing just this. With only a small amount of work you can ensure that you’re receiving the proper alerts and warnings so that you can, hopefully, avoid any sticky situations. This week we are joined by Brian Gottfried, from Red Hat Consulting, to focus on Alertmanager, how to configure it and how to customize the settings to avoid both too many alerts and not enough.
As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!
If you’re interested in more streaming content, please subscribe to the OpenShift.tv streaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.
Episode 31 recorded stream:
Use this link to jump directly to where we start talking about today’s topic.
Supporting links for today:
- A question from Twitter about disconnected installs and using an ImageContentSourcePolicy (ICSP). While the ICSP is necessary to map image locations from their original, connected, locations to the new disconnected, there are some things on the roadmap to make disconnected a better overall experience. We also talked about disconnected installs in episode 13 if you want more information.
- Another Twitter inspired topic this week: load balancers for OpenShift. There’s a number of options available for load balancing OpenShift API and Ingress traffic, the best one is the one that works for you!
Questions answered during the stream:
- Can the timezone be set for the cluster? Unfortunately not, but you can track the RFE here.
- What is the architecture and components of the OpenShift Monitoring service? The architecture diagram used can be found in the docs here. There are multiple components, including Prometheus for data export and collection, Thanos for aggregation and reduction, and when using Advanced Cluster Manager, Observatorium for historical views.
- Alert fatigue is real and you should be careful of it when configuring your system!
- How do I enable user workload monitoring? It’s done by adding a ConfigMap to the openshift-monitoring namespace, see the docs here.
- What role does Thanos play? It’s important for aggregating metrics across multiple Prometheus instances, for example when user workload monitoring is enabled in the cluster.
- What persistent storage should I use for long term data retention? Brian answers this during the stream, the docs explain how to configure persistent storage.
- Is user workload monitoring with Istio and mTLS going to be supported? This is most likely because you cannot modify alerts and monitoring in the system namespaces.
- Is there a way to get individual container metrics from a Pod with multiple containers? You would need to configure each container to expose a different metrics endpoint, then configure ServiceMonitors for each of them.
- How do I configure AlertManager to send to Mattermost? You’ll need to use a webhook via a community plugin for AlertManager or create your own.
- Is it possible to configure per-namespace alert addresses? Yes, this should be possible using labels. There’s an example here.
Über den Autor
Ähnliche Einträge
AI in telco – the catalyst for scaling digital business
Introducing OpenShift Service Mesh 3.2 with Istio’s ambient mode
Edge computing covered and diced | Technically Speaking
Nach Thema durchsuchen
Automatisierung
Das Neueste zum Thema IT-Automatisierung für Technologien, Teams und Umgebungen
Künstliche Intelligenz
Erfahren Sie das Neueste von den Plattformen, die es Kunden ermöglichen, KI-Workloads beliebig auszuführen
Open Hybrid Cloud
Erfahren Sie, wie wir eine flexiblere Zukunft mit Hybrid Clouds schaffen.
Sicherheit
Erfahren Sie, wie wir Risiken in verschiedenen Umgebungen und Technologien reduzieren
Edge Computing
Erfahren Sie das Neueste von den Plattformen, die die Operations am Edge vereinfachen
Infrastruktur
Erfahren Sie das Neueste von der weltweit führenden Linux-Plattform für Unternehmen
Anwendungen
Entdecken Sie unsere Lösungen für komplexe Herausforderungen bei Anwendungen
Virtualisierung
Erfahren Sie das Neueste über die Virtualisierung von Workloads in Cloud- oder On-Premise-Umgebungen