This blog looks at troubleshooting a problematic OpenShift Container Platform update.
If you're having trouble with an OpenShift update, you should start by inspecting the Cluster Version Operator (CVO), which automates the update process. You can check the CVO to see how the update process is progressing and/or how individual components respond to the update request.
Gathering data on the cluster version is the first step. You can do this with the following command:
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.0 True False 76m Cluster version is 4.12.0
$ oc get clusterversion -o yaml
Run oc get clusterversion -o yaml multiple times during the update process. Update status messages will be displayed in the YAML output.
Inspecting common update issues
There are several potential causes for an update to be problematic. This article covers some of the common issues and a few resources to remember while debugging.
For instance, look at the status of the nodes in your cluster by running the following command:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-1-128-45.ec2.internal Ready control-plane,master 97m v1.25.4+77bec7a
ip-1-150-124.ec2.internal Ready control-plane,master 97m v1.25.4+77bec7a
ip-1-167-126.ec2.internal Ready worker 90m v1.25.4+77bec7a
ip-1-186-97.ec2.internal Ready worker 90m v1.25.4+77bec7a
ip-1-212-120.ec2.internal Ready control-plane,master 97m v1.25.4+77bec7a
Sometimes nodes are not Ready, which can be due to a lack of resources on a node or a problem with kube-proxy or kubelet. These can all cause issues during an update.
Check the Cluster Operator status and make sure that all Operators are available. If they are not, check the logs of the operator pods for further debugging of the issue. Sometimes the Operators are degraded, but that does not mean the updates have failed or are not working. Run oc get clusteroperator to see the status of the Cluster Operators. For example, the etcd operator shown below is degraded:
# for example, below we can see etcd is Degraded
$ oc get clusteroperator
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
console 4.12.0 True False False 86m
control-plane-machine-set 4.12.0 True False False 96m
dns 4.12.0 True False False 97m
etcd 4.12.0 False False True 25m <----
image-registry 4.12.0 True False False 91m
kube-apiserver 4.12.0 True False False 86m
kube-controller-manager 4.12.0 True False False 94m
kube-scheduler 4.12.0 True False False 94m
kube-storage-version-migrator 4.12.0 True False False 98m
machine-api 4.12.0 True False False 93m
machine-approver 4.12.0 True False False 97m
machine-config 4.12.0 True False False 96m
network 4.12.0 True False False 100m
node-tuning 4.12.0 True False False 97m
openshift-apiserver 4.12.0 True False False 93m
openshift-controller-manager 4.12.0 True False False 93m
operator-lifecycle-manager 4.12.0 True False False 97m
operator-lifecycle-manager-catalog 4.12.0 True False False 98m
operator-lifecycle-manager-packageserver 4.12.0 True False False 92m
service-ca 4.12.0 True False False 98m
storage 4.12.0 True False False 97m
You can also check if any MachineConfigPools (MCP) are paused. An MCP associates MachineConfigs with Nodes. Pausing an MCP prevents the MachineConfigOperator (MCO) from updating the nodes associated with the MCP.
$ oc get mcp <pool-id> -ojsonpath='{.spec.pasused}'
$ oc get mcp <pool-id> -o yaml | yq '.spec.paused'
Sometimes, the MCP is paused for an EUS-to-EUS Update or Canary update. If neither is underway, then unpause the MCP with the following command:
$ oc patch mcp <pool-id> -p '{"spec": {"paused": false}}' --type=merge
The maxUnavailable is another parameter in MachineConfigPool (MCP) that must be configured properly. This parameter allows users to define the maximum number or percentage of machines that can be unavailable simultaneously during the update process. This setting controls the number of unavailable machines that can occur during the update process of a MachineConfig. When updates to a MachineConfig need to be applied across the cluster, they are rolled out incrementally to avoid service disruptions. During this update process, some machines may become unavailable temporarily as the new configuration is applied. The maxUnavailable parameter verifies enough machines are available to handle the cluster's workload.
Check whether the cluster uses Pod Disruption Budgets (PDB). A PDB specifies the minimum number of replicas that must be up at a time. If they are used in the cluster, be sure they don't stop the pod from draining. Look at the Minavailable and MaxAvailable values to be sure that a Minavailable=1 configuration is not stopping the pod from draining.
$ oc get pdb
$ oc get pdb -n <namespace>
# get all pdb with Minavailable=1
$ oc get pdb -A -o json | jq -r '.items[] | select(.spec.minAvailable >= 1) | [.metadata.name , .metadata.namespace, .spec.minAvailable]
If pods take longer to drain from the nodes, the update can take longer, or updates may be paused until the disruption is resolved.
Refer to the OpenShift documentation to calculate an approximate update time.
Use the following commands to identify the nodes that take more time to drain. If possible, you can force drain the node.
# find the node which is taking longer to drain
$ oc get co machine-config -o yaml
# see which nodes are cordoned
$ oc get nodes
# find the machine-config-daemon which is running on that cordoned node
$ oc get pods -n openshift-machine-config-operator -o wide
# look in that container's logs to check why the drain could be stalled - like unable to unmount a PVC or an issue with a PDB
$ oc logs machine-config-daemon-zhd4l -c machine-config-daemon -n openshift-machine-config-operator
# check nodes with more CPU and memory use
$ oc adm top nodes
# find the pods running in those nodes.
$ oc get pods -o wide | grep <node>
# check the logs for the node
# if it is allowed, then force drain the node.
Cordon the node to avoid workloads from getting scheduled.Drain and then uncordon the node.
$ oc adm cordon <node>
$ oc adm drain <node> --grace-period=20 --ignore-daemonsets --force=true --delete-emptydir-data --timeout=60s
$ oc adm uncordon <node>
Checking for Conditional updates
Since OpenShift 4.10, Conditional updates provide guidance on update paths that are not recommended. Suppose you want to update to a new release like 4.12.102 to fix a security CVE. If 4.12.102 is not recommended for your cluster, then check whether that version is a Conditional update. Use the following commands to do this:
$ oc adm upgrade
Cluster version is 4.12.0
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12 (available channels: candidate-4.12, candidate-4.13, eus-4.12, fast-4.12, fast-4.13, stable-4.12)
Recommended updates: <----- will also include 'Recommended=True'CONDITIONAL UPDATES
VERSION IMAGE
4.12.19 quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
4.12.18 quay.io/openshift-release-dev/ocp-release@sha256:8465e416a403cec2e6887c8aebe783b976f46f81d513890f17037b652b143de5
4.12.17 quay.io/openshift-release-dev/ocp-release@sha256:7ca5f8aa44bbc537c5a985a523d87365eab3f6e72abc50b7be4caae741e093f4
.... quay.io/openshift-release-dev/ocp-release@sha256:db976910d909373b1136261a5479ed18ec08c93971285ff760ce75c6217d3943
4.12.9 quay.io/openshift-release-dev/ocp-release@sha256:96bf74ce789ccb22391deea98e0c5050c41b67cc17defbb38089d32226dba0b8
4.12.8 quay.io/openshift-release-dev/ocp-release@sha256:28358de024c01a449b28f27fb4c122f15eb292a2becdf7c651511785c867884a
Additional updates which are not recommended based on your cluster configuration are available, to view those re-run the command with --include-not-recommended.
Each Conditional update version is available with a short bug description. The cluster admin needs to decide whether the bug is minor enough to proceed with the update.
$ oc adm upgrade --include-not-recommended
Cluster version is 4.12.0
Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12 (available channels: candidate-4.12, candidate-4.13, eus-4.12, fast-4.12, fast-4.13, stable-4.12)
Recommended updates:
VERSION IMAGE
4.12.19 quay.io/openshift-release-dev/ocp-release@sha256:41fd42cc8b9f86fc86cc8763dcf27e976299ff632a336d393b8e643bd8a5f967
4.12.18 quay.io/openshift-release-dev/ocp-release@sha256:8465e416a403cec2e6887c8aebe783b976f46f81d513890f17037b652b143de5
......
4.12.8 quay.io/openshift-release-dev/ocp-release@sha256:28358de024c01a449b28f27fb4c122f15eb292a2becdf7c651511785c867884a
Supported but not recommended updates: <----- not recommended CONDITIONAL UPDATES
Version: 4.12.102
Image: quay.io/openshift-release-dev/ocp-release@sha256:800d1e39d145664975a3bb7cbc6e674fbf78e3c45b5dde9ff2c5a11a8690c87b
Recommended: False
Reason: LeakedMachineConfigBlocksMCO
Message: Machine Config Operator stalls when encountering orphaned KubeletConfig or ContainerRuntimeConfig resources. https://issues.redhat.com/browse/OCPNODE-11502
Version: 4.12.5
Image: quay.io/openshift-release-dev/ocp-release@sha256:fd65cebce150bac3c622e30e7f762d3173575ae3541b3a7648819cb63e9b63a4
Recommended: False
Reason: LeakedMachineConfigBlocksMCO
Message: Machine Config Operator stalls when encountering orphaned KubeletConfig or ContainerRuntimeConfig resources. https://issues.redhat.com/browse/OCPNODE-1502
Note: conditional update versions are fully supported.
It is possible that the update issue is unrelated to the update and is something else, like a hardware version or firmware incompatibility issue. In this case, the cluster admin can seek help from Red Hat Support by opening a support case. It is helpful to provide debugging information about your cluster to Red Hat Support when opening a support case.
The oc adm must-gather CLI command collects the information into a directory from your cluster that is needed for debugging issues. Create a compressed file from the must-gather directory and attach the compressed file to your support case on the Red Hat Customer Portal. You can find more information on using the oc adm must-gather command in the OpenShift documentation.
Sobre el autor
Subin Modeel is a principal technical product manager at Red Hat.
Más similar
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit