This is the second part of our series of blogs on the OpenShift sandboxed containers Operator. The first part is available from the following link. In this blog, we want to show what you can do when things go wrong. An OpenShift cluster is a complex system, and many pieces need to work together. Sometimes things get out of order, and a little help is needed to get everything back into the right order. We have chosen the most common problems and describe what those mean and how they can be solved.
Kata Runtime Installation and Uninstallation on OpenShift Nodes
Here is a quick recap of the Kata Containers runtime deployment workflow, which will be useful for troubleshooting.
The installation workflow gets triggered by the creation of KataConfig CR. Alternatively, uninstallation workflow gets triggered by the deletion of KataConfig CR.
For each of the worker nodes in the MachineConfigPool (MCP), the MachineConfigOperator (MCO) performs a set of steps as described in the following diagram to install or remove the sandboxed-containers extension:
For detailed understanding of the install process including creation of MachineConfigPool(MCP) and MachineConfig (MC) refer to the previous blog in the series - https://cloud.redhat.com/blog/openshift-sandboxed-containers-operator-from-zero-to-hero-the-hard-way
Troubleshooting
“Logs” is your friend to troubleshoot issues with installation or uninstallation.
To understand the key logs that are required for troubleshooting, let’s consider the following diagram depicting the major components:
You can use the following table as a cheat sheet describing how to collect the required info for each of the components:
Component |
Commands |
Node |
Get the cluster nodes and current status (Ready, NotReady etc..) oc get nodes |
Sandbox Containers Operator |
Get the operator POD oc get pods -n openshift-sandboxed-containers-operator You can also use the following snippet to get the POD name POD_NAME=$(oc get pods -l control-plane=controller-manager -n openshift-sandboxed-containers-operator -o jsonpath="{.items[0].metadata.name}") View Logs oc logs $POD_NAME -n openshift-sandboxed-containers-operator |
KataConfig CR |
Get the KataConfig CR status oc describe kataconfigs <kataconfig-cr-name> You can also use the following snippet to get only the status: oc get kataconfigs <kataconfig-cr-name> -o jsonpath='{.status.*}' |
Machine Config Operator |
Get MachineConfigPool and current status oc get mcp Get MachineConfig and current status oc get mc View Logs Find out the machine config daemon pods running on each node. oc get pods -n openshift-machine-config-operator -o wide | grep machine-config-daemon You can also use the following snippet to get the machine-config-daemon and the corresponding nodes: oc get pods -l k8s-app=machine-config-daemon -n openshift-machine-config-operator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName View logs of specific machine config daemon oc logs <pod-name> -n openshift-machine-config-operator -c machine-config-daemon |
CRI-O and Kata runtime logs |
Use ‘journalctl’ on specific node oc debug node/<node-name> ... To use host binaries, run `chroot /host` … sh-4.4# chroot /host sh-4.4# journalctl -u crio Use ‘oc adm’ command oc adm node-logs <node_name> -u crio If SSH is configured then you can SSH to the node and check the logs. For more details refer to: https://docs.openshift.com/container-platform/4.8/support/troubleshooting/troubleshooting-crio-issues.html You can also use OpenShift Logging to centrally collect logs from all the nodes and create dashboards in Kibana: https://docs.openshift.com/container-platform/4.8/logging/cluster-logging-visualizer.html |
Let’s go through a sample scenario and troubleshoot it using the above commands.
Scenario
You have created a KataConfig Custom Resource(CR) and are monitoring the progress of installation by watching the CR description:
$ oc describe kataconfig example-kataconfig
Spec:
Kata Config Pool Selector: <nil>
Status:
Installation Status:
Is In Progress: True
Completed:
Failed:
Inprogress:
Binaries Install Nodes List:
ip-10-0-140-61.us-east-2.compute.internal
Kata Image:
Prev Mcp Generation: 2
Runtime Class:
Total Nodes Count: 3
[..snip..]
You realize that the installation status is in-progress ( Is In Progress: True) for a long time and appears to be stuck for one node: ip-10-0-140-61.us-east-2.compute.internal under “Binaries Install Nodes List.”
Viewing the operator logs is a good first step to understand what is happening as part of the Kata runtime deployment process.
Get the Operator POD
$ oc get pods -n openshift-sandboxed-containers-operator
NAME READY STATUS RESTARTS AGE
sandboxed-containers-operator-controller-manager-65c57c9d5s7nkx 2/2 Running 0 16h
View the POD logs
$ oc logs sandboxed-containers-operator-controller-manager-65c57c9d5s7nkx -n openshift-sandboxed-containers-operator
2021-07-02T10:02:50.699Z INFO controllers.KataConfig Creating MachineConfig for Custom Resource
2021-07-02T10:02:50.699Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:02:50.699Z INFO controllers.KataConfig Waiting for MachineConfigPool to be fully updated
2021-07-02T10:03:05.704Z INFO controllers.KataConfig Reconciling KataConfig in OpenShift Cluster
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Kata installation in progress
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z INFO controllers.KataConfig creating RHCOS extension MachineConfig
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Creating MachineConfig for Custom Resource
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Getting MachineConfigPool Name
2021-07-02T10:03:05.705Z INFO controllers.KataConfig Waiting for MachineConfigPool to be fully updated
As you can see from the logs, the operator has created the MachineConfig (MC) and is waiting for the MachineConfigPool (MCP) to be updated.
Viewing the MCP status confirms that the “worker” MCP is updating, as shown below:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-26065a51be10defb9236a9e86726bf1e True False False 3 3 3 0 103m
worker rendered-worker-95ae3283c7dfd0ecdf2594414e1f083c False True False 3 0 0 0 103m
Let’s view the node status:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-134-242.us-east-2.compute.internal Ready master 109m v1.21.0-rc.0+1622f87
ip-10-0-140-61.us-east-2.compute.internal NotReady,SchedulingDisabled worker 102m v1.21.0-rc.0+1622f87
ip-10-0-178-80.us-east-2.compute.internal Ready worker 103m v1.21.0-rc.0+1622f87
ip-10-0-184-83.us-east-2.compute.internal Ready master 108m v1.21.0-rc.0+1622f87
ip-10-0-217-214.us-east-2.compute.internal Ready worker 101m v1.21.0-rc.0+1622f87
ip-10-0-242-199.us-east-2.compute.internal Ready master 109m v1.21.0-rc.0+1622f87
The node status for “ip-10-0-140-61.us-east-2.compute.internal” shows “NotReady.”
Let’s continue to observe the status for some time. If the node status doesn’t change to “Ready,” the MachineConfigPool “UPDATING” status will still be “True” and the KataConfig CR description will continue to show in-progress and appear to be stuck
A good first step is to look at the machine-config-daemon logs for the specific node (ip-10-0-140-61.us-east-2.compute.internal) to know the status of extension installation.
Get the machine-config-daemon POD for the specific node:
$ oc get pods -l k8s-app=machine-config-daemon -n openshift-machine-config-operator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
NAME NODE
machine-config-daemon-5zzcf ip-10-0-134-242.us-east-2.compute.internal
machine-config-daemon-gb7s8 ip-10-0-140-61.us-east-2.compute.internal
machine-config-daemon-mmjjj ip-10-0-178-80.us-east-2.compute.internal
machine-config-daemon-rxjb8 ip-10-0-184-83.us-east-2.compute.internal
machine-config-daemon-txvpz ip-10-0-242-199.us-east-2.compute.internal
View the POD logs:
$ oc logs machine-config-daemon-gb7s8 -n openshift-machine-config-operator -c machine-config-daemon
[..snip..]
I0704 15:23:58.131569 4107 update.go:1099] Applying extensions : ["update" "--install" "kata-containers"]
I0704 15:23:58.131656 4107 rpm-ostree.go:258] Running captured: rpm-ostree update --install kata-containers
[..snip..]
The logs do not indicate any issue with package installation.
You will need to check the node for errors. Check the node console for any boot time errors. If the node is pingable and direct SSH is configured, then SSH to the node, and check for kubelet or crio errors. Resolve any errors and check the node status.
Once the node status changes to “Ready,” all the related objects will reflect the updated status.
For example, the KataConfig CR description will show the node in the completed list:
Status:
Installation Status:
Is In Progress: True
Completed:
Completed Nodes Count: 1
Completed Nodes List:
ip-10-0-140-61.us-east-2.compute.internal
Failed:
Inprogress:
Binaries Install Nodes List:
ip-10-0-178-80.us-east-2.compute.internal
Kata Image:
Prev Mcp Generation: 2
Runtime Class:
Total Nodes Count: 3
[..snip..]
Also, the MCP READYMACHINECOUNT and UPDATEDMACHINECOUNT status will get updated to reflect the same:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-26065a51be10defb9236a9e86726bf1e True False False 3 3 3 0 108m
worker rendered-worker-95ae3283c7dfd0ecdf2594414e1f083c False True False 3 1 1 0 108m
Summary
In this last part of our blog series about the OpenShift sandboxed containers Operator, we showed how to troubleshoot the most common problems that can occur. We looked at one exemplary scenario and used a step-by-step process to show the root cause. In the next release of the Operator, we will provide an image that can be passed to the ‘oc adm must-gather’ command to collect even more information that will help solve all kinds of possible problems. We hope you have a great experience with the OpenShift sandboxed containers Operator and have no need to use what you have learned in this blog.
About the authors
Jens Freimann is a Software Engineering Manager at Red Hat with a focus on OpenShift sandboxed containers and Confidential Containers. He has been with Red Hat for more than six years, during which he has made contributions to low-level virtualization features in QEMU, KVM and virtio(-net). Freimann is passionate about Confidential Computing and has a keen interest in helping organizations implement the technology. Freimann has over 15 years of experience in the tech industry and has held various technical roles throughout his career.
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Inclusion at Red Hat
- Cool Stuff Store
- Red Hat Summit