Kuryr-Kubernetes, often shortened to Kuryr, is a Container Network Interface plug-in (Cloud Native Computing Foundation standard defining how container networking plug-ins should behave) that aims to provide connectivity for OpenShift Pods using OpenStack Neutron. It also implements OpenShift Services using Octavia load balancers created on the OpenStack cloud that hosts the OpenShift cluster. This means that a single Software-Defined Networking layer is used for both OpenStack and OpenShift, effectively getting rid of double-encapsulation that is present when other CNI plug-ins are used.
During OpenShift 4.9 planning, the Kuryr team realized that there are many usability and debugability headaches that haunt users running OpenShift clusters with Kuryr. As those deployments are heavily affected by any issues with underlying OpenStack cloud, it is critical for the administrators to easily understand if a problem they see is related to Kuryr or OpenStack. Those conclusions led us to a decision to make improving user experience our 4.9 goal. This blogpost will list the enhancements that we have implemented in this matter.
Kuryr logs are noisy and the reason is that every OpenShift Pod or Service creation triggers a number of synchronized operations that both kuryr-controller and kuryr-cni have to make. In 4.9, we focused on making logs more readable. This includes several categories of problems.
Getting rid of useless messages
We have audited logs for messages that pass very minimal information to the user. For example, in 4.9 you will not see every /metrics call logged on INFO level. Logging hell like below is no more:
It was common for users to include false-positive WARNING log messages when reporting a problem with a totally different root cause. This decreases users’ trust in the product (warnings and tracebacks are scary) as well as increases support complexity (you have to scroll through harmless messages to find the real problem). In 4.9 we have made sure to remove as many of such false-positives as possible. This includes increasing keystoneauth connection pool size to remove "WARNING urllib3.connectionpool [-] Connection pool is full, discarding connection" messages that were showing up when Kuryr was making a high amount of OpenStack calls, but also making sure that kuryr-cni containers are able to be stopped gracefully without showing a ton of tracebacks about connections being broken.
Clearly communicating reasons of failures
As Kuryr is strongly dependent on OpenStack it is using, it is critical that it does everything that is possible to help the administrator pinpoint the root cause of a problem. In 4.9 we have applied special attention to improving communication of such issues in the logs. An example is that Kuryr pods will now clearly state the reason for a liveness or readiness probe failing and the user will not need to scroll through the logs to find the original reason for a container restart.
Sometimes a seemingly Kuryr problem can have a root cause in misbehavior of OpenStack services that Kuryr is unable to do anything about. An example of this is a situation when the Octavia load balancer is stuck in PENDING_UPDATE state. If that happens, there is nothing Kuryr can do, as such LBs are immutable; that is, Octavia API will return 409 Conflict to any request made. The only solution is manual intervention of the administrator in the Octavia DB. To make sure that such problems will be directed to a correct team or administrator, Kuryr will now print clear log messages when it encounters a critical problem with OpenStack that it cannot fix on its own:
Loadbalancer 192d5391-35b9-4b69-a37f-8a58b5a3c085 is stuck in PENDING_UPDATE status for several minutes. This is unexpected and indicates problem with OpenStack Octavia. Please contact your OpenStack administrator.
Alerts on critical failures
The next improvement is closely related to the issues described in the few paragraphs above. In addition to making sure log messages shown on unrecoverable issues are clear, we have also decided that we should raise a Prometheus alert when they happen. It's mainly because such problems may require manual intervention of the OpenStack administrators. This means that in 4.9, Kuryr has 2 new metrics:
self.load_balancer_readiness = prometheus_client.Counter(
'kuryr_load_balancer_readiness', 'This counter is increased when '
'Kuryr notices that an Octavia load balancer is stuck in an '
'unexpected state', registry=self.registry)
self.port_readiness = prometheus_client.Counter(
'kuryr_port_readiness', 'This counter is increased when Kuryr '
'times out waiting for Neutron to move port to ACTIVE',
Those counters are increased each time Kuryr encounters an OpenStack resource being stuck in an unexpected state for more than 8 minutes. This applies to Octavia load balancer in PENDING_* state or Neutron subports being DOWN.
Each time that counter increments, a Prometheus alert, which is named KuryrLoadBalancerNotReady or KuryrPortNotReady, will be raised for 20 minutes, clearly signaling a problem with the cluster. Once the issue is resolved, the alert will clear itself in up to 20 minutes.
It is worth saying that Octavia and Neutron problems described above are being worked on by the OpenStack team, and our tests indicate that OSP 16.1.7 will be free of the issues causing them.
Monitoring of critical OpenStack resources
Some OpenStack resources Kuryr creates are critical for the OpenShift cluster to function. We have decided that we have to put some effort into monitoring them and being able to raise alerts quickly if something wrong is happening to them. In this iteration, the resources that got our special attention are the Octavia load balancers created for OpenShift API and DNS. In 4.9, Kuryr will actively monitor those LBs and raise a Prometheus alert when any of them has a low number of members or is in a state that indicates an issue. We hope this will improve the resilience of OpenShift on OpenStack running with Kuryr.
Those are not all of the general UX improvements that we have on our plate. One of the main enhancements that we are planning for future releases is usage of Kubernetes Events that will allow us to communicate the state of OpenStack resources directly on the OpenShift resources that are represented by them. This means that a simple oc describe svc <name> will get you an overview of the state of the Octavia load balancer being created for that Service, as well as information about any problems that Kuryr may have had when creating it. Our aim is to make sure running an OpenShift cluster with Kuryr is giving users the performance benefit of no double encapsulation without introducing any support complexity when compared to other CNI plug-ins.
About the author
Michał Dulko is one of the engineers working on kuryr-kubernetes in OpenShift with the OpenStack team. Dulko has been involved in the OpenStack community since the Juno release and served as Cinder and Kuryr core reviewer and Kuryr PTL throughout the journey. His professional interests are HA solutions, cluster management and building reliable distributed systems.