three-cities Image composed of map tiles created by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL. Map tiles are, from top to bottom, Boston, MA, Raleigh, NC, and Phoenix, AZ

According to the OpenShift installation guide,

You can deploy an OpenShift Container Platform 4 cluster to both on-premise hardware and to cloud hosting services, but all of the machines in a cluster must be in the same datacenter or cloud hosting service.

Even as they reside in a single data center or cloud region, there are cases where an OpenShift cluster's nodes may span multiple failure domains, whether they involve power domains like a subset of generator or UPS feeds in a datacenter, or a cloud provider's availability zones in a region. Kubernetes provides node selection and node affinity mechanisms to provide applications the ability to span failure domains to keep them running through a planned or unplanned outage.

Set up and label nodes

The test cluster for this demonstration includes six nodes, named wkr0, mcp0, wkr1, mcp1, wkr2, and mcp2. For illustration purposes only, we pretend these nodes span three separate data centers in three cities in two geographical regions:

$ oc label node wkr0 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=bos
$ oc label node mcp0 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=bos

$ oc label node wkr1 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=rdu
$ oc label node mcp1 topology.kubernetes.io/region=us-mntn topology.kubernetes.io/zone=rdu

$ oc label node wkr2 topology.kubernetes.io/region=us-east topology.kubernetes.io/zone=phx
$ oc label node mcp2 topology.kubernetes.io/region=us-mntn topology.kubernetes.io/zone=phx

Additionally, to demonstrate adding hardware hints to a subset of nodes, all wkr* nodes are labeled as having faster SSD storage available:

$ oc label nodes wkr0 wkr1 wkr2 disktype=ssd

Assign virtual machines to nodes

Virtual machines in OpenShift follow similar node selection and affinity criteria to Pods with one notable exception. Pods can select a node by name by using nodeName, but this is not implemented for VirtualMachine resources. There are a number of use cases for nodeSelector and affinity rules, we will start with one of the simplest.

NodeSelector

The nodeSelector for a VirtualMachine belongs at the same level as the domain object, under the path spec.template.spec as seen here:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: boston
spec:
template:
spec:
nodeSelector:
topology.kubernetes.io/zone: bos
[ remainder of VM omitted ]

This nodeSelector will cause the VM to require one of the nodes labeled with the zone bos. In this case, either wkr0 or mcp0 may run this VM.

$ oc get vmi boston

NAME AGE PHASE IP NODENAME READY
boston 7m16s Running 10.129.2.107 wkr0 True

Should work need to be done on wkr0, start by cordoning and draining the node as outlined in the OpenShift Understanding node rebooting documentation

$ oc adm cordon wkr0

node/wkr0 cordoned

$ oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force

node/wkr0 already cordoned

[ skipping updates of all the evicted pods ]

error when evicting pods/"virt-launcher-boston-pnbcd" -n "database" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

[ skipping repeats of above message ]

pod/virt-launcher-boston-pnbcd evicted
node/wkr0 drained

While the drain command runs, it outputs error messages showing that it fails to immediately evict the VM's virt-launcher Pod. Behind the scenes, the eviction request has triggered a VM migration, which we can see afterwards with:

[kni@jump-cnv zones]$ oc get vmim
NAME PHASE VMI
kubevirt-evacuation-5xg8b Succeeded boston

Next, check the migrated VM landed in the other bos node, mcp0:

$ oc get vmi

NAME AGE PHASE IP NODENAME READY
virtualmachineinstance.kubevirt.io/boston 8m27s Running 10.131.0.34 mcp0 True

To return the node to service use:

$ oc adm uncordon wkr0

The nodeSelector field could also be used with the disktype or topology.kubernetes.io/region labels or even multiple labels at once, provided the logic required is AND:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: boston
spec:
template:
spec:
nodeSelector:
topology.kubernetes.io/zone: bos
disktype: ssd
[ remainder of VM omitted ]

Only the node wkr0 will satisfy both nodeSelector labels, and attempts to migrate the node will result in a failed migration:

$ oc get vmi,vmim

NAME AGE PHASE IP NODENAME READY
virtualmachineinstance.kubevirt.io/boston 6m21s Running 10.129.3.182 wkr0 True

NAME PHASE VMI
virtualmachineinstancemigration.kubevirt.io/kubevirt-migrate-vm-k77r5 Failed boston

Assign virtual machines to nodes using affinity rules

When more nuanced control is required, affinity rules come in to play. Affinity rules fall into three categories: nodeAffinity, podAffinity, and podAntiAffinity. The first behaves much as the nodeSelector above, but with more options. All three affinity rule categories further subdivide into preferredDuringSchedulingIgnoredDuringExecution and requiredDuringSchedulingIgnoredDuringExecution. Ignored during execution means these rules can not affect the behavior of running VMs. In other words, changes to a node's label while VMs are running will not cause a migration. The difference between preferred and required indicates whether the scheduler will make a best-effort attempt to schedule according to the weighted selectors (preferred), or will require all selectors be true and fail to schedule the VM if this is impossible (required).

As an example, we can adapt the above nodeSelector with the failed migration as preferredDuringSchedulingIgnoredDuringExecution, giving a 75% weight to staying in Boston, and 50% weight to having an SSD disk:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: boston
spec:
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- bos
weight: 75
- preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
weight: 50
[ remainder of VM omitted ]

As before, the VM schedules on wkr0, which is the only node that satisfies both conditions:

$ oc get vmi,vmim
NAME AGE PHASE IP NODENAME READY
virtualmachineinstance.kubevirt.io/boston 4m54s Running 10.129.3.188 wkr0 True

Now when the VM is migrated, it will allow itself to run on node mcp0 which, while it is in Boston, does not carry the SSD label:

$ oc get vmi,vmim

NAME AGE PHASE IP NODENAME READY
virtualmachineinstance.kubevirt.io/boston 11m Running 10.131.0.56 mcp0 True

NAME PHASE VMI
virtualmachineinstancemigration.kubevirt.io/kubevirt-migrate-vm-jqdl4 Succeeded boston

Pod affinity and anti-affinity

The podAffinity selector covers the case where it is required to keep a VM in the same node, availability zone, or region as a related service. An example might be a latency sensitive front-end application that should run on the same node as its corresponding back-end service. The following Pod and VM definitions will always place the Pod on node mcp0 and allow the VM to migrate between wkr0 (preferred due to the disktype=ssd label there) and mcp0.

apiVersion: v1
kind: Pod
metadata:
name: httpd
labels:
app: low-latency
spec:
nodeName: mcp0
containers:
- name: httpd
image: httpd
imagePullPolicy: IfNotPresent
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: back-end
spec:
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- low-latency
topologyKey: topology.kubernetes.io/zone
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
weight: 50
[ remainder of VM omitted ]

An example of this running would look like the following:

$ oc get vmi

NAME AGE PHASE IP NODENAME READY
back-end 12m Running 10.129.3.15 wkr0 True

$ oc get pod httpd -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd 1/1 Running 0 30m 10.131.0.60 mcp0 <none> <none>

Note that changes to the Pod do not trigger an effect in the VM. As an example, we delete the httpd Pod and recreate it in mcp2:

$ oc get po httpd -o wide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
httpd 1/1 Running 0 2m9s 10.128.3.198 mcp2 <none> <none>

The back-end VM stays running where it was, but if we migrate it now:

$ virtctl migrate back-end

VM back-end was scheduled to migrate

$ oc get vmi
NAME AGE PHASE IP NODENAME READY
back-end 18m Running 10.128.4.26 wkr2 True

Finally, consider the case where a clustered application like a database requires three cluster members, and for maximum protection, it is desired to keep those all in separate zones. Translating this to an anti-affinity rule would look something like:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
name: db01
spec:
template:
metadata:
labels:
app: database
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: topology.kubernetes.io/zone

A collection of VMs with the above anti-affinity rule and app: database labels will arrange themselves into nodes in the bos, rdu, and phx zones:

$ oc get vmi

NAME AGE PHASE IP NODENAME READY
db01 58m Running 10.130.2.173 wkr1 True
db02 58m Running 10.128.4.27 wkr2 True
db03 57m Running 10.131.0.62 mcp0 True

Caveats

As mentioned above, none of the affinity rules currently have any effect during execution. In other words, a running VM will continue running even if affinity rules suggest it should migrate to another node. For both nodeSelector and affinity rules, it is not possible to alter the set of rules applied to a VirtualMachineInstance and then migrate it according to the new rules. Instead, a shutdown and restart of the VM's OS is required to propagate the changes. In the case of a single VM, this could be some minutes worth of interruption. In the case of a clustered application like a database, this still could allow an admin to work around planned maintenance or unplanned emergencies without causing interruption of the clustered service. Work to update the KubeVirt API to allow propagating nodeSelector and affinity rules is scheduled for a future release, and can be tracked here. On the subject of future work, this blog was written based on version 4.11 of the OpenShift Virtualization operator. In 4.12, there will be an additional mechanism available to control distribution of virtual machines across a cluster, topology spread constraints. The Kubernetes documentation provides an explanation of how this works for Pods now.

Conclusion

Whether your goal is to avoid losing service during standard maintenance, or to make sure certain VMs always have particular hardware available, node selection and affinity is the way to go.

For more documentation about virtual machines and node assignment, see the upstream documentation at the KubeVirt User Guide.