订阅内容

Our previous blog discussed the persistent volume challenges with peer-pods and how to resolve them. It also introduced using the CSI wrapper as a potential solution to the persistent volume usage challenges with peer-pods. 

This post dives deeper into the various components that make up the persistent volume solution in peer-pods.

Interpreting the CSI plugins in peer-pods

To use persistent volumes in peer-pods, intercept the CSI Plugins in the control plane (CSI Controller Plugin) and worker node (CSI Node Plugin) through the CSI Wrapper approach. With the CSI Wrapper injected into CSI Plugins, no code changes in the original CSI Plugins and it achieves the goal to attach, detach, mount, and unmount volumes for containers in peer-pods.

One key aspect of this solution is interpreting the CSI Plugins.

Interpreting the CSI Controller Plugin

As discussed in the previous blog, CSI volume drivers implement the CSI interface. This implementation includes a cluster-level statefulset or a deployment to facilitate communication with the Kubernetes controllers. The solution is the CSI Controller Plugin. We injected a container in the CSI Controller Plugin to handle peer-pods persistent volumes.

The key component to interpret the CSI Controller Plugin is:

  • CSI Controller Plugin Wrapper – A container injected in the CSI Controller Plugin to hook and filter the CSI API calls.
Figure 1: CSI Controller Plugin interpreting concept

Figure 1: CSI Controller Plugin interpreting concept

Note the following: 

  • The CSI Driver Container and CSI Sidecars are originally linked through the Unix Domain Socket (UDS) /csi/csi.sock
  • The CSI Controller Plugin Wrapper container is between the CSI Driver Container and CSI Sidecars in the CSI Controller Plugin pod. 
  • All the API calls to the CSI Driver Container will be hooked and filtered by the CSI Controller Plugin Wrapper container. 
  • The CSI Controller Plugin Wrapper will revise the request or response on demand specifically for peer-pods.

Figure 2 shows how the CSI Controller Plugin Wrapper hooks and filters the API calls by adding an extra UDS /csi/csi-controller-wrapper.sock

Figure 2: CSI Controller Plugin interpreting with UDS

Figure 2: CSI Controller Plugin interpreting with UDS

Note the following: 

  • The CSI Sidecars in the CSI Controller Plugin communicate with the Kubernetes api-server via HTTP.
  • The CSI Controller Plugin Wrapper container communicates with the CSI Sidecars via the UDS /csi/csi-controller-wrapper.sock
  • The CSI Driver Container communicates with the CSI Controller Plugin Wrapper via the UDS /csi/csi.sock.

The following is a list of CSI APIs implemented in the CSI Controller Plugin:

  • CreateVolume – Creates a persistent volume through the infrastructure API.
  • DeleteVolume – Deletes a persistent volume through the infrastructure API.
  • ControllerPublishVolume – Attaches a persistent volume to a worker node through the infrastructure API.
  • ControllerUnpublishVolume – Detaches a persistent volume from a worker node through the infrastructure API.

All are hooked and filtered in the CSI Controller Plugin Wrapper container. 

Figure 3 shows an example of how the CSI API is hooked when creating and attaching a persistent volume in peer-pods: 

Figure 3: APIs interpreted in CSI Controller Plugin

Figure 3: APIs interpreted in CSI Controller Plugin

Note the following: 

  • The CreateVolume API – Goes from the CSI Sidecars to the CSI Controller Plugin Wrapper, and then from the CSI Controller Plugin Wrapper to the CSI Driver Container without change. It can leverage the same function in the CSI Driver Container to create a persistent volume. 
  • The ControllerPublishVolume API – The CSI Controller Plugin Wrapper will hook it and it will not be passed to the CSI Driver Container because we don’t want to call AttachVolume to attach the persistent volume to the worker node. Instead, we will attach it to the peer-pods by altering the algorithm. This approach leverages Cache and Replay, which we explain in later sections.

Interpreting the CSI Node Plugin

CSI volume drivers also have a node-level daemonset named the CSI Node Plugin to facilitate communication with every Kublet instance. We injected a container in CSI Node Plugin to handle peer-pods persistent volumes.

The key component for understanding the CSI Node Plugin is:

  • CSI Node Plugin Wrapper – A container injected in the CSI Node Plugin to hook and filter the CSI API calls.

Figure 4 shows the concept of the CSI Node Plugin

Figure 4: CSI Node Plugin interpreting concept

Figure 4: CSI Node Plugin interpreting concept

Note the following:

  • CSI Driver Container in the CSI Controller Plugin pod and Kubelet communicates via UDS /csi/csi.sock
  • CSI Node Plugin Wrapper container is added between the CSI Driver Container in the CSI Controller Plugin pod and Kubelet. 
  • All the API calls to the CSI Driver Container will be hooked and filtered by the CSI Node Plugin Wrapper container 
  • The CSI Node Plugin Wrapper will revise the request or response on demand specifically for peer-pods. 

Figure 5-1 shows how the CSI Node Plugin Wrapper hooks and filters the API calls via UDS /csi/csi.sock and /csi/csi-controller-wrapper.sock

Figure 5-1: CSI Node Plugin interpreting via UDS

Figure 5-1: CSI Node Plugin interpreting via UDS

Note that this is similar to hooking in the CSI Controller Plugin.

The following is a list of CSI APIs implemented in the CSI Node Plugin:

  • NodeStageVolume – Mounts a persistent volume to a global path on a worker node.
  • NodeUnstageVolume – Unmounts a persistent volume from the global path on a worker node.
  • NodePublishVolume – Mounts a persistent volume from the global path on a worker node to a pod.
  • NodeUnpublishVolume – Unmounts a persistent volume from a pod.

These CSI APIs will be hooked and filtered in the CSI Node Plugin Wrapper container. CSI Actions will be Cached and then Replayed. 

Besides the CSI Driver Container, the CSI Node Plugin Wrapper also talks with the cloud-api-adaptor through the UDS /run/peerpod/hypervisor.sock

Figure 5-2 shows how the CSI Node Plugin Wrapper communicates with the cloud-api-adaptor to read necessary peer-pods information, such as the peer-pod's ID, and caches it: 

Figure 5-2: The CSI Node Plugin communicates with cloud-api-adaptor via UDS

Figure 5-2: The CSI Node Plugin communicates with cloud-api-adaptor via UDS

Cache and replay the CSI actions

Since persistent volume attaching and mounting happens on the pod VM rather than the worker node VM for peer-pods, we’ll cache some of the CSI Actions in the CSI Controller Plugin and CSI Node Plugin and replay them on the pod VM later.

The caching mechanism is achieved via the following components (illustrated in Figure 6):

  • PeerpodVolume – A customized resource definition (CRD) object to save caching information, like the peer-pod ID.
  • PeerpodVolume Controller – A Kubernetes controller to watch for PeerpodVolume status changes and take corresponding steps to replay the CSI actions.

Cache

Figure 6 shows what CSI Actions will be cached in PeerpodVolume and their order. 

Figure 6: CSI Actions caching

Figure 6: CSI Actions caching

The list of actions includes the following: 

  • CreateVolume – A PeerpodVolume object will be created in the CSI Action CreateVolume operation.
  • ControllerPublishVolume – A PeerpodVolume object will be updated with the ControllerPublishVolume to cache the AttachVolume CSI Action.
  • NodeStageVolume – A PeerpodVolume object will be updated when performing the NodeStageVolume to cache the volume mounting to peer-pods VM CSI Action.
  • NodePublishVolume – A PeerpodVolume object will be updated when performing NodePublishVolume to cache the volume mounting to pod CSI Action.
  • PeerPod Instance running – A PeerpodVolume object will be added to the PeerPod Instance when the peer-pods VM is created.
  • PeerPod Instance ready – A PeerpodVolume Controller in the worker node reads the peer-pods ID, sets it in the PeerpodVolume object, and updates the PeerpodVolume to indicate the PeerPod Instance is ready.

Replay

After the PeerPod instance is in running status, the CSI Node Plugin running on the peer-pods VM will monitor the PeerpodVolume object via the PeerpodVolume Controller and perform the replay procedure when the PeerPod instance is ready

Figure 7 shows the flow:

Figure 7: CSI Actions replaying

Figure 7: CSI Actions replaying

Note the following: 

  • ControllerPublishVolume – The Replay ControllerPublishVolume CSI Action will attach the created persistent volume to the peer-pods VM instead of the worker node VM.
  • NodeStageVolume – The Replay NodeStageVolume CSI Action will mount the persistent volume to a global path on the peer-pods VM.
  • NodePublishVolume – The Replay NodePublishVolume CSI Action will mount the persistent volume from the global path on the peer-pods VM to the pod.
  • StartContainer – The components on the peer-pods VM will start the container when the persistent volume is mounted to the pod.

The next task is to examine the detailed flow and compare the persistent volume workflow in standard pods versus peer-pods.

Use case example 

When using a persistent volume in a pod, PersistentVolume is pre-defined and PersistentVolumeClaim is used in the pod descriptor, as with the Persistent Volume Example. Define the PersistentVolume (PV) and PersistentVolumeClaim (PVC) as seen below. Note that the storageClassName is what the CSI volume drivers define, which is viable depending on the CSI volume drivers.

apiVersion: v1
kind: PersistentVolume
metadata:
 name: my-pv-volume
 labels:
   type: local
spec:
 storageClassName: my-csi-volume-driver-name
 capacity:
   storage: 10Gi
 accessModes:
   - ReadWriteOnce
 hostPath:
   path: "/mnt/data"

 

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: my-pv-claim
spec:
 storageClassName: my-csi-volume-driver-name
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 3Gi

Use the PVC in a pod, as seen below:

apiVersion: v1
kind: Pod
metadata:
 name: my-pv-pod
spec:
 volumes:
   - name: my-pv-storage
     persistentVolumeClaim:
       claimName: my-pv-claim
 containers:
   - name: my-pv-container
     image: nginx
     ports:
       - containerPort: 80
         name: "http-server"
     volumeMounts:
       - mountPath: "/usr/share/nginx/html"
         name: my-pv-storage

The next section covers the flow in standard pods.

Workflow in standard pods

Creating a volume

Figure 8 shows the workflow for creating a persistent volume:

Figure 8: Workflow when creating a volume

Figure 8: Workflow when creating a volume

 The flow consists of the following components: 

  • The user provides a persistent volume claim (PVC) descriptor and creates it in the API server.
  • The external-provisioner sidecar container in the CSI Controller Plugin watches the PVC and issues a CreateVolumeRequest call to the CSI socket.
  • The csi-driver-container in the CSI Controller Plugin listens on the CSI socket, calls CreateVolume, and informs the CSI about its creation.
  • The external-provisioner sidecar container in the CSI Controller Plugin creates a persistent volume (PV) and updates the PVC to be bound. The VolumeAttachment object is created by controller-manager.

Next is the flow for attaching volumes.

Attaching a volume

Figure 9 shows the workflow when attaching the persistent volume:

Figure 9: Workflow when attaching a volume

Figure 9: Workflow when attaching a volume

The flow consists of the following steps:

  • The external-attacher sidecar container in the CSI Controller Plugin watches for the VolumeAttachments and submits a ControllerPublishVolume RPC call to the csi-driver-container.
  • The csi-driver-container gets the ControllerPublishVolume and calls AttachVolume.
  • The external-attacher updates the VolumeAttachment status.

Next is the volume mounting flow.

Mounting a volume

Figure 10 shows the workflow when mounting a persistent volume:

Figure 10: Workflow when mounting a volume

Figure 10: Workflow when mounting a volume

The flow consists of the following:

  • Kubelet waits for the volume to be attached and submits the NodeStageVolume (formats and mounts the volume to the node at the staging dir) to the CSI-Node-Plugin.
  • The CSI-Node-Plugin gets the NodeStageVolume call and mounts it to path /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pv-name>/globalmount, then responds to Kubelet.
  • Kubelet calls the NodePublishVolume (to mount the volume to the pod's direcotry).
  • The CSI-Node-Plugin performs the NodePublishVolume action and mounts the volume to /var/lib/kubelet/pods/<pod-uuid>/volumes/kubernetes.io~csi/<pvc-name>/mount.
  • Kubelet starts the container of the pod with the provisioned volume.

Bringing the workflow together

Figure 11 shows the end-to-end workflow when creating, attaching, and mounting the persistent volume to a pod.

Figure 11: The full process for creating, attaching, and mounting persistent volumes

Figure 11: The full process for creating, attaching, and mounting persistent volumes

The flow consists of the following:

  • The user provides a persistent volume claim (PVC) descriptor and creates it on the API server.
  • An external-provisioner sidecar container in the CSI Controller Plugin watches the PVC and issues a CreateVolumeRequest call to the CSI socket.
  • A csi-driver-container in the CSI Controller Plugin listens on the CSI socket, calls CreateVolume, and informs CSI about its creation.
  • An external-provisioner sidecar container in the CSI Controller Plugin creates a persistent volume (PV) and updates the PVC to be bound. The VolumeAttachment object is created by the controller-manager.
  • An external-attacher sidecar container in the CSI Controller Plugin watches for VolumeAttachments and submits a ControllerPublishVolume RPC call to the csi-driver-container.
  • A csi-driver-container gets ControllerPublishVolume and calls AttachVolume.
  • An external-attacher updates the VolumeAttachment status.
  • Kubelet waits for the volume to be attached and submits NodeStageVolume (formats and mounts the volume to the node at the staging directory) to the CSI-Node-Plugin.
  • The CSI-Node-Plugin gets the NodeStageVolume call and mounts it to path /var/lib/kubelet/plugins/kubernetes.io/csi/pv/<pv-name>/globalmount, then responds to Kubelet.
  • Kubelet calls NodePublishVolume (mounts volume to the pod’s directory)
  • The CSI-Node-Plugin performs NodePublishVolume and mounts the volume to /var/lib/kubelet/pods/<pod-uuid>/volumes/kubernetes.io~csi/<pvc-name>/mount.
  • Kubelet starts the container of the pod with the provisioned volume.

The next topic is the workflow in peer-pods.

Workflow in peer-pods

Creating a persistent volume with peer-pods 

The CSI Controller Plugin Wrapper (orange in the diagram) is injected between the CSI Sidecars (external provisioner, external attacher) and CSI Driver Container. It hooks the CreateVolumeRequest in the CreateVolume CSI Interface and performs the following actions:

  • The CSI Controller Plugin Wrapper passes through the CreateVolumeRequest to the CSI Driver Container and calls CreateVolume in the CSI Driver Container to create the persistent volume.
  • The CSI Controller Plugin Wrapper creates the PeerpodVolume CR object and saves it to the api-server.
  • The CSI Controller Plugin Wrapper returns the VolumeCreated response to the CSI Sidecar (external provisioner).

Figure 12 below illustrates this process: 

Figure 12: Creating a persistent volume in the CSI Controller Plugin

Figure 12: Creating a persistent volume in the CSI Controller Plugin

Cache-attaching volume actions to peer-pods

When attaching a persistent volume, the CSI Controller Plugin Wrapper (orange in the diagram) hooks the ControllerPublishVolume in the CSI Interface and performs the following actions:

  • The CSI Controller Plugin Wrapper does not pass through the ControllerPublishVolume and will not call the AttachVolume in the CSI Driver Container.
  • The CSI Controller Plugin Wrapper updates the PeerpodVolume CR object by adding the original node ID (it will be updated to the peer-pods ID in a later phase). It could be used later when replaying this CSI Action to attach the persistent volume to the corresponding peer-pods VM.
  • The CSI Controller Plugin Wrapper returns a fake VolumeAttached response to the CSI Sidecar (external attacher). 
  • The CSI Controller Plugin Wrapper embedded a synchronized PeerpodVolume Controller, which monitors the PeerpodVolume CR object change and calls ControllerPublishVolume to attach the persistent volume to the peer-pods VM instance. This is a replay action.

Figure 13 illustrates this process: 

Figure 13: Attaching persistent volume in CSI Controller Plugin

Figure 13: Attaching persistent volume in CSI Controller Plugin

Cache-mounting volume actions to peer-pods

The CSI Node Plugin Wrapper (orange in the diagram) is injected between the CSI Node Plugin and Kubelet on the worker node. It hooks the NodeStageVolume and NodePublishVolume CSI Interface, communicates with cloud-api-adaptor, and performs the following actions:

  • The CSI Node Plugin Wrapper hooked the NodeStageVolume and will not pass the API call to the CSI Node Plugin. It will not call the API to mount the persistent volume to the worker node VM.
  • The CSI Node Plugin Wrapper updates the PeerpodVolume CR object by adding the NodeStageVolume CSI Actions.
  • The CSI Node Plugin Wrapper returns a fake response for NodeStageVolume to Kubelet and the api-server.
  • The CSI Node Plugin Wrapper hooked the NodePublishVolume and will not pass the API call to the CSI Node Plugin. It will not call the API to mount the persistent volume to the pod.
  • The CSI Node Plugin Wrapper updates the PeerpodVolume CR object by adding the NodePublishVolume CSI Actions.
  • The CSI Node Plugin Wrapper returns a fake response for NodePublishVolume to Kubelet and the api-server.
  • The CSI Node Plugin Wrapper watches the PeerpodVolume CR object and identifies the running peer-pods instance. It retrieves the peer-pod's ID and sets it in the PeerpodVolume CR, then sets the peer-pods instance to ready.

This is illustrated in Figure 14: 

Figure 14: Mounting a persistent volume in the CSI Node Plugin on the worker node

Figure 14: Mounting a persistent volume in the CSI Node Plugin on the worker node

Mounting persistent volumes to containers in peer-pods

The CSI Node Plugin Wrapper (orange in the diagram) is injected between the CSI Node Plugin and Kubelet on the peer-pods VM. It embeds the PeerpodVolume Controller, watches the PeerpodVolume CR objects status change, and performs corresponding actions according to the status of the PeerpodVolume CR:

  • The CSI Node Plugin Wrapper watches the PeerpodVolume CR, performs a NodeStageVolume CSI Action, and calls the corresponding API in the CSI Node Plugin to mount the persistent volume to a global path on the peer-pods VM.
  • The CSI Node Plugin Wrapper watches the PeerpodVolume CR, performs a NodePublishVolume CSI Action, and calls the corresponding API in the CSI Node Plugin to mount the persistent volume to the pod on the peer-pods VM.

This is illustrated in Figure 15: 

Figure 15: Mounting a persistent volume in the CSI Node Plugin on peer-pods

Figure 15: Mounting a persistent volume in the CSI Node Plugin on peer-pods

Bringing it all together and end-to-end architecture

Figure 16 shows the overall flow for the caching and replaying procedure:

Figure 16: Cache and replay procedures

Figure 16: Cache and replay procedures

Here is a review of the detailed steps: 

  • CreateVolume – A PeerpodVolume object will be created with the CSI Action CreateVolume operation.
  • Cache ControllerPublishVolume – A PeerpodVolume object will be updated when performing ControllerPublishVolume to cache the AttachVolume CSI Action.
  • Cache NodeStageVolume – A PeerpodVolume object will be updated when performing NodeStageVolume to cache the volume mounting to a peer-pods VM CSI Action.
  • Cache NodePublishVolume – A PeerpodVolume object will be updated when performing NodePublishVolume to cache the volume mounting to a pod CSI Action.
  • Cache PeerPod Instance running status – A PeerpodVolume object will be updated to PeerPod Instance running when the peer-pods VM is created.
  • Cache PeerPod Instance ready status – A PeerpodVolume Controller on the worker node reads the peer-pods ID, sets it in the PeerpodVolume object, and updates the PeerpodVolume to indicate the PeerPod Instance is ready.
  • Replay ControllerPublishVolume – A Replay ControllerPublishVolume CSI Action attaches the created persistent volume to the peer-pods VM instead of the worker node VM.
  • Replay NodeStageVolume – A Replay NodeStageVolume CSI Action mounts the persistent volume to a global path on the peer-pods VM.
  • Replay NodePublishVolume – A Replay NodePublishVolume CSI Action mounts the persistent volume from the global path on the peer-pods VM to the pod.
  • StartContainer – The kata-agent component on the peer-pods VM starts the container when the persistent volume is mounted to the pod.

Peer-pods and CSI Plugins

This blog post dived deeper into the technical implementation of persistent storage in a peer-pods solution. It covered the CSI Plugins in the control plane and worker node. It also discussed the new CRD to help with cache and replay CSI actions.

In addition, it explained the workflow when creating, attaching, and mounting a persistent volume to a container in peer-pods.

The next blog post will provide hands-on instructions for deploying and running the persistent volume in peer-pods.

 


关于作者

Qi Feng is an architect of cloud-native infrastructure and Confidential Computing in IBM Cloud and Systems. He is the maintainer of Cloud API Adapter of Confidential Container. He is a big fan of open source and has contributed to various CNCF communities in addition to CoCo.

Read full bio

Da Li is working in the area of confidential containers, is one maintainer of the CNCF confidential containers cloud-api-adaptor project, and focuses on csi-wrapper, podvm image build and e2e test pipelines.

Read full bio

Yohei is working on enhancements of performance and security of software stacks for IBM Z. He has contributed to various open source projects related to cloud and security, and is recently contributing to the Confidential Container project, a CNCF sandbox project.

Read full bio

Lei Li is a Senior Software Engineer at IBM Systems, working on Confidential Computing in IBM Cloud and focusing on the implementation of Confidential computing technology.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事