Introduction

In a prior blog post, we illustrated some best practices on which metrics to use when monitoring applications. We used the same metrics to construct dashboards and predict when something might break. We also discussed how, in certain cases, is it possible to preemptively intervene and prevent the failure from occurring (preventive maintenance).

It is a responsibility of the SRE team to automate, as much as possible, these preventive maintenance operations. Typically, the metric that best lends itself to be used for this purpose is saturation (of a given resource).

When it comes to storage, saturation can be calculated as used volume space versus the total size of the volume in question.

Preventive maintenance, in this case, would be to increase the size of the volume when the saturation reaches a certain threshold.

If using persistent storage, it is recommended to create a Prometheus alert like the following:

    - alert: "Storage Saturation"
     labels:
       severity: critical
     annotations:
       summary: Storage usage over 75%
       persistentvolumeclaim: {{` "{{ $labels.persistentvolumeclaim }}" `}}
       namespace: {{` "{{ $labels.namespace }}" `}}
     expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.75
     for: 1m

This alert will notify an SRE when any volume crosses the 75% utilization threshold (this threshold is, of course, adjustable to your needs). When notified, an SRE has the chance to expand the volume.

Expanding the size of persistent storage volumes is a relatively new feature in Kubernetes that must be supported by all CSI storage drivers and is also supported by some in-tree storage drivers.

By combining the ability to monitor persistent volumes and the availability to expand them, we can build preventive maintenance automation that will expand volumes when they are reaching capacity. For the remainder of this article, an operator-based approach will be introduced to automate this use case of preventive maintenance of volumes.

The volume-expander-operator

The volume-expander-operator is an operator that monitors volumes, and when the saturation of a volume crosses a given threshold, it expands that volume. Expanding volumes, in some situations, requires that attached pods also be restarted. This functionality is also integrated into the operator.

For a volume to be monitored by the volume-expander-operator, one has to add the following annotation to the PersistentVolumeClaim:

volume-expander-operator.redhat-cop.io/autoexpand: “true”.

Once enabled, the volume-expander-operator will start polling the platform Prometheus instance that is part of OpenShift Monitoring and calculating the ratio between these two metrics: kubelet_volume_stats_used_bytes and kubelet_volume_stats_capacity_bytes.

This ratio represents the volume saturation, and the following annotations placed on the PersistentVolumeClaim determine when the volume should be expanded:

 

Annotation

Description

Default Value

volume-expander-operator.redhat-cop.io/expand-threshold-percent

Saturation point that will trigger the expansion of the volume

80

volume-expander-operator.redhat-cop.io/expand-by-percent

Percentage based on the size of the existing volume it will be expanded

25

volume-expander-operator.redhat-cop.io/expand-up-to

Upper bound of the volume

 

volume-expander-operator.redhat-cop.io/polling-frequency

Frequency by which the operator will query Prometheus for metrics

30s

 

When the saturation crosses the threshold, the volume will be expanded until it reaches the upper bound value.

The following example of is a PVC configured for auto expansion:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 annotations:
   volume-expander-operator.redhat-cop.io/autoexpand: 'true'
   volume-expander-operator.redhat-cop.io/expand-threshold-percent: "85"
   volume-expander-operator.redhat-cop.io/expand-by-percent: "20"
   volume-expander-operator.redhat-cop.io/polling-frequency: "1m"
   volume-expander-operator.redhat-cop.io/expand-up-to: "20Gi"
 name: to-be-expanded
spec:
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: "1Gi"

Installation

The volume-expander-operator can be installed via OperatorHub (look for the icon shown below) or via a Helm Chart.

 

Instructions using either of the supported installation methods can be found here.

Conclusion

The purpose of the volume-expander-operator is to relieve human operators from the toil of monitoring volumes and expanding them when needed. It also provides some protection against run-away processes that would indefinitely increase the consumption of storage space. However, it is always recommended to set a quota on the relative storage class to prevent this situation from occurring.

It is our hope that this operator will increase the adoption rate of stateful workloads on OpenShift.


저자 소개

Raffaele is a full-stack enterprise architect with 20+ years of experience. Raffaele started his career in Italy as a Java Architect then gradually moved to Integration Architect and then Enterprise Architect. Later he moved to the United States to eventually become an OpenShift Architect for Red Hat consulting services, acquiring, in the process, knowledge of the infrastructure side of IT.

Currently Raffaele covers a consulting position of cross-portfolio application architect with a focus on OpenShift. Most of his career Raffaele worked with large financial institutions allowing him to acquire an understanding of enterprise processes and security and compliance requirements of large enterprise customers.

Raffaele has become part of the CNCF TAG Storage and contributed to the Cloud Native Disaster Recovery whitepaper.

Recently Raffaele has been focusing on how to improve the developer experience by implementing internal development platforms (IDP).

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래