Red Hat Advanced Cluster Management for Kubernetes (RHACM) Governance provides an extensible framework for enterprises to introduce their own security and configuration policies and apply them to managed OpenShift or Kubernetes clusters. For more information on RHACM policies, I recommend that you read the Applying Policy-Based Governance at Scale Using Templates and Comply to standards using policy-based governance blogs.

This multi-part blog series will showcase several techniques you can apply when using templates in your RHACM Policies. In part one, I reviewed practices you can use to make your templates more readable and easier to maintain.

Part two of this series will discuss more advanced template functionality and extended use cases for using Policies to manage clusters.

Prerequisites

Validate the cluster state

Users typically view RHACM Policies as the mechanism to apply day-2 configuration to a cluster. This could be configuring authentication, creating infra nodes and configuring cluster workloads, and installing operators along with numerous other day-2 tasks. In part one of this series, I discussed using templating to make these configurations more dynamic and the policies easier to maintain.

A policy to install an Operator using the Operator Lifecycle Manager (OLM) might consist of a Namespace definition, an OperatorGroup, and a Subscription. Applying these three objects will result in OLM installing the specified Operator. Once those three objects exist, the Policy will show as status compliant. Compliance is only an indicator the objects have been created as specified, not that the Operator has successfully installed and is running.

RHACM Policies can be in an "Inform" state where you can extend the Policy to validate the state of objects in the cluster. This additional functionality opens a very powerful set of tools, setting RHACM apart from other GitOps cluster management tooling. As a cluster manager, you can ensure all components are healthy across your entire fleet of clusters by viewing the status in RHACM. 

I'll review how to implement this when installing an Operator like OpenShift GitOps. In addition to the Policy to enforce creating the Subscription, you can add a Policy to verify the health of the Operator. The example below will validate the health of the Subscription, the Operator Deployment, and the ArgoCD instance itself.

apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
 name: gitops-operator-health
 namespace: bry-tam-policies
spec:
 disabled: false
 policy-templates:
 - objectDefinition:
    apiVersion: policy.open-cluster-management.io/v1
    kind: ConfigurationPolicy
    metadata:
      name: gitops-operator-health
    spec:
      object-templates:
      - complianceType: musthave
        objectDefinition:
          apiVersion: operators.coreos.com/v1alpha1
          kind: Subscription
          metadata:
            labels:
              acm-policy: gitops-operator
            namespace: openshift-operators
          status:
            state: AtLatestKnown
      - complianceType: musthave
        objectDefinition:
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            labels:
              olm.owner: '{{ (lookup "operators.coreos.com/v1alpha1" "Subscription" "openshift-operators" "openshift-gitops-operator").status.currentCSV }}'
            namespace: openshift-operators
          status:
            availableReplicas: 1
            conditions:
            - status: "True"
              type: Available
            readyReplicas: 1
            replicas: 1
            updatedReplicas: 1
      - complianceType: musthave
        objectDefinition:
          apiVersion: argoproj.io/v1alpha1
          kind: ArgoCD
          metadata:
            namespace: openshift-gitops
          status:
            applicationController: Running
            applicationSetController: Running
            dex: Running
            notificationsController: Running
            phase: Available
            redis: Running
            repo: Running
            server: Running
            ssoConfig: Success
      remediationAction: inform
        severity: high

 

You can determine the objects required were created to install the Operator and that the Operator is installed and running successfully. When combined with Policy Dependencies, you can now confirm the Operator requirements are met before creating CustomResources.

RHACM inform policies can identify other cluster issues, not just the health of day-2 configurations. Common cluster health states, such as kcs-645901, can be identified in Policies, making cluster administrators aware of potential problems before users are impacted. This example will become non-compliant if the openshift-marketplace Job or InstallPlan contain the indicated status conditions. An upcoming addition to this series will look at how to use policies to correct issues such as this automatically.

---
kind: Job
apiVersion: batch/v1
metadata:
 namespace: openshift-marketplace
status:
 conditions:
  - type: Failed
    status: 'True'
    reason: DeadlineExceeded
    message: Job was active longer than specified deadline
  failed: 1

---
apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
metadata:
 generateName: install-
status:
 bundleLookups:
  - conditions:
      - reason: JobIncomplete
        status: 'True'
        type: BundleLookupPending
      - message: Job was active longer than specified deadline
        reason: DeadlineExceeded
        status: 'True'
        type: BundleLookupFailed
 conditions:
  - message: >-
      bundle unpacking failed. Reason: DeadlineExceeded, and Message: Job was active longer than specified deadline
    reason: InstallCheckFailed
    status: 'False'
    type: Installed
 phase: Failed

 

Note that for the above example, you must create a PolicyGenerator configuration.

Enabling new capabilities with object-templates-raw

A new capability was added to ConfigurationPolicies in RHACM 2.7.2 and 2.8; objects-template-raw. This new feature allows you to use if statements, assign values to variables, and make use of ranges.

All of the templating discussed to this point has been to return a string or a single value. object-templates-raw supports advanced templating use cases by allowing a policy to generate YAML string representation.

The example from part one included setting the default value for the number of replicas on the IngressController based on the number of infra nodes found. However, it did not configure the nodeSelector or tolerations to support running on the infra nodes. Consider how using raw templates allows you to solve this fully.

apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
 name: ingressoperator-default
spec:
 remediationAction: enforce
 severity: low
 object-templates-raw: |
  - complianceType: musthave
    objectDefinition:
      apiVersion: operator.openshift.io/v1
      kind: IngressController
      metadata:
        name: default
        namespace: openshift-ingress-operator
      spec:
        httpEmptyRequestsPolicy: Respond
  {{- $infraCount := (len (lookup "v1" "Node" "" "" "node-role.kubernetes.io/infra").items) }}
  {{- if ne $infraCount 0 }}
        nodePlacement:
          nodeSelector: 
            matchLabels:
              node-role.kubernetes.io/infra: ""
          tolerations:
          - operator: Exists
            key: node-role.kubernetes.io/infra
  {{- end }}
         replicas: {{ ($infraCount | default 2) | toInt }}

 

When you apply the policy to the cluster, if there are zero infra nodes ($infraCount == 0), the whole block for the spec.nodePlacement will not be part of the IngressController configuration. Once infra nodes are added to the cluster, the policy will be reevaluated, and the configuration will be updated.

The raw templating also allows you to create more advanced objects where some information processing must be completed before generating the objectDefinition. I create the multiline string for the Thanos configuration using information from the OpenShift Data Foundation configured on the cluster. The Thanos configuration is then processed and encoded to be stored in the thanos.yaml key of the secret generated by the policy.

apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
 name: thanos-secret
spec:
 remediationAction: enforce
 severity: high
 object-templates-raw: |
  {{- /* read the bucket data and noobaa endpoint access data */ -}}
  {{- $objBucket := (lookup "objectbucket.io/v1alpha1" "ObjectBucket" "" "obc-openshift-storage-obc-observability") }}
  {{- $awsAccess := (lookup "v1" "Secret" "openshift-storage" "noobaa-admin") }}
  {{- /* create the thanos config file as a template */ -}}
    {{- $thanosConfig := `
    type: s3
    config:
      bucket: %[1]s
      endpoint: %[2]s
      insecure: true
      access_key: %[3]s
    secret_key: %[4]s`
  }}
  }}
  {{- /* create the secret using the thanos configuration template created above. */ -}}
  - complianceType: mustonlyhave
    objectDefinition:
      apiVersion: v1
      kind: Secret
      metadata:
        name: thanos-object-storage
        namespace: open-cluster-management-observability
      type: Opaque
      data:
        thanos.yaml: {{ (printf $thanosConfig $objBucket.spec.endpoint.bucketName 
                                              $objBucket.spec.endpoint.bucketHost 
                                              ($awsAccess.data.AWS_ACCESS_KEY_ID | base64dec) 
                                              ($awsAccess.data.AWS_SECRET_ACCESS_KEY | base64dec)
                          ) | base64enc }}

 

 

Using range to generate objects in policies

The range function creates a loop on an array, slice, map, or channel. Use this feature to loop through a list of static values, a return from the lookup function, or parts of an object, such as the labels on a Deployment. Each iteration of the loop can be assigned to a variable using the format {{ range $myItem := $list }} printf $myItem.property {{ end }}, a dot (.) context variable using the format {{ range $list }} printf .property {{ end }}, or {{ range $myItem := $list }} printf $myItem.property {{ else }} printf "empty list" {{ end }}, which will execute the else if the $list is empty.

This approach can be useful for creating policies that would generate many objectDefinitions, such as creating a ConfigMap for each namespace that meets a set requirement. This example loops through all Pods in the "portworx" namespace and identifies failed pods with the name containing kvdb. Pods found matching this condition are removed from the cluster.

apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
 name: portworx-failed-pod-claner
spec:
 remediationAction: enforce
 severity: low
 object-templates-raw: |
  {{- /* find Portworx pods in terminated state */ -}}
  {{- range $pp := (lookup "v1" "Pod" "portworx" "").items }}
    {{- /* if the pod is blocked because it is in node shutdown we should delete the pod */ -}}
    {{- if and (eq $pp.status.phase "Failed") 
               (contains "kvdb" $pp.metadata.name) }}
  - complianceType: mustnothave
    objectDefinition:
      apiVersion: v1
      kind: Pod
      metadata:
        name: {{ $pp.metadata.name }}
        namespace: {{ $pp.metadata.namespace }}
    {{- end }}
  {{- end }}

 

Expanding the earlier example checking the health of OpenShift GitOps instances, note the use of range to check all ArgoCD instances on a cluster, along with a range on a list of label selectors to validate each Deployment verifying all components are healthy and contain the expected number of replicas.

apiVersion: policy.open-cluster-management.io/v1
kind: ConfigurationPolicy
metadata:
 name: argocd-instance-status
spec:
 remediationAction: inform
 severity: high
 object-templates-raw: |
  ## Get all the ArgoCD instances we are checking health for  
  {{- range $argo := (lookup "argoproj.io/v1alpha1" "ArgoCD" "" "").items }}
    ## list all of the lookups for Argo deployments  
    {{- $selectors := list "app.kubernetes.io/name=argocd-applicationset-controller"
                          (printf "app.kubernetes.io/name=%s-dex-server" $argo.metadata.name)
                          (printf "app.kubernetes.io/name=%s-notifications-controller" $argo.metadata.name)
                          (printf "app.kubernetes.io/name=%s-redis" $argo.metadata.name)
                          (printf "app.kubernetes.io/name=%s-repo-server" $argo.metadata.name)
                          (printf "app.kubernetes.io/name=%s-server" $argo.metadata.name)
    }}
    
    ## ensure ArgoCD is reporting healthy 
  - complianceType: musthave
    objectDefinition:
      apiVersion: argoproj.io/v1alpha1
      kind: ArgoCD
      metadata:
        namespace: {{ $argo.metadata.namespace }}
      status:
        server: Running
        notificationsController: Running
        applicationController: Running
        applicationSetController: Running
        ssoConfig: Success
        repo: Running
        dex: Running
        phase: Available
          redis: Running

    ## ensure all deployments are healthy in each argo instance 
    {{- range $sel := $selectors }}
      {{- $dep := (lookup "apps/v1" "Deployment" $argo.metadata.namespace "" $sel).items }}
  - complianceType: musthave
    objectDefinition:
      kind: Deployment
      apiVersion: apps/v1
      metadata:
        namespace: {{ $argo.metadata.namespace }}
        labels:
          {{ $sel | replace "=" ": " }}
      status:
      {{- if gt (len $dep) 0 }}
        {{- $dp := (index $dep 0) }}
        replicas: {{ $dp.spec.replicas }}
        updatedReplicas: {{ $dp.spec.replicas }}
        readyReplicas: {{ $dp.spec.replicas }}
        availableReplicas: {{ $dp.spec.replicas }}
        conditions:
          - type: Available
            status: 'True'
      {{- end }}
    {{- end }}
    {{- end }}

 

Wrap up

In part one of this series, I outlined the use of several template functions and examples to make your templates easier to read and maintain. In part two, I looked at validating cluster health with policies and how to use the object-templates-raw to expand templates for more complex use cases.