Disaster can happen at the most unexpected time, and can be anything from mistakenly deleting data to an entire cloud region or critical service within that region being down. There are many commercial and "freemium" solutions designed to mitigate this danger.
In the event of a disaster, one question customers often ask is "How do I make sure business can resume as quickly as possible with the least amount of downtime and data loss?"
This article demonstrates the low cost solution of using static volume provisioning with Red Hat OpenShift Service on AWS (ROSA) and AWS EFS.
In the grand scheme of disaster recovery, replication of persistent volumes alone is not enough when you take a holistic view of all other services that applications must interact with to fully function. For example, one application may need to communicate with a third-party vendor API, a data store service such as AWS RDS, an application running on a virtual machine, and so on. Some, if not all, critical dependencies must be taken into account when designing a regional disaster recovery (DR) plan.
This solution works best for workloads with storage performance requirements that can be satisfied with NFS.
This solution may be applied to any of the following scenarios:
- Data protection
- Application migration
- Replacing a cluster due to misconfiguration (Please contact Red Hat support before making a cluster replacement decision!)
- Data proximity, exposing read-only data to workloads deployed in other regions
- Regional failover due to a region or critical service within that region being down
- Protection from accidental data deletion. For example, the entire storage service or persistent volume is deleted by mistake
- Business continuity wargaming
If you need a refresher on the topic of disaster recovery in the cloud, read this article from AWS.
Security best practices
AWS EFS is a shared NFS storage service. It's imperative that best security practices, such as file permissions and access policies are applied to the EFS instance to ensure application teams are able to access assigned directories. Only team members with elevated privileges can access the entire EFS directory tree. For example, you might limit specific types of access to certain IP ranges, AWS Principals, Roles, and so on.
In the implementation section of this article, I use slightly relaxed policies just for demonstration. In a production environment, you should apply even more restrictive policies while ensuring data access by application teams is not hindered.
Solution overview
I'll discuss the solution in two phases, before and after the disaster.
Phase I: pre-disaster (disaster readiness)
During the readiness phase, in conjunction with RPO and RTO objectives, we develop the fallback plan that will be executed in the event of a disaster. Keep in mind that such a plan must be periodically tested to identify any cracks that may have been introduced over time due to technology maturity, and lifecycle changes.
In this phase, most if not all application deployments and network traffic will be directed at the primary region.
At a higher level, the process would look like this:
- Provision the primary OpenShift cluster in Region A.
- Provision the secondary OpenShift cluster in Region B. This step can wait until a disaster occurs if RPO and RTO allow for the time it takes to provision a new ROSA cluster, and apply day-2 configurations.
- 45 to 60 minutes for OpenShift self-managed and ROSA Classic
- 15 to 30 minutes for ROSA with Hosted Control Planes (HCP)
- Day 2 configurations should not take more than 15 minutes with GitOps
- Provision EFS Primary in Region A. Apply appropriate SecurityGroup rules to allow NFS traffic from OpenShift-Primary cluster and the bastion or CI/CD host.
- Provision EFS Secondary in Region B.
- Apply appropriate SecurityGroup rules to allow NFS traffic from OpenShift-Secondary cluster and the bastion/CICD host.
- Enable replication from EFS-Primary to EFS-Secondary.
- If you intend to enable dynamic volume provisioning as well, configure the AWS EFS CSI Driver Operator. I caution against doing this unless it's intended for non-critical workloads. This solution does not support dynamic volumes (those created with a StorageClass).
- Implement the automation process (Ansible, for example) for tenants (app teams) to request static persistent volumes on OpenShift-Primary.
The volume-create.yaml playbook works as follows:
- Take in required user inputs, including AWS credentials, Git credentials,
efs_primary_hostname, business_unit, ocp_primary_cluster_name, application_name, pvc_name, pvc_size, namespace, ocp_primary_login_command
- Validate user inputs for character lengths, OpenShift cluster-admin permission, and so on
- On EFS-Primary, create the volume directory tree as:/<prefix>/<business_unit>/<application_name>/<namespace>/<pvc_name>
- Using the predefined PV/PVC template, replace parameters such as <volume_name>, <volume_namespace>, <volume_nfs_server>, <volume_nfs_path> , and save the generated manifest to a directory local to the repository.
- Apply the PV/PVC manifest to OpenShift-Primary. The namespace is created if it does not exist.
- Commit and push the PV/PVC manifest to a Git repository. The PV/PVC manifest file path is <playbook-dir>/PV-PVCs/primary/<business_unit>/<ocp_primary_cluster_name>/<application_name>/<namespace>/pv-pvc_<pvc_name>.yaml
- Wrap the volume-create.sh process into a proper CI pipeline with user inputs provided in the form of job parameters.
- Implement the automation process (Ansible, for example) for restoring the static volumes on OpenShift-Secondary.
The volume-restore.yaml playbook works as follows:
- Take in required user inputs, including AWS credentials, Git credentials, efs_primary_hostname, efs_secondary_hostname, ocp_secondary_login_command.
- Stop the EFS replication and wait until EFS-Secondary is write-enabled.
- Recursively scan the PV-PVCs/primary/* directory, and list all volume manifests used for OpenShift-Primary. For each persistent-volume manifest, replace the EFS-Primary hostname with that of EFS-Secondary.
- Apply the secondary PV/PVC manifests on OpenShit-Secondary.
- Commit the secondary PV/PVC manifests to Git. The PV/PVC manifest file path is <playbook-dir>/PV-PVCs/secondary/<business_unit>/<cluster_name>/<application_name>/<namespace>/pv-pvc_<pvc_name>.yaml
- Wrap the volume-restore.sh process into a proper CI pipeline with user inputs provided as job parameters.
- To test, run volume-create.sh pipeline to provision a few persistent volumes on OpenShift-Primary.
- Deploy a few stateful (with static volumes) applications on OpenShift-Primary.
- Verify pods are able to mount their respective persistent volumes at the specified directory and write some data on them.
Phase II: post-disaster (disaster recovery)
The recovery phase is when the Secondary region takes over and becomes Primary, applications are redeployed (if not already) and network traffic is rerouted.
At a higher level, the process looks like this:
- Provision OpenShift-Secondary in Region B, as needed.
- Integrate OpenShift-Secondary with the EFS-Secondary instance, as needed.
- Network connectivity verification
- Dynamic volume provisioning can be enabled as well. However, they should be used for non-critical workloads that do not require regional disaster recovery.
Run the volume-restore.sh pipeline to restore static volumes onto OpenShift-Secondary.
This process scans the <playbook-dir>/PV-PVCs/primary/<cluster_name>/* directory, create a corresponding PV/PVC for each manifest found. Save the resulting volume manifests in <playbook-dir>/PV-PVCs/secondary/<cluster_name>*. cluster_name is the name of the primary cluster.
- Redeploy your applications on OpenShift-Secondary. In order for the deployment time to comply with DR RPO/RTO objectives, it's recommended that applications container images be placed in an external image registry, and the Continuous Deployment and Delivery (CD) part of the CI/CD flow be handled by GitOps tools such as ArgoCD.
- Reroute network traffic to the Secondary Region.
Implementation
If this solution seems useful to you, then review the exact implementation steps. These are the prerequisites:
- Basic understanding of NFS, AWS, OpenShift
- An AWS account
- Permission to create and replicate EFS services
- Two ROSA clusters, Primary and Secondary
- Bastion host with the following software packages: python3.11, python3.11-pip, ansible, aws-cli-v2, nfs-utils, nmap, unzip, openshift-cli
Conclusion
We’ve demonstrated, in the Kubernetes world, how one can achieve regional failover of an application state using ROSA, AWS EFS, static volume provisioning, and Ansible automation. This approach can be taken even further by adding Event-Driven Ansible to the mix to remove or minimize the need for human intervention in the volume-create and volume-restore cycle. A similar approach could be applied in a Microsoft Azure environment with Azure Red Hat OpenShift (ARO) and Azure File.
Sobre el autor
Más similar
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit