OpenStack provides scale and redundancy at the infrastructure layer to provide high availability for applications built for operation in a horizontally scaling cloud computing environment. It has been designed for applications that are “designed for failure” and voluntarily excluded features that would enable traditional enterprise applications, in fear of limiting its’ scalability and corrupting its initial goals. These traditional enterprise applications demand continuous operation, and fast, automatic recovery in the event of an infrastructure level failure. While an increasing number of enterprises look to OpenStack as providing the infrastructure platform for their forward-looking applications they are also looking to simplify operations by consolidating their legacy application workloads on it as well.
As part of the On-Ramp to Enterprise OpenStack program, Red Hat, in collaboration with Intel, Cisco and Dell, have been working on delivering a high availability solution for such enterprise workloads running on top of OpenStack. This work provides an initial implementation of the instance high availability proposal that we put forward in the past and is included in the recently released Red Hat Enterprise Linux OpenStack Platform 7.
In putting forward this original proposal it was posited that there are three key capabilities to any solution endeavoring to provide workload high availability in a cloud or virtualization environment:
- A monitoring capability to detect when a given compute node has failed and trigger handling of the failure.
- A fencing capability to remove the relevant compute node from the environment.
- A recovery capability to orchestrate the rescuing of instances from the failed compute node.
Rather than re-inventing the wheel inside the OpenStack projects themselves it is possible to deploy and manage an OpenStack environment with these capabilities using traditional high availability tools such as Pacemaker, without compromising the scalability aspect of the overall platform. This is the approach used to deliver instance-level high availability in RHEL OpenStack Platform 7. You can view a demonstration of the solution in action, as previously shown at Red Hat Summit in partnership with Dell and Intel, here:
http://www.youtube.com/watch?v=aJqgyP54Xgk
In this implementation monitoring is performed using the NovaCompute pacemaker resource agent while fencing and recovery are handled by the fence_compute pacemaker fence agent and the NovaEvacuate resource agent. These three new components were all co-engineered by the High Availability and OpenStack Compute teams at Red Hat and are provided in updated resource-agents and fence-agents packages for Red Hat Enterprise Linux 7.1.
Monitoring
In a traditional pacemaker deployment each node in a cluster runs the full stack of services for ensuring high availability, including pacemaker and corosync. The traditional HA setup, as delivered via RHEL High Availability add-on, supports up to 16 nodes. In contrast a typical OpenStack deployment has many hundreds or even thousands, of compute nodes that need to be monitored. To close the scalability gap, the Red Hat HA team designed and developed, from the ground up, pacemaker_remote.
By using pacemaker_remote it is possible to continue adding compute nodes and connecting them to the Pacemaker cluster running on the OpenStack controller nodes without running into the 16 node limit, thus keeping all of the nodes in a single administrative domain. As a result the compute nodes do not become full members of the cluster and do not need to run full pacemaker, or corosync stacks, instead just running pacemaker_remote and integrating with the cluster as remote nodes.
This eases the process of scaling out the compute cluster while still allowing us to provide some neat functions in relation to providing high availability, including monitoring compute nodes for failures and automating recovery of the virtual machines running on them when failures occur. To do this the Pacemaker cluster running on the controller nodes monitors pacemaker_remoted on each compute node to confirm it is “alive”. In turn, on the compute node itself pacemaker_remoted monitors the state of a number of services including the Neutron and Ceilometer agents, Libvirt, and of course the nova-compute service itself. In the event of an issue being detected in one of these services pacemaker_remote will endeavour to recover it independently. In the event this fails however, or if pacemaker_remote stops responding entirely, fencing and recovery operations are triggered.
Fencing
In the event that a compute node fails Pacemaker powers it off using fence_ipmilan (other fencing mechanisms will be supported in the future), while it is powering down the fence_compute fence agent loops waiting for Nova to also recognize that the failed host is down. This is necessary because OpenStack Compute (Nova) will not let an evacuation be initiated until it recognizes the node being evacuated is down. In the near future, it will be possible for the fence agent to use the force-down API call (formerly referred to as “mark host down”), introduced in OpenStack “Liberty”, to proactively tell Nova that the node is down and speed up this part of the process.
Recovery
Once Nova has recognized that the node is down in response to either the original failure or Pacemaker explicitly powering the node off the fence agent initiates a call to Nova host-evacuate which triggers Nova to restart all of the virtual machines that were running on the failed compute node on a new one. In the future it may be desirable to have an image property or flavor extra specification that can be used to explicitly “opt in” to this functionality only for traditional application workloads that need it.
In this implementation we assume that impacted virtual machines are either using shared ephemeral storage, for example Ceph, or were booted from volumes. These characteristics make it possible to recover the instances, including their on-disk state, even when the host on which they were originally running has gone down permanently. An out of the box RHEL OpenStack Platform 7 deployment uses Ceph for this purpose.
If pacemaker_remote is also successful in powering the node back on then it will be returned to the pool of available compute resources when the Nova heartbeat process discovers its return to operation.
The combination of these monitoring, fencing, and recovery capabilities provide a solution that makes it easier than ever to migrate traditional, business-critical applications that require high availability to OpenStack.
Want to try it out for yourself? Sign-up for an evaluation of Red Hat Enterprise Linux OpenStack Platform today! Existing users can find instructions on manually enabling high availability for their compute nodes in the Red Hat Knowledgebase. We would love to get more feedback on this feature as we work on integrating these capabilities and more into the RHEL OpenStack Platform director (based on the "TripleO" project) to provide full automation.
Want to learn more about moving instances around an OpenStack environment? Don’t know the difference between cold migration, live migration, and evacuation? Catch my presentation - “Dude, this isn’t where I parked my instance!?” - at OpenStack Summit Tokyo!
À propos de l'auteur
Steve Gordon is senior director of Product Management at Red Hat
Parcourir par canal
Automatisation
Les dernières nouveautés en matière d'automatisation informatique pour les technologies, les équipes et les environnements
Intelligence artificielle
Actualité sur les plateformes qui permettent aux clients d'exécuter des charges de travail d'IA sur tout type d'environnement
Cloud hybride ouvert
Découvrez comment créer un avenir flexible grâce au cloud hybride
Sécurité
Les dernières actualités sur la façon dont nous réduisons les risques dans tous les environnements et technologies
Edge computing
Actualité sur les plateformes qui simplifient les opérations en périphérie
Infrastructure
Les dernières nouveautés sur la plateforme Linux d'entreprise leader au monde
Applications
À l’intérieur de nos solutions aux défis d’application les plus difficiles
Programmes originaux
Histoires passionnantes de créateurs et de leaders de technologies d'entreprise
Produits
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Services cloud
- Voir tous les produits
Outils
- Formation et certification
- Mon compte
- Assistance client
- Ressources développeurs
- Rechercher un partenaire
- Red Hat Ecosystem Catalog
- Calculateur de valeur Red Hat
- Documentation
Essayer, acheter et vendre
Communication
- Contacter le service commercial
- Contactez notre service clientèle
- Contacter le service de formation
- Réseaux sociaux
À propos de Red Hat
Premier éditeur mondial de solutions Open Source pour les entreprises, nous fournissons des technologies Linux, cloud, de conteneurs et Kubernetes. Nous proposons des solutions stables qui aident les entreprises à jongler avec les divers environnements et plateformes, du cœur du datacenter à la périphérie du réseau.
Sélectionner une langue
Red Hat legal and privacy links
- À propos de Red Hat
- Carrières
- Événements
- Bureaux
- Contacter Red Hat
- Lire le blog Red Hat
- Diversité, équité et inclusion
- Cool Stuff Store
- Red Hat Summit