Most people think of Kubernetes and OpenShift as hosting “cloud native” applications, where cloud native refers to a system that expects failure and can compensate automatically, for example, using horizontal scaling. But does that apply to the infrastructure hosting OpenShift? What about OpenShift itself?
In this episode we are joined by Christian Hernandez, Technical Marketing Manager for Red Hat, to look at high availability for OpenShift, including what it expects from the infrastructure, what it’s capable of providing for applications, and some example scenarios.
As always, please see the list below for additional links to specific topics, questions, and supporting materials for the episode!
If you’re interested in more streaming content, please subscribe to the Red Hat livestreaming calendar to see the upcoming episode topics and to receive any schedule changes. If you have questions or topic suggestions for the Ask an OpenShift Admin Office Hour, please contact us via Discord, Twitter, or come join us live, Wednesdays at 11am EDT / 1500 UTC, on YouTube and Twitch.
Episode 38 recorded stream:
Use this link to jump directly to where we start talking about today’s topic.
This week’s top of mind topics:
- Late last week Red Hat and Nutanix jointly announced a strategic partnership and support for OpenShift on Nutanix AOS, including a certified CSI provisioner! This is just the first step toward much more, be sure to watch for more information in the future!
- We talked a bit about how the National Security Agency (NSA) recently released a hardening guide for Kubernetes. Including how OpenShift already meets many of those guidelines out of the box. You can find more information about Red Hat’s perspective on the NSA hardening guide in the blog post here.
- The last top of mind topic this week was around namespaces, in particular, when should you create additional namespaces and what purpose do they serve? You can listen to answers from all of us here in the stream.
Questions answered and topics discussed during the stream:
- Can I run dev and prod in the same cloud? Yes, just be cognizant of what you’re protecting against and your failure domains. For example, is it ok for the applications to go down if the entire cloud provider is down? Or, do you need a multi-cloud strategy?
- One of the first steps to putting in place an effective high availability - and disaster recovery - strategy is to identify what risks you’re mitigating against. Without a clear understanding of the goals, then you can accidentally omit an important scenario that you wanted to protect against.
- What is HA and how is it different than disaster recovery (DR)? HA is usually targeted at keeping an application running rather than recovering it after going down.
- Do you consider a failed pod to be an HA event? Yes, of course it is because we want to ensure that the application continues to run when some of its capacity is lost.
- As hyperscaler deployments have become more widespread, applications have evolved to provide their own high availability. The important aspect is to strike a balance between what’s provided by the infrastructure and the application.
- How can I restore a cluster from just an etcd backup? It’s possible, but complicated because the restored etcd won’t be aware that it’s actually a new cluster.
- We talk about two KCS articles, one providing recommended practices for high availability and another with guidance for multi-site clusters.
- With two sites, how can I achieve HA with OpenShift? OpenShift 4 requires three control plane nodes, which means that with only two sites, one of them will always have a majority of nodes. If that site fails, then you’re in a DR scenario, so HA with two sites is, at best, complex.
- One option is two use two OpenShift clusters with a global load balancer. We talked about this in-depth during the stream, including how this is the recommended method from other vendors as well.
- Scheduling hints, such as (anti)affinity, node selectors, and scheduler profiles can all affect high availability.
- “Kubernetes doesn’t fix a broken application design”. Simply using Kubernetes to deploy and manage a containerized application doesn’t make it magical. Operations teams need to work closely with applications teams to provide the right resources - and capabilities - at the right layer.
- We reinforce this during the stream by talking about how containers can still be utilized to make the application deployment and configuration process easier, even without Kubernetes. It makes sense to deploy the application to the infrastructure that provides the right services, like HA, for what it’s needs are rather than blindly prioritizing Kubernetes over everything.
- Did you know that it takes 5 minutes, by default, for Kubernetes to reschedule workload from a failed node? Using node and pod health checks and liveness probes are a critical component of ensuring that the application remains available, even when non-obvious failures happen.
- It’s important to work with the application team. The infrastructure cannot be the only way that the application achieves HA with Kubernetes. Both teams have to cooperate!
- When using two clusters deployed in two locations, are there any recommendations or guidelines for keeping the application data in sync? You can use either infrastructure level replication, i.e. asynchronous storage replication, or application level replication, e.g. CockroachDB, which is a cloud-native, distributed SQL database.