Brushing Off Failure In Distributed Databases
About the episode
Ever been so frustrated with the options available that you build your own? Ben Darnell, Chief Architect and Co-Founder of Cockroach Labs, shares how his dissatisfaction with distributed databases led to the creation of CockroachDB. To build a distributed database that not only plans for but expects failures, they needed to implement the raft consensus algorithm. Getting it up and running was a tough technical challenge. But the result was an incredibly resilient database.
Find out why Netflix uses CockroachDB for their databases.
Can you have access to a globally available database at the speed of a regional one? Check out how Cockroach Labs accomplishes this with global tables.
Chief Architect & Co-Founder
00:00 — Burr Sutter
I have been thinking about so many different software projects and software systems that I have built throughout my lifetime and throughout my career.
As you can imagine, when you engage in those pieces of software, you had the ideas down on paper, you had what was in your head. But at some point you had to put hands on keyboard, you had to put key and keystrokes in and code into the system. And of course, the actual software had to come to life.
Most importantly, that software had to make it to production and become a living, breathing thing that users engaged with, where you could get real feedback.
But all along that journey, things fundamentally changed, from the moment you had it as an idea in your head, to software actually built and compiled and ready to run, and of course, software and a system that users actually saw.
I've been thinking about that as I talk to Ben Darnell, co-founder and CTO of Cockroach Labs, the distributed database company. I find that name, Cockroach Labs, so very interesting.
00:57 — Ben Darnell
The name's been really good. It's a really distinctive, memorable name, unlike some of our peers in the database space. Once you hear CockroachDB, you'll always remember us. But when you get outside of tech circles, it can be a little more of a liability.
I was applying for a mortgage a couple of years ago and got asked whether I owned an exterminator company.
01:16 — Burr Sutter
I can imagine you out there with a backpack on and a little spray can, making sure that you fix all the bugs in someone's house.
This is Code Comments, an original podcast from Red Hat. I'm Burr Sutter, and here again is Ben Darnell.
01:36 — Ben Darnell
So the story actually starts about 10 years ago at another startup called Viewfinder. We were building a mobile photo sharing app. I was with Spencer Kimball and Peter Mattis, who would go on to be my co-founders at Cockroach Labs.
We'd all spent a lot of time at Google, and this was one of our first experiences building an app from scratch, outside of Google.
So we were looking around for databases to use. We were really disappointed in everything that we saw. We had gotten used to Google's in-house technologies, like GFS and Bigtable and all these massively scalable and highly available systems.
We were looking at what our options were out in the real world, and nothing really satisfied us. We were looking at things like sharding MySQL or Cassandra, or Hbase, or DynamoDB.
We saw all these things as having a lot of compromises for our use case, in comparison to some of the things that we'd used before.
So Spencer, my co-founder, actually wrote the first version of the CockroachDB design doc while we were at Viewfinder. Peter and I had to talk him down because we couldn't build a world class database and the next big photo sharing app at the same time.
02:44 — Burr Sutter
You obviously discovered while at Google there was these limitations in the actual database server that you're using there. But at the same time, when you came out of Google, I don't know if you landed directly at the startup at that point in time, but you certainly were finding limitations with the open source ecosystem as it stood.
Meaning, you looked at MySQL, you looked at Cassandra, you looked at PostgreSQL, et cetera, and you found limits there. Can you tell us more about that journey from Google all the way into that startup, and where you were looking at the open source ecosystem around databases?
03:13 — Ben Darnell
Sure. I actually went through a number of startups in between Google and this photo sharing startup. I was at FriendFeed right before they got acquired by Facebook and a startup called Brizzly, which got acquired by AOL. Then I had been at Dropbox for a couple of years.
All of these startups, they all had something in common. They were all using sharded MySQL as their primary data store. So I got to see, really firsthand how painful that could be at large scale.
I knew that that was not something that I was eager to sign up for again, because we saw that as the system grows, all these different shards, they all fail independently.
But when you get up to Dropbox's scale, for example, there were database shards failing on a weekly basis and needing manual intervention to clean that up.
Whereas at Google, with Bigtable and systems like that, we just didn't have to worry about individual failures in the same way.
One thing that really defines the CockroachDB approach is that at scale failures, become very frequent. You can have small partial failures on a daily or weekly basis.
It's important to make these non-events. That means that, instead of having a big process for failing over from a primary database to a secondary, you have an automated process that can move a small chunk of data over from its primary to its secondary, transparently and without losing any data.
This is a big difference with a lot of traditional databases, is that failover comes with a risk of losing the last little bit of data whenever you flip the switch.
But in CockroachDB, everything is replicated consistently, so that the failure doesn't lose any data. Therefore, you can have this failover process be much more automated and faster and less disruptive because you know it's not going to lose any of your data.
04:51 — Burr Sutter
One of the things that's fundamental to the architecture is that you guys have a distributed system. That means there could be network partitions. There could be network failure in between any of the nodes. There could be a node failure. Meaning, one of the actual database engines just goes away. The virtual machine it's in or the actual hardware it's on, just dies.
Can you talk more about how that consensus is reached? What algorithm are using within this context?
05:15 — Ben Darnell
The whole system is built on a distributed consensus algorithm to bring all the nodes in the cluster into sync. The specific algorithm that we use is called Raft.
It was very recent when we first started using it. It was just a couple of years old at that point. The classic distributed consensus algorithm is called Paxos. It's notoriously complex and hard to understand.
Raft came along in the last decade and promised to be a much easier to understand and implement consensus algorithm. So, that's what we chose to use in CockroachDB.
The essential idea in all of these distributed consensus algorithms is they're all more or less built around the idea of running a lot of little elections.
You typically have a leader. In CockroachDB, we have a leader for each range of data. The data's divided up into many chunks, called ranges. Each of those has its own leader, so that this leadership responsibility can be spread across all of the nodes in the cluster.
Whenever you want to write to the database, that essentially gets put to a vote of all the replicas of that piece of data, and you need a majority of those to come back.
So, that's what gives you the ability to survive failures. You don't need all of the nodes to respond. You just need two out of three or three out of five.
So, you're able to tolerate one or two node failures at any given time. And then you have enough nodes that survive that have the data, that are able to then replace any copies that get lost.
This is the key mechanism that we use to ensure that your data is always safe, no matter what happens to the underlying machines and hardware.
06:49 — Burr Sutter
So what Ben is telling us is that when you actually have distributed work, a distributed piece of software around multiple machines, especially with a network in between, things can go badly, things can break.
What you want to make sure is that your consensus algorithm, your consensus protocol is verifying that enough of the machines acknowledge that they have written the data properly. If you've done that, you can feel very secure and very safe that your software and your data has been persisted correctly.
Conceptualizing an algorithm on paper, certainly designing it out is one thing. But actually building it into a piece of software where you have to put hands on keyboard and create the code is a whole different thing altogether.
Ben told me about his actual experience, where he actually started laying down the keystrokes and building out that code around the consensus algorithm and how that fit into the overall perspective of what they needed to build for Cockroach Labs.
07:44 — Ben Darnell
I started working in Cockroach in earnest, actually at a hack week, when we were working at Square, and Cockroach was just an open source project at this time.
I had this hack week blocked off. The first code that I wrote for CockroachDB was actually an implementation of Raft. In that week, I was able to implement, I would say probably the first 80% of the Raft algorithm.
Then it turns out that we'd spend upwards of the next year working on what I'd say is the last 20%. So getting started, we were able to implement the happy path, where you have a fixed set of nodes. There's no failures or there's certain types of failures, and everything is proceeding fairly normally and everything's working and certain types of failures are tolerated.
But then where it really gets tricky is when you have to deal with changing the membership of the cluster.
The simple implementation of a distributed consensus algorithm always starts just talking about the membership of the cluster being a fixed thing supplied from the outside, but if you want to run this at scale, you've got to have nodes entering and leaving the cluster in a fairly automated way. So, you need to be able to safely add and remove nodes.
This is where a lot of the subtlety comes in. You need to make sure that the nodes that are being removed from service are properly removed, and they're able to sync up with all their peers before they go, so that you don't leave the cluster in a fragile state.
You need to introduce new nodes in a way that everyone agrees on who gets a vote, essentially, in those elections.
You've got to make sure that the membership roster of the cluster is very consistently maintained.
On paper, it tends to be a small fraction of what the protocol looks like. There's maybe a page or two out of a 15-page paper on Raft, devoted to this topic, but in terms of implementation complexity, it's much hairier than it looks.
09:33 — Burr Sutter
I do appreciate the point you made there, about the happy path and the fact that something as simple as how many nodes there are to begin with and how nodes maybe join the cluster or leave the cluster over time. That represents a much harder, a much more challenging set of algorithms, if you will, and things to figure out as you pull all those elements together.
09:51 — Ben Darnell
Yeah. This next 20% took us the better part of a year. But then in the course of that year, we ended up connecting with the team working on etcd at CoreOS.
They had one of the most mature Go language implementations of Raft at the time, and they were kind of going through the same thing.
They had implemented the first 80 plus percent of the protocol, and they were working through all of the final edge cases. So, they were realizing how much work was still ahead of them and us and how much overlap there was. So, they suggested that we join forces and share a single implementation.
10:26 — Burr Sutter
You mentioned that you were doing this over at Square. You were able to contribute that code to an open source project, it sounds like, at the time. But there were other maturing implementations for the Raft protocol in Go, written in the Golang programming language, that was etcd, and also from HashiCorp. Is that correct?
10:43 — Ben Darnell
Yeah. When we were starting the CockroachDB project, well, there were a lot of implementations of Raft in Go at the time, actually. Because of the timing of the Raft paper and the introduction of the Go programming language, it turns out that a lot of people have used Raft as a starter project to learn the Go language, which is an interesting trend that we saw in whatever year this was, about 2014.
So before I started implementing Raft from scratch by myself, I looked at the two major competing implementations from HashiCorp and CoreOS. They weren't quite suitable for us.
One of the things that's unusual about CockroachDB and the way that we use distributed consensus algorithms is that we run a lot of different instances of the algorithm.
So in CockroachDB, we divide the data up into many thousands of independent ranges, each of which is a separate instantiation of the Raft protocol. Whereas in something like etcd or console, you really just have one instance of Raft for the entire system.
So, we're very concerned about the scalability of the protocol implementation. We saw that both etcd and console's Raft implementations would do things like startup Go routines in the background for the protocol.
This kind of background processing, when you multiply it by a thousand or ten thousand different instances of the algorithm running in parallel, this can really be a drain on the nodes resources. So, we needed to be able to run many instances of Raft without the overhead of a Go routine per instance.
That was the original reason why I started implementing the protocol from scratch, even though there were these existing high quality implementations out there.
12:13 — Burr Sutter
Well, one thing I really love about the story you're telling us here is that it started with a proprietary solution to a hard, hard problem, in this case, a distributed computing, a distributed database.
Then as you moved out of that organization into more startups, you then found that this did not exist in this world. What's amazing about this from an open source perspective, is you as a software engineer, had that itch you wanted to scratch. You had that problem you wanted to solve.
You did not see a perfect solution in the space, even if there were other opensource projects that were already implementing the Raft protocol in Go, in the Go programming language, like the folks over at etcd and HashiCorp.
Then you found that collaborative opportunity with some of these other players in the market to actually really build out that element of your application architecture.
I would like to double-click and drill down just for a moment, on that multi-Raft idea, the multi-Raft protocol, because you've mentioned that the current open source implementations out there were single Raft implementations. They had those background Go routines. They were somewhat problematic, at least at great scale.
Can you describe for, let's say our average listener... who's only perhaps just worked with a relational database before and not necessarily thought too much about sharding per se or the fact that there's these different indexing solutions, what does that really look like from the user's perspective, that multi-Raft protocol, that ability to basically use Raft around, I assume, what is a group of rows in the database?
13:33 — Ben Darnell
Well, for one thing, to an end user of CockroachDB, it doesn't really look like much because it's all happening under the hood. It's all basically invisible.
But in terms of what we mean when we say multi-Raft, this is about the fact that one server in your CockroachDB cluster may be responsible for 10, 20, 30,000 ranges.
So, you're running 30,000 instances of the Raft protocol in a 10-node cluster, for example. You're going to have a lot of overlap in the membership of those Raft instances.
One of the things that Raft has to do, is it has to send heartbeats between all the different peers of the node. So, you have these heartbeats going back and forth with these nodes just essentially asking each other, are you alive and able to respond?
These messages are flying back and forth, but in the CockroachDB case, you don't want to have 30,000 heartbeat messages flying back and forth every second. That just wastes all your computing resources.
You want to be able to say, "I see that I've got 10,000 Raft groups, but I've only got nine actual peers. So, I want to be able to send nine heartbeats instead of 30,000."
That's the kind of thing that we do in multi-Raft, is we recognize the higher level structure and try to coalesce and consolidate a lot of this background activity across the many different independent Raft groups that exist under the hood.
14:48 — Burr Sutter
I can see how this is incredibly valuable as the number of nodes grows, as well as the volume of data in that database grows. You want to make sure that it's not so chatty across the wire and there's not so much interaction, as you mentioned, just the heartbeat going back and forth.
That does add a ton of overhead, both at the CPU level as well as the network chattiness, for these things to simply say, "Hey, I'm still alive. I'm still okay. We don't need to elect another leader at this point in time."
Can you tell us more about that last 20%? If we had 80% done in a single week, and you were working with these new partners, what did it take to actually get us the rest of the way? What kind of issues did you bump into? What kind of problems did you solve?
15:25 — Ben Darnell
Yeah. The last 20% is all about chasing down all the corner cases and edge cases. In the Raft protocol, as we implemented it, it turned out to be very safe, in the sense that we never had a lot of problems with the protocol doing a bad thing. But we did have problems with liveness, the way that under adverse conditions, it would fail to do anything at all.
Trying to maximize that availability was really the long tail of work to get to a hundred percent done.
Of course, you can never make it perfectly available. This is one of the things we learned from the CAP theorem. Raft is ultimately not an available protocol in the CAP theorems terminology. It's a consistent protocol. So, it does have times where it can't make progress, but trying to get as close to that ideal as we can in practical, real world situations.
16:18 — Burr Sutter
Well, you mentioned CAP theorem there. Certainly, a lot of people have definitely heard about that. There's always the concept of, well, you get two out of three, in that scenario, which is we know, not exactly accurate, but the concept of consistency versus availability when facing a network partition.
So, the network partition being a key aspect of that, the failure between the different nodes.
So in that case, you mentioned that availability versus consistency. It sounds like CockroachDB focuses on the consistency side and Raft algorithm, Raft protocol definitely had you focus on that.
16:47 — Ben Darnell
That's right. The CAP theorem says that in the face of our partition we're a distributed system. We can't choose to be partition intolerant because partitions will always happen. So, that means we have to choose... when there is a partition, we have to choose whether we're going to keep consistency or availability.
So when push comes to shove, when we have to choose one or the other, we always choose consistency. But within that framework, the CAP theorems trade-offs only apply in a fairly narrow window of situations.
So, the vast majority of the time when we're doing things to improve the availability of the system, we're not anywhere near the limits where the CAP theorem starts to apply.
17:27 — Burr Sutter
Getting to 1.0 is always an amazing moment. Your software, your creation now has to face real end users, and getting your very first customer for a database is not so easy.
It is a non-trivial thing that you're asking people to do. As you know, data is one of the most important assets in your organization.
Ben told me a story about one of their biggest and earliest customers having a failure that resulted in data loss and how his team rallied to address it.
17:52 — Ben Darnell
There were actually two different versions of the Raft paper. The first version described a very complicated membership change protocol called joint consensus, which was very robust, but was very complex to implement.
Then there was a second paper on Raft, which introduced a major simplification in this area. It said that you didn't need the joint consensus protocol as long as you were only ever adding or removing one node at a time. We said, "That's great. We'll just implement the simpler version."
The problem is that, if you want to move a replica instead of just adding or reducing capacity, that means you have to model it as an add followed by a remove or a remove followed by an add.
If you model it as a remove followed by an add, which happened to be the way that we implemented it at first, then you're in this very fragile state in between.
You originally have a three replica configuration, and you need two out of three to be able to make progress. If you go down to two replicas, then all of a sudden you can't handle any more failures. You need two out of two to be a majority, because if you just have one vote in a two member cluster, then that's a tie.
So, there's this window in between going down from three to two and then coming back up to three, where you're vulnerable. You can have one loss that can lead you to losing your quorum.
This sounds like it's a risk of going down from three to two before coming back up. It actually works out in the other way as well. If you go from three nodes, add the fourth node before removing one of the nodes, that can lead to problems, especially in a multi-region or multi-data center case because then you find yourself with four nodes instead of three. And if you have three data centers that you're working with, then two of those nodes have to be in the same data center, and they could suffer correlated failures.
That's something that we saw here, where one of these customer clusters experienced two failures at the same time because of a region or availability zone data center outage.
Then, because it happened at just the wrong time, when the cluster had up temporarily up replicated into this more fragile state, then they lost quorum and that range was broken until we could get into manually fix it.
In order to fix this, we did some quick fixes at first, to try and repair the immediate damage to get this user cluster back up and running.
The longterm fix was to go back to this feature of joint consensus that was featured in the original Raft paper, but was kind of deprecated and almost removed from the second edition because it wasn't seen as being practically necessary.
Again, coming back to multi-Raft and the way that we have the large number of Raft protocol instances undergoing constant up and down replications, we were much more vulnerable to this risk from doing the one replica at a time change.
So, we needed a protocol that could actually guarantee that it was going to be correct, even in the face of failures during the membership change process.
This ended up not getting fixed until, I think it was about two years after 1.0. It was just a very sizeable project. We needed to lay the groundwork over a long period of time to be able to make that transition.
But in the end, again, because we were working with the etcd team on this, the work that we were doing here was also able to benefit that project as well.
21:23 — Burr Sutter
What I find interesting about this part of the story is that you guys actually saw that there was a great need for something that other people perceived as optional. Right?
You mentioned the atomic replica changes, the replication happening and having guarantees, if you will, as to how those replicas get started up, get shut down and making sure that they are happening, let's say in this case, across multiple availability zone.
I want to know more about that, because that suggests to me that other people had not gotten to the level of complexity, in terms of their implementation with the Raft protocol that you guys had. They had not pushed it enough in production to see some of these edge cases occur in a real production environment, with a real distributed database engine.
Why do you think you guys were the first ones to really stumble across that and realize optionality or being optional was not really optional?
22:09 — Ben Darnell
I think it comes down to the fact that when you're operating at scale, everything has to be automated.
There's an assumption in a lot of these papers. The reason that the Raft authors decided that you didn't really need the more complex protocol is because they were envisioning adding a replica to the system to be kind of a rare manual event.
It would be an operator somewhere, pushing a button. That operator would presumably know to look at the network status and not do it if there's any signs of trouble. So, it's only going to be executed under best case conditions.
So, it's just not considered a completely routine operation, like it has to be when you're operating at massive scale.
22:52 — Burr Sutter
Well, you mentioned that you guys discovered that you needed this capability by working with a real world customer who was actually having a real world struggle at that moment.
I'm kind of curious to know. What did your world look like at the moment you guys realized there was a real customer with a real problem or losing data? They had downtime. Of course, they wanted to get back up again and get back to processing again.
Was it a whole company, hair-on-fire, "we got to ring the fire alarm and get focused on it?" Can you describe the environment and how you guys reacted to that situation? Then of course, then discovered, this optional thing is really not that optional. We need to really understand how to make this better.
23:29 — Ben Darnell
Yeah. So obviously as a database and especially a transactional database, making the kinds of guarantees that we claim to offer for availability and consistency, obviously losing customer data is just the worst thing we can do. So, we want to throw all available resources at the problem whenever it happens.
This became the top priority. We were cranking out tools on the fly, to try and do a little surgery on the data, on disks, to try and bring the cluster back up as quickly as possible.
Then the long term fix happened at a less extreme pace, because we wouldn't be able to ship that feature before a lot of testing and our next major release anyway.
24:14 — Burr Sutter
I like how you mentioned you guys were engineering new tools, basically on the fly, to basically say, "Hey, execute this, execute that."
This doesn't sound like these were just simple bash shell scripts that you had to create for the customer. These were new Go algorithms, new Go routines that you guys had to create, and they had to execute those executables. Right?
24:30 — Ben Darnell
That's right. Yeah. This was new Go code that was written and being built into the CockroachDB binary. So, we were shipping them new releases of CockroachDB on the fly.
24:41 — Burr Sutter
Oh, well, there's an aspect of agile you don't hear about every day! We're patching our actual core engine and shipping the whole thing yet again. So, you're doing multiple releases within an hour, in this case, it sounds like, or a few hours. About how long was the window that you guys were working in this state?
24:55 — Ben Darnell
Fortunately, because of the way data is divided up in CockroachDB, this did not manifest as a total outage for the customer. The cluster was still 99 plus percent available. It was just that there were certain records that were unavailable. So, the customer was able to get by in this state.
It ended up taking us... I don't remember exactly. It was a couple of days, I think, to get it all done.
25:26 — Burr Sutter
Well, you bring up a great point there about the fact that not everything was down. I think that's something that I missed initially, and we probably should talk about a little bit more.
The name Cockroach suggests this thing is unkillable. Most of the application and users applications in the database was up and running. Everything was okay, except for a few isolated records. That's related to the way you guys actually distribute the data. Therefore, it was only this one subset that had problems.
That's incredibly powerful. Just that alone is a powerful story. Where if the customer's continuing on trucking, but they just had a problem in one area, that's a massive win for the customer as well as a testament, if you will, to the architecture and the solution you guys have created.
But I got to ask you this particular element, because I'm very curious: Normally in the case of software engineering, when we run into a failure, often we learn from that. We actually learn more from our failures than we do from our successes, in many cases.
So based on that concept, that you had to go through some learning, tell me how this plays out. What lessons did you guys learn? How did it impact your journey as a company and a corporation, as a team, as you guys continue to learn from that experience as well as others?
26:31 — Ben Darnell
Well, one lesson from this: it turns out that the issue around atomic replication changes was something that we had kind of made a note of early on.
We filed an issue in our issue tracker on GitHub, and then kind of forgot about it because it seemed like it was such a remote possibility.
One of the lessons from this is that, it really is important to go through your backlog of old issues because there may be real things hiding there.
If something doesn't feel right and it makes you think you need to file one issue about it, there's a very real chance that it'll come back to bite you later.
27:12 — Burr Sutter
Ben's real world experience, where their software in the hands of real end users, might be used in a way not previously anticipated, where rare manual events, that remote possibility, like adding a replica to the database, might be an edge case. Because as we all know, software's never finished. The happy path is one thing, but it's those edge cases that often get us.
I really love hearing these stories of real world software deliveries, stories just like Ben's about founding Cockroach Labs, because we can all learn from them.
We want to thank Ben Darnell for coming on the show. Thanks also to Aika Zikibayeva, Joe Gomes, Michael Waite, John Gibson and Victoria Lawton.
This episode was produced by Brent Simoneaux and Caroline Creaghead. Our sound designer is Christian Prohom. Our audio team includes Leigh Day, Stephanie Wonderlick, Mike Esser, Johan Philippine, Kim Huang, Nick Burns, Aaron Williamson, Karen King, Jared Oats, Rachel Ertel, Devin Pope, Matias Faundez, Mike Compton, Ocean Matthews, Alex Traboulsi, and Victoria Lawton.
I'm Burr Sutter and this is Code Comments, an original podcast from Red Hat. Thank you so much for listening. I hope you'll join us again next time.
What we’re doing together
The partnership between Cockroach Labs and Red Hat allows customers to grow their business in a cloud-native infrastructure. As a distributed database, CockroachDB has the same shared-nothing architecture as Kubernetes, offering customers seamless deployment in OpenShift environments.
Bringing Deep Learning to Enterprise Applications
Machine learning models need inference engines and good datasets. OpenVINO and Anomalib are open toolkits from Intel that help enterprises set up both.
Rethinking Networks In Telecommunications
Successful telecommunications isn’t just the speed of a network. Tech Mahindra’s Sandeep Sharma explains how companies can keep pace with unforeseen changes and customer expectations.
Aligning With Open Source Principles
It’s one thing to talk about your open source principles. It’s another to live them. David Duncan of AWS shares how extensive collaboration is key.