Legacies | The Legend Of Hadoop
In 2002, Hadoop hit the scene, and quickly became a media darling. Twenty years later, typing the term into a search engine will return questions about its continued relevance—or possible lack thereof.
Is Hadoop still important? Where is it most visible today? The Compiler team dives hard into the project, and how it forever changed the way we look at data.
00:01 — Kim Huang
Around 2010, I was touring the campus of an unnamed tech company when I heard a strange word for the first time—Hadoop. Everyone was talking about it and they were really excited. I didn't know what it was. Actually, it seemed like the entire tech industry was talking about Hadoop. There were blogs and articles, and then seemingly overnight, no one was. Everyone had moved on to the next thing. And while new innovation is great and all, it made me wonder: What happened to Hadoop?
00:45 — Angela Andrews
This is Compiler, an original podcast from Red Hat.
00:49 — Brent Simoneaux
I am Brent Simoneaux.
00:50 — Angela Andrews
And I'm Angela Andrews. We go beyond the buzzwords and jargon and simplify tech topics.
01:01 — Brent Simoneaux
Today we're exploring Hadoop's rise to prominence and where it stands today.
01:07 — Angela Andrews
Producer Kim Huang is here to get us started.
01:10 — Kim Huang
Listeners may not think of Apache Hadoop as legacy software, but it's been around for a while, since 2006. It's an open-source project under the Apache Software Foundation, and it was the brainchild of Doug Cutting and Mike Cafarella. Hadoop stands for a "high availability distributed object-oriented platform." And while there is an acronym, the name came from Doug's son's toy animal, which was an elephant, which is why the logo for Hadoop is an elephant. Despite the name, the Hadoop that people refer to is actually a software framework for data storage. And when we say data, we mean everything—large data sets, media files, text files, everything. (01:54): It helps me to think about Hadoop as a dam or a reservoir of water. So when you think about the water in a reservoir, you can maintain it, you can test it, you can manage it, you can access it when you need it, but it's on standby if you don't need it. (02:11): Hadoop was lauded as revolutionary, but after a while, it faded into the background. Today, the number of new users is shrinking, and it's looked upon as a dated technology. I wanted to dive deeper into Hadoop's origin. To do that, I brought back someone who is an old friend of the show, Sherard Griffin. Listeners might remember him from the episode on technical debt. Sherard is a Red Hatter, and he wrote the book on Hadoop. Well, not THE book, but A book on its popularity in the early 2000s.
02:46 — Sherard Griffin
When you think about Hadoop, it became really, really popular because it allowed you to collect all types of data. And for those of us who've been in the industry for a while, you'll remember the days of Facebook and the days of even before that, where it's like all this data is being collected, YouTube. We were told, "Just collect the data. You'll figure out what to do with it later. We need a system that can just collect every single piece of information about every single customer in our system, and we don't know what we're going to do, but by God—5, 10 years from now, we'll figure something out to do with the data."
03:23 — Kim Huang
Hadoop gave companies something they never had before: a method to store large data sets, incoming data from users and store it until they knew what to do with it. I've heard from a number of technologists about Hadoop, and the advanced capabilities of Hadoop, in their opinion, gave birth to what we understand to be data science today. So analysts could tap into large amounts of data and develop models for predictive analytics or for machine learning, and it all sounds really nice. But after a while, people started to see some challenges with Hadoop and its main selling point.
04:05 — Sherard Griffin
And so what happened, Hadoop was really popular because it allowed you to collect all of that data in a way that if you needed to, you could do something with it. Now, what customers quickly realized is, okay, we're starting to store petabytes and petabytes of information, but we don't necessarily have the right tools to be able to process all of it.
04:27 — Kim Huang
Remember, this is the early aughts. Cloud computing was a thing, but it wouldn't become an integral part of IT infrastructure as we know it today until much later on. Not to mention the effects on IT operations when dealing with a Hadoop implementation.
04:46 — Sherard Griffin
If you ever installed Hadoop bare metal, it's quite a number of IT tickets to scale out your infrastructure to be able to process additional data, and the data was coming in faster than they could scale out the infrastructure. And so the reason we wrote the book was because customers were coming across this challenge, and they would come to us and say, "Hey, how do we build out the right infrastructure that allows us to scale it out?"
05:10 — Brent Simoneaux
Angela, I'm curious if you have any experience with Hadoop. Do you remember this moment that Kim is describing from the early aughts?
05:18 — Angela Andrews
I doop. Yes. It was one of—
05:22 — Brent Simoneaux
05:24 — Angela Andrews
It was all the rage. It was big data. You heard about it out there in the world. It had never landed specifically in my datacenter, because we were still all on-prem back then. There was no cloud yet. We had a private cloud using virtualization, but the amount of data that we're talking about that Hadoop usually works for was out of our grasp. It could very well have possibly been that there were some professors using it somewhere, scrapping together whatever they could for storage. But, personally, I only heard of it as this buzzword for big data.
06:04 — Brent Simoneaux
06:04 — Kim Huang
I imagine it's around the same time that Facebook is coming to prominence. You had that kind of capability, and this was a new time where both technologies, like social media as a medium and also this new kind of object-oriented storage infrastructure was coming about. Right? So you had these two things happening at the same time. Am I off or is that right?
06:28 — Angela Andrews
It sounds about right. The timeline is such that when we're talking about social media, that was all the rage at that time, when you get the Facebooks, and... There are others out there that were around in 2008. But this is where you started to hear about the sheer amount of data that these platforms were responsible for, for mining and trying to figure out, "Well, now that we have it, what do we do with it?" And again, it was just one of those things that you just heard about. (07:02): Think of Facebook, it's literally like the pack rat of data. You're holding onto all this information back then and not really able to figure out what to do with it? It just seems like, I want to see where this episode is going because maybe that's why.
07:24 — Brent Simoneaux
I kind of get the logic, though.
07:27 — Kim Huang
07:29 — Brent Simoneaux
I get this. It's like, "Oh, I might need this later. Let me go ahead and save it." I don't know what this is useful for now, but in the future, it might be. I think I get the logic.
07:40 — Kim Huang
Maybe my stepdad's garage was the right metaphor, instead of a reservoir.
07:43 — Angela Andrews
It was the perfect metaphor.
07:45 — Brent Simoneaux
It actually is, yeah.
07:46 — Angela Andrews
For pack rats everywhere.
07:49 — Brent Simoneaux
07:50 — Kim Huang
Well, I actually have a copy of Sherard's book that he wrote, and it's really interesting. But he says something that really caught my attention. He likened Hadoop to a hammer, and when one has a hammer, other things, well, everything starts to look like a nail. Sherard gave us an example.
08:12 — Sherard Griffin
I remember the very first time I used Hadoop in a production environment. We were building out an application for a company of mine, and we thought Hadoop was the right thing. We were just told, "Hey, you guys have to store... You're building a multi-tenanted cloud product, and you have to use Hadoop because that's the way these things are done." (08:33): And so what did we do? We had all of our teams build out this fancy infrastructure with Hadoop and, boy oh boy, we knew that if we just stored all the data in HDFS [Apache Hadoop distributed file system], then we'd be able to solve every single query that came in, whether it was real time or whether it was batch, we could do it. (08:49): That architecture lasted about 6 months before we ripped out half of it, because we realized Hadoop is not going to save us in terms of allowing us to do anything we want. We still have to really understand what Hadoop is good at.
09:04 — Angela Andrews
Did they figure it out? Imagine putting all of this money and resources into something that 6 months later you go, yeah, no, this ain't it.
09:17 — Kim Huang
09:18 — Angela Andrews
But it's awesome that they were nimble enough to be able to say, this is really not the answer for us. It is an answer, but not an answer for what we're trying to do with this product.
09:29 — Kim Huang
Absolutely. What next?
09:31 — Brent Simoneaux
Well, I like what he says there at the end. We really have to understand what Hadoop is good at, and I feel like not enough of us, myself included, ask that question at the beginning.
09:44 — Kim Huang
09:45 — Angela Andrews
09:46 — Kim Huang
Yes. And to your point, Angela, implementations of Hadoop, they aren't cheap, and you need the personnel also to know how to work with Hadoop and all the different layers of platforms and applications that run on top of it. So it poses kind of this conundrum of, hey, there's this thing that can solve the problems that are associated with big data and this new age of big data, but it may not be the tool that you need. You may not figure that out until you're already starting the work—which is messy. But, it is nice to have that pivot and be able to understand that maybe this is not the right tool.
10:29 — Brent Simoneaux
So Hadoop is really popular. It's all the rage. It's sort of people think it's going to solve a lot of these big data problems.
10:38 — Kim Huang
10:39 — Brent Simoneaux
What happens next, Kim?
10:41 — Kim Huang
Well, we heard from Sherard about the challenges that a Hadoop implementation can present, but where does the legend of Hadoop stand today? Well, we'll find that out next. (11:04): I would like to introduce Michael Wells. He's an engineering technologist at Dell Technologies. When he spoke about Hadoop, the story he shared was similar to what we heard from Sherard.
11:16 — Michael Wells
Most companies aren't implementing Hadoop themselves anymore. They're more consuming Hadoop as a service, relying on somebody else to implement it and manage it for them. Or—a step beyond that—they're not even aware of the presence of Hadoop, and they're consuming services that are built on top of it.
11:40 — Angela Andrews
Now that makes more sense to me.
11:43 — Kim Huang
11:44 — Angela Andrews
Hadoop hasn't gone away, per se. HASS.
11:47 — Kim Huang
11:51 — Angela Andrews
11:52 — Brent Simoneaux
I was like, where are you going with that?
11:54 — Angela Andrews
But it's still there. You know, Avocado. We're still there. It hasn't gone anywhere. But now that we think about out there in the cloud where I'm assuming they're talking about we're consuming it as a service, there are other cloud vendor services that are smarter in dealing with Hadoop and you're utilizing those services. Hadoop is the underpinnings, but these cloud-native services have kind of taken the wheel at this point.
12:24 — Kim Huang
12:25 — Angela Andrews
I hope I'm heading in the right direction.
12:26 — Kim Huang
12:27 — Angela Andrews
Okay. All right. That's what it sounds like he's saying here.
12:30 — Kim Huang
100% what he's saying. Yeah. According to both Michael and Sherard, Hadoop's popularity started to subside when other technologies started coming into play, and the ability to scale—kind of what Sherard was talking about earlier, and also the flexibility—those two things were important on building on the success of Hadoop. But those other tools and those other projects, they're basically being stacked on top of Hadoop. The software framework itself became less relevant over time.
13:01 — Angela Andrews
But did it, though?
13:02 — Kim Huang
Yeah. Why did people stop talking about it so suddenly? What do you think, Angela? I don't know.
13:07 — Angela Andrews
Well, it's still there.
13:09 — Kim Huang
Yeah, it's still there.
13:09 — Angela Andrews
We're hearing that it is still there, but there are other services on top of it that make it more accessible. So Hadoop is this huge framework, and then there are things over top of it, cloud-native services on top of it, that know how to use it. They're the smarts, where Hadoop was just this one big thing and it's all this data. Something had to be built to make it make sense, to be able to get into this data, to get the kind of data out of it that you're putting into it. How are we filtering all of this data to disparate services somewhere? It's still great that it does what it does, but that's not enough. You need these other cloud-native services that do these things, very particular things that can utilize this service. It hasn't gone anywhere, but no one's in the business of managing and maintaining it anymore, because the cloud providers pretty much have cornered the market on that.
14:10 — Kim Huang
But there's more to the story, mainly because like you said earlier, Angela, implementing Hadoop affects more than scalability. Here's Michael again,
14:20 — Michael Wells
It's not that nobody's talking about it. I think the conversations have just moved into smaller circles. So a lot of organizations rushed to implement Hadoop and realized, this is difficult or this is expensive. Data scientists aren't cheap. Data lakes, because you can throw anything into them, they grow very, very big, very, very fast. And if you never reach the point of being able to determine the value of the data, you've got a lot of investment in infrastructure, and you're not seeing any return out of that data. So I think a lot of that has shifted into services and service providers, and a lot of the conversation kind of shifted to not so much from Hadoop, but shifted less to how you're storing the data and more to what you're doing with the data.
15:20 — Brent Simoneaux
So the conversation that we were talking about before where it's like, let's just collect as much as possible and we'll do something later, it's like it's flipped to like, okay, no we really need to think about what are we doing with this data?
15:34 — Kim Huang
15:36 — Angela Andrews
Because data is money.
15:37 — Kim Huang
It most certainly is.
15:39 — Angela Andrews
Petabytes cost money. So gone are the days when you're just storing and trying to figure it out. You have to be very prescriptive, because every piece of data that comes along may not be what's going to make you your money back. Let's just store it all, but you only need this much data to get some value out of everything that's coming through. I think organizations have just gotten much smarter in understanding that data is king, but we need to be very thoughtful about what data we need to be concerned with to make our product or our business successful. And that's where I think managing that data and picking and choosing that data has become more the new hotness, as opposed to just store it all and we'll figure it out. No, we're going to figure this out first.
16:34 — Kim Huang
Yeah, operational strategy.
16:36 — Angela Andrews
Before we store it all.
16:38 — Kim Huang
So then there's the million-dollar question. Right? This series is about legacy technology. Is it still worth it for technologists to learn Hadoop?
16:49 — Brent Simoneaux
16:50 — Kim Huang
I've done my own research, and the answers I got are mixed.
16:56 — Brent Simoneaux
16:56 — Kim Huang
Yeah. Some say it's beneficial to understand it while others say learning everything on top of it, or the things on top of it—kind of like Spark, for example, is one that is really popular—they're more useful because large Hadoop implementations are mostly a thing of the past. For Michael, it's important for people to think about what they're passionate about and not laser focus on what they need to learn or what they feel like they need to learn.
17:31 — Michael Wells
Those are the types of things you're going to get burnout on. I don't want to learn this, but I have to learn this. So you're not going to invest the time in mastery that you should, and it's going to be exhausting. If you can focus on those things that you are passionate about, it makes it a lot easier. You enjoy reading the latest articles about new advancements and new technologies and new design patterns and how people are doing things. When it's exciting like that, then that continuing education isn't a chore.
18:05 — Kim Huang
Brent, Angela, I have a question for you both: How do you feel about continuing education, and how do you approach what to learn versus what's just popular right now? How do you figure out what steps to take and what to learn next?
18:23 — Angela Andrews
That is a good question. Brent, I'm going to have to let you go first.
18:29 — Brent Simoneaux
I was going to say—
18:29 — Angela Andrews
I need to mull that one over.
18:32 — Brent Simoneaux
I was going to say, this is an Angela question. (18:34): Well, I think something that Michael was saying earlier about not being laser-focused on things that you "need" to learn. I think that that's important, but I want to emphasize not being laser-focused on it. Because there are some things that you need to know and it sucks to learn and you don't enjoy it. But you just do it; you learn how to do it and you kind of move through it. (19:02): I think what he's saying right there though is that if you do that too much, then learning is going to be a drag and you're going to get burnt out and you're not going to want to do it anymore. So you need to also have something that you're really excited about and something that you really genuinely enjoy learning and that you're curious about.
19:26 — Angela Andrews
That is true. Thank you for that. Thank you for inspiring my answer. (19:31): So we cannot spend a lot of time deep diving into all the things. It's not possible. It's not feasible. There's not enough hours in the day. So it's better that we are driven by the things that excite us, and it might be something... This quarter, it's... I just had a brain fart. Oh, literally, it depends on what's going on in the market, in technology, on what you're working on at the moment—that tends to drive where your interests are. You always want to follow your heart, where you think things are going and where you find value. This quarter, it's generative AI, and next quarter it may be something totally different. But what you have to do is, knowing you can't deep dive into everything, but because you're always staying curious, you'll learn a little bit about this and understand how something works with this. You'll pick and choose because you want to be able to have an intelligent conversation depending on what your role is, because especially if it comes across your desk, you really want to be able to understand what's going on out in the markets and what your customers are using, and all of your other stakeholders are thinking about whatever this technology is. (20:57): So staying curious is a huge part of it, but also understanding you can't be a subject matter expert in everything. Follow your heart. I agree with Brent. What are your passions? What sings to you? Those are the things that you should spend your time on. Everything else is—let me just read a little bit here. Let me pick and choose.
21:20 — Brent Simoneaux
Yeah, there's something—I love losing an afternoon or losing a few hours at night to just following something that I'm just curious about in the moment, or just researching something or playing with something just because I'm curious in that moment about it. Sometimes I think I'm an overachiever or something and I feel like I have to set myself on this regimented learning schedule. And there's a time and there's a place for that, but now I'm going to sing the praises of losing an afternoon to a curiosity.
22:01 — Angela Andrews
I second it. I think we should all do that, though. I think it's something that all of us, even our listeners—stay curious. There's always something that interests you and give it the time it deserves.
22:15 — Kim Huang
I want to bring it back to Michael, because the hype machine for Hadoop was going strong at the early aughts. And then it sort of petered out a bit once the implementations were in place, once people kind of understood the technology and they realized that there was kind of more going on than what they bargained for. But what does the story of Hadoop, how it started and where it is today, what does it tell us about the tech industry?
22:44 — Michael Wells
It tells us that the IT industry has a very short memory. I've been in IT for over 20 years now, and even just in my relatively short time, I have learned that things come full circle. You look at the mainframe days and centralization and dummy terminals, and there's a lot of value in centralizing the data and centralizing the processing. But then we move into workstations and distributed computing and separating things out, and then they come back together again and then they separate back out again. And so every few years, somebody goes, "Oh, I've got this new idea," and if you really look closely at it, it's probably not that new of an idea.
23:32 — Angela Andrews
If he didn't tell it from the mountaintop.
23:34 — Kim Huang
23:34 — Angela Andrews
That is so true. So as he's talking, I'm thinking about the technologies that I've seen that are coming through and a lot of them are just improvements or rehashing or rebranding or rearchitecting, of things that you've seen 15, 20 years ago. There's nothing new under the sun. And in a lot of respects, it sounds like what Michael's saying—you stay around long enough, you're going to see it come back in a different iteration.
24:09 — Brent Simoneaux
So Kim, Angela, I'm curious how we're thinking about Hadoop now. It seems like maybe it's not something that we're going to be interacting with or seeing day-to-day right now, but what can we maybe learn from this and what are you both taking away?
24:32 — Angela Andrews
My takeaway is, we're learning that the newest technology is great, but it is not the tool for everything that comes down the pipe, right? We should be very aware and wary of things that come in and want to be this end-all, be-all until we fully understand what these use cases are for. (24:57): I think staying aware of said things is very important, but I also think that since we're not fortune-tellers, we should be saying to ourselves, "Okay, technology, this is what's happening right now. It is steeped in X. How are we going to see these new iterations come through?" (25:19): So I've learned that there's nothing new under the sun. I think new technologists should be aware. Do your homework, understand what Hadoop or any other tool was made for and why folks are using it. But, you don't have to spend a whole lot of time on it, because as things come along and build themselves on top of these "legacy type tools," you really need to know where your work is and what you are doing. As long as you have an understanding, you don't have to be a subject matter expert in distributed file systems and whatnot to get your job done. You should respect it, you should understand it, but we can't know all the things.
26:02 — Brent Simoneaux
Kim, how are you thinking about this?
26:06 — Kim Huang
Michael's parting words for us are, to me, the definition of legacy technology. Something that builds upon the success and innovation of the thing that came before it. (26:17): If anything, Hadoop is in this series because it's both an inspiring story and a cautionary tale at the same time. The hype machine was definitely running, and it was warranted because it's an open source project that changed the way companies and everyone else looked at data. It gave data scientists a playground to kind of grow their capabilities and by extension their profession. But, some organizations went all in before knowing what their goals were and how it would affect their infrastructure in the long term.
26:52 — Brent Simoneaux
You don't say.
26:53 — Kim Huang
Angela said it: No one's a fortune-teller. No one can predict the future, what innovations are around the corner. But following a trend without a strategy can be very costly. Sometimes the hammer is just a hammer, and it's up to us as technologists to figure out what's a nail and what isn't one.
27:12 — Angela Andrews
And that's on everything. I am so glad we did this episode, because it's showing folks that technology and hype are real, and we can't get dragged into it. (27:28): I'm just curious about, as folks are listening to this podcast, what are they thinking? We want to know what you're thinking about trends and tech trends and what's old is new. You have to share your thoughts with us. Hit us up on Twitter @RedHat. Use the hashtag #CompilerPodcast. Tell us about legacy technology or definitely—what's the hype machine technology you're dealing with now and why? And what's your use for it? Again, you have to be able to discern what is really a nail. Right? I think Kim put a nice point on that, no pun intended. So definitely tell us what you're thinking. Hit us up and share with us. We'd appreciate it.
28:12 — Brent Simoneaux
And that does it for this episode of Compiler.
28:22 — Angela Andrews
Today's episode was produced by Kim Huang and Caroline Creaghead.
28:26 — Brent Simoneaux
A big, big thank you to our guests, Sherard Griffin and Michael Wells.
28:31 — Angela Andrews
Victoria Lawton may not have all the answers, but by God 5 or 10 years from now, she'll have it all figured out.
28:38 — Brent Simoneaux
Our audio engineer is Robyn Edgar. Special thanks to Shawn Cole. Our theme song was composed by Mary Ancheta.
28:46 — Angela Andrews
Our audio team includes Leigh Day, Stephanie Wonderlick, Mike Esser, Nick Burns, Aaron Williamson, Karen King, Jared Oates, Rachel Ertel, Devin Pope, Matias Foundez, Mike Compton, Ocean Matthews, Paige Johnson and Alex Traboulsi.
29:03 — Brent Simoneaux
If you like today's episode, please follow the show. Rate it, leave it a review, share it with someone you know. It really does help us out.
29:12 — Angela Andrews
We love that you listen. Thanks for hanging in there. We will talk to you soon. Bye, everybody.
29:17 — Brent Simoneaux
All right, see you next time.