Organizations across a range of industries are sitting on treasure troves of data. For the business intelligence team with the right tools, they can mine this data and unearth brilliant insights that could lead to the creation of new products and services and improve customer service and retention rates. With the possibility of such rich opportunities mere queries away, it’s no wonder that IT departments are increasingly concerned about data usage for analytics.
We sat down with Shadi Shahin, Red Hat’s Business Intelligence and Data Warehousing team lead, to learn more about how data virtualization can help organizations achieve their analytics goals and boost their business results.
Data virtualization seems to be gaining momentum as a mainstream technology. As a business intelligence practitioner and leader at Red Hat, what are your thoughts about data virtualization?
The rise of data virtualization starts with the realization that the traditional data warehouse is no longer sufficient to meet many operational objectives. All we would essentially be doing when we create a data warehouse or data lake is replicating massive extract, transform, and load (ETL) work and cost. With data virtualization, we’re trying to get as close to the source as possible and only copy, move, and transform data when necessary. That doesn’t stop us from amassing data--the data lake concept doesn’t go away, because hemorrhaging data isn’t something we want to do either--but that doesn't mean it’s our default. Storage costs may be low, but storing data doesn’t get us the analytical output. What we’re trying to derive is the value of that output. Transform data only when we need to, move data only when we need to. That’s the guiding principle for my team.
You had been using extract, transform, load (ETL) technology for your data integration. What made you evaluate data virtualization?
It’s been an evolution. A plethora of tools out there are supposed to be self-service--the Qliks, Tableaus, and SAPs of the world--and we need to keep up. When there is self-service BI tool proliferation, there is also likely data proliferation. We can lose control and governance. We may no longer know what happens with the data that people are procuring. We wanted to have a unified data layer for self-service BI tools to consume data.
Silos of information only tell us a portion of the customer story. We have an incomplete picture without understanding every aspect of how customers interact with our company and what’s going on in their own organization. For example, let’s say a customer has asked a salesperson a question. To build a complete picture of the customer and answer the question fully using ETL methods could take months. We’d go to our various constituents and construct the requirements. Then we’d write a bunch of ETL and integrate the data, all the time contending with massive data volume, data complexity, business rules, and business logic. Then we’d finally get back to the customer. By then, either the customer has found a different way to solve their problem and has moved on, their problem has changed and they’re interested in answering a different question, or the customer doesn’t want to hear the answer because it was either the wrong question or they’re not prepared to hear the answer.
For this reason and many others, speed to market is our top priority. Once we discovered that the traditional single system would not facilitate speed, we realized the traditional warehouse and reporting structure wouldn't meet our needs. Data virtualization is one tool we’re using to operate in parallel and move faster.
We have three primary use cases. First, there’s exploration. You don’t understand the data, and I don’t understand the data. Data virtualization can give us visibility into multiple systems at once. We don’t have to wait months for data in siloed systems to be integrated before we can start playing with it. Even better, while we’re exploring the data, we can start moving the data to set up for the second use case or step in the process. Second, there’s investigation. We are done exploring the data and are ready to consume it and establish a reporting framework. This is for the business analysts who are doing the investigation work. Third, we build data into an operational framework. We have explored it, investigated it, built platforms on top of it, and now can push that information out to constituents because it’s consistent and unified. It’s no longer the wrong question and wrong answer. Now we know the data we have, what data we need, and which answers it can give us.
What is the relationship between data virtualization and ETL? Are they mutually exclusive data integration strategies?
I think they are complementary and transitional. Sometimes you do data virtualization, and sometimes you do ETL. Most of the time you do both.
If you need fully-integrated, secure data—for example, SEC filings—you can’t be loose with that integration. Data virtualization probably isn’t the right answer. ETL probably is. But when you’re in exploration mode and only looking to discover what the data looks like so you can mine it, ETL probably isn’t the right answer. Data virtualization is probably better.
It’s a journey. Many times, you start at exploration and end at prescriptive, predictive, or basic reporting--that will need ETL in time. Data virtualization enables you to go on that journey without the upfront cost.
Data virtualization and ETL can work well with each other. They help provide speed to market to get answers faster while enabling you to build rigorous reporting, too.
How do you evaluate data virtualization platforms? Which capabilities and features are most important?
I’ll talk first about our approach to evaluation and then the most important features.
Our approach was to perform two kinds of analyses. First, we considered functional versus nonfunctional requirements. Functional requirements are about how the platform performs and what it can do for me from an ease-of-use and capabilities standpoint. Is the platform what it claims to be? For that, we did both a paper and hands-on evaluation—a bake-off between the tools in the market. Second, we looked at nonfunctional requirements. We thought about the business and asked, what would IT care about? Our answer: scalability, security, cloud deployability, cost, support, and whether or not it’s open source.
When it came to prioritizing features, we weighted the different things we care about today and in the future. For example, today we care more about scalability than ease of use. We want to deploy an enterprise-grade tool. Three years from now, however, we may not care as much about certain features versus others. We assigned a grade for every line item for both today and three years from now. Based on the scoring, we came to the conclusion that Red Hat JBoss Data Virtualization is what we ought to use—not only from a non-functional standpoint, but also from a functional standpoint. We believe it’s above and beyond what our competitors are doing. Even if we fall short on ease of use, we can evolve that. We can’t evolve a non-enterprise platform that can’t scale into an enterprise platform.
What is your advice to your peers in the industry about using a data virtualization platform?
I embarked on this journey a couple of years ago to move us from a reporting and data warehousing team to a true business intelligence and analytics team. Having gone through that process and spoken with a lot of my peers, the key is to start. Don’t be scared. But don’t believe the hype either. There isn’t a single tool out there that solves all of your problems. Your architecture has to be multi-tiered.
Whether you have Hadoop or a traditional warehouse, you need multiple ways to solve issues. Data virtualization is a way to solve some problems faster and buy you options to decide later. We have a saying: “decide late.” Instead of spending money on other solutions, use data virtualization to use what you already do and have today. If you apply data virtualization on top of your traditional warehouse and plug it in with your other systems, it can provide you with simpler integration pretty much out of the box. There’s some engineering involved, but it can be much faster than to stand up a full new stack. Data virtualization can enable you to take steps beyond where you are without spending a lot of money and time. It enables you to figure out your needs. Are your needs really around massive amounts of data? Or are you struggling with data complexity? Then maybe you should invest more in skill and governance than in a new tool. Data virtualization helps you to answer these questions.
You have more knowledge in-house than you think you do. People should not be as afraid of data virtualization as other solutions because it doesn’t displace the warehouse; it’s not disruptive to your environment. Data virtualization may actually become part of your architecture in the long term, but it’ll guide you. So start today. Try it. That’s my advice.