Data explosion, or data implosion?

25 marzo 20193 minuti (tempo di lettura)

Data. It’s here. It’s everywhere. It is, as I’ve said before, the dawn of the Data Economy. With data permeating our lives—and business—it makes sense to say that data, and the ability to use it wisely for insights, is important to many organizations’ success. In this post, I’ll examine some of the barriers organizations face in the Data Economy and how they’re overcoming them.

In Red Hat’s Storage group, we’ve spoken to dozens of organizations about their data analytics challenges, and there are two significant macro trends we have observed:

Multiple copies of datasets: Organizations are struggling with traditional data analytics architectures that tightly couple compute and storage resources. The implication of this tight coupling is that datasets must be copied across compute clusters. While these architectures have definite performance benefits, they require significant capital and operational expenditure to be maintained.
Resource contention: Organizations are struggling with resource contention in their data analytics infrastructure. All too often, multiple business units compete for the same resources leading to frustration for data scientists and data engineers who rely upon their analytics infrastructure for their jobs.

Businesses are increasingly relying on data to help them track all aspects of their operations. CEOs and boards of directors expect to have the latest information at their fingertips. Data scientists are charged with making sense of reams and reams of data for executive management. They don’t want to be bogged down with how their tools and data analytics infrastructure produce results. It’s the job of data engineers and platform teams to figure out all the infrastructure plumbing and toolsets to enable the Data Scientist to be productive. It’s the data engineer who most acutely feels the pains I mentioned previously.

The tight coupling of compute and storage resources ensures data locality, which can mean higher performance than otherwise possible. But significant challenges arise in maintaining copies of petabyte-sized datasets across clusters and organizations. In our research, it was not uncommon to find customers who were maintaining dozens of copies of datasets. Aside from the obvious storage expenditure, maintaining up-to-date copies was a manual and time-consuming task. Imagine trying to maintain 50+ synchronized copies of petabyte-sized datasets!

Resource contention can be a major limiter for multiple organizations sharing a common infrastructure. Employing the right resources at the right time can mean the difference between runaway blockbuster sales and missed opportunities.

What is needed is a solution that delivers true multi-tenant workload isolation while providing a shared data context at the same time.

Shared data context

Let’s first delve into what a shared data context means. A shared data context provides centralized pool(s) of storage to support data analytics workflows. The clear advantage of this is the elimination of the need to copy data sets to various analytics clusters.

Quite often, though, we hear about concerns with this approach. The primary question is whether customers will be able to hit similar levels of performance with a centralized shared storage resource. This is an excellent question, and at Red Hat we did some independent testing of our own. You can find the results in the reference architecture we recently published. Please see the link at the end of this blog.

In Red Hat’s specific solution to this problem, we’ve built a stack with S3A, which is open source software designed to interface Hadoop Distributed File System (HDFS) to an S3 endpoint. The reference architecture previously mentioned employs S3A together with Red Hat Ceph Storage.

Multi-tenant workload isolation

The second issue is how to provide access to analytics clusters across an organization’s business units, which often have competing demands. Red Hat can provide this multi-tenant workload isolation in two ways, by using:

A Kubernetes container infrastructure using Red Hat OpenShift Container Platform. This approach allows data analysts to spin up analytics workloads more quickly, helping them become more agile and productive. It’s still early days for containerized data analytics applications, but things are developing fast in this space. We already have customers running Apache Spark on a containerized OpenShift infrastructure, and we are working with the major data analytics vendors to enable containerized data analytics applications.
Virtual machines (VMs) deployed on Red Hat OpenStack Platform. The advantage with this approach is that there is minimal porting of applications, so you can be up and running quickly. Many customers are in production with this architecture today, realizing the benefits of analytics workload isolation.

By combining S3A with Red Hat Ceph Storage along with Red Hat OpenShift or Red Hat OpenStack Platform, organizations can provide more consistent service levels to data scientists and data engineers while rationalizing storage requirements and reducing dataset duplication. With this reference architecture from Red Hat, organizations can tame their data explosion while sidestepping a data implosion.

You can find a full copy of the Red Hat data analytics infrastructure reference architecture

To find out more about the Red Hat data analytics infrastructure solution, please see this short video on Red Hat's data analytics infrastructure solution:

For a slightly deeper dive, check out “Breaking down data silos with Red Hat infrastructure.”

If you’re attending the Strata Data Conference in San Francisco this week, stop by Red Hat’s booth to learn more about the solutions we’ve developed to make analytics easier.