INTRODUCTION
Traditional data analytics infrastructure is under stress due to the enormous volume of captured data and the need to share finite resources among teams of data scientists and data analysts. New technology and computing capabilities have created a revolution in the amounts of data that can be retained and in the kinds of insights that can be garnered. However, divergent objectives have emerged between teams who want their own dedicated clusters, and the underlying data platform teams who would prefer shared datacenter infrastructure. In response, some data platform teams are offering teams their own Apache Hadoop Distributed File System (HDFS) clusters. Unfortunately, this approach results in expensive duplication of large datasets to each individual cluster, as HDFS is not traditionally shared between different clusters.
Public cloud providers like Amazon Web Services (AWS) offer a more compelling model, where analytics clusters can be rapidly deployed and decommissioned on demand—with each having the ability to share datasets in a common object storage repository. Emulating this architectural pattern, several leading large enterprises have used Red Hat® private cloud platforms to deploy agile analytics clusters, sharing datasets in a common AWS Simple Storage Service (S3)-compatible Ceph® object storage repository.
Based on the pioneering success of these efforts, Red Hat, Quanta Cloud Technology (QCT), and Intel Corporation sought to quantify the performance and cost ramifications of decoupling compute from storage for big data analytics infrastructure. They were aided by the S3-compatible Hadoop S3A filesystem client connector that can be used with an S3-compatible object store to augment or replace HDFS. Using a shared data lake concept based on Red Hat Ceph Storage, compute workloads and storage can be independently managed and scaled, providing multitenant workload isolation with a shared data context.
RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION
After conversations with more than 30 companies, Red Hat identified a host of issues with sharing large analytics clusters. Teams are frequently frustrated because someone else’s job prevents their job from finishing on time, potentially impacting service-level agreements (SLAs). Moreover, some teams want the stability of older analytics tool versions on their clusters, whereas their peers might need to load the latest and greatest tool releases. As a result, many teams demand their own separate and specifically tailored analytics cluster so that their jobs are not competing for resources with other teams.
With traditional Hadoop, each separate analytics cluster typically has its own dedicated HDFS datastore. To provide access to the same data for different Hadoop/HDFS clusters, the platform team frequently must copy very large datasets between the clusters, trying to keep them consistent and up-to-date. As a result, companies maintain many separate, fixed analytics clusters (more than 50 in one company Red Hat interviewed2), each with its own redundant data copy in HDFS containing potentially petabytes of data. Keeping datasets updated between clusters requires an error-prone maze of scripts. The cost of maintaining 5, 10, or 20 copies of multipetabyte datasets on the various clusters is prohibitive to many companies in terms of capital expenses (CapEx) and operating expenses (OpEx).