By Brent Compton, Senior Director, Technical Marketing, Red Hat Cloud Storage and HCI
Breaking down barriers to innovation.
Breaking down data silos.
These are arguably two of the top items on many enterprises’ wish lists. In the world of analytics infrastructure, people have described a solution to these needs as "multi-tenant workload isolation with shared storage." Several public-cloud-based analytics solutions exist to provide this. However, many large Red Hat customers are doing large-scale analytics in their own data centers and were unable to solve these problems with their on-premises analytic infrastructure solutions. They turned to Red Hat private cloud platforms as their analytics infrastructure and achieved just this: multi-tenant workload isolation with shared storage. To be clear, Red Hat is not providing these customers with analytics tools. Instead, it is welcoming these analytics tools onto the same Red Hat infrastructure platforms running much of the rest of their other enterprise workloads.
Traditional on-premises analytics infrastructures do not provide on-demand provisioning for short-running analytics workloads, frequently needed by data scientists. In addition, traditional HDFS-based infrastructures do not share storage between analytics clusters. As such, traditional analytics infrastructures often don’t meet the competing needs of multiple teams needing different types of clusters, all with access to common data sets. Individual teams can end up competing for the same set of cluster resources, causing congestion in busy analytics clusters, leading to frustration and delays in getting insights from their data.
As a result, a team may demand their own separate analytics cluster so their jobs aren’t competing for resources with other teams, and so they can tailor their cluster to their own workload needs. Without a shared storage repository, this can lead to multiple analytic cluster silos, each with its own copy of data. Net result? Cost duplication and the burden of maintaining and tracking multiple data set copies.
An answer to these challenges? Bring your analytics workloads onto a common, scalable infrastructure.
Red Hat has seen customers solve these challenges by breaking down traditional Hadoop silos and bringing analytics workloads onto a common, private cloud infrastructure running in today’s enterprise datacenters. At its core is Red Hat Ceph Storage, our massively scalable, software-defined object storage platform, which enables organizations to more easily share large-scale data sets between analytics clusters. The on-demand provisioning of virtualized analytics clusters is enabled through Red Hat OpenStack Platform. Additionally, early adopters are deploying Apache Spark in kubernetes-orchestrated, container-based clusters via Red Hat OpenShift Container Platform. Delivery and support are provided by the IT experts at Red Hat Consulting based on documented leading practices to help establish an optimal architecture for our clients’ unique requirements.
Key benefits to customers
- Get answers faster. By enabling teams to elastically provision their own dedicated analytics compute resources via Red Hat OpenStack Platform, teams have avoided cluster resource competition in order to better meet service-level agreements (SLAs). And teams can spin up these new analytics clusters without lengthy data-hydration delays (made possible by accessing shared data sets on Red Hat Ceph Storage).
- Remove roadblocks. Empower teams of data scientists to use the analytics tools/versions they need through dynamically provisioned data labs and workload clusters (while still accessing shared data sets).
- Hybrid cloud versatility. Enable your query authors to use the same S3 syntax in their queries, whether running on a private cloud or public cloud. Spark and other popular analytics tools can use the Hadoop S3A client to access data in S3-compatible object storage, in place of native HDFS. Ceph is the most popular S3-compatible open-source object storage backend for OpenStack.
- Cut costs associated with data set duplication. In traditional Hadoop/Spark HDFS clusters, data is not shared. If a data scientist wants to analyze data sets that exists in two different clusters, they may need to copy data sets from one cluster to the other. This can result in duplicate costs for multi-PB data sets that must be copied among many analytics clusters.
- Reduce risks of maintaining duplicate data sets. Duplicate data-set maintenance can be time-consuming and prone to error, but it can also result in incomplete or inaccurate insights being derived from stale data.
- Scale costs based on requirements. In traditional Hadoop/Spark HDFS clusters, capacity is added by procuring more HDFS nodes with a fixed ratio of CPU and storage capacity. With Red Hat data analytics infrastructure, customers can provision compute servers separately from a common storage pool and thus can scale each resource according to need. By freeing storage capacity from compute cores previously locked together, companies can scale storage capacity costs independently of compute costs according to need.
Innovation for today's data needs
As data continues to grow, organizations should have a supporting infrastructure that can break down data silos and enable teams to access and use information in more agile ways. Red Hat platforms can foster greater agility, efficiency, and savings--a nice combination for today’s data-driven organizations looking to build analytics applications across the open hybrid cloud.
You can also find our blog post that covers other news from the Strata conference and upstream community projects here. For more details on empirical test results, see here. For a video whiteboard of these topics, see here. Finally, To learn more, visit www.redhat.com/bigdata.