Data. It’s here. It’s everywhere. It is, as I’ve said before, the dawn of the Data Economy. With data permeating our lives—and business—it makes sense to say that data, and the ability to use it wisely for insights, is important to many organizations’ success. In this post, I’ll examine some of the barriers organizations face in the Data Economy and how they’re overcoming them.
In Red Hat’s Storage group, we’ve spoken to dozens of organizations about their data analytics challenges, and there are two significant macro trends we have observed:
-
Multiple copies of datasets: Organizations are struggling with traditional data analytics architectures that tightly couple compute and storage resources. The implication of this tight coupling is that datasets must be copied across compute clusters. While these architectures have definite performance benefits, they require significant capital and operational expenditure to be maintained.
-
Resource contention: Organizations are struggling with resource contention in their data analytics infrastructure. All too often, multiple business units compete for the same resources leading to frustration for data scientists and data engineers who rely upon their analytics infrastructure for their jobs.
Businesses are increasingly relying on data to help them track all aspects of their operations. CEOs and boards of directors expect to have the latest information at their fingertips. Data scientists are charged with making sense of reams and reams of data for executive management. They don’t want to be bogged down with how their tools and data analytics infrastructure produce results. It’s the job of data engineers and platform teams to figure out all the infrastructure plumbing and toolsets to enable the Data Scientist to be productive. It’s the data engineer who most acutely feels the pains I mentioned previously.
The tight coupling of compute and storage resources ensures data locality, which can mean higher performance than otherwise possible. But significant challenges arise in maintaining copies of petabyte-sized datasets across clusters and organizations. In our research, it was not uncommon to find customers who were maintaining dozens of copies of datasets. Aside from the obvious storage expenditure, maintaining up-to-date copies was a manual and time-consuming task. Imagine trying to maintain 50+ synchronized copies of petabyte-sized datasets!
Resource contention can be a major limiter for multiple organizations sharing a common infrastructure. Employing the right resources at the right time can mean the difference between runaway blockbuster sales and missed opportunities.
What is needed is a solution that delivers true multi-tenant workload isolation while providing a shared data context at the same time.
Shared data context
Let’s first delve into what a shared data context means. A shared data context provides centralized pool(s) of storage to support data analytics workflows. The clear advantage of this is the elimination of the need to copy data sets to various analytics clusters.
Quite often, though, we hear about concerns with this approach. The primary question is whether customers will be able to hit similar levels of performance with a centralized shared storage resource. This is an excellent question, and at Red Hat we did some independent testing of our own. You can find the results in the reference architecture we recently published. Please see the link at the end of this blog.
In Red Hat’s specific solution to this problem, we’ve built a stack with S3A, which is open source software designed to interface Hadoop Distributed File System (HDFS) to an S3 endpoint. The reference architecture previously mentioned employs S3A together with Red Hat Ceph Storage.
Multi-tenant workload isolation
The second issue is how to provide access to analytics clusters across an organization’s business units, which often have competing demands. Red Hat can provide this multi-tenant workload isolation in two ways, by using:
-
A Kubernetes container infrastructure using Red Hat OpenShift Container Platform. This approach allows data analysts to spin up analytics workloads more quickly, helping them become more agile and productive. It’s still early days for containerized data analytics applications, but things are developing fast in this space. We already have customers running Apache Spark on a containerized OpenShift infrastructure, and we are working with the major data analytics vendors to enable containerized data analytics applications.
-
Virtual machines (VMs) deployed on Red Hat OpenStack Platform. The advantage with this approach is that there is minimal porting of applications, so you can be up and running quickly. Many customers are in production with this architecture today, realizing the benefits of analytics workload isolation.
By combining S3A with Red Hat Ceph Storage along with Red Hat OpenShift or Red Hat OpenStack Platform, organizations can provide more consistent service levels to data scientists and data engineers while rationalizing storage requirements and reducing dataset duplication. With this reference architecture from Red Hat, organizations can tame their data explosion while sidestepping a data implosion.
You can find a full copy of the Red Hat data analytics infrastructure reference architecture here.
To find out more about the Red Hat data analytics infrastructure solution, please see this short video on Red Hat's data analytics infrastructure solution:
For a slightly deeper dive, check out “Breaking down data silos with Red Hat infrastructure.”
If you’re attending the Strata Data Conference in San Francisco this week, stop by Red Hat’s booth to learn more about the solutions we’ve developed to make analytics easier.
Sull'autore
Altri risultati simili a questo
Ricerca per canale
Automazione
Novità sull'automazione IT di tecnologie, team e ambienti
Intelligenza artificiale
Aggiornamenti sulle piattaforme che consentono alle aziende di eseguire carichi di lavoro IA ovunque
Hybrid cloud open source
Scopri come affrontare il futuro in modo più agile grazie al cloud ibrido
Sicurezza
Le ultime novità sulle nostre soluzioni per ridurre i rischi nelle tecnologie e negli ambienti
Edge computing
Aggiornamenti sulle piattaforme che semplificano l'operatività edge
Infrastruttura
Le ultime novità sulla piattaforma Linux aziendale leader a livello mondiale
Applicazioni
Approfondimenti sulle nostre soluzioni alle sfide applicative più difficili
Serie originali
Raccontiamo le interessanti storie di leader e creatori di tecnologie pensate per le aziende
Prodotti
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servizi cloud
- Scopri tutti i prodotti
Strumenti
- Formazione e certificazioni
- Il mio account
- Supporto clienti
- Risorse per sviluppatori
- Trova un partner
- Red Hat Ecosystem Catalog
- Calcola il valore delle soluzioni Red Hat
- Documentazione
Prova, acquista, vendi
Comunica
- Contatta l'ufficio vendite
- Contatta l'assistenza clienti
- Contatta un esperto della formazione
- Social media
Informazioni su Red Hat
Red Hat è leader mondiale nella fornitura di soluzioni open source per le aziende, tra cui Linux, Kubernetes, container e soluzioni cloud. Le nostre soluzioni open source, rese sicure per un uso aziendale, consentono di operare su più piattaforme e ambienti, dal datacenter centrale all'edge della rete.
Seleziona la tua lingua
Red Hat legal and privacy links
- Informazioni su Red Hat
- Opportunità di lavoro
- Eventi
- Sedi
- Contattaci
- Blog di Red Hat
- Diversità, equità e inclusione
- Cool Stuff Store
- Red Hat Summit