문의하기

This Storage Tutorial was filmed live at Spark Summit East.

 

Our host, Brian Chang, is joined by Peter Wang, president of Continuum, along with show regulars Irshad Raihan and Greg Kleiman of Red Hat Big Data. Peter fills the group in about what buzz he is hearing at the conference as well as what sorts of big data use cases he’s seeing best supported on Spark. Read on for an excerpt of the conversation, but check out the video for the full discussion.

What is Continuum analytics?
Continuum Analytics supports the use of open source data science tools, primarily around the python programming language. Many of the core libraries in Python for data and scientific computing were written by principles at Continuum, and we’ve been heavily involved in PyData and promoting the use of Python for data and analytics

Tell our viewers about what is Spark and what are you hearing about Spark at this conference?
It’s very exciting, this is my first time at the summit! I’m really excited to see the energy around the technology stack and around the things happening with Spark. The most interesting thing for me is, the Python world has been involved in high end, very large data science and data analytics workloads for a long time, but the rise of Hadoop was a separate sort of thing. Python and R were outsiders in the Hadoop ecosystem. What we’re seeing with Spark that’s interesting, is that they are working really hard to ensure Python and R are native in the technology stack. It goes to the design of the underlying components in Spark even, whether it is a scheduler, or the resilient data structure, all these things in Spark are exposed nicely in Python.

There’s great energy, great buzz here. The show floor is certainly smaller than Strata+Hadoop, so you feel like this is an event that will grow as time goes on. A lot of the energy behind Spark is, because it has taken the storage efficiencies of Hadoop and made that more accessible to a wider audience. A lot of people were not thrilled about doing MapReduce jobs in Java, they’d rather do them in Python, but that connection was tenuous. But now with Spark and with Python behind a first class citizen in the Spark ecosystem, a lot of people, at least the Python folks I’ve talked to with Hadoop workloads, they are excited about that.

Tell us about those high-end workloads you talked about?
There are a lot of people doing traditional cluster level workloads using Red Hat in the cluster, and use Python to drive the computation. As Hadoop has emerged and Spark has emerged on top of Hadoop, we’re seeing a lot of these people doing exploratory data science and analytics with Python on a workstation, but then they have to port to larger scale equipment. There’s a workflow impediment, a mismatch there, between the workload they can do on their machine, which doesn’t have a petabyte of storage attached to it. After they do the work on the subset and do the work on scale, that moving back and forth, we’ve built tools like Anaconda cluster that eases the transitions, but the actual storage of the bits….at the end of the day, we all know that when you do computation at scale you have to move code to data.

So where the data sits, that’s an important place. How the data is formatted, what file systems, what walled gardens are built around it, those limit what you can do. It’s unfortunate. Your storage should be flexible, it should give you scale and resiliency without limiting what you can do.

Watch the video for more of the conversation!


Red Hat logoLinkedInYouTubeFacebookTwitter

제품

체험, 구매 & 영업

커뮤니케이션

Red Hat 소개

Red Hat은 Linux, 클라우드, 컨테이너, 쿠버네티스 등을 포함한 글로벌 엔터프라이즈 오픈소스 솔루션 공급업체입니다. Red Hat은 코어 데이터센터에서 네트워크 엣지에 이르기까지 다양한 플랫폼과 환경에서 기업의 업무 편의성을 높여 주는 강화된 기능의 솔루션을 제공합니다.

Red Hat Shares 뉴스레터를 구독하세요

지금 신청하기

언어 선택