订阅内容

This Storage Tutorial was filmed live at Spark Summit East.

 

Our host, Brian Chang, is joined by Peter Wang, president of Continuum, along with show regulars Irshad Raihan and Greg Kleiman of Red Hat Big Data. Peter fills the group in about what buzz he is hearing at the conference as well as what sorts of big data use cases he’s seeing best supported on Spark. Read on for an excerpt of the conversation, but check out the video for the full discussion.

What is Continuum analytics?
Continuum Analytics supports the use of open source data science tools, primarily around the python programming language. Many of the core libraries in Python for data and scientific computing were written by principles at Continuum, and we’ve been heavily involved in PyData and promoting the use of Python for data and analytics

Tell our viewers about what is Spark and what are you hearing about Spark at this conference?
It’s very exciting, this is my first time at the summit! I’m really excited to see the energy around the technology stack and around the things happening with Spark. The most interesting thing for me is, the Python world has been involved in high end, very large data science and data analytics workloads for a long time, but the rise of Hadoop was a separate sort of thing. Python and R were outsiders in the Hadoop ecosystem. What we’re seeing with Spark that’s interesting, is that they are working really hard to ensure Python and R are native in the technology stack. It goes to the design of the underlying components in Spark even, whether it is a scheduler, or the resilient data structure, all these things in Spark are exposed nicely in Python.

There’s great energy, great buzz here. The show floor is certainly smaller than Strata+Hadoop, so you feel like this is an event that will grow as time goes on. A lot of the energy behind Spark is, because it has taken the storage efficiencies of Hadoop and made that more accessible to a wider audience. A lot of people were not thrilled about doing MapReduce jobs in Java, they’d rather do them in Python, but that connection was tenuous. But now with Spark and with Python behind a first class citizen in the Spark ecosystem, a lot of people, at least the Python folks I’ve talked to with Hadoop workloads, they are excited about that.

Tell us about those high-end workloads you talked about?
There are a lot of people doing traditional cluster level workloads using Red Hat in the cluster, and use Python to drive the computation. As Hadoop has emerged and Spark has emerged on top of Hadoop, we’re seeing a lot of these people doing exploratory data science and analytics with Python on a workstation, but then they have to port to larger scale equipment. There’s a workflow impediment, a mismatch there, between the workload they can do on their machine, which doesn’t have a petabyte of storage attached to it. After they do the work on the subset and do the work on scale, that moving back and forth, we’ve built tools like Anaconda cluster that eases the transitions, but the actual storage of the bits….at the end of the day, we all know that when you do computation at scale you have to move code to data.

So where the data sits, that’s an important place. How the data is formatted, what file systems, what walled gardens are built around it, those limit what you can do. It’s unfortunate. Your storage should be flexible, it should give you scale and resiliency without limiting what you can do.

Watch the video for more of the conversation!


关于作者

UI_Icon-Red_Hat-Close-A-Black-RGB

按频道浏览

automation icon

自动化

有关技术、团队和环境 IT 自动化的最新信息

AI icon

人工智能

平台更新使客户可以在任何地方运行人工智能工作负载

open hybrid cloud icon

开放混合云

了解我们如何利用混合云构建更灵活的未来

security icon

安全防护

有关我们如何跨环境和技术减少风险的最新信息

edge icon

边缘计算

简化边缘运维的平台更新

Infrastructure icon

基础架构

全球领先企业 Linux 平台的最新动态

application development icon

应用领域

我们针对最严峻的应用挑战的解决方案

Original series icon

原创节目

关于企业技术领域的创客和领导者们有趣的故事