This Storage Tutorial was filmed live at Spark Summit East.
Our host, Brian Chang, is joined by Peter Wang, president of Continuum, along with show regulars Irshad Raihan and Greg Kleiman of Red Hat Big Data. Peter fills the group in about what buzz he is hearing at the conference as well as what sorts of big data use cases he’s seeing best supported on Spark. Read on for an excerpt of the conversation, but check out the video for the full discussion.
What is Continuum analytics?
Continuum Analytics supports the use of open source data science tools, primarily around the python programming language. Many of the core libraries in Python for data and scientific computing were written by principles at Continuum, and we’ve been heavily involved in PyData and promoting the use of Python for data and analytics
Tell our viewers about what is Spark and what are you hearing about Spark at this conference?
It’s very exciting, this is my first time at the summit! I’m really excited to see the energy around the technology stack and around the things happening with Spark. The most interesting thing for me is, the Python world has been involved in high end, very large data science and data analytics workloads for a long time, but the rise of Hadoop was a separate sort of thing. Python and R were outsiders in the Hadoop ecosystem. What we’re seeing with Spark that’s interesting, is that they are working really hard to ensure Python and R are native in the technology stack. It goes to the design of the underlying components in Spark even, whether it is a scheduler, or the resilient data structure, all these things in Spark are exposed nicely in Python.
There’s great energy, great buzz here. The show floor is certainly smaller than Strata+Hadoop, so you feel like this is an event that will grow as time goes on. A lot of the energy behind Spark is, because it has taken the storage efficiencies of Hadoop and made that more accessible to a wider audience. A lot of people were not thrilled about doing MapReduce jobs in Java, they’d rather do them in Python, but that connection was tenuous. But now with Spark and with Python behind a first class citizen in the Spark ecosystem, a lot of people, at least the Python folks I’ve talked to with Hadoop workloads, they are excited about that.
Tell us about those high-end workloads you talked about?
There are a lot of people doing traditional cluster level workloads using Red Hat in the cluster, and use Python to drive the computation. As Hadoop has emerged and Spark has emerged on top of Hadoop, we’re seeing a lot of these people doing exploratory data science and analytics with Python on a workstation, but then they have to port to larger scale equipment. There’s a workflow impediment, a mismatch there, between the workload they can do on their machine, which doesn’t have a petabyte of storage attached to it. After they do the work on the subset and do the work on scale, that moving back and forth, we’ve built tools like Anaconda cluster that eases the transitions, but the actual storage of the bits….at the end of the day, we all know that when you do computation at scale you have to move code to data.
So where the data sits, that’s an important place. How the data is formatted, what file systems, what walled gardens are built around it, those limit what you can do. It’s unfortunate. Your storage should be flexible, it should give you scale and resiliency without limiting what you can do.
Watch the video for more of the conversation!
Sobre el autor
Navegar por canal
Automatización
Las últimas novedades en la automatización de la TI para los equipos, la tecnología y los entornos
Inteligencia artificial
Descubra las actualizaciones en las plataformas que permiten a los clientes ejecutar cargas de trabajo de inteligecia artificial en cualquier lugar
Nube híbrida abierta
Vea como construimos un futuro flexible con la nube híbrida
Seguridad
Vea las últimas novedades sobre cómo reducimos los riesgos en entornos y tecnologías
Edge computing
Conozca las actualizaciones en las plataformas que simplifican las operaciones en el edge
Infraestructura
Vea las últimas novedades sobre la plataforma Linux empresarial líder en el mundo
Aplicaciones
Conozca nuestras soluciones para abordar los desafíos más complejos de las aplicaciones
Programas originales
Vea historias divertidas de creadores y líderes en tecnología empresarial
Productos
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Servicios de nube
- Ver todos los productos
Herramientas
- Training y Certificación
- Mi cuenta
- Soporte al cliente
- Recursos para desarrolladores
- Busque un partner
- Red Hat Ecosystem Catalog
- Calculador de valor Red Hat
- Documentación
Realice pruebas, compras y ventas
Comunicarse
- Comuníquese con la oficina de ventas
- Comuníquese con el servicio al cliente
- Comuníquese con Red Hat Training
- Redes sociales
Acerca de Red Hat
Somos el proveedor líder a nivel mundial de soluciones empresariales de código abierto, incluyendo Linux, cloud, contenedores y Kubernetes. Ofrecemos soluciones reforzadas, las cuales permiten que las empresas trabajen en distintas plataformas y entornos con facilidad, desde el centro de datos principal hasta el extremo de la red.
Seleccionar idioma
Red Hat legal and privacy links
- Acerca de Red Hat
- Oportunidades de empleo
- Eventos
- Sedes
- Póngase en contacto con Red Hat
- Blog de Red Hat
- Diversidad, igualdad e inclusión
- Cool Stuff Store
- Red Hat Summit