Data Services for the open hybrid cloud deliver on the promise of cloud-native infrastructure

29 de setembro de 2020Mike Piech4 minutos (tempo de leitura)

Data is often the elephant in the room. It is obvious that applications are useless without data, that data is no less important now than it was at the dawn of computing, and that there’s no end in sight to the exponential growth of data. The term “exponential” is tossed about rather flippantly these days — it’s easy to lose sight of its basic mathematical implications — but some analysts suggest that more data will be created in the next three years than has been created in the last thirty.

Most people in technology are familiar with Moore’s Law, originally an observation that the number of transistors on a chip doubles every two years, which roughly translates into compute capability doubling commensurately. The specific phenomenon of transistor density doubling held for many years but eventually flattened as various physical asymptotes were approached. However, pulling the camera back and looking at the bigger picture, compute capability continued its trajectory thanks to other contributing factors such as better parallelization.

So what does this mean for data and storage? One important analogy to consider is that, just like we couldn’t continue to get exponential growth in compute by simply increasing transistor density, a given enterprise will not likely succeed in getting sustained value from its growing data simply by adding more storage arrays on its network. Different ways of dealing with data, analogous to parallelization and other ways of accelerating compute despite flattened transistor density, are needed going forward.

Data Services in the Cloud-Native World

Enter the world of cloud-native data services. “Cloud-native” is perhaps a bit of an overloaded term in industry buzz-speak, but it is at this point reasonably well established as implying the use of fine-grained modularization (containers) and a means of automating the orchestration of large numbers of modules (Kubernetes).

Containers have enabled developers to structure applications as composites of many small modules (microservices), bringing benefits such as easier, more rapid incremental innovation with less risk and disruption, as well as greater operational flexibility and resilience when capacity and placement needs evolve. Red Hat OpenShift brings all this together in a Kubernetes-based enterprise cloud platform for development and operations.

With large numbers of small immutable workloads being constantly spun up and down in a microservices environment, the assumption of static, long-running data connections becomes problematic.

In the old world of monolith-to-monolith, application-to-database applications, the overhead to establish a connection wasn’t a big deal. Now there is an impedance mismatch between monolithic data stores and distributed, fine-grained workloads.

Technologies like Ceph (and its enterprise counterpart Red Hat OpenShift Container Storage) bridge this gap and match existing and new storage hardware through a software-defined abstraction that enables microservices to get the fast, automatic attach and detach they need.

Data at Rest, Data in Motion, and Data in Action

But it’s not just about connecting to simple storage. Of course, the need for traditional storage functions such backups, replication, and security don’t go away in a cloud-native data services world, they are just initiated and managed in new ways — in many cases much more automatically.

This is where Ceph’s software-defined storage capabilities are a powerful complement for Kubernetes’ machinery for dynamically provisioning workloads with the right persistence functionality. Many of these capabilities are about data “at rest.”

Applications often pull data from multiple sources to carry out a task, and increasingly such aggregation is expected on demand — last night’s batch job is already stale. This is an area where the data services approach really shines — developers can rely on Kubernetes automation to dynamically connect data sources, sometimes streaming with Apache Kafka, sometimes triggering serverless functions with events, to handle data “in motion.”

When that disparate data has been brought together it can have impact. A data service can populate that list of recommended next actions. A trained model can help identify whether a lung X-ray indicates potential cancer. A continuously learning model can help a self-driving car avoid a pedestrian. This is data “in action.”

The Future: AI/ML

Even in our COVID-impacted reality, machine learning continues to be a strong driver of expansion in need for data capabilities, both in terms of raw capacity and in new functionality. Model training entails aggregating large amounts of data (the larger the better) in a temporary structure. A mature learning environment likely has a sophisticated data pipeline that feeds a training regimen executed on a regular basis for continuous model refinement. All of this motivates the need for a new sort of data processing platform.

Red Hat has been incubating such a platform in the Open Data Hub open source project. Open Data Hub combines Ceph, Kubeflow, Apache Spark, Jupyter, Kafka, Seldon, Argo CD, and other open source projects to create a comprehensive yet pluggable and configurable environment to support a variety of machine learning use cases. We use it today underneath Red Hat Insights, and it has been used by Red Hat Consulting in a number of customer deployments. Look for continued development in this area!

Conclusion

For operations folks, storage has long been a critical infrastructure element to get right. That is even more true today. For developers, storage has long been something buried deep in the infrastructure that they probably didn’t care about (until it broke). Today, the mutually reinforcing drivers of microservices and machine learning demand a new approach, with data capabilities expressed as cloud-native data services that empower the developer and delight the operator.

Onward to the open hybrid cloud!

Sobre o autor

Mike Piech

Vice President & General Manager, Cloud Storage & Data Services

Imaginative but reality-grounded product exec with a passion for surfacing the relevant essence of complex technology. Strong technical understanding complemented by ability to explain, excite, and lead. Driven toward challenge and the unknown.

Read full bio