We at Red Hat are proud to have the opportunity to work with so many interesting and innovative organizations. One such group is the Mass Open Cloud (MOC), which is a non-profit initiative that includes universities, government organizations and businesses, and provides reliable and cost effective storage to support both its public and private clouds built on Red Hat OpenStack Platform. In addition to OpenStack, the MOC has deployed Red Hat Ceph Storage as the storage foundation for its innovative research and big data analytics. This blog will showcase the importance of Ceph storage in the work Red Hat is doing with the MOC.
Collaboration Between MOC and Ceph
The MOC was formed in large part as a way to offer organizations including business, universities, governments and nonprofits a way to store and extract meaningful insights out of large amounts of data. The goal of the MOC is to provide these groups with a common, cloud-based infrastructure on which researchers can store, share and analyze data. However, most public clouds are typically built in a closed environment and operated by a single provider, meaning limited flexibility, which is not ideal for the organizations that build on the MOC. The MOC found itself needing to create a public cloud that is inexpensive, efficient and highly scalable, so it made sense that they would turn to open source solutions to do so.
The MOC chose the Red Hat OpenStack Platform as the underlying infrastructure foundation because it is cost-effective and can support a large number of contributors. It was quickly realized that a storage solution was needed in addition, and the MOC worked with Red Hat Consulting to deploy Red Hat Ceph Storage. Running three storage clusters - a production environment, a research and experimentation cluster and an internal testing cluster - Ceph allows the MOC to expand its storage needs to meet researchers’ ever growing needs for developing innovative, data-intensive applications while also performing detailed analysis. Ceph Storage also provides rapid recovery from issues and high reliability for critical research project data.
Northeast Storage Exchange
One of the advantages that the use of open source technologies and Ceph storage gives the MOC is the ability to build innovative data solutions without having to rely on new technology platforms. One such project is the Northeast Storage Exchange (NESE), which is a project that is funded out of the recently awarded National Science Foundation (NSF) Grant, that is helping to fund a national cloud testbed for the research and development of new cloud computing platforms. Specifically, NESE allows advanced researchers, including physicists, biochemists and others to generate large amounts of data, and actually have the room to store it. This is very important, because computers and sensors have become so much faster and better in the sense that we are now able to collect larger quantities of data than ever before. Within this data could live the answers to pretty big, potentially humanity-changing questions, like potential cures for cancer. The issue with the enormous amounts of data is storage - where to store it and how to store it in both a cost effective and in an accessible way. Currently, the researchers were finding that the data was either scattered around in such a way that it was difficult to run computations on, or some of it was being thrown away and there was no way to determine what data was being disposed of.
NESE works to solve the issue of data storage for science by offering a giant central data repository accessible to lots of universities. It allows for multiple researchers from multiple universities to both store and access data for the advancement of scientific research, which is critical for scientists of any discipline doing research on data. With NESE, a researcher can gather the data they need, and then layer other applications on top of it, like analysis through artificial intelligence (AI) and machine learning (ML) to glean insights. With NESE running Ceph storage, the data stored is replicated across multiple drives, which also takes care of the issue of backing up the data. NESE is significant because it is one of the first times that open source software has been used on a data store of this scale. Ceph storage gave the researchers the opportunity to store massive amounts of data in a cost-effective way and in a manner that can be easily layered for easier data abstraction, to advance what is often mission-critical scientific research. With the NSF grant, this research will be able to continue and expand.
Datacenter-Data-Delivery Network (D3N)
In addition to NESE, the MOC research team is creating a datacenter-data-delivery network, D3N, which is a novel multi-layer cooperative caching architecture for object stores that is currently in production. It is essentially designed to accelerate big data analytic workloads with strong locality traits and a limited network connectivity between compute clusters and data storage. One of the biggest advantages for an organization is the speed at which they can glean insights from the data they have, in addition to how useful these insights will be. However, the more data you collect, the harder it can be to actually be able to use that data - becoming somewhat of a Catch-22. To help with large-scale data analysis, it is fairly common to use data lakes, which are large repositories of data that store and share terabyte and petabyte data sets. D3N - based on Ceph - improves the performance of big-data jobs running in analytics clusters by increasing the speeds at which the reads and writes take place in the data lake. There are three components to the D3N architecture:
Cache servers, which client requests are directed to and which act as proxies for the back-end object store, which stores data locally for re-use
Lookup service, so researchers can look up what they need from local servers
Heartbeat service, which will track the set of active caches.
The data stored in the data lake can be thought of as a funnel, so the more narrow the access, the harder it will to retrieve the data from. If you cache over a wider network in different cloud data centers - such as what Ceph allows for - the data will be able to be accessed over multiple data centers, so anybody anywhere in the world can have access to it much faster. More on the D3N can be found in the latest issue of Red Hat’s Research Quarterly.
While storage is an immensely important foundation for gaining insights from data, one of the biggest advantages of Ceph is that it is open source. The advantages of running data lakes on open source technology moves beyond the technology itself to establishing a holistic research culture. There has historically been a massive barrier for grad students entering the field, and with the collaboration of universities with the MOC and working with Red Hat and open source tools, researchers can work together faster than when working in closed silos, allowing for greater accessibility and collaboration.
About the authors
Hugh Brock is the Research Director for Red Hat, coordinating Red Hat research and collaboration with universities, governments, and industry worldwide. A Red Hatter since 2002, Hugh brings intimate knowledge of the complex relationship between upstream projects and shippable products to the task of finding research to bring into the open source world.