Many modern enterprises have initiatives underway to modernize their IT infrastructures, and today that means moving workloads to cloud environments, whether they be public or private. On the surface, moving data platforms to a cloud environment shouldn’t be a difficult undertaking: Leverage cloud APIs to provision instances, and use those instances like their bare-metal brethren. For popular analytics workloads, this means running storage services in those instances that are specific to analytics and continuing a siloed approach to storage. This is the equivalent of lift and shift for data-intensive apps, a shortcut approach undertaken by some organizations when migrating an enterprise app to a cloud when they don’t have the luxury of adopting a more contemporary application architecture.
The following data platform principles pertain to moving legacy data platforms to either a public cloud or a private cloud. The private cloud storage platform discussed is Ceph, of course, a popular open-source storage platform used in building private clouds for a variety of data-intensive workloads, including MySQL DBaaS and Spark/Hadoop analytics-as-a-service.
Elasticity is one of the key benefits of cloud infrastructure, and running storage services inside your instances definitely cramps your ability to take advantage of it. For example, let’s say you have an analytics cluster consisting of 100 instances and the resident HDFS cluster has a utilization of 80 percent. Even though you could terminate 10 of those instances and still have sufficient storage space, you would need to rebalance the data, which is often undesirable. You will also be out of luck if months later you also realize that you’re only using half the compute resources of that cluster. If the infrastructure teams make a new instance flavor available, say with fancy GPUs for your hungry machine-learning applications, it’ll be much harder to start consuming them if it entails the migration of storage services.
This is why companies like Netflix decided to use object storage as the source of truth for the analytics applications, as detailed in my previous post What about locality? It enables them to expand, and contract, workload-specific clusters as dictated by their resource requirements. Need a quick cluster with lots of nodes to chew through a one-time ETL? No problem. Need transient data labs for data scientists by day only to relinquish those resources for use for reporting after hours? Easy peasy.
Before departing on the journey to cloudify an organization’s data platform architecture, an important first step is assessing the capabilities of the cloud infrastructure, public or private, you intend to consume to make sure it provides the features that are most important to data-intensive applications. World-class data infrastructure provides their tenants with a number of fundamental building blocks that lend power and flexibility to the applications that will sit atop them.
Persistent block storage
Not all data is big, and it’s important to provide persistent block storage for data sets that are well served by database workhorses like MySQL and Postgres. An obvious example is the database used by the Hive metastore. With all these workload clusters being provisioned and deprovisioned, it’s often desirable to have them interact with a common metadata service. For more details about how persistent block storage fits into the dizzying array of architectural decisions facing database administrators, I suggest a read of our MySQL reference architecture.
I also suggest infrastructure teams learn how to collapse persistent block storage performance and spatial capacity into a single dimension, all while providing deterministic performance. For this, I recommend watching the session I gave with several of my colleagues at Red Hat Summit last year.
Sometimes we need to access data fast, really fast, and the best way to realize that is with locally attached SSDs. Most clouds make this possible with special instance flavors, modeled after the i3 instances provided by Amazon EC2. In OpenStack, the equivalent would be instances where the hypervisor uses PCIe passthrough for NVMe devices. These devices are best leveraged by applications that handle their own replication and fault tolerance, good examples being Cassandra and Clustered Elasticsearch. Fast local devices are also useful for scratch space for intermediate shuffle data that doesn’t fit in memory, or even S3A buffer files.
Machine learning frameworks like TensorFlow, Torch, and Caffe can all benefit from GPU acceleration. With the burgeoning popularity of these frameworks, it’s important that infrastructure cater to them by providing instances flavors infused with GPU goodness. In OpenStack, this can be accomplished by passing through entire GPU devices in a similar fashion detailed in the Local SSD section, or by using GPU virtualization technologies like Intel GVT-g or NVIDIA GRID vGPU. OpenStack developers have been diligently integrating these technologies, I’d recommend operations folk understand how to deploy them once these features mature.
In both public and private clouds, deploying multiple analytics clusters backed by object storage is becoming increasingly popular. In the private cloud, a number of things are important to prepare a Ceph object store for data intensive applications.
Bucket sharding was enabled by default with the advent of Red Hat Ceph Storage 3.0. This feature spreads a bucket’s indexes across multiple shards with corresponding RADOS objects. This is great for increasing the write throughput of a particular bucket, but comes at the expense of LIST operations. This is because of the way index entries are interleaved, and needing to gather entries before replying to a LIST request. Today, the S3A filesystem client performs many LIST requests, and as such it is advantageous to disable bucket sharding with rgw_override_bucket index_max_shards set to 1.
Bucket indexes on SSD
The Ceph object gateway uses distinct pools for objects and indexes, and as such those pools can be mapped to different device classes. Due to the S3A filesystem client’s heavy usage of LIST operations, it’s highly recommended that index pools be mapped to OSDs sitting on SSDs for fast access. In many cases, this can be achieved even on existing hardware by using the remaining space on devices that house OSD journals.
Due to the immense storage requirements of data platforms, erasure coding for the data section of objects is a no brainer. Compared to 3x replication that’s common with HDFS, erasure coding reduced the required storage by 50%. When tens of petabytes are involved, that amounts to big savings! Most folks will probably end up using either 4+2 or 8+3 and spreading chunks across hosts using ruleset-failure-domain=host.