Anatomy of the S3A filesystem client

19 juillet 20189 minutes (temps de lecture)

Amazon introduced their Simple Storage Service (S3) in March 2006, which proved to be a watershed moment that ushered in the era of cloud computing services. It wasn’t long before folks started trying to use Amazon S3 in conjunction with Apache Hadoop; in fact, the first attempt was the S3 block filesystem, which was completed before the end of the year. This early integration stored data in a way that facilitated fast renames and deletes, but came at the expense of not being able to access data that had been written to S3 directly. Accessing data written by other applications was highly desirable, and by 2008 the S3N, or S3 Native Filesystem, was merged into Apache Hadoop. The following year, Amazon introduced Elastic MapReduce, which included Amazon’s own close-sourced S3 filesystem client.

Netflix was an early adopter of Elastic MapReduce, which was used to analyze data to improve streaming quality. This workload was one of three detailed in Amazon’s press release announcing Netflix’s intent to migrate a variety of applications to Amazon. Fast forward to 2013 and “any data set worth retaining” was stored in S3. To the best of my knowledge, this was the first public reference of a multi-cluster data platform that used S3 for shared storage. With all the chips on the table, Netflix and other cloud heavyweights started seriously thinking about how to better leverage the S3 API and improve S3 client performance. This led to the development of the successor to S3N, named S3A. The development of the S3A filesystem client has manifested as a series of phases and is still seeing loads of active development from the likes of Netflix, Cloudera, Hortonworks, and a whole host of others. Each phase of development is tracked in the Hadoop JIRA:

Phase I, available in Hadoop 2.7.0
Phase II, available in Hadoop 2.8.1
Phase III, available in Hadoop 2.9.0 / 3.0.0
Phase IV, available in Hadoop 3.1.0
Phase V, targeting Hadoop 3.2.0

Downstream Hadoop distributions have done a terrific job of ensuring that juicy features are expeditiously backported, so even if your vendors distribution is based on an older Hadoop version, it’s likely that they are ready to consume data stored in S3.

Working with Ceph

Back in the early days of Ceph, we quickly came to the realization that it would be useful to allow folks to use the S3 API to interact with their Ceph storage infrastructure. By doing so, we’d be able to leverage the might of the Amazon ecosystem, including all the SDKs and tools written to interact with Amazon S3. The Ceph object gateway was conceived for this application and is an essential ingredient in providing private infrastructure operators a means to extend cloud object storage modalities into their data center. The S3 API has an ever-expanding set of calls, and keeping up requires a lot of hard work. To keep us honest, we actively develop a functional test suite affectionately named s3-tests. The test suite is so useful that it’s even been adopted by other storage vendors to ensure their products can have the same high level of fidelity with the S3 API—How flattering!

So Ceph and S3 are like two high school buddies, and that’s great. But what's required to ensure maximum fidelity with Amazon S3?

Endpoints

The Amazon S3 API provides a high-level container for objects which, in S3 parlance, is called a bucket. All objects are stored in exactly one bucket, and each bucket is mapped to a single Amazon S3 region. If you send a GET request for an object that lives in a bucket in another region, you’ll get a 302 redirect to the proper region. PUT requests sent to the wrong region will fail and be provided with the endpoint of the region the bucket calls home. If you have multiple Ceph clusters and want to emulate this behavior, you can do so by having each cluster be a distinct zone, in a distinct zonegroup, but with a common realm. For more details, refer to Ceph multi-site documentation.

If data engineers have a level of understanding about different endpoints available to them, then they can set fs.s3a.endpoint in a properties file such that it directs requests to the correct Ceph cluster or Amazon S3 region.

If you want to be helpful and know that an analytics cluster will only be talking to a single Ceph cluster, and not Amazon S3, then you might set the fs.s3a.endpoint in the core-site.xml. If you plan on having multiple Ceph clusters, or communicating with a Ceph cluster and the public cloud, then you have a few options. One is to set a default f3.s3a.endpoint in the analytics clusters’ core-site.xml to either an Amazon S3 regional endpoint or a Ceph endpoint. Application owners can still override this default with a properties file.

With S3A from Apache Hadoop 2.8.1 users can define different S3A properties for different buckets. This is handy for applications that might operate on data sets in one or more Ceph cluster or Amazon S3 region.

Access control

Now that there is a level of understanding around endpoints, we can move on to authentication and access control. Amazon S3 has evolved over the years to provide more flexible control over who has access to what. Coarsely, the evolution can be broken down into multiple approaches.

S3 ACLs
Bucket policy with S3 users
STS issued temporary credentials

The first approach is table stakes for any storage system that wants to allow applications to use the S3 API to interact with them. Each user has an access key and a secret key which they use to sign requests. Buckets always belong to a single user, and buckets can be specified as either public or private. Public buckets is the only means of sharing data between users, which is pretty inflexible. Ceph has supported basic S3 ACLs since the Ceph object gateway was introduced, and it’s not uncommon to see this be the only means of providing access control with other systems that tout an S3 compatible API. These access and secret keys can be mapped to the fs.s3a.access.key and fs.s3a.secret.key parameters in core-site.xml, in a properties file submitted with an application, or the most secure option: storing them in encrypted files using the Hadoop Credentials API.

Ceph doesn’t stop there. With Red Hat Ceph Storage 3.0, based on upstream Luminous, we added support for bucket policy, which allows cross user bucket sharing. This means one user's private bucket can be shared with another user through bucket policy.

This is all good and well, but sometimes you want to manage access control for groups of users, instead of having to update a bunch of bucket policies whenever a user needs to be removed from a group. Amazon S3 doesn’t yet have its own notion of groups. Instead, Amazon has IAM, which is a means of enforcing role-based access control. IAM supports users, groups, and roles. This means you can create policies for a group, instead of having policy entries for each individual user. Unfortunately, IAM groups cannot be principals in bucket policies. IAM roles are similar to a user, but they do not have credentials. IAM roles delegate to some other authentication provider, and that authentication provider decides if the request is allowed to assume a particular role.

This is where Amazon STS comes in. STS issues temporary authentication tokens, which are mapped to IAM roles, and that mapping is attested by an external identity provider. As such, the external credentials provider can attest to whether a particular user can receive a token that allows that user to assume a IAM role. The S3A filesystem client supports the use of these tokens by changing the S3A Credentials Provider and setting the fs.s3a.session.token parameter in addition to fs.s3a.access.key and fs.s3a.secret.key. The upstream community is currently working on STS and IAM support in the Ceph object gateway, which will bring all this goodness to folks interacting with Ceph object stores. If this is important to your organization, we’re interested in getting feedback that helps us prioritize STS actions like AssumeRoleWithSAML and AssumeRoleWithWebIdentity.

Bucket prefix vs. path

The AWS Java SDK for S3 default is to use bucket prefix notation when sending requests and, by extension, so does the S3A filesystem client. If your Ceph object gateway endpoint is s3.example.com, then requests are sent to bucket.s3.example.com. To ensure your Ceph storage infrastructure behaves the same way as Amazon S3, you’ll need to configure a few things on the infrastructure side. The first is including the rgw_dns_name parameter in the [rgw] or[global] block of your ceph.conf configuration file. The value of this parameter in this scenario would be s3.example.com. Now, in order for the client to resolve the bucket subdomain, you’ll also need a wildcard DNS record in the form of *.s3.example.com that resolves to your gateway virtual IP address. To support SSL with bucket prefix notation, you’ll need to use a certificate with a wildcard subject alternative name (SAN) wherever it is being used to terminate TLS.

If you’re not on the infrastructure side of things, and you want to consume Ceph object infrastructure where this machinery hasn’t been configured, then you can opt for path style access by setting fs.s3a.path.style.access to true. In this configuration, requests will be sent to s3.example.com/bucket instead of bucket.s3.example.com.

There is a third trick, which might be helpful in unusual scenarios, and that's to create a bucket with uppercase characters. Because DNS isn’t case sensitive, the SDK automatically switches over to path style access.

Encryption

We talked a bit about SSL/TLS in the previous section, but only insofar as how to configure things on the infrastructure side. The S3A filesystem client will default to using SSL/TLS. Changing this behavior is done by switching the fs.s3a.connection.ssl.enabled parameter to false. This is useful in scenarios like testing, or when SSL isn’t yet configured for your Ceph storage infrastructure. In a production environment, the adage “dance like nobody's watching, and encrypt like everyone is,” still applies. Check out this blog post for a thorough walk through on configuring SSL/TLS for you Ceph object storage system.

In addition to transport encryption, objects can be encrypted. Amazon S3 provides two categories of object encryption: (1) encryption at the client and (2) server-side encryption. The S3A filesystem client does not support using client-side encryption, but it does support all three varieties of server-side encryption. The three server-side encryption options are:

SSE-S3: Keys managed internally by Amazon S3
SSE-C: Keys managed by the client and passed to Amazon S3 to encrypt/decrypt requests
SSE-KMS: Keys managed through Amazon’s Key Management Service (KMS)

With the advent of Red Hat Ceph Storage 3.0, we support the SSE-C flavor of server-side encryption. Configuring the S3A filesystem client to use it is a simple affair, involving only two parameters: (1) fs.s3a.server-side-encryption-algorithm should be set to SSE-C and (2) the value of fs.s3a.server-side-encryption.key should be set to your secret key. I suspect most folks will want to do this by way of a properties file.

Upstream Ceph also has support for SSE-KMS, which is intended to be integrated with OpenStack Barbican for key management. I imagine it’s only a matter of time before this is also supported in Red Hat Ceph Storage.

Fast upload

There are two main ways of performing writes with the S3A filesystem client. The default mode buffers uploads to disk before sending a request to Amazon S3; it's important to make sure that the directory you are buffering to is large and fast. I prefer to specify a directory supported by a local SSD. Even with fast media, this can be expensive from an IO perspective, so an alternative is to use in memory buffers. The S3A filesystem client offers two in memory buffer options: (1) one using arrays and (2) one using byte buffers. You have to be careful when using memory buffers, because it's easy to run out if you don't properly size both your JVM and YARN containers. If the available memory permits you to use in memory buffers, they can be much faster than their disk-based brethren.

Giving it a try

There's an easy way to learn how to use the S3A filesystem client with Ceph, and this section will walk you through setting up a development environment using Minishift. If you’ve never used Minishift before, it’s relatively painless to set up. The guides for installation on Mac OSX, Windows, and Linux can be found here.

Once you’ve installed Minishift on your local system, drop into a terminal and fire it up.

minishift start

Now that we have a Minishift environment running, we’re going to get Ceph Nano running so we can use it as a S3A endpoint. To get started, you'll need to download a couple of YAML files: ceph-nano.yml and ceph-rgw-keys.yml. Once those files are downloaded to your working directory, we can use the oc command line utility to deploy Ceph Nano:

oc --as system:admin adm policy add-scc-to-user anyuid \
  system:serviceaccount:myproject:default
oc create -f ceph-rgw-keys.yml
oc create -f ceph-nano.yml
oc expose pod ceph-nano-0 --type=NodePort

You should have a Ceph Nano service running at this point and may be wondering how you're going to interact with it. Jupyter Notebooks have become wildly popular in the analytics and data-science communities because of their ability to create a single artifact, the notebook, which contains both code and documentation. As if that wasn’t enough, you can interact with them from the comfort of your web browser. Here at Red Hat, we’ve been fostering an awesome community called radanalytics.io. They’re hard at work empowering intelligent application development on OpenShift. They’ve provided a base notebook application that I used as a starting point for playing with S3A from Jupyter, because its container image is neatly bundled with Spark and PySpark libraries. You can get Jupyter up and running in your Minishift environment with just a few commands:

oc new-app https://github.com/mmgaggle/ceph-notebook \
  -e JUPYTER_NOTEBOOK_PASSWORD=developer \
  -e RGW_API_ENDPOINT=$(minishift openshift service ceph-nano-0 --url) \
  -e JUPYTER_NOTEBOOK_X_INCLUDE=http://mmgaggle-bd.s3.amazonaws.com/ceph-notebook.ipynb

oc env --from=secret/ceph-rgw-keys dc/ceph-notebook
oc expose svc/ceph-notebook
oc status

The "oc status" command will provide a http://ceph-notebook-myproject.$(IP).nip.io URL. Load that into your browser and follow along with your browser!