Crunchy Data PostgreSQL and Red Hat OpenShift Data Foundation
There are certain scenarios where read-only replicas are not necessary. In those cases, OpenShift Data Foundation can be an effective solution for providing storage replication and high availability for PostgreSQL. OpenShift Data Foundation features built-in replication for block, file, and object data. This ability allows PostgreSQL to be deployed as a standalone node with OpenShift Data Foundation providing storage replication (default 3x). In the event of a node failure or any other failure, the PostgreSQL pod will be rescheduled quickly to another node. The pod will bond to another OpenShift Data Foundation node with the persistent data intact.
The sections that follow describe how the Crunchy PostgreSQL Operator works together with the OpenShift Data Foundation Operator.
The OpenShift Data Foundation Operator
The OpenShift Data Foundation Operator is a single, meta-operator providing one interface for installation and management for three components, including:
- Ceph, as the storage building block providing all storage needs.
- Rook, an operator to manage storage (Ceph) in a Kubernetes cluster.
- NooBaa, an operator for a true multicloud object gateway and management.
Direct attached storage
With Ceph as the storage technology for OpenShift Data Foundation, performance directly correlates to the type of storage provided to the Ceph components. Direct-attached storage, now available from multiple public cloud vendors, offers compelling performance. AWS provides Storage Optimized Instances, offering storage devices directly connected to the instance. Importantly, these devices are not shared among other instances. With support for direct-attached storage in OpenShift Data Foundation 4.3, the Ceph object storage daemon (OSD) pods consume storage locally attached to the node, instance, or virtual machine (VM) where the OSD pods are running.
For performance testing, Red Hat and Crunchy Data engineers chose to have three worker nodes dedicated to the Crunchy Data pods (the PostgreSQL databases) and three worker nodes dedicated to providing storage to these databases via OpenShift Data Foundation (the storage cluster). Nodes were configured as follows:
- Database nodes used AWS m5.4xlarge instances, each with 64GB of RAM and 16 vCPUs.
- Storage nodes used AWS i3en.2xlarge instances, each with 64GB of RAM, eight vCPUs, and two direct-attached 2.3TB NVMe devices.
Tests ran on two separate clusters. One cluster was created within a single AWS Availability Zone (us-west-2b). The second cluster was spread across three AWS Availability Zones (us-west-2a/b/c). Figure 3 shows the single availability zone layout. Figure 4 illustrates the layout across multiple availability zones.
Figure 3. Single availability zone layout
Figure 4. Layout for multiple availability zones
As shown, each OpenShift Data Foundation worker node contains two Ceph OSD pods. Each of these pods uses a single NVMe device, whether it is in a single or multiple availability zone environment. For the nodes running Crunchy Data PostgreSQL, we’ve used a pod resource of four vCPUs and 16GB of memory for requests and limits. As such, the configuration uses 12 vCPUs and 48GB of RAM for the PostgreSQL pods. This approach retains extra resources in each node to be used by OpenShift, or in the case we need to migrate pods.
Test workload and process
In Red Hat testing, Sysbench was used to benchmark the performance of the cluster. Engineers chose Sysbench because it closely emulates an online web application in processing input/output (I/O) operations. A small bash script spread the Crunchy instances (PostgreSQL pods) equally among the three application worker nodes. Another bash script started statistics collection and Sysbench pods via Kubernetes jobs.
The initial sysbench job loaded data into all the databases. With each database containing 400 tables of 1,000,000 rows, the resulting database saved to storage was 100GB in size. Once the data was loaded into all the databases, a five-minute Sysbench warmup ran in parallel on all nine Crunchy instances simultaneously. The warm-up job consisted of 70% read operations, and 30% write operations.
Once the warm-up phase was complete, ten consecutive runs were started in parallel. Each job consisted of 70% reads and 30% writes. Each job ran for 10 minutes, followed by one minute of rest. In total, engineers collected 100 minutes of Sysbench workload on each database and in parallel across the nine Crunchy nodes.
Tests were run simultaneously on two different clusters. One cluster was configured entirely within a single availability zone, while the other cluster was spread across three availability zones in the AWS region. The tests were always run in parallel on all nine databases (Crunchy PostgreSQL pods) on the three nodes. Figures 5 and 6 illustrate performance for a single zone configuration as measured from PostgreSQL.
Figure 5 shows average transactions per second (TPS) for the nine databases, the total TPS for the whole cluster, as well as average latency and 95th percentile latency per database.
Figure 5. Transactions per second and latency for a single availability zone configuration
Figure 6 shows read and write I/O operations per second (IOPS) across the Ceph block devices (Ceph RADOS Block devices or RBDs).
Figure 6. Read and write performance for a single availability zone configuration
Figure 7 shows TPS and latency data for a multizone configuration, with performance results that are highly similar to those obtained in a single availability zone (Figure 5).
Figure 7. TPS and latency for a multiple availability zone configuration
Figure 8 shows OSD performance per node for a multiple availability zone configuration. Again, these results closely matched those for a single availability zone (Figure 6).
Figure 8. OSD per node performance for a multiple availability zone configuration
In summary, engineers made the following observations based on the testing:
- The OSD per node shows a stable and consistent spread of I/Os between the databases. This behavior demonstrates that Ceph handles the RBD volumes equally among all the databases.
- Testing produced an average of roughly 50 TPS per database and 450 TPS for the entire cluster (all nine databases). These results are consistent with expectations based on the type of test (Sysbench 70r/30w), database size, the CPU and RAM resources per Crunchy pod (PostgreSQL database), and the number of Crunchy pods per node.
- Importantly, OpenShift Data Foundation kept the same performance capabilities when moving from a single availability zone to running across multiple availability zones. This finding is remarkable because the devices Ceph used were spread across three instances, with each instance located in a different availability zone. Latency is typically lowest within a single availability zone (e.g., servers located in the same rack or the same row in the data center). Latency is generally higher across multiple availability zones (e.g., servers in different rooms in the datacenter or in different datacenters all together). OpenShift Data Foundation made these differences essentially moot.
- Testing configured resources with the headroom for flexibility. From the Red Hat OpenShift perspective, resources were retained in each instance to provide for failover. From the OpenShift Data Foundation perspective, the devices that the Ceph OSDs used averaged 90% utilization, leaving headroom for accommodating device failures.
Resilience and business continuity
Resilience is one of the greatest challenges for any enterprise application, and this is no different for a PostgreSQL database. Performance during failover is another critical aspect because services must continue to operate at sufficient performance, even while underlying infrastructure services are down or recovering from failure. The Crunchy PostgreSQL Operator and the OpenShift Data Foundation Operator work together to provide resilience that can help with business continuity. This section provides results from testing and validating three common failure scenarios:
- Simulating human operator error (deleting a Ceph OSD pod)
- Simulating maintenance operations (rebooting an OpenShift Data Foundation instance)
- Simulating a node failure (shutting down an OpenShift Data Foundation instance)
All the scenarios were evaluated with the same infrastructure, cluster, and databases that were used for the performance testing described above. Engineers also ran the same Sysbench workload continuously while triggering these failover scenarios.
Simulating human operator error
Human error is a common source of downtime, and an important aspect to simulate and understand. OSDs are the building blocks of the Ceph data plane, and the Ceph cluster consumes and aggregates all the OSDs into a logical storage layer for the application. In OpenShift Data Foundation, an OSD pod corresponds to a storage device that the Ceph cluster consumes. Accidentally deleting an OSD pod would result in losing a storage device momentarily. The AWS i3en.2xlarge instances used in testing each have two direct-attached NVMe devices. Two OSDs are run on each instance for six OSDs in total because three nodes were used for OpenShift Data Foundation.
The Rook operator monitors all the Ceph components, and the configuration used Kubernetes deployments for the OSD pods. As such, deleting an OSD pod causes Kubernetes to immediately bring up another OSD to support the same NVMe device the deleted OSD pod had used before. In testing, the time it took for a new OSD pod to start averaged 2-4 seconds. In fact, recovery occurred so quickly in our tests that the Ceph cluster never required any kind of rebuild process, and I/O operations were barely impacted.
As shown in Figure 9, the same Sysbench workload was run 10 times within the single availability zone cluster. During each run, engineers failed one of the Ceph OSDs in the cluster by deleting one of the OSD pods. A different OSD pod was deleted in a random manner on each run at different points of the run timeline. The impact on performance was minimal, as demonstrated by the nearly identical performance with and without the pod deletion.
Figure 9. TPS with and without OSD pod deletion across 10 runs
Simulating maintenance operations
Planned downtime for maintenance operations like security patching can also impact operations. In this scenario we concentrate on rebooting a server, or in our case, an AWS instance. An instance reboot may be caused by scheduled maintenance or human error. The OpenShift Data Foundation cluster consisted of three AWS i3en.2xlarge instances with two OSDs each. As such, rebooting a node resulted in taking two Ceph OSDs down for 1-2 minutes. Kubernetes recognizes when the AWS instance reboots, notices when it is back up, and restarts all pods on it.
This failure scenario gave engineers the opportunity to study impacts to the workload (in terms of TPS), and also at the recovery time for the Ceph cluster after losing 1/3 of its OSDs. The impact of the recovery on the workload was also of interest. The scenario was run 10 times, as with the previous tests. The Sysbench workload had an average I/O pause1 of approximately 10 seconds on all nine databases during the 10 runs.
To test the impact on recovery time, engineers needed to run the Sysbench workload for a longer period of time. While previous runs lasted 10 minutes, testing for this scenario increased the workload time to 65 minutes. Results from the testing are shown in Figure 10
Figure 10. Rebooting an instance with two OSD pods
The red line represents the TPS performance per database while running without failures. The blue line represents the TPS performance per database while running during a reboot of an AWS instance (losing two Ceph OSDs). Importantly, these results closely mirror those obtained when a single OSD was deleted. For all of the test runs, the time to complete the recovery was slightly less than 60 minutes. In this testing, there was negligible impact on the performance of the Sysbench workload during a Ceph cluster recovery, even though a third of the cluster devices were impacted.2
Simulating maintenance operations with more OSDs
Importantly, the OpenShift Data Foundation cluster under test represented a bare minimum from a Ceph cluster perspective. With only six devices for the OpenShift Data Foundation cluster (two in each AWS Elastic Compute Cloud (EC2) instance), this test temporarily removed 1/3 of the devices. Larger OpenShift Data Foundation/Ceph clusters—with more nodes, more devices per node, or both—would realize a smaller impact. This is true not only for the impact on the transactions, but also for the recovery time.
Because Ceph can handle significantly more OSDs than were deployed in the test cluster, engineers wanted to explore the impact on performance with more OSDs. An additional six AWS EC2 i3en.2xlarge instances were added to the OpenShift Data Foundation cluster. The same three instances were retained for the PostgreSQL databases (via the Crunchy PostgreSQL Operator). In fact, the database remained up and running while the additional storage instances were added. Figure 11-13 illustrates the impact of adding an additional six OSD nodes on TPS, I/O pause, and the full recovery of the cluster.
Figure 11. Adding more OSDs to a cluster dramatically improves performance, reduces I/O pause, and achieves full recovery in less time.
As shown, the results showed that average TPS per database went from 42.6 TPS to 109.18 TPS, almost tripling the performance and moving the bottleneck to the compute resources on the instances running the database pods. More importantly, the I/O pause went from an average of 10 seconds with only three OpenShift Data Foundation instances to an average of 2 seconds with the six additional instances. The full recovery time of the cluster went from roughly 59 minutes with three OpenShift Data Foundation nodes to 23 minutes with nine nodes.
These results demonstrate the power of decoupled software-defined storage. Storage resources can be scaled independently from computational (database) resources, potentially realizing dramatic performance gains.
Simulating a node failure
Node failure was the final resilience scenario tested. This scenario is particularly important because the AWS EC2 storage instances have storage devices that are directly attached to these virtual instances for much better performance. This increased performance comes with a caveat—if the AWS instance is shut down and then started again, it will get new storage devices attached to it. Note that this is distinct from rebooting an AWS instance with direct-attached storage. If an AWS storage instance is rebooted (either via the AWS console or the operating system) it will retain the same storage devices (as discussed in the section on simulating maintenance operations).
With OpenShift Data Foundation and its inherent Ceph data replication, other copies of the data spread are distributed on other OSDs. In the event of an instance shutdown, OpenShift Data Foundation can automatically update the new direct-attached devices to be part of the cluster.
For the instance shutdown test, engineers reverted to a three-node OpenShift Data Foundation cluster. With the same Sysbench workload running as before, they performed an instance shutdown from the AWS console (though it can also be done from the operating system). The process is summarized here, and full details are provided in the Appendix.
- Under load, engineers shut down one of the OpenShift Data Foundation nodes.
- After one minute, they used the console to restart the instance. When the instance came back online, it had two new NVMe devices.
- Engineers then logged into the instance and updated symbolic links for the two new NVMe devices.
- Next, the OpenShift Local Storage Operator (LSO) LocalVolume custom resource (CR) was updated with the new NVMe device names.
Once the CR was saved, new symbolic links were created by the LSO, and the OpenShift Data Foundation Operator started to take action. In our testing, the new pods were created in seconds. At this point, Ceph began using the two new OSDs and started to copy pages of Placement Groups to the new OSDs, making the new NVMe devices part of the existing OpenShift Data Foundation cluster. Resiliency was restored to 100% (three copies). This rebuild process was similar to what happened in our second test case (“rebooting an instance”) and it had no impact on the performance of the running workloads.
Running Crunchy Data PostgreSQL databases on Red Hat OpenShift Data Foundation provides a simple, general solution to business continuity during infrastructure failure and data protection against loss. The Crunchy PostgreSQL Operator provides a convenient mechanism to configure primary and replica database nodes for database failover and workload scalability through multiple read-only database instances.
In instances where additional read-only database replicas are not required, resilience and data redundancy can be achieved by using OpenShift Data Foundation along with the Crunchy PostgreSQL Operator. Internal OpenShift Data Foundation data replication provides highly available PostgreSQL. That replication can be easily extended across multiple AWS availability zones with little or no performance penalty to the application in terms of both TPS and latency, even during failover events. As a common storage services layer for Red Hat OpenShift, OpenShift Data Foundation can also provide this same data protection to other applications running in the same Red Hat OpenShift cluster.
Appendix: Instance shutdown and recovery procedure
This section illustrates the instance shutdown testing process used in Red Hat testing in detail.
To execute the test, the workload was started and engineers waited several minutes before using the AWS console to shut down one of our OpenShift Data Foundation nodes. They then waited another minute and used the AWS console to restart the instance. Once the instance restarted and was running in “Ready” state, two of the OSD pods were shown as not being in the “Running” state.
When the AWS i3en.2xlarge instance came back online, it had two new NVMe devices. First, engineers needed to change the OpenShift Local Storage Operator (LSO) Persistent Volumes (PVs) to point to the new NVMe devices. To do this, engineers logged into the instance that was shutdown.3 Once gaining access to the node, the engineers checked the symbolic link that the LSO created previously, before the node shutdown. The links are located at /mnt/local-storage/localblock and are named nvme1n1 and nvme2n1.
After the shutdown, the targets of these symbolic links were missing. These links had to be deleted and replaced with links to the new NVMe devices that AWS provided to the new i3en.2xlarge instance once it was up and running again.
With the new NVMe devices, the LSO LocalVolume custom resource can be updated using: oc edit LocalVolume local-block -n local-storage
As shown below in the devicePaths array, the paths still depict the “old” NVMe devices that the AWS instance had prior to the shutdown (the first two devices listed, in this case).
The old devices can now be replaced with the two new devices that are associated with the new i3en.2xlarge instance.
Once the CR is saved, new symbolic links will be created by the LSO.
Once the symbolic links were recreated, the Ceph OSDs that were using the old devices needed to be deleted. OpenShift Data Foundation and the Rook operator then recreated new OSDs and new OSD pods. The OpenShift Data Foundation operator has a specific, built-in job template for this use case.4 Viewing the previous output of OSD pods, OSD numbers 4 and 5 needed to be deleted.
Once the two jobs were completed, the OSDs disappeared, but the deployments, and thus the pods, still exist. These had to be removed.
Once the deployments were deleted, triggering a new reconcile of the Ceph cluster CR caused the OpenShift Data Foundation and Rook operators to take note and compare to the original OpenShift Data Foundation setup. They noticed that two OSDs were missing and started deployment. The easiest way to perform this operation is to update the settings in the CR via:
oc edit cephcluster ocs-storagecluster-cephcluster -n openshift-storage
Then look for “storageClassDeviceSets” and update the first count from 2 to 0 (it really doesn’t matter which storageClassDeviceSets you update—OpenShift Data Foundation will always compare to what was initially installed and make sure we are at the same level, so if you edit the CR again, you will see the count is back at 2)
In this example, two new deployments for OSD 4 and 5 were created and the corresponding osd-prepare and OSD pods were up and running. It is important to mention that the number of OSDs in the Ceph cluster can impact the timing of when you will actually see the new OSD pods being created.