Issue #21 July 2006

Data sharing with a Red Hat GFS storage cluster


This article is an updated version of an article first published in our April 2005 issue.

Introduction

Linux® server clustering is an important technique that provides scalable performance and high availability for IT services. These services quite often require that data be shared between servers. Even small companies often have many computers, including desktops and servers, that must share data. Hence, data sharing is a requirement for both small and large companies.

Some services have static data that can easily be copied between servers. In this scenario, each server in the cluster hosts its own copy of all the data. Other services utilize dynamic data that is rapidly changing, making it much more difficult to duplicate data between servers. For example, databases and file services (based on protocols like SQL, NFS or CIFS) would have to distribute the new information synchronously to all other servers after each write. This would lead to very long response times and an extremely high network load. Another disadvantage is the higher cost of maintaining duplicate copies and the associated increase in system management complexity.

What these applications really need is access to a single data store that can be read from and written to by all servers simultaneously. The use of a file server (network attached storage server) supporting the NFS and CIFS protocols is the traditional approach for this kind of shared data. Linux, of course, offers these popular data sharing protocols. This solution is suitable for some applications, but a single file server often becomes a performance bottleneck and single-point-of-failure in the complete system.

To use these traditional approaches for scalable and simplified data sharing, every server in the cluster should have direct access to the storage device and each server should be able to read and write to the data store simultaneously. ATIX developed such a solution--the com.oonics Diskless Shared Root Cluster. Red Hat® Global File System (GFS) is the basis of it.

Red Hat GFS overview

Red Hat GFS lets servers share files with a common file system on a Storage Area Network (SAN). With most SAN deployments, only one server at a time can have access to a disk or logical volume. Red Hat GFS creates a common file system across multiple SAN disks or volumes and makes this file system available to multiple servers in a cluster. Data integrity is protected by coordinating access to files so that reads and writes are consistent between servers. Availability is improved by making the file system accessible to all servers in the cluster. By eliminating the need for multiple copies of data, Red Hat GFS increases performance, reduces management complexity, and reduces costs with consolidated storage resources.

Red Hat GFS internals

Red Hat GFS was created as a 64-bit cluster file system. Red Hat GFS enables several servers to simultaneously connect to a SAN and access a common, shared file with standard UNIX/POSIX file system semantics.

Red Hat GFS is a journaling file system. Each cluster node is allocated its own journal. Changes to the file system metadata are written into a journal and then onto the file system like other journaling file systems. In case of a node failure, file system consistency can be recovered by replaying the metadata operations. Optionally, both data and metadata can be journaled.

Red Hat GFS saves its file system descriptors in inodes that are allocated dynamically (referred to as dynamic nodes or dinodes). They are placed in a whole file system block (4096 bytes is the standard file system block size in Linux kernels). In a cluster file system without this Red Hat GFS capability, multiple servers accessing the file system at the same time would lead to increased competition for block accesses and false contention due to the pooling of multiple dinodes into one block. For space efficiency and reduced disk accesses, Red Hat GFS file data is saved (stuffed) into the dinode itself if the file is small enough to fit. In this case, only one block access is necessary to access smaller files. If the files are bigger, Red Hat GFS uses a "flat file" structure. All pointers in a dinode have the same depth. There are only direct, indirect, or double indirect pointers. The tree height grows as much as necessary to store the file data, as shown in Figure 1.

"Extendible hashing" (ExHash) is used to save the index structure for directories. For every filename a multi-bit hash is saved as an index in the hash table. The corresponding pointer in the table points at a "leaf node." Every leaf node can be referenced by multiple pointers. If a hash table leaf node becomes too small to save the directory entries, the size of the whole hash table is doubled. If one leaf node is too small, it splits up into two leaf nodes of the same size. If there are only a few directory entries, the directory information is saved within the dinode block, just like file data. This data structure lets each directory search be performed in a number of disk accesses proportional to the depth of the very flat, extendible hashing tree structure. For very large directories with thousands or millions of files, only a small number of disk accesses are required to find the directory entry.

Within the ongoing development of Red Hat GFS, new features like file ACLs, quota support, direct I/O (dio), and asynchronous I/O (aio) have been added. Direct I/O and Asynchronous I/O, in conjunction with Red Hat GFS, can be used to greatly accelerate database performance.

GFS metadata structure
Fig.1. GFS metadata structure

Structure

Figure 2 shows the structure of a typical Red Hat GFS storage cluster. The Red Hat GFS file system is mapped onto shared disk volumes, which are constructed from one or more independent storage units. The servers are connected to a SAN over one or more fibre channel or iSCSI data paths to the shared disk volumes. The individual cluster servers are also connected via one or more data paths to an Ethernet network. Thus every server can directly access the storage arrays onto which the shared disk volumes are mapped. This greatly increases I/O system performance and provides scalability far beyond what can be achieved with a single NAS server.

Fig.2. GFS storage cluster

The servers in the Red Hat GFS storage cluster use Red Hat Enterprise Linux as the operating system. The Red Hat Cluster Suite architecture is the basis for cluster communication, locking layer, and clustered volume management.

Cluster layer

The current Red Hat GFS version 6.1 fits perfectly into Red Hat's cluster architecture. The design of each component of the architecture follows the primary design principle of symmetry, i.e. every node is equal. There is no single-point-of-failure or bottleneck by design , so it offers optimal scalability. Following these design principles, every node runs an instance of the Cluster Configuration System daemon (CCSD), statically providing information to the upper layer cluster services. The CCSDs on all nodes are synchronizing changes within the cluster configuration made by the administrator.

The Cluster Manager (CMAN) builds the basic cluster infrastructure and is represented by the CMAN kernel module running on every cluster node. CMAN receives the cluster configuration information from the the CCSD and manages the cluster membership and all services that are running on each cluster node. CMAN is responsible for sending periodic "HELLO" messages over the network to check the health of the cluster nodes. In the case of a node failure (i.e. its "HELLO" messages are not received by the others for a predefined number of seconds), the unresponsive node is removed from the cluster. To prevent file system inconsistency, a failed node using services like Red Hat GFS or CLVM has to be fenced from the surviving nodes in a SAN. This fencing ensures that a failed node cannot perform any more I/O operations to the underlying storage infrastructure. This function is provided by the fencing service (fenced) on every node by either disabling the path to the storage array or resetting the failed node.

CMAN prevents another type of problem that occurs when a failure in the cluster communication network results in network partitioning. Without CMAN, the cluster could be divided into two groups of computers--two independent clusters with the same name that can no longer see the computers in the other group. This situation is called the "split brain problem" and would result in a major data loss because both clusters could write to the same storage resources without common lock management. To prevent this self-destructive behavior, CMAN only allows a cluster to be active when the majority of all nodes are part of this cluster. A cluster must fulfill this quorum requirement. Multiple votes can also be assigned, allowing different expected service levels for each node. The quorum is reached when a group of cluster nodes has the majority of all votes in the cluster.

Concurrent access to a shared cluster resource must be coordinated by a locking mechanism. In a Red Hat GFS storage cluster, each node has the ability to access the same physical file system blocks. The lock manager assures data consistency of the file system. A new general purpose high availablity distributed lock manager (DLM) can be used for Red Hat GFS 6.1. Every node runs a part of the overall locking service, managing the locks it creates. When the node is working on files for which it masters the locks, no external resource needs to be used and the locks are immediate, resulting in optimal file system performance. When multiple servers are acquiring the same lock, the performance is like using an external lock manager. As all locking requests are distributed to all nodes, the possible bottleneck of a single or redundant lock manager have been eliminated, resulting in better performance and scalability. The Grand Unified Lock Manager (GULM) is available for Red Hat GFS version 6.1.

The Cluster Logical Volume Manager (CLVM) is an LVM2-based lock-manager-aware cluster volume manager. CLVM allows multiple servers to share access to a storage volume on a SAN. CLVM virtualizes the storage units (/dev/sda) and aggregates them into a single logical pool volume (dev/VG_foo/LV_foo). Multiple devices can be combined by the CLVM device mapper through the support of concatenation. Changes in the volume manager configuration are visible to all cluster servers in a SAN. CLVM allows volumes to be resized online. CLVM also provides multi-path I/O allowing single failures in the fibre channel or iSCSI SAN path to be tolerated. Clustered volume mirroring is scheduled for release with Red Hat Enterprise Linux 4, Update 4.

Scalability

A classic SAN environment consists of services and applications that run on individual servers. These services and applications are each generally limited to running on a particular individual server even when connected to other servers via a high speed SAN. If the hardware to which a particular application is limited is no longer sufficient, the application generally cannot exploit the additional memory, processing power, or storage capacity contained with other servers in the SAN. This effect is generally due to services and applications on the SAN being unable to share data with any concurrency.

Applications that can run in parallel and share data with concurrency across many servers in the SAN are much easier to scale. A database environment such as Oracle's 10g RAC using Red Hat GFS as the underlying file system is an excellent example. Red Hat GFS-based clusters offering data services to applications utilizing NFS and Apache are also good examples. In case of a capability shortage, new components including servers and storage can easily be integrated into the system until the required capacity is achieved. The concurrent sharing of data within a SAN-based storage pool not only removes the need for the laborious duplication of data to multiple servers but also offers elegant scaling possibilities. As the data requirements of a parallel application grow, the common storage pool can be expanded and shared data offered by the Red Hat GFS file system will be immediately available for all servers.

Availability

The availability of the complete system is an important aspect for providing IT services. To achieve Class 3 availability (99% to 99.9%), it is necessary to eliminate every single-point-of-failure. For Class 4 availability (99.9% to 99.99% uptime), it is necessary to have a high-availability cluster, mirrored data, and a second data center for disaster recovery. The services must run on multiple servers at different locations. The breakdown of one server or the whole data center must not avert the accessibility of the services for more than a short period of time.

Red Hat GFS storage clusters can be connected to the central storage system via the SAN through redundant I/O paths. This allows the Red Hat GFS-based SAN environment to overcome the failure of individual infrastructure components like switches, host bus adapters, and cables. I/O multi-pathing can be implemented by the capabilities offered natively with CLVM through its interaction with the device-mapper functionality offered with RHEL 4.

Red Hat GFS storage clusters can be replicated from server-to-server over a WAN using various third-party tools like BakBone's NetVault Replicator or Oracle's DataGuard. Of course, array-based replication is supported by most enterprise-ready storage arrays like those offered by EMC, HP, IBM, and Network Appliance.

Host-based mirroring of data in a Red Hat GFS cluster is a component of the Cluster Logical Volume Manager scheduled for release later in 2006. Of course, various levels of RAID protection, including mirroring, are offered by all modern storage arrays and are fully supported by Red Hat GFS.

Lock management servers are offered in two versions that support redundancy. GULM is an external lock manager that can utilize an external Redundant Lock Manager. DLM is a distributed or external lock manager which offers high availability by design.

Application and server failover in a Red Hat GFS environment can be provided by Red Hat Cluster Suite. Red Hat Cluster Suite provides the basic fencing, management interfaces, and other failover capabilities for Red Hat GFS. Red Hat Cluster Suite is included with the Red Hat GFS subscription. Red Hat GFS is also integrated with Service Guard for Linux offered by Hewlett Packard. Oracle RAC environments with versions 9i and 10g are also able to provide the failover mechanisms for Red Hat GFS-based Oracle implementations.

LAN-free backup

A data backup in most IT environments is normally done from backup client machines either over the LAN to a dedicated backup server or LAN-free from the application server directly to the backup device. With Red Hat GFS it is possible to have a server in the SAN be designated as a backup server for the entire Red Hat GFS cluster. This capability is available since every connected server using Red Hat GFS has access to all data and all file systems in the cluster. The backup server is able to accomplish a backup during ongoing operations without affecting the performance or any other capabilities of the Red Hat GFS clustered application servers. It is also useful to generate snapshots or clones of volumes using the hardware snapshot capabilities of many storage products. These snapshot volumes can be mounted and backed up by a Red Hat GFS backup server. To enable this capability, Red Hat GFS includes a file system quiesce capability to ensure a consistent data state. To quiesce means that all accesses to the file system are halted after a file system sync operation. This ensures that all metadata and data is written to the storage unit in a consistent state before the snapshot is taken.

Diskless shared root clustering

As all servers in a Red Hat GFS storage cluster access their data through a shared storage area network, additional servers can be added to easily scale the server capacity. Hence, each server can be viewed as just another resource in the pool of available servers. ATIX developed this GFS storage cluster concept further and created the diskless shared root cluster. In a diskless shared root cluster, the system data and operating system images are all on the shared storage. Therefore, server and storage can be seen as effectively independent of each other. The result is that no server needs a local hard disk. Instead, each server is stateless and can boot directly from the SAN. Both application data and the operating system images are shared over the SAN where the root (/) partition for all cluster nodes is the same, which greatly simplifies system management requirements. Changes across the cluster only have to occur one time and they are immediately valid for all servers in the Red Hat GFS cluster.

Scalability

The consequent separation of server and storage also makes the server scaling independent from storage scaling. If a diskless shared root cluster needs more CPU resources one simply needs to add new stateless nodes on the fly. From a scalability perspective, a shared root cluster is equivalent to a shared storage cluster.

Availability

The separation of the server from the storage with a Red Hat GFS-based diskless shared root environment allows for all of the data about the architecture and the content of the cluster to be consolidated in a central resource. An outage of a single server does not influence the data itself and no data--not even the operating system--needs to be restored or reconstructed on any failed servers. Another node will simply take over the services of the failed node. The fact that all data is consolidated into a central resource greatly reduces the mean time to repair of a cluster node. The replacement server simply has to be added and powered on; it instantaneously takes over all the duties of the failed server. The result is a much higher availability for the entire cluster.

Care should always be taken when building a highly available cluster that uses a central data store and a GFS-based diskless shared root architecture. Data should always be made highly available with modern storage virtualization technologies like volume management, combined with standard storage array data protection capabilities. Furthermore, if the data needs to be replicated to a separate array one can either use a storage system to synchronously replicate all data or technologies like mirrored CLVM volumes (soon to be available with Red Hat GFS 6.1). The availability of a diskless shared root cluster is proportional to the availability of the central storage and network infrastructure.

Management

One of the key advantages of a diskless shared root cluster is the ease of management. Compared to other cluster concepts the management and operation is straightforward. With classic application clusters, the complexity of management is proportional to the number of nodes in the cluster. Any change will have to be rolled out on every node. As a consequence, such a cluster requires a single point of control. That means there has to be an instance that can manage the whole cluster. That instance has to be made highly available in order to not have a single point of failure.

With a Red Hat GFS-based diskless shared root cluster one can change any information on any node and the change will be seen by all nodes automatically. No error-prone replication processes are required to submit changes on any node. Software updates can be made on one node and any other node will immediately benefit from the changes. The results are that all cluster nodes are equal and any node can be the single point of control.

Single system image

A cluster-spanning element is the idea of a single system image or SSI.

[Pfister98] defines a SSI as an illusion, created by the software or hardware, that a collection of computing elements is a single computing resource. Pfister also shows that every SSI has its limits and is dependent upon the watcher's or application's point of view. As a consequence, an SSI can have different levels. For example, that means there is an Application and Subsystem Level SSI, an Operating System Kernel Level SSI, or a Hardware Level SSI. The diskless shared root cluster provides a file system SSI which is a part of the Kernel Level SSI. To get that SSI in a standardized way the open shared root project was founded (see [osr]). In addition to the design of a diskless shared root cluster, the boot process is also effectively addressed so that nodes with different hardware can boot into the same diskless shared root cluster.

In combination with a Kernel Level SSI, applications will run as if they are working on a single virtualized symmetric multiprocessing (SMP) server. The administrator does not have to care about which process is running on which node or which resources are utilized how efficiently. All of these functions will be supported by the SSI. A very important part of the Kernel Level SSI is the shared root.

Constructing shared root disk clusters with Red Hat GFS is quite hardware- and kernel-version-specific. This feature should only be deployed with the help of Red Hat professional services or Red Hat advanced partners like ATIX GmbH.

Reference: Munich International Trade Fairs

Messe München International (Munich International Trade Fair - MMI), which organizes about 40 trade fairs for capital goods, consumer goods, and new technologies, is one of the world's leading trade fair companies. Over 30,000 exhibitors from more than 90 countries and over two million visitors from about 180 countries take part in the events in Munich every year. MMI also organizes trade fairs in Asia and North and South America. With five subsidiary companies abroad and 75 foreign branches dealing with 89 countries, MMI has a worldwide network.

MMI is clearly set for further growth, and as the business expands, so must the IT infrastructure. By the middle of 2005, it was apparent that the existing infrastructure for providing web services was no longer fully able to cope with the increased requirements. However, any further scaling of the system would have involved great expense. At that time the web services were based on a Linux cluster using the NFS file system. The operating system used on the cluster nodes was Debian.

Currently MMI registers about 2 million visits on the 51 hosted websites per month with a predicted rise in visits by 400%-500% per year. Online services offered by MMI include the full range of functionality a visitor or exhibitor needs. Exhibitors can book all services, from booth space and catering services to Internet and power connectivity. Visitors can book their ticket online and plan their time at the fair with a personalized web-based organizer. Even the ticket checks and gates at the fair's entrance are operated on the web services cluster system.

"We had previously always had very good experience of Linux," explains Martina Ritzer of MMI. "With the new solution, we really wanted to continue to benefit from the flexibility and vendor-independence of Linux, and have a scalable overall solution which was also professionally supported and properly certified by the main hardware and software providers."

MMI managers had already heard about the opportunities offered by the Red Hat Global File System (GFS). MMI turned to Red Hat to sound out the possibility of using a dynamic cluster system for its web services. Red Hat brought ATIX, its Advanced Partner for clustering and storage, on board. ATIX specializes entirely in highly scalable IT platforms for use in data centers and, even when the project first started, had extensive experience with Red Hat GFS, going back to the time when the preceding technology was being developed and marketed under the ownership of Sistina. In this case, ATIX developed the "diskless shared root cluster," an overall solution based on Red Hat GFS at its center which meets all MMI's requirements.

The starting configuration for the new cluster at MMI has 16 nodes: HP Proliant servers, each with two Intel Xeon 2.8-3.2 GHz processors. Red Hat Cluster Suite is responsible for ensuring that the services of one server are taken over by another if it fails and for distributing the load between the cluster nodes.

At the heart of the system is Red Hat GFS, which allows parallel file system access by all cluster nodes to a central storage system. The server cluster is linked via SAN to an HP EVA storage system.

Already, the configuration of the cluster at MMI provides a high-performance and flexible solution using the latest technology. The innovative next step is to completely do away with hard disks in the cluster servers and to boot them directly from the storage system. This configuration is very easily scalable. New resources in the form of new server hardware can simply be added on the "plug and play" principle because the operating system is also installed centrally on the storage system. In addition, it makes maintaining the operating system much easier because there is only ever one version of the operating system to update.

The complete separation of cluster nodes and central disk storage means that all information about the structure and content of the cluster is consolidated in the central storage system. This means that if one server fails, no information is affected which has to be reinstated. This reduces the restart time for a cluster node to a minimum, because only the server hardware has to be replaced in order to return the system to its normal status. This increases the overall availability of the cluster.

Backup is a big issue in every enterprise storage installation space. Using ATIX's com.oonics backup tools, consistent tape copies of the whole system can be produced while the system is running. To achieve consistent backup sets, ATIX com.oonics backup uses the Snapshot function of the HP EVA 5000. The cluster is remotely monitored and administered using the ATIX solution com.oonics GrayHead.

Migration to the new cluster took place quickly and--because of ATIX's experience--very smoothly and largely without problem. The combination of standard hardware and open-source software proved to be highly effective and reliable.

"With the new cluster system, we have an extremely high-performance solution which offers us maximum scalability for the future," says Martina Ritzer. "We run MySQL, Tomcat, PHP, and email services in the cluster as well as FTP, CVS, and our staging front-end software. All the systems are performing brilliantly with the new operating system and the new architecture using the Red Hat Global File System. We are also really well equipped for future growth. Resource scaling using the plug and play principle is a whole new experience for us."

Fig.3. MMI cluster environment

MMI's cluster environment is divided into several subclusters, each of which has its own purpose (the 16th node is the backup node and not shown): One for PHP applications, one for JAVA applications, one for mail, and so on. Each subcluster node is also a member of the main cluster, thus it is possible to move the nodes from one subcluster to the other, if performance issues or other reasons make this necessary.

Changing a node's role or subcluster membership is done by rebooting this special node. Meanwhile, all cluster applications continue to operate as usual. Due to the fact that all nodes operate without local hard drives, any node can take any role.

From the outside IP world and from the JAVA or PHP application designer's point of view, the main cluster and subclusters appear to be just one single machine.

Summary

No login required. Want to see your comments in print? Send a letter to the editor.

By using an ATIX com.oonics diskless shared root cluster based on Red Hat Enterprise Linux and GFS, Munich International Trade Fairs has achieved excellent performance, availability, scalability, and reduced complexity in a way that was unattainable with NFS. Data sharing via GFS has allowed Munich International Trade Fairs to reduce management complexity and scale performance to meet customer demand while using a low-cost, Linux-based hardware and software infrastructure. Management costs could be reduced to as little as 30% of former costs.

Red Hat GFS can also accelerate other storage-intensive applications, including Oracle database clusters, software development (builds and compiles), and high performance computing. Learn more about Red Hat GFS at redhat.com.

Bibliography

[Pfister 98] Gregory F. Pfister - In search of clusters, 1998
[Teig04] David Teigland - Symmetric Cluster Architecture and Component Technical Specifications, 2004
[RHGFS] Red Hat Global File System, 2006
[osr] Open Shared Root Project, 2006
[HlGr0501] Mark Hlawatschek, Marc Grimme - Data Sharing with a GFS Storage Cluster, 2005
[HlGr0502] Mark Hlawatschek, Marc Grimme - Der Diskless Shared Root Cluster, 2005

About the authors

Marc Grimme, academically qualified computer scientist, Mark Hlawatschek, academically qualified engineer and Thomas Merz, academically qualified engineer, are the three founders of the ATIX GmbH. Their main focus lies in the development and implementation of tailor-made enterprise storage solutions based on SAN/NAS. Additionally, they are responsible for the realization of highly scalable IT infrastructures for enterprise applications on Linux base. They also develop products by themselves, like the modular com.oonics Enterprise IT Platform including the com.oonics Diskless Shared Root Cluster.

ATIX has been the first--and so far only--Red Hat Advanced Partner in Europe for storage, clustering, and the Red Hat cluster file system (GFS) since the beginning of 2005.