Issue #9 July 2005

Best practices with the Red Hat GFS

In this article, we discuss best practices for Red Hat® Global File System (GFS), including when to use GFS instead of NFS, tips on configuring hardware for use in a Red Hat GFS cluster, and how GFS quotas and context-dependent path names can be used to simplify system management.

Introduction

Red Hat GFS is a cluster file system for Red Hat® Enterprise Linux® that allows multiple servers to share the same files on a storage area network. GFS uses its own on-disk metadata and sophisticated clustering techniques to allow RHEL servers to directly access shared storage devices to provide scalable storage performance and capacity. It incurs significantly less overhead than the Network File System (NFS). In the next section, we outline GFS differences with NFS and make recommendation on how to use these technologies.

NFS versus Red Hat GFS

The previous article Red Hat GFS vs. NFS: Improving performance and scalability compared Red Hat GFS to NFS where the underlying storage hardware is similar. However, it's important to remember that GFS can provide a direct connection between a server and shared storage devices, whereas the NFS protocol requires that clients attach to storage devices through a server. The NFS server layer introduces additional overhead and can act as a bottleneck that limits performance and scalability. In contrast, GFS can use fast storage area network (SAN) technology like Fibre Channel to scale up storage bandwidth and I/O operations per second. Let's compare standard NFS client-server networks to a Red Hat Enterprise Linux cluster using GFS and a SAN using several relevant metrics.

Client scalability

Red Hat GFS can scale up to 300 or more client machines with direct connections to the SAN, whereas NFS can generally achieve up to 10 to 20 clients for bandwidth-intensive workloads and perhaps 30-50 clients for less I/O-intensive workloads. GFS is fundamentally more scalable than NFS but does require a SAN infrastructure and a sophisticated cluster locking and membership protocol running between GFS clients. These clustering protocols require that the clients act in a tightly synchronized way that requires network, server, and SAN faults to be processed before normal file access resumes.

Bandwidth to/from server clients

With Red Hat GFS, bandwidth between the GFS machines and shared storage is only limited by the SAN infrastructure and the storage hardware. Additional incremental bandwidth can be applied by adding more storage arrays and SAN network ports. In contrast, with NFS, the NFS server itself is the primary bottleneck to scaling bandwidth between storage and the NFS clients. The maximum bandwidth possible is determined by how much bandwidth a particular server can provide. In addition, the NFS protocol itself imposes much more network processing overhead than GFS, further limiting NFS bandwidth scalability.

Complexity at large scale

Large Red Hat Enterprise Linux clusters with dozens or hundreds of machines that must share data can do so quite easily with GFS. In contrast, with NFS, multiple NFS servers are often required to scale up to the required I/O bandwidth and operations per second. Data placement and synchronization between these different NFS servers is complex and imposes additional management overhead on the system administrator.

POSIX semantics

NFS clients may cache write data for a few seconds or more to improve performance. In the NFS v2 and v3 protocol versions, it is quite possible that the NFS server will not know this, and for short periods of time the NFS server and client version of file data can be out of sync. This behavior does not conform to POSIX file access semantics, so that standard UNIX applications generally cannot be run across a group of NFS clients if data is to be shared using files. In contrast, Red Hat GFS file write and read accesses follow strict POSIX semantics, so that a write to a file on one machine in a GFS cluster is always visible to another machine that later reads that file. By following standard POSIX semantics, GFS allows standard UNIX applications to be run across the cluster.

These four factors, along with the additional hardware costs and tight cluster synchronization required by GFS, yield the following recommendations regarding a GFS deployment:

  • Deploy GFS when you have a large number (more than 10 for bandwidth-intensive I/O, more than 50 for moderate intensity I/O workloads) of Red Hat Enterprise Linux machines that must access, share, and often change a set of shared data.
  • Deploy GFS when many Red Hat Enterprise Linux machines must run POSIX applications that share data.
  • Deploy GFS if partitioning data among several NFS servers adds too much additional complexity and data duplication.
  • Deploy GFS when the network processing overheads and loose semantics of NFS are not acceptable.
  • Deploy NFS when your performance requirements are low, you have only a few client machines that are only sharing files in a loosely-coupled way, and low cost is important.

Let's now consider some best practices with respect to deploying hardware for use in a GFS cluster.

Hardware best practices for GFS

Red Hat GFS requires that the following hardware components (in addition to standard servers and networking hardware) be present for it to run properly:

  1. Shared network block storage visible to all GFS machines in a cluster
  2. Fencing hardware that resets or reboots GFS machines that are no longer communicating with or connected to the cluster

Shared storage

Shared block storage can be achieved in several ways that result in a variety of cost and performance trade offs. In general, large GFS clusters that require a high storage area network port count have significantly higher per port costs due to the higher per port costs of large Fibre Channel switches (known as Director Switches).

A small GFS cluster (two to eight machines) with, for example, four machines can be directly attached to a 4-port storage array without a switch, significantly reducing costs. This configuration is shown in Figure 1. Small GFS Cluster. Mid-sized storage arrays with up to eight ports are also readily available from a variety of storage vendors, and a configuration without switches is still reasonable for this number of machines.

Figure 1. Small GFS Cluster

A medium-sized GFS cluster (9 to 32 machines) is best configured for shared storage with a Fibre Channel switch that has a port count that is approximately the number of machines plus the number of storage array ports, and an additional three to four ports for some minimal growth. If significant growth in the number of machines is possible, both small- and medium-sized clusters should use switches to accommodate the additional port count growth. An example of a medium-sized GFS cluster with 10 servers and four storage ports using a 16-port Fibre Channel switch is shown in Figure 2. Medium GFS cluster.

Figure 2. Medium GFS cluster

Large GFS clusters (more than 32 machines) should use large port count Fibre Channel switches (known as Director-class switches) that provide significant internal redundancy to handle fan, power supply, and switch component failures. A whole switch failure is costly in large GFS clusters because it brings down a large number of servers at once. This is unacceptable in most enterprise computing environments. Figure 3. Large GFS clusters shows a large 140-port Director switch connecting 128 machines to 8 storage arrays.

Figure 3. Large GFS clusters

Fencing hardware

Fencing hardware is required for both Red Hat GFS and Red Hat Cluster Manager, a high availability application failover product. When a machine can no longer communicate with other machines in the GFS cluster due to hardware or software failures, it must be fenced to prevent its access to shared storage. This is accomplished either with external power switches (such as the APC Masterswitch) or via external server management interfaces like the HP iLO (integrated lights out) or the Dell DRAC (Dell Remote Access Controller). These remote server management tools let a GFS cluster reset or reboot machines as part of a controlled fencing process to recover from a component failure and continue processing. The advantage of these out-of-band management techniques is they can handle nearly all failure conditions and generally guarantee that the cluster continues processing after a fault.

Making the best use of GFS features

Certain GFS features may be particularly useful in a given situation. In this section we describe several of these features and their usage.

GFS quota management

Red Hat GFS provides an effective quota management system for both users and groups using its gfs_quota command. Particular effort has been made to make GFS quota updates fast and scalable. File system quotas are used to limit the amount of file system space a user or group can use. For GFS, a user or group does not have a quota limit until one is set. GFS keeps track of the space used by each user and group even when there are no limits in place. GFS updates quota information in a transactional way so system crashes do not require quota usages to be reconstructed. To prevent a performance slowdown, a GFS node synchronizes updates to the quota file only periodically. The "fuzzy" quota accounting can allow users or groups to slightly exceed the set limit. To minimize this, GFS dynamically reduces the synchronization period as a "hard" quota limit is approached.

Two quota settings are available for each user ID (UID) or group ID (GID): a hard limit and a warn limit. A hard limit is the amount of space that can be used. The file system will not let the user or group use more than that amount of disk space. A hard limit value of zero means that no limit is enforced. A warn limit is usually a value less than the hard limit. The file system will notify the user or group when the warn limit is reached. A warn limit value of zero means that no limit is enforced. Limits are set using the gfs_quota command. The command only needs to be run on a single node where GFS is mounted.

To set a hard limit for a user or group, the following commands can be used:

gfs_quota limit -u <user> -l <size> -f <mountpoint>
gfs_quota limit -g <group> -l <size> -f <mountpoint>

To set warn limits for a user or group, the following commands can be used:

gfs_quota warn -u <user> -l <size> -f <mountpoint>
gfs_quota warn -g <group> -l <size> -f <mountpoint>

<size> indicates the quota size in MB, while <mountpoint> specifies the GFS file system to which the actions apply. For example, to set the hard limit for user Bert to 1024 MB (1 GB) on file system /gfs the following command can be applied:

gfs_quota limit -u Bert -l 1024 -f /gfs

To set the warn limit for group ID 21 to 50 kilobytes on file system /gfs use the following command:

gfs_quota warn -g 21 -l 50 -k -f /gfs

Context-Dependent Path Names (CDPN)

Context-Dependent Path Names (CDPNs) allow symbolic links to be created that point to destination files or directories that can vary based upon a context-sensitive variable. The variables are resolved to real files or directories each time an application follows the symbolic link. The resolved value of the link depends on certain machine or user attributes. CDPN variables can be used in any path name, not just with symbolic links. However, the CDPN variable name cannot be combined with other characters to form an actual directory or file name. The CDPN variable must be used alone as one segment of a complete path.

In the following symbolic link operation:

ln -s <target> <linkname>

<target> specifies an existing file or directory on a file system while <linkname> specifies a name to represent the real file or directory on the other end of the link.

In the following variable symbolic link command:

ln -s <variable> <linkname>

<variable> specifies a special reserved name from a list of values (refer to Table 1. CDPN variable values) to represent the resolved name for multiple existing files or directories. This string is not the name of an actual file or directory itself. (The real files or directories must be created in a separate step using names that correlate with the type of variable used.) <linkname> specifies a name that will be seen and used by applications and will be followed to get to one of the multiple real files or directories. When <linkname> is followed, the destination depends on the type of variable and the node or user following the link.

Variable Description
@hostname Resolves to a real file or directory named with the hostname string produced by the output of the following command: uname -n
@mach Resolves to a real file or directory name with the machine type string produced by the output of the following command: uname -m
@os Resolves to a real file or directory named with the operating system name string produced by the output of the following command: uname -s
@sys This variable resolves to a real file or directory named with the combined machine type and OS release strings produced by the output of the following command: echo 'uname -m'_'uname -s'
@uid This variable resolves to a real file or directory named with the user ID string produced by the output of the following command: id -u
@gid This variable resolves to a real file or directory named with the group ID string produced by the output of the following command: id -g
Table 1. CDPN variable values

Consider a cluster with three machines with hostnames n01, n02, and n03. Applications on each node use the directory /gfs/log/, but the administrator wants to create separate log directories for each node. To do this, no actual log directory is created; instead, an @hostname CDPN link is created with the name log. Individual directories /gfs/n01/, /gfs/n02/, and /gfs/n03/ are created to be the actual directories used when each node references /gfs/log/. The following Linux command sequence and output helps explain CDPN usage.

n01# cd /gfs 
n01# mkdir n01 n02 n03 
n01# ln -s @hostname log 
n01# ls -l /gfs 
lrwxrwxrwx 1 root root 9 Apr 25 14:04 log -> @hostname/ 
drwxr-xr-x 2 root root 3864 Apr 25 14:05 n01/ 
drwxr-xr-x 2 root root 3864 Apr 25 14:06 n02/ 
drwxr-xr-x 2 root root 3864 Apr 25 14:06 n03/ 

n01# touch /gfs/log/fileA 
n02# touch /gfs/log/fileB 
n03# touch /gfs/log/fileC 
n01# ls /gfs/log/ 
fileA 
n02# ls /gfs/log/
fileB 
n03# ls /gfs/log/ 
fileC

Summary

Red Hat GFS provides fast, scalable access to shared storage. In this article we have provided guidance on hardware configurations, command usage, and the use of GFS versus NFS. You can learn more about GFS in the GFS 6.0 and GFS 6.1 Administrator's Guides. Further information on constructing and configuring SANs can be found in the book Building SANs with Brocade Fabric Switches by Josh Judd, et al.

For more information on Red Hat GFS, refer to the following articles:

About the author

From 1990 to May 2000, Matthew O'Keefe taught and performed research in storage systems and parallel simulation software as a professor of electrical and computer engineering at the University of Minnesota. He founded Sistina Software in May of 2000 to develop storage infrastructure software for Linux, including the Global File System (GFS) and the Linux Logical Volume Manager (LVM). Sistina was acquired by Red Hat in December 2003, where Matthew now directs storage software strategy.