[Linux-cluster] cluster architecture/filesystem suggestions wanted
bergman at merctech.com
bergman at merctech.com
Mon Aug 10 16:16:13 UTC 2009
Hello,
I've been testing RHCS (CentOS 5.2) cluster with GFS1 for a while, and
I'm about to transition the cluster to production, and I'd appreciate
a quick review of the architecture and filesystem choices. I've got some
concerns about GFS (1 & 2) stability and performance vrs. ext3fs, but the
increased flexibility in a clustered filesystem has a lot of advantages.
If there are fundamental stability advantages to a design that does not
cluster the filesystems (ie., that uses GFS in lock_nolock mode or ext3fs),
that would override any performance consideration.
Assuming that stability is not an issue, my basic question in terms of choosing
an architecture is whether there is better performance through using GFS with
multiple cluster nodes (gaining some CPU and network load balancing at the cost
of the GFS locking performance penalty) accessing the same data, or whether
serving each volume from a single server via NFS (using RCHS solely for
fail-over) is more efficient? Obviously, I don't expect anyone to provide
definitive answers or data that's unique to our environment, but I'd highly
appreciate your view on the architecture choices.
Background:
Our lab does basic science research on software to process medical
images. There are about 40 lab members, with about 15~25 logged in
at any given time. Most people will be logged into multiple servers
at once, with their home directory and all data directories provided
via NFS at this time.
The workload is divided between a software development environment
(compile/test cycles) and image processing. The software development
process is interactive, and includes algorithm testing which requires
reading/writing multi-MB files. There's a reasonably high performance
expectation for interactive work, less so for the testing phase.
Many lab members also SAMBA mount filesystems from the servers
to their desktop machines, for which there is a high performance
expectation.
The image processing is very strongly CPU-bound, but involves reading
many image files in the 1 to 50MB range, and writing results files
in the same range, along with smaller index and metadata files. The
image processing is largely non-interactive, so the I/O performance
is not critical.
The RHCS cluster will be used for infrastructure services (not as a
compute resource for image processing, not as login servers, not as
compilation servers). The primary services to be run on the clustered
machines are:
network file sharing (NFS, Samba)
SVN repository
backup server (bacula, to fibre-attached tape drive)
Wiki
nagios
None of those services require a lot of CPU. The network file sharing
could benefit from load balancing, so that the NFS and SAMBA clients have
multiple network paths to the storage, but the NFS and SAMBA protocols
are not well suited for using RHCS as a load balancer, so this may not
be possible (using LVS or a front-end hardware load balancer is not an
option at this time...HA is much more important than load balancing).
The goals of using RHCS and clustering those functions are (in order of
importance):
stability of applications
high availability of applications
performance
expandability of filesystems (ie., expand volumes at the SAN, LUN,
LVM, and filesystem layers)
expandability of servers (add more servers to the cluster, with
machines dedicated to functions, as a crude form of load
balancing)
The computing environment consists of:
2 RHCS servers
fibre attached to storage and backup tape device
~15TB EMC fibre-attached storage
~14TB fibre and iSCSI attached storage in the near future
4 compute servers
currently accessing storage via NFS, could be
fibre-attached and configured as cluster members
35 compute servers
NFS-only access to storage, possibly iSCSI in the
future, no chance of fibre attachment
As I see it, there are 3 possible architecture choices:
[1] infrastructure only-GFS+NFS
the 2 cluster nodes share storage via GFS, and
act as NFS servers to all compute servers
+ load balancing of some services
- complexity of GFS
- performance of shared GFS storage
[2] shared storage/NFS
2 cluster nodes and 4 fibre-attached compute servers
share storage via GFS (all machines are RHCS nodes, but
the compute nodes do not provide infrastructure services,
just use cluster membership for GFS file access)
each GFS node is potentially an NFS server (via a VIP) to
the 35 compute servers that are not on the fibre SAN
+ potentially faster access to data for 4 fibre-attached
compute servers
- potentially slow accesss to data for 4 fibre-attached
compute servers due to GFS locking
+ increased stability over 2 node cluster
- increased complexity
[3] exclusive storage/NFS
filesystems are formatted as ext3fs, exclusively mounted
to one of the 2 infrastructure cluster nodes at a time,
each filesystem mount also includes a child (dependent)
function for the node to be an NFS server, all compute nodes
access data via NFS
+ reliability of filesystem
+ performance of filesystem
- potential for corruption in case of non-exclusive access
- decreased flexibility due to exclusive use
- no potential for load balancing across cluster nodes
I'm very interested in getting your opinion of the choices, and would like
to learn about other ideas that I may have overlooked.
Thanks,
Mark
----
Mark Bergman voice: 215-662-7310
mark.bergman at uphs.upenn.edu fax: 215-614-0266
System Administrator Section of Biomedical Image Analysis
Department of Radiology University of Pennsylvania
PGP Key: https://www.rad.upenn.edu/sbia/bergman
More information about the Linux-cluster
mailing list