[Linux-cluster] which is better gfs2 and ocfs2?

Sat Mar 12 17:46:15 UTC 2011

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com]
> On Behalf Of Alan Brown
> Sent: Friday, March 11, 2011 7:07 PM
> 
> Personal observation: GFS and GFS2 currently have utterly rotten
performance for
> activities involving many small files, such as NFS exporting /home via
NFS sync
> mounts. They also fails miserably if there are a lot of files in a
single directory (more
> than 5-700, with things getting unusable beyond about 1500 files)

While I certainly agree there are common scenarios in which GFS performs
slowly (backup by rsync is one), your characterization of GFS
performance within large directories isn't completely fair.

Here's a test I just ran on a cluster node, immediately after rebooting,
joining the cluster and mounting a GFS filesystem:

	[root at cluster1 76]# time ls
	00076985.ts  28d80a9c.ts  52b778d2.ts  7f50762b.ts  a9c5f908.ts
d39d0032.ts
	00917c3e.ts  28de643b.ts  532d3fd7.ts  7f5dea46.ts  a9e0328b.ts
d3bcc9fb.ts
	...
	289d2764.ts  527b6f37.ts  7f3e5c9a.ts  a989df77.ts  d36c57fc.ts
	28c3aa38.ts  52ab865f.ts  7f3e9278.ts  a9aa3dba.ts  d392d793.ts

	real    0m0.034s
	user    0m0.008s
	sys     0m0.004s

	[root at cluster1 76]# ls | wc -l
	1970

The key is that only a few locks are needed to list the directory:

	[root at cluster1 76]# gfs_tool counters /tb2

	                                  locks 32
	                             locks held 25

Running "ls -l" on the same directory takes a bit longer (by a factor of
about 20):

	[root at cluster1 76]# time ls -l
	total 1970
	-rw-r----- 1 root root 42 Mar  2 12:01 00076985.ts
	-rw-r----- 1 root root 42 Mar  2 12:01 00917c3e.ts
	-rw-r----- 1 root root 42 Mar  2 12:01 00b60c66.ts
	...
	-rw-r----- 1 root root 42 Mar  2 12:01 ffc02edd.ts
	-rw-r----- 1 root root 42 Mar  2 12:01 ffefd00a.ts
	-rw-r----- 1 root root 42 Mar  2 12:01 fff80ff6.ts

	real    0m0.641s
	user    0m0.032s
	sys     0m0.032s

presumably because it has to acquire quite a few additional locks:

	[root at cluster1 76]# gfs_tool counters /tb2

	                                  locks 3972
	                             locks held 3965

For better or worse, "ls -l" (or equivalently, the aliased "ls
--color=tty" for Red Hat users) is a very common operation for
interactive users, and such users often have an immediate negative
reaction to using GFS as a consequence.  In my personal opinion:

- Decades of work on Linux have optimized local filesystem performance
and system call performance to the point that system call overhead is
often treated as negligible for most applications.  Running "ls -l"
within a large directory is a slow, expensive operations on any system,
but if it "feels" fast enough (in terms of wall clock time, not compute
cycles) there's little incentive to optimize it further.  I find this is
true of software applications as well.  It's shocking to me how many
unnecessary system calls our own applications make, often as a result of
libraries such as glibc.

- Cluster filesystems require a lot of network communication to maintain
perfect consistency.  The network protocols used by (e.g.) DLM to
maintain this consistency are probably slower than the methods of
maintaining memory cache consistency on a SMP system by several orders
of magnitude.  It follows that assumptions about stat() performance on a
local filesystem do not necessarily hold on a clustered filesystem, and
application performance can suffer as a result.

- Overcoming this may involve significant changes to the Linux system
call interface (assuming there won't be a hardware solution anytime
soon).  For example, relying on the traditional stat() interface for
file metadata limits us to one file per system call.  In the case of a
clustered filesystem, stat() often triggers a synchronous network
round-trip via the locking protocol.  A theoretical stat() interface
that supports looking up multiple files at once would be an improvement,
but is relatively difficult to implement because it would entail
changing the system kernel, libraries, and application software.

- Ethernet is a terrible medium for a distributed locking protocol.
Ethernet is well suited for applications needing high bandwidth that are
not particularly sensitive to latency.  DLM doesn't need lots of
bandwidth, but is very sensitive to latency.  There exists better
hardware for this (e.g.
http://www.dolphinics.com/products/pemb-sci-d352.html) than Ethernet,
but alas Ethernet is ubiquitous and little work has been done in the
cluster community to support alternative hardware as far as I am aware.

As an example, while running a "du" command on my GFS mount point, I
observed the Ethernet traffic peak:

	12:20:33 PM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s
rxcmp/s   txcmp/s  rxmcst/s
	12:20:38 PM      eth0   3517.60   3520.60 545194.80 631191.20
0.00      0.00      0.00

So a few thousand packets per second is the best this cluster node could
muster.  Average packet sizes are less than 200 bytes each way.  I'm
sure I could bring in my network experts and improve these results
somewhat, maybe with hardware that supports TCP offloading, but you'd
never improve this by more than perhaps an order of magnitude because
you're hitting the limits of what Ethernet hardware can do.

In summary, the state of the art in Linux clustered filesystems is
unlikely to change much until we change the way we write software
applications to optimize system call usage, or redesign the system call
interface to take better advantage of distributed locking protocols, or
start using new hardware that provides for distributed shared memory
much more efficiently than Ethernet is capable of.  Until any of those
things happen, many users are bound to be unimpressed with GFS and
similar clustered filesystems, relegating these to remain a niche
technology.

-Jeff