[Linux-cluster] I/O errors and performance in GFS mounts

Fri Jan 4 18:20:24 UTC 2008

Greetings,

I have a two-node cluster based on Itanium2 machines using GFS for
shared storage with fibre channel as transport. The whole setup has been
working OK for three months now, and I have another two-node setup which
is also working OK, except for some fibre issues (see below) // other
clustering applications are working OK.

I'm using clvmd and I've setup two LV, mkfs'ed them with GFS and mounted
them in both nodes without any problems (based, of course, on the cman
cluster definitions). I've setup manual fencing since I don't have
proper devices to help me with that at the time.

Since a couple of days now I've seen a lot of I/O errors with the GFS
mounts, for example when using df to look at the available space on
local mounts, and of course when ls'ing the shares. Sometimes df also
reports incorrect size information (for example only 677 MB. used when
the share has circa 60 GB.)

This problem only occurs in one of the two nodes at the same time, and
it is mostly random. The cluster hosts IMAP (Dovecot) and SMTP (Postfix)
services, which turn unusable (except for non-local mail transport in
Postfix) when this I/O errors appear.

Searching for errors on dmesg and syslog throws several, continual
errors such as this one:

<error>
GFS: can't mount proto = lock_dlm, table = mail:inbox, hostdata =
...
</error>

Where ... varies from:

kernel unaligned access to 0xfffffffffffffffd, ip=0xa000000100187d81
mount[2200]: error during unaligned kernel access

to

mount[5221]: NaT consumption 2216203124768 [4]

I'm aware that unaligned kernel access are not a bug, but rather a
well-handled inconsistency, but these one seems to mess with GFS way too
much. I fsck'ed the filesystems and this seemed to help a little, but
I'm still getting slow times when ls'ing the GFS filesystems.

We've chosen GFS over HA NFS, but we're getting this kind of performance
problems. Some of our problems are due to fibre issues, for example
unexpected LOOP DOWN's, but this time it seems more like a software
issue. I'm running kernel 2.6.18 in Debian Etch.

I would like to know if some of you have run into this problem. Maybe
I'm missing some critical part in my cluster setup.

Greetings,
Jose