RHCS/GFS2 support team,
I would like to inform you about a serious GFS2 problem we encountered last week.
Please find a detailed description below. I have enclosed a tarfile containing
detailed information about this problem.
Two-node cluster is used as a test cluster without any load.
Only functionality is tested, no performance tests. The RHCS services
that run on this cluster are rather standard services.
In a 2-day timeframe we had two occurrences of this problem which were
both very similar.
On the 2nd node, a Perl script tried to write some info to a file on
the GFS2 filesystem, but the process hung at that time. From the GFS2
lockdump info we saw one W-lock associated with an inode and it
turned out that the inode was a directory on GFS2. Every command executed on
that file (eg. ls -l) or on this directory resulted in a hang of that
process (eg. du <dirname>).
The processes that hung all had the D-state (uninterruptable sleep).
However, from the 1st node all files and directories were accessible without
any problem. Even ls -lR executed on the 1st node from top of the GFS2
filesystem traversed the full directory tree without problems.
We suspect that the offending directory has got a W-lock and that there is
no lock owner anymore.
So, it does not look like a 'global' file system hang, but it seems to
to be a local problem on the 2nd node, where the major part of the GFS2
is also accessible from the 2nd node, except the dir with the lock.
Needless to say that this causes the application to be unavailable.
We are unable to reproduce the problem.
1st occurrence. After collecting information, we rebooted the 2nd node and after
the reboot it joined the 1st node in the cluster without any problem.
2nd occurrence. This happened 2 days later in the same way on the same node. After
collecting information, we now also ran gfs2_fsck on the GFS2 filesystem
before letting it join the cluster. No errors, orphans, corruption was reported.
After the fsck we started the cluster software on the 2nd node and the 2nd
node joined the cluster without any problem.
Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was
collected in a tarball (enov_additional_info.tar).
Additional information in additional_info.tar
- enov_clusterinfo_app2.txt.gz containing
- gfs2_hangalyzer output from 2nd node
- cman_tool <version, status, services, -af nodes>
- group_tool < -v, dump, dump fence, dump gfs2>
- ccs_tool <lsnode, lsfence>
- openais-cfgtool -s
- clustat -fl
- Process status information of all processes
- gfs2_tool gettune /gfsdata
- enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm
directory from debugfs filesystem from 2nd node.
2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2.
Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed.
Kernel version 2.6.18-274.12.1.el5PAE.
One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume.
iSCSI initiator version 220.127.116.112-10.el5.
Thanking you in advance for your cooperation.
If you need additional information to help to solve this problem, please let me know.
With kind regards,
Sr. Engineer at E.Novation