[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] (no subject)

RHCS/GFS2 support team,


I would like to inform you about a serious GFS2 problem we encountered last week.

Please find a detailed description below. I have enclosed a tarfile containing

detailed information about this problem.



        Two-node cluster is used as a test cluster without any load.

        Only functionality is tested, no performance tests. The RHCS services

        that run on this cluster are rather standard services.

        In a 2-day timeframe we had two occurrences of this problem which were

        both very similar.

        On the 2nd  node, a Perl script tried to write some info to a file on

        the GFS2 filesystem, but the process hung at that time. From the GFS2

        lockdump info we saw one W-lock associated with an inode and it

        turned out that the inode was a directory on GFS2. Every command executed on

        that file (eg. ls -l) or on this directory resulted in a hang of that

        process (eg. du <dirname>).

        The processes that hung all had the D-state (uninterruptable sleep).

        However, from the 1st  node all files and directories were accessible without

        any problem. Even ls -lR executed on the 1st node from top of the GFS2

        filesystem traversed the full directory tree without problems.

        We suspect that the offending directory has got a W-lock and that there is

        no lock owner anymore.

        So, it does not look like a 'global' file system hang, but it seems to

        to be a local problem on the 2nd  node, where the major part of the GFS2

        is also accessible from the 2nd node, except the dir with the lock.

        Needless to say that this causes the application to be unavailable.


                  We are unable to reproduce the problem.


        1st occurrence. After collecting information, we rebooted the 2nd node and after

        the reboot it joined the 1st node in the cluster without any problem.


        2nd occurrence. This happened 2 days later in the same way on the same node. After

        collecting information, we now also ran gfs2_fsck on the GFS2 filesystem

        before letting it join the cluster. No errors, orphans, corruption was reported.


        After the fsck we started the cluster software on the 2nd  node and the 2nd

        node joined the cluster without any problem.

        Additional information (gfs2_lockdump, gfs2_hangalyzer, sysrq-t info, etc.) was

        collected in a tarball (enov_additional_info.tar).


Additional information in additional_info.tar

- enov_clusterinfo_app2.txt.gz containing

                        - /etc/cluster.conf

                        - gfs2_hangalyzer output from 2nd node

                        - cman_tool <version, status, services, -af nodes>

                        - group_tool < -v, dump, dump fence, dump gfs2>

                        - ccs_tool <lsnode, lsfence>

                        - openais-cfgtool -s

                        - clustat -fl

                        - Process status information of all processes

                        - gfs2_tool gettune /gfsdata


                - enov_sysrq-t_app2.txt.gz

                - enov_glocks_app2.txt.gz

                - enov_debugfs_dlm_app2.tar.gz Contains compressed tarball of dlm

                  directory from debugfs filesystem from 2nd node.


        2-node cluster running CentOS 5.7, with RedHat Cluster Suite and GFS2.

        Latest updates for OS and RHCS/GFS2 (as per Jan 8, 2012) are installed.

        Kernel version 2.6.18-274.12.1.el5PAE.

        One GFS2 filesystem (20G) on HP/LeftHand Networks iSCSI SAN volume.

        iSCSI initiator version


Thanking you in advance for your cooperation.

If you need additional information to help to solve this problem, please let me know.


With kind regards,

G. Wieberdink

Sr. Engineer at E.Novation


gert wieberdink enovation nl


Attachment: enov_additional_info.tar
Description: enov_additional_info.tar

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]