[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS + DRBD Problems



Hi,

I'm appear to be a experiencing a strange compound problem with this, that is proving rather difficult to troubleshoot, so I'm hoping someone here can spot a problem I hadn't.

I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single node mounts GFS OK and works, but after a while seems to just block for disk. Very much as if it started trying to fence the other node and is waiting for acknowledgement. There are no fence devices defined (so this could be a possibility), but the other node was never powered up in the first place, so it is somewhat beyond me why it might suddenly decide to try to fence it. This usually happens after a period of idleness. If the node is used, this doesn't seem to happen, but leaving it along for half an hour causes it to block for disk I/O.

Unfortunately, it doesn't end there. When an attempt is made to dual-mount the GFS file system before the secondary is fully up to date (but is connected and syncing), the 2nd node to join notices an inconsistency, and withdraws from the cluster. In the process, GFS gets corrupted, and the only way to get it to mount again on either node is to repair it with fsck.

I'm not sure if this is a problem with my cluster setup or not, but I cannot see that the nodes would fail to find each other and get DLM working. Console logs seem to indicate that everything is in fact OK, and the nodes are connected directly via a cross-over cable.

If the nodes are in sync by the time GFS tries to mount, the mount succeeds, but everything grinds to a halt shortly afterwards - so much so that the only way to get things moving again is to hard-reset one of the nodes, preferably the 2nd one to join.

Here is where the second thing that seems wrong happend - the first node doesn't just lock-up at this point, as one might expect (when a connected node disappears, e.g. due to a hard reset, cluster is supposed to try to fence it until it cleanly rejoins - and it can't possibly fence the other node since I haven't configured any fencing devices yet). This doesn't seem to happen. The first node seems to continue like nothing happened. This is possibly connected to the fact that by this point, GFS is corrupted and has to be fsck-ed at next boot. This part may be a cluster setup issue, so I'll raise that on the cluster list, although it seems to be a DRBD specific peculiarity - using a SAN doesn't have this issue with a nearly identical cluster.conf (only difference being the block device specification).

The cluster.conf is as follows:
<?xml version="1.0"?>
<cluster config_version="18" name="sentinel">
        <cman two_node="1" expected_votes="1"/>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="sentinel1c" nodeid="1" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
                                <!--<chrootenv  mountpoint      = "/var/comoonics/chroot"
                                                fstype          = "ext3"
                                                device          = "/dev/sda2"
                                                chrootdir       = "/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
                                <rootvolume     name            = "/dev/drbd1"
                                                mountopts       = "noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.1"
                                        mac     = "00:0B:DB:92:C5:E1"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "secret"
                                />
                        </com_info>
                        <fence>
                                <method name="1"/>
                        </fence>
                </clusternode>
                <clusternode name="sentinel2c" nodeid="2" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
                                <!--<chrootenv  mountpoint      = "/var/comoonics/chroot"
                                                fstype          = "ext3"
                                                device          = "/dev/sda2"
                                                chrootdir       = "/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
                                <rootvolume     name            = "/dev/drbd1"
                                                mountopts       = "noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.2"
                                        mac     = "00:0B:DB:90:4E:1B"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "secret"
                                />
                        </com_info>
                        <fence>
                                <method name="1"/>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices/>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

Getting to the logs can be a bit difficult with OSR (they get reset on reboot, and it's rather difficult getting to them when the node stops responding without rebooting it), so I don't have those at the moment.

Any suggestions would be welcome at this point.

TIA.

Gordan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]