[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS + DRBD Problems



Gordan Bobic wrote:
As I thought, the problem I'm seeing is indeed rather multi-part. The first part is now resolved - large time-skips due to the system clock being out of date until ntpd syncs it up. It seems that large time jumps made dlm choke.

Now for part 2:

The two nodes connect - certainly enough to sync up DRBD. That stage goes through fine. They start cman and other cluster components, but it would appear then never actually find each other.

When mounting the shared file system:

Node 1:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 54 of 197 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 54, skips = 36, sames = 107
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 7 IDs
GFS: fsid=sentinel:root.0: Done


Node 2:
GFS: fsid=sentinel:root.0: jid=0: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=0: Looking at journal...
GFS: fsid=sentinel:root.0: jid=0: Acquiring the transaction lock...
GFS: fsid=sentinel:root.0: jid=0: Replaying journal...
GFS: fsid=sentinel:root.0: jid=0: Replayed 6 of 6 blocks
GFS: fsid=sentinel:root.0: jid=0: replays = 6, skips = 0, sames = 0
GFS: fsid=sentinel:root.0: jid=0: Journal replayed in 1s
GFS: fsid=sentinel:root.0: jid=0: Done
GFS: fsid=sentinel:root.0: jid=1: Trying to acquire journal lock...
GFS: fsid=sentinel:root.0: jid=1: Looking at journal...
GFS: fsid=sentinel:root.0: jid=1: Done
GFS: fsid=sentinel:root.0: Scanning for log elements...
GFS: fsid=sentinel:root.0: Found 0 unlinked inodes
GFS: fsid=sentinel:root.0: Found quota changes for 2 IDs
GFS: fsid=sentinel:root.0: Done

Unless I'm reading this wrong, they are both trying to use JID 0.

The second node to join generally chokes at some point during the boot, but AFTER it mounted the GFS volume. On the booted node, cman_tool status says:

# cman_tool status
Version: 6.0.1
Config Version: 20
Cluster Name: sentinel
Cluster Id: 28150
Cluster Member: Yes
Cluster Generation: 4
Membership state: Cluster-Member
Nodes: 1
Expected votes: 1
Total votes: 1
Quorum: 1
Active subsystems: 6
Flags: 2node
Ports Bound: 0
Node name: sentinel1c
Node ID: 1
Multicast addresses: 239.192.109.100
Node addresses: 10.0.0.1

So the second node never joined.
I know for a fact that the network connection between them is working, as they sync DRBD.

cluster.conf is here:

<?xml version="1.0"?>
<cluster config_version="20" name="sentinel">
        <cman two_node="1" expected_votes="1"/>
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="sentinel1c" nodeid="1" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
<!--<chrootenv mountpoint = "/var/comoonics/chroot"
                                                fstype          = "ext3"
device = "/dev/sda2" chrootdir = "/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
<rootvolume name = "/dev/drbd1" mountopts = "defaults,noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.1"
                                        mac     = "00:0B:DB:92:C5:E1"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "password"
                                />
                        </com_info>
                        <fence>
                                <method name = "1">
                                        <device name = "sentinel1d"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="sentinel2c" nodeid="2" votes="1">
                        <com_info>
                                <rootsource name="drbd"/>
<!--<chrootenv mountpoint = "/var/comoonics/chroot"
                                                fstype          = "ext3"
device = "/dev/sda2" chrootdir = "/var/comoonics/chroot"
                                />-->
                                <syslog name="localhost"/>
<rootvolume name = "/dev/drbd1" mountopts = "defaults,noatime,nodiratime,noquota"
                                />
                                <eth    name    = "eth0"
                                        ip      = "10.0.0.2"
                                        mac     = "00:0B:DB:90:4E:1B"
                                        mask    = "255.255.255.0"
                                        gateway = ""
                                />
                                <fenceackserver user    = "root"
                                                passwd  = "password"
                                />
                        </com_info>
                        <fence>
                                <method name = "1">
                                        <device name = "sentinel2d"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
<fencedevice agent="fence_drac" ipaddr="192.168.254.252" login="root" name="sentinel1d" passwd="password"/> <fencedevice agent="fence_drac" ipaddr="192.168.254.253" login="root" name="sentinel2d" passwd="password"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>

What could be causing the nodes to not join in the cluster?

A bit of additional information. When both nodes come up at the same time, they actually sort out the journals between them correctly. One gets 0, the other 1.

But almost immediately afterwards, this happens on the 2nd node:
dlm: closing connection to node 1
dlm: connect from non cluster node

shortly followed by DRBD keeling over:

drbd1: Handshake successful: DRBD Network Protocol version 86
drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
drbd1: conn( WFConnection -> WFReportParams )
drbd1: Discard younger/older primary did not found a decision
Using discard-least-changes instead
drbd1: State change failed: Device is held open by someone
drbd1: state = { cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown r--
- }
drbd1: wanted = { cs:WFReportParams st:Secondary/Unknown ds:UpToDate/DUnknown r
--- }
drbd1: helper command: /sbin/drbdadm pri-lost-after-sb
drbd1: Split-Brain detected, dropping connection!
drbd1: self 866625728B4E10B9:E4C3366683AFBC6B:ED24F75CC7B3F4A5:EFFAB6EF6A3CC469 drbd1: peer 572F799325FDF21D:E4C3366683AFBC6B:ED24F75CC7B3F4A4:EFFAB6EF6A3CC469
drbd1: conn( WFReportParams -> Disconnecting )
drbd1: helper command: /sbin/drbdadm split-brain
drbd1: error receiving ReportState, l: 4!
drbd1: asender terminated
drbd1: tl_clear()
drbd1: Connection closed
drbd1: conn( Disconnecting -> StandAlone )
drbd1: receiver terminated

At this point the 1st node seems to lock up, but despite fencing being set up, the 2nd node doesn't get powered down. The fencing device is a DRAC III ERA/O. Rebooting the 2nd node makes things revert back to it trying to use JID 0, which is already used by the 1st node, and things go wrong again.

I'm sure I must be missing something obvious here, but for the life of me I cannot see what.

Gordan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]