[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] DRBD8 and GFS issues



Hello guys,

I'm trying to use one cluster with 2 nodes, using DRDB 8.x and GFS 1.x
on RHEL 5.2 x84_64.

The problem is: Then one machine was gone (node2) the node1 stop to work
(one simple 'ls -l' on shared mounted point) until the second machine
return.

I'm using GFS on this way:

# gfs_mkfs -t hotsite:gfs-00 -p lock_dlm -j 2 /dev/drbd0
# mount -v /dev/drbd0 /test

'Causing a FAIL on second node on this way:
# echo 1 > /proc/sys/kernel/sysrq
# echo b > /proc/sysrq-trigger

==============================================================================
$ cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster name="hotsite" config_version="4">

<cman two_node="1" expected_votes="1"/>

<fence_daemon post_join_delay="60">
</fence_daemon>

<clusternodes>
<clusternode name="drdb_hotsite-1" nodeid="1">
        <fence>
                <method name="single">
                        <device name="gnbd" ipaddr="192.168.0.3"/>
                </method>
        </fence>
</clusternode>
<clusternode name="drdb_hotsite-2" nodeid="2">
        <fence>
                <method name="single">
                        <device name="gnbd" ipaddr="192.168.0.3"/>
                </method>
        </fence>
</clusternode>
</clusternodes>

<fencedevices>
        <fencedevice name="manual" agent="fence_manual"/>
</fencedevices>
</cluster>
==============================================================================

Follow the logs:

Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: PingAck did not arrive in time.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 ) 
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: asender terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Terminating asender thread
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: short read expecting header on sock: r=-512
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Creating new current UUID
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Connection closed
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: helper command: /sbin/drbdadm outdate-peer
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: outdate-peer helper broken, returned 0
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0:  old = { cs:NetworkFailure st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0:  new = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) 
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver terminated
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: receiver (re)started
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0:  old = { cs:Unconnected st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0:  new = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 19:59:08 hotsite-bsb-la-1 kernel: drbd0: conn( Unconnected -> WFConnection ) 
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] The token was lost in the OPERATIONAL state. 
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). 
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
Jun 11 19:59:08 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 2. 
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: drdb_hotsite-2 not a cluster member after 0 sec post_fail_delay
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 0. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 31 high seq received 31 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 168 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 356 rep 192.168.0.3 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 31 high delivered 31 received flag 1 
Jun 11 19:59:12 hotsite-bsb-la-1 kernel: dlm: closing connection to node 2
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] New Configuration: 
Jun 11 19:59:12 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.3)  
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Left: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.4)  
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Joined: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] New Configuration: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.3)  
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Left: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Joined: 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state. 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CLM  ] got nodejoin message 192.168.0.3 
Jun 11 19:59:12 hotsite-bsb-la-1 openais[2939]: [CPG  ] got joinlist message from node 1 
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:17 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:22 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 19:59:27 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
.....
Jun 11 20:01:32 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fencing node "drdb_hotsite-2"
Jun 11 20:01:37 hotsite-bsb-la-1 fenced[2956]: fence "drdb_hotsite-2" failed
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering GATHER state from 11. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Creating commit token because I am the rep. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Saving state aru 14 high seq received 14 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Storing new sequence id for ring 16c 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering COMMIT state. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering RECOVERY state. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [0] member 192.168.0.3: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.3 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 14 high delivered 14 received flag 1 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] position [1] member 192.168.0.4: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] previous ring seq 360 rep 192.168.0.4 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] aru 9 high delivered 9 received flag 1 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Did not need to originate any messages in recovery. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] Sending initial ORF token 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] New Configuration: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.3)  
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Left: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Joined: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] CLM CONFIGURATION CHANGE 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] New Configuration: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.3)  
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.4)  
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Left: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] Members Joined: 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] 	r(0) ip(192.168.0.4)  
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [SYNC ] This node is within the primary component and will provide service. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [TOTEM] entering OPERATIONAL state. 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] got nodejoin message 192.168.0.4 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CLM  ] got nodejoin message 192.168.0.3 
Jun 11 20:01:40 hotsite-bsb-la-1 openais[2939]: [CPG  ] got joinlist message from node 1 
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock...
Jun 11 20:01:42 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal...
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Handshake successful: Agreed network protocol version 88
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Peer authenticated using 20 bytes of 'sha1' HMAC
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Considering state change from bad state. Error would be: 'Refusing to be Primary while peer is not outdated'
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0:  old = { cs:WFConnection st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0:  new = { cs:WFReportParams st:Primary/Unknown ds:UpToDate/DUnknown s--- }
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFConnection -> WFReportParams ) 
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Starting asender thread (from drbd0_receiver [526])
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: data-integrity-alg: <not-used>
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Outdated ) 
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: tl_clear()
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: susp( 1 -> 0 ) 
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: peer( Secondary -> Primary ) 
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Outdated -> Inconsistent ) 
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Began resync as SyncSource (will sync 548864 KB [137216 bits set]).
Jun 11 20:05:04 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:05:05 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Acquiring the transaction lock...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replaying journal...
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Replayed 0 of 1 blocks
Jun 11 20:05:07 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: replays = 0, skips = 0, sames = 1
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Journal replayed in 5s
Jun 11 20:05:10 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Resync done (total 15 sec; paused 0 sec; 36588 K/sec)
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) 
Jun 11 20:05:20 hotsite-bsb-la-1 kernel: drbd0: Writing meta data super block now.
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Trying to join cluster "lock_dlm", "hotsite:gfs-00"
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: dlm: Using TCP for communications
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: Joined cluster. Now mounting FS...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=0: Done
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Trying to acquire journal lock...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Looking at journal...
Jun 11 20:07:03 hotsite-bsb-la-1 kernel: GFS: fsid=hotsite:gfs-00.0: jid=1: Done
Jun 11 20:07:25 hotsite-bsb-la-1 kernel: dlm: connecting to 2

Thanks!

-- 
Tiago Cruz
http://everlinux.com
Linux User #282636



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]