[Linux-cluster] SAN with GFS2 on RHEL 6 beta: STONITH right after start

Wed Jul 28 08:25:09 UTC 2010

On Wed, Jul 28, 2010 at 9:58 AM, Köppel  Benedikt (LET)
<benedikt.koeppel at let.ethz.ch> wrote:
> Hello
>
> I have two nodes running RHEL 6 beta 2  and configured corosync as follows.
> Both nodes have access to a SAN disk. The disk is partitioned into /dev/sdb1
> for SBD STONITH and /dev/sdb2 for data. /dev/sdb2 has a GFS2 filesystem on the
> LVM (vg01/lv00). For the configuration, I followed the Cluster from Scratch PDF
> from clusterlabs.
>
> As soon as I start the two nodes, one of them gets immediately fenced and shut
> down.

Are these the Pacemaker packages that ship with RHEL?
Because external/sbd isnt a fencing device we ship on RHEL so I've
having trouble imagining how you're nodes are being fenced.

> I see in the logs, that the fenced node tries to mount the FS when he
> gets shot down. I have no clue why this happens. Can anyone give me a hint how
> to fix my cluster?
>
>
> Configuration:
> [root at pcmknode-1 ~]# crm configure show
> node pcmknode-1
> node pcmknode-2
> primitive WebFS ocf:heartbeat:Filesystem \
>        params device="/dev/vg01/lv00" directory="/data_1" fstype="gfs2"
> primitive dlm ocf:pacemaker:controld \
>        params configdir="/config" \
>        op monitor interval="120s"
> primitive gfs-control ocf:pacemaker:controld \
>        params daemon="gfs_controld.pcmk" args="-g 0" \
>        op monitor interval="120s"
> primitive resSBD stonith:external/sbd \
>        params sbd_device="/dev/sdb1"
> clone WebFSClone WebFS
> clone dlm-clone dlm \
>        meta interleave="true" target-role="Started"
> clone gfs-clone gfs-control \
>        meta interleave="true" target-role="Started"
> location cli-prefer-WebFS WebFSClone \
>        rule $id="cli-prefer-rule-WebFS" inf: #uname eq pcmknode-1 and date lt "2010-07-27 21:53:10Z"
> colocation WebFS-with-gfs-control inf: WebFSClone gfs-clone
> colocation gfs-with-dlm inf: gfs-clone dlm-clone
> order start-WebFS-after-gfs-control inf: gfs-clone WebFSClone
> order start-gfs-after-dlm inf: dlm-clone gfs-clone
> property $id="cib-bootstrap-options" \
>        dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \
>        cluster-infrastructure="openais" \
>        expected-quorum-votes="2" \
>        stonith-enabled="true" \
>        stonith-timeout="30s" \
>        no-quorum-policy="ignore"
>
>
>
> These are the logs:
> pcmknode-1: /var/log/messages
> ~ snip ~
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_color: Resource WebFS:0 cannot run anywhere
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Action dlm:0_stop_0 on pcmknode-2 is unrunnable (offline)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Marking node pcmknode-2 unclean
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Action gfs-control:0_stop_0 on pcmknode-2 is unrunnable (offline)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: custom_action: Marking node pcmknode-2 unclean
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: stage6: Scheduling Node pcmknode-2 for STONITH
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_stop_constraints: dlm:0_stop_0 is implicit after pcmknode-2 is fenced
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: native_stop_constraints: gfs-control:0_stop_0 is implicit after pcmknode-2 is fenced
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: Colocating gfs-control:1 with dlm:1 on pcmknode-1
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: Interleaving dlm:1 with gfs-control:1
> Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: te_fence_node: Executing reboot fencing operation (30) on pcmknode-2 (timeout=30000)
> Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: te_rsc_command: Initiating action 22: stop WebFS:1_stop_0 on pcmknode-1 (local)
> Jul 28 00:46:32 pcmknode-1 crmd: [2629]: info: do_lrm_rsc_op: Performing key=22:3:0:2419bb70-dce6-4a0e-b649-fae2b0f21b8d op=WebFS:1_stop_0 )
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: Colocating dlm:0 with gfs-control:0 on pcmknode-2
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: Interleaving gfs-control:0 with dlm:0
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: Colocating dlm:1 with gfs-control:1 on pcmknode-1
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: Interleaving gfs-control:1 with dlm:1
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: find_compatible_child: Colocating WebFS:1 with gfs-control:1 on pcmknode-1
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: clone_rsc_order_lh: Interleaving gfs-control:1 with WebFS:1
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource resSBD   (Started pcmknode-1)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Stop resource dlm:0     (pcmknode-2)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource dlm:1    (Started pcmknode-1)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Stop resource gfs-control:0     (pcmknode-2)
> Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: write_cib_contents: Archived previous version as /var/lib/heartbeat/crm/cib-23.raw
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource gfs-control:1    (Started pcmknode-1)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Leave resource WebFS:0  (Stopped)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: notice: LogActions: Restart resource WebFS:1        (Started pcmknode-1)
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: WARN: process_pe_message: Transition 3: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-4.bz2
> Jul 28 00:46:32 pcmknode-1 pengine: [2628]: info: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
> Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: write_cib_contents: Wrote version 0.127.0 of the CIB to disk (digest: af7f98fa70bd2ef644e8e70d6f2ceea9)
> Jul 28 00:46:32 pcmknode-1 cib: [2900]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.vx6ONO (digest: /var/lib/heartbeat/crm/cib.kGkBp3)
> Jul 28 00:46:32 pcmknode-1 Filesystem[2902]: INFO: Running stop for /dev/vg01/lv00 on /data_1
> Jul 28 00:46:32 pcmknode-1 Filesystem[2902]: INFO: Trying to unmount /data_1
> Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: ERROR: remote_op_query_timeout: Query 8f1eeecf-4832-4430-b8e8-41a645675c58 for pcmknode-2 timed out
> Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: ERROR: remote_op_timeout: Action reboot (8f1eeecf-4832-4430-b8e8-41a645675c58) for pcmknode-2 timed out
> Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: info: remote_op_done: Notifing clients of 8f1eeecf-4832-4430-b8e8-41a645675c58 (reboot of pcmknode-2 from 540c61a4-d351-40c7-aa60-efd445097180 by (null)): 0, rc=-7
> Jul 28 00:46:35 pcmknode-1 stonith-ng: [2624]: info: stonith_notify_client: Sending st_fence-notification to client 2629/cc57856c-5357-4343-95a9-712771f711ae
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: log_data_element: tengine_stonith_callback: StonithOp <remote-op state="0" st_target="pcmknode-2" st_op="reboot" />
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: tengine_stonith_callback: Stonith operation 2/30:3:0:2419bb70-dce6-4a0e-b649-fae2b0f21b8d: Operation timed out (-7)
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: ERROR: tengine_stonith_callback: Stonith of pcmknode-2 failed (-7)... aborting transition.
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: abort_transition_graph: tengine_stonith_callback:402 - Triggered transition abort (complete=0) : Stonith failed
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: update_abort_priority: Abort action done superceeded by restart
> Jul 28 00:46:35 pcmknode-1 crmd: [2629]: info: tengine_stonith_notify: Peer pcmknode-2 was terminated (reboot) by (null) for pcmknode-1 (ref=8f1eeecf-4832-4430-b8e8-41a645675c58): Operation timed out
> ~ snip ~
>
>
>
> pcmknode-2: /var/log/messages
> ~ snip ~
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received ringid(192.168.1.186:620) seq 91
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 90 to 91
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message with seq 91 to pending delivery queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received ringid(192.168.1.186:620) seq 92
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 91 to 92
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message with seq 92 to pending delivery queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] mcasted message added to pending queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 92 to 93
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message with seq 93 to pending delivery queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [CPG   ] got procjoin message from cluster node -1147763583
> Jul 28 00:46:29 pcmknode-2 cib: [2542]: debug: cib_process_xpath: cib_query: //nvpar[@name='terminate'] does not exist
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received ringid(192.168.1.186:620) seq 93
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to and including 92
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [CPG   ] got mcast request on 0x1b072a0
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received ringid(192.168.1.186:620) seq 94
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 93 to 94
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message with seq 94 to pending delivery queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] mcasted message added to pending queue
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to and including 93
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering 94 to 95
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Delivering MCAST message with seq 95 to pending delivery queue
> Jul 28 00:46:29 pcmknode-2 kernel: : dlm: got connection from -1164540799
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] Received ringid(192.168.1.186:620) seq 95
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to and including 94
> Jul 28 00:46:29 pcmknode-2 corosync[2535]:   [TOTEM ] releasing messages up to and including 95
> Jul 28 00:46:29 pcmknode-2 kernel: GFS2: fsid=pcmknode:data1s.0: Joined cluster. Now mounting FS...
> -- that was the last message in the log.
>
>
>
>
> So, how can I fix my cluster? What exactly is the problem?
>
>
> Thanks,
> Benedikt
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>