[Linux-cluster] Re: Fencing test

Thu Jan 8 18:39:10 UTC 2009

On Mon, Jan 5, 2009 at 12:11 PM, Paras pradhan <pradhanparas at gmail.com> wrote:
> hi,
>
> On Mon, Jan 5, 2009 at 8:23 AM, Rajagopal Swaminathan
> <raju.rajsand at gmail.com> wrote:
>> Greetings,
>>
>> On Sat, Jan 3, 2009 at 4:18 AM, Paras pradhan <pradhanparas at gmail.com> wrote:
>>>
>>> Here I am using 4 nodes.
>>>
>>> Node 1) That runs luci
>>> Node 2) This is my iscsi shared storage where my virutal machine(s) resides
>>> Node 3) First node in my two node cluster
>>> Node 4) Second node in my two node cluster
>>>
>>> All of them are connected simply to an unmanaged 16 port switch.
>>
>> Luci need not require a separate node to run. it can run on one of the
>> member nodes (node 3 | 4).
>
> OK.
>
>>
>> what does clustat say?
>
> Here is my clustat o/p:
>
> -----------
>
> [root at ha1lx ~]# clustat
> Cluster Status for ipmicluster @ Mon Jan  5 12:00:10 2009
> Member Status: Quorate
>
>  Member Name                                                     ID   Status
>  ------ ----                                                     ---- ------
>  10.42.21.29                                                         1
> Online, rgmanager
>  10.42.21.27                                                         2
> Online, Local, rgmanager
>
>  Service Name
> Owner (Last)                                                     State
>  ------- ----
> ----- ------                                                     -----
>  vm:linux64
> 10.42.21.27
> started
> [root at ha1lx ~]#
> ------------------------
>
>
> 10.42.21.27 is node3 and 10.42.21.29 is node4
>
>
>
>>
>> Can you post your cluster.conf here?
>
> Here is my cluster.conf
>
> --
> [root at ha1lx cluster]# more cluster.conf
> <?xml version="1.0"?>
> <cluster alias="ipmicluster" config_version="8" name="ipmicluster">
>        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>        <clusternodes>
>                <clusternode name="10.42.21.29" nodeid="1" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="fence2"/>
>                                </method>
>                        </fence>
>                </clusternode>
>                <clusternode name="10.42.21.27" nodeid="2" votes="1">
>                        <fence>
>                                <method name="1">
>                                        <device name="fence1"/>
>                                </method>
>                        </fence>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <fencedevices>
>                <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.28"
> login="admin" name="fence1" passwd="admin"/>
>                <fencedevice agent="fence_ipmilan" ipaddr="10.42.21.30"
> login="admin" name="fence2" passwd="admin"/>
>        </fencedevices>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="myfd" nofailback="0" ordered="1" restricted="0">
>                                <failoverdomainnode name="10.42.21.29" priority="2"/>
>                                <failoverdomainnode name="10.42.21.27" priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources/>
>                <vm autostart="1" domain="myfd" exclusive="0" migrate="live"
> name="linux64" path="/guest_roots" recovery="restart"/>
>        </rm>
> </cluster>
> ------
>
>
> Here:
>
> 10.42.21.28 is IPMI interface in node3
> 10.42.21.30 is IPMI interface in node4
>
>
>
>
>
>
>
>
>>
>> When you pull out the network cable *and* plug it back  in say node 3,
>> , what messages appear in the /var/log/messages if Node 4 (if any)?
>> (sorry for the repitition, but messages are necessary here to make any
>> sense of the situation)
>>
>
> Ok here is the log in node 4 after i disconnect the network cable in node3.
>
> -----------
>
> Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] The token was lost in the
> OPERATIONAL state.
> Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] Receive multicast socket
> recv buffer size (288000 bytes).
> Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] Transmit multicast socket
> send buffer size (262142 bytes).
> Jan  5 12:05:24 ha2lx openais[4988]: [TOTEM] entering GATHER state from 2.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering GATHER state from 0.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Creating commit token
> because I am the rep.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Saving state aru 76 high
> seq received 76
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Storing new sequence id
> for ring ac
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering COMMIT state.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering RECOVERY state.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.29:
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] previous ring seq 168 rep
> 10.42.21.27
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] aru 76 high delivered 76
> received flag 1
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] Sending initial ORF token
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] New Configuration:
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.29)
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Left:
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.27)
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Joined:
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
> Jan  5 12:05:28 ha2lx kernel: dlm: closing connection to node 2
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] New Configuration:
> Jan  5 12:05:28 ha2lx fenced[5004]: 10.42.21.27 not a cluster member
> after 0 sec post_fail_delay
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.29)
> Jan  5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Trying to acquire journal lock...
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Left:
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] Members Joined:
> Jan  5 12:05:28 ha2lx openais[4988]: [SYNC ] This node is within the
> primary component and will provide service.
> Jan  5 12:05:28 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state.
> Jan  5 12:05:28 ha2lx openais[4988]: [CLM  ] got nodejoin message 10.42.21.29
> Jan  5 12:05:28 ha2lx openais[4988]: [CPG  ] got joinlist message from node 1
> Jan  5 12:05:28 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Looking at journal...
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Acquiring the transaction lock...
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Replaying journal...
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Replayed 0 of 0 blocks
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Found 0 revoke tags
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0:
> jid=1: Journal replayed in 1s
> Jan  5 12:05:29 ha2lx kernel: GFS2: fsid=ipmicluster:guest_roots.0: jid=1: Done
> ------------------
>
> Now when I plug back my cable to node3, node 4 reboots and here is the
> quickly grabbed log in node4
>
>
> --
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering GATHER state from 11.
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Saving state aru 1d high
> seq received 1d
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Storing new sequence id
> for ring b0
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering COMMIT state.
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering RECOVERY state.
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] position [0] member 10.42.21.27:
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep
> 10.42.21.27
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 16 high delivered 16
> received flag 1
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] position [1] member 10.42.21.29:
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] previous ring seq 172 rep
> 10.42.21.29
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] aru 1d high delivered 1d
> received flag 1
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] Did not need to originate
> any messages in recovery.
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] New Configuration:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.29)
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Left:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Joined:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] CLM CONFIGURATION CHANGE
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] New Configuration:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.27)
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.29)
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Left:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ] Members Joined:
> Jan  5 12:07:12 ha2lx openais[4988]: [CLM  ]    r(0) ip(10.42.21.27)
> Jan  5 12:07:12 ha2lx openais[4988]: [SYNC ] This node is within the
> primary component and will provide service.
> Jan  5 12:07:12 ha2lx openais[4988]: [TOTEM] entering OPERATIONAL state.
> Jan  5 12:07:12 ha2lx openais[4988]: [MAIN ] Killing node 10.42.21.27
> because it has rejoined the cluster with existing state
> Jan  5 12:07:12 ha2lx openais[4988]: [CMAN ] cman killed by node 2
> because we rejoined the cluster without a full restart
> Jan  5 12:07:12 ha2lx gfs_controld[5016]: groupd_dispatch error -1 errno 11
> Jan  5 12:07:12 ha2lx gfs_controld[5016]: groupd connection died
> Jan  5 12:07:12 ha2lx gfs_controld[5016]: cluster is down, exiting
> Jan  5 12:07:12 ha2lx dlm_controld[5010]: cluster is down, exiting
> Jan  5 12:07:12 ha2lx kernel: dlm: closing connection to node 1
> Jan  5 12:07:12 ha2lx fenced[5004]: cluster is down, exiting
> -------
>
>
> Also here is the log of node3:
>
> --
> [root at ha1lx ~]# tail -f /var/log/messages
> Jan  5 12:07:24 ha1lx openais[26029]: [TOTEM] entering OPERATIONAL state.
> Jan  5 12:07:24 ha1lx openais[26029]: [CLM  ] got nodejoin message 10.42.21.27
> Jan  5 12:07:24 ha1lx openais[26029]: [CLM  ] got nodejoin message 10.42.21.27
> Jan  5 12:07:24 ha1lx openais[26029]: [CPG  ] got joinlist message from node 2
> Jan  5 12:07:27 ha1lx ccsd[26019]: Attempt to close an unopened CCS
> descriptor (4520670).
> Jan  5 12:07:27 ha1lx ccsd[26019]: Error while processing disconnect:
> Invalid request descriptor
> Jan  5 12:07:27 ha1lx fenced[26045]: fence "10.42.21.29" success
> Jan  5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1:
> jid=0: Trying to acquire journal lock...
> Jan  5 12:07:27 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1:
> jid=0: Looking at journal...
> Jan  5 12:07:28 ha1lx kernel: GFS2: fsid=ipmicluster:guest_roots.1: jid=0: Done
> ----------------
>
>
>
>
>
>
>
>
>
>
>
>
>> HTH
>>
>> With warm regards
>>
>> Rajagopal
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> Thanks a lot
>
> Paras.
>

In an act to solve my fencing issue in my 2 node cluster, i tried to
run fence_ipmi to check if fencing is working or not. I need to know
what is my problem

-
[root at ha1lx ~]# fence_ipmilan -a 10.42.21.28 -o off -l admin -p admin
Powering off machine @ IPMI:10.42.21.28...ipmilan: Failed to connect
after 30 seconds
Failed
[root at ha1lx ~]#
---------------

Here 10.42.21.28 is an IP address assigned to IPMI interface and I am
running this command in the same host.

Thanks
Paras.