[Linux-cluster] node is reboot during stop cluster application (oracle) and unable to relocate cluster application between nodes

Christopher Chen muffaleta at gmail.com
Mon May 11 14:34:13 UTC 2009


I hope you're planning to expand to least a 3 node cluster before you  
go into production. You know two node clusters are inherently  
unstable, right?I assume you've read the architectural overview of how  
the cluster suite achieves quorum.

A cluster requires (n/2)+1 to continue to operate. If you restart or  
otherwise remove a machine from a two node cluster, you've lost quorum  
and by definition you've dissolved your cluster while you're in that  
state.

I'm pretty sure the behavior you are describing is proper.

Time flies like an arrow.
Fruit flies like a banana.

On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz at ril.com>  
wrote:

> Hi,
>
> I have configured two node cluster on redhat-5. now the problem is  
> when i relocate,restart or stop,  running cluster service between  
> nodes (2 nos) ,the node get fenced and restart server . Other side,  
> the server who obtain cluster service leave the cluster and it's  
> cluster service (cman) stop automatically .so it is also fenced by  
> other server.
>
> I observed that , this problem occurred while stopping cluster  
> service (oracle).
>
> Please help me to resolve this problem.
>
> log messages and cluster.conf file are as given as  below.
> -------------------------
> /etc/cluster/cluster.conf
> -------------------------
> <?xml version="1.0"?>
> <cluster config_version="59" name="new_cluster">
>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>     <clusternodes>
>         <clusternode name="psfhost1" nodeid="1" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="cluster1"/>
>                 </method>
>             </fence>
>         </clusternode>
>         <clusternode name="psfhost2" nodeid="2" votes="1">
>             <fence>
>                 <method name="1">
>                     <device name="cluster2"/>
>                 </method>
>             </fence>
>         </clusternode>
>     </clusternodes>
>     <cman expected_votes="1" two_node="1"/>
>     <fencedevices>
>         <fencedevice agent="fence_ilo" hostname="ilonode1"  
> login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
>         <fencedevice agent="fence_ilo" hostname="ilonode2"  
> login="Administrator" name="cluster2" passwd="ST69D87V"/>
>     </fencedevices>
>     <rm>
>         <failoverdomains>
>             <failoverdomain name="poy-cluster" ordered="0"  
> restricted="0">
>                 <failoverdomainnode name="psfhost1" priority="1"/>
>                 <failoverdomainnode name="psfhost2" priority="1"/>
>             </failoverdomain>
>         </failoverdomains>
>         <resources>
>             <ip address="10.2.220.2" monitor_link="1"/>
>             <script file="/etc/init.d/httpd" name="httpd"/>
>             <fs device="/dev/cciss/c1d0p3" force_fsck="0"  
> force_unmount="0" fsid="52427" fstype="ext3" mountpoint="/app"  
> name="app" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p4" force_fsck="0"  
> force_unmount="0" fsid="39388" fstype="ext3" mountpoint="/opt"  
> name="opt" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p1" force_fsck="0"  
> force_unmount="0" fsid="62307" fstype="ext3" mountpoint="/data"  
> name="data" options="" self_fence="0"/>
>             <fs device="/dev/cciss/c1d0p2" force_fsck="0"  
> force_unmount="0" fsid="47234" fstype="ext3" mountpoint="/OPERATION"  
> name="OPERATION" options="" self_fence="0"/>
>             <script file="/etc/init.d/orcl" name="Oracle"/>
>         </resources>
>         <service autostart="0" name="oracle" recovery="relocate">
>             <fs ref="app"/>
>             <fs ref="opt"/>
>             <fs ref="data"/>
>             <fs ref="OPERATION"/>
>             <ip ref="10.2.220.2"/>
>             <script ref="Oracle"/>
>         </service>
>     </rm>
> </cluster>
>
>
>
>
>
>
>
> ---------------- -------
> /var/log/messages
> -----------------------
> following logs during relocate cluster service (oracle) between nodes.
> Node-1
>
> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped  
> service service:oracle
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:58 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal  
> journal
> May  2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:59 psfhost2 kernel: kjournald starting.  Commit  
> interval 5 seconds
> May  2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount  
> count reached, running e2fsck is recommended
> May  2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal  
> journal
> May  2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with  
> ordered data mode.
> May  2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address  
> record for 10.2.220.2 on eth0.
> May  2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt  
> (IP_ADD_MEMBERSHIP): Address already in use
> May  2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
> May  2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service  
> service:oracle started
> May  2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER  
> state from 11.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b  
> high seq received 1b
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence  
> id for ring 90
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY  
> state.
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member 10.2.220.6 
> :
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 140 rep 10.2.220.6
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered  
> 9 received flag 1
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member 10.2.220.7 
> :
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 136 rep 10.2.220.7
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high  
> delivered 1b received flag 1
> May  2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to  
> originate any messages in recovery.
> May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:19:26 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.6)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.6)
> May  2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within  
> the primary component and will provide service.
> May  2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL  
> state.
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 10.2.220.6
> May  2 16:19:27 psfhost2 openais[3275]: [CLM  ] got nodejoin message 10.2.220.7
> May  2 16:19:27 psfhost2 openais[3275]: [CPG  ] got joinlist message  
> from node 2
> May  2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000  
> Mbps full duplex, receive & transmit flow control ON
> May  2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
> May  2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps  
> full duplex, receive & transmit flow control ON
> May  2 16:19:42 psfhost2 kernel: dlm: connecting to 1
> May  2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete  
> (version 57 -> 59).
> May  2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
> May  2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service  
> service:oracle
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address  
> record for 10.2.220.7 on eth0.
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast  
> group on interface eth0.IPv4 with address 10.2.220.7.
> May  2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast  
> group on interface eth0.IPv4 with address 10.2.220.2.
> May  2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove 10.2.220.2
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY  
> state.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member  
> 127.0.0.1:
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq  
> 144 rep 10.2.220.6
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high  
> delivered 31 received flag 1
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to  
> originate any messages in recovery.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF  
> token
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0)  
> ip(10.2.220.7)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] CLM CONFIGURATION  
> CHANGE
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] New Configuration:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ]     r(0) ip(127.0.0.1)
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Left:
> May  2 16:21:40 psfhost2 openais[3275]: [CLM  ] Members Joined:
> May  2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within  
> the primary component and will provide service.
> May  2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL  
> state.
> May  2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node  
> psfhost2 because it has rejoined the cluster without cman_tool join
> May  2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node  
> 2 because we rejoined the cluster without a full restart
> May  2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
> May  2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at  
> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
> May  2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
> May  2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
> May  2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
> May  2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
> May  2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon  
> died, rebooting...
> May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
> May  2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
> May  2 16:21:40 psfhost2 kernel: md: stopping all md devices.
> May  2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not  
> completed yet!
> May  2 16:24:55 psfhost2 syslogd 1.4.1: restart.
> May  2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/ 
> kmsg started.
> May  2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5 (brewbuilder at hs20-bc1-7.build.redhat.com 
> ) (gcc version 4.1.2 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090511/9626ca03/attachment.htm>


More information about the Linux-cluster mailing list