[Linux-cluster] Re: Suggestion for backbone network maintenance

Thu Oct 8 13:04:20 UTC 2009

On Wed, Oct 7, 2009 at 5:03 PM, Gianluca Cecchi
<gianluca.cecchi at gmail.com>wrote:

> Hello,
> cluster rh el 5.3 with 2 nodes and a quorum disk with heuristics. The nodes
> are in different sites.
> At this moment inside cluster.conf I have this:
>
>         <quorumd device="/dev/mapper/mpath6" interval="5" label="oraquorum"
> log_facility="local4" log_level="7" tko="16" votes="1">
>                 <heuristic interval="2" program="ping -c1 -w1 10.4.5.250"
> score="1" tko="20"/>
>         </quorumd>
>
> [snip]
>
>
>
It seems it doesn't work as I expected....
You have to manually restart qdisk daemon to have it catch the changes.
I would expect cluster manager to communicate with it when you do a ccs_tool
update....
qdiskd seems not to have a sort of reload function... (based on init script
options at least)
And also, in my situation, it is better to have both the nodes up'n'running.
In fact when you restart qdiskd it actually takes about 2 minutes and 10
seconds to re-register and count as one vote out of three.
Some seconds before of this, I get the emergency message where I lost quorum
and my services (FS and IP) are suddenly stopped and then restarted when
quorum regained.....

So the successfull steps are, at least in my case:

node 1 and 2 both up and running cluster services
node1
- change to cluster.conf incrementing version number and putting tko=1500
- ccs_tool update /etc/cluster/cluster.conf
- cman_tool version -r <new_version>   (is this still necessary?????)
- service qdiskd restart; sleep 2; service qdiskd start
(sometimes due to a bug in qdiskd it doesn't suddenly start, even if you do
stop/start; so that for safe I have to put a new start command just after
the first attempt...
more precisely: bug https://bugzilla.redhat.com/show_bug.cgi?id=485199
I'm in cman 2.0.98-1.el5_3.1 to simulate my prod cluster and this bug seems
to be first fixed in rh el 5.4 with cman-2.0.115-1.el5, then superseded a
few day after by important fix
2.0.115-1.el5_4.2<https://rhn.redhat.com/rhn/software/packages/details/Overview.do?pid=493748>
)
Anyway, after about 2 minutes and 10 seconds the qdiskd finishes its
initialization and synchronises with the other one...

Now I can go to node2 and run on it
- service qdiskd restart; sleep 2; service qdiskd start

This way both the nodes are aligned with qdiskd changes.

In my case then I can shutdown node2 and go through waiting network people
tell me that maintenance is finished, to re-apply initial configuration....

Comments are again welcome obviously ;-)

Gianluca
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20091008/7cee913e/attachment.htm>