[Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

Thu Mar 20 02:12:08 UTC 2014

Hi

On Wednesday 19 of March 2014 21:26:56 Digimer wrote:
> On 19/03/14 07:45 PM, Digimer wrote:
> > On 19/03/14 06:31 PM, Chris Feist wrote:
> >> On 03/18/2014 08:27 PM, Digimer wrote:
> >>> Hi all,
> >>> 
> >>>    I would like to tell rgmanager to give more time for VMs to stop. I
> >>> 
> >>> want this:
> >>> 
> >>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
> >>> path="/shared/definitions/" exclusive="0" recovery="restart"
> >>> max_restarts="2"
> >>> restart_expire_time="600">
> >>> 
> >>>    <action name="stop" timeout="10m" />
> >>> 
> >>> </vm>
> >>> 
> >>> I already use ccs to create the entry:
> >>> 
> >>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
> >>> path="/shared/definitions/" exclusive="0" recovery="restart"
> >>> max_restarts="2"
> >>> restart_expire_time="600"/>
> >>> 
> >>> via:
> >>> 
> >>> ccs -h localhost --activate --sync --password "secret" \
> >>> 
> >>>   --addvm vm01-win2008 \
> >>>   --domain="primary_n01" \
> >>>   path="/shared/definitions/" \
> >>>   autostart="0" \
> >>>   exclusive="0" \
> >>>   recovery="restart" \
> >>>   max_restarts="2" \
> >>>   restart_expire_time="600"
> >>> 
> >>> I'm hoping it's a simple additional switch. :)
> >> 
> >> Unfortunately currently ccs doesn't support setting resource actions.
> >> However it's my understanding that rgmanager doesn't check timeouts
> >> unless __enforce_timeouts is set to "1".  So you shouldn't be seeing a
> >> vm resource go to failed if it takes a long time to stop.  Are you
> >> trying to make the vm resource fail if it takes longer than 10 minutes
> >> to stop?
> > 
> > I was afraid you were going to say that. :(
> > 
> > The problem is that after calling 'disable' against the VM service,
> > rgmanager waits two minutes. If the service isn't closed in that time,
> > the server is forced off (at least, this was the behaviour when I last
> > tested this).
> > 
> > The concern is that, by default, windows installs queue updates to
> > install when the system shuts down. During this time, windows makes it
> > very clear that you should not power off the system during the updates.
> > So if this timer is hit, and the VM is forced off, the guest OS can be
> > damaged.
> > 
> > Of course, we can debate the (lack of) wisdom of this behaviour, and I
> > already document this concern (and even warn people to check for updates
> > before stopping the server), it's not sufficient. If a user doesn't read
> > the warning, or simply forgets to check, the consequences can be
> > non-trivial.
> > 
> > If ccs can't be made to add this attribute, and if the behaviour
> > persists (I will test shortly after sending this reply), then I will
> > have to edit the cluster.conf directly, something I am loath to do if at
> > all avoidable.
> > 
> > Cheers
> 
> Confirmed;
> 
> I called disable on a VM with gnome running, so that I could abort the
> VM's shut down.
> 
> an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
> Wed Mar 19 21:06:29 EDT 2014
> Local machine disabling vm:vm01-rhel6...Success
> Wed Mar 19 21:08:36 EDT 2014
> 
> 2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been
> a windows guest in the middle of installing updates, it would be highly
> likely to be screwed now.

Is this really the best way to handle such an event?

>From what I remember, Windows can (or could, I don't have any 'modern' windows 
laying around) be told to shutdown without updating. maybe a wiser approach 
would be to make the stop event (which I believe is delivered to the guest as 
pressing the ACPI power button) trigger a shutdown without updates.

keep in mind that doing system updates on timer is dangerous, irrelevant of 
the actual time

regards
Pavel Herrmann

> To confirm, I changed the config to:
> 
> <vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2"
> name="vm01-rhel6" path="/shared/definitions/" recovery="restart"
> restart_expire_time="600">
>    <action name="stop" timeout="10m"/>
> </vm>
> 
> Then I repeated the test:
> 
> an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
> Wed Mar 19 21:13:18 EDT 2014
> Local machine disabling vm:vm01-rhel6...Success
> Wed Mar 19 21:23:31 EDT 2014
> 
> 10 minutes and 13 seconds before the cluster killed the server, much
> less likely to interrupt a in-progress OS update (truth be told, I plan
> to set 30 minutes.
> 
> I understand that this blocks other processes, but in an HA environment,
> I'd strongly argue that safe > speed.
> 
> digimer