[Linux-cluster] Adding a stop timeout to a VM service using 'ccs'

Thu Mar 20 01:26:56 UTC 2014

On 19/03/14 07:45 PM, Digimer wrote:
> On 19/03/14 06:31 PM, Chris Feist wrote:
>> On 03/18/2014 08:27 PM, Digimer wrote:
>>> Hi all,
>>>
>>>    I would like to tell rgmanager to give more time for VMs to stop. I
>>> want this:
>>>
>>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
>>> path="/shared/definitions/" exclusive="0" recovery="restart"
>>> max_restarts="2"
>>> restart_expire_time="600">
>>>    <action name="stop" timeout="10m" />
>>> </vm>
>>>
>>> I already use ccs to create the entry:
>>>
>>> <vm name="vm01-win2008" domain="primary_n01" autostart="0"
>>> path="/shared/definitions/" exclusive="0" recovery="restart"
>>> max_restarts="2"
>>> restart_expire_time="600"/>
>>>
>>> via:
>>>
>>> ccs -h localhost --activate --sync --password "secret" \
>>>   --addvm vm01-win2008 \
>>>   --domain="primary_n01" \
>>>   path="/shared/definitions/" \
>>>   autostart="0" \
>>>   exclusive="0" \
>>>   recovery="restart" \
>>>   max_restarts="2" \
>>>   restart_expire_time="600"
>>>
>>> I'm hoping it's a simple additional switch. :)
>>
>> Unfortunately currently ccs doesn't support setting resource actions.
>> However it's my understanding that rgmanager doesn't check timeouts
>> unless __enforce_timeouts is set to "1".  So you shouldn't be seeing a
>> vm resource go to failed if it takes a long time to stop.  Are you
>> trying to make the vm resource fail if it takes longer than 10 minutes
>> to stop?
>
> I was afraid you were going to say that. :(
>
> The problem is that after calling 'disable' against the VM service,
> rgmanager waits two minutes. If the service isn't closed in that time,
> the server is forced off (at least, this was the behaviour when I last
> tested this).
>
> The concern is that, by default, windows installs queue updates to
> install when the system shuts down. During this time, windows makes it
> very clear that you should not power off the system during the updates.
> So if this timer is hit, and the VM is forced off, the guest OS can be
> damaged.
>
> Of course, we can debate the (lack of) wisdom of this behaviour, and I
> already document this concern (and even warn people to check for updates
> before stopping the server), it's not sufficient. If a user doesn't read
> the warning, or simply forgets to check, the consequences can be
> non-trivial.
>
> If ccs can't be made to add this attribute, and if the behaviour
> persists (I will test shortly after sending this reply), then I will
> have to edit the cluster.conf directly, something I am loath to do if at
> all avoidable.
>
> Cheers

Confirmed;

I called disable on a VM with gnome running, so that I could abort the 
VM's shut down.

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:06:29 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:08:36 EDT 2014

2 minutes and 7 seconds, then rgmanager forced-off the VM. Had this been 
a windows guest in the middle of installing updates, it would be highly 
likely to be screwed now.

To confirm, I changed the config to:

<vm autostart="0" domain="primary_n01" exclusive="0" max_restarts="2" 
name="vm01-rhel6" path="/shared/definitions/" recovery="restart" 
restart_expire_time="600">
   <action name="stop" timeout="10m"/>
</vm>

Then I repeated the test:

an-c05n01:~# date; clusvcadm -d vm:vm01-rhel6; date
Wed Mar 19 21:13:18 EDT 2014
Local machine disabling vm:vm01-rhel6...Success
Wed Mar 19 21:23:31 EDT 2014

10 minutes and 13 seconds before the cluster killed the server, much 
less likely to interrupt a in-progress OS update (truth be told, I plan 
to set 30 minutes.

I understand that this blocks other processes, but in an HA environment, 
I'd strongly argue that safe > speed.

digimer

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?