[Linux-cluster] Node Failure Detection Problems

Mon Mar 20 17:51:45 UTC 2006

On Sun, Mar 19, 2006 at 09:38:12PM +0000, James Firth wrote:
> James Firth wrote:
> >Hi,
> >
> >I have some questions on configuring and tuning heartbeats and 
> >node-failure detection.
> 
> Further to my earlier mail - am also having problems with exported gnbd 
> devices on node failure.  I want to get gnbd to give up trying to 
> reconnect on node failure, but it insists on retrying ad infinitum, 
> causing services that are using imported gnbd volumes to lock.

Are you exporting the gnbds in clustered or unclustered mode (with the -c
option or not)? In uncached, you should be able to run "gnbd_import -Or <gnbd>"
It wont actually remove the device if it is opened, but it should cause
all the pending IOs to fail.  In uncached mode, after your timeout, all the
IOs should get flushed assuming that gnbd can fence the server. If this
isn't happening, can you please send me a more complete description of your
gnbd setup and problem, including the result of following set of commands, run
after the server node fails.

# gnbd_import -l
# gnbd_import -Or <dead_gnbd>
(You might get a message saying "gnbd_import: waiting for all users to close
device <dead_gnbd>". This is fine. Just Ctrl-C to stop waiting. In this case,
the gnbd device will not be removed because something still has it open, but
all the pending IOs should have returned as failed, and all future IOs will
fail)
# gnbd_import -l

-Ben

> Regards,
> James Firth
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster