[Linux-cluster] Re: Csnap instantiation and failover using libdlm

Fri Oct 22 00:27:46 UTC 2004

On Thursday 21 October 2004 17:56, Benjamin Marzinski wrote:
> Um.. I just realized that there's a problem here.
> If the agent dies but the server doesn't, the lock will get revoked.
> While this won't interfere with the clients currently connected to
> the server, any new client (or client that gets disconnected) will
> think that there is no server, and promote it's server to master....
> and data corruption will follow.
>
> As far as I can tell, the way to ensure that this doesn't happen is
> to have the server process take out the lock. That way the lock won't
> be freed unless the server process dies. Agreed?

No, the way to ensure this is to have the server die if its control 
socket goes away.

However, you have pointed out why it's bad for the new server to rely 
only on the lock to decide when its safe to start processing requests, 
or even to recover the journal: there may still be writes in flight 
from the old server.  If a server dies but its node is still in the 
cluster, the new server's agent has to regard that as a valid reason 
for fencing the node.  This can only be handled properly at the 
membership level, not at the lock level.

> If that's the case, should the server also be responsible for
> contacting the agents in the appropriate service group and getting
> the client information?

It's not the case, so we don't have to worry about it.

The only interesting argument I know of for moving infrastructure 
details into the server is to get rid of one daemon, but daemons are 
cheap, particularly if they sleep nearly all the time like the agent 
does.  It's better to keep the agent and daemon separate and 
specialized for the time being.

Regards,

Daniel