[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS over AOE without fencing?

On Apr 20, 2007, at 3:49 AM, Kadlecsik Jozsi wrote:
I have little information about how resilient the protocol is, however, 
in a unit we have with a bad disk, I've had the cec connection 
spontaneously drop mid-command.  I'm sure they're working to fix this, 
but it doesn't bode well for something as critical as fencing.  I'm also 
unclear on whether a dropped connection generates a non-zero exit code 
(i.e. is even detectable).

The fence_coraid script I wrote uses expect in perl. So if the cec 
connection fails (at any point) it is detected and reported by the script.
Also of interest is whether these masks are saved over reboot.  I think they are, but its probably worth checking.

Also, on APCs, the fence_apc script has the benefit that the APC 
switches do not allow more than one concurrent telnet connection, which 
effectively serializes fence requests.  With the cec, not so much.

This is problematic: the requests are not serialized at all, two 
concurrent cec sessions are totally mixed: command issued in one cec
appears in the other (letter by letter). Yes, this is a real issue.
In our current setup, we utilize over 25 lblades on four shelves.  We are adding a fifth this weekend, with an additional six lblades.  As you can imaging, fencing in this situation becomes complex.  Additionally, if you desire to dynamically detect which lblades to fence, this becomes fairly complex quickly.  I will leave to the reader envisioning the alternative of manually updating fencing scripts across the cluster for each lblade addition.

Also, this fences the entire Coraid device in a way that must be manually
cleared if it gets left masked.  This is a real possibility where multiple
nodes are racing to fence each other--especially on multiple Coraid shelfs (as
it must be done per shelf).

Since we use our Coraids for non-GFS boot volumes as well, this is also
problematic for us, since a stale mask entry keeps us from booting.

The masking disallows the access to the logical blades only. The host 
still able to connect to the Coraid box over cec and re-enable it's 
access rights to the lblades.
This certainly complicates setups that attempt to use the Coraid as a root device.  I don't like the idea of having to include cec and expect in an initrd.

Jayson Vantuyl
Systems Architect

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]