[Linux-cluster] GFS over AOE without fencing?

Jayson Vantuyl jvantuyl at engineyard.com
Fri Apr 20 09:18:25 UTC 2007


On Apr 20, 2007, at 3:49 AM, Kadlecsik Jozsi wrote:
>> I have little information about how resilient the protocol is,  
>> however,
>> in a unit we have with a bad disk, I've had the cec connection
>> spontaneously drop mid-command.  I'm sure they're working to fix  
>> this,
>> but it doesn't bode well for something as critical as fencing.   
>> I'm also
>> unclear on whether a dropped connection generates a non-zero exit  
>> code
>> (i.e. is even detectable).
>
> The fence_coraid script I wrote uses expect in perl. So if the cec
> connection fails (at any point) it is detected and reported by the  
> script.
Also of interest is whether these masks are saved over reboot.  I  
think they are, but its probably worth checking.

>> Also, on APCs, the fence_apc script has the benefit that the APC
>> switches do not allow more than one concurrent telnet connection,  
>> which
>> effectively serializes fence requests.  With the cec, not so much.
>
> This is problematic: the requests are not serialized at all, two
> concurrent cec sessions are totally mixed: command issued in one cec
> appears in the other (letter by letter). Yes, this is a real issue.
In our current setup, we utilize over 25 lblades on four shelves.  We  
are adding a fifth this weekend, with an additional six lblades.  As  
you can imaging, fencing in this situation becomes complex.   
Additionally, if you desire to dynamically detect which lblades to  
fence, this becomes fairly complex quickly.  I will leave to the  
reader envisioning the alternative of manually updating fencing  
scripts across the cluster for each lblade addition.

>> Also, this fences the entire Coraid device in a way that must be  
>> manually
>> cleared if it gets left masked.  This is a real possibility where  
>> multiple
>> nodes are racing to fence each other--especially on multiple  
>> Coraid shelfs (as
>> it must be done per shelf).
>>
>> Since we use our Coraids for non-GFS boot volumes as well, this is  
>> also
>> problematic for us, since a stale mask entry keeps us from booting.
>
> The masking disallows the access to the logical blades only. The host
> still able to connect to the Coraid box over cec and re-enable it's
> access rights to the lblades.
This certainly complicates setups that attempt to use the Coraid as a  
root device.  I don't like the idea of having to include cec and  
expect in an initrd.

-- 
Jayson Vantuyl
Systems Architect
Engine Yard
jvantuyl at engineyard.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070420/17d89dac/attachment.htm>


More information about the Linux-cluster mailing list