On Apr 20, 2007, at 3:49 AM, Kadlecsik Jozsi wrote:
I have little information about how resilient the protocol is, however,
in a unit we have with a bad disk, I've had the cec connection
spontaneously drop mid-command. I'm sure they're working to fix this,
but it doesn't bode well for something as critical as fencing. I'm also
unclear on whether a dropped connection generates a non-zero exit code
(i.e. is even detectable).
The fence_coraid script I wrote uses expect in perl. So if the cec
connection fails (at any point) it is detected and reported by the script.
Also of interest is whether these masks are saved over reboot. I think they are, but its probably worth checking.
Also, on APCs, the fence_apc script has the benefit that the APC
switches do not allow more than one concurrent telnet connection, which
effectively serializes fence requests. With the cec, not so much.
This is problematic: the requests are not serialized at all, two
concurrent cec sessions are totally mixed: command issued in one cec
appears in the other (letter by letter). Yes, this is a real issue.
In our current setup, we utilize over 25 lblades on four shelves. We are adding a fifth this weekend, with an additional six lblades. As you can imaging, fencing in this situation becomes complex. Additionally, if you desire to dynamically detect which lblades to fence, this becomes fairly complex quickly. I will leave to the reader envisioning the alternative of manually updating fencing scripts across the cluster for each lblade addition.
Also, this fences the entire Coraid device in a way that must be manually
cleared if it gets left masked. This is a real possibility where multiple
nodes are racing to fence each other--especially on multiple Coraid shelfs (as
it must be done per shelf).
Since we use our Coraids for non-GFS boot volumes as well, this is also
problematic for us, since a stale mask entry keeps us from booting.
The masking disallows the access to the logical blades only. The host
still able to connect to the Coraid box over cec and re-enable it's
access rights to the lblades.
This certainly complicates setups that attempt to use the Coraid as a root device. I don't like the idea of having to include cec and expect in an initrd.