[Linux-cluster] Fencing the gulm master node: problem

Fri Jul 1 08:36:46 UTC 2005

Hello,

I want to test the fencing capability of GFS by unplugging the network on a 
node. But I experience some problems when the node I unplug is the gulm master.

I am using the RPM:
  - GFS-6.0.2.20-2
  - GFS-modules-smp-6.0.2.20-2

I have a 8-nodes cluster (sam21, sam22, ..., sam28). I mount a GFS 
filesystem on all nodes on /mnt/gfs

My config is:

----->8-------->8-------->8-------->8---
# fence.ccs
fence_devices {
         admin {
                 agent="fence_manual"
         }
}

# cluster.ccs
cluster {
         name="sam"
         lock_gulm {
                 servers=["sam21", "sam22", "sam23", "sam24", "sam25"]
         }
}

# nodes.ccs
nodes {
         sam21.toulouse {
                 ip_interfaces {
                 eth0 = "192.168.0.121"
                 }
                 fence {
                         human {
                                 admin {
                                         ipaddr = "192.168.0.121"
                                 }
                         }
                 }
         }
# etc. for sam22 ... sam28
----->8-------->8-------->8-------->8---

I want to check that the unplugged node is fenced and its locks are released 
when I run "fence_ack_manual" (and only when I run fence_ack_manual", not 
before).

In order to know when the locks are released, I wrote a small program:

----->8-------->8-------->8-------->8---
// lock.c
fd = open("/mnt/gfs/lock-test.tmp", O_RDWR|O_CREAT, S_IREAD|S_IWRITE);
if (fd == -1) {
   printf("ERROR: open failed.\n");
   return 1;
}
error = flock(fd, LOCK_EX);
if (error == -1) {
   printf("ERROR: lock failed.\n");
   return 1;
}
while (1) {
   printf("writing... pid %d : %d\n", pid, counter++);
   buf[0]=0;
   p = sprintf(buf, "pid %d : %d\n", pid, counter);
   write(fd, buf, p);
   sleep(1);
}
----->8-------->8-------->8-------->8---

First test (which works):
- I run my lock.c program on sam26 (not a gulm server)
- The lock is acquired on sam26
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam26 and wait until the gulm master (sam21) want to fence sam26
- The gulm master (sam21) want me to run fence_ack_manual
- The lock is not taken on other node
- I run fence_ack_manual
- The lock is released on the unplugged node (sam26) and taken by another
=> So when I unplug a node which is not a gulm server, all work correctly.

Second test (which doesn't work):
- I run my lock.c program on sam21 (the gulm master)
- The lock is acquired on sam21
- I run my lock.c program on all other nodes
- The other nodes wait for the lock
- I unplug sam21 and wait until a new gulm master (sam22) want to fence the 
old master (sam21)
- The new gulm master (sam22) want me to run fence_ack_manual BUT the lock 
is released immediately. I did not run fence_ack_manual and the lock is 
already released. This is my problem.

I read the bug reports [1][2] and the advisory RHBA-2005:466-11 [3] which 
says « Fixed a problem in which a gulm lock server ran on GFS clients after 
the master server died. » But I use GFS-6.0.2.20-2.

[1] https://bugzilla.redhat.com/beta/show_bug.cgi?id=148029
[2] https://bugzilla.redhat.com/beta/show_bug.cgi?id=149119
[3] http://rhn.redhat.com/errata/RHBA-2005-466.html

Is this a bug? Or a misunderstanding of the fencing mechanism?

The syslogs on the new gulm master (sam22) are:

----->8-------->8-------->8-------->8---
Jul  1 09:06:52 sam22 lock_gulmd_core[4195]: Failed to receive a timely 
heartbeat reply from Master. (t:1120201612489192 mb:1)
Jul  1 09:07:07 sam22 lock_gulmd_core[4195]: Failed to receive a timely 
heartbeat reply from Master. (t:1120201627509192 mb:2)
Jul  1 09:07:22 sam22 lock_gulmd_core[4195]: Failed to receive a timely 
heartbeat reply from Master. (t:1120201642529191 mb:3)
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: I see no Masters, So I am 
Arbitrating until enough Slaves talk to me.
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam23.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam28.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam26.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam25.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam24.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam22.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send quorum update to 
slave sam27.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: LastMaster 
sam21.toulouse:192.168.0.121, is being marked Expired.
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam23.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam28.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam26.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_LTPX[4197]: New Master at 
sam22.toulouse:192.168.0.122
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam25.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam24.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam22.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Could not send membership 
update "Expired" about sam21.toulouse to slave sam27.toulouse
Jul  1 09:07:37 sam22 lock_gulmd_core[4195]: Forked [4882] fence_node 
sam21.toulouse with a 0 pause.
Jul  1 09:07:37 sam22 lock_gulmd_core[4882]: Gonna exec fence_node 
sam21.toulouse
Jul  1 09:07:37 sam22 fence_node[4882]: Performing fence method, human, on 
sam21.toulouse.
Jul  1 09:07:37 sam22 fence_manual: Node 192.168.0.121 requires hard reset. 
  Run "fence_ack_manual -s 192.168.0.121" after power cycling the machine.
Jul  1 09:07:38 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 1, 
need 3 for quorum.
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, 
need 3 for quorum.
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: New Client: idx:5 fd:10 from 
(192.168.0.124:sam24.toulouse)
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam24.toulouse to sam23.toulouse is lost because node is in OM
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam24.toulouse to sam28.toulouse is lost because node is in OM
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam24.toulouse to sam25.toulouse is lost because node is in OM
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam24.toulouse to sam27.toulouse is lost because node is in OM
Jul  1 09:07:39 sam22 lock_gulmd_LT000[4196]: Attached slave 
sam24.toulouse:192.168.0.124 idx:2 fd:7 (soff:3 connected:0x8)
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, 
need 3 for quorum.
Jul  1 09:07:39 sam22 lock_gulmd_core[4195]: Still in Arbitrating: Have 2, 
need 3 for quorum.
Jul  1 09:07:41 sam22 lock_gulmd_core[4195]: Now have Slave quorum, going 
full Master.
Jul  1 09:07:41 sam22 lock_gulmd_core[4195]: New Client: idx:6 fd:11 from 
(192.168.0.123:sam23.toulouse)
Jul  1 09:07:41 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam23.toulouse to sam25.toulouse is lost because node is in OM
Jul  1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Logged into LT000 at 
sam22.toulouse:192.168.0.122
Jul  1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 3 fd 8 from 
(192.168.0.122:sam22.toulouse)
Jul  1 09:07:41 sam22 lock_gulmd_LTPX[4197]: Finished resending to LT000
Jul  1 09:07:41 sam22 lock_gulmd_LT000[4196]: Attached slave 
sam23.toulouse:192.168.0.123 idx:4 fd:9 (soff:2 connected:0xc)
Jul  1 09:07:41 sam22 lock_gulmd_LT000[4196]: New Client: idx 5 fd 10 from 
(192.168.0.123:sam23.toulouse)
Jul  1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Trying to 
acquire journal lock...
Jul  1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Looking at 
journal...
Jul  1 09:07:41 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Acquiring the 
transaction lock...
Jul  1 09:07:42 sam22 lock_gulmd_LT000[4196]: New Client: idx 6 fd 11 from 
(192.168.0.124:sam24.toulouse)
Jul  1 09:07:45 sam22 lock_gulmd_core[4195]: New Client: idx:3 fd:7 from 
(192.168.0.126:sam26.toulouse)
Jul  1 09:07:45 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam26.toulouse to sam25.toulouse is lost because node is in OM
Jul  1 09:07:45 sam22 lock_gulmd_LT000[4196]: New Client: idx 7 fd 12 from 
(192.168.0.126:sam26.toulouse)
Jul  1 09:07:46 sam22 lock_gulmd_core[4195]: New Client: idx:7 fd:12 from 
(192.168.0.128:sam28.toulouse)
Jul  1 09:07:46 sam22 lock_gulmd_core[4195]: Member update message Logged in 
about sam28.toulouse to sam25.toulouse is lost because node is in OM
Jul  1 09:07:46 sam22 lock_gulmd_LT000[4196]: New Client: idx 8 fd 13 from 
(192.168.0.128:sam28.toulouse)
Jul  1 09:07:47 sam22 lock_gulmd_core[4195]: New Client: idx:8 fd:13 from 
(192.168.0.125:sam25.toulouse)
Jul  1 09:07:47 sam22 lock_gulmd_LT000[4196]: Attached slave 
sam25.toulouse:192.168.0.125 idx:9 fd:14 (soff:1 connected:0xe)
Jul  1 09:07:47 sam22 lock_gulmd_LT000[4196]: New Client: idx 10 fd 15 from 
(192.168.0.125:sam25.toulouse)
Jul  1 09:07:49 sam22 lock_gulmd_core[4195]: New Client: idx:9 fd:14 from 
(192.168.0.127:sam27.toulouse)
Jul  1 09:07:49 sam22 lock_gulmd_LT000[4196]: New Client: idx 11 fd 16 from 
(192.168.0.127:sam27.toulouse)
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replaying 
journal...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Replayed 0 of 
2 blocks
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: replays = 0, 
skips = 1, sames = 1
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Journal 
replayed in 9s
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=0: Done
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=7: Busy
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=6: Busy
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Trying to 
acquire journal lock...
Jul  1 09:07:50 sam22 kernel: GFS: fsid=sam:grp_gfs.1: jid=5: Busy
----->8-------->8-------->8-------->8---

Sincerely,

Alban Crequy