[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow



Thanks again, Bob.

No kernel-panic on any of the nodes. I had to cold boot all 3 nodes in order to get the cluster going (might have been a fence issue but am not 100% sure, since we use only SCSI fencing until we agree on secondary fencing method). What is 'scary' is that gfs_grow command paralized that volume on all 3 nodes, and I coldn't access, nor unmount, nor run gfs_fsck, from any of the nodes. We will do more testing on this, btw do you have suggested "safe" method of growing and shrinking the volume other than what is noted in 5.2 documentation (since we followed the RHEL manual). If the GFS volume hangs - what is the best way to try and unmount it from the node,  would 'gfs_freeze' helped)?

I checked all nodes for service gfs status, service clvmd status and service cman status. On node4 service clvmd status hangs on displaying me active volumes:

From node3 - service clvmd status:
clvmd (pid 8892) is running...
active volumes: LogVol00 LogVol01 gfs_sda1 gfs_sdb1

Here is the node4 response to service clvmd status:
clvmd (pid 4829) is running... (and it hangs)

On node3 - I didn't get any messages in the dmesg - but I got this in /var/log/messages:

Sep 26 10:40:17 dev03 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Unmount seems to be stalled. Dumping lock state...

Sep 26 10:40:17 dev03 kernel: Glock (5, 22)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (1, 2)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 24)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 5

Sep 26 10:40:17 dev03 kernel:   gl_state = 1

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = yes

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Inode: busy

Sep 26 10:40:17 dev03 kernel: Glock (2, 22)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 24)

Sep 26 10:40:17 dev03 kernel:   gl_flags =

Sep 26 10:40:17 dev03 kernel:   gl_count = 2

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = no

Sep 26 10:40:17 dev03 kernel:   req_bh = no

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = yes

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Holder

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 3

Sep 26 10:40:17 dev03 kernel:     gh_flags = 5 7

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 1 6 7

Sep 26 10:40:17 dev03 kernel: Glock (5, 21)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 23)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (1, 1)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 21)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1 2

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 25)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (2, 23)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = 1

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel: Glock (5, 25)

Sep 26 10:40:17 dev03 kernel:   gl_flags = 1

Sep 26 10:40:17 dev03 kernel:   gl_count = 3

Sep 26 10:40:17 dev03 kernel:   gl_state = 3

Sep 26 10:40:17 dev03 kernel:   req_gh = yes

Sep 26 10:40:17 dev03 kernel:   req_bh = yes

Sep 26 10:40:17 dev03 kernel:   lvb_count = 0

Sep 26 10:40:17 dev03 kernel:   object = no

Sep 26 10:40:17 dev03 kernel:   new_le = no

Sep 26 10:40:17 dev03 kernel:   incore_le = no

Sep 26 10:40:17 dev03 kernel:   reclaim = no

Sep 26 10:40:17 dev03 kernel:   aspace = no

Sep 26 10:40:17 dev03 kernel:   ail_bufs = no

Sep 26 10:40:17 dev03 kernel:   Request

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5

Sep 26 10:40:17 dev03 kernel:   Waiter2

Sep 26 10:40:17 dev03 kernel:     owner = -1

Sep 26 10:40:17 dev03 kernel:     gh_state = 0

Sep 26 10:40:17 dev03 kernel:     gh_flags = 0

Sep 26 10:40:17 dev03 kernel:     error = 0

Sep 26 10:40:17 dev03 kernel:     gh_iflags = 2 4 5



----------------------------------------------------------------------------------------------------------------

Here are /dmesg lines - starting with the node4, gfs_sdb1 is the "bad volume":

Sep 26 09:11:46 dev04 kernel: Joined cluster. Now mounting FS...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0: Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0: Busy

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1: Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1: Looking at journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=1: Done

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Trying to acquire journal lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Looking at journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Acquiring the transaction lock...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Replaying journal...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Replayed 0 of 9 blocks

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: replays = 0, skips = 2, sames = 7

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Journal replayed in 1s

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=2: Done

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Scanning for log elements...

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Found 0 unlinked inodes

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Found quota changes for 0 IDs

Sep 26 09:11:46 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: Done

Sep 26 09:11:46 dev04 kernel: ACPI: PCI Interrupt 0000:01:04.2[B] -> GSI 22 (level, low) -> IRQ 233

Sep 26 09:11:47 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0: Trying to acquire journal lock...

Sep 26 09:11:47 dev04 gfs_controld[6351]: gfs_sdb1 finish: needs recovery jid 0 nodeid 1 status 1

Sep 26 09:11:47 dev04 kernel: GFS: fsid=test1_cluster:gfs_sdb1.1: jid=0: Busy


-------------------









On Fri, Sep 26, 2008 at 10:33 AM, Bob Peterson <rpeterso redhat com> wrote:
----- "Alan A" <alan zg gmail com> wrote:
| This is worse than I tought. The entire cluster is hanging upon
| restart
| command issued from the Conga - lucy box. I tried bringing gfs service
| down
| on node2 (lucy) with the: service gfs stop (we are not running
| rgmanager),
| and I got:
| FATAL: Module gfs is in use.

Hi Alan,

It sounds like conga can't reboot the cluster because the GFS file
system is still mounted, or is in use.  I don't know much about conga,
so forgive my ignorance there.  You may need to unmount the gfs file
system before you reboot.  The dmesg you sent looked perfectly normal
to me.  Those are normal openais messages.  I'm more interested to
see if there were any "File system withdrawn" messages, general protection
faults, or kernel panic messages or other serious kernel errors on any
of the nodes in the cluster just around the time of the first failure.

This is just a wild guess, but I'm guessing that there was some kind
of error, like a kernel panic that occurred a while back.  That caused
the node to be fenced.  Perhaps the SCSI fencing locked up the device
somehow so none of the nodes can use it.  If that's the case, you
should be able to log in to each of the nodes, unmount the gfs file
systems that are mounted, manually, and then reboot them.
If it doesn't let you unmount them, it might be because some process
is still using the GFS file system.  For example, if you're using
NFS to export the GFS file system, you probably need to do
service nfs stop before it will let you unmount the gfs, then reboot.

So I would comb through the /var/log/messages of each node looking
for an error message regarding the node being fenced, withdrawn, panic,
SCSI errors, or any kind of serious errors that occurred around the
time where you first had the problem.

Regards,

Bob Peterson
Red Hat Clustering & GFS

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster



--
Alan A.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]