[Linux-cluster] Restarting GFS2 without reboot

Tue Nov 26 08:19:00 UTC 2013

Dear colleagues,

Your advices will be greatly appreciated.

I have another small GFS2 cluster. 2 nodes connected to the same
iSCSI-target.

Tonight something has happen and now both nodes can't work with the mounted
filesystem anymore.

Processes that opened files on the filesystem are keeping files opened and
working with them, but I can't open new files, I even can't get the list of
files on the mountpoint by "ls" command.

Both nodes are joined:

Node  Sts   Inc   Joined               Name

   1   M    388   2013-11-26 03:43:01  ***

   2   M    360   2013-11-11 07:39:22  ***

That's what "gfs_control dump" says:

1384148367 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/gfs_controld.log

1384148367 gfs_controld 3.0.12.1 started

1384148367 cluster node 1 added seq 364

1384148367 cluster node 2 added seq 364

1384148367 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/gfs_controld.log

1384148367 group_mode 3 compat 0

1384148367 setup_cpg_daemon 14

1384148367 gfs:controld conf 2 1 0 memb 1 2 join 2 left

1384148367 run protocol from nodeid 1

1384148367 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1

1384148372 client connection 5 fd 16

1384148372 join: /mnt/psv4 gfs2 lock_dlm ckvm1_pod1:psv4
rw,noatime,nodiratime /dev/dm-0

1384148372 psv4 join: cluster name matches: ckvm1_pod1

1384148372 psv4 process_dlmcontrol register 0

1384148372 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 2 left

1384148372 psv4 add_change cg 1 joined nodeid 2

1384148372 psv4 add_change cg 1 we joined

1384148372 psv4 add_change cg 1 counts member 2 joined 1 remove 0 failed 0

1384148372 psv4 wait_conditions skip for zero started_count

1384148372 psv4 send_start cg 1 id_count 2 om 0 nm 2 oj 0 nj 0

1384148372 psv4 receive_start 2:1 len 104

1384148372 psv4 match_change 2:1 matches cg 1

1384148372 psv4 wait_messages cg 1 need 1 of 2

1384148372 psv4 receive_start 1:2 len 104

1384148372 psv4 match_change 1:2 matches cg 1

1384148372 psv4 wait_messages cg 1 got all 2

1384148372 psv4 pick_first_recovery_master old 1

1384148372 psv4 sync_state first_recovery_needed master 1

1384148372 psv4 create_old_nodes 1 jid 0 ro 0 spect 0 kernel_mount_done 0
error 0

1384148372 psv4 create_new_nodes 2 ro 0 spect 0

1384148372 psv4 create_new_journals 2 gets jid 1

1384148373 psv4 receive_first_recovery_done from 1 master 1
mount_client_notified 0

1384148373 psv4 start_kernel cg 1 member_count 2

1384148373 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 0

1384148373 psv4 set open /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block
error -1 2

1384148373 psv4 client_reply_join_full ci 5 result 0
hostdata=jid=1:id=2447518500:first=0

1384148373 client_reply_join psv4 ci 5 result 0

1384148373 psv4 wait_recoveries done

1384148373 uevent add gfs2 /fs/gfs2/ckvm1_pod1:psv4

1384148373 psv4 ping_kernel_mount 0

1384148373 psv4 receive_mount_done from 1 result 0

1384148373 psv4 wait_recoveries done

1384148373 uevent change gfs2 /fs/gfs2/ckvm1_pod1:psv4

1384148373 psv4 recovery_uevent jid 1 ignore

1384148373 uevent online gfs2 /fs/gfs2/ckvm1_pod1:psv4

1384148373 psv4 ping_kernel_mount 0

1384148373 mount_done: psv4 result 0

1384148373 psv4 receive_mount_done from 2 result 0

1384148373 psv4 wait_recoveries done

1385430013 cluster node 1 removed seq 368

1385430013 gfs:controld conf 1 0 1 memb 2 join left 1

1385430013 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1

1385430013 psv4 add_change cg 2 remove nodeid 1 reason 3

1385430013 psv4 add_change cg 2 counts member 1 joined 0 remove 1 failed 1

1385430013 psv4 stop_kernel

1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 1

1385430013 psv4 check_dlm_notify nodeid 1 begin

1385430013 psv4 process_dlmcontrol notified nodeid 1 result -11

1385430013 psv4 check_dlm_notify result -11 will retry nodeid 1

1385430013 psv4 check_dlm_notify nodeid 1 begin

1385430013 psv4 process_dlmcontrol notified nodeid 1 result 0

1385430013 psv4 check_dlm_notify done

1385430013 psv4 send_start cg 2 id_count 2 om 1 nm 0 oj 0 nj 1

1385430013 psv4 receive_start 2:2 len 104

1385430013 psv4 match_change 2:2 matches cg 2

1385430013 psv4 wait_messages cg 2 got all 1

1385430013 psv4 sync_state first_recovery_msg

1385430013 psv4 set_failed_journals jid 0 nodeid 1

1385430013 psv4 wait_recoveries jid 0 nodeid 1 unrecovered

1385430013 psv4 start_journal_recovery jid 0

1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/recover to 0

1385430044 cluster node 1 added seq 372

1385430044 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left

1385430044 psv4 add_change cg 3 joined nodeid 1

1385430044 psv4 add_change cg 3 counts member 2 joined 1 remove 0 failed 0

1385430044 psv4 check_dlm_notify done

1385430044 psv4 send_start cg 3 id_count 3 om 1 nm 1 oj 1 nj 0

1385430044 cpg_mcast_joined retried 1 start

1385430044 gfs:controld conf 2 1 0 memb 1 2 join 1 left

1385430044 psv4 receive_start 2:3 len 116

1385430044 psv4 match_change 2:3 matches cg 3

1385430044 psv4 wait_messages cg 3 need 1 of 2

1385430044 psv4 receive_start 1:4 len 116

1385430044 psv4 match_change 1:4 matches cg 3

1385430044 receive_start 1:4 add node with started_count 3

1385430044 psv4 wait_messages cg 3 need 1 of 2

1385430088 cluster node 1 removed seq 376

1385430088 gfs:controld conf 1 0 1 memb 2 join left 1

1385430088 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1

1385430088 psv4 add_change cg 4 remove nodeid 1 reason 3

1385430088 psv4 add_change cg 4 counts member 1 joined 0 remove 1 failed 1

1385430088 psv4 check_dlm_notify nodeid 1 begin

1385430088 psv4 process_dlmcontrol notified nodeid 1 result 0

1385430088 psv4 check_dlm_notify done

1385430088 psv4 send_start cg 4 id_count 2 om 1 nm 0 oj 1 nj 0

1385430088 psv4 receive_start 2:4 len 104

1385430088 psv4 match_change 2:4 skip 3 already start

1385430088 psv4 match_change 2:4 matches cg 4

1385430088 psv4 wait_messages cg 4 got all 1

1385430088 psv4 sync_state first_recovery_msg

1385430088 psv4 set_failed_journals no journal for nodeid 1

1385430088 psv4 wait_recoveries jid 0 nodeid 1 unrecovered

1385430092 cluster node 1 added seq 380

1385430092 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left

1385430092 psv4 add_change cg 5 joined nodeid 1

1385430092 psv4 add_change cg 5 counts member 2 joined 1 remove 0 failed 0

1385430092 psv4 check_dlm_notify done

1385430092 psv4 send_start cg 5 id_count 3 om 1 nm 1 oj 1 nj 0

1385430092 cpg_mcast_joined retried 1 start

1385430092 gfs:controld conf 2 1 0 memb 1 2 join 1 left

1385430092 psv4 receive_start 2:5 len 116

1385430092 psv4 match_change 2:5 matches cg 5

1385430092 psv4 wait_messages cg 5 need 1 of 2

1385430092 psv4 receive_start 1:6 len 116

1385430092 psv4 match_change 1:6 matches cg 5

1385430092 receive_start 1:6 add node with started_count 4

1385430092 psv4 wait_messages cg 5 need 1 of 2

1385430143 cluster node 1 removed seq 384

1385430143 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1

1385430143 psv4 add_change cg 6 remove nodeid 1 reason 3

1385430143 psv4 add_change cg 6 counts member 1 joined 0 remove 1 failed 1

1385430143 psv4 check_dlm_notify nodeid 1 begin

1385430143 gfs:controld conf 1 0 1 memb 2 join left 1

1385430143 psv4 process_dlmcontrol notified nodeid 1 result 0

1385430143 psv4 check_dlm_notify done

1385430143 psv4 send_start cg 6 id_count 2 om 1 nm 0 oj 1 nj 0

1385430143 psv4 receive_start 2:6 len 104

1385430143 psv4 match_change 2:6 skip 5 already start

1385430143 psv4 match_change 2:6 matches cg 6

1385430143 psv4 wait_messages cg 6 got all 1

1385430143 psv4 sync_state first_recovery_msg

1385430143 psv4 set_failed_journals no journal for nodeid 1

1385430143 psv4 wait_recoveries jid 0 nodeid 1 unrecovered

1385430181 cluster node 1 added seq 388

1385430181 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left

1385430181 psv4 add_change cg 7 joined nodeid 1

1385430181 psv4 add_change cg 7 counts member 2 joined 1 remove 0 failed 0

1385430181 psv4 check_dlm_notify done

1385430181 psv4 send_start cg 7 id_count 3 om 1 nm 1 oj 1 nj 0

1385430181 cpg_mcast_joined retried 1 start

1385430181 gfs:controld conf 2 1 0 memb 1 2 join 1 left

1385430181 psv4 receive_start 2:7 len 116

1385430181 psv4 match_change 2:7 matches cg 7

1385430181 psv4 wait_messages cg 7 need 1 of 2

1385430181 psv4 receive_start 1:8 len 116

1385430181 psv4 match_change 1:8 matches cg 7

1385430181 receive_start 1:8 add node with started_count 5

1385430181 psv4 wait_messages cg 7 need 1 of 2

I can't reboot nodes, they're pretty busy, but, of course, I'd like to make
that GFS2-filesystem working again.

There's what I'd got in the log-file when that happened:

Nov 26 03:40:11 host2 corosync[2596]:   [TOTEM ] A processor failed, forming
new configuration.

Nov 26 03:40:12 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576348348, last ping 5576353348, now 5576358348

Nov 26 03:40:12 host2 kernel: connection1:0: detected conn error (1011)

Nov 26 03:40:13 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)

Nov 26 03:40:13 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
activity

Nov 26 03:40:13 host2 corosync[2596]:   [QUORUM] This node is within the
non-primary component and will NOT provide any services.

Nov 26 03:40:13 host2 corosync[2596]:   [QUORUM] Members[1]: 2

Nov 26 03:40:13 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:40:13 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)

Nov 26 03:40:13 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:40:13 host2 kernel: dlm: closing connection to node 1

Nov 26 03:40:13 host2 kernel: GFS2: fsid=ckvm1_pod1:psv4.1: jid=0: Trying to
acquire journal lock...

Nov 26 03:40:44 host2 iscsid: connection1:0 is operational after recovery (3
attempts)

Nov 26 03:40:44 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:40:44 host2 corosync[2596]:   [CMAN  ] quorum regained, resuming
activity

Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] This node is within the
primary component and will provide service.

Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:40:44 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)

Nov 26 03:40:44 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:40:44 host2 gfs_controld[2727]: receive_start 1:4 add node with
started_count 3

Nov 26 03:40:44 host2 fenced[2652]: receive_start 1:4 add node with
started_count 2

Nov 26 03:41:26 host2 corosync[2596]:   [TOTEM ] A processor failed, forming
new configuration.

Nov 26 03:41:28 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
activity

Nov 26 03:41:28 host2 corosync[2596]:   [QUORUM] This node is within the
non-primary component and will NOT provide any services.

Nov 26 03:41:28 host2 corosync[2596]:   [QUORUM] Members[1]: 2

Nov 26 03:41:28 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:41:28 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)

Nov 26 03:41:28 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:41:28 host2 kernel: dlm: closing connection to node 1

Nov 26 03:41:29 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576425428, last ping 5576430428, now 5576435428

Nov 26 03:41:29 host2 kernel: connection1:0: detected conn error (1011)

Nov 26 03:41:30 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)

Nov 26 03:41:32 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:41:32 host2 corosync[2596]:   [CMAN  ] quorum regained, resuming
activity

Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] This node is within the
primary component and will provide service.

Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:41:32 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)

Nov 26 03:41:32 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:41:32 host2 fenced[2652]: receive_start 1:6 add node with
started_count 2

Nov 26 03:41:32 host2 gfs_controld[2727]: receive_start 1:6 add node with
started_count 4

Nov 26 03:41:37 host2 iscsid: connection1:0 is operational after recovery (1
attempts)

Nov 26 03:42:19 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576475399, last ping 5576480399, now 5576485399

Nov 26 03:42:19 host2 kernel: connection1:0: detected conn error (1011)

Nov 26 03:42:20 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)

Nov 26 03:42:21 host2 corosync[2596]:   [TOTEM ] A processor failed, forming
new configuration.

Nov 26 03:42:23 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
activity

Nov 26 03:42:23 host2 corosync[2596]:   [QUORUM] This node is within the
non-primary component and will NOT provide any services. Nov 26 03:42:23
host2 corosync[2596]:   [QUORUM] Members[1]: 2

Nov 26 03:42:23 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:42:23 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)

Nov 26 03:42:23 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:42:23 host2 kernel: dlm: closing connection to node 1

Nov 26 03:42:41 host2 kernel: INFO: task kslowd001:2942 blocked for more
than 120 seconds.

Nov 26 03:42:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nov 26 03:42:41 host2 kernel: kslowd001     D 000000000000000b     0  2942
2 0x00000080

Nov 26 03:42:41 host2 kernel: ffff88086b29d958 0000000000000046
0000000000000102 0000005000000002

Nov 26 03:42:41 host2 kernel: fffffffffffffffc 000000000000010e
0000003f00000002 fffffffffffffffc

Nov 26 03:42:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
000000000000fb88 ffff88086b29bab8

Nov 26 03:42:41 host2 kernel: Call Trace:

Nov 26 03:42:41 host2 kernel: [<ffffffff814ffec5>]
rwsem_down_failed_common+0x95/0x1d0

Nov 26 03:42:41 host2 kernel: [<ffffffff81500056>]
rwsem_down_read_failed+0x26/0x30

Nov 26 03:42:41 host2 kernel: [<ffffffff8127e634>]
call_rwsem_down_read_failed+0x14/0x30

Nov 26 03:42:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30

Nov 26 03:42:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm]

Nov 26 03:42:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0

Nov 26 03:42:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40

Nov 26 03:42:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac71>]
gfs2_glock_nq_num+0x61/0xa0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa064eca3>]
gfs2_recover_work+0x93/0x7b0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff8105b483>] ?
perf_event_task_sched_out+0x33/0x80

Nov 26 03:42:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320

Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac69>] ?
gfs2_glock_nq_num+0x59/0xa0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff8106335b>] ?
enqueue_task_fair+0xfb/0x100

Nov 26 03:42:41 host2 kernel: [<ffffffff81108093>]
slow_work_execute+0x233/0x310

Nov 26 03:42:41 host2 kernel: [<ffffffff811082c7>]
slow_work_thread+0x157/0x360

Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40

Nov 26 03:42:41 host2 kernel: [<ffffffff81108170>] ?
slow_work_thread+0x0/0x360

Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0

Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20

Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0

Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Nov 26 03:42:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more
than 120 seconds.

Nov 26 03:42:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nov 26 03:42:41 host2 kernel: gfs2_quotad   D 0000000000000001     0  2950
2 0x00000080

Nov 26 03:42:41 host2 kernel: ffff88086afdfc20 0000000000000046
0000000000000000 ffffffffa0605f4d

Nov 26 03:42:41 host2 kernel: 0000000000000000 ffff88106c505800
ffff88086afdfc50 ffffffffa0604708

Nov 26 03:42:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
000000000000fb88 ffff88086afddaf8

Nov 26 03:42:41 host2 kernel: Call Trace:

Nov 26 03:42:41 host2 kernel: [<ffffffffa0605f4d>] ?
dlm_put_lockspace+0x1d/0x40 [dlm]

Nov 26 03:42:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0
[dlm]

Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa063757e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90

Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff814feb58>]
out_of_line_wait_on_bit+0x78/0x90

Nov 26 03:42:41 host2 kernel: [<ffffffff81092110>] ?
wake_bit_function+0x0/0x50

Nov 26 03:42:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff8107eabb>] ?
try_to_del_timer_sync+0x7b/0xe0

Nov 26 03:42:41 host2 kernel: [<ffffffffa0653658>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff814fe75a>] ?
schedule_timeout+0x19a/0x2e0

Nov 26 03:42:41 host2 kernel: [<ffffffffa0653650>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa064b9d7>]
quotad_check_timeo+0x57/0xb0 [gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40

Nov 26 03:42:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0
[gfs2]

Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0

Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20

Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0

Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Nov 26 03:42:54 host2 iscsid: connect to 192.168.1.161:3260 failed (No route
to host)

Nov 26 03:43:00 host2 iscsid: connect to 192.168.1.161:3260 failed (No route
to host)

Nov 26 03:43:01 host2 corosync[2596]:   [TOTEM ] A processor joined or left
the membership and a new membership was formed.

Nov 26 03:43:01 host2 corosync[2596]:   [CMAN  ] quorum regained, resuming
activity

Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] This node is within the
primary component and will provide service.

Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2

Nov 26 03:43:01 host2 corosync[2596]:   [CPG   ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)

Nov 26 03:43:01 host2 corosync[2596]:   [MAIN  ] Completed service
synchronization, ready to provide service.

Nov 26 03:43:01 host2 gfs_controld[2727]: receive_start 1:8 add node with
started_count 5

Nov 26 03:43:01 host2 fenced[2652]: receive_start 1:8 add node with
started_count 2

Nov 26 03:43:03 host2 iscsid: connection1:0 is operational after recovery (5
attempts)

Nov 26 03:44:41 host2 kernel: INFO: task kslowd001:2942 blocked for more
than 120 seconds.

Nov 26 03:44:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nov 26 03:44:41 host2 kernel: kslowd001     D 000000000000000b     0  2942
2 0x00000080

Nov 26 03:44:41 host2 kernel: ffff88086b29d958 0000000000000046
0000000000000102 0000005000000002

Nov 26 03:44:41 host2 kernel: fffffffffffffffc 000000000000010e
0000003f00000002 fffffffffffffffc

Nov 26 03:44:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
000000000000fb88 ffff88086b29bab8

Nov 26 03:44:41 host2 kernel: Call Trace:

Nov 26 03:44:41 host2 kernel: [<ffffffff814ffec5>]
rwsem_down_failed_common+0x95/0x1d0

Nov 26 03:44:41 host2 kernel: [<ffffffff81500056>]
rwsem_down_read_failed+0x26/0x30

Nov 26 03:44:41 host2 kernel: [<ffffffff8127e634>]
call_rwsem_down_read_failed+0x14/0x30

Nov 26 03:44:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30

Nov 26 03:44:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm]

Nov 26 03:44:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0

Nov 26 03:44:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40

Nov 26 03:44:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac71>]
gfs2_glock_nq_num+0x61/0xa0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa064eca3>]
gfs2_recover_work+0x93/0x7b0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff8105b483>] ?
perf_event_task_sched_out+0x33/0x80

Nov 26 03:44:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320

Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac69>] ?
gfs2_glock_nq_num+0x59/0xa0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff8106335b>] ?
enqueue_task_fair+0xfb/0x100

Nov 26 03:44:41 host2 kernel: [<ffffffff81108093>]
slow_work_execute+0x233/0x310

Nov 26 03:44:41 host2 kernel: [<ffffffff811082c7>]
slow_work_thread+0x157/0x360

Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40

Nov 26 03:44:41 host2 kernel: [<ffffffff81108170>] ?
slow_work_thread+0x0/0x360

Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0

Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20

Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0

Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

Nov 26 03:44:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more
than 120 seconds.

Nov 26 03:44:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Nov 26 03:44:41 host2 kernel: gfs2_quotad   D 0000000000000001     0  2950
2 0x00000080

Nov 26 03:44:41 host2 kernel: ffff88086afdfc20 0000000000000046
0000000000000000 ffffffffa0605f4d

Nov 26 03:44:41 host2 kernel: 0000000000000000 ffff88106c505800
ffff88086afdfc50 ffffffffa0604708

Nov 26 03:44:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
000000000000fb88 ffff88086afddaf8

Nov 26 03:44:41 host2 kernel: Call Trace:

Nov 26 03:44:41 host2 kernel: [<ffffffffa0605f4d>] ?
dlm_put_lockspace+0x1d/0x40 [dlm]

Nov 26 03:44:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0
[dlm]

Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa063757e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90

Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff814feb58>]
out_of_line_wait_on_bit+0x78/0x90

Nov 26 03:44:41 host2 kernel: [<ffffffff81092110>] ?
wake_bit_function+0x0/0x50

Nov 26 03:44:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff8107eabb>] ?
try_to_del_timer_sync+0x7b/0xe0

Nov 26 03:44:41 host2 kernel: [<ffffffffa0653658>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff814fe75a>] ?
schedule_timeout+0x19a/0x2e0

Nov 26 03:44:41 host2 kernel: [<ffffffffa0653650>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa064b9d7>]
quotad_check_timeo+0x57/0xb0 [gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40

Nov 26 03:44:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0
[gfs2]

Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0

Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20

Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0

Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20

What would you do in the same case? Is it possible to restart GFS2 without
rebooting nodes?

Thank you very much for any help.

-- 

V.Melnik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20131126/27d5cdf7/attachment.htm>