[Linux-cluster] Restarting GFS2 without reboot
Vladimir Melnik
v.melnik at uplink.ua
Tue Nov 26 08:19:00 UTC 2013
Dear colleagues,
Your advices will be greatly appreciated.
I have another small GFS2 cluster. 2 nodes connected to the same
iSCSI-target.
Tonight something has happen and now both nodes can't work with the mounted
filesystem anymore.
Processes that opened files on the filesystem are keeping files opened and
working with them, but I can't open new files, I even can't get the list of
files on the mountpoint by "ls" command.
Both nodes are joined:
Node Sts Inc Joined Name
1 M 388 2013-11-26 03:43:01 ***
2 M 360 2013-11-11 07:39:22 ***
That's what "gfs_control dump" says:
1384148367 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/gfs_controld.log
1384148367 gfs_controld 3.0.12.1 started
1384148367 cluster node 1 added seq 364
1384148367 cluster node 2 added seq 364
1384148367 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/gfs_controld.log
1384148367 group_mode 3 compat 0
1384148367 setup_cpg_daemon 14
1384148367 gfs:controld conf 2 1 0 memb 1 2 join 2 left
1384148367 run protocol from nodeid 1
1384148367 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
1384148372 client connection 5 fd 16
1384148372 join: /mnt/psv4 gfs2 lock_dlm ckvm1_pod1:psv4
rw,noatime,nodiratime /dev/dm-0
1384148372 psv4 join: cluster name matches: ckvm1_pod1
1384148372 psv4 process_dlmcontrol register 0
1384148372 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 2 left
1384148372 psv4 add_change cg 1 joined nodeid 2
1384148372 psv4 add_change cg 1 we joined
1384148372 psv4 add_change cg 1 counts member 2 joined 1 remove 0 failed 0
1384148372 psv4 wait_conditions skip for zero started_count
1384148372 psv4 send_start cg 1 id_count 2 om 0 nm 2 oj 0 nj 0
1384148372 psv4 receive_start 2:1 len 104
1384148372 psv4 match_change 2:1 matches cg 1
1384148372 psv4 wait_messages cg 1 need 1 of 2
1384148372 psv4 receive_start 1:2 len 104
1384148372 psv4 match_change 1:2 matches cg 1
1384148372 psv4 wait_messages cg 1 got all 2
1384148372 psv4 pick_first_recovery_master old 1
1384148372 psv4 sync_state first_recovery_needed master 1
1384148372 psv4 create_old_nodes 1 jid 0 ro 0 spect 0 kernel_mount_done 0
error 0
1384148372 psv4 create_new_nodes 2 ro 0 spect 0
1384148372 psv4 create_new_journals 2 gets jid 1
1384148373 psv4 receive_first_recovery_done from 1 master 1
mount_client_notified 0
1384148373 psv4 start_kernel cg 1 member_count 2
1384148373 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 0
1384148373 psv4 set open /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block
error -1 2
1384148373 psv4 client_reply_join_full ci 5 result 0
hostdata=jid=1:id=2447518500:first=0
1384148373 client_reply_join psv4 ci 5 result 0
1384148373 psv4 wait_recoveries done
1384148373 uevent add gfs2 /fs/gfs2/ckvm1_pod1:psv4
1384148373 psv4 ping_kernel_mount 0
1384148373 psv4 receive_mount_done from 1 result 0
1384148373 psv4 wait_recoveries done
1384148373 uevent change gfs2 /fs/gfs2/ckvm1_pod1:psv4
1384148373 psv4 recovery_uevent jid 1 ignore
1384148373 uevent online gfs2 /fs/gfs2/ckvm1_pod1:psv4
1384148373 psv4 ping_kernel_mount 0
1384148373 mount_done: psv4 result 0
1384148373 psv4 receive_mount_done from 2 result 0
1384148373 psv4 wait_recoveries done
1385430013 cluster node 1 removed seq 368
1385430013 gfs:controld conf 1 0 1 memb 2 join left 1
1385430013 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
1385430013 psv4 add_change cg 2 remove nodeid 1 reason 3
1385430013 psv4 add_change cg 2 counts member 1 joined 0 remove 1 failed 1
1385430013 psv4 stop_kernel
1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to 1
1385430013 psv4 check_dlm_notify nodeid 1 begin
1385430013 psv4 process_dlmcontrol notified nodeid 1 result -11
1385430013 psv4 check_dlm_notify result -11 will retry nodeid 1
1385430013 psv4 check_dlm_notify nodeid 1 begin
1385430013 psv4 process_dlmcontrol notified nodeid 1 result 0
1385430013 psv4 check_dlm_notify done
1385430013 psv4 send_start cg 2 id_count 2 om 1 nm 0 oj 0 nj 1
1385430013 psv4 receive_start 2:2 len 104
1385430013 psv4 match_change 2:2 matches cg 2
1385430013 psv4 wait_messages cg 2 got all 1
1385430013 psv4 sync_state first_recovery_msg
1385430013 psv4 set_failed_journals jid 0 nodeid 1
1385430013 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
1385430013 psv4 start_journal_recovery jid 0
1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/recover to 0
1385430044 cluster node 1 added seq 372
1385430044 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
1385430044 psv4 add_change cg 3 joined nodeid 1
1385430044 psv4 add_change cg 3 counts member 2 joined 1 remove 0 failed 0
1385430044 psv4 check_dlm_notify done
1385430044 psv4 send_start cg 3 id_count 3 om 1 nm 1 oj 1 nj 0
1385430044 cpg_mcast_joined retried 1 start
1385430044 gfs:controld conf 2 1 0 memb 1 2 join 1 left
1385430044 psv4 receive_start 2:3 len 116
1385430044 psv4 match_change 2:3 matches cg 3
1385430044 psv4 wait_messages cg 3 need 1 of 2
1385430044 psv4 receive_start 1:4 len 116
1385430044 psv4 match_change 1:4 matches cg 3
1385430044 receive_start 1:4 add node with started_count 3
1385430044 psv4 wait_messages cg 3 need 1 of 2
1385430088 cluster node 1 removed seq 376
1385430088 gfs:controld conf 1 0 1 memb 2 join left 1
1385430088 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
1385430088 psv4 add_change cg 4 remove nodeid 1 reason 3
1385430088 psv4 add_change cg 4 counts member 1 joined 0 remove 1 failed 1
1385430088 psv4 check_dlm_notify nodeid 1 begin
1385430088 psv4 process_dlmcontrol notified nodeid 1 result 0
1385430088 psv4 check_dlm_notify done
1385430088 psv4 send_start cg 4 id_count 2 om 1 nm 0 oj 1 nj 0
1385430088 psv4 receive_start 2:4 len 104
1385430088 psv4 match_change 2:4 skip 3 already start
1385430088 psv4 match_change 2:4 matches cg 4
1385430088 psv4 wait_messages cg 4 got all 1
1385430088 psv4 sync_state first_recovery_msg
1385430088 psv4 set_failed_journals no journal for nodeid 1
1385430088 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
1385430092 cluster node 1 added seq 380
1385430092 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
1385430092 psv4 add_change cg 5 joined nodeid 1
1385430092 psv4 add_change cg 5 counts member 2 joined 1 remove 0 failed 0
1385430092 psv4 check_dlm_notify done
1385430092 psv4 send_start cg 5 id_count 3 om 1 nm 1 oj 1 nj 0
1385430092 cpg_mcast_joined retried 1 start
1385430092 gfs:controld conf 2 1 0 memb 1 2 join 1 left
1385430092 psv4 receive_start 2:5 len 116
1385430092 psv4 match_change 2:5 matches cg 5
1385430092 psv4 wait_messages cg 5 need 1 of 2
1385430092 psv4 receive_start 1:6 len 116
1385430092 psv4 match_change 1:6 matches cg 5
1385430092 receive_start 1:6 add node with started_count 4
1385430092 psv4 wait_messages cg 5 need 1 of 2
1385430143 cluster node 1 removed seq 384
1385430143 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
1385430143 psv4 add_change cg 6 remove nodeid 1 reason 3
1385430143 psv4 add_change cg 6 counts member 1 joined 0 remove 1 failed 1
1385430143 psv4 check_dlm_notify nodeid 1 begin
1385430143 gfs:controld conf 1 0 1 memb 2 join left 1
1385430143 psv4 process_dlmcontrol notified nodeid 1 result 0
1385430143 psv4 check_dlm_notify done
1385430143 psv4 send_start cg 6 id_count 2 om 1 nm 0 oj 1 nj 0
1385430143 psv4 receive_start 2:6 len 104
1385430143 psv4 match_change 2:6 skip 5 already start
1385430143 psv4 match_change 2:6 matches cg 6
1385430143 psv4 wait_messages cg 6 got all 1
1385430143 psv4 sync_state first_recovery_msg
1385430143 psv4 set_failed_journals no journal for nodeid 1
1385430143 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
1385430181 cluster node 1 added seq 388
1385430181 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
1385430181 psv4 add_change cg 7 joined nodeid 1
1385430181 psv4 add_change cg 7 counts member 2 joined 1 remove 0 failed 0
1385430181 psv4 check_dlm_notify done
1385430181 psv4 send_start cg 7 id_count 3 om 1 nm 1 oj 1 nj 0
1385430181 cpg_mcast_joined retried 1 start
1385430181 gfs:controld conf 2 1 0 memb 1 2 join 1 left
1385430181 psv4 receive_start 2:7 len 116
1385430181 psv4 match_change 2:7 matches cg 7
1385430181 psv4 wait_messages cg 7 need 1 of 2
1385430181 psv4 receive_start 1:8 len 116
1385430181 psv4 match_change 1:8 matches cg 7
1385430181 receive_start 1:8 add node with started_count 5
1385430181 psv4 wait_messages cg 7 need 1 of 2
I can't reboot nodes, they're pretty busy, but, of course, I'd like to make
that GFS2-filesystem working again.
There's what I'd got in the log-file when that happened:
Nov 26 03:40:11 host2 corosync[2596]: [TOTEM ] A processor failed, forming
new configuration.
Nov 26 03:40:12 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576348348, last ping 5576353348, now 5576358348
Nov 26 03:40:12 host2 kernel: connection1:0: detected conn error (1011)
Nov 26 03:40:13 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
Nov 26 03:40:13 host2 corosync[2596]: [CMAN ] quorum lost, blocking
activity
Nov 26 03:40:13 host2 corosync[2596]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Nov 26 03:40:13 host2 corosync[2596]: [QUORUM] Members[1]: 2
Nov 26 03:40:13 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:40:13 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)
Nov 26 03:40:13 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:40:13 host2 kernel: dlm: closing connection to node 1
Nov 26 03:40:13 host2 kernel: GFS2: fsid=ckvm1_pod1:psv4.1: jid=0: Trying to
acquire journal lock...
Nov 26 03:40:44 host2 iscsid: connection1:0 is operational after recovery (3
attempts)
Nov 26 03:40:44 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:40:44 host2 corosync[2596]: [CMAN ] quorum regained, resuming
activity
Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] This node is within the
primary component and will provide service.
Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:40:44 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:40:44 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)
Nov 26 03:40:44 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:40:44 host2 gfs_controld[2727]: receive_start 1:4 add node with
started_count 3
Nov 26 03:40:44 host2 fenced[2652]: receive_start 1:4 add node with
started_count 2
Nov 26 03:41:26 host2 corosync[2596]: [TOTEM ] A processor failed, forming
new configuration.
Nov 26 03:41:28 host2 corosync[2596]: [CMAN ] quorum lost, blocking
activity
Nov 26 03:41:28 host2 corosync[2596]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Nov 26 03:41:28 host2 corosync[2596]: [QUORUM] Members[1]: 2
Nov 26 03:41:28 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:41:28 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)
Nov 26 03:41:28 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:41:28 host2 kernel: dlm: closing connection to node 1
Nov 26 03:41:29 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576425428, last ping 5576430428, now 5576435428
Nov 26 03:41:29 host2 kernel: connection1:0: detected conn error (1011)
Nov 26 03:41:30 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
Nov 26 03:41:32 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:41:32 host2 corosync[2596]: [CMAN ] quorum regained, resuming
activity
Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] This node is within the
primary component and will provide service.
Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:41:32 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:41:32 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)
Nov 26 03:41:32 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:41:32 host2 fenced[2652]: receive_start 1:6 add node with
started_count 2
Nov 26 03:41:32 host2 gfs_controld[2727]: receive_start 1:6 add node with
started_count 4
Nov 26 03:41:37 host2 iscsid: connection1:0 is operational after recovery (1
attempts)
Nov 26 03:42:19 host2 kernel: connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 5576475399, last ping 5576480399, now 5576485399
Nov 26 03:42:19 host2 kernel: connection1:0: detected conn error (1011)
Nov 26 03:42:20 host2 iscsid: Kernel reported iSCSI connection 1:0 error
(1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state (3)
Nov 26 03:42:21 host2 corosync[2596]: [TOTEM ] A processor failed, forming
new configuration.
Nov 26 03:42:23 host2 corosync[2596]: [CMAN ] quorum lost, blocking
activity
Nov 26 03:42:23 host2 corosync[2596]: [QUORUM] This node is within the
non-primary component and will NOT provide any services. Nov 26 03:42:23
host2 corosync[2596]: [QUORUM] Members[1]: 2
Nov 26 03:42:23 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:42:23 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.2) ; members(old:2 left:1)
Nov 26 03:42:23 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:42:23 host2 kernel: dlm: closing connection to node 1
Nov 26 03:42:41 host2 kernel: INFO: task kslowd001:2942 blocked for more
than 120 seconds.
Nov 26 03:42:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 03:42:41 host2 kernel: kslowd001 D 000000000000000b 0 2942
2 0x00000080
Nov 26 03:42:41 host2 kernel: ffff88086b29d958 0000000000000046
0000000000000102 0000005000000002
Nov 26 03:42:41 host2 kernel: fffffffffffffffc 000000000000010e
0000003f00000002 fffffffffffffffc
Nov 26 03:42:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
000000000000fb88 ffff88086b29bab8
Nov 26 03:42:41 host2 kernel: Call Trace:
Nov 26 03:42:41 host2 kernel: [<ffffffff814ffec5>]
rwsem_down_failed_common+0x95/0x1d0
Nov 26 03:42:41 host2 kernel: [<ffffffff81500056>]
rwsem_down_read_failed+0x26/0x30
Nov 26 03:42:41 host2 kernel: [<ffffffff8127e634>]
call_rwsem_down_read_failed+0x14/0x30
Nov 26 03:42:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30
Nov 26 03:42:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm]
Nov 26 03:42:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0
Nov 26 03:42:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40
Nov 26 03:42:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac71>]
gfs2_glock_nq_num+0x61/0xa0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa064eca3>]
gfs2_recover_work+0x93/0x7b0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff8105b483>] ?
perf_event_task_sched_out+0x33/0x80
Nov 26 03:42:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac69>] ?
gfs2_glock_nq_num+0x59/0xa0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff8106335b>] ?
enqueue_task_fair+0xfb/0x100
Nov 26 03:42:41 host2 kernel: [<ffffffff81108093>]
slow_work_execute+0x233/0x310
Nov 26 03:42:41 host2 kernel: [<ffffffff811082c7>]
slow_work_thread+0x157/0x360
Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40
Nov 26 03:42:41 host2 kernel: [<ffffffff81108170>] ?
slow_work_thread+0x0/0x360
Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Nov 26 03:42:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more
than 120 seconds.
Nov 26 03:42:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 03:42:41 host2 kernel: gfs2_quotad D 0000000000000001 0 2950
2 0x00000080
Nov 26 03:42:41 host2 kernel: ffff88086afdfc20 0000000000000046
0000000000000000 ffffffffa0605f4d
Nov 26 03:42:41 host2 kernel: 0000000000000000 ffff88106c505800
ffff88086afdfc50 ffffffffa0604708
Nov 26 03:42:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
000000000000fb88 ffff88086afddaf8
Nov 26 03:42:41 host2 kernel: Call Trace:
Nov 26 03:42:41 host2 kernel: [<ffffffffa0605f4d>] ?
dlm_put_lockspace+0x1d/0x40 [dlm]
Nov 26 03:42:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0
[dlm]
Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa063757e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff814feb58>]
out_of_line_wait_on_bit+0x78/0x90
Nov 26 03:42:41 host2 kernel: [<ffffffff81092110>] ?
wake_bit_function+0x0/0x50
Nov 26 03:42:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff8107eabb>] ?
try_to_del_timer_sync+0x7b/0xe0
Nov 26 03:42:41 host2 kernel: [<ffffffffa0653658>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff814fe75a>] ?
schedule_timeout+0x19a/0x2e0
Nov 26 03:42:41 host2 kernel: [<ffffffffa0653650>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa064b9d7>]
quotad_check_timeo+0x57/0xb0 [gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40
Nov 26 03:42:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0
[gfs2]
Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Nov 26 03:42:54 host2 iscsid: connect to 192.168.1.161:3260 failed (No route
to host)
Nov 26 03:43:00 host2 iscsid: connect to 192.168.1.161:3260 failed (No route
to host)
Nov 26 03:43:01 host2 corosync[2596]: [TOTEM ] A processor joined or left
the membership and a new membership was formed.
Nov 26 03:43:01 host2 corosync[2596]: [CMAN ] quorum regained, resuming
activity
Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] This node is within the
primary component and will provide service.
Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:43:01 host2 corosync[2596]: [QUORUM] Members[2]: 1 2
Nov 26 03:43:01 host2 corosync[2596]: [CPG ] chosen downlist: sender
r(0) ip(192.168.1.1) ; members(old:1 left:0)
Nov 26 03:43:01 host2 corosync[2596]: [MAIN ] Completed service
synchronization, ready to provide service.
Nov 26 03:43:01 host2 gfs_controld[2727]: receive_start 1:8 add node with
started_count 5
Nov 26 03:43:01 host2 fenced[2652]: receive_start 1:8 add node with
started_count 2
Nov 26 03:43:03 host2 iscsid: connection1:0 is operational after recovery (5
attempts)
Nov 26 03:44:41 host2 kernel: INFO: task kslowd001:2942 blocked for more
than 120 seconds.
Nov 26 03:44:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 03:44:41 host2 kernel: kslowd001 D 000000000000000b 0 2942
2 0x00000080
Nov 26 03:44:41 host2 kernel: ffff88086b29d958 0000000000000046
0000000000000102 0000005000000002
Nov 26 03:44:41 host2 kernel: fffffffffffffffc 000000000000010e
0000003f00000002 fffffffffffffffc
Nov 26 03:44:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
000000000000fb88 ffff88086b29bab8
Nov 26 03:44:41 host2 kernel: Call Trace:
Nov 26 03:44:41 host2 kernel: [<ffffffff814ffec5>]
rwsem_down_failed_common+0x95/0x1d0
Nov 26 03:44:41 host2 kernel: [<ffffffff81500056>]
rwsem_down_read_failed+0x26/0x30
Nov 26 03:44:41 host2 kernel: [<ffffffff8127e634>]
call_rwsem_down_read_failed+0x14/0x30
Nov 26 03:44:41 host2 kernel: [<ffffffff814ff554>] ? down_read+0x24/0x30
Nov 26 03:44:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0 [dlm]
Nov 26 03:44:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf+0x484/0x5f0
Nov 26 03:44:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock+0xf1/0x130
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast+0x0/0x50
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa063a385>] do_xmote+0x1a5/0x280
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf+0x34/0x40
Nov 26 03:44:41 host2 kernel: [<ffffffffa063a551>] run_queue+0xf1/0x1d0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq+0x21e/0x3d0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac71>]
gfs2_glock_nq_num+0x61/0xa0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa064eca3>]
gfs2_recover_work+0x93/0x7b0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff8105b483>] ?
perf_event_task_sched_out+0x33/0x80
Nov 26 03:44:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac69>] ?
gfs2_glock_nq_num+0x59/0xa0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff8106335b>] ?
enqueue_task_fair+0xfb/0x100
Nov 26 03:44:41 host2 kernel: [<ffffffff81108093>]
slow_work_execute+0x233/0x310
Nov 26 03:44:41 host2 kernel: [<ffffffff811082c7>]
slow_work_thread+0x157/0x360
Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40
Nov 26 03:44:41 host2 kernel: [<ffffffff81108170>] ?
slow_work_thread+0x0/0x360
Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Nov 26 03:44:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for more
than 120 seconds.
Nov 26 03:44:41 host2 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 03:44:41 host2 kernel: gfs2_quotad D 0000000000000001 0 2950
2 0x00000080
Nov 26 03:44:41 host2 kernel: ffff88086afdfc20 0000000000000046
0000000000000000 ffffffffa0605f4d
Nov 26 03:44:41 host2 kernel: 0000000000000000 ffff88106c505800
ffff88086afdfc50 ffffffffa0604708
Nov 26 03:44:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
000000000000fb88 ffff88086afddaf8
Nov 26 03:44:41 host2 kernel: Call Trace:
Nov 26 03:44:41 host2 kernel: [<ffffffffa0605f4d>] ?
dlm_put_lockspace+0x1d/0x40 [dlm]
Nov 26 03:44:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock+0x98/0x1e0
[dlm]
Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa063757e>]
gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff814feb58>]
out_of_line_wait_on_bit+0x78/0x90
Nov 26 03:44:41 host2 kernel: [<ffffffff81092110>] ?
wake_bit_function+0x0/0x50
Nov 26 03:44:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait+0x45/0x90
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq+0x237/0x3d0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff8107eabb>] ?
try_to_del_timer_sync+0x7b/0xe0
Nov 26 03:44:41 host2 kernel: [<ffffffffa0653658>]
gfs2_statfs_sync+0x58/0x1b0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff814fe75a>] ?
schedule_timeout+0x19a/0x2e0
Nov 26 03:44:41 host2 kernel: [<ffffffffa0653650>] ?
gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa064b9d7>]
quotad_check_timeo+0x57/0xb0 [gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad+0x234/0x2b0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
autoremove_wake_function+0x0/0x40
Nov 26 03:44:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad+0x0/0x2b0
[gfs2]
Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
What would you do in the same case? Is it possible to restart GFS2 without
rebooting nodes?
Thank you very much for any help.
--
V.Melnik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20131126/27d5cdf7/attachment.htm>
More information about the Linux-cluster
mailing list