[Linux-cluster] Restarting GFS2 without reboot

Tue Nov 26 09:59:34 UTC 2013

Hi,

On Tue, 2013-11-26 at 10:19 +0200, Vladimir Melnik wrote:
> Dear colleagues,
> 
>  
> 
> Your advices will be greatly appreciated.
> 
>  
> 
> I have another small GFS2 cluster. 2 nodes connected to the same
> iSCSI-target.
> 
>  
> 
> Tonight something has happen and now both nodes can’t work with the
> mounted filesystem anymore.
> 
>  
> 
> Processes that opened files on the filesystem are keeping files opened
> and working with them, but I can’t open new files, I even can’t get
> the list of files on the mountpoint by “ls” command.
> 
Looking at the logs, I see that it looks like recovery has got stuck for
one of the nodes, since the log is complaining that it has taken a long
time for kslowd to run.

So that suggests that the other node is currently fenced, and only one
node is working anyway. If that is not the case then something has got
rather confused somehow. What kind of fencing is in use here?

I also noticed that gfs2_quotad was complaining too - that tends to be
the first thing to complain when it cannot make progress. It is used for
both statfs and quota, so runs periodically even when quotas are not in
use. So that is just an indicator that things are slow, and the cause is
most likely to be elsewhere.

The other question is also what caused the node to try and fence the
other one in the first place? That is not immediately clear from the
logs.

However you may well have to reboot one or more nodes in order to clear
this condition, depending on exactly what the problem is.

I did spot a note in the logs about the connection to the storage being
lost, and that would certainly be enough to cause a problem on whichever
node lost access. Are you running qdisk on that iSCSI storage? It would
help if you could post your configuration,

Steve.

>  
> 
> Both nodes are joined:
> 
>  
> 
> Node  Sts   Inc   Joined               Name
> 
>    1   M    388   2013-11-26 03:43:01  ***
> 
>    2   M    360   2013-11-11 07:39:22  ***
> 
>  
> 
> That’s what “gfs_control dump” says:
> 
>  
> 
> 1384148367 logging mode 3 syslog f 160 p 6 logfile p
> 6 /var/log/cluster/gfs_controld.log
> 
> 1384148367 gfs_controld 3.0.12.1 started
> 
> 1384148367 cluster node 1 added seq 364
> 
> 1384148367 cluster node 2 added seq 364
> 
> 1384148367 logging mode 3 syslog f 160 p 6 logfile p
> 6 /var/log/cluster/gfs_controld.log
> 
> 1384148367 group_mode 3 compat 0
> 
> 1384148367 setup_cpg_daemon 14
> 
> 1384148367 gfs:controld conf 2 1 0 memb 1 2 join 2 left
> 
> 1384148367 run protocol from nodeid 1
> 
> 1384148367 daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
> 
> 1384148372 client connection 5 fd 16
> 
> 1384148372 join: /mnt/psv4 gfs2 lock_dlm ckvm1_pod1:psv4
> rw,noatime,nodiratime /dev/dm-0
> 
> 1384148372 psv4 join: cluster name matches: ckvm1_pod1
> 
> 1384148372 psv4 process_dlmcontrol register 0
> 
> 1384148372 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 2 left
> 
> 1384148372 psv4 add_change cg 1 joined nodeid 2
> 
> 1384148372 psv4 add_change cg 1 we joined
> 
> 1384148372 psv4 add_change cg 1 counts member 2 joined 1 remove 0
> failed 0
> 
> 1384148372 psv4 wait_conditions skip for zero started_count
> 
> 1384148372 psv4 send_start cg 1 id_count 2 om 0 nm 2 oj 0 nj 0
> 
> 1384148372 psv4 receive_start 2:1 len 104
> 
> 1384148372 psv4 match_change 2:1 matches cg 1
> 
> 1384148372 psv4 wait_messages cg 1 need 1 of 2
> 
> 1384148372 psv4 receive_start 1:2 len 104
> 
> 1384148372 psv4 match_change 1:2 matches cg 1
> 
> 1384148372 psv4 wait_messages cg 1 got all 2
> 
> 1384148372 psv4 pick_first_recovery_master old 1
> 
> 1384148372 psv4 sync_state first_recovery_needed master 1
> 
> 1384148372 psv4 create_old_nodes 1 jid 0 ro 0 spect 0
> kernel_mount_done 0 error 0
> 
> 1384148372 psv4 create_new_nodes 2 ro 0 spect 0
> 
> 1384148372 psv4 create_new_journals 2 gets jid 1
> 
> 1384148373 psv4 receive_first_recovery_done from 1 master 1
> mount_client_notified 0
> 
> 1384148373 psv4 start_kernel cg 1 member_count 2
> 
> 1384148373 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to
> 0
> 
> 1384148373 psv4 set
> open /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block error -1 2
> 
> 1384148373 psv4 client_reply_join_full ci 5 result 0
> hostdata=jid=1:id=2447518500:first=0
> 
> 1384148373 client_reply_join psv4 ci 5 result 0
> 
> 1384148373 psv4 wait_recoveries done
> 
> 1384148373 uevent add gfs2 /fs/gfs2/ckvm1_pod1:psv4
> 
> 1384148373 psv4 ping_kernel_mount 0
> 
> 1384148373 psv4 receive_mount_done from 1 result 0
> 
> 1384148373 psv4 wait_recoveries done
> 
> 1384148373 uevent change gfs2 /fs/gfs2/ckvm1_pod1:psv4
> 
> 1384148373 psv4 recovery_uevent jid 1 ignore
> 
> 1384148373 uevent online gfs2 /fs/gfs2/ckvm1_pod1:psv4
> 
> 1384148373 psv4 ping_kernel_mount 0
> 
> 1384148373 mount_done: psv4 result 0
> 
> 1384148373 psv4 receive_mount_done from 2 result 0
> 
> 1384148373 psv4 wait_recoveries done
> 
> 1385430013 cluster node 1 removed seq 368
> 
> 1385430013 gfs:controld conf 1 0 1 memb 2 join left 1
> 
> 1385430013 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
> 
> 1385430013 psv4 add_change cg 2 remove nodeid 1 reason 3
> 
> 1385430013 psv4 add_change cg 2 counts member 1 joined 0 remove 1
> failed 1
> 
> 1385430013 psv4 stop_kernel
> 
> 1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/block to
> 1
> 
> 1385430013 psv4 check_dlm_notify nodeid 1 begin
> 
> 1385430013 psv4 process_dlmcontrol notified nodeid 1 result -11
> 
> 1385430013 psv4 check_dlm_notify result -11 will retry nodeid 1
> 
> 1385430013 psv4 check_dlm_notify nodeid 1 begin
> 
> 1385430013 psv4 process_dlmcontrol notified nodeid 1 result 0
> 
> 1385430013 psv4 check_dlm_notify done
> 
> 1385430013 psv4 send_start cg 2 id_count 2 om 1 nm 0 oj 0 nj 1
> 
> 1385430013 psv4 receive_start 2:2 len 104
> 
> 1385430013 psv4 match_change 2:2 matches cg 2
> 
> 1385430013 psv4 wait_messages cg 2 got all 1
> 
> 1385430013 psv4 sync_state first_recovery_msg
> 
> 1385430013 psv4 set_failed_journals jid 0 nodeid 1
> 
> 1385430013 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
> 
> 1385430013 psv4 start_journal_recovery jid 0
> 
> 1385430013 psv4 set /sys/fs/gfs2/ckvm1_pod1:psv4/lock_module/recover
> to 0
> 
> 1385430044 cluster node 1 added seq 372
> 
> 1385430044 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430044 psv4 add_change cg 3 joined nodeid 1
> 
> 1385430044 psv4 add_change cg 3 counts member 2 joined 1 remove 0
> failed 0
> 
> 1385430044 psv4 check_dlm_notify done
> 
> 1385430044 psv4 send_start cg 3 id_count 3 om 1 nm 1 oj 1 nj 0
> 
> 1385430044 cpg_mcast_joined retried 1 start
> 
> 1385430044 gfs:controld conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430044 psv4 receive_start 2:3 len 116
> 
> 1385430044 psv4 match_change 2:3 matches cg 3
> 
> 1385430044 psv4 wait_messages cg 3 need 1 of 2
> 
> 1385430044 psv4 receive_start 1:4 len 116
> 
> 1385430044 psv4 match_change 1:4 matches cg 3
> 
> 1385430044 receive_start 1:4 add node with started_count 3
> 
> 1385430044 psv4 wait_messages cg 3 need 1 of 2
> 
> 1385430088 cluster node 1 removed seq 376
> 
> 1385430088 gfs:controld conf 1 0 1 memb 2 join left 1
> 
> 1385430088 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
> 
> 1385430088 psv4 add_change cg 4 remove nodeid 1 reason 3
> 
> 1385430088 psv4 add_change cg 4 counts member 1 joined 0 remove 1
> failed 1
> 
> 1385430088 psv4 check_dlm_notify nodeid 1 begin
> 
> 1385430088 psv4 process_dlmcontrol notified nodeid 1 result 0
> 
> 1385430088 psv4 check_dlm_notify done
> 
> 1385430088 psv4 send_start cg 4 id_count 2 om 1 nm 0 oj 1 nj 0
> 
> 1385430088 psv4 receive_start 2:4 len 104
> 
> 1385430088 psv4 match_change 2:4 skip 3 already start
> 
> 1385430088 psv4 match_change 2:4 matches cg 4
> 
> 1385430088 psv4 wait_messages cg 4 got all 1
> 
> 1385430088 psv4 sync_state first_recovery_msg
> 
> 1385430088 psv4 set_failed_journals no journal for nodeid 1
> 
> 1385430088 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
> 
> 1385430092 cluster node 1 added seq 380
> 
> 1385430092 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430092 psv4 add_change cg 5 joined nodeid 1
> 
> 1385430092 psv4 add_change cg 5 counts member 2 joined 1 remove 0
> failed 0
> 
> 1385430092 psv4 check_dlm_notify done
> 
> 1385430092 psv4 send_start cg 5 id_count 3 om 1 nm 1 oj 1 nj 0
> 
> 1385430092 cpg_mcast_joined retried 1 start
> 
> 1385430092 gfs:controld conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430092 psv4 receive_start 2:5 len 116
> 
> 1385430092 psv4 match_change 2:5 matches cg 5
> 
> 1385430092 psv4 wait_messages cg 5 need 1 of 2
> 
> 1385430092 psv4 receive_start 1:6 len 116
> 
> 1385430092 psv4 match_change 1:6 matches cg 5
> 
> 1385430092 receive_start 1:6 add node with started_count 4
> 
> 1385430092 psv4 wait_messages cg 5 need 1 of 2
> 
> 1385430143 cluster node 1 removed seq 384
> 
> 1385430143 gfs:mount:psv4 conf 1 0 1 memb 2 join left 1
> 
> 1385430143 psv4 add_change cg 6 remove nodeid 1 reason 3
> 
> 1385430143 psv4 add_change cg 6 counts member 1 joined 0 remove 1
> failed 1
> 
> 1385430143 psv4 check_dlm_notify nodeid 1 begin
> 
> 1385430143 gfs:controld conf 1 0 1 memb 2 join left 1
> 
> 1385430143 psv4 process_dlmcontrol notified nodeid 1 result 0
> 
> 1385430143 psv4 check_dlm_notify done
> 
> 1385430143 psv4 send_start cg 6 id_count 2 om 1 nm 0 oj 1 nj 0
> 
> 1385430143 psv4 receive_start 2:6 len 104
> 
> 1385430143 psv4 match_change 2:6 skip 5 already start
> 
> 1385430143 psv4 match_change 2:6 matches cg 6
> 
> 1385430143 psv4 wait_messages cg 6 got all 1
> 
> 1385430143 psv4 sync_state first_recovery_msg
> 
> 1385430143 psv4 set_failed_journals no journal for nodeid 1
> 
> 1385430143 psv4 wait_recoveries jid 0 nodeid 1 unrecovered
> 
> 1385430181 cluster node 1 added seq 388
> 
> 1385430181 gfs:mount:psv4 conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430181 psv4 add_change cg 7 joined nodeid 1
> 
> 1385430181 psv4 add_change cg 7 counts member 2 joined 1 remove 0
> failed 0
> 
> 1385430181 psv4 check_dlm_notify done
> 
> 1385430181 psv4 send_start cg 7 id_count 3 om 1 nm 1 oj 1 nj 0
> 
> 1385430181 cpg_mcast_joined retried 1 start
> 
> 1385430181 gfs:controld conf 2 1 0 memb 1 2 join 1 left
> 
> 1385430181 psv4 receive_start 2:7 len 116
> 
> 1385430181 psv4 match_change 2:7 matches cg 7
> 
> 1385430181 psv4 wait_messages cg 7 need 1 of 2
> 
> 1385430181 psv4 receive_start 1:8 len 116
> 
> 1385430181 psv4 match_change 1:8 matches cg 7
> 
> 1385430181 receive_start 1:8 add node with started_count 5
> 
> 1385430181 psv4 wait_messages cg 7 need 1 of 2
> 
>  
> 
> I can’t reboot nodes, they’re pretty busy, but, of course, I’d like to
> make that GFS2-filesystem working again.
> 
>  
> 
> There’s what I’d got in the log-file when that happened:
> 
>  
> 
> Nov 26 03:40:11 host2 corosync[2596]:   [TOTEM ] A processor failed,
> forming new configuration.
> 
> Nov 26 03:40:12 host2 kernel: connection1:0: ping timeout of 5 secs
> expired, recv timeout 5, last rx 5576348348, last ping 5576353348, now
> 5576358348
> 
> Nov 26 03:40:12 host2 kernel: connection1:0: detected conn error
> (1011)
> 
> Nov 26 03:40:13 host2 iscsid: Kernel reported iSCSI connection 1:0
> error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state
> (3)
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
> activity
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [QUORUM] This node is within
> the non-primary component and will NOT provide any services.
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [QUORUM] Members[1]: 2
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.2) ; members(old:2 left:1)
> 
> Nov 26 03:40:13 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:40:13 host2 kernel: dlm: closing connection to node 1
> 
> Nov 26 03:40:13 host2 kernel: GFS2: fsid=ckvm1_pod1:psv4.1: jid=0:
> Trying to acquire journal lock...
> 
> Nov 26 03:40:44 host2 iscsid: connection1:0 is operational after
> recovery (3 attempts)
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [CMAN  ] quorum regained,
> resuming activity
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] This node is within
> the primary component and will provide service.
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.1) ; members(old:1 left:0)
> 
> Nov 26 03:40:44 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:40:44 host2 gfs_controld[2727]: receive_start 1:4 add node
> with started_count 3
> 
> Nov 26 03:40:44 host2 fenced[2652]: receive_start 1:4 add node with
> started_count 2
> 
> Nov 26 03:41:26 host2 corosync[2596]:   [TOTEM ] A processor failed,
> forming new configuration.
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
> activity
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [QUORUM] This node is within
> the non-primary component and will NOT provide any services.
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [QUORUM] Members[1]: 2
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.2) ; members(old:2 left:1)
> 
> Nov 26 03:41:28 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:41:28 host2 kernel: dlm: closing connection to node 1
> 
> Nov 26 03:41:29 host2 kernel: connection1:0: ping timeout of 5 secs
> expired, recv timeout 5, last rx 5576425428, last ping 5576430428, now
> 5576435428
> 
> Nov 26 03:41:29 host2 kernel: connection1:0: detected conn error
> (1011)
> 
> Nov 26 03:41:30 host2 iscsid: Kernel reported iSCSI connection 1:0
> error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state
> (3)
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [CMAN  ] quorum regained,
> resuming activity
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] This node is within
> the primary component and will provide service.
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.1) ; members(old:1 left:0)
> 
> Nov 26 03:41:32 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:41:32 host2 fenced[2652]: receive_start 1:6 add node with
> started_count 2
> 
> Nov 26 03:41:32 host2 gfs_controld[2727]: receive_start 1:6 add node
> with started_count 4
> 
> Nov 26 03:41:37 host2 iscsid: connection1:0 is operational after
> recovery (1 attempts)
> 
> Nov 26 03:42:19 host2 kernel: connection1:0: ping timeout of 5 secs
> expired, recv timeout 5, last rx 5576475399, last ping 5576480399, now
> 5576485399
> 
> Nov 26 03:42:19 host2 kernel: connection1:0: detected conn error
> (1011)
> 
> Nov 26 03:42:20 host2 iscsid: Kernel reported iSCSI connection 1:0
> error (1011 - ISCSI_ERR_CONN_FAILED: iSCSI connection failed) state
> (3)
> 
> Nov 26 03:42:21 host2 corosync[2596]:   [TOTEM ] A processor failed,
> forming new configuration.
> 
> Nov 26 03:42:23 host2 corosync[2596]:   [CMAN  ] quorum lost, blocking
> activity
> 
> Nov 26 03:42:23 host2 corosync[2596]:   [QUORUM] This node is within
> the non-primary component and will NOT provide any services. Nov 26
> 03:42:23 host2 corosync[2596]:   [QUORUM] Members[1]: 2
> 
> Nov 26 03:42:23 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:42:23 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.2) ; members(old:2 left:1)
> 
> Nov 26 03:42:23 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:42:23 host2 kernel: dlm: closing connection to node 1
> 
> Nov 26 03:42:41 host2 kernel: INFO: task kslowd001:2942 blocked for
> more than 120 seconds.
> 
> Nov 26 03:42:41 host2 kernel: "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Nov 26 03:42:41 host2 kernel: kslowd001     D 000000000000000b     0
> 2942      2 0x00000080
> 
> Nov 26 03:42:41 host2 kernel: ffff88086b29d958 0000000000000046
> 0000000000000102 0000005000000002
> 
> Nov 26 03:42:41 host2 kernel: fffffffffffffffc 000000000000010e
> 0000003f00000002 fffffffffffffffc
> 
> Nov 26 03:42:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
> 000000000000fb88 ffff88086b29bab8
> 
> Nov 26 03:42:41 host2 kernel: Call Trace:
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff814ffec5>]
> rwsem_down_failed_common+0x95/0x1d0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81500056>]
> rwsem_down_read_failed+0x26/0x30
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8127e634>]
> call_rwsem_down_read_failed+0x14/0x30
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff814ff554>] ? down_read
> +0x24/0x30
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0
> [dlm]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf
> +0x484/0x5f0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock
> +0xf1/0x130 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
> [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast
> +0x0/0x50 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063a385>] do_xmote
> +0x1a5/0x280 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf
> +0x34/0x40
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063a551>] run_queue
> +0xf1/0x1d0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq
> +0x21e/0x3d0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac71>] gfs2_glock_nq_num
> +0x61/0xa0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa064eca3>] gfs2_recover_work
> +0x93/0x7b0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8105b483>] ?
> perf_event_task_sched_out+0x33/0x80
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to
> +0xd0/0x320
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063ac69>] ? gfs2_glock_nq_num
> +0x59/0xa0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8106335b>] ? enqueue_task_fair
> +0xfb/0x100
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81108093>] slow_work_execute
> +0x233/0x310
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff811082c7>] slow_work_thread
> +0x157/0x360
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
> autoremove_wake_function+0x0/0x40
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81108170>] ? slow_work_thread
> +0x0/0x360
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip
> +0x0/0x20
> 
> Nov 26 03:42:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for
> more than 120 seconds.
> 
> Nov 26 03:42:41 host2 kernel: "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Nov 26 03:42:41 host2 kernel: gfs2_quotad   D 0000000000000001     0
> 2950      2 0x00000080
> 
> Nov 26 03:42:41 host2 kernel: ffff88086afdfc20 0000000000000046
> 0000000000000000 ffffffffa0605f4d
> 
> Nov 26 03:42:41 host2 kernel: 0000000000000000 ffff88106c505800
> ffff88086afdfc50 ffffffffa0604708
> 
> Nov 26 03:42:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
> 000000000000fb88 ffff88086afddaf8
> 
> Nov 26 03:42:41 host2 kernel: Call Trace:
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0605f4d>] ? dlm_put_lockspace
> +0x1d/0x40 [dlm]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock
> +0x98/0x1e0 [dlm]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063757e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit
> +0x5f/0x90
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0637570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff814feb58>]
> out_of_line_wait_on_bit+0x78/0x90
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81092110>] ? wake_bit_function
> +0x0/0x50
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait
> +0x45/0x90 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq
> +0x237/0x3d0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8107eabb>] ?
> try_to_del_timer_sync+0x7b/0xe0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0653658>] gfs2_statfs_sync
> +0x58/0x1b0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff814fe75a>] ? schedule_timeout
> +0x19a/0x2e0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa0653650>] ? gfs2_statfs_sync
> +0x50/0x1b0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa064b9d7>] quotad_check_timeo
> +0x57/0xb0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad
> +0x234/0x2b0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff810920d0>] ?
> autoremove_wake_function+0x0/0x40
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad
> +0x0/0x2b0 [gfs2]
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> 
> Nov 26 03:42:41 host2 kernel: [<ffffffff8100c140>] ? child_rip
> +0x0/0x20
> 
> Nov 26 03:42:54 host2 iscsid: connect to 192.168.1.161:3260 failed (No
> route to host)
> 
> Nov 26 03:43:00 host2 iscsid: connect to 192.168.1.161:3260 failed (No
> route to host)
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [CMAN  ] quorum regained,
> resuming activity
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] This node is within
> the primary component and will provide service.
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [QUORUM] Members[2]: 1 2
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [CPG   ] chosen downlist:
> sender r(0) ip(192.168.1.1) ; members(old:1 left:0)
> 
> Nov 26 03:43:01 host2 corosync[2596]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> Nov 26 03:43:01 host2 gfs_controld[2727]: receive_start 1:8 add node
> with started_count 5
> 
> Nov 26 03:43:01 host2 fenced[2652]: receive_start 1:8 add node with
> started_count 2
> 
> Nov 26 03:43:03 host2 iscsid: connection1:0 is operational after
> recovery (5 attempts)
> 
> Nov 26 03:44:41 host2 kernel: INFO: task kslowd001:2942 blocked for
> more than 120 seconds.
> 
> Nov 26 03:44:41 host2 kernel: "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Nov 26 03:44:41 host2 kernel: kslowd001     D 000000000000000b     0
> 2942      2 0x00000080
> 
> Nov 26 03:44:41 host2 kernel: ffff88086b29d958 0000000000000046
> 0000000000000102 0000005000000002
> 
> Nov 26 03:44:41 host2 kernel: fffffffffffffffc 000000000000010e
> 0000003f00000002 fffffffffffffffc
> 
> Nov 26 03:44:41 host2 kernel: ffff88086b29bab8 ffff88086b29dfd8
> 000000000000fb88 ffff88086b29bab8
> 
> Nov 26 03:44:41 host2 kernel: Call Trace:
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff814ffec5>]
> rwsem_down_failed_common+0x95/0x1d0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81500056>]
> rwsem_down_read_failed+0x26/0x30
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8127e634>]
> call_rwsem_down_read_failed+0x14/0x30
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff814ff554>] ? down_read
> +0x24/0x30
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa06046d2>] dlm_lock+0x62/0x1e0
> [dlm]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8127cd04>] ? vsnprintf
> +0x484/0x5f0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa06564e1>] gdlm_lock
> +0xf1/0x130 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa06565f0>] ? gdlm_ast+0x0/0xe0
> [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0656520>] ? gdlm_bast
> +0x0/0x50 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063a385>] do_xmote
> +0x1a5/0x280 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8127cf14>] ? snprintf
> +0x34/0x40
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063a551>] run_queue
> +0xf1/0x1d0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8de>] gfs2_glock_nq
> +0x21e/0x3d0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac71>] gfs2_glock_nq_num
> +0x61/0xa0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa064eca3>] gfs2_recover_work
> +0x93/0x7b0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8105b483>] ?
> perf_event_task_sched_out+0x33/0x80
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff810096f0>] ? __switch_to
> +0xd0/0x320
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063ac69>] ? gfs2_glock_nq_num
> +0x59/0xa0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8106335b>] ? enqueue_task_fair
> +0xfb/0x100
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81108093>] slow_work_execute
> +0x233/0x310
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff811082c7>] slow_work_thread
> +0x157/0x360
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
> autoremove_wake_function+0x0/0x40
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81108170>] ? slow_work_thread
> +0x0/0x360
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip
> +0x0/0x20
> 
> Nov 26 03:44:41 host2 kernel: INFO: task gfs2_quotad:2950 blocked for
> more than 120 seconds.
> 
> Nov 26 03:44:41 host2 kernel: "echo 0
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Nov 26 03:44:41 host2 kernel: gfs2_quotad   D 0000000000000001     0
> 2950      2 0x00000080
> 
> Nov 26 03:44:41 host2 kernel: ffff88086afdfc20 0000000000000046
> 0000000000000000 ffffffffa0605f4d
> 
> Nov 26 03:44:41 host2 kernel: 0000000000000000 ffff88106c505800
> ffff88086afdfc50 ffffffffa0604708
> 
> Nov 26 03:44:41 host2 kernel: ffff88086afddaf8 ffff88086afdffd8
> 000000000000fb88 ffff88086afddaf8
> 
> Nov 26 03:44:41 host2 kernel: Call Trace:
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0605f4d>] ? dlm_put_lockspace
> +0x1d/0x40 [dlm]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0604708>] ? dlm_lock
> +0x98/0x1e0 [dlm]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063757e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff814feaaf>] __wait_on_bit
> +0x5f/0x90
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0637570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff814feb58>]
> out_of_line_wait_on_bit+0x78/0x90
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81092110>] ? wake_bit_function
> +0x0/0x50
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa06394f5>] gfs2_glock_wait
> +0x45/0x90 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa063a8f7>] gfs2_glock_nq
> +0x237/0x3d0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8107eabb>] ?
> try_to_del_timer_sync+0x7b/0xe0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0653658>] gfs2_statfs_sync
> +0x58/0x1b0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff814fe75a>] ? schedule_timeout
> +0x19a/0x2e0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa0653650>] ? gfs2_statfs_sync
> +0x50/0x1b0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa064b9d7>] quotad_check_timeo
> +0x57/0xb0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa064bc64>] gfs2_quotad
> +0x234/0x2b0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff810920d0>] ?
> autoremove_wake_function+0x0/0x40
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffffa064ba30>] ? gfs2_quotad
> +0x0/0x2b0 [gfs2]
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81091d66>] kthread+0x96/0xa0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff81091cd0>] ? kthread+0x0/0xa0
> 
> Nov 26 03:44:41 host2 kernel: [<ffffffff8100c140>] ? child_rip
> +0x0/0x20
> 
>  
> 
> What would you do in the same case? Is it possible to restart GFS2
> without rebooting nodes?
> 
>  
> 
> Thank you very much for any help.
> 
>  
> 
> -- 
> 
> V.Melnik
> 
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster