[Linux-cluster] Possible problem with gfs2

Thu Oct 24 17:51:14 UTC 2013

Usually I see that message when the storage backing GFS2 fails or when a
pending fence is taking a long time and DLM is blocking.

Can you paste your cluster.conf please, along with log messages starting
a little before the GFS2 issue began?

On 24/10/13 13:38, Juan Pablo Lorier wrote:
> Hi,
> 
> I'm new to work with clusters and I've started setting up samba with HA.
> For that, I've a "storage server" from supermicro with 16 2TB drives and
> dual server slots that access the drives directly.
> I've set a md raid 6 with the drives and created a LVM volume on top of
> the raid with one lv and gfs2 partition to create a common storage for
> the servers. Only one of the servers is member of the cluster right now
> with Centos 6.4.
>>From time to time, some users get corrupted files when they copy to the
> file server and I've asked to samba list about it and they pointed to
> this list for help.
> The error I've found is this:
> 
> Oct 22 17:50:04 nas kernel: INFO: task smbd:5994 blocked for more than
> 120 seconds.
> Oct 22 17:50:04 nas kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 22 17:50:04 nas kernel: smbd          D 0000000000000000     0 
> 5994   4870 0x00000084
> Oct 22 17:50:04 nas kernel: ffff88017c61b948 0000000000000086
> ffff8801bb44b500 0000000000000000
> Oct 22 17:50:04 nas kernel: 00000000ffffffff 00000000ffffffff
> ffff88017c61b998 ffffffff8105b4d3
> Oct 22 17:50:04 nas kernel: ffff880175b03058 ffff88017c61bfd8
> 000000000000fb88 ffff880175b03058
> Oct 22 17:50:04 nas kernel: Call Trace:
> Oct 22 17:50:04 nas kernel: [<ffffffff8105b4d3>] ?
> perf_event_task_sched_out+0x33/0x80
> Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa055757e>]
> gfs2_glock_holder_wait+0xe/0x20 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffff814fef1f>] __wait_on_bit+0x5f/0x90
> Oct 22 17:50:04 nas kernel: [<ffffffffa0557570>] ?
> gfs2_glock_holder_wait+0x0/0x20 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffff814fefc8>]
> out_of_line_wait_on_bit+0x78/0x90
> Oct 22 17:50:04 nas kernel: [<ffffffff81092190>] ?
> wake_bit_function+0x0/0x50
> Oct 22 17:50:04 nas kernel: [<ffffffffa05594f5>]
> gfs2_glock_wait+0x45/0x90 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa055a8f7>]
> gfs2_glock_nq+0x237/0x3d0 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa055c5c9>]
> gfs2_inode_lookup+0x129/0x300 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa054ea4d>] ?
> gfs2_dirent_search+0x16d/0x1a0 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa054efbe>]
> gfs2_dir_search+0x5e/0x80 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa055c32e>] gfs2_lookupi+0xde/0x1e0
> [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa0559ca8>] ?
> do_promote+0x208/0x330 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa055c39d>] ?
> gfs2_lookupi+0x14d/0x1e0 [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffffa0569c76>] gfs2_lookup+0x36/0xd0
> [gfs2]
> Oct 22 17:50:04 nas kernel: [<ffffffff81193ed7>] ? d_alloc+0x137/0x1b0
> Oct 22 17:50:04 nas kernel: [<ffffffff81189935>] do_lookup+0x1a5/0x230
> Oct 22 17:50:04 nas kernel: [<ffffffff81189ccd>]
> __link_path_walk+0x20d/0x1030
> Oct 22 17:50:04 nas kernel: [<ffffffff814a1d1a>] ? inet_recvmsg+0x5a/0x90
> Oct 22 17:50:04 nas kernel: [<ffffffff8118ad7a>] path_walk+0x6a/0xe0
> Oct 22 17:50:04 nas kernel: [<ffffffff8118af4b>] do_path_lookup+0x5b/0xa0
> Oct 22 17:50:04 nas kernel: [<ffffffff8118bbb7>] user_path_at+0x57/0xa0
> Oct 22 17:50:04 nas kernel: [<ffffffff81092150>] ?
> autoremove_wake_function+0x0/0x40
> Oct 22 17:50:04 nas kernel: [<ffffffff811807ec>] vfs_fstatat+0x3c/0x80
> Oct 22 17:50:04 nas kernel: [<ffffffff8118095b>] vfs_stat+0x1b/0x20
> Oct 22 17:50:04 nas kernel: [<ffffffff81180984>] sys_newstat+0x24/0x50
> Oct 22 17:50:04 nas kernel: [<ffffffff810d6ce2>] ?
> audit_syscall_entry+0x272/0x2a0
> Oct 22 17:50:04 nas kernel: [<ffffffff8100b0f2>]
> system_call_fastpath+0x16/0x1b
> 
> 
> 
> Can any one help me to debugg this? I'm trying to find out if this is a
> gfs2 problem or it may be related to the hardware. I've started a gfs2
> fsck but it takes too long and I rather try to figure out the possible
> cause before leaving out of production the server for more than 2 days
> (that is what I'm estimating it may take according to how much it took
> to partially do pass1).
> Regards,
> 
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?