[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Problems with GFS2 faulting.



Hello,

I've been having MAJOR issues with GFS2 faulting while doing extremely simple operations. My test-bed is my /home being GFS2, storage servers being two iSCSI targets running DRBD accross both replicating the data, and the client side using open-iscsi to bring it in, multipath'ing both nodes. I've done this setup as well without DRBD and multipath and just going straight iSCSI with no difference in the problem.

What is being done to cause this error is, I have an Apache server running my home page which is simply just Dokuwiki. Dokuwiki uses flatfiles for everything so it uses file locking techniques on the wiki data files.

After editing documents a few times, the same document or even mixed documents, within a short period of time, for example within about 30 minutes, it causes GFS2 to dump a stack trace and fault and crowbar both nodes to a complete halt.

There's 2 client servers using the same GFS2 mountpoint are KVM guests on two physical virtual server machines; the storage servers are bare metal iSCSI targets with DRBD replication on them over 1 GB ethernet.

The cluster glue is comprised of the clustering PPA stack for Ubuntu 10.04.1, pacemaker, dlm_controld.pcmk, gfs_controld.pcmk.

The only recovery that can be done to get even one node back online is to forcefully take down ALL nodes at once, because gfs_controld.pcmk will not die, dlm_controld.pcmk will not die, can't umount /home, can't even stop open-iscsi as they just throw more stack traces and time outs. After taking them down, I bring one node back up, setup to not re-activate the gfs2, but to load gfs2_controld.pcmk, fsck it, and finally re-enable the mount before bring the secondary node back online.

Obviously taking all nodes down means 100% downtime during that period of recovery, so this completely fails.

gfs_controld.pcmk is setup to start with the command-line arguments: -g 0 -l 0 -o 0

Here's the stack traces I'm getting when it faults:

Jan 13 03:31:27 cweb1 kernel: [1387920.160141] INFO: task flush-251:1:27497 blocked for more than 120 seconds. Jan 13 03:31:27 cweb1 kernel: [1387920.160802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 13 03:31:27 cweb1 kernel: [1387920.161474] flush-251:1 D 0000000000000002 0 27497 2 0x00000000 Jan 13 03:31:27 cweb1 kernel: [1387920.161479] ffff88004854fa10 0000000000000046 0000000000015bc0 0000000000015bc0 Jan 13 03:31:27 cweb1 kernel: [1387920.161483] ffff880060bf03b8 ffff88004854ffd8 0000000000015bc0 ffff880060bf0000 Jan 13 03:31:27 cweb1 kernel: [1387920.161485] 0000000000015bc0 ffff88004854ffd8 0000000000015bc0 ffff880060bf03b8
Jan 13 03:31:27 cweb1 kernel: [1387920.161488] Call Trace:
Jan 13 03:31:27 cweb1 kernel: [1387920.161508] [<ffffffffa022e730>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161516] [<ffffffffa022e73e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161540] [<ffffffff81558fcf>] __wait_on_bit+0x5f/0x90 Jan 13 03:31:27 cweb1 kernel: [1387920.161547] [<ffffffffa022fd9d>] ? do_promote+0xcd/0x290 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161555] [<ffffffffa022e730>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161558] [<ffffffff81559078>] out_of_line_wait_on_bit+0x78/0x90 Jan 13 03:31:27 cweb1 kernel: [1387920.161575] [<ffffffff810843c0>] ? wake_bit_function+0x0/0x40 Jan 13 03:31:27 cweb1 kernel: [1387920.161582] [<ffffffffa022f971>] gfs2_glock_wait+0x31/0x40 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161590] [<ffffffffa0230975>] gfs2_glock_nq+0x2a5/0x360 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161597] [<ffffffffa022f064>] ? gfs2_glock_put+0x104/0x130 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161606] [<ffffffffa02497f2>] gfs2_write_inode+0x82/0x190 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161614] [<ffffffffa02497ea>] ? gfs2_write_inode+0x7a/0x190 [gfs2] Jan 13 03:31:27 cweb1 kernel: [1387920.161629] [<ffffffff811661d4>] writeback_single_inode+0x2b4/0x3d0 Jan 13 03:31:27 cweb1 kernel: [1387920.161631] [<ffffffff81166745>] writeback_sb_inodes+0x195/0x280 Jan 13 03:31:27 cweb1 kernel: [1387920.161638] [<ffffffff81061671>] ? dequeue_entity+0x1a1/0x1e0 Jan 13 03:31:27 cweb1 kernel: [1387920.161641] [<ffffffff81166f60>] writeback_inodes_wb+0xa0/0x1b0 Jan 13 03:31:27 cweb1 kernel: [1387920.161643] [<ffffffff811672ab>] wb_writeback+0x23b/0x2a0 Jan 13 03:31:27 cweb1 kernel: [1387920.161648] [<ffffffff81075f3c>] ? lock_timer_base+0x3c/0x70 Jan 13 03:31:27 cweb1 kernel: [1387920.161651] [<ffffffff8116748c>] wb_do_writeback+0x17c/0x190 Jan 13 03:31:27 cweb1 kernel: [1387920.161653] [<ffffffff81076050>] ? process_timeout+0x0/0x10 Jan 13 03:31:27 cweb1 kernel: [1387920.161656] [<ffffffff811674f3>] bdi_writeback_task+0x53/0xf0 Jan 13 03:31:27 cweb1 kernel: [1387920.161667] [<ffffffff8110e9c6>] bdi_start_fn+0x86/0x100 Jan 13 03:31:27 cweb1 kernel: [1387920.161669] [<ffffffff8110e940>] ? bdi_start_fn+0x0/0x100 Jan 13 03:31:27 cweb1 kernel: [1387920.161671] [<ffffffff81084006>] kthread+0x96/0xa0 Jan 13 03:31:27 cweb1 kernel: [1387920.161680] [<ffffffff810131ea>] child_rip+0xa/0x20 Jan 13 03:31:27 cweb1 kernel: [1387920.161683] [<ffffffff81083f70>] ? kthread+0x0/0xa0 Jan 13 03:31:27 cweb1 kernel: [1387920.161685] [<ffffffff810131e0>] ? child_rip+0x0/0x20

--
Eric Renfro


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]