[Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing

nrbwpi at gmail.com nrbwpi at gmail.com
Wed Jun 27 22:35:57 UTC 2007


Thanks for your reply

I switched the hardware over to Fedora core 6, brought the system up2date,
and configured it the same as before with GFS2. Uname returns the following
kernel string: "Linux fu2 2.6.20-1.2952.fc6 #1 SMP Wed May 16 18:18:22 EDT
2007 x86_64 x86_64 x86_64 GNU/Linux".

The same fencing occurred after several hours of writing zeros to the volume
with dd in 250MB files.  This time, however, I noticed a kernel panic on the
fenced node.  The kernel output in /var/log/messages is below.  Could this
be a hardware configuration issue, or a bug in the kernel?



#####################################



Kernel panic



#####################################



Jun 26 10:00:41 fu2 kernel: ------------[ cut here ]------------

Jun 26 10:00:41 fu2 kernel: kernel BUG at lib/list_debug.c:67!

Jun 26 10:00:41 fu2 kernel: invalid opcode: 0000 [1] SMP

Jun 26 10:00:41 fu2 kernel: last sysfs file: /devices/pci0000:00/0000:00:
02.0/0000:06:00.0/0000:07:00.0/0000:08:00.0/0000:09:00.0/irq

Jun 26 10:00:41 fu2 kernel: CPU 7Jun 26 10:00:41 fu2 kernel: Modules linked
in: lock_dlm gfs2 dlm configfs ipt_MASQUERADE iptable_nat nf_nat
nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink ipt_REJECT xt_tcpudp
iptable_filter ip_tables x_tables bridge autofs4 hidp xfs rfcomm l2cap
bluetooth sunrpc ipv6 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
ib_addr iscsi_tcp libiscsi scsi_transport_iscsi dm_multipath video sbs
i2c_ec i2c_core dock button battery asus_acpi backlight ac parport_pc lp
parport sg ata_piix libata pcspkr bnx2 ide_cd cdrom serio_raw dm_snapshot
dm_zero dm_mirror dm_mod lpfc scsi_transport_fc shpchp megaraid_sas sd_mod
scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd

Jun 26 10:00:41 fu2 kernel: Pid: 4142, comm: gfs2_logd Not tainted
2.6.20-1.2952.fc6 #1

Jun 26 10:00:41 fu2 kernel: RIP: 0010:[<ffffffff80341368>]
[<ffffffff80341368>] list_del+0x21/0x5b

Jun 26 10:00:41 fu2 kernel: RSP: 0018:ffff81011e247d00  EFLAGS: 00010082

Jun 26 10:00:41 fu2 kernel: RAX: 0000000000000058 RBX: ffff81011aa40000 RCX:
ffffffff8057fc58

Jun 26 10:00:41 fu2 kernel: RDX: ffffffff8057fc58 RSI: 0000000000000000 RDI:
ffffffff8057fc40

Jun 26 10:00:41 fu2 kernel: RBP: ffff81012da3f7c0 R08: ffffffff8057fc58 R09:
0000000000000001

Jun 26 10:00:41 fu2 kernel: R10: 0000000000000000 R11: ffff81012fd9d0c0 R12:
ffff81011aa40f70

Jun 26 10:00:41 fu2 kernel: R13: ffff810123fb1a00 R14: ffff810123fb05d8 R15:
0000000000000036

Jun 26 10:00:41 fu2 kernel: FS:  0000000000000000(0000)
GS:ffff81012fdb47c0(0000) knlGS:0000000000000000

Jun 26 10:00:41 fu2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:
000000008005003b

Jun 26 10:00:41 fu2 kernel: CR2: 00002aaaadfbe008 CR3: 0000000042c20000 CR4:
00000000000006e0

Jun 26 10:00:41 fu2 kernel: Process gfs2_logd (pid: 4142, threadinfo
ffff81011e246000, task ffff810121d35800)

Jun 26 10:00:41 fu2 kernel: Stack:  ffff810123fb1a00 ffffffff802cc6e7
0000003c00000000 ffff81012da3f7c0

Jun 26 10:00:41 fu2 kernel:  000000000000003c ffff810123fb0400
0000000000000000 ffff810123fb1a00

Jun 26 10:00:41 fu2 kernel:  ffff81012da3f800 ffffffff802cc8be
ffff810123fb07e8 ffff810123fb0400

Jun 26 10:00:41 fu2 kernel: Call Trace:

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc6e7>] free_block+0xb1/0x142

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802cc8be>] cache_flusharray+0x7d/0xb1

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8020765f>]
kmem_cache_free+0x1ef/0x20c

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88445628>]
:gfs2:databuf_lo_before_commit+0x576/0x5c6

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88443acf>]
:gfs2:gfs2_log_flush+0x11e/0x2d3

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438310>] :gfs2:gfs2_logd+0xab/0x15b

Jun 26 10:00:41 fu2 kernel:  [<ffffffff88438265>] :gfs2:gfs2_logd+0x0/0x15b

Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
keventd_create_kthread+0x0/0x6a

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802318bd>] kthread+0xd0/0xff

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aec8>] child_rip+0xa/0x12

Jun 26 10:00:41 fu2 kernel:  [<ffffffff80297a1e>]
keventd_create_kthread+0x0/0x6a

Jun 26 10:00:41 fu2 kernel:  [<ffffffff802317ed>] kthread+0x0/0xff

Jun 26 10:00:41 fu2 kernel:  [<ffffffff8025aebe>] child_rip+0x0/0x12

Jun 26 10:00:41 fu2 kernel:

Jun 26 10:00:41 fu2 kernel:

Jun 26 10:00:41 fu2 kernel: Code: 0f 0b eb fe 48 8b 07 48 8b 50 08 48 39 fa
74 12 48 c7 c7 97

Jun 26 10:00:41 fu2 kernel: RIP  [<ffffffff80341368>] list_del+0x21/0x5b

Jun 26 10:00:41 fu2 kernel:  RSP <ffff81011e247d00>


On 6/7/07, Steven Whitehouse <swhiteho at redhat.com> wrote:
>
> Hi,
>
> The version of GFS2 in RHEL5 is rather old. Please use Fedora, the
> upstream kernel or wait until RHEL 5.1 is out. This should solve the
> problem that you are seeing,
>
> Steve.
>
> On Wed, 2007-06-06 at 19:27 -0400, nrbwpi at gmail.com wrote:
> > Hello,
> >
> > Installed RHEL5 on a new two node cluster with Shared FC storage.  The
> > two shared storage boxes are each split into 6.9TB LUNs for a total of
> > 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb
> > connection to a switch and a single FC connection to a FC switch.
> >
> > The 4 LUNs have LVM on them with GFS2.  The file systems are mountable
> > from each box.  When performing a script dd write of zeros in 250MB
> > file sizes to the file system from each box to different LUNS, one of
> > the nodes in the cluster is fenced by the other one.  File size does
> > not seem to matter.
> >
> > My first guess at the problem was the heartbeat timeout in openais.
> > In the cluster.conf below I added the totem line to hopefully raise
> > the timeout to 10 seconds.  This however did not resolve the problem.
> > Both boxes are running the latest updates as of 2 days ago from
> > up2date.
> >
> > Below is the cluster.conf and what is seen in the logs.  Any
> > suggestions would be greatly appreciated.
> >
> > Thanks!
> >
> > Neal
> >
> >
> >
> > ##########################################
> >
> > Cluster.conf
> >
> > ##########################################
> >
> >
> > <?xml version="1.0"?>
> > <cluster alias="storage1" config_version="4" name="storage1">
> >         <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >         <clusternodes>
> >                 <clusternode name="fu1" nodeid="1" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device name="apc4" port="1"
> > switch="1"/>
> >                                 </method>
> >                         </fence>
> >                         <multicast addr=" 224.10.10.10"
> > interface="eth0"/>
> >                 </clusternode>
> >                 <clusternode name="fu2" nodeid="2" votes="1">
> >                         <fence>
> >                                 <method name="1">
> >                                         <device name="apc4" port="2"
> > switch="1"/>
> >                                 </method>
> >                         </fence>
> >                         <multicast addr="224.10.10.10"
> > interface="eth0"/>
> >                 </clusternode>
> >         </clusternodes>
> >         <cman expected_votes="1" two_node="1">
> >                 <multicast addr="224.10.10.10"/>
> >                 <totem token="10000"/>
> >         </cman>
> >         <fencedevices>
> >                 <fencedevice agent="fence_apc" ipaddr="192.168.14.193"
> > login="apc" name="apc4" passwd="apc"/>
> >         </fencedevices>
> >         <rm>
> >                 <failoverdomains/>
> >                 <resources/>
> >         </rm>
> > </cluster>
> >
> >
> > #####################################################
> >
> > /var/log/messages
> >
> > #####################################################
> >
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the
> > OPERATIONAL state.
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket
> > recv buffer size (262142 bytes).
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket
> > send buffer size (262142 bytes).
> > Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from
> > 2.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from
> > 0.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token
> > because I am the rep.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high
> > seq received 6e
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member
> > 192.168.14.195:
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep
> > 192.168.14.195
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e
> > received flag 0
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate
> > any messages in recovery.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for
> > ring 14
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> > Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
> > Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec
> > post_fail_delay
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.195)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> > Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.197)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> > primary component and will provide service.
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0)
> > ip(192.168.14.195)
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
> > Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the
> > primary component and will provide service.
> > Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
> > Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message
> > 192.168.14.195
> > Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from
> > node 1
> > Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Trying to acquire journal lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Looking at journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Replayed 0 of 0 blocks
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Found 0 revoke tags
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Replayed 0 of 0 blocks
> > Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Found 0 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Acquiring the transaction lock...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Replayed 222 of 223 blocks
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Found 1 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1:
> > Done
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Replaying journal...
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Replayed 438 of 439 blocks
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Found 1 revoke tags
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Journal replayed in 1s
> > Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1:
> > Done
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070627/d74d2fe0/attachment.htm>


More information about the Linux-cluster mailing list