[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] cluster instability



GS R wrote:


On 6/16/08, *Shawn Hood* <shawnlhood gmail com <mailto:shawnlhood gmail com>> wrote:

    All,

    This message was sent out to my office, so the voice may seem a bit
    odd.  We have a 4 node cluster running RHEL4U6 on Dell Poweredge
    1950s.  Fencing is done via DRAC.

    Using packages (from RHN):

    cman-kernel-smp-2.6.9-53.13
    cman-1.0.17-0.el4_6.5
    ccs-1.0.11-1.el4_6.1
    fence-1.32.50-2.el4_6.1
    lvm2-cluster-2.02.27-2.el4_6.2
    dlm-kernel-smp-2.6.9-52.9
    dlm-kernheaders-2.6.9-52.9

    Our cluster became unstable on Saturday morning.  Apparently
    hugin stopped sending out heartbeats, causing it to become
    fenced.  hugin
    was under heavy load (~10) at the time:

    03:30:02 AM         6       453      9.35     10.29     10.51
    03:40:01 AM        12       465     11.02     11.00     10.75
    03:50:02 AM         3       446      9.75     10.80     10.86
    04:00:01 AM         5       430      9.23      9.47     10.07
    Average:            7       455     10.19     10.32     10.28

    04:09:35 AM       LINUX RESTART

    As you can see, hugin was fenced at 4:09.  The other nodes then began
    logging the following:

    Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
    Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
    Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
    Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
    Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts -
    will die
    Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster.
    Inconsistent
    cluster view

I guess this has to do with network issue though its utilization was low when this logged.
The node is not able to receive messages.


I suspect you've hit this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=444751


There's a patch in the bugzilla, and a workaround program you can run which should help if you can't upgrade the kernel module (See comment #10)

--

Chrissie


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]