[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Two-node cluster unpatched B doesn't see patched A

The most recent set of patches for RHCS, comprising:

RHBA-2008:0093    dlm-kernel bug fix update
RHBA-2008:0092    cman-kernel bug fix update
RHBA-2008:0060    cman bug fix update
RHBA-2008:0095    gnbd-kernel bug fix update
RHBA-2008:0096    GFS-kernel bug fix update
RHSA-2008:0055    Important: kernel security and bug fix update

has resulted in a problem in my two-node (production) cluster. Let me explain ;-)

I have a three-node test cluster where I install all patches before rolling them into my (two-node) production cluster; I know, I know, they're not the same, and that's the only difference I can see in what has happened here (a first in two years). In the three-node cluster (which, just to complicate things, only had two active nodes at the time), I rolled these patches through the two nodes without taking the whole cluster down. That is:

1. Stop all cluster services on Node A. Disable auto-start using chkconfig off <cluster-service-name>. Services stop successfully, Node A leaves the cluster, Node B continues running all shared cluster services (GFS, Fibre-channel-connected shared storage, HP MSA1000). 2. Patch Node A, reboot to new kernel, re-install HP-supplied QLogic driver, edit /etc/modprobe.conf for failover settings, rebuild initrd file for QLogic drivers, reboot, re-enable auto-start of cluster services, reboot once more and the cluster re-forms.
3. Repeat Steps 1 and 2 for Node B
4. Cluster is restored to normal operation, both nodes fully patched.

On my production cluster, which uses a Quorum Disk in place of the third node, I completed steps 1 and 2 on Node A, but the cluster did NOT reform. cman sends out its advertisement, and I can see that Node B receives it (by looking at the tcpdump traces), but Node B never responds.

So: before I take down Node B (which is currently the only one running my production services), can someone either (a) explain why the cluster is not re-forming, or (b) assure me that by restoring both systems to the same patch level, the cluster WILL reform properly? (Which begs the question: why did my test cluster survive the patch process and my production cluster didn't? Same versions of everything......)

Thanks in advance, and best regards,

   /Harry Sutton, RHCA
    Hewlett-Packard Company

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]