The most recent set of patches for RHCS, comprising: RHBA-2008:0093 dlm-kernel bug fix update RHBA-2008:0092 cman-kernel bug fix update RHBA-2008:0060 cman bug fix update RHBA-2008:0095 gnbd-kernel bug fix update RHBA-2008:0096 GFS-kernel bug fix update RHSA-2008:0055 Important: kernel security and bug fix updatehas resulted in a problem in my two-node (production) cluster. Let me explain ;-)
I have a three-node test cluster where I install all patches before rolling them into my (two-node) production cluster; I know, I know, they're not the same, and that's the only difference I can see in what has happened here (a first in two years). In the three-node cluster (which, just to complicate things, only had two active nodes at the time), I rolled these patches through the two nodes without taking the whole cluster down. That is:
1. Stop all cluster services on Node A. Disable auto-start using chkconfig off <cluster-service-name>. Services stop successfully, Node A leaves the cluster, Node B continues running all shared cluster services (GFS, Fibre-channel-connected shared storage, HP MSA1000). 2. Patch Node A, reboot to new kernel, re-install HP-supplied QLogic driver, edit /etc/modprobe.conf for failover settings, rebuild initrd file for QLogic drivers, reboot, re-enable auto-start of cluster services, reboot once more and the cluster re-forms.
3. Repeat Steps 1 and 2 for Node B 4. Cluster is restored to normal operation, both nodes fully patched.On my production cluster, which uses a Quorum Disk in place of the third node, I completed steps 1 and 2 on Node A, but the cluster did NOT reform. cman sends out its advertisement, and I can see that Node B receives it (by looking at the tcpdump traces), but Node B never responds.
So: before I take down Node B (which is currently the only one running my production services), can someone either (a) explain why the cluster is not re-forming, or (b) assure me that by restoring both systems to the same patch level, the cluster WILL reform properly? (Which begs the question: why did my test cluster survive the patch process and my production cluster didn't? Same versions of everything......)
Thanks in advance, and best regards, /Harry Sutton, RHCA Hewlett-Packard Company
Description: S/MIME Cryptographic Signature