[Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman

Thu Jul 26 17:52:58 UTC 2012

Ah, hehe, ignore my PS in the other reply then. :)

digimer

On 07/26/2012 01:20 PM, DIMITROV, TANIO wrote:
> Sorry, sent the message to the wrong address
>
>
> The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available.
>
> So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, July 26, 2012 12:47 PM
> To: DIMITROV, TANIO
> Cc: linux clustering
> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>
> For automatic recovery, you have to use power fencing. Fabric fencing
> (like fencing at a SAN switch) is perfectly safe, but it requires human
> intervention.
>
> The problem is that the messages passed around the cluster in the closed
> process group (CPG) are sequenced. Once a node falls out of sequence, it
> needs to be restarted. To automate this, power fence the node. When it
> boots back up, it should automatically rejoin the cluster with a clean
> state.
>
> May I ask why you're so careful to avoid a restart? The whole idea of
> clustering is to have no/minimal interruption of service during a node
> failure.
>
> Digimer
>
> On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
>> Thanks Digimer,
>>
>> Yes, this works but it cannot be done automatically - and that's my problem.
>> I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
>> Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
>> And if it is not, what about the SAN switch fencing scenario?
>>
>>
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, July 26, 2012 11:48 AM
>> To: linux clustering
>> Cc: DIMITROV, TANIO
>> Subject: Re: [Linux-cluster] RHEL 6 two-node cluster - nodes killing each other's cman
>>
>> On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
>>> Hello,
>>> I'm testing RHEL 6.2 cluster using CMAN.
>>> It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :
>>>
>>> Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
>>> Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
>>> Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting
>>>
>>> Can this be avoided somehow?
>>>
>>> Thanks in advance!
>>
>> Use real fencing.
>>
>> The problem is, I believe, that the CPG messages fall out of sync. You
>> could try stopping cman on one node, reconnecting the network and
>> restarting cman on the one node again.
>>
>
>

-- 
Digimer
Papers and Projects: https://alteeve.com