[Linux-cluster] Re: SMP and GFS

On Mon, Oct 03, 2005 at 10:31:02AM +0100, Patrick Caulfield wrote:
> Axel Thimm wrote:
> > On Mon, Oct 03, 2005 at 07:59:22AM +0100, Patrick Caulfield wrote:
> > 
> >>Axel Thimm wrote:
> >>
> >>>On Thu, Jul 14, 2005 at 04:57:51PM -0400, Manuel Bujan wrote:
> >>>
> >>>
> >>>>Is there any  issue I should be aware of if SMP is enabled in
> >>>>my kernel ? What if I compile my kernel to be pre-emptible ? Any problem with that and GFS ?
> >>>>
> >>
> >>Pre-emptible kernels will not work with GFS, that's certain.
> > 
> > 
> > My report was on a RHEL4 kernel.
> ...but you did ask about pre-emtible kernels :)

No, I didn't, that was Manuel Bujan 6 weeks ago. ;)

I replied that I saw the same einval messages on a RHEL4 kernel.

> The important messages here are these :
> > Sep 30 05:08:33 zs03 kernel: CMAN: removing node zs02 from the cluster :
> Missed too many heartbeats (P:kernel)
> > Sep 30 05:08:39 zs03 kernel: CMAN: removing node zs01 from the cluster : No
> response to messages (P:kernel)
> showing that a node has been kicked out of the cluster for not responding
> quickly enough to messages. You could try increasing the value in
> /proc/cluster/config/cman/max_retries

I know, but that doesn't explain the einval messages, or does it? Or
formulated differently: the einval messages show that the dual Xeon
box had some issues with sockets and its being kicked out could be
just a symptom of that.

Also the RHEL4 box should not kernel panic (all involved parties have
the same config, but only the panicing node has dual Xeons on EM64T,
the other two are dual opterons, all run the same smp RHEL4 kernel).

At that time the dual xeon was doing a backup on this interface with
25-30 MB/sec. That could explain the delayed/dropped UDP heartbeat
packages. Can it explain the "send einval to 1" messages and the
kernel panic?
Axel.Thimm at ATrpms.net

