[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] cman bad generation number



On Wed, 2005-01-12 at 00:58, Patrick Caulfield wrote:
> On Tue, Jan 11, 2005 at 05:00:46PM -0800, Daniel McNeil wrote:
> > On Tue, 2005-01-11 at 00:56, Patrick Caulfield wrote:
> > > On Wed, Dec 22, 2004 at 09:33:39AM -0800, Daniel McNeil wrote:
> > > > How long does cman stay up in your testing?
> > > 
> > > With the higher pririty on the heartbeat thread I got 5 days before iSCSI died
> > > on me again... This isn't quite the same load as yours but it is on 8 busy nodes.
> > 
> > I have not seen 5 days yet on my set.  See my email from yesterday.
> > Is the code to have higher priority for the heartbeat thread
> > already checked in?  I restarted my test yesterday and it is
> > still going, but it usually has trouble after 50 hours or so.
> > 
> 
> It's rev 1.45 of membership.c checked in on the 7th Jan. If that hasn't fixed it
> I'll have to dabble with realtime things as it does seem now that the threads
> are not being woken up, even though the timer is firing.

I'm running from code as of Jan 4th, so I do not have that change.
I'll updated my code.

2 nodes died last night running my tests with
echo "9" > /proc/cluster/config/cman/max_retries
echo "1" > /proc/cluster/config/cman/hello_timer

here's the output on the console from the 3 nodes:

cl030:
CMAN: no HELLO from cl031a, removing from the cluster
CMAN: node cl032a is not responding - removing from the cluster
CMAN: quorum lost, blocking activity

cl031:
CMAN: node cl030a is not responding - removing from the cluster
CMAN: node cl032a is not responding - removing from the cluster
                                                                                
SM:  Assertion failed on line 67 of file
/Views/redhat-cluster/cluster/cman-kernel/src/sm_membership.c
SM:  assertion:  "node"
SM:  time = 115176056
                                                                                
Kernel panic - not syncing: SM:  Record message above and reboot.
                                                                                
Message from syslogd cl031 at Wed Jan 12 01:17:57 2005 ...
Record message above and reboot. syncing: SM:

cl032:
CMAN: too many transition restarts - will die

Daniel




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]