[Linux-cluster] Re: ccsd problem

Tue Mar 4 15:22:38 UTC 2008

On Tue, 2008-03-04 at 11:49 +0100, Rudolf Gabler wrote:
> Hi Lon,
> 
> Sorry to bother you directly, but although I subscribed to this list I
> cannot post to it (I tried several times).

Ok.  I've CC'd the list.

> My problem: we are running 3 shared-root gfs cluster (2 on x86_64, 1 ita64)
> and after the last upgrade we are faced with the following messages (example
> from cluster 1):
> 
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Invalid descriptor specified (-111).
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Someone may be attempting something
> evil.
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Error while processing get: Invalid
> request descriptor
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Invalid descriptor specified (-111).
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Someone may be attempting something
> evil.
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Error while processing get: Invalid
> request descriptor
> Mar  4 10:33:12 bldsrv3 ccsd[1664]: Invalid descriptor specified (-21).

I've seen this before - I'll try to dig up what I know.

> As far as I understand this, the problem occurs because a connection to the
> ccsd fails ("ccs_test connect" in one of the /usr/share/cluster scripts)
> because of to many open connections (more than 30?).

That could be, and it would make sense.  You might have found the source
of the problem.

> ccs_test connect several times, I get .i.e 5 time a descriptor and then 6
> times a "connection refused". The descriptor numbers starts at number zero,
> incrementing and the thing I don't understand is the huge ccsd activity.

> After a fresh boot the descriptor number counts to around 1 Million after
> one day running. Is this intended (normal behavior)? Maybe its related to a
> cman upgrade.

The ccs descriptors are non-decreasing.  They're not "file descriptors",
and they increment by a huge number each time.  Don't worry about what
the value is ;)  The max # of open descriptors is fixed @ compile-time.
I think there are a couple things we should do:

 * Increase the limit (as you noted, the max is 30).
 * Make the scripts calling ccs_test retry when an error is received.

-- Lon