[Linux-cluster] Cluster failure, dlm overload


A five node cluster that is sharing several GFS filesystem is having total blocks of filesystem activity. Around one block each week. These blocks appeared several weeks ago, after more than three years in service. Cluster is restored after restart of all cluster nodes ;-)

When these blocks appears, we can see dlm send and receive process with a high level of CPU consumption, network traffic is a also ten times the normal one.

A capture (wireshark) of network traffic in DLM port shows thousand of messages per second. In particular, all "request message" are replied with a "request reply" where errno=EBADR, Lookup messages seems ok.

The cluster is with a software version a few outdated, the one of RedHat 2.6.18, but not possible to upgrade easily.

Any suggestion is welcome.

