[Linux-cluster] GFS + DRBD Problems

Mon Mar 3 17:06:19 UTC 2008

On Mon, 3 Mar 2008, gordan at bobich.net wrote:

> I have a 2-node cluster with Open Shared Root on GFS on DRBD. A single node 
> mounts GFS OK and works, but after a while seems to just block for disk.

[...]

> This usually happens after a period of idleness. If the node is used, this 
> doesn't seem to happen, but leaving it alone for half an hour causes it 
> to block for disk I/O.

I've done a bit more digging, and the processes that hang seem to do so, 
as expected, in disk sleep state.

For example, when trying to log in, sshd hangs. It's status (from /proc) 
is:

Name:   sshd
State:  D (disk sleep)
SleepAVG:       97%
[...]

The only open file handles it has are:
# ls -la /proc/9643/fd/
total 0
dr-x------ 2 root root  0 Mar  3 16:41 .
dr-xr-xr-x 5 root root  0 Mar  3 16:41 ..
lrwx------ 1 root root 64 Mar  3 16:42 0 -> /dev/null
lrwx------ 1 root root 64 Mar  3 16:42 1 -> /dev/null
lrwx------ 1 root root 64 Mar  3 16:42 2 -> /dev/null
lrwx------ 1 root root 64 Mar  3 16:42 3 -> socket:[118904]
lrwx------ 1 root root 64 Mar  3 16:42 4 -> /cdsl.local/var/run/utmp

I am guessing that it's the utmp that is blocking things, but I'm not 
sure. I can read-write the /var/run/utmp file just fine (/var/run is 
symlinked to /cdsl.local/var/run).

The socked is a TCP socket, so I cannot see that being a disk block issue.

As for /dev/null, I didn't think that could be flock-ed...

Looking at cman_tool status and /proc/drbd, both seem to be in order and 
saying everything is working.

Any ideas as to what could be causing these bogus disk-sleep lock-ups?

Gordan