[Linux-cluster] problem with deadlocked processes (D)

Peter Sopko Peter_Sopko at tempest.sk
Wed Apr 4 12:17:32 UTC 2007


Hi,

today a strange thing occurred - on both of our cluster nodes a lot of
processes suddenly started to become locked in the D state (i/o lock). This
thing has already happened once before (six months ago), but a simple reboot
helped to solve this issue. But as it appeared again, I don't want to solve
it this way again, I would like to find the reason why this is happening,
but have no idea where to start. In /var/log/messages there is nothing
unusual, the only thing is that some directories are unremoveable and a lot
of processes locked. 

Some infos about our configuration :

The storage array is an MSA 1000. There are two cluster nodes, that are
connected with each other and both are connected to the storage array. The
only use of the cluster is web/mail server - apache, courier-imap, postfix,
spamassassin, clamav, mysql, ... 

The last time this thing happened only courier-imap processes (imapd, pop3d)
were locked, today it is also apaches httpd.

The current state is 261 processes out of 472 are locked in the D state.

Here are some examples taken using ps afx (the XXXXX are just filtred e-mail
addresses) :

3024 ?        Ss     7:14 /usr/libexec/postfix/master
 4544 ?        S      0:17  \_ tlsmgr -l -t unix -u
15922 ?        S      0:00  \_ proxymap -t unix -u
15943 ?        D      0:00  \_ virtual -t unix
16129 ?        D      0:00  \_ virtual -t unix
16153 ?        D      0:00  \_ virtual -t unix
16261 ?        S      0:00  \_ proxymap -t unix -u
16262 ?        D      0:00  \_ virtual -t unix
16269 ?        D      0:00  \_ virtual -t unix
16271 ?        D      0:00  \_ virtual -t unix
16424 ?        D      0:00  \_ virtual -t unix
19138 ?        D      0:00  \_ virtual -t unix
19147 ?        D      0:00  \_ virtual -t unix
19153 ?        D      0:00  \_ virtual -t unix
19205 ?        D      0:00  \_ virtual -t unix
19835 ?        S      0:00  \_ pickup -l -t fifo -u
19919 ?        S      0:00  \_ qmgr -l -t fifo -u
19920 ?        S      0:00  \_ pipe -n filter -t unix flags=Rq user=filter
argv=/data/spam/spamfilter.sh -f ${sender} --{recipient}
20346 ?        Ss     0:00  |   \_ /bin/sh /data/spam/spamfilter.sh -f XXXXX
-- XXXXX
20353 ?        S      0:00  |       \_ cat
20354 ?        D      0:00  |       \_ /bin/sh /data/spam/spamfilter.sh -f
XXXXX -- XXXXX


3039 ?        Ss     0:03 /usr/sbin/httpd
15674 ?        D      0:37  \_ /usr/sbin/httpd
15675 ?        D      0:32  \_ /usr/sbin/httpd
15676 ?        D      0:30  \_ /usr/sbin/httpd
15677 ?        D      0:34  \_ /usr/sbin/httpd
15678 ?        D      0:31  \_ /usr/sbin/httpd
15679 ?        S      0:33  \_ /usr/sbin/httpd
15680 ?        D      0:34  \_ /usr/sbin/httpd
15681 ?        D      0:34  \_ /usr/sbin/httpd
30808 ?        D      0:15  \_ /usr/sbin/httpd
30809 ?        D      0:15  \_ /usr/sbin/httpd
30810 ?        D      0:13  \_ /usr/sbin/httpd
30825 ?        D      0:16  \_ /usr/sbin/httpd
30827 ?        D      0:15  \_ /usr/sbin/httpd
30828 ?        D      0:17  \_ /usr/sbin/httpd
30829 ?        D      0:14  \_ /usr/sbin/httpd
30830 ?        D      0:14  \_ /usr/sbin/httpd
30831 ?        D      0:17  \_ /usr/sbin/httpd
30832 ?        D      0:12  \_ /usr/sbin/httpd
30835 ?        D      0:15  \_ /usr/sbin/httpd
30840 ?        D      0:12  \_ /usr/sbin/httpd
20441 ?        D      0:00  \_ /usr/sbin/httpd
20500 ?        D      0:00  \_ /usr/sbin/httpd
20501 ?        S      0:00  \_ /usr/sbin/httpd


any idea where to start with debuging and looking for the reason this is
happening ? I find it quit weird, that for more than 6 month it is ok a now
in a sudden it starts doing this.....

Any help would be greatly appreciated.

Thanks

Peter Sopko, IT Security Consultant
Tempest a.s.
Slovak Republic





More information about the Linux-cluster mailing list