[Cluster-devel] qdiskd hangs cluster activity

Fabio Massimo Di Nitto fabbione at ubuntu.com
Mon Oct 29 07:19:30 UTC 2007


Hi Lon,

I found a very interesting bug that manages to hang the entire cluster.

Setup is a 3 nodes cluster, with no fancy stuff running at all (i will be able
to show it to you next wed as it lives on my laptop ;)).

<quorumd label="test1">
 <heuristic program="ping 192.168.1.1 -c1 -t1" score="1" interval="2" tko="3"/>
</quorumd>

test1 is a 1GB shared AOE device between the 3 nodes.

the cluster starts without problems. After firing up qdiskd -f -d:

qdiskd -f -d
[12681] debug: Loading configuration information
[12681] debug: Heuristic: 'ping 192.168.1.1 -c1 -t1' score=1 interv
=2 tko=3
[12681] debug: 1 heuristics loaded
[12681] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 0 votes
open_partition: seek: Invalid argument
qdisk_validate: open of /dev/sda2 for RDWR failed: Illegal seek
qdisk_verify: Illegal seek
[12681] info: Quorum Partition: /dev/etherd/e1.0 Label: test1
[12681] info: Quorum Daemon Initializing
[12682] info: Heuristic: 'ping 192.168.1.1 -c1 -t1' UP
[12681] debug: Node 2 is UP
[12681] debug: Node 3 is UP
[12681] info: Initial score 1/1
[12681] info: Initialization complete
[12681] notice: Score sufficient for master operation (1/1; required=1); upgra
ng
[12681] debug: Making bid for master
[12681] info: Assuming master role

A few seconds after the node assume master role, it hangs. The others will
follow in a matter of seconds.

aisexec is stalled in recv(..

No way to recover. kill -9 all over is required.

In attachment is a qdiskd strace from all the 3 nodes started at the exact same
time.

Fabio

PS I wonder if we are hitting this:

from qdisk/disk.c:

        /*
         * All IOs must be of size which is a multiple of 512.  Here we
         * just add in enough extra to accommodate.
         * XXX - if the on-disk offsets don't provide enough room we're cooked!
         */

-- 
I'm going to make him an offer he can't refuse.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: qdisk.logs.tar.bz2
Type: application/x-bzip
Size: 10859 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20071029/300ad994/attachment.bin>


More information about the Cluster-devel mailing list