[Cluster-devel] qdiskd hangs cluster activity

Hi Lon,

I found a very interesting bug that manages to hang the entire cluster.

Setup is a 3 nodes cluster, with no fancy stuff running at all (i will be able
to show it to you next wed as it lives on my laptop ;)).

<quorumd label="test1">
 <heuristic program="ping -c1 -t1" score="1" interval="2" tko="3"/>

test1 is a 1GB shared AOE device between the 3 nodes.

the cluster starts without problems. After firing up qdiskd -f -d:

qdiskd -f -d
[12681] debug: Loading configuration information
[12681] debug: Heuristic: 'ping -c1 -t1' score=1 interv
=2 tko=3
[12681] debug: 1 heuristics loaded
[12681] debug: Quorum Daemon: 1 heuristics, 1 interval, 10 tko, 0 votes
open_partition: seek: Invalid argument
qdisk_validate: open of /dev/sda2 for RDWR failed: Illegal seek
qdisk_verify: Illegal seek
[12681] info: Quorum Partition: /dev/etherd/e1.0 Label: test1
[12681] info: Quorum Daemon Initializing
[12682] info: Heuristic: 'ping -c1 -t1' UP
[12681] debug: Node 2 is UP
[12681] debug: Node 3 is UP
[12681] info: Initial score 1/1
[12681] info: Initialization complete
[12681] notice: Score sufficient for master operation (1/1; required=1); upgra
[12681] debug: Making bid for master
[12681] info: Assuming master role

A few seconds after the node assume master role, it hangs. The others will
follow in a matter of seconds.

aisexec is stalled in recv(..

No way to recover. kill -9 all over is required.

In attachment is a qdiskd strace from all the 3 nodes started at the exact same


PS I wonder if we are hitting this:

from qdisk/disk.c:

         * All IOs must be of size which is a multiple of 512.  Here we
         * just add in enough extra to accommodate.
         * XXX - if the on-disk offsets don't provide enough room we're cooked!

I'm going to make him an offer he can't refuse.

