[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Known limits ?



Hello GFS team,

I'm trying to run a GFS filesystem on ~32 nodes, but there is a problem when i start the daemons.
My nodes are called sam38 -> sam70.
- I run ccsd on all nodes (using init script) at the same time, and it's ok.
- I run cman on all nodes (using init script) at the same time, and it's ok. "cman_tool nodes" tell me alll nodes have rejoined the cluster. - I run fenced on all nodes, one by one every second, and it fails from sam57 to sam70.

From the last one that succeeds (sam56), i see :
[root sam56 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 4 -
[21 19 9 8 7 1 2 3 4 6 11 13 16 23 26 27 28 32 33 25 20 14]

Then, at sam57 , i get in /var/log/messages:
Jul 20 18:38:17 sam57 kernel: CMAN: got WAIT barrier not in phase 1 TRANSITION.44 (2)

when trying to run fenced.

Then i /etc/init.d/fenced stop, and i get :
Jul 20 18:47:44 sam57 fenced[28722]: process_events: service leave failed
Jul 20 18:47:44 sam57 fenced: shutdown succeeded


When i start it again:
Jul 20 18:47:45 sam57 fenced[28964]: fence_domain_add: service set level failed


After this step, i did stop everything on sam38 (fenced/ccsd/cman), to see whether getting one node out would let me get another one in, but i got this strange message on sam39: Jul 20 19:14:04 sam39 kernel: CMAN: node sam38 has been removed from the cluster : No response to messages Jul 20 19:14:12 sam39 kernel: SM: 00000001 process_recovery_barrier status=-104


In a previous exprerience, running fenced all at the same time lead to a global cluster failure (everything did not respond): this is why i tried to run them one by one.


All nodes are 64bits, i use the latest "cluster" code from the STABLE branch of the CVS. My configuration file is classical and works for a few nodes (tested many times on 5 nodes with no problem), except for this:
<fence_daemon post_join_delay="30"></fence_daemon>

Why does fenced daemon fails to start where cman succeeded ? I thought it was just a service like any other, and was built on the top of CMAN. Also, what are the known limits of the cluster infrastructure, in terms of nodes ?

Do you have any advice on how to remove this problem ? particularly, at this point, does it change something to choose DLM instead of Gulm ? (no, i think, but i'd rather be sure)

Any idea is welcome.

--
Mathieu


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]