[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Inconsistent cluster view, shutdown, kernel panic



Moreno Baricevic wrote:
> 
> Hello,
> 
> we are trying to install GFS (cluster-1.02 on vanilla 2.6.16.16) on a
> CentOS cluster of 70 "diskless" nodes.
> 
> The structure is something like this:
> 
> +---+   GNBD-SERVERS                  GNBD CLIENTS
> |   |-----[node63]-----[node64 node65 node66 node67 node68 node69]
> | S |.....
> | A |.....
> | N |-----[node07]-----[node08 node09 node10 node11 node12 node13]
> |   |-----[node00]-----[node01 node02 node03 node04 node05 node06]
> +---+
> 
> All the nodes have a gigabit NIC and all the nodes see each other.
> Only the gnbd-servers have a fiber adapter to connect to the SAN.
> 
> Everything works fine as far as we test on 33 nodes: 9 nodes with the
> fiber adapter (acting as both GFS nodes and gnbd-servers) and 24 gnbd
> clients (connected to 4 of the gnbd-servers). "Fine" means that we have
> been able to mount and use the GFS filesystem.
> 
> When we try to start cman on 39 nodes (or worst, when we try with 63
> nodes), more or less half of the nodes soon get this:
> 
>     "kernel panic - not syncing: membership stopped responding"
> 
> We tried to increase CMAN_CLUSTER_TIMEOUT and CMAN_QUORUM_TIMEOUT
> (/etc/init.d/cman), but the problem persists.
> 
> We tried to boot the nodes 10 at once, with a 2 minutes delay between
> groups. As soon as we reach the quorum (or one of the timeout?) the
> nodes start collapsing due to "Inconsistent cluster view", "Shutdown",
> "No response to messages".
> 
> We also tried the patch supplied as solution for the bug report 187777,
> but nothing changes.
> 
> Is there a limit on the number of nodes, a timeout, or any other issue
> that we didn't consider?


To be honest, cman has never been tested beyond 32 nodes to my knowledge. for
large clusters you may well be better off using gulm - at least in the short-term.

> Here you can find the cluster.conf, logs from survived and dead nodes,
> tcpdump for UDP:6809, nodes' /proc/cluster/{status,nodes,services}:
> 
>     http://www.democritos.it/~baro/gfs-test/
> 
> There's a lot of stuff, let me know if you need something more specific.

I'll have a look through those logs...but it may take me some time !

Thanks,

Patrick

-- 

patrick


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]