[Linux-cluster] Maximum number of nodes

Tue May 4 18:14:20 UTC 2010

> -----Original Message-----
> > What is the maximum number of nodes that will work within a single
> > cluster?
> >
> The limit is 16 nodes.
> 
> > From where do the limitations come? GFS2? Qdisk? What if not using
> > qdisk? What if not using GFS2?
> >
> > Thank you!
> The limit is down to what we can reasonably test, and thus what is
> supported. The theoretical limit is much higher and it may be possible
> to raise the supported node limits in future,

As a data point, we run a production cluster with 24 nodes and haven't
observed any adverse effects.  Our cluster has grown slowly from a
smaller initial deployment by adding nodes to meet capacity.

(However we don't run a Red Hat OS nor are we a Red Hat support
customer.  Therefore we are relying on our own resources, community
goodwill, and a little bit of luck to keep this thing running.)

A few notes about our deployment...

CMAN, based on OpenAIS (now CoroSync), is a remarkably efficient cluster
monitor.  Virtual Synchrony is an elegant protocol and appears to scale
well in practice.  We don't observe significant overhead on our network
interfaces due to cluster traffic, and we don't see erratic behavior due
to latency as the cluster grows (though totem parameters may have to be
adjusted at some point).  That said, the cluster is only as good as your
network, and your network *must* handle IP multicast properly.  (If you
ever suspect a network is faulty somewhere, try running a cluster on it.
You suspicions may be quickly confirmed!)

DLM and GFS are not part of CMAN, but work alongside it.   I don't know
what limits they may have.  I suspect we'd reach throughput limits on
our SAN before anything else if we tried to grow our cluster
significantly--we're already at several thousand iops sustained, and the
SAN is a specialized component that doesn't scale just by adding cluster
nodes.  DLM scalabilities depends highly on the application--when we
made our app locality-aware, locking problems went away.

We have many GFS filesystems, not just one, and none of them span all
nodes of our cluster.  The largest one is mounted across 22 nodes.

I don't plan to increase the node count of our cluster much further, if
at all.  With increasing multicore hardware available, we're more likely
to scale up by replacing nodes with 8-way or 16-way units.  (I'm curious
to know what the practical limits are, but don't plan to really push the
envelope in a production cluster.)

Jeff