[Linux-cluster] GFS2 cluster node is running very slow

Thu Mar 31 14:56:08 UTC 2011

Hi,

On Thu, 2011-03-31 at 10:36 -0400, David Hill wrote:
> Hi Steve,
> 
> 	The service is degrading ... it's going slower and slower and slower before actually being totally unresponsive.
> The only entries in the log appears when we reboot the whole cluster.
> 
> Thank you for your interest in this issue :)
> 
> Dave
> 
> 
Well my first red flag in the info you've reported is that the fs is 98%
full. When GFS2 tries to allocate blocks it will search through the
resource groups for free space, skipping those in use by other nodes.
With a filesystem that full, this might result in longer search times
since a large number of the resource groups will be full. I'd recommend
not going above say 80% full as a general rule.

Also, by not have the fs really full, it is more likely that you will
not run into fragmentation issues. However I would point out the
following bug:
https://bugzilla.redhat.com/show_bug.cgi?id=683155

which has caused similar problems in the past and will soon be fixed. I
still can't quite figure out why the problem should only show up on some
nodes and not others, perhaps they are the nodes which have already
reserved a resource group with lots of free space and the remaining
nodes can't find one of those?

Either way, those would be the first two thing that I'd look into in
order to track this down,

Steve.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steven Whitehouse
> Sent: 31 mars 2011 10:25
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> 
> Hi,
> 
> On Thu, 2011-03-31 at 10:14 -0400, David Hill wrote:
> > These directories are all on the same mount ... with a total size of 1.2TB!
> > /mnt/gfs is the mount
> > /mnt/gfs/scripts/appl01
> > /mnt/gfs/scripts/appl02
> > /mnt/gfs/scripts/appl03
> > /mnt/gfs/scripts/appl04
> > /mnt/gfs/scripts/appl05
> > /mnt/gfs/scripts/appl06
> > /mnt/gfs/scripts/appl07
> > /mnt/gfs/scripts/appl08
> > 
> > All files accessed by the application are within it's own folder/subdirectory.
> > No files is ever accessed by more than one node.
> > 
> > I'm going to suggest to split but this also bring another issue:
> > 
> > - We have a daily GFS lockout now...  We need to reboot the whole cluster to solve the issue.
> > 
> I'm not sure what you mean by that. What actually happens? Is it just
> the filesystem that goes slow? Do you get any messages
> in /var/log/messages do any nodes get fenced or does that fail too?
> 
> Steve.
> 
> > This is going bad.
> > 
> > -----Original Message-----
> > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Alan Brown
> > Sent: 31 mars 2011 07:21
> > To: linux clustering
> > Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> > 
> > David Hill wrote:
> > > Hi Steve,
> > > 
> > > 	We seems to be experiencing some new issues now... With 4 nodes, only one is slow but with 3 nodes, 2 of them are now slow.
> > > 2 nodes are doing 20k/s and one is doing 2mb/s ...  Seems like all nodes will end up with poor performances.
> > > All nodes are locking files in their own directory /mnt/application/tomcat-1, /mnt/application/tomcat-2 ...
> > 
> > Just to clarify:
> > 
> > Are these directories on the same filesystem or are they on individual 
> > filesystems?
> > 
> > If the former, try splitting into separate filesystems.
> > 
> > Remember that one node will become the filesystem master and everything 
> > else will be slower when accessing that filesystem.
> > 
> > > I'm out of ideas on this one.
> > > 
> > > Dave
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Hill
> > > Sent: 30 mars 2011 11:42
> > > To: linux clustering
> > > Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> > > 
> > > Hi Steve,
> > > 
> > > 	I think you're right about the the glock ... There was MANY more of these.  
> > > We're using a new server with totally different hardware.  We've done many test 
> > > before posting to the mailing list like:
> > > - copy files from the problematic node to the other nodes without using the problematic mount, everything is fine (7MB/s)
> > > - read from the problematic mount on the "broken" node is fine too (21MB/s)
> > > So, at this point, I doubt the problem is the network infrastructure behind the node (or the network adapter) because everything is going smooth on all aspect BUT
> > > we cannot use the /mnt on the broken node because it's not usable.  Last time I tried to copy a file to that /mnt it was doing 5k/s while
> > > all the other nodes are doing ok at 7MB/s ...
> > > 
> > > Whenever we do the test, it doesn't seem to go higher than 200k/s ...
> > > 
> > > But still, we can transfer to all nodes at a decent speed from that host.
> > > We can transfer to the SAN at a decent speed.
> > > 
> > > CPU is 0% used.
> > > Memory is 50% used.
> > > Network is 0% used.
> > > 
> > > Only difference between that host and the others is that the mysql database is hosted locally and storage is on the same SAN ... but even with this,
> > > Mysqld is using only 2mbit/s on the loopback, a little bit of memory and mostly NO CPU .
> > > 
> > > 
> > > Here is a capture of the system:
> > > top - 15:39:51 up  7:40,  1 user,  load average: 0.08, 0.13, 0.11
> > > Tasks: 343 total,   1 running, 342 sleeping,   0 stopped,   0 zombie
> > > Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu1  :  0.1%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu2  :  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu3  :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu7  :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu8  :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu9  :  0.1%us,  0.0%sy,  0.0%ni, 99.9%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu10 :  0.0%us,  0.0%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu13 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu14 :  0.1%us,  0.1%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu15 :  0.4%us,  0.1%sy,  0.0%ni, 99.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu16 :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu17 :  0.4%us,  0.1%sy,  0.0%ni, 99.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu18 :  0.2%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu19 :  0.6%us,  0.1%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu20 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu21 :  0.6%us,  0.1%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
> > > Cpu22 :  0.2%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
> > > Cpu23 :  0.1%us,  0.0%sy,  0.0%ni, 99.8%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
> > > Mem:  32952896k total,  2453956k used, 30498940k free,   256648k buffers
> > > Swap:  4095992k total,        0k used,  4095992k free,   684160k cached
> > > 
> > > 
> > > It's a monster for what it does.  Could it be possible that it's soo much more performant than the other nodes that it kills itself?  
> > > 
> > > The servers is Centos 5.5 .
> > > The filesystem if 98% full (31G remaining on 1.2T) ... but if that is an issue, why does all other nodes running smoothly and having no issues but that one?
> > > 
> > > 
> > > Thank you for the reply,
> > > 
> > > Dave
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Steven Whitehouse
> > > Sent: 30 mars 2011 07:48
> > > To: linux clustering
> > > Subject: Re: [Linux-cluster] GFS2 cluster node is running very slow
> > > 
> > > Hi,
> > > 
> > > On Wed, 2011-03-30 at 01:34 -0400, David Hill wrote:
> > >> Hi guys,
> > >>
> > >>  
> > >>
> > >> I’ve found this in /sys/kernel/debug/gfs2/fsname/glocks
> > >>
> > >>  
> > >>
> > >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> > >> [gfs2]
> > >>
> > >> H: s:EX f:tW e:0 p:22591 [jsvc] gfs2_inplace_reserve_i+0x451/0x69a
> > >> [gfs2]
> > >>
> > >> H: s:EX f:W e:0 p:806 [pdflush] gfs2_write_inode+0x57/0x152 [gfs2]
> > >>
> > > This doesn't mean anything without a bit more context. Were these all
> > > queued against the same glock? If so which glock was it?
> > > 
> > >>  
> > >>
> > >> The application running is confluence and has 184 thread.   The other
> > >> nodes work fine but that specific node is having issues obtaining
> > >> locks when it’s time to write?
> > >>
> > > That does sound a bit strange. Are you using a different network card on
> > > the slow node? Have you checked to see if there is too much traffic on
> > > that network link?
> > > 
> > > Also, how full was the filesystem and which version of GFS2 are you
> > > using (i.e. RHELx, Fedora X or CentOS or....)?
> > > 
> > > 
> > > Steve.
> > > 
> > >>  
> > >>
> > >> Dave
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >> From: linux-cluster-bounces at redhat.com
> > >> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of David Hill
> > >> Sent: 29 mars 2011 21:00
> > >> To: linux-cluster at redhat.com
> > >> Subject: [Linux-cluster] GFS2 cluster node is running very slow
> > >>
> > >>
> > >>  
> > >>
> > >> Hi guys,
> > >>
> > >>  
> > >>
> > >>                 We have a GFS2 cluster consisting of 3 nodes.  At this
> > >> point, everything is going smooth.  Now, we add a new node with more
> > >> CPUs with the
> > >>
> > >> exact same configuration but all transactions on the mount run very
> > >> slow.
> > >>
> > >>  
> > >>
> > >> Copying a file to the mount is done at about 25kb/s when on the three
> > >> other nodes, everything goes smooth at about 7MB/s.
> > >>
> > >> CPU on all nodes is idling at some point, all cluster process are kind
> > >> of sleeping. 
> > >>
> > >>  
> > >>
> > >> We’ve tried the ping_pong.c from apache and it seems to be able to
> > >> write/read lock files at a decent rate.
> > >>
> > >>  
> > >>
> > >> There’s other mounts on the system using the same fc
> > >> card/fibers/switches/san and all these are also working at a decent
> > >> speed...
> > >>
> > >>  
> > >>
> > >> I’ve been reading a good part of the day, and I can’t seem to find a
> > >> solution.
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >>  
> > >>
> > >> ubisoft_logo
> > >>
> > >> David C. Hill
> > >>
> > >> Linux System Administrator - Enterprise
> > >>
> > >> 514-490-2000#5655
> > >>
> > >> http://www.ubi.com
> > >>
> > >>  
> > >>
> > >>
> > >> --
> > >> Linux-cluster mailing list
> > >> Linux-cluster at redhat.com
> > >> https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > > --
> > > Linux-cluster mailing list
> > > Linux-cluster at redhat.com
> > > https://www.redhat.com/mailman/listinfo/linux-cluster
> > > 
> > 
> > 
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> > 
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster