head node has an extremely high load average.

Doll, Margaret Ann margaret_doll at brown.edu
Wed Jun 26 19:59:20 UTC 2013


The users' home directories are nfs'd to the compute nodes.

On Wed, Jun 26, 2013 at 3:35 PM, Jonathan Billings <jsbillin at umich.edu>wrote:

> Hello,
>
> Is your head node an NFS server, and are the jobs writing to the NFS share?
>
>
> On Wed, Jun 26, 2013 at 3:27 PM, Doll, Margaret Ann <
> margaret_doll at brown.edu
> > wrote:
>
> > I have a computer cluster Running rocks 5.2,  Centos 6.
> >
> > The head node is over loaded.  There are 2 CPUs on the head node.
> >
> > top - 14:27:49 up 1 day,  6:11,  6 users,  load average: 13.65, 14.12,
> > 13.92
> > Tasks: 168 total,   3 running, 163 sleeping,   0 stopped,   2 zombie
> > Cpu(s):  1.2%us,  1.9%sy,  0.0%ni,  0.0%id, 91.7%wa,  1.0%hi,  4.1%si,
> > 0.0%st
> > Mem:   2053088k total,  2001464k used,    51624k free,    74476k buffers
> > Swap:  1020116k total,      388k used,  1019728k free,  1638076k cached
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> > COMMAND
> >
> >  2515 nobody    15   0  218m 3176 1048 S  2.3  0.2   8:46.23
> > gmetad
> >  2967 root      15   0     0    0    0 S  2.0  0.0   0:20.31
> > nfsd
> >  2970 root      15   0     0    0    0 R  1.0  0.0   0:20.60
> > nfsd
> >  3110 nobody    15   0  198m  20m 3360 S  0.3  1.0   4:22.71
> > gmond
> > 29788 mad       15   0 90736 2336 1084 S  0.3  0.1   0:02.91
> > sshd
> >     1 root      15   0 10372  684  572 S  0.0  0.0   0:00.51
> > init
> >     2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00
> > migration/0
> >     3 root      34  19     0    0    0 S  0.0  0.0   0:00.00
> > ksoftirqd/0
> >     4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
> >
> > I have everyone logged off of the head node.  Four jobs are running on
> the
> > compute nodes, but I believe they are non-parallel jobs which causes no
> > traffic on the head node.   The load_avg on each of the compute nodes is
> > less than 8.  Each compute node has 8 CPUs.
> >
> > How can I find the problem?   I have seen the zombies go as high as 2 on
> > the head node; most of the time there are 0 zombies.
> >
> > I did reboot the head node, but the problem comes back fairly quickly.
> > --
> > redhat-list mailing list
> > unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> > https://www.redhat.com/mailman/listinfo/redhat-list
> >
>
>
>
> --
> Jonathan Billings <jsbillin at umich.edu>
> College of Engineering - CAEN - Unix and Linux Support
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
>



More information about the redhat-list mailing list