Frequent Buildsystem outages

Over the last week we've seen a large increase in buildsystem outages.
Some for only a few minutes, some for a couple of hours.  Most of the
longer outages were due to us probing and testing new things.

There are a number of issues we're working on right now and while its not
totally clear if last weeks mass rebuild caused some of these issues to
come out of the woodwork, it did cause more load on the boxes in question.

We've been able to mitigate some of these issues to cause as little impact
on the environment as possible but without more hardware (which is on the
way) its likely we'll continue to have outages from time to time.

At its core we have 3 main issues that, at this time, seem unrelated but
all of which caused at least 1 outage last week.

1) load on the db and connections to the db cause other applications to
not work.  We've disabled search engines (robots.txt) and a few db heavy
selects to help mitigate this issue for now.

2) Our NFS server has, on a couple of occasions, had lockd fail and leave
its port open.  This causes lockd to be unable to restart, specifying a
different port allows lockd to restart but because the kernel is unaware
of this new port, rpcinfo still reports the wrong port and clients can't
connect to it.  An update to nfs-utils was suggested and with this new
version we have yet to see this problem, its still a bit early but
hopefully this is also fixed.

3) Machine instability.  Unfortunately the physical machine running both
the nfs share and releng1 (where bodhi lies) reboots.  I've submitted a
bug with the kernel on this, unfortunately we just don't have much
information to go on:


We've got a logger on the console but have yet to capture anything.

As always we appreciate your understanding, we're working on it and hope
the instabilities will level out soon.


