Some ongoing issues

Mon Jul 6 16:15:47 UTC 2009

So we've got some ongoing issues in our environment right now.  Some of
them are unrelated.  All are being worked on but one in particular I want
to discuss on the list here because I just don't know what changed.

The problem:

When one of the fas servers goes offline, most of our other apps get taken
offline.

The way it is supposed to work:

When one of our fas servers goes offline, haproxy takes it out of the
farm and sends all requests to the still online fas server.  Thus,
possibly, generating a few errors for a short time but generally goes
un-noticed.

The details:

When one of the fas servers goes offline, haproxy is hanging on that
connection.  The application servers hang or possibly try to re-use
connections to fas, thus causing the number of httpd processes to sky
rocket on the app servers.  Ultimately hitting MaxClients and taking
everything offline.  This happens fairly quickly, matter of seconds.

It seems that even after haproxy flags the fas server as dead (takes about
15s), that any connections open at that time to the old server aren't
killed.  They just hang.

Outstanding questions:

What changed?

Does python-fedora (our primary interface to fas) now do something
differently with keepalive?

Anyone else using haproxy seeing this same issue?  I've got redispatch
enabled still.

	-Mike