How much downtime do we afford for nagios?

Sun Apr 27 17:21:38 UTC 2008

On Mon, 28 Apr 2008, Nigel Jones wrote:

> On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
> > Nigel Jones wrote:
> >> Looking through my email, from what I can recall there are no false
> >> positives.  xen6 had to be power-cycled which caused all the other
> >> collateral notifications.
> >>
> >
> > Collateral notifications can be caught using service dependencies and
> > parent hosts. Do we currently use any?
> I believe we do, but it wouldn't have helped in this case (I've done a bit
> more digging)
>
> Half the notifications came from the external nagios instance on noc2,
> while the xen6/db alerts came from the internal nagios instance. Another
> reason why I like the current setup and don't think we should change a
> thing :)
>
> Also, the UNKNOWN alerts weren't that bad, they were a precursor to the
> box having to restarted, only in this case was the up/down alerts a little
> useless.  However, I'd sooner keep them as it because otherwise we run the
> risk of not noticing a box down immediately and get everyone under the
> moon asking "why can't I access fedoraproject.org... it's down your OS
> can't be that good".

One thing I would like implemented is event handlers.  Some things
(probably not this thing) could be handled automatically for us.

	-Mike