Server Monitoring - A replacement for Nagios?

Thu Jul 31 02:59:04 UTC 2008

Okay, so while this was intended to be a primary discussion point for 
tomorrows Infrastructure meeting we had a little bit of discussion first 
in #fedora-admin, and then in #fedora-meeting regarding Zabbix, a tool 
like Nagios that I begun to setup for testing this week.

In summary the discussion ended positively we think it will do the job 
quite well and we really need to now sit down and work out if we want to 
try implementing it on a limited scale in parallel with Nagios (to act 
as a comparison).

The related part of the agenda for tomorrow now will be:
* Do we want to push this into a limited trial (say 10 key-ish machines 
in our infrastructure)
* How long would such a trial last for
* What are we going to use as a metric for such a trial
* Are there other concerns

Personally I'd like to see this as a step forward in revamping 
sysadmin-noc so we can reduce the work load on members in sysadmin-main.

Review the log below.

-- Nigel

10:43 < mmcgrath> G: so whats your take on how big the zabbix db will 
get?  Should we put it on db1 or on its own box?
10:43 < mmcgrath> if its on its own (probably the same one zabbix is on) 
we're lowering points of failure, but we might have to re-spec noc1 and 
noc2.
10:43 < G> mmcgrath: I'm not sure
10:44 < G> this is where it's great to have people like wakko666 and 
jcollie who use it atDAYJOB
10:44 < mmcgrath> yeah.
10:44 < G> maybe we need a nocdb1
10:44 < mmcgrath> If it needs to be pretty quick once it gets full of 
stuff, we might justwant to put it on db1.
10:44 < mmcgrath> if it stays light though, we'll probably just keep it 
localhost to noc1 and give noc1 more ram / disk space.
10:45 < wakko666> mmcgrath:  the rate of growth for the zabbix DB 
directly depends on the poll rates for all of the checks
10:45 < mmcgrath> wakko666: if you don't mind my asking.... how many 
hosts do you have andhow big is the db?
10:45 < G> yeah, from what I can tell also, zabbix does it's own 
housekeeping to try and consolidate some of the data
10:45 < mmcgrath> and how much stuff do you monitor?  pretty default 
stuff?  or more then the default.
10:46 < wakko666> we have 50 hosts in production, and another 100 hosts 
outside that across two zabbix nodes
10:47 < mmcgrath> wakko666: is the two zabbix nodes for high 
availability or was it because one zabbix node couldn't handle the traffic?
10:47 < wakko666> it's because they're in different locations
10:47 < mmcgrath> <nod>
10:47 < mmcgrath> Do you have a dedicated db?  How big is the raw database?
10:48 < fchiulli_> mmcgrath: I'm assuming that part of the discussion 
will be whether to have more than one zabbix monitoring host.
10:48 < wakko666> mmcgrath: we've got a dedicated mysql db for each 
node.  the production data is currently around 10-20 GB, the 
non-production node sits at around 40-50 GB
10:48 < wakko666> the key thing to note is that zabbix keeps data in two 
forms, with tunable knobs for each.
10:48 < dgilmore> wakko666: over what time period?
10:48 < mmcgrath> wakko666: you don't happen to have sar data for those 
hosts you could give to me would you?  :)
10:49 < wakko666> dgilmore: we're at about 3-4 months right now
10:49 < mmcgrath> I suppose we can start out small and move it later... 
its not really that big of a risk.
10:49 < wakko666> mmcgrath:  unfortunately, today was my last day there. 
i was "reorganized" out of a job. ;-)
10:49 < dgilmore> wakko666: so your anticipating up to 80gb for 
production a year?
10:49 < mmcgrath> wakko666: doah, well... hope all is well.
10:50 < wakko666> dgilmore: sort of.   as i was saying, there are two 
knobs. poll data, and trend data.
10:50 < wakko666> typically, we keep all polled data for about 7 days 
worth, then only keep trend data after that
10:50 < ricky> mmcgrath: IT's in now :-)
10:50 < dgilmore> much like cacti does
10:50 < mmcgrath> ricky: hilarious.
10:50 < wakko666> mmcgrath:  yeah, i'll probably be fine.  though, i 
wouldn't mind findinga spot at RH.  ;-)
10:51 < G> wakko666: wait a second, I thought if you setup multiple 
nodes they could sharethe same tasks?
10:51 < mmcgrath> and, correct me if I'm wrong, but zabbix doesn't store 
RRD right?  the graphs come from the database?
10:51 < G> mmcgrath: correct from what I can tell
10:51 < wakko666> mmcgrath: correct.  graphs are auto-generated, not 
RRD.  so you can create new graphs and they're autopopulated with old data
10:52 < wakko666> G: yes, nodes share the data from the tasks.  the 
zabbix-agent.conf and zabbix-server.conf help configure which node 
performs the polling
10:52 < mmcgrath> wakko666: were you using auto-recovery services?
10:52 < G> wakko666: k, so it's one big db and you just assign hosts to 
each node?
10:53 < wakko666> mmcgrath: auto-recovery?  not sure what you mean.  
perhaps you mean auto-discovery?
10:53 < G> wakko666: remote commands :)
10:53 < mmcgrath> wakko666: like if httpd dies on an app server, have 
zabbix restart it.
10:53 < wakko666> G: can be, or you can set up a db per node, or db on 
some nodes and not others.  it's pretty flexible
10:54 < wakko666> mmcgrath: ah ha!  yeah, you can have zabbix execute 
commands on healthcheck failure
10:54 < wakko666> really, the big limitation of zabbix is a couple of things
10:54 < G> I'd like to see noc1/noc2 share the zabbix checks
10:54 < wakko666> currently, in 1.4, there's no repeated notifications.  
one notify is allyou get.
10:54 < G> wakko666: yeah, I noticed that
10:54 < wakko666> (it's coming in 1.6, which is due in Sept)
10:54 < mmcgrath> G: yeah, I'm totally fine re-thinking how we have our 
noc's setup.  The big things I want are:
10:55 < mmcgrath> paged alerts when a service is not available.
10:55 < mmcgrath> and email alerts when an individual service in a farm 
goes down.
10:55 < G> mmcgrath: yeah
10:55 < wakko666> mmcgrath:  yup, no troubles doing those, and you'll 
likely get finer granularity than with nagios
10:55 < mmcgrath> that got kind of tricky in one nagios instance.
10:55 < G> yep, exactly
10:56 < mmcgrath> well, and even tricker in one nagios instance in PHX :)
10:56 < mmcgrath> wakko666: if there's some services that noc1 can't get 
to but noc2 can, can you tell zabbix to always check those with noc2?
10:56 < G> mmcgrath: the nice thing is, is that you can run the 
zabbix-server on more thanone server, and the web interface on totally 
different servers
10:57 < wakko666> yeah... with multiple nodes, you define checks per 
node.  so you'd configure a particular host on noc2's zabbix node.
10:57 < G> yeah, thats what we really want
10:57 < mmcgrath> yep.
10:57 < G> actually #fedora-meeting is free, shall we have an impromptu 
there?
10:57 < wakko666> works for me.
10:58 < mmcgrath> G: sure

-- Discussion moved to #fedora-meeting --

10:58 -!- G changed the topic of #fedora-meeting to: sysadmin-noc - 
System Monitoring Needs
10:58 < mmcgrath> W00t
10:58 < G> ricky: dgilmore: jcollie: you folks around?
10:58 < mmcgrath> G: so I want zabbix to monitor when new versions of my 
packages are around, build them, and push them via bodhi when new 
versions are out :)
10:58  * mmcgrath runs
10:59 < wakko666> lol
10:59 < G> mmcgrath: haha :)
10:59 < ricky> G: pongish
10:59 < G> okay, so if you open your hym books to 
http://publictest3.fedoraproject.org/zabbix/overview.php we have a 
basic-ish setup atm
11:00 < wakko666> looks like the basic Linux Server template...
11:00 < G> wakko666: yeah :)
11:00 < G> wakko666: except I started moving some of the specific checks 
like apache into other templates and started linking them
11:00 < dgilmore> G: not really
11:01 < wakko666> G: that works.  one suggestion:  copy the default 
graphs for Zabbix Server into the Linux Server template so you get some 
default graphing for each host
11:01 < mmcgrath> G: any luck getting ahold of fchuili?
11:02 < G> argh, I meant to ping him back before
11:02 < G> dgilmore: no problem :)
11:02 < G> wakko666: ricky: you have accounts there now, irc nick/test
11:02 < mmcgrath> I'll drop him an email
11:03 < G> wakko666: they were in default settings iirc
11:03 < G> oh maybe not
11:03 < mmcgrath> G: you don't happen to know if we can plug this in to 
FAS do you?
11:03  * dgilmore will note he tried zabbix 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannd founnnnnnnnnnnnnnnnnnnnd  
ituseleeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeess hard to configure and 
didnt work right
11:04 < G> wakko666: okay, done that now
11:04  * dgilmore wonders when ajax will gggget 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx  
fied:)pretty please
11:04 < G> mmcgrath: I don't think so, kinda like cacti in a way
11:04 < mmcgrath> dgilmore: thats so funny, G set this up in a matter of 
hours and have noproblems at all ;-)
11:05 < wakko666> i'm not sure about plugging the auth into FAS, but 
it's PHP so at the very least it should be hackable
11:05 < dgilmore> mmcgrath: monitoring localhost worked
11:05 < dgilmore> mmcgrath: but that was it
11:05 < G> 
http://publictest3.fedoraproject.org/zabbix/charts.php?period=86400&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10017&graphid=5
11:05 < ricky> Worst case, we put it behind basic auth.
11:05 < mmcgrath> dgilmore: time for another look :)
11:05 < G> I like the stuff like that
11:05 < mmcgrath> ricky: yeah, thats what I was thinking
11:05 < dgilmore> mmcgrath: it was about 3 or 4 months ago i think
11:05 < G> dgilmore: I got 4 hosts monitored in no time, only trouble 
was iptables on the pt machines :)
11:06 < wakko666> one note:  stacked graphs occasionally don't render 
quite right.  sometimes zabbix leaves white space between data sets
11:06 < G> wakko666: yeah, but it still shows the trend quite nicely
11:06 < wakko666> G: agreed.
11:07 < wakko666> for the web servers, setting up app-specific web 
checks is great, and fairly easy to do
11:07 < G> hmmm whats on PT7, it takes a bit of beating
11:08 < G> 
http://publictest3.fedoraproject.org/zabbix/charts.php?period=43200&dec=0&inc=0&left=0&right=0&stime=yyyymmddhhmm&from=0&groupid=0&hostid=10026&graphid=31
11:08 < G> okay, so I think the first thing is:
11:08 < G> What are our requirements?
11:08 < ricky> pt7 looks fine to me
11:08 < G> I can easily add the following:
11:08 < ricky> Ah, it's back in the green now.
11:09 < G> -> Equal checking abilities to nagios (i.e. the type of checks)
11:09 < ricky> Could you walk us through the processes of adding a 
complex check?
11:09 < G> -> Ability to send out e-mails/pagers
11:09 < ricky> And also, is there any sort of equivalent screen to 
https://admin.fedoraproject.org/nagios/cgi-bin//status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 
in nagios?
11:09 < mmcgrath> and what is the difference between a "web" check and 
just your normal check?
11:09 < mmcgrath> how do we write custom plugins?
11:09 < mmcgrath> why is the sky blue?
11:09 < G> -> Ability to customise stuff
11:10 < ricky> (As little information as possible - a view with *just* 
what problems are going on)
11:10 < G> ricky: yes
11:10 < wakko666> mmcgrath:  by 'web check'  i mean,  a semi-intelligent 
check of a web-app, where you can set up a series of steps for it to 
check through such as "hit koji.fp.org,click packages, click builds, etc"
11:10 < G> 
http://publictest3.fedoraproject.org/zabbix/tr_status.php?onlytrue=true&noactions=false&compact=false&select=false&txt_select=&sort=priority
11:10 < dgilmore> can i just edit a nice easy to read config file to do 
things?
11:11 < G> dgilmore: and break it while you try to work out why it broke
11:11 < wakko666> dgilmore:  all config is done through the zabbix web gui
11:11 < G> errr
11:11 < G> and break it and spend ages working out why you broke it
11:11 < wakko666> ricky:  the zabbix equivalent to that nagios screen is 
the Monitoring ->Overview screen, though under Screens, you can set up a 
customized view as well.
11:12 < dgilmore> wakko666: to me thats really bad
11:12 < G> wakko666: the triggers = true page is like that too
11:12 < G> (the link I pasted just before)
11:12  * dgilmore personally doesnt like configuring though a web gui.  
maybe why zabbix did not work out for me
11:12 < wakko666> dgilmore:  it's a different paradigm.  i don't equate 
different to bad. not having config files doesn't strike me as a flaw.
11:13 < abadger1999> wakko666: makes it harder to manage it via puppet.
11:14 < wakko666> abadger1999: yes and no.  there are config files for 
the polling server daemon and the client-side agent.   at $dayjob, i 
push the agent configs via puppet
11:14 < abadger1999> <nod.
11:15 < ricky> I guess abadger1999 was referring to things like 
configuration for specificchecks and things like that
11:15 < G> wakko666: custom checks are defined in the agent config right?
11:15 < wakko666> to me, the big thing with zabbix is that it's 
essential to back up the db, and export your configs on a regular 
basis.  it's painful to spend hours setting zabbix up, and have your db 
get corrupted and have to do all that work all over again
11:16 < ricky> So the exciting question: What problems that we're seeing 
with nagios does zabbix solve?
11:16 < wakko666> G: custom checks can be one of two things.   custom 
zabbix-agent checks,and zabbix server-side remote checks
11:16 < G> wakko666: oh thats extra nifyt
11:16 < ricky> One thing is combining cacti functionality - what else?
11:16 < G> *nifty
11:16 < G> ricky: distributed monitoring :)
11:17 < ricky> Can you elaborate a bit? :-)
11:17 < G> and has Brett pointed out before, complex checks
11:17 < wakko666> for me, zabbix does templates and rapid configuration 
of new hosts significantly better than nagios
11:17 < G> errr complex web checks
11:17 < G> yeah, the templating looks _REALLY_ good
11:18 < wakko666> zabbix also is more granular than both cacti and 
nagios.  the default network traffic checks are done every 5 seconds
11:18 < mmcgrath> G: I take it it has similar workflow that nagios has?  
(not that we usedit?)
11:18 < G> build a profile of the typical application server apply the 
template to all theapp servers and your home free
11:18 < ricky> Do you have a link where I can see the templating 
coolness in action?
11:18 < mmcgrath> but outage happens, someone ack's it and starts working?
11:18 < G> mmcgrath: ack etc? yeah
11:18 < f13> darn, I have to leave, but I'm really interested in what 
platform wins out.  Particularly interested in zenoss vs zabbix
11:18 < ricky> Because right now, I'm visualizing hostgroups in nagios
11:18 < wakko666> mmcgrath: yes. same basic workflow
11:19 < wakko666> f13:  i vote zabbix over zenoss simply because zabbix 
doesn't use rpath
11:19 < ricky> f13: zenoss = zope :-(
11:19 < mmcgrath> wakko666: G: how hard is it to script outages?
11:19 < G> that'd be something brett would have to answer
11:20 < f13> wakko666: there is that.
11:20 < f13> ricky: good point.
11:20 < f13> zenoss had something going for it in that previous 
cacti/nagios stuff would work with it, or so was the claim
11:20 < wakko666> outages are the one thing about zabbix that is a bit 
unclear to me.  i think the best analogue is to disable monitoring (a 
single drop-down box), or to acknowledgethe alert
11:21  * ricky still hasn't figured out where he can see templates
11:21 < wakko666> being that zabbix doesn't do repeated alerts, you'll 
only get a single "down" page anyway...
11:21 < mmcgrath> wakko666: as in its difficult to schedule an outage 
ahead of time?
11:21 < G> ricky: 
http://publictest3.fedoraproject.org/zabbix/hosts.php?groupid=0&config=2
11:21 < wakko666> ricky: Configuration > Items or Triggers.  there's a 
Template drop-down
11:21 < ricky> Aha
11:22 < wakko666> mmcgrath: yeah, basically.  as far as i've seen, 
zabbix doesn't yet havethe concept of scheduled outages.  a service is 
either up or down, and not much beyond that
11:23 < wakko666> i suspect that may be on their todo list for the next 
version, though
11:23 < ricky> So where can I see the linkage between a template and the 
checks for that template?
11:23 < G> I don't think it's an exact issue
11:23 < G> ricky: Items
11:23 < jcollie> you could always shut down the zabbix server :)
11:24 < wakko666> ricky: the expression column will have the template 
name in it
11:24 < mmcgrath> I've only looked a little bit but... how well does 
service deps work?
11:24 < wakko666> ricky:  err... not expression column... the name column.
11:24 < ricky> I think I got it
11:24 < wakko666> mmcgrath: dependencies are dead easy.
11:24 < G> mmcgrath: it'd appear you can add multiple dependences per 
trigger
11:25 < G> 
http://publictest3.fedoraproject.org/zabbix/triggers.php?form=update&triggerid=10043&hostid=10001
11:25 < wakko666> if you check apache on host A, but that check goes 
through router B, youadd a dependency on the apache check so that the 
check doesn't execute unless the checks for router B are passing.
11:28 < mmcgrath> So really
11:29 < mmcgrath> G: how about this...  We give it a quick talk tomorrow 
at the meeting there.  If there's no blockers or major opposition.  We 
get it on noc1 and get to work?
11:29 < G> mmcgrath: so your happy with what I've done on pt3 so far?
11:30 < mmcgrath> Yeah so far.  I'd like to see it monitoring a couple 
of things along side nagios, both sending notifications, and see how it 
does in production.
11:30 < mmcgrath> so not spending a ton of time on it, but monitoring a 
few critical bits that frequently have problems.
11:30 < G> in that case sure, except if we are putting into production, 
I guess we should grab Jeff's 0.4.6 update and put it in f-i until it 
appears in epel
11:31 < G> I'll be happy to lead that task
11:31 < G> wakko666: jcollie: you both in sysadmin-noc?
11:31 < mmcgrath> G: excellent.
11:31 < wakko666> G: applying now. :)
11:31 < G> I'll sponsor you :)
11:32 < wakko666>  yay!  :-)
11:32 < G> mmcgrath: I think we'll leave the internal authentication for 
now, I'll leave the main part readable by everyone, and add accounts for 
everyone in sysadmin-main/noc thatsactive
11:33 < G> wakko666: done
11:33 < mmcgrath> G: thats fine.
11:34 < G> okay, so adjourned until the inframeeting 2000UTC tomorrow :)
11:34 < ricky> How can I trigger a check?
11:34 < wakko666> ricky:  turn off the service that it's checking.   ;-)
11:34 < ricky> Oh.
11:35 < wakko666> you can also just flip the logic of the trigger.
11:35 -!- G changed the topic of #fedora-meeting to: Channel is used by 
various Fedora groups and committees for their regular meetings | Note 
that meetings often get logged | For questions about using Fedora please 
ask in #fedora | See 
http://fedoraproject.org/wiki/Communicate/FedoraMeetingChannel for 
meeting schedule
11:36 < G> I'll post a log to the infra-list soon so people can have a 
read before the main meeting
11:37 < mmcgrath> G: good ide
11:37 < mmcgrath> a