[ZZ] Sending ALERTS with GRAPHITE GRAPHS from NAGIOS

Source: Internet
Author: User

Disclaimer

The the I ' m doing this relies in a feature I wrote for Graphite that is only recently merged to Trunks, so at time of the writ ing that feature isn ' t in a stable release. Hopefully it ' ll is in 0.9.10. Until then, you can at least test this setup using Graphite ' s trunk version.

Oh Yeah, the new feature is the ability to send graph images (not links) via email. I surfaced this feature in Graphite through the graph menus that pops up when you click on a graph in Graphite, but Impleme Nted it such that it's pretty easy-to-call from a script (which I also wrote–you ' ll see if you read the post).

Also, note that I assume your already know Nagios, how to install the new command scripts, and all that. It's really easy-to-figure this stuff out in Nagios, and it's well-documented elsewhere, so I don ' t cover anything here BU t the configuration of this new feature.

The idea

I ' m not a huge fan of Nagios, to be honest. As far as I know, nobody really is. We all just use it because it's there, and the alternatives are either overkill, unstable, too complex, or just don ' t prov IDE much value for all the extra overhead this comes with them (whether-s config overhead, administrative overhead, p Rocessing overhead, or whatever depends on the specific alternative your ' re looking at). So ... Nagios it is.

One thing that *is* pretty nice about Nagios was that configuration was really dead simple. Another thing is so you can do pretty much whatever your want with it, and write code in any language the want to get thi NGS done. We ' ll take advantage of these, features to actually do a couple of things:

    • Monitor a metric by polling Graphite for it directly
    • Tell Nagios to fire off a script that ' ll go get the graph for the problematic metric, and send email with the graph Embedd Ed in it to the configured contacts.
    • Record that we sent the alert back in Graphite, so we can overlay those events on the corresponding metric graph and Verif Y that alerts is going out when they should, that the outgoing alerts is hitting your phone without delay, etc.
The Candy

Just to is clear, we ' re going to set things up so you can get alert messages from Nagios the look like this (click to ENL Arge):

And you ' ll also is able to the track those alert events in Graphite in graphs so look like this (click to enlarge, and note The vertical lines–those is the alert events.):

Defining Contacts

In production, it's possible that the proper contacts and contact groups already exist. For testing (and maybe production) your might find that you want to limit who receives graphite graphs in email notificatio Ns. To test things out, I defined:

    • A new Contact template, that's configured specifically to receive the graphite graphs. Without this, no graphs.
    • A new contact that uses the template
    • A new Contact group containing said contact.

For testing, you can create a test with a templates.cfg:

define contact{        name                            graphite-contact         service_notification_period     24x7                    host_notification_period        24x7         service_notification_options    w,u,c,r,f,s         host_notification_options       d,u,r,f,s          service_notification_commands   notify-svcgraph-by-email        host_notification_commands      notify-host-by-email        register                        0        }

You'll notice a few things here:

    • This is a contact with only a template.
    • Any contact defined using this template would be notified of service issues with the command ' Notify-svcgraph-by-email ', wh Ich we ' ll define in a moment.

In Contacts.cfg, can now define a individual contact that uses the graphite-contact template we just assembled:

define contact{        contact_name    graphiteuser        use             graphite-contact         alias           Graphite User        email           someone@example.com         }

Of course, you'll want to change the ' e-mail ' attribute here, even for testing.

Once done, you also want to has a contact group set up the contains this new ' Graphiteuser ', so the can add user s to the group to expand the testing, or evolve things into production. This is also do in contacts.cfg:

define contactgroup{        contactgroup_name       graphiteadmins        alias                   Graphite Administrators        members                 graphiteuser        }
Defining a Service

Also for testing, you can set up a test service, necessary in this case to bypass default settings, so seek to not bombar D contacts by sending a email for every a single aberrant check. Since The end result of this test was to see a email, we want to get a email for every check where the. The on-the-bounds. In Templates.cfg put this:

define service{    name                        test-service    use                         generic-service    passive_checks_enabled      0    contact_groups              graphiteadmins    check_interval              20    retry_interval              2    notification_options        w,u,c,r,f    notification_interval       30    first_notification_delay    0    flap_detection_enabled      1    max_check_attempts          2    register                    0    }

Again, the key point here's to insure that no notifications be ever silenced, deferred, or delayed by Nagios on any, For any reason. You probably don ' t want the production. The other point is, and the uses ' test-service ' in its definition, the alerts would Go to our previously defined ' graphiteadmins '.

To do use of the This service, I ' ve defined a service in ' localhost.cfg ' that'll require further explanation, but first le T ' s just look at the definition:

define service{        use                             test-service         host_name                       localhost        service_description             Some Important Metric        _GRAPHURL           "http://graphite.example.com/render?width=800&from=-1hours&until=now&target=graphite.path.to.target"        check_command                   check_graphite_data!24!36        notifications_enabled           1        }

There is the new things we need to understand when looking at the this definition:

    • What is ' check_graphite_data '?
    • What is ' _graphurl '?

These questions is answered in the following section.

In addition, you should know the value for _graphurl is intended to come straight from the Graphite dashboard. Go to your dashboard, pick a graph of a single metric, grab the URL for the graph, and paste it in (and double-quote it).

Defining the ' Check_graphite_data ' Command

This command relies to a small script written by the folks at Etsy, which can is found on github:https://github.com/etsy/ Nagios_tools/blob/master/check_graphite_data

Here's the COMMANDS.CFG definition for the command:

‘check_graphite_data‘command definitiondefine command{        command_name    check_graphite_data        command_line    $USER1$/check_graphite_data -u $_SERVICEGRAPHURL$ -w $ARG1$ -c $ARG2$        }

The ' Command_line ' attribute calls the Check_graphite_data script we got on GitHub earlier. The '-u ' flag is a URL, and this is actually using the custom object attribute ' _graphurl ' from our service definition. Can see more about custom object variables Here:http://nagios.sourceforge.net/docs/3_0/customobjectvars.html-the sh ORT story was, since we defined _graphurl in a service definition, it gets prepended with ' service ', and the Underscor E in ' _graphurl ' moves to the front, giving you ' $_servicegraphurl '. More on how that works at the link provided.

The '-W ' and '-C ' flags to Check_graphte_data is ' warning ' and ' critical ' thresholds, respectively, and they correlate to The positions of the service definition ' s ' check_command ' arguments (so, check_graphite_data!24!36 maps to ' Check_graphit E_data-u <url> w 24-c 36′)

Defining the ' Notify-svcgraph-by-email ' Command

This command relies to a script that I wrote in Python called ' sendgraph.py ', which also lives in Github:https://gist.git hub.com/1902478

The script does the things:

    • It emails the graph that corresponds to the metric being checked by Nagios, and
    • It pings back to graphite to record the alert itself as an event so that you can define a graph for, say, ' Apache Load ', and I F can also overlay the alert events on top of the ' Apache Load ' graph, an metric D vet that alerts is going out when you expect. It's also a good test to see that you ' re actually getting the alerts this script tries to send, and that they ' re not being Dropped or seriously delayed.

To make use of the script in Nagios, lets define the command that actually sends the alert:

define command{    command_name    notify-svcgraph-by-email    command_line    /path/to/sendgraph.py -u"$_SERVICEGRAPHURL$"-t $CONTACTEMAIL$ -n "$SERVICEDESC$"-s $SERVICESTATE$    }

A couple of Quick notes:

    • Notice this need to double-quote any variables in the ' command_line ' that might contain spaces.
    • For a definition of the command line flags, see sendgraph.py ' s–help output.
    • Just to close the loop, note this notify-svcgraph-by-email is the ' service_notification_commands ' value in our initial con Tact template (The very first listing in this post)
Fire It up

Fire up your Nagios daemon to take it for a spin. For testing, make sure your set the check_graphite_data thresholds to numbers that is pretty much guaranteed to trigger an Alert when Graphite is polled. Hope this helps! If you had questions, first make sure you ' re using Graphite's ' trunk ' branch, and not 0.9.9, and then give me a shout in The comments.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.