Implementing a custom monitoring stack with Nagios, InfluxDB and Grafana

At Netvlies we were using Nagios for quite some time as our monitoring stack. It was a hell of a job to maintain (let alone setting it up). Nevertheless it does its job quite well, until it just wasn’t enough. Its interface is crappy and configuration was a mess. The manual editing of config files in vim is not an inviting job to do. Furthermore we wanted to extend our monitoring stack to not just monitor hardware and middleware metrics, but application and security metrics as well. From that point of view a search began for new tools and setups. These were our requirements:

  • an app (Android/iOS) for alerting
  • possibility for writing custom checks
  • easy configuration for adding/altering checks
  • monitoring data must be accessible through API or database
  • possibility to assign alerts to someone
  • no SAAS/cloud solution. Our data is our own.

There are many, many tools, to name a few of them: Nagios, Icinga, Oculus, Sensu, New relic, Flapjack, Cabbix, Gacti, Graphite, Grafana, Kibana, … and many more. But none of them seemed to have all desired features. I realized that I needed to create a stack of different tools to meet all of the requirements. Also many tools had a sort of overlapping when comparing to each other, so I decided to split up responsibilities for the different tasks to meet the separation of concerns principle. While googling around I found this paradigm seems already be (partly) adopted by some companies (like Elastic and InfluxData). Combining their and other toolsets and looking at what I needed, a distillation to different layers was quickly made.

The monitoring stack

checks the gathering of metrics
storage plain/raw result storage
aggregation enrichment, calculation or humanizing values
view displaying the aggregated results
alerting parallel to view layer, action may be required by someone
reporting Needed for longterm decisionmaking

With this stack it’s just going to the candy store to find the best solution for each layer(s). Below is our resulting stack and some explanation in why we chose for a specific solution.

View
I quickly fell in love with the ELK stack (Elasticsearch ,Logstash, Kibana) and TICK stack (Telegraph, Influxdb, Chronograf, Kapacitor). But there is also a new kid on the block called Grafana (which is actually a fork from Kibana). If you’ve already tried Grafana yourself you won’t be needing to hear from me that it is just awesome. Grafana is an easy configurable dashboard with multiple datasources to show all kind of near-realtime metrics. Have a look at Grafana and you’ll know what I’m writing about. So the view layer was quickly chosen.

Checks and storage
We didn’t had that many time to set up a completely new stack for all of our +/- 50 servers, so I wanted to reuse the Nagios checking mechanisms through NRPE and store it’s metrics into an acceptable format. I found InfluxDB the most ideal candidate for this. It could also have been elasticsearch but InfluxDB seemed to be more suited for just metrical data, and has some default retention schemes out of the box (and i’m lazy). Furthermore Grafana has great support for InfluxDB with user friendly query editors.

After choosing the storage backend we needed to transfer Nagios data to InfluxDB. For this we found that Graphios was the missing puzzle piece. Graphios is great, but it missed storing the service state (Ok, Critical, Warning, Unknown). For this reason I forked the repo, in which Graphios stores the service state in a field called “state_id”. You can check here if you’re interested.

Staying in Nagios land still left us with the configuration mess. To easen the pain we installed Nconf to manage the Nagios configuration. In Nconf every check, host, command, etc is managed throug a web interface, and it has some powerfull templating as well. Configuration is generated with a single click and is validated in a pre-flight mode after wich it can be deployed to the Nagios runtime. It took some time to migrate all current configuration into Nconf, but it was worthwile now that we have a the possibilty to add a new server with default checks in just a few seconds.

Alerting
We used the aNag app for alerting, which is an excellent app for the Android platform. Unfortunately there is nothing like it for the iOS platform. Furthermore no actions can be seen or be discussed. So a kind of chat-client would be easier. For this we found HipChat very usefull to dump any alerts in that could be delegated to the right person, or be replied to als “false positive”, “working on it”, etc. We used HipSaint to hook up HipChat to Nagios.

Aggregating/Reporting
Currently we don’t have uses cases where aggregating is usefull yet, but once we do need them I guess I would be looking into Logstash. Reporting is also not used yet, but should be easy once requested, as there are many client libraries for InfluxDB in different languages.

 

Concluding

Grafana is just awesome to see in action. And is easy to sell as it is more tangible than something more abstract like InfluxDB. Also I’m very enthousiastic about the TICK and ELK stack, as both of them do some kind of separation of concerns. The one tool that does it all doesnt exist and if there was any tool nearly like it would be way to fat (and expensive as well). The best way to handle monitoring is accepting that it should be seen as a stack, and implementing your own will give you the right tool for the job.

15 thoughts on “Implementing a custom monitoring stack with Nagios, InfluxDB and Grafana

  1. Hello Makri,

    just found this article and it’s excatly what I’m right now trying to do. Do you probably have a few minutes to help me configuring the connection between nagios and influxdb? Somehow I’m right now at a dead end and no longer have an idea on what else configuration to try.

    I’m trying to use graphios to send data to the graphite backend of influxdb, but I don’t know how to configure the graphite section of the influxdb correctly to store this data.

    Greeting from Switzerland and already thanks for an reply!

    1. Hi Michael,

      Graphios supports many backends (graphite, influxdb, etc). You need to enable just one of them. Below is my InfluxDB config section of graphios. Altough InfluxDB is at a newer state, the influxdb09 should work with the latest release ).

      enable_influxdb09 = True

      # Extra tags to add to metrics, like data center location etc.
      # Only valid for 0.9
      #influxdb_extra_tags = {“location”: “la”}

      # Comma separated list of server:ports
      # defaults to 127.0.0.1:8086 (:8087 if using SSL).
      #influxdb_servers = 127.0.0.1:8087

      # SSL, defaults to False
      #influxdb_use_ssl = True

      # Database-name, defaults to nagios
      influxdb_db = graphios

      # Credentials (required)
      influxdb_user =
      influxdb_password =

      In InfluxDB you should create a database and users according to above settings

      1. I am trying to setup the same stack that you illustrate in your article. I notice that the #influxdb_servers = 127.0.0.1:8087 entry is commented out and the user name / password options are blank.

        Do you also enable the line protocol, or is this section sufficient with all others set to false?

        What listener did you configure on the Influxdb side?

        Thanks you for writing this!

        1. Alan, sry for my late reply as holidays are meant to be being off the grid for a while 🙂
          The section as mentioned should suffice in a default setup. InfluxDB is listening @ 8086 by default.

          The only thing you need to do, is to create a user on the influxDB server and to enter these credentials in the config file mentioned above.

  2. Nice post, and agreed that that stack sounds like a solid fit, picking & choosing the best from each layer. That said, you didn’t really delve into the ‘how’ of the implementation. Care to share?

    1. Since the setup is already done, and InfluxDB heavily under development I have my doubts about a complete setup guide that will work 100%.

      From my experience; I found the Nagios configuration the hardest part. The other components are pretty well documented on their originating sites. This is also the layer that I find the most doubtfull, since Nagios is a kind of fossil. If you’re starting fresh, and don’t have many servers, I would advice you to take a look at Telegraph for metric collections, which is a lot easier to setup and has more/better metrics out of the box.

  3. Hi,
    Thanks for such a great document I was also able to configure using influxdb, but the influxdb version supported by graphios is 0.9 which is to old now and lack many features is it possible to make it supportable to latest versions.
    Thanks again. Cheers!!

  4. Hi Team,,
    Thanks for such a great write up. Its really helpful..
    We also have similar requirements like mentioned above & planning to build the stack from scratch.
    So can you pls advise if Nagios will be good option , or we can consider some other tool as well.

    1. Nagios is the most doubtfull layer, for sure 🙂 For me it is some heritage I have to work with. Configuring Nagios can cause lots of headaches. So if you’re setting up from scratch, and only for a few servers or so, I would advice to use Telegraph which is easier to setup and has more and better metrics.

    1. Hi Riccardo,
      Thnx for your suggestion. It would sure be a better choice if you’re starting from scratch. Point is that all of our 40 servers were already installed with Nagios (too much hassle to reconfigure them all). For a new server stack I’m building right now I use Telegraf for metrics (very easy to configure), and with the upcoming release of Grafana 4 alerting will be added. Which makes both Nagios and Icinga obsolete (at least for our case)

      1. Ok, I will install both icinga2, telegraph and collectd to understand which is the best for my processes. I’m interested on Network Monitoring .. have you got ideas for that?

        1. Not entirely sure about which part of networking you mean, but for traffic the Netlink plugin for Collectd or the built-in network plugin in Telegraf should suffice for that.

  5. This is the exact solution I was looking for!!
    Can you post on what is the sequence in which this Stack should be up? I am new to this tool stack.

Leave a Reply

Your email address will not be published. Required fields are marked *