Implementing a custom monitoring stack with Nagios, InfluxDB and Grafana

At Netvlies we were using Nagios for quite some time as our monitoring stack. It was a hell of a job to maintain (let alone setting it up). Nevertheless it does its job quite well, until it just wasn’t enough. Its interface is crappy and configuration was a mess. The manual editing of config files in vim is not an inviting job to do. Furthermore we wanted to extend our monitoring stack to not just monitor hardware and middleware metrics, but application and security metrics as well. From that point of view a search began for new tools and setups. These were our requirements:

  • an app (Android/iOS) for alerting
  • possibility for writing custom checks
  • easy configuration for adding/altering checks
  • monitoring data must be accessible through API or database
  • possibility to assign alerts to someone
  • no SAAS/cloud solution. Our data is our own.

There are many, many tools, to name a few of them: Nagios, Icinga, Oculus, Sensu, New relic, Flapjack, Cabbix, Gacti, Graphite, Grafana, Kibana, … and many more. But none of them seemed to have all desired features. I realized that I needed to create a stack of different tools to meet all of the requirements. Also many tools had a sort of overlapping when comparing to each other, so I decided to split up responsibilities for the different tasks to meet the separation of concerns principle. While googling around I found this paradigm seems already be (partly) adopted by some companies (like Elastic and InfluxData). Combining their and other toolsets and looking at what I needed, a distillation to different layers was quickly made.

The monitoring stack

checks the gathering of metrics
storage plain/raw result storage
aggregation enrichment, calculation or humanizing values
view displaying the aggregated results
alerting parallel to view layer, action may be required by someone
reporting Needed for longterm decisionmaking

With this stack it’s just going to the candy store to find the best solution for each layer(s). Below is our resulting stack and some explanation in why we chose for a specific solution.

View
I quickly fell in love with the ELK stack (Elasticsearch ,Logstash, Kibana) and TICK stack (Telegraph, Influxdb, Chronograf, Kapacitor). But there is also a new kid on the block called Grafana (which is actually a fork from Kibana). If you’ve already tried Grafana yourself you won’t be needing to hear from me that it is just awesome. Grafana is an easy configurable dashboard with multiple datasources to show all kind of near-realtime metrics. Have a look at Grafana and you’ll know what I’m writing about. So the view layer was quickly chosen.

Checks and storage
We didn’t had that many time to set up a completely new stack for all of our +/- 50 servers, so I wanted to reuse the Nagios checking mechanisms through NRPE and store it’s metrics into an acceptable format. I found InfluxDB the most ideal candidate for this. It could also have been elasticsearch but InfluxDB seemed to be more suited for just metrical data, and has some default retention schemes out of the box (and i’m lazy). Furthermore Grafana has great support for InfluxDB with user friendly query editors.

After choosing the storage backend we needed to transfer Nagios data to InfluxDB. For this we found that Graphios was the missing puzzle piece. Graphios is great, but it missed storing the service state (Ok, Critical, Warning, Unknown). For this reason I forked the repo, in which Graphios stores the service state in a field called “state_id”. You can check here if you’re interested.

Staying in Nagios land still left us with the configuration mess. To easen the pain we installed Nconf to manage the Nagios configuration. In Nconf every check, host, command, etc is managed throug a web interface, and it has some powerfull templating as well. Configuration is generated with a single click and is validated in a pre-flight mode after wich it can be deployed to the Nagios runtime. It took some time to migrate all current configuration into Nconf, but it was worthwile now that we have a the possibilty to add a new server with default checks in just a few seconds.

Alerting
We used the aNag app for alerting, which is an excellent app for the Android platform. Unfortunately there is nothing like it for the iOS platform. Furthermore no actions can be seen or be discussed. So a kind of chat-client would be easier. For this we found HipChat very usefull to dump any alerts in that could be delegated to the right person, or be replied to als “false positive”, “working on it”, etc. We used HipSaint to hook up HipChat to Nagios.

Aggregating/Reporting
Currently we don’t have uses cases where aggregating is usefull yet, but once we do need them I guess I would be looking into Logstash. Reporting is also not used yet, but should be easy once requested, as there are many client libraries for InfluxDB in different languages.

 

Concluding

Grafana is just awesome to see in action. And is easy to sell as it is more tangible than something more abstract like InfluxDB. Also I’m very enthousiastic about the TICK and ELK stack, as both of them do some kind of separation of concerns. The one tool that does it all doesnt exist and if there was any tool nearly like it would be way to fat (and expensive as well). The best way to handle monitoring is accepting that it should be seen as a stack, and implementing your own will give you the right tool for the job.

WordPress vulnerability monitoring

Due to recurring security issues with WordPress I wanted a some kind of wordpress vulnerability monitoring for our Wordpress implementations. The current monitoring setup is implemented with good old fossil Nagios. Despite of many better alternatives, Nagios still does a great job in alerting me through the aNag app. In this post I’ll describe a simple Nagios setup for continuously monitoring WordPress vulnerabilities.  It’s pretty straightforward if you already know Nagios. Nevertheless the scripts I wrote to wrap things around should be re-usable (more or less) to be used in any other setup (ELK?).

Scanning WordPress is done with the wpscan tool, which can be downloaded from http://wpscan.org . Output of this tool is stored, transformed and displayed in the Nagios way of doing this. As a prerequisite you need to install this tool on your server (hint; use the RVM method) and have PHP/MySQL installed for different subcommands that will be called.

Nagios configuration

First we’re going to write the command, service and service template in Nagios. Check interval is set to once every 24 hours. And by using a parent service (template), you can easily create more checks for other WordPress instances. I copied the configuration from my generated config, which is created by Nconf, but shouldn’t make any difference in directly using this config.

The actual command(s)

When you look closely at the configured command you will see a lot of piping.  I could have put all into one command, but that wouldn’t be re-usable in a future possible ELK setup. It’s also a bit easier to change these scripts to your needs, as its purpose is more obvious.

The actual scanning is done with /opt/wpscan/wpscan.rb –update -u $ARG1$. The output of wpscan is intended to be read by humans, so we need to convert this into something that Nagios understands. Note that this feature should be released in the future as it is announced in the 3.0 release of wpscan (https://github.com/wpscanteam/wpscan/issues/198). Until then we use our own filter for transforming which is defined in $USER1$/wpsecurity_filter . This filter will process the wpscan output to a JSON string which is passed to $USER1$/wpsecurity_message that will exit for your with the right exit code and message for Nagios. See gists below for this filtering and messaging to Nagios:

If you’ve read closely, you might already noticed a missing command, namely $USER1$/wpsecurity_store $HOSTNAME$ $SERVICEDESC$. Reason to name this separately is because it is optional. From the outside it just moves STDIN to STDOUT, seemingly doing nothing. In its internals, it stores the output in a MySQL database, which can be accessed with a custom PHP script from the Nagios dashboard (by using the “notes_url” configuration directive). See gist below for this storage:

By using different subcommands we can easily replace the filter later on when wpscan 3.0 is released, also the wpsecurity_store is optional as it outputs exactly what it receives (so you can leave that out if you don’t want storage; also remove the notes_url from Nagios config if you do so).

Vulnerabilities overview

In the Nagios dashboard there is only room for just one sentence for the state of that service. Just naming the number of found vulnerabilities isn’t enough. That’s why we stored the output in MySQL. With a notes_url config we can redirect to a separate PHP file which will display the output generated by wpscan. Please adjust to your needs 🙂 as it is just bare minimum. In the example below are 2 PHP includes which can be downloaded from the SensioLabs repo https://github.com/sensiolabs/ansi-to-html.

Nagios dashboard

If all went well (see /var/log/nagios/nagios.log for errors), you should be seeing something like screenshot below. Watch for the the document icon right from the service description. This link will lead you to the next screenshot.


nagios


 

wordpress vulnerability monitoring