At Netvlies we were using Nagios for quite some time as our monitoring stack. It was a hell of a job to maintain (let alone setting it up). Nevertheless it does its job quite well, until it just wasn’t enough. Its interface is crappy and configuration was a mess. The manual editing of config files in vim is not an inviting job to do. Furthermore we wanted to extend our monitoring stack to not just monitor hardware and middleware metrics, but application and security metrics as well. From that point of view a search began for new tools and setups. These were our requirements:
- an app (Android/iOS) for alerting
- possibility for writing custom checks
- easy configuration for adding/altering checks
- monitoring data must be accessible through API or database
- possibility to assign alerts to someone
- no SAAS/cloud solution. Our data is our own.
There are many, many tools, to name a few of them: Nagios, Icinga, Oculus, Sensu, New relic, Flapjack, Cabbix, Gacti, Graphite, Grafana, Kibana, … and many more. But none of them seemed to have all desired features. I realized that I needed to create a stack of different tools to meet all of the requirements. Also many tools had a sort of overlapping when comparing to each other, so I decided to split up responsibilities for the different tasks to meet the separation of concerns principle. While googling around I found this paradigm seems already be (partly) adopted by some companies (like Elastic and InfluxData). Combining their and other toolsets and looking at what I needed, a distillation to different layers was quickly made.
The monitoring stack
|checks||the gathering of metrics|
|storage||plain/raw result storage|
|aggregation||enrichment, calculation or humanizing values|
|view||displaying the aggregated results|
|alerting||parallel to view layer, action may be required by someone|
|reporting||Needed for longterm decisionmaking|
With this stack it’s just going to the candy store to find the best solution for each layer(s). Below is our resulting stack and some explanation in why we chose for a specific solution.
I quickly fell in love with the ELK stack (Elasticsearch ,Logstash, Kibana) and TICK stack (Telegraph, Influxdb, Chronograf, Kapacitor). But there is also a new kid on the block called Grafana (which is actually a fork from Kibana). If you’ve already tried Grafana yourself you won’t be needing to hear from me that it is just awesome. Grafana is an easy configurable dashboard with multiple datasources to show all kind of near-realtime metrics. Have a look at Grafana and you’ll know what I’m writing about. So the view layer was quickly chosen.
Checks and storage
We didn’t had that many time to set up a completely new stack for all of our +/- 50 servers, so I wanted to reuse the Nagios checking mechanisms through NRPE and store it’s metrics into an acceptable format. I found InfluxDB the most ideal candidate for this. It could also have been elasticsearch but InfluxDB seemed to be more suited for just metrical data, and has some default retention schemes out of the box (and i’m lazy). Furthermore Grafana has great support for InfluxDB with user friendly query editors.
After choosing the storage backend we needed to transfer Nagios data to InfluxDB. For this we found that Graphios was the missing puzzle piece. Graphios is great, but it missed storing the service state (Ok, Critical, Warning, Unknown). For this reason I forked the repo, in which Graphios stores the service state in a field called “state_id”. You can check here if you’re interested.
Staying in Nagios land still left us with the configuration mess. To easen the pain we installed Nconf to manage the Nagios configuration. In Nconf every check, host, command, etc is managed throug a web interface, and it has some powerfull templating as well. Configuration is generated with a single click and is validated in a pre-flight mode after wich it can be deployed to the Nagios runtime. It took some time to migrate all current configuration into Nconf, but it was worthwile now that we have a the possibilty to add a new server with default checks in just a few seconds.
We used the aNag app for alerting, which is an excellent app for the Android platform. Unfortunately there is nothing like it for the iOS platform. Furthermore no actions can be seen or be discussed. So a kind of chat-client would be easier. For this we found HipChat very usefull to dump any alerts in that could be delegated to the right person, or be replied to als “false positive”, “working on it”, etc. We used HipSaint to hook up HipChat to Nagios.
Currently we don’t have uses cases where aggregating is usefull yet, but once we do need them I guess I would be looking into Logstash. Reporting is also not used yet, but should be easy once requested, as there are many client libraries for InfluxDB in different languages.
Grafana is just awesome to see in action. And is easy to sell as it is more tangible than something more abstract like InfluxDB. Also I’m very enthousiastic about the TICK and ELK stack, as both of them do some kind of separation of concerns. The one tool that does it all doesnt exist and if there was any tool nearly like it would be way to fat (and expensive as well). The best way to handle monitoring is accepting that it should be seen as a stack, and implementing your own will give you the right tool for the job.