We have a lot of data to parse through at 37signals. Our internal stats application, Dash, does the majority of heavy data lifting for us, including reports, application health, CI builds, and much more. Our Campfire bot named Tally happily pings us when a build fails, deploys are fired off, and when Nagios alerts pop up.
I had a problem though: I needed to have all of this data open constantly to absorb it. Either I had to look at the pages on Dash…
…solutions. The components are designed to integrate seamlessly with widely deployed solutions such as Nagios and Cacti, and are delivered in the form of templates, plugins, and scripts.
* MySQL 5.6 compatibility for InnoDB graphs (bug 1124292)
* Added performance data to Nagios plugins (bugs 1090145, 1102687)
* Added UTC option to pmp-check-mysql-replication-delay to be compatible with pt-hearbeat 2.1.8+ (bug 1103364)
* Added 1-second granularity to pmp-check-mysql-deadlocks …
And yet, how many of you are still using Nagios?
There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.
Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.
There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring …
…existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.
For example, the Nagios plugin development guidelines state that UNKNOWN from a check can mean:
Invalid command line arguments were supplied to the plugin
Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from …
When using Nagios, the check + notification infrastructure are generally collapsed into one compartment (with the exception of NRPE ).
Many monitoring pipelines start out with the data collection + storage infrastructure decoupled from the check infrastructure. Monitoring checks query the same targets that are being graphed, but:
Because the check intervals don't necessarily match up to the data collection intervals, it can be hard to correlate monitoring alerts to features …
…Percona XtraDB Cluster node status using the following example command for Nagios config:
define command{
command_name check_pxc_node_status
command_line /usr/lib64/nagios/plugins/pmp-check-mysql-status -l $USER3$ -p $USER4$ -H $HOSTADDRESS$ -x wsrep_local_state -C ‘!=' -w 4
}
To install packages you can use one of the following commands:
yum install percona-nagios-plugins percona-cacti-templates
apt-get install percona-nagios-plugins percona-cacti-templates …
…Tasseo is another one of them, a successful experiment of having an at-a-glance dashboard with the most important metrics in one convenient overview.
It'll still be a while until we see the ancient tools like Nagios, Icinga and others improve, but the competition is ramping up. Sensu is one open source alternative to keep an eye on.
I'm looking forward to seeing how the monitoring space evolves over the next two years.
Make a resolution to automate in 2013. Attend one of our Chef Introductory workshops in a location near you. Workshop coursework covers:
local workstation set up with Chef and connection to a Chef Server.
Use Chef to automate installation of a Nagios server as a real world example.
Automate other common system tasks with Chef, including: User management and sudo permissions, NTP (including a local NTP server) and SMTP relaying with postfix. Each exercise will be instructor-led, …
…new streaming API . Everything you need to know about the new and exciting API.
Best quote on scalability I have come across: "Design for 20x capacity, implement for 3x capacity, deploy for ~1.5x capacity."
§ DevOps Borat :
Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.
We reordered our versioning with the release of 10.12.0 to facilitate point releases that just contain bug fixes. Here is your first one.
MVPs
Phil Dibowitz contributed awesome improvements to knife that were released in 10.14.0, such as batching for knife cookbook upload -a .
Nagios cookbook hero Tim Smith helped find a couple of the bugs that we fixed in this release.
We're grateful you are both a part of the Chef community, and you are the co-MVPs for this release!