For replication delay monitoring (i.e. Nagios), 1 second granularity is plenty
Typically, you would only alert after several seconds of delay were noticed
Naturally, there are some other factors that can impact the delay/accuracy of this system (pub/sub time, time to issue select, etc), but for the purpose of isolating some sub-optimal processes at the millisecond level, this approach was extremely helpful.
Stay tuned for a followup post where I'll share the tool and go over it's …
…on-premises monitoring solutions. The components are designed to integrate seamlessly with widely deployed solutions such as Nagios,and and are delivered in the form of templates, plugins, and scripts which make it easy to monitor performance.
The post Percona Monitoring Plugins 1.1.3. Addressed CVE-2014-2569. appeared first on MySQL Performance Blog .
You can scale your Nagios horizontally. Nagios can be really performant if you don't use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.
You can run multiple check execution engines in production. Nagios is well suited to some monitoring tasks. Sensu is well suited to others. …
At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A "help" notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.
One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load …
Roughly three weeks later, Nagios started screaming in the routing team's internal chat room every five minutes, for days at a time. Some nodes in the cluster had their memory bubble up, and never gave it back to the OS. The nodes wouldn't crash as fast as before; instead, they'd grow close to the ulimit we'd set and hover there, taunting Nagios, and us by extension.
Clearly, I needed to do more work.
I Just Keep on Bleeding and I Won't Die
First Attempt …
For those familiar with Nagios, standard Sensu checks are compatible with Nagios checks. So if you know of a Nagios check that does what you need, you can stop right here and go grab that. Otherwise, let's continue.
The exit status of a Sensu check should be:
3 or more: unknown
A sensu check also outputs text describing the state to stdout or stderr.
Example outputs of check-ram.rb: Exit Status Output 0 CheckRAM OK: 65% free RAM …
§ DevOps Borat :
Law offor devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.
§ DownloadMoreRAM . Just downloaded 4GB of RAM to my iPhone, 10 , $ 0.
§ Sayings 2.0 :
Never judge an app by its icon
A watched status update never gets liked.
Close, but no WiFi.
Configure checks in Nagios, but configure a contact that drops the alerts
Read Nagios's state out of a file + parse it
Aggregate the checks by regex, and alert if a percentage is critical
It's a godsend for people who manage large Nagios instances, but it starts falling down if you've got multiple independent Nagios instances (shards) that are checking the same thing.
You still end up with a situation where each of your shards alert if the shared entity they're …
We have a lot of data to parse through at . Our internal stats application, , does the majority of heavy data lifting for us, including reports, application health, CI builds, and much more. Our Campfire bot named Tally happily pings us when a build fails, deploys are fired off, and when Nagios alerts pop up.
I had a problem though: I needed to have all of this data open constantly to absorb it. Either I had to look at the pages on Dash…
And yet, how many of you are still using Nagios?
There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.
Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.
There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring …