24 April 2014

The Ruby Reflector

Topic

Nagios

  Source Favicon
Email

For replication delay monitoring (i.e. Nagios), 1 second granularity is plenty

Typically, you would only alert after several seconds of delay were noticed

Naturally, there are some other factors that can impact the delay/accuracy of this system (pub/sub time, time to issue select, etc), but for the purpose of isolating some sub-optimal processes at the millisecond level, this approach was extremely helpful.

Stay tuned for a followup post where I'll share the tool and go over it's …

mysqlperformanceblog.com Read
  Source Favicon
Email

…on-premises monitoring solutions. The components are designed to integrate seamlessly with widely deployed solutions such as Nagios, Cacti and Zabbix and are delivered in the form of templates, plugins, and scripts which make it easy to monitor MySQL performance.

The post Percona Monitoring Plugins 1.1.3. Addressed CVE-2014-2569. appeared first on MySQL Performance Blog .

mysqlperformanceblog.com Read
  Source Favicon
Email

You can scale your Nagios horizontally. Nagios can be really performant if you don't use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.

You can run multiple check execution engines in production. Nagios is well suited to some monitoring tasks. Sensu is well suited to others. …

holmwood.id.au Read
  Source Favicon
By Taylor of Signal vs. Noise 4 months ago.
Email

At 5:25 p.m. CT, Nagios alerted us that two database and two bigdata hosts were down. A few second later Nagios notified us that 10 additional hosts were down. A "help" notification was posted in Campfire and all our teams followed the documented procedure to join a predefined (private) Jabber chat.

One immediate effect of the original problem was that we lost both our internal DNS servers. To address this we added two backup DNS servers to the virtual server on the load …

37signals.com Read
  Source Favicon
By Fred of Heroku 6 months ago.
Email

Roughly three weeks later, Nagios started screaming in the routing team's internal chat room every five minutes, for days at a time. Some nodes in the cluster had their memory bubble up, and never gave it back to the OS. The nodes wouldn't crash as fast as before; instead, they'd grow close to the ulimit we'd set and hover there, taunting Nagios, and us by extension.

Clearly, I needed to do more work.

I Just Keep on Bleeding and I Won't Die

First Attempt …

blog.heroku.com Read
  Source Favicon
On Programblings 6 months ago.
Email

For those familiar with Nagios, standard Sensu checks are compatible with Nagios checks. So if you know of a Nagios check that does what you need, you can stop right here and go grab that. Otherwise, let's continue.

The exit status of a Sensu check should be:

0: ok

1: warning

2: critical

3 or more: unknown

A sensu check also outputs text describing the state to stdout or stderr.

Example outputs of check-ram.rb: Exit Status Output 0 CheckRAM OK: 65% free RAM …

programblings.com Read
  Source Favicon
On Labnotes 9 months ago.
Email

§ DevOps Borat :

Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.

§ DownloadMoreRAM . Just downloaded 4GB of RAM to my iPhone, 10 sec, $ 0.

§ Sayings 2.0 :

Never judge an app by its icon

A watched status update never gets liked.

Close, but no WiFi.

blog.labnotes.org Read
  Source Favicon
Email

Configure checks in Nagios, but configure a contact that drops the alerts

Read Nagios's state out of a file + parse it

Aggregate the checks by regex, and alert if a percentage is critical

It's a godsend for people who manage large Nagios instances, but it starts falling down if you've got multiple independent Nagios instances (shards) that are checking the same thing.

You still end up with a situation where each of your shards alert if the shared entity they're …

holmwood.id.au Read
  Source Favicon
By Nick of Signal vs. Noise 12 months ago.
Email

We have a lot of data to parse through at 37signals. Our internal stats application, Dash, does the majority of heavy data lifting for us, including reports, application health, CI builds, and much more. Our Campfire bot named Tally happily pings us when a build fails, deploys are fired off, and when Nagios alerts pop up.

I had a problem though: I needed to have all of this data open constantly to absorb it. Either I had to look at the pages on Dash…

37signals.com Read
  Source Favicon
On paperplanes 1 year ago.
Email

And yet, how many of you are still using Nagios?

There are great advances in monitoring at the moment, and I enjoying watching them as someone who greatly benefits from them.

Yet, I'm worried that all these advances still don't focus enough on the single thing that's supposed to use them: humans.

There's lots of work going on to solve problems to make monitoring technology more accessible, yet I feel like we haven't solved the first problem at hand: to make monitoring …

paperplanes.de Read