I've been playing with monit[1] to monitor a server which has been having some problems (specifically a service would stop responding to TCP requests, so although it was still "running" it was no longer any use; monit is able to make connections periodically and check whether or not the they're being accepted and giving the expected response).
It comes with an example configuration for monitoring the system (memory/swap/cpu usage, and load averages). I have it monitoring load averages based on the configuration but I'm getting a lot of alarm reports every day and I need to tweak the config. The existing config is fairly self explanatory: if loadavg (1min) > 4 for 2 cycles then alert if loadavg (5min) > 2 for 2 cycles then alert
On other words, if the 1 or 5 min load averages exceed the level shown for two or more consecutive tests (which take place every 2 mins) I'll get an alert.
My question is: what would be sensible load averages to set the thresholds at, or are they essentially meaningless and I should ditch them? The server has a AMD Athlon(tm) II Neo N36L Dual-Core Processor so as I understand it a load average of anything up to 2 means the CPU is under-utilised anyway?
Mark Rogers wrote:
if loadavg (1min) > 4 for 2 cycles then alert if loadavg (5min) > 2 for 2 cycles then alert
On other words, if the 1 or 5 min load averages exceed the level shown for two or more consecutive tests (which take place every 2 mins) I'll get an alert.
My question is: what would be sensible load averages to set the thresholds at, or are they essentially meaningless and I should ditch them? The server has a AMD Athlon(tm) II Neo N36L Dual-Core Processor so as I understand it a load average of anything up to 2 means the CPU is under-utilised anyway?
I think my values for your situation are 10 (1min) and 4 (5min), but I might be misreading my configuration. As I understood it, Brett on IRC suggested warning at numcores+1 and alerting at numcores*2 for the 1min.
They're not the most meaningful of data, but there's few cases where you want them high, so I'd set some sort of alarms, but don't take it as the only measure.
Hope that helps,
On 22/08/11 16:16, MJ Ray wrote:
Mark Rogers wrote:
if loadavg (1min)> 4 for 2 cycles then alert if loadavg (5min)> 2 for 2 cycles then alert
I think my values for your situation are 10 (1min) and 4 (5min), but I might be misreading my configuration. As I understood it, Brett on IRC suggested warning at numcores+1 and alerting at numcores*2 for the 1min.
OK, I'll try your config of 10 and 4 and see how I get on. Brett's comments are inline with what I'd expect but I'm already alerting at 4 (numcores*2) and the purpose of this exercise is to reduce the alerts because they're making more important alerts harder to spot and I don't believe load is a major problem at the moment.
Thanks for the help.