server load issues - main

18 Dec 2013


      Hullo there,
I'm troubleshooting a debian server that is intermittently freezing up
and consuming 100% cpu. The problems began about five days ago.  At
around this time two things changed: exim configuration, and a related
server IP (where this server sends on some of its mail).  There have
also been various package upgrades.  The changes to exim were minimal
- change of relay address, so are likely to be unrelated, but thought
I'd better mention them.
Munin shows a grapical report of the load here:
http://aleph1.co.uk/munin/aleph1.co.uk/stoneboat.aleph1.co.uk/load.html
The strange thing is how it was spiking like mad on the 12/13th then
calmed down for a few days until yesterday.
I can't seem to isolate any particular process on the server when this
happens.  Restarting various services can bring it down, but there is
no indication that these processes are going particularly mad
beforehand.  I can't find anything in the server logs reporting
anything unusual amiss.
There seems to be plenty memory and swap space on the system.
I've been running htop and trying to catch what starts the freezing,
but it seems different every time - a process will show it is using
100% CPU, then htop itself goes sluggish with lag and when it next
shows something there are several processes using 100% CPU in a queue.
 It can then take up to two minutes to type a command into an ssh
terminal. iotop doesn't show anything significant either.
Unfortunately these tools are also subject to lag so are probably not
showing the current state at time of freeze anyway.
I've tried running tools like iostat, but fail miserably to understand
the resulting jumble of numbers.
If anyone can point me to better diagnostic tools, or give any hint of
where to look for the culprit of this behaviour, I'd be most grateful.
Cheers,
Jenny