Hullo there,
I'm troubleshooting a debian server that is intermittently freezing up and consuming 100% cpu. The problems began about five days ago. At around this time two things changed: exim configuration, and a related server IP (where this server sends on some of its mail). There have also been various package upgrades. The changes to exim were minimal - change of relay address, so are likely to be unrelated, but thought I'd better mention them.
Munin shows a grapical report of the load here: http://aleph1.co.uk/munin/aleph1.co.uk/stoneboat.aleph1.co.uk/load.html
The strange thing is how it was spiking like mad on the 12/13th then calmed down for a few days until yesterday.
I can't seem to isolate any particular process on the server when this happens. Restarting various services can bring it down, but there is no indication that these processes are going particularly mad beforehand. I can't find anything in the server logs reporting anything unusual amiss. There seems to be plenty memory and swap space on the system.
I've been running htop and trying to catch what starts the freezing, but it seems different every time - a process will show it is using 100% CPU, then htop itself goes sluggish with lag and when it next shows something there are several processes using 100% CPU in a queue. It can then take up to two minutes to type a command into an ssh terminal. iotop doesn't show anything significant either. Unfortunately these tools are also subject to lag so are probably not showing the current state at time of freeze anyway.
I've tried running tools like iostat, but fail miserably to understand the resulting jumble of numbers.
If anyone can point me to better diagnostic tools, or give any hint of where to look for the culprit of this behaviour, I'd be most grateful.
Cheers,
Jenny
On 18/12/13 11:22, Jenny Hopkins wrote:
I'm troubleshooting a debian server that is intermittently freezing up and consuming 100% cpu. The problems began about five days ago. At around this time two things changed: exim configuration, and a related server IP (where this server sends on some of its mail). There have also been various package upgrades. The changes to exim were minimal
- change of relay address, so are likely to be unrelated, but thought
I'd better mention them.
[]
I've been running htop and trying to catch what starts the freezing, but it seems different every time - a process will show it is using 100% CPU, then htop itself goes sluggish with lag and when it next shows something there are several processes using 100% CPU in a queue. It can then take up to two minutes to type a command into an ssh terminal. iotop doesn't show anything significant either. Unfortunately these tools are also subject to lag so are probably not showing the current state at time of freeze anyway.
I've tried running tools like iostat, but fail miserably to understand the resulting jumble of numbers.
If anyone can point me to better diagnostic tools, or give any hint of where to look for the culprit of this behaviour, I'd be most grateful.
A guess. Occasionally one of my laptops will get locked up with the processor at 100%. Sometimes when this happens, htop won't tell me what's hogging the processor. I eventually tracked down the problem using top instead of htop. My problem was indicated by the figure near wa being very high. This meant basically that the machine was locked waiting (hence "wa") for disk input/output (i/o).
If it is this, you may find soemthing like lsof (LiSt Open Files) might help. Alternatively, I've found sudo iotop useful sometimes, if the busy thing is network related. Use -i interfacename (e.g. -i eth1) to specify a particular network device.
Hope that helps.
Good luck Steve
On 19 December 2013 00:31, steve-ALUG@hst.me.uk wrote:
On 18/12/13 11:22, Jenny Hopkins wrote:
I'm troubleshooting a debian server that is intermittently freezing up and consuming 100% cpu. The problems began about five days ago. At around this time two things changed: exim configuration, and a related server IP (where this server sends on some of its mail). There have also been various package upgrades. The changes to exim were minimal
- change of relay address, so are likely to be unrelated, but thought
I'd better mention them.
[]
I've been running htop and trying to catch what starts the freezing, but it seems different every time - a process will show it is using 100% CPU, then htop itself goes sluggish with lag and when it next shows something there are several processes using 100% CPU in a queue. It can then take up to two minutes to type a command into an ssh terminal. iotop doesn't show anything significant either. Unfortunately these tools are also subject to lag so are probably not showing the current state at time of freeze anyway.
I've tried running tools like iostat, but fail miserably to understand the resulting jumble of numbers.
If anyone can point me to better diagnostic tools, or give any hint of where to look for the culprit of this behaviour, I'd be most grateful.
A guess. Occasionally one of my laptops will get locked up with the processor at 100%. Sometimes when this happens, htop won't tell me what's hogging the processor. I eventually tracked down the problem using top instead of htop. My problem was indicated by the figure near wa being very high. This meant basically that the machine was locked waiting (hence "wa") for disk input/output (i/o).
If it is this, you may find soemthing like lsof (LiSt Open Files) might help. Alternatively, I've found sudo iotop useful sometimes, if the busy thing is network related. Use -i interfacename (e.g. -i eth1) to specify a particular network device.
Thanks Steve. I set up a cron job to capture the output of various commands and store in timestamped file, so will add lsof. If the ssh connection to the server hangs for too long to catch anything red-handed, I can get in and consult the latest file if need be.
Yesterday teatime the server decided to resume normality. It's boiled down to two possible causes.
First, reply from Bytemark this morning - one of the hard drives on our host had failed, they spotted it it yesterday, replaced the drive and rebuild the RAID.
Second, we were subject to a SYN flooding attack, this in dmesg: "TCP: Possible SYN flooding on port 25. Sending cookies. Check SNMP counters."
I've included adding log files to our daily backup, so that if it happens again I can look at dmesg backups for SYN flooding warnings again.
Aye, well, the day wasn't entirely wasted yesterday, as quite a few cobwebs were uncovered and cleaned up as we tried to find the load cause.
Thanks,
Jenny