Hi ALUG, I've been managing a Xen hypervisor for about nine months now and occasionally I get problems with DomUs running out of memory. The clients are all students and most of them are running WordPress blogs. The instances of OOM errors I've seen seem to be related to Apache and students serving up large image files. I'm thinking that this shouldn't really be happening; serving up a few ~1MB files shouldn't really be causing Apache to make the OS exhaust *all* available memory, should it? Even if it were servicing several requests simultaneously. Some details: * The host is a Dell PowerEdge x86_64 system with 32 GB RAM * The host OS is Debian 6.0 * We're running Xen 4.0.1 from Debian * The guests all run Debian 6.0 * Each guest has 15 GB of storage, 512 MB RAM, and 1 GB of swap * We currently have ~40 guests running Below is some console output from a DomU that suffered this problem earlier today. You can see that the OOM killer killed Apache. And I'm guessing it killed sshd too as I couldn't connect to the guest. I couldn't find any errors in Xen's logs. Any thoughts on what might be going on here? And come September we'll have a short window in which we could alter out setup. Any suggestions for better ways of providing virtual machines? Perhaps alternatives hypervisors? Or some mechanism other than hypervisors? Cheers, Richard (Working from home: <http://pic.twitter.com/NetsgOS2>) -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Richard Lewis ISMS, Computing Goldsmiths, University of London t: +44 (0)20 7078 5134 j: ironchicken@jabber.earth.li @: lewisrichard s: richardjlewis http://www.richardlewis.me.uk/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- [ 1656.306197] rs:main Q:Reg invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0 [ 1656.306215] rs:main Q:Reg cpuset=/ mems_allowed=0 [ 1656.306222] Pid: 1135, comm: rs:main Q:Reg Not tainted 2.6.32-5-xen-amd64 #1 [ 1656.306230] Call Trace: [ 1656.306242] [<ffffffff810b7318>] ? oom_kill_process+0x7f/0x23f [ 1656.306251] [<ffffffff810b783c>] ? __out_of_memory+0x12a/0x141 [ 1656.306259] [<ffffffff810b7993>] ? out_of_memory+0x140/0x172 [ 1656.306268] [<ffffffff810bb742>] ? __alloc_pages_nodemask+0x4ec/0x5fe [ 1656.306278] [<ffffffff810bcca9>] ? __do_page_cache_readahead+0x9b/0x1b4 [ 1656.306286] [<ffffffff810bcdde>] ? ra_submit+0x1c/0x20 [ 1656.306294] [<ffffffff810b5a66>] ? filemap_fault+0x17d/0x2f6 [ 1656.306302] [<ffffffff810cba22>] ? __do_fault+0x54/0x3c3 [ 1656.306313] [<ffffffff8130c7d1>] ? __wait_on_bit_lock+0x76/0x84 [ 1656.306323] [<ffffffff8100c3a5>] ? __raw_callee_save_xen_pud_val+0x11/0x1e [ 1656.306333] [<ffffffff810cdda8>] ? handle_mm_fault+0x3b8/0x80f [ 1656.306342] [<ffffffff8100ecf2>] ? check_events+0x12/0x20 [ 1656.306351] [<ffffffff8130fb26>] ? do_page_fault+0x2e0/0x2fc [ 1656.306360] [<ffffffff8130d9c5>] ? page_fault+0x25/0x30 [ 1656.306366] Mem-Info: [ 1656.306370] Node 0 DMA per-cpu: [ 1656.306376] CPU 0: hi: 0, btch: 1 usd: 0 [ 1656.306381] Node 0 DMA32 per-cpu: [ 1656.306388] CPU 0: hi: 186, btch: 31 usd: 75 [ 1656.306397] active_anon:55210 inactive_anon:55287 isolated_anon:1350 [ 1656.306398] active_file:10 inactive_file:11 isolated_file:26 [ 1656.306400] unevictable:0 dirty:0 writeback:171 unstable:0 [ 1656.306401] free:1180 slab_reclaimable:922 slab_unreclaimable:2187 [ 1656.306402] mapped:16 shmem:7 pagetables:7768 bounce:0 [ 1656.306422] Node 0 DMA free:2032kB min:80kB low:100kB high:120kB active_anon:6168kB inactive_anon:6312kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:14868kB mlocked:0kB dirty:0kB writeback:8kB mapped:20kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:76kB kernel_stack:16kB pagetables:292kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3 all_unreclaimable? yes [ 1656.306456] lowmem_reserve[]: 0 489 489 489 [ 1656.306467] Node 0 DMA32 free:2688kB min:2788kB low:3484kB high:4180kB active_anon:214672kB inactive_anon:214836kB active_file:40kB inactive_file:44kB unevictable:0kB isolated(anon):5400kB isolated(file):104kB present:500960kB mlocked:0kB dirty:0kB writeback:676kB mapped:44kB shmem:28kB slab_reclaimable:3688kB slab_unreclaimable:8672kB kernel_stack:1232kB pagetables:30780kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:328 all_unreclaimable? yes [ 1656.306501] lowmem_reserve[]: 0 0 0 0 [ 1656.306512] Node 0 DMA: 4*4kB 1*8kB 0*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 2040kB [ 1656.306537] Node 0 DMA32: 232*4kB 2*8kB 1*16kB 0*32kB 1*64kB 3*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 2688kB [ 1656.306561] 8628 total pagecache pages [ 1656.306566] 8586 pages in swap cache [ 1656.306571] Swap cache stats: add 1078026, delete 1069440, find 120149/196407 [ 1656.306579] Free swap = 4kB [ 1656.306583] Total swap = 1048568kB [ 1656.308293] 131072 pages RAM [ 1656.308302] 4044 pages reserved [ 1656.308307] 22558 pages shared [ 1656.308311] 123779 pages non-shared [ 1656.308317] Out of memory: kill process 489 (apache2) score 125474 or a child [ 1656.308324] Killed process 964 (apache2)
On Wed, Jul 11, 2012 at 04:53:06PM +0100, Richard Lewis wrote:
Below is some console output from a DomU that suffered this problem earlier today. You can see that the OOM killer killed Apache. And I'm guessing it killed sshd too as I couldn't connect to the guest. I couldn't find any errors in Xen's logs.
Any thoughts on what might be going on here?
The oom killer doesn't necessarily kill what is using the RAM it guesses at the best thing to kill. See http://linux-mm.org/OOM_Killer for more information on that. What I'd suggest is that you put a job into cron that runs every 5 minutes and grabs a complete process list and shows how much RAM each process is using. You should then be able to see where the memory is going and what is using it as I'm going to suggest something has a memory leak but it may not be Apache itself causing the problem. Adam
At Wed, 11 Jul 2012 18:57:39 +0100, Adam Bower wrote:
On Wed, Jul 11, 2012 at 04:53:06PM +0100, Richard Lewis wrote:
Below is some console output from a DomU that suffered this problem earlier today. You can see that the OOM killer killed Apache. And I'm guessing it killed sshd too as I couldn't connect to the guest. I couldn't find any errors in Xen's logs.
Any thoughts on what might be going on here?
The oom killer doesn't necessarily kill what is using the RAM it guesses at the best thing to kill. See http://linux-mm.org/OOM_Killer for more information on that.
What I'd suggest is that you put a job into cron that runs every 5 minutes and grabs a complete process list and shows how much RAM each process is using. You should then be able to see where the memory is going and what is using it as I'm going to suggest something has a memory leak but it may not be Apache itself causing the problem.
Thanks for the suggestion. It's been executing a script along these lines: #!/bin/bash LOG=/var/log/rjl-mem-info.log NOW=`date +"%Y-%m-%d %H:%M:%S"` echo "============================" >> $LOG echo $NOW >> $LOG ps aux | awk '{print $4, $10, $11}' | sort -rn | head -30 >> $LOG free -m >> $LOG every couple of minutes for the last few hours. So far swap usage has remained almost entirely static at 20 MB. And RAM usage has fluctuated between about 400 and 500 MB. Of course, this has reminded me that these VMs are using pre-fork Apache, so I see 10 Apache processes. Typically, 7 of them report that they are using ~9% of the RAM, and 3 that they are using ~8%. Then there's also the parent Apache process using ~2%. I'm not (so far) seeing any other processes using any significant amount of RAM, apart from MySQL. But that seems fairly static at 1.7%. And, of course, it's also failed to misbehave :-( Cheers, Richard -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Richard Lewis ISMS, Computing Goldsmiths, University of London t: +44 (0)20 7078 5134 j: ironchicken@jabber.earth.li @: lewisrichard s: richardjlewis http://www.richardlewis.me.uk/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
On Wed, Jul 11, 2012 at 10:11:03PM +0100, Richard Lewis wrote:
I'm not (so far) seeing any other processes using any significant amount of RAM, apart from MySQL. But that seems fairly static at 1.7%.
That suggests it may be a single thing that happens that causes something to eat memory all of a sudden. I'm afraid you'll just have to keep waiting in this case :) Part of the point of this exercise is simply to see if memory usage stays constant over time or if it suddenly gets used exponentially or linearly which might help you after it has gone wrong again. Adam
On 11 July 2012 23:08, Adam Bower <adam@thebowery.co.uk> wrote:
On Wed, Jul 11, 2012 at 10:11:03PM +0100, Richard Lewis wrote:
I'm not (so far) seeing any other processes using any significant amount of RAM, apart from MySQL. But that seems fairly static at 1.7%.
That suggests it may be a single thing that happens that causes something to eat memory all of a sudden. I'm afraid you'll just have to keep waiting in this case :)
Part of the point of this exercise is simply to see if memory usage stays constant over time or if it suddenly gets used exponentially or linearly which might help you after it has gone wrong again.
I had an OOM killer problem on one of my vms hosted at bytemark for weeks before I managed to trace the problem: bots trawling the trac directory of an apache site. Banning them with robots.txt fixed it. I had one script running every five minutes that checked for memory usage, and if it was above a certain amount then to send all sorts of memory usage/processes data to an output file. From this I could see it was always apache (even though, as Adam says, oomkiller was randomly killing anything it could to reclaim memory), and from there start to monitor the apache connections until I found it always stuck on listening to googlebots. Memory usage would balloon from a few hundred MB to > 800 within minutes, when oomkiller kicked in, making it very hard to pinpoint the problem. I've still got scriptage if that helps. Jenny
At Sun, 29 Jul 2012 11:04:26 +0100, Jenny Hopkins wrote:
On 11 July 2012 23:08, Adam Bower <adam@thebowery.co.uk> wrote:
On Wed, Jul 11, 2012 at 10:11:03PM +0100, Richard Lewis wrote:
I'm not (so far) seeing any other processes using any significant amount of RAM, apart from MySQL. But that seems fairly static at 1.7%.
That suggests it may be a single thing that happens that causes something to eat memory all of a sudden. I'm afraid you'll just have to keep waiting in this case :)
Part of the point of this exercise is simply to see if memory usage stays constant over time or if it suddenly gets used exponentially or linearly which might help you after it has gone wrong again.
I had an OOM killer problem on one of my vms hosted at bytemark for weeks before I managed to trace the problem: bots trawling the trac directory of an apache site. Banning them with robots.txt fixed it. I had one script running every five minutes that checked for memory usage, and if it was above a certain amount then to send all sorts of memory usage/processes data to an output file. From this I could see it was always apache (even though, as Adam says, oomkiller was randomly killing anything it could to reclaim memory), and from there start to monitor the apache connections until I found it always stuck on listening to googlebots. Memory usage would balloon from a few hundred MB to > 800 within minutes, when oomkiller kicked in, making it very hard to pinpoint the problem.
I've still got scriptage if that helps.
Thanks for sharing your experiences. I'll consider your reply evidence of interest in the thread and so provide a brief update. The VM in question did eventually go on to misbehave in exactly the same way as before. I restarted it and checked my log file which reported complete memory saturation (RAM and swap) by lots and lots of Apache processes. As a result, I had a look at the Apache configuration and decided to do some performance tuning, especially of the Keep-Alive settings. All the settings were at their default values. The Keep-Alive timeout is possibly the most significant; I changed that from 15s to 3s which will hopefully get Apache processes out of the way quicker in future. This particular VM has been running problem-free for around two weeks now. I suppose the take home message is, don't try and blame your virtualisation hypervisor before you've actually tuned your pre-fork Web server sensibly. I've effectively re-discovered something which has actually been common knowledge since about September 1993. Richard -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Richard Lewis ISMS, Computing Goldsmiths, University of London t: +44 (0)20 7078 5134 j: ironchicken@jabber.earth.li @: lewisrichard s: richardjlewis http://www.richardlewis.me.uk/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
participants (3)
-
Adam Bower -
Jenny Hopkins -
Richard Lewis