Hi!
My backup system has recently started causing me some grief.
I'm running backuppc backing up to an IDE drive in a Netgear SC101 SAN storage box. There are two drives in this device, and the second one is being talked to by Simple Backup, and is working fine.
The main/bigger drive is /dev/nbd1 and backuppc has started hanging whilst using it. It doesn't happen straight away - it's several hours after the backup has been started. If I power everything down, wait and switch it back on again, all the drives spin up and seem to work ok again.
The SC101 is attached via the SC101-nbd project driver. http://code.google.com/p/sc101-nbd/
It has all been working fine until recently. Nothing's changed, apart from installing routine security updates (inc kernel updates) I'm running ubuntu 10.04.
I'm currently doing a deep check on the disk using a tool from Ultimate boot Cd appropriate for the drive manufacturer, but a quick check said the disk's OK.
Any ideas why it's hanging like this? Once hung, I can't even ls the disk - it's completely blocked. I tried reformatting it yesterday, and erasing and reinstalling backuppc, but it crashed again.
Some of Syslog output below - there's many more after this one, but this is the first. Possible candidates: Disk SAN controller Main PC faulty Disk sleeping and failing to wake again.
Any ideas guys? Cheers Steve
Jan 11 07:00:28 myserver ntpd[1950]: kernel time sync status change 6001 Jan 11 07:02:01 myserver CRON[11758]: (logcheck) CMD ( if [ -x /usr/sbin/logcheck ]; then nice -n10 /usr/sbin/logcheck; fi) Jan 11 07:04:11 myserver kernel: [45960.212092] INFO: task jbd2/nbd1-8:12704 blocked for more than 120 seconds. Jan 11 07:04:11 myserver kernel: [45960.212100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 11 07:04:11 myserver kernel: [45960.212104] jbd2/nbd1-8 D 00005645 0 12704 2 0x00000000 Jan 11 07:04:11 myserver kernel: [45960.212113] c6ce9de4 00000046 f7490000 00005645 00000000 c08964e0 c47b9c24 c08964e0 Jan 11 07:04:11 myserver kernel: [45960.212123] 3c1ac4f3 0000299c c08964e0 c08964e0 c47b9c24 c08964e0 c08964e0 d7e7d800 Jan 11 07:04:11 myserver kernel: [45960.212131] 00000000 0000299c c47b9980 c47b9980 c30094e0 c6ce9e2c c6ce9df4 c05b71a1 Jan 11 07:04:11 myserver kernel: [45960.212140] Call Trace: Jan 11 07:04:11 myserver kernel: [45960.212157] [<c05b71a1>] io_schedule+0x61/0xa0 Jan 11 07:04:11 myserver kernel: [45960.212165] [<c01d4b9d>] sync_page+0x3d/0x50 Jan 11 07:04:11 myserver kernel: [45960.212171] [<c05b795d>] __wait_on_bit+0x4d/0x70 Jan 11 07:04:11 myserver kernel: [45960.212176] [<c01d4b60>] ? sync_page+0x0/0x50 Jan 11 07:04:11 myserver kernel: [45960.212181] [<c01d4dc1>] wait_on_page_bit+0x91/0xa0 Jan 11 07:04:11 myserver kernel: [45960.212187] [<c0170e90>] ? wake_bit_function+0x0/0x50 Jan 11 07:04:11 myserver kernel: [45960.212193] [<c01d5051>] wait_on_page_writeback_range+0xa1/0x110 Jan 11 07:04:11 myserver kernel: [45960.212201] [<c023ca2e>] ? bio_alloc_bioset+0x2e/0xf0 Jan 11 07:04:11 myserver kernel: [45960.212209] [<c02da977>] ? jbd2_journal_write_metadata_buffer+0x277/0x370 Jan 11 07:04:11 myserver kernel: [45960.212215] [<c01d511a>] filemap_fdatawait+0x5a/0x70 Jan 11 07:04:11 myserver kernel: [45960.212221] [<c02d2bae>] jbd2_journal_commit_transaction+0x51e/0x1030 Jan 11 07:04:11 myserver kernel: [45960.212230] [<c01311b8>] ? default_spin_lock_flags+0x8/0x10 Jan 11 07:04:11 myserver kernel: [45960.212236] [<c0164c78>] ? try_to_del_timer_sync+0x68/0xb0 Jan 11 07:04:11 myserver kernel: [45960.212241] [<c02d93d5>] kjournald2+0x95/0x1c0 Jan 11 07:04:11 myserver kernel: [45960.212247] [<c0170e40>] ? autoremove_wake_function+0x0/0x50 Jan 11 07:04:11 myserver kernel: [45960.212252] [<c02d9340>] ? kjournald2+0x0/0x1c0 Jan 11 07:04:11 myserver kernel: [45960.212256] [<c0170bb4>] kthread+0x74/0x80 Jan 11 07:04:11 myserver kernel: [45960.212261] [<c0170b40>] ? kthread+0x0/0x80 Jan 11 07:04:11 myserver kernel: [45960.212267] [<c010a467>] kernel_thread_helper+0x7/0x10 Jan 11 07:04:11 myserver kernel: [45960.212278] INFO: task flush-43:16:11670 blocked for more than 120 seconds. Jan 11 07:04:11 myserver kernel: [45960.212281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 11 07:04:11 myserver kernel: [45960.212285] flush-43:16 D 00000484 0 11670 2 0x00000000 Jan 11 07:04:11 myserver kernel: [45960.212291] c46d3b34 00000046 c07eaa20 00000484 00000000 c08964e0 ef5735a4 c08964e0 Jan 11 07:04:11 myserver kernel: [45960.212300] f5f3884b 0000299a c08964e0 c08964e0 ef5735a4 c08964e0 c08964e0 e8566a00 Jan 11 07:04:11 myserver kernel: [45960.212308] f5f2b230 0000299a ef573300 c2e0afe0 eb6accc0 c46d3b74 c46d3b94 c02d1ec5 Jan 11 07:04:11 myserver kernel: [45960.212316] Call Trace: Jan 11 07:04:11 myserver kernel: [45960.212324] [<c02d1ec5>] do_get_write_access+0x245/0x440 Jan 11 07:04:11 myserver kernel: [45960.212330] [<c0170e90>] ? wake_bit_function+0x0/0x50 Jan 11 07:04:11 myserver kernel: [45960.212336] [<c02d2238>] jbd2_journal_get_write_access+0x28/0x40 Jan 11 07:04:11 myserver kernel: [45960.212343] [<c02bb3fe>] __ext4_journal_get_write_access+0x2e/0x60 Jan 11 07:04:11 myserver kernel: [45960.212349] [<c02bd611>] ext4_mb_mark_diskspace_used+0x61/0x2f0 Jan 11 07:04:11 myserver kernel: [45960.212354] [<c02c38a6>] ext4_mb_new_blocks+0x306/0x590 Jan 11 07:04:11 myserver kernel: [45960.212360] [<c02b8613>] ? ext4_ext_find_extent+0x253/0x280 Jan 11 07:04:11 myserver kernel: [45960.212365] [<c02baf12>] ext4_ext_get_blocks+0x472/0x700 Jan 11 07:04:11 myserver kernel: [45960.212373] [<c0298f98>] ext4_get_blocks+0xf8/0x2c0 Jan 11 07:04:11 myserver kernel: [45960.212379] [<c02997c9>] mpage_da_map_blocks+0x99/0x410 Jan 11 07:04:11 myserver kernel: [45960.212385] [<c02b0d90>] ? ext4_journal_start_sb+0xc0/0xf0 Jan 11 07:04:11 myserver kernel: [45960.212390] [<c0296917>] ? ext4_meta_trans_blocks+0xb7/0xd0 Jan 11 07:04:11 myserver kernel: [45960.212396] [<c0299dc9>] ext4_da_writepages+0x289/0x580 Jan 11 07:04:11 myserver kernel: [45960.212403] [<c01489d9>] ? update_curr+0x169/0x2a0 Jan 11 07:04:11 myserver kernel: [45960.212410] [<c01dc66c>] do_writepages+0x1c/0x40 Jan 11 07:04:11 myserver kernel: [45960.212417] [<c0232a15>] writeback_single_inode+0xd5/0x380 Jan 11 07:04:11 myserver kernel: [45960.212422] [<c02331f9>] writeback_sb_inodes+0x149/0x230 Jan 11 07:04:11 myserver kernel: [45960.212427] [<c023364a>] writeback_inodes_wb+0x8a/0x160 Jan 11 07:04:11 myserver kernel: [45960.212432] [<c0233903>] wb_writeback+0x1e3/0x230 Jan 11 07:04:11 myserver kernel: [45960.212438] [<c0233a81>] wb_do_writeback+0x131/0x140 Jan 11 07:04:11 myserver kernel: [45960.212443] [<c0233acb>] bdi_writeback_task+0x3b/0xb0 Jan 11 07:04:11 myserver kernel: [45960.212448] [<c01eb9e0>] ? bdi_start_fn+0x0/0xc0 Jan 11 07:04:11 myserver kernel: [45960.212453] [<c01eba3d>] bdi_start_fn+0x5d/0xc0 Jan 11 07:04:11 myserver kernel: [45960.212457] [<c01eb9e0>] ? bdi_start_fn+0x0/0xc0 Jan 11 07:04:11 myserver kernel: [45960.212461] [<c0170bb4>] kthread+0x74/0x80 Jan 11 07:04:11 myserver kernel: [45960.212466] [<c0170b40>] ? kthread+0x0/0x80 Jan 11 07:04:11 myserver kernel: [45960.212471] [<c010a467>] kernel_thread_helper+0x7/0x10 Jan 11 07:05:01 myserver CRON[11897]: (root) CMD (if [ -x /usr/bin/vnstat ] && [ `ls /var/lib/vnstat/ | wc -l` -ge 1 ]; then /usr/bin/vnstat -u; fi) Jan 11 07:06:11 myserver kernel: [46080.212093] INFO: task jbd2/nbd1-8:12704 blocked for more than 120 seconds. Jan 11 07:06:11 myserver kernel: [46080.212101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 11 07:06:11 myserver kernel: [46080.212106] jbd2/nbd1-8 D 00005645 0 12704 2 0x00000000 Jan 11 07:06:11 myserver kernel: [46080.212114] c6ce9de4 00000046 f7490000 00005645 00000000 c08964e0 c47b9c24 c08964e0 Jan 11 07:06:11 myserver kernel: [46080.212124] 3c1ac4f3 0000299c c08964e0 c08964e0 c47b9c24 c08964e0 c08964e0 d7e7d800 Jan 11 07:06:11 myserver kernel: [46080.212132] 00000000 0000299c c47b9980 c47b9980 c30094e0 c6ce9e2c c6ce9df4 c05b71a1 Jan 11 07:06:11 myserver kernel: [46080.212140] Call Trace: Jan 11 07:06:11 myserver kernel: [46080.212156] [<c05b71a1>] io_schedule+0x61/0xa0 Jan 11 07:06:11 myserver kernel: [46080.212165] [<c01d4b9d>] sync_page+0x3d/0x50 Jan 11 07:06:11 myserver kernel: [46080.212171] [<c05b795d>] __wait_on_bit+0x4d/0x70 Jan 11 07:06:11 myserver kernel: [46080.212175] [<c01d4b60>] ? sync_page+0x0/0x50 Jan 11 07:06:11 myserver kernel: [46080.212180] [<c01d4dc1>] wait_on_page_bit+0x91/0xa0 Jan 11 07:06:11 myserver kernel: [46080.212187] [<c0170e90>] ? wake_bit_function+0x0/0x50 Jan 11 07:06:11 myserver kernel: [46080.212192] [<c01d5051>] wait_on_page_writeback_range+0xa1/0x110 Jan 11 07:06:11 myserver kernel: [46080.212200] [<c023ca2e>] ? bio_alloc_bioset+0x2e/0xf0 Jan 11 07:06:11 myserver kernel: [46080.212207] [<c02da977>] ? jbd2_journal_write_metadata_buffer+0x277/0x370 Jan 11 07:06:11 myserver kernel: [46080.212213] [<c01d511a>] filemap_fdatawait+0x5a/0x70 Jan 11 07:06:11 myserver kernel: [46080.212219] [<c02d2bae>] jbd2_journal_commit_transaction+0x51e/0x1030 Jan 11 07:06:11 myserver kernel: [46080.212228] [<c01311b8>] ? default_spin_lock_flags+0x8/0x10 Jan 11 07:06:11 myserver kernel: [46080.212234] [<c0164c78>] ? try_to_del_timer_sync+0x68/0xb0 Jan 11 07:06:11 myserver kernel: [46080.212239] [<c02d93d5>] kjournald2+0x95/0x1c0 Jan 11 07:06:11 myserver kernel: [46080.212245] [<c0170e40>] ? autoremove_wake_function+0x0/0x50 Jan 11 07:06:11 myserver kernel: [46080.212250] [<c02d9340>] ? kjournald2+0x0/0x1c0 Jan 11 07:06:11 myserver kernel: [46080.212254] [<c0170bb4>] kthread+0x74/0x80 Jan 11 07:06:11 myserver kernel: [46080.212259] [<c0170b40>] ? kthread+0x0/0x80 Jan 11 07:06:11 myserver kernel: [46080.212265] [<c010a467>] kernel_thread_helper+0x7/0x10 Jan 11 07:06:11 myserver kernel: [46080.212277] INFO: task flush-43:16:11670 blocked for more than 120 seconds. Jan 11 07:06:11 myserver kernel: [46080.212282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 11 07:06:11 myserver kernel: [46080.212285] flush-43:16 D 00000484 0 11670 2 0x00000000 Jan 11 07:06:11 myserver kernel: [46080.212292] c46d3b34 00000046 c07eaa20 00000484 00000000 c08964e0 ef5735a4 c08964e0 Jan 11 07:06:11 myserver kernel: [46080.212300] f5f3884b 0000299a c08964e0 c08964e0 ef5735a4 c08964e0 c08964e0 e8566a00 Jan 11 07:06:11 myserver kernel: [46080.212309] f5f2b230 0000299a ef573300 c2e0afe0 eb6accc0 c46d3b74 c46d3b94 c02d1ec5 Jan 11 07:06:11 myserver kernel: [46080.212317] Call Trace: Jan 11 07:06:11 myserver kernel: [46080.212325] [<c02d1ec5>] do_get_write_access+0x245/0x440 Jan 11 07:06:11 myserver kernel: [46080.212331] [<c0170e90>] ? wake_bit_function+0x0/0x50 Jan 11 07:06:11 myserver kernel: [46080.212337] [<c02d2238>] jbd2_journal_get_write_access+0x28/0x40 Jan 11 07:06:11 myserver kernel: [46080.212344] [<c02bb3fe>] __ext4_journal_get_write_access+0x2e/0x60 Jan 11 07:06:11 myserver kernel: [46080.212349] [<c02bd611>] ext4_mb_mark_diskspace_used+0x61/0x2f0 Jan 11 07:06:11 myserver kernel: [46080.212354] [<c02c38a6>] ext4_mb_new_blocks+0x306/0x590 Jan 11 07:06:11 myserver kernel: [46080.212359] [<c02b8613>] ? ext4_ext_find_extent+0x253/0x280 Jan 11 07:06:11 myserver kernel: [46080.212364] [<c02baf12>] ext4_ext_get_blocks+0x472/0x700 Jan 11 07:06:11 myserver kernel: [46080.212372] [<c0298f98>] ext4_get_blocks+0xf8/0x2c0 Jan 11 07:06:11 myserver kernel: [46080.212377] [<c02997c9>] mpage_da_map_blocks+0x99/0x410 Jan 11 07:06:11 myserver kernel: [46080.212383] [<c02b0d90>] ? ext4_journal_start_sb+0xc0/0xf0 Jan 11 07:06:11 myserver kernel: [46080.212388] [<c0296917>] ? ext4_meta_trans_blocks+0xb7/0xd0 Jan 11 07:06:11 myserver kernel: [46080.212393] [<c0299dc9>] ext4_da_writepages+0x289/0x580 Jan 11 07:06:11 myserver kernel: [46080.212400] [<c01489d9>] ? update_curr+0x169/0x2a0 Jan 11 07:06:11 myserver kernel: [46080.212407] [<c01dc66c>] do_writepages+0x1c/0x40 Jan 11 07:06:11 myserver kernel: [46080.212413] [<c0232a15>] writeback_single_inode+0xd5/0x380 Jan 11 07:06:11 myserver kernel: [46080.212418] [<c02331f9>] writeback_sb_inodes+0x149/0x230 Jan 11 07:06:11 myserver kernel: [46080.212422] [<c023364a>] writeback_inodes_wb+0x8a/0x160 Jan 11 07:06:11 myserver kernel: [46080.212427] [<c0233903>] wb_writeback+0x1e3/0x230 Jan 11 07:06:11 myserver kernel: [46080.212433] [<c0233a81>] wb_do_writeback+0x131/0x140 Jan 11 07:06:11 myserver kernel: [46080.212438] [<c0233acb>] bdi_writeback_task+0x3b/0xb0 Jan 11 07:06:11 myserver kernel: [46080.212443] [<c01eb9e0>] ? bdi_start_fn+0x0/0xc0 Jan 11 07:06:11 myserver kernel: [46080.212447] [<c01eba3d>] bdi_start_fn+0x5d/0xc0 Jan 11 07:06:11 myserver kernel: [46080.212451] [<c01eb9e0>] ? bdi_start_fn+0x0/0xc0 Jan 11 07:06:11 myserver kernel: [46080.212455] [<c0170bb4>] kthread+0x74/0x80 Jan 11 07:06:11 myserver kernel: [46080.212459] [<c0170b40>] ? kthread+0x0/0x80 Jan 11 07:06:11 myserver kernel: [46080.212464] [<c010a467>] kernel_thread_helper+0x7/0x10 Jan 11 07:08:11 myserver kernel: [46200.212124] INFO: task jbd2/nbd1-8:12704 blocked for more than 120 seconds.
On 11/01/13 12:16, steve-alug@hst.me.uk wrote:
Hi!
My backup system has recently started causing me some grief.
[]
Done some checking on the disk, and although smart says it's OK, it also says that there are hundreds or corrected and uncorrected disk errors, both figures way above the listed acceptable upper bound, and yet, smart still says it's OK. Anyway, trying with a different disk and see what happens.
On 12/01/13 01:40, steve-ALUG@hst.me.uk wrote:
On 11/01/13 12:16, steve-alug@hst.me.uk wrote:
Hi!
My backup system has recently started causing me some grief.
[]
Done some checking on the disk, and although smart says it's OK, it also says that there are hundreds or corrected and uncorrected disk errors, both figures way above the listed acceptable upper bound, and yet, smart still says it's OK. Anyway, trying with a different disk and see what happens.
To be honest on a modern disk I reject now if there are any Pending or Reallocated Sectors. The Call Trace you are seeing I think is because the JBD2 driver module is hung waiting on something...most likely I/O to the faulty disk. Either that or you have a network issue which is stalling access to the network block device, but the fact that you can see errors in SMART would point to the disk first.
Are logs available on the Netgear device...would be good to see if there are any IDE command timeouts that coincide with this happening on "myserver"
main@lists.alug.org.uk http://www.alug.org.uk/ http://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!
On 12/01/13 09:25, Wayne Stallwood wrote:
On 12/01/13 01:40, steve-ALUG@hst.me.uk wrote:
On 11/01/13 12:16, steve-alug@hst.me.uk wrote:
Hi!
My backup system has recently started causing me some grief.
[]
Done some checking on the disk, and although smart says it's OK, it also says that there are hundreds or corrected and uncorrected disk errors, both figures way above the listed acceptable upper bound, and yet, smart still says it's OK. Anyway, trying with a different disk and see what happens.
To be honest on a modern disk I reject now if there are any Pending or Reallocated Sectors. The Call Trace you are seeing I think is because the JBD2 driver module is hung waiting on something...most likely I/O to the faulty disk. Either that or you have a network issue which is stalling access to the network block device, but the fact that you can see errors in SMART would point to the disk first.
Are logs available on the Netgear device...would be good to see if there are any IDE command timeouts that coincide with this happening on "myserver"
No logs unfortunately. I've replaced the suspect disk, and the backup is still up and running after a day and a bit, whearas, with the old disk, it'd hung within a day. Time will tell though.
Using DBAN to wipe the suspect disk succeeded with no errors - ah well - I can't trust that disk so it's for the high jump anyway!
Cheers Steve