I had a RAID5 array set up as /dev/md1 using 4 external USB3 drives.
Following a reboot (I assume, although to be honest I'm not sure what caused it) the array hasn't come back up correctly: $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid5 super 1.2 level 5, 512k chunk, algorithm 2 [4/0] [____]
md0 : active raid5 sda1[0] sdb1[1] sdd1[3] sdc1[2] 5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
Note: /dev/md0 is a separate array and is working fine; md1 should comprise sd[fghi]1
All the "missing" partitions are present, eg: $sudo fdisk -l /dev/sdf
Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes 81 heads, 63 sectors/track, 765633 cylinders, total 3907029168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x2c2d5341
Device Boot Start End Blocks Id System /dev/sdf1 2048 3907029167 1953513560 fd Linux RAID autodetect
(same applies for the other three disks).
How do I get mdadm to add my disks into the array, both now and after a reboot?
On 20/12/12 12:48, Mark Rogers wrote:
I had a RAID5 array set up as /dev/md1 using 4 external USB3 drives.
Following a reboot (I assume, although to be honest I'm not sure what caused it) the array hasn't come back up correctly: $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid5 super 1.2 level 5, 512k chunk, algorithm 2 [4/0] [____]
md0 : active raid5 sda1[0] sdb1[1] sdd1[3] sdc1[2] 5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>
Note: /dev/md0 is a separate array and is working fine; md1 should comprise sd[fghi]1
something like
sudo mdadm --assemble /dev/md1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1
perhaps?
Does this thread help?
http://ubuntuforums.org/showthread.php?t=923253
Steve
On 20/12/12 13:21, steve-ALUG@hst.me.uk wrote:
something like
sudo mdadm --assemble /dev/md1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1
perhaps?
$ sudo mdadm --assemble /dev/md1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 mdadm: /dev/md1 is already in use.
Does this thread help?
No :-(
This might be relevant:
$ sudo grep md /var/log/syslog | head Dec 20 07:06:55 FileServer kernel: [1699255.264012] EXT4-fs (md1): error count: 2 Dec 20 07:06:55 FileServer kernel: [1699255.264017] EXT4-fs (md1): initial error at 1354776072: ext4_journal_start_sb:327 Dec 20 07:06:55 FileServer kernel: [1699255.264021] EXT4-fs (md1): last error at 1354776072: ext4_journal_start_sb:327 Dec 20 10:44:59 FileServer kernel: [1712339.827622] Buffer I/O error on device md1, logical block 0 Dec 20 12:38:03 FileServer kernel: [1719124.235840] Buffer I/O error on device md1, logical block 1465036400 Dec 20 12:38:04 FileServer kernel: [1719124.252048] Buffer I/O error on device md1, logical block 1465036400 Dec 20 12:38:04 FileServer kernel: [1719124.261477] Buffer I/O error on device md1, logical block 1465036414
.. although it might just be complaining that md1 isn't consistent, which it won't be as it has no drives in it...
On 20/12/12 15:16, Mark Rogers wrote:
This might be relevant:
$ sudo grep md /var/log/syslog | head Dec 20 07:06:55 FileServer kernel: [1699255.264012] EXT4-fs (md1): error count: 2 Dec 20 07:06:55 FileServer kernel: [1699255.264017] EXT4-fs (md1): initial error at 1354776072: ext4_journal_start_sb:327 Dec 20 07:06:55 FileServer kernel: [1699255.264021] EXT4-fs (md1): last error at 1354776072: ext4_journal_start_sb:327
OK, a bit more progress...
It appears that the uptime was >20 days which means I didn't reboot it when I thought I had. I just rebooted it and the array is back.
syslog shows errors like those above as far back as it goes (7 days) so I can't see what caused it - I haven't been using the array so wouldn't have noticed it fail. I'll have to monitor it and see.
Mark
On 20/12/12 15:27, Mark Rogers wrote:
It appears that the uptime was >20 days which means I didn't reboot it when I thought I had. I just rebooted it and the array is back.
Hmm, it seems to have happened again, only this time it's still recent enough to be in the logs. Coincidentally (or otherwise) the uptime on the box is now 19 days, so similar to last time.
As things stand: I have two RAID5 arrays, one (md0) comprising four local disks and is fine, the other (md1) comprising four disks (partitions sd[fghi]1) in a USB3 caddy, which has vanished. At 13:29 yesterday I received five emails from mdadm telling me, respectively, FailSpare event on sdi1, Fail events on sdf1, sdg1 and sdh1, and finally Fail event on the array md1.
I wasn't in the office when this happened but having come in this morning the external drive is up and running and according to fdisk all four partitions are currently present, but md1 is still inaccessible. Output from mdstat/fdisk/syslog included below.
My interpretation is that (a) "something" happened to cause all the drives to become inaccessible at once, (b) the "something" resolved itself (within a very short period of time - 2s in this case), (c) the array didn't come back up (not that I'd necessarily expect it to, but I might like to find a way to achieve this in future).
Two questions: (1) How can I bring the array back up without restarting the server, and (b) any clues as to what the "something" was and how to avoid it?
Hopefully relevant info follows (with no further comment from me)...
Mark
$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid5 super 1.2 level 5, 512k chunk, algorithm 2 [4/0] [____]
md0 : active raid5 sdc1[2] sdd1[3] sda1[0] sdb1[1] 5860535808 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
$ sudo fdisk -l /dev/sd[fghi] [...] Device Boot Start End Blocks Id System /dev/sdf1 2048 3907029167 1953513560 fd Linux RAID autodetect [...] /dev/sdg1 2048 3907029167 1953513560 fd Linux RAID autodetect [...] /dev/sdh1 2048 3907029167 1953513560 fd Linux RAID autodetect [...] /dev/sdi1 2048 3907029167 1953513560 fd Linux RAID autodetect
$ cat /var/log/syslog.1 [...] Jan 8 13:29:43 FileServer kernel: [1635170.968066] usb 6-2: USB disconnect, device number 2 Jan 8 13:29:43 FileServer kernel: [1635170.968072] sd 6:0:0:2: Device offlined - not ready after error recovery Jan 8 13:29:43 FileServer kernel: [1635170.968092] sd 6:0:0:2: [sdh] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.968097] sd 6:0:0:2: [sdh] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.968104] sd 6:0:0:2: [sdh] CDB: Read(10): 28 40 de c4 0b 28 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.968123] end_request: I/O error, dev sdh, sector 3737389864 Jan 8 13:29:43 FileServer kernel: [1635170.968568] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.968917] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.969272] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.969681] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.969973] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.970263] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.970583] sd 6:0:0:0: [sdf] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.970588] sd 6:0:0:0: [sdf] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.970594] sd 6:0:0:0: [sdf] CDB: Read(10): 28 00 de c4 0b a8 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.970612] end_request: I/O error, dev sdf, sector 3737389992 Jan 8 13:29:43 FileServer kernel: [1635170.971069] sd 6:0:0:1: [sdg] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.971074] sd 6:0:0:1: [sdg] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.971081] sd 6:0:0:1: [sdg] CDB: Read(10): 28 20 de c4 09 b8 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.971100] end_request: I/O error, dev sdg, sector 3737389496 Jan 8 13:29:43 FileServer kernel: [1635170.971832] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.971837] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.971844] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 09 58 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.971865] end_request: I/O error, dev sdi, sector 3737389400 Jan 8 13:29:43 FileServer kernel: [1635170.972329] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.972334] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.972341] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0a 48 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.972360] end_request: I/O error, dev sdi, sector 3737389640 Jan 8 13:29:43 FileServer kernel: [1635170.972854] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.972858] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.972864] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0b 38 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.972881] end_request: I/O error, dev sdi, sector 3737389880 Jan 8 13:29:43 FileServer kernel: [1635170.973282] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.973286] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.973292] sd 6:0:0:3: [sdi] CDB: Write(10): 2a 60 de c4 09 58 00 00 60 00 Jan 8 13:29:43 FileServer kernel: [1635170.973309] end_request: I/O error, dev sdi, sector 3737389400 Jan 8 13:29:43 FileServer kernel: [1635170.973617] md/raid:md1: Disk failure on sdi1, disabling device. Jan 8 13:29:43 FileServer kernel: [1635170.973620] md/raid:md1: Operation continuing on 3 devices. Jan 8 13:29:43 FileServer kernel: [1635170.973626] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.973635] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.973641] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0c 28 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.973679] end_request: I/O error, dev sdi, sector 3737390120 Jan 8 13:29:43 FileServer kernel: [1635170.973707] md/raid:md1: read error not correctable (sector 3737388072 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973734] md/raid:md1: read error not correctable (sector 3737388080 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973743] md/raid:md1: read error not correctable (sector 3737388088 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973759] md/raid:md1: read error not correctable (sector 3737388096 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973785] md/raid:md1: read error not correctable (sector 3737388104 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973799] md/raid:md1: read error not correctable (sector 3737388112 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973818] md/raid:md1: read error not correctable (sector 3737388120 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973850] md/raid:md1: read error not correctable (sector 3737388128 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973868] md/raid:md1: read error not correctable (sector 3737388136 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.973880] md/raid:md1: read error not correctable (sector 3737388144 on sdi1). Jan 8 13:29:43 FileServer kernel: [1635170.974029] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.974034] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.974038] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0d 18 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.974075] end_request: I/O error, dev sdi, sector 3737390360 Jan 8 13:29:43 FileServer kernel: [1635170.974277] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.974293] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.974308] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0e 08 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.974343] end_request: I/O error, dev sdi, sector 3737390600 Jan 8 13:29:43 FileServer kernel: [1635170.974546] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.974564] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.974579] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0e f8 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.974630] end_request: I/O error, dev sdi, sector 3737390840 Jan 8 13:29:43 FileServer kernel: [1635170.974817] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.974839] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.974876] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 0f e8 00 00 f0 00 Jan 8 13:29:43 FileServer kernel: [1635170.974907] end_request: I/O error, dev sdi, sector 3737391080 Jan 8 13:29:43 FileServer kernel: [1635170.975091] sd 6:0:0:3: [sdi] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.975108] sd 6:0:0:3: [sdi] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jan 8 13:29:43 FileServer kernel: [1635170.975127] sd 6:0:0:3: [sdi] CDB: Read(10): 28 60 de c4 10 d8 00 00 80 00 Jan 8 13:29:43 FileServer kernel: [1635170.975159] end_request: I/O error, dev sdi, sector 3737391320 Jan 8 13:29:43 FileServer kernel: [1635170.976385] sd 6:0:0:2: rejecting I/O to offline device Jan 8 13:29:43 FileServer kernel: [1635170.976659] md/raid:md1: Disk failure on sdh1, disabling device. Jan 8 13:29:43 FileServer kernel: [1635170.976662] md/raid:md1: Operation continuing on 2 devices. Jan 8 13:29:43 FileServer kernel: [1635170.977277] md/raid:md1: Disk failure on sdg1, disabling device. Jan 8 13:29:43 FileServer kernel: [1635170.977280] md/raid:md1: Operation continuing on 1 devices. Jan 8 13:29:43 FileServer kernel: [1635170.978051] md/raid:md1: Disk failure on sdf1, disabling device. Jan 8 13:29:43 FileServer kernel: [1635170.978053] md/raid:md1: Operation continuing on 0 devices. Jan 8 13:29:43 FileServer kernel: [1635170.978809] md: md1: data-check done. Jan 8 13:29:43 FileServer kernel: [1635170.979427] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635170.979433] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635170.979437] disk 0, o:0, dev:sdf1 Jan 8 13:29:43 FileServer kernel: [1635170.979441] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635170.979445] disk 2, o:0, dev:sdh1 Jan 8 13:29:43 FileServer kernel: [1635170.979449] disk 3, o:0, dev:sdi1 Jan 8 13:29:43 FileServer kernel: [1635170.992052] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635170.992057] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635170.992062] disk 0, o:0, dev:sdf1 Jan 8 13:29:43 FileServer kernel: [1635170.992065] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635170.992068] disk 3, o:0, dev:sdi1 Jan 8 13:29:43 FileServer kernel: [1635170.992076] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635170.992078] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635170.992081] disk 0, o:0, dev:sdf1 Jan 8 13:29:43 FileServer kernel: [1635170.992083] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635170.992086] disk 3, o:0, dev:sdi1 Jan 8 13:29:43 FileServer kernel: [1635171.012042] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635171.012051] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635171.012057] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635171.012061] disk 3, o:0, dev:sdi1 Jan 8 13:29:43 FileServer kernel: [1635171.012069] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635171.012072] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635171.012074] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635171.012077] disk 3, o:0, dev:sdi1 Jan 8 13:29:43 FileServer kernel: [1635171.032038] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635171.032043] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635171.032047] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635171.032054] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635171.032057] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer kernel: [1635171.032060] disk 1, o:0, dev:sdg1 Jan 8 13:29:43 FileServer kernel: [1635171.060015] RAID conf printout: Jan 8 13:29:43 FileServer kernel: [1635171.060020] --- level:5 rd:4 wd:0 Jan 8 13:29:43 FileServer mdadm[1354]: Fail event detected on md device /dev/md1, component device /dev/sdg1 Jan 8 13:29:43 FileServer mdadm[1354]: Fail event detected on md device /dev/md1, component device /dev/sdh1 Jan 8 13:29:43 FileServer kernel: [1635171.373837] md: unbind<sdf1> Jan 8 13:29:43 FileServer kernel: [1635171.373942] md: export_rdev(sdf1) Jan 8 13:29:43 FileServer mdadm[1354]: Fail event detected on md device /dev/md1 Jan 8 13:29:43 FileServer kernel: [1635171.530156] md: unbind<sdg1> Jan 8 13:29:43 FileServer kernel: [1635171.540089] md: export_rdev(sdg1) Jan 8 13:29:43 FileServer kernel: [1635171.549711] md: unbind<sdh1> Jan 8 13:29:43 FileServer kernel: [1635171.564043] md: export_rdev(sdh1) Jan 8 13:29:43 FileServer kernel: [1635171.567696] md: unbind<sdi1> Jan 8 13:29:43 FileServer kernel: [1635171.584038] md: export_rdev(sdi1) Jan 8 13:29:43 FileServer mdadm[1354]: FailSpare event detected on md device /dev/md1, component device /dev/sdi1 Jan 8 13:29:43 FileServer mdadm[1354]: RebuildFinished event detected on md device /dev/md1 Jan 8 13:29:43 FileServer mdadm[1354]: Fail event detected on md device /dev/md1, component device /dev/sdf1 Jan 8 13:29:43 FileServer mdadm[1354]: SpareActive event detected on md device /dev/md1, component device /dev/sdg1 Jan 8 13:29:43 FileServer mdadm[1354]: SpareActive event detected on md device /dev/md1, component device /dev/sdh1
Jan 8 13:29:45 FileServer kernel: [1635173.360025] usb 6-2: new high-speed USB device number 3 using xhci_hcd Jan 8 13:29:45 FileServer kernel: [1635173.381147] usb 6-2: ep 0x81 - rounding interval to 32768 microframes, ep desc says 0 microframes Jan 8 13:29:45 FileServer kernel: [1635173.381158] usb 6-2: ep 0x2 - rounding interval to 32768 microframes, ep desc says 0 microframes Jan 8 13:29:45 FileServer kernel: [1635173.384334] scsi7 : usb-storage 6-2:1.0 Jan 8 13:29:46 FileServer kernel: [1635174.385997] scsi 7:0:0:0: Direct-Access WDC WD20 EARX-00PASB0 PQ: 0 ANSI: 2 CCS Jan 8 13:29:46 FileServer kernel: [1635174.386284] scsi 7:0:0:1: Direct-Access WDC WD20 EARX-00PASB0 PQ: 0 ANSI: 2 CCS Jan 8 13:29:46 FileServer kernel: [1635174.386571] scsi 7:0:0:2: Direct-Access WDC WD20 EARX-00PASB0 PQ: 0 ANSI: 2 CCS Jan 8 13:29:46 FileServer kernel: [1635174.386843] scsi 7:0:0:3: Direct-Access WDC WD20 EARX-00PASB0 PQ: 0 ANSI: 2 CCS Jan 8 13:29:46 FileServer kernel: [1635174.387945] sd 7:0:0:0: Attached scsi generic sg6 type 0 Jan 8 13:29:46 FileServer kernel: [1635174.390973] sd 7:0:0:1: Attached scsi generic sg7 type 0 Jan 8 13:29:46 FileServer kernel: [1635174.391731] sd 7:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jan 8 13:29:46 FileServer kernel: [1635174.392563] sd 7:0:0:0: [sdf] Write Protect is off Jan 8 13:29:46 FileServer kernel: [1635174.392571] sd 7:0:0:0: [sdf] Mode Sense: 28 00 00 00 Jan 8 13:29:46 FileServer kernel: [1635174.393512] sd 7:0:0:2: Attached scsi generic sg8 type 0 Jan 8 13:29:46 FileServer kernel: [1635174.394209] sd 7:0:0:0: [sdf] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.394530] sd 7:0:0:0: [sdf] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.395241] sd 7:0:0:3: Attached scsi generic sg9 type 0 Jan 8 13:29:46 FileServer kernel: [1635174.397299] sd 7:0:0:0: [sdf] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.397566] sd 7:0:0:0: [sdf] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.410408] sdf: sdf1 Jan 8 13:29:46 FileServer kernel: [1635174.410931] sd 7:0:0:1: [sdg] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jan 8 13:29:46 FileServer kernel: [1635174.411088] sd 7:0:0:2: [sdh] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jan 8 13:29:46 FileServer kernel: [1635174.411276] sd 7:0:0:3: [sdi] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Jan 8 13:29:46 FileServer kernel: [1635174.412160] sd 7:0:0:1: [sdg] Write Protect is off Jan 8 13:29:46 FileServer kernel: [1635174.412169] sd 7:0:0:1: [sdg] Mode Sense: 28 00 00 00 Jan 8 13:29:46 FileServer kernel: [1635174.413011] sd 7:0:0:2: [sdh] Write Protect is off Jan 8 13:29:46 FileServer kernel: [1635174.413019] sd 7:0:0:2: [sdh] Mode Sense: 28 00 00 00 Jan 8 13:29:46 FileServer kernel: [1635174.413879] sd 7:0:0:1: [sdg] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.414192] sd 7:0:0:1: [sdg] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.414784] sd 7:0:0:3: [sdi] Write Protect is off Jan 8 13:29:46 FileServer kernel: [1635174.414792] sd 7:0:0:3: [sdi] Mode Sense: 28 00 00 00 Jan 8 13:29:46 FileServer kernel: [1635174.415662] sd 7:0:0:2: [sdh] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.416074] sd 7:0:0:2: [sdh] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.417177] sd 7:0:0:3: [sdi] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.417528] sd 7:0:0:3: [sdi] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.420515] sd 7:0:0:0: [sdf] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.420837] sd 7:0:0:0: [sdf] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.421144] sd 7:0:0:0: [sdf] Attached SCSI disk Jan 8 13:29:46 FileServer kernel: [1635174.421416] sd 7:0:0:1: [sdg] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.421791] sd 7:0:0:1: [sdg] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.437545] sdg: sdg1 Jan 8 13:29:46 FileServer kernel: [1635174.439003] sd 7:0:0:3: [sdi] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.439361] sd 7:0:0:3: [sdi] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.440001] sd 7:0:0:2: [sdh] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.448689] sd 7:0:0:2: [sdh] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.467770] sdi: sdi1 Jan 8 13:29:46 FileServer kernel: [1635174.493225] sdh: sdh1 Jan 8 13:29:46 FileServer kernel: [1635174.495137] sd 7:0:0:1: [sdg] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.504732] sd 7:0:0:1: [sdg] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.513962] sd 7:0:0:1: [sdg] Attached SCSI disk Jan 8 13:29:46 FileServer kernel: [1635174.523285] sd 7:0:0:3: [sdi] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.531683] sd 7:0:0:3: [sdi] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.541065] sd 7:0:0:3: [sdi] Attached SCSI disk Jan 8 13:29:46 FileServer kernel: [1635174.548886] sd 7:0:0:2: [sdh] No Caching mode page present Jan 8 13:29:46 FileServer kernel: [1635174.558578] sd 7:0:0:2: [sdh] Assuming drive cache: write through Jan 8 13:29:46 FileServer kernel: [1635174.567333] sd 7:0:0:2: [sdh] Attached SCSI disk Jan 8 14:00:06 FileServer kernel: [1636994.816122] Buffer I/O error on device md1, logical block 751866032 Jan 8 14:00:06 FileServer kernel: [1636994.824972] lost page write due to I/O error on md1 Jan 8 14:00:06 FileServer kernel: [1636994.824996] JBD2: Detected IO errors while flushing file data on md1-8 Jan 8 14:00:06 FileServer kernel: [1636994.825053] Aborting journal on device md1-8. Jan 8 14:00:06 FileServer kernel: [1636994.834255] Buffer I/O error on device md1, logical block 732463104 Jan 8 14:00:06 FileServer kernel: [1636994.845362] lost page write due to I/O error on md1 Jan 8 14:00:06 FileServer kernel: [1636994.845422] JBD2: I/O error detected when updating journal superblock for md1-8. Jan 8 14:00:35 FileServer kernel: [1637023.164090] Buffer I/O error on device md1, logical block 0 Jan 8 14:00:35 FileServer kernel: [1637023.173657] lost page write due to I/O error on md1 Jan 8 14:00:35 FileServer kernel: [1637023.173680] EXT4-fs error (device md1): ext4_journal_start_sb:327: Detected aborted journal Jan 8 14:00:35 FileServer kernel: [1637023.173692] EXT4-fs (md1): Remounting filesystem read-only Jan 8 14:00:35 FileServer kernel: [1637023.173697] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 14:00:35 FileServer kernel: [1637023.174810] Buffer I/O error on device md1, logical block 0 Jan 8 14:00:35 FileServer kernel: [1637023.174833] lost page write due to I/O error on md1 Jan 8 14:00:35 FileServer kernel: [1637023.174916] EXT4-fs (md1): ext4_da_writepages: jbd2_start: 1024 pages, ino 94180140; err -30 Jan 8 20:00:01 FileServer kernel: [1658589.772603] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 20:00:01 FileServer kernel: [1658589.792110] Buffer I/O error on device md1, logical block 0 Jan 8 20:00:01 FileServer kernel: [1658589.803060] lost page write due to I/O error on md1 Jan 8 20:00:01 FileServer kernel: [1658589.803085] EXT4-fs error (device md1): ext4_find_entry:935: inode #94179991: comm BackupPC_dump: reading directory lblock 0 Jan 8 20:00:01 FileServer kernel: [1658589.951838] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 20:00:01 FileServer kernel: [1658589.964244] Buffer I/O error on device md1, logical block 0 Jan 8 20:00:01 FileServer kernel: [1658589.976811] lost page write due to I/O error on md1 Jan 8 20:00:02 FileServer kernel: [1658589.976830] EXT4-fs error (device md1): ext4_find_entry:935: inode #94179991: comm BackupPC_dump: reading directory lblock 0 Jan 8 20:00:02 FileServer kernel: [1658590.037737] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 20:00:02 FileServer kernel: [1658590.050969] Buffer I/O error on device md1, logical block 0 Jan 8 20:00:02 FileServer kernel: [1658590.064255] lost page write due to I/O error on md1 Jan 8 20:00:02 FileServer kernel: [1658590.064278] EXT4-fs error (device md1): ext4_find_entry:935: inode #94179991: comm BackupPC_dump: reading directory lblock 0 Jan 8 21:00:01 FileServer kernel: [1662189.887900] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 21:00:01 FileServer kernel: [1662189.902258] Buffer I/O error on device md1, logical block 0 Jan 8 21:00:01 FileServer kernel: [1662189.916412] lost page write due to I/O error on md1 Jan 8 21:00:01 FileServer kernel: [1662189.916458] EXT4-fs error (device md1): ext4_find_entry:935: inode #94179991: comm BackupPC_dump: reading directory lblock 0 Jan 8 21:00:02 FileServer kernel: [1662190.000254] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 21:00:02 FileServer kernel: [1662190.015366] Buffer I/O error on device md1, logical block 0 Jan 8 21:00:02 FileServer kernel: [1662190.030395] lost page write due to I/O error on md1 Jan 8 21:00:02 FileServer kernel: [1662190.030413] EXT4-fs error (device md1): ext4_find_entry:935: inode #94179991: comm BackupPC_dump: reading directory lblock 0 Jan 8 21:00:02 FileServer kernel: [1662190.073627] EXT4-fs (md1): previous I/O error to superblock detected Jan 8 21:00:02 FileServer kernel: [1662190.089554] Buffer I/O error on device md1, logical block 0 Jan 8 21:00:02 FileServer kernel: [1662190.105590] lost page write due to I/O error on md1 [...]
On 09/01/13 09:01, Mark Rogers wrote:
Two questions: (1) How can I bring the array back up without restarting the server,
I'm not an expert at this!
http://www.tcpdump.com/kb/os/linux/starting-and-stopping-raid-arrays.html says "mdadm --assemble --scan" but I think that'll scan everything, and as you have one that's already working, I think you can do
mdadm --assemble /dev/md1 --scan
If that doesn't work, the above website lists how to find the UUID of the array and restart the array using that. HTH
and (b) any clues as to what the "something" was and how to avoid it? [] $ cat /var/log/syslog.1 [...] Jan 8 13:29:43 FileServer kernel: [1635170.968066] usb 6-2: USB disconnect, device number 2 Jan 8 13:29:43 FileServer kernel: [1635170.968072] sd 6:0:0:2: Device offlined - not ready after error recovery Jan 8 13:29:43 FileServer kernel: [1635170.968092] sd 6:0:0:2: [sdh] Unhandled error code Jan 8 13:29:43 FileServer kernel: [1635170.968097] sd 6:0:0:2: [sdh] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
I'm guessing the something that caused the error is the USB disconnect. I think all the rest of the log is just showing how the system is trying to recover. Does anything before give a clue as to why it disconnected? Lose wire, prying fingers? Dunno - grasping at straws!
HTH Steve
On 9 January 2013 10:11, steve-ALUG@hst.me.uk wrote:
I'm not an expert at this!
More than I am though!
http://www.tcpdump.com/kb/os/linux/starting-and-stopping-raid-arrays.html
says "mdadm --assemble --scan" but I think that'll scan everything, and as you have one that's already working, I think you can do
mdadm --assemble /dev/md1 --scan
If that doesn't work, the above website lists how to find the UUID of the array and restart the array using that. HTH
It didn't quite work, as the array was in use, and also --query didn't work as expected: $sudo mdadm --query /dev/sdf1 /dev/sdf1: is not an md array /dev/sdf1: device 0 in 4 device unknown raid5 array. Use mdadm --examine for more detail.
However, this got me the UUID: $sudo mdadm --examine /dev/sdf1 /dev/sdf1: [...] Array UUID : e6d6be87:9fb1d819:d44509f4:46512dd5
I added that to /etc/mdadm/mdadm.conf: ARRAY /dev/md1 UUID=e6d6be87:9fb1d819:d44509f4:46512dd5
Then, as I still couldn't re-assemble the array (because it was in use), I got the following steps to get me back online: $sudo umount /dev/md1 # Unmount the array $sudo mdadm -S /dev/md1 # Stop the array $sudo mdadm --assemble /dev/md1 --scan # Reassemble the array $sudo mount /dev/md1 # Re-mount the array (based on fstab config)
I assume those four steps will suffice in future now that the device UUID is in mdadm.conf, but it looks like I have to wait another 19 days to find out...
I'm guessing the something that caused the error is the USB disconnect. I think all the rest of the log is just showing how the system is trying to recover. Does anything before give a clue as to why it disconnected? Lose wire, prying fingers? Dunno - grasping at straws!
No lose fingers, cables seem OK, so no clues. I wonder if the cheap ("eBay special") USB3 card may be part of it.
How vulnerable is the array to corruption from this process? I'm not too concerned about occassionally having to go through these steps, but I would worry a lot more about losing the data on the array.
As far as I know write caching is disabled, which would be a pretty bad thing under the circumstances (how do I check?), but otherwise is mdadm/RAID5 and ext4 pretty safe?
On 09/01/13 10:58, Mark Rogers wrote:
How vulnerable is the array to corruption from this process? I'm not too concerned about occassionally having to go through these steps, but I would worry a lot more about losing the data on the array.
I spotted from the logs you're using ext4. Providing your kernel version is 2.6.30 or above it should be as robust as ext3.
(See http://en.wikipedia.org/wiki/Ext4# Delayed_allocation_and_potential_data_loss for why the kernel version if interested)
I think there are more fault-tolerant file-systems, but I don't know if it's worth swapping to one. ext-3/4 are both journalled file systems so should be reasonably fault-tolerant.
As far as I know write caching is disabled, which would be a pretty bad thing under the circumstances (how do I check?), but otherwise is mdadm/RAID5 and ext4 pretty safe?
Two sorts of write caching: Hardware and Software.
I doubt that you could turn Hardware Write Caching off, and from the sounds of it, the disk remains powered even after the whoopsie, so any pending disk writes should get through once sent from disk controller.
Re: Software write caching.
sudo hdparm -i /dev/sdf
will show you if it's on or off. You're looking for "WriteCache=" Repeat for the other drives too!
I got the above from: http://www.linuxquestions.org/questions/debian-26/how-can-i-permanently-turn...
and it also tells you how to turn it off too. HTH
On 9 January 2013 14:14, steve-ALUG@hst.me.uk wrote:
I spotted from the logs you're using ext4. Providing your kernel version is 2.6.30 or above it should be as robust as ext3.
As I understand it the biggest issue is RAID5 itself; a failure to write cached writes physically to disk will do horrible things to the RAID array.
I doubt that you could turn Hardware Write Caching off, and from the sounds of it, the disk remains powered even after the whoopsie, so any pending disk writes should get through once sent from disk controller.
It's a good point that hardware write caching should be ensured by the fact the drive retains power. (The caddy is on a UPS as well, which will help in other circumstances.)
Re: Software write caching.
sudo hdparm -i /dev/sdf
$ sudo hdparm -i /dev/sdf /dev/sdf: HDIO_GET_IDENTITY failed: Invalid argument
I assume that's due to it being on USB?
However: $ sudo hdparm -I /dev/sdf Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set Write cache
.. if that helps (I assume that's just confirming that hardware write caching is disabled).
On 09/01/13 14:59, Mark Rogers wrote:
$ sudo hdparm -i /dev/sdf /dev/sdf: HDIO_GET_IDENTITY failed: Invalid argument
I assume that's due to it being on USB?
I guess
However: $ sudo hdparm -I /dev/sdf Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set Write cache
.. if that helps (I assume that's just confirming that hardware write caching is disabled).
I'm not sure. I suspect it's software, but I can't find out by googling.
Is it worth trying sudo hdparm -W /dev/sdf
This also says if write caching is on or off and you can use sudo hdparm -W 1 /dev/sdf to turn on write caching or sudo hdparm -W 0 /dev/sdf to turn it off
if hdparm -i doesn't work though, these might not work. I've googled for ages and I can't find any other way of telling if write caching is on or off - sorry!
HTH Steve
On 09/01/13 17:50, steve-ALUG@hst.me.uk wrote:
if hdparm -i doesn't work though, these might not work. I've googled for ages and I can't find any other way of telling if write caching is on or off - sorry!
A thought. I don't know if it will work, but there's a Gnome utility called "Disk" in my version of ubuntu 's menu (don't know the package name) that may give you info on the disk if you select it in the list of disks down the left hand side. Alternatively, if you have a system info program installed, or install one, that may tell you if write caching is on.
HTH Steve
On 9 January 2013 17:50, steve-ALUG@hst.me.uk wrote:
Is it worth trying sudo hdparm -W /dev/sdf
I had the same thought, and it gives the same result (that write caching is off).
if hdparm -i doesn't work though, these might not work. I've googled for ages and I can't find any other way of telling if write caching is on or off
- sorry!
Thanks for trying! I also did a fair bit of Googling yesterday and came away far from sure I understand the difference between software and hardware!
A thought. I don't know if it will work, but there's a Gnome utility called "Disk" in my version of ubuntu 's menu (don't know the package name) that may give you info on the disk if you select it in the list of disks down the left hand side.
Alternatively, if you have a system info program installed, or install one, that may tell you if write caching is on.
It's a server install so no Gnone tools available. However I'm pretty sure they're going to just be GUI wrappers for hdparm.
I'm pretty sure that the defaults are for software write caching to be off, so I don't think this is something I need to worry about.