I have a small server Thinkserver ts140 for serving files via sftp to my colleagues, as well as a few other services. I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and have a separate raid10 array 4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid10 sdb1[0] sde1[3] sdc1[4] 7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] unused devices: <none>"
First thing I did was backup the most important files to an external drive via rsync. (I have run out of space for a full 8TB external drive backup and can't afford to buy any more drives just now) We also keep these more important files offsite)
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I checked the smart tools to see if the drive was failing and saw nothing, even after running a self-test. I didn't check for disk errors on the drive itself as I'm not sure how this works in a raid setup. I did try re-adding the disk to the raid and it immediately picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried re-adding the removed drive: http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the sdd section showing Array State AAAA instead of AA.A as the others were (. = missing)
This morning I woke up to another email with the same message, and the disk sdd1 was removed again. I am adding the drive again to see what happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out: http://pastebin.com/jzWnYizV
1) Should I run any more tests on the drive itself? Any specific suggestions? Do you think it could be a hardware fault unrelated to the actual drive? I'm nervous of starting to swap things around to test this. 2) The drives are in warranty so I have contacted WD to ask for a replacement, although if the SMART tests don't show anything I don't know if they'll replace the drive 3) Should I tell my colleagues we're going to have to switch the machine off until the new drive arrives? (I am paranoid of a second drive failing before I get a new one)
Any advice as to what I could do to test the drive/setup is appreciated. Please be aware I'm no linux guru, my field is in something completely different and I just happen to love learning about linux and have found it very useful to complement my office setup and work.
Many thanks
John
On 10/06/16 12:36, John Cohen wrote:
I have a small server Thinkserver ts140 for serving files via sftp to my colleagues, as well as a few other services. I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and have a separate raid10 array 4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid10 sdb1[0] sde1[3] sdc1[4] 7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] unused devices: <none>"
First thing I did was backup the most important files to an external drive via rsync. (I have run out of space for a full 8TB external drive backup and can't afford to buy any more drives just now) We also keep these more important files offsite)
Good!
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I checked the smart tools to see if the drive was failing and saw nothing, even after running a self-test. I didn't check for disk errors on the drive itself as I'm not sure how this works in a raid setup. I did try re-adding the disk to the raid and it immediately picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried re-adding the removed drive: http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the sdd section showing Array State AAAA instead of AA.A as the others were (. = missing)
This morning I woke up to another email with the same message, and the disk sdd1 was removed again. I am adding the drive again to see what happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out: http://pastebin.com/jzWnYizV
- Should I run any more tests on the drive itself? Any specific
suggestions? Do you think it could be a hardware fault unrelated to the actual drive? I'm nervous of starting to swap things around to test this.
If/when the drive unmounts with an error, run fsck on it. was it drive sdd1? if so sudo fskc /dev/sdd1
and see it it passes or fails. If it was me, I might be tempted to reformat that particular drive. I'd probably use a disk tool to do that so that I could do a "slow" format; one that writes to every sector, "zero"ing them. This does mean that each sector is written to. This means that any bad sectors may be detected and marked as bad. Actually, I'd be very tempted to run a disk test on that disk. I'm sure there are many disk tests available. I think there's some on the system rescue disk live cd.
- The drives are in warranty so I have contacted WD to ask for a
replacement, although if the SMART tests don't show anything I don't know if they'll replace the drive
Unless or until you can show some errors on them (other than it being unmounted from the raid array), I don't know if you'll have any joy.
- Should I tell my colleagues we're going to have to switch the
machine off until the new drive arrives? (I am paranoid of a second drive failing before I get a new one)
It depends on how valuable the data is. If it's vital data, then yes, wait until you can get a new disk. If it's that vital, perhaps you need to add another disk to the raid array (5 tot) so that it is a bit more robust?
Any advice as to what I could do to test the drive/setup is appreciated. Please be aware I'm no linux guru, my field is in something completely different and I just happen to love learning about linux and have found it very useful to complement my office setup and work.
http://www.ultimatebootcd.com/ https://www.system-rescue-cd.org/SystemRescueCd_Homepage
Sorry I can't give you a blow-by-blow as ultimate boot cd for example has many hard disk tools, some Western Digital specific ones. I've never run them.
How to run safely? Disconnect the good drives, leaving only the bad one connected. Be 100% certain it's the bad one, and only the bad one connected. Boot from the live cd/dvd, follow the menu options.
Actually, I'd suggest downloading ultimate boot cd and burning it to a disk, then try the WD utilities on the bad disk. Be aware that these may be destructive tests - i.e. they may delete the data on the drive. Once that's happend, you'd have to recreate and re-add the drive to the array if it passes.
If in doubt, don't do it!
Good luck.
Steve
On Fri, Jun 10, 2016 at 12:36:48PM +0100, John Cohen wrote:
dmesg since a recent reboot, with a lot of the repeats taken out: http://pastebin.com/jzWnYizV
Any advice as to what I could do to test the drive/setup is appreciated. Please be aware I'm no linux guru, my field is in something completely different and I just happen to love learning about linux and have found it very useful to complement my office setup and work.
I had a bit of a look at dmesg, I didn't google all the errors however to me it looks like you possibly have a bad cable or SATA port. You could try swapping the SATA cables and ports (although, don't do it if the RAID isn't rebuilt!) and see if the error follows the drive, cable or SATA port.
There is also some suggestion it might be the mode the hardware is running in or some kind of power management problem.
Adam
If/when the drive unmounts with an error, run fsck on it. was it drive sdd1? if so sudo fskc /dev/sdd1
Thanks, I didn't know if having a raid setup affected the way I could check errors in the individual disks, I'll definitely give it a go then.
I had a bit of a look at dmesg, I didn't google all the errors however to me it looks like you possibly have a bad cable or SATA port. You could try swapping the SATA cables and ports (although, don't do it if the RAID isn't rebuilt!) and see if the error follows the drive, cable or SATA port.
There is also some suggestion it might be the mode the hardware is running in or some kind of power management problem.
WD got back to me saying to just send the drive off without any further questioning.. Before I do that I'll check the cables/sata ports as suggested. Would be very annoying if I got a new drive and the problem persisted! I'm away as of tomorrow but will report when I'm back. Cheers
I had a bit of a look at dmesg, I didn't google all the errors however to me it looks like you possibly have a bad cable or SATA port. You could try swapping the SATA cables and ports (although, don't do it if the RAID isn't rebuilt!) and see if the error follows the drive, cable or SATA port.
There is also some suggestion it might be the mode the hardware is running in or some kind of power management problem.
I looked a bit more into dmesg and the SATA cables as you suggested. I had originally just checked the connectors, but didn't think too much of the kink on the cable itself. Turns out straightening the cable did the trick. I've ordered some new cables and will replace the dodgy one, but the raid is now in active sync for the last 4 days!
Cheers
John