On 10/06/16 12:36, John Cohen wrote:
I have a small server Thinkserver ts140 for serving files via sftp to my colleagues, as well as a few other services. I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and have a separate raid10 array 4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid10 sdb1[0] sde1[3] sdc1[4] 7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] unused devices: <none>"
First thing I did was backup the most important files to an external drive via rsync. (I have run out of space for a full 8TB external drive backup and can't afford to buy any more drives just now) We also keep these more important files offsite)
Good!
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I checked the smart tools to see if the drive was failing and saw nothing, even after running a self-test. I didn't check for disk errors on the drive itself as I'm not sure how this works in a raid setup. I did try re-adding the disk to the raid and it immediately picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried re-adding the removed drive: http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the sdd section showing Array State AAAA instead of AA.A as the others were (. = missing)
This morning I woke up to another email with the same message, and the disk sdd1 was removed again. I am adding the drive again to see what happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out: http://pastebin.com/jzWnYizV
- Should I run any more tests on the drive itself? Any specific
suggestions? Do you think it could be a hardware fault unrelated to the actual drive? I'm nervous of starting to swap things around to test this.
If/when the drive unmounts with an error, run fsck on it. was it drive sdd1? if so sudo fskc /dev/sdd1
and see it it passes or fails. If it was me, I might be tempted to reformat that particular drive. I'd probably use a disk tool to do that so that I could do a "slow" format; one that writes to every sector, "zero"ing them. This does mean that each sector is written to. This means that any bad sectors may be detected and marked as bad. Actually, I'd be very tempted to run a disk test on that disk. I'm sure there are many disk tests available. I think there's some on the system rescue disk live cd.
- The drives are in warranty so I have contacted WD to ask for a
replacement, although if the SMART tests don't show anything I don't know if they'll replace the drive
Unless or until you can show some errors on them (other than it being unmounted from the raid array), I don't know if you'll have any joy.
- Should I tell my colleagues we're going to have to switch the
machine off until the new drive arrives? (I am paranoid of a second drive failing before I get a new one)
It depends on how valuable the data is. If it's vital data, then yes, wait until you can get a new disk. If it's that vital, perhaps you need to add another disk to the raid array (5 tot) so that it is a bit more robust?
Any advice as to what I could do to test the drive/setup is appreciated. Please be aware I'm no linux guru, my field is in something completely different and I just happen to love learning about linux and have found it very useful to complement my office setup and work.
http://www.ultimatebootcd.com/ https://www.system-rescue-cd.org/SystemRescueCd_Homepage
Sorry I can't give you a blow-by-blow as ultimate boot cd for example has many hard disk tools, some Western Digital specific ones. I've never run them.
How to run safely? Disconnect the good drives, leaving only the bad one connected. Be 100% certain it's the bad one, and only the bad one connected. Boot from the live cd/dvd, follow the menu options.
Actually, I'd suggest downloading ultimate boot cd and burning it to a disk, then try the WD utilities on the bad disk. Be aware that these may be destructive tests - i.e. they may delete the data on the drive. Once that's happend, you'd have to recreate and re-add the drive to the array if it passes.
If in doubt, don't do it!
Good luck.
Steve