Raid10 degraded - main

10 Jun 2016


      I have a small server Thinkserver ts140 for serving files via sftp to
my colleagues, as well as a few other services.
I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and
have a separate raid10 array  4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I
received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid10 sdb1[0] sde1[3] sdc1[4]
      7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
unused devices: <none>"
First thing I did was backup the most important files to an external
drive via rsync. (I have run out of space for a full 8TB external
drive backup and can't afford to buy any more drives just now) We also
keep these more important files offsite)
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I
checked the smart tools to see if the drive was failing and saw
nothing, even after running a self-test. I didn't check for disk
errors on the drive itself as I'm not sure how this works in a raid
setup. I did try re-adding the disk to the raid and it immediately
picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried
re-adding the removed drive:  http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the
sdd section showing Array State AAAA instead of AA.A as the others
were (. = missing)
This morning I woke up to another email with the same message, and the
disk sdd1 was removed again. I am adding the drive again to see what
happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out:
http://pastebin.com/jzWnYizV
1) Should I run any more tests on the drive itself? Any specific
suggestions? Do you think it could be a hardware fault unrelated to
the actual drive? I'm nervous of starting to swap things around to
test this.
2) The drives are in warranty so I have contacted WD to ask for a
replacement, although if the SMART tests don't show anything I don't
know if they'll replace the drive
3) Should I tell my colleagues we're going to have to switch the
machine off until the new drive arrives? (I am paranoid of a second
drive failing before I get a new one)
Any advice as to what I could do to test the drive/setup is
appreciated. Please be aware I'm no linux guru, my field is in
something completely different and I just happen to love learning
about linux and have found it very useful to complement my office
setup and work.
Many thanks
John