I have a small server Thinkserver ts140 for serving files via sftp to my colleagues, as well as a few other services. I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and have a separate raid10 array 4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0. P.S. The /proc/mdstat file currently contains the following: Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid10 sdb1[0] sde1[3] sdc1[4] 7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] unused devices: <none>"
First thing I did was backup the most important files to an external drive via rsync. (I have run out of space for a full 8TB external drive backup and can't afford to buy any more drives just now) We also keep these more important files offsite)
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I checked the smart tools to see if the drive was failing and saw nothing, even after running a self-test. I didn't check for disk errors on the drive itself as I'm not sure how this works in a raid setup. I did try re-adding the disk to the raid and it immediately picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried re-adding the removed drive: http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the sdd section showing Array State AAAA instead of AA.A as the others were (. = missing)
This morning I woke up to another email with the same message, and the disk sdd1 was removed again. I am adding the drive again to see what happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out: http://pastebin.com/jzWnYizV
1) Should I run any more tests on the drive itself? Any specific suggestions? Do you think it could be a hardware fault unrelated to the actual drive? I'm nervous of starting to swap things around to test this. 2) The drives are in warranty so I have contacted WD to ask for a replacement, although if the SMART tests don't show anything I don't know if they'll replace the drive 3) Should I tell my colleagues we're going to have to switch the machine off until the new drive arrives? (I am paranoid of a second drive failing before I get a new one)
Any advice as to what I could do to test the drive/setup is appreciated. Please be aware I'm no linux guru, my field is in something completely different and I just happen to love learning about linux and have found it very useful to complement my office setup and work.
Many thanks
John