[ALUG] Raid10 degraded

10 Jun 2016


      I have a small server Thinkserver ts140 for serving files via sftp to
my colleagues, as well as a few other services.
I run ubuntu 14.04 from a 500gig drive backed up nightly offsite, and
have a separate raid10 array  4x 4TB (WD Reds) for the storage.
They've been running fine for over a year until yesterday, when I
received an email saying the array was degraded.
" A DegradedArray event had been detected on md device /dev/md/0.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid10 sdb1[0] sde1[3] sdc1[4]
      7813772288 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U]
unused devices: <none>"
First thing I did was backup the most important files to an external
drive via rsync. (I have run out of space for a full 8TB external
drive backup and can't afford to buy any more drives just now) We also
keep these more important files offsite)
mdadm -D /dev/md0 shows that my third drive sdd1 was removed. I
checked the smart tools to see if the drive was failing and saw
nothing, even after running a self-test. I didn't check for disk
errors on the drive itself as I'm not sure how this works in a raid
setup. I did try re-adding the disk to the raid and it immediately
picked it up and started recovering.
This was the mdadm --examine shows of the raid status *before* I tried
re-adding the removed drive:  http://pastebin.com/P4Q4YhN7
I can't see anything wrong here, the only difference I saw was in the
sdd section showing Array State AAAA instead of AA.A as the others
were (. = missing)
This morning I woke up to another email with the same message, and the
disk sdd1 was removed again. I am adding the drive again to see what
happens, but suspect the same results.
Here's the mdadm --detail now: http://pastebin.com/CkBm6SbH
dmesg since a recent reboot, with a lot of the repeats taken out:
http://pastebin.com/jzWnYizV
1) Should I run any more tests on the drive itself? Any specific
suggestions? Do you think it could be a hardware fault unrelated to
the actual drive? I'm nervous of starting to swap things around to
test this.
2) The drives are in warranty so I have contacted WD to ask for a
replacement, although if the SMART tests don't show anything I don't
know if they'll replace the drive
3) Should I tell my colleagues we're going to have to switch the
machine off until the new drive arrives? (I am paranoid of a second
drive failing before I get a new one)
Any advice as to what I could do to test the drive/setup is
appreciated. Please be aware I'm no linux guru, my field is in
something completely different and I just happen to love learning
about linux and have found it very useful to complement my office
setup and work.
Many thanks
John

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

[ALUG] Raid10 degraded