I have 4x2TB disks configured for RAID5.
Initially they were in a USB3 external caddy but this never worked correctly - the raid kept dropping offline before it completed building the array.
I then switched to eSATA (same caddy) and that improved things but I still failed to build the array completely. So they're now in a new HP microserver.
Until now I assumed the issues were connectivity but since I still have the same problem there must be a disk issue of some kind. However SMART is reporting healthy even after longer self tests.
Do I just right this off as a duff disk or can I investigate this further?
syslog reports thus: Oct 4 01:30:09 backup kernel: [49309.671201] sd 4:0:0:0: [sdd] Unhandled error code Oct 4 01:30:09 backup kernel: [49309.671217] sd 4:0:0:0: [sdd] Oct 4 01:30:09 backup kernel: [49309.671222] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Oct 4 01:30:09 backup kernel: [49309.671229] sd 4:0:0:0: [sdd] CDB: Oct 4 01:30:09 backup kernel: [49309.671233] Read(10): 28 00 d9 2e b1 f0 00 04 00 00 Oct 4 01:30:09 backup kernel: [49309.671253] end_request: I/O error, dev sdd, sector 3643716080 Oct 4 01:30:09 backup kernel: [49309.671264] md/raid:md0: read error not correctable (sector 3643714032 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671274] md/raid:md0: Disk failure on sdd1, disabling device. Oct 4 01:30:09 backup kernel: [49309.671274] md/raid:md0: Operation continuing on 2 devices. Oct 4 01:30:09 backup kernel: [49309.671310] md/raid:md0: read error not correctable (sector 3643714040 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671316] md/raid:md0: read error not correctable (sector 3643714048 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671321] md/raid:md0: read error not correctable (sector 3643714056 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671326] md/raid:md0: read error not correctable (sector 3643714064 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671331] md/raid:md0: read error not correctable (sector 3643714072 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671336] md/raid:md0: read error not correctable (sector 3643714080 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671341] md/raid:md0: read error not correctable (sector 3643714088 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671346] md/raid:md0: read error not correctable (sector 3643714096 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671351] md/raid:md0: read error not correctable (sector 3643714104 on sdd1). Oct 4 01:30:09 backup kernel: [49310.107789] md: md0: recovery done. Oct 4 01:30:09 backup kernel: [49310.170142] RAID conf printout: Oct 4 01:30:09 backup kernel: [49310.170156] --- level:5 rd:4 wd:2 Oct 4 01:30:09 backup kernel: [49310.170163] disk 0, o:1, dev:sdb1 Oct 4 01:30:09 backup kernel: [49310.170167] disk 1, o:1, dev:sdc1 Oct 4 01:30:09 backup kernel: [49310.170171] disk 2, o:0, dev:sdd1 Oct 4 01:30:09 backup kernel: [49310.170175] disk 3, o:1, dev:sde1 Oct 4 01:30:09 backup kernel: [49310.170279] RAID conf printout: Oct 4 01:30:09 backup kernel: [49310.170292] --- level:5 rd:4 wd:2 Oct 4 01:30:09 backup kernel: [49310.170299] disk 0, o:1, dev:sdb1 Oct 4 01:30:09 backup kernel: [49310.170304] disk 1, o:1, dev:sdc1 Oct 4 01:30:09 backup kernel: [49310.170309] disk 2, o:0, dev:sdd1 Oct 4 01:30:09 backup kernel: [49310.170322] RAID conf printout: Oct 4 01:30:09 backup kernel: [49310.170325] --- level:5 rd:4 wd:2 Oct 4 01:30:09 backup kernel: [49310.170329] disk 0, o:1, dev:sdb1 Oct 4 01:30:09 backup kernel: [49310.170332] disk 1, o:1, dev:sdc1 Oct 4 01:30:09 backup kernel: [49310.170336] disk 2, o:0, dev:sdd1 Oct 4 01:30:09 backup sSMTP[10706]: Unable to locate mail Oct 4 01:30:09 backup sSMTP[10706]: Cannot open mail:25 Oct 4 01:30:09 backup mdadm[3415]: Fail event detected on md device /dev/md0, component device /dev/sdd1 Oct 4 01:30:09 backup kernel: [49310.172571] RAID conf printout: Oct 4 01:30:09 backup kernel: [49310.172578] --- level:5 rd:4 wd:2 Oct 4 01:30:09 backup kernel: [49310.172585] disk 0, o:1, dev:sdb1 Oct 4 01:30:09 backup kernel: [49310.172589] disk 1, o:1, dev:sdc1 Oct 4 01:30:09 backup mdadm[3415]: RebuildFinished event detected on md device /dev/md0
Mark
On 08/10/13 16:52, Mark Rogers wrote:
I have 4x2TB disks configured for RAID5.
Initially they were in a USB3 external caddy but this never worked correctly - the raid kept dropping offline before it completed building the array.
I then switched to eSATA (same caddy) and that improved things but I still failed to build the array completely. So they're now in a new HP microserver.
Until now I assumed the issues were connectivity but since I still have the same problem there must be a disk issue of some kind. However SMART is reporting healthy even after longer self tests.
Do I just right this off as a duff disk or can I investigate this further?
syslog reports thus: Oct 4 01:30:09 backup kernel: [49309.671201] sd 4:0:0:0: [sdd] Unhandled error code Oct 4 01:30:09 backup kernel: [49309.671217] sd 4:0:0:0: [sdd] Oct 4 01:30:09 backup kernel: [49309.671222] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Oct 4 01:30:09 backup kernel: [49309.671229] sd 4:0:0:0: [sdd] CDB: Oct 4 01:30:09 backup kernel: [49309.671233] Read(10): 28 00 d9 2e b1 f0 00 04 00 00 Oct 4 01:30:09 backup kernel: [49309.671253] end_request: I/O error, dev sdd, sector 3643716080 Oct 4 01:30:09 backup kernel: [49309.671264] md/raid:md0: read error not correctable (sector 3643714032 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671274] md/raid:md0: Disk failure on sdd1, disabling device.
{SNIP}
http://www.ultimatebootcd.com/ http://www.sysresccd.org/SystemRescueCd_Homepage
I'd be tempted to download one of these rescue CDs - Ultimate Bood CD I think has loads of HDD diagnostic tests on it, including I think some manufacturer specific utilities that you can run to test and low-level test & format drives.
I'd be tempted to boot to the CD with only one drive connected, and then run manufacturer diags on the disk, do destructive testing and/or low level format if available. Ensure all disk is written to. Will take a while. Once tested, if it passes, reformat and restore data.
Then try the other disk.
Destructive testing will of course destroy your data - backup first.
If you can't find any drive specific tests, you could try one of the Disk Wiping utilities, and get it to wipe the whole disk. This will write to the whole disk multiple times, and will show you if there are any errors. My favourite is Darik's Boot and Nuke (DBAN). Wipe the disk with a comprehensive wipe which will take seveal hours. If no errors, format & restore data.
Alternatively, if finances allow, dump the disks and get new ones.
HTH
Steve
On 8 October 2013 17:55, steve-ALUG@hst.me.uk wrote:
http://www.ultimatebootcd.com/ http://www.sysresccd.org/SystemRescueCd_Homepage
Good thought, I have UBCD kicking around but didn't think about using it to test the disks.
Destructive testing will of course destroy your data - backup first.
As it happens these disks are empty (as you may have surmised by the fact I'm trying to build a RAID array on them) but that's a welcome reminder for anyone else who might follow this!
Alternatively, if finances allow, dump the disks and get new ones.
I'm edging towards this. There's a good chance that the disks are still in warranty though, so I'd like to prove whether they're faulty.
Mark
On 9 October 2013 08:40, Mark Rogers mark@quarella.co.uk wrote:
Good thought, I have UBCD kicking around but didn't think about using it to test the disks.
So far I have tried every tool under the sun (it seems), without getting anywhere. The WD tools just refuse to test the disks ("missing test tracks", error 0229, that nobody seems to have a solution for but isn't the fault of the disks). Even DBAN crashed whenever I tried to wipe the disks, but I have solved that one (booting from a USB drive seems to confuse it, and you have to pull the drive out at the "detecting USB devices" (or similar) stage of the boot process). I wonder now whether something similar might fix the WD diagnostics problems (I'm about 0.05% into a quick wipe at the moment so I've got 10~ hours before I can do much else with them).
When did hard disks get so complicated?
On 09/10/13 13:08, Mark Rogers wrote:
When did hard disks get so complicated?
It's not but unless you are used to it, it's easy to miss something in the smartctl output.
In the first error you posted it seemed that /dev/sdd is reporting the error so can you paste the full output of smartctl -a /dev/sdd here. The reported errors are only when things get really bad there is stuff to look for in the regular stats that is a warning sign of trouble.. Post it here and I (or someone else) will guide you though it.
On 10 October 2013 07:04, Wayne Stallwood ALUGlist@digimatic.co.uk wrote:
On 09/10/13 13:08, Mark Rogers wrote:
When did hard disks get so complicated?
It's not but unless you are used to it, it's easy to miss something in the smartctl output.
Fair point :-)
In the first error you posted it seemed that /dev/sdd is reporting the error so can you paste the full output of smartctl -a /dev/sdd here.
As the output is quite long I've pasted it here: http://pastebin.com/vMpY6qP4
The reported errors are only when things get really bad there is stuff to look for in the regular stats that is a warning sign of trouble.. Post it here and I (or someone else) will guide you though it.
One thing that I have noticed is high Load_Cycle_Count values (one drive is in excess of 785,000), which is apparently due to the way these WD drives are set up, causing them to power down on 8 seconds of idle. I have now tweaked this (using wdidle3 from a DOS disk) to 5min, but given a typical full life value is apparently supposed to be about 300k and I've hit 785k in ~7000 hours (~10 months) of power on time, this is a tad worrying... I don't think it can explain the problems I was having though, especially given that the drive with (by far) the highest value was no the one that apparently had issues.
I did successfully DBAN the drives (zero-fill only) and all drives report Reallocated_Sector_Ct=0 (if I'm reading the SMART output correctly). I am now in the process of recreating the RAID5 array (~10 hours to go).
Thanks for all the help
Mark
On 10/10/13 09:18, Mark Rogers wrote:
I did successfully DBAN the drives (zero-fill only) and all drives report Reallocated_Sector_Ct=0 (if I'm reading the SMART output correctly). I am now in the process of recreating the RAID5 array (~10 hours to go). Thanks for all the help Mark
Reassuring that DBAN could write to all the disk, but wierd that disk tests wouldn't run.
I'm not an expert on Smartctl output. I've have literally just read up about it here. I MAY HAVE MISSED SOMETHING!
http://smartmontools.sourceforge.net/man/smartctl.8.html Skip down to -A, --attributes.
Value column is the current value - lower (closer to zero) is worse. Worst column is the worst it's ever been - lower is worse. Thresh is the point at which the attribute is seen to be failing or failed. Pre-fail means if this attribute is too low, then this is a sign that the disk is failing. Updated means when is this test value updated. When Failed means when did this disk fail, e.g. now, previously, or "-" for OK. Yours are all OK. Ignore the Raw Value column- it's manufacturer dependant and may be misleading
Reallocated Sector Count is: 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
Value and Worst are 200 which is above the threshold of 140, so these are OK. Raw Value of 0 may mean anything, but could be no sectors reallocated. "-" says it's OK.
I can't see anything wrong with these, DBAN worked, and yet raid5 won't. I'm confused.
Let us know how the RAIDing goes!
Good luck Steve
PS - IMHO, disks have always been complicated. e.g. Master/Slave/CS jumpers. Size, speed, format jumpers. PATA/IDE/ATA (various versions)/SCSI (various versions)/SATA (various versions)/USB disk.
DOS Device drivers to overcome BIOS limitations. Windows Device drivers to overcome BIOS/DOS/Windows limitations.
Low and high level formats. Availability or not of manufacturer disk tools. Variying boot sectors and drive formats, and partitions etc etc etc
Revisiting the 1st post in this thread...
On 08/10/13 16:52, Mark Rogers wrote:
I have 4x2TB disks configured for RAID5.
Initially they were in a USB3 external caddy but this never worked correctly - the raid kept dropping offline before it completed building the array.
So at this point it could be disk, or caddy, or USB issue...
I then switched to eSATA (same caddy)
so at this point it could be disk or caddy, but not USB....
and that improved things but I still failed to build the array completely. So they're now in a new HP microserver.
New microserver? New as in brand new? If yes, then same problems mean it's unlikely to be the new server, or the caddy, and leaves the disk suspect... Unless it's a compatability issue between your OS and the disk, or the Raid S/w and the disk.
Until now I assumed the issues were connectivity but since I still have the same problem there must be a disk issue of some kind.
Seems to be a fair conclusion.
However SMART is reporting healthy even after longer self tests.
Which, to me seem OK.
Do I just right this off as a duff disk or can I investigate this further?
syslog reports thus: Oct 4 01:30:09 backup kernel: [49309.671201] sd 4:0:0:0: [sdd] Unhandled error code Oct 4 01:30:09 backup kernel: [49309.671217] sd 4:0:0:0: [sdd] Oct 4 01:30:09 backup kernel: [49309.671222] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT Oct 4 01:30:09 backup kernel: [49309.671229] sd 4:0:0:0: [sdd] CDB: Oct 4 01:30:09 backup kernel: [49309.671233] Read(10): 28 00 d9 2e b1 f0 00 04 00 00 Oct 4 01:30:09 backup kernel: [49309.671253] end_request: I/O error, dev sdd, sector 3643716080 Oct 4 01:30:09 backup kernel: [49309.671264] md/raid:md0: read error not correctable (sector 3643714032 on sdd1). Oct 4 01:30:09 backup kernel: [49309.671274] md/raid:md0: Disk failure on sdd1, disabling device.
[SNIP] So it works then it stops working. Could it be a timeout issue? I.e. it writes to the disk, then spends a while writing to the other one, in the mean time, the first one has powered down and won't spin back up again?
Try disabling power saving, disks spinning down etc. Grasping at straws, is the power supply at your location reliable? Could it be brown-outs messing things up. Are you using the latest O/S, with all the patches installed?
Good luck!
Steve
On 10 October 2013 15:07, steve-ALUG@hst.me.uk wrote:
New microserver? New as in brand new?
Yes, brand new. 250GB HDD as supplied with the microserver has Ubuntu 12.04 (server, amd64) on it, then 4x2TB drives installed to be used as RAID5 array for storing backups.
So it works then it stops working. Could it be a timeout issue? I.e. it writes to the disk, then spends a while writing to the other one, in the mean time, the first one has powered down and won't spin back up again?
Possibly. I don't know what would cause mdadm to "give up", but it might not expect the drives to power down after 8s of inactivity. (Aside: anyone with WD "green" disks would be advised to check whether they are running up large Load_Cycle_Count counts; anecdotal evidence (ie I read it in a forum somewhere) suggests other manufacturers have this "feature" too. Any figure over 1000 (except for an old disk) should be considered high, I believe.
Try disabling power saving, disks spinning down etc.
From my efforts with Google these drives don't like having the power
saving turned off but setting to 300s should suffice, which is what I have done.
Grasping at straws, is the power supply at your location reliable? Could it be brown-outs messing things up.
Anything is possible but if so it's only this set of disks that's showing it. I have other mdadm-based RAID arrays in other boxes in the same building (not sure off the top of my head whether any are RAID5, most will be RAID1).
Are you using the latest O/S, with all the patches installed?
If latest LTS counts as "latest" then yes (including up to date on patches).
5hrs remaining until the RAID build is due to complete so I guess I'll know in the morning...
Thanks everyone for the suggestions and advice.
Mark
On 10 October 2013 15:38, Mark Rogers mark@quarella.co.uk wrote:
(Aside: anyone with WD "green" disks would be advised to check whether they are running up large Load_Cycle_Count counts; anecdotal evidence (ie I read it in a forum somewhere) suggests other manufacturers have this "feature" too. Any figure over 1000 (except for an old disk) should be considered high, I believe.
I just found another box here with 4x2TB drives (in 2xRAID1 config) with Load_Cycle_Count values well in excess of 1 million...
The "correct" fix is to use the wdidle3 tool from DOS (Google it) but I also found a Linux version here: http://idle3-tools.sourceforge.net/
.. which looks to work.
On 10/10/13 15:47, Mark Rogers wrote:
The "correct" fix is to use the wdidle3 tool from DOS (Google it) but I also found a Linux version here: http://idle3-tools.sourceforge.net/
.. which looks to work.
AFAIK WD Green's aren't supported in Raid configurations for this reason or at least that used to be the case.
On Thu, 10 Oct 2013 22:11:49 +0100 Wayne Stallwood ALUGlist@digimatic.co.uk allegedly wrote:
On 10/10/13 15:47, Mark Rogers wrote:
The "correct" fix is to use the wdidle3 tool from DOS (Google it) but I also found a Linux version here: http://idle3-tools.sourceforge.net/
.. which looks to work.
AFAIK WD Green's aren't supported in Raid configurations for this reason or at least that used to be the case.
Oh bugger...
Guess what disks I chose for both my desktop upgrade and my new build backup server.
On checking my disks, this is what I see.
Desktop 2TB WDC WD20EARX purchased January this year /dev/sda Power_On_Hours 2143 Load_Cycle_Count 148616
Server (built as RAID 1) 2 X 2TB WDC WD20EARX purchased May of this year.
/dev/sda Power_On_Hours 3388 Load_Cycle_Count 386649
/dev/sdb Power_On_Hours 3389 Load_Cycle_Count 386087
For comparison I checked an old 1TB USB disk which been attached to an NSLU2 running 24/7 for well over three years. That disk is also a caviar green, but the model number is WDC WD10EAVS.
/dev/sda Power_On_Hours 29888 Load_Cycle_Count 107
Some difference!
All the sites I have read as a result of this discussion here suggest that whilst 300,000 load cycles is probably a reasonable lifetime max (so my 6 month old server is fsckd) up to a million may be possible.
So, given that I should probably change my shiny new disks PDQ, I'll ask the same question Mark did - what can people recommend for a server which is primarily a NAS backup (it is also my DNS server and a DLNA server for my MP3 and MP4 files)?
Mick
---------------------------------------------------------------------
Mick Morgan gpg fingerprint: FC23 3338 F664 5E66 876B 72C0 0A1F E60B 5BAD D312 http://baldric.net
---------------------------------------------------------------------
On 12 October 2013 14:16, mick mbm@rlogin.net wrote:
All the sites I have read as a result of this discussion here suggest that whilst 300,000 load cycles is probably a reasonable lifetime max (so my 6 month old server is fsckd) up to a million may be possible.
My interpretation of the 300,000 lifetime max was that this was an "expected" maximum, ie it would be predicted that a disk would reach this level in normal usage over its lifetime. In designing a disk to spin down more often it should be expected to have a higher load cycle count in normal use than a "normal" disk. The maximum design lifetime seems to be 1,000,000, so anything up to that shouldn't really give any cause for concern (in my reading of this), and I will repeat that I have a RAID5 array running on disks at 1,800,000 cycles that hasn't shown any sign of problems (although I will be replacing those disks as a caution now that I've become aware of it).
So I wouldn't worry too much about your disks, but at the same time I won't personally take any responsibility for your data :-)
If I had a RAID1 array, as you do, I would replace one of the disks, and in future I shall return to my principle of using disks from different manufacturers in my RAID1 arrays.
So, given that I should probably change my shiny new disks PDQ, I'll ask the same question Mark did - what can people recommend for a server which is primarily a NAS backup (it is also my DNS server and a DLNA server for my MP3 and MP4 files)?
I'm also still very much looking for recommendations here though. I am seriously considering the "better the devil you know" route and getting WD drives and tuning them accordingly, for fear of buying a different brand which has a similar quirk that I know nothing about. The fact that my disks currently seem fine at 1,800,000 load cycles gives me some confidence that the drives are themselves well made.
On Mon, 14 Oct 2013 09:16:17 +0100 Mark Rogers mark@quarella.co.uk allegedly wrote:
My interpretation of the 300,000 lifetime max was that this was an "expected" maximum, ie it would be predicted that a disk would reach this level in normal usage over its lifetime. In designing a disk to spin down more often it should be expected to have a higher load cycle count in normal use than a "normal" disk. The maximum design lifetime seems to be 1,000,000, so anything up to that shouldn't really give any cause for concern (in my reading of this),
But with a NAS that is always on, a load cycle count of nearly 390.00 in 6 months points to a disk lifetime of around 16 months. I was sort of hoping for about 3 years (which is about what I expect of a modern disk).
and I will repeat that I have a RAID5 array running on disks at 1,800,000 cycles that hasn't shown any sign of problems (although I will be replacing those disks as a caution now that I've become aware of it).
So would I.
So I wouldn't worry too much about your disks, but at the same time I won't personally take any responsibility for your data :-)
:-)
If I had a RAID1 array, as you do, I would replace one of the disks, and in future I shall return to my principle of using disks from different manufacturers in my RAID1 arrays.
I have decided to order one new disk to replace my desktop and I will now re-use the old desktop disk in the RAID array. I have also taken the plunge and (successfully it would seem) used Christophe Bothamy’s idle3tool utility to switch off the idle3 timer on my disks. Of course it remains to be seen what longer term effect this will have.
I'm also still very much looking for recommendations here though. I am seriously considering the "better the devil you know" route and getting WD drives and tuning them accordingly, for fear of buying a different brand which has a similar quirk that I know nothing about. The fact that my disks currently seem fine at 1,800,000 load cycles gives me some confidence that the drives are themselves well made.
You may be right. But I've ordered a seagate for the desktop. I'm still open to suggestions for the RAID box.
Cheers
Mick
---------------------------------------------------------------------
Mick Morgan gpg fingerprint: FC23 3338 F664 5E66 876B 72C0 0A1F E60B 5BAD D312 http://baldric.net
---------------------------------------------------------------------
On 14 October 2013 13:04, mick mbm@rlogin.net wrote:
On Mon, 14 Oct 2013 09:16:17 +0100 Mark Rogers mark@quarella.co.uk allegedly wrote:
My interpretation of the 300,000 lifetime max was that this was an "expected" maximum, ie it would be predicted that a disk would reach this level in normal usage over its lifetime. In designing a disk to spin down more often it should be expected to have a higher load cycle count in normal use than a "normal" disk. The maximum design lifetime seems to be 1,000,000, so anything up to that shouldn't really give any cause for concern (in my reading of this),
But with a NAS that is always on, a load cycle count of nearly 390.00 in 6 months points to a disk lifetime of around 16 months. I was sort of hoping for about 3 years (which is about what I expect of a modern disk).
Indeed, this isn't good, however if you make the changes to the idle timer (which you have done) that should stop this getting substantially worse meaning that the life of the disk "should" be fine in its current usage. In other words, as long as you make the change then my personal opinion is that you caught it early enough not to need to replace the disk, although if budget is no issue then replacing one makes sense.
I am likely to standardise on RAID1 going forward because it makes it easier to routinely replace one disk periodically (eg every 18 months if you assume 3 year lifespan as you have suggested). You can still do that with RAID5 (eg replace one of my 4 disks every 9 months) but RAID1 is just so much simpler.
Mark
{snippus maximus}
Out of interest, I wonder if
hdparm -I /dev/sda
shows anything interesting, adjusting to your drive names as appropriate.
I decided to google for "kernel: [49309.671222] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT"
This thread was quite interesting http://ubuntuforums.org/archive/index.php/t-1681924.html Post by ashikaga at February 10th, 2011, 11:06 PM:
"Or, you're trying to use the Western Digital Green drives, and they're taking too long to respond and getting kicked out of the array. That's right - all the drives are WD20EARS, which I realise in hindsight was not a great choice for a RAID configuration!
I haven't yet had time to try replacing cables or switching SATA ports around; however, as suggested in another thread, I tried disabling NCQ on all three drives and re-created the array, and this morning it seemed to have re-built successfully! All three drives came up as 'active sync'. I might try failing the other two drives and re-adding them to make sure it still works, but assuming it does, is this likely to be a reliable fix? Any significant problems with disabling NCQ? (The server is never going to be under heavy load, just for home-use.)"
So, it seems that some WD Green drives have had problems, that seem to be solved (at least once) by disabling NCQ - Native Command Queuing.
NCQ - Native Command Queuing works like this: If you ask for sectors 2, 4, 3, 1, then without NCQ you get them in that order, but with NCQ, you get them in the order the drive thinks is most efficient - e.g. 1, 2, 3, 4.
I've seen a few posts like this one:
http://serverfault.com/questions/305890/poor-linux-software-raid-5-performan...
which says that NCQ is a bad thing in RAID 5 as it slows things down. I can see this could be an issue if both drives were asked for the info and returned it in a different order.
It appears that NCQ can be disabled by
$ echo 1 > /sys/block/sda/device/queue_depth $ echo 1 > /sys/block/sdb/device/queue_depth
(adjusting drive names appropriately)
Or more permanently as in this thread:
https://groups.google.com/forum/#!topic/linux.debian.user/kINSsYgJIes
It also occurs me that perhaps the drives have acoustic management (i.e. quiet mode) which may slow things down or cause problems. If it's still not working, perhaps try to turn that off too.
Good luck & keep us posted!
Steve
On 10 October 2013 22:27, steve-ALUG@hst.me.uk wrote:
Out of interest, I wonder if
hdparm -I /dev/sda
shows anything interesting, adjusting to your drive names as appropriate.
Define "interesting"!
See http://pastebin.com/bAxaa1BH
This thread was quite interesting http://ubuntuforums.org/archive/index.php/t-1681924.html
Thanks for that, interesting reading.
So, it seems that some WD Green drives have had problems, that seem to be solved (at least once) by disabling NCQ - Native Command Queuing.
OK, sounds like it wouldn't do me any harm to disable it (hdparm confirms it's enabled). However the two steps of performing a zero-fill, then setting the idle time to 300s, prior to rebuilding the array has resulted in a fully functioning array this morning. My instinct is that it was the idle timeout change that fixed it, particularly in light of the thread you found.
I do have another RAID5 array in another box, also using WD drives (identical drives I think) that never presented an issue in building or in use, although I suspect now that the discs have taken an undue hammering and need to be pensioned off early.
It appears that NCQ can be disabled by
$ echo 1 > /sys/block/sda/device/queue_depth $ echo 1 > /sys/block/sdb/device/queue_depth
I note that hdparm -Q can report and set the queue depth (currently set to 31). Any suggestions from anyone as to which way I *should* disable NCQ?
It also occurs me that perhaps the drives have acoustic management (i.e. quiet mode) which may slow things down or cause problems. If it's still not working, perhaps try to turn that off too.
As far as I can tell these drives don't have a configurable acoustic management, they're just tuned to low energy use
Meanwhile, can anyone recommend a good choice of desktop drives for a simple RAID5 array? By "simple" I mean for basically a NAS box storing various files for occasional use in an office with two people in it, so not heavy usage. Obviously a set of SAS disks in a decent server would be preferable but I don't have the budget for that, and I know what the "I" in RAID stands for...
On Fri, 11 Oct 2013 09:00:37 +0100 Mark Rogers mark@quarella.co.uk wrote:
Meanwhile, can anyone recommend a good choice of desktop drives for a simple RAID5 array? By "simple" I mean for basically a NAS box storing various files for occasional use in an office with two people in it, so not heavy usage. Obviously a set of SAS disks in a decent server would be preferable but I don't have the budget for that, and I know what the "I" in RAID stands for...
I use WD Red drives in my Edimax NAS purely because they were 'I' ;-)
On 11/10/13 14:21, Chris Walker wrote:
Mark Rogersmark@quarella.co.uk wrote:
Meanwhile, can anyone recommend a good choice of desktop drives for a simple RAID5 array? By "simple" I mean for basically a NAS box storing various files for occasional use in an office with two people in it, so not heavy usage. Obviously a set of SAS disks in a decent server would be preferable but I don't have the budget for that, and I know what the "I" in RAID stands for...
I use WD Red drives in my Edimax NAS purely because they were 'I' ;-)
I'm not sure either of you know what the "I" really stands for today.
The researchers at Berkeley may have originally defined it as "Inexpensive" but in the commercial world it is more generally accepted to stand for "Independent"
On 12 October 2013 13:42, Wayne Stallwood ALUGlist@digimatic.co.uk wrote:
The researchers at Berkeley may have originally defined it as "Inexpensive" but in the commercial world it is more generally accepted to stand for "Independent"
For me, the I=Inexpensive bit matters, because it's about saying "buy cheap disks, expect them to fail, build in redundancy so it doesn't matter when they do". Expecting the disks to fail is important, because then the redyundancy is taken seriously - it's not just "a nice to have it just in case but as it'll never happen I won't check it's set up correctly".
If I could, I would buy disks at half the price that failed twice as often and rotate them more frequently. Although in that case I don't think I'd consider RAID5 to be a sufficient level of redundancy.
On 14/10/13 09:06, Mark Rogers wrote:
For me, the I=Inexpensive bit matters, because it's about saying "buy cheap disks, expect them to fail, build in redundancy so it doesn't matter when they do". Expecting the disks to fail is important, because then the redyundancy is taken seriously - it's not just "a nice to have it just in case but as it'll never happen I won't check it's set up correctly".
If I could, I would buy disks at half the price that failed twice as often and rotate them more frequently. Although in that case I don't think I'd consider RAID5 to be a sufficient level of redundancy.
Funny, I'd rather have the opposite, even with redundancy in my array. :D
Unfortunately because drive sizes have increased faster than the uncorrected read error rate, the statistical likelihood of recovering completely from a failed drive when you have member sizes of say 2TB is now so low there is almost no point counting RAID 5 as fault tolerant.
On Sun, Oct 20, 2013 at 04:33:23PM +0100, Wayne Stallwood wrote:
On 14/10/13 09:06, Mark Rogers wrote:
If I could, I would buy disks at half the price that failed twice as often and rotate them more frequently. Although in that case I don't think I'd consider RAID5 to be a sufficient level of redundancy.
Funny, I'd rather have the opposite, even with redundancy in my array. :D
Unfortunately because drive sizes have increased faster than the uncorrected read error rate, the statistical likelihood of recovering completely from a failed drive when you have member sizes of say 2TB is now so low there is almost no point counting RAID 5 as fault tolerant.
Pretty much. I build storage (SANs) for a living and our most recent software release no longer allows RAID 5 on the large SATA drives, due to the increased risk of a double disk failure during rebuild. It's worth noting the same applies to a double disk RAID 1 set as well.
(I think I've said this here before, but it's worth mentioning again.)
J.
On 20 October 2013 16:33, Wayne Stallwood ALUGlist@digimatic.co.uk wrote:
Unfortunately because drive sizes have increased faster than the uncorrected read error rate, the statistical likelihood of recovering completely from a failed drive when you have member sizes of say 2TB is now so low there is almost no point counting RAID 5 as fault tolerant.
There are lots of ways data can get lost that RAID of any kind don't help with (viruses, accidental deletion, etc). In my view, RAID is a convenience that can save the hassle of rebuilding everything from backups. Of-course the backups are on RAID too, but at least there are now multiple copies.
RAID1 has the advantage from a data recovery point of view that each disk should contain all the data so tools like photorec etc stand a decent chance of recovering a lot of data from them even after a failure, which RAID5 doesn't give you.
Ie: If a disk fails, you have a copy on the second. If that also fails, you have your backups. If they're out of date or failed, you have two disks you stand a good chance of recovering data from. If not, well...
I'm not really sure what better options there are. In light of Jonathan's comments:
I build storage (SANs) for a living and our most recent software release no longer allows RAID 5 on the large SATA drives, due to the increased risk of a double disk failure during rebuild. It's worth noting the same applies to a double disk RAID 1 set as well.
.. I think maybe I'll take my strategy of having two disks on RAID1 and instead of replacing a disk every year, I'll add a new disk every year until I've reached the capacity of the hardware (4 disks).
Aside: You will recall the RAID5 array that I was moving everything off due to disks having 1.8M cycles (cf expected lifetime of 300k, design lifetime of 1M)? One of the disks has started to report errors via SMART. (The data is all transferred to my 2x3TB RAID1 array, I guess it was probably the process of copying it that triggered the errors that SMART detected.) At the moment the RAID5 array is still showing healthy, and indeed even the disk (sdc) is showing healthy in SMART but my guess would be that it's on its way out?
smartctl output: http://pastebin.com/ufMMHdFu syslog: http://pastebin.com/RbXywUui
And finally: I just checked the SMART data for my two new ("identical") disks: http://pastebin.com/D09ZkVz9 Can anyone explain why the output from the two disks is so different from each other, given they're the same model?
On Mon, 21 Oct 2013 09:20:47 +0100 Mark Rogers mark@quarella.co.uk allegedly wrote:
And finally: I just checked the SMART data for my two new ("identical") disks: http://pastebin.com/D09ZkVz9 Can anyone explain why the output from the two disks is so different from each other, given they're the same model?
Firmware appears to be different.
Mick
---------------------------------------------------------------------
Mick Morgan gpg fingerprint: FC23 3338 F664 5E66 876B 72C0 0A1F E60B 5BAD D312 http://baldric.net
---------------------------------------------------------------------
On 21 October 2013 11:25, mick mbm@rlogin.net wrote:
Firmware appears to be different.
I noticed that but they look fundamentally different, as if they're different drives by different manufacturers rebadged with the same model number? The firmware references don't even look similar.
They were both bought in the same order from the same supplier at the same time so I'd have expected them to be close. Having two different disks suits me (a flaw in the design of one of them is less likely to affect the other, otherwise resulting is failures of both drives at around the same time). One doesn't seem to have the same level of SMART capability enabled though - not sure if I just need to enable something (SMART itself is enabled), it suggests it can't run self-tests which would be a bit odd?
On 11/10/13 09:00, Mark Rogers wrote:
On 10 October 2013 22:27, steve-ALUG@hst.me.uk wrote:
Out of interest, I wonder if
hdparm -I /dev/sda
shows anything interesting, adjusting to your drive names as appropriate.
Define "interesting"!
Well I thought it was interesting :-)
So, it seems that some WD Green drives have had problems, that seem to be solved (at least once) by disabling NCQ - Native Command Queuing.
OK, sounds like it wouldn't do me any harm to disable it (hdparm confirms it's enabled). However the two steps of performing a zero-fill, then setting the idle time to 300s, prior to rebuilding the array has resulted in a fully functioning array this morning. My instinct is that it was the idle timeout change that fixed it, particularly in light of the thread you found.
Indeed. Yay it's working! :-))))
{}
It appears that NCQ can be disabled by
$ echo 1 > /sys/block/sda/device/queue_depth $ echo 1 > /sys/block/sdb/device/queue_depth
I note that hdparm -Q can report and set the queue depth (currently set to 31). Any suggestions from anyone as to which way I *should* disable NCQ?
The thread had this solution from Stephan Seitz at 29/07/2009 "No, not sysctl.conf, use /etc/sysfs.conf (package sysfsutils). You can then add the two lines: block/sda/device/queue_depth=1 block/sdb/device/queue_depth=1"
I'm sure any will work, but I "echo" won't survive a reboot. I don't know if setting hdparm will survive a reboot, or if you'll need to schedule it to run on startup. The above is supposed to work. Dunno though as I've not tried it.
Glad it's working. Long may it continue.
Steve