Software RAID error

List overview All Threads
Download

newer

older

Compaq Proliant servers going free

Re: [ALUG] how to block new...

Mark Rogers

28 May 2013 28 May '13

9:04 a.m.

I'm having problems understanding what is happening on my RAID array.

$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1[0] sdb1[1](F) 976629568 blocks super 1.2 [2/1] [U_]

To me, that suggests that sdb has failed, however, syslog contains lots of errors on sda but no mention of sdb.

Amongst the things I've learned this morning, is the fact that I'm not getting any notifications; this only came to light because a VM that is hosted on this RAID array has lost the ability to write to one of its virtual disks (but not all of them).

Should I conclude that both disks have failed? What is the best route to recovery here? Ie which disk do I swap out first with a new disk? Logically swapping out sda is the only thing that makes sense to me but advice welcomed.

smartctl reports that neither disk had SMART enabled, which is odd because I thought they had, so there's no information there. I have enabled it this morning but don't want to try doing self-tests at this stage unless someone more knowledgeable than me says I should.

Mark

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

Show replies by date

steve-ALUG＠hst.me.uk

28 May 28 May

10:48 a.m.

On 28/05/13 09:04, Mark Rogers wrote:

...

I'm having problems understanding what is happening on my RAID array.

$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1[0] sdb1[1](F) 976629568 blocks super 1.2 [2/1] [U_]

To me, that suggests that sdb has failed,

That's the way I would interpret it too. mdadm says sdb1 has failed, so it's not using it. I googled, and found http://ubuntuforums.org/showthread.php?t=1677577 which lead me to http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-6.html#ss6.2 and http://tldp.org/FAQ/Linux-RAID-FAQ/x37.html#failrecover

...

however, syslog contains lots of errors on sda but no mention of sdb.

Perhaps it's not using it any more, so it doesn't show up.

[SNIP]

...

Should I conclude that both disks have failed?

If you're getting error messages for sda then there may well be a fault. sdb may well be faulty too, as it's marked as faulty.

...

What is the best route to recovery here?

Before doing anything else, BACK EVERYTHING UP. Check everything backed up, and check that you can read the backup.

...

Ie which disk do I swap out first with a new disk? Logically swapping out sda is the only thing that makes sense to me but advice welcomed.

NO!!!!!! if sdb is marked as failed, mdadm won't be using it. The only drive with data on will be sda. If you swap it out, you'll have nothing left.

TBC

Steve

steve-ALUG＠hst.me.uk

11:14 a.m.

On 28/05/13 09:04, Mark Rogers wrote: [SNIP]

...

smartctl reports that neither disk had SMART enabled, which is odd because I thought they had, so there's no information there. I have enabled it this morning but don't want to try doing self-tests at this stage unless someone more knowledgeable than me says I should.

If you have a working backup, then I don't know that non-destructive smart tests would be a problem. I suppose it also makes a difference what sort of errors appear in the log - does it sound like a catastrophic drive hardware failure, or something that fsck could fix?

NB, Read to the end before doing anything!

I had some problems with S/W RAID. I have ubuntu, and I used the program called Disk Utility (that's the menu entry for it - I don't know what the actual name is). I clicked on the array, and checked and repair the array. I then re-added the faulty disk to the array, then re-checked the whole array.

Once that check had finished, the whole thing seemed to work for a bit, but then errors started recurring, so I decided to swap out both drives.

There are at least 2 ways to do this: If your hardware can cope with 4 drives plugged in at the same time, you could do this: Format and partition the new drives so that they are at least the same size as the partitions on the current partitions. Also ensure that the partitions are marked as active and bootable if the original ones are marked like this.

Increase the size of the raid array from 2 devices to 4 with no spares. Add the new drives as members of the raid array. I think you use mdadm to do this, but I don't exactly recall how. - GOOGLE IT!

Once these drives are added, mdadm should synchronise all the data from the existing drives onto the new ones. You can then use mdadm to remove the initial drives from the raid array, and reduce the raid array size back down to 2.

DO NOT REBOOT!

If these disks are the boot device, I think you need to do something to update the initramfs, otherwise the system won't boot. I've had a discussion in ALUG about this before. I'll root it out in a minute.

An alternative, if you can't run 4 disks at the same time, is to swap out sdb and replace it with a new one. Format and partition it, as described above, then add it to the array, wait for it to resync. Update initramfs. Shutdown, then swap out sda. Add the 2nd replacement disk. Format and partition it. Add it to the array. Wait for recync to finish. Update initramfs. Reboot and you're done.

Any resyncs will take a long time!

A third alternative, if you have complete backup, is to just remove the old disks, install new disks, then restore your backup. I don't know exactly how to go about this though, esp if the disks include your boot partition.

HTH TBC Steve

initramfs see alug

steve-ALUG＠hst.me.uk

11:20 a.m.

And finally...

From a previous post entitled: Reassembling a RAID array

<Quote> Hi,

When I had a raid problem, I googled and found loads of stuff about how to recover a disk, so there's loads of info out there - perhaps a confusing amount!

This one may help. http://ubuntuforums.org/showthread.php?t=1950154

From it, you could use

sudo mdadm --examine /dev/sd*

which will tell you details about the disks, including the array uuids with which, you should be able to work out how many raids you had.

If all's well, you could try

mdadm --detail --scan

mdadm --assemble --scan

will try to work out which disks are part of which array, and add them.

If that doesn't work, you may have to manually add the disks to the array using mdadm, but I can't find an example to hand of how to do that.

You'll have to ensure that your mdadm.conf file is up to date too, otherwise this won't mount after a boot.

I don't know if this step is required in all situations, but I had to do

sudo update-initramfs -u

which updates the initial ram file system used when booting the system take the new array into account, but I'm booting from raid disks, so this may not apply to you.

Hope that helps Steve </Quote>

And it might be worth looking up my previous problems with RAID I've discussed on ALUG, and any other RAID threads. Good luck!

Steve

Mark Rogers

11:44 a.m.

OK, well I may have taken the wrong path here then, but...

I hadn't seen any replies when I decided to replace sda, given that sda was clearly showing errors in syslog (that looked pretty fatal to me: http://paste.ubuntu.com/5709656/ - although with the benefit of hindsight these may be errors on sdb which don't reference sdb by name, mixed with errors on sda which do reference it by name - advice welcomed on that one!)

I have sda now in a USB caddy where it doesn't even appear to exist as far as my desktop is concerned.

I separately have a 2TB disk pulled from somewhere it wasn't needed, onto which I have created a new 1TB partition to match that on sda/sdb, and installed it alongside sdb and included it in the array. The rebuild started fine but I then started to get more errors: http://paste.ubuntu.com/5709668/

/proc/mdstat now reports: md0 : active raid1 sda1[2](S) sdb1[1] 976629568 blocks super 1.2 [2/1] [_U]

My take on all of this is that (the old) sda is dead and has gone unnoticed, and now sdb has a problem.

The RAID array houses several virtual machines. It isn't backed up as such, although critical data on the individual VMs is backed up separately. I'd really like to get as much back of this as I can because otherwise I'm going to have to recreate about a dozen VMs, although I'm realistic about my chances. As things stand the array is mounted but giving errors in places, so I'm copying off what I can get access to before I go any further.

All the comments appreciated, even if I did press ahead without reading them - I have pretty much confirmed now that sda is dead so any hope of data recovery lies on sdb. If only I had logs going back further to see what the sequence of events was (or, for that matter, I was receiving mdadm notifications, something to investigate once I get this back up and running).

Mark

steve-ALUG＠hst.me.uk

12:42 p.m.

On 28/05/13 11:44, Mark Rogers wrote:

...

OK, well I may have taken the wrong path here then, but...

I hadn't seen any replies when I decided to replace sda, given that sda was clearly showing errors in syslog (that looked pretty fatal to me: http://paste.ubuntu.com/5709656/ - although with the benefit of hindsight these may be errors on sdb which don't reference sdb by name, mixed with errors on sda which do reference it by name - advice welcomed on that one!)

Don't know - sorry.

...

I have sda now in a USB caddy where it doesn't even appear to exist as far as my desktop is concerned.

Are you using Ubuntu, or something else? If Ubuntu, does it show in Disks utility, or fdisk? It may be present but not automatically mounted, as raid software may be confused by it now being external - bit of a guess there!

...

I separately have a 2TB disk pulled from somewhere it wasn't needed, onto which I have created a new 1TB partition to match that on sda/sdb, and installed it alongside sdb and included it in the array. The rebuild started fine but I then started to get more errors: http://paste.ubuntu.com/5709668/

Yes, definite impression of hardware errors on your original sdb.

...

/proc/mdstat now reports: md0 : active raid1 sda1[2](S) sdb1[1] 976629568 blocks super 1.2 [2/1] [_U]

I think that means that what is now showing up as sda1 is Spare - it can be part of the raid array, but isn't currently. I suspect, but I'm not sure that sdb1 is the same sdb1 that you had before, and it's now the main/1st element of the raid array. sda1 is the new disk you added. It is NOT being used yet.

...

My take on all of this is that (the old) sda is dead and has gone unnoticed, and now sdb has a problem.

My take is sdb very probably has a problem. I don't know for sure about the old sda.

...

The RAID array houses several virtual machines. It isn't backed up as such, although critical data on the individual VMs is backed up separately. I'd really like to get as much back of this as I can because otherwise I'm going to have to recreate about a dozen VMs, although I'm realistic about my chances. As things stand the array is mounted but giving errors in places, so I'm copying off what I can get access to before I go any further.

Indeed - carry on backing up/copy from.

...

All the comments appreciated, even if I did press ahead without reading them - I have pretty much confirmed now that sda is dead so any hope of data recovery lies on sdb. If only I had logs going back further to see what the sequence of events was (or, for that matter, I was receiving mdadm notifications, something to investigate once I get this back up and running).

I'd suggest that you continue backing up everything you can. Then, I'd suggest you disconnect both sdb (the original one) and the new 2TB disk. Reinsert the original sda back into its original place (i.e. not in the caddy). Reboot and see if the raid array restarts but in degraded mode (i.e. it knows it's missing a disk).

I hope/suspect it's sdb that's been causing the problems. IF you find that sda works by itself, then (assuming you have everything copied off the old sdb), I'd suggest that you reformat & repartition the 2TB disk and add it to your original sda as part of the raid array. I suggest using mdadm and making sure that it's an active part of the array, not a spare - a spare is no use in a 2 disk raid.

HTH Steve

steve-ALUG＠hst.me.uk

12:48 p.m.

I'd also suggest looking at

http://ubuntuforums.org/showthread.php?t=1950154

It lists all sorts of things to try, inc tools to check SMART status of drive etc, how to reassemble raid array etc.

HTH Steve

Mark Rogers

12:53 p.m.

On 28 May 2013 12:42, steve-ALUG@hst.me.uk wrote:

...

Are you using Ubuntu, or something else? If Ubuntu, does it show in Disks utility, or fdisk? It may be present but not automatically mounted, as raid software may be confused by it now being external - bit of a guess there!

Kubuntu 13.04.

KDE Partition Manager can't see it, nor can fdisk. Indeed it's not listed in /dev/sd*

...

I think that means that what is now showing up as sda1 is Spare - it can be part of the raid array, but isn't currently.

OK, thanks. I'll do some more digging.

...

I'd suggest that you continue backing up everything you can.

Strangely, an rsync of everything onto another server completed without errors. That makes no sense. I assume that some of the files are corrupt and have been copied as such but I haven't found a good quick way to test them yet (VirtualBox VM disk images).

...

Then, I'd suggest you disconnect both sdb (the original one) and the new 2TB disk. Reinsert the original sda back into its original place (i.e. not in the caddy). Reboot and see if the raid array restarts but in degraded mode (i.e. it knows it's missing a disk).

Sounds like a reasonable plan, I'll give this a go.

...

I hope/suspect it's sdb that's been causing the problems.

I share your hope but not your optimism! I'll let you know how it goes.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

Mark Rogers

1:43 p.m.

Update:

I haven't removed sdb yet. Looking at the error I'm getting, READ_FPDMA_QUEUED could well indicate a driver or controller issue rather than a failed drive. Indeed, smartctl doesn't seem to indicate a drive issue that I can see (although I admit to finding smartctl output very hard to interpret): http://paste.ubuntu.com/5710000/

I'm going to try replacing cables etc and seeing if that makes a difference before swapping the drive out, as I am as sure as I can be that sda is doing a damn good impression of a dodo right now (but worst case it's still available to work on if I need to).

Mark -- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

Mark Rogers

2:06 p.m.

OK, time to hold my hand up and admit to being an idiot.

I went to remove sda from the caddy to discover that the disk wasn't properly located. Plug in it and voila, I can see it. It's now back in the host server, where following Steve's advice I had confirmed I could see it on its own, so it now has my new drive along side it. However, when I try to rebuild the array I get the same errors as above.

At the moment this is feeling like a controller issue? (Or maybe PSU?)

Mark -- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

steve-ALUG＠hst.me.uk

2:40 p.m.

On 28/05/13 14:06, Mark Rogers wrote:

...

OK, time to hold my hand up and admit to being an idiot.

Been There, Done That, Got The T-Shirt :-)

...

I went to remove sda from the caddy to discover that the disk wasn't properly located. Plug in it and voila, I can see it. It's now back in the host server, where following Steve's advice I had confirmed I could see it on its own, so it now has my new drive along side it. However, when I try to rebuild the array I get the same errors as above.

Could you be a bit more specific as to what the errors are?

...

At the moment this is feeling like a controller issue? (Or maybe PSU?)

Could be a controller, or possibly both disks have failed somehow. Could it be software config somehow? I guess it could perhaps be PSU, if it wasn't supplying correct power to the drives, but I'd think that was unlikely.

1st things first - can you successfully copy any/all info off sda? If so, that solves your data preservation issue.

How are you adding the new drive to the array and triggering a rebuild?

Cheers Steve

Mark Rogers

3:16 p.m.

On 28 May 2013 14:40, steve-ALUG@hst.me.uk wrote:

...

Could you be a bit more specific as to what the errors are?

At a cursory glance, as per http://paste.ubuntu.com/5709668/, although with "ata4" replaced by "ata3"

...

Could be a controller, or possibly both disks have failed somehow. Could it be software config somehow? I guess it could perhaps be PSU, if it wasn't supplying correct power to the drives, but I'd think that was unlikely.

I've just put one of the disks into my USB caddy again and have successfully mounted the raid partition (as read-only). Trying to copy files off I'm getting errors: May 28 15:03:22 localhost kernel: [1123077.252040] sd 12:0:0:0: [sdc] Unhandled sense code May 28 15:03:22 localhost kernel: [1123077.252052] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252058] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 28 15:03:22 localhost kernel: [1123077.252064] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252068] Sense Key : Medium Error [current] May 28 15:03:22 localhost kernel: [1123077.252076] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252081] Add. Sense: Unrecovered read error May 28 15:03:22 localhost kernel: [1123077.252087] sd 12:0:0:0: [sdc] CDB: May 28 15:03:22 localhost kernel: [1123077.252090] Read(10): 28 00 03 2b dc 00 00 00 f0 00 May 28 15:03:22 localhost kernel: [1123077.252108] end_request: critical target error, dev sdc, sector 53206016

That seems to suggest that a hardware problem in the host server isn't the issue.

Both disks failing in similar ways at around the same time seems unlikely too, unless there was a power surge or something (this box doesn't go through a UPS).

...

1st things first - can you successfully copy any/all info off sda? If so, that solves your data preservation issue.

Working on it. Copy: yes; successfully: not sure.

...

How are you adding the new drive to the array and triggering a rebuild?

mdadm --manage /dev/md0 --add /dev/sda1 (or sdb1, depending which drive I have swapped out).

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

steve-ALUG＠hst.me.uk

8:19 p.m.

On 28/05/13 15:16, Mark Rogers wrote:

...

On 28 May 2013 14:40, steve-ALUG@hst.me.uk wrote:

...
Could you be a bit more specific as to what the errors are?

At a cursory glance, as per http://paste.ubuntu.com/5709668/, although with "ata4" replaced by "ata3"

...

...
Could be a controller, or possibly both disks have failed somehow. Could it be software config somehow? I guess it could perhaps be PSU, if it wasn't supplying correct power to the drives, but I'd think that was unlikely.

I've just put one of the disks into my USB caddy again and have successfully mounted the raid partition (as read-only). Trying to copy files off I'm getting errors: May 28 15:03:22 localhost kernel: [1123077.252040] sd 12:0:0:0: [sdc] Unhandled sense code May 28 15:03:22 localhost kernel: [1123077.252052] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252058] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE May 28 15:03:22 localhost kernel: [1123077.252064] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252068] Sense Key : Medium Error [current] May 28 15:03:22 localhost kernel: [1123077.252076] sd 12:0:0:0: [sdc] May 28 15:03:22 localhost kernel: [1123077.252081] Add. Sense: Unrecovered read error May 28 15:03:22 localhost kernel: [1123077.252087] sd 12:0:0:0: [sdc] CDB: May 28 15:03:22 localhost kernel: [1123077.252090] Read(10): 28 00 03 2b dc 00 00 00 f0 00 May 28 15:03:22 localhost kernel: [1123077.252108] end_request: critical target error, dev sdc, sector 53206016

That seems to suggest that a hardware problem in the host server isn't the issue.

Googling some of those errors took me to this: http://www.linuxquestions.org/questions/linux-general-1/problem-mounting-che...

The error there was corrupted superblock, and various fscks didn't fix it, but http://www.cgsecurity.org/wiki/TestDisk

did.

I guess a corrupted superblock would make sense - both disks would look wrong, and it could have been caused by a power loss. Worth a look??

...

Both disks failing in similar ways at around the same time seems unlikely too, unless there was a power surge or something (this box doesn't go through a UPS).

...
1st things first - can you successfully copy any/all info off sda? If so, that solves your data preservation issue.

Working on it. Copy: yes; successfully: not sure.

Good luck!

...

...
How are you adding the new drive to the array and triggering a rebuild?

mdadm --manage /dev/md0 --add /dev/sda1 (or sdb1, depending which drive I have swapped out).

Seems fail enough, but you may need to remove sda1 or sdb1 (depending) and then using --assemble to force the new drive to be active and not the spare.

Good luck! Steve

Mark Rogers

29 May 29 May

10:24 a.m.

On 28 May 2013 20:19, steve-ALUG@hst.me.uk wrote:

...

Googling some of those errors took me to this: http://www.linuxquestions.org/questions/linux-general-1/problem-mounting-che...

The error there was corrupted superblock, and various fscks didn't fix it, but http://www.cgsecurity.org/wiki/TestDisk

Thanks, I'll give that a go. I've used TestDisk (and PhotoRec from the same place) many times in the past so I "trust" them with my data.

It didn't look like the obvious solution previously because I can access "most" of the data on the disks. Indeed, I now have 5 of my VMs up and running, apparently (not tested well enough to be sure yet) without problems, although they all had some level of (virtual) disk corruption that the O/S tools were able to handle. I have another VM which is booting but lots of system files are missing (I say missing, they're in lost+found but putting them all back in the right places would take months!).

I think I'm at a point where I can get a day's "real" work done so that will have to take priority and I'll come back to analysing the disks later in the week.

Thanks for all the suggestions. -- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) @ 13 Clarke Rd, Milton Keynes, MK1 1LG

4429

Age (days ago)

4430

Last active (days ago)

main@lists.alug.org.uk

13 comments

2 participants

tags (0)

participants (2)

Mark Rogers
steve-ALUG＠hst.me.uk