[Alug] RAID0 all gone wrong :-/

22 Jul 2002


      Hi all
I've got a 140gb RAID0 across 4 disks of differing sizes. Its on RedHat 
7.1 so thats kernel 2.4.2.
Unfortunately this afternoon the raid has gone kaput :-/
As instructed I used 'persistant superblocks' which means that the 
kernel finds it on boot and tries to initialise the device without me 
having to run raidstart. On boot, or if I run raidstart myself, I get 
the following messages:
Jul 21 23:41:17 giles kernel: autodetecting RAID arrays
Jul 21 23:41:17 giles kernel: (read) hde1's sb offset: 45030080hde: 
dma_intr: status=0x51 { DriveReady SeekComplete Error }
Jul 21 23:41:17 giles kernel: hde: dma_intr: error=0x40 { 
UncorrectableError }, LBAsect=90060223, sector=90060160
Jul 21 23:41:17 giles kernel: end_request: I/O error, dev 21:01 (hde), 
sector 90060160
Jul 21 23:41:18 giles kernel: md: disabled device hde1, could not read 
superblock.
Jul 21 23:41:18 giles kernel: md: could not read hde1's sb, not importing!
Jul 21 23:41:18 giles kernel: could not import hde1!
...I'm guessing from this that the drive (40gb IBM) has conveniently 
developed a bad block right where raidtools put its superblock.
So I thought rebuilding the superblock might be a good plan:
[root@giles /]# mkraid /dev/md0
handling MD device /dev/md0
analyzing super-block
disk 0: /dev/hde1, 45030163kB, raid superblock at 45030080kB
mkraid: aborted, see the syslog and /proc/mdstat for potential clues.
syslog said:
Jul 22 02:25:21 giles kernel: hde: read_intr: status=0x59 { DriveReady 
SeekComplete DataRequest Error }
Jul 22 02:25:21 giles kernel: hde: read_intr: error=0x40 { 
UncorrectableError }, LBAsect=90060223, sector=90060160
Jul 22 02:25:21 giles kernel: end_request: I/O error, dev 21:01 (hde), 
sector 90060160
From reading around various HOWTOs, it seems the superblock only exists 
so that modern kernels can automount the RAIDs at boot time. They didn't 
exist in the old days.
So I tried to frig it by editing /etc/raidtab, setting 
persistent-superblock to 0 and running 'raid0run'. This seems to work - 
I get lots of syslog messages saying its investigating the drives and I 
can mount the md0 device. But only some directories list, the values are 
all wrong, and I get kernel messages like:
Jul 21 23:41:25 giles kernel: attempt to access beyond end of device
Jul 21 23:41:25 giles kernel: 09:00: rw=0, want=326333420, limit=143733920
running e2fsck on it (in read-only mode) produces loads of inode errors, 
eventually exiting with
Error while iterating over blocks in inode 2932821: Illegal indirect 
block found
...my /etc/raidtab is correct, drives are in the correct order and I 
have backups of /var/log/messages showing superblock addresses when it 
worked (earlier today), so I'm sure I must be able to mark the bad 
sectors then get it to put a new superblock somewhere... but I've no 
idea how!
I know I might end up losing a few files on bad sectors - I can live 
with that - but losing a whole raid over 4 disks seems a bit too much.
Can anyone help?
If not, can anyone point me to some better online resources? I've tried 
the HOWTOs on linuxdoc.org and there's a RedHat howto but it doesn't go 
into this much detail... I'm unable to find any raidtools documentation 
beyond the man files... nor a mailing list...?
Cheers
Neil

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

[Alug] RAID0 all gone wrong :-/