Hi all
I've got a 140gb RAID0 across 4 disks of differing sizes. Its on RedHat 7.1 so thats kernel 2.4.2.
Unfortunately this afternoon the raid has gone kaput :-/
As instructed I used 'persistant superblocks' which means that the kernel finds it on boot and tries to initialise the device without me having to run raidstart. On boot, or if I run raidstart myself, I get the following messages:
Jul 21 23:41:17 giles kernel: autodetecting RAID arrays Jul 21 23:41:17 giles kernel: (read) hde1's sb offset: 45030080hde: dma_intr: status=0x51 { DriveReady SeekComplete Error } Jul 21 23:41:17 giles kernel: hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=90060223, sector=90060160 Jul 21 23:41:17 giles kernel: end_request: I/O error, dev 21:01 (hde), sector 90060160 Jul 21 23:41:18 giles kernel: md: disabled device hde1, could not read superblock. Jul 21 23:41:18 giles kernel: md: could not read hde1's sb, not importing! Jul 21 23:41:18 giles kernel: could not import hde1!
...I'm guessing from this that the drive (40gb IBM) has conveniently developed a bad block right where raidtools put its superblock.
So I thought rebuilding the superblock might be a good plan:
[root@giles /]# mkraid /dev/md0 handling MD device /dev/md0 analyzing super-block disk 0: /dev/hde1, 45030163kB, raid superblock at 45030080kB mkraid: aborted, see the syslog and /proc/mdstat for potential clues.
syslog said:
Jul 22 02:25:21 giles kernel: hde: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Jul 22 02:25:21 giles kernel: hde: read_intr: error=0x40 { UncorrectableError }, LBAsect=90060223, sector=90060160 Jul 22 02:25:21 giles kernel: end_request: I/O error, dev 21:01 (hde), sector 90060160
From reading around various HOWTOs, it seems the superblock only exists so that modern kernels can automount the RAIDs at boot time. They didn't exist in the old days. So I tried to frig it by editing /etc/raidtab, setting persistent-superblock to 0 and running 'raid0run'. This seems to work - I get lots of syslog messages saying its investigating the drives and I can mount the md0 device. But only some directories list, the values are all wrong, and I get kernel messages like:
Jul 21 23:41:25 giles kernel: attempt to access beyond end of device Jul 21 23:41:25 giles kernel: 09:00: rw=0, want=326333420, limit=143733920
running e2fsck on it (in read-only mode) produces loads of inode errors, eventually exiting with
Error while iterating over blocks in inode 2932821: Illegal indirect block found
...my /etc/raidtab is correct, drives are in the correct order and I have backups of /var/log/messages showing superblock addresses when it worked (earlier today), so I'm sure I must be able to mark the bad sectors then get it to put a new superblock somewhere... but I've no idea how!
I know I might end up losing a few files on bad sectors - I can live with that - but losing a whole raid over 4 disks seems a bit too much.
Can anyone help?
If not, can anyone point me to some better online resources? I've tried the HOWTOs on linuxdoc.org and there's a RedHat howto but it doesn't go into this much detail... I'm unable to find any raidtools documentation beyond the man files... nor a mailing list...?
Cheers Neil