One of our servers did a strange thing today.
Suddenly at about half 11 this morning it remounted / as read only, I have seen filesystems do this on boot when fsck was unable to repair something automatically but this happened with an uptime of 80 days. Nothing useful was written to the logs (presumably because they are on the same mount point and hence also read only)
Seeing as it was in a ro state anyway I ran fsck on the ext3 filesystem and it came back with a large number (100's) of errors, some of which needed manual confirmation before they could be fixed.
However / being ro had made a mess of some of the processes that machine runs so I gave it a reboot and everything came back as expected and with an empty lost+found.
I ran smartctl on all 4 disks of the Raid 5 array and no errors have been logged and disks look healthy. The mdadm shows all 4 disks as being in normal mode and the array as being healthy.
I am going to run extended offline smart tests of the disks over the weekend when the machine is less busy, but in the meantime can somebody confirm that damaged filesystems remounting themselves as ro is normal behaviour and if so what detects and schedules this ?
Also any ideas as to what (apart from faulty RAM which again I am going to have to wait until the weekend to test) might have caused such widespread corruption in the first place ?
Wayne,
Wayne Stallwood wrote:
One of our servers did a strange thing today.
Suddenly at about half 11 this morning it remounted / as read only, I have seen filesystems do this on boot when fsck was unable to repair something automatically but this happened with an uptime of 80 days.
[...]
I am going to run extended offline smart tests of the disks over the weekend when the machine is less busy, but in the meantime can somebody confirm that damaged filesystems remounting themselves as ro is normal behaviour and if so what detects and schedules this ?
from the mount man page in the ext2/3 options section:
errors=continue / errors=remount-ro / errors=panic Define the behaviour when an error is encountered. (Either ignore errors and just mark the file system erroneous and con- tinue, or remount the file system read-only, or panic and halt the system.) The default is set in the filesystem superblock, and can be changed using tune2fs(8).
You can use the dumpe2fs program to show you what the default behaviour on errors is set to in the superblock.
HTH
JD
Thanks Jon for your response, that was not a feature I was aware of.
Anyway after taking the machine in question offline over the weekend and running memtest we found an error on one of the DIMMs (took 6 passes before it started to show) given that we can't find much else wrong the theory is that an intermittent/thermal memory issue may have caused the disk corruption as although this server is ECC capable it did not have ECC ram installed.
Running now with new memory and crossed fingers
Wayne Stallwood wrote:
Thanks Jon for your response, that was not a feature I was aware of.
Anyway after taking the machine in question offline over the weekend and running memtest we found an error on one of the DIMMs (took 6 passes before it started to show) given that we can't find much else wrong the theory is that an intermittent/thermal memory issue may have caused the disk corruption as although this server is ECC capable it did not have ECC ram installed.
Running now with new memory and crossed fingers
Wayne,
I have a live mail filter that exhibits this behaviour on one partition, and I can't afford to take it offline for 5 minutes, never mind a weekend. It filters well over 500K messages a month of customers email. (yes, the second one is nearly ready!).
I'd like to hear if your new memory fixes the problem, as the only cure atm is an cron reboot at 4am.
Cheers, Laurie.