smartd[6119]: Device: /dev/hda, 1 Currently unreadable (pending) sectors smartd[6119]: Device: /dev/hda, 1 Offline uncorrectable sectors
Do the above error messages indicate any serious (potential) problems?
On Mon, Aug 21, 2006 at 06:45:18PM +0100, Barry Samuels wrote:
smartd[6119]: Device: /dev/hda, 1 Currently unreadable (pending) sectors smartd[6119]: Device: /dev/hda, 1 Offline uncorrectable sectors
Do the above error messages indicate any serious (potential) problems?
My gut instinct is that the error message you've got is telling you "disk is fubar" If I were you i'd be backing up right about now "just in case" (if you can backup that is) before trying to fix the problem.
The smartctl tool should be able to tell you exactly what the disk thinks the problem is (or indeed if it thinks there is a problem in the first place).
Thanks Adam
On Mon, 2006-08-21 at 19:03 +0100, Adam Bower wrote:
On Mon, Aug 21, 2006 at 06:45:18PM +0100, Barry Samuels wrote:
smartd[6119]: Device: /dev/hda, 1 Currently unreadable (pending) sectors smartd[6119]: Device: /dev/hda, 1 Offline uncorrectable sectors
Do the above error messages indicate any serious (potential) problems?
My gut instinct is that the error message you've got is telling you "disk is fubar" If I were you i'd be backing up right about now "just in case" (if you can backup that is) before trying to fix the problem.
I'll second Adam's advice and add a bit more info.
What has happened is that the drive has detected a bad sector and it unable to remap it to an area reserved for replacing bad sectors without losing the contents of that sector (normally because the ECC data is unreadable)
If the drive gets to read the data even once it will then move from a pending count to a reallocated count and all data will be intact.
If the drive never gets to read the data then you can manually force the drive to remap the sector anyway (losing the contents naturally) by either using manufacturer specific tools or by following these (slightly scary) instructions http://smartmontools.sourceforge.net/BadBlockHowTo.txt
But in my opinion leaving it as an offline sector is the safest thing to do, using the manufacturer tools (or the above instructions) it is very easy to kill the whole file-system.
This doesn't automatically mean the disk is dead or even dying, a small number of reallocated sectors over the lifetime of a disk is almost expected and usually transparent. Unless as in your case you are running the monitoring daemon and the ECC data is damaged you would never know.
As Adam suggested backing up your data is a good first step, then I'd keep an eye for more problems in the logs and perhaps schedule regular on-line tests. If you see more of the same error or notice the reallocated sector count steadily increasing then it is definitely time to change the disk.
What I wouldn't do before verifying that you have a good backup, is run any of the extended tests. If there is a mechanical or thermal problem with the drive then the extra stress of running the extended tests may push it over the edge and it could fail completely.
On 21-Aug-06 Adam Bower wrote:
On Mon, Aug 21, 2006 at 06:45:18PM +0100, Barry Samuels wrote:
smartd[6119]: Device: /dev/hda, 1 Currently unreadable (pending) sectors smartd[6119]: Device: /dev/hda, 1 Offline uncorrectable sectors
Do the above error messages indicate any serious (potential) problems?
My gut instinct is that the error message you've got is telling you "disk is fubar" If I were you i'd be backing up right about now "just in case" (if you can backup that is) before trying to fix the problem.
The smartctl tool should be able to tell you exactly what the disk thinks the problem is (or indeed if it thinks there is a problem in the first place).
Thanks Adam
I'm not familiar with SMART/smartd; but experience with ordinary ext2 and e2fsck whispered "Short Read" in my ear when I saw the above. Also known as "Bad blocks". In which case you have a bad block somewhere.
If that's the case, then all is not lost, since the rest of the disk is probaboly fine after the filesystem is mended.
But in any case, Adam's advice is good: mistrust your drive; it may get worse; it may fail altogether sometime. So back it up and think of maybe replacing the disk (but also think of other sources of the problem, e.g. dodgy RAM, disk controller or motherboard, since these can write corrupt data to the disk).
Also don't forget that the season of Limothrips cerealium (Cereal thrips, aka thunderflies) is only just finished. These very tiny insects are very prevalent in East Anglia, especially in wheat growing areas, and are flushed out in vast numbers by harvesting and thundery weather. They can get inside literally anything, no matter how firmly sealed you think it is!
See http://en.wikipedia.org/wiki/Thysanoptera for a brief account of some hardware problems (by no means an exhaustive list ... ) that they can cause.
If you find what looks like a sprinkling of black particles on surfaces, inside picture frames, etc., then you've probably got some inside your computer too.
Good luck! Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 21-Aug-06 Time: 20:28:08 ------------------------------ XFMail ------------------------------
On 21-Aug-06 Ted Harding wrote:
[...] Also don't forget that the season of Limothrips cerealium (Cereal thrips, aka thunderflies) is only just finished. These very tiny insects are very prevalent in East Anglia, especially in wheat growing areas, and are flushed out in vast numbers by harvesting and thundery weather. They can get inside literally anything, no matter how firmly sealed you think it is!
And, for a bizarre case of trouble caused by infestation (and a bizarre cure), see
http://www.fire.org.uk/BBC_News/News2003/September/bbc300903g.htm
Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 21-Aug-06 Time: 20:38:22 ------------------------------ XFMail ------------------------------
On Mon, 2006-08-21 at 20:28 +0100, Ted.Harding@nessie.mcc.ac.uk wrote:
So back it up and think of maybe replacing the disk (but also think of other sources of the problem, e.g. dodgy RAM, disk controller or motherboard, since these can write corrupt data to the disk).
The Smartmon tools are talking directly to the firmware on the drive, it is this firmware that has recorded and reported the error. It's not the same as a filesystem error and in this case cannot be caused by RAM, disk controllers, cables or motherboard.*
I get what you are saying though, filesystem errors can be caused by all the things you mention but in this case the error has been detected by something that is running at a raw hardware level and isn't even aware of the filesystem on the disk.
* Well actually that's not strictly true, powering down the drive in the middle of a write could make it think it has a bad sector, but given the write speed of a modern drive I don't think it is at all likely on modern hardware.
On 21-Aug-06 Wayne Stallwood wrote:
On Mon, 2006-08-21 at 20:28 +0100, Ted.Harding@nessie.mcc.ac.uk wrote:
So back it up and think of maybe replacing the disk (but also think of other sources of the problem, e.g. dodgy RAM, disk controller or motherboard, since these can write corrupt data to the disk).
The Smartmon tools are talking directly to the firmware on the drive, it is this firmware that has recorded and reported the error. It's not the same as a filesystem error and in this case cannot be caused by RAM, disk controllers, cables or motherboard.*
I get what you are saying though, filesystem errors can be caused by all the things you mention but in this case the error has been detected by something that is running at a raw hardware level and isn't even aware of the filesystem on the disk.
- Well actually that's not strictly true, powering down the drive in
the middle of a write could make it think it has a bad sector, but given the write speed of a modern drive I don't think it is at all likely on modern hardware.
Thanks for the clarifications, Wayne -- as I said, I'm not familiar with SMART stuff, so was to an extent guessing. I'm sure your spelling out of where the messages came from will help Barry!
Cheers, Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 21-Aug-06 Time: 22:49:17 ------------------------------ XFMail ------------------------------
If you're not interested in the results of my previous posting about wireless cards then skip to the next bit. I finally bought two PCI cards via the link I mentioned. These were D-Link DWL-G520 revision B which has an Atheros chipset and after compiling and installing the Madwifi modules everything just works. The reason that they were cheap is probably because they appeared to be for the french market with all instructions in french. I didn't need the instructions or the CD so it wasn't a problem.
Right, back to smartd. First thanks to all for the useful imformation.
smartd is set up to run a short test every day and a long test once a week. I tried smartctl -l error which listed no errors and smartctl -H which said it passed the health test. So is there a problem or ain't there?
The hard drive is backed up every day so I'm not really worried about it.
What I find irritating is that the messages in my original post are injected into the logs every half-hour. As the saying goes 'I do not wish to know that' - well perhaps just the once.
On Tue, 2006-08-22 at 16:47 +0100, Barry Samuels wrote:
What I find irritating is that the messages in my original post are injected into the logs every half-hour. As the saying goes 'I do not wish to know that' - well perhaps just the once.
I assume you are sure that it is not a new event each time it appears in the logs ?
Any chance of seeing the output of smartctl -a /dev/whatever ?
On 22/08/06 21:44:11, Wayne Stallwood wrote:
I assume you are sure that it is not a new event each time it appears in the logs ?
Any chance of seeing the output of smartctl -a /dev/whatever ?
================================================================================ smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION === Model Family: Maxtor MaXLine Plus II Device Model: Maxtor 7Y250P0 Serial Number: Y60W7AME Firmware Version: YAR41BW0 User Capacity: 251,000,193,024 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is: Wed Aug 23 08:52:54 2006 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x80) Offline data collection activity was never started. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 363) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 107) minutes.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0027 180 180 063 Pre-fail Always - 29657 4 Start_Stop_Count 0x0032 253 253 000 Old_age Always - 248 5 Reallocated_Sector_Ct 0x0033 253 253 063 Pre-fail Always - 0 6 Read_Channel_Margin 0x0001 253 253 100 Pre-fail Offline - 0 7 Seek_Error_Rate 0x000a 253 252 000 Old_age Always - 0 8 Seek_Time_Performance 0x0027 253 249 187 Pre-fail Always - 35514 9 Power_On_Minutes 0x0032 253 253 000 Old_age Always - 71h+07m 10 Spin_Retry_Count 0x002b 253 252 157 Pre-fail Always - 0 11 Calibration_Retry_Count 0x002b 253 252 223 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 253 253 000 Old_age Always - 54 192 Power-Off_Retract_Count 0x0032 253 253 000 Old_age Always - 0 193 Load_Cycle_Count 0x0032 253 253 000 Old_age Always - 0 194 Temperature_Celsius 0x0032 253 253 000 Old_age Always - 13 195 Hardware_ECC_Recovered 0x000a 253 252 000 Old_age Always - 9052 196 Reallocated_Event_Count 0x0008 253 253 000 Old_age Offline - 0 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 0 198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0008 199 199 000 Old_age Offline - 0 200 Multi_Zone_Error_Rate 0x000a 253 252 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 252 000 Old_age Always - 2 202 TA_Increase_Count 0x000a 253 252 000 Old_age Always - 0 203 Run_Out_Cancel 0x000b 253 252 180 Pre-fail Always - 1 204 Shock_Count_Write_Opern 0x000a 253 252 000 Old_age Always - 0 205 Shock_Rate_Write_Opern 0x000a 253 252 000 Old_age Always - 0 207 Spin_High_Current 0x002a 253 252 000 Old_age Always - 0 208 Spin_Buzz 0x002a 253 252 000 Old_age Always - 0 209 Offline_Seek_Performnce 0x0024 194 191 000 Old_age Offline - 0 99 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 100 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0 101 Unknown_Attribute 0x0004 253 253 000 Old_age Offline - 0
SMART Error Log Version: 1 No Errors Logged
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 8 -
SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ===================================================================================
There it is.
Hmm pretty strange output if you ask me.
So it's a pretty new disk then ? (or it's a very infrequently used machine) The power on hours counter seems a little low (unless it has gone round the clock which some drives do) It just seems odd that you have managed 54 power cycles in 71 hours.
Temperature must have the wrong multiplier or something, either that or the machine lives in a fridge.
What confuses me is that if you had a Offline Uncorrectable or Current Pending Sector Offline event then it should still be there (or it should have been eventually read and become a Reallocated Event). But all these counters are at Zero.
It's be nice to compare the spinup time to another drive of the same model/capacity, it looks high compared to my drives but it may be expressed in different units (or being misreported) however in theory at least if it went too far out of spec then it would have caused an alert status on the drive.
Also keep an eye on that Hardware ECC recovered, I am sure it shouldn't be that high (if the drive has indeed only done 17hours work) if it has rolled over then it is acceptable to have some of these over 100's or 1000's of operating hours (my drives don't even report it)
On 23/08/06 22:47:20, Wayne Stallwood wrote:
Hmm pretty strange output if you ask me.
So it's a pretty new disk then ? (or it's a very infrequently used machine) The power on hours counter seems a little low (unless it has gone round the clock which some drives do) It just seems odd that you have managed 54 power cycles in 71 hours.
Temperature must have the wrong multiplier or something, either that or the machine lives in a fridge.
What confuses me is that if you had a Offline Uncorrectable or Current Pending Sector Offline event then it should still be there (or it should have been eventually read and become a Reallocated Event). But all these counters are at Zero.
It's be nice to compare the spinup time to another drive of the same model/capacity,
Funny you should say that! Oops!
There are two identical drives, same make, model and capacity, in the machine and I did the test on the wrong one. The normal working drive is hda and hdb is mirroring hda as a backup. I did the first test on hdb which is quite new and is normally powered down except for once a night when the mirroring is done.
The temperature was probably so low because the drive had only just spun up for the test.
it looks high compared to my drives but it may be expressed in different units (or being misreported) however in theory at least if it went too far out of spec then it would have caused an alert status on the drive.
Also keep an eye on that Hardware ECC recovered, I am sure it shouldn't be that high (if the drive has indeed only done 17hours work) if it has rolled over then it is acceptable to have some of these over 100's or 1000's of operating hours (my drives don't even report it)
This is the report for hda:
================================================================================================= smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION === Device Model: SAMSUNG SP2514N Serial Number: S08BJ1SYA14693 Firmware Version: VF100-33 User Capacity: 250,059,350,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a Local Time is: Thu Aug 24 09:30:16 2006 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (4898) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 81) minutes.
SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 9 3 Spin_Up_Time 0x0007 084 066 025 Pre-fail Always - 9984 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 279 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 9866 9 Power_On_Half_Minutes 0x0032 100 100 000 Old_age Always - 32h+33m 10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 253 002 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 169 190 Unknown_Attribute 0x0022 148 121 000 Old_age Always - 30 194 Temperature_Celsius 0x0022 148 121 000 Old_age Always - 30 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 114726757 196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
SMART Error Log Version: 1 No Errors Logged
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 3899 - # 2 Short offline Completed: read failure 90% 3875 - # 3 Short offline Completed: read failure 90% 3851 - # 4 Short offline Completed: read failure 90% 3827 - # 5 Short offline Completed: read failure 90% 3803 - # 6 Short offline Completed: read failure 90% 3791 - # 7 Extended offline Completed: read failure 90% 3780 - # 8 Short offline Completed: read failure 40% 3779 - # 9 Short offline Completed without error 00% 3755 - #10 Short offline Completed without error 00% 3731 - #11 Short offline Completed without error 00% 3707 - #12 Short offline Completed without error 00% 3683 - #13 Short offline Completed without error 00% 3659 - #14 Short offline Completed without error 00% 3635 - #15 Extended offline Completed without error 00% 3613 - #16 Short offline Completed without error 00% 3611 - #17 Short offline Completed without error 00% 3587 - #18 Short offline Completed without error 00% 3563 - #19 Short offline Completed without error 00% 3539 - #20 Short offline Completed without error 00% 3515 - #21 Short offline Completed without error 00% 3490 -
SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. =================================================================================================
Perhaps this might make more sense and at least you have something to compare it with. :-))
I think based on the number of ECC errors in the 4000 hour run time, plus the uncorrectable offline event, plus the hardware read errors in addition to the fact that it is not passing the short offline test. This drive is poorly enough to justify warranty replacement.
It is still under warranty with Samsung so if it were me I would run from the mirror disk, do a secure wipe of any sensitive data and send this drive back under RMA.
RMA's for consumer grade gear can take time, so whether you are comfortable with running without the protection of the mirror for what is likely to be 2-3 weeks depends on what other arrangements you have for backups.
On 24/08/06 22:14:37, Wayne Stallwood wrote:
I think based on the number of ECC errors in the 4000 hour run time, plus the uncorrectable offline event, plus the hardware read errors in addition to the fact that it is not passing the short offline test. This drive is poorly enough to justify warranty replacement.
It is still under warranty with Samsung so if it were me I would run from the mirror disk, do a secure wipe of any sensitive data and send this drive back under RMA.
RMA's for consumer grade gear can take time, so whether you are comfortable with running without the protection of the mirror for what is likely to be 2-3 weeks depends on what other arrangements you have for backups.
Thanks to everyone who tried to help with this and especially Wayne who's contribution was particularly useful.
I've contacted the vendor with regard to a possible replacement.
If I move over to the mirror drive and return the possibly faulty one I still have my DAT tape backup drive which does a backup every night.
Thanks again