New subject: [Alug] Hard disks - how to tell if one (which one) is about to fail?

1 Feb 2003


      Over the last day or so I have found that occasionaly things wait for
disk I/O longer than I would normally expect and before the data is
returned there is a noticable click coming from one (or both) of the
hard disks, presumably as the head is retruned to track 0.
Sometimes this click goes hand in hand with a kernel message like:
hdb: timeout waiting for DMA
ide_dmaproc: chipset supported ide_dma_timeout func only: 14
Sometimes the above messages are acompanied by one saying the kernel has
reset the IDE bus and sometimes there is no message from the kernel at
all sugesting that the drive itself is attempting some kind of recovery.
I am guessing from this that one of the disks is on its way out, but the
question is which one?  The hdb mentioned above isn't a guide as the
kernel also logs DMA timeouts for hda too.
When booting the BIOS reports both drives as S.M.A.R.T. capable and status OK
so I tied installing some Linux S.M.A.R.T software - smartsuite.  Can anyone
help me with the interpretation of the output below?
----
# smartctl -a /dev/hda
Device: WDC WD1200JB-00CRA1  Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.
General Smart Values: 
Off-line data collection status: (0x84)	Offline data collection activity was 
    				suspended by an interrupting command
Self-test execution status:      (  40)	The self-test routine was interrupted
    				by the host with a hard or soft reset
Total time to complete off-line 
data collection: 		 (4680) Seconds
Offline data collection 
Capabilities: 			 (0x3b)SMART EXECUTE OFF-LINE IMMEDIATE
    				Automatic timer ON/OFF support
    				Suspend Offline Collection upon new
    				command
    				Offline surface scan supported
    				Self-test supported
Smart Capablilities:           (0x0003)	Saves SMART data before entering
    				power-saving mode
    				Supports SMART auto save timer
Error logging capability:        (0x01)	Error logging supported
Short self-test routine 
recommended polling time: 	 (   2) Minutes
Extended self-test routine 
recommended polling time: 	 (  87) Minutes
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute                    Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate     0x000b   200   200   051       0
(  3)Spin Up Time            0x0007   103   102   021       5616
(  4)Start Stop Count        0x0032   100   100   040       47
(  5)Reallocated Sector Ct   0x0033   199   199   140       1
(  7)Seek Error Rate         0x000b   100   253   051       0
(  9)Power On Hours          0x0032   098   098   000       2185
( 10)Spin Retry Count        0x0013   100   253   051       0
( 11)Calibration Retry Count 0x0013   100   253   051       0
( 12)Power Cycle Count       0x0032   100   100   000       47
(196)Reallocated Event Count 0x0032   199   199   000       1
(197)Current Pending Sector  0x0012   200   200   000       2
(198)Offline Uncorrectable   0x0012   200   200   000       2
(199)UDMA CRC Error Count    0x000a   200   200   000       153
(200)Unknown Attribute       0x0009   200   200   051       2
SMART Error Log:
SMART Error Logging Version: 1
No Errors Logged
# smartctl -a /dev/hdb
Device: Maxtor 5T040H4  Supports ATA Version 6
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.
General Smart Values: 
Off-line data collection status: (0x00)	Offline data collection activity was
    				never started
Self-test execution status:      (   0)	The previous self-test routine completed					without error or no self-test has ever 
    				been run
Total time to complete off-line 
data collection: 		 (  30) Seconds
Offline data collection 
Capabilities: 			 (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
    				Automatic timer ON/OFF support
    				Suspend Offline Collection upon new
    				command
    				Offline surface scan supported
    				Self-test supported
Smart Capablilities:           (0x0003)	Saves SMART data before entering
    				power-saving mode
    				Supports SMART auto save timer
Error logging capability:        (0x01)	Error logging supported
Short self-test routine 
recommended polling time: 	 (   2) Minutes
Extended self-test routine 
recommended polling time: 	 (  25) Minutes
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute                    Flag     Value Worst Threshold Raw Value
(  1)Raw Read Error Rate     0x000a   253   252   000       6
(  3)Spin Up Time            0x0027   196   190   063       17692
(  4)Start Stop Count        0x0032   253   253   000       478
(  5)Reallocated Sector Ct   0x0033   178   178   063       190
(  6)Read Channel Margin     0x0001   253   253   100       0
(  7)Seek Error Rate         0x000a   116   115   000       188
(  8)Seek Time Preformance   0x0027   251   248   187       44139
(  9)Power On Hours          0x0032   246   246   000       18579
( 10)Spin Retry Count        0x002b   235   222   223       13
( 11)Calibration Retry Count 0x002b   253   252   223       0
( 12)Power Cycle Count       0x0032   252   252   000       409
(196)Reallocated Event Count 0x0008   253   253   000       0
(197)Current Pending Sector  0x0008   253   253   000       0
(198)Offline Uncorrectable   0x0008   253   253   000       0
(199)UDMA CRC Error Count    0x0008   199   199   000       1
(200)Unknown Attribute       0x000a   253   252   000       0
(201)Unknown Attribute       0x000a   253   252   000       28
(202)Unknown Attribute       0x000a   253   252   000       0
(203)Unknown Attribute       0x000b   253   252   180       0
(204)Unknown Attribute       0x000a   253   252   000       0
(205)Unknown Attribute       0x000a   253   252   000       0
(207)Unknown Attribute       0x002a   243   238   000       7
(208)Unknown Attribute       0x002a   248   243   000       4
(209)Unknown Attribute       0x0024   253   253   000       0
( 96)Unknown Attribute       0x0004   253   253   000       0
( 97)Unknown Attribute       0x0004   253   253   000       0
( 98)Unknown Attribute       0x0004   253   253   000       0
( 99)Unknown Attribute       0x0004   253   253   000       0
(100)Unknown Attribute       0x0004   253   253   000       0
(101)Unknown Attribute       0x0004   253   253   000       0
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 05
ATA Error Count: 501
Non-Fatal Count: 0
Error Log Structure 2:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
 08   00   02   3f   64   34    f0   c8     354178
 08   00   80   f1   89   2c    f0   c8     354178
 08   00   10   71   8a   2c    f0   c8     354178
 08   00   70   83   8a   2c    f0   c8     354178
 08   00   80   f3   8a   2c    f0   c8     354186
 00   84   00   f3   8a   2c    e0   51     955260
Error Log Structure 3:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
 08   00   14   95   7b   2c    f0   c8     354174
 08   00   80   a9   7b   2c    f0   c8     354175
 08   00   80   29   7c   2c    f0   c8     354175
 08   00   02   69   a4   21    f0   c8     354175
 08   00   01   01   00   00    b0   08     354240
 00   04   01   01   00   00    b0   51     955270
Error Log Structure 5:
DCR   FR   SC   SN   CL   SH   D/H   CR   Timestamp
 08   00   08   3f   b8   99    f0   ca     104366
 08   00   02   3f   a4   1e    f0   ca     104366
 08   00   08   6f   54   f8    f0   ca     104365
 08   00   08   af   b7   fc    f0   ca     104365
 08   00   01   01   00   00    b0   08     104442
 00   04   01   01   00   00    b0   51     1217309