Over the last day or so I have found that occasionaly things wait for disk I/O longer than I would normally expect and before the data is returned there is a noticable click coming from one (or both) of the hard disks, presumably as the head is retruned to track 0.
Sometimes this click goes hand in hand with a kernel message like:
hdb: timeout waiting for DMA ide_dmaproc: chipset supported ide_dma_timeout func only: 14
Sometimes the above messages are acompanied by one saying the kernel has reset the IDE bus and sometimes there is no message from the kernel at all sugesting that the drive itself is attempting some kind of recovery.
I am guessing from this that one of the disks is on its way out, but the question is which one? The hdb mentioned above isn't a guide as the kernel also logs DMA timeouts for hda too.
When booting the BIOS reports both drives as S.M.A.R.T. capable and status OK so I tied installing some Linux S.M.A.R.T software - smartsuite. Can anyone help me with the interpretation of the output below?
----
# smartctl -a /dev/hda Device: WDC WD1200JB-00CRA1 Supports ATA Version 5 Drive supports S.M.A.R.T. and is enabled Check S.M.A.R.T. Passed.
General Smart Values: Off-line data collection status: (0x84) Offline data collection activity was suspended by an interrupting command
Self-test execution status: ( 40) The self-test routine was interrupted by the host with a hard or soft reset
Total time to complete off-line data collection: (4680) Seconds
Offline data collection Capabilities: (0x3b)SMART EXECUTE OFF-LINE IMMEDIATE Automatic timer ON/OFF support Suspend Offline Collection upon new command Offline surface scan supported Self-test supported Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine recommended polling time: ( 2) Minutes
Extended self-test routine recommended polling time: ( 87) Minutes
Vendor Specific SMART Attributes with Thresholds: Revision Number: 16 Attribute Flag Value Worst Threshold Raw Value ( 1)Raw Read Error Rate 0x000b 200 200 051 0 ( 3)Spin Up Time 0x0007 103 102 021 5616 ( 4)Start Stop Count 0x0032 100 100 040 47 ( 5)Reallocated Sector Ct 0x0033 199 199 140 1 ( 7)Seek Error Rate 0x000b 100 253 051 0 ( 9)Power On Hours 0x0032 098 098 000 2185 ( 10)Spin Retry Count 0x0013 100 253 051 0 ( 11)Calibration Retry Count 0x0013 100 253 051 0 ( 12)Power Cycle Count 0x0032 100 100 000 47 (196)Reallocated Event Count 0x0032 199 199 000 1 (197)Current Pending Sector 0x0012 200 200 000 2 (198)Offline Uncorrectable 0x0012 200 200 000 2 (199)UDMA CRC Error Count 0x000a 200 200 000 153 (200)Unknown Attribute 0x0009 200 200 051 2 SMART Error Log: SMART Error Logging Version: 1 No Errors Logged
# smartctl -a /dev/hdb Device: Maxtor 5T040H4 Supports ATA Version 6 Drive supports S.M.A.R.T. and is enabled Check S.M.A.R.T. Passed.
General Smart Values: Off-line data collection status: (0x00) Offline data collection activity was never started
Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run
Total time to complete off-line data collection: ( 30) Seconds
Offline data collection Capabilities: (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE Automatic timer ON/OFF support Suspend Offline Collection upon new command Offline surface scan supported Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine recommended polling time: ( 2) Minutes
Extended self-test routine recommended polling time: ( 25) Minutes
Vendor Specific SMART Attributes with Thresholds: Revision Number: 16 Attribute Flag Value Worst Threshold Raw Value ( 1)Raw Read Error Rate 0x000a 253 252 000 6 ( 3)Spin Up Time 0x0027 196 190 063 17692 ( 4)Start Stop Count 0x0032 253 253 000 478 ( 5)Reallocated Sector Ct 0x0033 178 178 063 190 ( 6)Read Channel Margin 0x0001 253 253 100 0 ( 7)Seek Error Rate 0x000a 116 115 000 188 ( 8)Seek Time Preformance 0x0027 251 248 187 44139 ( 9)Power On Hours 0x0032 246 246 000 18579 ( 10)Spin Retry Count 0x002b 235 222 223 13 ( 11)Calibration Retry Count 0x002b 253 252 223 0 ( 12)Power Cycle Count 0x0032 252 252 000 409 (196)Reallocated Event Count 0x0008 253 253 000 0 (197)Current Pending Sector 0x0008 253 253 000 0 (198)Offline Uncorrectable 0x0008 253 253 000 0 (199)UDMA CRC Error Count 0x0008 199 199 000 1 (200)Unknown Attribute 0x000a 253 252 000 0 (201)Unknown Attribute 0x000a 253 252 000 28 (202)Unknown Attribute 0x000a 253 252 000 0 (203)Unknown Attribute 0x000b 253 252 180 0 (204)Unknown Attribute 0x000a 253 252 000 0 (205)Unknown Attribute 0x000a 253 252 000 0 (207)Unknown Attribute 0x002a 243 238 000 7 (208)Unknown Attribute 0x002a 248 243 000 4 (209)Unknown Attribute 0x0024 253 253 000 0 ( 96)Unknown Attribute 0x0004 253 253 000 0 ( 97)Unknown Attribute 0x0004 253 253 000 0 ( 98)Unknown Attribute 0x0004 253 253 000 0 ( 99)Unknown Attribute 0x0004 253 253 000 0 (100)Unknown Attribute 0x0004 253 253 000 0 (101)Unknown Attribute 0x0004 253 253 000 0 SMART Error Log: SMART Error Logging Version: 1 Error Log Data Structure Pointer: 05 ATA Error Count: 501 Non-Fatal Count: 0
Error Log Structure 2: DCR FR SC SN CL SH D/H CR Timestamp 08 00 02 3f 64 34 f0 c8 354178 08 00 80 f1 89 2c f0 c8 354178 08 00 10 71 8a 2c f0 c8 354178 08 00 70 83 8a 2c f0 c8 354178 08 00 80 f3 8a 2c f0 c8 354186 00 84 00 f3 8a 2c e0 51 955260
Error Log Structure 3: DCR FR SC SN CL SH D/H CR Timestamp 08 00 14 95 7b 2c f0 c8 354174 08 00 80 a9 7b 2c f0 c8 354175 08 00 80 29 7c 2c f0 c8 354175 08 00 02 69 a4 21 f0 c8 354175 08 00 01 01 00 00 b0 08 354240 00 04 01 01 00 00 b0 51 955270
Error Log Structure 5: DCR FR SC SN CL SH D/H CR Timestamp 08 00 08 3f b8 99 f0 ca 104366 08 00 02 3f a4 1e f0 ca 104366 08 00 08 6f 54 f8 f0 ca 104365 08 00 08 af b7 fc f0 ca 104365 08 00 01 01 00 00 b0 08 104442 00 04 01 01 00 00 b0 51 1217309
On Sat, Feb 01, 2003 at 02:01:55PM +0000, Steve Fosdick wrote:
I am guessing from this that one of the disks is on its way out, but the question is which one? The hdb mentioned above isn't a guide as the kernel also logs DMA timeouts for hda too.
I had this with my IBM disk recently, you could try running badblocks on your disks to see if they have duff sectors or I used this http://www.hgst.com/hdd/support/download.htm (non-free software) from IBM to test my disk, they don't fully support non-IBM disks but say that they can do a certain amount of testing etc. When I used this software it detected some bad sectors and reallocated them after a low level format and the disk appears to be ok again now *fingers crossed*
You may want to give it a try anyhow. The output it gives is very simple to understand compared to the linux tools, well the linux tools may be easy to interpret the data from but I was getting worried that the disk was about to die entirely and of course you only ever install these things about 2 minutes before the disk catches fire ;)
Adam
Before you get into a panic (or the kernel does) try re-seating your IDE cables. The heat cycling in the computer can cause things to work loose. I find this to be a particular problem with SCSI cables.
On a related theme, if you have a suspect card, it is worth taking a pencil eraser and rubbing over the contacts.
On 01-Feb-2003 Steve Fosdick wrote:
Over the last day or so I have found that occasionaly things wait for disk I/O longer than I would normally expect and before the data is returned there is a noticable click coming from one (or both) of the hard disks, presumably as the head is retruned to track 0.
Sometimes this click goes hand in hand with a kernel message like:
hdb: timeout waiting for DMA ide_dmaproc: chipset supported ide_dma_timeout func only: 14
Sometimes the above messages are acompanied by one saying the kernel has reset the IDE bus and sometimes there is no message from the kernel at all sugesting that the drive itself is attempting some kind of recovery.
I am guessing from this that one of the disks is on its way out, but the question is which one? The hdb mentioned above isn't a guide as the kernel also logs DMA timeouts for hda too.
When booting the BIOS reports both drives as S.M.A.R.T. capable and status OK so I tied installing some Linux S.M.A.R.T software - smartsuite. Can anyone help me with the interpretation of the output below?
# smartctl -a /dev/hda Device: WDC WD1200JB-00CRA1 Supports ATA Version 5 Drive supports S.M.A.R.T. and is enabled Check S.M.A.R.T. Passed.
General Smart Values: Off-line data collection status: (0x84) Offline data collection
activity was
suspended by an interrupting command
Self-test execution status: ( 40) The self-test routine was
interrupted
by the host with a hard or soft reset
Total time to complete off-line data collection: (4680) Seconds
Offline data collection Capabilities: (0x3b)SMART EXECUTE OFF-LINE IMMEDIATE Automatic timer ON/OFF support Suspend Offline Collection upon new command Offline surface scan supported Self-test supported Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine recommended polling time: ( 2) Minutes
Extended self-test routine recommended polling time: ( 87) Minutes
Vendor Specific SMART Attributes with Thresholds: Revision Number: 16 Attribute Flag Value Worst Threshold Raw Value ( 1)Raw Read Error Rate 0x000b 200 200 051 0 ( 3)Spin Up Time 0x0007 103 102 021 5616 ( 4)Start Stop Count 0x0032 100 100 040 47 ( 5)Reallocated Sector Ct 0x0033 199 199 140 1 ( 7)Seek Error Rate 0x000b 100 253 051 0 ( 9)Power On Hours 0x0032 098 098 000 2185 ( 10)Spin Retry Count 0x0013 100 253 051 0 ( 11)Calibration Retry Count 0x0013 100 253 051 0 ( 12)Power Cycle Count 0x0032 100 100 000 47 (196)Reallocated Event Count 0x0032 199 199 000 1 (197)Current Pending Sector 0x0012 200 200 000 2 (198)Offline Uncorrectable 0x0012 200 200 000 2 (199)UDMA CRC Error Count 0x000a 200 200 000 153 (200)Unknown Attribute 0x0009 200 200 051 2 SMART Error Log: SMART Error Logging Version: 1 No Errors Logged
# smartctl -a /dev/hdb Device: Maxtor 5T040H4 Supports ATA Version 6 Drive supports S.M.A.R.T. and is enabled Check S.M.A.R.T. Passed.
General Smart Values: Off-line data collection status: (0x00) Offline data collection
activity was
never started
Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test
has ever
been run
Total time to complete off-line data collection: ( 30) Seconds
Offline data collection Capabilities: (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE Automatic timer ON/OFF support Suspend Offline Collection upon new command Offline surface scan supported Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering power-saving mode Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine recommended polling time: ( 2) Minutes
Extended self-test routine recommended polling time: ( 25) Minutes
Vendor Specific SMART Attributes with Thresholds: Revision Number: 16 Attribute Flag Value Worst Threshold Raw Value ( 1)Raw Read Error Rate 0x000a 253 252 000 6 ( 3)Spin Up Time 0x0027 196 190 063 17692 ( 4)Start Stop Count 0x0032 253 253 000 478 ( 5)Reallocated Sector Ct 0x0033 178 178 063 190 ( 6)Read Channel Margin 0x0001 253 253 100 0 ( 7)Seek Error Rate 0x000a 116 115 000 188 ( 8)Seek Time Preformance 0x0027 251 248 187 44139 ( 9)Power On Hours 0x0032 246 246 000 18579 ( 10)Spin Retry Count 0x002b 235 222 223 13 ( 11)Calibration Retry Count 0x002b 253 252 223 0 ( 12)Power Cycle Count 0x0032 252 252 000 409 (196)Reallocated Event Count 0x0008 253 253 000 0 (197)Current Pending Sector 0x0008 253 253 000 0 (198)Offline Uncorrectable 0x0008 253 253 000 0 (199)UDMA CRC Error Count 0x0008 199 199 000 1 (200)Unknown Attribute 0x000a 253 252 000 0 (201)Unknown Attribute 0x000a 253 252 000 28 (202)Unknown Attribute 0x000a 253 252 000 0 (203)Unknown Attribute 0x000b 253 252 180 0 (204)Unknown Attribute 0x000a 253 252 000 0 (205)Unknown Attribute 0x000a 253 252 000 0 (207)Unknown Attribute 0x002a 243 238 000 7 (208)Unknown Attribute 0x002a 248 243 000 4 (209)Unknown Attribute 0x0024 253 253 000 0 ( 96)Unknown Attribute 0x0004 253 253 000 0 ( 97)Unknown Attribute 0x0004 253 253 000 0 ( 98)Unknown Attribute 0x0004 253 253 000 0 ( 99)Unknown Attribute 0x0004 253 253 000 0 (100)Unknown Attribute 0x0004 253 253 000 0 (101)Unknown Attribute 0x0004 253 253 000 0 SMART Error Log: SMART Error Logging Version: 1 Error Log Data Structure Pointer: 05 ATA Error Count: 501 Non-Fatal Count: 0
Error Log Structure 2: DCR FR SC SN CL SH D/H CR Timestamp 08 00 02 3f 64 34 f0 c8 354178 08 00 80 f1 89 2c f0 c8 354178 08 00 10 71 8a 2c f0 c8 354178 08 00 70 83 8a 2c f0 c8 354178 08 00 80 f3 8a 2c f0 c8 354186 00 84 00 f3 8a 2c e0 51 955260
Error Log Structure 3: DCR FR SC SN CL SH D/H CR Timestamp 08 00 14 95 7b 2c f0 c8 354174 08 00 80 a9 7b 2c f0 c8 354175 08 00 80 29 7c 2c f0 c8 354175 08 00 02 69 a4 21 f0 c8 354175 08 00 01 01 00 00 b0 08 354240 00 04 01 01 00 00 b0 51 955270
Error Log Structure 5: DCR FR SC SN CL SH D/H CR Timestamp 08 00 08 3f b8 99 f0 ca 104366 08 00 02 3f a4 1e f0 ca 104366 08 00 08 6f 54 f8 f0 ca 104365 08 00 08 af b7 fc f0 ca 104365 08 00 01 01 00 00 b0 08 104442 00 04 01 01 00 00 b0 51 1217309
main@lists.alug.org.uk http://www.alug.org.uk/ http://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!
On Sun, 02 Feb 2003 12:32:04 -0000 (GMT) raph@panache.demon.co.uk wrote:
Before you get into a panic (or the kernel does) try re-seating your IDE cables. The heat cycling in the computer can cause things to work loose. I find this to be a particular problem with SCSI cables.
On a related theme, if you have a suspect card, it is worth taking a pencil eraser and rubbing over the contacts.
Thanks for the advice, I will remember this for the next time.
In the mean time the disk seems to have fixed itself after a power cycle. My previous experience with failing disks at work suggested that power cycles were sometimes fatal - that the drive would spin down and then refuse to start again when power was re-applied. However, my hand was forced in this case as when I tried reboot the PC to load a different kernel the BIOS couldn't find the hard disks. Cycling the power seems to have restored things to normal.
Steve.