On 03-Aug-05 Stuart Bailey wrote:
I'm using a SCSI 72Gb tape drive on a Fedora Core3 system which has recently been updated with the latest kernel (about 2 weeks ago). The tape backup (using tar -cf /dev/st0 ...) had been working fine - for about 5 months, and was used to restore data at that time.
I have just discovered that the backup has been failing recently. When I run tar -cvf /dev/st0 ... only 2048 bytes are backed up before this message is displayed:
tar: /dev/st0: Wrote only 2048 of 10240 bytes tar: Error is not recoverable: exiting now
If I then run tar -tvf /dev/st0, I get all the files upto the point at which the error message was generated.
Any ideas what may have gone wrong? Are there any tools to run diagnostics on the tape unit?
A few questions/suggestions.
1. Did this trouble start concurrently with the kernel upgrade? If so, possibly the cause is there. Can you re-instate the previous kernel and see if it still gives trouble? If it's the kernel upgrade then I don't have useful ideas.
2. Do you get the same problem regardless of which tape you put in the drive? If it's just one tape, then there may be a defect on the tape itself. But if it's independent of the tape, and it's not the kernel, then this points to the tape drive itself.
3. If it's not the tape, then try putting a spare (i.e. potentially disposable) tape in the drive and raw-writing to it:
a) Set up a test file with decipherable structure:
echo -e "\n" | awk '{for(i=1;i<=1000000;i++){printf("%07.0f\n",i)}}' > testfile
which will give you a test file with 8000000 bytes (7 for each integer plus a newline, so 8 bytes per integer).
b) raw-write this to the tape in various ways, e.g.:
dd if=testfile of=/dev/st0 bs=512 count=8
which will write 4096 bytes to the device in 8 blocks of 512 bytes.
c) raw-read it back (you will need to re-wind the tape first), e.g.:
dd if=/dev/st0 bs=512 count=8
[or e.g. bs=4096 count=1]
and see how far it gets. If only 2048 bytes of the 4096 got written to the tape, then the last line to be printed to the console would be "0000256".
d) Vary the above with different values for "bs" and "count".
The fact that your tape error says that only 2048 bytes were written suggests that the mechanism may be using a block-size of 2048 bytes and only one block got written. Where this failure to move to the next block arises, however, is not clear. It may be a hardware failure in the drive (internal buffer of 2048, not recycled); failure to communicate with the drive (e.g. the "handshake" from the drive would announce that the "write" had been cleared and it was ready for the next block, but the handshake was not being read and acted on); the kernel was using a 2048-byte block of RAM as a buffer but not re-cycling this; etc.
Using "bs" greater than 2048 as well as less than or equal to (e.g. "bs=4096" or "bs=8192") may discriminate.
Hoping this provides a useful pointer or two!
Best wishes, Ted.
-------------------------------------------------------------------- E-Mail: (Ted Harding) Ted.Harding@nessie.mcc.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 04-Aug-05 Time: 10:22:08 ------------------------------ XFMail ------------------------------