OK, a replacement box has been sent to site, but I'm still investigating this as best I can via SSH.
I have managed to "force" the commit error by just repeatedly running scripts which access the flash, and this time I've done it with "tail -f /var/log/syslog" running in another session; that has given me the following log at time of failure:
Jul 12 14:16:31 mybox kernel: usb 1-1.1: USB disconnect, device number 3 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x07 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 50 00 00 08 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 6009168 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737909 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x01 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 90 00 00 10 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 6009232 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737917 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737918 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x01 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 6f 5f 68 00 00 08 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 7298920 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 899128 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2
Also in the log, a few minutes earlier but while I was trying to reproduce this problem, were: Jul 12 14:04:52 mybox kernel: usb 1-1.1: reset high speed USB device number 3 using orion-ehci Jul 12 14:04:53 mybox kernel: usb 1-1.1: reset high speed USB device number 3 using orion-ehci
Any suggestions as to how to fix this? Is looking like a card (media) issue, a hardware issue or a software issue?
In case anyone is interested, my "reboot when the filesystem vanishes script" is now: #!/bin/bash while : ; do sleep 60 if [ $? != 0 ] ; then break fi done echo Rebooting echo b > /proc/sysrq-trigger
All attempts to detect whether the filesystem was there failed, but just relying on my call to sleep returning an error code if sleep "vanishes" seems to have worked.