I have an ARM box which was working fine until it went to site. It's running Debian and I only have access to it via SSH (although at the moment I have someone on site who can force it to reboot).
Twice today I've received the error "kernel:journal commit I/O error" to my SSH terminal, after which point the root filesystem gets unmounted and I can't do anything even though the unit is still "up". Eg: $ ls -bash: /bin/ls: No such file or directory
Any ideas what would be causing this? It's an expensive SD card in these units (not a cheap no-brand)
This is the unit in case anyone is interested, it's running Debian 6: http://www.newit.co.uk/shop/proddetail.php?prod=DreamPlug_N
I'd also be interested to know if there's anything I can do down an SSH connection to force it to reboot, given the lack of access to the filesystem to even run "sudo" never mind "reboot". I'm guessing that sending Alt-Ctrl-Del or Alt-SysRq-B via SSH are out...
On 11/07/12 17:42, Mark Rogers wrote:
I'd also be interested to know if there's anything I can do down an SSH connection to force it to reboot, given the lack of access to the filesystem to even run "sudo" never mind "reboot". I'm guessing that sending Alt-Ctrl-Del or Alt-SysRq-B via SSH are out...
Solution in theory here: http://rackerhacker.com/2009/01/29/linux-emergency-reboot-or-shutdown-with-m...
It turns out that Alt-SysRq-B commands can be sent via SSH. However, you need to be root to do it, and I wasn't logged in as root when the filesystem disappeared, and of-course now there's no way to achieve that (that I can think of anyway).
When I do regain access, is there any way I can make /proc/sys/kernel/trigger and /proc/sysrq-trigger writable by anyone other than root? Or do I need to make sure I'm always logged in as root until I find the root cause of this problem?
Mark
On 11/07/12 18:01, Mark Rogers wrote:
When I do regain access, is there any way I can make /proc/sys/kernel/trigger and /proc/sysrq-trigger writable by anyone other than root? Or do I need to make sure I'm always logged in as root until I find the root cause of this problem?
Typo there, I meant /proc/sys/kernel/sysrq
Anyway, yes I can make it writable by all users then write to it as a normal user.
So, for testing, rather than just echo b > /proc/sysrq-trigger (instant reboot, no sync etc) I tried the more graceful: echo r > /proc/sysrq-trigger # take control away from X, pointless as server is headless echo e > /proc/sysrq-trigger # SIGTERM to all processes echo i > /proc/sysrq-trigger # SIGKILL to all processes echo s > /proc/sysrq-trigger # Sync echo u > /proc/sysrq-trigger # Unmount echo b > /proc/sysrq-trigger # Reboot
Except that (doh!) after SIGTERM I lost access because (I assume) I'd just killed the SSH process. So once again I call guy on site to force a reboot....
However I think the principle is OK (unmounting the disk doesn't matter when it's gone anyway) so I just need to make sure that I have the right permissions on /proc/sys/kernel/sysrq - what's the best way to maintain this past a reboot?
None of this "solves" the core problem of-course....
Thinking about this, a better solution would be to run a script like this (as root) on startup:
#!/bin/bash while [ -d ~/var ] ; do sleep 60; done echo b > /proc/sysrq-trigger
Does that look "safe"? I'm reluctant to test something so drastic without some oversight by someone more sane than me!
What would happen during normal system shutdown/reboot? Will this script have been killed by the time the kernel unmounts the filesystem?
What's the "correct" way to start a script like this as root on system startup?
Mark
On 12/07/12 10:20, Mark Rogers wrote:
Thinking about this, a better solution would be to run a script like this (as root) on startup:
#!/bin/bash while [ -d /var ] ; do sleep 60; done echo b > /proc/sysrq-trigger
OK, I tried this, and the error occurred, and the script starts spewing /root/fswatchdog.sh: line 3: /bin/sleep: No such file or directory to the terminal; clearly /var is still there even though /bin/sleep isn't.
If I try: echo xxx > /var/xxx I get: -bash: /var/xxx: Read-only file system .. again suggesting that /var is there (the same is true when I test /bin, /etc, etc). I can't think of a way to look at the contents of any files though; cat, ls, etc are all gone at this stage so I can't see what the filesystem "looks like".
Any suggestions as to how I can investigate this further welcomed, for the time being I'm going to get the box rebooted and try "while [ -f /bin/sleep ] ; do" (or maybe -r ?) instead to see if that's any better.
[Note: my last email had ~/var in the script rather than /var, fixed above.]
Mark
On Wed, Jul 11, 2012 at 05:42:05PM +0100, Mark Rogers wrote:
Any ideas what would be causing this? It's an expensive SD card in these units (not a cheap no-brand)
When you say units, how many are affected? I'd be considering trying another SD card if it's a single unit. If you have several doing it I'd still consider another SD card, just in case there's some reason that they are not compatible.
Adam
On 11/07/12 19:10, Adam Bower wrote:
When you say units, how many are affected?
I have three units, only one is affected but its also the only one that's left the office.
I'd be considering trying another SD card if it's a single unit.
My instinct is the same, although that means getting it back from site.
Thinking about this last night, it seems odd that the filesystem isn't be remounted read-only, and that after a reboot it's coming back (apparently) healthy. It's as if the card is being unmounted somehow; it is an external card (ie it's mounted in an external SD card slot) but nobody is doing anything with it, do the symptoms seem consistent with the card being ejected?
Is the root filesystem the only filesystem on the device ?
If not are the other filesystems unaffected ?
If you do have another filesystem on there you could potentially build some static versions of useful tools such as dmesg so you can at least find out what is going on after the event.
but mostly it looks like what Adam said :)
With stuff like this the best you can do it haul it back and rig it up in the office with some proper debugging attached and hope it still decides to play up again.
On 11/07/12 20:20, Wayne Stallwood wrote:
Is the root filesystem the only filesystem on the device ?
Yes.
If you do have another filesystem on there you could potentially build some static versions of useful tools such as dmesg so you can at least find out what is going on after the event.
It crossed my mind to put some things into a ram disk on startup but these systems aren't over-endowed with RAM anyway!
With stuff like this the best you can do it haul it back and rig it up in the office with some proper debugging attached and hope it still decides to play up again.
Yeah, I'm coming to that conclusion myself, but it's politically awkward and I'm reluctant to bite that bullet unless I'm sure.
OK, a replacement box has been sent to site, but I'm still investigating this as best I can via SSH.
I have managed to "force" the commit error by just repeatedly running scripts which access the flash, and this time I've done it with "tail -f /var/log/syslog" running in another session; that has given me the following log at time of failure:
Jul 12 14:16:31 mybox kernel: usb 1-1.1: USB disconnect, device number 3 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x07 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 50 00 00 08 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 6009168 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737909 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x01 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 90 00 00 10 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 6009232 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737917 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 737918 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x01 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 6f 5f 68 00 00 08 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector 7298920 Jul 12 14:16:31 mybox kernel: Buffer I/O error on device sdb2, logical block 899128 Jul 12 14:16:31 mybox kernel: lost page write due to I/O error on sdb2
Also in the log, a few minutes earlier but while I was trying to reproduce this problem, were: Jul 12 14:04:52 mybox kernel: usb 1-1.1: reset high speed USB device number 3 using orion-ehci Jul 12 14:04:53 mybox kernel: usb 1-1.1: reset high speed USB device number 3 using orion-ehci
Any suggestions as to how to fix this? Is looking like a card (media) issue, a hardware issue or a software issue?
In case anyone is interested, my "reboot when the filesystem vanishes script" is now: #!/bin/bash while : ; do sleep 60 if [ $? != 0 ] ; then break fi done echo Rebooting echo b > /proc/sysrq-trigger
All attempts to detect whether the filesystem was there failed, but just relying on my call to sleep returning an error code if sleep "vanishes" seems to have worked.
On 12/07/12 15:26, Mark Rogers wrote:
OK, a replacement box has been sent to site, but I'm still investigating this as best I can via SSH.
I have managed to "force" the commit error by just repeatedly running scripts which access the flash, and this time I've done it with "tail -f /var/log/syslog" running in another session; that has given me the following log at time of failure:
Jul 12 14:16:31 mybox kernel: usb 1-1.1: USB disconnect, device number 3 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x07 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 50 00 00 08 00
[snip]
Is the card reader on the end of the USB bus (either on board or externally) ? What is usb device number 3 and can you give us the output of lsusb please ?
On 12/07/12 15:26, Mark Rogers wrote:
OK, a replacement box has been sent to site, but I'm still investigating this as best I can via SSH.
I have managed to "force" the commit error by just repeatedly running scripts which access the flash, and this time I've done it with "tail -f /var/log/syslog" running in another session; that has given me the following log at time of failure:
Jul 12 14:16:31 mybox kernel: usb 1-1.1: USB disconnect, device number 3 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Unhandled error code Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] Result: hostbyte=0x07 driverbyte=0x00 Jul 12 14:16:31 mybox kernel: sd 0:0:0:1: [sdb] CDB: cdb[0]=0x2a: 2a 00 00 5b b1 50 00 00 08 00 Jul 12 14:16:31 mybox kernel: end_request: I/O error, dev sdb, sector
[snip]
Is the card reader on the end of the USB bus (either on board or externally) ? What is usb device number 3 and can you give us the output of lsusb please ?
On 14/07/12 09:56, Wayne Stallwood wrote:
Is the card reader on the end of the USB bus (either on board or externally) ? What is usb device number 3 and can you give us the output of lsusb please ?
To the best of my knowledge, it's connected via onboard USB. (It's definitely not external!)
I'll get the output of lsusb once the box gets back to me, which should be today.
I'm leaning towards a hardware issue at the moment which gives me limited time to play with it once it lands on my desk as I need to get it back to NewIT to look at, so any other things to try just fire them at me and I'll give them a go before it gets sent back out, I'd rather not ship it back to the supplier if there's any chance this is a code issue (whether my own or within the O/S); I only became aware of problems after an apt-get dist-upgrade but there were several other (software) changes around that time and of-course all of them just meant more writes to the SD card than usual and could have triggered a hardware issue.
On 16/07/12 09:23, Mark Rogers wrote:
I'll get the output of lsusb once the box gets back to me, which should be today.
Output from lsb -v (for the device in question).
Bus 001 Device 003: ID 05e3:0726 Genesys Logic, Inc. SD Card Reader Device Descriptor: bLength 18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass 0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 64 idVendor 0x05e3 Genesys Logic, Inc. idProduct 0x0726 SD Card Reader bcdDevice 99.10 iManufacturer 0 iProduct 1 USB Storage iSerial 2 000000009910 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 32 bNumInterfaces 1 bConfigurationValue 1 iConfiguration 0 bmAttributes 0x80 (Bus Powered) MaxPower 500mA Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber 0 bAlternateSetting 0 bNumEndpoints 2 bInterfaceClass 8 Mass Storage bInterfaceSubClass 6 SCSI bInterfaceProtocol 80 Bulk-Only iInterface 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x81 EP 1 IN bmAttributes 2 Transfer Type Bulk Synch Type None Usage Type Data wMaxPacketSize 0x0200 1x 512 bytes bInterval 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x02 EP 2 OUT bmAttributes 2 Transfer Type Bulk Synch Type None Usage Type Data wMaxPacketSize 0x0200 1x 512 bytes bInterval 0 Device Qualifier (for other device speed): bLength 10 bDescriptorType 6 bcdUSB 2.00 bDeviceClass 0 (Defined at Interface level) bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 64 bNumConfigurations 1 Device Status: 0x0000 (Bus Powered)
On 16/07/12 15:07, Mark Rogers wrote:
On 16/07/12 09:23, Mark Rogers wrote:
I'll get the output of lsusb once the box gets back to me, which should be today.
Output from lsb -v (for the device in question).
[snip]
Right given the storage controller is the one that drops off the bus and THEN you get block errors on the SD card connected to it....
I'd say most likely a hardware issue...either that or there is some compatibility problem with the SD cards you are using..the USB to SD card bridge is detecting it and failing rather ungracefully by falling off the bus.
Or finally if you have other things on USB and one of them is loading up the supply and the storage controller is glitching as a result.
Is there anything attached via USB that wasn't when it was in your office on initial tests ?
On 17/07/12 19:05, Wayne Stallwood wrote:
I'd say most likely a hardware issue...either that or there is some compatibility problem with the SD cards you are using..the USB to SD card bridge is detecting it and failing rather ungracefully by falling off the bus.
NewIT have swapped the unit out, having determined that the CPU was reporting an unusually slow clock speed: SoC: Kirkwood 88F6281_A0 CPU running @ 8Mhz L2 running @ 2Mhz
I haven't been able to work out where he got that information from though?
So I have a new unit here with my original SD card waiting for me to run some tests on it.
Is there anything attached via USB that wasn't when it was in your office on initial tests ?
Not, at least not to my knowledge. Maybe someone had plugged their phone into it on site to charge it, they'd never admit it if they had....