Several times over the past few weeks I've had my PC (Ubuntu 16.10*) apparently lock up: desktop stops changing (clock stops updating) but the mouse (usually) continues to move. Sometimes I can SSH in, although more often I can get as far as entering username/password but never reach a terminal prompt.
It can happen while I'm actively using it (mid-sentence), or if I go away and come back it could have locked in my absence, often relatively shortly after it was left (the stopped clock is quite handy...)
I found messages in syslog relating to nouveau so I replaced the graphics card (as a fairly cheap swap out). That made no difference (actually if anything it got worse, although I'd hesitate to say so for sure).
Rather than mess around I bought a new PC and transferred my system disk to it, still figuring it felt like a hardware issue. Except that the problem has followed across onto the new PC. The only things moved across are the new graphics card and three hard disks.
It no longer feels like a hardware issue....
How do I diagnose?
Or am I just looking at a re-install?
* Note that this PC has been upgraded through several versions, it wasn't a clear 16.10 install. The only "major" change recently has been a shift from Chrome to Opera as the primary browser (and since I tend to have dozens of tabs open and most of my work involves a browser in some way that does count as a major change).
Mark
On Wed, Mar 22, 2017 at 01:42:55PM +0000, Mark Rogers wrote:
Rather than mess around I bought a new PC and transferred my system disk to it, still figuring it felt like a hardware issue. Except that the problem has followed across onto the new PC. The only things moved across are the new graphics card and three hard disks.
I've had this when I had a failing disk that would just stop responding.
Makes it very hard to launch any programs.
Thanks Adam
On 22 March 2017 at 16:22, Adam Bower adam@thebowery.co.uk wrote:
I've had this when I had a failing disk that would just stop responding.
Makes it very hard to launch any programs.
It wasn't launching programs that was the issue though (I do know what you mean). The whole GUI had stopped responding (couldn't drag apps around etc), and the biggest tell was that the clock had stopped updating. The fact that the mouse still moved is what hinted at a graphics card issue, as I can imagine that working at a different level (that and finding references to nouveau in the logs).
On a whim I have installed the nVidia proprietary drivers to see if that helps. We'll see if it becomes more stable. If not I'm going to get better at note taking (eg recording the exact time it locks up from the frozen clock, so I can see exactly what syslog and other logs say for that time). Until I replaced the PC I had just (wrongly!) accepted it was a PC issue and not kept any proper notes.
A lot will depend on whether it locks up tonight I guess...
On Wed, Mar 22, 2017 at 05:47:44PM +0000, Mark Rogers wrote:
It wasn't launching programs that was the issue though (I do know what you mean). The whole GUI had stopped responding (couldn't drag apps around etc), and the biggest tell was that the clock had stopped updating. The fact that the mouse still moved is what hinted at a graphics card issue, as I can imagine that working at a different level (that and finding references to nouveau in the logs).
Right, but if your swap partition is on that disk and you can't read/write any temp files etc. etc. then you get pretty much exactly the symptoms you describe in that everything appears to have stopped working but you can still move the mouse.
Your problem was that you couldn't launch any programs or get back to any of the ones you were running. That you could connect via ssh but could not login also very much suggests it could be this as your system is still running just that it won't let you do anything that might involve reading or writing from the file system.
I would suggest that as you know the system is running (evidenced by you can connect via ssh) and you have good reason to suspect a time the system fails from the clock that you should use this to see if anything is logged to disk after the failure time. You may want to set up a periodic task to write various information to disk and see if that shows anything up. In fact a simple script writing continually with a sleep every minute with the output of dmesg to a network mounted filesystem or to a port on another computer which has netcat running on it may help show up any more detail about the problem.
Adam
On 22 March 2017 at 21:31, Adam Bower adam@thebowery.co.uk wrote:
Right, but if your swap partition is on that disk and you can't read/write any temp files etc. etc. then you get pretty much exactly the symptoms you describe in that everything appears to have stopped working but you can still move the mouse.
I know the symptoms you mean, and have seen them before; I'm pretty sure that the clock doesn't stop in that scenario though.
However, if it happens again I should be able to check: either it should be logged in syslog or there'll be nothing at all in the logs after that point (if the root partition was also affected and switched to read-only). So it's definitely something I'll check out if it fails again (it didn't fail overnight, but that means nothing).
Note that the new PC only has (for now anyway) 8GB RAM, where the old one had 16GB, so the new one should be making more use of swap and more likely to fail if sudden loss of swap is the problem.
In fact a simple script writing continually with a sleep every minute with the output of dmesg to a network mounted filesystem or to a port on another computer which has netcat running on it may help show up any more detail about the problem.
That's a good suggestion, I'll give that a go if it fails again. (Obviously it makes more sense to do it before it fails, but at this point I'm optimistic that maybe the driver switch may be enough. If it fails again I'll know that it wasn't and that more failures will follow that I can capture.)
I'll also do some disk SMART checks.
On 23 March 2017 at 09:51, Mark Rogers mark@more-solutions.co.uk wrote:
So it's definitely something I'll check out if it fails again (it didn't fail overnight, but that means nothing).
It's the best part of a week since that last comment, and I've been out of the office for most of it. I came back in today and the PC was still running fine.
I'll update if I see the problem again but for the sake of the archives it otherwise looks like in this case it was the nouveau driver causing the issue, and replacing it with the nVidia proprietary driver has fixed it.
I've been lurking and following this because I have had a lock-up too. Not the same though so I decided to stick my head up and post...
A few days ago I came to my desktop PC (Linux Mint 18.1, KDE version) and found the screen black. Power lights on and case fans running but no response to KB or mouse. I hadn't used it for several hours and can't remember what I did last but it being bed time I hit the power switch.
Next day it booted OK and various apps left open were restored and worked, but KDE kept popping up warnings about being unable to write to config files, I should have made a note, but didn't. I assumed it was the incorrect shutdown the previous evening had corrupted something, so did an orderly shut down and reboot.
Back to Black, nothing on screen at all.
Rebooted with a live distro in USB port and it worked OK, so not hardware problem.
Rebooted and this time chose recovery mode. Nothing in the ASH menu seemed helpful so finally I ran FSCK and accepted all the suggestions. Success.
But, a few days later, back to black. This time I do know what I did last. That morning the desktop seemed sluggish. It hadn't been shutdown for a while, I've got in the habit of using Suspend because I use the machine irregularly and it's handy having it come to life when I want it. So I set it to reboot while I went for lunch. And didn't go back until after dinner, about 8 hours later. It should have been sat at the login screen, but clearly wasn't.
This time hit Reset rather than Power. When it came back up I got a simple terminal screen telling me the previous desktop session hadn't shutdown properly and advising I switch to another terminal and run <loginctl unlock- sessions>. That seems to have worked OK.
Does anyone know if this odd behaviour stems from me using Suspend? Haven't dared since, it's either on or shutdown now!
-- Phil Thane
www.pthane.co.uk phil@pthane.co.uk 01767 449759 07582 750607 Twitter @pthane On Wednesday, 29 March 2017 08:56:21 BST Mark Rogers wrote:
On 23 March 2017 at 09:51, Mark Rogers mark@more-solutions.co.uk wrote:
So it's definitely something I'll check out if it fails again (it didn't fail overnight, but that means nothing).
It's the best part of a week since that last comment, and I've been out of the office for most of it. I came back in today and the PC was still running fine.
I'll update if I see the problem again but for the sake of the archives it otherwise looks like in this case it was the nouveau driver causing the issue, and replacing it with the nVidia proprietary driver has fixed it.
On 29/03/17 09:57, Phil Thane wrote:
I've been lurking and following this because I have had a lock-up too. Not the same though so I decided to stick my head up and post...
A few days ago I came to my desktop PC (Linux Mint 18.1, KDE version) and found the screen black. Power lights on and case fans running but no response to KB or mouse. I hadn't used it for several hours and can't remember what I did last but it being bed time I hit the power switch.
Next day it booted OK and various apps left open were restored and worked, but KDE kept popping up warnings about being unable to write to config files, I should have made a note, but didn't. I assumed it was the incorrect shutdown the previous evening had corrupted something, so did an orderly shut down and reboot.
Back to Black, nothing on screen at all.
Rebooted with a live distro in USB port and it worked OK, so not hardware problem.
Rebooted and this time chose recovery mode. Nothing in the ASH menu seemed helpful so finally I ran FSCK and accepted all the suggestions. Success.
But, a few days later, back to black. This time I do know what I did last. That morning the desktop seemed sluggish. It hadn't been shutdown for a while, I've got in the habit of using Suspend because I use the machine irregularly and it's handy having it come to life when I want it. So I set it to reboot while I went for lunch. And didn't go back until after dinner, about 8 hours later. It should have been sat at the login screen, but clearly wasn't.
This time hit Reset rather than Power. When it came back up I got a simple terminal screen telling me the previous desktop session hadn't shutdown properly and advising I switch to another terminal and run <loginctl unlock- sessions>. That seems to have worked OK.
Does anyone know if this odd behaviour stems from me using Suspend? Haven't dared since, it's either on or shutdown now!
I doubt that it's a suspend problem. I think that if suspend goes wrong, it just won't wake up. Suspend is usually OK though.
I would guess at a intermittent hardware error or lack of disk space. It could be overheating (are the fans working?) or a disk error. force a Fsck it agian, and check the SMART status info.
Disk space? use df -h to see if there's enough space on each partition.
Is it overheating due to clogging up with dust, or is there a problem with the graphics card?
Anyway, good luck!
Steve
Thanks for the advice Steve,
FSCK reports no errors.
df -h report:
phil@phil-desktop ~ $ df -h Filesystem Size Used Avail Use% Mounted on udev 4.0G 0 4.0G 0% /dev tmpfs 807M 9.3M 798M 2% /run /dev/sda1 909G 85G 778G 10% / tmpfs 4.0G 92K 4.0G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 4.0G 0 4.0G 0% /sys/fs/cgroup cgmfs 100K 0 100K 0% /run/cgmanager/fs tmpfs 807M 0 807M 0% /run/user/121 tmpfs 807M 12K 807M 1% /run/user/1000
So plenty of disk space.
Fans running and not clogged up.
PC runs fine, but I haven't tried Suspend again...
-- Phil Thane
www.pthane.co.uk phil@pthane.co.uk 01767 449759 07582 750607 Twitter @pthane On Thursday, 30 March 2017 21:15:31 BST steve-ALUG@hst.me.uk wrote:
On 29/03/17 09:57, Phil Thane wrote:
I've been lurking and following this because I have had a lock-up too. Not the same though so I decided to stick my head up and post...
A few days ago I came to my desktop PC (Linux Mint 18.1, KDE version) and found the screen black. Power lights on and case fans running but no response to KB or mouse. I hadn't used it for several hours and can't remember what I did last but it being bed time I hit the power switch.
Next day it booted OK and various apps left open were restored and worked, but KDE kept popping up warnings about being unable to write to config files, I should have made a note, but didn't. I assumed it was the incorrect shutdown the previous evening had corrupted something, so did an orderly shut down and reboot.
Back to Black, nothing on screen at all.
Rebooted with a live distro in USB port and it worked OK, so not hardware problem.
Rebooted and this time chose recovery mode. Nothing in the ASH menu seemed helpful so finally I ran FSCK and accepted all the suggestions. Success.
But, a few days later, back to black. This time I do know what I did last. That morning the desktop seemed sluggish. It hadn't been shutdown for a while, I've got in the habit of using Suspend because I use the machine irregularly and it's handy having it come to life when I want it. So I set it to reboot while I went for lunch. And didn't go back until after dinner, about 8 hours later. It should have been sat at the login screen, but clearly wasn't.
This time hit Reset rather than Power. When it came back up I got a simple terminal screen telling me the previous desktop session hadn't shutdown properly and advising I switch to another terminal and run <loginctl unlock- sessions>. That seems to have worked OK.
Does anyone know if this odd behaviour stems from me using Suspend? Haven't dared since, it's either on or shutdown now!
I doubt that it's a suspend problem. I think that if suspend goes wrong, it just won't wake up. Suspend is usually OK though.
I would guess at a intermittent hardware error or lack of disk space. It could be overheating (are the fans working?) or a disk error. force a Fsck it agian, and check the SMART status info.
Disk space? use df -h to see if there's enough space on each partition.
Is it overheating due to clogging up with dust, or is there a problem with the graphics card?
Anyway, good luck!
Steve
main@lists.alug.org.uk http://www.alug.org.uk/ https://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!
Hi,
On 31 Mar 19:43, Phil Thane wrote:
Thanks for the advice Steve,
FSCK reports no errors.
df -h report:
phil@phil-desktop ~ $ df -h Filesystem Size Used Avail Use% Mounted on udev 4.0G 0 4.0G 0% /dev tmpfs 807M 9.3M 798M 2% /run /dev/sda1 909G 85G 778G 10% / tmpfs 4.0G 92K 4.0G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 4.0G 0 4.0G 0% /sys/fs/cgroup cgmfs 100K 0 100K 0% /run/cgmanager/fs tmpfs 807M 0 807M 0% /run/user/121 tmpfs 807M 12K 807M 1% /run/user/1000
So plenty of disk space.
Fans running and not clogged up.
PC runs fine, but I haven't tried Suspend again...
OK - so last time I had this, it actually happened very shortly after logging it to X11. Turned out there was some breakages in my /run directory, which I found out from the backend of syslog / journalctl -f.
After fixing permissions I stopped having the "interesting" fail modes.
Thanks,
On 04/04/17 11:12, Brett Parker wrote:
[SNIP] OK - so last time I had this, it actually happened very shortly after logging it to X11. Turned out there was some breakages in my /run directory, which I found out from the backend of syslog / journalctl -f.
After fixing permissions I stopped having the "interesting" fail modes.
Thanks,
OK, stupid question: Isn't the entirety of /run a tmpfs temporary file system, and as such, just held in memory? Any fixes to any files under /run would just dissappear on reboot, no? Or were there configuration problems somewhere else which then caused problems with files under /run?
Cheers Steve
On 05 Apr 16:08, steve-ALUG@hst.me.uk wrote:
On 04/04/17 11:12, Brett Parker wrote:
[SNIP] OK - so last time I had this, it actually happened very shortly after logging it to X11. Turned out there was some breakages in my /run directory, which I found out from the backend of syslog / journalctl -f.
After fixing permissions I stopped having the "interesting" fail modes.
Thanks,
OK, stupid question: Isn't the entirety of /run a tmpfs temporary file system, and as such, just held in memory? Any fixes to any files under /run would just dissappear on reboot, no? Or were there configuration problems somewhere else which then caused problems with files under /run?
Sometimes, sometimes not, my laptop didn't have /run that way for a while, and that certainly caused me some headaches.