 
            Hi all
Anyone else experiencing hard lockups since approx December 2023 on Ubuntu / Kubuntu?
I've got the same hardware, there's no hardware changes. I suspect it might have been a kernel / driver upgrade, I did pull some info from dpkg logs that said 12th December
2023-12-12 09:47:49 upgrade linux-image-generic:amd64 5.15.0.89.86 5.15.0.91.88
Does this mean that I went from version 5.15.0.89.86 to version 5.15.0.91.88? How does one revert back to test for stability?
When the lockup occurs, it seems that all graphics processes freeze, mouse freezes. No blinking lights on the keyboard. I vaguely remember that sometimes I can switch to one of the other terminals (what used to be tty2-6) but nothing seems to respond.
When I tried to do this a couple of weeks ago, it seemed to be wanting to remove lots of things, and I never proceeded.
sudo apt-get purge linux-image-5.15.0-91-generic
I had journal logging to Konsole back then and saw some messages (I have a phone on my phone) about PCIe Bus Error, BadTLP and BadDLLP and timeouts, Uncorrected fatal errors with AER, and nvme0: frozen state and PCIe bus error.
How does one even know if their NVME drive is failing? AFAIK, smartctl doesnt work on them in the same way it does on spinning drives. Or is this a kernel bug?
Thanks, Srdjan
 
            On Wed, Jan 31, 2024 at 10:00:38PM +0000, Srdjan Todorovic wrote:
Hi all
Anyone else experiencing hard lockups since approx December 2023 on Ubuntu / Kubuntu?
I've got the same hardware, there's no hardware changes. I suspect it might have been a kernel / driver upgrade, I did pull some info from dpkg logs that said 12th December
2023-12-12 09:47:49 upgrade linux-image-generic:amd64 5.15.0.89.86 5.15.0.91.88
Does this mean that I went from version 5.15.0.89.86 to version 5.15.0.91.88? How does one revert back to test for stability?
I'm running Kernel: 5.15.0-91-generic x86_64 on my desktop system which is also a server, no lockups:-
chris@esprimo$ uptime 07:12:59 up 37 days, 13:46, 3 users, load average: 0.11, 0.16, 0.11
It has two SSDs, one of which is nvme:-
Filesystem Type 1M-blocks Used Avail Use% Mounted on /dev/nvme0n1p2 ext4 48122 14669 30938 33% / /dev/nvme0n1p3 ext4 896121 318393 532136 38% /home /dev/sdb1 ext4 937804 295991 594105 34% /bak /dev/sda1 ext4 9980 168 9284 2% /boot /dev/sda2 ext4 109536 64903 39025 63% /scratch
So I don't think there's an obvious kernel bug.
 
            On Wed, Jan 31, 2024 at 10:00:38PM +0000, Srdjan Todorovic wrote:
How does one even know if their NVME drive is failing? AFAIK, smartctl doesnt work on them in the same way it does on spinning drives. Or is this a kernel bug?
There's a command nvme or nvme-cli that may help.
https://github.com/linux-nvme/nvme-cli
I'd just install the older kernel and boot that manually and see what happens, it could be a bug introduced in a newer kernel that is affecting your hardware.
Adam --
 
            I forgot to mention, the lockups happen more often when graphics is being used, e.g. steam games, obsidian graph view, sometimes YouTube videos.
So I may need to look into the Nvidia drivers too. Yes, tainted kernel.
On Thu, 1 Feb 2024, 09:29 Adam Bower, adam@thebowery.co.uk wrote:
On Wed, Jan 31, 2024 at 10:00:38PM +0000, Srdjan Todorovic wrote:
How does one even know if their NVME drive is failing? AFAIK, smartctl doesnt work on them in the same way it does on spinning drives. Or is this a kernel bug?
There's a command nvme or nvme-cli that may help.
https://github.com/linux-nvme/nvme-cli
I'd just install the older kernel and boot that manually and see what
In terms of installing older kernels, when I tried (I don't understand the Ubuntu way of doing this), apt seemed to want to uninstall a lot of things, and I wasn't sure if I'd have a working system afterwards.
Before December, the machine was incredibly stable no matter what I threw at it.
happens, it could be a bug introduced in a newer kernel that is affecting
your hardware.
Adam
To unsubscribe send an email to main-leave@lists.alug.org.uk http://www.alug.org.uk/ Unsubscribe? See message headers or the web site above!
 
            On Thu, 1 Feb 2024 at 09:45, Srdjan Todorovic todorovic.s@googlemail.com wrote:
I forgot to mention, the lockups happen more often when graphics is being used, e.g. steam games, obsidian graph view, sometimes YouTube videos. So I may need to look into the Nvidia drivers too. Yes, tainted kernel.
On Thu, 1 Feb 2024, 09:29 Adam Bower, adam@thebowery.co.uk wrote:
On Wed, Jan 31, 2024 at 10:00:38PM +0000, Srdjan Todorovic wrote:
How does one even know if their NVME drive is failing? AFAIK, smartctl doesnt work on them in the same way it does on spinning drives. Or is this a kernel bug?
There's a command nvme or nvme-cli that may help.
https://github.com/linux-nvme/nvme-cli
I'd just install the older kernel and boot that manually and see what
In terms of installing older kernels, when I tried (I don't understand the Ubuntu way of doing this), apt seemed to want to uninstall a lot of things, and I wasn't sure if I'd have a working system afterwards.
Before December, the machine was incredibly stable no matter what I threw at it.
Ok so if I try to install the older kernel: sudo apt-get install linux-image-5.15.0-88-generic It says: linux-image-5.15.0-88-generic is already the newest version (5.15.0-88.98).
However, uname lists this as the current running kernel: 5.15.0-94-generic
I am not confident enough with grub / grub2 to know how to make it boot the old one permanently - wasn't it the case that the config files are no longer config files but are now actually scripts?
However I did install nvme-cli, and just got this:
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 48 C (321 Kelvin) available_spare : 100% available_spare_threshold : 10% percentage_used : 1% endurance group critical warning summary: 0 data_units_read : 68,600,474 data_units_written : 68,516,189 host_read_commands : 304,422,351 host_write_commands : 428,570,321 controller_busy_time : 2,254 power_cycles : 1,189 power_on_hours : 2,483 unsafe_shutdowns : 20 media_errors : 0 num_err_log_entries : 3,566 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 48 C (321 Kelvin) Temperature Sensor 2 : 52 C (325 Kelvin) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
Particularly of note is the large number for 'num_err_log_entries', I'm doing some googling to work out if this is bad (if anyone knows already please let me know).
Helpful pointers appreciated, thanks!
Srdjan


