That Nasty Linux Kernel Lockup Bug Is Still Unresolved

Written by Michael Larabel in Linux Kernel on 18 December 2014 at 10:27 AM EST. 8 Comments
LINUX KERNEL
Nearly one month ago back during the Linux 3.18 release candidates there was a worrisome regression uncovered by kernel developers, but now with the Linux 3.19 merge window nearly over, that issue still has yet to be firmly addressed.

Throughout the Linux 3.18 kernel cycle and likely impacting Linux 3.17 too has been a nasty Linux kernel lock-up issue that was first widely reported by Red Hat's Dave Jones and then he's been spending the last several weeks bisecting kernels, testing patches, and trying to figure out the root cause. Other kernel developers have also been able to reproduce the problem, various kernel patches proposed, but as of this morning the issue is still present in Git master. Like reported a few weeks ago in the last Phoronix article on the matter, it looks like the issue might be related to the Xen code within the Linux kernel.

There's been many mailing list posts in the "frequent lockups in 3.18rc4" thread but no conclusion. The most recent post by Dave Jones was this morning:
Bah, I was getting all optimistic. I came home this evening to a locked up machine. Serial console had a *lot* more traces than usual though. Full log below. The 12xxx.xxxxxx traces we seemed to recover from, followed by silence for a while, before the real fun begins at 157xx.xxxxxx
...
That's the end of the thread at the time of writing.

Earlier this week, Linus Torvalds was looking at a potentially related issue within the kernel. Linus noted, "there's something funny going on there. Anyway, I've looked at the page fault patch, and I mentioned this last time it came up: there's a nasty possible kernel loop in the 'retry' case if there's also a fatal signal pending, and we're returning to kernel mode rather than returning to user mode." Linus came up with an (untested) patch for that issue and then replied with, "So after looking at this more, I'm actually really convinced that this was a pretty nasty bug. I'm *not* convinced that it's necessarily [Dave Jones'] bug, but I still think it could be." Given Dave's emails after that point, this wasn't the root problem, but it looks like another bug was squashed in the process or at least more kernel code cleaned-up.

Stay tuned to Phoronix for when there's any further firm leads on this Linux kernel lockup issue... At least the issue doesn't appear to be too widespread and I haven't yet encountered it with the many systems that are automatically testing the Linux kernel daily.
Related News
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week