Kernel dumps (Linus Torvalds; Theodore Tso)

Index Home About Blog

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: source line numbers with x86_64 modules? [Was: Re: [patch]
Date: Sat, 10 Jan 2009 21:16:39 UTC
Message-ID: <fa.TlxGHVyUD5ikshbviJPn+wBFvZY@ifi.uio.no>

On Sat, Jan 10, 2009 at 01:21:06PM -0500, Mike Snitzer wrote:
> > In practice i rarely see bugfixes that were debugged via kdump. Normal
> > oops based fixes outnumber kdump based fixes by a ratio of 1:100 or worse
> > - and kdump is readily available these days - just nobody configures it.
>
> So you're telling me RedHat doesn't rely on kdump at enterprise
> customer installations?  I find that hard to believe.  Few enterprise
> customers allow defects to be debugged on-site, sometimes collecting a
> crash dump is all you can hope for to make progress.  I have to
> believe you know this fairly well; if not with direct experience then
> through your co-workers?  Or am I living in Ingo's version of Linux
> hell where kdump is actually useful?

In my experience, there are very few kernel versions and hardware for
which kdump works.  I've talked to the people who have to make kdump
work, and every 12-18 months, with a new set of enterprise kernels
comes out, they have to go and fix kdump so it works again for the set
of hardware that they care about, and for the kernel version involved.
Part of the problem is one which has infected nearly every single RAS
technology out there, from kdump to Systemtap, which is the people who
architect and fund these RAS technologies delude themselves into
thinking that they only have to worry about making it work for
enterprise kernels and enterprise users, and to hell with everyone
else --- specifically, kernel developers, which don't matter since
they aren't enterprise users.  Heck, until July of last year,
Systemtap wouldn't even ***compile*** out of the box on a
non-enterprise distribution like Ubuntu or Debian.  And I still have
yet to make kdump work on a Thinkpad, although I've tried.

Since pretty much no one uses these RAS technologies except enterprise
users, and no one bothers to make it easy for kernel developers,
kernel developers have developed alternate mechanisms for debugging
the Linux kernel --- and they don't involve using Systemtap or kdump,
because in practice, it doesn't work for them at all, or it's too hard
to make it work for them.

And this becomes a vicious cycle; since no one is bothered to spend
time making RAS technologies work for everyday use by kernel
developers, bitrot inevitably sets in, and so the RAS developers get
no help from other kernel developers, who are busy fixing their own
problems via different means; and so the RAS developers hunker down,
and spend even more time fixing the bitrot and complaining that no one
helps them or takes them seriously, and the problem gets worse and
worse and worse --- until now there are people who are busily
developing alternatives to Systemtap, just because too many RAS
architects and developers and had their priorities wrong, and forgot
to focus on every day kernel developers instead of just enterprise
users.

It's very sad, and it means a lot of investment gets wasted, and work
is getting duplicated as a result.

Oh, well.

							- Ted

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: source line numbers with x86_64 modules? [Was: Re: [patch]
Date: Sat, 10 Jan 2009 23:00:28 UTC
Message-ID: <fa.X9Rn8X8NGvmrgrO+aSMvKR3XCas@ifi.uio.no>

On Sat, 10 Jan 2009, Andi Kleen wrote:
>
> I think that's mostly because kexec from arbitary context is a
> somewhat unstable concept.

I think that's the understatement of the year.

We have tons of problems with standard suspend-to-ram, and that's when the
suspend sequence has done its best to make everything quiescent. Expecting
that we can reinitialize all the hardware at some random point when things
are going haywire is "optimistic" at best.

So of course it will work on some hardware and not others.

I think we've been fairly successful at keeping a running system for
_most_ of our bugs. Even when things go bad with X running, it's quite
often possible to ssh in over the network (although it's often better if
you were already connected) and see the dump.

Not always, obviously. Many dumps really are painful. I'm hoping that
kernel-mode-setting will at least give us the oops message _more_ of the
time.

As far as I'm concerned, digital cameras have been more useful than kernel
dumps to kernel debugging.

			Linus

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: source line numbers with x86_64 modules? [Was: Re: [patch]
Date: Sun, 11 Jan 2009 20:46:28 UTC
Message-ID: <fa./H6rYnULBkS9TN29CCG4WMEWPIE@ifi.uio.no>

On Sun, Jan 11, 2009 at 06:11:35PM +0800, Andreas Dilger wrote:
> I'm sad that netconsole/netdump never made it big.  It was fairly useful,
> and extending the eth drivers to add the polling mode was trivial to do.
> We were using that for a few years, but it got replaced by kdump and it
> appears to be less usable IMHO.

The netdump I'm familiar with had the misfeature that it didn't do
packet retransmission, so when it was used on a customer network with
any amount of traffic, packets would get dropped and the crash dump
would utterly fail.  I honestly can't remember which enterprise distro
shipped it, but I can't say I was terribly impressed.  :-(

	    	  	      	  	   	       - Ted

Index Home About Blog