BUG() (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.33.0209091714330.2069-100000@penguin.transmeta.com>
Date: Tue, 10 Sep 2002 00:14:51 GMT
Message-ID: <fa.oaq5dnv.h4i4bb@ifi.uio.no>

Greg, please don't do this

> ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com
>   USB: storage driver: replace show_trace() with BUG()

that BUG() thing is _way_ out of line, and has killed a few of my machines
several times for no good reason. It actively hurts debuggability, because
the machine is totally dead after it, and the whole and ONLY point of
BUG() messages is to help debugging and make it clear that we can't handle
something.

In this case, we _can_ handle it, and we're much better off with a machine
that works and that you can look up the messages with than killing it.

Rule of thumb: BUG() is only good for something that never happens and
that we really have no other option for (ie state is so corrupt that
continuing is deadly).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 01:42:40 GMT
Message-ID: <fa.m5v6d2v.150g9on@ifi.uio.no>

On 10 Sep 2002, Alan Cox wrote:
>
> I'd have thought you may well want the reverse. If the user didnt pick
> the kernel debugging, don't die on software check option you want to
> blow up.

No, thanks. If it's fatal, it's a BUG(), and you blow regardless.

If it isn't fatal, there is no excuse for blowing up. EVER.  That's
_especially_ true for some random user who didn't ask for, and can't
handle debugging. If it's useful information that the developer believes
he wants, it shouldn't be conditional at all.

> If they are debugging or its < 2.6.0-rc1 you want it to show
> the stack and keep going

You definitely want to keep going regardless. A BUG() that takes out the
machine is just not useful, because users who aren't ready to debug it
can't even make any reports except "it stops" (which this one did if you
were under X - the machine was just _dead_).

Basically, with the amount of locking we have, a BUG() or a blow-up just
about anywhere is lethal. Most sequences (especially in drivers, but
inside filesystems etc too) tend to hold spinlocks etc that just makes it
a bad idea to BUG() out unless you really really have to, since the
machine is not likely to survive and be able to write good reports to disk
etc at pretty much any point.

(It used to be that you could take a fault just about anywhere except for
in interrupt handlers, and Linux would try its damndest to clean up and
continue as if nothing had happened. Those days are sadly gone, and
trapping and depending on killing the process seldom works well any more).

On the whole, it's a lot better to just print out a message (and call
traces are often very useful) and continue. That's not always possible, of
course, and a lot of BUG() and BUG_ON() cases are perfectly valid simply
because sometimes there isn't anything you can do except kill the machine
and try to inform the user.

I think the historical kernel behaviour ("trap and kill and continue"
historically worked so fine for _both_ major bugs and for "random sanity
test" cases) has caused us to be a bit lazy about this sometimes.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209091842400.1714-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 01:49:33 GMT
Message-ID: <fa.m7eqdav.17gs80n@ifi.uio.no>

On Mon, 9 Sep 2002, Linus Torvalds wrote:
>
> On the whole, it's a lot better to just print out a message (and call
> traces are often very useful) and continue. That's not always possible, of
> course, and a lot of BUG() and BUG_ON() cases are perfectly valid simply
> because sometimes there isn't anything you can do except kill the machine
> and try to inform the user.

Note that from an implementation standpoint I suspect that a "trap and
continue" thing can easily be pretty much exactly as the current BUG()
with a flag somewhere, say in the "third byte" of the "ud2" instruction.

That would also make it easy to dynamically change the behaviour (ie some
people might want to explicitly make even the "warnings" fatal - a kernel
version of -Werror), and the implementation should be trivial:

#define TRAP_INSTRUCTION( lethal )	\
	__asm__ __volatile__( "ud2\n"	\
			"\t.byte %0\n"	\
			"\t.word %c1\n" \
			"\t.long %c2\n" \
			: :"i" (lethal), "i" (__LINE__), "i" (__FILE__))

and then you have

	#define BUG()	TRAP_INSTRUCTION(1)
	#define WARN()	TRAP_INSTRUCTION(0)

or something like that (where the non-lethal version just increments eip
by 9 to jump over the extended ud2 and the information pointers).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209091926320.1911-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 02:50:44 GMT
Message-ID: <fa.m6v4cqv.14069gm@ifi.uio.no>

Short and sweet:

  If I have a BUG() that I need to remove and replace with a printk()
  (which I did) in order to even be able to debug the dang thing, then
  that BUG()  is a mistake. No ifs, buts or maybe's accepted. The one and
  _only_ point of BUG() is to help debugging, and if it doesn't, then it
  should not be there.

Longer:

On Mon, 9 Sep 2002, Matthew Dharm wrote:
>
> (3) Given that we think we've fixed all of these bad initiators, the BUG()
> really is something that should never happen.

You can't have it both ways.

Either it really never happens.

Or it _does_ happen, in which case you want sane debuggable output.

In the latter case, BUG() is the wrong thing to do, exactly because it
tends to bring the system down.  In the first case it obviously doesn't
matter and isn't even relevant, and the best thing to do would be to
remove it altogether. In neither case is BUG() useful.

I'm personally in X 99% of the time except for the reasonably rare case
when I'm chasing down some bug I know I can reproduce and I want the
kernel to have access to the console.

And I doubt I'm alone in that. I suspect most people who use Linux in any
interesting situation (and no, I don't think servers are very interesting
from most standpoints) tend to do this. Agreed?

That means that the BUG() output won't even be _visible_ to 99% of all
cases. And if it causes the machine to hang, it is now not only invisible,
it cannot even be gotten hold of some other way.

Which basically means that a driver that uses BUG() for something that
isn't certain to be fatal anyway is a BUGGY driver.

If my X session is hung, that's already bad (most people will give up at
that point). If I can't even log in remotely, that's _disastrous_, since
now the BUG() causes me to not to be able to debug it at all. The BUG()
basically made itself irrlevant.

The fact is, BUG() is almost always the wrong thing to do. And it's
almost _guaranteed_ to be the wrong thing to do in a driver, since a
driver won't know what locks the rest of the kernel is holding, _and_
since it's almost always possible for a driver to try to return an error
code instead.

In short:

  Either you want debugging (in which case BUG() is the wrong thing to
  do), or you don't want debugging (in which case BUG() is the wrong thing
  to do). You can choose either, but in neither case is BUG() acceptable.

  BUG() is fine for _fundamental_ problems, where you don't have any other
  choice, and where the machine really is effectively dead anyway. If the
  VM notices that it's lists are corrupt, that's a BUG() thing. We don't
  have much choice. If the scheduler notices that it's running on another
  CPU than it thought it was running on, that's a BUG() thing.

Now, the kernel problem may be that BUG() is a bit too easy to use, and
the alternatives are not. We should fix that. But we shouldn't fix it by
using BUG() in places where it definitely doesn't belong.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209091951550.1911-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 03:00:12 GMT
Message-ID: <fa.m7vadiv.170888h@ifi.uio.no>

On Mon, 9 Sep 2002, Linus Torvalds wrote:
>
> Which basically means that a driver that uses BUG() for something that
> isn't certain to be fatal anyway is a BUGGY driver.

There are probably good exceptions to this rule, like any rule.

For example, if the request queue (or some other fundamental internal data
structure) is found to be corrupted, I couldn't really fault a driver for
BUG()'ing out on it. It's not as if the driver could sanely even do an
end_request() in that case.

But even broken hardware is not a reason for a BUG(). For example, if the
driver notices that some part of the controller is hung hard (ie provable
hardware problem), the last thing it should do is to bring the system
down. It should fail the IO, and maybe turn itself off, but let the system
continue otherwise.

One of the best things I ever did from a debuggability standpoint was to
almost never use panic() in the base kernel, and make various kernel page
faults etc just try to kill the offender and go on.

Sometimes that "try to continue" approach ends up being nasty too (the
problem repeats and you end up with a dead machine that scrolls infinite
bug reports off the screen, making them really hard to catch), but on the
whole it tends to make things much easier to debug ("oops, I just lost my
floppy drive, let's save the messages on the harddisk and reboot").

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209100947481.2842-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 16:52:02 GMT
Message-ID: <fa.m7feeiv.16gg98t@ifi.uio.no>

On Tue, 10 Sep 2002, David Brownell wrote:
> > In short:
> >
> >   Either you want debugging (in which case BUG() is the wrong thing to
> >   do), or you don't want debugging (in which case BUG() is the wrong thing
> >   to do). You can choose either, but in neither case is BUG() acceptable.
>
> Or in even shorter sound bite format:  "Just say no to BUG()s."

Well, the thing is, BUG() _is_ sometimes useful. It's a dense and very
convenient way to say that something catastrophic happened.

And actually, outside of drivers and filesystems you can often know (or
control) the number of locks the surrounding code is holding, and then a
BUG() may not be as lethal. At which point the normal "oops and kill the
process" action is clearly fine - the machine is still perfectly usable.

(In fact, on UP a BUG() tends to be quite usable just about anywhere
except in an interrupt handler: there may be some local locks like
directory semaphores etc that are held and not released, but _most_ of the
time the machine is quite usable. SMP really does make things harder to
debug even quite apart from the races it introduces. Sad.)

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209101132320.3280-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 18:41:51 GMT
Message-ID: <fa.m7u6f2v.1500boq@ifi.uio.no>

On Tue, 10 Sep 2002, David S. Miller wrote:
>
>    IMO we should have ASSERT() and OHSHIT(),
>
> I fully support the addition of an OHSHIT() macro.

Oh, please no. We'd end up with endless asserts in the networking layer,
just because David would find it amusing.

I can just see it now - code bloat hell.

And no, I still don't like ASSERT().

I think the approach should clearly spell what the trouble level is:

	DEBUG(x != y, "x=%d, y=%d\n", x, y);

	WARN(x != y, "crap happens: x=%d y=%d\n", x, y);

	FATAL(x != y, "Aiee: x=%d y=%d\n", x, y);

where the DEBUG one gets compiled out normally (or has some nice per-file
way of being enabled/disabled - a perfect world would expose the on/off in
devicefs as a per-file entity when kernel debugging is on), WARN continues
but writes a message (and normally does _not_ get compiled out), and FATAL
is like our current BUG_ON().

All would print out the filename and line number, the message, and the
backtrace.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209101236270.7106-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 19:40:47 GMT
Message-ID: <fa.m9egd2v.17ga9oo@ifi.uio.no>

On Tue, 10 Sep 2002, Oliver Xymoron wrote:
>
> Which still leaves the question, does it really make sense for
> FATAL/BUG to forcibly kill the machine?

No. It should only be "locally fatal", and it should clearly just do what
BUG() does now - kill the process.

But that implies very much that you really cannot use FATAL() in general
at all, since it would be illegal to use whenever some caller holds some
non-local locks (which is almost always the case for most "peripheral
code").

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34
Original-Message-ID: <Pine.LNX.4.44.0209101156510.7106-100000@home.transmeta.com>
Date: Tue, 10 Sep 2002 19:03:56 GMT
Message-ID: <fa.mau2div.140088o@ifi.uio.no>

On 10 Sep 2002, Alan Cox wrote:
>
> It drops you politely into the kernel debugger, you fix up the values
> and step over it. If you want to debug with zen mind power and printk
> feel free. For the rest of us BUG() is fine on SMP

Ok, a show of hands..

Of the millions (whatever) of Linux machines, how many have a kernel
debugger attached? Count them.

In other words, if a user is faced with a dead machine with no other way
to even know what BUG() triggered than to try to set up a cross debugger,
just how useful is that BUG()? I claim it is pretty useless - simply
because 99+% of all people won't even make a bug report in that case,
they'll just push the reset button and consider Linux unreliable.

In other news, the approach that shows up in the kernel logs might just
eventually be noticed and acted upon (especially if the machine acts
strange and kills processes).

So I claim a BUG() that locks up the machine is useless. If the user can't
just run ksymoops and email out the BUG message, that BUG() is _not_ fine
on SMP.

It has nothing to do with zen mind power or printk's. It has everything to
do with the fact that a working machine is about a million times easier to
debug on than a dead one, _and_ is a lot more likely to get acted upon by
most users.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: What policy for BUG_ON()?
Original-Message-ID: <Pine.LNX.4.58.0408301718040.2295@ppc970.osdl.org>
Date: Tue, 31 Aug 2004 00:27:34 GMT
Message-ID: <fa.gsea069.a24ohr@ifi.uio.no>

On Mon, 30 Aug 2004, Adrian Bunk wrote:
>
> Let me try to summarize the different options regarding BUG_ON,
> concerning whether the argument to BUG_ON might contain side effects,
> and whether it should be allowed in some "do this only if you _really_
> know what you are doing" situations to let BUG_ON do nothing.
>
> Options:
> 1. BUG_ON must not be defined to do nothing
> 1a. side effects are allowed in the argument of BUG_ON
> 1b. side effects are not allowed in the argument of BUG_ON
> 2. BUG_ON is allowed to be defined to do nothing
> 2a. side effects are allowed in the argument of BUG_ON
> 2b. side effects are not allowed in the argument of BUG_ON
>
> It would be good if there was a decision which of the four choices
> should become documented policy.

I'd suggest we strongly discourage side-effects in BUG_ON().

That said, it might be safest to just go for 1b - we make side-effects of
BUG_ON() be _documented_ as a bug, but just for safety, I'd suggest doing

	#define BUG_ON(x) (void)(x)

anyway, if somebody wants to compile without debugging. That will still
make the side-effects happen if somebody has them (and if there are none,
the compiler will not generate any code anyway).

In other words: nobody should _depend_ on the side effects, and we should
flame people who have them. A quick grep shows this:

	mm/slab.c:      BUG_ON(spin_trylock(&cachep->spinlock));

which makes no real sense - it looks like slab really wants to use
"spin_is_locked()".

I could make some sparse extension that warns about side effects. I
already calculate the "side-effectiveness" of an expression for other
reasons..

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: What policy for BUG_ON()?
Original-Message-ID: <Pine.LNX.4.58.0408310945580.2295@ppc970.osdl.org>
Date: Tue, 31 Aug 2004 16:54:53 GMT
Message-ID: <fa.guem0u7.820ppn@ifi.uio.no>

On Tue, 31 Aug 2004, Albert Cahalan wrote:
>
> The normal expectation for non-debug builds
> would be this:
>
> #define BUG_ON(x)

No, this is bad, for one big reason: it generates compiler warnings if 'x'
happens to be the only thing that uses some value. You get things like
"unused variable 'mode'" etc in perfectly good code.

For example, the code might look something like

	int count = 0;

	for_each_online_cpu(cpu) {
		... do something ..
		.. update count ..
	}

	BUG_ON(!count);

and if you now compile on UP and with debugging enabled, the compiler
might complain that you're computing "count" but never _using_ the value.

This is generally why you should have macros like this not become empty,
but become something that the compiler can compile away. Which is why I'd
much rather see

	#define BUG_ON(x) (void)(x)

regardless of any side-effect issues - it's a way to let the compiler
optimize the thing away, but still show that something was used at least
"conceptually"..

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [2/3] 2.6.22-rc2: known regressions v2
Date: Thu, 24 May 2007 19:51:58 UTC
Message-ID: <fa.gaD/+HhjslaxEsxVKvydm+VFVAo@ifi.uio.no>

On Thu, 24 May 2007, Ingo Molnar wrote:
>
> i very much agree that this kmalloc_index() one shouldnt be called a
> "BUG: ", but if you look at the majority of WARN_ON() instances they are
> checks for clear, serious kernel bugs.

I _still_ disagree.

There's a huge difference between "You killed my father, prepare to die",
and "Btw, I didn't like that, but I'll just continue".

And that's the difference between BUG_ON() and WARN_ON().

And dammit, the kernel message should make that CLEAR. It's totally
idiotic to call both "BUG". One is a clear BUG, the other is a "uhhuh,
something unexpected happened, but I know how to continue".

So stop this idiotic "we should call them both the same". If they actually
were the same, they'd both be called BUG_ON().

		Linus

Index Home About Blog