Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.33.0209091714330.2069-100000@penguin.transmeta.com> Date: Tue, 10 Sep 2002 00:14:51 GMT Message-ID: <fa.oaq5dnv.h4i4bb@ifi.uio.no> Greg, please don't do this > ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com > USB: storage driver: replace show_trace() with BUG() that BUG() thing is _way_ out of line, and has killed a few of my machines several times for no good reason. It actively hurts debuggability, because the machine is totally dead after it, and the whole and ONLY point of BUG() messages is to help debugging and make it clear that we can't handle something. In this case, we _can_ handle it, and we're much better off with a machine that works and that you can look up the messages with than killing it. Rule of thumb: BUG() is only good for something that never happens and that we really have no other option for (ie state is so corrupt that continuing is deadly). Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209091832160.1714-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 01:42:40 GMT Message-ID: <fa.m5v6d2v.150g9on@ifi.uio.no> On 10 Sep 2002, Alan Cox wrote: > > I'd have thought you may well want the reverse. If the user didnt pick > the kernel debugging, don't die on software check option you want to > blow up. No, thanks. If it's fatal, it's a BUG(), and you blow regardless. If it isn't fatal, there is no excuse for blowing up. EVER. That's _especially_ true for some random user who didn't ask for, and can't handle debugging. If it's useful information that the developer believes he wants, it shouldn't be conditional at all. > If they are debugging or its < 2.6.0-rc1 you want it to show > the stack and keep going You definitely want to keep going regardless. A BUG() that takes out the machine is just not useful, because users who aren't ready to debug it can't even make any reports except "it stops" (which this one did if you were under X - the machine was just _dead_). Basically, with the amount of locking we have, a BUG() or a blow-up just about anywhere is lethal. Most sequences (especially in drivers, but inside filesystems etc too) tend to hold spinlocks etc that just makes it a bad idea to BUG() out unless you really really have to, since the machine is not likely to survive and be able to write good reports to disk etc at pretty much any point. (It used to be that you could take a fault just about anywhere except for in interrupt handlers, and Linux would try its damndest to clean up and continue as if nothing had happened. Those days are sadly gone, and trapping and depending on killing the process seldom works well any more). On the whole, it's a lot better to just print out a message (and call traces are often very useful) and continue. That's not always possible, of course, and a lot of BUG() and BUG_ON() cases are perfectly valid simply because sometimes there isn't anything you can do except kill the machine and try to inform the user. I think the historical kernel behaviour ("trap and kill and continue" historically worked so fine for _both_ major bugs and for "random sanity test" cases) has caused us to be a bit lazy about this sometimes. Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209091842400.1714-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 01:49:33 GMT Message-ID: <fa.m7eqdav.17gs80n@ifi.uio.no> On Mon, 9 Sep 2002, Linus Torvalds wrote: > > On the whole, it's a lot better to just print out a message (and call > traces are often very useful) and continue. That's not always possible, of > course, and a lot of BUG() and BUG_ON() cases are perfectly valid simply > because sometimes there isn't anything you can do except kill the machine > and try to inform the user. Note that from an implementation standpoint I suspect that a "trap and continue" thing can easily be pretty much exactly as the current BUG() with a flag somewhere, say in the "third byte" of the "ud2" instruction. That would also make it easy to dynamically change the behaviour (ie some people might want to explicitly make even the "warnings" fatal - a kernel version of -Werror), and the implementation should be trivial: #define TRAP_INSTRUCTION( lethal ) \ __asm__ __volatile__( "ud2\n" \ "\t.byte %0\n" \ "\t.word %c1\n" \ "\t.long %c2\n" \ : :"i" (lethal), "i" (__LINE__), "i" (__FILE__)) and then you have #define BUG() TRAP_INSTRUCTION(1) #define WARN() TRAP_INSTRUCTION(0) or something like that (where the non-lethal version just increments eip by 9 to jump over the extended ud2 and the information pointers). Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209091926320.1911-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 02:50:44 GMT Message-ID: <fa.m6v4cqv.14069gm@ifi.uio.no> Short and sweet: If I have a BUG() that I need to remove and replace with a printk() (which I did) in order to even be able to debug the dang thing, then that BUG() is a mistake. No ifs, buts or maybe's accepted. The one and _only_ point of BUG() is to help debugging, and if it doesn't, then it should not be there. Longer: On Mon, 9 Sep 2002, Matthew Dharm wrote: > > (3) Given that we think we've fixed all of these bad initiators, the BUG() > really is something that should never happen. You can't have it both ways. Either it really never happens. Or it _does_ happen, in which case you want sane debuggable output. In the latter case, BUG() is the wrong thing to do, exactly because it tends to bring the system down. In the first case it obviously doesn't matter and isn't even relevant, and the best thing to do would be to remove it altogether. In neither case is BUG() useful. I'm personally in X 99% of the time except for the reasonably rare case when I'm chasing down some bug I know I can reproduce and I want the kernel to have access to the console. And I doubt I'm alone in that. I suspect most people who use Linux in any interesting situation (and no, I don't think servers are very interesting from most standpoints) tend to do this. Agreed? That means that the BUG() output won't even be _visible_ to 99% of all cases. And if it causes the machine to hang, it is now not only invisible, it cannot even be gotten hold of some other way. Which basically means that a driver that uses BUG() for something that isn't certain to be fatal anyway is a BUGGY driver. If my X session is hung, that's already bad (most people will give up at that point). If I can't even log in remotely, that's _disastrous_, since now the BUG() causes me to not to be able to debug it at all. The BUG() basically made itself irrlevant. The fact is, BUG() is almost always the wrong thing to do. And it's almost _guaranteed_ to be the wrong thing to do in a driver, since a driver won't know what locks the rest of the kernel is holding, _and_ since it's almost always possible for a driver to try to return an error code instead. In short: Either you want debugging (in which case BUG() is the wrong thing to do), or you don't want debugging (in which case BUG() is the wrong thing to do). You can choose either, but in neither case is BUG() acceptable. BUG() is fine for _fundamental_ problems, where you don't have any other choice, and where the machine really is effectively dead anyway. If the VM notices that it's lists are corrupt, that's a BUG() thing. We don't have much choice. If the scheduler notices that it's running on another CPU than it thought it was running on, that's a BUG() thing. Now, the kernel problem may be that BUG() is a bit too easy to use, and the alternatives are not. We should fix that. But we shouldn't fix it by using BUG() in places where it definitely doesn't belong. Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209091951550.1911-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 03:00:12 GMT Message-ID: <fa.m7vadiv.170888h@ifi.uio.no> On Mon, 9 Sep 2002, Linus Torvalds wrote: > > Which basically means that a driver that uses BUG() for something that > isn't certain to be fatal anyway is a BUGGY driver. There are probably good exceptions to this rule, like any rule. For example, if the request queue (or some other fundamental internal data structure) is found to be corrupted, I couldn't really fault a driver for BUG()'ing out on it. It's not as if the driver could sanely even do an end_request() in that case. But even broken hardware is not a reason for a BUG(). For example, if the driver notices that some part of the controller is hung hard (ie provable hardware problem), the last thing it should do is to bring the system down. It should fail the IO, and maybe turn itself off, but let the system continue otherwise. One of the best things I ever did from a debuggability standpoint was to almost never use panic() in the base kernel, and make various kernel page faults etc just try to kill the offender and go on. Sometimes that "try to continue" approach ends up being nasty too (the problem repeats and you end up with a dead machine that scrolls infinite bug reports off the screen, making them really hard to catch), but on the whole it tends to make things much easier to debug ("oops, I just lost my floppy drive, let's save the messages on the harddisk and reboot"). Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209100947481.2842-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 16:52:02 GMT Message-ID: <fa.m7feeiv.16gg98t@ifi.uio.no> On Tue, 10 Sep 2002, David Brownell wrote: > > In short: > > > > Either you want debugging (in which case BUG() is the wrong thing to > > do), or you don't want debugging (in which case BUG() is the wrong thing > > to do). You can choose either, but in neither case is BUG() acceptable. > > Or in even shorter sound bite format: "Just say no to BUG()s." Well, the thing is, BUG() _is_ sometimes useful. It's a dense and very convenient way to say that something catastrophic happened. And actually, outside of drivers and filesystems you can often know (or control) the number of locks the surrounding code is holding, and then a BUG() may not be as lethal. At which point the normal "oops and kill the process" action is clearly fine - the machine is still perfectly usable. (In fact, on UP a BUG() tends to be quite usable just about anywhere except in an interrupt handler: there may be some local locks like directory semaphores etc that are held and not released, but _most_ of the time the machine is quite usable. SMP really does make things harder to debug even quite apart from the races it introduces. Sad.) Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209101132320.3280-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 18:41:51 GMT Message-ID: <fa.m7u6f2v.1500boq@ifi.uio.no> On Tue, 10 Sep 2002, David S. Miller wrote: > > IMO we should have ASSERT() and OHSHIT(), > > I fully support the addition of an OHSHIT() macro. Oh, please no. We'd end up with endless asserts in the networking layer, just because David would find it amusing. I can just see it now - code bloat hell. And no, I still don't like ASSERT(). I think the approach should clearly spell what the trouble level is: DEBUG(x != y, "x=%d, y=%d\n", x, y); WARN(x != y, "crap happens: x=%d y=%d\n", x, y); FATAL(x != y, "Aiee: x=%d y=%d\n", x, y); where the DEBUG one gets compiled out normally (or has some nice per-file way of being enabled/disabled - a perfect world would expose the on/off in devicefs as a per-file entity when kernel debugging is on), WARN continues but writes a message (and normally does _not_ get compiled out), and FATAL is like our current BUG_ON(). All would print out the filename and line number, the message, and the backtrace. Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209101236270.7106-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 19:40:47 GMT Message-ID: <fa.m9egd2v.17ga9oo@ifi.uio.no> On Tue, 10 Sep 2002, Oliver Xymoron wrote: > > Which still leaves the question, does it really make sense for > FATAL/BUG to forcibly kill the machine? No. It should only be "locally fatal", and it should clearly just do what BUG() does now - kill the process. But that implies very much that you really cannot use FATAL() in general at all, since it would be illegal to use whenever some caller holds some non-local locks (which is almost always the case for most "peripheral code"). Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@transmeta.com> Subject: Re: [linux-usb-devel] Re: [BK PATCH] USB changes for 2.5.34 Original-Message-ID: <Pine.LNX.4.44.0209101156510.7106-100000@home.transmeta.com> Date: Tue, 10 Sep 2002 19:03:56 GMT Message-ID: <fa.mau2div.140088o@ifi.uio.no> On 10 Sep 2002, Alan Cox wrote: > > It drops you politely into the kernel debugger, you fix up the values > and step over it. If you want to debug with zen mind power and printk > feel free. For the rest of us BUG() is fine on SMP Ok, a show of hands.. Of the millions (whatever) of Linux machines, how many have a kernel debugger attached? Count them. In other words, if a user is faced with a dead machine with no other way to even know what BUG() triggered than to try to set up a cross debugger, just how useful is that BUG()? I claim it is pretty useless - simply because 99+% of all people won't even make a bug report in that case, they'll just push the reset button and consider Linux unreliable. In other news, the approach that shows up in the kernel logs might just eventually be noticed and acted upon (especially if the machine acts strange and kills processes). So I claim a BUG() that locks up the machine is useless. If the user can't just run ksymoops and email out the BUG message, that BUG() is _not_ fine on SMP. It has nothing to do with zen mind power or printk's. It has everything to do with the fact that a working machine is about a million times easier to debug on than a dead one, _and_ is a lot more likely to get acted upon by most users. Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@osdl.org> Subject: Re: What policy for BUG_ON()? Original-Message-ID: <Pine.LNX.4.58.0408301718040.2295@ppc970.osdl.org> Date: Tue, 31 Aug 2004 00:27:34 GMT Message-ID: <fa.gsea069.a24ohr@ifi.uio.no> On Mon, 30 Aug 2004, Adrian Bunk wrote: > > Let me try to summarize the different options regarding BUG_ON, > concerning whether the argument to BUG_ON might contain side effects, > and whether it should be allowed in some "do this only if you _really_ > know what you are doing" situations to let BUG_ON do nothing. > > Options: > 1. BUG_ON must not be defined to do nothing > 1a. side effects are allowed in the argument of BUG_ON > 1b. side effects are not allowed in the argument of BUG_ON > 2. BUG_ON is allowed to be defined to do nothing > 2a. side effects are allowed in the argument of BUG_ON > 2b. side effects are not allowed in the argument of BUG_ON > > It would be good if there was a decision which of the four choices > should become documented policy. I'd suggest we strongly discourage side-effects in BUG_ON(). That said, it might be safest to just go for 1b - we make side-effects of BUG_ON() be _documented_ as a bug, but just for safety, I'd suggest doing #define BUG_ON(x) (void)(x) anyway, if somebody wants to compile without debugging. That will still make the side-effects happen if somebody has them (and if there are none, the compiler will not generate any code anyway). In other words: nobody should _depend_ on the side effects, and we should flame people who have them. A quick grep shows this: mm/slab.c: BUG_ON(spin_trylock(&cachep->spinlock)); which makes no real sense - it looks like slab really wants to use "spin_is_locked()". I could make some sparse extension that warns about side effects. I already calculate the "side-effectiveness" of an expression for other reasons.. Linus Newsgroups: fa.linux.kernel From: Linus Torvalds <torvalds@osdl.org> Subject: Re: What policy for BUG_ON()? Original-Message-ID: <Pine.LNX.4.58.0408310945580.2295@ppc970.osdl.org> Date: Tue, 31 Aug 2004 16:54:53 GMT Message-ID: <fa.guem0u7.820ppn@ifi.uio.no> On Tue, 31 Aug 2004, Albert Cahalan wrote: > > The normal expectation for non-debug builds > would be this: > > #define BUG_ON(x) No, this is bad, for one big reason: it generates compiler warnings if 'x' happens to be the only thing that uses some value. You get things like "unused variable 'mode'" etc in perfectly good code. For example, the code might look something like int count = 0; for_each_online_cpu(cpu) { ... do something .. .. update count .. } BUG_ON(!count); and if you now compile on UP and with debugging enabled, the compiler might complain that you're computing "count" but never _using_ the value. This is generally why you should have macros like this not become empty, but become something that the compiler can compile away. Which is why I'd much rather see #define BUG_ON(x) (void)(x) regardless of any side-effect issues - it's a way to let the compiler optimize the thing away, but still show that something was used at least "conceptually".. Linus From: Linus Torvalds <torvalds@linux-foundation.org> Newsgroups: fa.linux.kernel Subject: Re: [2/3] 2.6.22-rc2: known regressions v2 Date: Thu, 24 May 2007 19:51:58 UTC Message-ID: <fa.gaD/+HhjslaxEsxVKvydm+VFVAo@ifi.uio.no> On Thu, 24 May 2007, Ingo Molnar wrote: > > i very much agree that this kmalloc_index() one shouldnt be called a > "BUG: ", but if you look at the majority of WARN_ON() instances they are > checks for clear, serious kernel bugs. I _still_ disagree. There's a huge difference between "You killed my father, prepare to die", and "Btw, I didn't like that, but I'll just continue". And that's the difference between BUG_ON() and WARN_ON(). And dammit, the kernel message should make that CLEAR. It's totally idiotic to call both "BUG". One is a clear BUG, the other is a "uhhuh, something unexpected happened, but I know how to continue". So stop this idiotic "we should call them both the same". If they actually were the same, they'd both be called BUG_ON(). Linus |
|