Regression tracking (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: RFC: PCI quirks update for 2.6.16
Date: Mon, 11 Dec 2006 01:22:32 UTC
Message-ID: <fa.SpTiSvkolita6Nc5oa2Q2oNboHo@ifi.uio.no>

On Sun, 10 Dec 2006, Chris Wedgwood wrote:
>
> Well, it's not clear to me that reverting to a quirk the pokes *all*
> VIA pci devices on all machines is safe, it's not even clear if it was
> a good idea to merge this.

I'm just saying that the stable tree should never merge anything that can
possibly cause a regression.

> Well, I think the current 2.6.16.x release series is already broken on
> some other subset of hardware.

That's not the point. If it was broken on some subset of hardware, as long
as it's not a REGRESSION from 2.6.16, that's better than _changing_ the
breakage. And no, it doesn't really matter how many machines are affected
(ie it's not better to have a "smaller" set of cases that break, unless
it's a _strict_ subset).

The reason? It's better to be _dependable_ than to work on a maximum
number of machines. This is why _regressions_ are always much worse than
old bugs. It's much better to have "it didn't work before, and it still
doesn't work" than to have "it used to work, but now it broke".

Because people for whom something used to work should always be able to
update to a new kernel without having to constantly worry.

So for the _stable_ series, if you don't understand the problem 100%, and
you don't know that something really fixes something and never causes
regressions, such a patch simply SHOULD NOT be applied. It's that easy.

(And the argument that it "fixes more than it breaks" is a total garbage
argument for several reasons:

 - you don't actually know that. You may have a lot of reports about
   breakage that you think will be fixed (so you _think_ it fixes a lot),
   but by definition you won't have any clue AT ALL about how much it will
   break, since nobody will have tested it. The machines that weren't
   broken before generally won't even bother to upgrade, so you'll find
   out only much later.

 - machines that didn't use to work well before are much less important
   than machines that worked fine. People don't _expect_ them to work,
   people don't have a history of them working. So if you fix ten machines
   that didn't work before, but you break one that _did_ work before,
   that's _still_ not actually a good deal. Because angst-wise, you
   actually lost on it.

So please revert anything that is even slightly open for debate in the
stable series. The whole point of the stable series is to be _stable_, and
regressions are bad.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: RFC: PCI quirks update for 2.6.16
Date: Mon, 11 Dec 2006 01:28:42 UTC
Message-ID: <fa.d46fNNOSE0aB5/Rw74HZT9tmmLw@ifi.uio.no>

On Mon, 11 Dec 2006, Adrian Bunk wrote:
>
> If life was that easy...  ;-)

No. Life _is_ that easy.

If the 2.6.16 stable tree took a patch that was questionable, and we don't
know what the right answer to it is from the _regular_ tree, than the
patch violated the stable tree rules in the first place and should just be
reverted.

Once people know what the right answer is (and by "know", I mean: "not
guess") from the regular tree having been tested with it, and people
understanding the problem, then it can be re-instated.

But if you're just guessing, and people don't _know_ the right answer,
then just revert the whole questionable area.  The patch shouldn't have
been there in the first place.

It really _is_ that simple.

Either it's a stable tree or it isn't.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 15:48:00 UTC
Message-ID: <fa.Ocu+hJQs7/Ll+UnEXTnGHg/IDWk@ifi.uio.no>

On Thu, 26 Apr 2007, Adrian Bunk wrote:
>
> There is a conflict between Linus trying to release kernels every
> 2 months and releasing with few regressions.

No.

Regressions _increase_ with longer release cycles. They don't get fewer.

The fact is, we have a -stable series for a reason. The reason is that the
normal development kernel can work in three ways:

 (a) long release cycles, with two subcases:
	(a1) huge changes (ie a "long development series". This is what we
	     used to have. There's no way to even track the regressions,
	     because things just change too much.
	(a2) keep the development limited, just stretch out the
	     "stabilization phase". This simply *does*not*work*. You might
	     want it to work, but it's against human psychology. People
	     get bored, and start wasting their time discussing esoteric
	     scheduler issues which weren't regressions at all.
 (b) Short and staggered release cycle: keep changes limited (like a2),
     but recognize when it gets counter-productive, and cut a release so
     that the stable team can continue with it, while most developers (who
     wouldn't have worked on the stable kernel _anyway_) don't get
     frustrated.

And yes, we've gone for (b). With occasional "I'm not taking any half-way
scary things at _all_" releases, like 2.6.20 was.

> Trying to avoid regressions might in the worst case result in an -rc12
> and 4 months between releases. If the focus is on avoiding regressions
> this has to be accepted.

No. You are ignoring the reality of development. The reality is that you
have to balance things. If you have a four-month release cycle, where
three and a half months are just "wait for reports to trickle in from
testers", you simply won't get _anything_ done. People will throw their
hands up in frustration and go somewhere else.

> And a serious delay of the next regression-merge window due to unfixed
> regressions might even have the positive side effect of more developers
> becoming interested in fixing the current regressions for getting their
> shiny new regressions^Wfeatures faster into Linus' tree.

No. Quite the reverse.

If we have a problem right now

> 0 regressions is never realistic (especially since many regressions
> might not be reported during -rc), but IMHO we could do much better than
> what happened in 2.6.20 and 2.6.21.

2.6.20 was actually really good. Yes, it had some regressions, but I do
believe that it was one of the least buggy releases we've had. The process
_worked_.

2.6.21 was much less pleasant, but the timer thing really was

> I'm not satisfied with the result, and the world won't stop turning when
> I'm not tracking 2.6.22-rc regressions.

True. However, it's sad that you feel like you can't bother to track them.
They were _very_ useful. The fact that you felt they weren't is just
becasue I think you had unrealistic expectations, and you think that the
stable people shouldn't have to have anything to do.

You're maintaining 2.6.16 yourself - do you not see what happens when you
decide that "zero regressions" is the target? You have to stop
development. And while that may sound like a good thing at any particular
time, it's a total *disaster* in the long run (not even very long,
actually: in the two-to-three release cycle kind of run), because while
you are in a "regression fix" mode, people still go on developing, and
you're just causing problems for the _next_ release by holding things up
too long.

That's the *real* reality: 5 to 7 _million_ lines of diffs in a release
every two to three months. Do you really think those changes stop just
because of a release process? No. If you drag out the releases to be 4+
months, you'll just have 10-15 million lines of changes instead (or, more
likely, you'll have developers who can't be bothered any more, and you may
have just 2 million lines, and three years later you have a kernel that
isn't relevant any more. Look at any of the other Unixes).

In other words, there's a _reason_ we have staggered development. We have
the "crazy development trees" (aka -mm and various other trees), we have
the "development tree" (aka Linus' tree), and we have the -stable tree. If
the stable tree has a dozen known issues that they'll have to sort out
over the next two months, that's *fine*. That's kind of the point of the
stable tree.

And you would helpe them with the 2.6.22-stable releases if you'd maintain
that list. Even if it is _designed_ not to go down to zero.

I suspect that you got overly optimistic from the fact that 2.6.20 really
_was_ an easy release. It was designed that way. You feel that it was bad
or average, but that's actually because of _your_ unrealistic
expectations, not becasue there was anything wrong with 2.6.20.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 18:43:41 UTC
Message-ID: <fa.TrW8kXhisjX8yuHO1qxUNk5Gen8@ifi.uio.no>

On Thu, 26 Apr 2007, Diego Calleja wrote:
>
> From my humble POV, it's a problem that all this discussion was generated
> on what Adrian does or stop doing. Apparently, unless Adrian posts his
> list of know regressions, most of the people doesn't look at the bugzilla
> at all. Maybe it'd be useful to create a per-release bug tracker in the
> bugzilla or collect them into one of the a kernel.org's wiki, to make easier
> to follow the current state of all the "important" regressions.

Any web-based interface is a no-no. It's one reason I don't use bugzilla a
lot. If I can't get it by email, it doesn't exist, as far as I'm
concerned.

I bet that's true even of a lot of people who are more "web oriented" than
I am. They may look at webpages, but getting notified by email is still
the wakeup call. There's a difference between "active and directed pushing
to the involved people" and "the resource exists, that people could look
at".

So it would have to be more than just a wiki or a bugzilla entry. It would
have to have that weekly email status thing, and I think that it needs to
have some human who tries to find messages on the kernel mailing list too,
and make a first-level judgement on the bugs. Adrian was doing a good job.

But it doesn't necessarily need somebody with intimate knowledge of the
kernel. In fact, almost everybody who *does* have intimate knowledge tends
to have so in a very specific area (nobody knows everything - and that
very much includes people like me and Andrew too) and maybe be skewed in
other ways too, so a "generalist" is probably more useful than somebody
who is a "deep coder" in some subsystem.

And it almost certainly doesn't have/prefer to be _one_ person. I suspect
that this is something where it actually might be better to have some
collection of people interested in it, and yes, perhaps editing a wiki is
part of the process, but with at least that "automated email" thing going
on in additin (and it needs to go to the people involved, not just the
kernel mailing list - so part of it is not just gathering the reports
themselves, but also gathering target addresses from MAINTAINERS files and
perhaps git logs etc).

And yes, it's quite possibly a good way to get into kernel development -
it definitely helps to know about programming, but as mentioned, I don't
think it is something where you even need to know specifically about
*kernel* programming per se.

For example, I don't think it was an accident that Adrian (who has been
involved in kernelnewbies, janitors and the trivial tree) was the one who
picked it up. That's exactly the kind of involvement that the regression
tracking is all about!

(In fact, I think regression tracking might be "easier" to get into than
actually getting into some of the janitorial projects, exactly because
it's less about coding. The fact that a person who tracks regressions
might then *also* indirectly get into the code itself would just be a big
additional bonus!)

So yes, some automation can almost certainly help (especially if there are
multiple people involved), but I think a lot of it is that "human
judgement" and ability to group things and communicate. Are there any
kernel janitors/newbies/mentors that can think of people who would want to
do something like this?

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 21:13:52 UTC
Message-ID: <fa./Ew5OFYulAuejKhWiyNLp7DNqv8@ifi.uio.no>

On Thu, 26 Apr 2007, Diego Calleja wrote:
>
> Bugzilla sucks quite a lot at email, but you can answer emails and they get
> into the bugzilla database; and there're two mailing lists (listed in
> Documentation/HOWTO) that send notifications about every new bug
> added/modified- I know it's not the perfect email interface every hacker
> wants, but it's better than nothing.

No, it's *not* better than nothing.

The thing is, these reports MUST NOT go to "everybody". If they do, that
is actually *worse* than nothing, because people will just ignore them
entirely, since they aren't "directed".

The emails need to be directed to the appropriate parties, not go to
everybody. There is nobody who is interested in seeing all regressions,
except perhaps me and Andrew. Most *real* developers (as opposed to people
like me, who are integrators, not "real developers") want to be notified
about problems in *their* area, and if it's just automation that sends out
everything, it just dilutes the value of the thing, to the point where
people will ignore it even for the cases when they happen to be related to
what they do.

> I suggested some time ago that it'd be useful to send every new bug
> notification from bugme-new to the LKML (and/or other lists).

I don't know a lot of developers who actually read LKML. I know a lot of
people who look for interesting subject lines and interesting people, but
read LKML in the sense of reading everything? Not likely.

That's why I think Adrian did a great job: he took the "noise" and made it
somethng worth looking at! And part of that is very much to make it
directred to only relevant parties (yes, they *also* got cc'd to
linux-kernel, but people would get them in their personal mailboxes and
*not* feel like it was just noise that didn't matter to them!)

> I can understand Adrian's resign. Bugzilla is crap, but there're users
> reporting bugs there and willing to cooperate to fix them, and they're
> not getting listened.

I personally refuse to have anything at all with bugzilla. The interface
is so horrible that it's just not worth my time. I know there are a few
people who use it productively, but I'm always amazed that they can do
that.

The *big* problem with bugzilla is that it's such a "detail-oriented"
thing. It's fine if you have *one* bug that you're tracking. But whenever
that's not the case, it's almost totally useless.

Let me put it another way: I would never use a source control system that
forces me to look at my 22,000 files one at a time. I think such a system
is fundamentally broken, because it makes it impossible to get the big
picture ("what changed in the last week" kind of thing). The same is true
of bugzilla: if you *know* which bug you're looking at, it's good. For
anything else, it's almost worse than useless, exactly because there is no
way to get an overview.

> There're even a few description of patches (ie: "line
> 6 in foo.c is wrong and it breaks our testing, it should read like this:")
> that have been sitting there for *years* and not getting merged.

.. and you claim that this shows that developers don't listen. I'd say it
shows the exact *opposite*: the users don't listen. There's a lot more
users than developers, and bugzilla is pretty much designed to let the
users "report and forget", which is exactly the *wrong* thing to do,
because it puts the onus on the developer.

(I've said this before, but I'll say it again: one thing that would
already make bugzilla better is to just always drop any bug reports that
are more than a week old and haven't been touched. It wouldn't need *much*
touching, but if a reporter cannot be bothered to say "still true with
current snapshot" once a week, then it shouldn't be seen as being somehow
up to those scarce resources we call "developers" to have to go through
it).

So there are probably things that bugzilla could do to become more useful,
but I don't see that happening. We'd need either a smarter/better
bugzilla, or somebody who actually turns noise into real information.
Adrian did that (although in fairness to others, other people definitely
do it too. Dave Jones, for example. Very useful).

> So I, or anyone else, could try to do Adrian's job. But if Adrian (a guy
> that sends patches to make global functions static 8) got tired
> of doing that job, I suspect that I, or anyone else would also got
> tired of it even sooner.

I do agree - one of the problems with the job is not that it's thankless
(I think we've had at least ten kernel developers, very much including me,
talking about how _useful_ it is), but there is definitely a lack of
glamour and probably interest.

I think it could be more interesting if part of the job was doing the
tools. Tools *are* important. Most of my actual _development_ for the last
couple of years has been on "git", not the kernel, but I think I was more
productive that way, so I don't think that's wasted time at all.

So yes, automation would be a good idea, but I don't think bugzilla is it.

> There're other big projects with probably more bug reports than linux,
> they don't work this way, and they look more succesful in their bug
> handling.

Well, one thing to keep in mind is that the kernel really does have a
*lot* more development going on that most other projects.

I don't think you'll find another project that has about six megabytes of
diffs every release (every two months). That's really one of the
fundamental issues - things really *happen* in the kernel. A *lot* of
things. You can't take a breather - I can do "stabilization releases"
every once in a while, and Andrew can kick out patches he decides aren't
ready to be merged rather than maintain them in his tree (and he does do
that), but the kernel simply tends to have a different *scale* than other
projects.

And almost all hard bugs are about hardware interactions. Drivers. Big
iron. Things like that - ie unlike something like a compiler, you can
seldom say "this test-case crashes". Yes, that does happen for the kernel
too, but those are the *easy* bugs. Those generally get fixed in a day or
two.

So I really don't think you can compare to "other projects". They simply
don't have these issues.

		Limis

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 17:21:20 UTC
Message-ID: <fa.fRjTjZrL95SPmA2PQE46sR9FGmU@ifi.uio.no>

On Thu, 26 Apr 2007, Adrian Bunk wrote:
>
> They get frustrated because they focussed on developing new features
> instead of fixing regressions, and now it takes longer until their new
> features get merged because noone fixed the regressions...

I agree. That's part of it. But part of it is not just the "it's 2 months
until the next release", part of it is also very much a "nothing has
happened in the normal kernel for the last 8 weeks, this is boring, so
I'll do my own exciting stuff".

So one _fundamental_ issue is that all the people who aren't directly
involved with a particular regression are simply bored. And bored is not
good. You want people productive - and that meas that you want a
active development kernel that they can work with, since they aren't going
to help with the regressions anyway.

This is why the -stable tree is so useful. It's not only that users want a
stable tree - it allows people who do *not* have regressions on their
plate to not be stuck twiddling their thumbs - they can be on the regular
kernel.

> I'm not saying it always have to be 4 months.

I'm saying that four months wouldn't even have *helped* in the case of
2.6.21.

Do you really think bugs get fixed faster just because there wasn't a
release? Quite the reverse. Bugs get _found_ faster thanks to a release
(simply because you tend to get more information thanks to more users),
giving the stable people more information, causing the bugs to be able to
be found and fixed _more_quickly_ in the stable release than if we had
waited for four months to release 2.6.21.

The two last weeks of 2.6.21-rc were almost entirely "wasted", apart from
getting the e1000 issue at least resolved (which was the reason for that
delay, so I'm not complaining - I'm just saying that not a lot of people
actually were able to _help_ with regressions during that time, and for
some of them, we might well be better off with more information about the
issue).

Did we fix other bugs? Yes. There was one long-time bug (since 2.6.15 or
something) that happened to come in during that time, and we had some
cleanups, we had MIPS bugs, we found some networking issues etc etc. But
the amount of combined effort people put on it was pretty weak.

> "wait for reports to trickle in from testers" is exactly the opposite of
> our problem.

I disagree. Quite often, having 5 people report the same thing is actually
more useful (because you see a pattern) than having one known regression
that you don't know _why_ that regression happened. And that's the case we
had for most of them.

You have things like the maintainer (see Oliver's reply, for example)
simply unable to reproduce it, and needing more information. It
*does*not*matter* that the original report may be old. If you need more
information, you need more information, and a two-month-old report isn't
any better just because it's two months old.

At some point, you need to say: we're not making progress, need to release
it, that might get us *off* this stuck situation.

That's the part you seem unable to accept. You think that "we have a
listed regression" means that you should be able to fix it. Not so. We
*often* need more information.

> But I am not happy with the current state of released kernels.

So you're going to help exactly how? By stopping to help? Or kvetching
about developers that can't figure out why something regressed.

Sure, that makes tons of sense sense, Adrian.

NOT.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 17:45:00 UTC
Message-ID: <fa.6vnnEbaNvt0CJIfKjONIyYEtLd0@ifi.uio.no>

On Thu, 26 Apr 2007, Bill Davidsen wrote:
>
> If the result is fixing things which then don't get fixed in mainline, as
> Adrian notes

That whole premise is flawed. The *rule* for the stable tree is that
things don't get merged into the stable tree unless they are fixed in
mainline already.

We had that problem in the 2.4.x / 2.5.x split. I think we learnt our
lesson.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Fri, 27 Apr 2007 00:22:55 UTC
Message-ID: <fa.eBvTMCbuowlKW68hTChOrePNL1k@ifi.uio.no>

On Fri, 27 Apr 2007, Thomas Gleixner wrote:
>
> Maybe we need to coordinate changes better. 2.6.21 got three big updates
> which affected suspend/resume - one of them is my fault. But fiddling
> out which one of those - we had nested problems as well - makes it quite
> hard to grok them in time, especially if they happen only on one
> reporters system.

Yes. _If_ we had known how painful the timer changes would end up being,
we'd probably have done them separately from everything else.

That is the kind of thing that looks obvious in hindsight: merge stuff
that is questionable and scary alone, and don't do anything else that
release cycle.

But while the timer code is obviously pretty core, I think everybody
expected it to be a lot easier to merge (and it had existed as patches in
various forms for some time).

So we simply didn't know beforehand that it was going to cause the kinds
of regressions it did cause (and in fact, some of the regressions were
initially blamed on other things entirely - some of them looked like IO
regressions).

Water under the bridge. It's also easy to say in hindsight that something
should have been merged separately and been given a release cycle all its
own.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.21
Date: Thu, 26 Apr 2007 16:40:55 UTC
Message-ID: <fa.4jcPxsz9ZNKPBfTsanghvvWSeeM@ifi.uio.no>

On Thu, 26 Apr 2007, Jan Engelhardt wrote:
>
> I really appreciate the lot of -rcs, especially if there are so many
> intrusive changes/regressions. Like Andrew, I have a feeling that it
> gets buggier, but at least, it seems to be made up every ... two
> releases.

I wouldn't say that, but yes, there is at least *some* tendency to not
merge scary stuff after a painful release.

For example, I can certainly say that after 2.6.21, I'm likely to be very
unhappy merging something that isn't "obviously safe". I knew the timer
changes were potentially painful, I just hadn't realized just *how*
painful they would be (we had some SATA/IDE changes too, of course, it's
not all just about the timers, those just ended up being more noticeable
to me than some of the other things were).

> About 2.6.21 - will see, rc has been to my liking.

I actually hope that 2.6.21 isn't even all that bad, despite all the
worries about it. And I may be complaining about the problems the timers
caused, but it was definitely something that was not only worth it, it was
overdue - and those NO_HZ issues had been brewing literally for years. So
considering issues like that, I think we're actually doing fairly well.

One of the bigger issues is that I think -mm (and I'm pretty sure Andrew
will agree with me on this) has really had a rather spotty history. It's
been unstable enough at times that I suspect people have largely stopped
testing it, with just the most die-hard testers running -mm.

So -mm is still very useful just because *Andrew* tests it, and finds all
kinds of issues with it, but I literally suspect that Andrew himself is
personally a big part of that, which is kind of wasteful - we should be
able to spread out the pain more. Andrew is also too damn polite when
something goes wrong ;)

So we should have somebody like Christoph running -mm, and when things
break, we'll just sic Christoph on whoever broke it, and teach people
proper fear and respect! As it is, I think people tend to send things to
-mm a bit *too* eagerly, because there is no downside - Andrew is a "cheap
date" testing-wise, and always puts out ;)

			Linus

Index Home About Blog