Fairness (Ingo Molnar; Linus Torvalds; Ulrich Drepper)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair
Date: Wed, 18 Apr 2007 17:25:05 UTC
Message-ID: <fa.yYXoQg2aAh1kKN9OtcS5k4NxbMg@ifi.uio.no>

On Wed, 18 Apr 2007, Matt Mackall wrote:
>
> On Wed, Apr 18, 2007 at 07:48:21AM -0700, Linus Torvalds wrote:
> > And "fairness by euid" is probably a hell of a lot easier to do than
> > trying to figure out the wakeup matrix.
>
> For the record, you actually don't need to track a whole NxN matrix
> (or do the implied O(n**3) matrix inversion!) to get to the same
> result.

I'm sure you can do things differently, but the reason I think "fairness
by euid" is actually worth looking at is that it's pretty much the
*identical* issue that we'll have with "fairness by virtual machine" and a
number of other "container" issues.

The fact is:

 - "fairness" is *not* about giving everybody the same amount of CPU time
   (scaled by some niceness level or not). Anybody who thinks that is
   "fair" is just being silly and hasn't thought it through.

 - "fairness" is multi-level. You want to be fair to threads within a
   thread group (where "process" may be one good approximation of what a
   "thread group" is, but not necessarily the only one).

   But you *also* want to be fair in between those "thread groups", and
   then you want to be fair across "containers" (where "user" may be one
   such container).

So I claim that anything that cannot be fair by user ID is actually really
REALLY unfair. I think it's absolutely humongously STUPID to call
something the "Completely Fair Scheduler", and then just be fair on a
thread level. That's not fair AT ALL! It's the anti-thesis of being fair!

So if you have 2 users on a machine running CPU hogs, you should *first*
try to be fair among users. If one user then runs 5 programs, and the
other one runs just 1, then the *one* program should get 50% of the CPU
time (the users fair share), and the five programs should get 10% of CPU
time each. And if one of them uses two threads, each thread should get 5%.

So you should see one thread get 50& CPU (single thread of one user), 4
threads get 10% CPU (their fair share of that users time), and 2 threads
get 5% CPU (the fair share within that thread group!).

Any scheduling argument that just considers the above to be "7 threads
total" and gives each thread 14% of CPU time "fairly" is *anything* but
fair. It's a joke if that kind of scheduler then calls itself CFS!

And yes, that's largely what the current scheduler will do, but at least
the current scheduler doesn't claim to be fair! So the current scheduler
is a lot *better* if only in the sense that it doesn't make ridiculous
claims that aren't true!

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair
Date: Wed, 18 Apr 2007 19:42:00 UTC
Message-ID: <fa.LidW6EY8/ShOvo+8rYZuN1kBmIw@ifi.uio.no>

On Wed, 18 Apr 2007, Ingo Molnar wrote:
>
> perhaps a more fitting term would be 'precise group-scheduling'. Within
> the lowest level task group entity (be that thread group or uid group,
> etc.) 'precise scheduling' is equivalent to 'fairness'.

Yes. Absolutely. Except I think that at least if you're going to name
something "complete" (or "perfect" or "precise"), you should also admit
that groups can be hierarchical.

The "threads in a process" thing is a great example of a hierarchical
group. Imagine if X was running as a collection of threads - then each
server thread would no longer be more important than the clients! But if
you have a mix of "bags of threads" and "single process" kind
applications, then very arguably the single thread in a single traditional
process should get as much time as the "bag of threads" process gets
total.

So it really should be a hierarchical notion, where each thread is owned
by one "process", and each process is owned by one "user", and each user
is in one "virtual machine" - there's at least three different levels to
this, and you'd want to schedule this thing top-down: virtual machines
should be given CPU time "fairly" (which doesn't need to mean "equally",
of course - nice-values could very well work at that level too), and then
within each virtual machine users or "scheduling groups" should be
scheduled fairly, and then within each scheduling group the processes
should be scheduled, and within each process threads should equally get
their fair share at _that_ level.

And no, I don't think we necessarily need to do something quite that
elaborate. But I think that's the kind of "obviously good goal" to keep in
mind. Can we perhaps _approximate_ something like that by other means?

For example, maybe we can approximate it by spreading out the statistics:
right now you have things like

 - last_ran, wait_runtime, sum_wait_runtime..

be per-thread things. Maybe some of those can be spread out, so that you
put a part of them in the "struct vm_struct" thing (to approximate
processes), part of them in the "struct user" struct (to approximate the
user-level thing), and part of it in a per-container thing for when/if we
support that kind of thing?

IOW, I don't think the scheduling "groups" have to be explicit boxes or
anything like that. I suspect you can make do with just heuristics that
penalize the same "struct user" and "struct vm_struct" to get overly much
scheduling time, and you'll get the same _effect_.

And I don't think it's wrong to look at the "one hundred processes by the
same user" case as being an important case. But it should not be the
*only* case or even necessarily the *main* case that matters. I think a
benchmark that literally does

	pid_t pid = fork();
	if (pid < 0)
		exit(1);
	if (pid) {
		if (setuid(500) < 0)
			exit(2);
		for (;;)
			/* Do nothing */;
	}
	if (setuid(501) < 0)
		exit(3);
	fork();
	for (;;)
		/* Do nothing in two processes */;

and I think that it's a really valid benchmark: if the scheduler gives 25%
of time to each of the two processes of user 501, and 50% to user 500,
then THAT is a good scheduler.

If somebody wants to actually write and test the above as a test-script,
and add it to a collection of scheduler tests, I think that could be a
good thing.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Announce] [patch] Modular Scheduler Core and Completely Fair
Date: Wed, 18 Apr 2007 19:24:41 UTC
Message-ID: <fa.iYEabutgip3cPap+b83WLV/bjtg@ifi.uio.no>

On Wed, 18 Apr 2007, Ingo Molnar wrote:
>
> But note that most of the reported CFS interactivity wins, as surprising
> as it might be, were due to fairness between _the same user's tasks_.

And *ALL* of the CFS interactivity *losses* and complaints have been
because it did the wrong thing _between different user's tasks_

So what's your point? Your point was that when people try it out as a
single user, it is indeed fair. But that's no point at all, since it
totally missed _my_ point.

The problems with X scheduling is exactly that "other user" kind of thing.

The problem with kernel thread starvation due to user threads getting all
the CPU time is exactly the same issue.

As long as you think that all threads are equal, and should be treated
equally, you CANNOT make it work well. People can say "ok, you can renice
X", but the whole problem stems from the fact that you're trying to be
fair based on A TOTALLY INVALID NOTION of what "fair" is.

> In the typical case, 99% of the desktop CPU time is executed either as X
> (root user) or under the uid of the logged in user, and X is just one
> task.

So? You are ignoring the argument again. You're totally bringing up a red
herring:

> Even with a bad hack of making X super-high-prio, interactivity as
> experienced by users still sucks without having fairness between the
> other 100-200 user tasks that a desktop system is typically using.

I didn't say that you should be *unfair* within one user group. What kind
of *idiotic* argument are you trying to put forth?

OF COURSE you should be fair "within the user group". Nobody contests that
the "other 100-200 user tasks" should be scheduled fairly _amongst
themselves_.

The only point I had was that you cannot just lump all threads together
and say "these threads are equally important". The 100-200 user tasks may
be equally important, and should get equal amounts of preference, but that
has absolutely _zero_ bearing on the _single_ task run in another
"scheduling group", ie by other users or by X.

I'm not arguing against fairness. I'm arguing against YOUR notion of
fairness, which is obviously bogus. It is *not* fair to try to give out
CPU time evenly!

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sun, 22 Apr 2007 23:27:09 UTC
Message-ID: <fa.dsaijbCZBYD8eYz06FjWY4OKIpw@ifi.uio.no>

On Sun, 22 Apr 2007, Juliusz Chroboczek wrote:
>
> Why not do it in the X server itself?  This will avoid controversial
> policy in the kernel, and have the added advantage of working with
> X servers that don't directly access hardware.

It's wrong *wherever* you do it.

The X server should not be re-niced. It was done in the past, and it was
wrong then (and caused problems - we had to tell people to undo it,
because some distros had started doing it by default).

If you have a single client, the X server is *not* more important than the
client, and indeed, renicing the X server causes bad patterns: just
because the client sends a request does not mean that the X server should
immediately be given the CPU as being "more important".

In other words, the things that make it important that the X server _can_
get CPU time if needed are all totally different from the X server being
"more important". The X server is more important only in the presence of
multiple clients, not on its own! Needing to renice it is a hack for a bad
scheduler, and shows that somebody doesn't understand the problem!

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sat, 21 Apr 2007 16:36:09 UTC
Message-ID: <fa.m4HsERO9PY6ExXMJbXZP/nwENNM@ifi.uio.no>

On Sat, 21 Apr 2007, Willy Tarreau wrote:
>
> If you remember, with 50/50, I noticed some difficulties to fork many
> processes. I think that during a fork(), the parent has a higher probability
> of forking other processes than the child. So at least, we should use
> something like 67/33 or 75/25 for parent/child.

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will spread
   out its points to the children (the "vfork()" blocking is another case
   that is really the same).

This is a very special kind of "priority inversion" logic: you give higher
priority to the things you wait for. Not because of holding any locks, but
simply because a blocking waitpid really is a damn big hint that "ok, the
child now works for the parent".

		Linus

From: Ingo Molnar <mingo@elte.hu>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sat, 21 Apr 2007 16:55:08 UTC
Message-ID: <fa.UofOdO5MLsXvHFSAccWpeyIUSiQ@ifi.uio.no>

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread
>    out its points to the childred (the "vfork()" blocking is another case
>    that is really the same).
>
> This is a very special kind of "priority inversion" logic: you give
> higher priority to the things you wait for. Not because of holding any
> locks, but simply because a blockign waitpid really is a damn big hint
> that "ok, the child now works for the parent".

yeah. One problem i can see with the implementation of this though is
that shells typically do nonspecific waits - for example bash does this
on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls",
  ...
  21310 waitpid(-1, <unfinished ...>

the PID is -1 so we dont actually know which task we are waiting for. We
could use the first entry from the p->children list, but that looks too
specific of a hack to me. It should catch most of the
synchronous-helper-task cases though.

	Ingo

From: "Ulrich Drepper" <drepper@gmail.com>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sat, 21 Apr 2007 18:09:47 UTC
Message-ID: <fa.RFpVhdcz6PeA9ZhuCvihQhVPDkI@ifi.uio.no>

On 4/21/07, Ingo Molnar <mingo@elte.hu> wrote:
> on a simple 'ls' command:
>
>   21310 clone(child_stack=0,  ...) = 21399
>   ...
>   21399 execve("/bin/ls",
>   ...
>   21310 waitpid(-1, <unfinished ...>
>
> the PID is -1 so we dont actually know which task we are waiting for.

That's a special case.  Most programs don't do this.  In fact, in
multi-threaded code you better never do it since such an unqualified
wait might catch the child another thread waits for (particularly bad
if one thread uses system()).

And even in the case of bash, we probably can change to code to use a
qualified wait in case there are no other children.  This is known at
any time and I expect that most of the time there are no background
processes.  At least in shell scripts.

From: "Ulrich Drepper" <drepper@gmail.com>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sat, 21 Apr 2007 19:50:23 UTC
Message-ID: <fa.x0YzH2aYoZUj5bgQHh18iLeGqUs@ifi.uio.no>

On 4/21/07, Kyle Moffett <mrmacman_g4@mac.com> wrote:
> It might be nice if it was possible to actively contribute your CPU
> time to a child process.  For example:
> int sched_donate(pid_t pid, struct timeval *time, int percentage);

If you do this, and it has been requested many a times, then please
generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
call is issues the remaining time in the slot should be given to the
thread currently owning the futex.  For non-PI futexes this needs an
extension of the interface but I would be up for that.  It can have
big benefits on the throughput of an application.

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sat, 21 Apr 2007 23:38:04 UTC
Message-ID: <fa.eRBqNevkDEaGPLFHWULkY2+jSHM@ifi.uio.no>

On Sat, 21 Apr 2007, Ulrich Drepper wrote:
>
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.

And how the hell do you imagine you'd even *know* what thread holds the
futex?

The whole point of the "f" part of the mutex is that it's fast, and we
never see the non-contended case in the kernel.

So we know who *blocks*, but we don't know who actually didn't block.

		Linus

From: "Ulrich Drepper" <drepper@gmail.com>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sun, 22 Apr 2007 01:51:49 UTC
Message-ID: <fa.nWavHeDCILn/j08Q0wSJlB+OCUA@ifi.uio.no>

On 4/21/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> And how the hell do you imagine you'd even *know* what thread holds the
> futex?

We know this in most cases.  This is information recorded, for
instance, in the mutex data structure.  You might have missed my "the
interface must be extended" part.  This means the PID of the owning
thread will have to be passed done.  For PI mutexes this is not
necessary since the kernel already has access to the information.

> The whole point of the "f" part of the mutex is that it's fast, and we
> never see the non-contended case in the kernel.

See above.  Believe me, I know how futexes work.  But I also know what
additional information we collect.  For mutexes and in part for
rwlocks we know which thread owns the sync object.  In that case we
can easily provide the kernel with the information.

From: "Ulrich Drepper" <drepper@gmail.com>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sun, 22 Apr 2007 07:17:55 UTC
Message-ID: <fa.i5QHyEJ5fWkV04xCC6dOLX9GoWw@ifi.uio.no>

On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> I'm just looking for what people want the API to be here. With that in
> hand we can just go out and do whatever needs to be done.

I think a sched_yield_to is one interface:

   int sched_yield_to(pid_t);

For futex(), the extension is needed for the FUTEX_WAIT operation.  We
need a new operation FUTEX_WAIT_FOR or so which takes another (the
fourth) parameter which is the PID of the target.

For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
the current owner.  This is required for the whole interface to work
in the first place.

From: "Ulrich Drepper" <drepper@gmail.com>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Sun, 22 Apr 2007 16:17:01 UTC
Message-ID: <fa.zgU7VMUoshuFs9tJP2xednVJess@ifi.uio.no>

On 4/22/07, William Lee Irwin III <wli@holomorphy.com> wrote:
> On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> > For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> > need a new operation FUTEX_WAIT_FOR or so which takes another (the
> > fourth) parameter which is the PID of the target.
> > For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> > the current owner.  This is required for the whole interface to work
> > in the first place.
>
> We'll have to send things out and see what sticks here. There seems to
> be some pickiness above.

I know Rusty will shudder since it makes futexes yet more complicated
(although only if the user wants it) but if you introduce the concept
of "yield to" then this extension makes really sense and it is a quite
simple extension.  Plus: I'm the most affected by the change since I
have to change code to use it and I'm fine with it.

Oh, last time I didn't explicitly mention the cases of
waitpid()/wait4()/waitid() explicitly naming a process to wait on.  I
think it's clear that those cases also should be changed to use yield
to if possible.  I don't have a good suggestion what to do when the
call waits for any child.  Perhaps yielding to the last created one is
fine.  If delays through reading on a pipe are recognized as well and
handle with yield to then the time slot will automatically be
forwarded to the first runnable process in the pipe sequence.  I.e.,
running

    grep foo /etc/passwd | cut -d: -f2 | crack

probably will create 'crack' last.  Giving the remainder of the time
slot should result is recognizing it waits for 'cut' which in turn
waits for 'grep'.  So in the end 'grep' gets the timeslot.  Seems
quite complicated from the outside but I can imagine quite good
results from this.

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Mon, 23 Apr 2007 15:59:02 UTC
Message-ID: <fa.T68WLnw24+ovLr+ZNc0KvOLQ2Ok@ifi.uio.no>

On Mon, 23 Apr 2007, Nick Piggin wrote:
> > If you have a single client, the X server is *not* more important than the
> > client, and indeed, renicing the X server causes bad patterns: just
> > because the client sends a request does not mean that the X server should
> > immediately be given the CPU as being "more important".
>
> If the client is doing some processing, and the user moves the mouse, it
> feels much more interactive if the pointer moves rather than waits for
> the client to finish processing.

.. yes. However, that should be automatically true if the X process just
has "enough CPU time" to merit being scheduled to.

Which it normally should always have, exactly because it's an
"interactive" process (regardless of how the scheduler is done - any
scheduler should always give sleepers good latency. The current one
obviously does it by giving interactivity-bonuses, CFS does it by trying
to be fair in giving out CPU time).

The problem tends to be the following scenario:

 - the X server is very CPU-busy, because it has lots of clients
   connecting to it, and it's not getting any "bonus" for doing work for
   those clients (ie it uses up its time-slice and thus becomes "less
   important" than other processes, since it's already gotten its "fair"
   slice of CPU - never mind that it was really unfair to not give it
   more)

 - there is some process that is *not* dependent on X, that can (and does)
   run, because X has spent its CPU time serving others.

but the point I'm trying to make is that X shouldn't get more CPU-time
because it's "more important" (it's not: and as noted earlier, thinking
that it's more important skews the problem and makes for too *much*
scheduling). X should get more CPU time simply because it should get it's
"fair CPU share" relative to the *sum* of the clients, not relative to any
client individually.

Once you actually do give the X server "fair share" of the CPU, I'm sure
that you can still get into bad situations (trivial example: make clients
that on purpose do X requests that are expensive for the server, but are
cheap to generate). But it's likely not going to be an issue in practice
any more.

Scheduling is not something you can do "perfectly". There's no point in
even trying. To do "perfect" scheduling, you'd have to have ESP and know
exactly what the user expects and know the future too. What you should aim
for is the "obvious cases".

And I don't think anybody really disputes the fact that a process that
does work for other processes "obviously" should get the CPU time skewed
towards it (and away from the clients - not from non-clients!). I think
the only real issue is that nobody really knows how to do it well (or at
all).

I think the "schedule by user" would be reasonable in practice - not
perfect by any means, but it *does* fall into the same class of issues:
users are not in general "more important" than other users, but they
should be treated fairly across the user, not on a per-process basis.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [REPORT] cfs-v4 vs sd-0.44
Date: Mon, 23 Apr 2007 20:05:11 UTC
Message-ID: <fa.op9JQfY4rEdKvsCu5w7vGT2e/6E@ifi.uio.no>

On Mon, 23 Apr 2007, Ingo Molnar wrote:
>
> The "give scheduler money" transaction can be both an "implicit
> transaction" (for example when writing to UNIX domain sockets or
> blocking on a pipe, etc.), or it could be an "explicit transaction":
> sched_yield_to(). This latter i've already implemented for CFS, but it's
> much less useful than the really significant implicit ones, the ones
> which will help X.

Yes. It would be wonderful to get it working automatically, so please say
something about the implementation..

The "perfect" situation would be that when somebody goes to sleep, any
extra points it had could be given to whoever it woke up last. Note that
for something like X, it means that the points are 100% ephemeral: it gets
points when a client sends it a request, but it would *lose* the points
again when it sends the reply!

So it would only accumulate "scheduling points" while multiple clients
are actively waiting for it, which actually sounds like exactly the right
thing. However, I don't really see how to do it well, especially since the
kernel cannot actually match up the client that gave some scheduling
points to the reply that X sends back.

There are subtle semantics with these kinds of things: especially if the
scheduling points are only awarded when a process goes to sleep, if X is
busy and continues to use the CPU (for another client), it wouldn't give
any scheduling points back to clients and they really do accumulate with
the server. Which again sounds like it would be exactly the right thing
(both in the sense that the server that runs more gets more points, but
also in the sense that we *only* give points at actual scheduling events).

But how do you actually *give/track* points? A simple "last woken up by
this process" thing that triggers when it goes to sleep? It might work,
but on the other hand, especially with more complex things (and networking
tends to be pretty complex) the actual wakeup may be done by a software
irq. Do we just say "it ran within the context of X, so we assume X was
the one that caused it?" It probably would work, but we've generally tried
very hard to avoid accessing "current" from interrupt context, including
bh's..

			Linus

Index Home About Blog