Light-weight processes (David S. Miller; Larry McVoy; Zack Weinberg)

Index Home About Blog

Subject: Re: kernel thread support - LWP's 
Date:   Wed, 14 Jul 1999 19:10:08 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: Are there plans in the near future to support proper LWP's in the Linux
: kernel ?  By this I mean multiple threads of execution within the same
: process id, not multiple processes sharing the same VM, etc.  Can I
: encourage whoever is considering doing this that it would be a very
: good thing to do ? :)

Sure you could if you had so much as a shred of data which supported
the idea that it would be a good thing to do.  In a short discussion I
had with Linus about this a month or so ago, he pointed out something
that should have been obvious, that cloned processes which share VM
also share the page tables and hence the TLB resources.  Why is that
important?  Because it was the one remaining thing that I could see as
a legit argument for supporting LWPs.  Given that that isn't an issue,
can you think of a single technical reason why LWP's would be better?

I'll warn you up front that I've chewed over this topic at length with
people like Steve Kleiman, the architect of the the Solaris threading
model and the guy that taught me much of what I know about operating
systems, and even he isn't convinced that Solaris model is worth it.
If the clone() model had been around, I'm 90% sure he would have gone
with that.

So do you have any supporting data which makes a case that LWP's would
be better than the current model?  I'm willing to believe there is such
data, but at this point I'm at a loss as to what it could be.

Subject: Re: kernel thread support - LWP's 
Date:   Thu, 15 Jul 1999 00:31:40 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: > I'll warn you up front that I've chewed over this topic at length with
: > people like Steve Kleiman, the architect of the the Solaris threading
: > model and the guy that taught me much of what I know about operating
: > systems, and even he isn't convinced that Solaris model is worth it.
: > If the clone() model had been around, I'm 90% sure he would have gone
: > with that.
: 
: Do you have any inklings as to why they'd have gone this way ?

Yes.  It's a much nicer model.  If you've worked in both, the Linux (aka
Plan 9) model is trivial - you already know how to do everything.  The only
new parts are the parts that are actually new - such as locking to prevent
data modification races.  In the LWP world, you now get to learn about 
two kinds of processes, you have to invent new tools to monitor, debug, 
and administer these processes, you have to learn about thread scheduling
versus process scheduling, thread signals versus process signals, etc.  

It's all a crock of doo doo that was deemed to necessary because
"context switches cost too much".  

First of all, it's extremely rare that anyone who says "context switches
cost too much" actually knows what they cost, or knows if they are even
within an order of magnitude of being a critical performance bottleneck.

Second, for those rare cases where they actually do cost too much,
that's only on crappy operating systems.  The last time I checked, Linux
process context switches were faster than Solaris LWP context switches.
So much for that argument.

Third, the context switch time on any half way decent OS quickly becomes
dominated by cache misses caused by rebuilding the real process context.
Context switch benchmarks typically (lmbench is a notable exception,
shameless plug) do not measure anything but the OS context switch -
i.e., how fast can I switch from one thread of execution to the next.
If you get this code path small enough, you can fit the whole thing in
the L1 cache like Linux does and get ~1 microsecond context switches.
And the user level scheduler people just love to brag about numbers that
are even better (though typically not much).  But that's just benchmarking
crap - people don't context switch to do nothing, they context switch to
do some work.  That work turns into cache misses.  If you miss 5 or 6
times, you're up to the Linux kernel level context.  If you are talking
about those wonderful userlevel schedulers, you are probably at the context
switch time in 2-3 cache misses.  Do you really think you are going to
context switch in and do less than a half dozen cache missed before you
go to the next thread?  I doubt it.  If I'm right, then the real context
switch time is actually dominated by rebuilding your cache foot print,
NOT by the context switch time: regardless of whether we are talking about
kernel level, user level, whatever.  So it's a marketing argument that 
the context switches are the issue, not a real argument.  

Srk, in private conversations with me, admitted that threads were heavily
oversold and 99% of what you wanted to do you could do with processes and
mmap() and of the remaining 1% could be done with clones.  

: > So do you have any supporting data which makes a case that LWP's would
: > be better than the current model?  I'm willing to believe there is such
: > data, but at this point I'm at a loss as to what it could be.
: 
: Hmmm.  I can't claim to have any better exposure than yourself, I was
: asking more from a consumer point of view.
: 
: My perspective on LWP's is from a Solaris background so your comments are
: very interesting, indeed, given the respect the Solaris model has earned.

The Solaris model is fine except that it doesn't need to exist.  They could
support clone() just fine if they wanted to.

: I think the differences I see aren't so much technical as the way they're
: interacted with (and perceived by someone who's been around Unix too long).
: For example, I'd expect a ps (or ls in /proc) to list LWP's as separate
: processes (I assume), when it ought not to.  The impression of different
: pid's is of different processes, which the clone model breaks.

I personally think the other way is busted.  I think the Linux way
is correct.  Suppose you have a N thread program, done with LWP's on
Solaris and clones on Linux.  You want to see if it is unbalanced on
an N cpu system.  On Linux, you'd use the same tools that you'd use if
it were a bunch of cooperating processes (because it is), i.e., you'd
fire up top(1) and look at 'em.  On Solaris, if you did the same thing,
all you'd see is one busy process but you couldn't tell (using standard
tools) which thread was the busy one and which ones were hanging around
doing nothing.

I'm sure other people will argue that they don't want their top listings
or ps listings ``cluttered up'' with all those threads.  I don't agree. 
That attitude says that threads are really light, it's OK to have so
many that listing them all would be too busy.  BS.  Those threads cost
and cost big time.  Every thread is, at the very least, a 1 or 2 page
stack, typically 8 to 16KB.  You got a 1000 threads?  Cool, that's 16MB 
of stacks.  Great idea.  Not.

I said a long time ago:  ``Threads are like salt.  You like salt, I
like salt, but we eat a lot more pasta than salt.''  The thread guys
are trying to tell you that diet of salt is a good idea.  They are wrong,
don't listen, eat more pasta and be happy.

Subject: Re: kernel thread support - LWP's 
Date:   Thu, 15 Jul 1999 11:36:36 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: At 12:31 AM 7/15/99 -0600, Larry McVoy wrote:
: >Second, for those rare cases where they actually do cost too much,
: >that's only on crappy operating systems.  The last time I checked, Linux
: >process context switches were faster than Solaris LWP context switches.
: >So much for that argument.
: 
: Since you've obviously talked to a lot of good people on this, I was
: wondering if you could talk about the only issue I haven't heard you bring
: up which is frequently brought up by the LWP/user-thread-scheduler folks.
: What about kernel run-queue length?  It seems that I've heard the argument
: made that LWP's keep you from spending a long time in the kernel scheduler,
: which I could see might actually be a good thing.

I just ran some benchmarks to make sure that Linux does the right thing.
I was on Linux 2.2.9, on a 400Mhz Celeron.   The context switch time
doesn't vary over the range of 40..200 processes.  What this means
is that by the time we have 40 active processes, adding another 40 or
another 160 for that matter, made no difference.  The actual numbers are
8 usecs for 0 sized processes (just context switching) and 54 usecs for
processes that also touch 16KB between each context switch.

So once again, on a decent OS, we just aren't seeing any data which 
supports the idea that user level is measurably better.   Yeah, I'll
bet those user level schedulers can context switch in .1 usecs or
something like that, but given that a cache miss is closer to .2 usecs,
the context switch time is just noise - it's the memory subsystem which
will dominate.

I've heard all these arguments and I've also spent a lot of time thinking
about this because I sort of want to believe that people who I respect
at Sun weren't nuts.  But in spite of that, there is no technical evidence
that they weren't - the whole threading thing was just overblown and 
they added WAY too much code and WAY too many interfaces to do something
that Rob Pike and Linus managed to do in one interface (rfork() and clone()).
If you ever go look at the Linux code for all this stuff, it's wonderful,
it's like this

	clone(... int flags ...)
	{
		clone_vm(... flags ...);
		clone_signals(... flags ...);
		clone_pwd(... flags ...);
	}

	and then each of the resources are like this

	clone_vm()
	{
		if (flags & CLONE_VM) {
			vm->refcount++;
			return (0);
		}
		/* otherwise create a new VM and set it up to be COW */
	}

It's so damn clean, it's clearly the right way to go about doing this.
I just about fell out of my chair in admiration when I read that code -
say what you well about various parts of the Linux kernel, this part is
just beautiful.

And the icing on the cake is that Linux is tight and small enough that
doing things the right way actually works - you can have just one process
abstraction and it works.  The other OS's are giving you two abstractions
because they have slow processes.  The Rob Pike quote on this is great:
``If you think you need threads then your processes are too fat.''

Subject: Re: kernel thread support - LWP's 
Date:   Thu, 15 Jul 1999 11:46:26 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: On Wed, Jul 14, 1999 at 07:10:08PM -0600, Larry McVoy wrote:
: > Given that that isn't an issue,
: > can you think of a single technical reason why LWP's would be better?
: 
: It's been a couple of years since I used threads on Linux, so this may be
: out of date, but WTH.
: 
: If you want to send a signal, you usually want to send it to a process,
: not a particular thread.  

Nonsense.  When I do

    $ cmd | cmd2 | cmd3
and then hit ^C, I certainly do not want to kill one process, I want to 
kill all of them.  There is this age old concept called process groups 
which makes this work.  And process groups work just fine for killing a 
related group of cloned processes.

: With the LWP model, there is a single PID,
: shared by all the threads, making it easy to send the signal.  Under
: Linux, you have n PIDs for n threads.  Which PID do you signal?

This is just silly beyond words.  The Linux model is _clearly_ the
superset of the LWP model.  Under Linux I can kill a specific thread
or all of the threads, using the same interfaces Unix has had since v6
or earlier.  Under the LWP model, I can kill all the LWPs.  I can't kill
a specific one.  And that's really a drag - maybe I want to use SIGUSR1
or SIGHUP to turn on debugging on a per thread basis.  Under Linux, that
just works, with no new code to be written, no new commands to be added,
no new model to be understood.

: With the LWP model, you can fork() a process, and the new process can
: contain duplicates of all the parent's threads.  That would seem to be a
: challenge with the clone model.  

Err, umm, I guess I'd want to see a real world example of somebody wanting
to do this to really understand the need.  Until then, this sounds sort of
like a made up example to win an argument.

: With the m-to-n thread model used by Solaris, you write an application
: using however many threads makes sense for that application.  At run
: time, you can specify how many LWPs you want to use, and the thread
: library handles the multiplexing of the application's user-level threads
: to the OS's LWPs.  This makes it easier to write a multithreaded
: application that doesn't overload a two-processor system, but which can
: scale up to a 64-processor system.  

On Linux, it doesn't matter.  Use the right number of threads from the
beginning, the OS is more than capable of context switching them fast
enough on a 2 processor system.

: It is probably possible to implement this m-to-n model on top of clones,

But why would you want to?  The only reason is if your processes are 
so bloody slow that you must use threads.  Suppose for a moment that
process context switches and thread context switches both cost the
same, hell, let's say they both cost 0.  Then what possible reason
would there be to have user level threads?

--lm

Subject: Re: kernel thread support - LWP's 
Date:   Thu, 15 Jul 1999 13:13:24 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: depends on the definition of nicer, clone is less complicated to implement
: except when you start dealing with signaling threads and implementing a
: M to N mapping thread library.

I ask again - why do you need to do this?  Is there any other reason than
the cost of a thread context switch vs the cost of process context switch?
If the answer is no and process context switch approximates thread context
switch, then the whole necessity for the complicated two level model goes
away.

: >It's all a crock of doo doo that was deemed to necessary because
: >"context switches cost too much".  
: 
: well, context switches are painful as is any kernel crossing in high
: performance computing. imagine user level networking on high speed
: connections that can have round trip times in the ~50us range (this is
: a software implementation in our lab, SGI's GSN is committed to round
: trip times of around 7us roundtrip hardware latency), if you

I've (a) spent a great deal of time thinking about this very issue, and
(b) worked on GSN at SGI, and (c) am under contract with LLNL working
on exactly this issue, amongst others.  I'm pretty in tune with the
problem space and I don't see that it has any bearing on the discussion
at all.  If you are going to context switch for each packet, you can
kiss your performance good bye whether you are context switching threads
or processes.  Neither are fast enough to hit the needed 10 usec round
trip time that all the HPC folks like LLNL want.

: you could say that going to an event loop programming model would be
: much more appropriate for this system, but if you can get a thread
: model to handle it this type of load, then do it.

Agreed with the first part, couldn't agree with the second part - it ain't
happening - the context switches will be kernel level context switches
whether they are "threads" or "processes" since the event generated
is a kernel level event.  Yeah, you can deliver the packet into user
space directly, but have fun getting the kernel to tell your user level
scheduler to run a new thread.  Sure it can be done, and has been done,
but an old quote of mine is "Architect: someone who knows the difference
between what could be done and what should be done".  My architect hat
says this is not "a should be done", your view may be different.

: so, i am really confused by this with the sample program attached below
: and run against the MIT pthreads user level package and the glibc
: linuxthreads implementation shows a bit of a bigger difference than you
: suggest over a 2.2.9 kernel.

That's a threads package problem.  Our numbers agree nicely on the 3.5
usecs time - I get 1.75 usecs for a 2 process context switch plus about
2.7 usecs for overhead of passing a word through pipes.  So if the glibc
is getting 20 usecs or so, they are busted.  But that has nothing to do
with our discussion here: my claim is that if the process performance is
good enough, the whole need for threads as a concept goes a way - a thread
is just a different set of attributes on the process, i.e., shared VM,
signals, PWD, whatever.  And I'm assuming that you are happy with the 
3 usecs number and that's about the same as the process number.  So where's
the need for threads?

: the cache miss domination argument doesn't seem to make alot of sense
: unless you want to believe you are flushing the whole cache which you
: aren't because you are sharing the VM space between the threads.

Caches aren't infinite in size.  Yeah, it's true that benchmarks fit
nicely in the L1 cache so we see these nice 1-3 usec context switch
numbers.  But those numbers go up by a factor of ~100 when you add 32KB
worth of data | instructions to the work load performed by each thread.
Your threads do do something, right?  So if you have N threads with a
cache working set of C bytes, then you will start generating cache misses
somewhere around N * C sized L1 caches.  As soon as you fall out of the
L1 cache, the numbers start to get dominated by the cache miss time.
If you look at the graphs you can get from lmbench, you can see the
effects of cache sizes on context switches.  I measure context switch
time as a function of number of processes and process working set, and
plot the results.  yeah, the 2 process, 0 size numbers are ~1 usec, but
the 2 process 32K sized ones are 22 usecs, the 8 process 32K ones are
100 usecs, etc.  So once again, people can misuse benchmarks to try and
make their point but I think we are trying to get at the truth here, not
at better benchmark numbers.   And the truth is that context switch times
are _not_ represented by the 2 process, 0 size case.  That's a benchmark.
In the real world, you switch to that context to do something and that
is going to have a cost that can exceed the context switch by multiple
orders of magnitude.

Subject: Re: linux-kernel-digest V1 #4149 
Date:   Thu, 15 Jul 1999 16:33:57 -0600
From: Larry McVoy <lm@bitmover.com>
Newsgroups: fa.linux.kernel

: I could just as easily give you a situation that lends itself to
: threads.  What about when I have a process with several threads handling
: I/O events, and I get a SIGIO.  I do NOT want any one particular thread to
: receive the signal (that thread may be busy with a really long queue of
: I/Os to process already).  I do NOT want all threads to receive the signal
: (one is fine, thanks).  I want the signal to go to a thread (any thread)
: that can handle it.  THAT is what POSIX thread semantics do for signal
: handling.

This is completely orthogonal to the threads/process discussion.  The same
problem exists (and has been solved multiple ways) for processes.  I have
N processes (in Apache, for example), 1 or more of which may be blocked
in select waiting for the I/O.  The original BSD implementation of select()
woke up all of them.  Many operating systems have implemented "wake up 1"
semantics, where only one of the N blocked processes are awoken.  There are
other ways to solve the problem as well.

The point is that this isn't a thread issue - it's a generic issue.  If you
solve it for processes, you've also solved it for Linux threads (aka clones).
If you don't solve it for processes, I suppose you can conjure up a world
view in which you decide to only solve it for threads, but that's a hack,
and it's the sort of hack which is extremely unlikely to be acceptable by
Linus (or any other reasonable architect).  The right answer is to solve it
for the general case; then it works for threads and it works for problems
that don't need the thread model (HTTP servers, for example).

: Different models for different needs, but processes and process groups
: can't do everything.  Not everyone who wrote POSIX is stupid.

Linus said at BALUG not too long ago that there are two kinds of standards,
the good kind and the bad kind.  The good kind is when you've reached a 
point that you are happy with a part of the system and are willing to 
commit to not breaking those interfaces and semantics.  Writing down the
interfaces in a formal way and assuring people that they will continue
to work is standard.  P1003.1 is an almost perfect example of this.
These sorts of standards are Good Things (tm) and are to be encouraged.

The other kind of standard is created when a set of interfaces don't
exist, but there is a need for them to exist.  Rather than wait for 
industry to figure out the right answer, a committee of people ``decide''
on the right answer.  These standards are almost universally bad things.
You can't, you simply can't, figure it all out in advance.  You need to
wait and see how things are used, how they are implemented, what works,
what doesn't.  A really brilliant mind, with a lot of experience, left
alone to do the whole thing can sometimes (it's rare but it does happen)
write down the right answer.  But a committee of people just doesn't get
it right.  The later 1003.x standards are examples of this.  There are
all sorts of hacks in there that just aren't right.  If the POSIX guys
could do the thread stuff all over again, after understanding the Linux
(aka Plan 9) clone() interface, I'd lay 10:1 odds that the POSIX threads
interface would be very, very different.

: > This is just silly beyond words.  The Linux model is _clearly_ the
: > superset of the LWP model.  Under Linux I can kill a specific thread
: > or all of the threads, using the same interfaces Unix has had since v6
: > or earlier.  Under the LWP model, I can kill all the LWPs.
: 
: Actually, you can kill any one that is not blocking the signal at the
: point that you send it (or one that has it masked but is blocked in
: sigwait waiting to handle the signal).  How do I do that under Linux right
: now?  

No, you can _all_ that are not blocking the signal.  How do I use kill(2) or
kill(1) to kill a specific one of them?

: I'm working on a patch that lets me do that... but it does add to
: the interface quite a bit, because now I need to distinguish between
: killing a whole process (group of tasks cloned with CLONE_PID) or killing
: a task.  

I think CLONE_PID is a crock of doo doo and should be shot dead.  It's the
wrong answer.  They aren't one process, they are N processes.  Pretending
that they are one process is just the wrong answer.

: Not from outside the process, anyway.  A process is a single entity to
: outsiders... under normal operation (not debugging), one process playing
: with the internal details of how another process is implemented seems
: pretty dumb to me.  

Using that logic, we should just remove the kill(2) interface.   I doubt
that, upon reflection, you really want to do that.

Subject: Re: kernel thread support - LWP's
Date:   Fri, 16 Jul 1999 01:28:25 -0400
From: Zack Weinberg <zack@rabi.columbia.edu>
Newsgroups: fa.linux.kernel

Khimenko Victor wrote:
> Hmm. I know it's not really related to linux-kernel but still... WHY glibc
> is getting 20 usecs or so and can it be shrinked down to 3 usecs ? Usually
> peoples (like Apache developers, for examle) do not use clone(), it uses
> pthreads: clone is nice and all but it's non-portable...

pthread_create() has to do a hell of a lot more work than just clone(). 
Part of this is the ugly implementation, part is (I think) intrinsic to the
attempt to stack POSIX thread semantics on top of clone().

There is an internal thread used by libpthread.  It's called the "manager". 
It gets created the first time you call pthread_create.  It sits blocked
in select() on a pipe.

Assuming the manager is already active, pthread_create writes a message down
that pipe saying "please create another thread".  The manager wakes up and
creates a new child, which involves a pile of work both before and after the
call. In the end it kicks the original thread back awake with a signal.

The minimum sequence is

<calling thread>
write
sigsuspend
<switch to manager>
select returns
getppid
read
geteuid
mmap, twice
clone
<switch to child>
getpid
<back in manager>
kill
select
<calling thread is awake again>

That's _minimum_.  Add at least two calls to sched_setscheduler (one parent,
one child) if you run RT threads.  There might be N calls to waitpid in
between getppid and read.  Plus there's a bunch of copying data around in
user space.  There are some mutexes that have to be acquired and dropped. 
Etc.

Why is it doing all this crap?  Because of another requirement of the POSIX
thread spec, which hasn't got any air in the conversation yet: Any thread in
the process can do the equivalent of waitpid for any other thread in the
process.  Linux doesn't allow that; you can only wait for a task that you
yourself spawned (unless you are init).  The workaround is to make all the
threads children of the "manager".  It waits for them and passes back exit
statuses to the threads calling pthread_join.

There's another lovely detail: The "initial thread" (the one that was
executing before the first call to pthread_create) isn't a child of the
manager; it's the parent of.  Per POSIX, the manager has to notice when the
initial thread goes away and kill all the others.  That's why there is a
getppid in the above.  The select times out every two seconds just so it can
check.

---

If there were a way to specify - ideally at clone time - that any task in
some group (process group would be fine with me) can wait for this child,
then the above gunk could be reduced to clone() plus the work of setting up
the child's stack (the mmaps) and user-level context (data for
pthread_setspecific, per-thread errno, etc.)  And that would get you a hell
of a lot closer to 3usec thread create times.  If you had a
CLONE_VM_WITH_NEW_STACK that created a new stack for the child (still visible
to both) then more overhead would go away, plus clone() wouldn't need a
special assembly stub that only works if you use CLONE_VM.

It would also be handy to have a "disown" call which had the effect of
immediately reparenting the target process to init.  Currently "detached
threads" have to be waited for too.

zw

Date:   Sat, 11 Sep 1999 02:50:04 -0700
From: Larry McVoy <lm@bitmover.com>
Subject: Re: linux threads vs. solaris threads
Newsgroups: fa.linux.kernel

On Sat, Sep 11, 1999 at 11:04:43AM +0000, Hrafnkell Eiriksson wrote:
> Few days ago in my concurrent programming class the teacher stated
> that the time it takes the OS to switch between threads (i.e. threads 
> created with pthreads_create()/clone(CLONE_VM|CLONE_FILES etc)) was
> probably higher in Linux than in Solaris (and other commercial Unices)
> on the same hardware.
>
> (and no we were not confusing Linux threads and Solaris 
> userspace threads, we were talking about kernel level threads in
> both systems).

In that case, I doubt it.  Linux is a substantially lighter weight system
than Solaris.  Given that you are talking about kernel supported threads,
the work that Linux does is pretty much the same as the work that Solaris
does.  

> I found the lmbench benchmark that tries to measure the time it takes
> to do a context switch. Checking the lmbench source shows that it
> creates new processes with fork() and measures the time it takes
> to do a context switch between them. I don't think that this measurement
> is a fair measurement for the time it takes to switch between threads
> as there is no need to flush the TLB wich has to be done when doing a
> switch between processes if I have understood the purpose of a TLB
> correctly (but I might be wrong as I dont understand this well enough).

Almost all (maybe 100%) modern hardware has context ids in the TLB.  The
context ID is basically the same a process id.  So there is no need to
flush the TLB, the entries are different for each process.

> So I concluded that lmbench was not the right tool to measure this and
> answer my question.

Don't be so quick to dismiss it.  Running lmbench's context switch code on
the same hardware but on different operating systems tells you something:
it tells you how fast each OS can switch the _heaviest_ weight process
it has.  That is itself interesting.  It's true that there can be lighter
weight objects in the system, but see below for a rationale why this is
not an issue.

> As far as I understand, switching between threads in Linux means
> changing the PC counter, the stackpointer and restoring the registers
> and FP state. Solaris probably does something similar.

That's about right.  So if both OS's are saving and restoring exactly
the same state, why should there be any difference in performance?

The answer is that the state saving is really a very small part of the
context switch path in Solaris, at least.  If that was not true then
the context switch times between user level threads and the context
switch times between kernel level threads would be close to identical.
They aren't.

So what's the cost difference?  The difference is the cost of getting to
the point in the kernel where you decide to switch.  That cost is 
substantially higher in Solaris.  Solaris is a fine grain threaded system
with 1000's of kernel level locks.  A number of those locks are taken
on the way to the context switch code.  That and other code present in
Solaris but not present in Linux accounts for the difference.

This is a clear example of why scaling can be viewed as "bad".  It comes
with a cost.  While it is true that a fine grain threaded OS can take
better advantage of more processors, it's also true that it takes worse
advantage of less processors.  They are robbing Peter to pay Paul.

> Can anyone here help me determine if my teacher is right or wrong on
> this (or isn't there a right/wrong answer? :) or point me
> to webpages, articles, books or benchmarking tools?

You can take the lmbench timing harness and trivially create a 
benchmark which uses threads instead of processes and see what
the numbers are.

> Also, is there any more "context switch" involved in switching 
> between threads in Linux than in Solaris?  A Linux thread
> created with the clone() system call has the same virtual 
> memory and same file descriptors as other threads running
> with it within the same process so there isn't really any
> context to switch between except changing the state of
> the CPU.

Something to note: Linux actually shared page tables between processes
when the processes are sharing the same virtual address space.  The
page tables are an attribute of the VM, not of the process.  There's
normally a 1:1 mapping but when you clone there is a N:1 mapping.  

That's really cool because it means that Linux cloned processes are
every bit as lightweight as Solaris threads, there is no technical
reason for that not to be true.
-- 
---
Larry McVoy            	   lm@bitmover.com           http://www.bitmover.com/lm

Date:   Sat, 11 Sep 1999 03:36:40 -0700
From: "David S. Miller" <davem@redhat.com>
Subject: Re: linux threads vs. solaris threads
Newsgroups: fa.linux.kernel

   Date:   Sat, 11 Sep 1999 02:50:04 -0700
   From: Larry McVoy <lm@bitmover.com>

   Almost all (maybe 100%) modern hardware has context ids in the TLB.

Unfortunately, the cpu with the largest market share, x86, does not.

So if you aren't context switching between threads sharing the same
address space, the whole TLB is in fact flushed.

But other than x86, I cannot think of any other cpu in use today which
does not have TLB context ids.

Later,
David S. Miller
davem@redhat.com

Index Home About Blog