Microkernels (Linus Torvalds)

Index Home About Blog

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: Byte Article by barryn@bix.com
Date: 11 Mar 1998 00:44:52 GMT

In article <3CD2CEB661C6ADF3.A3A2A06B9C19614F.266B53C41F2561CD@library-proxy.airnews.net>,
Eric Lee Green <e_l_green@hotmail.com> wrote:

>But why mess with a MkLinux port when an L4 port is faster and has a
>better microkernel? See: http://os.inf.tu-dresden.de/L4/ for more
>info.

The L4 kernel is certainly a lot better designed than Mach, and has none
of the cruft.

One interesting issue in the design of the L4 kernel is that it is _not_
meant to be particularly portable: it very much admits that on a
microkernel level there are lots of reasons why you want to avoid
portability due to performance issues.

>Looks interesting. There was a average 8%-10% performance loss
>compared to "native" Linux, but they note that much of that could be
>recouped by re-writing various "native" Linux services as L4
>services. As an example they created a "pipe" microkernel service
>which actually ran FASTER than "native" Linux pipes. I suspect that
>the overhead can be brought down to less than 5% compared to a
>monolithic kernel, at which point the maintainability and ease of
>porting of a microkernel-based OS kernel will make it quite
>attractive.

"Maintainability" and "ease of porting" of microkernels are both pipe
dreams that have absolutely no basis in reality.  They are widespread
beliefs, but they are beliefs that are spread by certain establishments
without having any factual basis behind them - rather like the religious
beliefs in the middle ages ("everything revolves around the earth" and
"microkernels are more portable" are roughly equivalent in many senses).

Linux is already more portable than many microkernels, and the Linux
memory management is in a class of its own.  Mach is probably one of the
most ported microkernels, and mach is a great example of a rats nest of
horrible code that nobody understands, much less maintains.

Almost _all_ the problems with porting an OS are due to device drivers
once you get past the initial hurdles (which linux got past several
years ago now).  There are some bootstrapping issues (getting a working
compiler, getting a working boot sequence etc) that can definitely be
nasty, but they are nasty regardless of whether you have a microkernel
or not.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: gnu.misc.discuss,comp.os.linux.misc,comp.os.linux.x
Subject: Re: Tanenbaum (was Re: X11R6.4 -- What are your thoughts?)
Date: 9 Apr 1998 20:29:47 GMT

In article <6git4l$4hv$2@justus.ecc.lu>,
Stefaan A Eeckels <Stefaan.Eeckels@ecc.lu> wrote:
>
>That's mainly because Linux is a 'traditional monolithic kernel'
>design instead of a microkernel. Current academic thoughts
>on kernel development are firmly in the microkernel camp.

Actually, it seems to be that _current_ academic thoughts have started
moving away from microkernels.  They are much more varied than they were
5 years ago, with people still being MK proponents (although even the MK
proponents tend to call them "nanokernels" in order to distance
themselves from horrors like Mach), but they've had their own backlash
for more traditional systems, and there are also more people doing even
stranger things (exokernels).

At least from what I've seen, it's no longer a "microkernels or die"
attitude, but a much more varied (and somewhat saner) climate. But maybe
I've been talking to the wrong people.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.sys.mac.advocacy,comp.sys.next.advocacy,comp.os.linux.advocacy
Subject: Re: Linus Torvalds on micro-kernals
Date: 28 Feb 1999 18:37:25 GMT

In article <36D56879.2E02664F@cadence.com>,
Simon Kinahan  <simonk@cadence.com> wrote:
>tharsos@bigfoot.com wrote:
>>
>> In comp.sys.next.advocacy Nathan Urban <nurban@crib.corepower.com> wrote:
>> : No comment on my own position, but I just ran across this on Slashdot:
>>
>> : "I'm not saying that they were knowingly dishonest, perhaps they were
>> : simply stupid."
>>
>> :   -- Linus Torvalds, commenting on those who really thought Microkernels
>> : were wise. (Open Sources, 1999 O'Reilly and Associates)

Note that the above is actually part of the polite version: my original
stuff called microkernels an excercise in masturbation, and the O'Reilly
people changed that to "self-gratification".

>Thats pretty unfair of Linus, to be honest. Microkernels have advantages,
>they just tend not to be very apparent on uniprocessor machines, or to have
>much bearing on the things most users do.

Umm.  Microkernels don't have any advantages I'm aware of.  All the
much-touted "advantages" have either been outright lies ("simplicity")
or things that have very little to do with microkernels ("modularity").
Some of them are just ridiculous and obviously dishonest
("performance").

Whether the microkernel proponents are knowingly lying or just misguided
(or outright evil) is still open to discussion, though.

Basically, pro-microkernel arguments tend to all be rather simplistic,
and always ignore the other part of the equation. That is, in my
opinion, the worst kind of academic lie: soemthing that you try to make
sound plausible, without actually having any critical thought. Examples:

 - Microkernels are simple, because each part is simple.

   Bzzt! Wrong! The above argument completely ignores the fundamental
   issue of interaction between the parts, which is where all the
   complexity comes in in the first place. Basically, they trade a
   "easy" difficulty for the truly hard one.

 - Microkernels are modular, because they can be written in modules.

   Bzzt! Dishonest! Sure, microkernels can be modular, but so can
   monolithic kernels. It's not an issue of microkernel vs monolithic,
   it's an issue of programming. But the microkernel people try to imply
   that this is somehow a microkernel issue.

   What a lying bunch of incompetents.

 - Microkernels are as fast or faster.

   Bzzt! Dishonest.  Usually the argument goes that in theory, you can
   spend a lot of time speeding up a microkernel to the point where the
   speed difference is megligible, and then the other so-called
   "advantages" of microkernels will make up for the rest.

   The even more dishonest answer is that you can optimize your
   microkernel so that it is faster than some other (productized)
   monolithic kernel.

   Dishonest: it assumes that nothing can be done on the monolithic
   kernel.  That's like saying "if I ate steroids for 15 years, I would
   be stronger than my neigbour who doesn't eat steroids, so I must be
   stronger".

 - The list goes on.

The good part is that many people _have_ realized that the microkernel
promises were mostly just pretty but empty words on a grant application.
So microkernels aren't any longer considered "de rigeur" in the OS
community.

			Linus

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.qnx,comp.sys.amiga.misc,comp.os.linux.misc
Subject: Re: Amiga, QNX, Linux and Revolution
Date: 3 Sep 1999 19:03:10 GMT

In article <37cf9cb5.2992973@news.demon.co.uk>, John Birch <nospam> wrote:
>
>QNX does a number of things right that Linux does flat wrong (true
>_uncrashable_ (almost) micro kernel, real time performance etc)

Ehhh..

Sure, the QNX microkernel is pretty uncrashable. But have you ever asked
yourself why? Maybe because it doesn't do all that much.

Put it in a general-purpose system, do some real work with it, open it
up to people who aren't polite, and see what happens. Not many people
care that the microkernel hasn't crashed when everything else has.

>I'm sure you're right, the problem for QNX is that few people know how
>good it is because it is so expensive (aimed at a different market).

It's good for that market.  But think about that _really_ means for a
moment.  Don't make the mistake of extrapolating goodness in a very
specialized market into goodness a more real-life and much less
constrained market.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: The Future of LINUX ??
Date: 1 Apr 1998 22:25:19 GMT

In article <vc7ogyl2nv6.fsf@jupiter.cs.uml.edu>,
Albert D. Cahalan <acahalan@jupiter.cs.uml.edu> wrote:
>mpa@squawk.klue.on.ca (Marco Anglesio) writes:
>>
>> Why did linux go with a monolithic kernel instead of a microkernel?
>> Because it started that way.
>
>Nope. This is a FAQ. Microkernels are academic bullshit.
>Burn any books you have that are written by Andrew Tanenbaum.

No, please don't. The "Operating Systems: Design and Implementation"
book is a really good book (I've only read the first edition, but I
assume the second one makes sense too). Well written, understandable,
and brings out the right points.

The fact that ast is a microkernel proponent doesn't mean that he
couldn't write well.

		Linus

Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Subject: Re: Multics Concepts For the Contemporary Computing World
From: torvalds@penguin.transmeta.com (Linus Torvalds)
Message-ID: <1057178824.74582@palladium.transmeta.com>
Date: Wed, 02 Jul 2003 20:46:45 GMT

In article <kscudb.krn1.ln@acer>, Morten Reistad  <mrr@reistad.priv.no> wrote:
>
>The Linux people have the nfs server still in user mode last I saw.

Nope. That was five years ago. Nobody uses the user-space server for
serious NFS serving any more, even though it _is_ useful for
experimenting with user-space filesystems (ie "ftp filesystem" or
"source control filesystem").

><rant>
>Why do the file systems have to be so tightly integrated in the "ring0"
>core? This is one subsystem that screams for standard callouts and
>"ring1" level.
></rant off>

Because only naive people think you can do it efficiently any other way.

Face it, microkernels and message passing on that level died a long time
ago, and that's a GOOD THING.

Most of the serious processing happens outside the filesystem (ie the
VFS layer keeps track of name caches, stat caches, content caches etc),
and all of those data structures are totally filesystem-independent (in
a well-designed system) and are used heavily by things like memory
management.  Think mmap - the content caches are exposed to user space
etc.  But that's not the only thing - the name cache is used extensively
to allow people to see where their data comes from (think "pwd", but on
steroids), and none of this is anything that the low-level filesystem
should ever care about.

At the same time, all those (ring0 - core) filesystem data structures
HAVE TO BE MADE AVAILABLE to the low-level filesystem for any kind of
efficient processing.  If you think we're going to copy file contents
around, you're just crazy.  In other words, the filesystem has to be
able to directly access the name cache, and the content caches. Which in
turn means that it has to be ring0 (core) too.

If you don't care about performance, you can add call-outs and copy-in
and copy-out etc crap. I'm telling you that you would be crazy to do it,
but judging from some of the people in academic OS research, you
wouldn't be alone in your own delusional world of crap.

Sorry to burst your bubble.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Multics Concepts For the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057348540.574919@palladium.transmeta.com>
Date: Fri, 04 Jul 2003 12:55:40 -0700

Douglas H. Quebbeman wrote:
> Linus Torvalds <torvalds@penguin.transmeta.com> wrote:
>>
>> If you don't care about performance, you can add call-outs and copy-in
>> and copy-out etc crap. I'm telling you that you would be crazy to do it,
>> but judging from some of the people in academic OS research, you
>> wouldn't be alone in your own delusional world of crap.
>
> I think most of us care about performance.
>
> But I always thought one of the main reasons for chasing the
> ever-increasing hardware performance curve was to make it possible
> to write code in a high-level manner, and have it run fast enough
> to be useful.

Why would you ever do that for an operating system, though?

When you make your own application slower, that's _your_ problem, and
the rest of the world doesn't really mind. If they find your application
useful enough, they'll use it. And if it's too slow, they might not. Not
everybody can just buy faster hardware.

However, when you make your OS slow, you make _everything_ slow.

Also, what's the point of writing low-level code in a high-level manner?
High-level code hides the details and does a lot of things automatically
for the programmer, but in an OS you are constrained by hardware and
security issues, and a lot of the time you absolutely MUST NOT hide the
details.

>        Getting away from the bit-twiddling and relying on
> higher-level constructs makes it possible for us to capitalize
> on our past efforts more efficiently. Each new layer brings new
> metaphors that permit programmers to get their tasks done faster.

But the OS _is_ one of those layers. The whole point of layering is
you use independent concepts on top of each other to make the higher
levels more pleasant to use.

The important part here is "independent". The concepts should be
clearly above or below each other, not smushed together into a
unholy mess of 'every single abstraction you can think of'.

Leave the OS be. Put your abstractions on top of the solid ground
of an OS that performs well.

> I always hoped these benefits would be available not only to
> applications programmers, but to us systems programmers as well.

If you want OS protection, you work outside the OS. It's that simple.

You can do a lot of "system programming" outside the OS. Look at X,
or look at any number of server applications (apache etc). But don't
make the mistake of thinking that because a lot of services _should_
be done outside the kernel, that means that you should do all of them.

Filesystems are just about _the_ most performance critical thing in
an operating system. They tie intimately with pretty much everything.

In particular, filesystems are a hell of a lot more important than
message passing - you're better off implementing message passing on
top of a filesystem than you are the other way around.

In short: when I get a faster machine, I'd rather use that extra speed
to make the machine more useful to me, than waste it on stuff where it
doesn't help and where it makes no sense.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Multics Concepts For the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057372217.336719@palladium.transmeta.com>
Date: Fri, 04 Jul 2003 19:30:16 -0700

Nick Maclaren wrote:
>>
>>Why would you ever do that for an operating system, though?
>
> To increase its extensibility by a factor of 10, its debuggability
> by a factor of 100, its reliability by a factor of 1,000 and its
> security by a factor of 10,000.  No, I am NOT joking - those numbers
> are achievable.

If you're not joking, you have your work cut out for you. The proof is
in the pudding, and I hereby claim that you're seriously naïve if you
truly believe that.

But hey, prove me wrong. I'm pragmatic: I'd happily be proven wrong
by somebody actually showing something. It sounds like so much hot
air to me so far, though.

> Just for the record, Linux is at least as good as most commercial
> systems in those respects, so I am talking generically.

Completely ignoring all performance issues, I will tell you why you're
wrong on all counts: it's a hell of a lot more difficult to write, debug
and validate communications than it is to do the same for a monolithic
program.

In short, you're making the _classic_ microkernel mistake. You are
comparing apples to oranges. Your argument goes like this:

 - by making the filesystem an independent entity, it becomes simpler
   than the full OS would become, and as such it is more easily written
   and debugged.

This is a stupid and completely illogical comparison. You compare one
subsystem to the whole, and you think the comparison is valid. Yes, the
filesystem itself might be easier to write THAN THE WHOLE OS, and it
might be easier to debug THAN THE WHOLE OS. But do you see the fallacy?

The system as a whole actually got _harder_ to debug, because when you
split the thing up, you added a layer of communication and you _removed_
the possibility to trivially debug the two parts together as one.

You hamstrung the parts by not allowing them to share data and thus you
introduced the problem of data coherency that didn't even exist in the
original design because the original design didn't need them. The original
design could share locks and data freely.

To be blunt: have you tried debugging deadlocks and raceconditions in
threaded applications that use multiple address spaces?

Put it another way: have you tried debugging asynchronous systems
with complex interactions? It's not as simple as debugging one part
on its own. A lot of the problems only show up as emergent behaviour.

For example, one of the most interesting parts of filesystem behaviour
is the handling of low-memory situations and the shrinking of the caches
involved. You can avoid it by never caching anything in the filesystem,
but that tends to be a bad idea.

My favourite analogy is one of the brain. The complexity of the brain
is _bigger_ than the sum of the complexity of each neuron. Because the
_real_ complexity is not in the neurons themselves (and no, they aren't
exactly trivial, but some people think they understand them reasonably
well), but in the patterns of interactions that happen: the feedback
cycles that keeps the activity in balance.

This is the kind of complexity you really really don't want to debug.
It's wonderful when it works, but part of why it's so wonderful is the
fact that it's almost impossible to understand _how_ it really works,
and when things go wrong you have a really hard time fixing them.

So when you say that message passing increases debuggability by two
orders of magnitude, I laugh derisively in your direction.

I claim that you want to start communicating between independent modules
no sooner than you absolutely HAVE to, and that you should avoid splitting
things up until you really need to, because that communication complexity
often swamps the complexity of the actual pieces involved in it.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Multics Concepts For the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057375190.35698@palladium.transmeta.com>
Date: Fri, 04 Jul 2003 20:19:49 -0700

Peter da Silva wrote:
>
> Linus, you're already doing it for your operating system. How much do you
> think you could speed up Linux by getting rid of memory protection and
> multiuser protection? Your system call overhead could go down from
> microseconds to nanoseconds, the Amiga "system call" was four instruction
> long.

No. That's a user-visible API thing, and as such unacceptable. If the
OS doesn't give protection to the user programs, it is not in my opinion
in the least interesting.

So what you are suggesting doesn't make sense. It's like getting the
wrong answer - it doesn't make for a faster system, simply because
system is now no longer doing the right thing.

However, _within_ the OS, protection doesn't buy you much. We have
some internal debugging facilities that slow things down enormously
if you enable them, but they are literally only meant for debugging,
and they play second fiddle to the design.

And NOT having protection within the kernel itself is actually a
huge win. And while performance is important, the big win is that
without protection you can simply solve a lot of problems in a much
more straightforward manner.

> You put up with the performance hit because you get a corresponding
> benefit.

Clearly any engineering problem always ends up being a cost/benefit
analysis. That's what makes it engineering, not science.

So in that sense what you're saying is a tautology.

At the same time, I claim that you're just wrong. Because this is one of
the areas where it is NOT a question of cost vs benefit, but one of the
few areas where it is a question of simple user requirements. Not having
protection between programs and users is simply not an option.

> The same thing can be said for all the different layering mechanisms
> used inside the kernel.

Actually, in almost all cases the benefit of layering in the kernel is
 - better performance
 - less duplication

Really. Layering done right makes it _easier_ to write good code, and
layering should never be a performance issue unless it is badly
designed.

STL is an example of good layering design (in C++): it's very much
designed so that the compiler at least in theory - and reasonably often
in practice too - can do the right thing with little or no performance
downside.

Similarly, the examples of layering that Linux uses extensively (starting
from the use of C over assembly, to having various ground rules on how to
write architecture-neutral code, to having things like a VFS layer that
handles most of the common code in filesystems) literally do improve
performance. If they didn't, they'd be badly designed.

For example, the VFS layer gives filesystems a generic notion of a page
cache for maintaining caches of the file (and directory) contents. Yes,
the low-level filesystems could do it themselves, but not only does this
layering avoid duplicated work, it actually improves performance, because
it means that we have a _global_ cache replacement policy that very much
outperforms something that could only work on one filesystem at a time.

Similarly, when I rewrite the basic VM subsystem for the first Linux port
to the alpha, and had to virtualize the page tables, I actually ended up
improving performance even on x86. Why? Because the layering itself didn't
add any overhead (trivial macros and inline functions used to hide the
differences), but by being done right, it made the code easier to follow,
and actually made some bad decisions in the original code clear.

The same goes for the choice of a C compiler over hand-written assembly.
Nobody sane these days claims that handwritten assembly will outperform
a good compiler, especially if the code is allowed to use inline asms
for stuff that the compiler can't handle well natively.

See? It's a total and incorrect MYTH that layering should be bad for
performance.  Only _bad_ layering is bad for performance.

This is why I rage against silly people who try to impose bad layering.
It's easy to recognize bad layering: when it results in a clear performance
problem, it is immediately clear that the layering was misdesigned.

And in the end, that is my beef with microkernels. They clearly
_are_ badly designed, since the design ends up not only having
fundamental performance problems, but actually makes a lot of things
more complex as opposed to less. QED.

> So the question isn't "why would you do that for an operating system", the
> answer to that is easy: "because it makes the OS more stable, and makes
> driver development easier, and allows you to expand what a non-root user
> is allowed to do". The question is, "how much overhead are we talking
> about" or "what are the tradeoffs" or "where would you start?"...

No. I claim that trying to split off the filesystems out of the low-level
kernel is a layering violation, because it is clearly a case of bad
layering.

And performance is part of it, but so is (very much) the fact that it
also makes it much harder to maintain coherent data structures and graceful
caches.

>> Also, what's the point of writing low-level code in a high-level manner?
>
> What's the point of writing the OS in C instead of assembly?

Your argument is nonsense.

I very much point to the use of C instead of assembly as a performance
IMPROVING thing, and thus clearly good layering. I claim that a good
C compiler will outperform a human on any reasonably sized project,
and that the advantages are obvious both from a performance and a
maintenance standpoint.

>> Leave the OS be. Put your abstractions on top of the solid ground
>> of an OS that performs well.
>
> Loadable drivers.
> File systems.
> Pseudodevices.
> Layered file systems.

None of these are even visible in a performance analysis.

I don't much like loadable drivers myself, and I don't use them, but nobody
has actually ever shown the theoretical downside (because they are loaded
into the virtual kernel address space rather than in the 1:1 mapping that
the main kernel uses, they have somewhat worse ITLB behaviour in theory).

The others do not have any performance problems, and as mentioned things
like the FS virtualization actually has performance benefits.

> On the Amiga, where "inside the kernel" you had high level abstractions
> for all the interfaces (a file system, for example, was just a program
> that registered itself as a file system and accepted file system
> messages), people were producing some amazingly cool and *clean*
> system-level objects... and they didn't even have OS source available!

You're not arguing against my point at all.

Good abstraction has zero performance penalty, and can make the system
much more pleasant to work with. And I claim that this is the DEFINITION
of good abstraction.

Microkernels and message passing fail this definition very badly. (The
Amiga had a "sort of" message passing, but since in real life it was
nothing but a function call with a pointer, that was message passing on
the _abstract_ level, with no performance downside. That's fine as an
abstraction, but it obviously only works if there is no memory protection)

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Multics Concepts For the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057423832.645700@palladium.transmeta.com>
Date: Sat, 05 Jul 2003 09:50:32 -0700

Nick Maclaren wrote:

> OK.  So why can't I get access to the kernel data structures from my
> program?  I really would be able to like to place my data structures
> for best effect, control the TLBs and so on.  I could get MUCH better
> performance under many circumstances that way!

Bad example.

Because you _can_ get access to the TLB entries (well, the page tables,
or on processors with fixed entries like PPC, the BAT registers) from
your program. The interfaces are there exactly because some programs
literally do care that much, and care about things like specific physical
pages.

It's a pain to do, and very few programs care enough to use it. But
big databases want to control their TLB behaviour, so they have access
to what the kernel calls "hugetlb" - basically you can bypass a large
part of the VM by mapping a hugetlb area. It won't swap for you, and
it will have various architecture-defined limitations (ie on x86 the
area will have to be either 2M or 4M aligned/sized).

And other system programs need access to raw physical pages, for AGP
mapping and starting DMA from user space. The interfaces exist, it's
just that they tend to not be used very widely because they are clearly
not portable, and they are a pain to use and administer (all of these
have security issues).

The rule is: provide an abstraction, but if you really need to, be
willing to _break_ the abstraction.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Multics Concepts For the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057426667.567100@palladium.transmeta.com>
Date: Sat, 05 Jul 2003 10:37:47 -0700

Linus Torvalds wrote:

> Bad example.
>
> Because you _can_ get access to the TLB entries (well, the page tables,
> or on processors with fixed entries like PPC, the BAT registers) from
> your program. [ ... ]

Actually, thinking about it, it's not a bad example.

Yes, you can control the TLB from user space to some degree, because we
can make the interfaces available for it. But no, you can't control the
placement of the rest of the kernel data structures to make sure that
you don't get any interaction between the kernel TLB accesses and your
own user TLB accesses.

That's a problem with separation in general: some things that might be
trivial if they weren't separate end up being very hard to do if you split
them up, because there are no sane interfaces for them. When you can't
just directly interface to the internal data structures, you sometimes
end up being screwed.

So the kernel ends up exporting _some_ functionality that is obvious
enough, but at some point you can't get any more because the interfaces
to export internals get too hairy and pointless.

Somebody (sorry, forget who) mentioned a Tandem "filesystem outside the
OS" example. But I bet that ended up putting the block IO layer inside the
filesystem, and didn't support very many alternate filesystems - once you
split it up that way, it ends up often being very hard to do things that
others find trivial, exactly because the split makes it hard to access state
on the "other side".

So without knowing the Tandem OS, I will just make a wild stab at guessing
what the split resulted in: the filesystem process maintained all caches
internally to itself, and the system probably had some interface to set
aside <n>% of memory for that filesystem process for caching. That's not
the only way to do things, but it's a fairly obvious approach, and the
alternatives would probably tend to get rather complicated.

This is why you don't want to split too early. At the same time you _do_
want to split at some point, and the point should be the one that has the
simplest requirements for the interfaces you provide.

I maintain that if you want to make something that looks like UNIX, the
split has to be at the system call interface level and nowhere else. Using
message passing at a lower level ends up being just plain stupid.

If you want to make a system that doesn't care about compatibility, you
may have more freedom. Of course, with that freedom comes the fact that
nobody actually wants to use your system, which gives you even _more_
freedom, since then you don't have to worry about those pesky users and
their real-life problems.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057474131.367543@palladium.transmeta.com>
Date: Sat, 05 Jul 2003 23:48:50 -0700

Peter da Silva wrote:

>> No. That's a user-visible API thing, and as such unacceptable.
>
> What's a user-visible API?

Your program crashing when somebody else does something bad. That's
a rather basic user-visible API - it's just that the API is not "yours".

Which is why it is unacceptable. It doesn't matter that you parade a
lot of OS's where it was acceptable - those OS's are dead and forgotten,
and largely exactly _because_ of the fact that they had a serious
user API bug - namely the API of letting other users screw you over.

(Of course, that's not how it is described in the API sections. It
gets described in nicer ways, so instead of "other users can screw
you over", it's called "fast message passing" or similar)

> Then there's things like gethostbyname(), which *seems* like it should be
> a system call like open(), but it's not... so why is open() in the kernel?
> The whole mechanism of namei() could easily be pulled out into libc and
> all of a sudden you could open("http://ftp.uu.net/..", "r").

Actually, you can't.

The reason you can't pull namei() into libc is the same reason you can't
put things like TCP/IP into libc. It needs to interact with other programs
and maintain consistency across those other programs - without other people
being able to screw things over. Again, notice how the security part is very
much part of the API.

(Again, systems like PalmOS 4 that didn't have the security part _do_
actually do TCP/IP effectively on a library level. As did some others.
But history has time and again shown that having insecure systems simply
isn't acceptable - even if it starts out acceptable, it will later become
unacceptable when the uses grow).

In other words, there are two things that distinguish system calls:

 - protection domains ("other users can't screw up your world view"). This
   is usually shared data or HW access.

 - performance ("you can't afford to emulate it on top of other system
   calls")

These two (and _only_ these two - if you ignore binary compatibility issues
that may force other decisions on you) are what determine whether it is a
"true" system call or a library.

This explains, for example, why "gethostbyname()" isn't a system call: while
it does require a protection domain (you can't let other processes mess up
your view of name lookups), it does not have the performance requirement. As
a result, it is universally implemented on top of other trusted mechanisms,
ie reading trusted configuration files (and then acting on them, often by
using a trusted connection mechanism).

See?

Apply those two simple criteria to the system calls, and none of them will
be surprising. They all flow from those two things (and legacy. Never under-
estimate the importance of legacy. It easily gets forgotten in theoretical
arguments, but it often ends up subsuming everything else).

>> If the
>> OS doesn't give protection to the user programs, it is not in my opinion
>> in the least interesting.
>
> Your life depends on operating systems like that, every day.

So? My life depends even more every day on plants that generate the oxygen
I breathe. What does that have to do with anything? It doesn't mean that
I should find plants at all interesting for what I do.

Btw, even in the embedded world people are more and more looking to general
purpose OS's. The main reason they don't use them is that hardware used to
be expensive. It still is, but...

> That argument can turn around and bite you, for someone else "the right
> thing" may require a microkernel.

Hey, go wild. I don't care. Use them if you want. But I came in to this
discussion as a response to this:

        <rant>
        Why do the file systems have to be so tightly integrated in the "ring0"
        core? This is one subsystem that screams for standard callouts and
        "ring1" level.
        </rant off>

and I wanted to inform the ranter that he was, indeed, ranting, and clearly
naïve, since the problem absolutely DOES NOT "scream for standard callouts".
I've given several reason why you absolutely do not want to have callouts.

Go ahead play with your microkernels, life-sustaining or not. I'm answering
the "Why do the file systems have to be so tightly integrated", and trying
to explain why anybody who rants about it is _wrong_.

Because if you care about performance, and you want a traditional OS
(general purpose, and with memory protection, not some embedded thing that
can't be used anywhere else), then you absolutely do want that tight
integration between VM and filesystems and system calls.

> This is like the Minix debate ten yours ago. You're looking at a system
> that isn't designed for production use and think that it's proof that
> you can't build a system like that for production use.

You're ignoring facts.

Lots of people have designed microkernels for production use. They all
failed as general-purpose operating systems.

Not minix. Every d*mn last one.

>> And NOT having protection within the kernel itself is actually a
>> huge win. And while performance is important, the big win is that
>> without protection you can simply solve a lot of problems in a much
>> more straightforward manner.
>
> Right. Look at the Palm, for example. PalmOS 4 has no protection, and
> there's all kinds of clever things people have written that change the
> user interface in clever ways... but that are no longer possible on PalmOS
> 5 which has protection.

Exactly. Because even on a small hand-held device, you in the end absolutely
require memory protection, and that secure API.

Go back to my list of two requirements for system calls. Ponder.

> In operating systems where this kind of thing is possible, you do get some
> amazingly useful tools that run in this kind of twilight zone.

This is indeed my argument why you do _not_ want to have protection domains
within the OS itself. Because that twilight zone without the protection
can be a very productive environment.

But I stand by the two requirements. Call them arbitrary, I don't care. I'll
call them my "axioms" of operating systems. You can choose different axioms,
it doesn't have to be the same world we live in.

>> See? It's a total and incorrect MYTH that layering should be bad for
>> performance.  Only _bad_ layering is bad for performance.
>
> You're arguing for the wrong side here.

No. I'm arguing for the side that thinks that layering is often good, but
that does not mean that ALL layering is good.

The rant I answered to thought that all layering was good ('screams for
standard callouts and "ring1" level'). I'm arguing that that is not the
case. Layering is often bad - if it means that you have to use cumbersome
communication channels to forward information that otherwise would just
have been available to you naturally.

So you have to look at when abstraction and layering pays off. And one
of THE most important parts there is performance.

Copying data around is almost always a no-no. As they say: most programs
can be viewed as an exercise in caching - and OS's even more so.

>> And performance is part of it, but so is (very much) the fact that it
>> also makes it much harder to maintain coherent data structures and
>> graceful caches.
>
> Why is that so? The file system can still be given access to the same pool
> of blocks without having access to data structures it doesn't need access
> to. Yes, this gives you a little less protection from bugs in the file
> system, but it does limit that to blocks it's got write access to.

No, you need not just the blocks, you need the actual cache chain data
structures themselves. For doing things like good read-ahead, you need
to be able to (efficiently) look up trivial things like "is that block
already in the cache".

So you need not only the data, you need the _tags_ too.

In other words, your filesystem needs to have access to the whole
disk cache layer, not just the contents. Or it will not perform well.

But hey, feel free to disagree with me. I know my approach works, and
scales well from small machines to machines with hundreds of nodes.

And I know it works well, not just because of Linux, but because it
has shown itself to work well for others too.

>> Microkernels and message passing fail this definition very badly. (The
>> Amiga had a "sort of" message passing, but since in real life it was
>> nothing but a function call with a pointer,
>
> Um, no, it was enqueuing a message on a doubly linked list and setting a
> flag that the scheduler noticed when the process later performed a context
> switch (which may have been immediately following, but didn't have to be).

It was still just passing a pointer around.  No protection, no nothing.

Which very much makes it uninteresting.

>> that was message passing on
>> the _abstract_ level, with no performance downside. That's fine as an
>> abstraction, but it obviously only works if there is no memory
>> protection)
>
> Then let there be no memory protection, for most components.
>
> Let the message passing be implemented as a macro. When you compile a
> component for ring zero, that macro generates a zero-overhead call.
>
> Compile it for ring 1, and it uses a lightweight system call to pass the
> message to ring 0 or accept it from ring zero.

Wrong.

You're making a serious logical flaw: you're claiming that the message
passing approach has no performance impact, because you could just compile
it out.

And that's not true, because there's no way in HELL you're going to "message
pass" a complex data structure like a hash chain efficiently.

So the performance impact comes from the fact that by limiting yourself
to a message passing model, you're limiting what you can do. Yes, there
is no overhead from the compiler ("we can compile the message away, and
just pass the pointer"), but there doesn't need to be any: you've
added the overhead by making it illegal to pass around arbitrary data
structures that cannot be efficiently marshalled into a message.

Which makes your whole argument totally invalid. Sorry.

You have to KNOW that there is no memory protection in order to be able
to write efficient code. This is why nobody ever was able to just make
Amiga-OS secure. You could only protect so much, until the whole house
of cards came tumbling down. So a lot of the '040-based hacks protected
certain areas, to give slightly improved debugging, but nobody could
take it further from there, without giving up the whole OS.

You can rant and rave against it all you like, but you know I'm right.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057888744.661663@palladium.transmeta.com>
Date: Thu, 10 Jul 2003 18:59:04 -0700

Peter da Silva wrote:

> Ah, this one finally shows up. I'll try and avoid repeating myself.
>
> In article <1057474131.367543@palladium.transmeta.com>,
> Linus Torvalds  <torvalds@osdl.org> wrote:
>> (Of course, that's not how it is described in the API sections. It
>> gets described in nicer ways, so instead of "other users can screw
>> you over", it's called "fast message passing" or similar)
>
> Or "having the whole kernel in one protection domain"?

What does that have to do with the API? Internal kernel implementation
is something that should be considered a black box. Nobody should care,
except for the few kernel developers.

Why do you care about the internal workings of the OS? If you envision
the OS being so big and complex that it needs to be split up into many
different protection domains, then you're doing something wrong.

From a maintenance standpoint, the kernel is a fairly big thing, but
that can be attacked successfully by just having good abstractions,
and has nothing to do with message passing or with protection domains.
There are no "potentially untrusted parties" that _force_ you to have
a protection domain.

>> The reason you can't pull namei() into libc is the same reason you can't
>> put things like TCP/IP into libc. It needs to interact with other
>> programs and maintain consistency across those other programs - without
>> other people being able to screw things over.
>
> Could you elaborate on this? Not all UNIX-like operating systems implement
> either chroot() or traverse checking, but even if you demand those
> continue to work they can be implemented by making the handle at the start
> of a link be part of the authorization required to get a handle at the end
> of a link.

The real issue with name lookup is the cross-domain coherency requirements.
Details like starting points etc are largely a non-issue, and are pretty
much abritrary - they are unimportant, and can be handled many different
ways. Unix has "cwd" and "root directory", and system calls to change both,
but that's not important or relevant to the real issue.

The real issue is all about making sure that filesystem accesses are
consistent. It's the software equivalent of memory ordering of a CPU:
each execution context has an internal ordering that is purely sequential,
but more importantly you also have certain cross-domain ordering
requirements.

As with CPU's, you can obviously play games with the ordering consistency,
up to and including requireing explicit ordering primitives. It makes it
slightly harder to think about some problems, but not a lot of people need
to care, since 99% of it is hidden inside the OS or in special threading
libraries.

(NOTE! When it comes to filesystem consistency, I'm not talking about when
the data actually hits the disk - that's a separate ordering issue, largely
orthogonal to the issue of what the "cache consistency model" is).

But some consistency model has to be there, and because most (almost all)
programmers - even the good ones - have serious problems with concurrency,
almost all filesystems end up going for the strictest consistency possible.
Anything else is just too painful for application programmers. And
unlike CPU memory ordering, consistency of opening and reading/writing a
file are things that "normal" programmers have to care about.

As an example: it just gets very very confusing if one process creates a
file, then signals another process that tries to open the file, and the
file (or its contents) haven't yet appeared. And yes, this can happen in
real life over NFS data contents with caching clients and makes things
like automatic load balancing interesting: as I'm sure you know NFSv2
commonly has so-called "open-close" consistency, ie file open operations
end up being consistency points, but normal read/write operations do not.

Some of that consistency can obviously be done on a library level, so don't
get riled up (yet).

Are we in agreement so far? We need quite strict filesystem consistency,
loose consistency guarantees (like NFSv2 across different clients, which
is still fairly strict when it comes to name data) make life very nasty
indeed.

In short: in practice, you REALLY want to have consistent filesystem
namespaces, even if you are willing to _maybe_ play a bit fast-and-loose
with the contents.

Now, go and research the issue of consistent cluster filesystems. The
executive summary boils down roughly to "They're damn hard to make". And
pretty much ALL of the problems are in the consistency area. It's just
very hard to synchronize independent entities.

And _that_ consistency is what I mean by "shared data". A filesystem has
a lot of shared data between a lot of different independent entities in
the system - lots of totally unrelated user processes, where the only
thing that really makes them come together is the user at the other end
of the keyboard. And that user gets REALLY ANGRY if he writes a file in
one window, and it doesn't show up when he does a "ls" in another window.

And this shared data is not read-only. Not at all. And it's critical that
you don't allow direct access to the raw data, since if your OS allows
raw data access to normal users, you'll have nasty things like wildly
spreading unstoppable viruses and other "fun stuff".

See? THIS is where a system call is imperative. It has both of the
critical requirement:
 - shared data structures (not just "caches". _shared_ caches).
 - performance

In other words, the shared data structures part mean that you literally
cannot sanely do it on a library level, without making the complexity
space go up by several orders of magnitude. You'll have to solve the whole
"high-performance coherent cluster filesystem" problem, which while it is
certainly worthwhile solving, hasn't been done satisfactorily yet.

And performance of "simple" operations like "read", "write", "create" and
"delete" etc are clearly rather important and fundamental. Which means
that unlike something like "gethostbyname()" you cannot easily just
implement them using other primitives.

And yes, the "system call" may be just a message to another protected
domain. That's just an implementation detail - it doesn't change the
fundamental problem space.

And _that_ is why you cannot do namei() in a library.

Are you satisfied?

> You forgot the most important one:
>
>    - tradition

I didn't forget it. I explicitly mentioned legacy as often trumping all
other considerations.

After all, that's why I don't care for a lot of "OS research": exactly
because they forget that legacy matters. Quite often a lot more than
anything else. The building block for Linux - and a large reason for
its success - is that I considered UNIX compatibility to be paramount.

>> This explains, for example, why "gethostbyname()" isn't a system call:
>> while it does require a protection domain (you can't let other processes
>> mess up your view of name lookups), it does not have the performance
>> requirement.
>
> Webservers have found gethostbyname() is expensive enough they disable it
> for logging.

Absolutely. That's an application-level optimization, and one that takes
into account that some things are more expensive than others.

But this is not a "system call" vs "library" issue. Making it a system
call does not inherently speed up anything. And I never claimed it does:
in fact, I claim that you want to avoid protection domain crossing as
much as possible, exactly because it tends to _cost_ you in performance.

So the issue is not that system calls are faster: they're not. They are
more _controlled_, but that usually means that you want to avoid them
as much as possible if you don't need the control.

For example - if you don't want to see real-time logs, you're also better
off buffering the log up internally in the server, and then occasionally
writing it out in one bigger chunk. Exactly because you want to avoid the
"synchronization cost" with shared data structures that system calls are
all about.

So hostname lookup is NOT made faster by a system call: it is faster by
delaying it, and then doing many of them at once, and caching them. And
that shouldn't be done with a system call either, since we don't have any
real consistency requirements (it's a cache, not a shared data structure),
so creating a system call for it would only complicate the system and slow
things down.

> I would have thought you'd be in favor of letting the application cache
> that data.

But I _am_.

See above. Repeat after me: "Caching is good. Consistency is hard."

>> Btw, even in the embedded world people are more and more looking to
>> general purpose OS's. The main reason they don't use them is that
>> hardware used to be expensive. It still is, but...
>
> It sounds like you've fallen into a common misconception, that "real time"
> means "fast". It doesn't.

What planet are you from? Apparently one where they didn't teach people
to read. Where did I say "real time"? I said "embedded".

The embedded people (whether real-time or not, and as far as I can tell,
most of them are _not_) are often constrained by a combination costs and
special external requirements.  The cost requirement in particular often
meant that you're forced to use marginal hardware (not in the quality
sense, but in the sense that it is the smallest/cheapest possible platform).

Which traditionally has meant that they can't afford to run a general-
purpose OS: either because the OS itself is too expensive ("we can make
more money off customers that aren't as cost-sensitive") or because the
OS assumes that hardware isn't as constrained ("we run on _real_ machines,
not your 8-bit microcontroller").

But open source makes the OS cost a non-issue (development cost obviously
is still there for the required customization), and hardware is getting
faster and better faster while the requirements for a lot of embedded
development is going up (quiz: how many embedded projects needed to connect
wirelessly to the internet ten years ago?), which is making the embedded
world increasingly look towards general-purpose OS's, since there just
isn't much justification for the cost of maintaining special odd-ball ones.

THAT is my argument. My viewpoint is obviously biased, since that's the
side I'd be seeing, but I don't think I'm horribly off-course in stating
that "even in the embedded world people are more and more looking to
general-purpose OS's".

Note the lack of real-time mentioned anywhere. Almost all "real time"
people end up actually being perfectly ok with "fast enough", and very
very few people are rally _hard_ real time.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057948947.265605@palladium.transmeta.com>
Date: Fri, 11 Jul 2003 11:42:26 -0700

Peter da Silva wrote:
>
> If namei() in user mode performs the same operations as namei() in kernel
> mode, how do you get an inconsistent state?

I'm not going to bother belabouring the issue, since you do not seem to
understand it at all. But I'll try one more time:

We have shared data structures with multiple writers and multiple readers.

We don't trust the writers or readers, which means that
 - they cannot be allowed to access locks directly (instant DoS if they
   do something bad).
 - they cannot be allowed to write to shared data structures directly
   (instant filesystem corruption and major security issues)
 - they cannot in general even be allowed to _read_ shared data structures
   directly, since many of those shared data structures may contain data
   that the process must not be able to see. A trivial example of this is
   the unix "execute" bit on directories, but a much more generally valid
   example is something like a hash chain - many CS algorithms depend on
   "global data".

So what do you end up with? You either have to solve all these problems
by a total re-architecting of the filesystem: and as I tried to explain
to you, the problems are EXACTLY the same as the ones you have in a
secure coherent distributed  filesystem where you can't even trust all
the nodes. Nobody has so far been able to solve that problem very well.

Or you end up waking up, and doing the sane thing, and just doing the real
work in a separate protection domain: ie you need a system call. At which
point all of your problems go away, and filesystems end up being pretty
simple things. Comparatively.

In other words, the only sane approach is to do it in a system call. But
I'm not going to bother to argue the point any further with you. If you
don't want to accept the facts, nothing I can say will force you to.

Yes, you can "simplify" the problem space by allowing pure read-operations
in user space, and forcing people to go to the system call approach only
on creates etc. That avoids one _huge_ chunk of the problem, since for read-
only accesses you don't have to have direct access to locks, and you can
depend on things like sequence numbers and optimistic algorithms. They do
tend to have bad worst-case behaviour (livelocks), but hey, I doubt you care
any more. You're not looking for a sane implementation.

So I assume that is what you want to do - just mmap the whole filesystem
read-only into the client, and then doing all lookups that way?

And you could do _some_ filesystems that way (it's no different than having
a read-only database for "direct lookups"), but no, you cannot do a UNIX
"namei()".  See about the security issues of just being able to _see_ other
peoples name data. Also, realize that even "namei()" actually writes to the
filesystem, through accessed bits etc. So even if you try to limit yourself
to a read-only model, it won't be actually "namei()".

Or think about "access()" in a threaded environment. Which uses "namei()"
too. Or ponder the "interesting" behaviour on switching UID's/groups. Or
filesystems that aren't local, or have more dynamic behaviour (/proc). The
list goes on and on and on.

I suspect that with a _lot_ of effort you could probably overcome these
issues, and you'd end up with a very fragile and complex system. Or a very
slow one.

But yeah, in the end user-space is obviously "turing-complete", so in that
sense everything is _possible_. It's just bad engineering.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1057949228.284389@palladium.transmeta.com>
Date: Fri, 11 Jul 2003 11:47:07 -0700

Russell Williams wrote:
>
> The obvious case where "same protection domain for performance" conflicts
> with "separate domains for crash resistence" is for device drivers that
> need high bandwidth or low latency but are not system critical. I want to
> buy a hot-pluggable firewire video device or MIDI controller and I don't
> want to put my system at risk from bugs in those drivers (and I don't want
> to reboot after the install, either)..

The problem there is that most device drivers don't crash the system by
following wild pointers.

They crash the system by just not doing the right thing to the hardware
(or the hardware itself being buggy). The system ends up crashing because
the data off the harddisk is corrupt, which is REALLY BAD, but has nothing
to do with protection domains, and everything to do with the fact that
hardware is often flaky and complex and badly documented.

Yes, the "wild pointer" thing does happen, but it's never been even _near_
the top of my worries wrt device drivers.

Locking up the PCI bus because you got the hardware in a strange state,
now _that_ is nasty.

> It's easy to design I/O subsystems that can load drivers either inside or
> outside the kernel.  Getting high performance from drivers outside the
> kernel is much harder.

Performance does end up being a big issue too, yes. Especially since
device drivers (unlike most other things) often care deeply about
things like exact physical addresses, so MMU's are often in the way.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1058036184.864306@palladium.transmeta.com>
Date: Sat, 12 Jul 2003 11:56:24 -0700

Peter da Silva wrote:
>
> Ah, I'm beginning to see where you're confused.

I think it is you who are confused, because..

>                                    You're talking about
> taking the existing namei() code, or something very similar, and moving
> it to the user level. I'm talking about implementing namei() at the
> user level using operations on handles. It doesn't access any of the
> shared data structures directly, and I even described how a handoff
> mechanism can be used to keep it from stepping outside the bounds
> allowed by the existing implementation.

... this is totally uninteresting, since you can already do this on UNIX, and
you have been able to do it for a long long time.

The fact that nobody (*) _does_ do it only shows how useless it is.

It's not hard. It has problems with execute-only subdirectories, but
since nobody really cares, why should anybody ever have spent the time
making this approach work any better?

It's inefficient and not very interesting. It uses the normal UNIX
"handle", ie a file descriptor. I tested it on Linux, but as far as
I know, it works pretty much on any UNIX that has fchdir()

        #include <fcntl.h>
        #include <signal.h>
        #include <stdlib.h>

        sigset_t everything;

        int namei(int basefd, const char **components)
        {
                int oldcwd;
                int retval;
                sigset_t old_sigset;

                oldcwd = open(".", O_RDONLY);
                sigprocmask(SIG_BLOCK, &everything, &old_sigset);
                retval = basefd;

                while (*components) {
                        int fd;

                        fchdir(retval);
                        fd = open(*components, O_RDONLY);
                        close(retval);
                        if (fd < 0)
                                break;
                        retval = fd;
                        components++;
                }

                fchdir(oldcwd);
                sigprocmask(SIG_SETMASK, &old_sigset, NULL);
                return retval;
        }

        int main(int arcg, char **argv)
        {
                sigfillset(&everything);
                return namei(open(".", O_RDONLY), argv+1);
        }

Was this what you wanted?

(*) "Nobody" in the sense of "it's so rare that you don't even know about
it". It's indeed occasionally used for things like emulation, where you
have different rules about pathnames.

> This *may* slow down namei() in some cases, since you'll need to make N
> kernel transitions to resolve an N-level-deep path, but you can cache
> the results in libc and avoid a lot of calls to namei() that are
> currently made.

It slows things down A LOT.

Caching things in libc has the problem that it is only a local cache, which
makes it a lot less effective than the in-kernel global cache.

In other words, you get a lot more misses in the libc cache, since there
are a lot of startup costs. But if you want to do it, the program above
will actually compile and work.  Go wild with it.

[ It has other fundamental problems too: the locking that the _real_ name
  lookup has to do is a lot less efficient, and you have to constantly
  translate the "handle" into its internal format which has some locking
  issues of its own. When you just give the whole pathname (which is what
  everybody wants to do anyway) to the system call, the OS can just do a lot
  better job at it.

  The beauty with the sane approach (which is to do the full lookup in the
  kernel) is that you get the performance you want, and if you want to do
  just one component at a time, you can do so by hand. ]

I repeat: the reasons UNIX does path lookup like it does - as part of a
system call - is that it's just good engineering. What you suggest is not.
Can we stop this pointless discussion now?

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1058046799.46836@palladium.transmeta.com>
Date: Sat, 12 Jul 2003 14:53:18 -0700

Peter da Silva wrote:

> Um, no, fchdir() has unacceptable side effects (it changes the current
> directory),

Those are not unacceptable side effects, those are the whole _point_ of it.

Read my example. You can trivially read-and-restore the current working
directory with fchdir. My example did exactly that - and it was signal-
safe because I blocked everything.

>         has at a minimum twice the system call overhead even if
> you don't use message-passing techniques to implement the system calls
> to cut down the overhead further, it doesn't work from kernel mode, and
> it keeps the N^2 behaviour in large directories.

The point being that all of these are fixable, BUT NOBODY WANTS TO DO
WHAT YOU PROPOSE!

Part of "good engineering" is realizing that you shouldn't do things
that don't add any value.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1058126751.4314@palladium.transmeta.com>
Date: Sun, 13 Jul 2003 13:05:50 -0700

Peter da Silva wrote:
>
> The thing is, this is an illustration of the kinds of things that are
> easy with a different OS design, not the goal. The "it's easy" part is the
> interesting bit.

No. These are things that an academic might find interesting, but that
have absolutely _zero_ to do with real life.

The fact is, I could write a

        int fd_open(int basefd, const char *path, int flags)

in about 30 lines of code in the kernel. Your "OS design" argument is
totally flawed: it is indeed a lot _easier_ to do these kinds of things
in a traditional monolithic kernel, because you don't ever have to worry
about what other kernel servers you need to touch etc. There's no
"documented protocol" that you have to extend to do what you want to do:
you just do it.

I showed you how to do it without even a new system call, and you didn't
understand my point. It's _trivial_ to do. It didn't take many lines
in user space (admittedly not a very efficient implementation), and the
only reason it would take a few more lines in kernel space is that if
I were to make it into a system call, I would _not_ make it use fchdir()
or anything like that, I'd do the efficient thing of accessing all the
internal kernel "struct file *" entries directly.

And I'm not unique. I'm not the only person who could write a 30-line
"fd_open()". The VFS layer infrastructure in Linux is quite powerful, but
I bet it would be trivial to do in BSD or any of the other UNIXes out
there.

So clearly, the fact that it hasn't been done is NOT because of your "OS
design" issues.

If I were to add a "fd_open()", who would use it?

Because, in the end, absolutely the _only_ thing that matters for an OS
is the people who use it. The "goodness" in a design is purely and utterly
in how well it _works_in_real_life_.

Which is, in the end, the real reason Linux has been successful. I don't
have an agenda of crazy OS ideas to push. Because I don't care. I think
people who have an "agenda" to push (and I very much mean people like you
who seem to think that "design" is somehow more important than getting the
job actually _done_) are just being silly.

Good design (the _truly_ good kind) is whatever helps you get the work
done. It's not "abstraction", it's not "microkernels", it's not "crazy
notion of the day". It's all the small (and often boring) details that
help your day-to-day work and maintenance.

And what you're pushing isn't the good stuff. You're pushing the pipe
dream, the "this is how I think it should be done if the world was
different" kind of thing.

Welcome to the real world.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Microkernels are not "all or nothing". Re: Multics Concepts For 
	the Contemporary Computing World
Newsgroups: alt.os.multics,comp.arch,alt.folklore.computers
Message-ID: <1058141522.950697@palladium.transmeta.com>
Date: Sun, 13 Jul 2003 17:12:02 -0700

Peter da Silva wrote:
>
> Yes, you could, and you could write all the other calls that are involved,
> but YOU have to do it. This gets back to your comment about it being of
> interest only to "a few kernel hackers".

I don't see your point at _all_.

Pretty much anybody could do it. After all, it's likely a lot easier in
Linux than in your microkernel world. It's slightly more complicated these
days than in "traditional UNIX" because Linux supports private name-spaces
etc concepts that means that name lookup is context-dependent, but it's
quite literally a question of copying the code in "path_lookup()", and
replacing the logic that checks for initial slash character to use

        nd->mnt = mntget(file->f_vfsmnt);
        nd->dentry = dget(file->f_dentry);

instead of getting the mountpoint/initial startpoint information from
"current->fs->{root|pwd}[mnt]".

That will literally make a 5-line function that does the actual lookup,
the rest of the overhead is for the system call interfaces.

Yeah, somebody who isn't intimately involved with the VFS layer will
likely stumble around for a while, but no less so than in any other big
project. And they'll likely stumble around a lot _less_ than if they
had to update all different filesystem servers to a new filesystem
protocol in a microkernel.

You totally ignored my argument that it is _easier_ to do this in a
traditional kernel.

Why do you think most everybody writes monolithic software? It's easier
to visualize and follow the flow of control. And that's true very much
not just for experts, but for beginners too. This would be a totally
appropriate exercise for a first-year university OS student. Not because
writing the 30-odd lines is important in itself (it isn't), but it's
a fairly simple introduction to how the VFS layer works.

>                       even if someone else can use it, they
> have to patch the kernel and rebuild it every time they upgrade, because
> the kernel's not organised to make it easy.

That's a load of bull. Building a kernel is not hard, and you can even
use loadable modules to test things like new system calls out if you
really want to.

More importantly, if it actually ends up being a useful feature that
real applications want to use, I'd be happy to integrate it. I don't mind
the notion of a "fd_open()" myself - it's been discussed before, but the
projects that end up looking up the path by hand (things like Apache do
it in certain configurations because they want to verify each path
component) seem to almost universally consider portability to be much more
important than "clever features".

That's the part you seem to be totally ignoring. I'm trying to tell you
that there is no inherent "goodness" to clever new system calls. In fact,
the reverse is generally true. You should generally avoid adding interfaces
"just because you can", and you should strive for a stable base.

Boring? Maybe. But practical. It's the reason x86 has flourished, too.
That damn compatibility ends up being worth _more_ to application writers
than pretty much _any_ clever feature you can throw them.

                        Linus

Index Home About Blog