Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261402400.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 23:29:44 GMT
Message-ID: <fa.gudtte8.828qpm@ifi.uio.no>

On Thu, 26 Aug 2004, Martin J. Bligh wrote:
>
> I think what you're saying is that they'd both return positive, right?

No. I'd say that a file would look like a file, even if it has attributes.

It wouldn't show as a directory at all - unless you start looking at
attributes. Because it really _is_ a file, and it's "directory aspect" is
really nothing but a way to make its named streams visible.

So you really should consider it a perfectly regular file, and so only
S_ISREG() will return true, and S_ISDIR() will return false.

Think of it this way: when you add a named attribute to current files
using the xattr interfaces, do you start thinking of the file as a
directory? No. Even though it actually now is a starting point for finding
more information.

It's just that thanks to it's directory aspect, people and apps that
_care_ about the attributes suddenly can trivially access them. You can
access them in shell scripts, you can access them in programs, you can
access them in perl. With no special knowledge necessary.

Remember: in the file-as-directory model, we're always talking about just
_one_ object. That one object just happens to have named streams or
attributes or whatever you want to call them associated with it, and you
can _see_ them through the normal directory interfaces. But that doesn't
really make the object a "directory". It's still a file.

In other words, the "directory" part is just a _view_ into the file. A
view that potentially exposes a lot _more_ of the file, but we're still
talking about the same file.

In contrast, a S_IFDIR-like _directory_ is something else entirely. When
you view the things in that, you aren't looking at data "inside" the
directory. You're looking at something totally independent.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261149510.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 19:20:34 GMT
Message-ID: <fa.gutpuef.8i0rpt@ifi.uio.no>

On Thu, 26 Aug 2004, Rik van Riel wrote:
>
> So you'd have both a file and a directory that just happen
> to have the same name ?  How would this work in the dcache?

There would be only one entry in the dcache. The lookup will select
whether it opens the file or the directory based on O_DIRECTORY (and
usage, of course - if it's in the middle of a path, it obviously needs to
be opened as a directory regardless).

That's not the problem. The problem from a dcache standpoint ends up being
when the file has a link, and you have two paths to the same sub-file
through two different ways:

	.. create file 'x' with named stream 'y' ...
	ln x z
	ls -l x/y z/y	/* it's the same attribute!! */

but this is actually exactly the same thing that we already have with
mounts, ie it is equivalent (from a dentry standpoint) to

	.. create directory 'x' with file 'y' ..
	mkdir z
	mount --bind x z
	ls -l x/y z/y	/* It's the same file!! */

so none of this is really anything "new" from a dcache standpoint.

Except for all the details, of course ;)

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261217140.2304@ppc970.osdl.org>
Date: Fri, 27 Aug 2004 02:59:01 GMT
Message-ID: <fa.gsu1tmd.aicqhj@ifi.uio.no>

On Thu, 26 Aug 2004, Rik van Riel wrote:
>
> Hmmm, I just straced  "cp /bin/bash /tmp".
> One line stood out as a potential problem:
>
> open("/tmp/bash", O_WRONLY|O_CREAT|O_LARGEFILE, 0100755) = 4
>
> What do we do with O_CREAT ?
>
> Do we always allow both a directory and a file to be created with
> the same name ?

Either I am confused, or you are.

To me, a filesystem that allows this thing doesn't really _have_ the
concept of "directory vs file". It's just a "filesystem object", and it
can act as _both_ a directory and a file.

So when you create "/tmp/bash" - assuming /tmp supports the file-as-dir
semantics at all, you're just creating a new inode - the same you always
have. When you write to that inode, it writes to the default stream.
There's no special cases here.

Now, after you have your regular /tmp/bash, you can then start adding
named streams to it, ie you can do

	open("/tmp/bash/icon", O_WRONLY|O_CREAT|O_LARGEFILE, 0755);

and that will create the "icon" named stream. See?

So "/tmp/bash" is _not_ two different things. It is _one_ entity, that
contains both a standard data stream (the "file" part) _and_ pointers to
other named streams (the "directory" part).

Hey, think of it as a wave-particle duality. Both "modes" exist at the
same time, and cannot be separated from each other. Which one you see
depends entirely on your "experiment", ie how you open the file.

> Does this create a new class of "symlink attack" style security
> holes ?

I don't believe so.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261315110.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 20:54:25 GMT
Message-ID: <fa.gstttmb.ai4qhh@ifi.uio.no>

On Thu, 26 Aug 2004, Rik van Riel wrote:
>
> Thinking about it some more, how would file managers and file chosers
> handle this situation ?

The same way they already handle it on other platforms that support it? By
taking advantage of it..

People literally use this for icon and preview information, so some of the
stuff shows up very much in file managers. And I'd assume that a normal
double-click would just open the main file, and you can right-click for
management information, including opening the file.

If you want an entity that acts as a directory, you create a directory.

Directories don't go away - you still have S_ISDIR() and S_ISREG(), and
they still return information. A file that has associated information is
still a _file_, and people should treat it that way, it's just that it
also has a list of named sub-streams. You can open it as a directory, but
the stat information clearly says it is a file, and the "directory" view
is very much associated with _that_ file.

I definitely don't think you want the file manager to act as if a named
stream is exactly the same as a stand-alone file. They have all the same
operations, but there's no question that there are differences too.
Especially on a conceptual level - but for most filesystems there are
likely real technical differences too.

For example, it's likely that most filesystems would _not_ allow linking
of a named stream anywhere else. And you might not be able to change the
permissions or date on the named stream either, since it may or may not
have a separate date/permission thing from the container.

So don't believe that just because the named streams are _named_ like real
files, that they suddenly have any existence beyond the container that
they are part of.

There may be other limitations too - again depending on how the filesystem
actually implements named streams. It might not support more than one
level of naming, for example - so you might not be able to create a
directory structure within the named streams, for example.

In short: the fact that the VFS layer exposes the _capability_ to see the
named streams as a full POSIX filesystem of its own does _not_ mean that
the low-level filesystem necessarily allows the full possible semantics.
So you shouldn't design your apps that know about named streams to think
they are normal directories.

The directory thing is just a very powerful naming scheme, and one that
fits into the existing UNIX model.

> Do we really want to have a file paradigm that's different
> from the other OSes out there ?

Different from what other OSes?

Last I looked, Windows, Solaris, and OSX all supported named streams. What
other OS's are out there that you care about?

In other words, this is _not_ a different paradigm from what others do.
The discussion is whether we want to implement it at all, and more
importantly about syntactic issues (ie we clearly already implement
extended attributes, just with a much more limited syntactic power).

We don't have to go all the way. Solaris has "openat()", which is kind of
a half-way there - not really directly available in the same namespace,
but at least the result is available as a real file interface (as opposed
to the Linux "xattr" interfaces that are _totally_ special-cased system
calls).

In other words - the paradigm is already there. It's just that currently
it's pretty much unusable under Linux, because it requires so much
specialized knowledge that it's not worth it to modify existing apps. And
the interfaces are really very limited, so even if you _do_ end up using
the specialized Linux interfaces, you can't actually do a lot of what you
might want to do.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261132150.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 19:13:05 GMT
Message-ID: <fa.gsu1u68.ai8q1m@ifi.uio.no>

On Thu, 26 Aug 2004, Denis Vlasenko wrote:
>
> Is it possible to sufficiently hide "dirs inside files"
> so that old tools will be unable to see them?

Certainly possible.

> I just checked:
>
> ls -d /foo  does lstat64("/foo", ...)
> ls -d /foo/ does lstat64("/foo", ...)
> 	but
> ls -d /foo/. does lstat64("/foo/.", ...)
>
> Will it work out if "dir inside file" will only be visible when referred as "file/."?

That would likely be the _easiest_ approach for the kernel, and does solve
the problem with some apps knowingly removing the trailing '/'.

Note that we could try this out with existing filesystems with very
minimal changes:

 - make directory bind mounts work on top of files ("graft_tree()")
 - make open_namei() and friend _not_ do the mount-point following for the
   last component if it's a non-directory.
 - probably some trivial fixups I haven't thought about. There might be
   some places that use "S_ISDIR()" to check for whether something can be
   looked up, but the main path walking already just checks whether there
   is a ".lookup" operation or not.

This would already allow people to "try out" how different applications
would react to a file that can show up both as a directory and a file. The
patch might end up being less than 25 lines or so, the difficulty is in
finding all the right places.

Al, anything I missed?

(And yes, it's a quick hack, but it's a quick hack that would probably
mimic a good part of what we would have to do internally in the VFS layer
to support this notion anyway).

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040826191323.GY21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 19:28:34 GMT
Message-ID: <fa.ndtfait.1t1cpjd@ifi.uio.no>

[reposted in thread]

On Thu, Aug 26, 2004 at 11:46:33AM -0700, Linus Torvalds wrote:
> Note that we could try this out with existing filesystems with very
> minimal changes:
>
>  - make directory bind mounts work on top of files ("graft_tree()")
>  - make open_namei() and friend _not_ do the mount-point following for the
>    last component if it's a non-directory.
>  - probably some trivial fixups I haven't thought about. There might be
>    some places that use "S_ISDIR()" to check for whether something can be
>    looked up, but the main path walking already just checks whether there
>    is a ".lookup" operation or not.
>
> This would already allow people to "try out" how different applications
> would react to a file that can show up both as a directory and a file. The
> patch might end up being less than 25 lines or so, the difficulty is in
> finding all the right places.

The real issue is what to do with unlink() et.al. on these guys.  Note
that "unlink is OK if all we have there is a bunch of directory mounts"
won't work well - we have no good way to check that condition.

Even funnier one is what we do if we have directory mounted there *and*
have something mounted on stuff in that directory.

Yes, that's one of the probable directions for such stuff, but there's a
lot of fun semantics questions and answers to them will matter a lot.

Hey, if we lose the "can't unlink/rmdir/rename over something that is
a mountpoint in other life" - I'm happy and we can get a lot of much
more interesting stuff to work.  It will take some work (e.g. making
sure we can find all vfsmounts over given mountpoint and sorting out
the locking issues, which won't be trivial), but the main obstacle in
that direction is not in architecture - it's in SuS and tradition; as
the matter of fact, our life would be much easier if we stopped trying
to give -EBUSY here and just dissolved all subtrees mounted on anything
that has that dentry.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040826203228.GZ21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 20:59:55 GMT
Message-ID: <fa.net7b2s.1t18p3c@ifi.uio.no>

On Thu, Aug 26, 2004 at 08:13:23PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:

> Hey, if we lose the "can't unlink/rmdir/rename over something that is
> a mountpoint in other life" - I'm happy and we can get a lot of much
> more interesting stuff to work.  It will take some work (e.g. making
> sure we can find all vfsmounts over given mountpoint and sorting out
> the locking issues, which won't be trivial), but the main obstacle in
> that direction is not in architecture - it's in SuS and tradition; as
> the matter of fact, our life would be much easier if we stopped trying
> to give -EBUSY here and just dissolved all subtrees mounted on anything
> that has that dentry.

Argh...  OK, now I remember why I went for -EBUSY for unlink() (we obviously
are not bound by SuS on that one).  Consider the following scenario:
	* local file foo got something else bound on it for a while
	* we are tight on space - time to clean up
	* oh, look - contents of foo is junk
	* rm foo
	* ... oh, fuck, there goes the underlying file.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408261344150.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 23:54:18 GMT
Message-ID: <fa.gsu5uea.aicrpg@ifi.uio.no>

On Thu, 26 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> Argh...  OK, now I remember why I went for -EBUSY for unlink() (we obviously
> are not bound by SuS on that one).  Consider the following scenario:
> 	* local file foo got something else bound on it for a while
> 	* we are tight on space - time to clean up
> 	* oh, look - contents of foo is junk
> 	* rm foo
> 	* ... oh, fuck, there goes the underlying file.

Hey, that's a valid reason for doing -EBUSY for normal bind-mounts, but it
actually _is_ what we want for an "implied-by-way-of-container-mount".
After all, when you do a "rm foo", you do mean "remove the container foo".

I replied to your earlier off-list mail in private, so let's re-iterate
for the list: the easiest way to handle this is to just have a "mount
option", and have "MNT_ALLOWUNLINK" that gets set for containers, and that
users could possibly choose to set for regular mounts too (as a mount
option) if they really want to (and if we want them to).

So there's no reason we'd have to drop existing mount behaviour only
because we also have special files that look like mountpoints.

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <20040826212853.GA21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 21:45:45 GMT
Message-ID: <fa.n3svar2.1j1spb6@ifi.uio.no>

[subject changed and *please* let's keep the wankfest out of that branch;
we are talking about possible ways to handle hybrids, *NOT* their desirability,
effect on DARPA funding, size of KDE developers' genitals experienced by
Aunt Tillie, etc.]

On Thu, Aug 26, 2004 at 01:47:37PM -0700, Linus Torvalds wrote:
> Hey, that's a valid reason for doing -EBUSY for normal bind-mounts, but it
> actually _is_ what we want for an "implied-by-way-of-container-mount".
> After all, when you do a "rm foo", you do mean "remove the container foo".
>
> I replied to your earlier off-list mail in private, so let's re-iterate
> for the list: the easiest way to handle this is to just have a "mount
> option", and have "MNT_ALLOWUNLINK" that gets set for containers, and that
> users could possibly choose to set for regular mounts too (as a mount
> option) if they really want to (and if we want them to).
>
> So there's no reason we'd have to drop existing mount behaviour only
> because we also have special files that look like mountpoints.

All right, let's see where that would take us.

1) we would need to find all vfsmounts over given dentry.  Probably a cyclic
list (we want to check if there are normal mounts/bindings among those and
we want to dissolve them if there's none).

2) we would need to do something about locking, since mount trees in other
guys' namespaces are protected by semaphores of their own.

3) what do we do on umount(2)?  We can get a bunch of vfsmounts hanging off
it.  MNT_DETACH will have no problems, but normal umount() is a different
story.  Note that it's not just hybrid-related problem - implementing the
mount traps will cause the same kind of trouble,

4) OK, we have those hybrids and want to create vfsmounts when crossing a
mountpoint.  When do they go away, anyway?  When we don't reference them
anymore?  Right now "attached to mount tree" == "+1 to refcount" and detaching
happens explicitly - outside of the "dropping the final reference" path.
Might become a locking issue.

5) Creation of these vfsmounts: fs should somehow tell us whether it wants
one or not (at the very least, we should stop *somewhere*).  Can we use
the same dentry/inode?  I'm not sure and I really doubt that we'd like that.

6) if it's a method, where should it live, *especially* if we want them on
device nodes.  Note that inode_operations belongs to underlying fs, so it's
not particulary good place for device case.

7) automount folks want partially shared mount trees (well, mirrored,
actually).  The basic idea is that while namespace boundary is a trust
boundary, we might want to be able to say "I trust this guy to handle
that subtree under /home/stuff/foobar/mounts".  It's the same situation
as with shared mappings (I want separate address space, but I'm willing
to share that chunk of memory with other process), except that we want
it to be allowed to become asymmetric (shared r/o mapping with somebody
else having it r/w).   That stuff (and mount traps) is the next pending
major work on the mount trees.  We probably want hybrids work to go with it,
since they affect the same data structures and need more or less the same
changes.
[And yes, this is the open season on design discussions for shared subtrees -
automount folks are welcome to join]

8) what should happen when something is mounted on top of directory-over-file?
How do we treat such beasts?  What are the implications?

9) how do we recognize such mountpoints in the path lookups?  It *is* a
hot path, so we should be careful in that area; the impact will be felt
by everything in the system.

10) how do we deal with directories, anyway?  Mixing "attributes" with
normal directory contents is going to be fun, what with lseek() insanity.
That's not an issue for hybrids, but it is one for anybody who wants
any sort of common metadata exported that way.  Note that it's really
important for fs writers - having two different pieces of code to export
the same information (for directories and non-directories resp.) is going
to become a prime breeding ground for bugs.

11) if we go for your "here's stuff that belongs in device node viewed
as directory", how would that play with fs metadata exporters?  Again,
due to the insanity of lseek() on directories it's *very* hard to deal
with unions, when parts of directory come from different chunks of code.

That's it for starters.  Technical answers/questions/comments are welcome.
Generic masturbation => over there in the parent thread, please.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <Pine.LNX.4.58.0408261436480.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 22:25:50 GMT
Message-ID: <fa.guedu6c.82oq1i@ifi.uio.no>

[ This is quite possibly just impossible and buggy, but here's my
  implementation notes. You asked for them. ]

On Thu, 26 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> All right, let's see where that would take us.
>
> 1) we would need to find all vfsmounts over given dentry.  Probably a cyclic
> list (we want to check if there are normal mounts/bindings among those and
> we want to dissolve them if there's none).

Not per-inode? dentries are a bit more memory-constrained than inodes, and
we only need this for filesystems that want to support it, so we wouldn't
need to put this information in each dentry.

Since the vfsmounts have back-pointers to the dentry they are mounted on,
you can still do a "per-dentry" traversal, by just doing the inode list
and checking the dentry pointer. No?

Alternatively, we could just put the list on an external hash-chain
entirely, and hash off the dentry. It depends on how often we end up
needing it.

Or we could just put it in the dentry itself. I'd hate to make it any
bigger than it already is, but maybe it doesn't matter that much.

> 2) we would need to do something about locking, since mount trees in other
> guys' namespaces are protected by semaphores of their own.

Ok, I'll admit that I don't know how to handle namespaces. These things
should just go into a global namespace, and I was kind of assuming it
would happen automatically in "lookup_mnt()" or something like that. A
special case in lookup_mnt which says something like "if you didn't find a
vfsmount, we create a new one for you".

It should be reasonably easy to create new ones on-the-fly, since we'd
have all the information (the parent vfsmount comes stated, and the
vfsmount we create would point to the same things that the "base" one
would).

> 3) what do we do on umount(2)?  We can get a bunch of vfsmounts hanging off
> it.  MNT_DETACH will have no problems, but normal umount() is a different
> story.  Note that it's not just hybrid-related problem - implementing the
> mount traps will cause the same kind of trouble,

Don't allow umount. It's not something the user can unmount - the mount is
"implied" in the file.

> 4) OK, we have those hybrids and want to create vfsmounts when crossing a
> mountpoint.  When do they go away, anyway?  When we don't reference them
> anymore?  Right now "attached to mount tree" == "+1 to refcount" and detaching
> happens explicitly - outside of the "dropping the final reference" path.
> Might become a locking issue.

Ahh. Umm.. Yes. I think this might be the real problem. Unless I seriously
clossed something over when I blathered about the "create the vfsmount on
the fly" thing above ;)

> 5) Creation of these vfsmounts: fs should somehow tell us whether it wants
> one or not (at the very least, we should stop *somewhere*).  Can we use
> the same dentry/inode?  I'm not sure and I really doubt that we'd like that.

Why not? When doing the ->lookup() operation, the filesystem would create
the vfsmount and bind it to the current vfsmount. That guarantees that it
has a vfsmount, and will mean that it will show up positive with the
"d_mountpoint()" query, which in turn will cause us to do the
"lookup_mnt()".

Which in turn will create the other vfsmounts as needed, if you have
multiple namespaces.

So I _think_ creation is easy. Getting rid of the dang things might be
harder.

> 6) if it's a method, where should it live, *especially* if we want them on
> device nodes.  Note that inode_operations belongs to underlying fs, so it's
> not particulary good place for device case.

Why not just let the existing .lookup method initialize the mount-point
thing? After that, it's all in the VFS layer (I'd hate to have filesystems
mess around with vfsmounts - they'll just get it wrong).

> 7) automount folks want partially shared mount trees (well, mirrored,
> actually).

I don't think you can get partial sharing on one of these puppies. You'd
always have one vfsmount per namespace (well, lazily created, so maybe in
practice you'd see a lot fewer).

> 8) what should happen when something is mounted on top of directory-over-file?
> How do we treat such beasts?  What are the implications?

Allow file-on-file mounts - it will just totally hide the thing (in that
namespace, at least). But don't allow the dir-on-file thing (that we
already don't allow).

> 9) how do we recognize such mountpoints in the path lookups?  It *is* a
> hot path, so we should be careful in that area; the impact will be felt
> by everything in the system.

I don't think you'll have any special cases. Same d_mountpount(), same
lookup_mnt().

> 10) how do we deal with directories, anyway?  Mixing "attributes" with
> normal directory contents is going to be fun, what with lseek() insanity.

You couldn't get at the attributes that way anyway, so I think the point
is moot. The "real" directory always takes over.

Crazy people could try to just use the regular "xattrs" interfaces if they
really want attributes on directories. You wouldn't ever be able to use
the "easy" one.

> 11) if we go for your "here's stuff that belongs in device node viewed
> as directory", how would that play with fs metadata exporters?  Again,
> due to the insanity of lseek() on directories it's *very* hard to deal
> with unions, when parts of directory come from different chunks of code.

Don't go there. See above. Directories would be just plain directories,
you could never see their metadata. Same goes for at least symlinks, and
possibly other filetypes too (ie at least initially, a block or character
special device will just take over the whole "file_operations", which
includes "readdir", so it's actually hard to have the filesystem do
anything about those).

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <20040826223625.GB21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 22:59:04 GMT
Message-ID: <fa.n2t5b30.1h1mp38@ifi.uio.no>

On Thu, Aug 26, 2004 at 03:04:21PM -0700, Linus Torvalds wrote:

> > 2) we would need to do something about locking, since mount trees in other
> > guys' namespaces are protected by semaphores of their own.
>
> Ok, I'll admit that I don't know how to handle namespaces. These things
> should just go into a global namespace, and I was kind of assuming it
> would happen automatically in "lookup_mnt()" or something like that. A
> special case in lookup_mnt which says something like "if you didn't find a
> vfsmount, we create a new one for you".
>
> It should be reasonably easy to create new ones on-the-fly, since we'd
> have all the information (the parent vfsmount comes stated, and the
> vfsmount we create would point to the same things that the "base" one
> would).

Erm...  What do we do upon unlink()?  I'm killing a file, fs it's in is
mounted in a dozen of places (no namespaces, just chroot jails, whatever).
We need to find all vfsmounts to be killed by that.

And BTW that's an argument against anchoring that list in inode - unlink()
on foo should not screw bar/... even if bar and foo are links to the same
file.  So we'll need to check for dentry match anyway.

> > 3) what do we do on umount(2)?  We can get a bunch of vfsmounts hanging off
> > it.  MNT_DETACH will have no problems, but normal umount() is a different
> > story.  Note that it's not just hybrid-related problem - implementing the
> > mount traps will cause the same kind of trouble,
>
> Don't allow umount. It's not something the user can unmount - the mount is
> "implied" in the file.

See below.

> > 4) OK, we have those hybrids and want to create vfsmounts when crossing a
> > mountpoint.  When do they go away, anyway?  When we don't reference them
> > anymore?  Right now "attached to mount tree" == "+1 to refcount" and detaching
> > happens explicitly - outside of the "dropping the final reference" path.
> > Might become a locking issue.
>
> Ahh. Umm.. Yes. I think this might be the real problem. Unless I seriously
> clossed something over when I blathered about the "create the vfsmount on
> the fly" thing above ;)

> > 5) Creation of these vfsmounts: fs should somehow tell us whether it wants
> > one or not (at the very least, we should stop *somewhere*).  Can we use
> > the same dentry/inode?  I'm not sure and I really doubt that we'd like that.
>
> Why not? When doing the ->lookup() operation, the filesystem would create
> the vfsmount and bind it to the current vfsmount. That guarantees that it
> has a vfsmount, and will mean that it will show up positive with the
> "d_mountpoint()" query, which in turn will cause us to do the
> "lookup_mnt()".

Several paragraphs below you are saying that you don't like fs messing with
vfsmounts.  Use of ->lookup() would mean that we should not only create
and attach vfsmounts from within fs code, but would actually have to make
->lookup() return vfsmount+dentry, AFAICS.

> > 6) if it's a method, where should it live, *especially* if we want them on
> > device nodes.  Note that inode_operations belongs to underlying fs, so it's
> > not particulary good place for device case.
>
> Why not just let the existing .lookup method initialize the mount-point
> thing? After that, it's all in the VFS layer (I'd hate to have filesystems
> mess around with vfsmounts - they'll just get it wrong).

> Allow file-on-file mounts - it will just totally hide the thing (in that
> namespace, at least). But don't allow the dir-on-file thing (that we
> already don't allow).

Err...  What about dir-on-dir-that-is-on-file?  I.e. mount on foo/. when foo
is a file?

> > 9) how do we recognize such mountpoints in the path lookups?  It *is* a
> > hot path, so we should be careful in that area; the impact will be felt
> > by everything in the system.

> I don't think you'll have any special cases. Same d_mountpount(), same
> lookup_mnt().

See above on use ->lookup()


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <Pine.LNX.4.58.0408261538030.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 23:11:32 GMT
Message-ID: <fa.gse5u6e.a2cq1s@ifi.uio.no>

On Thu, 26 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> >
> > It should be reasonably easy to create new ones on-the-fly, since we'd
> > have all the information (the parent vfsmount comes stated, and the
> > vfsmount we create would point to the same things that the "base" one
> > would).
>
> Erm...  What do we do upon unlink()?  I'm killing a file, fs it's in is
> mounted in a dozen of places (no namespaces, just chroot jails, whatever).
> We need to find all vfsmounts to be killed by that.

But that should be trivial: that's what the per-inode vfsmount list was
(your first question in the last email).

> And BTW that's an argument against anchoring that list in inode - unlink()
> on foo should not screw bar/... even if bar and foo are links to the same
> file.  So we'll need to check for dentry match anyway.

And again - I talked about this in the previous email. Even if you anchor
the list in "struct inode", or you do it with a totally external
hash-list, you'll always have the "vfsmount->mnt_mountpoint" pointer to
point to the dentry. So you can just iterate over the list, and
cherry-pick the ones that point to the dentry you are removing.

>
> > > 3) what do we do on umount(2)?  We can get a bunch of vfsmounts hanging off
> > > it.  MNT_DETACH will have no problems, but normal umount() is a different
> > > story.  Note that it's not just hybrid-related problem - implementing the
> > > mount traps will cause the same kind of trouble,
> >
> > Don't allow umount. It's not something the user can unmount - the mount is
> > "implied" in the file.
>
> See below.
>
> > > 4) OK, we have those hybrids and want to create vfsmounts when crossing a
> > > mountpoint.  When do they go away, anyway?  When we don't reference them
> > > anymore?  Right now "attached to mount tree" == "+1 to refcount" and detaching
> > > happens explicitly - outside of the "dropping the final reference" path.
> > > Might become a locking issue.
> >
> > Ahh. Umm.. Yes. I think this might be the real problem. Unless I seriously
> > clossed something over when I blathered about the "create the vfsmount on
> > the fly" thing above ;)
>
> > > 5) Creation of these vfsmounts: fs should somehow tell us whether it wants
> > > one or not (at the very least, we should stop *somewhere*).  Can we use
> > > the same dentry/inode?  I'm not sure and I really doubt that we'd like that.
> >
> > Why not? When doing the ->lookup() operation, the filesystem would create
> > the vfsmount and bind it to the current vfsmount. That guarantees that it
> > has a vfsmount, and will mean that it will show up positive with the
> > "d_mountpoint()" query, which in turn will cause us to do the
> > "lookup_mnt()".
>
> Several paragraphs below you are saying that you don't like fs messing with
> vfsmounts.  Use of ->lookup() would mean that we should not only create
> and attach vfsmounts from within fs code, but would actually have to make
> ->lookup() return vfsmount+dentry, AFAICS.

No, lookup would just return the dentry, but the dentry would already be
filled in with the mount-point information.

And you can do that with a simple vfs helper function, ie the filesystem
itself would just need to do

	pseudo_mount(dentry, inode);

thing - which just fills in dentry->d_mountpoint with a new vfsmount
thing. It would allocate a new root dentry (for the pseudo-mount) and a
new vfsmount, and make dentry->d_mountpoint point to it.

IOW, the filesystem itself would never mess around with d_mountpoint
itself.

> Err...  What about dir-on-dir-that-is-on-file?  I.e. mount on foo/. when foo
> is a file?

Hmm.. We might as well allow it, I suspect. It's not like it should hurt.
We'd end up following the mount-chain twice, but we already have that
issue with multi-mount cases..

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <20040826225308.GC21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 23:09:33 GMT
Message-ID: <fa.n2dbbit.1ghcojd@ifi.uio.no>

On Thu, Aug 26, 2004 at 03:45:09PM -0700, Linus Torvalds wrote:
> No, lookup would just return the dentry, but the dentry would already be
> filled in with the mount-point information.
>
> And you can do that with a simple vfs helper function, ie the filesystem
> itself would just need to do
>
> 	pseudo_mount(dentry, inode);
>
> thing - which just fills in dentry->d_mountpoint with a new vfsmount
> thing. It would allocate a new root dentry (for the pseudo-mount) and a
> new vfsmount, and make dentry->d_mountpoint point to it.

What dentry->d_mountpoint?  No such thing...

Note that we can't get vfsmount by dentry - that's the point of having these
guys in the first place.  So I'm not sure what you are trying to do here -
dentry + inode is definitely not enough to attach any vfsmounts anywhere.

That's not about namespaces - same fs mounted in several places will give
the same problem - one dentry, many vfsmounts.  And we obviously *can't*
have one vfsmount for all of them - if the same fs is mounted on /foo and
/bar, we will have the same dentry for /foo/splat and /bar/splat.  So
what should we get for /foo/splat/. and /bar/splat/.?  Same dentry *and*
same vfsmount?  I'd expect .. from the former to give /foo and from the
latter - /bar...


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <Pine.LNX.4.58.0408261619230.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 23:39:18 GMT
Message-ID: <fa.gte7tmf.b2aqht@ifi.uio.no>

On Thu, 26 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> What dentry->d_mountpoint?  No such thing...

Sorry - set "dentry->d_mounted++" + "add vfsmount/dentry to hashes".

Yes, it's not a direct list off the dentry, but it effectively is the same
thing.

So basically: the "d_mounted++" just makes sure we get into
"lookup_mnt()". That's where we will usually find the actual mount thing.

And that's also where the special case comes in: if we _don't_ find the
mount thing there, that's where we need to create it. That will only
happen if somebody looks it up using another namespace, though, so it
should be rare.

And when it does happen, we can just create a new vfsmount - we have all
the information there (we'll have to walk the per-inode-or-whatever
vfsmount list to find all the information to populate the thing with, of
course. But we need that list _anyway_, so it should be a fairly
straightforward special case).

			Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <20040826234048.GD21964@parcelfarce.linux.theplanet.co.uk>
Date: Fri, 27 Aug 2004 00:00:21 GMT
Message-ID: <fa.n4tdbaq.1h1eore@ifi.uio.no>

On Thu, Aug 26, 2004 at 04:24:51PM -0700, Linus Torvalds wrote:
> So basically: the "d_mounted++" just makes sure we get into
> "lookup_mnt()". That's where we will usually find the actual mount thing.
>
> And that's also where the special case comes in: if we _don't_ find the
> mount thing there, that's where we need to create it. That will only
> happen if somebody looks it up using another namespace, though, so it
> should be rare.

No.  Trivial example:

mount --bind /foo /bar
mount /dev/sda1 /bar/baz

do lookup for /foo/baz.  No namespaces involved, no vfsmounts found, d_mounted
positive and we certainly do *not* want anything to be created at that point.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <Pine.LNX.4.58.0408261652240.2304@ppc970.osdl.org>
Date: Fri, 27 Aug 2004 00:14:30 GMT
Message-ID: <fa.gte9um8.b24rhm@ifi.uio.no>

On Fri, 27 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Thu, Aug 26, 2004 at 04:24:51PM -0700, Linus Torvalds wrote:
> > So basically: the "d_mounted++" just makes sure we get into
> > "lookup_mnt()". That's where we will usually find the actual mount thing.
> >
> > And that's also where the special case comes in: if we _don't_ find the
> > mount thing there, that's where we need to create it. That will only
> > happen if somebody looks it up using another namespace, though, so it
> > should be rare.
>
> No.  Trivial example:
>
> mount --bind /foo /bar
> mount /dev/sda1 /bar/baz
>
> do lookup for /foo/baz.  No namespaces involved, no vfsmounts found, d_mounted
> positive and we certainly do *not* want anything to be created at that point.

Right. We obviously need to mark the dentry somehow, and only do the
"create vfsmount" special case in this special case. If we didn't do that,
then it wouldn't be a special case, now would it?

So clearly lookup_mnt() needs to check the dentry in the failure case. The
marking could be in any of three places
 - mark the dentry itself by just using a dentry flag ("DCACHE_AUTOVFSMNT")
   or by having a dentry operation for this.
 - mark the inode itself (same logic as dentry)
 - look up the first vfsmount (on the inode list), and look if that one is
   of the automatic type.

Clearly we should not _always_ create a vfsmount, that would just break
the existing logic.

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: [some sanity for a change] possible design issues for hybrids
Original-Message-ID: <20040827010147.GE21964@parcelfarce.linux.theplanet.co.uk>
Date: Fri, 27 Aug 2004 01:17:32 GMT
Message-ID: <fa.n4d7aas.1ghkpre@ifi.uio.no>

On Thu, Aug 26, 2004 at 04:57:01PM -0700, Linus Torvalds wrote:
> Right. We obviously need to mark the dentry somehow, and only do the
> "create vfsmount" special case in this special case. If we didn't do that,
> then it wouldn't be a special case, now would it?
>
> So clearly lookup_mnt() needs to check the dentry in the failure case. The
> marking could be in any of three places
>  - mark the dentry itself by just using a dentry flag ("DCACHE_AUTOVFSMNT")
>    or by having a dentry operation for this.
>  - mark the inode itself (same logic as dentry)
>  - look up the first vfsmount (on the inode list), and look if that one is
>    of the automatic type.
>
> Clearly we should not _always_ create a vfsmount, that would just break
> the existing logic.

Hmm...  IOW, you are suggesting to use normal trigger for lookup_mnt() and
then have extra check in lookup_mnt() failure exit.  That will probably
work (and I'd prefer to mark dentry here).

Freeing these guys will not be fun.  The final reference could be given
up under very different locking conditions - it's certainly a blocking
operation (it can trigger final umount, among other things), but having
it trigger removal from mount tree(s) can get ugly.  We might be able to
tweak that, but it will probably need require major surgery in several
places.

Right now we assume that vfsmount lock *and* per-namespace semaphore are
held when detaching from the tree (attaching what could be the first at
given dentry --- same + i_sem on its ->i_sem).  We also assume that vfsmount is
detached before refcount reaches zero.  The thing is, that final refcount
can be dropped while holding namespace semaphore.  It's not fatal (we'll
need to hold onto these guys until dropping that semaphore or add an analog
of mntput() for places under namespace sem), but it can get messy.

Situation with unlink(): we need to (a) check that all vfsmounts over
given point are "hybrid" ones (locking is not a problem at that point)
and (b) after successful operation (unlink() and rename() alike, AFAICS)
we want to go and kill everything under the victim dentry.  *That*
can get very ugly, since fs could be mounted in a lot of places and/or
in a lot of namespaces.

Note that we also need to do in VFS what NFS tries to do in revalidation -
if dentry is invalidated, everything under it should be detached and
dropped.  So it's not something new - we do need something that would
be called outside of any superblock/namespace/vfsmount locks (both
unlink() and revalidation qualify) and would kill the vfsmounts that
become unreachable.

Locking rules for attaching might need adjustment (the logics with ->i_sem
is that we want !d_mountpoint() stay true around call of ->unlink(),
->rmdir() and ->rename(); so cloning or replacing doesn't need that
protection.  Now it might become interesting - we'll need to be careful
in pivot_root(2) and possibly some other places).

I'll see what can be done (it'll definitely take careful looking at the
code), but there is a chance of things getting very ugly.

BTW, shared subtrees *are* compatible with that puppy - there's a lot
of corner cases to look at, but AFAICS they are all doable.  Lifetime
rules can get ugly in some of them, but that's on par with the situation
without sharing.

One thing that looks like a bad interface: we get forcible "use same
dentry for file and directory" with that design.  That can lead to
a big can of worms - without a large audit I wouldn't bet a dime on
ability to get that right (and same applies to reiser4 code as-is,
for the same reasons).  No idea how bad it will turn out to be - right
now all I can say is that we might run into trouble and that we almost
certainly will get extra "be careful not to do <list of things>" rules
out of that.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408251314260.17766@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 05:52:54 GMT
Message-ID: <fa.it81n88.144kbji@ifi.uio.no>

On Wed, 25 Aug 2004, Christoph Hellwig wrote:
>
> For one thing _I_ didn't decide about xattrs anyway.  And I still
> haven't seen a design from you on -fsdevel how you try to solve the
> problems with files as directories.

Hey, files-as-directories are one of my pet things, so I have to side with
Hans on this one. I think it just makes sense. A hell of a lot more sense
than xattrs, anyway, since it allows scripts etc standard tools to touch
the attributes.

It's the UNIX way.

And yes, the semantics can _easily_ be solved in very unixy ways.

One way to solve it is to just realize that a final slash at the end
implies pretty strongly that you want to treat it as a directory. So what
you do is:

 - without the slash, a file-as-dir won't open with O_DIRECTORY (ENOTDIR)
 - with the slash, it won't open _without_ O_DIRECTORY (EISDIR)

Problem solved. Very user-friendly, and very intuitive.

Will it potentially break something? Sure. Do we care? Me, I'll take that
kind of extension _any_ day over xattrs, that are fundamentally flawed in
my opinion and totally useless. The argument that applications like "tar"
won't understand the file-as-directory thing is _flawed_, since legacy
apps won't understand xattrs either.

Oh, add a O_NOXATTRS flag to force a path lookup to only use regular
directories, the same way we have O_NOFOLLOW and friends. That allows
people to see the difference, if they care (ie a file server might decide
that it doesn't want to expose things like this).

I never liked the xattr stuff. It makes little sense, and is totally
useless for 99.9999% of everything. I still don't see the point of it,
except for samba. Ugly.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408251723540.17766@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 06:02:49 GMT
Message-ID: <fa.iuo5ng7.17kobbl@ifi.uio.no>

On Thu, 26 Aug 2004, Mikulas Patocka wrote:
>
> Stupid question: who will use it? And why?
>
> Anyone can write an userspace library, that implements function
> set_attribute(char *file, char *attribute, char *value), that creates
> directory ".attr/file" in file's directory and stores attribute there.
> (and you can get list of attributes from shell too:
> ls `echo "$filename" |sed 's/\/\([^\/]*\)$/\/\.attr\/\1/'`
> ). There's no need to add extra functionality to kernel and filesystem.

...and the above is, roughly, what I understand samba etc falls back on.

The problem ends up being that the above isn't in any way safe from people
moving files around (oops, where did those attributes go?) nor does it
have any consistency guarantees. So it only works well if _one_
application does this, and that application follows all the locking rules.

Is it enough? It may have to be.

> The only way xattrs are useful is that backup/restore software doesn't
> have to know about every filesystem with it's specific attributes and
> every magic ioctl for setting them. Instead it can save/restore
> filesystem-specific attributes without understanding what do they mean.
> However there's no need why application should use them. And no
> application does.

If no application does, then why back them up? Why implement them in the
first place?

In other words - some apps obviously do want to use the. Sadly.

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040825204240.GI21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 05:52:49 GMT
Message-ID: <fa.n7cnbar.1nhoorf@ifi.uio.no>

On Wed, Aug 25, 2004 at 01:22:55PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 25 Aug 2004, Christoph Hellwig wrote:
> >
> > For one thing _I_ didn't decide about xattrs anyway.  And I still
> > haven't seen a design from you on -fsdevel how you try to solve the
> > problems with files as directories.
>
> Hey, files-as-directories are one of my pet things, so I have to side with
> Hans on this one. I think it just makes sense. A hell of a lot more sense
> than xattrs, anyway, since it allows scripts etc standard tools to touch
> the attributes.
>
> It's the UNIX way.

Not if you allow link(2) on them.  And not if you design and market your
stuff as a general-purpose backdoor into kernel.  Note how *EVERY* *DAMN*
*OPERATION* is made possible to override by "plugins".  Which is the reason
for deadlocks in question, BTW.

Don't fool yourself - that's what Hans is selling.  Target market: ISV.
Marketed product: a set of hooks, the wider the better, no matter how
little sense it makes.  The reason for doing that outside of core kernel:
bypassing any review and being able to control the product being sold (see
above).

Shame that it got an actual filesystem mixed in with the marketing plans
and general insanity...


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408251348240.17766@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 06:03:28 GMT
Message-ID: <fa.it7to0c.144garu@ifi.uio.no>

On Wed, 25 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> On Wed, Aug 25, 2004 at 01:22:55PM -0700, Linus Torvalds wrote:
> >
> > It's the UNIX way.
>
> Not if you allow link(2) on them.

Heh. I don't think that's a very strong argument against being "unixy",
considering how traditional unix _used_ to handle directories.

mkdir/rmdir/rename only came later. Now, obviously they did come later for
a good reason, but still..

The interesting part is that thanks to the dcache, we should be perfectly
able to actually _see_ circular links etc, so some of the problems with
linking directories should actually be quite solvable - something that is
_not_ true for a traditional UNIX VFS layer.

Of course, the dcache introduces some new problems of its own wrt
directory aliasing, but I don't actually think that should be fundamental
either. Treating them more as a "static mountpoint" from an FS angle and
less as a traditional Unix hardlink should be doable, I'd have thought.

(Also, it's entirely possible that the filesystem may not support some of
the more esoteric linking/renaming operations. For example, in a
traditional xattrs setup where the xattr is linked on-disk with the file
it is associated with, you simply _can't_ link it somewhere else, or
rename it to any other directory. That's not a VFS layer issue, obviously,
but I thought I'd bring up the point that file-as-dir cases may have
limitations that normal files don't have).

>  And not if you design and market your stuff as a general-purpose
> backdoor into kernel.

Now that's a separate argument, and not one I'm personally interested in
arguing at least right now. I haven't actually looked at the reiser4 code,
so I'm really _only_ arguing against special-case attributes.

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040825212518.GK21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:03:26 GMT
Message-ID: <fa.n6t9aqu.1k1apb8@ifi.uio.no>

On Wed, Aug 25, 2004 at 02:00:01PM -0700, Linus Torvalds wrote:
> Of course, the dcache introduces some new problems of its own wrt
> directory aliasing, but I don't actually think that should be fundamental
> either. Treating them more as a "static mountpoint" from an FS angle and
> less as a traditional Unix hardlink should be doable, I'd have thought.

Yeah, if we ditch the "mountpoints are busy and untouchable" stuff.  Which
I'd love to, but it's a hell of a visible (and admin-visible) change.

FWIW, current deadlocks are unrelated to actual operation succeeding.
Look: we have sys_link() making sure that parent of target is a directory
(PATH_LOOKUP, in a "it has ->lookup()" sense), then locking target's parent,
then checking that it has ->link() (everyone on reiser4 does) and then
checking that source (old link to file) is *not* a directory (in S_ISDIR
sense).  Then we lock source.

Note that currently it's OK - we get "all non-directories are always locked
after all directories".  With filesystem that provides hybrid objects with
non-NULL ->link() it's not true and we are in deadlock country.  Before
we get anywhere near fs code.

I'm not saying that this particular instance is hard to fix, but it wasn't
even looked at.  All it would take is checking the description of current
locking scheme and looking through the proof of correctness (present in the
tree).  That's the first point where said proof breaks if we have hybrids.
And it's what, about 4 screenfuls of text?

I have no problems with discussing such stuff and no problems with having it
merged if it actually works.  But let's start with something better than
"let's hope nothing breaks if we just add such objects and do nothing else,
'cause hybridi files/directories are good, mmmkay?"


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040826003055.GO21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:02:47 GMT
Message-ID: <fa.n9t1b2q.1l1ip3e@ifi.uio.no>

On Thu, Aug 26, 2004 at 01:11:52AM +0100, Jamie Lokier wrote:
> Is this a problem if we treat entering a file-as-directory as crossing
> a mount point (i.e. like auto-mounting)?

Yes - mountpoints can't be e.g. unlinked.  Moreover, having directory
mounted on non-directory is also an interesting situation.

> Simply doing a path walk would lock the file and then cross the mount
> point to a directory.

*Ugh*

What would happen if you open that directory or chdir there?  If it's
"underlying file stays locked" - we are in even more obvious deadlocks.

> A way to ensure that preserves the lock order is to require that the
> metadata is in a different filesystem to its file (i.e. not crossing a
> bind mount to the same filesystem).
>
> That has the side effect of preventing hard links between metadata
> files and non-metadata, which in my opinion is fine.

We don't actually need a different fs - different vfsmount will do just fine.

> The strict order is ensured by preventing bind mounts which create a
> path cycle containing a file->metadata edge.  One way to ensure that
> is to prevent mounts on the metadata filesystems, but the rule doesn't
> have to be that strict.  This condition only needs to be checked in
> the mount() syscall.

You really don't want to lock mountpoint on path lookup, so I don't see
how that would be relevant - it's a hell to clean up, for one thing
(I've crossed ten mountpoints on the way, when do I unlock them and
how do I prevent deadlocks from that?)  Besides, different namespaces
can have completely different mount trees, so tracking down all that
stuff would be hell in its own right.

The main issue I see with all schemes in that direction (and something
like that could be made workable) is the semantics of unlink() on
mountpoints.  *Especially* with users being able to see attributes of
files they do not own (e.g. reiser4 mode/uid/gid stuff).  Ability to
pin down any damn file on the system and make it impossible to replace
is not something you want to give to any user.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040826031347.GQ21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:14:00 GMT
Message-ID: <fa.nadbait.1qhgpjd@ifi.uio.no>

On Thu, Aug 26, 2004 at 02:00:49AM +0100, Jamie Lokier wrote:
> viro@parcelfarce.linux.theplanet.co.uk wrote:
> > > Is this a problem if we treat entering a file-as-directory as crossing
> > > a mount point (i.e. like auto-mounting)?
> >
> > Yes - mountpoints can't be e.g. unlinked.  Moreover, having directory
> > mounted on non-directory is also an interesting situation.
>
> Ok, so can we make it so mountpoints can be unlinked? :)

User-visible change of behaviour and IIRC a SuS violation on top of that.

> I think the underlying file does not stay locked, and once you've
> entered it as a directory, it can be unlinked.

So why lock it at all in that case?

> I didn't mean locking a chain of mountpoints, I meant the temporary
> state where two dentries and/or inodes are locked, parent and child,
> during a path walk.  However I'm not very familiar with that part of
> the VFS and I see that the current RCU dcache might not lock that much
> during a path walk.

Never had been needed on crossing mountpoints, actually.

> I agree, users shouldn't be able to pin down a file.
>
> I think unlink() should succeed on a file while something is visiting
> inside its metadata directory.

See above.  Again, the fundamental problem with that is allowing unlink
and friends on a mountpoint.  I would love to do that, but it always
generated -EBUSY on all Unices.  Linux got a bit more users and userland
code than Plan 9 - they can afford such changes, but...

And yes, from the kernel POV it's trivial to do - witness the MNT_DETACH
codepath in umount - it's much simpler than "normal" umount exactly because
it doesn't try to emulate old "it's busy, can't umount" behaviour.

With umount we could introduce "don't bother with that shit" flag.  With
unlink() we would have to make that default behaviour to be useful.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408251516390.17766@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 06:03:11 GMT
Message-ID: <fa.itobn8a.14k6bjg@ifi.uio.no>

On Wed, 25 Aug 2004, Matt Mackall wrote:
> >
> > It's the UNIX way.
>
> I thought the UNIX way is "everything's a file", not "everything's a
> directory".

It really was. Directories were historically largely just files too,
although with the special "lookup" operation.

Historic unix didn't have readdir/rmdir/mkdir/rename or really much _any_
special directory handling. Directories were just files, and you read them
like files.

Of course, even in that early unix, "directories" were very much a
reality even apart from the fact that they happened to be implemented
pretty much like files. Nobody has ever claimed that the UNIX way is
"everything is _one_ file", after all ;)

> > Will it potentially break something? Sure. Do we care? Me, I'll take that
> > kind of extension _any_ day over xattrs, that are fundamentally flawed in
> > my opinion and totally useless.
>
> There's always the option that they're both broken.

Yes. Highly likely. However, something like that _does_ end up what a
Windows fileserver wants. IOW, even if it's broken, _something_ is likely
forced on us by that nasty thing we call "real users". Damn them.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408260919380.2304@ppc970.osdl.org>
Date: Thu, 26 Aug 2004 16:49:56 GMT
Message-ID: <fa.gtentmf.b22qht@ifi.uio.no>

On Thu, 26 Aug 2004, Christoph Hellwig wrote:
>
> > Are you saying that with reiser4, you can open a device or fifo with
> > O_DIRECTORY?
>
> That's what I thought, but as far as I can follow the code this is not
> actually true.

It should be possible to do, though. There's nothing really different in
making the "default" (unnamed) fork be a special device or a fifo.

And it would be perfectly ok for O_DIRECTORY to open such a file, as long
as it opens the directory branch, not the special device.

I advocated (long ago) something like this for /dev handling, just because
I think it would make sense to have

	/dev/hda	<- special file
	/dev/hda/part1	<- partition 1 (aka /dev/hda1)

which just seems like a very obvious and intuitive interface to me. Of
course, we have so much legacy in /dev that there's no real point to doing
this, but it's still an appealing approach, I think.

But I do take Al's concerns seriously. I like the notion of supporting
"containers", but there are undoubtedly serious issues in the notion. I
don't strictly know exactly _how_ to implement it sanely (I can talk about
using the vfsmnt structure all I like, but the fact is, it's a different
thing from a normal mount, and there may be serious problems indeed
there).

Still, I really do like the idea of merging the notion of file and
directory into one notion of "container". I absolutely _detest_ files with
internal structure that tools have to know about (ie I hate seeing all
those embedded formats that I can't use "grep" on - MIME being one case).
I'd much rather see a "group of files"  and a "file with a grouping of
information".

(Now, flattening that "group of files" is obviously needed for serial
protocols, so I think MIME/tar/xxxx are fine for _transporting_ data, but
I'm saying that outside of transport I really prefer a "collection of
files" approach).

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040825205957.GJ21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:03:20 GMT
Message-ID: <fa.n8d5bj2.1mhmoj4@ifi.uio.no>

On Wed, Aug 25, 2004 at 01:24:36PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 25 Aug 2004, Christoph Hellwig wrote:
> >
> > Over the last at least five years we've taken as much as possible
> > semantics out of the filesystems and into the VFS layer, thus having
> > a separation between the semantical layer (VFS) and the low level
> > filesystem.  Your attributes are absoultely a VFS thing and as such
> > should not happen at the filesystem layer, and no, that doesn't mean
> > they're bad per se, I just think they are a rather bad fit for Linux.
>
> Now this I agree with, in the sense that I think that if we want to
> support this, it should be supported at a VFS layer.

ACK.  However, I'm still not seeing *ANYTHING* that would look like a workable
scheme in presense of hardlinks.  Show me how to make that deadlock- and
race-free and we might very well do it in VFS.

_That_ is what's missing and it's needed no matter where it's implemented.
You want hybrid objects - you want to solve that one.  So far I've seen
nothing workable.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040829183629.GP21964@parcelfarce.linux.theplanet.co.uk>
Date: Sun, 29 Aug 2004 18:37:57 GMT
Message-ID: <fa.n9dpb33.1phqp37@ifi.uio.no>

On Sun, Aug 29, 2004 at 11:28:42AM -0700, Hans Reiser wrote:
> just use a view, and skip the options on the system calls.  if you cd to
> /nometas/your_home_directory_path you don't see the metafiles.  Why is a
> view better than a syscall flag?  Because it lets the user choose what
> he wants without recompiling to do it.  This kind of a view requires no
> coding because you can just mount the root filesystem two ways, one with
> the -nopseudos mount option, and one without it.

*What*?

OK, now I want detailed explanation of the reasons why that doesn't create
cache coherency problems.

Do you have an analysis of locking in the entire thing?


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040829185744.GQ21964@parcelfarce.linux.theplanet.co.uk>
Date: Sun, 29 Aug 2004 18:59:07 GMT
Message-ID: <fa.natfbj4.1q10oj6@ifi.uio.no>

On Sun, Aug 29, 2004 at 07:36:29PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> > he wants without recompiling to do it.  This kind of a view requires no
> > coding because you can just mount the root filesystem two ways, one with
> > the -nopseudos mount option, and one without it.
>
> *What*?
>
> OK, now I want detailed explanation of the reasons why that doesn't create
> cache coherency problems.
>
> Do you have an analysis of locking in the entire thing?

And I am very, very serious about that - we are talking about very nasty
minefield and design choices in that area have fundamental impact on the
entire layer, wherever it is located.

It's *NOT* something that you can leave until later and hope it somehow
falls into place - it can be merged in steps, but you MUST know the goal
on that level.  To rearchitect later might be possible (even though you
will a hell of a time avoiding plugins breakage), but it will be *hard*.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040829212700.GA16297@parcelfarce.linux.theplanet.co.uk>
Date: Sun, 29 Aug 2004 21:32:15 GMT
Message-ID: <fa.n2stc2t.1m1qojd@ifi.uio.no>

On Sun, Aug 29, 2004 at 01:06:41PM -0700, Hans Reiser wrote:

> How about if you educate me on the problems you see for a bit before I
> respond? I think it might help us move into a constructive discussion.

A slew of cache coherency issues (and memory consumption on top of that).
Right now we have a single dentry tree for all fs instances, no matter
how many times it and its subtrees are visible in the tree.  Which,
obviously, avoids all that crap.  As soon as we start trying to have
multiple trees over the same fs, we are in for a *lot* of fun.

The basic question is how do you propagate the changes from one tree
to another and what do you do when change to "invisible" entry happens
via a tree where it is visible.  Unfortunately, all obvious solutions
either hit the pathname resolution and hit it *hard* (both in scalability
and in price of uncontended case) or create a mess for stuff like
NFS silly-rename semantics, VFAT aliases, etc.

Saying that some unspecified data structures might work doesn't help.
obviously - any variant will have tradeoffs of its own and has to be
discussed individually.  fsdevel is there for exactly that sort of
stuff, so if you have any specific proposals you want to discuss, you
are more than welcome.

Again, right now all cache coherency issues are sidestepped by having a single
instance of dentry tree per superblock.  Each mount/binding is represented by
vfsmount and _refers_ to a (sub)tree of dentry tree of that fs - dentries
and inodes are the same (ditto for private data structures owned by filesystem,
obviously).

So we have a forest of dentry trees (one per filesystem) and forest of
vfsmount trees (one tree per namespace).  Each vfsmount corresponds to
a chunk of namespace - think of the full user-visible tree cut into
pieces by mountpoints.  For every piece we know
	a) what fs it's from (->mnt_sb)
	b) what subtree of that fs it corresponds (->mnt_root)
	c) what piece it's attached to (->mnt_parent)
	d) where in that piece it's attached (->mnt_mountpoint)
	e) where to find its siblings and children
	f) which namespace (== tree in vfsmount forest) it's from

Point in a namespace is determined by pair (vfsmount, dentry) - which
piece we are looking at and where in that piece we are.  Pathname lookups
operate on such pairs - as long as we do not cross into another piece
we just step from dentry to dentry, when we cross mountpoint towards root,
we flip to (mnt_parent, mnt_mountpoint) first, when we cross mountpoint
into mounted filesystem, we do hash lookup (lookup_mnt()) by our pair,
find vfsmount with such mnt_parent and mnt_mountpoint and step into
(vfsmount, vfsmount->mnt_root).

Lifetime of vfsmounts is controlled by a simple refcount.  Having vfsmount
attached to tree contributes +1, so does having it pinned down by lookup/
chdir/chroot/opened file.  When the last vfsmount goes away, filesystem is
shut down (basically, that's what happens on umount).  Binding simply creates
a vfsmount pointing to target and attaches it at source.

Cloning a namespace copies the vfsmount tree of parent process' namespace
and flips vfsmounts of our root and cwd into the corresponding nodes in that
copy.  Normally, fork()/clone() just increments namespace refcount instead
of cloning it, so all children share namespace with parent.

When all processes in a namespace are gone its refcount goes to zero.  At
that point we dissolve all attachments in its vfsmount tree (which drops
a reference to each vfsmount and can in turn lead to fs shutdowns).

Lazy umount does similar operation with specified subtree - it detaches
all pieces mounted anywhere in that subtree and drops reference to each.
Ones that are not busy will be shut down immediately, ones that are will
go away as soon as they stop being busy.

vfsmounts can have flags of their own - one of the pending projects is to
allow individual vfsmounts to be read-only.  Noexec, nosuid and nodev are
already per-mountpoint.

Tree of vfsmounts is protected by
	a) vfsmount spinlock (protects hash lookups, mostly)
	b) per-namespace semaphore (that one is obviously blocking)
	c) in cases when we can go from 0 to 1 vfsmounts with given dentry
as a mountpoint, we also hold ->d_inode->i_sem on the mountpoint-to-be
(that closes races between rmdir/mount, rename/mount, etc.)

Dentry tree is messier - these days we have dcache_lock and RCU stuff.
If you can help getting the documentation out of these guys, you've got
a *lot* of thanks from a lot of people.  As it is, it's RTFS country.
I can answer specific questions there, but I won't even try to produce
readable manual covering the entire thing.

Directory operations have exclusion based on ->i_sem - see
Documentation/filesystems/directory-locking for description and proof
of correctness.

For operations that can destroy an object (unlink/rmdir/overwriting rename)
we try to unhash the victim dentry first, so the filesystems that can't
handle unlinked-but-busy stuff can detect such attempt by seeing the
victim still hashed _and_ can be sure that if it's not hashed, nobody will
come and see it (lookup would have to go to filesystem code, since we
have the sucker unhashed and we have exclusion there).  If operation is
unsuccessful, we rehash the victim.

Files involved:
	fs/dcache.c
	fs/super.c
	fs/inode.c
	fs/namespace.c
	include/linux/dcache.h
	include/linux/fs.h
	include/linux/mount.h

If you need more details on any specific area (or gaps in the above) - just
ask.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408291431070.2295@ppc970.osdl.org>
Date: Sun, 29 Aug 2004 21:54:20 GMT
Message-ID: <fa.gsea0eb.a24o9r@ifi.uio.no>

On Sun, 29 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> A slew of cache coherency issues (and memory consumption on top of that).
> Rigth now we have a signle dentry tree for all fs instances, no matter
> how many times it and its subtrees are visible in the tree.  Which,
> obviously, avoids all that crap.  As soon as we start trying to have
> multiple trees over the same fs, we are in for a *lot* of fun.

Al, I think you should make the argument a bit more specific, because I
doubt a lot of people understand just what the problems are with aliased
names. Just a few examples of the problems involved will illuminate things
very well, I think. People who haven't been intimate with the name caches
probably simply don't understand why you worry so much.

I'll start out with some trivial examples, just to let Hans and others get
an idea about what the issues are. I think examples of problems are often
better ways to explain them than the abstract issues themselves.

Aliases: let's say that you have filename "a" hard-linked to filename "b",
and you have a directory structure of streams under there. So you have

	a/file1
	a/dir1/file2
	a/dir2/file3

and (through the hard-link with "b") you have aliases of all these same
names available as "b/file1", "b/dir1/file2" etc).

Now, imagine that you have two processes doing

	mv a/dir1 a/dir2/newdir

and

	mv b/dir2 b/dir1/newdir

at the same time. Both of them MUST NOT SUCCEED, for pretty obvious
reasons (you'd have moved two directories within each other, and now
neither would be accessible any more).

How do you handle locking for this situation?

Another interesting case is what happens when you have looked up and cache
the filename "a/file1" and then another process does "rm b/file1". How do
you update the _other_ cached copy, since they had two different names,
but _both_ names went away at the same time. Also again, how do you handle
locking?

The general VFS layer has a lot of rules, and avoids these problems by
simply never having aliases between two directories. If the same directory
shows up multiple times (which can happen with bind mounts), they have the
exact same dentry for the directory, it's just found through two different
vfsmount instances. That's why vfsmounts exist - they allow the same name
cache entry to show up in different places at the same time.

So when we do a bind mount, and the same directory shows up under two
different names "a" and "b", and we do a "rm b/file1", it _automatically_
disappears from "a/file1" too, simply by virtue of "a" and "b" literally
being the same dentry. No aliasing ever happens, and this makes coherency
and locking much easier (which is not to say that they are trivial, but
they are pretty damn clear in comparison to the alternatives).

What Al (and others) worries about is that the reiser4 name handling has
_none_ of these issues figured out and protected against. You can protect
against them by taking very heavy locks (you can trivially protect against
all races by taking one large lock around any operation), but the fact is,
that is just not an option for high-performance name lookup.

These aliasing/locking rules need to be global and well-though-out. Not
just fix the two examples above, but be shown to be safe _in_general_.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408291641070.2295@ppc970.osdl.org>
Date: Sun, 29 Aug 2004 23:59:49 GMT
Message-ID: <fa.gsee0mb.a20phr@ifi.uio.no>

On Sun, 29 Aug 2004, Trond Myklebust wrote:
>
> So could you explain what is stopping us from reducing the whole problem
> to the bind mount problem? IOW have "a/" be a directory that acts as if
> it is dynamically bind mounted on top of the file "a".

Hey, I suggested people do exactly that. See the other thread between me
and Al Viro on exact implementation issues - even some code snippets ;)

The problem really ends up being directories with attributes (where we
can't just overmount the existing directory). That's where "openat()"
helps us.

(The other problem is the purely _practical_ problem of reiser4 going
behind the VFS layer, and thus _not_ getting the aliasing and locking
right, but that's an implementation issue in my book, and nothing really
fundamental).

> Is it just the fantasy of supporting hard-links across "stream
> boundaries" (as in "touch a b; ln b a/b; ln a b/a")? I'm pretty sure
> nobody wants to have to add cyclic graph detection to their filesystems
> anyway. 8-)

It's easy enough to do the graph detection at the VFS layer, exactly
because of the density of the dentry graph.

(Of course, nfs-exporting a filesystem breaks that density, thanks to the
"lookup by fh" stuff, so we might not allow it for an NFS client).

I suspect most filesystems wouldn't allow hard-links across "stream
boundaries" in the first place. I _suspect_ that most stream
implementations would bind the attributes more tightly than that to the
file that owns them. reiser4 might be the only one that might ever support
something like that.

> What other issues would need to be addressed?

 - Whether we want it in the first place
 - whether we need the "separate namespace" thing that O_XATTR and
   openat() brings us (they certainly solve the problem, but maybe it is
   solvable another way too)
 - how to actually test this out in practice (ie getting reiser4 to do the
   proper thing wrt the VFS layer, but preferably _also_ having another
   filesystem like NFSv4 or cifs that actually uses this and shows what
   the problems are).
 - whether it makes any sense at all unless we also make at least a few
   other filesystems support it, so that people start using it as an
   "expected feature" rather than a "works only on a couple of machines".

And probably others.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408291919450.2295@ppc970.osdl.org>
Date: Mon, 30 Aug 2004 02:35:34 GMT
Message-ID: <fa.guefvuj.82qopj@ifi.uio.no>

On Sun, 29 Aug 2004, Trond Myklebust wrote:
>
> Well, yes there has to be a distinction between a true bind mount which
> actually covers the file or directory, and something like the stream
> "bind mount" which doesn't.
>
> The stream "bind mount" is just there to allow you to root the
> attributes in a single tree. It can be made functionally entirely
> equivalent to the openat(), but uses pathname semantics (e.g., "//") to
> denote the attribute fork instead of an extra function call.

Using '//' would be nice, but would break real apps. If I remember
correctly, POSIX specifies that '//' can be special at the _beginning_ of
a path, but in the middle, it has to act like a single '/'.

And that's not just theory - it's quite common for programs to just
concatenate a directory name (which may or may not end with a slash) with
another path-name that starts with a slash. So you _will_ see existing
scripts and programs using things like "/usr/include//sys/type.h", and
they'd break if "//" would switch from "regular namespace" to "attribute
namespace".

So I don't see any way to extend pathname semantics to distinguish between
"directory contents" and "directory attribute stream".

> > It's easy enough to do the graph detection at the VFS layer, exactly
> > because of the density of the dentry graph.
>
> Don't you end up having to lock the entire paths b/c/d and a/e/f in
> order to prevent "ln a b/c/d/a; ln b a/e/f/b"?

That's not the problem - since it's in memory, we can just get the dcache
lock, and do it locked for t least local filesystems.

However, being prodded by Andries, I think I'm wrong _anyway_. Since the
dcache is only "dense" down one path to the root, and doesn't contain all
the alternate ways of getting to a particular directory, I came to the
conclusion that the VFS layer can't actually do cyclic detection after
all...

So together with the fact that nobody really _wants_ hardlinks to
directories, I think the right answer is "no". It's not a problem as long
as the attributes streams are always tied to the file/directory they are
attributes of - then the "directory link" is really just a file link, and
can't cause any cycles.

> >  - how to actually test this out in practice (ie getting reiser4 to do the
> >    proper thing wrt the VFS layer, but preferably _also_ having another
> >    filesystem like NFSv4 or cifs that actually uses this and shows what
> >    the problems are).
>
> As I said, NFSv4 can be made ready pretty quickly: Bruce is already
> finishing up the xattr implementation.

Do we have any servers that implement it? I think NFSv4 might be a good
test-case if so.

> >  - whether it makes any sense at all unless we also make at least a few
> >    other filesystems support it, so that people start using it as an
> >    "expected feature" rather than a "works only on a couple of machines".
>
> NTFS? ;-)

Hey, I see the smiley, but I'd still like to point out that not many
people use it under Linux, and while I think writing to it might be stable
these days, I don't believe named streams are necessarily going to
materialize all that quickly..

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040830030157.GE16297@parcelfarce.linux.theplanet.co.uk>
Date: Mon, 30 Aug 2004 03:03:21 GMT
Message-ID: <fa.n6dfbqe.1nhgob2@ifi.uio.no>

On Sun, Aug 29, 2004 at 07:31:49PM -0700, Linus Torvalds wrote:
> And that's not just theory - it's quite common for programs to just
> concatenate a directory name (which may or may not end with a slash) with
> another path-name that starts with a slash. So you _will_ see existing
> scripts and programs using things like "/usr/include//sys/type.h", and
> they'd break if "//" would switch from "regular namespace" to "attribute
> namespace".
>
> So I don't see any way to extend pathname semantics to distinguish between
> "directory contents" and "directory attribute stream".

I do, actually.  There might be a way and it is kinda-sorta similar to
your openat() variant, but lives in normal namespace.  No, I'm not too
fond of that, but since we are discussing weird variants anyway...

	a) associated directory tree of object is not automounted on top
of it.  Instead of that, we always do detached vfsmount (and do it on demand -
see below)
	b) we have a bunch of pseudo-symlinks in /proc/<pid>/fd/ - same
kind as what we already have there, but instead of (file->f_vfsmount,
file->f_dentry) they lead to associated vfsmount (allocated if needed).
Once we get a reference to such guy, we can
	* do further lookups
	* chdir there and poke around
	* hell, we can even bind it someplace (that will require slight change
in attach_mnt() logics, but it's not hard) and get it permanently mounted

Since it's not attached anywhere, normal GC logics works just fine.  And
yes, they are usable from scripts, etc. -
	exec 42<foo/bar/baz
	cat /proc/self/fd/#42/whatever/crap/you/want
and enjoy.

> However, being prodded by Andries, I think I'm wrong _anyway_. Since the
> dcache is only "dense" down one path to the root, and doesn't contain all
> the alternate ways of getting to a particular directory, I came to the
> conclusion that the VFS layer can't actually do cyclic detection after
> all...

<blinks>
<rereads a bunch of earlier postings>

Oh, _that_'s what you meant...  No, we definitely have no chance in hell
to catch loops, dcache or not.  It costs too much - we need to examine
a *lot* of nodes to do that and all of them would have to be read at some
point.  We either need entire fs tree locked in core or we might have to
reread it on every rename().  The former will kill us on memory use, the
latter - on amount of IO.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408292042380.2295@ppc970.osdl.org>
Date: Mon, 30 Aug 2004 03:58:27 GMT
Message-ID: <fa.gue40mc.a2ipho@ifi.uio.no>

On Mon, 30 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
> >
> > So I don't see any way to extend pathname semantics to distinguish between
> > "directory contents" and "directory attribute stream".
>
> I do, actually.  There might be a way and it is kinda-sorta similar to
> your openat() variant, but lives in normal namespace.  No, I'm not too
> fond of that, but since we are discussing weird variants anyway...
>
> 	a) associated directory tree of object is not automounted on top
> of it.  Instead of that, we always do detached vfsmount (and do it on demand -
> see below)
> 	b) we have a bunch of pseudo-symlinks in /proc/<pid>/fd/ - same
> kind as what we already have there, but instead of (file->f_vfsmount,
> file->f_dentry) they lead to associated vfsmount (allocated if needed).
> Once we get a reference to such guy, we can
> 	* do further lookups
> 	* chdir there and poke around
> 	* hell, we can even bind it someplace (that will require slight change
> in attach_mnt() logics, but it's not hard) and get it permanently mounted

Well, the above _is_ the same as "openat()", really. It's just using a
filesystem starting point to emulate a new system call. Same thing
conceptually. You could do pretty much any system call as a filesystem
action if you wanted to ;)

I don't disagree with doing so - as a way to expose the new system call to
scripts. But I don't think you're being entirely intellectually honest if
you think this suddendly makes it be "one namespace". It's still a
secondary namespace rooted in an entry in the normal ones - exactly like
"openat()".

For a non-script, a native "openat()" interface would be more efficient
and less confusing, and conceptually no different from yours.  No reason
we couldn't have both, since they are 100% equivalent and would share the
same code anyway...

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040830044637.GF16297@parcelfarce.linux.theplanet.co.uk>
Date: Mon, 30 Aug 2004 04:50:33 GMT
Message-ID: <fa.n5thcqj.1l1upb5@ifi.uio.no>

On Sun, Aug 29, 2004 at 08:55:02PM -0700, Linus Torvalds wrote:
> Well, the above _is_ the same as "openat()", really. It's just using a
> filesystem starting point to emulate a new system call. Same thing
> conceptually. You could do pretty much any system call as a filesystem
> action if you wanted to ;)

Umm...  Yes and no - open() is not the only thing you can do there.  You
can emulate some of the other stuff with open() + fchdir() + syscall, but...
Note that it also gives you access to other tasks' files (provided that
tasks are yours, so you can get to their /proc/.../fd).

> I don't disagree with doing so - as a way to expose the new system call to
> scripts. But I don't think you're being entirely intellectually honest if
> you think this suddendly makes it be "one namespace". It's still a
> secondary namespace rooted in an entry in the normal ones - exactly like
> "openat()".

Well - it *does* expose these objects to all normal syscalls.  E.g. you
can unlink() a component in there without
	fd2 = open(".", ...);
	fd3 = openat(fd, ".", ....)
	fchdir(fd3);
	unlink("blah");
	fchdir(fd2);
and similar horrors (now have fun adding locking for multithreaded process,
etc.).

In that sense it is, indeed, the same namespace.

> For a non-script, a native "openat()" interface would be more efficient
> and less confusing, and conceptually no different from yours.  No reason
> we couldn't have both, since they are 100% equivalent and would share the
> same code anyway...

I'm not sure that openat() is the right interface for e.g. fileservers.
And no, I'm not saying that above is suitable for them - IMO neither variant
is what they need.  The thing is, fileserver almost certainly wants to
create and remove objects.  And you either end up with new syscalls for
doing *that* relative to opened fd or you do tons of fchdir(), which
makes benefits of openat() dubious in the best case.

Arguments about O_NOFOLLOW on the intermediate stages are bullshit, IMNSHO -
if they want to make some parts of tree inaccessible, they should simply
mkdir /tmp/FOAD; chmod 0 /tmp/FOAD; mount --bind /tmp/FOAD <blocked path>
in the namespace their daemon is running in.  And forget all that crap
about filtering pathnames and blocking symlinks on intermediate stages
(the latter is obviously worthless without the former since one can simply
substitute the symlink body in the pathname).


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040830063331.GI16297@parcelfarce.linux.theplanet.co.uk>
Date: Mon, 30 Aug 2004 06:36:55 GMT
Message-ID: <fa.n7d9cig.1ihmoj0@ifi.uio.no>

On Mon, Aug 30, 2004 at 05:46:37AM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> Arguments about O_NOFOLLOW on the intermediate stages are bullshit, IMNSHO -
> if they want to make some parts of tree inaccessible, they should simply
> mkdir /tmp/FOAD; chmod 0 /tmp/FOAD; mount --bind /tmp/FOAD <blocked path>
> in the namespace their daemon is running in.  And forget all that crap
> about filtering pathnames and blocking symlinks on intermediate stages
> (the latter is obviously worthless without the former since one can simply
> substitute the symlink body in the pathname).

Ehh...  After looking at that for a while...  No, it's not that simple
and removing the stuff that way won't do what these guys want, at least
not without something else.  Frankly, what I've seen worries me a lot -
it looks like there is a missing primitive here that would be saner
than this sort of filtering.

It appears that most of this stuff would be covered by a fast way to tell
if the resulting object belongs to given subtree.  That could be arranged
(not without some changes, but doable), but I'm not sure that it's enough
to cover the stuff they are really trying to do.  It does look like an
interesting problem and current solutions certainly suck.  And I very
much doubt that "do a lookup if it doesn't run into anything that could
be too tricky for our pathname-based checks, otherwise let's do it step-by-step
from userland" is the right approach here.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408301052040.2295@ppc970.osdl.org>
Date: Mon, 30 Aug 2004 18:30:32 GMT
Message-ID: <fa.gsds163.a2aphh@ifi.uio.no>

On Mon, 30 Aug 2004, Paul Stewart wrote:
>
> Here's another take on the same theme.  To see attrs on files, one can
> either use a newly developed application which can use special new
> syscalls/flags on syscalls a Paul Jackson recommends.  However from an
> old shell or application one can also open the attribute node on
> /home/myself/foo.txt by checking out /attr/home/myself/foo.txt/, which
> points to the "as directory" node on the filesystem that foo.txt points
> to.

This is the same idea as Al Viro's /proc/self/fd/#42/attr issue, except
yours has two fundamental problem: races and ambiguities.

If you open a filename in some "secondary" tree (be it /proc or //attr or
whatever) based on the filename in the primary one, you have two issues
that you need to work out:

 - how do you handle a name change in the primary tree at the same time as
   lookup
 - how do you handle the ambiguity of
	//attr/usr/bin/emacs/icon
   (is that the "icon" attribute on "/usr/bin/emacs", or is it perhaps the
   "emacs/icon" attribute on "/usr/bin").

The ambiguity can be handled by saying that attributes only have one
component (ie only the _last_ component of a lookup is the attribute
name). But the race between primary tree and secondary tree cannot be
handled in a normal name-space.

What Al did was to avoid both by "fixing" the attribute lookup point with
another open - _exactly_ the same way "openat()" handles it. So Al's
naming convention avoids both the ambiguity and the primary tree name
change races by first opening the primary tree file, and then explicitly
using that file as the "anchor" in the secondary tree. He did it in /proc,
where we obviously already do export an open fd as an anchor-point.

> The strange part of this idea is that the /attr filesystem wouldn't be
> conventionally browsable.

That may make it non-intuitive to use, but that's not the real problem.
See above.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408271902410.14196@ppc970.osdl.org>
Date: Sat, 28 Aug 2004 11:07:22 GMT
Message-ID: <fa.iu7tlgb.1748abp@ifi.uio.no>

On Fri, 27 Aug 2004, Rik van Riel wrote:
>
> Thing is, there is no way to distinguish between what are
> virtual files and what are actual streams hidden inside a
> file.  You don't know what should and shouldn't be backed
> up...

I think that lack of distinguishing power is more serious for
directories. The more I think I think about it, the more I wonder whether
Solaris did things right - having a special operation to "cross the
boundary".

I suspect Solaris did it that way because it's a hell of a lot easier to
do it like that, but regardless, it would solve the issue of real
directories having both real children _and_ the "extra streams".

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408281038510.2295@ppc970.osdl.org>
Date: Sat, 28 Aug 2004 18:14:04 GMT
Message-ID: <fa.gutm0eh.8i0o9j@ifi.uio.no>

On Sat, 28 Aug 2004, Helge Hafting wrote:
>
> > I think that lack of distinguishing power is more serious for
> > directories. The more I think I think about it, the more I wonder whether
> > Solaris did things right - having a special operation to "cross the
> > boundary".
> >
> > I suspect Solaris did it that way because it's a hell of a lot easier to
> > do it like that, but regardless, it would solve the issue of real
> > directories having both real children _and_ the "extra streams".
>
> There are many ways of doing this. Several extra streams to a directory
> that aren't ordinary files in the directory?

Well.. Yes. We already have "." and "..", which are "special extra
streams" in a sense. However, people expect them, and know to ignore them.
The same wouldn't be true of new naming.

> It seems to me that we can get a lot of nice functionality in a simpler way:
> Instead of thinking about a number of streams attached to something
> that is either an ordinary file or directory, just say that the only
> change will be that a directory may have a _single_ file stream in
> addition to being a plain directory.

That doesn't really help us. What would the name be, and how could you
avoid clashes?

> If the VFS is to be extended in order to support file-as-directory (or
> vice versa) then hopefully it can be done in a simple way.

I'm pretty confident that we can extend the VFS layer to support named
streams (see the technical discussion with Al, rather than the flames in
this thread). I also clearly believe that it is worth it, but I'm starting
to wonder if we should have a special open flag to make people select the
stream.

If you look at the Solaris interface, the _nice_ part about "openat()" is
that you can do something like

	file = open(filename, O_RDONLY);
	if (file < 0)
		return -ENOENT;
	icon = openat(file, "icon", O_RDONLY | O_XATTR);
	if (icon < 0)
		icon = default_icon_file;
	..

and it will work regardless of whether "filename" is a directory or a
regular file, if I've understood correctly.

Now, I think that makes sense for several reasons:
 - single case
 - race-free (think "stat()" vs "fstat()" races).
 - I think we want to do "openat()" regardless of whether we ever
   support extended attributes or not ("openat()" is nice for doing
   "namei()" in user space even in the absense of any attributes or
   named streams).

So what we can do is
 - implement openat() regardless, and expect to do the Solaris thing for
   it if we ever do streams.
 - _also_ support the "implied named attributes" for regular files, so
   that you don't have to use "openat()" to access them.

Comments? Does anybody hate "openat()" for any reason (regardless of
attributes)? We can easily support it, we'd just need to pass in the file
to use as part of the "nameidata" thing or add an argument (it would also
possibly be cleaner if we made "fs->pwd" be a "struct file").

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040828182954.GJ21964@parcelfarce.linux.theplanet.co.uk>
Date: Sat, 28 Aug 2004 18:33:05 GMT
Message-ID: <fa.n7tfar5.1n10pb9@ifi.uio.no>

On Sat, Aug 28, 2004 at 11:09:38AM -0700, Linus Torvalds wrote:
> Comments? Does anybody hate "openat()" for any reason (regardless of
> attributes)? We can easily support it, we'd just need to pass in the file
> to use as part of the "nameidata" thing or add an argument (it would also
> possibly be cleaner if we made "fs->pwd" be a "struct file").

What would your openat() produce?  Normal struct file?  Then what's going
to be its vfsmount/dentry and what will they be attached to?


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408281132480.2295@ppc970.osdl.org>
Date: Sat, 28 Aug 2004 18:47:56 GMT
Message-ID: <fa.gue60eb.82go9p@ifi.uio.no>

On Sat, 28 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> What would your openat() produce?  Normal struct file?  Then what's going
> to be its vfsmount/dentry and what will they be attached to?

Normal file descriptor, exactly like "open()".

And it's going to have all the same vfsmount/dentry thing it would have if
you looked it up the whole way. I don't understand your question..

Ignore the O_XATTR thing for a while, and assume it's just a convenient
combination of "fchdir + open" (plus "fchdir back" of course, but that's
beside the point - openat() doesn't ever really change cwd).

Going back to O_XATTR: that would end up doing the "special vfsmount"
magic at the beginning of the lookup. If the dentry you started with
wasn't marked D_HYBRID, it would just return -ENOTDIR.

So we could do openat() _without_ any of the lookup_mnt() etc special
cases. This interface is independent of whether we want to expose the
attributes through a normal lookup - we can do either, both, or neither as
we choose.

The nice thing about "openat()" is
 - people can definitely find uses for it even without attributes.
 - it's "portable". Well, at least somebody else does the same thing,
   which is nice for user-space developers. You don't use a Linux-only
   interface, you use a Linux/Solaris one, which makes a lot of people a
   lot more happy.

Remember, portability has always been very important to Linux, and
Linux-only features while nice are certainly not as nice as features you
can also find in other places.

NIH is a disease.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408281201290.2295@ppc970.osdl.org>
Date: Sat, 28 Aug 2004 19:20:27 GMT
Message-ID: <fa.gte9vma.b2kohq@ifi.uio.no>

On Sat, 28 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> On Sat, Aug 28, 2004 at 11:44:52AM -0700, Linus Torvalds wrote:
> > Going back to O_XATTR: that would end up doing the "special vfsmount"
> > magic at the beginning of the lookup. If the dentry you started with
> > wasn't marked D_HYBRID, it would just return -ENOTDIR.
>
> OK, let me restate the question - what do we get from pwd if we do
> fchdir() to such beast?

We'll see the "attribute path". We could (if we want to) mark the point
where we walked into attribute space somehow (since we can see it by just
looking at the vfsmount: d_root/d_mountpoint being the same), but even if
we don't, we'll get a sane-looking path.

(It's not just "getcwd()", it's /proc interfaces too, for opening files,
and we already have the notion of magic markers like "(deleted)" to show
human-readable information).

The question about what to do at the "attribute point" (and there may
actually be several, if the filesystem supports attributes on attribute
files) likely depends on whether we support the previously discussed
"lookup()" magic for attributes.

So if we do support it, I think we should just make the attribute point
look exactly like a normal directory, since that path would work as-is. If
we don't support it (and at directory points), we might want to just
consider it a special root.

NOTE! Anybody who tries to use the "getcwd()" string as a real path is
already broken - you have pathname permissions that may not make it
possible to look up the path you see.

So we have multiple options:

 - support "file/attribute" lookup: show the path as-is, so you'd show it
   as
	"/path/to/file/attribute"

 - alternatively, even if you _do_ support the normal lookup, show it with
   a double slash (which will still be a valid path), just as a visual
   clue:
	"/path/to/file//attribute"

 - for directories, or if we do _not_ support the extended lookup format,
   we could show it the same way we show deleted files, something like
   this:
	"/path/to/file/attribute (attr)"

 - using "http notation" for non-standard-namespaces (we already do this
   for sockets and pipes, for example)

	"attr:attribute@/path/to/file"

pick your poison.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408282205170.2295@ppc970.osdl.org>
Date: Sun, 29 Aug 2004 05:16:35 GMT
Message-ID: <fa.gte5vme.b28ohu@ifi.uio.no>

On Sat, 28 Aug 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> OK, forget getcwd().  What does lookup of .. do from that point?  *Especially*
> for stuff you've got from regular files.  That's the decision that needs to
> be made.

I think that will decide on whether we expose attributes through the
normal namespace or not.

If we do expose them in the normal namespace, then ".." should work the
way the namespace looks: if you do ".." on the "attribute directory" of a
file, you get the directory that the file was in. Ie an old-style
user-space "getcwd()" would give the right path (well, an old-style
user-space getcwd() would probably refuse the file on the base that it is
S_IFREG, but ignoring that..)

If we _don't_ expose it in the normal namespace, we should should either
just error out (logically you'd get the file itself, but I really don't
want to have ".." return a non-directory, because _that_ really might
confuse things), or you'd just return the same directory (ie it would be a
"local root" in the namespace you got moved to).

So let's try to be self-consistent with how we expose it in the normal
namespace.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <Pine.LNX.4.58.0408291523130.2295@ppc970.osdl.org>
Date: Sun, 29 Aug 2004 22:38:33 GMT
Message-ID: <fa.gsu406d.aieo1p@ifi.uio.no>

[ Linux-kernel cc'd, because I don't think the question is stupid, and I
  can't even fully answer the kNFSd thing other than point to it as a
  problem. ]

On Mon, 30 Aug 2004, Grzegorz Kulewski wrote:
>
> Sorry if my qestion is stupid, but why can't we deal with (hard)links to
> directories in (nearly) same way we deal with bind mounts (= making
> exactly one object representing target and only referencing to it)?

On a VFS level we could, these days, I think. But realize that bind mounts
and the vfsmounts are pretty recent things.

We don't have any filesystems that support the notion, though, and we
don't have any interfaces for the filesystem to tell us about it right
now. The VFS layer could try to figure it out on its own from aliasing
information, so the latter may be a non-issue, but the former is why
nobody does it.

And even if Linux _these days_ could handle hardlinked directories, the
fact is that they would cause slightly more memory usage (due to the
vfsmounts), and that nobody else can handle such filesystems - including
older versions of Linux. So nobody would likely use the feature (not to
mention that nobody is even really asking for it ;).

And the lack of filesystem support is not theoretical. It's not easy to
just retrofit directory hardlinks on a UNIX filesystem. The ".." entry
actually exists on _disk_ on traditional unix filesystems, and with
hardlinks on directories, that's a real problem. A hardlinked directory
has multiple parents.

Also, while the VFS layer no longer cares (to it, ".." is purely virtual,
and it never uses it), the NFS export routines still do actually want to
get the on-disk parent. A filesystem that can't do that may be unable to
be exported with full semantics (ie you might get ESTALE errors after
server reboots, although you'd have to ask somebody with more kNFSd
knowledge than me on exactly why that is the case ;)

			Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040829232856.GC16297@parcelfarce.linux.theplanet.co.uk>
Date: Sun, 29 Aug 2004 23:31:28 GMT
Message-ID: <fa.n6ddc2u.1lhioj2@ifi.uio.no>

On Sun, Aug 29, 2004 at 03:37:16PM -0700, Linus Torvalds wrote:
> > Sorry if my qestion is stupid, but why can't we deal with (hard)links to
> > directories in (nearly) same way we deal with bind mounts (= making
> > exactly one object representing target and only referencing to it)?
>
> On a VFS level we could, these days, I think. But realize that bind mounts
> and the vfsmounts are pretty recent things.

Bindings won't replace hardlinks.
	a) lifetime rules and keeping stuff busy
	b) who's bound on top?  Note that for real hardlinks to directories
(not just "directory on top of file" hybrids) it's a serious question
	c) for real hardlinks we would want at least rename() working


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Original-Message-ID: <Pine.LNX.4.58.0408311252150.2295@ppc970.osdl.org>
Subject: Re: silent semantic changes with reiser4
Date: Tue, 31 Aug 2004 22:15:57 GMT
Message-ID: <fa.gsu2164.aicphg@ifi.uio.no>

On Tue, 31 Aug 2004, Horst von Brand wrote:
>
> You do need extra tools anyway, placing them in the kernel is cheating (and
> absolutely pointless, IMHO).

I agree.

There's no point to having the kernel export information that is already
inherent in the main stream.

I've seen all these examples of exposing MP3 ID information as a "side
stream", and that's TOTALLY POINTLESS! The information is already there,
it's in a standard format, and exporting it as a stream buys you
absolutely nothing.

Where named attributes make sense is when they are _independent_ data.
Data that is only tangentially related to the main stream itself, and
which the main stream _cannot_ encompass because of some real technical
issue.

In a graphical environment, the "icon" stream is a good example of this.
It literally has _nothing_ to do with the data in the main stream. The
only linkage is a totally non-technical one, where the user wanted to
associate a secondary stream with the main stream _without_ altering the
main one. THAT is where named streams make sense.

But if you want to look at one particular file inside a tar-file, do so in
user space. There are zero advantages to exposing it as a side-stream, and
there are absolutely _tons_ of disadvantages.

In short, named streams only make sense if:
 - they are tied to the file some way that is _independent_ of the file
   contents (since if it's dependent on the file contents, you're just a
   ton better off regenerating it with a caching server)
 - there are serious reasons to keep the lookup synchronized (since if
   there isn't such a reason, you're just better off with a separate
   shadow tree in user space)

And realize that the "separate shadow tree" actually works very well.
That's how version control systems like CVS have always worked. It's
certainly how you can make icon information work too. If you use a tool
for accessing the data, the tool can maintain coherency and you'll never
care about the side stream.

Which means that normally we really don't _want_ named streams. In 99% of
all cases we can use equally good - and _much_ simpler - tool-based
solutions.

Which means that the only _real_ technical issue for supporting named
streams really ends up being things like samba, which want named streams
just because the work they do fundamentally is about them, for externally
dictated reasons. Doing named streams for any other reason is likely just
being stupid.

Once you do decide that you have to do named streams, you might then
decide to use them for convenient things like icons. But it should very
much be a secondary issue at that point.

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: Using fs views to isolate untrusted processes: I need an assistant 
	architect in the USA for Phase I of a DARPA funded linux kernel 
	project
Original-Message-ID: <20040826042936.GR21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:11:22 GMT
Message-ID: <fa.nadbar3.1ohspb7@ifi.uio.no>

On Thu, Aug 26, 2004 at 12:16:43AM -0400, Kyle Moffett wrote:
> I'm well aware of the technique, but I was wondering if there was any
> extra VFS baggage associated with a normal bind mount that might
> be eliminated by restricting a different version of a bind mount to only
> files.  That's why I asked later if anybody had benchmarked the bind
> mount system to see how well it would scale to 1000 bound files and
> directories.  If it's not a performance issue then I really don't care
> less,
> but I have a somewhat old box that must make do as a fileserver, so
> I'm very interested in maximizing the performance. I don't care much
> about extra RAM consumption, only about CPU and bus usage.

Files and directories are not different in that respect - the only overhead
is price of hash lookup when crossing the binding in either case.  1000
bindings shouldn't be a problem - it's 3--5 per hash chain.  Wrt memory,
it's one struct vfsmount allocated per binding - IOW, about 80Kb total
for 1000 of those.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: Using fs views to isolate untrusted processes: I need an assistant 
	architect in the USA for Phase I of a DARPA funded linux kernel 
	project
Original-Message-ID: <20040826050145.GT21964@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 26 Aug 2004 06:13:43 GMT
Message-ID: <fa.nbtbaar.1o1oprf@ifi.uio.no>

On Thu, Aug 26, 2004 at 12:52:37AM -0400, Kyle Moffett wrote:
> Where would I increase the hash size if I wanted to increase the number
> of bindings by an order of magnitude or so?  I'm very interested in
> pursuing this possibility, because when combined with the procedure I
> described earlier, plus a little bit of extra work with capabilities
> and such
> it's very easy to build incredibly flexible and basically indestructible
> chroot environments with not much code.

*shrug*

fs/namespace.c::mnt_init().  Right now it allocates 1 page for hash table
(order = 0), you can easily raise that.  You might want to try and change
the order of checks in lookup_mnt() loop - depending on your setup it
might speed the things up, but I doubt that it would be noticable win.


Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: silent semantic changes with reiser4
Original-Message-ID: <20040902125417.GA12118@thunk.org>
Date: Thu, 2 Sep 2004 13:55:37 GMT
Message-ID: <fa.d38rduc.1pl0t3e@ifi.uio.no>

On Wed, Sep 01, 2004 at 01:51:40PM -0700, Jeremy Allison wrote:
> > So you're saying SCP, CVS, Subversion, Bitkeeper, Apache and rsyncd
> > will _all_ lose part of a Word document when they handle it on a
> > Window box?
> >
> > Ouch!
>
> Yep. It's the meta data that Word stores in streams that will get lost.

And this is why I believe that using streams in application is well,
ill-advised.  Indeed, one of my concerns with providing streams
support is that application authors may make the mistake of using it,
and we will be back to the bad old days (when MacOS made this mistake)
where you will need to binhex files before you ftp them (and unbinhex
them on the otherside) --- and if you forget, the resulting file will
be useless.

I understand why the Samba folks want this feature very badly;
however, hopefully other projects will know enough *not* to use
streams once they become available in Linux....

						- Ted


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: The argument for fs assistance in handling archives (was: silent 
	semantic changes with reiser4)
Original-Message-ID: <20040902220027.GD23987@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 2 Sep 2004 22:11:06 GMT
Message-ID: <fa.n4thaam.1l1mpra@ifi.uio.no>

On Thu, Sep 02, 2004 at 11:48:06PM +0200, Frank van Maarseveen wrote:
> mount is nice for root, clumsy for user. And a rather complicated
> way of accessing data the kernel has knowledge about in the first
> place. For filesystem images, cd'ing into the file is the most
> obvious concept for file-as-a-dir IMHO.

The hell it is.

a) kernel has *NO* *FUCKING* *KNOWLEDGE* of fs type contained on a device.
b) kernel has no way to guess which options to use
c) fs _type_ is a fundamental part of mount - device(s) (if any) involved
are arguments to be interpreted by that particular fs driver.
d) permissions required for that lovely operation (and questions like
whether we force nosuid/noexec, etc.) are nightmare to define.

Frankly, the longer that thread grows, the more obvious it becomes that
file-as-a-dir is a solution in search of problem.  Desperate search, at
that.


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: The argument for fs assistance in handling archives (was: silent 
	semantic changes with reiser4)
Original-Message-ID: <20040902220640.GE23987@parcelfarce.linux.theplanet.co.uk>
Date: Fri, 3 Sep 2004 03:08:46 GMT
Message-ID: <fa.n6d3aas.1mhoprc@ifi.uio.no>

On Fri, Sep 03, 2004 at 12:02:42AM +0200, Frank van Maarseveen wrote:
> On Thu, Sep 02, 2004 at 11:00:27PM +0100, viro@parcelfarce.linux.theplanet.co.uk wrote:
> >
> > The hell it is.
> >
> > a) kernel has *NO* *FUCKING* *KNOWLEDGE* of fs type contained on a device.
>
> excuse me, but how does the kernel mount the root fs?

By trying all fs types it has registered in a more or less random (OK, defined
by order of fs type registration, which is kinda-sorta deterministic at
boot time) order.  With no flags, unless you pass them explicitly in kernel
command line.  Fs types list can also be set explicitly in the command line.

Next question?


Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: silent semantic changes in reiser4 (brief attempt to document the 
	idea ofwhat reiser4 wants to do with metafiles and why
Original-Message-ID: <20040909090342.GA30303@thunk.org>
Date: Thu, 9 Sep 2004 16:43:54 GMT
Message-ID: <fa.d3opc6k.1r5esr0@ifi.uio.no>

On Wed, Sep 08, 2004 at 12:09:52AM +0200, Robin Rosenberg wrote:
> Maybe file/./attribute then. /. on a file is currently meaningless. That does
> not avoid the unpleasant fact that has been brought up by others (only to be
> ignored), that the directory syntax does not allow metadata on directories.

*Not* that I am endorsing the idea of being able to access metadata
via a standard pathname --- I continue to believe that named streams
are a bad idea that will be an attractive nuisance to application
developers, and if we must do them, then Solaris's openat(2) API is
the best way to proceed --- HOWEVER, if people are insistent on being
able to do this via standard pathnames, and not introducing a new
system call, I would suggest /|/ as the separator as the third least
worst option.  Why?

Any such scheme will violate POSIX and SUS, since we are stealing from
the filename namespace, and thus could cause a previously working
program to stop working --- however, assuming that we don't care about
this, the vertical bar is the least likely to collide with existing
file usages, because of its status as a shell meta-character (i.e.,
pipe).  This means that in order to use it on the shell command line,
programs will have to quote it:

	cat /home/tytso/word.doc/\|/meta/silly-stupid-metadata-or-named-stream

This may seem to be inconvenient, but one very good thing about this
is that PHP and existing Perl scripts already already treat pathnames
that contain pipes with a certain amount of suspicion --- and this is
a good thing!  Otherwise, programs that take input from untrusted
sources (say, URL's or http form posts), may convert such input into a
metadata access, and that may be a very, very, very bad thing.  (For
example, it may mean that you will have accidentally allowed a web
user to read or possibly modify an ACL with whatever privileges of the
CGI-perl or php script.)  By using a pipe character, it avoids this
problem, since secure CGI scripts must be already checking for the
pipe character anyway.

> I'm not convinced that totally transparent access to meta-data actually
> benefits anyone. If metadata is that useful (which I believe) it may well be
> worth fixing those apps that need, and can use them. The rest should just
> ignore it, even loose it.

Totally agreed.  As I said above, I would prefer openat(2) to trying
to do this within a standard pathname, and I would prefer not doing it
all since aside from Samba, which is simply trying to maintain
backwards compatibility with a Really Bad Idea, the number of
protocols and data formats (ftp, tar, zip, gzip, cpio, etc., etc.,
etc.) that would need to be revamped is huge.

						- Ted


Index Home About Blog