Index Home About Blog
From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Race between "mount" uevent and /proc/mounts?
Date: Wed, 26 Oct 2005 11:15:47 UTC
Message-ID: <fa.im3c9e4.1i60tgq@ifi.uio.no>
Original-Message-ID: <20051026111506.GQ7992@ftp.linux.org.uk>

On Wed, Oct 26, 2005 at 02:27:10PM +0400, Sergey Vlasov wrote:
> > ... said event happens to be a piece of junk with ill-defined semantics.
>
> Hmm, and what should be the proper semantics for such an event?

> Currently the "mount" uevent signals that the device is busy

But it does not.  The thing that is fundamentally wrong about it is
that it doesn't match any of the real objects.  It assumes that
	* there is such thing as "filesystem of the device"
	* there is such thing as "mountpoint of the filesystem"
	* that no two filesystems use the same device
	* that no filesystem would use more than one device
	* that the thing we get on mountpoint is fully determined by fs
mounted there.

All of these assumptions used to be true for v7.  Guess what?  The world
had changed.  _None_ of the above is true anymore.

First of all, the fundamental property of filesystem is its type.  And
that's the only universal property these objects have.  Everything else
is type-dependent.

*Some* types happen to use one or more block devices.  How they use those
depends on the type.  E.g. ext3 can span two devices (journal being the
second one).

Some fs types claim devices they use exclusively.  Some do not; e.g. stuff
that does online resizing via secondary fs a-la ext2meta will, by design,
coexist with normal fs on the same device.

Each fs has a directory tree.  Pieces of these trees can be glued together
into unified tree; the same subtree can be seen in many places, several
different subtrees can be mounted even when the entire tree is not.
Moreover, different processes can see different mount trees.  Filesystem
can be active even if not present in mount trees of any processes - it can
be kept alive by e.g. open files on it; that's what happens if we do
umount -l and something in the subtree is still busy.

_That_ is the reality and any reasonable system of events should match it.
The objects are:

* fs types: flat set, depending on the kernel config
* active filesystems: each belongs to fs type, each has associated directory
tree and files in that tree.
* mounts: each maps a subtree of some filesystem.
* mount trees (aka namespaces): trees of mounts, providing a unified directory
tree as seen by processes; they glue together the subtrees from individual
mounts.
* block devices: used by many things in many ways; a lot of active filesystems
happen to use them; the number and kind of use depends on fs type.

Semantics for events depends on which objects you are interested in.
Existing ones do not match _any_ of the real objects and I have no
idea what exactly had been intended for them.  I've asked gregkh, but
he didn't remember that either.  Apparently they are used by different
people as (bad) approximations to different things.  Which doesn't work
well.  And until somebody cares to describe what exactly are they trying
to watch the situation obviously won't improve.


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Race between "mount" uevent and /proc/mounts?
Date: Wed, 26 Oct 2005 19:31:41 UTC
Message-ID: <fa.ip409m7.1h6ct8n@ifi.uio.no>
Original-Message-ID: <20051026192858.GR7992@ftp.linux.org.uk>

On Wed, Oct 26, 2005 at 04:34:17PM +0200, Kay Sievers wrote:
> > Semantics for events depends on which objects you are interested in.
> > Existing ones do not match _any_ of the real objects and I have no
> > idea what exactly had been intended for them.  I've asked gregkh, but
> > he didn't remember that either.  Apparently they are used by different
> > people as (bad) approximations to different things.  Which doesn't work
> > well.  And until somebody cares to describe what exactly are they trying
> > to watch the situation obviously won't improve.
>
> They are actually events for claim/release of a block device. As uevents
> are bound to kobjects we needed to send these events from an existing device
> which is the blockdev itself.
>
> Sure, the event itself, has nothing to do with a filesystem. The names are
> like this for historical reasons and "CLAIM/RELEASE" may be less confusing.
> The events are used as a trigger to rescan /proc/mounts instead of polling
> it constantly.

But that makes no sense.  /proc/*/mounts changes when mount tree changes.
Which is obviously not an event happening to block devices.  Moreover,
changes of mount tree may involve no changes in the set of active filesystems
or be separated in time from such changes by arbitrary intervals.

Looks like seriously wrong assumptions in userland code working with these
events...  _IF_ you want to keep track of /proc/*/mounts changes, the obvious
solution would be to implement ->poll() for them.  However, if you are
really interested in block devices, keep in mind that
	* getting them claimed happens before your event is generated
	* eventually the filesystem claiming them becomes active (or doesn't,
if mount fails)
	* eventually an active fs may (or may not) become visible in mount
tree.
	* not every umount leads to deactivation
	* deactivation can happen long after the fs is no longer present in
mount tree
	* fs may become visible in mount tree again without being deactivated
and activated again - (mount /dev/foo /mnt; exec </mnt/bar; umount -l /mnt;
sleep 100; mount /dev/foo /tmp/barf) in case of block filesystem will do just
that; fs gets activated, mounted, unmounted and mounted again 100 seconds
later.
	* deactivated fs gives up its claim on device(s).  Incidentally,
your UMOUNT event is triggered before either thing happens; any amount of
IO on the device(s) can happen after it.

Oh, and there are things other than filesystems that can (and do) claim
block devices.

So what's really going on?  If you want to know when device gets busy, you
need events in fs/block_dev.c and no expectation regarding /proc/mounts.
If you want to know when mount tree changes, you need events on attach_mnt/
detach_mnt (and I would seriously suggest ->poll() rather than wanking with
events).  If you want something more complex, you might or might not be
SOL, depending on what you are trying to achieve.


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 12/18] shared mount handling: bind and rbind
Date: Wed, 09 Nov 2005 19:00:51 UTC
Message-ID: <fa.fvpn4bf.hg27at@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511091054290.3247@g5.osdl.org>

On Wed, 9 Nov 2005, Ram Pai wrote:
>
> And 'umount .' really doen't make sense. What does it mean? umount the
> current mount? or umount of the mount that is mounted on this dentry?

"umount <directory>" _absolutely_ makes sense, whether "directory" is "."
or something else. People do it all the time.

Now, if it doesn't unmount the last thing mounted on top of ".", then
that's a misfeature. It might be a misfeature in the mount program (it
might scan /etc/mounts top-to-bottom rather than the other way), but the
kernel should also support it.

> no. I said application _should_not_ depend on it, because it is a
> undefined semantics.

It's definitely neither unusual nor undefined. I do all my umounts by
directory (in fact, doing it by anything else really _is_ badly defined,
since a block device can be mounted in many places), and the only sane
semantics would be to peel off the last mount on that directory.

Now, that doesn't necessarily mean that "list_add_tail()" is wrong. But
if we add new mounts to the end, then umount remove them from the end too,
no?

		Linus


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 12/18] shared mount handling: bind and rbind
Date: Wed, 09 Nov 2005 19:26:52 UTC
Message-ID: <fa.ie40968.1q6gtom@ifi.uio.no>
Original-Message-ID: <20051109192607.GA7992@ftp.linux.org.uk>

On Wed, Nov 09, 2005 at 10:59:47AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 9 Nov 2005, Ram Pai wrote:
> >
> > And 'umount .' really doen't make sense. What does it mean? umount the
> > current mount? or umount of the mount that is mounted on this dentry?
>
> "umount <directory>" _absolutely_ makes sense, whether "directory" is "."
> or something else. People do it all the time.

With current (and all previous, actually) tree umount . is usually -EBUSY.

The case Mikulas is talking about is much uglier - it's "mount on top
of current directory, then umount .".  _That_ (i.e. when . is overmounted)
happens to work.  And semantics is really, really not well-defined.

Situation with overmounts is nasty - it's *not* just a chain, unfortunately.
I certainly intended it to be such.  However, we can get a *tree* of
overmounts due to side-effects I've missed back then.

We really need it sanitized (and that's what I'm doing right now), but
yes, it *will* cause user-visible changes.  Incidentally, one of those
will be that umount . will work...

The trouble begins since we allow to attach vfsmounts to the *middle* of
overmount chain.  I.e.

mount foo /tmp
cd /tmp
mount bar .
mount baz .

will end up with *two* vfsmounts having root of foo as mountpoint and having
the same mnt_parent.  Which one is seen depends on phase of moon - the only
answer is "whichever is first in mnt_hash chain".  Which is certainly not a
sane answer.  We need explicit rules dealing with effect of overmounts;
anything that seriously relies on details of current behaviour in that sort
of corner cases is very definitely broken.

And "we allow" above should be read as "Al had not thought about that mess
back in 2001" ;-/  Current behaviour in that sort setups is an accident -
as soon as it gets to such forked chains of overmounts sanity exits stage
left.  To be fixed...

Index Home About Blog