Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402162034280.30742@home.osdl.org>
Date: Tue, 17 Feb 2004 05:15:05 GMT
Message-ID: <fa.idvvhjl.1jge92d@ifi.uio.no>

[ Al cc'd, because while I'm pretty certain that he agrees with me 100% on
  the craziness of case-insensitive name lookups, he may have some input
  on the "samba helper" function approach. That input may well boil down
  to "Linus is crazy", of course. Wouldn't be the first time ;)

  Andrew - you really should assume that case insensitivity is a hell of a
  lot more costly than you think it is, and forget that particular idea.
  Let's see if there are acceptable half-measures. ]

On Tue, 17 Feb 2004 tridge@samba.org wrote:
>
> Given how much pain the "kernel is agnostic to charset encoding"
> attitude has cost me in terms of programming pain, I thought I should
> de-cloak from lurk mode and put my 2c into the UTF-8 issue.
>
> Personally I think that eventually the Linux kernel will have to
> embrace the interpretation of the byte streams that applications have
> given it, despite the fact that this will be very painful and
> potentially quite complex.

I seriously doubt it. There just isn't any point.

>		 The reason is that I think that eventually
> the Linux kernel will need to efficiently support a userspace policy
> of case-insensitivity and the only way to do case-insensitive filename
> operations is to interpret those byte streams as a particular
> encoding.

The thing is, if you want to do efficient user-space case-insensitive
lookups, that is a _completely_ different matter from having the kernel do
case-insensitivity.

Kernel-level case insensitivity is a total disaster, and your "very
painful and potentially quite complex" assertion is the understatement of
the year. The thing is, you can't sanely do dentry caching, since the case
insensitivity has to be per-open or at least per-process (you MUST NOT be
case-insensitive in a POSIX process).

So the only way to do case-insensitive names is to do all lookups very
slowly. I'm willing to bet that WNT opens files a hell of a lot slower
than Linux does, and one big portion of that is exactly the fact that
Linux can do a really good job with the dentry cache.

And that _depends_ on a well-defined and unique filename setup (by
changing the hashing function and compare function, a filesystem can do a
limited kind of case-insensitivity right now in Linux, but then it will
have to be not only fairly slow, but also case-insensitive for _everybody_
which is unacceptable in a mixed POSIX/samba environment).

In other words, just forget the whole notion. The only set people who have
any reason at _all_ to want it is the samba team, and we can solve the
samba-specific problems other ways.

Just take that as a simple fact - case insensitivity in the kernel is such
a horribly bad idea, that you really shouldn't go there.

With that destructive criticism out of the way, let's look at somewhat
more constructive approaches, ie some way to allow certain processes that
need it better help in their quest for case insensitivity.

Let's start with some assumptions:

 - MOST name lookups are likely results of some kind of "readdir()"
   lookup, and tend to have the case right in the first place. So that
   should go fast. Maybe Tridge has some statistics on this one?

 - samba probably has certain pretty well-defined special patterns for
   what it wants to do with a filename, so you probably don't need a
   generic "everything that takes a filename should be case-insensitive",
   and it would be acceptable to have a few _very_ specific system calls.

With those assumptions out of the way, we could think of an interface that
exports some partial functionality of the "lookup_path()" code the kernel
as a special system call. In particular, something that takes an input
pathname, and is able to stop at any point of the name when a lookup
fails.

So some variation of the interface

	int magic_open(
		/* Input arguments */
		const char *pathname,
		unsigned long flags,
		mode_t mode,

		/* output arguments */
		int *fd,
		struct stat *st,
		int *successful_path_length);

ie the system call would:

 - look up as far into the pathname (using _exact_ lookup) as possible
 - return the error code of the last failure
 - the "flags" could be extended so that you can specify that you mustn't
   traverse ".." or symlinks (ie those would count as failures)

but also:

 - fill in the "struct stat" information for the last _successful_
   pathname component.
 - fill in the "fd" with a fd of the last _successful_ pathname component.
 - tell how much of the pathname it could traverse.

so that the user can do a "readdir" and try to "fix up" the problem
without having to restart the whole thing. For the (hopefully common case)
where the cases match, this would just boil down to an "open with stat
information" thing.

We'd need something more interesting to guarantee unique filename on file
create, possibly even including letting a trusted process maintain some
locks in the VFS layer. The point being that the kernel can _help_ some
specific usage, but making case-insensitive names be part of the VFS layer
proper is not acceptable.

I suspect we can do case-insensitive names faster than WNT even with a
fairly complex user-mode interface. Just because _not_ having them in the
kernel allows us to have much faster default behaviour.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402170704210.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 15:14:50 GMT
Message-ID: <fa.j3c6ptd.1d2akor@ifi.uio.no>

On Tue, 17 Feb 2004 tridge@samba.org wrote:
>
> From memory, the patch added new classes of dentries to the current
> "+ve" and "-ve" dentries. It added concepts like a "-ve
> case-insensitive" dentry and a "-ve case-sensitive" dentry. It
> certainly adds more code in trying to deal with these variants, but I
> see no reason why it should be significantly computationally less
> efficient.

Yes, we could add context sensitivity to the dcache with a context
bitmask.

However, it's _not_ correct.

It assumes that there is only one way to do lower/upper case, which just
isn't true. What about different locales that have different case rules?
Your "one bit per dentry" becomes "one bit per locale per dentry". That's
just horribly hard to do.

I don't know how Windows does it, so maybe this thing is hardcoded, and
you don't even want "true" case insensitivity. How "correct" is Windows?

(And don't even bother telling me about the translation table in NTFS
volumes - I'm not interested. This would have to work on a sane filesystem
to be useful, even for samba.)

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402170833110.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 17:00:57 GMT
Message-ID: <fa.j2s8qlc.1cikk0s@ifi.uio.no>

On Tue, 17 Feb 2004, Linus Torvalds wrote:
>
> It assumes that there is only one way to do lower/upper case, which just
> isn't true. What about different locales that have different case rules?
> Your "one bit per dentry" becomes "one bit per locale per dentry". That's
> just horribly hard to do.

It's also hard to know what to do when there are two filenames that
literally _are_ the same when not comparing cases. Which can obviously
happen under Linux - you'd have a case-sensitive app that creates a both
"makefile" and "Makefile", and now you have a case-insensitive app that
looks it up (or worse, removes it), and what the *heck* is the dcache now
supposed to really do?

This is why I'd hate for the generic Linux dcache to know about case
sensitivity, and I'd be a lot happier having a separate path (which isn't
as speed-critical) that can be used to help implement helper functions for
doing case-insensitive things.

That way the bugs and strange behaviour would be all be limited to the
case-insensitive special code, and not pollute the "sane" side.

For example, I fundamentally can't easily do an atomic exclusive
case-insensitive "create" or "rename", but we _could_ expose things like
directory generation counts to the special interfaces, and thus allow at
least "local-atomic" operations (but they would _not_ be atomic over a
network, to give you an idea of the kinds of _fundamental_ limitations
there are here).

That's why I'd advocate having a few very special system calls for doing
the operations that samba (and I'll throw wine into the pot too) wants to
do. So you could literally do an atomic create with something like

 - regular atomic create of random case-_sensitive_ name using something
   tempnam()-like (use a prefix that is invalid on windows or something:
   make the first character be 0xff or whatever).
 - "read directory local sequence count"
 - readdir to make sure that the new name is still unique even in the
   case-insensitive sense
 - "atomic move conditionally on the local sequence count still being X"

The thing is, we can do hack like the above, and yes, we could do them all
inside the kernel, and give user space a reasonably nice interface with
"pseudo-atomic" behaviour (ie it will _not_ be atomic if multiple clients
do this over NFS, but I doubt you care).

But it wouldn't be "open()" and "rename()". It would be a totally separate
kernel path. It would be in the "case-insensitivity-module". It would be
_outside_ the regular VFS layer, although it would have some visibility
into it (ie it could follow dentries on its own, and know about the RCU
etc locking rules).

We can even allow that case-insensitive module to set some flags in the
dentries (so that you can create negative dentries that have a flag set
"this is negative for all cases").

Trust me, this is much less intrusive, and a lot easier to debug too. It
won't be as fast as the regular path operations, but depending on what the
common cases are (hopefully "look up name that is exact"), it would likely
not be horrible either. And it could probably be debugged as a real
module, without impacting any existing code, which would make it a lot
easier to create.

See where I'm going? Would this be acceptable to you? Are there any samba
people who are knowledgeable about the VFS-layer and have the time/energy
to try something like this?

Al? What do you think?

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <20040217194414.GP8858@parcelfarce.linux.theplanet.co.uk>
Date: Tue, 17 Feb 2004 19:45:36 GMT
Message-ID: <fa.nkrug4t.1s7q0ah@ifi.uio.no>

On Tue, Feb 17, 2004 at 08:57:40AM -0800, Linus Torvalds wrote:
> Trust me, this is much less intrusive, and a lot easier to debug too. It
> won't be as fast as the regular path operations, but depending on what the
> common cases are (hopefully "look up name that is exact"), it would likely
> not be horrible either. And it could probably be debugged as a real
> module, without impacting any existing code, which would make it a lot
> easier to create.
>
> See where I'm going? Would this be acceptable to you? Are there any samba
> people who are knowledgeable about the VFS-layer and have the time/energy
> to try something like this?
>
> Al? What do you think?

What will protect your generation counts during the operation itself?
->i_sem?

If anything, I'd suggest doing it as
	cretinous_rename(dir_fd, name1, name2)
with the following semantics:

	* if directory had been changed since open() that gave us dir_fd -
-EFOAD
	* otherwise, rename name1 to name2 (no cross-directory renames here).

No need to expose generation counts to userland - we can just compare the
count at open() time with that at operation time.  The rest can be done
in userland (including creation of files).

We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
near the kernel - it's insane.  If samba wants it, they get to pay the price,
both in performance and keeping butt-ugly code (after all, the goal of project
is to imitate butt-ugly system for butt-ugly clients).  The same goes for Wine.

And we really don't want to encourage those who port Windows userland in
not fixing the idiotic semantics.  As for Lindows... let's just say that
I can't find any way to describe what I really think of those clowns, their
intellect and their morals that wouldn't lead to a lawsuit from them.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171153460.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 20:15:52 GMT
Message-ID: <fa.j4s4r5c.1ei8lgs@ifi.uio.no>

On Tue, 17 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> What will protect your generation counts during the operation itself?
> ->i_sem?

Yes. You have to take it anyway, so why not?

> If anything, I'd suggest doing it as
> 	cretinous_rename(dir_fd, name1, name2)
> with the following semantics:
>
> 	* if directory had been changed since open() that gave us dir_fd -
>   -EFOAD
> 	* otherwise, rename name1 to name2 (no cross-directory renames here).

Sure, that works.

> No need to expose generation counts to userland - we can just compare the
> count at open() time with that at operation time.  The rest can be done
> in userland (including creation of files).

Note that I'm not sure we would expose generation counts at all to user
space: we might keep all of this inside the "crapola windows behaviour"
module, and user space could actually see some easier highlevel interface.
Something like yours, but I suspect we'd want to see what the whole
user-level loop would look like to know what the architecture should be
like.

I do believe we'd need to have some way to "refresh" the fd in your
example, without restarting the whole lookup. So that when the user gets
EFOAD, it can do

	refresh(fd);
	readdir(fd);
	/* Check that nothing clashes */
	goto try_again;

or similar. So the generation count _semantics_ would be exposed, even if
the numbers themselves would be hidden inside the kernel.

> We _definitely_ don't want to put "UTF-8 case-insensitive comparison" anywhere
> near the kernel - it's insane.  If samba wants it, they get to pay the price,
> both in performance and keeping butt-ugly code (after all, the goal of project
> is to imitate butt-ugly system for butt-ugly clients).  The same goes for Wine.

I agree. We'd need to let user space do the equality comparisons, I just
don't see how to sanely do it in kernel land.

> And we really don't want to encourage those who port Windows userland in
> not fixing the idiotic semantics.  As for Lindows... let's just say that
> I can't find any way to describe what I really think of those clowns, their
> intellect and their morals that wouldn't lead to a lawsuit from them.

Heh.

I suspect most people don't care that much, but I also suspect that
projects like samba have to have a "anal mode" where they really act like
Windows, even when it's "wrong". People can then choose to say "screw that
idiocy", but by just _having_ a very compatible mode you deflect a lot of
criticism. Regardless of whether people want the anal mode or not in real
life.

Backwards compatibility is King. It's _hugely_ important. It's one of the
most important things to me in the kernel, and by the same logic I do see
that it is important to others as well - even when the backwards
compatibility ends up being inherited from a broken Windows setup. So
while I hate case-insensitive names, I do understand that people want to
have some way to emulate the braindamage for some _really_ "ass-backwards"
compatibility reasons.

So I think it's worth some pain, as long as we keep that compatibility
from starting to encrust the _good_ stuff.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171221120.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 20:30:23 GMT
Message-ID: <fa.j3buqda.1c26k8u@ifi.uio.no>

On Tue, 17 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> > 	refresh(fd);
>
> lseek(fd, 0, 0);

Yes. We can make that implicitly refresh, I'm certainly ok with that.

> > I suspect most people don't care that much, but I also suspect that
> > projects like samba have to have a "anal mode" where they really act like
> > Windows, even when it's "wrong". People can then choose to say "screw that
> > idiocy", but by just _having_ a very compatible mode you deflect a lot of
> > criticism. Regardless of whether people want the anal mode or not in real
> > life.
>
> Umm...  Samba deals with Windows clients.  Windows software allegedly being
> ported to Linux is a different story and in that case there's no excuse for
> demanding case-insensitive operations.

"wine". It's not porting, it's emulation.

But yes, I agree, I don't see any other cases where we want it.

We basically want to support broken clients - whether they be on the other
side of the network, or the other side of an emulation interface. That is
the only valid reason to do this crap.

It's a fairly sizeable reason, though. On another front ("World
Domination, Fast!") we'll try to fix the problem another way, but there's
nothing wrong with fighting on multiple fronts if you have the man-power.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171314320.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 21:25:36 GMT
Message-ID: <fa.j4c0q5d.1d24kgr@ifi.uio.no>

On Tue, 17 Feb 2004, Robin Rosenberg wrote:
>
> On Tuesday 17 February 2004 17.57, Linus Torvalds wrote:
> [case-insanesititvity proposal ///]
> > See where I'm going? Would this be acceptable to you? Are there any samba
> > people who are knowledgeable about the VFS-layer and have the time/energy
> > to try something like this?
>
> So the same guy that strongly insist that a file is a string of bytes and nothing else,
> now thinks it is sane to even think of "case" of a byte. That's impossible unless you
> actually DO believe its a bunch of characters.  What is it?

Which part of my argument don't you understand?

The kernel proper thinks it's just a stream of bytes, and all the existing
interfaces do likewise.

But we'd have a kernel helper module to let samba do what it already does
now, except help it do so more efficiently?

The fact that _I_ think pathnames are just a nice stream of bytes sadly
doesn't make Windows clients do the same. Some day when I'm King Of The
World, and I can outlaw windows clients, we'll finally get rid of the
braindamage, but until then I'm pragmatic enough to say "let's help out
the poor samba people who have to deal with the crap day in and day out".

What's your problem with that?

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171531570.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 23:46:26 GMT
Message-ID: <fa.j5ceqla.1e22k0u@ifi.uio.no>

On Wed, 18 Feb 2004 tridge@samba.org wrote:
>
> I think you're making it sound much harder than it really is.

I think I'm just making the mistake of assuming that anybody would care to
do it "right", while everybody really only cares to get it be compatible
with Windows.

For example, if you only want to be compatible with Windows, you don't
have to worry about UCS-4, you only have the UCS-2 part, which means that
you can do a silly array-lookup based thing or something.

> We just add a VFS hook in the filesystems. The filesystem chooses the
> encoding specific comparison function. If the filesystem doesn't
> provide one then don't do case insensitivity. If the filesystem does
> provide one (for example NTFS, JFS) then use it. Then all I need to do
> is convince one of the filesystem maintainers to add a mount time
> option to specify the case table (for example by specifying the name
> of a file in the filesystem that holds it).

Ugh. What a horrible kludge, and it won't work without "preparing" the
filesystem at mount-time. I'd much rather leave the translation table in
user space, and just give it as an argument to the "look up case
insensitive" special thing.

That would mean that we can hold the directory semaphore over the whole
thing, which would simplify _my_ kludge, since there would be no need to
worry about user space having separate stages.

The hard part would be negative dentries. We'd have to invalidate all
"case-insensitive" negative dentries when creating any new file in a
directory, and that would be something the generic VFS layer would have to
know about, and that might be unacceptable to Al.

		Linus


Newsgroups: fa.linux.kernel
From: hpa@zytor.com (H. Peter Anvin)
Subject:  Re: UTF-8 and case-insensitivity
Original-Message-ID:  <c0uj52$3mg$1@terminus.zytor.com>
Date: Wed, 18 Feb 2004 02:39:19 GMT
Message-ID: <fa.hllao6j.uleibv@ifi.uio.no>

Followup to:  <16434.41376.453823.260362@samba.org>
By author:    tridge@samba.org
In newsgroup: linux.dev.kernel
>
>  > I don't know how Windows does it, so maybe this thing is hardcoded, and
>  > you don't even want "true" case insensitivity.
>
> NTFS has a 128k table on disk, created at mkfs time and indexed by the
> UCS2 character.

So you're hosed if anyone uses characters outside the UCS-2 character
set...

> The interesting thing about this table is that it doesn't seem to
> vary between different locales as one might expect. I have checked 3
> locales so far (Swedish, Japanese and English) and all have the same
> 128k table. I should check a few more locales to see if it really is
> the same everywhere. Contact me off-list if you have a NTFS
> filesystem created in a different locale and would be willing to run
> a test program against it to see if the table is different from the
> one we have in Samba.

There is a "standard" table, which is published by the Unicode
consortium.  However, the "standard" table isn't what you want in
certain locales, e.g. Turkish.

> There is stuff in the charset handling of every locale that does vary
> in windows, but it isn't the case table, its the "valid characters"
> map used to determine what characters are allowed when converting
> strings into legacy multi-byte encodings. Even I don't think that the
> kernel will ever have to deal with that crap unless someone is foolish
> enough to port Samba into the kernel (several people have actually
> done that despite the insanity of the idea, but they all did an
> absolutely terrible job of it and certainly didn't take care to get
> all the charset handling right).
>
> > How "correct" is Windows?
>
> from my rather limited point of view I always have to assume that
> windows is "correct", unless I can show that its behaviour leads to
> data loss, a security hole or something equally extreme.

Well, we don't want to support a bunch of hacks to make it behave like
Windows if what Windows does doesn't make sense.  If so you should use
a metalayer where you canonicalize the filenames and don't store
"Makefile" on the disk; store "makefile" and keep the "real" filename
stashed elsewhere, perhaps an EA.

	-hpa



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171859570.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 03:07:42 GMT
Message-ID: <fa.j5curtk.1e2mmok@ifi.uio.no>

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
> Well, we don't want to support a bunch of hacks to make it behave like
> Windows if what Windows does doesn't make sense.

I'd disagree, for a very simple reason: case-insensitivity itself simply
does not make sense, so the _only_ reason for having a bunch of hacks is
literally to support windows file exports and nothing else.

I obviously agree with the fact that we should _not_ put those hacks into
the VFS layer proper - we should keep them as a separate thing, and we
should make it clear that it makes no sense _except_ for Windows
compatibility.

Think of it as nothing more than a binary compatibility layer, the same
way we have hooks to support "lcall 7,0" for binary compatibility with
some silly (and much less interesting) x86 OSes through external modules.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171919240.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 03:31:04 GMT
Message-ID: <fa.j3sqqtk.1diinok@ifi.uio.no>

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
>
> Well, this is also true :)  I still say it belongs in userspace.

The thing is, I do agree with Tridge on one simple fact: it's very hard
indeed to do atomic file operations from user space.

That's not necessarily a problem if samba is the only process accessing
the directories in question, since then samba could do all locking
internally and make sure that it never does anything inconsistent.

However, clearly people who run samba on a machine want to potentially
_also_ export that same filesystem as a NFS volume, as a way to have both
Windows and UNIX clients access the same data. And that pretty much means
that other people _will_ access the directories, and that samba can't do
its internal locking in that kind of environment.

This is why I am sympathetic to the need to add _some_ kind of support
for this. And the only common place ends up being the kernel.

> For 100% bug-compatibility with Windows, though, it is probably
> worthwhile to have the filename in the native filesystem be not what a
> Windows user would see, but rather the normalized filename.  That makes
> a userspace implementation much easier.

Oh, absolutely. But that's something that samba can easily do internally:
it can choose to just entirely ignore filenames that aren't normalized, or
it can export it on the wire (obviously in the normalized UCS-2 format),
and just consider non-normalized names to be another "case". In fact,
that's what the naive implementation would do anyway, so that's not any
added complexity.

(And samba clearly _cannot_ show the client a non-normalized name per se,
since the smb protocol ends up using UCS-2).

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: RE: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402171619010.2154@home.osdl.org>
Date: Wed, 18 Feb 2004 00:29:54 GMT
Message-ID: <fa.j2s4q5i.1ci8kgm@ifi.uio.no>

On Tue, 17 Feb 2004, Robert White wrote:
>
> OK, so I wrote the below, but then in the summary I realized that there was
> a significant factor that doesn't fit in with the rest of the post.  Case
> insensitivity, and more generally locale equivalence rules, is a security
> nightmare.  Consider the number of different file names that "su" could map
> to if you apply case insensitivity (4) and/or worse yet the various accents
> and umlats (?,etc) that sort-equivalent for "u" in some locales.  The user
> types "su" and runs "S(u-umlat)" etc.

This is but one reason why I will _refuse_ to make case insensitivity
magically start happening on regular "open()" etc calls.

You'd literally have to use a _different_ system call to do a
case-insensitive file open. Exactly because anything else would be very
confusing to existing apps (and thus be potential security holes).

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402181422180.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 22:27:10 GMT
Message-ID: <fa.j3cor5e.1c2gn0g@ifi.uio.no>

On Thu, 19 Feb 2004 tridge@samba.org wrote:
>
> The second basic fact that I think is relevant is that its not
> possible to do case-insensitive filesystem operations efficiently
> without the filesystem having knowledge of the fact that you want a
> case-insensitive lookup.

That's not my problem. That is _your_ problem, and I don't care. I
disagree violently with the notion that we would push this down to a
filesystem level.

Sorry, but there are limits to how much we care about broken operating
systems.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402181427230.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 22:31:09 GMT
Message-ID: <fa.j3ser5j.1di6n0l@ifi.uio.no>

On Wed, 18 Feb 2004, Linus Torvalds wrote:
>
> That's not my problem. That is _your_ problem, and I don't care. I
> disagree violently with the notion that we would push this down to a
> filesystem level.
>
> Sorry, but there are limits to how much we care about broken operating
> systems.

Side note: this only matters for cold cache entries anyway, so I doubt
you'll see any performance improvement on a file server from passing the
brain damage down to the lower levels.

And I bet the performance advantages of _not_ doing native case
insensitivity are likely to dominate hugely.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402181511420.18038@home.osdl.org>
Date: Wed, 18 Feb 2004 23:21:50 GMT
Message-ID: <fa.iggdfbj.1l008a1@ifi.uio.no>

On Thu, 19 Feb 2004 tridge@samba.org wrote:
>
>  > Why do you focus on linear directory scans?
>
> Because a large number of file operations are on filenames that don't
> exist. I have to *prove* they don't exist.

And you only need to do that ONCE per name.

There is zero reason to do it over and over again, and there is zero
reason to push case insensitivity deep into the filesystem.

Have you checked how many filesystems we have? Hint:

	ls -l fs/ | grep '^d' | wc

The thing is, you have to realize that Windows-compatibility is very very
much second-class. If you refuse to realize that, you can't argue
effectively, because you are arguing for things that simply WILL NOT
happen.

So instead of having this crazy windows-centric idea, I would suggest you
try to come up with ways to make it easier for you. I can tell you already
that it won't be everything you want or need, but quite frankly, your
choice is between _nada_ and something reasonable.

So give it up. We're not making the same STUPID mistakes that Microsoft
has done.

		Linus



Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <20040219024426.GA3901@thunk.org>
Date: Thu, 19 Feb 2004 02:47:23 GMT
Message-ID: <fa.e4bve95.img22f@ifi.uio.no>

On Thu, Feb 19, 2004 at 12:01:53PM +1100, tridge@samba.org wrote:
> The problem is that Samba isn't the only program to be accessing these
> directories. Multi-protocol file servers and file servers where users
> also have local access are common. That means we can't assume that
> some other filesystem user hasn't created a file which matches in a
> case-insensitive manner. That means we need to do an awful lot of
> directory scans.

Actually, not necessarily.  What if Samba gets notifications of all
filename renames and creates in the directory, so that after the
initial directory scan, it can keep track of what filenames are
present in the directory?  It can then "prove the negative", as you
put it, without having to continuously do directory scans.

Yeah, there can be some race conditions, but Samba already has to deal
with the race condition where it tries to create "MaKeFiLe" either
just before or just after a Posix process creates "Makefile".

						- Ted


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402190759550.1222@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 16:06:50 GMT
Message-ID: <fa.gttvuug.9iirpk@ifi.uio.no>

On Thu, 19 Feb 2004, Jamie Lokier wrote:

> Linus, while I agree with you wholeheartedly on everything else in
> this thread - how can Samba only do that lookup ONCE per name if a
> client is issuing many requests for non-existent opens or stats?

While I'm not willing to push case insensitivity deep into the
filesystems, I _am_ willing to entertain the notion of an extra flag to a
dcache entry that the regular VFS operations ignore (apart from clearing
it when they change anything and having to flush them under some
circumstances), which would basically be "this dentry has been judged
unique in a case-insensitive environment".

So assuming nobody else is touching the directory, the case-insensitive
special module could create these kinds of dentries to its hearts content
when it does a lookup.

> Example: A client has a search path for executables or libraries.
>
> Each time SomeThing.DLL is looked up by the client, it will issue an
> open() for each entry in the path, until it finds the file it wants.
>
> For each request, Samba must readdir() every directory in the path
> until the file is found.
>
> If a directory doesn't change between requests, Samba can use dnotify
> to cache the negative lookups.
>
> However, if any change occurs in a directory, or if the directory is
> not dnotify-capable, Samba is not allowed to cache these negative
> results: It has to do the readdir() for _every_ request.

But this is exactly what I _am_ willing to entertain: have some limited
special logic inside the kernel (but outside the VFS layer proper), that
allows samba to use special interfaces that avoids this.

For example, the rule can be that _any_ regular dentry create will
invalidate all the "case-insensitive" dentries. Just to be simple about
it. But if samba is the only thing that accesses a certain directory (or
the directory is not written to, like / and /usr etc usually behave), the
"windows hack" interface will be able to populate it with its fake
dentries all it wants.

Or something like this. Basically, I'm convinced that the problem _can_ be
solved without going deep into the VFS layer. Maybe I'm wrong. But I'd
better not be, because we're definitely not going to screw up the VFS
layer for Windows.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <Pine.LNX.4.58.0402190853500.1222@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 16:51:48 GMT
Message-ID: <fa.gttnuua.9i6rpu@ifi.uio.no>

On Thu, 19 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > For example, the rule can be that _any_ regular dentry create will
> > invalidate all the "case-insensitive" dentries. Just to be simple about
> > it.
>
> If that's the rule, then with exactly the same algorithmic efficiency,
> readdir+dnotify can be used to maintain the cache in userspace
> instead.  There is nothing gained by using the helper module in that case.

Wrong.

Because the dnotify would trigger EVEN FOR SAMBA OPERATIONS.

Think about it. Think about samba doing a "rename()" within the directory.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191124080.1270@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 19:46:01 GMT
Message-ID: <fa.grtpve9.bi4r9r@ifi.uio.no>

Ok,
 I think I've got it. Here's an algorithm that will have "perfect"
behaviour under normal circumstances as long as you've got enough memory.

Admittedly the "you've got enough memory" part is a downside, but it's so
_damn_ clean and simple that it is, I think, a reasonable trade-off.
Besides, if you want good file serving numbers, you'd better have enough
memory anyway.

Basic approach: add two bits to the VFS dentry flags. That's all that is
needed. Then you have two new system calls:

 - set_bit_one(dirfd)
 - set_bit_two_if_one_is_set(dirfd);
 - check_or_create_name(dirfd, name, case_table_pointer, newfd);

The VFS rule is:
 - all new dentries start off with the two magic bits clear
 - whenever we shrink a dentry, we clear the two magic bits in the parent

and that is _all_ the VFS layer ever does. Even Al won't find this
obnoxious (yeah, we might clear the bits after a timeout on things that
need re-validation, but that's in the noise).

The "set_bit_one()" system call will set one of the magic bits (with the
dcache lock held) in the dentry that is pointed to by the file descriptor.
Nothing more.

The "set_bit_two_if_one_is_set()" system call will set the _other_ magic
bit (with the dcache lock held) in the dentry, if the first bit is set.
Otherwise it will just return.

Let's leave the "check_or_create_name()" thing for now, and see how we can
use this in user space (and realize that we only do this on cache failure,
so this is the "slow case"):

	set_bit_one(dir);
	lseek(dir, 0, SEEK_SET);
	while (readdir(dir, de)) {
		stat(de->d_name);
		.. might also compare the name here with whatever it is
		   working on right now..
	}
	set_bit_two_if_one_is_set(dirfd);

Notice what the above does? After the above loop, bit two will be set IFF
the dentry cache now contains every single name in the directory.
Otherwise it will be clear. Bit two will basically be a "dcache complete"
bit.

Now, let's go to "check_or_create_name()", which can thus do:

 - for each name in the dcache name list, compare the dang thing
   without case.
 - return "lookup succeeded" (the file descriptor of the thing it
   successfully looked up) if a match with a positive dentry occurs.
 - check bit two, return -ENOCACHE if it was clear.
 - create the new dentry with the new name and the new file descriptor
   inode, and return success.

Notice? Basically _ZERO_ changes to the VFS layer, together with basically
perfect hot-cache-case behaviour.

Yeah, yeah, the above is probably glossing over a lot of issues (there's a
race if somebody does both the "readdir loop" and the "create" case at the
same time, so that would need a lock around it in user space, but please
realize that the readdir loop only happens if the "check_or_create()"
thing fails, so the readdir loop should basically never happen in the
hot-cache case.

And the above allows perfect behaviour even for new filenames that we have
never seen before (ie a create of a new file with a random name). At least
as long as the dcache for that directory remains "complete" (which it will
do, until the kernel needs to throw something out).

Am I a super-intelligent bastard, or am I a complete nincompoop? You
decide.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191150120.1270@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 19:50:21 GMT
Message-ID: <fa.gsde065.b2gqhv@ifi.uio.no>

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> Basic approach: add two bits to the VFS dentry flags. That's all that is
> needed. Then you have two new system calls:
                        ^^^
>  - set_bit_one(dirfd)
>  - set_bit_two_if_one_is_set(dirfd);
>  - check_or_create_name(dirfd, name, case_table_pointer, newfd);

 [ deletia ]

> Am I a super-intelligent bastard, or am I a complete nincompoop? You
> decide.

I think my lack of counting ability basically answers that question.

Damn.

		Linus "complete nincompoop" Torvalds


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191202350.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 20:01:50 GMT
Message-ID: <fa.gtdptug.a2kqpk@ifi.uio.no>

On Thu, 19 Feb 2004, H. Peter Anvin wrote:
>
> How about a compomise - super-intelligent complete nincompoop bastard?

Ok, but in the meantime I think I can save face by saying that you only
need two system calls, by simply making a "lseek(fd, 0, SEEK_SET)"
implicitly set the first bit. So then the "set second bit if first is set"
just becomes a "dcache fill complete" notifier.

So I'll take half credit.

		Linus "super-complete bastard" Torvalds


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191217050.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 20:20:17 GMT
Message-ID: <fa.grtpu6l.bikqhh@ifi.uio.no>

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:

> On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> > The VFS rule is:
> >  - all new dentries start off with the two magic bits clear
> >  - whenever we shrink a dentry, we clear the two magic bits in the parent
> >
> > and that is _all_ the VFS layer ever does. Even Al won't find this
> > obnoxious (yeah, we might clear the bits after a timeout on things that
> > need re-validation, but that's in the noise).
>
> > Notice what the above does? After the above loop, bit two will be set IFF
> > the dentry cache now contains every single name in the directory.
> > Otherwise it will be clear. Bit two will basically be a "dcache complete"
> > bit.
>
> What about dentry getting dropped in the middle of that loop _and_
> another task setting the first bit again before the loop ends?

Hey, you snipped the part where I said that the application has to have
its own locking around the loop and around the lookup to avoid races.

We can avoid that requirement by using sequence numbers and making it a
bit more complex, but the simple version was for samba only (ie "only one
app that wants this").

Realize that none of this makes the internal kernel (or filesystem) data
structures be wrong, so even if the app has a bug and doesn't do the right
locking, at worst that just results in problems for that application, not
for the rest of the system.

But yes, if we want to make others use this, we'd need to have the kernel
actually support some kind of locking, probably by just making the whole
readdir loop be inside the kernel itself (at which point we can use the
inode semaphore for this).

The "dcache full" bit could be potentially useful regardless of any
case-ignorant operating system emulation crap, although I don't see any
really obvious applications (we could speed up regular "readdir()", but we
don't have the d_offset thing, so..)

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191226240.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 20:32:23 GMT
Message-ID: <fa.gstnuek.aimq9g@ifi.uio.no>

On Thu, 19 Feb 2004, Linus Torvalds wrote:
> >
> > What about dentry getting dropped in the middle of that loop _and_
> > another task setting the first bit again before the loop ends?
>
> Hey, you snipped the part where I said that the application has to have
> its own locking around the loop and around the lookup to avoid races.

[ That, btw, implies that we do need to make the "set bit one" a system
  call of its own, so that somebody elses "lseek(fd, 0, SEEK_SET)" wouldn't
  mess up. Mea culpa. ]

Anyway, if we're willing to make some other changes to the VFS layer, we
could make all of this a bit more efficient by _not_ requiring the actual
filesystem lookup to take place.

If we had a flag that allowed a dentry to not have a d_inode pointer, but
still _not_ be considered automatically negative, then we could just make
a loop that fills the dcache directly from the readdir() data inside the
kernel, without calling down to the filesystem to look up the inode.

That would save a _lot_ of memory - quite often we'd only need the dentry
itself.

That would require a third bit in the VFS dentry flags (something like
D_DENTRY_LIKELY_POSITIVE), and would require that "d_lookup()" not just
assume that a dentry without an inode is always negative (check the new
flag, and if so, do the filesystem lookup when the lookup actually
happens).

Doesn't look _too_ bad, and considering the potential memory savings (and
not having to seek around the disk to look up the inode data), it would
probably be worth thinking about at least as a "second stage".

So then we could have a dcache that is fully populated, even though the
actual inode data hasn't been loaded yet.

Comments?

		Linus


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <20040219204515.GG31035@parcelfarce.linux.theplanet.co.uk>
Date: Thu, 19 Feb 2004 20:48:31 GMT
Message-ID: <fa.n3cpb2p.1gh6o3d@ifi.uio.no>

On Thu, Feb 19, 2004 at 12:32:55PM -0800, Linus Torvalds wrote:
> Anyway, if we're willing to make some other changes to the VFS layer, we
> could make all of this a bit more efficient by _not_ requiring the actual
> filesystem lookup to take place.
>
> If we had a flag that allowed a dentry to not have a d_inode pointer, but
> still _not_ be considered automatically negative, then we could just make
> a loop that fills the dcache directly from the readdir() data inside the
> kernel, without calling down to the filesystem to look up the inode.
>
> That would save a _lot_ of memory - quite often we'd only need the dentry
> itself.

> So then we could have a dcache that is fully populated, even though the
> actual inode data hasn't been loaded yet.
>
> Comments?

*Ugh*

	That will cause all sorts of nastiness for filesystems that _have_
case-insensitive lookups.  Remember the crap we had to deal with to avoid
multiple dentries for directory?  It will come back, AFAICS.

	Another thing I really don't like is that we now get real lookups
on hashed dentry.  That potentially changes a lot and can lead to very
interesting results for some filesystems.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191255540.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 21:26:29 GMT
Message-ID: <fa.gudnv6j.92mrhj@ifi.uio.no>

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> > So then we could have a dcache that is fully populated, even though the
> > actual inode data hasn't been loaded yet.
> >
> > Comments?
>
> *Ugh*
>
> 	That will cause all sorts of nastiness for filesystems that _have_
> case-insensitive lookups.  Remember the crap we had to deal with to avoid
> multiple dentries for directory?  It will come back, AFAICS.


No no. Look at how this works:
 - only one dentry actually exists. It is marked "tentative", which means
   that nobody will use it as-such without doing a lookup on it. It has
   zero impact on aliases etc, because it's really just a place-holder: it
   doesn't point to any inodes at all, it only says "there may or may not
   be a file here"

   NOTE! This dentry is in no way case-insensitive. It happens to have
   _exactly_ the contents (and hash) that the readdir entry had, but it
   has no meaning outside of that.

 - each caller of "__d_lookup()" will have to check if it's a tentative
   dentry and basically ignore it if so.

   There aren't that many of them, and I think it all comes together in
   "do_lookup()", which may be the _only_ place that actually cares right
   now. Look at how that works right now:

		dentry = __d_lookup(..);
		if (!dentry)
			goto needs_lookup;	/* This case will allocate a whole
					new dentry and use that for lookup */

		/* NEW CASE! */
		if (dentry->d_flags & D_TENTATIVE)
			goto needs_lookup_with_this_dentry;
	done:
		path->mnt = mnt;
		path->dentry = dentry;
		return 0;

	/*
	 * NEW CASE!!
	 *
	 * Unhash the tentative one, and look up a real one.
	 */
	needs_lookup_with_this_dentry:
		d_drop(dentry);
		dentry = NULL;

	/* OLD REGULAR CASE */
	needs_lookup:
		...

In other words, neither the low-level filesystem NOR anything else really
ever sees the tentative dentry (the above is the really stupid approach: a
slightly more clever one will avoid the "real_lookup()" alloc_dentry()
thing and just use the tentative dentry after having unhashed it and
verified that it's the only user).

See? Nobody actually ever sees the "raw dentry". They all go through
__d_lookup(), and the rule would be:

 - if "d_lookup()" sees a tentative dentry, it will just unhash it and
   drop it (it has the dcache lock, so it can do that)
 - all callers of "__d_lookup()" will have to check for D_TENTATIVE, and
   decide what to do with it. I think there are exactly _three_ callers,
   and one of them is d_lookup() itself.

See? Very very minimal impact that I can see (really, the biggest part
would be to do the dentry re-use in the better version of "do_lookup()" -
that would mean some re-organization, but maybe that optimization isn't
even worth it).

Or did I miss anything?

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191334410.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 21:42:57 GMT
Message-ID: <fa.gttjumi.9iuq1i@ifi.uio.no>

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> No no. Look at how this works:
>  - only one dentry actually exists.

That was really badly phrased. There can be _millions_ of these things,
but they are all "unique" - they have zero impact on each other, and have
no linkages. They never shadow any existing dentries (ie when we create
these, we'd obviously never create a tentative dentry with the same name
as an existing _valid_ dentry), and they are never visible to the
filesystem.

So it's not that "only one dentry" exists, but that that this tentative
dentry only exists as a unique marker of "a dentry of this name _may_
exist".

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191340080.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 21:47:23 GMT
Message-ID: <fa.gru1uue.bicrpm@ifi.uio.no>

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> See? Nobody actually ever sees the "raw dentry". They all go through
> __d_lookup(), and the rule would be:
>
>  - if "d_lookup()" sees a tentative dentry, it will just unhash it and
>    drop it (it has the dcache lock, so it can do that)
>  - all callers of "__d_lookup()" will have to check for D_TENTATIVE, and
>    decide what to do with it. I think there are exactly _three_ callers,
>    and one of them is d_lookup() itself.

Actually, I've got a better setup: instead of having a D_TENTATIVE flag in
the dentry flags, just do

	#define TENTATIVE_INODE ((struct inode *) 1)

and just have "dentry->d_inode = TENTATIVE_INODE" for the dentries that
were filled directly from "readdir()" data.

This not only avoids using a bit in the dentry flags, but it pretty much
guarantees that everybody is forced to use them correctly. It would be
very hard to have a buggy user: the dentry will clearly not be a negative
dentry (since d_inode is not NULL), but if anybody ever uses it as a
positive dentry, you'll get a nice and immediate oops.

So we'd see very quickly if these tentative dentries were to escape
outside of __d_lookup().

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191349440.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 21:54:31 GMT
Message-ID: <fa.gttpuun.9ikrpv@ifi.uio.no>

On Thu, 19 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> On Thu, Feb 19, 2004 at 01:45:32PM -0800, Linus Torvalds wrote:
> > So we'd see very quickly if these tentative dentries were to escape
> > outside of __d_lookup().
>
> Ahem...  You'll see them (at least) in dcache pruning codepaths.  And
> those will dereference inodes...

Yea, you be right. Many of those paths would not need to care about
TENTATIVE at all, so using the d_inode thing would make them uglier, I
agree. Maybe the flag is better after all (and it really should be pretty
well contained by just checking all __d_lookup callers, so it should be
hard to get it wrong, but maybe I've forgotten some path).

We could do it both ways - do the TENTATIVE_INODE thing as a debugging
thing at first to make sure none of these dentries escape, and then remove
it (and the unnecessary tests in the pruning paths) once everybody is
convinced that it is working correctly.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191326490.1439@ppc970.osdl.org>
Date: Thu, 19 Feb 2004 21:33:21 GMT
Message-ID: <fa.gtu3uek.9ieq9g@ifi.uio.no>

On Thu, 19 Feb 2004, Jamie Lokier wrote:
>
> Yes: The slow part of my brain thinks dnotify with a new flag
> DN_IGNORE_SELF, meaning don't notify for things done by the process
> which is watching, would provide equivalent functionality.

Basically, yes. However, I can tell you that directory name caching is
damn hard, and the kernel does it better than anybody else.

The hardest part of caching is not filling the cache - it's knowing when
to release it. In other words, forget the filling part, and think about
the replacement policy (balacing between the page cache, the directory
cache, and regular pages). The kernel already has that.

Besides, I really think that we can do this with basically just a few
lines of code in the kernel (apart from the actual case comparison, which
I'm not even going to worry about - that's totally independent of the
cache handling itself, and I don't care about how to write a
"windows_equivalent_strncasecmp()".

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191607490.2244@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 00:18:36 GMT
Message-ID: <fa.gue5u6g.828r1s@ifi.uio.no>

On Fri, 20 Feb 2004, Jamie Lokier wrote:
>
> Will your proposal eliminate Samba's positive cache as well?

Samba has to work on different kernels, so they'll have to have their own
code anyway. Whether they want to turn it off or not if better
alternatives are found is up to them. Right now it appears that what
Tridge wants is a WNT dcache, and since he's not going to get it, I guess
the whole discussion is moot.

> What I like about my idea is that no windows_equivalent_strncasecmp()
> needs to go into the kernel.  I.e. no need for a Samba-specific module.
>
> The other thing I like is that DN_IGNORE_SELF would be useful for
> other applications too.

I agree. It might even be acceptable not as a new flag, but as a
modification to existing behaviour. I can't imagine that a file manager is
all that interested in seeing the changes it itself does be reported back
to it. And I don't really know of any other uses of dnotify.

(That said, clearly it's better to just have a new flag, since that way
there is no possibility of anything breaking).

On the other hand, even with a nice dnotify infrastructure, you simply
_cannot_ get absolute atomicity guarantees. Because by the time you
actually execute the "mv" operation, another process may create a new file
with the "same" name (ie different name, but comparing the same ignoring
case) on another CPU. By the time you get the dnotify, it's too late, and
the move will have happened, and undoing the operation (and hiding it from
the client) may well be impossible - possibly because another process
creating a file with the old name.

NOTE! Even an in-kernel implementation fundamentally cannot fix this race
on something like NFS. So the in-kernel version would only help for local
filesystems that the kernel has exclusive write access to.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191621250.2244@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 00:26:43 GMT
Message-ID: <fa.gtdtuma.b2grhq@ifi.uio.no>

On Thu, 19 Feb 2004, Linus Torvalds wrote:
>
> I agree. It might even be acceptable not as a new flag, but as a
> modification to existing behaviour. I can't imagine that a file manager is
> all that interested in seeing the changes it itself does be reported back
> to it. And I don't really know of any other uses of dnotify.

I take that back. Even a file manager may very well be interested in moves
that it does itself - most of them have some soft of multi-window view
capability, and if they use dnotify, they might well be using it to keep
the different views coherent.

So yes, a new flag would likely be required.

That said, who actually _uses_ dnotify? The only time dnotify seems to
come up in discussions is when people complain how badly designed it is,
and I don't think I've ever heard anybody say that they use it and
that they liked it ;)

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <Pine.LNX.4.58.0402191625060.2244@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 00:48:17 GMT
Message-ID: <fa.gsdvume.a2mrhu@ifi.uio.no>

On Fri, 20 Feb 2004 tridge@samba.org wrote:
>
> yes, I've acknowledged that. I know you aren't going to give me the
> ideal solution, I'm just exploring how far this is from the ideal and
> trying to get a feel for how much it actually gains us compared to
> what we do now.

I suspect the only way to know that is to code something up.

The kernel side (with the full "readdir()" loop and a TENTATIVE flag etc)
is not likely to be that many lines of code, but it's definitely something
where the person who writes those lines needs to really understand the
kernel code to get anywhere at all. And it's in an "interesting" area of
the kernel, so you have to be really careful. And you'd need somebody who
is used to samba too, in order to do the path component walk side in user
space work right with the new interface. So..

I an try to see if I can write something - I'd not do the actual
comparison function, but I have the rough framework in my mind. I won't
get to it for another day or two, at _least_, though.

With that set up, getting numbers and doing a kernel profile to see where
the time goes is probably not hard - again, if you have a samba setup with
benchmarks already set up. I just don't know anybody who knows both pieces
of the puzzle..

(This, btw, was the big problem with pthreads too. The 2.6.x threading
improvements were things that had been discussed for years, but it took
until Ingo, Uli and Roland actually sat down and looked at both the user
side and the kernel side before anything really happened).

		Linus


Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: Eureka! (was Re: UTF-8 and case-insensitivity)
Original-Message-ID: <20040220023057.GB22545@thunk.org>
Date: Fri, 20 Feb 2004 02:33:07 GMT
Message-ID: <fa.d6o9dua.1o5as3c@ifi.uio.no>

On Thu, Feb 19, 2004 at 11:48:50AM -0800, Linus Torvalds wrote:
> Let's leave the "check_or_create_name()" thing for now, and see how we can
> use this in user space (and realize that we only do this on cache failure,
> so this is the "slow case"):
>
> 	set_bit_one(dir);
> 	lseek(dir, 0, SEEK_SET);
> 	while (readdir(dir, de)) {
> 		stat(de->d_name);
> 		.. might also compare the name here with whatever it is
> 		   working on right now..
> 	}
> 	set_bit_two_if_one_is_set(dirfd);
>
> Notice what the above does? After the above loop, bit two will be set IFF
> the dentry cache now contains every single name in the directory.
> Otherwise it will be clear. Bit two will basically be a "dcache complete"
> bit.

Why do this in user space?  The set_bit_one() and
set_bit_two_if_one_is_set() can't really be used for anything else,
really, so why not let check_or_create_name() do the above loop if
necessary to populate all of the dcache entries in the dentry cache?

That way we only expose one system call (check_or_create_name()), and
we let the internal dcache flags be an internal implementation detail.
It will also make it much easier to avoid races.

						- Ted


Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: UTF-8 and case-insensitivity
Original-Message-ID: <20040219140847.GA5718@thunk.org>
Date: Thu, 19 Feb 2004 14:10:56 GMT
Message-ID: <fa.e9c9cpa.lmi0i2@ifi.uio.no>

On Thu, Feb 19, 2004 at 02:20:44PM +1100, tridge@samba.org wrote:
> Currently dnotify doesn't give you the filename that is being
> added/deleted/renamed. It just tells you that something has happened,
> but not enough to actually maintain a name cache in user space.
>
> That could be changed, so that on a dnotify event you do a fcntl() to
> ask for the name of the file. Or perhaps we could cram it into the
> structure the signal handler gets passed? I doubt that would make
> sense, but maybe some signal guru can tell me otherwise. Maybe we
> could even invent a new dnotify system where you do a read on a file
> descriptor to get details on what event happened, and give some
> "everything has changed" error when you run out of buffers.

Yes, that's what I was suggesting.  One advantage of such a scheme is
that it's not just for Windows compatibility.  A more rich directory
change notification scheme would also be useful for graphical file
managers, automatic indexing tools, and many, many other applications.

No, it's not everything you were requesting, but it may very well
represent three-quarters of a loaf, instead of nothing.

> If that happened then we could build our own dcache in user space, but
> it will be a very second rate dcache, with a racy and slow update
> mechanism that will in itself chew cpu. Maybe thats the best we can
> do, or maybe I should be asking distro vendors if they would consider
> a case-insensitive patch, especially the vendors aiming for
> "enterprise" scalability which might include serving windows clients.

I don't know that the update mechanism has to seriously chew that much
CPU.  It can certainly can be designed to minimize the amount of CPU
that is consumed, especially if it is read via a file descriptor so
that multiple updates can be sent via a single read() system call,
instead of sending a signal every single time a directory entry is
created, renamed, or deleted.

The problem with a case-insentive patch is that for most modern
filesystems (i.e., any filesystem that does better than O(1) directory
searches), it will have to involve a format change, since the case
insensitivity has to be built into the hash function or the tree
comparison fucture, or both.  At this point, the filesystem author has
to make the choice of whether to try to solve the Windows-specific
problem, in which case the fundamental filesystem format would have to
be tailored to the Windows case mapping table, or try to solve the
more general I18N case mapping problem.  (Lots of luck!  It's
constantly changing over time as new character sets are added or
modified...)  Yes, a few such filesystems might have this support
already, but I doubt distributions would be willing to accept patches
that make filesystem format-incompatible changes just for the sake of
accelerating Samba operations.

I don't know if the distributions would be willing to accept a
case-insensitive patch, but my suspicions is that it would be
difficult, and I would argue that it might be more efficient to get a
richer directory change notification system, for the reasons I argued
above.

						- Ted


Newsgroups: fa.linux.kernel
From: Ingo Molnar <mingo@elte.hu>
Subject: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(), O_CLEAN
Original-Message-ID: <20040220120417.GA4010@elte.hu>
Date: Fri, 20 Feb 2004 12:05:32 GMT
Message-ID: <fa.eahrsc3.g52a23@ifi.uio.no>

* Linus Torvalds <torvalds@osdl.org> wrote:

> Basic approach: add two bits to the VFS dentry flags. That's all that
> is needed. Then you have two new system calls:
>
>  - set_bit_one(dirfd)
>  - set_bit_two_if_one_is_set(dirfd);
>  - check_or_create_name(dirfd, name, case_table_pointer, newfd);

i believe Samba's problems can be solved in an even simpler way, by
using only a single bit associated with the directory dentry, and by not
putting any case-insensitivity code into the kernel. (not even as a
separate module.)

One 'user-space cache is valid/clean' bit should be enough - where all
non-Samba accesses clear the 'valid bit', and Samba sets the bit
manually.

What Samba needs is a way to tell between two points in time whether the
directory contents have changed in any way - nothing more. Only one new
syscall is used to maintain the Samba dcache:

	long sys_mark_dir_clean(dirfd);

the syscall returns whether the directory was valid/clean already.

this is how Samba name lookup would work:

repeat:
	if (sys_mark_dir_clean(dirfd)) {
		... pure user-space fast path, use Samba dcache ...
		return;
	}
	... fill Samba dcache ...
	readdir() loop

	goto repeat;

i.e. there will be two calls to sys_mark_dir_clean() in the slowpath
(the first one to set it, the second one to make sure it's still set).
Races are handled automatically by the loop.

this is how Samba could create a file atomically:

	sys_create(name, mode | O_CLEAN);

ie. the create only succeeds if the directory has not been touched since
the Samba dcache has processed it last time. O_CLEAN would be a very
simple check in the open_namei() code, it returns -ENOTCLEAN if the
parent directory has not been marked clean.

i dont think there's any need to have a case-insensitive lookup module
in the kernel - Samba has all the information through the readdir() loop
already - all it needs to know is whether that info is valid or not via
the mark_dir_clean() syscall!

the impact of sys_mark_dir_clean() and O_CLEAN is quite minimal on the
generic VFS i believe. Also, it can be used as a caching method for just
about everything that wants to have a coherent user-space cache of the
VFS namespace. Note that there's nothing about case sensitivity or
insensitivity in this approach, it still gets rid of all of the
excessive readdir()s done in the Samba fastpath.

[ To get rid of all Samba overhead in this area we might need other
  syscall variants too, like rename_if_clean() and unlink_if_clean().
  Under this scheme Samba would never have to do a stat() call of the
  target file, because it always has a coherent copy of the kernel
  dcache, for directories it choses to cache. ]

this approach differs from dnotify in a couple of key areas:

 - it's a synchronous solution that avoids signals, and is thus
   usable/robust in libraries too.

 - dnotify _forces_ action. mark_dir_clean() you can use if there's use
   and there's no overhead if the Samba workload is completely silent
   and there are only POSIX users. I.e. it should scale better than
   dnotify.

 - cache teardown can be done in userspace purely: the 'clean bit' has
   no state associated with it (unlike dnotify), so no kernel call is
   necessary to tear down state. User-space just forgets that it cached
   anything about that directory and it's done. No leaking state, and
   good scalability again.

 - but most importantly, it's fundamentally atomic for local filesystems
   and thus meets the needs of Samba in mixed POSIX/Samba workloads.

just in case anyone has followed me down to this point :-), there's yet
another, more advanced way to do the Samba-dcache fastpath 100% in
user-space:

We can export the 'directory clean bit' to userspace, via the same page
pinning and mapping techniques used by futexes. User-space could
register a 'clean bit' address via a new syscall, which the dcache then
uses from that point on. Thus there would be only a single syscall when
Samba sets up a directory cache in user-space [which needs those
readdir() calls so performance is down the drain anyway], which syscall
lets userspace register a machine-word address to serve as the
'directory is clean' flag. Userspace and kernelspace will set this flag
possibly in parallel which is not a problem as long as userspace uses
atomic ops. This approach introduces some page pinning allocation
overhead but that's easy to solve.  User-space would of course condense
the pinned range. Kernel-space would see very minimal overhead from
having the bit in an indirect pointer - at least on 64-bit systems where
all kernel RAM is mapped.

	Ingo


Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: [patch] explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(), O_CLEAN
Original-Message-ID: <20040220180043.GL31035@parcelfarce.linux.theplanet.co.uk>
Date: Fri, 20 Feb 2004 18:02:54 GMT
Message-ID: <fa.n6t5aab.1m1qpr1@ifi.uio.no>

On Fri, Feb 20, 2004 at 02:23:52PM +0100, Ingo Molnar wrote:

> i've also attached dir-cache.c, a simple testcode for the new
> functionality. It marks the current directory clean and tries to open
> the "./1" file via O_CLEAN with 1 second delay. Start this in one shell
> and do VFS-namespace modifying ops in another window (eg. "rm -f 2;
> touch 2") and see the dir-cache code react to it - the 'clean' bit is
> lost, and the file open-create does not succeed if the directory is not
> clean.
>
> there's a new dentry flag that is maintained under the directory's i_sem
> semaphore. (It would be simpler to have the flag on the inode level,
> that way the invalidation could be done as a simple filter to the
> dnotify function.)

IMO putting that in dentry (let alone inode) is fundamentally broken.
Basically, your flag says "somebody in userland knows the contents
of directory".  So your create-if-clean is inherently racy - if we get

task A                       task B                 task C
had learnt the contents
marked clean
                             changed the contents
                                                        had learnt the contents
                                                        marked clean
did create-if-clean, assuming
its knowledge to be accurate

then A will succeed just fine.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(),
Original-Message-ID: <Pine.LNX.4.58.0402200733350.1107@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 15:50:05 GMT
Message-ID: <fa.gsttu66.aikq1i@ifi.uio.no>

On Fri, 20 Feb 2004, Ingo Molnar wrote:
>
> One 'user-space cache is valid/clean' bit should be enough - where all
> non-Samba accesses clear the 'valid bit', and Samba sets the bit
> manually.

Yes, that, together with O_CLEAN would work.

The problem is that you'd still need other system calls: it's not like
open(O_CREAT) is the only way to create a file. So you'd have to add
versions of "link()" etc, which means that O_CLEAN is really pretty
pointless, and you might as well just do it in a new system call.

Your version is also not multi-threaded: you can never allow more than one
thread doing the "sys_mark_dir_clean()". That was the reason for having
two bits: so that anybody can do a lookup in parallel, and only the
"filldir" part needs to be serialized.

So I do believe you'd want two bits anyway.

		Linus


Newsgroups: fa.linux.kernel
From: Ingo Molnar <mingo@elte.hu>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(), O_CLEAN
Original-Message-ID: <20040220170438.GA19722@elte.hu>
Date: Fri, 20 Feb 2004 17:06:27 GMT
Message-ID: <fa.enp860k.1r5qhq4@ifi.uio.no>

* Linus Torvalds <torvalds@osdl.org> wrote:

> Your version is also not multi-threaded: you can never allow more than
> one thread doing the "sys_mark_dir_clean()". That was the reason for
> having two bits: so that anybody can do a lookup in parallel, and
> only the "filldir" part needs to be serialized.
>
> So I do believe you'd want two bits anyway.

hm, right. So for the lookup to be lockless, it would have to be managed
via a syscall variant similar in mechanism to the two-bit approach you
suggested:

	ret = sys_manage_dir_cache(fd, op);

where the following cache states are defined:

	(invalid, refill_in_progress, valid)

the following type of cache ops are defined:

	(lookup, cache_filled)

the semantics of the sys_manage_dir_cache() syscall are the following:

- op is 'lookup': the syscall returns 'valid' if state is valid. If the
  state is 'refill_in_progress' then lookup returns refill_in_progress.
  If the state is 'invalid', then the state goes to 'refill_in_progress'
  and 'invalid' is returned.

- op is 'cache_filled': the syscall moves the state to 'valid' if state
  is still 'refill_in_progress'. It goes to 'refill_in_progress' if the
  state was 'invalid'.

the kernel does the valid->invalid and refill_in_progress->invalid
transitions automatically, when relevant VFS events occur. All dentries
start out in state invalid.

there's another class of problems: is it an issue that directory renames
that move this directory (higher up in the directory hierarchy of this
directory) do not invalidate the cache? In that case there's no dnotify
event either.

	Ingo


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(),
Original-Message-ID: <Pine.LNX.4.58.0402200911260.2533@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 17:16:46 GMT
Message-ID: <fa.gsubue0.bi6q9k@ifi.uio.no>

On Fri, 20 Feb 2004, Ingo Molnar wrote:
>
> there's another class of problems: is it an issue that directory renames
> that move this directory (higher up in the directory hierarchy of this
> directory) do not invalidate the cache? In that case there's no dnotify
> event either.

This is one of the reasons why I worry about user-space caching. It's just
damn hard to get right.

It's hard in kernel space too, of course, but we've had smart people
working on the dcache for years. So if we can sanely avoid duplication,
that would be a good thing.

		Linus


Newsgroups: fa.linux.kernel
From: Ingo Molnar <mingo@elte.hu>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(), O_CLEAN
Original-Message-ID: <20040220184822.GA23460@elte.hu>
Date: Fri, 20 Feb 2004 18:49:11 GMT
Message-ID: <fa.ep8s5gl.1pliiab@ifi.uio.no>

* Linus Torvalds <torvalds@osdl.org> wrote:

> > there's another class of problems: is it an issue that directory renames
> > that move this directory (higher up in the directory hierarchy of this
> > directory) do not invalidate the cache? In that case there's no dnotify
> > event either.
>
> This is one of the reasons why I worry about user-space caching. It's
> just damn hard to get right.

this particular problem could be solved by walking down to the root
dentry for every sys_manage_dir_cache() lookup and check that each
dentry is still cache-valid. This involves some overhead, but it's still
faster than doing the same from userspace. (ie. validating each previous
path component at lookup time.) Since this doesnt change the dcache it
ought to be doable via the rcu-read path and would thus still have
pretty good SMP properties. [except when traversing mountpoints :-( ].

but this scheme also has other problems: who decides who is the 'cache
manager'? What if there are two instances of fileservers both using the
same fileset and also trying to do caching this way?

perhaps using a simple 64-bit generation counter would be better. Samba
would get a new syscall to get the sum of each generation counter down
to the root dentry - a total validation of the pathname. If the counter
matches with that in the userspace cache entry then no need to re-create
the cache. Such generation counters would be usable for multiple file
servers as well. Hm?

> It's hard in kernel space too, of course, but we've had smart people
> working on the dcache for years. So if we can sanely avoid
> duplication, that would be a good thing.

i believe Samba already has what is in essence a duplication of the
dcache. We could enable it to be fairly coherent, for Samba to be able
to have an authorative 'does this file exist' answer without any
excessive readdir()s.

	Ingo


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(),
Original-Message-ID: <Pine.LNX.4.58.0402201017370.2533@ppc970.osdl.org>
Date: Fri, 20 Feb 2004 18:21:33 GMT
Message-ID: <fa.gttrue6.bimq9i@ifi.uio.no>

On Fri, 20 Feb 2004, Jamie Lokier wrote:
>
> How about this: we clean up dnotify, so it can be used for
> user<->kernel dcache coherency

No can do.

There is no _way_ dnotify can do a race-free update, exactly because any
user-level state is fundamentally irrelevant because it isn't tested under
the directory semaphore.

See? You can have a user-level cache, but the flag and the notification
absolutely has to be under the inode semaphore (and thus in kernel space)
if you want to avoid all races with unrelated processes.

Now, for samba this isn't necessarily a huge problem, because you can
basically say "don't do that, then", and just document that you shouldn't
mess with a samba export using anything but SMB accesses. So in a sense,
the samba unix-side coherency is nothing more than politeness, and then
dnotify or similar works fine (by virtue of not being an absolute
coherency guarantee, just a "best effort").

But then it should be documented as such. It's not coherent, it's only
"almost coherent".

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(),
Original-Message-ID: <Pine.LNX.4.58.0402201708100.3301@ppc970.osdl.org>
Date: Sat, 21 Feb 2004 01:07:44 GMT
Message-ID: <fa.gtdnte5.a2qqpv@ifi.uio.no>

On Sat, 21 Feb 2004, Jamie Lokier wrote:
>
> Eh?  The flag and notification operations are set and tested
> under the inode semaphore, when fcntl() is called.

Doesn't matter. Because you will drop the inode semaphore before you
actually create a new file. So you'll alway shave a window open for a
race.

That's what Ingo's O_CLEAN thing did. And if you do Ingo's O_CLEAN, then
there's no point to notifiers in the first place - Ingo's algorithm works
regardless of them (it had other problems, but that's another issue and
just requires a bit of extending on the basic concept).

So why do you care about dnotify? It doesn't help at all once you have
O_CLEAN (or equivalent).

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: explicit dcache <-> user-space cache coherency, 
	sys_mark_dir_clean(),
Original-Message-ID: <Pine.LNX.4.58.0402211012190.3301@ppc970.osdl.org>
Date: Sat, 21 Feb 2004 18:13:27 GMT
Message-ID: <fa.gtdrtm0.a26qhk@ifi.uio.no>

On Sat, 21 Feb 2004 viro@parcelfarce.linux.theplanet.co.uk wrote:
>
> If we are demanding specific filesystems, we could simply say "use
> JFS in case-insensitive mode" and be done with that.  Which deals
> with all problems, since fs code will guarantee uniqueness, etc.

Don't be silly. You can't use JFS in case-insensitive mode and do anything
sane.

That will terminally confuse a lot of UNIX applications, including NFS
serving.  Which makes the whole thing completely useless _except_ as a
pure Windows-compatible partition.

If you are going to limit a partition to _only_ doing Samba serving, then
you have no problems _anyway_, since then samba can do all locking and all
name translation totally on its own.

In short, a case-insensitive filesystem is fundamentally uninteresting. It
buys _nothing_ that samba can't do already, since it only means that you
can't really do anything else on it.

		Linus


Index Home About Blog