UTF-8 (H. Peter Anvin; Jamie Lokier; Linus Torvalds; Theodore Ts'o; Al Viro)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: UTF-8 in file systems? xfs/extfs/etc.
Original-Message-ID: <20040210231756.GI21151@parcelfarce.linux.theplanet.co.uk>
Date: Tue, 10 Feb 2004 23:24:49 GMT
Message-ID: <fa.n7cnaaj.1mhcpb7@ifi.uio.no>

On Tue, Feb 10, 2004 at 03:04:52PM -0800, jw schultz wrote:
> I expect UTF-8 to have no multi-byte sequences containing NUL
> but it might be awkward if a multi-byte sequence contained
> 0x2F (/).  I would hope that the committees chose to avoid
> using symbol and punctuation byte-codes for alphanumeric
> sequences.

UTF-8 single-byte sequences are in range 0--127 with obvious mapping to
ASCII.  All bytes in UTF-8 multi-byte sequences are in range 128--255.

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: JFS default behavior
Original-Message-ID: <20040214154055.GH8858@parcelfarce.linux.theplanet.co.uk>
Date: Sat, 14 Feb 2004 15:48:52 GMT
Message-ID: <fa.nirog4m.1i700am@ifi.uio.no>

On Sat, Feb 14, 2004 at 03:27:50PM +0100, Nicolas Mailhot wrote:
> There is no more justification to keep encoding undefined as there is to
> keep time zone undefined. Last I've seen we're all pretty happy system
> time actually means something on unix (unlike other systems where it can
> be anything depending on the location where the initial installation was
> performed).

"System time" is amount of time elapsed since the epoch.  Period.  What does
it have to any timezone?

The only place where timezone enters the picture is conversion of time to
year:month:day:hours:minutes:seconds and that's
	a) process-dependent and
	b) done outside of kernel

The same goes for file names.  Filename is a sequence of bytes, no more and
no less.  Anything beyond that belongs to applications.

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: JFS default behavior
Original-Message-ID: <20040214232935.GK8858@parcelfarce.linux.theplanet.co.uk>
Date: Sat, 14 Feb 2004 23:31:05 GMT
Message-ID: <fa.njrkfkv.1h7c1qv@ifi.uio.no>

On Sun, Feb 15, 2004 at 12:06:23AM +0100, Robin Rosenberg wrote:
> On Saturday 14 February 2004 16.40, you wrote:
> > The same goes for file names.  Filename is a sequence of bytes, no more and
> > no less.  Anything beyond that belongs to applications.
>
> Should be a sequence of characters since humans are supposed to use them and
> it should be the same characters wheneve possible regardless of user's locale.

> The  "sequence of bytes" idea is a legacy from prehistoric times when byte == character
> was true.

Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
are opaque.  The only things that have special meanings are:
	octet 0x2f ('/') splits the pathname into components
	"." as a component has a special meaning
	".." as a component has a special meaning.
That's it.  The rest is never interpreted by the kernel.

> Having an iocharset options for all file systems make it backward compatible
> and creates a migration path to UTF-8 as system default locale.

Try to realize that different users CAN HAVE DIFFERENT LOCALES.  On the same
system.  And have files on the same fs.  Moreover, homedirs that used to be
on different filesystems can end up one the same fs.  What iocharset would
you use, then?  Sigh...

Again, there is no such thing as iocharset of filesystem - it varies between
users and users can and do share filesystems.  Think of /home; think of /tmp.

It isn't feasible.  At all.  Just as timezone doesn't belong in kernel, locales
have no place there.

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: JFS default behavior
Original-Message-ID: <Pine.LNX.4.58.0402141827200.14025@home.osdl.org>
Date: Sun, 15 Feb 2004 02:42:45 GMT
Message-ID: <fa.ie07fjk.1gg68ia@ifi.uio.no>

On Sun, 15 Feb 2004, Robin Rosenberg wrote:
> >
> > Bullshit.  It has _nothing_ to characters, wide or not.  For system filenames
> > are opaque.  The only things that have special meanings are:
> > 	octet 0x2f ('/') splits the pathname into components
> > 	"." as a component has a special meaning
> > 	".." as a component has a special meaning.
> > That's it.  The rest is never interpreted by the kernel.
>
> I know how it is (to some degree), and its wrong. The user sees inside the filename
> and sees a string of characters, not a byte sequence.

Yes, the user sees a string of characters, but the octet 0x2f ('/') and
the terminating NUL character '\0' are still perfectly normal characters
and there is no confusion.

The reason: UTF-8. It's the only sane encoding (apart from a pure extended
ASCII setup, which is also sane, but is obviously unacceptable for a large
portion of the world).

If some misguided person has told you about UCS-2 and horrors like UTF-9,
just ignore them. They are crazy and deluded, and - perhaps more
importantly - stupid.

In short: the kernel talks bytestreams, and that implies that if you want
to talk to the kernel, you HAVE TO USE UTF-8.

At which point there are no locale issues any more. The only locale issue
you can have is user space mistaking a stream of bytes as extended ASCII,
which will cause all your pretty UTF-8 characters to be shown as strange
latin1 (or other) squiggles.

> It seems you simply don't want to understand the problem, which is that users
> CAN HAVE DIFFERENT LOCALES on the same system and on different system.
> Sigh...

People understand the problem. And UTF-8 is the solution.

It's getting there. I think even Microsoft has seen the light, and is
phasing out their crapola (UCS-2LE? Whatever).

> I less concerned with which solution than that a solution should be found. So it
> seems no file system has a solution today. Still an iocharset option would relieve
> the problem for removable media and muli-boot systems.

No. Things like "iocharset" are not the solution. They are literally the
_problem_. The solution is to use something that not only acts as ASCII,
but also has a wide enough range to cover the whole required space (UCS-2
fails _both_ of these fundamental tests). At which point "iocharset" makes
no sense any more, and only exists as a way to translate legacy crap into
the one true format.

And that one true format is UTF-8. End of story. If you try to talk to the
kernel in UCS-2 or anything else, you _will_ fail.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: stty utf8
Original-Message-ID: <Pine.LNX.4.58.0402161413550.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 22:19:05 GMT
Message-ID: <fa.if01h3k.1hgs9ia@ifi.uio.no>

On Mon, 16 Feb 2004, Jamie Lokier wrote:
>
> I little thought and an experiment later, and I discovered:
>
> When you edit a line with the kernel's terminal line editor, when you
> press the Delete key, it writes backspace-space-backspace and removes
> one byte from the input.  That fails to do the right thing on UTF-8
> terminals.

Yes. I looked at that a year ago, and it should be pretty easy to make the
backspace code look more like the "delete word" code - except the "word"
is just a utf character.

(Btw, that's one of the things I like about UTF-8, and shows how _well_
designed it is - it's trivial to find the beginning of a UTF-8 character,
even when just doing a stupid scan backwards).

I didn't care enough to really bother fixing it - the fact is, that people
who care about UTF-8 tend to have to be in graphics mode anyway, and there
is something to be said for keeping the text console simple even if it
means it lacks functionality.

But if somebody cares more than I do (hint, hint ;), I do think it should
be fixed.

> There is no fancy environment setting which corrects this problem.
> The kernel needs to know it's dealing with a UTF-8 terminal for basic
> line editing to work.

Yes. And I'd happily take patches for it.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402161040310.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 18:52:31 GMT
Message-ID: <fa.idvhhrh.1igs8q9@ifi.uio.no>

On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> > In short: the kernel talks bytestreams, and that implies that if you want
> > to talk to the kernel, you HAVE TO USE UTF-8.
>
> This is not the problem at all. It's perfectly easy to write
> applications that talk UTF-8 and just UTF-8 with the kernel.
>
> The problem is that the kernel does not use UTF-8, i.e. applications in
> the current linux model have to deal with the fact that the kernel
> happily breaks the assumed protocol of using UTF-8 by delivering illegal
> byte sequences to applications.

You didn't read what I said.

READ MY POSTING. You even quoted it, but you didn't understand it.

I'm saying that "the kernel talks bytestreams".

I have never claimed that the kernel really talks UTF-8, and indeed, I
would say that such a kernel would be terminally and horribly broken.

The kernel is _agnostic_ in what it does. As it should be. It doesn't
really care AT ALL what you feed it, as long as it is a byte-stream.

Now, that implies that if you want to have extended characters, then YOU
HAVE TO USE UTF-8.

That's what I'm saying. I am _not_ saying that the kernel uses UTF-8. The
kernel doesn't care one way or the other. As far as the kernel is
concerened, you could uuencode all the stuff, and the kernel wouldn't
think you're crazy. The kernel _only_ cares about byte streams. And that
is as it should be.

> There is no way for applications to handle UTF-8 and illegal-utf8 in
> a sane way, so most apps will either eat the illegal bytes, skip the
> filename, or crash (the latter case is clearly a bug in the app, thr
> former cases aren't).

What you're complaining about are bad user applications. It has _zero_ to
do with the kernel.

> Fixing the VFS to actually enforce what linus claims (2filenames are
> utf-8") is a very good idea, imho.

No. Read my claim again. You obviously do not understand it AT ALL.

What you suggest would be a horribly idiotic and bad idea. The kernel
doesn't set policy. The kernel says "this is what I can do, you set
policy".

And UTF-8 just happens to be the only sane policy for encoding complex
characters into a byte stream. But it is not the only policy.

Another sane policy is to say "byte streams are latin1". It's not an
acceptable policy for encoding _complex_ characters, but it is a policy.
And it's a perfectly sane one.

In short: filenames are byte streams. Nothing more. They don't even have a
"character set". They literally are just a series of bytes.

And when I say that you have to talk to the kernel using UTF-8, I'm only
claiming that it is the only sane way to encode extended characters in a
byte stream. Nothing more.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402161141140.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 19:54:20 GMT
Message-ID: <fa.icvphri.1jgk8q8@ifi.uio.no>

On Mon, 16 Feb 2004, John Bradford wrote:
>
> The real problem is with mis-configured userspaces, where buggy UTF-8
> decoders are trying to make sense of data in legacy encodings
> containing essentially random bytes > 127, which are not part of valid
> UTF-8 sequences.
>
> None of this is a real problem, if everything is set up correctly and
> bug free.  Unfortunately the Just Works thing falls apart in the,
> (frequent), instances that it's not :-(.

The way to handle that is to aim to never _ever_ decode utf-8 unless you
really have to. Always leave the string in utf-8 "raw bytestring" mode as
long as possible, and convert to charater sets only when actually
printing.

If you do that, then at worst you'll show the user a strange name (extra
points for marking it as being erroneous), but everything still works. You
can still lookup/delete/whatever the file (internally the program still
works on the raw byte sequence and isn't confused). Basically accept the
fact that UTF-8 strings can contain "garbage", and don't try to fix it up.

And no, I'm not claiming that it's wonderfully clean and that we should
all love it. But it's _practical_, and the ugliness is certainly a lot
less than in the alternatives.

And it largely works today.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402161223420.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 20:30:20 GMT
Message-ID: <fa.iefnhbk.1h0u9aa@ifi.uio.no>

On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> On Mon, Feb 16, 2004 at 11:48:35AM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > works on the raw byte sequence and isn't confused). Basically accept the
> > fact that UTF-8 strings can contain "garbage", and don't try to fix it up.
>
> But you are wrong, UTF-8 strings never contain garbage. UTF-8 is
> well-defined and is always proper UTF-8. It's a tautology.
>
> The very idea of "UTF-8 with garbage in it" doesn't make sense.

Sure it does.

You live in a theoretical world where
 (a) there is only one standard
 (b) people read it
 (c) people actually follow it and never have bugs

I've got news for you: none of the above is true.

Which means that IN PRACTICE you will find strings that you think are
UTF-8-encoded, but that don't end up being proper UTF-8.

That's the difference between real world and theory.

And you can either write your programs to be "theoretically correct", or
you can write them to "work".

It's your choice. I know which program I'd prefer to use.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402161205120.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 20:28:34 GMT
Message-ID: <fa.icvngrm.1jgu9qc@ifi.uio.no>

On Mon, 16 Feb 2004, Marc Lehmann wrote:
>
> > I'm saying that "the kernel talks bytestreams".
>
> And I am saying that this is not good, which is my sole point.

Fair enough.

However, that's where the unix philosophy comes in. The unix philosophy
has always been to not try to understand the data that the user passes
around - and that "everything is a bytestream" is very much encoded in the
basic principles of how unix should work.

That agnosticism has a lot of advantages. It literally means that the
basic operating system doesn't set arbitrary limitations, which means that
you can do things that you couldn't necessarily otherwise easily do.

It does mean that you can do "strange" things too, and it does mean that
user space basically has a lot of choice in how to interpret those byte
streams.

And yes, it can cause confusion. You don't like the confusion, so you
argue that it shouldn't be allowed. It's a valid argument, but it's an
argument that assumes that choice is bad.

If you want to _force_ everybody to use UTF-8, then yes, the kernel could
enforce that readdir() would never pass through a broken UTF-8 string, and
all the path lookup functions also would never accept a broken string. It'
snot technically impossible to to, although it would add a certain amount
of pain and overhead.

But the thing is, not everyone uses UTF-8. The big distributions have only
recently started moving to UTF-8, and it will take _years_ before UTF-8 is
ubiquitous. And even then it might be the wrong thing to disallow clever
people from doing clever things. Encoding other information in filenames
might be proper for a number of applications.

> And I'd say such a kernel would be highly useful, as it would standardize
> the encoding of filenames, just as unix standardizes on "mostly ascii"
> (i.e. the SuS).

It would also be very painful, since it would mean that when you mount an
old disk, you may be totally unable to read the files, because they have
filenames that such a kernel would never accept.

> > The kernel is _agnostic_ in what it does.
>
> No, it's not. If at all, the kernel specifies a specially-interpreted
> (ascii sans / and \0) byte-stream, as you say yourself.
>
> However, just as with URLs (which are byte-streams, too), byte-streams are
> useless to store text. You need bytestreams + known encoding.

You don't "need" a known encoding. The kernel clearly doesn't need one.
It's a container, and the encoding comes from the outside.

And that's what I mean by agnostic - you can make your own encoding.

Most of the time (but not always) these days UTF-8 is the only sane
encoding to use. But let people do what they want to do.

Choice is _inherently_ good. Trying to force a world-view is bad. You
should be able to tell people what they should do to avoid confusion ("use
UTF-8"), but you should not _force_ them to that if they have good reasons
not to (and "backwards compatibility" is a better reason than just about
anything else).

> But you are saying that you have to feed UTF-8 into the kernel, which is
> not the case either.

No. I'm saying that
 (a) "if you want to use complex character sets"
then
 (b) "you really have to use UTF-8"
to talk to the kernel.

Note the two parts. You're hung up on (b), while I have tried to make it
clear that (a) is a prerequisite for (b).

Not everybody cares about (a). There are still people who use extended
ASCII, simply because they DO NOT CARE about complex character sets. And
if they don't care, and (a) isn't true, then (b) has no meaning any more.

(In all fairness, some people will disagree with (b) even when (a) is true
and like things like UCS-2. Those people are crazy, but I guess I'd just
mention that possibility anyway).

And this is why I say that the kernel only cares about byte streams, and
having it filter to only accept proper UTF-8 sequences would be a horribly
bad idea. Because it _assumes_ (a). That's what "making policy" is all
about. The kernel should not assume that everybody cares about complex
character sets.

This may change, btw. I'm nothing if not pragmatic. In another twenty
years, maybe everybody _literally_ uses complex character sets, and this
whole discussion is totally silly, and the kernel may enforce UTF-8 or
Klingon or whatever. At some point assumptions become _so_ ingrained that
they are no longer policy any more, they are just "fact".

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402161431260.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 22:42:45 GMT
Message-ID: <fa.idg3hji.1i0q928@ifi.uio.no>

On Mon, 16 Feb 2004, Jamie Lokier wrote:
>
> Alas, once userspace has migrated to doing everything in UTF-8, you
> won't be able to read those files because userspace will barf on them=
..

Nope. Read my other email. Done right, user space will _not_ barf on them,
because it won't try to "normalize" any UTF-8 strings. If the string has
garbage in it, user space should just pass the garbage through.

We've had this _exact_ issue before. Long before people worried about
UTF-8, people worried about the fact that programs like "ls" shouldn't
print out the extended ASCII characters as-is, because that would cause
bad problems on a terminal as they'd be seen as terminal control
characters.

Does that mean that unix tools like "rm" cannot remove those files? Hell
no! It just means that when you do "rm -i *", the filename that is printed
may not have special characters in it that you don't see.

Same goes for UTF-8. A "broken" UTF-8 string (ie something that isn't
really UTF-8 at all, but just extended ASCII) won't _print_ right, but
that doesn't mean that the tools won't work. You'll still be able to edit
the file.

Try it with a regular C locale. Do a simple

	echo > åäö

(that's latin1), and do a "rm -i åäö", and see what it says.

Right: it does the _right_ thing, and it prints out:

	torvalds@home:~> rm -i åäö
	rm: remove regular file `\345\344\366'?

In other words, you have a program that doesn't understand a couple of the
characters (because they don't make sense in its "locale"), but it still
_works_. It just can't print them.

Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
program should do when it sees broken UTF-8. It can still access the file,
it can still do everything else with it, but it can't print out the
filename, and it should use some kind of escape sequence to show that
fact.

The two cases are 100% equivalent. We've gone through this before. There
is a bit of pain involved, but it's not something new, or something
fundamentally impossible. It's very straightforward indeed.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402161447450.30742@home.osdl.org>
Date: Mon, 16 Feb 2004 22:58:29 GMT
Message-ID: <fa.ieg1hro.1h0s8qe@ifi.uio.no>

On Mon, 16 Feb 2004, Linus Torvalds wrote:
>
> Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> program should do when it sees broken UTF-8. It can still access the file,
> it can still do everything else with it, but it can't print out the
> filename, and it should use some kind of escape sequence to show that
> fact.

Side note: a UTF-8 program needs to do escape handling _anyway_, because
even if the filename is 100% UTF-8 compliant, you still can't print out
all the characters as such. In particular, charcters like '\n' etc are
obviously perfectly fine UTF-8, yet they need to be escaped when printing
out filenames in a file selector.

So I claim (and yes, people are free to disagree with me) that a
well-written UTF-8 program won't even have any real extra code to handle
the "broken UTF-8" code. It's just another set of bytes that needs
escaping, and they need escaping for _exactly_ the same reason some
regular utf-8 characters need escaping: because they can't be printed.

So it's all the same thing - it's just the reasons for "unprintability"
that are slightly different.

Now, I'll agree that getting the escaping right (whether for things like
'\n' or for byte sequences that are invalid UTF-8) can be painful. I just
don't think that the pain is in any way specific for "invalid UTF-8". It's
just _hard_ to think of all the special cases, and most programs have bugs
because somebody forgot something.

		Linus

Newsgroups: fa.linux.kernel
From: hpa@zytor.com (H. Peter Anvin)
Subject:  Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID:  <c0ukd2$3uk$1@terminus.zytor.com>
Date: Wed, 18 Feb 2004 03:01:47 GMT
Message-ID: <fa.ih5io6k.1q5mibu@ifi.uio.no>

Followup to:  <20040216202142.GA5834@outpost.ds9a.nl>
By author:    bert hubert <ahu@ds9a.nl>
In newsgroup: linux.dev.kernel
>
> Additional good news is that following octets in a utf-8 character sequence
> always have the highest order bit set, precluding / or \x0 from appearing,
> confusing the kernel.
>

Indeed.  The original name for the encoding was, in fact, "FSS-UTF",
for "filesystem safe Unicode transformation format."

> The remaining zit is that all these represent '..':
> 2E 2E
> C0 AE C0 AE
> E0 80 AE E0 80 AE
> F0 80 80 AE F0 80 80 AE
> F8 80 80 80 AE F8 80 80 80 AE
> FC 80 80 80 80 AE FC 80 80 80 80 AE

No, they don't.

The first represent "..", the remaining two are illegal encodings and
do not decode to anything.

Those of us who have been involved with the issue have fought
*extremely* hard against DWIM decoders which try to decode the latter
sequences into ".." -- it's incorrect, and a security hazard.  The
only acceptable decodings is to throw an error, or use an out-of-band
encoding mechanism to denote "bad bytecode."

> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.

No, it doesn't.

> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.

UTF-8 requires the use of the shortest possible encoding.  An
application which doesn't obey that and tries to be "smart" is a
security hazard.

It is a bit unfortunate that the encoding don't exclude these by
design as opposed by error checking; it makes it a little too easy for
clueless programmers to skip :(

	-hpa

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402171910550.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 03:15:40 GMT
Message-ID: <fa.j5csqtb.1e2gnot@ifi.uio.no>

On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
> Those of us who have been involved with the issue have fought
> *extremely* hard against DWIM decoders which try to decode the latter
> sequences into ".." -- it's incorrect, and a security hazard.  The
> only acceptable decodings is to throw an error, or use an out-of-band
> encoding mechanism to denote "bad bytecode."

Somebody correctly pointed out that you do not need any out-of-band
encoding mechanism - the very fact that it's an invalid sequence is in
itself a perfectly fine flag. No out-of-band signalling required.

The only thing you should make sure of is to not try to normalize it (that
would hide the error). Just keep carrying the bad sequence along, and
everybody is happy. Including the filesystem functions that get the "bad"
name and match it exactly to what it should be matched against.

		Linus

Newsgroups: fa.linux.kernel
Original-Message-ID: <4032DA76.8070505@zytor.com>
From: "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Date: Wed, 18 Feb 2004 03:24:40 GMT
Message-ID: <fa.ebarqom.nkm52u@ifi.uio.no>

Linus Torvalds wrote:
>
> On Wed, 18 Feb 2004, H. Peter Anvin wrote:
>
>>Those of us who have been involved with the issue have fought
>>*extremely* hard against DWIM decoders which try to decode the latter
>>sequences into ".." -- it's incorrect, and a security hazard.  The
>>only acceptable decodings is to throw an error, or use an out-of-band
>>encoding mechanism to denote "bad bytecode."
>
> Somebody correctly pointed out that you do not need any out-of-band
> encoding mechanism - the very fact that it's an invalid sequence is in
> itself a perfectly fine flag. No out-of-band signalling required.
>
> The only thing you should make sure of is to not try to normalize it (that
> would hide the error). Just keep carrying the bad sequence along, and
> everybody is happy. Including the filesystem functions that get the "bad"
> name and match it exactly to what it should be matched against.
>

Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.  Consider a UTF-8 decoder which uses
values in the 0x800000xx range to encode "bogus bytes"; that way it
wouldn't alias to anything else, but the bogus sequence "C0 AE" could be
represented as 0x800000C0 0x800000AE and displayed to the user as
\xC0\xAE\xC0\xAE ... which is different from \u00C0\u00AE ("À®", C3 80
C2 AE).  This would make it possible to figure out in, for example, an
ls listing, what those broken filenames are actually composed of.

There are some advantages to being able to represent all possible byte
sequences and present them to the user, even if they're bogus.

	-hpa

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402171927520.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 03:32:42 GMT
Message-ID: <fa.j5cmr5i.1e2un0q@ifi.uio.no>

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
>
> Well, the reason you'd want an out-of-band mechanism is to be able to
> display it as some kind of escapes.

I'd suggest just doing that when you convert the utf-8 format to printable
format _anyway_.  At that point you just make the "printable"
representation be the binary escape sequence (which you have to have for
other non-printable utf-8 characters anyway).

And if you do things right (ie you allow user input in that same escaped
output format), you can allow users to re-create the exact "broken utf-8".
Which is actually important just so that the user can fix it up (ie
imagine the user noticing that the filename is broken, and now needs to do
a "mv broken-name fixed-name" - the user needs some way to re-create the
brokenness).

		Linus

Newsgroups: fa.linux.kernel
Original-Message-ID: <4032F861.3080304@zytor.com>
From: "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Date: Wed, 18 Feb 2004 05:31:56 GMT
Message-ID: <fa.ec9rqgh.mju5an@ifi.uio.no>

Linus Torvalds wrote:
>
> On Tue, 17 Feb 2004, H. Peter Anvin wrote:
>
>>Well, the reason you'd want an out-of-band mechanism is to be able to
>>display it as some kind of escapes.
>
>
> I'd suggest just doing that when you convert the utf-8 format to printable
> format _anyway_.  At that point you just make the "printable"
> representation be the binary escape sequence (which you have to have for
> other non-printable utf-8 characters anyway).
>

What does "printable" mean in this context?  Typically you have to
convert it to UCS-4 first, so you can index into your font tables, then
you have to create the right composition, apply the bidirectional text
algorithm, and so forth.

Rendering general Unicode text is complex enough that you really want it
layered.  What I described what the first step of that -- mostly trying
to show that "throwing an error" doesn't necessarily mean "produce no
output."  What you shouldn't do, though, is alias it with legitimate input.

> And if you do things right (ie you allow user input in that same escaped
> output format), you can allow users to re-create the exact "broken utf-8".
> Which is actually important just so that the user can fix it up (ie
> imagine the user noticing that the filename is broken, and now needs to do
> a "mv broken-name fixed-name" - the user needs some way to re-create the
> brokenness).

Indeed.  The C language has gone with \x77 for bytes and \u7777 or
\U77777777 for Unicode characters (4 vs 8 hex digits respectively); I
think this is a good UI for shells to follow.  The \x representation
then doesn't stand for characters but for bytes.  It may be desirable to
disallow encoding of *valid* UTF-8 characters this way, though.

	-hpa

Newsgroups: fa.linux.kernel
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS 
	default behavior)
Original-Message-ID: <20040217131559.GA21842@mail.shareable.org>
Date: Tue, 17 Feb 2004 13:18:14 GMT
Message-ID: <fa.it0qfl9.1a02837@ifi.uio.no>

Linus Torvalds wrote:
> So I claim (and yes, people are free to disagree with me) that a
> well-written UTF-8 program won't even have any real extra code to handle
> the "broken UTF-8" code. It's just another set of bytes that needs
> escaping, and they need escaping for _exactly_ the same reason some
> regular utf-8 characters need escaping: because they can't be printed..

Even XML suffers from these sorts of problems: some Unicode characters
aren't allowed in XML, even as numeric references, so in theory XML
applications have to reject or escape some strings.

> So it's all the same thing - it's just the reasons for "unprintability"
> that are slightly different.

My difficulty with directories containing non-UTF-8 filenames shows up
with web pages in Perl, and not the printability part.  Please excuse
the Perl-oriented examples; Perl has good support for UTF-8 while also
working with arbitrary byte strings, so it's a fine language to
illustrate current problems.

What do you put in a URL composed from filenames in a directory
listing page?  The obvious thing is to %-escape each byte of the
names, in fact that's what everybody does.

In a language like Perl, where strings are labelled according to their
encoding, that means when you unescape the URL you get a string
labelled as "byte string".  You shouldn't tell Perl it's a "UTF-8
string" because some of them won't be (they are strings from
directories).

That's fine if you don't do anything except use those strings
unchanged, but as soon as you want to do something else like prepend a
character with code >= 256 or apply a regex where the pattern has
Unicode characters, Perl transcodes "byte string" to "UTF-8 string"
assuming it was latin1.  That, of course, mangles the string when it's
come from a source which is "nominally UTF-8 but might not be".

Your recommendation to simply pass around bytes all the time doesn't
work well, because to maintain basic properties of strings such as
length(a) + length(b) = length(a+b), that implies you either (1)
always do indexing, lengths, splitting etc. on strings as bytes not
characters, or (2) every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way.  (2) is
particularly nasty because then your program's logic can't depend on
the nice properties of UTF-8 strings.

That's why this line of Perl fails:

    for (glob "*") { rename $_, "ņi-".$_ or die "rename: $!\n"; }

(The source file, by the way, is assumed to be UTF-8-encoded text).

Perl reads each file name, and declares it to be of type "byte
string".  Then "ņi-" is prepended, which contains a character code >=
256, so the result must be UTF-8 encoded according to Perl.  The
original file name is transcoded from what was assumed to be
iso-8859-1 to UTF-8, "ņi-" is prepended, and that becomes the target
file name for rename().

This mangles the names; both UTF-8 and non-UTF-8 filenames are mangled
equally badly.

Your suggestion means that Perl should do bytewise concatenation of
the the "ņi-" (in UTF-8) and the filename (no encoding assumed).

It's a good one; it's exactly the right thing to do, and it works.

To do that in Perl, when you take a random byte string (such as from
readdir()) you must tell Perl it's a UTF-8 string, so shouldn't be
transcoded when it's combined with another UTF-8 string.  You can do
it, breaking documented rules of course (which say only do this when
you know it's valid UTF-8), with Encode::_utf8_on().

Guess what?  That actually works.  It does the filename operations
properly given any arbitrary filenames.

But remember I said "every operation that operates on a string must be
able to accept non-UTF-8 bytes and treat them the same way" earlier,
and how this is bad because it's nice to depend on UTF-8 properties?

You've just told Perl to treat an arbitrary byte sequence as UTF-8,
when sometimes it isn't.  Among other things, simple operators like
length() and substr() don't work as expected on those weird strings.

When I say don't work as expected, I mean if you had a file named
"müeller" in latin1, Perl will think it's length() is 2.  If you have
a file named "müller", Perl will not only report a length() of 1,
it'll spew a horrible error message when it calculates it.

These aren't Perl problems.  These are problems that any program will
have if it follows your suggestion of "keep everything in bytes" but
wants to combine filenames with other text or do pattern matching on
filenames.

It's not a problem if you can pass around a flag with each byte
sequence, carefully keeping readdir() results separate from text until
the point where your prepared to have a policy saying what to do with
non-UTF-8 readdir() results.

But it is a problem when you want to stuff readdir() results in a
general purpose "string" which is also used for text.

That's technically the wrong thing to do, in all programs.  In
practice, that's what programs do anyway because it's a lot easier
than having different string types for different data sources.

Most times it works out ok, but for the corners:

> It's just _hard_ to think of all the special cases, and most
> programs have bugs because somebody forgot something.

Exactly.

-- Jamie

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402170739580.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 15:59:18 GMT
Message-ID: <fa.j4skqli.1eiok0m@ifi.uio.no>

On Tue, 17 Feb 2004, Marc wrote:
> On Mon, Feb 16, 2004 at 02:40:25PM -0800, Linus Torvalds <torvalds@osdl.org> wrote:
> > Try it with a regular C locale. Do a simple
> >
> > 	echo > åäö
>
> Just for your info, though. You can't even input these characters in a C
> locale, since your libc (and/or xlib) is unable to handle them (lots of SO
> C functions will barf on this one). C is 7 bit only.

Ehh.. It's pointless to tell me that I can't do it. I just did.

The C locale is _not_ 7-bit only. The C locale is the traditional "byte
locale" for UNIX. It will happily collate 8-bit-characters in their
(numerical) order. Anything else would be seriously broken.

> > Which, if you think about is, is 100% EXACTLY equivalent to what a UTF-8
> > program should do when it sees broken UTF-8.
>
> The problem is that the very common C language makes it a pain to use
> this in i18n programs. multibyte functions or iconv will no accept
> these, so programs wanting to do what you are expecting to do need to
> re-implement most if not all of the character handling of your typical
> libc.

These are all teething problems. The thing is, true multi-locale programs
haven't been around long enough that people take the problems for granted.
A lot of them work today, but "work" is different from "always does the
right thing". These things take a _long_ time for people to sort out the
full implications of.

(Analogy time: how many people _still_ use "find ... | xargs xxx", even
though that can lead to problems and is thus wrong?  You should really use
"find ... -print0 | xargs -0 xxx" to get it _right_, but most people
ignore that, because the common form works for most cases.)

The process is complicated by the fact that most of the people who really
care about UTF-8 and locales are very strict about it: they have been
hitting their heads against latin1 users for a long time, and they are
frustrated and _tired_ of it, and so they often hate single-byte usage
with a passion, and consider it not only wrong but EVIL. Which is
obviously silly, but hey, I understand why they can feel a bit put off by
the problem.

So the multi-byte people often stare at the standards, and then _refuse_
to touch anything that isn't standards-compliant. When they see something
incorrect, they'd rather dump core (or just truncate it) than try to
handle it gracefully, because they want the whole world to see how
incorrect it is.

Which flies in the face of "Be strict in what you generate, be liberal in
what you accept". A lot of the functions are _not_ willing to be liberal
in what they accept. Which sometimes just makes the problem worse, for no
good reason.

The fact is, you shouldn't use "iconv()" unless you controlled the input.
It's a bit like "gets()" - unsafe to use unless you generated the damn
thing yourself and you _know_ it fits in the buffer. But we just don't
have the functions (yet) to do it _right_, and to escape the input some
way (yeah, yeah, I know you can do it with iconv() and a lot of cruft
around it - the point is that nobody does it, because it's too painful)..

		Linus

Newsgroups: fa.linux.kernel
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS 
	default behavior)
Original-Message-ID: <20040217192917.GA24311@mail.shareable.org>
Date: Tue, 17 Feb 2004 19:33:28 GMT
Message-ID: <fa.iph0gl8.1agc9j0@ifi.uio.no>

viro@parcelfarce.linux.theplanet.co.uk wrote:
> On Tue, Feb 17, 2004 at 04:36:13PM +0000, Jamie Lokier wrote:
> > But the reason they cite is security: when applications allow
> > malformed UTF-8 through, there's plenty of scope for security holes
> > due to multiple encodings of "/" and "." and "\0".
> >
> > This is a real problem: plenty of those Windows worms that attack web
> > servers get in by using multiple-escaped funny characters and
> > malformed UTF-8 to get past security checks for ".." and such.
>
> Pardon?  For that kernel would have to <drumrolls> interpret the bytestream
> as UTF-8.  We do not.  So your malformed UTF-8 for .. won't be treated as
> .. by the kernel.

Well, the security checks on ".." which worms get past aren't in the
kernel either.  This time _you_ made the strawman :)

What happens is that one program or library checks an incoming path
for ".." components - that code knows nothing about UTF-8 of course.

Then it passes the string to another program which assumes the path
has been subject to appropriate security checks, munges it in UTF-8,
and eventually does a file operation with it.  The munging generates
".." components from non-minimal UTF-8 forms - if it's not obeying the
Unicode rejection requirement (which wasn't in earlier versions), that is.

A realistic example is where the second program reads files whose
paths are mentioned in a text file which is parsed as UTF-8, after the
first program has done a security check by grepping for ".."
components.

Unicode says the second program shouldn't accept malformed UTF-8,
precisely because in real scenarios (like this one) there's a mix of
programs and libraries, some aware of UTF-8, some not, and the latter
are involved in security decisions.

Here on linux-kernel we're saying that if the second program accepts
any old byte sequence in a filename, it should preserve the byte
sequence exactly.  But any program whose parser-tokeniser is scanning
UTF-8 is very unlikely to do that - it's just too complicated to say
some bits of a text stream must be remembered as literal bytes, and
others must be scanned as multibyte characters.

We can't blame the second program for allowing those dodgy paths,
because it's the _first_ program which is setting policy.  We can't
blame the first program, because it doesn't care about UTF-8.  The
second program is just obeying orders, and the first program is just
applying POSIX rules.

These type of security holes are quite real, among software which
handles UTF-8 and also deals with paths.  At the current time, that
especially means XML, HTML, URIs, web servers and things behind them.

The holes only arise because software which is interpreting UTF-8 is
mixed with software which isn't.  That's one of the most useful
features of UTF-8, after all - that's why we use it for filenames.

Understand, this isn't a kernel problems; it is simply a good reason
to reject malformed UTF-8 by programs which parse UTF-8.

-- Jamie

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171134180.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 19:49:01 GMT
Message-ID: <fa.j3c8qld.1c2kk0r@ifi.uio.no>

On Tue, 17 Feb 2004, Jamie Lokier wrote:
>
> Well, the security checks on ".." which worms get past aren't in the
> kernel either.  This time _you_ made the strawman :)

Note that this is something that the kernel _can_ fix easily.

In particular, we already have flags like LOOKUP_FOLLOW and
LOOKUP_DIRECTORY that we use internally in the kernel to specify how to do
certain operations. We export _part_ of that to user space with the
O_DIRECTORY flag that says "allow open of directories".

And yes, we have security-related ones too (LOOKUP_NOALT disables the
alternate mount-point lookup).

And it would be _trivial_ to add a LOOKUP_NODOTDOT and allow user space to
use it through a O_NODOTDOT thing. But the people who need it really need
to do it and test it, and they need to be committed enough that they say
"yes, we'd use this, even though it's not portable". Because I don't want
to add features to the kernel that people don't use, and a lot of the
users don't want to use Linux-only things..

Same goes for O_NOFOLLOW or O_NOMOUNT, to tell the kernel that it
shouldn't follow symbolic links or cross mount-points - another thing that
some software might want to use in order to check that you can't "escape"
your subtree.

So these things would be literally trivial to add, and the only issue is
whether people would really use them.

> What happens is that one program or library checks an incoming path
> for ".." components - that code knows nothing about UTF-8 of course.
>
> Then it passes the string to another program which assumes the path
> has been subject to appropriate security checks, munges it in UTF-8,
> and eventually does a file operation with it.  The munging generates
> ".." components from non-minimal UTF-8 forms - if it's not obeying the
> Unicode rejection requirement (which wasn't in earlier versions), that is.

But note how my point was that YOU SHOULD NEVER EVER MUNGE A PATHNAME!

It is fundamentally _wrong_ to convert pathnames. You _cannot_ do it
correctly.

The rule should be:
 - convert user-input to UTF-8 early (do _nothing_ to it before the
   conversion). Allow escape sequences here.
 - never ever convert readdir/getcwd/etc system-specified paths AT ALL.
   They are already in "extended UTF-8" format (where the "extended" part
   is the 'broken UTF-8' thing. I can be like MS and call my breakage
   "extended" too ;)
 - always _always_ work on the "extended UTF-8" format, and never EVER
   convert that to anything else (except when you need to actually print
   it, but then you encode it properly with escape sequences, the way you
   have to _anyway_).

If you follow the above simple rules, you can't get it wrong. And in those
rules, ".." is the BYTE SEQUENCE in the "extended UTF-8". Nothing more.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171236180.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 20:53:42 GMT
Message-ID: <fa.j3caqlf.1c2ik0p@ifi.uio.no>

On Tue, 17 Feb 2004, Jamie Lokier wrote:
>
> Nope.  That wouldn't help for a bundle of libraries that goes:
>
>     1. Eliminate "." and ".." components, leaving only leading ".."s.

Who does this anyway? It's wrong. It gives the wrong answer if there was a
symlink somewhere.

I remember this exact bug in gcc (I think), at some point - trying to
"optimize" the path "aa/bb/../cc" into "aa/cc" is WRONG WRONG WRONG. They
are not the same thing at all.

Any library that does the above is broken.

>     2. Reject path if it has a leading "..".
>     3. Shove it in a string with some other text and pass to other library.
>
> Next program does:
>
>     4. Extract path from string.
>     5. open ("/var/public/files/$PATH", ...)
>
> O_NODOTDOT won't protect against that.

Ok, so explain why? O_NODOTDOT will certainly guarantee that it stays
inside "/var/public/files", since it has no way to escape (modulo
symlinks/mounts, of course).

The point being that with O_NODOTDOT | O_NOMOUNT | O_NOFOLLOW, you can
just do a simple "prepend my beginning pathname" operation, and do the
open that way without having to be careful.

Then, if the thing fails, you now need to be really careful, and perhaps
do a user-space "walk one component at a time" thing to see where it
failed. But what the O_NODOTDOT | O_NOMOUNT | O_NOFOLLOW thing gave you is
that you get a fast-path for the common case (ie you don't _always_ have
to do the "walk one component at a time" crud - only if you hit a case you
might be worried about).

(Now, O_NOMOUNT isn't actually useful if you use an absolute path like the
above example - it kind of assumes that you start from pwd which would be
your "safe point", and that you expect all "safe" files to be under that
one filesystem. With an absolute path, you'll clearly often end up having
to cross mount-points unless your whole thing is on the root filesystem,
which kind of makes O_NOMOUNT useless in the first place).

Btw, right now O_NOMOUNT isn't a big issue, since only root can mount
things anyway. But if we start allowing user mounts (likely with
restrictions like you can only mount if the mount-point is owned by you
and writable), O_NOMOUNT may actually become a good idea.

		Linus

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re: JFS 
	default behavior)
Original-Message-ID: <20040217195348.GQ8858@parcelfarce.linux.theplanet.co.uk>
Date: Tue, 17 Feb 2004 19:56:40 GMT
Message-ID: <fa.nms6gcs.1u7202m@ifi.uio.no>

On Tue, Feb 17, 2004 at 07:29:18PM +0000, Jamie Lokier wrote:
> What happens is that one program or library checks an incoming path
> for ".." components - that code knows nothing about UTF-8 of course.
>
> Then it passes the string to another program which assumes the path
> has been subject to appropriate security checks, munges it in UTF-8,
> and eventually does a file operation with it.  The munging generates
> ".." components from non-minimal UTF-8 forms - if it's not obeying the
> Unicode rejection requirement (which wasn't in earlier versions), that is.

Why the hell would it _ever_ do such normalization?

> A realistic example is where the second program reads files whose
> paths are mentioned in a text file which is parsed as UTF-8, after the
> first program has done a security check by grepping for ".."
> components.
>
> Unicode says the second program shouldn't accept malformed UTF-8,
> precisely because in real scenarios (like this one) there's a mix of
> programs and libraries, some aware of UTF-8, some not, and the latter
> are involved in security decisions.
>
> Here on linux-kernel we're saying that if the second program accepts
> any old byte sequence in a filename, it should preserve the byte
> sequence exactly.  But any program whose parser-tokeniser is scanning
> UTF-8 is very unlikely to do that - it's just too complicated to say
> some bits of a text stream must be remembered as literal bytes, and
> others must be scanned as multibyte characters.

So what you are saying is that conversion of invalid multibyte sequences
into non-error wide chars followed by conversion back into UTF-8 can
lead to trouble?  *DUH*

> The holes only arise because software which is interpreting UTF-8 is
> mixed with software which isn't.  That's one of the most useful
> features of UTF-8, after all - that's why we use it for filenames.

The holes only arise because software which is interpreting UTF-8 doesn't
care to do it properly.  Software that doesn't interpret it (including the
kernel) doesn't enter the picture at all.

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171259440.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 21:07:22 GMT
Message-ID: <fa.j4s2r5i.1eialgm@ifi.uio.no>

On Tue, 17 Feb 2004, John Bradford wrote:

> > Ok, but... why?  What does 32-bit words get you that UTF-8 does not?
> > I can't think of a single advantage, just lots of disadvantages.
>
> The advantage is that you can use them to store UCS-4.

Wrong. UTF-8 can store UCS-4 characters just fine.

Admittedly you might need up to six octets for the worst case, but hey,
since you only need one for the most common case (by _far_), who cares?

And with the same UTF-8 encoding, you could some day encode UCS-8 too if
the idiotic standards bodies some day decide that 4 billion characters
isn't enough because of all the in-fighting.

> Now, for file _contents_ this would be a compatibility disaster, which
> is why UTF-8 is so convenient, but for file_names_ UCS-4 lets you
> unambiguously represent any string of Unicode characters.

Why do you think UTF-8 can't do this? Did you read some middle-aged text
written by monks in a monastery that said that UTF-8 encodes a 16-bit
character set?

> Basically - no more multiple representations of the same thing.  No more
> funny corner cases where several different strings of bytes eventually
> resolve to the same name being presented to the user.

Welcome to normalized UTF-8. And realize that the "non-normalized" broken
stuff is what allows us backwards compatibility.

Of course, since you like UCS-4, you don't care about backwards
compatibility.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171318550.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 21:30:10 GMT
Message-ID: <fa.j5c6q5h.1e2akgn@ifi.uio.no>

On Tue, 17 Feb 2004, John Bradford wrote:
> >
> > Wrong. UTF-8 can store UCS-4 characters just fine.
>
> Does just fine include unambiguously?

If you don't care about backwards compatibility, then yes. You just have
to use "strict" UTF-8.

>				  Sure, standards-conforming
> UTF-8 is unambiguous, but you've already said time and again that that
> doesn't happen in the real world.  I just don't agree on the UTF-8 can
> store UCS-4 characters just fine thing _at all_.

You get to choose between "throw the baby out with the bathwater" or "be
compatible".

Sane people choose compatibility. But it's your choice. You can always
normalize thing if you want to - but don't complain to me if it breaks
things. It will still break _fewer_ things than UCS-4 would, so even if
you always normalize you'd still be _better_ off with UTF-8 than you would
be with UCS-4.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171251130.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 21:02:03 GMT
Message-ID: <fa.j3c0r5a.1c24lgu@ifi.uio.no>

On Tue, 17 Feb 2004, John Bradford wrote:
>
> Why not:

I'll start with the first one. That already kills the rest.

> * State that filenames are strings of 32-bit words.  UCS-4 should be
>   the prefered format for storing text in them, but storing legacy
>   encodings in the low 8 bits is acceptable, (but a Bad Thing for new
>   installations).

UCS-4 is as braindamaged as UCS-2 was, and for all the same reasons.

It's bloated, non-expandable, and not backwards compatible.

In contrast, UTF-8 doesn't measurably expand any normal text that didn't
need it, is backwards compatible in the major ways that matter, and can be
extended arbitrarily.

UCS-4 has _zero_ advantages over UTF-8.

Please. Give it up. Anybody who thinks that _any_ other encoding format
than UTF-8 is valid is just _wrong_.

(Now, I'll give that a lot of people don't like Unicode, so I'll allow
that maybe you'd want to use the UTF-8 _encoding_scheme_ for some other
mapping, but I don't see that that is worth the pain any more. Unicode may
be a horrible enumeration, but in the end all font encodings are arbitrary
anyway, so the unicode haters might as well start giving up).

In short: even if you hate Unicode with a passion, and refuse to touch it
and think standards are worthless, you should still use the same
transformation that UTF-8 does to your idiotic character set of the day.
Because the _transform_ makes sense regardless of character set encoding.

		Linus

Newsgroups: fa.linux.kernel
From: hpa@zytor.com (H. Peter Anvin)
Subject:  Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID:  <c0ul54$45h$1@terminus.zytor.com>
Date: Wed, 18 Feb 2004 03:13:39 GMT
Message-ID: <fa.gplgo6m.1ilsibu@ifi.uio.no>

Followup to:  <Pine.LNX.4.58.0402171251130.2154@home.osdl.org>
By author:    Linus Torvalds <torvalds@osdl.org>
In newsgroup: linux.dev.kernel
>
> UCS-4 is as braindamaged as UCS-2 was, and for all the same reasons.
>
> It's bloated, non-expandable, and not backwards compatible.
>

UCS-4 is actually a very nice format for *internal processing*.  For
data interchange, it sucks eggs.

UCS-2 is historic.  It's successor, UTF-16, is one of the worst
horrors ever inflicted on mankind by Microsoft.

	-hpa

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402171347420.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 22:00:04 GMT
Message-ID: <fa.j4s0qtg.1ei4loo@ifi.uio.no>

On Tue, 17 Feb 2004, Jamie Lokier wrote:
>
> >   I can point at the example of this "solution" that happened years ago
> > when UCS-2 was all the rage, and it got hardcoded and enforced by NTFS
> > and everything that handles it. Who is laughing about that decision now?
>
> We are all laughing ;)

Crying. Sadly, when MS makes a whopper of a mistake (and they do it all
too often), we're left having to work with the resulting breakage.

I suspect most samba developers are already technically insane (*).

		Linus

(*) Of course, since many of them are Australians, you can't tell.

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API (was: Re:
Original-Message-ID: <Pine.LNX.4.58.0402170820070.2154@home.osdl.org>
Date: Tue, 17 Feb 2004 16:33:27 GMT
Message-ID: <fa.j2ckqd9.1c2ok8v@ifi.uio.no>

On Tue, 17 Feb 2004, Marc Lehmann wrote:
>
> Because there is a fundamental difference between file contents and
> filenames. Filenames are supposed to be text.

I think this is actually the fundamental point where we disagree.

You think of filenames as something the user types in, and that is
"readable text". And I don't.

I think the filenames are just ways for a _program_ to look up stuff, and
the human readability is a secondary thing (it's "polite", but not a
fundamental part of their meaning).

So the same way I think text is good in config files and I dislike binary
blobs (hey, look at /proc), I think readable filenames are good. But that
doesn't mean that they have to be readable. I can well imagine encoding
meta-data in the filename for some database that uses the filesystem as
its backing store and generates files for large blobs. And then there
would be little if any "goodness" to keeping the filenames readable.

That's also a situation where case-insensitivity can _really_ screw you
(just one of the many).

It may be rare, but unlike you, I don't think there is anything "wrong"
with considering path components to be just "data".

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402180716180.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 15:38:29 GMT
Message-ID: <fa.j2suqti.1cimnok@ifi.uio.no>

On Tue, 17 Feb 2004, H. Peter Anvin wrote:
>
> What does "printable" mean in this context?  Typically you have to
> convert it to UCS-4 first, so you can index into your font tables, then
> you have to create the right composition, apply the bidirectional text
> algorithm, and so forth.

Not all characters _have_ font entries. And even when they have font
entries, they may need escaping for other reasons (ie you may want to
marshall UTF-8 as plain ASCII just because you want to use a portable
format for transfer).

Think about the simple (hex) string x0A x00. That's a well-defined UTF-8
string, yet if you want to print it as a filename on the console, you
should obviously print it as "\n" or some similar escaped sequence
(actually, that's a bad example, since it's a special case, and it would
probably be better to use the example string x7F x00, which would be shown
as \u177 or something).

The same is true for a _lot_ of perfectly fine UTF-8 sequences, no?

That implies that you have to use an escaped sequence _anyway_. So as you
go along, turning the string into something printable, you might as well
escape the invalid UTF-8 sequences.

In other words: you walk the utf-8 string one character at a time,
converting it to whatever format (eg UCS-4) you have for font lookup, but
you also escape characters that you don't have font entries for or that
aren't in proper UTF-8 format.

When converting to UCS-2, you have to check for the proper format
_anyway_, so none of this is in any way "extra work". Instead of just
aborting on an invalid UTF-8 character, you quote it, exactly the same way
you'd have to quote a _valid_ one that you can't just show as a string.

> Rendering general Unicode text is complex enough that you really want it
> layered.  What I described what the first step of that -- mostly trying
> to show that "throwing an error" doesn't necessarily mean "produce no
> output."  What you shouldn't do, though, is alias it with legitimate input.

Exactly. And since you need an escape sequence anyway, what's the problem?

> > And if you do things right (ie you allow user input in that same escaped
> > output format), you can allow users to re-create the exact "broken utf-8".
> > Which is actually important just so that the user can fix it up (ie
> > imagine the user noticing that the filename is broken, and now needs to do
> > a "mv broken-name fixed-name" - the user needs some way to re-create the
> > brokenness).
>
> Indeed.  The C language has gone with \x77 for bytes and \u7777 or
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I
> think this is a good UI for shells to follow.  The \x representation
> then doesn't stand for characters but for bytes.  It may be desirable to
> disallow encoding of *valid* UTF-8 characters this way, though.

You need to encode even valid UTF-8, since you may not find a font entry
for the character, or the character just isn't appropriate in that context
(ie you can't show a newline).

But it makes perfect sense to use a policy of:
 - escape valid UTF-8 characters as '\u7777'
 - escape _invalid_ UTF-8 characters as their hex byte sequence (ie
   '\xC0\x80\x80', whatever)
 - (and, obviously, escape the valid UTF-8 character '\' as '\\').

Don't you agree? It clearly allows all the cases, and you can re-generate
the _exact_ original stream of bytes from the above (ie it is nicely
reversible, which in my opinion is a requirement).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: UTF-8 practically vs. theoretically in the VFS API
Original-Message-ID: <Pine.LNX.4.58.0402181154290.2686@home.osdl.org>
Date: Wed, 18 Feb 2004 20:02:39 GMT
Message-ID: <fa.j3skrtg.1diomom@ifi.uio.no>

On Wed, 18 Feb 2004, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > Somebody correctly pointed out that you do not need any out-of-band
> > encoding mechanism - the very fact that it's an invalid sequence is in
> > itself a perfectly fine flag. No out-of-band signalling required.
>
> Technically this is almost(*) correct,
>
> (*) - It's fine until you concatenate two malformed strings.  Then the
>       out-of-band signal is lost if the combination is valid UTF-8.

But that's what you _want_. Having a real out-of-band signal that says
"this stuff is wrong, because it was wrong at some point in the past", and
not allowing concatenation of blocks of utf-8 bytes would be _bad_.

The thing, concatenating two malformed UTF-8 strings is normal behaviour
in a variety of circumstances, all basically having to do with lower
levels now knowing about higer-level concepts.

For example, look at a web-page. Look at how the data comes in: it comes
as a stream of bytes, with blocking rules that have _nothing_ to do with
the content (timing, mtu's, extended TCP headers etc etc). That doesn't
mean that you shouldn't be able to
 - work on the partial results and show them to the user as UTF-8
 - be able to concatenate new stuff as it comes in.

Having an out-of-band signal for "bad" would literally be a bad idea. If
you get a valid UTF-8 stream as a result of concatenation, you should
consider that to be the correct behaviour, or you should CHECK BEFOREHAND
if you think it is illegal.

		Linus

From: Theodore Ts'o <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: A Great Idea (tm) about reimplementing NLS.
Date: Thu, 16 Jun 2005 02:38:57 UTC
Message-ID: <fa.e6s7dhc.i600a4@ifi.uio.no>
Original-Message-ID: <20050616023630.GC9773@thunk.org>

On Wed, Jun 15, 2005 at 09:49:05PM -0400, Patrick McFarland wrote:
> On Monday 13 June 2005 03:20 pm, Alan Cox wrote:
> > An ext3fs is always utf-8. People might have chosen to put other
> > encodings on it but thats "not our fault" ;)
>
> What happens if you 'field upgrade' ext2 to ext3 by adding a journal? That
> doesn't magically convert !utf-8 to utf-8.

Ext2/3's encoding has always been utf-8.  Period.

There have been some people who have chosen to do something else
locally, but that was about as valid as the people who violated SMTP
standards by Just Sending 8-bits instead of using MIME.

							- Ted

From: Theodore Ts'o <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: A Great Idea (tm) about reimplementing NLS.
Date: Thu, 16 Jun 2005 14:39:20 UTC
Message-ID: <fa.d78rd6s.1qlgsb0@ifi.uio.no>
Original-Message-ID: <20050616143727.GC10969@thunk.org>

On Thu, Jun 16, 2005 at 12:33:16AM -0400, Jeremy Maitin-Shepard wrote:
> > Ext2/3's encoding has always been utf-8.  Period.
>
> In what way does Ext2/3 know or care about file name encoding?  Doesn't
> it just store an arbitrary 8-byte string?  Couldn't someone claim that
> from the start it was designed to use iso8859-1 just as easily as you
> can claim it was designed to use utf-8?

Because we've had this discussion^H^H^H^H^H^H^H^H^H^H^H flame war
years ago, and despite people from Russia whining that that it took 3
bytes to encode each Cyrillic character in UTF-8, it's where we came out.

The bottom-line though is that if someone files a bug report with ext3
because one user on the system was is creating filenames in Japanese,
and another user on the same time-sharing system is creating filenames
in Germany, and they fail to interoperate, and they were doing so in
their local language, we would laugh at them --- just as people
writing mail programs would laugh at people who complained that they
were running into problems Just Sending 8-bits instead of using MIME,
and could you please fix this business-critical bug?

Or as more and more desktop programs start interpreting the filenames
as UTF-8, and people with local variations get screwed, that is their
problem, and Not Ours.

So no, we can't prevent anyone from shooting them in the foot.
However, if they *do* take the gun, aim it straight downwards, and
pull the trigger, we aren't obligated to help.

						- Ted

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Smackv10: Smack rules grammar + their stateful parser
Date: Wed, 07 Nov 2007 04:11:37 UTC
Message-ID: <fa.kcoMKm9IQI7yOZXbNrB11pOmFEg@ifi.uio.no>

On Wed, 7 Nov 2007, Adrian Bunk wrote:
>
> Users are used to work on characters, not on bytes.

Adrian, stop this idiocy. I'm not interested in listening to your
soliloqui about irrelevant stuff.

The kernel works on bytes. Deal with it. Stop whining.

You've been told several times that all the examples you showed were
irrelevant, and tomoyo worked on bytes too.

You have been told several times that the VFS layer works on bytes, and
has done so since day 1.

You have *also* been told that there is no real other option ("you can
work with bytes, or you can go mad"). The normal kernel interfaces have to
be locale-independent (parly because it doesn't even KNOW the locale,
partly because locale is just totally irrelevant).

And your statement above is a TOTAL AND UTTER LIE.

More people are used to work with bytes (the C language calls them "char")
than with what _you_ call "characters". The fact is, people are very very
very used to working with 8-bit bytes, and there are a lot more people who
understand them than people who understand UTF-8 (never mind any of the
other million possible stupid and insane locales).

So can you stop your inane whining now? You can either:

 - accept that the kernel works on bytes (*) and that when we talk about
   parsing strings, we're talking the very _traditional_ C meaning, which
   is locale-independent, because locales DO NOT WORK in the kernel!

 - or you can continue your irrelevant ranting that has nothing to do with
   anything, but please don't cc me any more. People already pointed out
   to you that your assumption that "character" means something else than
   "byte" was wrong.

Please stop this. The absolute *last* thing you want is a kernel that
cares about locales. You *also* don't want a kernel that enforces some
idiotic UTF-8 rules, since not everybody is using UTF-8. That way lies
madness, not to mention totally unnecessary complexity.

		Linus

(*) With some *very* rare special cases, notably in the console driver,
and for filesystems that are forced by idiot designers to be compatible
with crap like OS X and Windows that think that filesystems should be
case-insensitive, which is a fundamental problem exactly because of its
dependence on locales)

Index Home About Blog