The everything-is-a-file principle (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] Futex Asynchronous Interface
Original-Message-ID: <Pine.LNX.4.44.0206060930240.5920-100000@home.transmeta.com>
Date: Thu, 6 Jun 2002 16:36:58 GMT
Message-ID: <fa.m7v2dav.160k90u@ifi.uio.no>

On Thu, 6 Jun 2002, Rusty Russell wrote:
>
> The method is: open /dev/futex

STOP!

What madness is this?

You have a damn mutex system call, don't introduce mode crap in /dev.

Do we create pipes by opening /dev/pipe? No. Do we have major and minor
numbers for sockets and populate /dev with them? No. And as a result,
there has _never_ been any sysadmin problems with either.

You already have to have a system call to bind the particular fd to the
futex _anyway_, so do the only sane thing, and allocate the fd _there_,
and get rid of that stupid and horrible /dev/futed which only buys you
pain, system administration, extra code, and a black star for being
stupid.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] Futex Asynchronous Interface
Original-Message-ID: <Pine.LNX.4.44.0206081523410.11630-100000@home.transmeta.com>
Date: Sat, 8 Jun 2002 22:29:26 GMT
Message-ID: <fa.l2cpo2v.1mh639m@ifi.uio.no>

On Fri, 7 Jun 2002, Peter Wächtler wrote:
>
> What about /proc/futex then?

Why?

Tell me _one_ advantage from having the thing exposed as a filename?

The whole point with "everything is a file" is not that you have some
random filename (indeed, sockets and pipes show that "file" and "filename"
have nothing to do with each other), but the fact that you can use common
tools to operate on different things.

But there's absolutely no point in opening /dev/futex from a shell script
or similar, because you don't get anything from it. You still have to bind
the fd to it's real object.

In short, the name "/dev/futex" (or "/proc/futex") is _meaningless_.
There's no point to it. It has no life outside the FUTEX system call, and
the only thing that you can do by exposing it as a name is to cause
problems for people who don't want to mount /proc, or who do not happen to
have that device node in their /dev.

> Give it an entry in the namespace, why not with sockets (unix and ip) also?

Perhaps because you cannot enumerate sockets and pipes? They don't _have_
names before they are created. Same as futexes, btw.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] Futex Asynchronous Interface
Original-Message-ID: <Pine.LNX.4.44.0206091029001.13459-100000@home.transmeta.com>
Date: Sun, 9 Jun 2002 17:51:15 GMT
Message-ID: <fa.l4shnqv.1g1a3hr@ifi.uio.no>

On Sun, 9 Jun 2002, Peter Wächtler wrote:
>
> Still you can open a file in the namespace and write some commands to it.
> Then it turns out to be a socket on port 25:
>
> fd=open("/dev/socket",O_RDWR);
> write(fd,"connect stream 25\n",sizeof(..));
> write(fd,"helo mail.my.com\n",..);

Yes, obviously you can avoid system calls entirely, and replace all of
them with read/write of commands.

This is not even a very uncommon idea: the above is basically message
passing, and is largely how many microkernels work. Except they don't call
it read/write, they tend to call it send/recv, and they aren't "file
descriptors", they are "ports".

It has advantages: because you only have one set of primitives, it's more
easily abtracted at that level, meaning that you can (and people do) make
it distributed etc without having to worry about local semantics.

It has disadvantages too: performance tends to be bad (you have to copy
around and parse the commands that are no longer implicit in the system
call number), and while there is a high level of abstraction on one level
("everything is a 'port' that can receive or send messages), at some point
the proverbial shit hits the fan and you've moved the details behind the
abstraction down (and now the data stream is no longer just bytes, but has
a meaning in itself).

But yes, the sequences

	open("/dev/socket")		->	socket()
	write(fd,"connect stream 25")	->	connect()

are obviously "equivalent". It's not my personal favourite equivalence,
though. I'd much rather add the information at _open_ time, and make it a
name-space issue, so that you'd do something like

	open("//sockfs/dst=123.45.67.89:25", O_RDWR);

instead. Which is _also_ entirely equivalent, of course (the "namespace"
approach does require that you be able to do "fd-relative" lookups, so
that you could also do

	sk = open("//sockfs", O_RDWR);
	sk2 = fd_open(sk, "dst=123.45.67.89:25", O_RDWR);

which is actually useful even in regular files too, just as a way of doing
directory-relative file opens without having to do a "chdir()").

HOWEVER, the fact is that exactly because they are equivalent, there is no
real difference between them. So you might as well just use the old UNIX
behaviour, and if you want to open sockets from a script, you use any of
the already _existing_ socket script helpers. For port 25, you have one
called "sendmail". For port 80, you have things like "lynx -source".

And you have tons of things like "netpipes", for doing generic scripting
of sockets.

The fact is, trying to come up with new ways to do the same old thing is
_not_ a good idea. It may look cool to expose sockets in the namespace,
but what's the actual added advantage over existing standard practices?
Unless that can be shown, there's just no point.

Do a google search for "netpipes", I'm sure you'll find it can do what you
wanted.

Sorry to rain on the "cool feature" parade, but I want to see some
_advantage_ from exposing new names in the namespace.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] Futex Asynchronous Interface
Original-Message-ID: <Pine.LNX.4.44.0206091056550.13459-100000@home.transmeta.com>
Date: Sun, 9 Jun 2002 18:10:36 GMT
Message-ID: <fa.l7croav.1ih021k@ifi.uio.no>

On 9 Jun 2002, Kai Henningsen wrote:
>
> However, I don't think that's all that important. What I'd rather see is
> making the network devices into namespace nodes. The situation of eth0 and
> friends, from a Unix perspective, is utterly unnatural.

But what would you _do_ with them? What would be the advantage as compared
to the current situation?

Now, to configure a device, you get a fd to the device the same way you
get a fd _anyway_ - with "socket()".

And anybody who says that "socket()" is utterly unnatural to the UNIX way
is quite far out to lunch. It may be unnatural to the Plan-9 way of
"everything is a namespace", but that was never the UNIX way. The UNIX way
is "everything is a file descriptor or a process", but that was never
about namespaces.

Yes, some old-timers could argue that original UNIX didn't have sockets,
and that the BSD interface is ugly and an abomination and that it _should_
have been a namespace thing, but that argument falls flat on its face when
you realize that the "pipe()" system call _was_ in original UNIX, and has
all the same issues.

Don't get hung up about names.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: of ethernet names (was [PATCH] Futex Asynchronous
Original-Message-ID: <Pine.LNX.4.44.0206091130490.13751-100000@home.transmeta.com>
Date: Sun, 9 Jun 2002 18:35:26 GMT
Message-ID: <fa.l2t5oiv.1m1q39i@ifi.uio.no>

On Sun, 9 Jun 2002, Dr. David Alan Gilbert wrote:
>
> Personally I would do away with ifconfig and replace it with
> cat in and out of device nodes; ifconfig seems to suffer about having to
> know about every protocol on every device type and the kernel has to
> provide interfaces for it that only it uses.

Well, the kernel would have to provide the same interfaces for "cat" if
you did it that way, and it would probably take up more space and cause
more kernel bloat.

And we'd still have to have the old interfaces for backwards compatibility
for ifconfig.

Is the "magic ioctl" approach ugly? Sure. But it's fairly well contained
to just one program (ifconfig), and everybody else just uses that. I think
it's less horrible than the alternatives right now.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2/5] signalfd v2 - signalfd core ...
Date: Thu, 08 Mar 2007 16:24:03 UTC
Message-ID: <fa.yLetPnUEmfUYainqduUPKkNd06w@ifi.uio.no>

On Thu, 8 Mar 2007, Davide Libenzi wrote:
>
> This patch, if you get a POLLIN, you have a signal to read for sure (well,
> unless you another thread/task reads it before you - but that's just
> somthing you have to take care). There is not explicit check for
> O_NONBLOCK now, but a zero timeout would do exactly the same thing.

You missed David's worry, I think.

Not only is POLLIN potentially an edge event (depending on the interface
you use to fetch it), but even as a level-triggered one you generally want
to read as much as possible per POLLIN event, and go back to the event
loop only when you get EAGAIN.

So that's in addition to the read/signal race with other
threads/processes.

You solved it by having a separate system call, but since it's a regular
file descriptor, why have a new system call at all, and not just make it
be a "read()"? In which case you definitely want O_NONBLOCK support.

The UNIX philosophy is often quoted as "everything is a file", but that
really means "everything is a stream of bytes".

In Windows, you have 15 different versions of "read()" with sockets and
files and pipes all having strange special cases and special system calls.
That's not the UNIX way. It should be just a "read()", and then people can
use general libraries and treat all sources the same.

For example, the main select/poll/epoll loop may be the one doing all the
reading, and then pass off "full buffers only" to the individual per-fd
"action routines". And that kind of model really very fundamentally wants
an fd to be an fd to be an fd - not "some file descriptors need
'read_from_sigfd()', and some file descriptors need 'read()', and some
file descriptors need 'recvmsg()'" etc.

So I think you should get rid of signalfd_dequeue(), and just replace it
with a "read()" function.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2/5] signalfd v2 - signalfd core ...
Date: Thu, 08 Mar 2007 17:29:00 UTC
Message-ID: <fa.36v6m+BTJbc0UZr7qfaU4I7I8F0@ifi.uio.no>

On Thu, 8 Mar 2007, Michael K. Edwards wrote:
>
> Make it a netlink socket and fetch your structures using recvmsg().
> siginfo_t belongs in ancillary data.

Gaah. That interface is horrible.

> The UNIX philosophy is "everything's a file".  The Berkeley philosophy
> is "everything's a socket, except for files, which are feeble
> mini-sockets".  I'd go with the Berkeley crowd here.

No, the berkeley crowd is totally out to lunch.

I might agree with you *if* you could actually do "recvmsg()" on arbitrary
file descriptors, but you cannot.

We could fix that in Linux, of course, but the fact is, "recvmsg()" is
*not* a superset of "read()". In general, it's a *subset*, exactly because
very few file descriptors support it.

So the normal way to read from a file descriptor (and the *only* way in
any generic select loop) is to use "read()". That's the only thing that
works for everything. And we shouldn't break that.

The sad part is that there really is no reason why the BSD crowd couldn't
have done recvmsg() as an "extended read with per-system call flags",
which would have made things like O_NONBLOCK etc unnecessary, because you
could do it just with MSG_DONTWAIT..

So anybody who would "go with the Berkeley crowd" really shows a lot of
bad taste, I'm afraid. The Berkeley crowd really messed up here, and it's
so long ago that I don't think there is any point in us trying to fix it
any more.

(But if somebody makes recvmgs a general VFS interface and makes it just
work for everything, I'd probably take the patch as a cleanup. I really
think it should have been a "struct file_operations" thing rather than
being a socket-only thing.. But since you couldn't portably use it
anyway, the thing is pretty moot)

				Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2/5] signalfd v2 - signalfd core ...
Date: Thu, 08 Mar 2007 17:16:49 UTC
Message-ID: <fa.cYAvg0XfOZcR/yrH852IQWdtT9E@ifi.uio.no>

On Thu, 8 Mar 2007, Davide Libenzi wrote:
>
> The reason for the special function, was not to provide a non-blocking
> behaviour with zero timeout (that just a side effect), but to read the
> siginfo. I was all about using read(2) (and v1 used it), but when you have
> to transfer complex structures over it, it becomes hell. How do you
> cleanly compat over a f_op->read callback for example?

I agree that it gets a bit "interesting", and one option might be that the
"read()" interface just gets the signal number and the minimal siginfo
information, which is, after all, what 99% of all apps actually care
about.

But "siginfo_t" is really a *horrible* structure. Nobody sane should ever
use siginfo_t, and the designer of that thing was so high on LSD that it's
not even funny. Re-using fields in a union? Values that depend on other
bits in the thing in random manners?

In other words, I bet that we could just make it a *lot* better by making
the read structure be:

 - 16 4-byte fields (fixed 64-byte packet), where each field is an
   uint32_t (we could even do it in network byte order if we care and if
   you want to just pipe the information from one machine to another, but
   that sounds a bit excessive ;)

 - Just put the fields people actually use at fixed offsets: si_signo,
   si_errno, si_pid, si_uid, si_band, si_fd.

 - that still leaves room for the other cases if anybody ever wants them
   (but I doubt it - things like si_addr are really only useful for
   synchronous signals that are actually done as *signals*, since you
   cannot defer a SIGBUS/SIGSEGV/SIGILL *anyway*).

So I bet 99% of users actually just want si_signo, while some small subset
might want the SIGCHLD info and some of the special cases (eg we might
want to add si_addr as a 64-bit thing just because the USB stack sends a
SI_ASYNCIO thing for completed URB's, so a libusb might want it, but
that's probably the only such user).

And it would be *cleaner* than the mess that is siginfo_t..

(I realize that siginfo_t is ugly because it built up over time, using the
same buffer for many different things. I'm just saying that we can
probably do better by *not* using it, and just laying things out in a
cleaner manner to begin with, which also solves any compatibility issues)

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2/5] signalfd v2 - signalfd core ...
Date: Thu, 08 Mar 2007 19:27:57 UTC
Message-ID: <fa.G19YT6NjsTW5pT0AnjFeFoS9Wpo@ifi.uio.no>

On Thu, 8 Mar 2007, Davide Libenzi wrote:
>
> So, to cut it short, I can do the pseudo-siginfo read(2), but I don't
> like it too much (little, actually). The siginfo, as bad as it is, is a
> standard used in many POSIX APIs (hence even in kernel), and IMO if we
> want to send that back, a struct siginfo should be.
> No?

I think it's perfectly fine if you make it "struct siginfo" (even though I
think it's a singularly ugly struct). It's just that then you'd have to
make your read() know whether it's a compat-read or not, which you really
can't.

Which is why you introduced a new system call, but that leads to all the
problems with the file descriptor no longer being *usable*.

Think scripts. It's easy to do reads in perl scripts, and parse the
output. In contrast, making perl use a new system call is quite
challenging.

And *that* is why "everything is a stream of bytes" is so important. You
don't know where the file descriptor has been, or who uses it. Special
system calls for special file descriptors are just *wrong*.

After all, that's why we'd have a signalfd() in the first place: exactly
so that you do *not* have to use special system calls, but can just pass
it on to any event waiting mechanism like select, poll, epoll. The same is
just *even*more*true* when it comes to reading the data!

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...
Date: Sat, 10 Mar 2007 21:44:53 UTC
Message-ID: <fa.jXxFH+f6+h145nuS3tS+QtqwPi0@ifi.uio.no>

On Sat, 10 Mar 2007, Nicholas Miell wrote:
>
> That's what the sigevent structure is for -- to describe how events
> should be signaled to userspace, whether by signal delivery, thread
> creation, or queuing to event completion ports. If if you think
> extending it would be bad, I can show you the line in POSIX where it
> encourages the contrary.

I'm sorry, but by pointing to the POSIX timer stuff, you're just making
your argument weaker.

POSIX timers are a horrible crock and over-designed to be a union of
everything that has ever been done. Nasty. We had tons of bugs in the
original setup because they were so damn nasty.

I'd rather look at just about *anything* else for good design than from
some of the abortions that are posix-timers.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...
Date: Sat, 10 Mar 2007 22:42:49 UTC
Message-ID: <fa.VEzwAybcgwouCeDwgcbOkbLjnWw@ifi.uio.no>

On Sat, 10 Mar 2007, Nicholas Miell wrote:
>
> Care to elaborate on why they're a horrible crock?

It's a *classic* case of an interface that tries to do everything under
the sun.

Here's a clue: look at any system call that takes a union as part of its
arguments. Count them. I think we have two:
 - struct siginfo
 - struct sigevent
and they are both broken horrible interfaces where the data structures
depend on various flags.

It's just not the UNIX system call way. And none of it really makes sense
if you already have a file descriptor, since at that point you know what
the notification mechanism is.

I'd actually much rather do POSIX timers the other way around: associate a
generic notification mechanism with the file descriptor, and then
implement posix_timer_create() on top of timerfd. Now THAT sounds like a
clean unix-like interface ("everything is a file") and would imply that
you'd be able to do the same kind of notification for any file descriptor,
not just timers.

But posix timers as they are done now are just an abomination. They are
not unix-like at all.

> And are the bugs fixed? If so, why replace them? They work now.

.. but the reason for the bugs was largely a very baroque interface, which
didn't get fixed (because it's specified by the standard).

I'd rather have straightforward interfaces. The timerfd() one looked a lot
more straightforward than posix timers.

(That said, using "struct itimerspec" might be a good idea. That would
also obviate the need for TFD_TIMER_SEQ, since an itimerspec automatically
has both "base" and "incremental" parts).

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 6/9] signalfd/timerfd v1 - timerfd core ...
Date: Sun, 11 Mar 2007 00:35:57 UTC
Message-ID: <fa.ZHzG7Ky4QN/om/8pdENWfOkm+WE@ifi.uio.no>

On Sat, 10 Mar 2007, Nicholas Miell wrote:
> >
> > I'd actually much rather do POSIX timers the other way around: associate a
> > generic notification mechanism with the file descriptor, and then
> > implement posix_timer_create() on top of timerfd. Now THAT sounds like a
> > clean unix-like interface ("everything is a file") and would imply that
> > you'd be able to do the same kind of notification for any file descriptor,
> > not just timers.
> >
>
> But timers aren't files or even remotely file-like

What do you think "a file" is?

In UNIX, a file descriptor is pretty much anything. You could say that
sockets aren't remotely file-like, and you'd be right. What's your point?
If you can read on it, it's a file.

And the real point of the whole signalfd() is that there really *are* a
lot of UNIX interfaces that basically only work with file descriptors. Not
just read, but select/poll/epoll.

They currently have just one timeout, but the thing is, if UNIX had just
had "timer file descriptors", they'd not need even that one. And even with
the timeout, Davide's patch actually makes for a *better* timeout than the
ones provided by select/poll/epoll, exactly because you can do things like
repeating timers and absolute time etc.

Much more naturally than the timer interface we currently have for those
system calls.

The same goes for signals. The whole "pselect()" thing shows that signals
really *should* have been file descriptors, and suddenly you don't need
"pselect()" at all.

So the "not remotely file-like" is not actually a real argument. One of
the big *points* of UNIX was that it unified a lot under the general
umbrella of a "file descriptor". Davide just unifies even more.

		Linus

Index Home About Blog