sendfile() (Linus Torvalds)

Index Home About Blog

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: sendfile(2) fails for devices?
Date: 	11 Nov 2000 16:57:29 -0800
Newsgroups: fa.linux.kernel

In article <3A0DE0C8.C700F33D@mandrakesoft.com>,
Jeff Garzik  <jgarzik@mandrakesoft.com> wrote:
>sendfile(2) fails with -EINVAL every time I try to read from a device
>file.
>
>This sounds like a bug... is it?  (the man page doesn't mention such a
>restriction)

sendfile() on purpose only works on things that use the page cache. 
EINVAL is basically sendfile's way of saying "I would fall back on doing
a read+write, so you might as well do it yourself in user space because
it might actually be more efficient that way". 

>I am using kernel 2.4.0-test11-pre2.  All other tests with sendfile(2)
>succeed:  file->file, file->STDOUT, STDIN->file...

Yes, as long as STDIN is a file ;)

sendfile() wants the source to be in the page cache, because the whole
point of sendfile() was to avoid a copy. 

The current device model does _not_ use the page cache. Now, arguably
that's a bug - it also means that you cannot mmap() a block device - but
as it could be easily documented (maybe it is, somewhere), I'll call it
a bad feature for now.

Now, if you want to add the code to do address spaces for block devices,
I wouldn't be all that unhappy.  I've wanted to see it for a while.  I'm
not likely to apply it for 2.4.x any more, but I'd love to have it early
for 2.5.x. 

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Is sendfile all that sexy?
Date: 	14 Jan 2001 12:22:31 -0800
Newsgroups: fa.linux.kernel

In article <Pine.GSO.4.30.0101141237020.12354-100000@shell.cyberus.ca>,
jamal  <hadi@cyberus.ca> wrote:
>
>Before getting excited i had the courage to give plain 2.4.0-pre3 a whirl
>and somethings bothered me.

Note that "sendfile(fd, file, len)" is never going to be faster than
"write(fd, userdata, len)". 

That's not the point of sendfile(). The point of sendfile() is to be
faster than the _combination_ of:

	addr = mmap(file, ...len...);
	write(fd, addr, len);

or

	read(file, userdata, len);
	write(fd, userdata, len);

and in your case you're not comparing sendfile() against this
combination.  You're just comparing sendfile() against a simple
"write()".

And no, I don't actually hink that sendfile() is all that hot. It was
_very_ easy to implement, and can be considered a 5-minute hack to give
a feature that fit very well in the MM architecture, and that the Apache
folks had already been using on other architectures.

The only obvious use for it is file serving, and as high-performance
file serving tends to end up as a kernel module in the end anyway (the
only hold-out is samba, and that's been discussed too), "sendfile()"
really is more a proof of concept than anything else.

Does anybody but apache actually use it?

			Linus

PS.  I still _like_ sendfile(), even if the above sounds negative.  It's
basically a "cool feature" that has zero negative impact on the design
of the system.  It uses the same "do_generic_file_read()" that is used
for normal "read()", and is also used by the loop device and by
in-kernel fileserving.  But it's not really "important".

Date: 	Sun, 14 Jan 2001 13:44:00 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Is sendfile all that sexy?
Newsgroups: fa.linux.kernel

On Sun, 14 Jan 2001, Ingo Molnar wrote:

> There is a Samba patch as well that makes it sendfile() based. Various
> other projects use it too (phttpd for example), some FTP servers i
> believe, and khttpd and TUX.

At least khttpd uses "do_generic_file_read()", not sendfile per se. I
assume TUX does too. Sendfile itself is mainly only useful from user
space..

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Is sendfile all that sexy?
Date: 	15 Jan 2001 10:46:41 -0800
Newsgroups: fa.linux.kernel

In article <14947.17050.127502.936533@leda.cam.zeus.com>,
Jonathan Thackray  <jthackray@zeus.com> wrote:

>> how would sendpath() construct the Content-Length in the HTTP header?
>
>You'd still stat() the file to decide whether to use sendpath() to
>send it or not, if it was Last-Modified: etc. Of course, you'd cache
>stat() calls too for a few seconds. The main thing is that you save
>a valuable fd and open() is expensive, even more so than stat().

"open" expensive?

Maybe on HP-UX and other platforms. But give me numbers: I seriously
doubt that

	int fd = open(..)
	fstat(fd..);
	sendfile(fd..);
	close(fd);

is any slower than

	.. cache stat() in user space based on name ..
	sendpath(name, ..);

on any real load. 

>> TCP_CORK is useful for FAR more than just sendfile() headers and
>> footers.  it's arguably the most correct way to write server code.
>
>Agreed -- the hard-coded Nagle algorithm makes no sense these days.

The fact I dislike about the HP-UX implementation is that it is so
_obviously_ stupid. 

And I have to say that I absolutely despise the BSD people.  They did
sendfile() after both Linux and HP-UX had done it, and they must have
known about both implementations.  And they chose the HP-UX braindamage,
and even brag about the fact that they were stupid and didn't understand
TCP_CORK (they don't say so in those exact words, of course - they just
show that they were stupid and clueless by the things they brag about). 

Oh, well. Not everybody can be as goodlooking as me. It's a curse.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Is sendfile all that sexy?
Date: 	15 Jan 2001 13:00:41 -0800
Newsgroups: fa.linux.kernel

In article <200101152033.f0FKXpv250839@saturn.cs.uml.edu>,
Albert D. Cahalan <acahalan@cs.uml.edu> wrote:
>Ingo Molnar writes:
>> On Mon, 15 Jan 2001, Jonathan Thackray wrote:
>
>>> It's a very useful system call and makes file serving much more
>>> scalable, and I'm glad that most Un*xes now have support for it
>>> (Linux, FreeBSD, HP-UX, AIX, Tru64). The next cool feature to add to
>>> Linux is sendpath(), which does the open() before the sendfile() all
>>> combined into one system call.
>
>Ingo Molnar's data in a nice table:
>
>open/close  7.5756 microseconds
>stat        5.4864 microseconds
>write       0.9614 microseconds
>read        1.1420 microseconds
>syscall     0.6349 microseconds
>
>Rather than combining open() with sendfile(), it could be combined
>with stat().

Note that "fstat()" is fairly low-overhead (unlike "stat()" it obviously
doesn't have to parse the name again), so "open+fstat" is quite fine
as-is. 

		Linus

Date: 	Mon, 15 Jan 2001 20:59:02 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [patch] sendpath() support, 2.4.0-test3/-ac9
Newsgroups: fa.linux.kernel

On Mon, 15 Jan 2001, dean gaudet wrote:

> On Mon, 15 Jan 2001, Ingo Molnar wrote:
> 
> > just for kicks i've implemented sendpath() support.
> >
> > _syscall4 (int, sendpath, int, out_fd, char *, path, off_t *, off, size_t, size)
> 
> hey so how do you implement transmit timeouts with sendpath() ?  (i.e.
> drop the client after 30 seconds of no progress.)

The whole "sendpath()" idea is just stupid.

You want to do a non-blocking send, so that you don't block on the socket,
and do some simple multiplexing in your server. 

And "sendpath()" cannot do that without having to look up the name again,
and again, and again. Which makes the performance "optimization" a
horrible pessimisation.

Basically, sendpath() seems to be only useful for blocking and
uninterruptible file sending.

Bad design. I'm not touching it with a ten-foot pole.

		Linus

Date: 	Wed, 17 Jan 2001 11:27:52 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

Rick Jones <raj@cup.hp.com> wrote:
>
> : >Agreed -- the hard-coded Nagle algorithm makes no sense these days.
> :
> : The fact I dislike about the HP-UX implementation is that it is so
> : _obviously_ stupid.
> :
> : And I have to say that I absolutely despise the BSD people.  They did
> : sendfile() after both Linux and HP-UX had done it, and they must have
> : known about both implementations.  And they chose the HP-UX braindamage,
> : and even brag about the fact that they were stupid and didn't understand
> : TCP_CORK (they don't say so in those exact words, of course - they just
> : show that they were stupid and clueless by the things they brag about).
> :
> : Oh, well. Not everybody can be as goodlooking as me. It's a curse.
> 
> nor it would seem, as humble :)

Yeah.. Humble is my middle name.

> Hello Linus, my name is Rick Jones. I am the person at Hewlett-Packard
> who drafted the "so _obviously_ stupid" sendfile() interface of HP-UX.
> Some of your critique (quoted above) found its way to my inbox and I
> thought I would introduce myself to you to give you an opportunity to
> expand a bit on your criticism. In return, if you like, I would be more
> than happy to describe a bit of the history of sendfile() on HP-UX.
> Perhaps (though I cannot say with any certainty) it will help explain
> why HP-UX sendfile() is spec'd the way it is.

I do realize why sendfile() is specced like it is: if you don't want to
change the networking layer, it's the obvious way to do it. You can take
just generate an iovec internally in the kernel, and pass that on to an
unmodified networking layer.

Hey, that's the way I'd do it too if I didn't have the ear of the
networking people and could tell them that "Psst! THIS is the right way of
doing this".

The fact that I understand _why_ it is done that way doesn't mean that I
don't think it's a hack. It doesn't allow you to sendfile multiple files
etc without having nagle boundaries, and the header/trailer stuff really
isn't a generic solution.

Sendfile() as done in HP-UX is a performance optimization. Fine. But it's
not exactly pretty. It shouldn't be called "sendfile()", it's more of a
called "send_a_file_and_these_headers_and_those_trailers()" system call.

Also note how I said that it is the BSD people I _despise_. Not The HP-UX
implementation. The HP-UX one is not pretty, but it works. But I hold open
source people to higher standards. They are supposed to be the people who
do programming because it's an art-form, not because it's their job. 

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Is sendfile all that sexy?
Date: 	17 Jan 2001 11:32:35 -0800
Newsgroups: fa.linux.kernel

In article <Pine.LNX.4.30.0101171454340.29536-100000@baphomet.bogo.bogus>,
Ben Mansell  <linux-kernel@slimyhorror.com> wrote:
>On 14 Jan 2001, Linus Torvalds wrote:
>
>> And no, I don't actually hink that sendfile() is all that hot. It was
>> _very_ easy to implement, and can be considered a 5-minute hack to give
>> a feature that fit very well in the MM architecture, and that the Apache
>> folks had already been using on other architectures.
>
>The current sendfile() has the limitation that it can't read data from
>a socket. Would it be another 5-minute hack to remove this limitation, so
>you could sendfile between sockets? Now _that_ would be sexy :)

I don't think that would be all that sexy at all.

You have to realize, that sendfile() is meant as an optimization, by
being able to re-use the same buffers that act as the in-kernel page
cache as buffers for sending data. So you avoid one copy.

However, for socket->socket, we would not have such an advantage.  A
socket->socket sendfile() would not avoid any copies the way the
networking is done today.  That _may_ change, of course.  But it might
not.  And I'd rather tell people using sendfile() that you get EINVAL if
it isn't able to optimize the transfer.. 

		Linus

Date: 	Wed, 17 Jan 2001 13:22:10 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Wed, 17 Jan 2001, Rick Jones wrote:

> > The fact that I understand _why_ it is done that way doesn't mean that I
> > don't think it's a hack. It doesn't allow you to sendfile multiple files
> > etc without having nagle boundaries, and the header/trailer stuff really
> > isn't a generic solution.
> 
> Hmm, I would think that nagle would only come into play if those files
> were each less than MSS and there were no intervening application level
> reply/request messages for each.

It's not the file itself - it's the headers and trailers.

The reason you want to have headers and trailers in your sendfile() is
two-fold:

 - if you have high system call latency, it can make a difference.

   This one simply isn't an issue with Linux. System calls are cheap, and
   I'd rather optimize them further than make them uglier. 

 - the packet boundary between the header and the file you're sending.

    Normally, if you do a separate data "send()" for the header before
    actually using sendfile(), the header would be sent out as one packet,
    while the actual file contents would then get coalesced into MSS-sized
    packets.

    This is why people originally did writev() and sendmsg() - to allow
    people to do scatter-gather without having multiple packets on the
    wire, and letting the OS choose the best packet boundaries, of course.

So the Linux approach (and, obviously, in my opinion the only right
approach) is basically to 

 (a) make sure that system call latency is low enough that there really
     aren't any major reasons to avoid system calls. They're just function
     calls - they may be a bit heavier than most functions, of course, but
     people shouldn't need to avoid them like the plague like on some
     systems.

and

 (b) TCP_CORK. 

Now, TCP_CORK is basically me telling David Miller that I refuse to play
games to have good packet size distribution, and that I wanted a way for
the application to just tell the OS: I want big packets, please wait until
you get enough data from me that you can make big packets.

Basically, TCP_CORK is a kind of "anti-nagle" flag. It's the reverse of
"no-nagle". So you'd "cork" the TCP connection when you know you are going
to do bulk transfers, and when you're done with the bulk transfer you just
"uncork" it. At which point the normal rules take effect (ie normally
"send out any partial packets if you have no packets in flight").

This is a _much_ better interface than having to play games with
scatter-gather lists etc. You could basically just do

	int optval = 1;

	setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int));
	write(sk, ..);
	write(sk, ..);
	write(sk, ..);
	sendfile(sk, ..);
	write(..)
	printf(...);
	...any kind of output..

	optval = 0;
	setsockopt(sk, SOL_TCP, TCP_CORK, &optval, sizeof(int));

and notice how you don't need to worry about _how_ you output the data any
more. It will automatically generate the best packet sizes - waiting for
disk if necessary etc.

With TCP_CORK, you can obviously and trivially emulate the HP-UX behaviour
if you want to. But you can just do _soo_ much more.

Imagine, for example, keep-alive http connections. Where you might be
doing multiple sendfile()'s of small files over the same connection, one
after the other. With Linux and TCP_CORK, what you can basically do is to
just cork the connection at the beginning, and then let is stay corked for
as long as you don't have any outstanding requests - ie you uncork only
when you don't have anything pending any more.

(The reason you want to uncork at all, is to obviously let the partial
packets out when you don't know if you'll write anything more in the near
future. Uncorking is important too.

Basically, TCP_CORK is useful whenever the server knows the patterns of
its bulk transfers. Which is just about 100% of the time with any kind of
file serving.

			Linus

Date: 	Wed, 17 Jan 2001 14:53:14 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Wed, 17 Jan 2001, Rick Jones wrote:
> > 
> >  (a) make sure that system call latency is low enough that there really
> >      aren't any major reasons to avoid system calls. They're just function
> >      calls - they may be a bit heavier than most functions, of course, but
> >      people shouldn't need to avoid them like the plague like on some
> >      systems.
> 
> i'm not quite sure how it plays here, but someone once told me that the
> most efficient procedure call was the one that was never made :)

Absolutely.

But I'm also a firm believer in "simplicity makes performance". 

My personal problem (and maybe it really is just me) with sendmgs() and
writev() kind of scatter-gather interfaces is that I think they are hard
and non-intuitive to use. They work beautifully if you design with them in
mind, and your data really is fundamentally already laid out in memory.

But they tend to be a bit too complicated if you have to do things like
"sprintf()" to generate part of the data first, and if you don't know
where you'll get your data before it is generated etc. For example, the
whole writev()/sendfile() kind of approach just _totally_ breaks down when
you have things like CGI involved.

Basically, I think the scatter-gather interfaces are too inflexible: they
are designed for one thing, and one thing only, and it's hard to use them
for anything else. And being hard to use means that people will do
non-obvious things, or just ignore them. Both of which will be bad for
performance in the long run. If you try to be clever, the program gets
harder to maintain, and because of that you can't do the good kinds of
re-organizations that might improve it.

The true power of TCP_CORK is when you really start thinking about what it
means that you can do _any_ output. Suddenly, you can have perl CGI stuff,
that uses stdio or something even more primitive that doesn't do buffering
at all - and it will automatically look ok on the wire.

> How "bulk" is a bulk transfer in your thinking? By the time the transfer
> gets above something like 100*MSS I would think that the first small
> packet would become epsilon. 

Actually, I don't really mean "bulk" as in "big", but more as in
"noninteractive". The biggest advantage of things like TCP_CORK is exactly
for small files or smallish CGI output, where it makes a difference
whether you sent out 4 big packets or 5 half-sized packets.

> How does CORKing interact with ACK generation? In particular how it
> might interact with (or rather possibly induce) standalone ACKs?

If anything, it should reduce ACK's too, simply because it reduces the
number of packets. But with most people doing delayed ACKs for every 2 MSS
of data (or whatever the RFC's specify), this is probably not really much
of an issue.

> so after i present each reply, i'm checking to see if there is another
> request and if there is not i have to uncork to get the residual data to
> flow.

Another way of thinking about it - you just know when the connection is
idle, and you uncork.

But note that you don't _have_ to be clever, if you don't want to. You can
just uncork after each transfer, and you'll still do no worse than if you
never corked at all. And you'll have all the advantages of being able to
not worry about how your CGI scripts etc work together.

> But does the server know the arrival pattern (well time distribution) of
> requests? It seems that one depends on a client being helpful about
> getting requests to the server in groups otherwise one is corking and
> uncorking the connection.

Oh, best performance definitely depends on the client interleaving the
requests. What else is new?

TCP_CORK is not going to suddenly make your application never have to
think about performance ever again ay more. That's obvious. It is nothing
but a tool in your tool-chest. It's a tool with a very simple interface,
and it's rather generic. Which is why it's so powerful. But it's not a
panacea.

I'm claiming that with TCP_CORK, it's fairly obvious how to write a server
that _can_ take advantage of a pipelined client. 

In contrast, with a writev/sg-sendfile kind of interface it would be much
more painful. You'd have to explicitly buffer up your replies all the
time, which creates much more interesting (read: bug-prone) memory
management issues, AND makes it a real bitch to handle things like
external CGI stuff etc.

But no, let's not claim that TCP_CORK solves the problem of world hunger..

(I also had one person point out that BSD's have the notion of TCP_NOPUSH,
which does almost what TCP_CORK does under Linux, except it doesn't seem
to have the notion of uncorking - you can turn NOPUSH off, but apparently
it doesn't affect queued packets. This makes it even less clear why they
have the ugly sendfile)

			Linus

Date: 	Thu, 18 Jan 2001 08:24:12 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Is sendfile all that sexy?
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Andreas Dilger wrote:

> Actually, this is a great example, because at one point I was working
> on a device interface which would offload all of the disk-disk copying
> overhead to the disks themselves, and not involve the CPU/RAM at all.

It's a horrible example.

device-to-device copies sound like the ultimate thing. 

They suck. They add a lot of complexity and do not work in general. And,
if your "normal" usage pattern really is to just move the data without
even looking at it, then you have to ask yourself whether you're doing
something worthwhile in the first place.

Not going to happen.

		Linus

Date: 	Thu, 18 Jan 2001 08:49:38 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Ingo Molnar wrote:
>
> [ BSD's TCP_NOPUSH ] 
>
> this is what MSG_MORE does.Basically i added MSG_MORE for the purpose of
> getting perfect TUX packet boundaries (and was ignorant enough to not know
> about BSD's NOPUSH), without an additional system-call overhead, and
> without the persistency of TCP_CORK. Alexey and David agreed, and actually
> implemented it correctly :-)

MSG_MORE is very different from TCP_NOPUSH, which is very different from
TCP_CORK.

First off, the interfaces are very different. MSG_MORE is a "this write
will be followed by more writes", and only works on programs that know
that they are writing to a socket.

That has its advantages: it's a very local thing, and doesn't need any
state. However, the fact is that you _need_ the persistency of a socket
option if you want to take advantage of external programs etc getting good
behaviour without having to know that they are talking to a socket. 

Remember the UNIX philosophy: everything is a file. MSG_MORE completely
breaks that, because the only way to use it is with send[msg](). It's
absolutely unusable with something like a traditional UNIX "anonymous"
application that doesn't know or care that it's writing to the network.

So while MSG_MORE has uses, it's absolutely and utterly wrong to say that
it is equivalent to either TCP_NOPUSH or TCP_CORK.

Now, I'll agree that TCP_NOPUSH actually has the same _logic_ as MSG_MORE:
you can basically say that the two are more or less equivalent by a source
transformation (ie send(MSG_MORE) => "set TCP_NOPUSH + send() + clear
TCP_NOPUSH". Both of them are really fairly "local", but TCP_NOPUSH has a
_notion_ of persistency that is entirely lacking in MSG_MORE.

In contrast, TCP_CORK has an interface much like TCP_NOPUSH, along with
the notion of persistency. The difference between those two is that
TCP_CORK really took the notion of persistency to the end, and made
uncorking actually say "Ok, no more packets". You can't do that with
TCP_NOPUSH: with TCP_NOPUSH you basically have to know what your last
write is, and clear the bit _before_ that write if you want to avoid bad
latencies (alternatively, you can just close the socket, which works
equally well, and was probably the designed interface for the thing. That
has the disadvantage of, well, closing the socket - so it doesn't work if
you don't _know_ whether you'd write more or not).

So the three are absolutely not equivalent. I personally think that
TCP_NOPUSH is always the wrong thing - it has the persistency without the
ability to shut it off gracefully after the fact. In contrast, both
MSG_MORE and TCP_CORK have well-defined behaviour but they have very
different uses.

		Linus

Date: 	Thu, 18 Jan 2001 08:51:19 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Ingo Molnar wrote:
> 
> Basically MSG_MORE is a simplified probability distribution of the next
> SEND, and it already covers all the other (iovec, nagle, TCP_CORK)
> mechanizm available, in a surprisingly easy way IMO. I believe MSG_MORE is
> very robust from a theoreticaly point of view.

Yeah, and how are you going to teach a perl CGI script that writes to
stdout to use it?

Face it, it's limited. It has, in fact, many of the same limitations
TCP_NOPUSH has.

		Linus

Date: 	Thu, 18 Jan 2001 10:50:02 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Rick Jones wrote:

> Linus Torvalds wrote:
> > Remember the UNIX philosophy: everything is a file.
> 
> ...and a file is simply a stream of bytes (iirc?)

Indeed.

And normal applications really shouldn't need to worry about things like
packetization etc.

Of course, many applications still do. stdio does "fstat" on the file
descriptor to get the st_blksize thing - which despite it's name is really
only meant to say "this is an efficient blocksize to write to this fd".
That only really works for regular files, and is just a heuristic even
there.

But TCP_CORK can be used to kind of "wrap" such applications, if you know
that they don't have interactive behaviour. 

99% of the time you probably don't care enough. Not very many people use
TCP_CORK, I suspect. It's too Linux-specific, and you really have to watch
the packets on the network to see the effect of it (unless you use it
wrong, and forget to uncork, in which case you can certainly see the
effect of it the wrong way ;)

Oh, well. The same is obviously largely true of "sendfile()" in general.
The people who use sendfile() under Linux are probably largely the same
people who know about and use TCP_CORK.

		Linus

Date: 	Thu, 18 Jan 2001 11:42:03 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Is sendfile all that sexy?
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Roman Zippel wrote:
> > 
> > Not going to happen.
> 
> device-to-device is not the same as disk-to-disk. A better example would
> be a streaming file server.

No, it wouldn't be.

[ Crystal ball mode: ON ]

It's too damn device-dependent, and it's not worth it. There's no way to
make it general with any current hardware, and there probably isn't going
to be for at least another decade or so. And because it's expensive and
slow to do even on a hardware level, it probably won't be done even then.

Which means that it will continue to be a pure localized hack for the
forseeable future.

Quite frankly, show me a setup where the network bandwidth is even _close_
to big enough that it would make sense to try to stream directly from the
disk? The only one I can think of is basically DoD-type installations with
big satellite pictures on a huge server, and gigabit ethernet everywhere.
Quite frankly, if that huge server is so borderline that it cannot handle
the double copy, the system administrators have big problems.

Streaming local video to disk? Sure, I can see that people might want
that. But if you can't see that people might want to _see_ it while they
are streaming, then you're missing a big part of the picture called
"psychology". So you'd still want to have a in-memory buffer for things
like that.

Come back to this in ten years, when devices and buses are smarter. MAYBE
they'll support it (but see later about why I don't think they will).
Today, you're living in a pipe dream. You can't practically do it with any
real devices of today - even when both parts support busmastering, they do
NOT tend to support "busmaster to the disk", or "busmaster from the disk".
I don't know of any disk interfaces that do that kind of interfaces
(they'd basically need to have some way to busmaster directly to the
controller caches, and do cache management in software. Can be done, but 
probably exposes more of the hardware than most people want to see),

Right now the only special case might be some very specific embedded
devices, things like routers, video recorders etc. And for them it would
be very much a special case, with special drivers and everything. This is
NOT a generic kernel issue, and we have not reached the point where it's
even worth it trying to design the interfaces for it yet.

An important point in interface design is to know when you don't know
enough. We do not have the internal interfaces for doing anything like
this, and I seriously doubt they'll be around soon.

And you have to realize that it's not at all a given that device protocols
will even move towards this kind of environment. It's equally likely that
device protocols in the future will be more memory-intensive, where the
basic protocol will all be "read from memory" and "write to memory", and
nobody will even have a notion of mapping memory into device space like
PCI kind of does now.

I haven't looked at what infiniband/NGIO etc spec out, but I'd actually be
surprised if they allow you to effectively short-circuit the IO networks
together. It is not an operation that lends itself well to a network
topology - it happens to work on PCI due to the traditional "shared bus"
kind of logic that PCI inherited. And even on PCI, there are just a LOT of
PCI bridges that apparently do not like seeing PCI-PCI transfers.

(Short and sweet: most high-performance people want point-to-point serial
line IO with no hops, because it's a known art to make that go fast.  No
general-case routing in hardware - if you want to go as fast as the
devices and the link can go, you just don't have time to route. Trying to
support device->device transfers easily slows down the _common_ case,
which is why I personally doubt it will even be supported 10-15 years from
now. Better hardware does NOT mean "more features").

		Linus

Date: 	Thu, 18 Jan 2001 11:45:45 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Fwd: [Fwd: Is sendfile all that sexy? (fwd)]]
Newsgroups: fa.linux.kernel

On Thu, 18 Jan 2001, Andrea Arcangeli wrote:
> 
> I'm all for TCP_CORK but it has the disavantage of two syscalls for doing the
> flush of the outgoing queue to the network. And one of those two syscalls is
> spurious. Certainly it makes perfect sense that the uncork flushes the outgoing
> queue, but I think we should have a way to flush it without exiting the cork-mode.
> I believe most software only needs to SIOCPUSH after sending the data and just
> before waiting the reply.

Sure, I agree. Something like SIOCPUSH would fit very well into the
TCP_CORK mentality.

			Linus

Date: 	Thu, 18 Jan 2001 17:14:01 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Is sendfile all that sexy?
Newsgroups: fa.linux.kernel

On Fri, 19 Jan 2001, Roman Zippel wrote:
> 
> On Thu, 18 Jan 2001, Linus Torvalds wrote:
> 
> > It's too damn device-dependent, and it's not worth it. There's no way to
> > make it general with any current hardware, and there probably isn't going
> > to be for at least another decade or so. And because it's expensive and
> > slow to do even on a hardware level, it probably won't be done even then.
> > 
> > [...]
> > 
> > An important point in interface design is to know when you don't know
> > enough. We do not have the internal interfaces for doing anything like
> > this, and I seriously doubt they'll be around soon.
> 
> I agree, it's device dependent, but such hardware exists.

Show me any practical case where the hardware actually exists.

I do not know of _any_ disk controllers that let you map the controller
buffers over PCI. Which means that with current hardware, you have to
assume that the disk is the initiator of the PCI-PCI DMA requests. Agreed?

Which in turn implies that the non-disk target hardware has to be able to
have a PCI-mapped memory buffer for the source or the destination, AND
they have to be able to cope with the fact that the data you get off the
disk will have to be the raw data at 512-byte granularity.

There are really quite few devices that do this. The most common example
by far would be a frame buffer, where you could think of streaming a few
frames at a time directly from disk into graphics memory. But nobody
actually saves pictures that way in reality - they all need processing to
show up. Even when the graphics card does things like mpeg2 decoding in
hardware, the decoding logic is not set up the way the data comes off the
disk in any case I know of. 

As to soundcards, all the ones I know about that are worthwhile have
certainly on-board memory, but that memory tends to be used for things
like waveforms etc, and most of them refill their audio data by doing DMA.
Again, they are the initiator of the IO, not a passive receiver. 

I'm sure there are sound cards that just expose their buffers directly.
Fine. Make a special user-space driver for it. Don't try to make it into a
design.

>							 It needs of
> course its own memory, but then you can see it as a NUMA architecture and
> we already have the support for this. Create a new memory zone for the
> device memory and keep the pages reserved. Now you can use it almost like
> other memory, e.g. reading from/writing to it using address_space_ops.

You need to have a damn special sound card to do the above.

And you wouldn't need a new memory zone - the kernel wouldn't ever touch
the memory anyway, you'd just ioremap() it if you needed to access it
programmatically in addition to the streaming of data off disk.

> An application, where I'd like to use it, is audio recording/playback
> (24bit, 96kHz on 144 channels). Although it's possible to copy that amount
> of data around, but then you can't do much beside this. All the data is
> most of the time only needed on the soundcard, so why should I copy it
> first to the main memory?

Because with 99% of the hardware, there is no other way to get at it?

Also, even when you happen to have the 1% card combination where it would
work in the first place, you'd better make sure that they are on the same
PCI bus. That's usually true on most PC's today, but that's probably going
to be an issue eventually. 

> Anyway, now with the zerocopy network patches, there are basically already
> all the needed interfaces and you don't have to wait for 10 years, so I
> think you need to polish your crystal ball. :-)

The zero-copy network patches have _none_ of the interfaces you think you
need. They do not fix the fact that hardware usually doesn't even _allow_
for what you are hoping for. And what you want is probably going to be
less likely in the future than more likely.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Is sendfile all that sexy?
Date: 	18 Jan 2001 17:53:40 -0800
Newsgroups: fa.linux.kernel

In article <3A66CDB1.B61CD27B@imake.com>,
Russell Leighton  <leighton@imake.com> wrote:
>
>"copy this fd to that one, and optimize that if you can"
>
>... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?

We talked extensively about "splice()" with Larry. It was one of the
motivations for doing sendfile(). The problem with "splice()" is that it
did not have very good semantics on who does the push and who does the
pull, and how to actually implement this efficiently yet in a generic
manner.

In many ways, that lack of good generic interfaces is what turned me off
splice().  I showed Larry the simple solution that gets 95% of what
people wanted splice for, and he didn't object. He didn't have any
really good solutions to the implementation problems either.

Now, the reason it is called "sendfile()" is obviously partially because
others _did_ have sendfiles (NT and HP-UX), but it's also because I
wanted to make it clear that this was NOT a generic splice(). It could
really only work in one direction: from the page cache out. The page
cache would always do a push, and nobody would do a pull.

Now, the page cache has improved, and these days we could _almost_ do a
"receivefile()", with the page cache doing a pull, in addition to the
push it can already do.  And yes, I'd probably use the same system call,
and possibly rename it to be "splice()", even though it still wouldn't
be the generic case. 

Now, the reason is say "almost" on the page cache "pull()" thing is that
while the page cache can now do basically "prepare_write()" + "pull()" +
"commit_write()", the problem is that it still needs to know the _size_
of the pull() in order to be able to prepare for the write.

Basically, the pull<->push model turns into a four-way handshake:

 (a) prepare for the pull		(source)
 (b) prepare for the push		(destination)
 (c) do the pull			(source)
 (d) commit the push			(destination)

and with this kind of model I suspect that we could actually do a fairly
real splice(), where sendfile() would just be a special case.

Right now, the only part we lack above is (a) - everything else we have.
(b) is "prepare_write()", (c) is "read()", (d) is "commit_write()".

So we lack a "prepare_read()" as things stand now. The interface would
probably be something on the order of

	int (*prepare_read)(struct file *, int);

wehere we'd pass in the "struct file" and the amount of data we'd _like_
to see, and we'd get back the amount of data we can actually have so
that we can successfully prepare for the push (ie "prepare_write()").

		Linus

Index Home About Blog