Index Home About Blog
From: (Linus Torvalds)
Subject: Re: zero-copy TCP
Date: 	2 Sep 2000 23:33:27 -0700
Newsgroups: fa.linux.kernel

In article <>,
Jamie Lokier  <> wrote:
>I just thought I'd mention that you can do zero copy TCP in and out
>*without* any page marking schemes.  All you need is a network card with
>quite a lot of RAM and some intelligence.  An Alteon could do it, with
>extra RAM or an impressively underloaded network.
>(for example)

The thing is, that at least historically it has always been a bad bet to
bet on special-purpose hardware over general-purpose stuff.

What I'm saying is that basically you should not design your TCP layer
around the 0.1% of cards that have tons of intelligence, when you have a
general-purpose CPU that tends to be faster in the end.

The smart cards can actually have higher latency than just doing it
the "stupid" way with the CPU. Yes, they'll offload some of the
computation, and may make system throughput better, but at what cost? 

[ Same old example: just calculate how quickly you can get your packet
  on the wire with a smart card that does checksumming in hardware, and
  do the same calculations with a CPU that does the checksums. Take into
  account that the checksum is at the _head_ of the packet. The CPU will

  Proof: the data to be sent out is in RAM.  In fact, often it is cached
  in the CPU these days. In order to start sending out the packet, the
  smart card has to move all of the data from RAM/cache over the bus to
  the card.  It can only start actually sending after that.  Cost: bus
  speed to copy it over.

  In contrast, if you do it on the CPU, you can basically start feeding
  the packet out on the net after doing a CPU checksum that is limited
  by RAM/cache speeds. Bus speed isn't the limiting factor any more on
  packet latency, as you can send out the start of the packet on the
  network before the whole packet has even been copied over the internal
  bus! ]

So.  Smart cards are not necessarily better for latency.  They are
certainly not cheaper.  They _are_ better for throughput, no question
about that.  But so is adding another CPU.  Or beefing up your memory
subsystem. Or any number of other things that are more generic than some
smart network card - and often cheaper because they are "standard
components", useful regardless of _what_ you do.

End result: smart cards only make sense in systems that are really
pushing the performance envelope.  Which, after all, is not that common,
as it's usually easier to just beef up the machine in other ways until
the network is not the worst bottle-neck.  Very few places outside
benchmark labs have networks _that_ studly. 

Right now gigabit is heavy-duty enough that it is worth smart cards. 
The same used to be true about the first generation of 100Mbit cards. 
The same will be true of 10Gbps cards in another few years.  But
basically, they'll probably always end up being the exception rather
than the rule, unless they become so cheap that it doesn't matter.  But
"cheap" and "pushing the performance envelope" do not tend to go hand in


Date: 	Sun, 3 Sep 2000 14:03:03 -0700 (PDT)
From: Linus Torvalds <>
Subject: Re: zero-copy TCP
Newsgroups: fa.linux.kernel

On Sun, 3 Sep 2000, Jamie Lokier wrote:
> Nice point!  Only valid for TCP & UDP though.

Yeah. But "we need oxygen" is only a valid point for carbon-based
life-forms. You might as well argue that oxygen is not avalid criteria for
being livable, because it's only valid for the particular kind of
creatures we are.

Basically, only TCP and UDP really matter. Decnet, IPX, etc don't really
make a big selling point any more.

> When people want _real_ low latency, they don't use TCP or UDP, and they
> certainly don't put data checksums at the start.  They still aim for
> zero copies.  That pass, even over cached data, is still significant.

I disagree.

Look at history.

	Exercise 1: name a protocol that did something like that
		(yes, I know, there are multiple).

	Exercise 2: name one of them that is still relevant today.

See? Performance, in the end, is very much secondary. It doesn't matter
one whit if you perform better than everybody else, if you cannot _talk_
to everybody else. 

I think the RISC vendors found that out. And I think most network vendors
find that out.

(Yes, I know, you're probably talking about things like the networking
protocols for clusters etc. I'm just saying that historically such
special-purpose stuff always tends to end up being not as good as the
"real thing".)

> Fair enough.  Please read my description of a zero-copy scheme that
> doesn't require much intelligence on the card though.  I think it's a
> neat kernel trick that might just pay off.  Sometimes, maybe.

We could certainly try to do better. But some of the scemes I've seen have
implied a lot of complexity for gains that aren't actually real in the end
(eg playing expensive games with memory mapping in order to avoid a copy
that ends up happening anyway because the particular card you're using
doesn't do scatter-gather: you'd perform a lot better if you just did the
copy outright and forgot about the expensive games - which is what Linux


From: (Linus Torvalds)
Subject: Re: zero-copy TCP
Date: 	3 Sep 2000 19:53:28 -0700
Newsgroups: fa.linux.kernel

In article <>,
Jamie Lokier  <> wrote:
>Alan Cox wrote:
>> > read/recv block while the NIC DMAs into user space main memory.
>> Thats actually not always a win either. A DMA to user space flushes
>> those pages out of cache which isnt so ideal if the CPU wants
>> them. Some of the results are suprisingly counter-intuitive like this
>Does it flush the CPU cache?  I thought the CPU just snooped the bus and
>updated its cache with new data.

In theory you could do "snoop and update".

In practice I do not know of a single chip that actually does that.
Pretty much _everybody_ does "invalidate on write".

(The "invalidate on write" is the sane way of doing SMP cache coherency,
which is probably why. Trying to have shared dirty cache-lines is just
not a viable option in the end).

>Ugh.  User space DMA gets complicated quickly.  The performance question
>is, perhaps, can you do this without a TLB flush (but with locking the
>struct page of course).  Note that it doesn't matter if another thread,
>and this includes truncate/write in another thread, clobbers the page
>data.  That's just the normal effect of two concurrent writers to the
>same memory.

Simple calculation: to actually even _find_ the physical page, you
usually need to do at least three levels of page table walking. Sure,
some CPU's have "translate" instructions to do it for you in hardware
and use the TLB to help you, but the most common architecture out there
does not do that. So with the 4GB+ option, in order to just _find_ the
physical page (so that you can do DMA to it), you need to do that
complex page table walk.

Let's say that that page table walk is 50 instructions at best case. 
And that pretty assumes that you did the thing in assembly code and were
very aggressive.

Also, you can pretty much assume that even if the code is in the cache,
the page tables themselves probably aren't. So assume a minimum of three
cache misses right there (plus the code - 50 instructions).

Furthermore, to actually pin a page down, even if you do _nothing_ else,
you'd at least need a SMP-safe increment (and eventual decrement). 
That's another 24 cycles just in those two instructions on x86. 

End result: you've done the work equivalent to about 4 cache misses.
Just to look up the physical page, and not actually _doing_ anything
with it. Never mind locking it in memory or anything like that. And
that's assuming you got no icache misses on the actual _code_ to do all
of this.

Basically any copy <= 4 cache lines is "free" compared to trying to be
clever.  That's 128 bytes on most machines right now.  And cache-lines
are growing: 64 and 128 byte cache-lines are not that unlikely these
days (I think Athlon has a 64-byte cache-line, for example, just in the
PC space, and alpha and sparc64 do also). 

So basically the cost of a simple memcpy() isn't neceassarily that big.
The above calculations were rather kind towards the "lock the page
down" case, and it's not all that unlikely that the cost of locking down
a page is on the same order of magnitude as just doing a "memcpy()" on
the whole page.

The above gets _much_ worse in real life.  If you truly want to do
zero-copy from user space and get real UNIX semantics for writes(),
you'd better protect the page somehow, so that if the data hasn't made
it out to the network (or needed to be re-sent) by the time the system
call returns, the user can't change the user-mode buffer before the data
is out.

That's when you get into TLB invalidates etc.  By which time you're
talking another few cache invalidates, and possibly some nasty cross-CPU
calls for SMP TLB coherency. 

People who claim "zero-copy" is a great thing often ignore the costs of
_not_ copying altogether. 

(This is the same mistake that people who do complexity analysis often
stumble on.  Sure, "constant time" is perfect.  Except it's not
necessarily unusual "constant time" is 50 times larger than O(n) in
practice.  Same goes for zero-copy - it's "perfect", but can easily be
slower than just plain old "good"). 


Date: 	Sun, 3 Sep 2000 20:59:39 -0700 (PDT)
From: Linus Torvalds <>
Subject: Re: zero-copy TCP
Newsgroups: fa.linux.kernel

On Sun, 3 Sep 2000, Lincoln Dale wrote:
> many people (myself included) have been experimenting with zerocopy 
> infrastructures.
> in my case, i've been working on it as time permits for quite a few months 
> now, and am about on my fourth rewrite.


> i've found exactly what you state about the bad things that occur when you 
> associate zerocopy infrastructure with user-space code.  some of the the MM 
> tricks required for handling individual pages effectively kills any 
> performance gain.
> however, approaching it from the other angle of "buffers pinned in kernel 
> memory" can give you a huge win.

I agree.

The "send data from already pinned buffers" case is different. That is
basically why "sendfile()" exists, and why TUX gets good numbers. Once you
get away from the "zero copy from user space" mentality, and start just
passing kernel buffers around, things look a lot better.

> for the application which prompted me to begin looking at this problem, 
> where packets typically go network -> RAM -> network, providing a zerocopy 
> infrastructure for (a) viewing incoming packet streams pinned in kernel 
> memory from user-space [a sort-of SIGIO with pointers to the buffers], and 
> (b) hooks for user-space directing the kernel to do things with these 
> buffers [eg. "queue buffer A for output on fd Y"] has provided an immediate 
> 60% performance gain.

You really should look into using the page cache if you can: that way you
have a very natural way of looking at it and possibly changing the stream
in user mode with no extra copies for that side either.

I'm not saying that you should necessarily actually go to a "real file",
but the best way of allowing user-space access to things like this is
through mmap(), and if you make it look like the page cache you'll get a
lot of code for free...


Date: 	Mon, 4 Sep 2000 09:41:25 -0700 (PDT)
From: Linus Torvalds <>
Subject: Re: zero-copy TCP
Newsgroups: fa.linux.kernel

On Mon, 4 Sep 2000, Jamie Lokier wrote:

> Linus Torvalds wrote:
> > Basically any copy <= 4 cache lines is "free" compared to trying to be
> > clever.
> We're obviously interested in larger packets than 128 bytes.


Take a look at some common traffic. Yes, even in servers. 

Small packets are not unlikely.

And even if you add extra code to _not_ do it for small packets, what
you're basically doing is slowing down the path - and increasing your
icache footprint etc quite noticeably, because suddenly you have two
mostly independent paths.

And the bugs get interesting too.

Also, you ignored the fact that the 128 can be 256 or 512. AND you ignored
the fact that I was being very generous indeed to the page table walking
case. So the 128 can end up being quite close to the page size. Or more.

> That's why the data is DMA'd to the card immediately, and that _card_
> retains the data at least for the short term.  Long term if it's still
> retained and the card runs out of memory, the card DMAs old buffers back
> to a kernel skbuff.  This is one way to avoid TLBs.


And such a card doesn't actually exist. At least not in copious numbers.

You need megabytes of memory to hold stuff like that for a busy server. A
few hundred connections with a few kB of queued-up, unacked data.

Face it, it's not realistic. Sure, such cards exist. How many people have
ever _seen_ one?

> Some people who claim zero-copy is great have done actual measurements
> and it does look good for reasonable size packets.  Even though the raw
> performance doesn't look that much better, the CPU utilisation does so
> you can actually _calculate_ a bit more with your data.
> However, I've not seen any evidence that it's a good idea with the
> standard unix APIs.  I suspect everyone will agree on that :-)

Blame the API's.

Face it, complexity almost _never_ pays off.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: Integration of SCST in the mainstream Linux kernel
Date: Mon, 04 Feb 2008 18:30:51 UTC
Message-ID: <fa.6ErEVvLCalE/>

On Mon, 4 Feb 2008, James Bottomley wrote:
> The way a user space solution should work is to schedule mmapped I/O
> from the backing store and then send this mmapped region off for target
> I/O.

mmap'ing may avoid the copy, but the overhead of a mmap operation is
quite often much *bigger* than the overhead of a copy operation.

Please do not advocate the use of mmap() as a way to avoid memory copies.
It's not realistic. Even if you can do it with a single "mmap()" system
call (which is not at all a given, considering that block devices can
easily be much larger than the available virtual memory space), the fact
is that page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner.

Yes, memory is "slow", but dammit, so is mmap().

> You also have to pull tricks with the mmap region in the case of writes
> to prevent useless data being read in from the backing store.  However,
> none of this involves data copies.

"data copies" is irrelevant. The only thing that matters is performance.
And if avoiding data copies is more costly (or even of a similar cost)
than the copies themselves would have been, there is absolutely no upside,
and only downsides due to extra complexity.

If you want good performance for a service like this, you really generally
*do* need to in kernel space. You can play games in user space, but you're
fooling yourself if you think you can do as well as doing it in the
kernel. And you're *definitely* fooling yourself if you think mmap()
solves performance issues. "Zero-copy" does not equate to "fast". Memory
speeds may be slower that core CPU speeds, but not infinitely so!

(That said: there *are* alternatives to mmap, like "splice()", that really
do potentially solve some issues without the page table and TLB overheads.
But while splice() avoids the costs of paging, I strongly suspect it would
still have easily measurable latency issues. Switching between user and
kernel space multiple times is definitely not going to be free, although
it's probably not a huge issue if you have big enough requests).


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 00:55:38 UTC
Message-ID: <>

On Thu, 31 Jul 2008, Jamie Lokier wrote:
> Jamie Lokier wrote:
> > not being able to tell when a sendfile() has finished with the pages
> > its sending.
> (Except by the socket fully closing or a handshake from the other end,
> obviously.)

Well, people should realize that this is pretty fundamental to zero-copy
schemes. It's why zero-copy is often much less useful than doing a copy in
the first place. How do you know how far in a splice buffer some random
'struct page' has gotten? Especially with splicing to splicing to tee to

You'd have to have some kind of barrier model (which would be really
complex), or perhaps a "wait for this page to no longer be shared" (which
has issues all its own).

IOW, splice() is very closely related to a magic kind of "mmap()+write()"
in another thread. That's literally what it does internally (except the
"mmap" is just a small magic kernel buffer rather than virtual address
space), and exactly as with mmap, if you modify the file, the other thread
will see it, even though it did it long ago.

Personally, I think the right approach is to just realize that splice() is
_not_ a write() system call, and never will be. If you need synchronous
writing, you simply shouldn't use splice().


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 00:58:31 UTC
Message-ID: <>

On Wed, 30 Jul 2008, Linus Torvalds wrote:
> Personally, I think the right approach is to just realize that splice() is
> _not_ a write() system call, and never will be. If you need synchronous
> writing, you simply shouldn't use splice().

Side note: in-kernel users could probably do something about this. IOW, if
there's some in-kernel usage (and yes, knfsd would be a prime example),
that one may actually be able to do things that a _user_level user of
splice() could never do.

That includes things like getting the inode semaphore over a write (so
that you can guarantee that pages that are in flight are not modified,
except again possibly by other mmap users), and/or a per-page callback for
when splice() is done with a page (so that you could keep the page locked
while it's getting spliced, for example).

And no, we don't actually have that either, of course.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 17:01:11 UTC
Message-ID: <>

On Thu, 31 Jul 2008, Evgeniy Polyakov wrote:
> It depends... COW can DoS the system: consider attacker who sends a
> page, writes there, sends again and so on in lots of threads. Depending
> on link capacity eventually COW will eat the whole RAM.

Yes, COW is complex, and the complexity would be part of the cost. But the
much bigger cost is the fact that COW is simply most costly than copying
the data in the first place.

A _single_ page fault is usually much much more expensive than copying a
page, especially if you can do the copy well wrt caches. For example, a
very common case is that the data you're writing is already in the CPU

In fact, even if you can avoid the fault, the cost of doing all the
locking and looking up the pages for COW is likely already bigger than the
memcpy. The memcpy() is a nice linear access which both the CPU and the
memory controller can optimize and can get almost perfect CPU throughput
for. In contrast, doing a COW implies a lot of random walking over
multiple data structures. And _if_ it's all in cache, it's probably ok,
but you're totally screwed if you need to send an IPI to another CPU to
actually flush the TLB (to make the COW actually take effect!).

So yes, you can win by COW'ing, but it's rare, and it mainly happens in

For example, I had a trial patch long long ago (I think ten years by now)
to do pipe transfers as copying pages around with COW. It was absolutely
_beautiful_ in benchmarks. I could transfer gigabytes per second, and this
was on something like a Pentium/MMX which had what, 7-10MB/s memcpy

In other words, I don't dispute at all that COW schemes can perform really
really well.

HOWEVER - the COW scheme actually performed _worse_ in any real world
benchmark, including just compiling the kernel (we used to use -pipe to
gcc to transfer data between cc/as).

The reason? The benchmark worked _really_ well, because what it did was
basically to do a trivial microbenchmark that did

	for (;;) {
		write(fd, buffer, bufsize);

and do you see something unrealistic there? Right: it never actually
touched the buffer itself, so it would not ever actually trigger the COW
case, and as a result, the memory got marked read-only on the first time,
and it never ever took a fault, and in fact the TLB never ever needed to
be flushed after the first one because the page was already marked

That's simply not _realistic_. It's hardly how any real load works.

> > > There was a linux aio_sendfile() too. Google still knows about its
> > > numbers, graphs and so on... :)
> >
> > I vaguely remember it's performance didn't seem that good.
> <q>
> Benchmark of the 100 1MB files transfer (files are in VFS already) using
> sync sendfile() against aio_sendfile_path() shows about 10MB/sec
> performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile
> sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().
> </q>
> So, it was really better that sync sendfile :)

I suspect it wasn't any better with small files and small transfers.

Yes, some people do big files. Physicists have special things where they
get a few terabytes per second from some high-energy experiment. The
various people spying on you have special setups where they move gigabytes
of satellite map data around to visualize it.

So there are undeniably cases like that, but they are also usually so
special that they really don't even care about COW, because they sure as
hell don't care about somebody else modifying the file they're sending at
the same time.

In fact the whole point is that they don't touch the data at the CPU
_at_all_, and the reason they want zero-copy sending is that they
literally want to DMA from disk buffers to memory, and then from memory to
a network interface, and they don't want the CPU _ever_ seeing it with all
the cache invalidations etc.

And _that_ is the case where you should use sendfile(). If your CPU has
actually touched the data, you should probably just use read()/write().

Of course, one of the really nice things about splice() (which was not
true about sendfile()) is that you can actually mix-and-match. You can
splice data from kernel buffers, but you can also splice data from user
VM, or you can do regular "write()" calls to fill (or read) the data from
the splice pipe.

This is useful in ways that sendfile() never was. You can write() headers
into the pipe buffer, and then splice() the file data into it, and the
user only sees a pipe (and can either read from it or splice it or tee it
or whatever). IOW, splice very much has the UNIX model of "everything is
a pipe", taken to one (admittedly odd) extreme.

Anyway, the correct way to use splice() is to either just know the data is
"safe" (either because you are _ok_ with people modifying it after the
splice op took place, or because you know nobody will). The alternative is
to expect an acknowledgement from the other side, because then you know
the buffer is done.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 16:38:47 UTC
Message-ID: <>

On Thu, 31 Jul 2008, Jamie Lokier wrote:
> Having implemented an equivalent zero-copy thing in userspace, I can
> confidently say it's not fundamental at all.

Oh yes it is.

Doing it in user space is _trivial_, because you control everything, and
there are no barriers.

> What is fundamental is that you either (a) treat sendfile as an async
> operation, and get a notification when it's finished with the data,
> just like any other async operation

Umm. And that's exactly what I *described*.

But it's trivial to do inside one program (either all in user space, or
all in kernel space).

It's very difficult indeed to do across two totally different domains.

Have you _looked_ at the complexities of async IO in UNIX? They are
horrible. The overhead to even just _track_ the notifiers basically undoes
all relevant optimizations for doing zero-copy.

IOW, AIO is useful not because of zero-copy, but because it allows
_overlapping_ IO. Anybody who confuses the two is seriously misguided.

>			, or (b) while sendfile claims those
> pages, they are marked COW.

.. and this one shows that you have no clue about performance of a memcpy.

Once you do that COW, you're actually MUCH BETTER OFF just copying.


Copying a page is much cheaper than doing COW on it. Doing a "write()"
really isn't that expensive. People think that memory is slow, but memory
isn't all that slow, and caches work really well. Yes, memory is slow
compared to a few reference count increments, but memory is absolutely
*not* slow when compared to the overhead of TLB invalidates across CPUs

So don't do it. If you think you need it, you should not be using
zero-copy in the first place.

In other words, let me repeat:

 - use splice() when you *understand* that it's just taking a refcount and
   you don't care.

 - use read()/write() when you can't be bothered.

There's nothing wrong with read/write. The _normal_ situation should be
that 99.9% of all IO is done using the regular interfaces. Splice() (and
sendpage() before it) is a special case. You should be using splice if you
have a DVR and you can do all the DMA from the tuner card into buffers
that you can then split up and send off to show real-time at the same time
as you copy them to disk.

THAT is when zero-copy is useful. If you think you need to play games with
async notifiers, you're already off the deep end.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 18:59:16 UTC
Message-ID: <>

On Thu, 31 Jul 2008, Jamie Lokier wrote:
> But did you miss the bit where you DON'T COPY ANYTHING EVER*?  COW is
> able provide _correctness_ for the rare corner cases which you're not
> optimising for.  You don't actually copy more than 0.0% (*approx).

The thing is, just even _marking_ things COW is the expensive part. If we
have to walk page tables - we're screwed.

> The cost of COW is TLB flushes*.  But for splice, there ARE NO TLB
> FLUSHES because such files are not mapped writable!

For splice, there are also no flags to set, no extra tracking costs, etc

But yes, we could make splice (from a file) do something like

 - just fall back to copy if the page is already mapped (page->mapcount
   gives us that)

 - set a bit ("splicemapped") when we splice it in, and increment
   page->mapcount for each splice copy.

 - if a "splicemapped" page is ever mmap'ed or written to (either through
   write or truncate), we COW it then (and actually move the page cache
   page - it would be a "woc": a reverse cow, not a normal one).

 - do all of this with page lock held, to make sure that there are no
   writers or new mappers happening.

So it's probably doable.

(We could have a separate "splicecount", and actually allow non-writable
mappings, but I suspect we cannot afford the space in the "struct space"
for a whole new count).

> You're missing the real point of network splice().
> It's not just for speed.
> It's for sharing data.  Your TCP buffers can share data, when the same
> big lump is in flight to lots of clients.  Think static file / web /
> FTP server, the kind with 80% of hits to 0.01% of the files roughly
> the same of your RAM.

Maybe. Does it really show up as a big thing?


Index Home About Blog