Index Home About Blog
Date: 	Tue, 6 Feb 2001 12:59:02 -0800 (PST)
From: Linus Torvalds <>
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Newsgroups: fa.linux.kernel

On Tue, 6 Feb 2001, Christoph Hellwig wrote:
> The second is that bh's are two things:
>  - a cacheing object
>  - an io buffer

Actually, they really aren't.

They kind of _used_ to be, but more and more they've moved away from that
historical use. Check in particular the page cache, and as a really
extreme case the swap cache version of the page cache.

It certainly _used_ to be true that "bh"s were actually first-class memory
management citizens, and actually had a data buffer and a cache associated
with them. And because of that historical baggage, that's how many people
still think of them.

These days, it's really not true any more. A "bh" doesn't really have an
IO buffer intrisically associated with it any more - all memory management
is done on a _page_ level, and it really works the other way around, ie a
page can have one or more bh's associated with it as the IO entity.

This _does_ show up in the bh itself: you find that bh's end up having the
bh->b_page pointer in it, which is really a layering violation these days,
but you'll notice that it's actually not used very much, and it could
probably be largely removed.

The most fundamental use of it (from an IO standpoint) is actually to
handle high memory issues, because high-memory handling is very
fundamentally based on "struct page", and in order to be able to have
high-memory IO buffers you absolutely have to have the "struct page" the
way things are done now.

(all the other uses tend to not be IO-related at all: they are stuff like
the callbacks that want to find the page that should be free'd up)

The other part of "struct bh" is that it _does_ have support for fast
lookups, and the bh hashing. Again, from a pure IO standpoint you can
easily choose to just ignore this. It's often not used at all (in fact,
_most_ bh's aren't hashed, because the only way to find them are through
the page cache).

> This is not really an clean appropeach, and I would really like to
> get away from it.

Trust me, you really _can_ get away from it. It's not designed into the
bh's at all. You can already just allocate a single (or multiple) "struct
buffer_head" and just use them as IO objects, and give them your _own_
pointers to the IO buffer etc.

In fact, if you look at how the page cache is organized, this is what the
page cache already does. The page cache has it's own IO buffer (the page
itself), and it just uses "struct buffer_head" to allocate temporary IO
entities. It _also_ uses the "struct buffer_head" to cache the meta-data
in the sense of having the buffer head also contain the physical address
on disk so that the page cache doesn't have to ask the low-level
filesystem all the time, so in that sense it actually has a double use for

But you can (and _should_) think of that as a "we got the meta-data
address caching for free, and it fit with our historical use, so why not
use it?".

So you can easily do the equivalent of

 - maintain your own buffers (possibly by looking up pages directly from
   user space, if you want to do zero-copy kind of things)

 - allocate a private buffer head ("get_unused_buffer_head()")

 - make that buffer head point into your buffer

 - submit the IO by just calling "submit_bh()", using the b_end_io()
   callback as your way to maintain _your_ IO buffer ownership.

In particular, think of the things that you do NOT have to do:

 - you do NOT have to allocate a bh-private buffer. Just point the bh at
   your own buffer.
 - you do NOT have to "give" your buffer to the bh. You do, of course,
   want to know when the bh is done with _your_ buffer, but that's what
   the b_end_io callback is all about.

 - you do NOT have to hash the bh you allocated and thus expose it to
   anybody else. It is YOUR private bh, and it does not show up on ANY
   other lists. There are various helper functions to insert the bh on
   various global lists ("mark_bh_dirty()" to put it on the dirty list,
   "buffer_insert_inode_queue()" to put it on the inode lists etc, but
   there is nothing in the thing that _forces_ you to expose your bh.

So don't think of "bh->b_data" as being something that the bh owns. It's
just a pointer. Think of "bh->b_data" and "bh->b_size" as _nothing_ more
than a data range in memory. 

In short, you can, and often should, think of "struct buffer_head" as
nothing but an IO entity. It has some support for being more than that,
but that's secondary. That can validly be seen as another layer, that is
just so common that there is little point in splitting it up (and a lot of
purely historical reasons for not splitting it).


Date: 	Tue, 6 Feb 2001 13:42:00 -0800 (PST)
From: Linus Torvalds <>
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Newsgroups: fa.linux.kernel

On Tue, 6 Feb 2001, Manfred Spraul wrote:
> Jens Axboe wrote:
> > 
> > > Several kernel functions need a "dontblock" parameter (or a callback, or
> > > a waitqueue address, or a tq_struct pointer).
> > 
> > We don't even need that, non-blocking is implicitly applied with READA.
> >
> READA just returns - I doubt that the aio functions should poll until
> there are free entries in the request queue.

The aio functions should NOT use READA/WRITEA. They should just use the
normal operations, waiting for requests. The things that makes them
asycnhronous is not waiting for the requests to _complete_. Which you can
already do, trivially enough.

The case for using READA/WRITEA is not that you want to do asynchronous
IO (all Linux IO is asynchronous unless you do extra work), but because
you have a case where you _might_ want to start IO, but if you don't have
a free request slot (ie there's already tons of pending IO happening), you
want the option of doing something else. This is not about aio - with aio
you _need_ to start the IO, you're just not willing to wait for it. 

An example of READA/WRITEA is if you want to do opportunistic dirty page
cleaning - you might not _have_ to clean it up, but you say

 "Hmm.. if you can do this simply without having to wait for other
  requests, start doing the writeout in the background. If not, I'll come
  back to you later after I've done more real work.."

And the Linux block device layer supports both of these kinds of "delayed
IO" already. It's all there. Today.


Date: 	Tue, 6 Feb 2001 14:26:38 -0800 (PST)
From: Linus Torvalds <>
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Newsgroups: fa.linux.kernel

On Tue, 6 Feb 2001, Jens Axboe wrote:

> On Tue, Feb 06 2001, Marcelo Tosatti wrote:
> > 
> > Reading write(2): 
> > 
> >        EAGAIN Non-blocking  I/O has been selected using O_NONBLOCK and there was
> >               no room in the pipe or socket connected to fd to  write  the data
> >               immediately.
> > 
> > I see no reason why "aio function have to block waiting for requests". 
> That was my reasoning too with READA etc, but Linus seems to want that we
> can block while submitting the I/O (as throttling, Linus?) just not
> until completion.

Note the "in the pipe or socket" part.
                 ^^^^    ^^^^^^

EAGAIN is _not_ a valid return value for block devices or for regular
files. And in fact it _cannot_ be, because select() is defined to always
return 1 on them - so if a write() were to return EAGAIN, user space would
have nothing to wait on. Busy waiting is evil.

So READA/WRITEA are only useful inside the kernel, and when the caller has
some data structures of its own that it can use to gracefully handle the
case of a failure - it will try to do the IO later for some reasons, maybe
deciding to do it with blocking because it has nothing better to do at the
later date, or because it decides that it can have only so many
outstanding requests.

Remember: in the end you HAVE to wait somewhere. You're always going to be
able to generate data faster than the disk can take it. SOMETHING has to
throttle - if you don't allow generic_make_request() to throttle, you have
to do it on your own at some point. It is stupid and counter-productive to
argue against throttling. The only argument can be _where_ that throttling
is done, and READA/WRITEA leaves the possibility open of doing it
somewhere else (or just delaying it and letting a future call with
READ/WRITE do the throttling).


Date: 	Thu, 8 Feb 2001 10:09:05 -0800 (PST)
From: Linus Torvalds <>
Subject: Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Newsgroups: fa.linux.kernel

On Thu, 8 Feb 2001, Marcelo Tosatti wrote:
> On Thu, 8 Feb 2001, Stephen C. Tweedie wrote:
> <snip>
> > > How do you write high-performance ftp server without threads if select
> > > on regular file always returns "ready"?
> > 
> > Select can work if the access is sequential, but async IO is a more
> > general solution.
> Even async IO (ie aio_read/aio_write) should block on the request queue if
> its full in Linus mind.

Not necessarily. I said that "READA/WRITEA" are only worth exporting
inside the kernel - because the latencies and complexities are low-level
enough that it should not be exported to user space as such.

But I could imagine a kernel aio package that does the equivalent of

	bh->b_end_io = completion_handler;
	generic_make_request(WRITE, bh);	/* this may block */
	bh= bh->b_next;

	/* Now, fill it up as much as we can.. */
	current->state = TASK_INTERRUPTIBLE;
	while (more data to be written) {
		if (generic_make_request(WRITEA, bh) < 0)
		bh = bh->b_next;


and then you make the _completion handler_ thing continue to feed more
requests. Yes, you may block at some points (because you need to always
have at least _one_ request in-flight in order to have the state machine
active, but you can basically try to avoid blocking more than necessary.

But do you see why the above can't be done from user space? It requires
that the completion handler (which runs in an interrupt context) be able
to continue to feed requests and keep the queue filled. If you don't do
that, you'll never have good throughput, because it takes too long to send
signals, re-schedule or whatever to user mode.

And do you see how it has to block _sometimes_? If people do hundreds of
AIO requests, we can't let memory just fill up with pending writes..


From: Theodore Tso <>
Newsgroups: fa.linux.kernel
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Fri, 29 Dec 2006 23:32:58 UTC
Message-ID: <fa.cnkpyyNL/>

On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in
> places where it really really shouldn't, and as a result, things like
> directory accesses are simply slower than they should be. Sadly, I don't
> think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well.  Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime.  Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines.  OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds.  So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable

						- Ted

From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Sat, 30 Dec 2006 00:00:46 UTC
Message-ID: <fa.bQ/>

On Fri, 29 Dec 2006, Theodore Tso wrote:
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_
later), but the real problem with it isn't actually the VM behaviour, but
simply the fact that cached performance absolutely _sucks_ with the buffer

With the physically indexed buffer cache thing, you end up always having
to do these complicated translations into block numbers for every single
access, and at some point when I benchmarked it, it was a huge overhead
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost
of translation - because you can't cheaply check whether the next block is
there, the cost of even asking the question "should I try to read ahead?"
is much much higher. As a result, read-ahead is seriously limited, because
it's so expensive for the cached case (which is still hopefully the
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having
ext3_readdir() actually do _readahead_ though the page cache, even though
it does everything else through the buffer cache. And that just happens to
work because we hopefully have physically contiguous blocks, but when that
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any
problems for the VM, since directories cannot be mmap'ed anyway. But it's
really pitiful, and it really doesn't work very well. Of course, other
filesystems _also_ suck at this, and other operating systems have even
MORE problems, so people don't always seem to realize how horribly
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache
directories well. Open some horrible file manager on /usr/bin with cold
caches, and weep. The biggest problem is the inode indirection, but at
some point when I looked at why it sucked, it was doing basically
synchronous single-buffer reads on the directory too, because readahead
didn't work properly.

I was hoping that something like SpadFS would actually take off, because
it seemed to do a lot of good design choices (having inodes in-line in the
directory for when there are no hardlinks is probably a requirement for a
good filesystem these days. The separate inode table had its uses, but
indirection in a filesystem really does suck, and stat information is too
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis
written ;)


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Sat, 30 Dec 2006 00:52:15 UTC
Message-ID: <>

On Fri, 29 Dec 2006, Andrew Morton wrote:
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up.  The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.

Using the buffer cache as a translation layer to the physical address is
fine. That's what _any_ block device will do.

I'm not at all saying that "buffer heads must go away". They work fine.

What I'm saying is that

 - if you index by buffer heads, you're screwed.
 - if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache
layer. Then, when you actually need to do IO, you look at the buffers. But
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the
"virtual->physical" translation. It's what they were designed for. But
they were _originally_ also designed for "lookup" and "driving the start
of IO", and that is wrong, and has been wrong for a long time now, because

 - lookup based on physical address is fundamentally slow and inefficient.
   You have to look up the virtual->physical translation somewhere else,
   so it's by design an unnecessary indirection _and_ that "somewere
   else" is also by definition filesystem-specific, so you can't do any
   of these things at the VFS layer.

   Ergo: anything that needs to look up the physical address in order to
   find the buffer head is BROKEN in this day and age. We look up the
   _virtual_ page cache page, and then we can trivially find the buffer
   heads within that page thanks to page->buffers.

   Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't.

 - starting IO based on the physical entity is insane. It's insane exactly
   _because_ the VM doesn't actually think in physical addresses, or in
   buffer-sized blocks. The VM only really knows about whole pages, and
   all the VM decisions fundamentally have to be page-based. We don't ever
   "free a buffer". We free a whole page, and as such, doing writeback
   based on buffers is pointless, because it doesn't actually say anything
   about the "page state" which is what the VM tracks.

But neither of these means that "buffer_head" itself has to go away. They
both really boil down to the same thing: you should never KEY things by
the buffer head. All actions should be based on virtual indexes as far as
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page,
it's then easy to look up the actual buffer head within the page, and use
that as a way to do the actual _IO_ on the physical address. So the buffer
heads still exist in ext2, for example, but they don't drive the show
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc.
But as long as none of those have big VM footprints, and as long as no
_common_ operations really care deeply, and as long as those data
structures never need to be touched by the VM or VFS layer, nobody will
ever really care).

The directory case comes up just because "readdir()" actually is very
common, and sometimes very slow. And it can have a big VM working set
footprint ("find"), so trying to be page-based actually really helps,
because it all drives things like writeback on the _right_ issues, and we
can do things like LRU's and writeback decisions on the level that really

I actually suspect that the inode tables could benefit from being in the
page cache too (although I think that the inode buffer address is actually
"physical", so there's no indirection for inode tables, which means that
the virtual vs physical addressing doesn't matter). For directories, there
definitely is a big cost to continually doing the virtual->physical
translation all the time.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Sat, 30 Dec 2006 00:12:34 UTC
Message-ID: <>

On Fri, 29 Dec 2006, Andrew Morton wrote:
> They're extra.  As in "can be optimised away".

Sure. Don't use buffer heads.

> The buffer_head is not an IO container.  It is the kernel's core
> representation of a disk block.

Please come back from the 90's.

The buffer heads are nothing but a mapping of where the hardware block is.
If you use it for anything else, you're basically screwed.

> JBD implements physical block-based journalling, so it is 100% appropriate
> that JBD deal with these disk blocks using their buffer_head
> representation.

And as long as it does that, you just have to face the fact that it's
going to perform like crap, including what you call "extra" writes, and
what I call "deal with it".

Btw, you can make pages be physically indexed too, but they obviously
 (a) won't be coherent with any virtual mapping laid on top of it
 (b) will be _physical_, so any readahead etc will be based on physical
     addresses too.

> I thought I fixed the performance problem?

No, you papered over it, for the reasonably common case where things were
physically contiguous - exactly by using a physical page cache, so now it
can do read-ahead based on that. Then, because the pages contain buffer
heads, the directory accesses can look up buffers, and if it was all
physically contiguous, it all works fine.

But if you actually want virtually indexed caching (and all _users_ want
it), it really doesn't work.

> Somewhat nastily, but as ext3 directories are metadata it is appropriate
> that modifications to them be done in terms of buffer_heads (ie: blocks).

No. There is nothing "appropriate" about using buffer_heads for metadata.

It's quite proper - and a hell of a lot more efficient - to use virtual
page-caching for metadata too.

Look at the ext2 readdir() implementation, and compare it to the crapola
horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and
as a result it is both simpler, smaller and a LOT faster than ext3 in
accessing that metadata.

Face it, Andrew, you're wrong on this one. Really. Just take a look at

[ I'm not saying that ext2_readdir() is _beautiful_. If it had been
  written with the page cache in mind, it would probably have been done
  very differently. And it doesn't do any readahead, probably because
  nobody cared enough, but it should be trivial to add, and it would
  automatically "do the right thing" just because it's much easier at the
  page cache level.

  But I _am_ saying that compared to ext3, the ext2 readdir is a work of
  art. ]

"metadata" has _zero_ to do with "physically indexed". There is no
correlation what-so-ever. If you think there is a correlation, it's all in
your mind.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
Date: Sat, 30 Dec 2006 01:00:21 UTC
Message-ID: <>

On Fri, 29 Dec 2006, Andrew Morton wrote:
> > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > that modifications to them be done in terms of buffer_heads (ie: blocks).
> >
> > No. There is nothing "appropriate" about using buffer_heads for metadata.
> I said "modification".

You said "metadata".

Why do you think directories are any different from files? Yes, they are
metadata. So what? What does that have to do with anything?

They should still use virtual indexes, the way files do. That doesn't
preclude them from using buffer-heads to mark their (partial-page)
modifications and for keeping the virtual->physical translations cached.

I mean, really. Look at ext2. It does exactly that. It keeps the
directories in the page cache - virtually indexed. And it even modifies
them there. Exactly the same way it modifies regular file data.

It all works exactly the same way it works for regular files. It uses

	page->mapping->a_ops->prepare_write(NULL, page, from, to);
	... do modification ...
	ext2_commit_chunk(page, from, to);

exactly the way regular file data works.

That's why I'm saying there is absolutely _zero_ thing about "metadata"
here, or even about "modifications". It all works better in a virtual
cache, because you get all the support that we give to page caches.

So I really don't understand why you make excuses for ext3 and talk about
"modifications" and "metadata". It was a fine design ten years ago. It's
not really very good any longer.

I suspect we're stuck with the design, but that doesn't make it any


Index Home About Blog