Block layer (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: cdrecord hangs my computer
Original-Message-ID: <Pine.LNX.4.58.0312110807250.2267@home.osdl.org>
Date: Thu, 11 Dec 2003 16:16:14 GMT
Message-ID: <fa.j3siq5c.1diqk0q@ifi.uio.no>

On Thu, 11 Dec 2003, Jens Axboe wrote:
>
> What makes you say that Linux has a block-centric IO architecture? 2.6
> block io layer is quite happy to do byte-granularity SCSI commands for
> you.

Indeed.

I don't think some people really _realize_ how much cleaner and generic
the generic block layer is compared to SCSI.

Yes, we call it "block layer" for historical reasons, but the fact is,
it's a "packet command" layer with knowledge of blocking (ie the merging
and sorting code has the ability to merge packets that are marked as
mergeable and fit certain criteria).

And the reason it is so much superior to SCSI is that it's designed to be
generic enough that it doesn't _care_ what the device is. The generic
block layer can work with MD, with floppy disks, with traditional SCSI
devices, and it just _works_.

The block layer doesn't have any silly assumptions about what it is
talking to, although it has some helper functions that are directly aimed
at a block device that implements a SCSI-like packet command set. But they
literally are helper functions - the block layer does not force your
floppy device to pretend that it is some kind of strange SCSI disk when it
isn't.

			Linus

Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: patch: aio + bio for raw io
Original-Message-ID: <a41kvr$836$1@penguin.transmeta.com>
Date: Fri, 8 Feb 2002 22:58:53 GMT
Message-ID: <fa.j9urjqv.1cioiqa@ifi.uio.no>

In article <20020208171327.B12788@redhat.com>,
Benjamin LaHaise  <bcrl@redhat.com> wrote:
>On Fri, Feb 08, 2002 at 01:07:58PM -0800, Badari Pulavarty wrote:
>> I am looking at the 2.5 patch you sent out. I have few questions/comments:
>>
>> 1) brw_kvec_async() does not seem to split IO at BIO_MAX_SIZE. I thought
>>    each bio can handle only BIO_MAX_SIZE (ll_rw_kio() is creating one bio
>>    for each BIO_MAX_SIZE IO).
>
>Sounds like a needless restriction in bio, especially as one of the design
>requirements for the 2.5 block work is that we're able to support large ios
>(think 4MB page support).

bio can handle arbitrarily large IO's, BUT it can never split them.

Basically, IO splitting is NOT the job of the IO layer.  So you can make
any size request you want, but you had better know that the hardware you
send it to can take it.  The bio layer basically guarantees only that
you can send a single contiguous request of PAGE_SIZE, nothing more (in
fact, we might at some point get away from even that, and only guarantee
sectors - with things like loopback remapping etc you might have trouble
even for "contiguous" requests).

Now, before you say "that's stupid, I don't know what the driver limits
are", ask yourself:
 - what is it that you want to go fast?
 - what is it that you CAN make fast?

The answer to the "want" question is: the common case. And like it or
not, the common case is never going to be 4MB pages.

The answer to the "can" question is: merging can be fast, splitting
fundamentally cannot.

Splitting a request _fundamentally_ involves memory management (at the
very least you have to allocate a new request), while growing a request
can (and does) mean just adding an entry to the end of a list (until you
cannot grow it any more, of course, but that's the point where you have
to end anyway, so..)

Now, think about that for five minutes, and if you don't come back with
the right answer, you get an F.

In short:

 - the right answer to handling 4MB pages is not to push complexity into
   the low-level drivers and make them try to handle requests that are
   bigger than the hardware can do.

   In fact, we don't even want to handle it in the mid layers, because
   (a) the mid layers have historically been even more flaky than some
   device drivers and (b) it's a performance loss to even test for the
   common case where the splitting is neither needed nor wanted.

 - the _right_ answer to handling big areas is to build up big bio's
   from smaller ones. And no, you don't have to call the elevator in
   between requests that you know are consecutive on the disk.

   Another way of saying it: if you have 4MB worth of IO, it's YOUR
   resposibility to do the work to make it fit the controller. It is off
   the default case, and _you_ do a bit of extra work instead of asking
   everybody else to do your heavy lifting for you.

Does bio have the interfaces to do this yet? No.  But if you think that
bio's should natively handle any kind of request at all, you're really
barking up the wrong tree.

If you are in the small small _small_ minority care about 4MB requests,
you should build the infrastructure not to make drivers split them, but
to build up a list of bio's and then submit them all consecutively in
one go.

Remember: checking the limits as you build stuff up is easy, and fast.

So you should make sure that you never EVER cause anybody to want to
split a bio.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: patch: aio + bio for raw io
Original-Message-ID: <Pine.LNX.4.33.0202081611490.11791-100000@penguin.transmeta.com>
Date: Sat, 9 Feb 2002 00:32:36 GMT
Message-ID: <fa.n8lj6mv.dn4goa@ifi.uio.no>

On Fri, 8 Feb 2002, Benjamin LaHaise wrote:
>
> Yup.  What we need is an interface for getting the max size of an io --

No. There is no such thing.

There is no "max size". There are various different limits, and "size" is
usually the last one on the list. The limitations are things like "I can
have at most N consecutive segments" or "crossing a 64kB border is fine,
but implies a new segment" or "a single segment is limited to X bytes, and
the _sum_ of all segments are limited to Y bytes" or..

And it can depend on the _address_ of the thing you're writing. If the
address is above a bounce limit, the bouncing code ends up having to copy
it to accessible memory, so you can have a device that can do a 4MB
request in one go if it's directly accessible, but if it is not in the low
XXX bits, then it gets split into chunks and copied down, at which point
you may only be able to do N chunks at a time.

And no, I didn't make any of these examples up.

A "max size" does not work. It needs to be a lot more complex than that.
For block devices, you need the whole "struct request_queue" to describe
the default cases, and even then there are function pointers to let
individual drivers limits of their own _outside_ those cases.

So it basically needs to be a "grow_bio()" function that does the choice,
not a size limitation.

(And then most devices will just use one of a few standard "grow()"
functions, of course - you need the flexibility, but at the same time
there is a lot of common cases).

		Linus

Index Home About Blog