Page sizes (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Sat, 11 Apr 2009 19:58:00 UTC
Message-ID: <fa.q7ceKn/MUQU1K8pEeO6WLkc2ZxA@ifi.uio.no>

On Sat, 11 Apr 2009, Alan Cox wrote:
>
> > 	  The spec describes the sector size as
> > 	  "512, 1k, 2k, 4k, 8k, etc."   It will be interesting to reach
> > 	  "etc" territory.
>
> Over 4K will be fun.

And by "fun", you mean "irrelevant".

If anybody does that, they'll simply not work. And it's not worth it even
trying to handle it.

That said, I'm pretty certain Windows has the same 4k issue, so we can
hope nobody will ever do that kind of idiotically broken hardware. Of
course, hardware people often do incredibly stupid things, so no
guarantees.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Sat, 11 Apr 2009 22:39:31 UTC
Message-ID: <fa.9wu7XI0fuffP1tl3YEU2Is5eC1A@ifi.uio.no>

On Sat, 11 Apr 2009, Grant Grundler wrote:
>
> Why does it matter what the sector size is?
> I'm failing to see what the fuss is about.
>
> We've abstract the DMA mapping/SG list handling enough that the
> block size should make no more difference than it does for the
> MTU size of a network.

The VM is not ready or willing to do more than 4kB pages for any normal
cacheing scheme.

> And the linux VM does handle bigger than 4k pages (several architectures
> have implemented it) - even if x86 only supports 4k as base page size.

4k is not just the "supported" base page size, it's the only sane one.
Bigger pages waste memory like mad on any normal load due to
fragmentation. Only basically single-purpose servers are worth doing
bigger pages for.

> Block size just defines the granularity of the device's address space in
> the same way the VM base page size defines the Virtual address space.

. and the point is, if you have granularity that is bigger than 4kB, you
lose binary compatibility on x86, for example. The 4kB thing is encoded in
mmap() semantics.

In other words, if you have sector size >4kB, your hardware is CRAP. It's
unusable sh*t. No ifs, buts or maybe's about it.

Sure, we can work around it. We can work around it by doing things like
read-modify-write cycles with bounce buffers (and where DMA remapping can
be used to avoid the copy). Or we can work around it by saying that if you
mmap files on such a a filesystem, your mmap's will have to have 8kB
alignment semantics, and the hardware is only useful for servers.

Or we can just tell people what a total piece of shit the hardware is.

So if you're involved with any such hardware or know people who are, you
might give people strong hints that sector sizes >4kB will not be taken
seriously by a huge number of people. Maybe it's not too late to head the
crap off at the pass.

Btw, this is not a new issue. Sandisk and some other totally clueless SSD
manufacturers tried to convince people that 64kB access sizes were the
RightThing(tm) to do. The reason? Their SSD's were crap, and couldn't do
anything better, so they tried to blame software.

Then Intel came out with their controller, and now the same people who
tried to sell their sh*t-for-brain SSD's are finally admitting that
it was crap hardware.

Do you really want to go through that one more time?

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Sun, 12 Apr 2009 00:55:42 UTC
Message-ID: <fa.85VT1mebwtm7J2DRypPigvQq8QM@ifi.uio.no>

On Sat, 11 Apr 2009, Jeff Garzik wrote:
>
> Or just ignore the extra length, thereby excising the 'read-modify' step...
> Total storage is halved or worse, but you don't take as much of a performance
> hit.

Well, the people who want > 4kB sectors usually want _much_ bigger (ie
32kB sectors), and if you end up doing the "just use the first part"
thing, you're wasting 7/8ths of the space.

Yes, it's doable, and yes, it obviously makes for a simple driver thing,
but no, I don't think people will consider it acceptable to lose that much
of their effective size of the disk.

I suspect people would scream even with a 8kB sector.

Treating all writes as read-modify-write cycles on a driver level (and
then opportunistically avoiding the read part when you are lucky and see
bigger contiguous writes) is likely more acceptable. But it _will_ suck
dick from a performance angle, because no regular filesystem will care
enough, so even with nicely behaved big writes, the two end-points will
have a fairly high chance of requiring a rmw cycle.

Even the journaling ones that might have nice logging write behavior tend
to have a non-logging part that then will behave badly. Rather few
filesystems are _purely_ log-based, and the ones that are tend to have
various limitations. Most commonly read performance just sucks.

We just merged nilfs2, and I _think_ that one is a pure logging filesystem
with just linear writes (within a segment). But I think random read
performance (think: loading executables off the disk) is bad.

And people tend to really dislike hardware that forces a particular
filesystem on them. Guess how big the user base is going to be if you
cannot format the device as NTFS, for example? Hint: if a piece of
hardware only works well with special filesystems, that piece of hardware
won't be a big seller.

Modern technology needs big volume to become cheap and relevant.

And maybe I'm wrong, and NTFS works fine as-is with sectors >4kB. But let
me doubt that.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Large stack usage in fs code (especially for PPC64)
Date: Mon, 17 Nov 2008 21:11:19 UTC
Message-ID: <fa.fvpMtuoO3aPSSLKLS64LrG2iw/I@ifi.uio.no>

On Mon, 17 Nov 2008, Steven Rostedt wrote:
>
>  45)     4992    1280   .block_read_full_page+0x23c/0x430
>  46)     3712    1280   .do_mpage_readpage+0x43c/0x740

Ouch.

> Notice at line 45 and 46 the stack usage of block_read_full_page and
> do_mpage_readpage. They each use 1280 bytes of stack! Looking at the start
> of these two:
>
> int block_read_full_page(struct page *page, get_block_t *get_block)
> {
> 	struct inode *inode = page->mapping->host;
> 	sector_t iblock, lblock;
> 	struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];

Yeah, that's unacceptable.

Well, it's not unacceptable on good CPU's with 4kB blocks (just an 8-entry
array), but as you say:

> On PPC64 I'm told that the page size is 64K, which makes the above equal
> to: 64K / 512 = 128  multiply that by 8 byte words, we have 1024 bytes.

Yeah. Not good. I think 64kB pages are insane. In fact, I think 32kB
pages are insane, and 16kB pages are borderline. I've told people so.

The ppc people run databases, and they don't care about sane people
telling them the big pages suck. It's made worse by the fact that they
also have horribly bad TLB fills on their broken CPU's, and years and
years of telling people that the MMU on ppc's are sh*t has only been
reacted to with "talk to the hand, we know better".

Quite frankly, 64kB pages are INSANE. But yes, in this case they actually
cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses
64 bytes, not 1kB.

I suspect the PPC people need to figure out some way to handle this in
their broken setups (since I don't really expect them to finally admit
that they were full of sh*t with their big pages), but since I think it's
a ppc bug, I'm not at all interested in a fix that penalizes the _good_
case.

So either make it some kind of (clean) conditional dynamic non-stack
allocation, or make it do some outer loop over the whole page that turns
into a compile-time no-op when the page is sufficiently small to be done
in one go.

Or perhaps say "if you have 64kB pages, you're a moron, and to counteract
that moronic page size, you cannot do 512-byte granularity IO any more".

Of course, that would likely mean that FAT etc wouldn't work on ppc64, so
I don't think that's a valid model either. But if the 64kB page size is
just a "database server crazy-people config option", then maybe it's
acceptable.

Database people usually don't want to connect their cameras or mp3-players
with their FAT12 filesystems.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Large stack usage in fs code (especially for PPC64)
Date: Mon, 17 Nov 2008 23:29:59 UTC
Message-ID: <fa.olQGjbIHWaWI7IzJmazkKTDMhqE@ifi.uio.no>

On Tue, 18 Nov 2008, Benjamin Herrenschmidt wrote:
>
> Guess who is pushing for larger page sizes nowadays ? Embedded
> people :-) In fact, we have patches submited on the list to offer the
> option for ... 256K pages on some 44x embedded CPUs :-)
>
> It makes some sort of sense I suppose on very static embedded workloads
> with no swap nor demand paging.

It makes perfect sense for anything that doesn't use any MMU.

The hugepage support seems to cover many of the relevant cases, ie
databases and things like big static mappings (frame buffers etc).

> > It's made worse by the fact that they
> > also have horribly bad TLB fills on their broken CPU's, and years and
> > years of telling people that the MMU on ppc's are sh*t has only been
> > reacted to with "talk to the hand, we know better".
>
> Who are you talking about here precisely ? I don't think either Paul or
> I every said something nearly around those lines ... Oh well.

Every single time I've complained about it, somebody from IBM has said "..
but but AIX".

This time it was Paul. Sometimes it has been software people who agree,
but point to hardware designers who "know better". If it's not some insane
database person, it's a Fortran program that runs for days.

> But there is also pressure to get larger page sizes from small embedded
> field, where CPUs have even poorer TLB refill (software loaded
> basically) :-)

Yeah, I agree that you _can_ have even worse MMU's. I'm not saying that
PPC64 is absolutely pessimal and cannot be made worse. Software fill is
indeed even worse from a performance angle, despite the fact that it's
really "nice" from a conceptual angle.

Of course, of the sw fill users that remain, many do seem to be ppc.. It's
like the architecture brings out the worst in hardware designers.

> > Quite frankly, 64kB pages are INSANE. But yes, in this case they actually
> > cause bugs. With a sane page-size, that *arr[MAX_BUF_PER_PAGE] thing uses
> > 64 bytes, not 1kB.
>
> Come on, the code is crap to allocate that on the stack anyway :-)

Why? We do actually expect to be able to use stack-space for small
structures. We do it for a lot of things, including stuff like select()
optimistically using arrays allocated on the stack for the common small
case, just because it's, oh, about infinitely faster to do than to use
kmalloc().

Many of the page cache functions also have the added twist that they get
called from low-memory setups (eg write_whole_page()), and so try to
minimize allocations for that reason too.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Large stack usage in fs code (especially for PPC64)
Date: Tue, 18 Nov 2008 02:08:55 UTC
Message-ID: <fa.D6f/o8r4Zh0qoKfjphuxUVCfUd0@ifi.uio.no>

On Tue, 18 Nov 2008, Paul Mackerras wrote:
>
> Also, you didn't respond to my comments about the purely software
> benefits of a larger page size.

I realize that there are benefits. It's just that the downsides tend to
swamp the upsides.

The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
can do good TLB's with essentially gang lookups, giving almost effective
page sizes of 32kB with hardly any of the downsides. Couple that with
low-latency fault handling (for not when you miss in the TLB, but when
something really isn't in the page tables), and it seems to be seldom the
biggest issue.

(Don't get me wrong - TLB's are not unimportant on x86 either. But on x86,
things are generally much better).

Yes, we could prefill the page tables and do other things, and ultimately
if you don't need to - by virtue of big pages, some loads will always
benefit from just making the page size larger.

But the people who advocate large pages seem to never really face the
downsides. They talk about their single loads, and optimize for that and
nothing else. They don't seem to even acknowledge the fact that a 64kB
page size is simply NOT EVEN REMOTELY ACCEPTABLE for other loads!

That's what gets to me. These absolute -idiots- talk about how they win 5%
on some (important, for them) benchmark by doing large pages, but then
ignore the fact that on other real-world loads they lose by sevaral
HUNDRED percent because of the memory fragmentation costs.

(And btw, if they win more than 5%, it's because the hardware sucks really
badly).

THAT is what irritates me.

What also irritates me is the ".. but AIX" argument. The fact is, the AIX
memory management is very tightly tied to one particular broken MMU model.
Linux supports something like thirty architectures, and while PPC may be
one of the top ones, it is NOT EVEN CLOSE to be really relevant.

So ".. but AIX" simply doesn't matter. The Linux VM has other priorities.

And I _guarantee_ that in general, in the high-volume market (which is
what drives things, like it or not), page sizes will not be growing. In
that market, terabytes of RAM is not the primary case, and small files
that want mmap are one _very_ common case.

To make things worse, the biggest performance market has another vendor
that hasn't been saying ".. but AIX" for the last decade, and that
actually listens to input. And, perhaps not incidentally, outperforms the
highest-performance ppc64 chips mostly by a huge margin - while selling
their chips for a fraction of the price.

I realize that this may be hard to accept for some people. But somebody
who says "... but AIX" should be taking a damn hard look in the mirror,
and ask themselves some really tough questions. Because quite frankly, the
"..but AIX" market isn't the most interesting one.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Large stack usage in fs code (especially for PPC64)
Date: Tue, 18 Nov 2008 16:02:54 UTC
Message-ID: <fa.r3ED/NgSxssWjSgDTikLdwr+C+g@ifi.uio.no>

On Tue, 18 Nov 2008, Nick Piggin wrote:
> >
> > The fact is, Intel (and to a lesser degree, AMD) has shown how hardware
> > can do good TLB's with essentially gang lookups, giving almost effective
> > page sizes of 32kB with hardly any of the downsides. Couple that with
>
> It's much harder to do this with powerpc I think because they would need
> to calculate 8 hashes and touch 8 cachelines to prefill 8 translations,
> wouldn't they?

Oh, absolutely. It's why I despise hashed page tables. It's a broken
concept.

> The per-page processing costs are interesting too, but IMO there is more
> work that should be done to speed up order-0 pages. The patches I had to
> remove the sync instruction for smp_mb() in unlock_page sped up pagecache
> throughput (populate, write(2), reclaim) on my G5 by something really
> crazy like 50% (most of that's in, but I'm still sitting on that fancy
> unlock_page speedup to remove the final smp_mb).
>
> I suspect some of the costs are also in powerpc specific code to insert
> linux ptes into their hash table. I think some of the synchronisation for
> those could possibly be shared with generic code so you don't need the
> extra layer of locks there.

Yeah, the hashed page tables get extra costs from the fact that it can't
share the software page tables with the hardware ones, and the associated
coherency logic. It's even worse at unmap time, I think.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Sun, 12 Apr 2009 15:46:42 UTC
Message-ID: <fa.DgRsaklDwZqJ4+m/gqVRK6dHf04@ifi.uio.no>

On Sun, 12 Apr 2009, Szabolcs Szakacsits wrote:
>
> I did not hear about NTFS using >4kB sectors yet but technically
> it should work.
>
> The atomic building units (sector size, block size, etc) of NTFS are
> entirely parametric. The maximum values could be bigger than the
> currently "configured" maximum limits.

It's probably trivial to make ext3 support 16kB blocksizes (if it doesn't
already).

That's not the problem. The "filesystem layout" part is just a parameter.

The problem is then trying to actually access such a filesystem, in
particular trying to write to it, or trying to mmap() small chunks of it.
The FS layout is the trivial part.

> At present the limits are set in the BIOS Parameter Block in the NTFS
> Boot Sector. This is 2 bytes for the "Bytes Per Sector" and 1 byte for
> "Sectors Per Block". So >4kB sector size should work since 1993.
>
> 64kB+ sector size could be possible by bootstrapping NTFS drivers
> in a different way.

Try it. And I don't mean "try to create that kind of filesystem". Try to
_use_ it. Does Window actually support using it it, or is it just a matter
of "the filesystem layout is _specified_ for up to 64kB block sizes"?

And I really don't know. Maybe Windows does support it. I'm just very
suspicious. I think there's a damn good reason why NTFS supports larger
block sizes in theory, BUT EVERYBODY USES A 4kB BLOCKSIZE DESPITE THAT!

Because it really is a hard problem. It's really pretty nasty to have your
cache blocking be smaller than the actual filesystem blocksize (the other
way is much easier, although it's certainly not pleasant either - Linux
supports it because we _have_ to, but sector-size of hardware had
traditionally been 4kB, I'd certainly also argue against adding complexity
just to make it smaller, the same way I argue against making it much
larger).

And don't get me wrong - we could (fairly) trivially make the
PAGE_CACHE_SIZE be bigger - even eventually go so far as to make it a
per-mapping thing, so that you could have some filesystems with that
bigger sector size and some with smaller ones. I think Andrea had patches
that did a fair chunk of it, and that _almost_ worked.

But it ABSOLUTELY SUCKS. If we did a 16kB page-cache-size, it would
absolutely blow chunks. It would be disgustingly horrible. Putting the
kernel source tree on such a filesystem would waste about 75% of all
memory (the median size of a source file is just about 4kB), so your page
cache would be effectively cut in a quarter for a lot of real loads.

And to fix up _that_, you'd need to now do things like sub-page
allocations, and now your page-cache size isn't even fixed per filesystem,
it would be per-file, and the filesystem (and the drievrs!) would hav to
handle the cases of getting those 4kB partial pages (and do r-m-w IO after
all if your hardware sector size is >4kB).

IOW, there are simple things we can do - but they would SUCK. And there
are really complicated things we could do - and they would _still_ SUCK,
plus now I pretty much guarantee that your system would also be a lot less
stable.

It really isn't worth it. It's much better for everybody to just be aware
of the incredible level of pure suckage of a general-purpose disk that has
hardware sectors >4kB. Just educate people that it's not good. Avoid the
whole insane suckage early, rather than be disappointed in hardware that
is total and utter CRAP and just causes untold problems.

Now, for specialty uses, things are different. CD-ROM's have had 2kB
sector sizes for a long time, and the reason it was never as big of a
problem isn't that they are still smaller than 4kB - it's that they are
read-only, and use special filesystems. And people _know_ they are
special. Yes, even when you write to them, it's a very special op. You'd
never try to put NTFS on a CD-ROM, and everybody knows it's not a disk
replacement.

In _those_ kinds of situations, a 64kB block isn't much of a problem. We
can do read-only media (where "read-only" doesn't have to be absolute: the
important part is that writing is special), and never have problems.
That's easy. Almost all the problems with block-size go away if you think
reading is 99.9% of the load.

But if you want to see it as a _disk_ (ie replacing SSD's or rotational
media), 4kB blocksize is the maximum sane one for Linux/x86 (or, indeed,
any "Linux/not-just-database-server" - it really isn't so much about x86,
as it is about large cache granularity causing huge memory fragmentation
issues).

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Sun, 12 Apr 2009 17:26:47 UTC
Message-ID: <fa.HnsK5EiEbe52OvoPijd2Pbu2gIU@ifi.uio.no>

On Sun, 12 Apr 2009, Robert Hancock wrote:
>
> What about FAT? It supports cluster sizes up to 32K at least (possibly up to
> 256K as well, although somewhat nonstandard), and that works.. We support that
> in Linux, don't we?

Sure.

The thing is, "cluster size" in an FS is totally different from sector
size.

People are missing the point here. You can trivially implement bigger
cluster sizes by just writing multiple sectors. In fact, even just a 4kB
cluster size is actually writing 8 512-byte hardware sectors on all normal
disks.

So you can support big clusters without having big sectors. A 32kB cluster
size in FAT is absolutely trivial to do: it's really purely an allocation
size. So a fat filesystem allocates disk-space in 32kB chunks, but then
when you actually do IO to it, you can still write things 4kB at a time
(or smaller), because once the allocation has been made, you still treat
the disk as a series of smaller blocks.

IOW, when you allocate a new 32kB cluster, you will have to allocate 8
pages to do IO on it (since you'll have to initialize the diskspace), but
you can still literally treat those pages as _individual_ pages, and you
can write them out in any order, and you can free them (and then look them
up) one at a time.

Notice? The cluster size really only ends up being a disk-space allocation
issue, not an issue for actually caching the end result or for the actual
size of the IO.

The hardware sector size is very different. If you have a 32kB hardware
sector size, that implies that _all_ IO has to be done with that
granularity. Now you can no longer treat the eight pages as individual
pages - you _have_ to write them out and read them in as one entity. If
you dirty one page, you effectively dirty them all. You can not drop and
re-allocate pages one at a time any more.

				Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Implementing NVMHCI...
Date: Mon, 13 Apr 2009 15:17:24 UTC
Message-ID: <fa.a10VsDV08lMAtCEqy1LzORM0jAk@ifi.uio.no>

On Mon, 13 Apr 2009, Avi Kivity wrote:
> >
> >  - create a big file,
>
> Just creating a 5GB file in a 64KB filesystem was interesting - Windows
> was throwing out 256KB I/Os even though I was generating 1MB writes (and
> cached too).  Looks like a paranoid IDE driver (qemu exposes a PIIX4).

Heh, ok. So the "big file" really only needed to be big enough to not be
cached, and 5GB was probably overkill. In fact, if there's some way to
blow the cache, you could have made it much smaller. But 5G certainly
works ;)

And yeah, I'm not surprised it limits the size of the IO. Linux will
generally do the same. I forget what our default maximum bio size is, but
I suspect it is in that same kind of range.

There are often problems with bigger IO's (latency being one, actual
controller bugs being another), and even if the hardware has no bugs and
its limits are higher, you usually don't want to have excessively large
DMA mapping tables _and_ the advantage of bigger IO is usually not that
big once you pass the "reasonably sized" limit (which is 64kB+). Plus they
happen seldom enough in practice anyway that it's often not worth
optimizing for.

> > then rewrite just a few bytes in it, and look at the IO pattern of the
> > result. Does it actually do the rewrite IO as one 16kB IO, or does it
> > do sub-blocking?
>
> It generates 4KB writes (I was generating aligned 512 byte overwrites).
> What's more interesting, it was also issuing 32KB reads to fill the
> cache, not 64KB.  Since the number of reads and writes per second is
> almost equal, it's not splitting a 64KB read into two.

Ok, that sounds pretty much _exactly_ like the Linux IO patterns would
likely be.

The 32kB read has likely nothing to do with any filesystem layout issues
(especially as you used a 64kB cluster size), but is simply because

 (a) Windows caches things with a 4kB granularity, so the 512-byte write
     turned into a read-modify-write
 (b) the read was really for just 4kB, but once you start reading you want
     to do read-ahead anyway since it hardly gets any more expensive to
     read a few pages than to read just one.

So once it had to do the read anyway, windows just read 8 pages instead of
one - very reasonable.

> >    If the latter, then the 16kB thing is just a filesystem layout
> > issue, not an internal block-size issue, and WNT would likely have
> > exactly the same issues as Linux.
>
> A 1 byte write on an ordinary file generates a RMW, same as a 4KB write on a
> 16KB block.  So long as the filesystem is just a layer behind the pagecache
> (which I think is the case on Windows), I don't see what issues it can have.

Right. It's all very straightforward from a filesystem layout issue. The
problem is all about managing memory.

You absolutely do _not_ want to manage memory in 16kB chunks (or 64kB for
your example!). It's a total disaster. Imagine what would happen to user
application performance if kmalloc() always returned 16kB-aligned chunks
of memory, all sized as integer multiples of 16kB? It would absolutely
_suck_. Sure, it would be fine for your large allocations, but any time
you handle strings, you'd allocate 16kB of memory for any small 5-byte
string. You'd have horrible cache behavior, and you'd run out of memory
much too quickly.

The same is true in the kernel. The single biggest memory user under
almost all normal loads is the disk cache. That _is_ the normal allocator
for any OS kernel. Everything else is almost details (ok, so Linux in
particular does cache metadata very aggressively, so the dcache and inode
cache are seldom "just details", but the page cache is still generally the
most important part).

So having a 16kB or 64kB granularity is a _disaster_. Which is why no sane
system does that. It's only useful if you absolutely _only_ work with
large files - ie you're a database server. For just about any other
workload, that kind of granularity is totally unnacceptable.

So doing a read-modify-write on a 1-byte (or 512-byte) write, when the
block size is 4kB is easy - we just have to do it anyway.

Doing a read-modify-write on a 4kB write and a 16kB (or 64kB) blocksize is
also _doable_, and from the IO pattern standpoint it is no different. But
from a memory allocation pattern standpoint it's a disaster - because now
you're always working with chunks that are just 'too big' to be good
building blocks of a reasonable allocator.

If you always allocate 64kB for file caches, and you work with lots of
small files (like a source tree), you will literally waste all your
memory.

And if you have some "dynamic" scheme, you'll have tons and tons of really
nasty cases when you have to grow a 4kB allocation to a 64kB one when the
file grows. Imagine doing "realloc()", but doing it in a _threaded_
environment, where any number of threads may be using the old allocation
at the same time. And that's a kernel - it has to be _the_ most
threaded program on the whole machine, because otherwise the kernel
would be the scaling bottleneck.

And THAT is why 64kB blocks is such a disaster.

> >  - can you tell how many small files it will cache in RAM without doing
> > IO? If it always uses 16kB blocks for caching, it will be able to cache    a
> > _lot_ fewer files in the same amount of RAM than with a smaller block
> > size.
>
> I'll do this later, but given the 32KB reads for the test above, I'm guessing
> it will cache pages, not blocks.

Yeah, you don't need to.

I can already guarantee that Windows does caching on a page granularity.

I can also pretty much guarantee that that is also why Windows stops
compressing files once the blocksize is bigger than 4kB: because at that
point, the block compressions would need to handle _multiple_ cache
entities, and that's really painful for all the same reasons that bigger
sectors would be really painful - you'd always need to make sure that you
always have all of those cache entries in memory together, and you could
never treat your cache entries as individual entities.

> > Of course, the _really_ conclusive thing (in a virtualized environment) is
> > to just make the virtual disk only able to do 16kB IO accesses (and with
> > 16kB alignment). IOW, actually emulate a disk with a 16kB hard sector size,
> > and reporting a 16kB sector size to the READ CAPACITY command. If it works
> > then, then clearly WNT has no issues with bigger sectors.
>
> I don't think IDE supports this?  And Windows 2008 doesn't like the LSI
> emulated device we expose.

Yeah, you'd have to have the OS use the SCSI commands for disk discovery,
so at least a SATA interface. With IDE disks, the sector size always has
to be 512 bytes, I think.

		Linus

Index Home About Blog