The page cache (Linus Torvalds)

Index Home About Blog

Date: 	Mon, 23 Apr 2001 10:17:07 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [patch] swap-speedup-2.4.3-A2
Newsgroups: fa.linux.kernel

On Mon, 23 Apr 2001, Ingo Molnar wrote:
> 
> you are right - i thought about that issue too and assumed it works like
> the pagecache (which first reads the page without hashing it, then tries
> to add the result to the pagecache and throws away the page if anyone else
> finished it already), but that was incorrect.

The above is NOT how the page cache works. Or if some part of the page
cache works that way, then it is a BUG. You must NEVER allow multiple
outstanding reads from the same location - that implies that you're doing
something wrong, and the system is doing too much IO.

The way _all_ parts of the page cache should work is:

Create new page:
 - look up page. If found, return it
 - allocate new page.
 - look up page again, in case somebody else added it while we allocated
   it.
 - add the page atomically with the lookup if the lookup failed, otherwise
   just free the page without doing anything.
 - return the looked-up / allocated page.

return up-to-date page:
 - call the above to get a page cache page.
 - if uptodate, return
 - lock_page()
 - if now uptodate (ie somebody else filled it and held the lock), unlock
   and return.
 - start the IO
 - wait on IO by waiting on the page (modulo other work that you could do
   in the background).
 - if the page is still not up-to-date after we tried to read it, we got
   an IO error. Return error.

The above is how it is always meant to work. The above works for both new
allocations and for old. It works even if an earlier read had failed (due
to wrong permissions for example - think about NFS page caches where some
people may be unable to actually fill a page, so that you need to re-try
on failure). The above is how the regular read/write paths work (modulo
bugs). And it's also how the swap cache should work.

		Linus

Date: 	Mon, 14 May 2001 21:43:18 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Getting FS access events
Newsgroups: fa.linux.kernel

On Mon, 14 May 2001, Linus Torvalds wrote:
> 
> Or rather, there is a fundamental reason why we must NEVER EVER look at
> the buffer cache: it is not coherent with the page cache. 
> 
> And keeping it coherent would be _extremely_ expensive. How do we
> know? Because we used to do that. Remember the small mindcraft
> benchmark? Yup. Double copies all over the place, double lookups, double
> everything.

I think I should explain a bit more.

The current page cache is completely non-coherent (with _anything_: it's
not coherent with other files using a page cache because they have a
different index, and it's not coherent with the buffer cache because that
one isn't even in the same name space).

Now, being non-coherent is always the best option if you can get away with
it. It means that there is no way you can ever have _any_ performance
overhead from maintaining the coherency, and it's 100% reproducible -
there's no question where the page cache gets its data from (the raw disk
device. No if's, but's and why's).

The disadvantage of virtual caches is that they can have aliases. That's
fine, but you have to be aware of it, and you have to live with the
consequences. That's what we do now. There are no aliases that are worth
worrying about, so virtual caches work perfectly. This is not always true
(virtual CPU data caches tend to be a really bad idea, while virtual CPU
instruction caches tend to work fairly well, although potentially with a
lower utilization ratio than a physical one due to aliasing).

The other alternative is to have a physical cache. That's fine too: you
avoid aliases, but you have to look up the physical address when looking
up the cache. THIS is the real cost of the buffer cache - not the hashing
and the locking, but the fact that you have to know the physical
location. 

A mixed-mode cache is not a good idea. It gets the worst from both worlds,
without getting _any_ of the good qualities. You have the horrible
coherency issue, together with the overhead of having to find out the
physical address. 

You could choose to do "partial coherency", ie be coherent only one way,
for example. That would make the coherency overhead much less, but would
also make the caches basically act very unpredictably - you might have
somebody write through the page cache yet on a read actually not _see_
what he wrote, because it got written out to disk and was shadowed by
cached data in the buffer cache that didn't get updated.

So "partial coherency" might avoid some of the performance issues, but
it's unacceptable to me simply it's pretty non-repeatable and has some
strange behaviour that can be considered "obviously wrong" (see above
about one example).

Which leaves us with the fact that the page cache is best done the way it
is, and anybody who has coherency concerns might really think about those
concerns another way.

I'm really serious about doing "resume from disk". If you want a fast
boot, I will bet you a dollar that you cannot do it faster than by loading
a contiguous image of several megabytes contiguously into memory. There is
NO overhead, you're pretty much guaranteed platter speeds, and there are
no issues about trying to order accesses etc. There are also no issues
about messing up any run-time data structures.

Give it some thought.

		Linus

Index Home About Blog