Large pages (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203171720330.14135-100000@home.transmeta.com>
Date: Mon, 18 Mar 2002 01:36:22 GMT
Message-ID: <fa.l3t1n1v.1k162pt@ifi.uio.no>

On Sun, 17 Mar 2002, Davide Libenzi wrote:
>
> What's the reason that would make more convenient for us, upon receiving a
> request to map a NNN MB file, to map it using 4Kb pages instead of 4MB ones ?

Ehh.. Let me count the ways:
 - reliably allocation of 4MB of contiguous data
 - graceful fallback when you need to start paging
 - sane coherency with somebody who mapped the same file/segment in a much
   smaller chunk

Guyes, 4MB pages are always going to be a special case. There's no sane
way to make them automatic, for the simple reason that they are USELESS
for "normal" work, and they have tons of problems that are quite
fundamental and just aren't going away and cannot be worked around.

The only sane way to use 4MB segments is:

 - the application does a special system call (or special flag to mmap)
   saying that it wants a big page and doesn't care about coherency with
   anybody else that didn't set the flag (and realize that that probably
   includes things like read/write)

 - the machine has sufficiently enough memory that the user can be allowed
   to _lock_ the area down, so that you don't have to worry about
   swapping out that thing in 4M pieces. (This of course implies that
   per-user memory counters have to work too, or we have to limit it by
   default with a rlimit or something to zero).

In short, very much a special case.

(There are two reasons you don't want to handle paging on 4M chunks: (a)
they may be wonderful for IO throughput, but they are horrible for latency
for other people and (b) you now have basically just a few bits of usage
information for 4M worth of memory, as opposed to a finer granularity view
of which parts are actually _used_).

Once you can count on having memory sizes in the hundreds of Gigs, and
disk throughput speeds in the hundreds of megs a second, and ther are
enough of these machines to _matter_ (and reliably 64-bit address spaces
so that virtual fragmentation doesn't matter), we might make 4MB the
regular mapping entity.

That's probably at least a decade away.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.33.0208021219160.2229-100000@penguin.transmeta.com>
Date: Fri, 2 Aug 2002 19:35:08 GMT
Message-ID: <fa.o9q3cnv.g4k5bd@ifi.uio.no>

[ linux-kernel cc'd, simply because I don't want to write the same thing
  over and over again ]

[ Executive summary: the current large-page-patch is nothing but a magic
  interface to the TLB. Don't view it as anything else, or you'll just be
  confused by all the smoke and mirrors. ]

On Fri, 2 Aug 2002, Gerrit Huizenga wrote:
> > Because _none_ of the large-page codepaths are shared with _any_ of the
> > normal cases.
>
> Isn't that currently an implementation detail?

Yes and no.

We may well expand the FS layer to bigger pages, but "bigger" is almost
certainly not going to include things like 256MB pages - if for no other
reason than the fact that memory fragmentation really means that the limit
on page sizes in practice is somewhere around 128kB for any reasonable
usage patterns even with gigabytes of RAM.

And _maybe_ we might get to the single-digit megabytes. I doubt it, simply
because even with a good buddy allocator and a memory manager that
actively frees pages to get large contiguous chunks of RAM, it's basically
impossible to have something that can reliably give you that big chunks
without making normal performance go totally down the toiled.

(Yeah, once you have terabytes of memory, that worry probably ends up
largely going away. I don't think that is going to be a common enough
platform for Linux to care about in the next ten years, though).

So there are implementation issues, yes. In particular, there _is_ a push
for larger pages in the FS and generic MM layers too, but the issues there
are very different and have no basically no generality with the TLB and
page table mapping issues of the current push.

What this VM/VFS push means is that we may actually have a _different_
"large page" support on that level, where the most likely implementation
is that the "struct address_space" will at some point have a new member
that specifies the "page allocation order" for that address space. This
will allow us to do per-file allocations, so that some files (or some
filesystems) might want to do all IO in 64kB chunks, and they'd just make
the address_space specify a page allocation order that matches that.

This is in fact one of the reasons I explicitly _want_ to keep the
interfaces separate - because there are two totally different issues at
play, and I suspect that we'll end up implementing _both_ of them, but
that they will _still_ have no commonalities.

The current largepage patch is really nothing but an interface to the TLB.
Please view it as that - a direct TLB interface that has zero impact on
the VFS or VM layers, and that is meant _purely_ as a way to expose hw
capabilities to the few applications that really really want them.

The important thing to take away from this is that _even_ if we could
change the FS and VM layers to know about a per-address_space variable-
sized PAGE_CACHE_SIZE (which I think it the long-term goal), that doesn't
impact the fact that we _also_ want to have the TLB interface.

Maybe the largepage patch could be improved upon by just renaming it, and
making clear that it's a "TLB_hugepage" thing. That's what a CPU designer
thinks of when you say "largepage" to him. Some of the confusion is
probably because a VM/FS person in an OS group does _not_ necessarily
think the same way, but thinks about doing big-granularity IO.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.33.0208022030170.2083-100000@penguin.transmeta.com>
Date: Sat, 3 Aug 2002 03:32:48 GMT
Message-ID: <fa.oa9temv.hkm7be@ifi.uio.no>

On Fri, 2 Aug 2002, David Mosberger wrote:
>
> The Rice people avoided some of the fragmentation problems by
> pro-actively allocating a max-order physical page, even when only a
> (small) virtual page was being mapped.

This probably works ok if
 - the superpages are only slightly smaller than the smaller page
 - superpages are a nice optimization.

>				  And since superpages quickly become
> counter-productive in tight-memory situations anyhow, this seems like
> a very reasonable approach.

Ehh.. The only people who are _really_ asking for the superpages want
almost nothing _but_ superpages. They are willing to use 80% of all memory
for just superpages.

Yes, it's Oracle etc, and the whole point for these users is to avoid
having any OS memory allocation for these areas.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208022125040.2694-100000@home.transmeta.com>
Date: Sat, 3 Aug 2002 04:26:35 GMT
Message-ID: <fa.m6egeqv.15g6bgr@ifi.uio.no>

On Fri, 2 Aug 2002, David Mosberger wrote:
>
> My terminology is perhaps a bit too subtle: I user "superpage"
> exclusively for the case where multiple pages get coalesced into a
> larger page.  The "large page" ("huge page") case that you were
> talking about is different, since pages never get demoted or promoted.

Ahh, ok.

> I wasn't disagreeing with your case for separate large page syscalls.
> Those syscalls certainly simplify implementation and, as you point
> out, it well may be the case that a transparent superpage scheme never
> will be able to replace the former.

Somebody already had patches for the transparent superpage thing for
alpha, which supports it. I remember seeing numbers implying that helped
noticeably.

But yes, that definitely doesn't work for humongous pages (or whatever we
should call the multi-megabyte-special-case-thing ;).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208031027330.3981-100000@home.transmeta.com>
Date: Sat, 3 Aug 2002 17:48:27 GMT
Message-ID: <fa.m7uieiv.150kbot@ifi.uio.no>

On Fri, 2 Aug 2002, David S. Miller wrote:
>
> Now here's the thing.  To me, we should be adding these superpage
> syscalls to things like the implementation of malloc() :-) If you
> allocate enough anonymous pages together, you should get a superpage
> in the TLB if that is easy to do.

For architectures that have these "small" superpages, we can just do it
transparently. That's what the alpha patches did.

The problem space is roughly the same as just page coloring.

> At that point it's like "why the system call".  If it would rather be
> more of a large-page reservation system than a "optimization hint"
> then these syscalls would sit better with me.  Currently I think they
> are superfluous.  To me the hint to use large-pages is a given :-)

Yup.

David, you did page coloring once.

I bet your patches worked reasonably well to color into 4 or 8 colors.

How well do you think something like your old patches would work if

 - you _require_ 1024 colors in order to get the TLB speedup on some
   hypothetical machine (the same hypothetical machine that might
   hypothetically run on 95% of all hardware ;)

 - the machine is under heavy load, and heavy load is exactly when you
   want this optimization to trigger.

Can you explain this difficulty to people?

> Stated another way, if these syscalls said "gimme large pages for this
> area and lock them into memory", this would be fine.  If the syscalls
> say "use large pages if you can", that's crap.  And in fact we could
> use mmap() attribute flags if we really thought that stating this was
> necessary.

I agree 100%.

I think we can at some point do the small cases completely transparently,
with no need for a new system call, and not even any new hint flags. We'll
just silently do 4/8-page superpages and be done with it. Programs don't
need to know about it to take advantage of better TLB usage.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208031240270.9758-100000@home.transmeta.com>
Date: Sat, 3 Aug 2002 19:58:12 GMT
Message-ID: <fa.maeqeav.10g490j@ifi.uio.no>

On Sat, 3 Aug 2002, David Mosberger wrote:
>
> Your point about wanting databases have access to giant pages even
> under memory pressure is a good one.  I had not considered that
> before.  However, what we really are talking about then is a security
> or resource policy as to who gets to allocate from a reserved and
> pinned pool of giant physical pages.

Absolutely. We can't allow just anybody to allocate giant pages, since
they are a scarce resource (set up at boot time in both Ingo's and Intels
patches - with the potential to move things around later with additional
interfaces).

>			  You don't need separate system
> calls for that: with a transparent superpage framework and a
> privileged & reserved giant-page pool, it's trivial to set up things
> such that your favorite data base will always be able to get the giant
> pages (and hence the giant TLB mappings) it wants.  The only thing you
> lose in the transparent case is control over _which_ pages need to use
> the pinned giant pages.  I can certainly imagine cases where this
> would be an issue, but I kind of doubt it would be an issue for
> databases.

That's _probably_ true. There aren't that many allocations that ask for
megabytes of consecutive memory that wouldn't want to do it. However,
there might certainly be non-critical maintenance programs (with the same
privileges as the database program proper) that _do_ do large allocations,
and that we don't want to give large pages to.

Guessing is always bad, especially since the application certainly does
know what it wants.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208041131380.10314-100000@home.transmeta.com>
Date: Sun, 4 Aug 2002 18:52:58 GMT
Message-ID: <fa.l3t1nhv.1n1229q@ifi.uio.no>

On Sun, 4 Aug 2002, Hubertus Franke wrote:
>
> As of the page coloring !
> Can we tweak the buddy allocator to give us this additional functionality?

I would really prefer to avoid this, and get "95% coloring" by just doing
read-ahead with higher-order allocations instead of the current "loop
allocation of one block".

I bet that you will get _practically_ perfect coloring with just two small
changes:

 - do_anonymous_page() looks to see if the page tables are empty around
   the faulting address (and check vma ranges too, of course), and
   optimistically does a non-blocking order-X allocation.

   If the order-X allocation fails, we're likely low on memory (this is
   _especially_ true since the very fact that we do lots of order-X
   allocations will probably actually help keep fragmentation down
   normally), and we just allocate one page (with a regular GFP_USER this
   time).

   Map in all pages.

 - do the same for page_cache_readahead() (this, btw, is where radix trees
   will kick some serious ass - we'd have had a hard time doing the "is
   this range of order-X pages populated" efficiently with the old hashes.

I bet just those fairly small changes will give you effective coloring,
_and_ they are also what you want for doing small superpages.

And no, I do not want separate coloring support in the allocator. I think
coloring without superpage support is stupid and worthless (and
complicates the code for no good reason).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208041225180.10314-100000@home.transmeta.com>
Date: Sun, 4 Aug 2002 19:43:34 GMT
Message-ID: <fa.l2t3n9v.1m1421u@ifi.uio.no>

On Sun, 4 Aug 2002, Andrew Morton wrote:
>
> Could we establish the eight pte's but still arrange for pages 1-7
> to trap, so the kernel can zero the out at the latest possible time?

You could do that by marking the pages as being there, but PROT_NONE.

On the other hand, cutting down the number of initial pagefaults (by _not_
doing what you suggest) might be a bigger speedup for process startup than
the slowdown from occasionally doing unnecessary work.

I suspect that there is some non-zero order-X (probably 2 or 3), where you
just win more than you lose. Even for small programs.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: large page patch (fwd) (fwd)
Original-Message-ID: <Pine.LNX.4.44.0208031237060.9758-100000@home.transmeta.com>
Date: Sat, 3 Aug 2002 19:52:35 GMT
Message-ID: <fa.m9eoe2v.11g68ok@ifi.uio.no>

On Sat, 3 Aug 2002, Hubertus Franke wrote:
>
> But I'd like to point out that superpages are there to reduce the number of
> TLB misses by providing larger coverage. Simply providing page coloring
> will not get you there.

Superpages can from a memory allocation angle be seen as a very strict
form of page coloring - the problems are fairly closely related, I think
(superpages are just a lot stricter, in that it's not enough to get "any
page of color X", you have to get just the _right_ page).

Doing superpages will automatically do coloring (while the reverse is
obviously not true). And the way David did coloring a long time ago (if
I remember his implementation correctly) was the same way you'd do
superpages: just do higher order allocations.

			Linus

Index Home About Blog