Fragmentation avoidance (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 15:52:18 UTC
Message-ID: <fa.hak77sr.gnq1o7@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511030747450.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Arjan van de Ven wrote:

> On Thu, 2005-11-03 at 07:36 -0800, Martin J. Bligh wrote:
> > >> Can we quit coming up with specialist hacks for hotplug, and try to solve
> > >> the generic problem please? hotplug is NOT the only issue here. Fragmentation
> > >> in general is.
> > >>
> > >
> > > Not really it isn't. There have been a few cases (e1000 being the main
> > > one, and is fixed upstream) where fragmentation in general is a problem.
> > > But mostly it is not.
> >
> > Sigh. OK, tell me how you're going to fix kernel stacks > 4K please.
>
> with CONFIG_4KSTACKS :)

2-page allocations are _not_ a problem.

Especially not for fork()/clone(). If you don't even have 2-page
contiguous areas, you are doing something _wrong_, or you're so low on
memory that there's no point in forking any more.

Don't confuse "fragmentation" with "perfectly spread out page
allocations".

Fragmentation means that it gets _exponentially_ more unlikely that you
can allocate big contiguous areas. But contiguous areas of order 1 are
very very likely indeed. It's only the _big_ areas that aren't going to
happen.

This is why fragmentation avoidance has always been totally useless. It is
 - only useful for big areas
 - very hard for big areas

(Corollary: when it's easy and possible, it's not useful).

Don't do it. We've never done it, and we've been fine. Claiming that
fork() is a reason to do fragmentation avoidance is invalid.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 16:53:59 UTC
Message-ID: <fa.h8k97sm.in41o2@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511030842050.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Mel Gorman wrote:

> On Thu, 3 Nov 2005, Linus Torvalds wrote:
>
> > This is why fragmentation avoidance has always been totally useless. It is
> >  - only useful for big areas
> >  - very hard for big areas
> >
> > (Corollary: when it's easy and possible, it's not useful).
> >
>
> Unless you are a user that wants a large area when it suddenly is useful.

No. It's _not_ suddenly useful.

It might be something you _want_, but that's a totally different issue.

My point is that regardless of what you _want_, defragmentation is
_useless_. It's useless simply because for big areas it is so expensive as
to be impractical.

Put another way: you may _want_ the moon to be made of cheese, but a moon
made out of cheese is _useless_ because it is impractical.

The only way to support big areas is to have special zones for them.

(Then, we may be able to use the special zones for small things too, but
under special rules, like "only used for anonymous mappings" where we
can just always remove them by paging them out. But it would still be a
special area meant for big pages, just temporarily "on loan").

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 17:23:49 UTC
Message-ID: <fa.h94374s.i7e0g8@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511030918110.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> The problem is how these zones get resized. Can we hotplug memory between
> them, with some sparsemem like indirection layer?

I think you should be able to add them. You can remove them. But you can't
resize them.

And I suspect that by default, there should be zero of them. Ie you'd have
to set them up the same way you now set up a hugetlb area.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 18:09:33 UTC
Message-ID: <fa.hbjp6sq.gnk0o6@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031006550.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Arjan van de Ven wrote:

> On Thu, 2005-11-03 at 09:51 -0800, Martin J. Bligh wrote:
>
> > For amusement, let me put in some tritely oversimplified math. For the
> > sake of arguement, assume the free watermarks are 8MB or so. Let's assume
> > a clean 64-bit system with no zone issues, etc (ie all one zone). 4K pages.
> > I'm going to assume random distribution of free pages, which is
> > oversimplified, but I'm trying to demonstrate a general premise, not get
> > accurate numbers.
>
> that is VERY over simplified though, given the anti-fragmentation
> property of buddy algorithm

Indeed. I write a program at one time doing random allocation and
de-allocation and looking at what the output was, and buddy is very good
at avoiding fragmentation.

These days we have things like per-cpu lists in front of the buddy
allocator that will make fragmentation somewhat higher, but it's still
absolutely true that the page allocation layout is _not_ random.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 18:45:13 UTC
Message-ID: <fa.h9417ct.i7c089@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031029090.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> >
> > These days we have things like per-cpu lists in front of the buddy
> > allocator that will make fragmentation somewhat higher, but it's still
> > absolutely true that the page allocation layout is _not_ random.
>
> OK, well I'll quit torturing you with incorrect math if you'll concede
> that the situation gets much much worse as memory sizes get larger ;-)

I don't remember the specifics (I did the stats several years ago), but if
I recall correctly, the low-order allocations actually got _better_ with
more memory, assuming you kept a fixed percentage of memory free. So you
actually needed _less_ memory free (in percentages) to get low-order
allocations reliably.

But the higher orders didn't much matter. Basically, it gets exponentially
more difficult to keep higher-order allocations, and it doesn't help one
whit if there's a linear improvement from having more memory available or
something like that.

So it doesn't get _harder_ with lots of memory, but

 - you need to keep the "minimum free" watermarks growing at the same rate
   the memory sizes grow (and on x86, I don't think we do: at least at
   some point, the HIGHMEM zone had a much lower low-water-mark because it
   made the balancing behaviour much nicer. But I didn't check that).

 - with lots of memory, you tend to want to get higher-order pages, and
   that gets harder much much faster than your memory size grows. So
   _effectively_, the kinds of allocations you care about are much harder
   to get.

If you look at get_free_pages(), you will note that we actyally
_guarantee_ memory allocations up to order-3:

	...
        if (!(gfp_mask & __GFP_NORETRY)) {
                if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
                        do_retry = 1;
	...

and nobody has ever even noticed. In other words, low-order allocations
really _are_ dependable. It's just that the kinds of orders you want for
memory hotplug or hugetlb (ie not orders <=3, but >=10) are not, and never
will be.

(Btw, my statistics did depend on that fact that the _usage_ was an even
higher exponential, ie you had many many more order-0 allocations than you
had order-1). You can always run out of order-n (n != 0) pages if you just
allocate enough of them. The buddy thing works well statistically, but it
obviously can't do wonders).

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 19:36:41 UTC
Message-ID: <fa.h93p7kn.i7k003@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031133040.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Possibly, I can redo the calculations easily enough (have to go for now,
> but I just sent the other ones). But we don't keep a fixed percentage of
> memory free - we cap it ... perhaps we should though?

I suspect the capping may well be from some old HIGHMEM interaction on x86
(ie "don't keep half a gig free in the normal zone just because we have
16GB in the high-zone". We used to have serious balancing issues, and I
wouldn't be surprised at all if there are remnants from that. Stuff that
simply hasn't been visible, because not a lot of people had many many GB
of memory even on machines that didn't need HIGHMEM.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 22:57:47 UTC
Message-ID: <fa.h94184o.i7s1g4@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031454050.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> But pages_min is based on the zone size, not the system size. And we
> still cap it. Maybe that's just a mistake?

The per-zone watermarking is actually the "modern" and "working" approach.

We didn't always do it that way. I would not be at all surprised if the
capping was from the global watermarking days.

Of course, I would _also_ not be at all surprised if it wasn't just out of
habit. Most of the things where we try to scale things up by memory size,
we cap for various reasons. Ie we tend to try to scale things like hash
sizes for core data structures by memory size, but then we tend to cap
them to "sane" versions.

So quite frankly, it's entirely possible that the capping is there not
because it _ever_ was a good idea, but just because it's what we almost
always do ;)

Mental inertia is definitely alive and well.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 01:08:22 UTC
Message-ID: <fa.hbkf6so.gn20o4@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031704590.27915@g5.osdl.org>

On Fri, 4 Nov 2005, Nick Piggin wrote:
>
> Looks like ppc64 is getting 64K page support, at which point higher
> order allocations (eg. for stacks) basically disappear don't they?

Yes and no, HOWEVER, nobody sane will ever use 64kB pages on a
general-purpose machine.

64kB pages are _only_ usable for databases, nothing else.

Why? Do the math. Try to cache the whole kernel source tree in 4kB pages
vs 64kB pages. See how the memory usage goes up by a factor of _four_.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 18:04:51 UTC
Message-ID: <fa.h94384p.i7e1g5@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511030955110.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
> > And I suspect that by default, there should be zero of them. Ie you'd have
> > to set them up the same way you now set up a hugetlb area.
>
> So ... if there are 0 by default, and I run for a while and dirty up
> memory, how do I free any pages up to put into them? Not sure how that
> works.

You don't.

Just face it - people who want memory hotplug had better know that
beforehand (and let's be honest - in practice it's only going to work in
virtualized environments or in environments where you can insert the new
bank of memory and copy it over and remove the old one with hw support).

Same as hugetlb.

Nobody sane _cares_. Nobody sane is asking for these things. Only people
with special needs are asking for it, and they know their needs.

You have to realize that the first rule of engineering is to work out the
balances. The undeniable fact is, that 99.99% of all users will never care
one whit, and memory management is complex and fragile. End result: the
0.01% of users will have to do some manual configuration to keep things
simpler for the cases that really matter.

Because the case that really matters is the sane case. The one where we
 - don't change memory (normal)
 - only add memory (easy)
 - only switch out memory with hardware support (ie the _hardware_
   supports parallel memory, and you can switch out a DIMM without
   software ever really even noticing)
 - have system maintainers that do strange things, but _know_ that.

We simply DO NOT CARE about some theoretical "general case", because the
general case is (a) insane and (b) impossible to cater to without
excessive complexity.

Guys, a kernel developer needs to know when to say NO.

And we say NO, HELL NO!! to generic software-only memory hotplug.

If you are running a DB that needs to benchmark well, you damn well KNOW
IT IN ADVANCE, AND YOU TUNE FOR IT.

Nobody takes a random machine and says "ok, we'll now put our most
performance-critical database on this machine, and oh, btw, you can't
reboot it and tune for it beforehand". And if you have such a person, you
need to learn to IGNORE THE CRAZY PEOPLE.

When you hear voices in your head that tell you to shoot the pope, do you
do what they say? Same thing goes for customers and managers. They are the
crazy voices in your head, and you need to set them right, not just
blindly do what they ask for.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 19:09:01 UTC
Message-ID: <fa.hbk36sm.gne0o2@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031102590.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Ha. Just because I don't think I made you puke hard enough already with
> foul approximations ... for order 2, I think it's

Your basic fault is in believing that the free watermark would stay
constant.

That's insane.

Would you keep 8MB free on a 64MB system?

Would you keep 8MB free on a 8GB system?

The point being, that if you start with insane assumptions, you'll get
insane answers.

The _correct_ assumption is that you aim to keep some fixed percentage of
memory free. With that assumption and your math, finding higher-order
pages is equally hard regardless of amount of memory.

Now, your math then doesn't allow for the fact that buddy automatically
coalesces for you, so in fact things get _easier_ with more memory, but
hey, that needs more math than I can come up with (I never did it as math,
only as simulations with allocation patterns - "smart people use math,
plodding people just try to simulate an estimate" ;)

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Thu, 03 Nov 2005 23:17:56 UTC
Message-ID: <fa.h9jp84t.ink1g9@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511031459110.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Martin J. Bligh wrote:
>
> Ummm. I was basing it on what we actually do now in the code, unless I
> misread it, which is perfectly possible. Do you want this patch?
>
> diff -purN -X /home/mbligh/.diff.exclude linux-2.6.14/mm/page_alloc.c 2.6.14-no_water_cap/mm/page_alloc.c
> --- linux-2.6.14/mm/page_alloc.c	2005-10-27 18:52:20.000000000 -0700
> +++ 2.6.14-no_water_cap/mm/page_alloc.c	2005-11-03 14:36:06.000000000 -0800
> @@ -2387,8 +2387,6 @@ static void setup_per_zone_pages_min(voi
>  			min_pages = zone->present_pages / 1024;
>  			if (min_pages < SWAP_CLUSTER_MAX)
>  				min_pages = SWAP_CLUSTER_MAX;
> -			if (min_pages > 128)
> -				min_pages = 128;
>  			zone->pages_min = min_pages;
>  		} else {
>  			/* if it's a lowmem zone, reserve a number of pages

Ahh, you're right, there's a totally separate watermark for highmem.

I think I even remember this. I may even be responsible. I know some of
our less successful highmem balancing efforts in the 2.4.x timeframe had
serious trouble when they ran out of highmem, and started pruning lowmem
very very aggressively. Limiting the highmem water marks meant that it
wouldn't do that very often.

I think your patch may in fact be fine, but quite frankly, it needs
testing under real load with highmem.

In general, I don't _think_ we should do anything different for highmem at
all, and we should just in general try to keep a percentage of pages
available. Now, the percentage probably does depend on the zone: we should
be more aggressive about more "limited" zones, ie the old 16MB DMA zone
should probably try to keep a higher percentage of free pages around than
the normal zone, and that in turn should probably keep a higher percentage
of pages around than the highmem zones.

And that's not because of fragmentation so much, but simply because the
lower zones tend to have more "desperate" users. Running out of the normal
zone is thus a "worse" situation than running out of highmem. And we
effectively never want to allocate from the 16MB DMA zone at all, unless
it is our only choice.

We actually do try to do that with that "lowmem_reserve[]" logic, which
reserves more pages in the lower zones the bigger the upper zones are (ie
if we _only_ have memory in the low 16MB, then we don't reserve any of it,
but if we have _tons_ of memory in the high zones, then we reserve more
memory for the low zones and thus make the watermarks higher for them).

So the watermarking interacts with that lowmem_reserve logic, and I think
that on HIGHMEM, you'd be screwed _twice_: first because the "pages_min"
is limited, and second because HIGHMEM has no lowmem_reserve.

Does that make sense?

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 05:15:47 UTC
Message-ID: <fa.ha3j6sp.j7u0o5@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511032105110.27915@g5.osdl.org>

On Thu, 3 Nov 2005, Andy Nelson wrote:
>
> I have done high performance computing in astrophysics for nearly two
> decades now. It gives me a perspective that kernel developers usually
> don't have, but sometimes need. For my part, I promise that I specifically
> do *not* have the perspective of a kernel developer. I don't even speak C.

Hey, cool. You're a physicist, and you'd like to get closer to 100%
efficiency out of your computer.

And that's really nice, because maybe we can strike a deal.

Because I also have a problem with my computer, and a physicist might just
help _me_ get closer to 100% efficiency out of _my_ computer.

Let me explain.

I've got a laptop that takes about 45W, maybe 60W under load.

And it has a battery that weighs about 350 grams.

Now, I know that if I were to get 100% energy efficiency out of that
battery, a trivial physics calculations tells me that e=mc^2, and that my
battery _should_ have a hell of a lot of energy in it. In fact, according
to my simplistic calculations, it turns out that my laptop _should_ have a
battery life that is only a few times the lifetime of the universe.

It turns out that isn't really the case in practice, but I'm hoping you
can help me out. I obviously don't need it to be really 100% efficient,
but on the other hand, I'd also like the battery to be slightly lighter,
so if you could just make sure that it's at least _slightly_ closer to the
theoretical values I should be getting out of it, maybe I wouldn't need to
find one of those nasty electrical outlets every few hours.

Do we have a deal? After all, you only need to improve my battery
efficiency by a really _tiny_ amount, and I'll never need to recharge it
again. And I'll improve your problem.

Or are you maybe willing to make a few compromises in the name of being
realistic, and living with something less than the theoretical peak
performance of what you're doing?

I'm willing on compromising to using only the chemical energy of the
processes involved, and not even a hundred percent efficiency at that.
Maybe you'd be willing on compromising by using a few kernel boot-time
command line options for your not-very-common load.

Ok?

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 15:32:56 UTC
Message-ID: <fa.h8kf7cq.in2082@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040725090.27915@g5.osdl.org>

On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
> just to make sure i didnt get it wrong, wouldnt we get most of the
> benefits Andy is seeking by having a: boot-time option which sets aside
> a "hugetlb zone", with an additional sysctl to grow (or shrink) the pool
> - with the growing happening on a best-effort basis, without guarantees?

Boot-time option to set the hugetlb zone, yes.

Grow-or-shrink, probably not. Not in practice after bootup on any machine
that is less than idle.

The zones have to be pretty big to make any sense. You don't just grow
them or shrink them - they'd be on the order of tens of megabytes to
gigabytes. In other words, sized big enough that you will _not_ be able to
create them on demand, except perhaps right after boot.

Growing these things later simply isn't reasonable. I can pretty much
guarantee that any kernel I maintain will never have dynamic kernel
pointers: when some memory has been allocated with kmalloc() (or
equivalent routines - pretty much _any_ kernel allocation), it stays put.
Which means that if there is a _single_ kernel alloc in such a zone, it
won't ever be then usable for hugetlb stuff.

And I don't want excessive complexity. We can have things like "turn off
kernel allocations from this zone", and then wait a day or two, and hope
that there aren't long-term allocs. It might even work occasionally. But
the fact is, a number of kernel allocations _are_ long-term (superblocks,
root dentries, "struct thread_struct" for long-running user daemons), and
it's simply not going to work well in practice unless you have set aside
the "no kernel alloc" zone pretty early on.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 16:10:47 UTC
Message-ID: <fa.hak96sm.gn40o6@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040801450.27915@g5.osdl.org>

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> AFAIK, mips chips have a software TLB refill that takes 1000
> cycles more or less. I could be wrong.

You're not far off.

Time it on a real machine some day. On a modern x86, you will fill a TLB
entry in anything from 1-8 cycles if it's in L1, and add a couple of dozen
cycles for L2.

In fact, the L1 TLB miss can often be hidden by the OoO engine.

Now, do the math. Your "3-4 time slowdown" with several hundred cycle TLB
miss just GOES AWAY with real hardware. Yes, you'll still see slowdowns,
but they won't be nearly as noticeable. And having a simpler and more
efficient kernel will actually make _up_ for them in many cases. For
example, you can do all your calculations on idle workstations that don't
mysteriously just crash because somebody was also doing something else on
them.

Face it. MIPS sucks. It was clean, but it didn't perform very well. SGI
doesn't sell those things very actively these days, do they?

So don't blame Linux. Don't make sweeping statements based on hardware
situations that just aren't relevant any more.

If you ever see a machine again that has a huge TLB slowdown, let the
machine vendor know, and then SWITCH VENDORS. Linux will work on sane
machines too.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 17:23:47 UTC
Message-ID: <fa.h94d6sl.i700o7@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040900160.27915@g5.osdl.org>

Andy,
 let's just take Ingo's numbers, measured on modern hardware.

On Fri, 4 Nov 2005, Ingo Molnar wrote:
>
>   32768 randomly accessed pages, 13 cycles avg, 73.751831% TLB misses.
>   32768 linearly accessed pages,  0 cycles avg,  0.259399% TLB misses.
>  131072 randomly accessed pages, 75 cycles avg, 94.162750% TLB misses.

NOTE! It's hard to decide what OoO does - Ingo's load doesn't allow for a
whole lot of overlapping stuff, so Ingo's numbers are fairly close to
worst case, but on the other hand, that serialization can probably be
honestly said to hide a couple of cycles, so let's say that _real_ worst
case is five more cycles than the ones quoted. It doesn't change the math,
and quite frankly, that way we're really anal about it.

In real life, under real load (especially with Fp operations going on at
the same time), OoO might make the cost a few cycles _less_, not more, but
hey, lt's not count that.

So in the absolute worst case, with 95% TLB miss ratio, the TLB cost was
an average 75 cycles. Let's be _really_ nice to MIPS, and say that this is
only five times faster than the MIPS case you tested (in reality, it's
probably over ten).

That's the WORST CASE. Realize that MIPS doesn't get better: it will
_always_ have a latency of several hundred cycles when the TLB misses. It
has absolutely zero OoO activity to hide a TLB miss (a software miss
totally serializes the pipeline), and it has zero "code caching", so even
with a perfect I$ (which it certainly didn't have), the cost of actually
running the TLB miss handler doesn't go down.

In contrast, the x86 hw miss gets better when there is some more locality
and the page tables are cached. Much better. Ingo's worst-case example is
not realistic (no locality at all in half a gigabyte or totally random
examples), yet even for that worst case, modern CPU's beat the MIPS by
that big factor.

So let's say that the 75% miss ratio was more likely (that's still a high
TLB miss ratio). So in the _likely_ case, a P4 did the miss in an average
of 13 cycles. The MIPS miss cost won't have come down at all - in fact, it
possibly went _up_, since the miss handler now might be getting more I$
misses since it's not called all the time (I don't know if the MIPS miss
handler used non-caching loads or not - the positive D$ effects on the
page tables from slightly denser TLB behaviour might help some to offset
this factor).

That's a likely factor of fifty speedup. But let's be pessimistic again,
and say that the P4 number beat the MIPS TLB miss by "only" a factor of
twenty. That means that your worst case totally untuned argument (30 times
slowdown from TLB misses) on a P4 is only a 120% slowdown. Not a factor of
three.

But clearly you could tune your code too, and did. To the point that you
had a factor of 3.4 on MIPS. Now, let's say that the tuning didn't work as
well on P4 (remember, we're still being pessimistic), and you'd only get
half of that.

End result? If the slowdown was entirely due to TLB miss costs, your
likely slowdown is in the 20-40% range. Pessimistically.

Now, switching to x86 may have _other_ issues. Maybe other things might
get slower. [ Mmwwhahahahhahaaa. I crack myself up. x86 slower than MIPS?
I'm such a joker. ]

Anyway. The point stands. This is something where hardware really rules,
and software can't do a lot of sane stuff. 20-40% may sound like a big
number, and it is, but this is all stuff where Moore's Law says that
we shouldn't spend software effort.

We'll likely be better off with a smaller, simpler kernel in the future. I
hope. And the numbers above back me up. Software complexity for something
like this just kills.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 16:02:10 UTC
Message-ID: <fa.hb457kt.g7o00f@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040738540.27915@g5.osdl.org>

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Big pages don't work now, and zones do not help because the
> load is too unpredictable. Sysadmins *always* turn them
> off, for very good reasons. They cripple the machine.

They do. Guess why? It's complicated.

SGI used to do things like that in Irix. They had the flakiest Unix kernel
out there. There's a reason people use Linux, and it's not all price. A
lot of it is development speed, and that in turn comes very much from not
making insane decisions that aren't maintainable in the long run.

Trust me. We can make things _better_, by having zones that you can't do
kernel allocations from. But you'll never get everything you want, without
turning the kernel into an unmaintainable mess.

> I think it was Martin Bligh who wrote that his customer gets
> 25% speedups with big pages. That is peanuts compared to my
> factor 3.4 (search comp.arch for John Mashey's and my name
> at the University of Edinburgh in Jan/Feb 2003 for a conversation
> that includes detailed data about this), but proves the point that
> it is far more than just me that wants big pages.

I didn't find your post on google, but I assume that a large portion on
your 3.4 factor was hardware.

The fact is, there are tons of architectures that suck at TLB handling.
They have small TLB's, and they fill slowly.

x86 is actually one of the best ones out there. It has a hw TLB fill, and
the page tables are cached, with real-life TLB fill times in the single
cycles (a P4 can almost be seen as effectively having 32kB pages because
it fills it's TLB entries to fast when they are next to each other in the
page tables). Even when you have lots of other cache pressure, the page
tables are at least in the L2 (or L3) caches, and you effectively have a
really huge TLB.

In contrast, a lot of other machines will use non-temporal loads to load
the TLB entries, forcing them to _always_ go to memory, and use software
fills, causing the whole machine to stall. To make matters worse, many of
them use hashed page tables, so that even if they could (or do) cache
them, the caching just doesn't work very well.

(I used to be a big proponent of software fill - it's very flexible. It's
also very slow. I've changed my mind after doing timing on x86)

Basically, any machine that gets more than twice the slowdown is _broken_.
If the memory access is cached, then so should be page table entry be
(page tables are _much_ smaller than the pages themselves), so even if you
take a TLB fault on every single access, you shouldn't see a 3.4 factor.

So without finding your post, my guess is that you were on a broken
machine. MIPS or alpha do really well when things generally fit in the
TLB, but break down completely when they don't due to their sw fill (alpha
could have fixed it, it had _architecturally_ sane page tables that it
could have walked in hw, but never got the chance. May it rest in peace).

If I remember correctly, ia64 used to suck horribly because Linux had to
use a mode where the hw page table walker didn't work well (maybe it was
just an itanium 1 bug), but should be better now. But x86 probably kicks
its butt.

The reason x86 does pretty well is that it's got one of the few sane page
table setups out there (oh, page table trees are old-fashioned and simple,
but they are dense and cache well), and the microarchitecture is largely
optimized for TLB faults. Not having ASI's and having to work with an OS
that invalidated the TLB about every couple of thousand memory accesses
does that to you - it puts the pressure to do things right.

So I suspect Martin's 25% is a lot more accurate on modern hardware (which
means x86, possibly Power. Nothing else much matters).

> If your and other kernel developer's (<<0.01% of the universe) kernel
> builds slow down by 5% and my and other people's simulations (perhaps
> 0.01% of the universe) speed up by a factor up to 3 or 4, who wins?

First off, you won't speed up by a factor of three or four. Not even
_close_.

Second, it's not about performance. It's about maintainability. It's about
having a system that we can use and understand 10 years down the line. And
the VM is a big part of that.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 16:42:09 UTC
Message-ID: <fa.hb4574p.g780g3@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040814530.27915@g5.osdl.org>

On Fri, 4 Nov 2005, Martin J. Bligh wrote:
>
> > So I suspect Martin's 25% is a lot more accurate on modern hardware (which
> > means x86, possibly Power. Nothing else much matters).
>
> It was PPC64, if that helps.

Ok. I bet x86 is even better, but Power (and possibly itanium) is the only
other architecture that comes close.

I don't like the horrible POWER hash-tables, but for static workloads they
should perform almost as well as a sane page table (I say "almost",
because I bet that the high-performance x86 vendors have spent a lot more
time on tlb latency than even IBM has). My dislike for them comes from the
fact that they are really only optimized for static behaviour.

(And HPC is almost always static wrt TLB stuff - big, long-running
processes).

> Well, I think it depends on the workload a lot. However fast your TLB is,
> if we move from "every cacheline read requires is a TLB miss" to "every
> cacheline read is a TLB hit" that can be a huge performance knee however
> fast your TLB is. Depends heavily on the locality of reference and size
> of data set of the application, I suspect.

I'm sure there are really pathological examples, but the thing is, they
won't be on reasonable code.

Some modern CPU's have TLB's that can span the whole cache. In other
words, if your data is in _any_ level of caches, the TLB will be big
enough to find it.

Yes, that's not universally true, and when it's true, the TLB is two-level
and you can have loads where it will usually miss in the first level, but
we're now talking about loads where the _data_ will then always miss in
the first level cache too. So the TLB miss cost will always be _lower_
than the data miss cost.

Right now, you should buy Opteron if you want that kind of large TLB. I
_think_ Intel still has "small" TLB's (the cpuid information only goes up
to 128 entries, I think), but at least Intel has a really good fill. And I
would bet (but have no first-hand information) that next generation
processors will only get bigger TLB's. These things don't tend to shrink.

(Itanium also has a two-level TLB, but it's absolutely pitiful in size).

NOTE! It is absolutely true that for a few years we had regular caches
growing much faster than TLB's. So there are unquestionably unbalanced
machines out there. But it seems that CPU designers started noticing, and
every indication is that TLB's are catching up.

In other words, adding lots of kernel complexity is the wrong thing in the
long run. This is not a long-term problem, and even in the short term you
can fix it by just selecting the right hardware.

In todays world, AMD leads with big TLB's (1024-entry L2 TLB), but Intel
has slightly faster fill and the AMD TLB filtering is sadly turned off on
SMP right now, so you might not always get the full effect of the large
TLB (but in HPC you probably won't have task switching blowing your TLB
away very often).

PPC64 has the huge hashed page tables that work well enough for HPC.

Itanium has a pitifully small TLB, and an in-order CPU, so it will take a
noticeably bigger hit on TLB's than x86 will. But even Itanium will be a
_lot_ better than MIPS was.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 17:50:35 UTC
Message-ID: <fa.h7477sp.g7a1o7@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511040943130.27921@g5.osdl.org>

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> Ok. In other posts you have skeptically accepted Power as a
> `modern' architecture.

Yes, sceptically.

I'd really like to hear what your numbers are on a modern x86. Any x86-64
is interesting, and I can't imagine that with a LANL address you can't
find any.

I do believe that Power is within one order of magnitude of a modern x86
when it comes to TLB fill performance. That's much better than many
others, but whether "almost as good" is within the error range, or whether
it's "only five times worse", I don't know.

The thing is, there's a reason x86 machines kick ass. They are cheap, and
they really _do_ outperform pretty much everything else out there.

Power 5 has a wonderful memory architecture, and those L3 caches kick ass.
They probably don't help you as much as they help databases, though, and
it's entirely possible that a small cheap Opteron with its integrated
memory controller will outperform them on your load if you really don't
have a lot of locality.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 21:23:52 UTC
Message-ID: <fa.h93t6sk.i700o6@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511041310130.28804@g5.osdl.org>

On Fri, 4 Nov 2005, Andy Nelson wrote:
>
> I am not enough of a kernel level person or sysadmin to know for certain,
> but I have still big worries about consecutive jobs that run on the
> same resources, but want extremely different page behavior. If what
> you are suggesting can cause all previous history on those resources
> to be forgotten and then reset to whatever it is that I want when I
> start my run, then yes.

That would largely be the behaviour.

When you use the hugetlb zone for big pages, nothing else would be there.

And when you don't use it, we'd be able to use those zones for at least
page cache and user private pages - both of which are fairly easy to evict
if required.

So the downside is that when the admin requests such a zone at boot-time,
that will mean that the kernel will never be able to use it for its
"normal" allocations. Not for inodes, not for directory name caching, not
for page tables and not for process and file descriptors. Only a very
certain class of allocations that we know how to evict easily could use
them.

Now, for many loads, that's fine. User virtual pages and page cache pages
are often a big part (in fact, often a huge majority) of the memory use.

Not always, though. Some loads really want lots of metadata caching, and
if you make too much of memory be in the largepage zones, performance
would suffer badly on such loads.

But the point is that this is easy(ish) to do, and would likely work
wonderfully well for almost all loads. It does put a small onus on the
maintainer of the machine to give a hint, but it's possible that normal
loads won't mind the limitation and that we could even have a few hugepage
zones by default (limit things to 25% of total memory or something). In
fact, we would almost have to do so initially just to get better test
coverage.

Now, if you want _most_ of memory to be available for hugepages, you
really will always require a special boot option, and a friendly machine
maintainer. Limiting things like inodes, process descriptors etc to a
smallish percentage of memory would not be acceptable in general.

Something like 25% "big page zones" probably is fine even in normal use,
and 50% might be an acceptable compromise even for machines that see a
mixture of pretty regular use and some specialized use. But a machine that
only cares about certain loads might boot up with 75% set aside in the
large-page zones, and that almost certainly would _not_ be a good setup
for random other usage.

IOW, we want a hit up-front about how important huge pages would be.
Because it's practically impossible to free pages later, because they
_will_ become fragmented with stuff that we definitely do not want to
teach the VM how to handle.

But the hint can be pretty friendly. Especially if it's an option to just
load a lot of memory into the boxes, and none of the loads are expected to
want to really be excessively close to memory limits (ie you could just
buy an extra 16GB to allow for "slop").

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Fri, 04 Nov 2005 21:39:50 UTC
Message-ID: <fa.hb437cn.g7a085@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511041333560.28804@g5.osdl.org>

On Fri, 4 Nov 2005, Linus Torvalds wrote:
>
> But the hint can be pretty friendly. Especially if it's an option to just
> load a lot of memory into the boxes, and none of the loads are expected to
> want to really be excessively close to memory limits (ie you could just
> buy an extra 16GB to allow for "slop").

One of the issues _will_ be how to allocate things on NUMA. Right now
"hugetlb" only allows us to say "this much memory for hugetlb", and it
probably needs to be per-zone.

Some uses might want to allocate all of the local memory on one node to
huge-page usage (and specialized programs would then also like to run
pinned to that node), others might want to spread it out. So the
maintenance would need to decide that.

The good news is that you can boot up with almost all zones being "big
page" zones, and you could turn them into "normal zones" dynamically. It's
only going the other way that is hard.

So from a maintenance standpoint if you manage lots of machines, you could
have them all uniformly boot up with lots of memory set aside for large
pages, and then use user-space tools to individually turn the zones into
regular allocation zones.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Sun, 06 Nov 2005 15:56:41 UTC
Message-ID: <fa.fuq33bd.ggi6ah@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511060746170.3316@g5.osdl.org>

On Sat, 5 Nov 2005, Paul Jackson wrote:
>
> It seems to me this is making it harder than it should be.  You're
> trying to create a zone that is 100% cleanable, whereas the HPC folks
> only desire 99.8% cleanable.

Well, 99.8% is pretty borderline.

> Unlike the hot(un)plug folks, the HPC folks don't mind a few pages of
> Linus's unmoveable kmalloc memory in their way.  They rather expect
> that some modest percentage of each node will have some 'kernel stuff'
> on it that refuses to move.

The thing is, if 99.8% of memory is cleanable, the 0.2% is still enough to
make pretty much _every_ hugepage in the system pinned down.

Besides, right now, it's not 99.8% anyway. Not even close. It's more like
60%, and then horribly horribly ugly hacks that try to do something about
the remaining 40% and usually fail (the hacks might get it closer to 99%,
but they are fragile, expensive, and ugly as hell).

It used to be that HIGHMEM pages were always cleanable on x86, but even
that isn't true any more, since now at least pipe buffers can be there
too.

I agree that HPC people are usually a bit less up-tight about things than
database people tend to be, and many of them won't care at all, but if you
want hugetlb, you'll need big areas.

Side note: the exact size of hugetlb is obviously architecture-specific,
and the size matters a lot. On x86, for example, hugetlb pages are either
2MB or 4MB in size (and apparently 2GB may be coming). I assume that's
where you got the 99.8% from (4kB out of 2M).

Other platforms have more flexibility, but sometimes want bigger areas
still.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Sun, 06 Nov 2005 16:16:55 UTC
Message-ID: <fa.fu9n3jd.g0u62h@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511060756010.3316@g5.osdl.org>

On Sun, 6 Nov 2005, Kyle Moffett wrote:
>
> Hmm, this brings up something that I haven't seen discussed on this list
> (maybe a long time ago, but perhaps it should be brought up again?).  What are
> the pros/cons to having a non-physically-linear kernel virtual memory space?

Well, we _do_ actually have that, and we use it quite a bit. Both
vmalloc() and HIGHMEM work that way.

The biggest problem with vmalloc() is that the virtual space is often as
constrained as the physical one (ie on old x86-32, the virtual address
space is the bigger problem - you may have 36 bits of physical memory, but
the kernel has only 30 bits of virtual). But it's quite commonly used for
stuff that wants big linear areas.

The HIGHMEM approach works fine, but the overhead of essentially doing a
software TLB is quite high, and if we never ever have to do it again on
any architecture, I suspect everybody will be pretty happy.

> Would it be theoretically possible to allow some kind of dynamic kernel page
> swapping, such that the _same_ kernel-virtual pointer goes to a different
> physical memory page?  That would definitely satisfy the memory hotplug
> people, but I don't know what the tradeoffs would be for normal boxen.

Any virtualization will try to do that, but they _all_ prefer huge pages
if they care at all about performance.

If you thought the database people wanted big pages, the kernel is worse.
Unlike databases or HPC, the kernel actually wants to use the physical
page address quite often, notably for IO (but also for just mapping them
into some other virtual address - the users).

And no standard hardware allows you to do that in hw, so we'd end up doing
a software page table walk for it (or, more likely, we'd have to make
"struct page" bigger).

You could do it today, although at a pretty high cost. And you'd have to
forget about supporting any hardware that really wants contiguous memory
for DMA (sound cards etc). It just isn't worth it.

Real memory hotplug needs hardware support anyway (if only buffering the
memory at least electrically). At which point you're much better off
supporting some remapping in the buffering too, I'm convinced. There's no
_need_ to do these things in software.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19
Date: Sun, 06 Nov 2005 17:02:38 UTC
Message-ID: <fa.fu9p3bf.g006av@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0511060848010.3316@g5.osdl.org>

On Sun, 6 Nov 2005, Linus Torvalds wrote:
>
> And no standard hardware allows you to do that in hw, so we'd end up doing
> a software page table walk for it (or, more likely, we'd have to make
> "struct page" bigger).
>
> You could do it today, although at a pretty high cost. And you'd have to
> forget about supporting any hardware that really wants contiguous memory
> for DMA (sound cards etc). It just isn't worth it.

Btw, in case it wasn't clear: the cost of these kinds of things in the
kernel is usually not so much the actual "lookup" (whether with hw assist
or with another field in the "struct page").

The biggest cost of almost everything in the kernel these days is the
extra code-footprint of yet another abstraction, and the locking cost.

For example, the real cost of the highmem mapping seems to be almost _all_
in the locking. It also makes some code-paths more complex, so it's yet
another I$ fill for the kernel.

So a remappable kernel tends to be different from a remappable user
application. A user application _only_ ever sees the actual cost of the
TLB walk (which hardware can do quite efficiently and is very amenable
indeed to a lot of optimization like OoO and speculative prefetching), but
on the kernel level, the remapping itself is the cheapest part.

(Yes, user apps can see some of the costs indirectly: they can see the
synchronization costs if they do lots of mmap/munmap's, especially if they
are threaded. But they really have to work at it to see it, and I doubt
the TLB synchronization issues tend to be even on the radar for any user
space performance analysis).

You could probably do a remappable kernel (modulo the problems with
specific devices that want bigger physically contiguous areas than one
page) reasonably cheaply on UP. It gets more complex on SMP and with full
device access.

In fact, I suspect you can ask any Xen developer what their performance
problems and worries are. I suspect they much prefer UP clients over SMP
ones, and _much_ prefer paravirtualization over running unmodified
kernels.

So remappable kernels are certainly doable, they just have more
fundamental problems than remappable user space _ever_ has. Both from a
performance and from a complexity angle.

			Linus

Index Home About Blog