Memory pressure code (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Wed, 24 Jun 2009 16:52:41 UTC
Message-ID: <fa.+xBLwy7EFSsCUXvJwAf94GMenb8@ifi.uio.no>

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> Well yes.  Using GFP_NOFAIL on a higher-order allocation is bad.

Yes, but your definition of "higher order" is incorrect.

At the very least, we should change the "order > 0" to "order > 1".

As I already mentioned, SLAB uses order-1 allocations in order to not have
excessive fragmentation for small kmalloc/slab's that would otherwise
waste tons of memory.

Think network packets at 1500 bytes each. You can allocate two per page,
or five per 2-pages. That's a 25% memory usage difference!

And getting an order-1 allocation is simply not that much harder than an
order-0 one. It starts getting hard once you get to order-3 or more.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Wed, 24 Jun 2009 19:46:43 UTC
Message-ID: <fa.QszDUniWlW1ba/42eTGeY745zo8@ifi.uio.no>

On Wed, 24 Jun 2009, Andrew Morton wrote:

> On Wed, 24 Jun 2009 12:16:20 -0700 (PDT)
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > Lookie here. This is 2.6.0:mm/page_alloc.c:
> >
> >         do_retry = 0;
> >         if (!(gfp_mask & __GFP_NORETRY)) {
> >                 if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
> >                         do_retry = 1;
> >                 if (gfp_mask & __GFP_NOFAIL)
> >                         do_retry = 1;
> >         }
> >         if (do_retry) {
> >                 blk_congestion_wait(WRITE, HZ/50);
> >                 goto rebalance;
> >         }
>
> rebalance:
> 	if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
> 		/* go through the zonelist yet again, ignoring mins */
> 		for (i = 0; zones[i] != NULL; i++) {
> 			struct zone *z = zones[i];
>
> 			page = buffered_rmqueue(z, order, cold);
> 			if (page)
> 				goto got_pg;
> 		}
> 		goto nopage;
> 	}

Your point?

That's the recursive allocation or oom case. Not the normal case at all.

The _normal_ case is to do the whole "try_to_free_pages()" case and try
and try again. Forever.

IOW, we have traditionally never failed small kernel allocations. It makes
perfect sense that people _depend_ on that.

Now, we have since relaxed that (a lot). And in answer to that, people
have added more __GFP_NOFAIL flags, I bet. It's all very natural. Claiming
that this is some "new error" and that we should warn about NOFAIL
allocations with big orders is just silly and simply not true.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Wed, 24 Jun 2009 19:48:09 UTC
Message-ID: <fa.tZkIsKq3jXOrDZ+bU+JaFUy5qj4@ifi.uio.no>

On Wed, 24 Jun 2009, Linus Torvalds wrote:
>
> Your point?
>
> That's the recursive allocation or oom case. Not the normal case at all.

In fact, as the whole comment explains, that case is the "give people
memory _now_, even if we normally wouldn't". The thing is, we can't
recurse into trying to free memory, because we're already in that path.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Wed, 24 Jun 2009 20:14:31 UTC
Message-ID: <fa.+emPztHjI/T4nM2F4VnK3KS2VJc@ifi.uio.no>

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> If the caller gets oom-killed, the allocation attempt fails.  Callers need
> to handle that.

I actually disagree. I think we should just admit that we can always free
up enough space to get a few pages, in order to then oom-kill things.

This is not a new concept. oom has never been "immediately kill".

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Wed, 24 Jun 2009 20:41:20 UTC
Message-ID: <fa.y5thpdkvACb2yLyJ6uX8JgiELMg@ifi.uio.no>

On Wed, 24 Jun 2009, Linus Torvalds wrote:
> On Wed, 24 Jun 2009, Andrew Morton wrote:
> >
> > If the caller gets oom-killed, the allocation attempt fails.  Callers need
> > to handle that.
>
> I actually disagree. I think we should just admit that we can always free
> up enough space to get a few pages, in order to then oom-kill things.

Btw, if you want to change the WARN_ON() to warn when you're in the
"allocate in order to free memory" recursive case, then I'd have no issues
with that.

In fact, in that case it probably shouldn't even be conditional on the
order.

So a

	WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));

actually makes tons of sense.

There are other cases where __GFP_NOFAIL doesn't make sense too, and that
could be warned about. The __GFP_NORETRY thing was already mentioned.
Similarly, !__GFP_WAIT doesn't work with __GFP_NOFAIL - because the nofail
obviously relies on being able to do something about the failure case.

We might want to also have rules like "in order to have NOFAIL, you need
to allow IO and FS accesses".

So I don't mind warnings with __GFP_NOFAIL. I just think they should be
relevant, and make sense. The "order > 0" one is neither.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Thu, 25 Jun 2009 20:12:41 UTC
Message-ID: <fa.v1DdMipu8RTycLGqPt0A0ZGfOOM@ifi.uio.no>

On Thu, 25 Jun 2009, Theodore Tso wrote:
>
> Never mind, stupid question; I hit the send button before thinking
> about this enough.  Obviously we should try without GFP_ATOMIC so the
> allocator can try to release some memory.  So maybe the answer for
> filesystem code where the alternative to allocator failure is
> remounting the root filesystem read-only or panic(), should be:
>
> 1)  Try to do the allocation GFP_NOFS.

Well, even with NOFS, the kernel will still do data writeout that can be
done purely by swapping. NOFS is really about avoiding recursion from
filesystems.

So you might want to try GFP_NOIO, which will mean that the kernel will
try to free memory that needs no IO at all. This also protects from
recursion in the IO path (ie block layer request allocation etc).

That said, GFP_ATOMIC may be better than both in practice, if only because
it might be better at balancing memory usage (ie too much "NOIO" might
result in the kernel aggressively dropping clean page-cache pages, since
it cannot drop dirty ones).

Note the "might". It probably doesn't matter in practice, since the bulk
of all allocations should always hopefully be GFP_KERNEL or GFP_USER.

> 2)  Then try GFP_ATOMIC

The main difference between NOIO and ATOMIC is

 - ATOMIC never tries to free _any_ kind of memory, since it doesn't want
   to take the locks, and cannot enable interrupts.

 - ATOMIC has the magic "high priority" bit set that means that you get to
   dip into critical memory resources in order to satisfy the memory
   request.

Whether these are important to you or not, I dunno. I actually suspect
that we might want a combination of "high priority + allow memory
freeing", which would be

	#define GFP_CRITICAL (__GFP_HIGH | __GFP_WAIT)

and might be useful outside of interrupt context for things that _really_
want memory at all costs.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Thu, 25 Jun 2009 20:23:31 UTC
Message-ID: <fa.jeqRfVHLgwHLK4AXhSUOEvQ1ggc@ifi.uio.no>

On Thu, 25 Jun 2009, Linus Torvalds wrote:
>
> Whether these are important to you or not, I dunno. I actually suspect
> that we might want a combination of "high priority + allow memory
> freeing", which would be
>
> 	#define GFP_CRITICAL (__GFP_HIGH | __GFP_WAIT)

Actually, that doesn't work quite the way I intended.

The current page allocator screws up, and doesn't allow us to do this
(well, you _can_ combine the flags, but they don't mean what they mean on
their own). If you have the WAIT flag set, the page allocator will not set
the ALLOC_HARDER bit, so it turns out that GFP_ATOMIC (__GFP_HIGH on its
own) sometimes actually allows more allocations than the above
GFP_CRITICAL would.

It might make more sense to make a __GFP_WAIT allocation set the
ALLOC_HARDER bit _if_ it repeats. The problem with doing a loop of
allocations outside of the page allocator is that you then miss the
subtlety of "try increasingly harder" that the page allocator internally
does (well, right now, the "increasingly harder" only exists for the
try-to-free path, but we could certainly have it for the try-to-allocate
side too)

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Thu, 25 Jun 2009 20:52:26 UTC
Message-ID: <fa.YEek/E1SaAS7VCYhis2TlR61Yw4@ifi.uio.no>

On Thu, 25 Jun 2009, David Rientjes wrote:
>
> On Thu, 25 Jun 2009, Linus Torvalds wrote:
>
> > It might make more sense to make a __GFP_WAIT allocation set the
> > ALLOC_HARDER bit _if_ it repeats.
>
> This would make sense, but only for !__GFP_FS, since otherwise the oom
> killer will free some memory on an allowed node when reclaim fails and we
> don't otherwise want to deplete memory reserves.

So the reason I tend to like the kind of "incrementally try harder"
approaches is two-fold:

 - it works well for balancing different choices against each other (like
   on the freeing path, trying to see which kind of memory is most easily
   freed by trying them all first in a "don't try very hard" mode)

 - it's great for forcing _everybody_ to do part of the work (ie when some
   new thread comes in and tries to allocate, the new thread starts off
   with the lower priority, and as such won't steal a page that an older
   allocator just freed)

And I think that second case is true even for the oom killer case, and
even for __GFP_FS.

So if people worry about oom, I would suggest that we should not think so
hard about the GFP_NOFAIL cases (which are relatively small and rare), or
about things like the above "try harder" when repeating model, but instead
think about what actually happens during oom: the most common allocations
will remain to the page allocations for user faults and/or page cache. In
fact, they get *more* common as you near OOM situation, because you get
into the whole swap/filemap thrashing situation where you have to re-read
the same pages over and over again.

So don't worry about NOFS. Instead, look at what GFP_USER and GFP_HIGHUSER
do. They set the __GFP_HARDWALL bit, and they _always_ check the end
result and fail gracefully and quickly when the allocation fails.

End result? Realistically, I suspect the _best_ thing we can do is to just
couple that bit with "we're out of memory", and just do something like

	if (!did_some_progress && (gfp_flags & __GFP_HARDWALL))
		goto nopage;

rather than anything else. And I suspect that if we do this, we can then
afford to retry very aggressively for the allocation cases that aren't
GFP_USER - and that may well be needed in order to make progress.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Sun, 28 Jun 2009 18:02:26 UTC
Message-ID: <fa.75kgnGBFkycqjuOYgYfqS5HNRfA@ifi.uio.no>

On Sun, 28 Jun 2009, Pavel Machek wrote:
>
> Ok, so we should re-add that 4MB buffer to suspend, so that
> allocations work even during that, right?

Pavel, you really are a one-trick pony, aren't you?

Give it up. Return to your pet worry when there are any actual reports. As
you have been told several times.

The _other_ part of memory management that you and Andrew seem to be
ignoring is that it's very robust, and keeps extra memory around, and just
generally does the right thing. We don't generally pre-allocate anything,
because we don't need to.

Almost the _only_ way to run out of memory is to have tons and tons of
dirty pages around. Yes, it can happen. But if it happens, you're almost
guaranteed to be screwed anyway. The whole VM is designed around the
notion that most of memory is just clean caches, and it's designed around
that simply because if it's not true, the VM freedom is so small that
there's not a lot a VM can reasonably do.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: upcoming kerneloops.org item: get_page_from_freelist
Date: Sun, 28 Jun 2009 18:37:42 UTC
Message-ID: <fa.nPHlg+YeXR843P/luBJYxFYK6C8@ifi.uio.no>

On Sun, 28 Jun 2009, Arjan van de Ven wrote:
> >
> > Almost the _only_ way to run out of memory is to have tons and tons
> > of dirty pages around. Yes, it can happen. But if it happens, you're
> > almost guaranteed to be screwed anyway. The whole VM is designed
>
> my impression is that when the strict dirty accounting code went in,
> this problem largely went away.

Yes and no.

It's still pretty easy to have lots of dirty anonymous memory, no
swap-space, and just run out of memory. If you do that, you're screwed.
The oom killer may or may not help, but even if it works, it's probably
going to work only after things have gotten pretty painful.

Also, we'll have to be pretty careful when/if we actually use that
"gfp_allowed_mask" for suspend/resume/hibernate. The SLAB_GFP_BOOT_MASK
has the __GFP_WAIT bit cleared, for example, which means that the VM won't
try to free even trivially freeable memory.

So for hibernate, if/when we use this, we should make sure to still allow
__GFP_WAIT (except, perhaps, during the stage when interrupts are actually
disabled), but clear __GFP_IO and __GFP_FS.

Or whatever. The devil is in the details.

		Linus

Index Home About Blog