Index Home About Blog
From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: remove zero_page (was Re: -mm merge plans for 2.6.24)
Date: Tue, 09 Oct 2007 14:53:29 UTC
Message-ID: <>

On Tue, 9 Oct 2007, Nick Piggin wrote:
> I have done some tests which indicate a couple of very basic common tools
> don't do much zero-page activity (ie. kbuild). And also combined with some
> logical arguments to say that a "sane" app wouldn't be using zero_page much.
> (basically -- if the app cares about memory or cache footprint and is using
> many pages of zeroes, then it should have a more compressed representation
> of zeroes anyway).

One of the things that zero-page has been used for is absolutely *huge*
(but sparse) arrays in Fortan programs.

At least in traditional fortran, it was very hard to do dynamic
allocations, so people would allocate the *maximum* array statically, and
then not necessarily use everything. I don't know if the pages ever even
got paged in, but this is the kind of usage which is *not* insane.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: remove zero_page (was Re: -mm merge plans for 2.6.24)
Date: Wed, 10 Oct 2007 02:23:52 UTC
Message-ID: <>

On Tue, 9 Oct 2007, Nick Piggin wrote:
> Where do you suggest I go from here? Is there any way I can
> convince you to try it? Make it a config option? (just kidding)

No, I'll take the damn patch, but quite frankly, I think your arguments

I've told you so before, and asked for numbers, and all you do is
handwave. And this is like the *third*time*, and you don't even seem to
admit that you're handwaving.

So let's do it, but dammit:
 - make sure there aren't any invalid statements like this in the final
   commit message.
 - if somebody shows that you were wrong, and points to a real load,
   please never *ever* make excuses for this again, ok?

Is that a deal? I hope we'll never need to hear about this again, but I
really object to the way you've tried to "sell" this thing, by basically
starting out dishonest about what the problem was, and even now I've yet
to see a *single* performance number even though I've asked for them
(except for the problem case, which was introduced by *you*)


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: remove zero_page (was Re: -mm merge plans for 2.6.24)
Date: Wed, 10 Oct 2007 03:07:32 UTC
Message-ID: <>

On Tue, 9 Oct 2007, Nick Piggin wrote:
> I gave 2 other numbers. After that, it really doesn't matter if I give
> you 2 numbers or 200, because it wouldn't change the fact that there
> are 3 programs using the ZERO_PAGE that we'll never know about.

You gave me no timings what-so-ever. Yes, you said "1000 page faults", but
no, I have yet to see a *single* actual performance number.

Maybe I missed it? Or maybe you just never did them.

Was it really so non-obvious that I actually wanted *performance* numbers,
not just some random numbers about how many page faults you have? Or did
you post them somewhere else? I don't have any memory of having seen any
performance numbers what-so-ever, but I admittedly get too much email.

Here's three numbers of my own: 8, 17 and 975.

So I gave you "numbers", but what do they _mean_?

So let me try one more time:

 - I don't want any excuses about how bad PAGE_ZERO is. You made it bad,
   it wasn't bad before.
 - I want numbers. I want the commit message to tell us *why* this is
   done. The numbers I want is performance numbers, not handwave numbers.
   Both for the bad case that it's supposed to fix, *and* for "normal
 - I want you to just say that if it turns out that there are people who
   use ZERO_PAGE, you stop calling them crazy, and promise to look at the

How much clearer can I be? I have said several times that I think this
patch is kind of sad, and the reason I think it's sad is that you (and
Hugh) convinced me to take the patch that made it sad in the first place.
It didn't *use* to be bad. And I've use ZERO_PAGE myself for timing, I've
had nice test-programs that knew that it could ignore cache effects and
get pure TLB effects when it just allocated memory and didn't write to it.

That's why I don't like the lack of numbers. That's why I didn't like the
original commit message that tried to blame the wrong part. That's why I
didn't like this patch to begin with.

But I'm perfectly ready to take it, and see if anybody complains.
Hopefully nobody ever will.

But by now I absolutely *detest* this patch because of its history, and
how I *told* you guys what the reserved bit did, and how you totally
ignored it, and then tried to blame ZERO_PAGE for that. So yes, I want the
patch to be accompanied by an explanation, which includes the performance
side of why it is wanted/needed in the first place.

If this patch didn't have that kind of history, I wouldn't give a flying f
about it. As it is, this whole thing has a background.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: remove zero_page (was Re: -mm merge plans for 2.6.24)
Date: Wed, 10 Oct 2007 05:22:10 UTC
Message-ID: <fa.ChpA9h9IA/NcsExJk4HrhyJ/>

On Wed, 10 Oct 2007, Hugh Dickins wrote:

> On Tue, 9 Oct 2007, Nick Piggin wrote:
> > by it ;) To prove my point: the *first* approach I posted to fix this
> > problem was exactly a patch to special-case the zero_page refcounting
> > which was removed with my PageReserved patch. Neither Hugh nor yourself
> > liked it one bit!
> True (speaking for me; I forget whether Linus ever got to see it).

The problem is, those first "remove ref-counting" patches were ugly
*regardless* of ZERO_PAGE.

We (yes, largely I) fixed up the mess since. The whole vm_normal_page()
and the magic PFN_REMAP thing got rid of a lot of the problems.

And I bet that we could do something very similar wrt the zero page too.

Basically, the ZERO page could act pretty much exactly like a PFN_REMAP
page: the VM would not touch it. No rmap, no page refcounting, no nothing.

This following patch is not meant to be even half-way correct (it's not
even _remotely_ tested), but is just meant to be a rough "grep for
ZERO_PAGE in the VM, and see what happens if you don't ref-count it".

Would something like the below work? I dunno. But I suspect it would. I
doubt anybody has the energy to actually try to actually follow through on
it, which is why I'm not pushing on it any more, and why I'll accept
Nick's patch to just remove ZERO_PAGE, but I really *am* very unhappy
about this.

The "page refcounting cleanups" in the VM back when were really painful.
And dammit, I felt like I was the one who had to clean them up after you
guys. Which makes me really testy on this subject.

And yes, I also admit that the vm_normal_page() and the PFN_REMAP thing
ended up really improving the VM, and we're pretty much certainly better
off now than we were before - but I also think that ZERO_PAGE etc could
easily be handled with the same model. After all, if we can make
"mmap(/dev/mem)" work with COW and everything, I'd argue that ZERO_PAGE
really is just a very very small special case of that!

Totally half-assed untested patch to follow, not meant for anything but a
"I think this kind of approach should have worked too" comment.

So I'm not pushing the patch below, I'm just fighting for people realizing

 - the kernel has *always* (since pretty much day 1) done that ZERO_PAGE
   thing. This means that I would not be at all surprised if some
   application basically depends on it. I've written test-programs that
   depends on it - maybe people have written other code that basically has
   been written for and tested with a kernel that has basically always
   made read-only zero pages extra cheap.

   So while it may be true that removing ZERO_PAGE won't affect anybody, I
   don't think it's a given, and I also don't think it's sane calling
   people "crazy" for depending on something that has always been true
   under Linux for the last 15+ years. There are few behaviors that have
   been around for that long.

 - make sure the commit message is accurate as to need for this (ie not
   claim that the ZERO_PAGE itself was the problem, and give some actual
   performance numbers on what is going on)

that's all.


 mm/memory.c  |   17 ++++++++---------
 mm/migrate.c |    2 +-
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f82b359..0a8cc88 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -386,6 +386,7 @@ static inline int is_cow_mapping(unsigned int flags)
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
 	unsigned long pfn = pte_pfn(pte);
+	struct page *page;

 	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
 		unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
@@ -413,7 +414,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_
 	 * The PAGE_ZERO() pages and various VDSO mappings can
 	 * cause them to exist.
-	return pfn_to_page(pfn);
+	page = pfn_to_page(pfn);
+	if (PageReserved(page))
+		page = NULL;
+	return page;

@@ -968,7 +973,7 @@ no_page_table:
 	if (flags & FOLL_ANON) {
 		page = ZERO_PAGE(address);
 		if (flags & FOLL_GET)
-			get_page(page);
+			page = alloc_page(GFP_KERNEL | GFP_ZERO);
 		BUG_ON(flags & FOLL_WRITE);
 	return page;
@@ -1131,9 +1136,6 @@ static int zeromap_pte_range(struct mm_struct *mm, pmd_t *pmd,
-		page_cache_get(page);
-		page_add_file_rmap(page);
-		inc_mm_counter(mm, file_rss);
 		set_pte_at(mm, addr, pte, zero_pte);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -1717,7 +1719,7 @@ gotten:

 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	if (old_page == ZERO_PAGE(address)) {
+	if (!old_page) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
 		if (!new_page)
 			goto oom;
@@ -2274,15 +2276,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	} else {
 		/* Map the ZERO_PAGE - vm_page_prot is readonly */
 		page = ZERO_PAGE(address);
-		page_cache_get(page);
 		entry = mk_pte(page, vma->vm_page_prot);

 		ptl = pte_lockptr(mm, pmd);
 		if (!pte_none(*page_table))
 			goto release;
-		inc_mm_counter(mm, file_rss);
-		page_add_file_rmap(page);

 	set_pte_at(mm, address, page_table, entry);
diff --git a/mm/migrate.c b/mm/migrate.c
index e2fdbce..8d2e110 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -827,7 +827,7 @@ static int do_move_pages(struct mm_struct *mm, struct page_to_node *pm,
 			goto set_status;

 		if (PageReserved(page))		/* Check for zero page */
-			goto put_and_set;
+			goto set_status;

 		pp->page = page;
 		err = page_to_nid(page);

Index Home About Blog