Page coloring (Larry McVoy; Linus Torvalds)

Index Home About Blog

Date: 	Sun, 9 Dec 2001 23:28:59 -0800
From: Larry McVoy <lm@bitmover.com>
Subject: Re: "Colo[u]rs"
Newsgroups: fa.linux.kernel

On Mon, Dec 10, 2001 at 02:07:06AM -0500, Stevie O wrote:
> After a few failed web searches (combos like 'linux cache color' just gave 
> me a bunch of references to video), I am resorting to this list for this 
> question.
> 
> What exactly do y'all mean by these "colors"? Task colors, cache colors, 
> and probably a few other colors i've missed/forgotten about. What do these 
> colors represent? How are they used to group tasks/cache entries? Is what 
> they're actually for?

Coloring usually means the laying out of data such that the data will 
not collide in the cache, usually the second (or third) level cache.

Data references are virtual, most caches are physically tagged.  That 
means that where data lands in the cache is a function of the physical
page address and offset within that page.  If the cache is what is called
direct mapped, which means each address has exactly one place it can be
in the cache, then page coloring becomes beneficial.  More on that in
a second.  Most caches these days are set associative, which means there
are multiple places the same cache line could be.  A 2 way set associative
cache means there are 2 places, 4 way means there are 4, and so on.  The
trend is that the last big cache before memory is at least 2 way and more
typically 4 way set associative.  There is a cost with making it N-way
associative, you have to run all the tag comparators in parallel or
you have unacceptable performance.  With shrinking transistors and high
yields we are currently enjoying, the costs are somewhat reduced.

So what's page coloring?  Suppose we have a 10 page address space and
we touch each page.  As we touch them, the OS takes a page fault, grabs
a physical page, and makes it part of our address space.  The actual
physical addresses of those pages determine where the cache lines
will land in the cache.  If the pages are allocated at random, worst
case is that all 10 pages will map to the same location in the cache,
reducing the cache effectiveness by 10x assuming a direct mapped cache,
where there is only one place to go.   Page coloring is the careful
allocation of pages such that each virtual page maps to a physical page
which will be at a different location in the cache.  Linus doesn't like
it because it adds cost to the page free/page alloc paths, they now have
to go put/get the page from the right bucket.  He also says it's pointless
because the caches are becoming enough associative that there is no need
and he's mostly right.  Life can really suck on small cache systems that
are direct mapped, as are some embedded systems, but that's life.  It's
a tradeoff.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm

Date: 	Sun, 9 Dec 2001 23:51:26 -0800
From: Larry McVoy <lm@bitmover.com>
Subject: Re: "Colo[u]rs"
Newsgroups: fa.linux.kernel

On Sun, Dec 09, 2001 at 11:21:23PM -0800, David Lang wrote:
> Larry, I thought that direct mapped caches had full mapping from any cache
> address to any physical address while the N-way mapped caches were more
> limited. modern caches are N-way instead of direct mapped becouse it's
> much more expensive (transistor count wise) for the direct mapped
> approach.

That's not correct unless they changed terminology without asking my 
permission :-)  A direct mapped cache is the same as a 1-way set
associative cache.  It means each cache line has one and only one 
place it can be in the cache.  It also means there is only one set
of tag comparators, which makes it cheaper and easier to build.

> If I'm mistaken about my termonology (very possible :-) what is the term
> for what I am refering to as direct mapped?

Fully associative cache, I believe.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Page Colouring (was: 2.6.0 Huge pages not working as expected)
Original-Message-ID: <Pine.LNX.4.58.0312291320140.2113@home.osdl.org>
Date: Mon, 29 Dec 2003 21:38:39 GMT
Message-ID: <fa.j3s4pl9.1ci8lgh@ifi.uio.no>

On Mon, 29 Dec 2003, Eric W. Biederman wrote:
>
> Linus Torvalds <torvalds@osdl.org> writes:
> >
> > Why? Because the fact is, that as memory gets further and further away
> > from CPU's, caches have gotten further and further away from being direct
> > mapped.
>
> Except for L1 caches.  The hit of an associate lookup there is inherently
> costly.

Having worked for a hardware company, and talked to hardware engineers, I
can say that it generally isn't all that true.

The reason is that you can start the lookup before you even do the TLB
lookup, and in fact you _want_ highly associative L1 caches to do that.

For example, if you have a 16kB L1 cache, and a 4kB page size, and you
want your memory accesses to go fast, you definitely want to index the L1
by the virtual access, which means that you can only use the low 12 bits
for indexing.

So what you do is you make your L1 be 4-way set-associative, so that by
the time the TLB lookup is done, you've already looked up the index, and
you only have to compare the TAG with one of the four possible ways.

In short: you actually _want_ your L1 to be associative, because it's the
best way to avoid having nasty alias issues.

The only people who have a direct-mapped L1 are one of:
 - crazy and/or stupid
 - really cheap (mainly embedded space)
 - not high-performance anyway (ie their L1 is really small)
 - really sorry, and are fixing it.
 - really _really_ sorry, and have a virtually indexed cache. In which
   case page coloring doesn't matter anyway.

Notice how high performance is _not_ on the list. Because you simply can't
_get_ high performance with a direct-mapped L1. Those days are long gone.

There is another reason why L1's have long since moved away from
direct-mapped: the miss ratio goes up quote a bit for the same size cache.
And things like OoO are pretty good at hiding one cycle of latency (OoO is
_not_ good at hiding memory latency, but one or two cycles are usually
ok), so even if having a larger L1 (and thus inherently more complex - not
only in associativity) means that you end up having an extra cycle access,
it's likely a win.

This is, for example, what alpha did between 21164 and the 21264: when
they went out-of-order, they did all the simulation to prove that it was
much more efficient to have a larger L1 with a higher hit ratio, even if
the latency was one cycle higher than the 21164 which was strictly
in-order.

In short, I'll bet you a dollar that you won't see a single direct-mapped
L1 _anywhere_ where it matters. They are already pretty much gone. Can you
name one that doesn't fit the four criteria above?

		Linus

Index Home About Blog