Cache thrashing (Andy Glew; Linus Torvalds; Terje Mathisen)

Index Home About Blog

From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: L1 and L2 Line size in a 64 bit proc
Date: Tue, 06 Jan 2004 17:11:26 +0100
Message-ID: <btemne$itu$1@osl016lin.hda.hydro.com>

Nick Maclaren wrote:
> In addition to the difficulty that it isn't always practical,
> there is the problem of the physical mapping of caches.  And the
> unprivileged programmer can neither control nor even query his
> virtual to physical address mapping.  You then get applications
> that run well 99% of the time and like a drain 1% of the time,
> unpredictably.
>
> This becomes REALLY serious for mission critical real-time codes,
> such as chemical plant control.  You absolutely do NOT want the
> program slowing down by a large factor (ones of 3-10, overall,
> can occur) because the system has distributed data in a way that
> it rarely does or the application has changed its data sizes and
> hit this problem.

Have I told the story about the OS which had process control structures
which were nice powers-of-two in size to speed up access to them?

With a lot of blocked threads, the system ran 4 X (or some number like
that) slower than usual, because the regular scanning of those blocked
threads to check if they needed some OS assistance caused both cache and
TLB trashing.


A little padding later, and everything was back to normal. :-)

> And so it goes.

Right.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: L1 and L2 Line size in a 64 bit proc
Message-ID: <bteuv9$str$1@build.pdx.osdl.net>
Date: Tue, 06 Jan 2004 18:40:01 GMT

Terje Mathisen wrote:
>
> Have I told the story about the OS which had process control structures
> which were nice powers-of-two in size to speed up access to them?

This still happens, even when you try to be careful. And yes, it probably
happens more in OS code than other code, exactly because the OS generally
has to track a lot of data structures that are powers-of-two in size
because of simple hardware issues (ie fundamental memory allocation issues
wrt pages and superpages).

So when you have data structures that quite naturally end up being
page-aligned due to allocator issues, and have linked lists of these
things, walking the list ends up being really really bad. Even with
four-way caches or better.

And with cache lines generally having become bigger (they tend to be
invariably 64 or 128 bytes these days), it's becoming harder to try to just
offset the data structures a bit. You've got to offset them rather much,
which can be very wasteful if your data structures are in the single-page
size order.

End result: avoid list walking like the plague for these data structures.
Even if it would be the simplest and obvious way, and the code generated
would be nice and understandable. Even if the average list size isn't that
big, it starts sucking rather quickly.

So even with four- or eight-way associativity there are problematic cases,
but at least at that point you usually can try to alleviate the few problem
spots by recoding. With lower associativity the "few problem spots" are
just too damn many and unrelated accesses end up disturbing each other.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: L1 and L2 Line size in a 64 bit proc
Message-ID: <btjup2$c0q$1@build.pdx.osdl.net>
Date: Thu, 08 Jan 2004 16:00:01 GMT

Glew News SBC wrote:

> "Linus Torvalds" <torvalds@osdl.org> wrote in message
> news:bteuv9$str$1@build.pdx.osdl.net...
>> Terje Mathisen wrote:
>> >
>> > Have I told the story about the OS which had process control structures
>> > which were nice powers-of-two in size to speed up access to them?
>>
>> This still happens, even when you try to be careful. And yes, it probably
>> happens more in OS code than other code, exactly because the OS generally
>> has to track a lot of data structures that are powers-of-two in size
>> because of simple hardware issues (ie fundamental memory allocation
>> issues wrt pages and superpages).
> ...
>>
>> End result: avoid list walking like the plague for these data structures.
>
> Sigh. Non power of two moduli are easy enough to compute,
> at least for an L2 access.

You didn't read my explanation of _why_ there are so many power-of-two
allocations.

It's not because we have an array which has entries the size of a
power-of-two. It's because the fundamental memory allocation unit HAS TO BE
a power of two for some rather basic reasons. When was the last time you
saw hardware that had non-power-of-two page sizes? Right. They just don't
exist, and they won't be existing any time soon.

So as a result, a lot of your OS allocations _will_ be based on power-of-two
offsets - whether they are related to each other or not. Which means that
your access patterns will invariably be biased towards those powers-of-two
as well.

Yes, we do offsetting etc when we only use partial pages, but as I also
explained, with cache lines sizes regularly being 128 bytes, you simply
_cannot_ color your data accesses very much if the basic allocation unit is
a page and you mostly work in single pages.  And single pages is what you
work with in an OS. Big allocations are very much the exception.

And don't get me wrong: we can, and we do, try to do our best to avoid cache
conflicts. But in the end the software side can't do everything, and
hardware should try to avoid arbitrary limitations and just go for higher
associativity.

It's a question of give-and-take. Software tries its best to be
cache-friendly, but hardware should try its best to be sw-friendly too. If
either of those parts fail, your performance will suck. It's not "one or
the other", it's both.

I'm not asking for fully associative caches, btw. I accept that
associativity is complicated. I'm just stating that there is only so much
that software can do to reasonably avoid problems with low levels of
associativity, and that I personally consider anything less than 4-way to
be just too damn _fragile_ from a performance standpoint.

                        Linus

From: "Glew News SBC" <andy-glew-public@sbcglobal.net>
Newsgroups: comp.arch
Subject: Re: L1 and L2 Line size in a 64 bit proc
Message-ID: <mEiLb.8347$ZG.7769@newssvr25.news.prodigy.com>
Date: Thu, 08 Jan 2004 19:59:14 GMT

[Glew]
> > Sigh. Non power of two moduli are easy enough to compute,
> > at least for an L2 access.

[Linus]
> You didn't read my explanation of _why_ there are so many power-of-two
> allocations.
>
> It's not because we have an array which has entires the size of a
> power-of-two. It's because the fundamental memory allocation unit HAS TO BE
> a power of two for some rather basic reasons. When was the last time you
> saw hardware that had non-power-of-two page sizes? Right. They just don't
> exist, and they won't be existing any time soon.

You misunderstand my "sigh".

I am totally in agreement that it is unreasonable to ask software
to force data structures to be non-power-of-two.

What I regret is that it is fairly easy, in hardware, to avoid
the cache thrashing problems that power-of-two sized data
structures cause.

Consider a 4M L2 data cache with 64 byte cache lines, 16 way associative.
That's 4M/64/16 = exp2(22-6-4) = exp2(12) = 4K sets.
Essentially requiring modulo 4K in the cache index computation
- which is, of course, just a bitfield extract.  Possibly with a few XORs
for hashing in of upper bits.

The circuit to compute modulo 4K-1 or 4K+1 is really not that hard.
It might not be reasonable to use in a fast L0 cache,
but it is pretty reasonable to use in an L2 cache.

I personally like this idea of non-power-of-two, Mersenne,
or prime cache indexing.  There's lots of literature on it.
There have been a few other proposals to improve the hash
function to avoid resonance collisions.

---

So, no, Linus when I complain about there being insufficient
hardware / software trading off, I think that in this case *hardware*
should change, not software.

If, that is, power-of-two data structures and linked lists
are still a good idea.

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: L1 and L2 Line size in a 64 bit proc
Message-ID: <btlid2$7ei$1@build.pdx.osdl.net>
Date: Fri, 09 Jan 2004 06:51:46 GMT

Glew News SBC wrote:
>
> You misunderstand my "sigh".

I did indeed. I've gotten so used to the way caches work that I have become
myopic.

Your idea to use a hash to distribute the caches more evenly sounds very
feasible, and would likely solve some problems. On the other hand it would
make it harder to tune for caches, which some projects like doing.
Personally, I'd love to just try to keep the footprint down and not worry
about cache coloring, and a hashed cache index would force that that. It
would likely be a lot more effective than what we do in software.

On the other hand, on the codes that _are_ tuned, it would possibly hurt, so
you'd get a fair amount of push-back on it.

And I suspect that with something like 8-way associativity, you won't find
many cases where it really matters. I think Intel already does 8-way on the
L2, and the Opteron does 16-way? I can't imagine that conflict misses end
up being that common any more at that point.

                Linus

Index Home About Blog