Software prefetching from memory (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] epoll use a single inode ...
Date: Wed, 07 Mar 2007 22:58:57 UTC
Message-ID: <fa.xioENCdH8/npFPrkKx9GGXIztBs@ifi.uio.no>

On Wed, 7 Mar 2007, Anton Blanchard wrote:
>
> Funny you mention this. We found some noticeable ppc64 regressions when
> moving the dcache to standard list macros and had to do this to fix it
> up:
>
> static inline void prefetch(const void *x)
> {
>         if (unlikely(!x))
>                 return;
>
>         __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x));
> }
>
> Urgh :)

Yeah, I'm not at all surprised. Any implementation of "prefetch" that
doesn't just turn into a no-op if the TLB entry doesn't exist (which makes
them weaker for *actual* prefetching) will generally have a hard time with
a NULL pointer. Exactly because it will try to do a totally unnecessary
TLB fill - and since most CPU's will not cache negative TLB entries, that
unnecessary TLB fill will be done over and over and over again..

In general, using software prefetching is just a stupid idea, unless

 - the prefetch really is very strict (ie for a linked list you do exactly
   the above kinds of things to make sure that you don't try to prefetch
   the non-existent end entry)
AND
 - the CPU is stupid (in-order in particular).

I think Intel even suggests in their optimization manuals to *not* do
software prefetching, because hw can usually simply do better without it.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] epoll use a single inode ...
Date: Thu, 08 Mar 2007 03:21:35 UTC
Message-ID: <fa.eyUP2auGjXgDVOESAAJyfZv0zks@ifi.uio.no>

On Wed, 7 Mar 2007, Michael K. Edwards wrote:
>
> People's prejudices against prefetch instructions are sometimes
> traceable to the 3DNow! prefetch(w) botch, which some processors
> "support" as no-ops and others are too aggressive about (Opteron
> prefetches are reputed to be "strong", i. e., not dropped on DTLB
> miss).

No, I just checked, and Intel's own optimization manual makes it clear
that you should be careful. They talk about performance penalties due to
resource constraints - which makes tons of sense with a core that is good
at handling its own resources and could quite possibly use those resources
better to actually execute the loads and stores deeper down the
instruction pipeline.

So it's not just 3DNow! making AMD look bad, or Intel would obviously
suggest people use it out of the wazoo ;)

> XScale gets it right.

Blah. XScale isn't even an OoO CPU, *of*course* it needs prefetching.
Calling that "getting it right" is ludicrous. If anything, it gets things
so wrong that prefetching is *required* for good performance.

I'm talking about real CPU's with real memory pipelines that already do
prefetching in hardware. The better the core is, the less the prefetch
helps (and often the more it hurts in comparison to how much it helps).

But if you mean "doesn't try to fill the TLB on data prefetches", then
yes, that's generally the right thing to do.

> (Oddly, Prescott seems to have initiated a page table walk on DTLB miss
> during software prefetch -- just one of many weird Prescott flaws.)

Netburst in general is *very* happy to do speculative TLB fills, I think.

> I'm guessing Pentium M and its descendants (Core Solo and Duo) get it
> right but I'm having a hell of a time finding out for sure.  Can any of
> the x86 experts answer this?

I just suspect that the upside for Core 2 Duo is likely fairly low. The L2
cache is good, the memory re-ordering is working.. I doubt "prefetch"
helps in generic code that much for things like linked list following, you
should probably limit it to code that has *known* access patterns and you
know it's not going to be in the cache.

(In other words, I bet prefetching can help a lot with MMX/media kind of
code, I doubt it's a huge win for "for_each_entry()")

		Linus

Index Home About Blog