Index Home About Blog
From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: InfoWorld says Opteron cache is slow???
Date: Fri, 16 May 2003 15:21:23 +0200
Message-ID: <ba2okj$2e4$1@osl016lin.hda.hydro.com>

Greg Lindahl wrote:
> In article <ba28io$m2r$2@woodrow.ucdavis.edu>,  <bill@math.ucdavis.edu> wrote:
>
>
>>I had great hopes for amazing memory bandwidth as a result of a 5.3 GB/sec
>>local memory bus + access to 3 6.4 GB/sec HT (each with a local 5.3 GB/sec
>>memory bus).  But alas, with 1, 2 or 4 active memory busses and various
>>different size arrays I always seem to get 1.7 GB/sec or so on a
>>McCalpin stream loop:
>>       for (j=0; j<N; j++)
>>           c[j] = a[j]+b[j];
>
>
> AMD has made their Software Optimization Guide available to everyone.
> It gives useful assembly code for stream, much faster than 1.7 GB/s.
> I've always been surprised that compiler quality made such a huge
> difference for stream, which spends most of its time waiting for
> memory, but it does.

You get rid of 25% of the memory traffic almost automatically if you can
avoid the read-for-ownership before the store operations.

That is the 'easy' part.

I've seen AMD asm code (wasn't it discussed here last year?) that did a
stream-type loop by L1 cache-blocking and going through everything three
times:

First with a set of explicit load operations that touched one byte/word
per cache-line, to force a maximum-speed memory load stream. Note that
using explicit prefetch opcodes instead would not work as well here!

Second it did the actual fp operations on the in-cache data.

Lastly it used integer (MMX) cache-bypassing stores in a tight loop to
flush the buffer to the target memory.

This is quite a long way from what any self-respecting compiler should
be forced to come up with on its own. :-)

Afair, the end result was a very useful (in the 2x to 4x range?) speedup.

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


Index Home About Blog