From: Terje Mathisen <email@example.com>
Subject: Re: InfoWorld says Opteron cache is slow???
Date: Fri, 16 May 2003 15:21:23 +0200
Greg Lindahl wrote:
> In article <firstname.lastname@example.org>, <email@example.com> wrote:
>>I had great hopes for amazing memory bandwidth as a result of a 5.3 GB/sec
>>local memory bus + access to 3 6.4 GB/sec HT (each with a local 5.3 GB/sec
>>memory bus). But alas, with 1, 2 or 4 active memory busses and various
>>different size arrays I always seem to get 1.7 GB/sec or so on a
>>McCalpin stream loop:
>> for (j=0; j<N; j++)
>> c[j] = a[j]+b[j];
> AMD has made their Software Optimization Guide available to everyone.
> It gives useful assembly code for stream, much faster than 1.7 GB/s.
> I've always been surprised that compiler quality made such a huge
> difference for stream, which spends most of its time waiting for
> memory, but it does.
You get rid of 25% of the memory traffic almost automatically if you can
avoid the read-for-ownership before the store operations.
That is the 'easy' part.
I've seen AMD asm code (wasn't it discussed here last year?) that did a
stream-type loop by L1 cache-blocking and going through everything three
First with a set of explicit load operations that touched one byte/word
per cache-line, to force a maximum-speed memory load stream. Note that
using explicit prefetch opcodes instead would not work as well here!
Second it did the actual fp operations on the in-cache data.
Lastly it used integer (MMX) cache-bypassing stores in a tight loop to
flush the buffer to the target memory.
This is quite a long way from what any self-respecting compiler should
be forced to come up with on its own. :-)
Afair, the end result was a very useful (in the 2x to 4x range?) speedup.
"almost all programming can be viewed as an exercise in caching"