From: email@example.com (John R. Mashey)
Subject: Re: L2 on-chip or off-chip cache?
Date: 9 Aug 2000 03:45:25 GMT
In article <firstname.lastname@example.org>,
email@example.com (Jonathan Thornburg) writes:
|> In article <firstname.lastname@example.org>,
|> John R. Mashey <email@example.com> wrote:
|> [[as always, many clear and cogent comments]]
|> >With every process improvement, it gets more practical to put bigger
|> >SRAM caches on-chip, but it is premature to claim that off-chip SRAM is
|> >is not practical. As I noted in earlier posting, I understand HP's
|> >decision to use a 1-level on-chip cache, but most other designers
|> >continue to use off-chip caches. When we get to have 4MB-8MB of
|> >on-chip cache with reasonable yields, in high-end micros,
|> >off-chip SRAM might go away, but not before then.
|> I'd expect off-chip "N+1-th level" cache to persist even then
|> (at least for high-end systems where the cost is tolerable):
|> * Cache "working set" sizes certainly aren't getting any smaller,
|> and may be getting larger in "modern" bloat^Wsoftware
|> * The same Moore's-law improvements in circuit density which will give
|> us those 4-8 meg on-chip caches, will also permit (say) 64-256 meg
|> or more off-chip caches at modest costs (say comparable to today's
|> 2-8 meg off-chip caches)
|> * Shrinking cpu clock cycle time
|> ==> growing DRAM latency time when measured in cpu cycles
|> (a.k.a. the "memory wall")
|> ==> larger penalty-in-clock-cycles for missing in the on-chip cache(s)
|> ==> need lower miss-all-the-way-to-DRAM rate to keep from hitting
|> the memory wall (this is just Amdahl's law at work)
|> ==> want bigger cache(s)
As always, there are all kinds of tradeoffs, and it depends on the
kind of code that you have. At the risk of over-generalizing:
(1) There are relatively few codes that don't get some help from
I-caches, even ones of relatively small size. There are codes that
need really big I-caches (OS's, DBMS come to mind), and there are a few
miserable codes for which the only hit-rate comes from bringing in multiple
instructions in a cache miss, i.e., there is zero other re-use.
[For example, some simulation codes are generated by translators,
and consist of an immense straight-line set of instructions, wrapped in
a giant loop bigger than anyone has built an I-cache so far. I've seen
ECAD simulations like that.]
However, by-and-large, I-caches are good.
(2) The story is much more mixed for data-caches.
But I think there are basically 3 cases, of which the middle is the
(2a) Caches are essentially worse-than-useless.
(2b) Bigger caches help, but the bigger they get, the less they help,
and the issue is cost/performance tradeoffs.
(2c) A problem gets good performance once the cache size reaches X,
but for that problem, making the cache 2X doesn't help at all.
(2a) In applications like "hash this into a GB table, with accesses random",
i.e., isomorphic to some NSA problems. For such applications,
the only thing that counts is the latency to memory for a cache
miss to retrive one word, and people who can usefully bypass caches, do.
Some vector codes certainly fit this as well, especially
ones that are naturally non-unit-stride.
(2b) Covers most codes that most people use, with realtively unpredictable
memory reference patterns, but with enough spatial and temporal locality
taht caches help. After that, one can argue and simulate among the various
choices of multi-level hierarchies verus one-level, direct-versus associative,
etc, etc, but in general, the performance, for a given machine, as the
problem gets larger, will look like the familiar stair-step falling off
as the problem overflows each level of cache. Many SPEC codes fit this
model, and of course, all vendor benchmarkers know that the finest thing
is to have a bigger cache than your competitor, and have a benchmark that
has a good hit rate in your cache, and misses miserably in theirs,
for that specific size. of course, if you ahve a smaller (and perhaps faster)
cache, what you want is a benchmark that either fits in your cache, so the
competitor gets no advantage, or is so huge it fits in nobody's cache :-)
(2c) Shows up in vector/matrix codes, where the compiler is doing
cache-blocking. If the cache is big enough, and blocking works well
enough to have driven the memory overhead reasonably low, making the
computation GFLOPS-bound anyway, then making the cache bigger won't
necesarily help, and also, running bigger problem sizes doesn't actually
hurt, if the problems still block well.
So far, there is evidence that some workloads really like 4M-8MB
external caches (and set-associative); since we don't yet have 64MB-256MB
caches, we don't really have data on their added value, especially if
they add latency for complete cache misses.
-John Mashey EMAIL: firstname.lastname@example.org DDD: 650-933-3090 FAX: 650-933-2663
USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, local consulting elsewise.
From: email@example.com (McCalpin)
Subject: Re: Status of EV7
Date: 12 Feb 2001 21:12:45 GMT
In article <firstname.lastname@example.org>,
Nick Maclaren <email@example.com> wrote:
>In article <firstname.lastname@example.org>,
>McCalpin <email@example.com> wrote:
>>In article <firstname.lastname@example.org>,
>>Nick Maclaren <email@example.com> wrote:
>>>The Hitachi SR2201 has 4 MB/sec per MFlop, all the way from main
>>>memory (actually it bypasses cache in pseudovectorising mode).
>>>Experience is that this was more than adequate (except for a few
>>>inner loops), but that dropping below 2 starts to be a major
>>Only if you don't have large caches (or don't know how to use them).
>Or your application isn't easily blockable (whether for algorithmic
>or structural reasons), as you point out later. And remember that
>this thread was about 'array-based' codes, which very often operate
>on quite large arrays. They can often be blocked, but it may make
>them MUCH harder to understand, debug and maintain.
Caches work better than most people expect even without cache
The incompressible CFD codes that wanted up to 1.4 MB/s per MFLOPS
were definitely not blocked and did not use cache-friendly
preconditioners for the pressure equation solvers.
The data in my Figure4 includes many scientific and engineering
codes with no explicit blocking, including one global weather
model (CCM3.2, which wanted approximately 0.13 MB/s per MFLOPS),
one local area weather model (MM5v2, which wanted approximately
0.11 MB/s per MFLOPS).
I do not believe that the Eigenvalue/Model analysis codes or the
Petroleum Reservoir codes were cache-blocked, and these appeared
to demand only up to about 0.5 MB/s per peak MFLOPS.
The computational chemistry codes used as little as 0.02 MB/s per
MFLOPS. These are not "blocked" in the usual sense, though they
deliberately use algorithms that are computationally intensive
rather than memory intensive.
>But that isn't really the issue. If you are using a cache-based
>machine, all the difference that it makes is those figures apply
>to your CPU to cache bandwidth. And a horrific number of system
>designers seem to forget this fact.
Can you point to any designers in particular? I don't know
any who are not aware of the importance of cache performance.
John D. McCalpin, Ph.D. firstname.lastname@example.org
Senior Scientist IBM POWER Microprocessor Development
"I am willing to make mistakes as long as
someone else is willing to learn from them."