From: firstname.lastname@example.org (John R. Mashey)
Subject: Re: Sorting in Hardware [really: time for updating Knuth V3, etc]
Date: Tue, 22 Mar 1994 20:49:33 GMT
In article <CMzK5y.26L.email@example.com>, firstname.lastname@example.org (Donald Lindsay) writes:
|> I wouldn't bother. If you read DEC's tech report on the Alpha sort
|> benchmark, you'll notice that it is not CPU limited. (And faster
|> Alpha chips are coming Real Soon.)
|> The interesting thing in the tech report was that cache misses
|> represented a bigger bottleneck than the IO did. So, if you want
|> to research sorting, perhaps you should work on cache-friendly
This is an instance of a general problem, i.e., that algorithm+data structure
analysis (especially in searching and sorting)
has not tracked current hardware. I include below a note incited
by dealing with people scaling up hash algorithms without worrying about
cache effects. People looking for article or M.S. thesis topics take note:
we need a whole lot of these to update the literature.
Algorithm Analyses *Must Change* to Model Current Processors
"Time to update Knuth Volume 3" :-)
John R. Mashey
A great deal of traditional algorithm analysis needs serious
extension to cope with currenty and likely-future processors.
Many of the traditional and long-known results need to be modified,
especially for searching and sorting methods.
Traditional analyses of in-memory algorithms usually assume that:
- memory is "about" the same speed as the CPU, or close.
- items in memory have time- and location- invariant access speeds.
- items in memory have access speeds invariant with regard to
the total number of items and their reference patterns.
These assumptions may once have been reasonable approximations,
but are often outdated, and they need to be extended to
better match current computers. In the early 1980s, on microprocessors, these
assumptions might have been appropriate, when CPU and DRAM cycle times
were much closer. Classic analyses count table probes, compares, etc ...
but these days, counting *cache misses* is much likelier to be useful.
Most current computer systems use substantial cache memory hierarchies.
This includes mainframes, superminis, the popular RISCs,
some PCs built with Intel 386 or Motorola 68030s, and all systems
built with Intel 486 (or later) and Motorola 68040 (or later).
In cached systems:
- CPUs are rapidly increasing in speed, much faster than main memory.
It is already possible, in faster multiprocessors, for a cache
miss to cost 100 or more CPU cycles. This effect is likely to get
worse, not better.
- Memory access times are affected by spatial locality, that is,
accessing two pieces of data close together may be much faster than
accessing two that are further apart. For example, simple hashing
with linear search may perform *better* than one would expect. Even
odder, for small tables, it may be better to do linear searches,
touching only a few cache lines, than to hash at all.
- Memory access times are affected by temporal locality, that is,
if an item has been referenced "recently", it is much faster to
re-reference it. For example, in searching a tree, frequent accesses
to items near the root of a tree are almost free. The presence of
caches affects sorting and searching somewhat separately.
Sort routines likely have more opportunity for affecting their own
cache behavior than do search functions that must cope with
potential cache flushing by other code outside their control. You
might prefer one kind of search routine if you think it is called often
enough for some of its data to stay cache-resident, and prefer another
if you expect enough other cache activity to have flushed its data.
- Access times can be affected by the total amount of data accessed,
that is, large amounts of data can cause cache-thrashing that
adds substantial overhead beyond that of the algorithmic complexity.
For example, traditional analysis shows that it is better to hash
items into a larger hash table to reduce collisions, but it may well
be much better to keep a table smaller, so that it stays in the cache,
and do a few more probes.
People who deal with scientific codes on cached machines have noticed
the drastic effects on speed caused by ignoring the existence of caches,
and there is even compiler support for reworking code to use them
better. [KAP, etc] It seems time for cache hierarchies to start getting
modeled into other important algorithms.
This will not be easy, but needs to be done to improve traditional
results that no longer fit, and could easily generate hundreds of
good papers. For example, one could start with traditional analyses
of searching and sorting techniques, then start asking questions like:
- Given algorithm S1, and memory system design D1, and problem size N,
show timing analyses as N varies, including comparisons with the
traditional architecture D0 (simple, fast memory, no cache).
- Compare algorithms S1 and S2 as above.
- Compare algorithm S1 as above, but for architectures D1 and D2
with different cache structures, cache sizes, memory latencies, etc.
At a minimum, one would like to compare single-level, medium-size
caches against 2-level caches, as both designs are heavily used.
One might ignore instruction-vs-data clashes for machines where they
can happen, although this should be checked.
- How detailed does the memory system model have to be? Can one get
good predictive results with simple models, or need one run
simulations in great deal?
- Both cached and uncached machines are sensitive to inherent
algorithmic complexity effects of size. How strong are the
additional size effects on cached machines?
- Processors are designed by using data derived from running code,
then designing the CPU to optimize across a range of codes.
Can algorithms be better designed to cater to a range of CPU designs?
- Is there any help from compilers for these kinds of algorithms?
Does compiler-controlled prefetching help at all?
Does hardware data-prefetching?
- Can one return to some of the oldest models, where main memory
was very small, and sorts and searches were quite aware of
disk and/or tape? Do any of those give insight onto processor
designs with longer memory latencies?
As it stands, the programmer who follows the standard analyses
that have worked for years may well encounter serious surprises on
the heavily-cached processors that we now use.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD: 415-390-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: email@example.com (John R. Mashey)
Subject: Re: Efficient execution in the year 2000
Date: 27 Oct 1997 03:51:50 GMT
In article <firstname.lastname@example.org>, email@example.com (David
|> (assuming the five ranges are equally likely). The advantage is that
|> your tree can be expected to be less deep by a factor of about
|> log(4)/log(5) = 0.86..., and thus that you can expect to access 14%
|> fewer nodes and encounter correspondingly fewer cache misses on
|> average. With a high (cache miss rate) * (cache miss penalty) value,
|> this could easily compensate for the somewhat higher per-node
|> resolution cost.
|> For longer cache lines, adjust the base of the search accordingly.
This is going in the right direction, for sure.
1) I've more than once complained that the typical undergradaute
courses in "Algorithms and Data Structures" have become obsolete.
Many such textbooks don't explain the effects of memory hierarchy,
and many don't even list "cache memory" in the index. For example,
KNUTH Volume 3: Searching & Sorting seems way out-of-date.
(Hennessy says Knuth is rewriting it ... but I've heard *that* before :-)).
2) There is room for a ton of MS theses to reanalyze the classic
algorithms in the light of current and expected memory hierarchies.
3) Of the various things that program do, 2 are real killers:
a) Following chain of pointers with cache misses.
b) Multi-way branching (like a switch table), or
loading a pointer to a function and calling it.
It is bad enough to have a) with cache misses, and b) is even worse,
in that even a speculative-execution chip has great difficulty guessing
where a jump is going and doing anything useful.
4) A few rules of thumb:
a) Count cache misses, not just ALU & load/store.
b) Wider structures are better than deeper, if the widths can be
tuned to cache linesize (David's comments above).
c) Current o-o-o CPUs can unroll loops for you, but can't outguess
indirect pointer fetches.
5) Temporal & spatial locality: some strange interactions,
even in uniprocessors (ignore MP for now):
a) For some algorithms, you have a chance to "own" the cache
for a while, i.e., the faster sorts right now get tuned to the
cache sizes, since you call the sort, and during the actual sort
phase, there doesn't have to be too much else going on.
b) At the other extreme, no amount of caching helps:
suppose you need to hash into a billion-element table: you have
plenty of cache misses, and they don't go away.
c) In the middle, your results and algorithm choices depend on
the rate of cache interference from other code, so it is
useful to understand the frequency of calling, and the amount of
conflict misses likely to be generated by others. For example,
you might use a hash table if you have a modest-sized table that
is used often, as there is a good chance the table will end up
cache-resident, but you might more likely use a B-Tree or similar
setup if you have to cover much larger tables than might fit,
but you at least have some hope of getting the intermediate nodes
into the cache. Some symbol-lookup functions might fiit in c),
as do parts of OS code.
6) Of course, object-oriented code and CPU design seem to conflict a bit :-)
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: firstname.lastname@example.org DDD: 650-933-3090 FAX: 650-932-3090
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389