SpecFP and time-skewing optimizations (Greg Lindahl; John D. McCalpin)

Index Home About Blog

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Sun breaks another SPEC benchmark
Date: Fri, 21 Mar 2003 22:32:02 +0000 (UTC)
Message-ID: <b5g3t2$ima$1@ausnews.austin.ibm.com>

In the ongoing conflict between static benchmarks and the need
for marketing collateral, another SPEC benchmark seems to just
lost a lot of its traditional usefulness....

SPECfp_rate2000 results for the Sun V1280 show the addition of a
new flag (-Atile=skewp) boosts performance on the "171.swim"
benchmark by almost a factor of five.  Although not quite as
dramatic as the now infamous "179.art" hack, this optimization
will make it even more difficult to interpret SPECfp_rate2000
results.

Obviously, I don't know exactly what Sun's compiler is doing with
this benchmark, but I think that I probably understand the
performance characteristics of this set of equations about as
well as anyone in the world, and their range of options is quite
limited.

I know of two families of optimization techniques that can
provide large boosts to performance of 171.swim.   One of
those families seems to be limited to small constant-factor
speedup, and I can't see how to get much past 2x.   The other
class class of optimizations could completely eliminate the
memory bottleneck and turn 171.swim into a completely cpu-bound
program.  Although the technique goes under many names, it is
called "time-skewing" in a Rutgers University technical report
by David Wonnacott and John McCalpin.  A convenient citation is
	http://citeseer.nj.nec.com/mccalpin99time.html

Having been thinking about this class of optimizations since
the mid 1990's, I think that I am in a fair position to judge
the general implementation of such techniques to be very
difficult.   Implementing the technique for the equations
solved by 171.swim is hard -- implementing the technique for
the specific source code representation of 171.swim is heroic
(or masochistic?).   Many special case analyses are needed to
apply time-skewing to the particulars of 171.swim, and I am
not sure if I should be in awe of the dedication of the Sun
compiler writers or in awe of their desparation.

Although the time-skewing optimizations are valid, potential
customers should ask very hard questions about the robustness of
such optimizations for different source code.   If experience is
a guide, these optimizations will be very brittle, and it will be
virtually impossibly to tell whether a particular piece of source
code will meet the criteria for the compiler to perform this
complex set of optimizations.

So now when reviewing Sun SPECfp_rate2000 results, you have
to factor in the 179.art hack (which gave a 7.1x boost on
the early 900 MHz Sun systems) and this optimization, which
gives a 4.6x boost in the 171.swim rate on the 900 MHz V1280
system.   The combined boost in the SPECfp_rate2000 score is
the fourteenth root of the product of these two speedups,
which is about 28% above the results before these two
optimizations were added.   This is no longer a little glitch
that customers can afford to ignore.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Mon, 24 Mar 2003 00:45:32 +0000 (UTC)
Message-ID: <b5lkfc$ef8$1@ausnews.austin.ibm.com>

In article <qnPea.196812$qi4.89715@rwcrnsc54>,
Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
>
>"McCalpin" <mccalpin@gmp246.austin.ibm.com> wrote in message
>news:b5g3t2$ima$1@ausnews.austin.ibm.com...
>> In the ongoing conflict between static benchmarks and the need
>> for marketing collateral, another SPEC benchmark seems to just
>> lost a lot of its traditional usefulness....
>>
>> SPECfp_rate2000 results for the Sun V1280 show the addition of a
>> new flag (-Atile=skewp) boosts performance on the "171.swim"
>> benchmark by almost a factor of five.
>(snip)
>
>[...] For us who
>don't know the 171.swim algorithm could you explain a little why
>you think it is hard to apply time-skewing optimization?

This would be quite difficult to explain concisely, but I will
give it a try....

Consider a collection of two-dimensional arrays, and an iterative
process that updates those arrays based on "spatial" averages and
"spatial" differences.  In each step, the "new" value at location "i,j"
(that would be [j][i] for you C folks out there), is calculated
based on the "old" values at points that might be offset as much
as one index value in either direction for each of "i" and "j".

Although the code does not use a separate index for the "time"
index (I will call it "k"), you can write this as
	Q(i,j,k+1) = Function(Q(%,@,k),Q(%,@,k-1))
where "%" means i,i+1, or i-1, and @ means j,j+1, or j-1.

In the shallow water equations, the "Function" above involves
performing only a few arithmetic operations for each data element
read from memory or written to memory.   The exact number of arithmetic
operations per point depends a bit on the common sub-expression
elimination of the compiler, but in subroutine "CALC1" of swim,
there are 24 floating-point operations associated with accesses to
a total of 7 arrays -- so we are looking at something like 3 floating-
point operations per load or store from main memory.   The main memory
accesses are all set up to be contiguous, but this is still about
an order of magnitude more bandwidth per floating-point operation
than most modern systems can provide, so the code is strongly
memory-bandwidth dominated on most systems.  (This class of codes
is very happy on high-bandwidth vector systems.)

"Time-skewing" is basically a transformation of variables that
changes Q(i,j,k) to P(i,j+k,j-k).  The loops can then be transformed
to these new variables.   The advantage of this technique is that
it allows you to effectively calculate many time steps while reading
the arrays from memory (and writing them back to memory) only once.

The actual implementation is horrendously more complex than I have
described because the update of the variables is split across three
subroutines, with complex updates of the elements at the edges at
the end of the first and second of those routines.   There is also
a global reduction (sums of absolute values of the elements of three
of the arrays) that takes place between the second and third
subroutines.

The general time-skewing algorithm can operate in multiple spatial
dimensions, but you probably only need to transform one spatial
dimension to get the required speedup here.   In multiple dimensions,
time-skewing could be enough to allow you to run the shallow water
equations from disk (for huge arrays) with neglible slowdown.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Mon, 24 Mar 2003 00:58:49 +0000 (UTC)
Message-ID: <b5ll89$bvo$1@ausnews.austin.ibm.com>

In article <3e7d0d11$1@news.meer.net>, Greg Lindahl <lindahl@pbm.com> wrote:
>
>You're making a big assumption here -- while the weather community
>doesn't use shallow water anything these days, it was my impression
>that people studying things like the Chesapeake Bay still use shallow
>water models. Hopefully John will speak up if I'm wrong, but the
>general point is that a code that's obsolete in one field might be
>current in another.

Some shallow water modelling is certainly still done for nearshore
environments and lakes.   Although the theoretical foundations of
time-skewing are based on the hyperbolic nature of the equations
(and are therefore applicable to any implementation), I think it
quite unlikely that any automatic system could "do the right thing"
for shallow water equations on irregular or unstructured meshes
(which would be more representative of current applications).

When I was working on my Ph.D. in the middle 1980's, my research
group drew a fair amount of criticism for continuing to use shallow
water models.   So this level of sophistication was considered
antique almost twenty years ago.  It is still applicable to some
specific theoretical problems (and I used a simple shallow water
model in a paper in 1995 with no objections), but I would certainly
not contend that these applications actually drive the purchase
of very many computers.

The 171.swim benchmark in SPEC is really a toy: in the complexity
of the geometry (square), the complexity of the grid (square),
and the complexity of the physics (none).   Enhancements to the
usefulness of the model in any of those directions would almost
certainly prevent automatic application of the time-skewing
transformations.

--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Mon, 24 Mar 2003 13:07:39 +0000 (UTC)
Message-ID: <b5mvur$bb4$1@ausnews.austin.ibm.com>

In article <Lvwfa.196539$L1.33670@sccrnsc02>,
Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
>
>It sounds sort of like a partial differential equation solver, though seven
>arrays is more than I usually see.

The shallow water equations are a set of hyperbolic equations.
There are three equations and the resulting system is hyperbolic
(which basically means that information propagates at finite
speeds, or in this case, a finite number of grid points per
time step).

>I have worked on Poisson solvers before, and thought some how the
>access patterns affect the timing.

Poisson's equation is elliptic (which means that information
travels infinite speed around the solution).  Many classic
iterative elliptic equation solvers for Poisson's equation
look "hyperbolic" (in the sense that information propagates
through the solution at a finite number of grid points per
iteration), and the time-skewing algorithm is perfectly
applicable to this class of algorithm.  The TOMCATV benchmark
from SPECfp95 was a elliptic equation solver for a technique
called automatic grid generation.

>I can almost imagine a compiler time skewing a two dimensional
>Poisson solver, but seven arrays would make it significantly harder.

There are nine primary arrays for the three solution variables
at each of the three time levels required.  Then there are four
additional arrays used to carry temporary values between the
first and second computational subroutine.  A 14th array is used
only by the initialization routine, though there is no reason
why additional storage needed to be allocated for this purpose.
All of the arrays sit in contiguous global storage (Fortran "blank
common").

>If it had to do it across function calls, it would seem it would have to
>inline the function first.

Yes, but the really ugly stuff is figuring out how to account for the
manipulation of the edge elements at the end of subroutines CALC1 and
CALC2.   The code runs around the edges and sets the first value in
each row/column equal to the last value (or vice versa, depending on
which array it is).  It is also ugly figuring out how to handle the
global reduction that occurs between CALC2 and CALC3.  Finally, it
is ugly to do the transformations to even find this stuff, since
the whole iteration loop is based on GOTO (rather than an explicit
loop), which the IF test for exit happening between CALC2 and CALC3
and with an alternate version of CALC3 called the first time through.

I can easily *imagine* implementing all of this stuff, but I cannot
imagine implementing coverage of all the possible combinations of
the ways in which the shallow water equations could have been
implemented.  Sort of like a binary tree with several levels.   A
vendor may not be required to implement *all* paths through the
tree in order to have a "general-purpose" optimization, but how
many is enough to avoid the label "benchmark special"?
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Tue, 25 Mar 2003 03:09:38 +0000 (UTC)
Message-ID: <b5oh9i$jo2$1@ausnews.austin.ibm.com>

In article <hxMfa.204992$L1.35404@sccrnsc02>,
Glen Herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
>
>So the arrays are one after another in storage, and the corresponding
>element of each, and neighbors of that element, needs to be accessed in
>evaluating each point.  I don't know if it would help in this case, but
>there was a discussion some time ago, possibly C related, about the ability
>of a compiler to rearrange such storage so that related elements are stored
>together.  (In C, an array of structures instead of structure of arrays.)

That would not help here, though it can be a useful optimization
for some other coupled solvers.

>Well, the boundary is O(N), and the interior O(N**2).  If you can optimize
>the interior, you win (for sufficiently large N).   I believe one of the
>other posts said it is a rectangular problem.  That makes it much easier.

The trouble is that at each time step, the stuff that was "boundary"
at the previous time step now propagates its information further
into the interior.

In two dimensions, imagine blocking the 2D array into smaller chunks.
Then at each time step, the sub-block that you work on is one element
smaller in each dimension.  Eventually you run out of elements to work
on, and you have updated all the elements in a sort of "pyramid" in
the three-dimensional space composed of the two spatial dimensions plus
time.   Then you have to figure out an ordering for completing the
updates between the pyramids.    Upside-down pyramids are easy to
calculate, but right-side-up pyramids plus upside-down pyramids do
not fully tile the space -- there are still bits and pieces in between
that need to be updated.

There are variants based on rectangular (rather than square)
sub-domains.   If these span the original grid, then the book-keeping
is simpler, but the scheme may require a lot more cache memory.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Mon, 24 Mar 2003 19:49:53 +0000 (UTC)
Message-ID: <b5nnh1$6vi$1@ausnews.austin.ibm.com>

In article <b5nloq$fmp$2@vkhdsu24.hda.hydro.com>,
Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
>McCalpin wrote:
>> Although the technique goes under many names, it is
>> called "time-skewing" in a Rutgers University technical report
>> by David Wonnacott and John McCalpin.  A convenient citation is
>> 	http://citeseer.nj.nec.com/mccalpin99time.html
>
>It took me a little while, but I believe I understand it:
>
>The optimization consists of blocking a set of array operations across
>time, and not just array space, right?

Sort of.   There are a variety of ways that you can order the
execution of the operations in the transformed space.  In one
case you are doing blocking in both space and time, but your
domain gets one grid point smaller on each end at each step.
Then when you are finished with this "pyramid", you still
need to go and calculate all the pieces that you missed.

An alternate ordering is to actually do the calculations along
"diagonal" vectors in the (implied) space-time grid.

The two approaches have different advantages and disadvantages
depending on the dimensionality of the problem and whether or not
you are attempting to solve the problem in parallel.

>This is sort of what I call speed of light: Starting with some initial
>situation, what is the minimum number of operations theoretically
>possible to get to the desired destination?

There is not a lot of opportunity to reduce the amount of
arithmetic, but there is nothing in the data dependence that says
that you have to stream all 13 arrays to and from main memory at
each step.  If you can figure out how to do a larger number of
steps at a single go, you can read the arrays once at the
beginning and write them once at the end of the "many-step"
process.

The code was widely distributed before SPEC began using it, and
I probably have a copy from my professor days.   I will see if
there is still a copy on my Linux box at home that I can post....
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Tue, 25 Mar 2003 13:03:46 +0000 (UTC)
Message-ID: <b5pk3i$9na$1@ausnews.austin.ibm.com>

In article <b5p1e5$ai0$1@vkhdsu24.hda.hydro.com>,
Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
>McCalpin wrote:
>> In one
>> case you are doing blocking in both space and time, but your
>> domain gets one grid point smaller on each end at each step.
>> Then when you are finished with this "pyramid", you still
>> need to go and calculate all the pieces that you missed.
>
>Anyway, for the algorithms where I've tried it, the pyramid surface
>quickly becomes so large as to negate any speedups for the internal
>calculations.
>
>This is even if I just try to disregard the edge conditions, i.e. doing
>all internal nodes over a number of iterations.

In the original performance experiments that Dave Wonnacott and I
did, the extra bookkeeping did slow down the base code by a small
constant factor.   So the experiment that Dave did was to scale
up the TOMCATV benchmark to overflow the available physical memory.
With the last level of the memory hierarchy now being painfully
slow, the bandwidth reduction from the time-skewing was able to
give something like a 30x performance boost over the original
code running to/from the disk.


As I said before, there are a number of orderings that can be
applied to the transformed set of equations.  "Pyramids" are
the best (if you can ignore edge effects), but other orderings
can still give a substantial speedup.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
From: lindahl@pbm.com (Greg Lindahl)
Message-ID: <3e8110e7$1@news.meer.net>
Date: Wed, 26 Mar 2003 02:33:28 GMT

In article <b5ll89$bvo$1@ausnews.austin.ibm.com>,
McCalpin <mccalpin@gmp246.austin.ibm.com> wrote:

>When I was working on my Ph.D. in the middle 1980's, my research
>group drew a fair amount of criticism for continuing to use shallow
>water models.

OK, so I'll give an example from my area, numerical hydrodynamics,
specifically compressible, super-sonic flows. At the risk of
overgeneralization, time-explicit schemes represent the data as
piecewise constant, piecewise linear, or piecewise parabolic
quantities. As you get a more complicated representation, the
computation time per cell rises, but the errors are smaller, so you
can use fewer cells. So in 3 dimensions, piecewise parabolic is always
the cheapest in the end. One such algorithm is PPM, which is one of
the ASCI showcase codes.

So for a huge community, piecewise linear is obsolete. But there's
still a sub-community that uses it. One good reason is that sometimes
you have additional physics that's less accurate than the hydro. In
that case, it's not worth it to have much more accurate hydro than
other stuff, so it's pointless to spend the extra computation needed
for piecewise parabolic representations.

This was the case a decade ago for MHD (magneto hydrodynamics). I
don't know if it's still the case. And I'm pleased as punch that I
managed to not mention "order" in this entire posting :-)

greg

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.benchmarks,comp.sys.super
Subject: Re: Sun breaks another SPEC benchmark
Date: Wed, 26 Mar 2003 14:17:49 +0000 (UTC)
Message-ID: <b5scqd$i2o$1@ausnews.austin.ibm.com>

In article <rfaga.4998$Q57.312734840@newssvr21.news.prodigy.com>,
Andy Glew <andy_glew_public@yahoo.com> wrote:
>> Some shallow water modelling is certainly still done for nearshore
>> environments and lakes.   Although the theoretical foundations of
>> time-skewing are based on the hyperbolic nature of the equations
>> (and are therefore applicable to any implementation), I think it
>> quite unlikely that any automatic system could "do the right thing"
>> for shallow water equations on irregular or unstructured meshes
>> (which would be more representative of current applications).
>
>Hmm...  It seems to me that the time skewing optimization could be fairly
>easily applied to "move forward" the computation for any subset of
>the mesh.  The border keeps growing.

The trouble is that on unstructured meshes, the neighbors are
only known at run-time.  At compile time there is no way to know
if elements with "near" index values are "near" in memory, and
vice-versa.

For irregular meshes (typical of all current production ocean
models) the neighbors are known at compile time, but whether a
cell is "wet" or "dry" is only known at runtime.   If you are
willing to do the calculations at all of the points and then
throw away the "dry" cells after you have completed the update,
then this would work.  The bad news is that for typical global
cases, fully 1/2 of the data points would be thrown away after
each step.

A more serious problem in ocean models is that there is typically
separate time-stepping for the vertically averaged flow (either
with short time steps or with long time steps using an implicit
solver --- typically conjugate gradient) and the vertically
varying flow (with longer time steps).

Time-skewing could be applied to the vertically averaged flow if
using short (explicit) time steps, or it could be applied to
certain classes of iterative solvers (but definitely not
conjugate gradient, or any other non-stationary Krylov scheme).
Time-skewing could also be applied to the vertically varying
flow.   I suppose in theory I can see how to apply time skewing
to the combined problem (as long as you avoid non-stationary
solvers), but it makes my head hurt to think about the complexity
of the resulting code.

>Hmm...  clustering of the nodes of an irregular mesh in all dimensions
>might go a long way towards improving cache locality; the cluster
>boundaries would be related to the subset that you want to time skew.

That part is already part of standard software for sparse problems.
The incentive was originally to reduce the width of the sparse matrix
to reduce fill-in for direct sparse matrix solvers, but it clearly
improves cache utilization as well.

I can't see any way to do time-skewing in any useful sense for
unstructured meshes, but maybe I have not thought about it enough.
Shame on me....
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

Index Home About Blog