Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: gcc as benchmark? [really: good M.S. topics in benchmarking]
Date: 27 Nov 1995 19:10:47 GMT

In article <4928c9$4vk@sparcserver.lrz-muenchen.de>,
rwilhelm@Physik.TU-Muenchen.DE (Robert Wilhelm) writes:

|> It should not be very hard to build some sort of SPEC Clone,
|> because most (all?) programs used by SPEC is freely available.

But I wouldn't recommend this:
	a) It is much harder work to put together good benchmark suites
	than you realize, unless you've done it.
	b) Even trying very hard, benchmarks degrade over time as compiler
	writers put forth frenzied efforts to crack them.  Sometimes,
	benchmarks get clobbered by natural improvements in compilers,
	i.e., like the way matrix300 got zapped by KAP and similar optimizers.
	c) Cloning SPEC is unlikely to help advance the state of the art.

Good M.S. topics (many):
	a) Build a synthetic benchmark that is simpler than SPEC95, but whose
	micro-level properties (instruction mix, cache behavior, TLB behavior,
	memory traffic behavior) correlates well with [overall SPEC95s, or
	individual benchmarks].
	b) Extend that to, for example, model big UNIX benchmarks, or big
	DBMS benchmarks, or big C++ graphics codes.
	c) Create a synthetic benchmark that avoids the size problem of
	SPEC and SPEC-like benchmarks. 
	(The size problem is that a benchmark has a particular size.
	If it "fits" in a cache of size N, it will not run much faster in a
	cache of size 2N ... even though the bigger cache, in actual practice,
	can be worth a lot to bigger codes.)  One would prefer approaches
	that run codes across a range of sizes, and give a figure of merit
	more related to the integral of performance under the overall curve.
	(I.e., some of this is like the Hint benchmark).
	Do this first with variations in data sizes, which is relatively
	easy to do.
	d) Then figure out how to do it with variations in code size as
	well, since many real codes put pressure on I-cache misses.  Do this
	while still maintining micro-level properties that look like
	real programs.

Note that there are plenty of difficulties: SPEC has used real codes to
guarantee that the behavior of the programs at least reflects *something*
real ... on the other hand, one would love to have smaller, synthetic codes
that were reasonable predictors.  The main problem is the difficulty of
mimicing *all* the necessary attributes of real codes.  That is, if you select
a bunch of metrics, and mimic them, it may turn out that your predictions fall
down because there is some additional metric you didn't think of.
-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Windows 95
Date: 8 Jan 1996 04:20:08 GMT

In article <IkvTc8_00YUp8TK0VR@andrew.cmu.edu>, Kenton Shaver
<kenton+@CMU.EDU> writes:

|> If the K5 and the M1 were here, Cyrix and AMD ad copyists would be
|> beside themselves with glee.  Everyone winks at each other when they
|> see SPEC tests conducted by the manufacturer with a compiler that no
|> one uses except said manufacturer for those "tests" anyway.  6 of one.

|> What are the problems with Intel license this supposed top-flight
|> compiler to Watcom, Symmantec, or Borland if none of their C compilers
|> are good enough for the Intel SPEC tests?  The results will be too
|> verifiable?  ;) 

Sigh. SPEC people have long tried to create better benchmarks and get
numbers that would be meaningful, but it seems to be getting harder
every year to have rules that actually work, and whose results had
reasonable correlations with what you saw in real life.
(Note: I used to write the MIPS Performance Brief in the late 1980s &
was later one of the founders of SPEC; if you read the former, and
look at the CPU methodology and disclosure approach of the latter,
you may find some resemblance.)
SPEC has been gotten more complicated in an effort to maintain the
original characteristics in the face of industry changes, CPU changes,
compiler changes, ...and benchmark-gaming.  It isn't easy, and the folks
still doing SPEC keep trying ... but it's hard.

So, to rationally evaluate what's going on, one might like to know 2
kinds of things, and maybe people can help with *authoritative* answers:
	1) To what extent is using the Intel Reference Compilers for
	SPEC benchmarks realistic?
	2) What was the specific problem in this case? (with a discussion
	of classic cases in the past).


1) Questions about Intel References Compilers
    SPEC numbers for Intel-based system are almost always done using:
	Intel Reference C Compiler	SPECint92 + several of SPECfp92 +
					SPECint95
	Intel Reference FORTAN Compiler	SPECfp92 + SPECfp95
And like other compilers, there have been various releases.
I'm interested in knowing the extent to which *these*  compilers
	(a) Get licensed to tool vendors, who modify/incorporate the
	technology into their own compiler suites, and if so, how much is
	the same, and *when* is it available to developers.
	(b) Are sold directly to developers, and when.
In general: to what extent is this technology *actually* used by developers
of X86 code, and if so, *when* ..  I.e., in comparing numbers, how does
this actually compare with the way the workstation numbers work:

Workstation compilers:
Workstation vendors usually supply compilers, although there are
often ISV or other 3rd-party compilers as well. In most cases, the
vendor-supplied compilers are widely used for compiling kernels, applications,
libraries, etc, etc. They tend to get QA'd very heavily and people tend to
be moderately conservative when adding optimizations to make sure they
don't break things.  Also, as production compilers, they tend to make
choices that make sense across wide ranges of applications (and sometimes,
use common components with other languages) ... because
that's what people use.  There are plenty of X86 compilers like this.
More on the subtle issues later ...

On workstations like DEC, HP, IBM, SG ... the SPEC benchmarks are typically run
either with a shipping compiler, or with a Beta version of it, due to ship
within N months.  I.e., the SPEC disclosures show a compiler availability
date, which means: that's when it's available to the customer base.

So, to compare the Reference Compilers:
please help me fill in the following table with *authoritative* answers,
and it may then be possible to rationally evaluate what's been going on,
and whether it means anything or not... for availability date, I mean to
regular users, not indirect license to OEMs.

Vendor		C/FORTRAN	OS	Avail	Reference Compiler basis
		Version			Date	If any, or No if not
Intel		C 2.2		UNIX	?	C 2.2
		FORTRAN	2.2	UNIX	?	FORTRAN 2.2
		C 2.2		NT, Win	?	C 2.2
		FORTRAN	 2.2	NT, Win	?

Microsoft	C		Win, NT
		FORTRAN		Win, NT

Borland		C		
		FORTRAN

Symantec	C
		FORTRAN

Watcom		C
		FORTRAN

Lahey		FORT: LF90 2.0	NT, Win	Y	FORTRAN 2.2 (?)
						says it has Pentium Pro tunes

====
etc, etc.  i.e., more vendors, and confirmation of the relationship with
the Intel Reference Compilers and whether or not they are available to
regular developers as a supported product.

2) What was the problem?

Hopefully the Motorola/IBM folks who found it may explain more (?)
Meanwhile, for people to think about, here is a categorization of
optimization/benchmark-gaming, with some examples.
These go in order from legitimate-sensible ... dubious ... dishonet.

Consider what happens with an industry-standrd benchmark, like:
	Dhrystone, Whetstone, Linpack, SPEC.

1) Simple optimization (by non-compiler folks)
	Run it with -O, or some simple set of optimizations, and see what
	you get.  This corresponds to the way lots of people work.

2) Optimization with serious analysis (by non-compiler folks)
	Profile the code, use options a lot ... and if the code is one
	of those benchmarks you're allowed to rework, do that also.
	This is the other extreme ... and there are people out there who do
	that, especially in the computionally-intense areas. 
	The presence of such customers varies by vendor and market; SGI
	actually has a fair number of such ISVs and customers.

Now, we get to stages where you have control of the compilers and other s/w.

3) Use the benchmarks to help tune the compilers, changing basic code
generation as you find sub-optimal things.
	Everybody does this, and there's nothing wrong with this, in fact,
	it's *good*, especially if the performance bugs you unearth have
	general applicability.

4) Use benchmarks to tune ... by adding options.
	A new optimization is sometimes enabled via some flag, so that people
	feel safer with it ... and later on, subsumed in the standard -O.
	Some options may be those where it may improve or hurt performance
	in any given case, and maybe you only added it for this benchmark,
	but it has general use.  The worse case is where it really only makes
	sense for this one case, and people start to complain if the option-set
	starts having myriads of benchmark-specific flags.

5) Add optimizations precisely-targeted to these benchmarks ... but which
	make very little sense or provide little performance elsewhere.
	[In my personal opinion, the eqntott optimzation is of this class,
	i.e., attacking a few lines of code to raise overall SPECint92
	numbers by 20%.]
	====
	This should not be confused with what happened to Dhrystone 1.1
	or SPEC89's matrix300, i.e., where fundamentally-improved compiler
	technology just happened to blow away the benchmark, in the first case,
	via the dead-code elimination that appeared with global optimizers,
	and in the second, via automatic matrix cache-blocking that appeared
	with the Kuck & Associated Preprocessor or equivalents. The dead code
	case had even worse cases: in the early 1980s, people sometimes had
	tiny benchmarks that computed nothing ... and a good optimizer would
	zero them out :-)
	=====

	Personally, I a little of this is hard to stop ... but a lot of
	it is getting very unrealistic, i.e., if hordes of benchmarkers
	spend tons of time tuning and tweaking for a small set of code,
	one can extract performance improvements on the benchmarks that
	have little correlation with anything real, even assuming that
	the benchmarks themselves could be good predictors of
	anything.  Hence, benchmarks "degrade" as a function of the
	amount of effort being put on them.  Dhrystone once actually
	told you something ... and then rapidly degraded as people
	"gamed" it. A good example was the hack used in one compiler
	for the Intel i860 to generate spectactular Dhrystones, about
	30-40% higher than one would have expected.  This was done by
	observing that a lot of time was spent in the string copy
	(strcpy) function copying a 30-character string, via the usual
	strcpy function.
	This particular compiler:
		- Padded the 30-character string to 32 bytes.
		- Aligned the string and the target buffer onto at least
		  8-byte multiples.
		- Converts the entire string copy into 4 instructions:
			2 16-byte floating-point loads +
			2 16-byte floating point stores
	Of course (1) the original benchmark's use of a 30-byte string is
	not representative of the string size distributions in C.
	(2) In this particular case, all of this works.  In most cases,
	it doesn't, like, for example, if the string isn't visible directly
	as a constant, but appears as a pointer.
	So: it is not illegal to add an optimization that notices when all the
	conditions are present, and generates such code.  Nothing will be
	broken ... but the results are pretty irrelevant to anything.

6) Over the edge
	Now we get to things that are beyond distasteful.  These
	include:  Optimizations not only overly-targeted, but where the
	recognition is actually benchmark-specific, i.e., like doing a
	string comparison on the input code, or recognizing the file
	name.  That is, the optimization is safe ... but the only
	program you ever expect to tickle it is the the benchmark.  For
	example, there was a FORTRAN compiler that once looked at the
	input for "DONGARRA", and if seen, assumed it was LINPACK, and
	could then emit perfect code.
	
7) Over the edge and wrong, but not usually noticed
	Finally, we get the case where one of these heavily-targeted
optimizations actually generates *incorrect* code ... that the targeted
benchmark happens to get away with.  As people have noted, compilers often
have bugs in general ... but features that most people really use do tend to
get thrashed over pretty well.  Sometimes they don't, due to such gaming.
For example, there was another compiler targeting Dhrystone, that generated
code that counted on good alignment for speed (either automatically or
via a flag) ... but if you tweaked Dhrystone to misalign the string with
perfectly legal string, the same code would be generated ... and would
trap with alignment errors.  This was back in MC68010 days.
Sometimes, the only way you can know what's going on is to look at the
generated code.  You can be assured that compiler writers involved in this
are scrutinizing object code with ferocious care.

There are all kinds of weird cases that come up, and different vendors have
different approaches ... some vendors, knowing about certain hacks, deem them
unreasonable, and refues to put them in.  Others go all out with every hack.

================================
To summarize, there are a couple interesting things one would want to know:

(a) To what extent are the numbers produced by Intel Reference Compilers
representative of what people really do?  If they are representative,
*when* are the representative (i.e., since time is relevant in performance
wars.  If it turns out that hardly any real applications are compiled
with these compilers, then maybe people might want to urge the publication
of SPEC results with compilers that people actually use.

(b) Presumably, more information will come about about this specific case.
The results could range anywhere from:
	(a) Just one of those bugs that come up.
to
	(b) All-out tuning that happened to go over the edge.  If so, it's
	   probably not unreasonable for people to wonder how much more of
	   this is going on.  Note that in this case, a fairly targeted
	   optimization gained 20+% (on SPECint), and apparently did something
	   wrong, hence is being described as a 10% bug, meaning that there's
	   a slighlty less-good sequence that actually works.

Final note: I've seen people argue over SPEC #s to 3 significant digits.
In the light of such things, this is pretty silly.  In fact, given the
inter-benchmark variance, and the moderate numbers of codes, people
really ought to think of these things as ranges ... 

SPEC rules have always attempted to minimize the gaming ... which after
all, has little benefit for actual customers ... but it may be a losing
battle.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: WIndows 95 [really eqntott & benchmarking]
Date: 11 Jan 1996 05:21:08 GMT

In article <CLIFFC.96Jan8101435@ami.sps.mot.com>, cliffc@ami.sps.mot.com
(Cliff Click) writes:

|> Eqntott, with the spec input data, has a place where structures with

Good history, thanx.

|> The problem for Intel is that the benchmark and dataset is "robust" -
|> if you screw up the vectorization, you can take an early exit from the
|> loop and _still_ get the same answer.  Intel's compiler missed, and
|> their Q&A didn't catch it.

|> > 5) Add optimizations precisely-targeted to these benchmarks ... but
|> > which make very little sense or provide little performance
|> > elsewhere.  [In my personal opinion, the eqnott optimzation is of
|> > this class, i.e., attacking a few lines of code to raise overall
|> > SPECint92 numbers by 20%.]
|> 
|> Here I disagree with you on the particulars - the eqntott optimization
|> is a special case of byte or short vectorization.  If you are doing
|> byte operations on a vast structure - say running a filter over a
|> large byte-per-pixel image, then vectorizing the loads & stores is
|> exactly the right thing to do.  

I don't think we actually disagree: I have nothing at all against
vectorization-kinds of algorithms, and if somebody put in a partial-word
vectorizer, and it nailed eqntott, and plenty of other code, then,
no problem.  [I.e., like what happened to matrix300.]  However, everything
I've seen leads me to believe that eqntott has gotten targeted by
very precise recognizers, not more general optimizations.  Somebody
mentioned:
http://www.nullstone.com/eqntott/eqntott.htm
which has a nice analysis of these effects.

Of course, there is a a fuzzy line between this and the next case,
and I was trying to make the line more precise, but maybe didn't do it
well enough.  If standard benchmarks stir people up to do better
optimization, that turns out to help broad ranges of programs, that is GOOD.
(And in fact, I conjecture that efforts like SPEC have actually encouraged
better compiler technology ... at least compared to the silliness
that was going on to game the *stone* benchmarks.)

 
|> > 6) Over the edge Now we get to things that are beyond distasteful.

|> Here is where the eqntott optimization gets distasteful - to be safe,
|> it's generally pattern matched somewhere at the IR level - and emitted
|> only for very close matches to the eqntott code.  The more general
|> byte/short vectorization is hard to do, and unlike Matrix 300 isn't at
|> the center of some important benchmark.


-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Subject: Re: why does Spec95 cost _anything_?
From: mash@mash.engr.sgi.com (John R. Mashey) 
Date: Jan 16 1996
Newsgroups: comp.sys.intel,comp.benchmarks

In article <4dab4v$on9@usenet.srv.cis.pitt.edu>,
hahn@neurocog.lrdc.pitt.edu (Mark Hahn) writes:

|> $300 is still nontrivial, and therefore is as good a barier to
|> wide dissemination of the benchmark as $900 was.  does the budget
|> for the Spec organization really need the money?  or is it simply
|> that the vendors involved _want_ a barrier in place?

Not speaking for SPEC, since I'm not a member these days ...
1) SPEC members pay for much of the work through membership dues,
and of course, spend far more on people's time.
SPEC has never been into making big piles of money.

2) When a vendor representative goes to their management and says:
"We need to increase our budget that we are spending on SPEC, so that SPEC
can give out free copies of the benchmarks to anyone who wants one, because
they say $300 is too much." you can imagine the reaction.

3) As it stands, people license the benchmarks and agree to play by the
rules, which becomes fairly difficult if copies are floating around.
This topic was discussed a lot early on ... but it is un-real to believe
that SPEC members want some large barrier to anyone else getting copies and
using them.  That is, one can argue with the specific approach, but
considering the amount of work that goes into this, and that a set of
benchmarks costs about what a good word office-applications suite costs... 

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Subject: Re: why does Spec95 cost _anything_?
From: mash@mash.engr.sgi.com (John R. Mashey) 
Date: Jan 17 1996
Newsgroups: comp.sys.intel,comp.benchmarks

In article <phrDLAnC5.6JK@netcom.com>, phr@netcom.com (Paul Rubin) writes:

|> >2) It doesn't ever seem to happen: doing bad benchmarks is easy, doing
|> >good benchmarks is more difficult than anyone could believe, unless
|> >they've been involved in these sorts of efforts.

|> Care to describe what some of the obstacles are?  Thanks!

Sure:
1) SPEC's  CPU benchmark requirements have included:
        a) Must be portable across different CPU architectures.
        b) Must be portable across (at least) different UNIX OSs,
           and if possible runnable on others.
           This also means that codes that use too many special
           language extensions are out, as well.
        c) Must run long enough to create meaningful times, in the first
           case at least 60 seconds on a VAX 11/780, preferably longer ...
           but not so long as to make the logistics impossible.  This
           is a moving target of course, but generally, if a benchmark
           gets down to a few seconds, timing gets difficult.
        d) The benchmark must produce output that is machine-checkable.
        e) The benchmark must at least start as a real program that
           actually does something.
        f) The benchmarks must (at least individually) be viewed as
          "representative" of real-life programs, i.e.
        g) The code must be, not public-domain (i.e., gcc is not), but
           at least publicly distributable.
        h) The code should try to exercise memory systems and not fit in
           trivial caches.

a) Has all kinds of surprises, especially combined with d), especially for
floating-point codes.  [If it takes different systems different #s of
iterations, is that a problem with the CPUs, or a problem with the test?]
One otherwise excellent candidate for SPEC89 failed because:
        a) It compiled and ran easily on all systems.
        b) Across vendors, it produced 3 slightly different answers, all of
           which were feasible (it was a simulated annealing chip layout,
           and all 3 answers were pretty close), because it effectively
           used the last bit of a floating-point computation to make
           yes/no decisions, i.e., this wasn't taken care of by normal
           convergence properties and significance comparisons.
b) Had some surprises, especially with FORTRAN and 64-bit systems.
DEC did make them work on VMS.

c) Caused a real problem with SPICE, where we had some good benchmarks,
but they were too short, so we switched to a longer one ... that turned
out to be massively atypical for normal SPICE processing.  SPICE was the one
original benchmark we all agreed was wonderfully representative.
Almost any realistic input deck turned out to be proprietary to somebody.

d) Checkability has caused several surprises, either where somebody got
clever [like what happened to the spreadsheet benchmark, where not enough
was checked], or where the input failed to exercise the code enough [as
witness the recent eqntott problems].

f) Is always a problem, that is: does this code really represent a portion
of your real workload, and if so, how much?  This is why SPEC always insisted
on publishing *all* of the numbers, not just the aggregate.
In general, rules for aggregating numbers together are fraught with
strife, both in terms of mathematics and politics.
Obviously, this is much easier in real life, in the sense that, for instance,
someone running a supercomputer often knows what their workloads are,
and which programs get run, and how much.  Likewise, in a binary-compatible
environment (like PCs), one may just run the real applications and see,
but this doesn't work across architectures very often.

Synthetic program (i.e., like Dhrystone or Whetstone) really have this
problem: it is all too easy to model some attributes of the problem,
adn not others.

The holy grail: small programs that don't run very long, but whose
performance can be used to get good predictions of large programs.
So far, this is fairly safe ... but if you can do it, you will be famous.

Anyway, those are a small sample of the difficulties.... there are more.


-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch,comp.benchmarks
Subject: Re: What is the real SGI performance?
Date: 30 Jan 1996 20:42:07 GMT
Keywords: SGI, benchmarks, MIPS, SPEC, FUD

In article <4e3ss2$dvj@murrow.corp.sgi.com>, mash@mash.engr.sgi.com
(John R. Mashey) writes:

|> All of this is one CPU: nothing is counting the effects of
|> automatic parallelization to boost SPECfp, for example.  [It is a separate
|> argument as to whether that makes sense or not.]  Note that the numbers

A couple people wanted to know the side arguments mentioned above:

The two positions are:
(1) If a compiler can automatically parallelize code, with zero source
changes, then let it.
[This is the official SPEC position, and hence one sometimes sees SPECfp
numbers that include auto-parallelization of a few of the SPECfp [92 or
95] codes.]  This position has some merit, and I believe that some
companies have customers for whom this is a reasonable position.  On
the other hand:

(2) When real software vendors parallelize code, they do anything from:
	(a) No changes.
	(b) Insertion of compiler directives [i.e., that look like comments].
	(c) Moderate rewrites.
	(d) Substantial rewrites [to end up with better algorithms overall].
and they use at least some of the tuning and analysis tools that system vendors
provide.  Note that SGI has substantial live experience in this, and
(relatively) large numbers of ISV applications that are actually shipped in
parallel versions ... so this is not theoretical.

So, the side argument is: consider the case (which actually arises in
one of the SPECfp92 codes, I don't recall which) where:
	(a) A compiler can parallelize a code by analyzing the source code.
OR
	(b) A compiler supports a bunch of parallel directives that are
	    widely-used in actual practice, and inserting a handful of them
	    produces a good parallel speedup.
Which way is more realistic?


My personal opinion is that:
	- Anyone serious about parallelization supports the directives.
	- ISVs use them ... i.e., there are plenty of codes that should be
	   compiled exactly as is, to match real life ... but ones that have
	   interesting parallel speedups usually have more tuning done to them.
	- (a) Can lead to "unnatural" benchmark-targeting that doesn't really
	      help real-life very much.

	- And finally, my *BIGGEST* concern is that allowing parallelization to
	   sneak into the basic SPEC CPUs confuses the ability to provide
	   useful correlations.  Let me try a simple example, ignoring all
	   the normal caveats (workloads vary, systems vary in their
	   memory usage, benchmarks vary, etc, etc):

	For example, let's suppose that you've
	determined that SPECfp, on one CPU, is a useful predictor of
	your own application's speed on one CPU, and that what you care about
	is parallel throughout, and you're going to have a small "farm"
	of machines to do this. [For example: at Pixar and other places
	for graphics rendering, at CERN and other places for particle physics,
	on Wall Street for independent financial calculations.]

		A: Uniprocessor, SPECfp95 = 20.
		B: Dual-processor, SPECfp95 = 20
		C: Dual-processor, SPECfp95 = 15 on one CPU, no parallelize

	  Now, in comparing A and C, you would guess that C would deliver
	  more throughput for independent computer-bound tasks, i.e., while
	  it wouldn't quite act like a 30-SPEC processor, it would probably
	  be as good as a single 25-SPEC CPU.

	But: what does B's number mean?
	1) It could have 2 11-SPEC CPUs, and terrific parallelization ...
	2) It could have 2 20-SPEC CPUs, and not have tried to parallelize,
		or not have good parallelization.
	Now, assuming similar costs for all 3 systems:
		- C is probably a better deal than A.
		- But B might be (case 1) no better than A, and in fact, for
		  many people, rank below A, as many would prefer one fast
		  CPU to 2 55%-as-fast CPUs.
		- OR B might be (case 2) better than C, and thus the best.

	As another example, suppose you *know* your own application's usual
	degree of parallel speedup, and would like to do your first cut of
	evaluation based on SPECfp ... but this has become harder because
	the SPECfp numbers are more difficult to interpret: you not only
	have to look at all the detailed nubmers ... but you've got to
	understand how much parallelization is going on under the covers.

	[By the way, none of the above is a theoretical exercise: I was
	recently trying to understand:
		Is the performance on <CPU application that customer cares
		about> estimatable from some combination of SPECint+SPECfp
		numbers ... and the answer was Not Easily, as some of the
		systems under consideration were multiprocessors with varying
		degrees of parallelization going on....]
	 
	

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Subject: Re: SPECint95/SPECfp95 for UltraSPARC ?
From: mash@mash.engr.sgi.com (John R. Mashey) 
Date: Feb 21 1996
Newsgroups: comp.benchmarks,comp.sys.sun.hardware

In article <4gci98$qeh@News.Dal.Ca>, matheson@phys.ocean.dal.ca (Steven
Matheson) writes:

|> I _just_ got the unofficial SPECint/fp95, under non-disclosure, for
|> the Ultra 1 140, 170/170E, as well as SS20/151, SS20/71, and SS5/110.
|> 
|> I cannot post them here or even email them out, so please don't ask.
|> (I can't even tell how they compare to other systems.)  Instead, if
|> you really do need them, get them under a non-disclosure agreement
|> from your Sun rep.  They are out!

If this is not against SPEC rules ....  I hope it gets to be so, because
(in my opinion) it is certainly against the spirit of SPEC cooperativeness
and openness.

(a) There are estimated results, labeled estimates.
(b) There are measured numbers, published, available for scrutiny by
competitors and others.

One of the major points of SPEC was that measured numbers got published,
were subjectable to scrutiny by competitors and other interested parties,
with a carefully-crafted set of disclosures to inhibit game-playing.

Think about what this means:
        (a) Under normal SPEC behavior, if a customers cares about SPEC #s,
        any official numbers are visible to everybody, including all of the
        details, and reasonable analysis of their meaning can be done (i.e.,
        like in cases where people break one benchmark, thus boosting the
        mean, but might or might not be meaningful.)
        (b) I can believe providing bids on not-yet-announced machines, where
        both the machine and its SPEC #s are both NDA.
        (b) If one vendor can give out numbers, and win deals based on
        NDAing SPEC #s, on already-announced machines, but not have to provide
        those numbers publicly ... there is *no* point to the SPEC disclosure
        rules.

What Steve basically suggested, perhaps without realizing it is:
        (a) If you are a potential customer, you can get Sun's #s under NDA.
However:
        (b) But of course, if you are anybody else, you can't.
                (1) Well, actually, maybe if you are an analyst of press
                person, who promises not to publish the numbers, but might
                want to say "I've seen the numbers, and can't print them,
                but they are really good", you might.
                (2) But for sure, if you are a competitor, you're not going
                to get this ... which means that your numbers are out in the
                open to be shot at, and someone else's are hidden, but used
                against you.  I assume people realize the non-motivation this
                is to provide accessible numbers in a fair way?
                People have worked *so hard* to make benchmarketing better ...
                do people *want* the situation to get worse, not better?

(Note: on sabbatical, but had to look something up and ran across this ....
I don't get irritated easily, but I hope this is a transient aberration,
i.e., that Sun either doesn't give out the numbers, or publishes them via
the normal rules, quickly.)

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch,comp.benchmarks
Subject: Re: Free SPEC benchmarks? (Re: gcc: quality of produced code?)
Date: 17 Dec 1996 20:45:57 GMT

Some people asked for more examples, hence I expand this list with a few,
none of which are likely to be a surprise to experienced people or long-term
readers of this newsgroup.

In article <594b8k$dd1@murrow.corp.sgi.com>, mash@mash.engr.sgi.com
(John R. Mashey) writes:

|> savvy/experienced benchmarkers.  For some reason, it is especially difficult
|> to find integer & systems benchmarks that:
|> 	- are representative at both macro- and micro- level
	1) Macro: realistic size, do real work
	2) Micro: if you spend a lot of time examing the low-level
		statistics of programs (instruction usage, memory access
		patterns, function call frequency & depth, etc), it sometimes
		happens that plausible-looking benchmark shows behavior
		unlike almost any real code that you examine.  Sometimes
		these differences afffect systems similarly, but sometimes
		they can have quite striking effects that are not seen in
		real life.
		EX: Dhrystone: started life as ADA, but converted to C,
		producing certain oddities:
			(a) Function calls were more frequent than in most
			C programs, i.e., ~40 cycles/call, rather than ~80.
			(b) Function call nesting depth shallow.
			Together with (a), (b) is goodness for architectures
			with register windows or similar things, i.e.,
			SPARC, AMD 29K.
			(c) I've never seen a program that easily spent
			40% of its time in strcpy; I've never seen a
			program where the distribution of strcpy sizes was
			100% 30-byte size; I've never seen a strcpy that could
			count on the size of its input being a constant
			and having both strings aligned on 8-byte boundaries.
			[Infamous i860 hack that turned strcpy into 2 16-byte
			quad-word loads folloed by 2 quad-word stores ...
			about as unrepresentative as you can find :-)]

		EX: there are numerous anomalies around partial-word
			operations; whether you think 16-bit integers are
			important or not really depends on the code you
			look at, and benchmark choice can affect different
			architectures very differently, with interactions
			with compiler technology.  Specifically, if
			something is slow on 16 bit operations, but the
			compiler can recognize certain sequences and avoid
			them, all is well, but if real life isn't that way,
			the benchmark is not as good a predictor as you'd
			think.
	

|> 	- are distributable & not proprietary
	This is straightforward, but it does mean they are usually
	university or shareware or copyleft, which are not necesarily
	representative of commercial codes.

|> 	- stress caches/memory systems realistically
	When SPEC first came out, a 128KB off-chip cache was "huge", and
	the SPECint part actually cache-missed some on most machines.  Some
	kinds of programs are relatively easy to scale up in size, others
	aren't, but one would like benchmarks that cover a range of sizes
	better, since that's what real programs do.  There is a particular
	benchmark confusion that happens when you run a fixed-size program
	on 2 systems, where any of the program's working-set sizes are
	close to any of the relevant cache sizes.  For example, suppose
	real-life programs have working sets spread across the 128K-4MB range,
	system A has a 256K cache, system B has a 1MB cache, and are
	otherwise similar.
	IF you happen to pick a 128K benchmark, A ~ B, that is you would
	believe the larger cache has no value.
	IF you happen to pick a 256K benchmark, A will do well, B will thrash,
	and you will believe A is much superior.
	IF you run a 4MB benchmark, both A and B will thrash; A will perform
	somewhat better, but performance will be dominated by the
	memory latency and bandwidth.
	[Many people understand this issue, and in technical benchmarks, it
	is relatively easy to scale the array sizes up and down, and look at
	performance curves, or use several different sizes.  It is *much*
	harder to do this when you are considering instruction-cache
	effects.  On the other hand, many people do *not* understand this
	effect, and get really hung up on the relative performance of 2
	systems as measured at a particular sizing, even when that size has
	no magic credibility.  In a few cases, people are benchmarking
	exactly the code they will run, day-in, day-out, and drawing
	conclusions there makes sense.]

|> 	- are portable
	Portability of simple code has improved over the years; on the other
	hand, a lot of real code uses extensive/growing collections of
	library interfaces that are not necessarily portable. 

|> 	- have "checkable" outputs
	This is trickier than you'd think: see what SPEC does; among other
	things, you need to figure out fuzzy comparisons for floating-point
	differences. A maddening case in SPEC was that of timberwolf, a
	"simulated-annealing" chip-routing code.  It was big, it was
	cleanly "portable", it was useful, it was almost entirely integer
	 ... unfortunately, at the first SPEC
	benchathon, when compiled and run on all of the vendors' systems,
	the runs yields 2 (or maybe 3) distinct answers across ~10 systems.
	All the answers were "right" (i.e., a feasible chip layout, all
	about the same size), but were different; amusingly, some vendors
	had two products that disagreed with each other, but each
	of those agreed with at least one other vendor's answers.  Why?
	Although mostly integer code, timberwolf did some floating-point
	caclculations, and unlike the usual case, where the effect is on
	convergence rates, and maybe you need a few more iterations to
	get to an answer close enough, timberwolf made yes/no decisions that
	effectively got based on the last significant digit of FP numbers,
	and then propagated, resulting in slightly different routing choices.
	Although we tried hard, we never did manage a version that was
	cleanly checkable/consistent before we gave up.

|> 	- are relatively safe from special-casing targeted optimizations
	This is always trickier thatn you'd expect.  These range from:
	clearly unethical, such as recognizing "DONGARRA" in the input
		file and saying "It's LINPACK" and then emitting perfect code.
	probably wrong, such as generating special-cased code that only
		works for the famous benchmark, and breaks in real life.
	to marginal: targeted optimization, helps this benchmark, works
		correctly, but real code that does this is seldom seen.
	Nullstone has a fine Web page on EQNTOTT from SPEC: 
	http://www.nullstone.com/eqntott/eqntott.htm

|> 	- don't get blown away by normal advances in compiler technology
	But this is legitimate, i.e. not all compiler work is benchmarketing:
	1) When global-optimizing compilers started appearing widely in
	the mid-1980s, some common smaller benchmarks ended up having
	big sections of dead code removed.  This is a legitimate thing,
	and sometimes even happens in real life due to debugging switches
	being turned off, but it demolishes small benchmarks sometimes.
	People did learn to make sure they printed out a result that
	depended on all of the computation to stop this.
	Dhrystone went through this as well, i.e., from version 1 -> 2.
	2) When KAP, and similar dataflow anlysis techniques got more
	widespread in the late 1980s, one of the casualties was matrix300
	from SPEC89.  It was just too simple, and a legitmate, important
	technique (data-flow, cache-blocking) drastically sped up the code
	much more than 

|> 	- don't get blown away by advances in implementations ... such as
|> 	a good benchmark becoming useless because caches get bigger than it.
	This was described above, and has happened to Whetstone, Dhrystone,
	and SPECint89, at least.

|> 	- don't have simple hacks that cheat on the specific cases
	This has been covered; Dhrystone strcpy hacks are amongst the
	most infamous, including somebody who generated code for Moto 68K
	that would fail for any string of odd size or not word-aligned.	

|> 	- have "appropriate" run-times, neither too short nor too long
	If a benchmark takes hours to run on common systems, it's painful.
	If it runs only fractions of a seconds, it can be dominated by
	measurement error and/or overheads that you didn't intend to measure.
	SPEC started saying that a benchmark should run at least 60 seconds
	on a VAX 11/780. In 1989, most SPECmark ratings were in the 10-20
	range, which already got 60 seconds -> 3-6 seconds, already
	getting low. An otherwise good set of SPICE runs (which actually
	were much more typical than the one we ended up with) dropped out
	because their runtimes were too short. 

|> It seems easier to find floating-point codes with at least some of the right
|> properties, although pitfalls are numerous there as well.
|> 
|> 
|> Suggestion: for anyone who wants to create benchmarks:
|> 	1) Read the first chapter or two in Hennessy & Patterson.
|> 	2) Read Raj Jain, "The Art of Computer Systems Performance Analysis",
|> 	1991, John Wiley, ISBN 0-471-50336-3.  Chapters 1,2,3, then section II.
|> 
|> We always *need* better benchmarks ... it's just that it isn't easy.

Benchmark data collection is like scientific data collection, in that it
it is difficult to learn much without gathering and publishing good data.
On the other hand, the physical universe has whatever properties it has,
whether you can observe them or not; it does not change its properties
just because you are paying more attention, and even Heisenberg Uncertainty
principle is not a good analog: particles don't purposefully *try* to
go faster because they know you're watching them, whereas benchmarks do
get more tuned if they get more attention :-)
At the SPEC92 announcement, somebody complained that they'd just gotten
used to SPEC89, and now we were changing it.  Some SPECer replied that it
was good to change benchmarks every few years to try to stay ahead of
the tuning.


-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Free SPEC benchmarks? (Re: gcc: quality of produced code?)
Date: 18 Dec 1996 22:32:56 GMT

In article <phrE2MB6x.340@netcom.com>, phr@netcom.com (Paul Rubin) writes:

|> >2) Unfortunately, lots of people *suggest* this, usually with the
|> >belief that this is "easy".
.
|> 
|> I think the proposal isn't to select a whole new set of programs
|> to benchmark, but to use pretty much the same ones that SPEC uses,
|> with appropriate substitutes found for the proprietary pieces.
|> E.g., for the gcc benchmark, if SPEC has gcc compiling some 
|> proprietary program (I don't know if it does) just have it compile
|> a free program instead (such as cccp.c which is part of gcc).

One more time: it *sounds* easy, which is why everybody suggests it.
(1) The specific code matters, sometimes.
(2) The specific input dataset matters a lot, sometimes;
	I alluded to the SPICE case, where:
	At the very first meeting of the SPEC group, we were throwing
	around ideas for things that we thought were good benchmarks.
	In the first minute, we *all* agreed that we used SPICE as classic,
	realistic floating-point benchmark, and that it was a wonderul and
	obvious one to include; we all knew it was a solid double precision
	FP-intense code with long code paths.  Then we went on to other things.
	We had a set of test cases that had been collected/vetted
	as realistic ... but they turned out to be too short.
	One of the folks suggested they had a much longer run, which we greeted
	with enthusiasm.  Meanwhile, we were analyzing all of the candidate
	benchmarks, down to micro-level statistics to make sure that the
	specific cases were representative of what we saw when we analyzed
	the various proprietary codes that we *really* used.
	We didn't bother to check the SPICE deck, since we all "knew" what
	SPICE did, and we needed to get SPEC89 done.
	
	Of course, it turned out that the SPICE run in SPEC89 *isn't*
	a floating-point benchmark: that particular run has a very different
	profile from most of what people use SPICE for, and is almost
	entirely integer, and spends most of its time in a small chunk of code.
	Shame on us ... and this was a collection of some very experienced
	performance-analysis people, with access to all of the tools that
	computer vendors keep around to profile and analyze code.

SUMMARY:
(1) If benchmarks are the real code, then they may predict themselves.
(2) Otherwise, if a becnhmark is believed to predict the performance of
different real programs, you need to:
	(a) Gather plenty of data points.
	(b) Investigate the correlation of the benchmark versus the real codes
		you care about.
This is why SPEC always suggested that you check the correlation of numbers,
and then focus on the pieces of the overall suite that seem predictive.
There has usually been an attempt in SPEC to pick benchmarks that seemed
"similar" to the kinds of real code that people were seeing, but that
were proprietary, unportable, etc.

  
-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311



Newsgroups: comp.arch.arithmetic,comp.arch,sci.math,sci.crypt,
	comp.security.pgp.tech,sci.math.num-analysis
Subject: Re: New programs for SPEC benchmark
From: mash@mash.engr.sgi.com (John R. Mashey)
Date: 15 Apr 1997 06:19:19 GMT

In article <3352B542.5549@mipos2.intel.com>, Jeff Reilly
<jwreilly@mipos2.intel.com> writes:

|> One of the largest initial problems SPEC has is locating legally
|> distibutable, 
|> portable code that solves a real world problem. With the SPEC search
|> program, 
|> SPEC is hoping to find such programs and work with the authors to
|> identify 
|> the issues (such as algorithm choice and "appropriateness").

As always (i.e., finding good benchmarks is hard).

In other thread:

From: Glen Clark <glen@clarkcom.com>
Subject: Re: Sun Ultra HPC Memory Performance
Date: Sat, 12 Apr 1997 18:37:20 -0400

Hugh LaMaster wrote:
 
> Applications sometimes show sharp knees as a
> function of data cache size.  When you average
> a bunch of them together, you get some kind of
> smooth curve.
 
I would have called it a step-function rather than a knee, 
but I won't quibble over terminology.

Does anyone know of a cache probe tool with which you can tell
where you are with reference to the next discontinuity? If you
...
At present, the only way we have to tell whether we're over or
under and by how much is to launch multiple, similar programs
of different sizes and to plot their execution times and to look
for the discontinuity. This gets the job done, but it rather
inelegant.
.....

This raises an issue that has bothered me for some time, which is
that typical benchmarks (of the form that SPEC and many
others use) are indeed very sensisitive if their work sets are
around cache-size boundaries, and that one would prefer some metrics
based on variable-sizings, since after all, real programs vary in size.

EXAMPLE:
	Suppose you have 2 systems, A and B, that are identical, except
	for cache size, where A has cache size 1, and B has cache size 4.

Consider the performance measures gotten from 3 different benchmarks,
with "working sets" of size X:

X	Performance	Notes
 .5	A ~= B		A almost as fast as B, since fits in A's cache
3	A <<< B		A much slower than B, since A thrashes
6	A << B		A slower than B, but less slower than previous case
20	A < B		A slower, bot both getting dragged down to memory
			speed

(I put the "working set" in quotes, I'll come back to that later).

The graph of relative performance, looks like:

A/B

1.0----	| ******
	|	 *							*
	|		*					*	
	|			*			*
	|				*	*
	|
	|
	|	A			B
	|--------------------------------------------------------------------
Size	   .5	1	2.0	3.0	4.0	5.0	6.0	7.0	8.0

I.e., if you have system A, pick a benchmark that either fits in your
cache, or is way bigger than B's cache.

If you are B, pick the largest benchmark that fits in your cache.

The intent of the original SPEC89 was to find at least some becnhmarks
whose working sets were 

Note: "working set" really means: gets good cache hit rate in this size,
and cache hit rate drops strongly if the benchmark gets much bigger.
Note, of course, that some benchmarks work that way, some don't:
some matrix codes that have been cache-blocked can get as large as you
like, and if they fit in A's cache, A and B will run at the same speed.

The problem comes that any given benchmark has a given size, and it can tell
you either that A is only slightly slower than B, or substantially
slower.  [With the right access patterns, you can make A ~ 100% D-cache miss
rate, and B approach 0% miss rate, so that A is much slower than B.]

Ideally, the figure of merit for a given benchmark, if it is one with this
pattern, would measure performance across a range of sizes and
integrate all of the results, so that B got credit for better
performance ... but not too much credit, i.e., not as measured at
point 3.5.

This is workable for some technical codes, where array sizes can be changed
easily (and this has been done, of course).
It doesn't work so well for those cache-blockable codes where once
a cache is big-enough to get good hit rates, it keeps getting them.

Finally, I've never run across anyone doing this for code sizes,
as that requires synthesizing some realistic-looking code and replicating
it, or having some large chunk of code and being able to parameterize
the amount of code touched. 

As CPU cycle times continue to drop faster than memory access times,
the sensitivity to these effects grows ... and of course, "cache size" is
just one of the attributes following the general rule:

	if you have more of something, pick a benchmark that stresses that
	something optimally ... and whoever has less of it will look
	worse :-), more worse than they usually are.

One can also find surprises when comparing "faster CPU, smaller cache"
versus "slower CPU, but bigger cache", or smaller cache,
but faster memory system versus bigger cache, slower memory system."


Note: there are probably some good M.S. theses lying around here.






-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: DEC/Intel deal and Alpha future...
Date: 12 Nov 1997 04:24:23 GMT

In article <34655CDF.AFA@EasyInternet.net>, Paul DeMone
<PaulDeMone@EasyInternet.net> writes:

|> > :      2) integer performance compared to x86 and other RISCs
|> >
|> > :        - FALSE: with a few noticably exceptions (0.35 um P6 intro)
|> > :          Alpha has held a consistent integer lead, typically as high as
|> > :          50%, occasionaly a lot more, over other RISC and x86 flagship
|> > :          CPUs
|> >
|> > Well today it's like 4.29 % (hmmm less then 5%).  (vs c240)
|>
|>   Well if you want to compare with the spanking new C240 why not try
|>   the latest Alpha results of 18.8 SPECint95 (AS 4100 IIRC);  this


Sigh. "Integer performance" != SPECint.
	"floating point performance" != SPECfp

SPECint is a set of benchmarks that tell you something about integer performance
for certain sizes of code, often with ultra-heroic compiler tuning that
sometimes doesn't happen on real applications, some of which includes
pattern recognition of the specific source code that breaks real applications,
etc.

SPECfp is a set of benchmarks that tell you something about floating point
code of certain sizes, often with... (etc)

In general, statements of the form:
	X has better integer performance than Y ....
	For example, SPECint....
are wrong.

Reality would be much better served if people would say:
	"Integer performance, as measured by SPECint, of X is better than
	that of Y" to remind people of what is, or isn't measured.

For instance, in an Origin system, the following:
	- Improving the L2 cache speed by 1.5X.
	- Quadrupling the L2 cache.
	- Increasing the main memory bandwidth.
	- Decreasing the main memory latency.

Don't affect SPECint95 very much, since the on-chip caches get fairly
good hit rates.  Nevertheless, in larger codes, in various combinations,
such changes can have noticeable effects [of course, all of these tests are
on unnanounced machines, or one we may never actually build].
In fact, a whole lot of hardware features found to be useful in various
real codes don't really help SPECint very much.

I've said this before, and I guess I'll have to keep saying it:
1) SPEC benchmarks are useful; they sure improved the state of the world over
	Whetstone & Dhrystone & vacuous mips-ratings; newer ones will
	continue to be useful, I hope.

2) People have *got* to use them for what they're good for, and not
propagate over-generalizations.

3) The best thing about SPEC is the provision of large amounts of consistent
data: search among the benchmarks for ones that are good predictors for
your own codes, and ignore most of the rest.  The problem is that many codes
are not very well predicted by them, despite being real codes of the
sort that McCalpin cited.

4) Finally, remember that SPECint & SPECfp  (done as 1-CPU benchmarks on
multiprocessors) tell you very little themselves about scaling, i.e.,
a processor A might have SPECfp 1.5X higher than processor B,
but 8 of them might not have higher SPECthruput 1.5X higher than B.



--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389


Index Home About Blog