Index Home About Blog
From: (Preston Briggs)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera books first order!
Date: 25 Nov 1996 20:54:04 GMT

>Running one spec program as one thread in the Tera whould give a
>terrible result, something like a 333 MHz / 20 instructions minimum
>between two issues on the same thread ~ 16 Mhz P6.

Maybe.  Things we have in our favor include wide instruction words,
plenty of registers, and a nice compiler.

>Now if the program could be parallised by the compiler to use a
>minimum of 20 threads it would come nearer a 333 MHz P6.

If there's enough parallelism to saturate a processor (say, 20 to 40
threads), then we'll whop on any current commidity chip.

But these arguments miss the point.  Our processors can be effectively
composed to give larger systems without modifying the programming
model.  If code runs fast on 1 processor, it'll run about twice as
fast on 2 processors, etc.  Without changing the source or
recompiling.  Speed of a single processor should be unimportant to
users.  It matters only as an engineering detail (i.e., is it cheaper
to make fewer, more powerful processor or more, less powerful

Preston Briggs

From: (Preston Briggs)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera books first order!
Date: 26 Nov 1996 20:53:05 GMT

>>As I understand it, it's a throughput machine.  In other words, fast
>>context switching allows multithreading to hide latency, so that many
>>different jobs keep the machine busy.  I'm assuming that typically
>>(although not *necessarily*) the different threads will be part of
>>entirely different jobs.

>The user will be free to use the machine as s/he pleases, but 
>*typically* lots of the threads will belong to the same application.
>There are two kinds of "switching" going on:
>(1) Thread switching *within* a process:  This is free, and happens
>    after every cycle.

>(2) Context switching *between* processes:  This is not free -- I
>    think I saw an estimate of 30-40 cycles in one of the Tera
>    technical reports.

Where did you see this?  (So I can fix it)

There's one kind of switching.  Each processor can support something
like 15 jobs at a time (that is, up to 128 threads thread spread as
you please over up to 15 jobs).  Switching between threads is always
free, regardless of job.

And more processors => more simultaneous jobs.

Preston Briggs

Subject: Re: 64 bit registers
From: preston@Tera.COM (Preston Briggs)
Date: Mar 20 1996
Newsgroups: comp.arch

krste@ICSI.Berkeley.EDU (Krste Asanovic) writes:
>I'm curious how you [Tera] handle the case where you issue a burst of 8
>instructions from the same thread in consecutive pipeline stages?

We never issue instructions from the same thread in consecutive
pipeline stages (though one instruction can have 3 operations which
are issued simultaneously to different pipes (not different stages)).
At best, instructions from a single thread have at least 20 cycles
between them.  If many other threads are active, the delay between
instructions may be longer.

Perhaps you're confused about the instruction lookahead?  The
lookahead field points to the next instruction that depends on the
current memory reference.  If it's set to 7, it means that the next 7
instructions can be issued (absent any other lookahead dependences)
before the current memory reference must be completed.

In older versions of the design, there was consideration given to
making the lookahead apply to all the operations.  In that case, it
would have been possible to issue several instruction quite close
together.  On the other hand, we wouldn't have been able to tolerate
as much memory latency.

Preston Briggs

Subject: Re: Does it exist a shared-bus NUMA multiprocessor?
From: preston@Tera.COM (Preston Briggs)
Date: Mar 05 1996
Newsgroups: comp.arch

>It's easy to provide uniform memory access time to large numbers of
>processors.   A uniformly *bad* memory access time, that is.  Heck,
>even the Tera MTA has NUMA as I understand it, but perhaps Preston
>has become so insensitive to memory latency that he doesn't notice
>any more. ;-)

It's true that I'm becoming insensitive to latency.  But why should I
pay attention to something that's unimportant?  Writing code for our
machine, I find healthy amount of parallelism at a fairly coarse
grain, enough to keep the processors saturated, then write the rest of
the code with no concern for parallelism, latency, locality, etc.  The
only thing that matters is reducing the total work.

>Very generally, once one has enough memory in a system, some of it
>inevitably ends up being noticeably further away from a given CPU than 
>the rest, and it seems to me unreasonable to slow down the nearby accesses
>out of a sense of cosmic justice.

Our memory is spread around a network, so some accesses take more hops
to get from the processor to the memory and back.  But the difference
_isn't_ noticable.  My thread runs along, does an access on one
instruction and, voila!, the result is there by the next instruction.
Of course, many other threads with have executed instructions in the
meantime, but that doesn't affect me.

When I write code for the machine, I often look at the inner loops to
get a feeling for the cost of each iteration (and to make sure the
compiler's doing the right thing).  We count costs like this:

	an add		=> 1
	a multiply	=> 1
	a branch	=> 1
	a load		=> 1
	a store		=> 1
	a fetch&add	=> 1
	spawn a thread	=> 1
	quit a thread	=> 1

Latency just doesn't enter into the calculation.

Unfortunately, latency occasionally matters, even on the Tera.  For
example, when many threads are competing for a critical region, the
longer it's locked, the less throughput you'll see.  For example,
consider a simple random number generator.

sync unsigned seed$ = 123456;

unsigned rand() {
  unsigned s = seed$;            /* lock */
  s = A * s + C;
  seed$ = s;                     /* unlock */
  return s % M;

where A, C, and M are magic numbers derived after a careful reading of
Knuth.  I've marked the variable seed$ as "sync", which has special
meaning to our compiler.  Sync variables have a value and a state.
The value is (in this case) an unsigned integer. The state is either
full or empty.

In the code above, the state of seed$ is initially full.  When some
thread calls rand(), it reads seed$, setting it to empty (1
instruction of work, but perhaps 100 cycles of latency).  Then
computes a new value (1 instruction, perhaps 20 cycles of latency).
The stores the new value, making seed$ full (1 instruction, perhaps
100 cycles of latency).  Then does the mod and returns the result
(maybe 3 more instructions).

There are several things to look at here.  We want good "random"
numbers, so we read Knuth closely.  We want the total work to be
small: it's 8 instructions (had to load some constants).  And we worry
about the length of the critical section (when other threads are
prevented from accessing seed$).

In this case, it's basically 1 round trip to memory + a couple of
instructions.  That is

	time for packet to travel from memory to processor
	time for mul-add
	time to issue store
	time for packet to travel from processor back to memory

So maybe 150 cycles or so, depending on how busy the processor is (and
a little bit on the network, but it's got so much bandwidth that it's
difficult to overload).

150 cycles means that we can only generate 1 new number per 150 ticks,
no matter how many threads need random numbers.  Not very impressive.
And this problem is due entirely to latency.

Of course, we get around the problem in this instance by using a
better random number generator -- one with no critical section.
Reading Knuth, Vol 2, 2nd edition, we discover algorithm A, which can
be implemented in such a way that we can produce a random number in 6
instructions of work, and with no critical sections.  Thus, we get
6/processors new numbers per tick.

Preston Briggs

From: (John R. Mashey)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera Boots
Date: 15 Sep 1997 06:51:34 GMT

In article <5v10pg$vd8$>, (Del Cecchi) writes:

|> this IS a supercomputer with a novel architecture.
|> John Mashey, if this is not somekind of propietary information, how
|> long did it take to do the R10,000 from "hey, lets design a new chip"
|> to Unix prompt?

1) The R10000 is perhaps not the right comparison, I'll list several to
bound this.

2) R2000:
	4Q84:  start, really, but already having Fred Chow's compilers
		from Stanford, and the Stanford research, of course.
	4Q85:  tapeout
	12/85: first silicon, bootproms up
	~4/86: first UNIX boot, I think
	~6/86: first UNIXstable enough that we dare let anyone else look at
	~9/86: system shipments stable enough to dare sell as early units

	[These were MIPS M/500s, i.e., deskside VME-based uniprocessor systems].

3) R10000:
	4Q91: some discussion was going on
	2H92: really got going (recall there was acquisition & move)
	mid96: production-class SMP systems (Challenge R10Ks)

	I.e., consider this 4+ years from getting serious to production

4) The first system of a family has the advantage that certain chip bugs
might be defined away as "features", which later chips cannot do,
since they have to be upward-compatible, or made to act that way.

First chips have the disadvantage that there are masses of softwre work to be

5) OS's in particular mature as a function of:
	{number of systems, number of users, elapsed time}

6) From the Tera press release:
	a) I must admit I was a little surprised, as I'd assuemd they'd booted
	UNIX before, as they'd reported NAS sort benchmark numbers.
	[Their press release did not *say* there was an OS there, I'd just
	assumed it, but I remember how it is...]

	b) From the press release, it appers that a) This was the first boot on
	real hardware, not simulations.  Also, it appears that it was the
	uniprocessor version, MP appears still to need work, not surprisingly.

7) When people ask me of my opinions of Tera, I always say:
	"Interesting architecture, from which useful things will be learned.
	Completely unclear if there's a business there or not, given the
	amount of work and $$ it takes to ship reasonable systems in this
	class, and the relatively small niche available.  On the other
	hand, if the machines are enough faster than anyone else, doing
	a small set of applications that people care about, then they have
	a chance, although it seems unlikely they'll be able to get lots of
	3rd-party software, given what that's like these days."

8) If the first boot just happened, it does seem slightly
	unlikely that they will be able to ship stable systems to SDSC this
	calendar year, but stranger things have happened.   The real killer is
	that the bar raises every year for what people expect, and the nature
	of the Tera systems is not to be a high-volume product.

-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: (John Mashey)
Newsgroups: comp.arch
Subject: Re: Tera
Date: 25 Oct 2004 15:10:38 -0700
Message-ID: <>

Jim Cownie <> wrote in message news:<VG2fd.14595$>...
> del cecchi wrote:
> > Tera wasn't a spook front before, so far as I can tell.
> No, however given Burton's bigraphy
>   From 1985 to 1988 he was Fellow at the Supercomputing Research
>   Center (now the Center for Computer Sciences) of the Institute for
>   Defense Analyses.
> You can be fairly sure that while he was working on what later became
> the Tera machine he had access to the right benchmarks to ensure
> that it would be of interest the NSA.

(My quarterly look-in to comp.arch).  This thread has a lot of *wild*
fantasy, including the idea that the MTA was built for NSA-land...

Let's review some history:
1) lists products: X1, XT3,
XD1, SX-6.
   It also has a section called "initiatives":
   Red Storm, Cascade, Oak Ridge, MTA-2 (in that order).
   Note that MTA-2 is an initiative, not a product.

2) : MTA-2 :
   Cray MTA-2 Historical Technical Papers, from which one can quickly

3) Some history, followed by some assessments:
- A lot of this goes back to Denelcor HEP in early 1980s.

- Work on MTA apparently started around 1987.

- The Tera architecture was described in detail in 1990. Likewise,
plenty of compiler papers were published around then, by first-rate
people like Preston Briggs and Ken Kennedy.

- 9/95: Tera IPOs, on Nasdaq SmallCap; moves to NASDAQ 1/16/98

(I don't recall the exact date, but sometime in here I was on some
IEEE panel in San Jose with Burton Smith, noting that I thought
threads were OK up to the point where the immediate CPU->interconnect
was saturating, at which point more threads, cores, etc didn't really
help.  I also said that I'd wished they'd done CMOS ASICs instead of a
(totally-crazed) GaAs design, so that it would be possible to actually
get software experience at a reasonable cost, as I always worried
about achitectures that only made sense at the very high end, as they
just never get enough software.  Burton still loved GaAs.

- 3/26/97: MTA-1 sets new record for NAS Parallel Integer Sort

- 8/28/97: Tera today announced that it has successfully booted and
run its UNIX-based operating system on its Multithreaded Architecture
(MTA) prototype.

What?!?!?!? (I ran across this a bit later, after they were on NASDAQ,
and it caused me to immediately short the stock, which was $15 ...
I've run UNIX bringups on various sorts of new hardware.  If you want
to guarantee several years from first boot to production-solid
systems, design a brand-new type of CPU, with a new interconnect, new
compilers, new OS, a different style of multiprocessor, and THEN do it
in GaAs, so there are so few in existence that software people have to
tussle over who gets a precious hour or two.  I expected it would be
at least 2 years before this was solid, and that turned out to be

- 12/5/97: for first time, Tera runs using >1 CPU

- 12/31/97: Tera installs a 1P MTA-1 @ San Diego Supercomputer Center;
only one outside Tera. SDSC effort funded by NSF, DARPA, i.e.,
"evaluation grants" to get MTA and experiment with it.

- 4/28/98: SDSC gets 2nd processor

- 12/22/98: SDSC gets 3rd and 4th CPUs.

- 12/31/98: Evaluation so far
"With the availability of Unix, more users began computing on the Tera
MTA at SDSC. This increased load appeared to lessen stability. During
much of the quarter the mean uptime between interrupts was about one
Following the hardware upgrade, numerous problems ensued that
prevented productive use of the MTA. By quarter's end, these problems
were still being diagnosed."
"Although the hardware was upgraded to four processors in December,
various hardware and software problems prevented its productive use
thereafter during the reporting quarter. [The most critical problems
were fixed in the following quarter, allowing performance measurements
to resume.] In addition, the instability of the Unix operating system
limited testing to relatively short runs throughout the rest of the

- 6/29/99: SDSC gets CPUs 5-8.

- 11/10/99: SDSC orders another 8P, upgrade from 8GB -> 16GB
We further expect that this will allow us to transition some of our
production workload to the MTA.
"Furthermore, we are excited about reaching this milestone as I am
confident that the MTA-16, with CMOS components, will mark Tera's
entrance into full scale commercialization and sales in year 2000."
(explains why CMOS, finally).

- 10/9/2000: A Scientific-Computer User's Assessment of the Cray MTA,
Keith Taylor,  This is a
fairly enthusiastic user's honest discussion, including notes like:
What the MTA Lacks:
   A FORTAN 90 compiler
   An MPI library [since many parallel codes have been MPIized]
   A decent debugger
   A Decent profiler
   Directory cross-mounting [currently, all tools run on a Sun
    all files have to be 'ftp'ed to the MTA
   Machine availabililty. "Since the start of the New Year, about the
world's only MTA, at SDSC, has been very unreliable.  The problems
stem from an enhancement in which the infrastructure has been expanded
to accommodate an increased nubmer of processors from 8 to 16,
although all the extra processors aren't in place yet."
   A rudimentary editor.

What the MTA, Cray, and SDSC Do Currently Provide
   Machine access (via Sun)
   F77 [good]
   Compiler analysis tool canal [brilliant]
   Help and advice
   Support of Multi-User Job management
Then gives enthusiastic plusses of MTA, tempered with:
"Despite launching many runs, I have so far been unable to obtain
results for 8 processors on the MTA."  Still, he would like to acquire

- 9/14/2001 - MTA Retired @ SDSC
"Further upgrade to 16 processors was planned, but difficulties in
manufacturing its GaAs chips prevented the upgrade from being

"Only a few kernels and large applications were found for which the
eight-processor MTA achieved higher absolute performance than other
computers available at SDSC. Such codes had the following

- They did not vectorize well.
- They were difficult to parallelize on conventional machines.
- They contained substantial parallelism.

Examples were codes that involved integer sorting; dynamic, irregular
meshes; or dynamic, non-uniform workloads within a regular mesh.

Offsetting its attractive programming model, the MTA proved very
difficult to manufacture and maintain. The inability to add more
processors left the computer relatively underpowered, and the frequent
down times reduced user productivity. These factors ultimately led to
diminished demand and the decision to retire the MTA at SDSC."  (i.e.,
the grant probably ran out).

- 1/3/02 - Cray ships first MTA-2 (to unnamed customer overseas, I
think ENRI in Japan), with larger system to Logicon/ Naval Research
Lab tht quarter

- 10/8/02 - MTA-2 accepted by NRL

- 11/11/03 - Early Experience with Scientific Programs on the Cray
MTA-2 sc/2003/2113/00/21130046abs.htm

"All of the programs required a fair amount of work to port to the
"I/O performance is often a problem when porting applications to the
(Amdahl's Law strikes again).


4) Some assessments

a) There have always been some interesting ideas, and some really fine
compiler work in this.  The GaAs choice was disaster, and the general
quality of execution implied above speaks for itself:

b) The MTA work started around 1987, and *10 years later*, had booted
UNIX and delivered one (1) system to a customer, which got an
evaluation grant to test it.  It is worth knowing that there is often
grant money that explicitly cannot be spent on products that work.  I
once was trying to sell a particular government agency some SGI
Challenges, and was told "Oh, we buy those for production, because
they work, but this pot of money has to be for experimental stuff.
What do you have that almost works?" me: "How about a quarter-baked
bunch of FDDI-network hardware and flakey prototype software for
ganging a bunch of SMPs together to run big problems?" A: "Yes, now
you're talking!

Note that the users of such grant money, quite properly, are supposed
to see past flakiness to possibilities.  In academe, evaluating new
machines is fun, and lets one write interesting papers.  Note however,
that the job "try this out" is not the same as "we need to actually
port and run our workload."
Note also, that this sort of thing is subject to "publication bias",
i.e., positive results tend to get published more widely than negative
ones, but in real life, both count.

Nevertheless, over-interpretation of the results above may yield total
fantasy, that is, there may be plenty of good architectural ideas, but
EXECUTION COUNTS, and so does volume, and this has a long history of
bad implementation choices, missed market windows, and continual
unreal-optimism about deliveries, and impossibly-small volumes for
software people to have a fighting chance of getting their work done.

It is a long way from building 1-2 systems, which even superb
programmers struggle with, but produce fine results on a few codes, to
actually having the sort of mature software environment that lets
people actually get their work done, and that run enough codes well
enough to be competitive in a research environment.

From there, it's an even longer way to providing systems that will be
bought and used by production supercomputer users like the automobile
companies.  It was interesting to see the Tera people talking about
LS-Dyna performance on crash tests in 1998 ... but just plain absurd,
as the auto folks are rather demanding customers, and so are serious
third-party software vendors.  Talking about auto companies, before a
4P system stays up for an hour, is fantasy.

c) So, maybe there's a pile of MTA's at NSA, but I doubt it, and I
have at least been there, and similar places, a few times, discussing
features they wanted.  There are some things about MTA they'd like,
but.... read the history again.

5) The HPC business has long been plagued by the fact that for every
problem you can think of, you can invent a different architecture that
will be optimal, and this is enough to get everybody excited ... but
it's not enough to make a business.

6) High-end, special-purpose, unique-technology computers are usually
snares and delusions in any commercial sense, and they certainly
snared and deluded VCs for many years, although most have learned
better.  I recall the heyday of the mini-super era, where VCs funded
the craziest ideas.  It's bad enough to fund 30 disk companies or 30
PC companies, each expecting 20% of its market, but at least the
products are understandable to mortals.  There are a few VCs who
actually have serious HPC tech credentials ... and the chances of them
funding special-purpose high-end tech is pretty low.

7) Once again, I'd guess that if something truly interesting and
widely-usable happens in HPC hardware, it will likely be technology
coming from below, like doing some SoC that's widely usable in some
volume application, but can be ganged together for some HPC apps as
well.  Once upon a time, the largest GFLOPS in once place was a
Seattle warehouse that had half a million Nintendo N64s.  Admittedly,
the I/O (half a million kids) was weak, but there were lots of FLOPS

From: (John Mashey)
Newsgroups: comp.arch
Subject: Re: Tera
Date: 26 Oct 2004 14:08:25 -0700
Message-ID: <> (Nick Maclaren) wrote in message news:<cll22v$8ao$>...
> In article <ORefd.771304$>,
> Stephen Fuld <> wrote:

> >But the evidence is that Tera somehow managed the investment, but
> >essentially no one bit.
> No.  Tera did NOT do so - THAT is is the point.  The GaAS prototype
> was just that, and attracted a lot of interest.  The promised CMOS
> production model was already behind schedule (surprise, surprise)
> when Tera was recreated as Cray.  And, at that point, the MTA was
> put on the back burner.

1) It wasn't a prototype in any normal sense of the word as used in
   People have built boards as prototypes for later chips [MIPS R2010]
   People have built ASICs before doing full-custom CMOS [first
   People use FPGAs to built prototypes of ASICs (or convert FPGAs to

2) No one in their right mind prototypes CMOS ASICs by building giant
   multi-chip GaAs systems at great cost and with long design time.
   Burton is certainly smarter than that, i.e., the choice of GaAs was
   a mistake, but it wasn't the idiocy of making a GaAs *prototype*.

3) And in any case, I do not believe that Burton thought the GaAs
MTA-1 was a prototype, given that various people (I included)
discussed it with him publicly more than once during that period.
Before the ATIP meeting in Tokyo in 1999, Tera had certainly come
around to having to do CMOS, because they had realized the error, but
this was not so in 1995.  GaAs was supposed to be a production system.
 It just wasn't one,
but it certainly wasn't for lack of Tera people trying hard.

4) Had this been done in CMOS, and appeared in the early 1990s, it
might have gotten a piece of the action along with the microprocessor
folks.  But, in this game, it's not enough to beat an incumbent at
something, you have to beat all of the plausible incumbents at a
big-enough something to be worth doing.  In the early 1990s, {IBM,
SGI} were able to successfully attack the dominant, but expensive
vector Crays, and various expensive parallel systems, using
higher-volume micros.  It took a while to get the sorts of mature
software environments that people wanted, and get ports of the
relevant third-party software.  But now, such environments are widely
available, even on commodity micros.

5) For new architectures in HPC & general use, the "bar" raises all
the time.
In the early 1980s, you could do a straightforward port of UNIX to a
new micro, and ship systems that people might want, but very quickly,
there got to be a long list of checkoff items that must be there, in
compilers, OS, and third-party software.  The same thing happens in
HPC.  In the "good old days", LLNL would happily accept a fast
computer and write their own system software.
people expect more now, even if they expect to get source code and
mess with things as well.

6) But, the real killer for expensive, special-designed HPC systems is
that the ecology of HPC is already well-populated by a range of
solutions, of which at least some have low-cost basic technology
backed by acceptable software, and a new entrant:
 *must* beat *all* the incumbents at something
  This might be price [good], performance [OK], price/performance
[more common],
  but in any case, there are many more plausible incumbents now.
 *must* beat them by enough to be interesting
 *must* have a roadmap that plausibly keeps them ahead for several
  where "plausibly" includes evidence of being able to execute on

At one point Tera put out a press release that said:
Tera Beats Cray T90 by 10 Percent"  It ran benchmark in 1.79 secs,
compared to T90's 2.02 seconds.

That's fine ... but realistically, most people didn't buy T90s to do
integer sorting...

Index Home About Blog