Multithreaded CPUs(John R. Mashey)

Index Home About Blog

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.os.linux.development.system,comp.arch,alt.folklore.computers
Subject: Re: MRISC [really: multithreading again]
Date: 29 Jun 1995 20:06:13 GMT

This is a slightly edited version of some discussion from comp.arch,
June 1994.  Note that people usually call these things multithreaded
CPUs, of which there is a long history, with HEPs and some of the
Honeywell machines from years ago, and coming, Tera's machines.
Preston Briggs had a good posting last June, with good references;
maybe he will repost that.

1) See Microprocessor Report, May 9, 1994, p18-, "Microprocessors Head
Towards MP on a Chip" for a good discussion of related issues, including
the range of choices from simple uniprocessor on a chip, through
threaded CPus, through having several CPUs on a die that share an L2 cache,
through several complete CPUs.

2) There is endless talk about threads, and some good research, like
MIT's Alewife/Sparcle [IEEE Micro, June 1993].

On the other hand, many people keep talking about threaded CPUs as
though all that's needed is a few-cycle switch of user register set and PC ...
or happily interleaving streams of instructions ...
and that's simply *bogus* when you are talking about actual products.
If more people don't *usefully* address the implementation issues
[some nice theses might be around], there is going to be little progress
in figuring out whether threaded CPUs are good ideas versus the
other relevant designs, either now, or later.

3) Real CPUs have a great deal of state:
	- integer registers *
	- floating point registers
	- control/status registers, of which there are often 20-30.
	- MMU**
	- cache**

*Even in SPARC, only a subset of the integer registers is duplicated,
none of the rest is.
** These use a lot of space, and you must think seriously about whether
or not to duplicate them;  if you don't, the threads will need to share
them, and it's hard to believe the locality gets *better* by sharing.

A production threaded processor is probably going to need
to duplicate most or all of the registers, and even this may be nontrivial.
The integer and FP registers:
	a) Are regular structures implemented in big blocks.
	b) Have full-size read/write ports connected to regular busses.
The control/status registers:
	a) Are sometimes irregular, or located on the chip wherever convenient.
	b) Often have restrictions in terms of read/write access latency,
	   with careful words about exactly when you see the effects of
	   changing a status bit in a register.  They often have "extra"
	   wires to get status bits to where they are needed, quickly,
	   because these may well be in the critical path on some designs.
	c) In general, are often *not* designed to be anything like as easy to
	   switch as switching a SPARC register window pointer;  some of this
	   difficulty may be accidental, but some of the difficulty may be real
	   issues of needing very fast signals.

4) But, it's *much* worse than that.  To be interesting as a *product*,
a threaded processor should be reasonably competitive as a uniprocessor ...
otherwise, people will not build them as products in enough volume to be
interesting and get access to latest process technology.  Aggressive current
CPUs have lots of invisible internal state in terms of instruction queues,
results queues, PC chains, etc, and that state must be switched, or at least
kept pending, and those are nontrivial things to do.  In many designs that
people are looking at, getting lots of transistors (to duplicate registers)
is *not* the problem, but rather the RC-delay of running long wires.
Right now, aggressive chips include deeply-pipelined ones with high
clock rates, or less deeply-pipelined ones with more units in parallal,
and either way, there is plenty of invisible state right in the heart of
the CPU, where cycle time is easily affected.

Put another way: some of the discussion of threading is a classic instance
of "A simple-sounding solution to a problem ... but the problem being solved
is only a tiny fraction of the total problem, and has more ramifications
than you'd expect."  I.e., the equivalent in ISA-design = "add this one
instruction and Dhrystones will get a lot better."

5) For *research* purposes, it is appropriate to do as the Sparcle people
did: "The Sparcle chip clocks at no more than 40MHz, has no more than 200,000
transistors, and dissipates a paltry 2 watts.  It has no on-chip cache, no
fancy pads, and only 207 pins.  It does not even support multiple-instruction
issue.  Then why do we think this chip is interesting?"  (goes to to say
research in latency tolerance, fine-grain synchronization, communications.)
This is *good* experimental work: take an existing architecture and tweak it to
explore interesting areas, without worrying about whether it would make
sense as a product.  Research chips can and should explore interesting
issaues without having to solve *all* of the problems that production
chips must, and implementing real (if not commercial) things in universities
is something to encourage, because it helps expose real problems and tradeoffs.
IMPLEMENTATION = REALITY-CHECK, often necessary to avoid things that sound
great on paper, but do not permit reasonable implementations.

6) For *product* purposes, *100%* of the problems have to be solved,
and one has to build competitive products.  Try the following thought
experiment, and maybe somebody will do some studies (please post if you
know of any like this):
	*Start* with an aggressive, competitive CPU, i.e., which these days:
		- must have reasonable uniprocessor performance
		- must be useful in cache-coherent SMPs
		- (maybe) should be useful in Distributed Shared Memory systems
	a) Make it more latency-tolerant within a single-thread, i.e.,
	   go deeper than hit-under-miss into multiple outstanding cache
	   misses, register-renaming, out-of-order execution, etc, and
	   see what you get. [i.e., PPC 620, R10000, HP PA 8000].
	b) OR, thread it, as the main approach to latency-tolerance,
	   and see what you get.  If you only have 2 sets of registers,
	   you really want to compare with the next one.
	c) OR, put 2 CPUs on one chip, and see what you get.  [Some people
	   argue for this, as a way to keep wire-lengths down for wires that
	   have to be fast.]  Consider for example, a DEC 21164, with 
	   both L1` and L2 cache on chip; suppose you shrank some more,
	   and put a 2nd CPU+L1 cache, and share the L2 cache.
	d) OR, take the die space for any of these and put it in something
	   else.  (The Microprocessor-Report mentioned above talks about
	   some of this.)

Now, I'm not a chip designer, but I've certainly heard lots of serious
arguments among designers fighting over a few sq mm of die space, and
worrying about clock skew across a modest-sized chip, and doing some funny
routing to get a signal to the right place in time ... so, in these
thought experiments, it's no fair ignoring the effects of longer wires,
more multiplexors, etc.... these things are *not* free, and it's not just
space, but gate delays.

7) Summary: multi-threading is an interesting area, worthy of research.
There is plenty of room for thesis papers. Most people who design
CPUs for a living will probably wait until there is more evidence that
a threaded CPU is a better choice, when doing aggressive commercial
designs.  

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Subject: Re: Limitations of Superscalar RISC Architecture [really: 
	multithreading]
From: mash@mash.engr.sgi.com (John R. Mashey) 
Date: Jul 14 1995
Newsgroups: comp.arch

In article <3u5atf$r3s@gap.cco.caltech.edu>,
andrey@gluttony.ugcs.caltech.edu (Andre Yew) writes:

|> Organization: California Institute of Technology, Pasadena
|> 
|> mash@mash.engr.sgi.com (John R. Mashey) writes:
|> 
|> [stuff about geometrical and electrical performance for duplicating
|> register sets and internal control stuff]
|> 
|>      This would all be fine and good except that the Caltech Mosaic T
|> processor has already achieved much of what you called non-trivial in
|> 1.2 um, two-metal CMOS laid out by hand by a bunch of undergrads in
|> about 1 year's work.  Not only can the thing switch states at the drop
|> of a cycle (50 MHz), but there's 64 kB of DRAM on board as well.  The
|> Mosaic T was not designed to be a multiscalar machine, but it could
|> easily be converted into one.

I think this illustrates some appropriate differences between research
efforts and production chips....  and it is interesting to hear that things
believed to be non-trivial by long-experienced chip designers turn out to
be easily doable by a few undergrads ... that happens sometimes.

It would be useful to post some more info on Mosaic T, i.e., especially
        a) What problems was it trying to investigate.
        b) And which ones were explicitly omitted. 

|> >4) But, it's *much* worse than that.  To be interesting as a *product*,
|> >a threaded processor should be reasonably competitive as a uniprocessor ...
|> 
|>      What was that story about the businessman who first heard about
|> the telephone?  "Why would I need that, when I could walk out to my
|> secretary and dictate directly to her?"  The most effective use of a
|> multiscalar machine will not come about because someone manages to
|> wrangle good SPEC numbers from it, but because there will be a
|> fundamental change in the way the OS is structured for it, and this in
|> turn will dictate the way programs are written for it.
|> Compiler-OS-system-architecture-processor is an integrated chain that
|> $300 games machines have shown us is as great an equalizing force as
|> buying a $2500 PC to run the same things.  A similar thinking for
|> multiscalar machines will, I think, be as effective in solving some
|> very popular problems.  In so many words, it's wasteful and stupid to
|> force a uniprocessor way-of-use onto a properly-designed multiscalar
|> system.

Let me try again, since there was at least one person to whom that part of
the posting wasn't clear.  Please reread what was posted, and what wasn't:

1) I claimed that a threaded processor should be competitive as a uniprocessor,
meaning that one (multithreaded) CPU + memory should be a competitive
design with other micros out there, and *not* require that the only
configurations for which the multithreaded CPU was competitive be
(likely large, but certainly) multiprocessors.  [One can of course argue
with this, and certainly the Tera folks would ... but then they are
aiming at the very high-end, not at $300 games machines.]

2) Nothing was said about requiring single-thread performance on a
multi-threaded CPU to be competitive with single-thread performance
on a non-threaded CPU [which is how Mr. Yew appears to have interpreted
the statement; if I ever edit & repost, I'll stick in some more comments.]
There of course is a worthy argument about how much single-thread
performance can be given up if necessary to get multi-threading, but
the original posting used the specific-but-uncharacteristically-vague
expression "reasonably competitive" to allow the multi-threaded processor
to be programmed however desired.  In that posting there was zero discussion
requiring multi-threaded processors to run existing binaries, existing
source.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-390-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Index Home About Blog