Merced/IA64(John R. Mashey)

Index Home About Blog

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 30 Dec 1997 01:52:04 GMT

In article <66sgkg$bvs$1@sue.cc.uregina.ca>, bayko@borealis.cs.uregina.ca
(John Bayko) writes:

|>     This is true, which is why traditional fixed length VLIW isn't
|> used in new designs. Rather, the trend is towards variable length
|> instruction groups (VLIG?) which specify only how many instructions

Whatever IA64 is, it seems a strain to call it classic VLIW, for sure,
without causing the VLIW term to expand into meaninglessness.

|>     The only real difference between dynamically grouped instructions
|> and statically grouped ones is that the compiler does all the work
|> beforehand (and probably does a more thorough job at it because it
|> doesn't have to rush), so you can completely eliminate the dependency
(1)   -------------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> checking circuitry from the CPU - that circuitry has become awfully
|> complex, and is causing exactly the same type of complexity in RISCs
|> as the memory addressing methods in CISCs which RISC did away with.
(2)-------------------------^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(1) Is widely-believed and repeated, but is not, in general, correct.
(2) Is not true either.

(1) Compilers may indeed be able to do static grouping of instructions,
such that no two instructions within a group are dependent. There is of
course a long history of compilers doing code scheduling to move
dependent instructions "far enough" apart, of which static grouping is
a particular case.  Even  on architectures that guarantee complete
hardware dependency checks, it has often been worthwhile doing code
scheduling.

However, although static scheduling to avoid inter-group dependency
checks may work fine when the latency of each operation is:
	(a) Fixed
	(b) Short, preferably 1 cycle, so that the results of one
	    group are available in the next with no concern.
and when the dependencies are:
	(c) Statically visible, i.e., like register-register checks.

But unfortunately, except for the "simple" integer register-register
operations in most RISCs, (a) and (b) are often untrue:
	(a) Many implementations use algorithms with timings that are
	data-dependent, even for register-register operations, and of course,
	loads and stores can have huge variations in cycle counts depending
	on whether or not they hit in cache ... or consider what
	happens in a large SMP when you write a line shared among all CPUs.

	(b) Most floating-point operations are at least a few cycles
	long, and implementations of CPUs in the same family offer differ,
	since these things can be very area-intensive. Even if you get
	FP + and * to be 2-3-cycle operations, FP / and sqrt will be many
	cycles.  If integer * and / are provided, they're likely to use
	a few cycles as well.  Maybe somebody has managed to avoid
	dependency-checks on the result of dsqrt ... but I doubt it,
	at least in a general-purpose CPU family.

	(c) And of course, even worse is load/store address checking,
	where compilers, in general, have *no* hope of doing anything
	reasonable/fast/safe in all cases, and the checks are better
	done in hardware [which is still simpler in RISCs than in some CISCs,
	contrary to (2) above ... discussed before in this group].
	Consider the following code, intended as part of library function,
	where it is impossible to see the callers of this function:

int func (int *a, int *b, char *c)
{
        *a = *b + 1;
        *c = (char) *b;
}

Which generates  (MIPS) code like:
 #   3          *a = *b + 1;
        lw      $14, 0($5)		1
        addu    $15, $14, 1		2  (depends on 1)
        sw      $15, 0($4)		3  (depends on 1 and 2)
        .loc    2 4
 #   4          *c = (char) *b;
        lw      $24, 0($5)		4  (may or may not depend on 3)
        sb      $24, 0($6)		5  (depends on 4)

In an in-order machine that can do 2 lds/stores/ clock, some of the time, one
would expect the following static grouping:
	lw
	addu
	sw
	lw
	sb

(Just in case a and b are same address.)

whereas an in-order machine with dynamic grouping plus
address-dependency checks would likely do:
	lw
	addu
	sw, lw			relying on hardware dependency check
	sb

and of course out-of-order machines would probably move the 2nd lw earlier.

Some machines can do two lds/stores/clock, but only if the accesses are
in even/odd words of the cache, etc...  and of course the compiler
*could* insert lots of address-checking code to detect aliasing,
but it would be sad to have to do this to avoid infrequent cases.

SUMMARY: code scheduling is pretty good at dealing with 1-cycle latencies
of simple integer register-register operations, and if that helps simplify
CPUs enough, that's useful ... but there are still *many* hardware dependency
checks needed for sensible machine families and programming.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 31 Dec 1997 19:44:03 GMT

In article <68cb8k$bti$1@sue.cc.uregina.ca>, bayko@borealis.cs.uregina.ca
(John Bayko) writes:

|>     You're right, I was a bit over-general here. I was really only
|> referring to data dependancies due to CPU resources such as registers.

Yes, we agree ... if in fact you mean the more precise:
"data dependencies due to register-to-register interactions of
instructions whose latencies are typically small, and guaranteed
bounded in all implementations expected to run the code efficiently;
in practice, this normally means the simple integer operations
(not multiply/divide), and perhaps simple floating-point operations
(like move), and usually, only operations with 1 cycle of latency."

I think the phrase "CPU resources such as registers" == "registers",
as opposed to "CPU resources, such as, registers",
since functional units are also CPU resources, and the assertion
is not obviously true for functional units, especially in low-end
implementations.

|> And dependency circuitry isn't completely eliminated, just greatly
|> simplified in that it mostly checks compiler set dependency bits
|> rather than trying to figure out resource usage for each instruction.
|> And I suppose there's nothing stopping you from keeping the complex
|> circuitry and just using it for more subtle data based dependency
|> checks like the load/store checks you mention or, I don't know,

Hmmm.   I guess I didn't make the point strong enough:
a) If you want to improve parallelism by analyzing static, visible
data dependencies, this has long been done, and it's a good thing, even
for a fully interlocked architecture.  Compilers sometimes know things
about the code that the hardware cannot possibly ever guess.

b) If you want to improve parallelism in memory access, you can do
some things at compile time, with prefetches, hoisting, etc ...
but there are some things that just don't make much sense at compile
time, and are more sensibly done at run time by hardware.  It makes
no sense whatsoever to generate myriads of instructions to check
address aliases, that occupy code space, and worse, issue slots,
to guard against cases that hardly ever happen.  Hardware is good
at making checks in parallel with doing other work.  Software isn't,
even with well-crafted traps/branches.

So, it isn't an optional "nothing stopping you" kind of thing:
perhaps a simple sequential CPU design need not worry about such things,
but anything above that must.  Even R2000s, in 1986, did (the external
write-buffers did write-gathering, and address checks to safely allow
read-around past an outgoing queue of writes; whether this was a good idea or
not remains to be seen; it was certainly a pain for OS programmers,
who had to use a write-buffer-flush call to drain the buffer before
reading device registers affected by the pending writes; it may have helped
normal code enough to be worth it.)

Summary: no architectural approach fixes all of the problems, so one must
be very careful to understand which ones get fixed and which ones
remain, demanding great care in description to avoid confusion.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch,comp.ai.edu
Subject: Re: Info on Merced needed
Date: 13 Jan 1999 00:34:15 GMT

In article <77gmj4$pbe$1@msunews.cl.msu.edu>, Mark W Brehob
<brehob@cse.msu.edu> writes:

|> Organization: Michigan State University
|>
|> (Removed comp.ai.edu as it seems like an odd group for this)
|>
|> :>I'm writing a Computer Architecture textbook, and would like to include
|> :>some substantive information about the Intel Merced.  Unfortunately
|> :>Intel has not been very forthcoming about the Merced ISA.
|> :                                                ^^^^^^
|> :>Does anyone know where I can get info on the Merced ISA?

|> 	I believe Micro Processor Report covered it also.  A few times.

Microprocessor Forums in 1997 and 1998 have included presentations on
IA64, at least some of which is/was in Intel's Web site. It is well
known that:
	1) There are 128 each of intger and floating-point registers.
	2) Instructions come in bundles of 128 bits, with 3 instructions, plus
	a couple other bits that describe the combinations.
	3) Instructions have predication bits.
People derive things like potential instruction sizes.

In general, the details needed to write a book about the IA64 ISA are not
public.  Either a person *knows* what's in there, in which case they
should have signed a serious Non-disclosure Agreement with Intel, or
they don't, in which case they are free to speculate :-)

If anybody is requesting someone who *knows* to send nonpublic information,
they are requesting them to do something illegal...  This is really not
A Good Idea.  Of the vast number of speculations about the architecture
that I've seen, many are wrong... [about all I could legally say.]

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-4392
USPS:   Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch,comp.ai.edu
Subject: Re: Info on Merced needed
Date: 15 Jan 1999 20:04:06 GMT

In article <369cc3f2@csnews>, heuring@crusher.cs.colorado.edu (Vincent
Heuring) writes:

|> In article <77gpm7$j0q$1@murrow.corp.sgi.com>,
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|>
|> (I wrote:)
|> >|> :>I'm writing a Computer Architecture textbook, and would like to include
|> >|> :>some substantive information about the Intel Merced.  Unfortunately
|> >|> :>Intel has not been very forthcoming about the Merced ISA.
|> >|> :                                                ^^^^^^
|> >|> :>Does anyone know where I can get info on the Merced ISA?
|> >
|> (Snip)
|> >In general, the details needed to write a book about the IA64 ISA are not
|> >public.  Either a person *knows* what's in there, in which case they
|> >should have signed a serious Non-disclosure Agreement with Intel, or
|> >they don't, in which case they are free to speculate :-)
|> >
|> >If anybody is requesting someone who *knows* to send nonpublic information,
|> >they are requesting them to do something illegal...  This is really not
|> >A Good Idea.  Of the vast number of speculations about the architecture
|> >that I've seen, many are wrong... [about all I could legally say.]
|> >
|> >--
|> >-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
|> >EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-4392
|> >USPS:   Silicon Graphics/Cray Research 40U-005,
|> >2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
|>
|> Hey, just a durn minute here, Mr. Mashey. RTFP! I *didn't* say I was
|> going to "write a book about the IA64 ISA.." I *said* I was writing an
|> architecture textbook and wanted to include some substantive
|> information on the Merced ISA. Big difference, and I resent the subtle
|> implication that I was trying to encourage anybody to violate an
|> confidential agreements.

Sorry ... but the problem is that you don't understand the problem,
which is that your request was seriously vague.
Had you asked the precisely, correct question:
	(a) "Can somebody point me at the publicly-available information
	on the Intel IA-64?"  (i.e., this is a question of the form:
	"I don't know what's available". Please tell me.)

Plenty of people would point at M-R, and nobody would have any issues.

Had you asked the question:
	(b) Is there any additional information, besides that public,
	that is available without breaking an NDA (or other legal
	agreement)?.  I.e., logical analyses that derive what must be
	going on from public facts.  [For example, after the first
	Microprocessor Forum, lots of people figured out instruction
	sizes.]
No one would object to that, although of course, speculation doesn't usually
make for a good textbook.

Had you asked the question:
	(c) Is there any additional information that anybody has that doesn't
	break an NDA, but might break other agreements, that you figure
	you can get away with?  For example, P Dickerson posted some
	info from a dump of the text of a Microsoft IA64 Assembler ...
	I don't know about this specific case, but I observe that Microsoft
	binary licenses have long included the words:

	"1) COMPANY may not reverse engineer, decompile or disassemble the
	Licensed Software."

	But, if that software included that license phrase, what you have
	done is to encourage Dickerson to break his license with Microsoft,
	which is not Intel [but of course, in some cases, this kind of clause
	got there in the first place to avouid premature disclosure of
	Intel info.] If that license had that clause, Microsoft is within
	its rights to demand that the software be returned, and all copies d
	destroyed...
One could debate the ethics of this, the license, etc.

Had you asked the question:
	(d) Has anybody, not under NDA, found an IA-64 manual lying around
	that they could send me?

But you wrote *explicitly* that you wanted substantive information, and in the
next sentence, *expicitly* that Intel wasn't providing what you wanted.
That is, the question appeared to be of the form "I know what's available,
but it's not enough."

Hence, if you weren't intending to have people break NDAs or other agreements,
then I apologize... but I think you just didn't realize the implications
of what you were saying, given the wording of what you said.

When you put out requests like this, you might want to be much more
precise, because, in fact, most of the information that most people would
expect to find in a Computer Architecture textbook, that even included
one *serious* chapter on IA-64, is still under NDA.

So, just what *were* you expecting to get?

Legal stuff is weird, and if one doesn't do this stuff, one may not realize
the implications of what one is saying.

By analogy, there are discussion domains where innocent-sounding
questions are problems for people familiar with the area. For example, if
you are a computer vendor person, and you are talking with a potential
customer, the first thing you (normally) ask them is:
	"Please explain your application.  Want do you want to do?"

... except if you are in the NSA, or CIA (or certain other places), you
do NOT ask that.  [I've seen it asked by people unfamiliar with this turf;
everybody cringes, thinking "this person just doesn't understand...".
It's not that they were asking "tell me stuff I shouldn't know", they
just didn't know how to ask in a way that made it clear.]

(What you ask, very carefully is: "Is there anything you are able to
tell us about what you want to do?")

So, what sort of information were you actually soliciting?

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-4392
USPS:   Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: SGIs future: What will happen after the R12/14k?
Date: 17 Jan 1999 00:30:55 GMT

1) I know too much to say very much either.  However, I do have the manuals,
and I've read them, and the architecture is good, and allows a number of
interesting implementations.  We couldn't figure out how to beat it
long-term with anything that was MIPS-upward-compatible, and we are
certainly familiar with the other options, and they didn't look better.
Going to Alpha certainly didn't make any sense, for business reasons, and
for long-term fab support. We've already done 64-bit, and R10Ks have
managed to compete OK with 21164s.

2) Many of the comparisons made here display a *serious* lack of reality.
For example, the i860 started life supposed to being a coprocessor,
tried to work its way into being a mainline CPU, and was designed
with ~zero input from software people, was extraordinarily difficult to
write compilers and OSs for, so what happened is not surprising,
despite its good FP performance.  If anybody thinks there is any
serious resemblance between i860 and IA64 project, they are out of it.

3) Among the publicly-known things about IA64 architecture are:
	- 128 integer & 128 FP registers (i.e., 7 bit register fields)
	- 128-bit, 3-instruction+other bit bundles

4) The current state of compiler technology, especially on the floating-pt
side, is able to make use of lots of registers, for loop-unrolling,
but also for allocation of things like transform matrices to registers.
Existing o-o-o RISCs with renaming can do some dynamically, but people still
run out of explicitly-addressible registers.  Put another way, from
the publicly-available information, one might guess that good compilers
will very quickly be able to use what's there.  If you know Multiflow,
Cydrome, HP Play-Doh, IA64, and various combinations of people involved,
this also might make sense.  Most compiler people at various companies think
it will take longer to tune up for integer performance, i.e., really getting
the optimizations for the predicated logic.  [Anyway, some compiler writers
who hated i860 like IA64].

5) Intel has stated publicly that the Merceds will yield, peak, 4 full-precision
flops/clock, and 8 single-precision flops/clock.  Now we we all know that
peak != achieved, and the details to figure out how close are all NDA ...
but I would be astonished if people with FP-compute-intensive needs don't
*love* this architecture.   I'm slightly mystified that Toone Moene
seems sure such machines aren't for him (or is it the NT-versus-UNIX issue,
rather than IA64-versus-others? Note that Terry Shannon's quote about
SGI betting its business on 64-bit NT may or may not have any accuracy).)

6) To summarize:
	(a) I have the manuals, and I've read them.  I'm not speculating.
	(b) The architecture is generally good; it has features similar
	to ones I've fought to get into MIPS over the years, including some that
	we've never been able to retrofit; it has the kind of stuff one expects
	from high-performance 64-bit chips; it has some stuff similar to things
	I wanted for H2.  It doesn't have everything I'd like, but it's good.
	(c) Because there seems to be no sensible way to extend IA32 to
	64-bit [or they would have done that], unlike the 32-bit RISCs that
	mostly had very straightforward 64-bit supersets, they had to go
	to a new ISA ... which was both bad (new), and good (all bets are off,
	relatively clean sheet of paper, can go to 7-bit registers, and
	ability to encode attributes of the instruction stream dependencies
	statically, and a bunch of other things.]
	(d) It may well be the *last* new successful general-purpose CPU ISA for
	many years ... [An opinion held not just by me, some other
	knowledgable folks think so too.]  BTW, if you *like* to design
	ISAs, think hard of going after the embedded/consumer markets,
	as there is at least some action left there.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-4392
USPS:   Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: SGIs future: What will happen after the R12/14k?
Date: 20 Jan 1999 20:48:47 GMT

In article <7848ov$icu$2@ec.arbat.com>, Erik Corry <erik@arbat.com> writes:

|> The reason for McKinley's superiority might be that it
|> doesn't contain IA32 support, or it has much slower IA32
|> support (Microcode? JIT? FX!32-like?).  This would fit
|> with the idea that HP designed it, and as far as I can
|> see, putting two ISAs on a chip is very very difficult
|> (PowerPC 615 anyone?).

While speculations are fun, and while one may or may not believe what
some people say, it is useful to at least know what people say/show.

1) At the last M-F, Steve Smith showed an IA-64 floorplan.
The IA-32-specific piece, not surprisingly, was a small corner, looked like
about 15% of the die.

2) It can be hard to retrofit a chip designed for one ISA with
another ISA ... although once upon a time, there was a very slick design for
a small MIPS chip that would also do IA-32 [never produced].  Even there,
it turned out that the IA-32-specific stuff wasn't that big, although
it certainly needed some new decode logic, and some extra registers.

3) If you are designing a *new* ISA, and if you have 100% rights to an old ISA
that you wish to include, it is not *that* hard to do this, and it has been
done various times before:
	S/360 (with emulation for various 707x, 709x systems ... very different)
	VAX (with PDP-11 mode)
	[both of these have microcode, of course]
What's different is doing it in a micro ... although of course the
gate count is getting high enough to be interesting.  It is a lot easier to
do 2 ISAs together if the new one is designed to not make the old one hard,
and it doesn't have to impair the new architecture.

4) Anyway, as was apparent from the die plan, if your target architecture
is like IA-64:
	- you have 128 each of integer & FP registers ... hence even if
	you were to provide an entire seprate IA-32 register set, it would be
	tiny in comparison.

In almost any current implementation (many ISAs), big die-space eaters include
big caches, TLBs, 64-bit busses, FP units, multiple 64-bit integer ALUs ...
and it's not all that hard to share much of those between related ISAs.

5) These days, even including [caches, TLBs, FPs, etc], complete
IA-32 implementations seem to range from ~130ish mm^2 down to 58mm^2
(WinChip2), so the IA32-specific stuff is smaller.

Note: none of this says anything about design complexity, it just says that
die space is far less of an issue than it was, say, in 1986, when,
in 2micron technology, we could barely fit an R2000 (64-entry TLB,
no caches, no FPU) on a chip.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-4392
USPS:   Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog