Zero registers(John R. Mashey)

Index Home About Blog

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 8 Aug 1998 06:38:07 GMT
Keywords: reduce

Having a zero register is a modest, but useful technique, which is
why so many recent architectures support it, because it generally
*simplifies* hardware, given the chain of assumptions usually already
made in ISAs before this decision gets made.

1) Contrary to some of the fantasies posted here, a zero register has
minimal implementation cost.  It is *not* implemented as bunches of
gates that do special case checking for every instruction.  It is
simply implemented to act as a "register" that delivers zeroes when read,
and ignores anything that is written.  This is, of course, trivial
hardware, adds no gate delays to anything, and requires no special-casing
of logic. Recommendation: if this doesn't make sense, it would be
covered in an appropriate mid-level undergraduate course on digital
design, like Stanford CS112/EE182:
	http://cs-class.stanford.edu/class/cs112/
or perhaps, just get hennessy & patterson's
	Computer Organization and Design

Note that special-casing move or clear, for example, to avoid a trip through
an ALU, is just not something designers care about in recent designs.
(I think the datapath section of h&P above discusses such stuff.)
IT is more likely to cost hardware to special-case this.

2) Start with the following assumptions, which one may or may not agree with,
but which are common:
	a) 32-bit instructions (or, 1-size for fast decode)
	b) many 3-operand instructions (for the usual reasons)
	c) base-displacement addressing, at least.

Having a zero register:
	a) Eliminates special-casing in the address unit to cause basereg 0
	to supply a zero during addressing [i.e., like S/360, where the use
	of R0 differs between normal usage and address computation].
	This offers a simple way to get direct addressing for a modest range of
	addresses ... which has actually been very useful for the kernel
	and certain graphics libraries.  This affects the 20+
	loads+stores.
	b) Does supply a few unary ALU operations for free, at the cost of
	allocating one register.  In MIPS, some of the ones that fall out
	are:
	3-op	unary
	ADD	mov (there are various ways to get mov)
	ADDI	load-immediate (sign-extend)
	DADD	64-bit mov
	DADDI	64-bit load-immediate (sign-extend)
	DSUB	64-bit negate
	DSUBU	64-bit negate (unsigned)
	NOR	bitwise-not
	ORI	load-immediate (some values that ADDI(U) can't, like 0X8???.)
	SUB	32-bit negate
	SUBU	32-bit negate (unsigned)

	most of these are not a big deal, although the various ways to get
	load-immediate are fairly useful.  There are multiple redundant ways
	to get mov and clear ... but it is actually less hardware to allow
	such things, than it is to special-case them away.

	c)  x = 0 ... is a fairly common construct, since it is a common
	flag value, and this means all of the stores can just use zero as a
	source with no other overhead. This gets used in bzero's, but there,
	the savings are minimal.

	d) Finally, any extremely important case (for MIPS, anyway) that
	falls out very cleanly are the following branch instructions:
	BEQ, BEQL, BNE, BNEQL
	all of which have 2 register fields and use the immediate field for
	a branch displacement.  Comparisons to zero are *quite* common,
	and this straightforwardly avoids any need to special-case.

3) This has been sufficiently useful that I've often wished we'd done
	something for floating point 0.0...

4) It is simple, trivial, and safe to have a zero register.
	Among other things, it would be irritating to have to waste a
	cycle on every kernel entry to clear the register to make sure some
	user process had not trashed it.

5) It is not a huge win to obtain the unary operations, but it does save
a few opcodes that are frequently used enough that one would have had to
add them ... and there is always pressure on opcode space. Although not
on all such ISAs, it is definitely nice not to have to special-case the
branches.

6) One might reach different conclusions for variable-size instruciton
encodings, where one may well provide operations limited to 2-operands
for density.  S/360 does that, so does MIPS-16.  If there are only 8 or
16 integer registers, one would think harder about dedicating one to zero.
Nevertheless, designers chose to do this because it made sense, and
saved hardware, given the 32-bit instruction, 32-integer register,
3-operand formats already chosen.

-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 8 Aug 1998 19:52:21 GMT
Keywords: reduce

In article <6qi0u9$33o@senator-bedfellow.MIT.EDU>, jfc@mit.edu (John F
Carr) writes:

|> Organization: Massachvsetts Institvte of Technology
|>
|> In article <6qgrof$6gk$1@murrow.corp.sgi.com>,
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> >Hhaving a zero register is a modest, but useful technique, which is
|> >why so many recent architectures support it, because it generally
|> >*simplifies* hardware, given the chain of assumptions usually already
|> >made in ISAs before this decision gets made.

|> The tradeoffs have changed since the 1980s.  Is a zero register a good
|> idea in a new ISA?

There's only one new ISA: IA64, so we'll see if they did it :-)

|> Extra instructions which would not be required with a zero register
|> don't impact the design as much.  The original MIPS architecture had
|> very simple instruction formats and it would have been hard to find the
|> opcode space for the extra unary instructions.  A decade later, MIPS IV
|> scatters opcode bits around in formerly reserved fields and the
|> simplicity is gone.

Yep, that's why one fights hard to abvoid adding things that will be
there forever, but are more usefully avoided.

|> The control/bypass logic is a lot more complex on modern chips.  There are
|> many more places where a register number must be tested and the zero
|> register special-cased.  On the SuperSPARC, which has a short but complex
|> pipeline including same-cycle result forwarding, a load-double instruction
|> writing %g1 causes the %g0 (zero) register to become non-zero for the next
|> two cycles.  (One isn't supposed to write ldd %g1, so this isn't
|> necessarily a bug, but other SPARC implementations don't affect %g0 in
|> this situation.)

Bug in that SPARC, probably (I don't know) due to oddity that ldd is
rare instruction (in SPARC) that modifies 2 registers.

But back to the bypass stuff:
	a) R2000s had the typical bypass logic of simple byapssed systems.
	b) The topic is discussed in Hennessy&Patterson C O & D, p480...
	which works through pipeline forwarding issues in detail for MIPS,
	including the explicit zero-check ...
	c) The logic already has to do register-number comparisons to set
	up the MUXes before the ALU inputs, i.e., these are two 5-bit
	compares of the input registers of one instruction to the
	output of the previous one, and if equal, select from the
	forwarding network, if unequal, take the register.
	d) The zero register simply requires that, in parallel with the
	register-number decoders is a zero-detect on the result register,
	[which can't be any slower than the comparators, and is a small
	number of gates], and that its result is combined with the
	register comparison to control the input MUXes.

	e) Throughout all of this discussion, we need to rememember that
	digital logic is *not* C code. If the same check appears in
	numerous places in a pseudo-code specification of the logic,
	that doesn't mean there must be numerous instantiations of the
	logic;  more likely, there is one instantiation and some wires.

All this is a handful of gates, and is well-described in an
undergraduate text-book.  Newer chips are more complex, but this particular
check is basically nothing compared to all of the other stuff that's going
on, and is a cheap way to avoid burning opcodes.

One more time: zero registers are *modestly* useful, cost very little logic to
implement, make sense when given the appropriate set of predecessor
decisions, and save opcodes that people would be very tempted to add
otehrwise, and therefore usually end up saving hardware.  They don't fit
some ISAs, but they fit others.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE [really: zero registers]
Date: 11 Aug 1998 22:53:20 GMT

In article <35CCDA92.33965996@gmx.de>, Bernd Paysan <bernd.paysan@gmx.de> writes:
(apparently not having yet rteceived the posting:
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: Sat, 8 Aug 98 12:52:21 1998
...
All this is a handful of gates, and is well-described in an
undergraduate text-book.  Newer chips are more complex, but this particular
check is basically nothing compared to all of the other stuff that's going
on, and is a cheap way to avoid burning opcodes.
...

And I still think this...

|> The constraints of the original MIPS are quite outdated. A register file
|> in current CPUs is ways more complicated than it was back then. All
|> high-performance implementations now have register renaming. All are
|> OOO, thus perform scheduling on the fly. All have fairly complex bypass
|> logic (e.g. the 21264 must bypass 4 results into the next cycle, and
|> since it isn't possible to read back a written value until 2 or 3 clocks
|> later, the bypass logic is much more complex). And R0=0 is a special
|> case for bypassing (and as John Carr reported, there is one example,
|> where the implementers forgot one case).

I have some passing familiarity with the R10000, which most people in this
newsgroup know is a 4-issue, superscalar, O-O-O, speculative-execution,
register-renaming CPU.  To reemphasize what was said before:
	a) It wasn't a problem in the R2000.
	b) It wasn't a problem in the R10000.
and
	c) In some sense, it is even less of problem in modern CPUs,
	as logic is much cheaper, i.e., i.e., zero-register-check logic is
	a smaller fraction, especially given the large bunch of comparators
	needed for all of the dependency checks in wide-issue superscalars.
and
	d) In many years of talking to chip designers, and getting pushback
	on features, or complaints about the unnecessary implementation pain,
	*this* one is one I've never once heard anyone complain about,
	at Hot Chips or Microprocessor Forum dinners, or in the bars around
	the valley, or in email, or elsewhere.  That doesn't prove that
	this isn't an issue, but it's a datapoint. The fact that there is
	an example where people goofed doesn't bother me much: there are
	hordes of places where people have goofed.

To get some facts, rather than opinions, I appeal to *designers* of
commercial CPUs that use renaming:

	If you have a zero register:
	1) Was it a major hassle, die size cost, or increase to critical path?
	2) Were there special-case zero-register bugs in the first stepping?

|> I won't argue that implementing the common case isn't trivial. It is.
|> That's a typical mistake in design: the common case always looks fairly
|> easy. It's the exceptions that makes things awful. You won't see the
|> common-case gates in the transistor count of the chip, but you will see
|> the exceptions. And as somebody who feels responsible for careful
|> testing, I vote to avoid exceptions. They are difficult as hell to test.

It is good to avoid exceptions. Personally, I doubt this one would make
anyone's top-100 list of testing worries...

|> I also want to comment on your list of opcodes provided with R0=0. Given
|> you have SUBR instead of SUB, you only have to add one instruction, and
|> that's load-immediate. Since load-immediate has a special property for
|> data flow analysis (it's truly independent), there is value to add it to
|> the instruction set. Note that then it can use a larger value, since it
|> can share the format of the long jump instruction (one reg, large
|> immediate, also truly independent).

Counts, and some of the others, although of minor use,
are of use, and the branches are of definite use, since you
get both compares to zero and compares of two registers.
The property mentioned for load-immediate sounds good, but is already
subsumed by the kind of implementations found in current CPUs, i.e.,
as soon as you've done the register rename, you know it has no
input dependencies (it is *not* truly independent, because it usually
modifies its output register). It *cannot* use the MIPS long-jump format,
which provides no register specifiers ...

I'll say it again: this kind of feature is a modestly-good thing for
some ISAs, and as far as I know, has never caused any great hassle for
MIPS implementors.  [Now: integer multiply&divide, load...left/right,
the original FP register arrangement, and soemtimes, branch delays ... :-)]

Not every feature is a great breakthrough: the devil is in the details,
and sometimes the best you can do is get them modestly good.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 13 Aug 1998 19:45:13 GMT

In article <35D1893C.4C92@bellatlantic.net>, "Jeffrey S. Dutky"
<dutky@bellatlantic.net> writes:

|> which has a few problems: 1) we need four instructions instead
|> of only two, so this is twice as slow as it needs to be, and
|> 2) all four operations are dependant, meaning that can't be
|> easily overlapped. While the same operations on an architecture
|> with r0==0 would be written as
|>
|>   opA r0,r1,r2; discard result
|>   opB r3,r4,r0; operate on value zero

While I agree with you that having a zero register is useful,
having looked at large numbers of lines of generated code for machines
that have them:
	a) It is very rare to execute an ALU operation that actually does
	something and put the result in zero, because the only reason to
	do so is to investigate side-effects, like overflows, and it
	is perfectly plausible to use a scratch register as a target most
	of the time.  [There are some places, like in kernel interrupt
	handlers, where scratch registers are in short supply].
	b) a) Is worded carefully, since, for example, NOP on MIPS is
	coded as a shift of zero and returned to zero ... but it doesn't
	actually do anything, is just convenient.
	c) Some ISAs have made use of zero as a target of load
	instructions, to turn them into prefetches, which is fairly
	elegant.  Others have separate prefetch instructions.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 14 Aug 1998 22:22:35 GMT

In article <slrn6t7klq.7j8.mike@ducky.net>, mike@ducky.net (Mike Haertel)
writes:

|> In article <6qvumm$an2$1@murrow.corp.sgi.com>, John R. Mashey wrote:
|> >	c) Some ISAs have made use of zero as a target of load
|> >	instructions, to turn them into prefetches, which is fairly
|> >	elegant.  Others have separate prefetch instructions.

Note: MIPS uses separate prefetch instructions, for various reasons.
I was willing to call this elegant in that:
	(a) Given that one has a zero register.
	(b) Opcode formats likely offer natural load and store instructions
	that use the zero register.  From inspection of much code,
		store[type] zero somewhere   is fairly frequent.
	(c) On the other hand, the instructions:
		load[type]  zero,somewhere
	usually have natural encodings, but of course do not modify zero.
	(d) Hence, overloading this to mean prefetch is elegant in the sense
	that:
		(a) It uses an encoding that is already likely to be there.
		    It saves burning an opcode.
		(b) As discussed earlier, bypass/rename logic makes sure
		    that loads into zero don't get bypassed, and that
		    zero never gets renamed anyway.
		(c) It even makes intuitive sense, in that
		(d) It may make CPU evolution a little easier:
			CPU ISA #1 didn't give any thought to this.
				load	zero,address
			fetches the data, and discards it ... but of course,
			compilers generally don't generate it, and the only
			conceivable use is as an address probe, and in most
			cases, one could use the same address, and load any
			unused register, and all would be well.
			By CPU ISA #2, somebody says: "need prefetch".
			If you add a new opcode, unless it was reserved in
			ISA #1 as "unused, but ignored", the new opcode
			will get trapped by the older CPUs, which means
			you may end up having to generate 2 flavors of code,
		        which makes ISVs unhappy.
			If you do the load zero trick, and if you only generate
			code where the address is legal, then the code
			will run fine on both flavors of CPUs.  Of coruse,
			if people really ahd been usign it as an address probe,
			it will not have the expected effect on CPU #2.
			Of course, if #2 has a bunch of new isntructions you
			want, you may end up generating seprate code anyway.

|> So I guess this is "elegant" the same way doing jumps via
|> "move to pc" (pdp-11?) is "elegant".  It may look elegant in
|> concept, but any realistic implementation needs special cases
|> to recognize it, and then if you're going to do special cases
|> you would have been better off choosing a different instruction
|> encoding in the first place.

(again, MIPS did it as separate instruction, and it may be cleaner,
but I'll also bet this is another one of those special cases, where in
the implementations where it is, the number of gates is pretty small.
After all, load   xx,address   and prefetch address
both need to provide: address, plus:
	a) address-checking required, trap if bad versus: ignore
	b) Dependency on load result, or just a hint, can ignore.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 15 Aug 1998 04:49:19 GMT

In article <6r2lur$jni@gurney.reilly.home>, reilly@zeta.org.au (Andrew
Reilly) writes:

|> I think you've managed to save a total of two opcodes (load
|> immediate and prefetch) so far.  See Pernd Paysan's article that
|> covers the other cases that have been brought up so far.
|>
|> That's really worth losing a general-purpose register for?

Please read what I posted carefully, and compare it with what Bernd posted.
Bernd (and you apparently) focus on arguments on the marginality of
features that I stated were of modest use, and somehow keep missing the
ones that I stated were important.  I stated they were important
because I've looked at many thousands of lines of generated code for
systems that have a zero register, and because I've looked at
instruction count summaries of billions of cycles, and finally,
because I've been involved in some of the discussions of tradeoffs,
both internally, and with other people who do this.

One more time:
	1) In some ISAs, the feature can save a few opcodes, and opcodes,
	especially in 32-bit-fixed-size-instructions, can get precious.

	2) Put another way, there is a tradeoff between adding opcodes
	and reducing path-lengths, and for some features, it is almost
	impossible to say whether a feature is a good addition or not
	without understanding the rest of the ISA already committed.
	In ISA #1, feature A may yield 1% improvement by whatever metrics
	one uses, but in ISA #B, the same feature may yield .1%, because
	a combination of several other features already gets most of
	the cases.  [Informal discussions in bars in Cupertino, or at
	Hot Chips tables sometimes gets into such things.]

	3) In the MIPS ISA, as I said, one major values come from
	path-length reduction & code-size reduction because
	store zero doesn't need a register clear.  On IRIX,
	try  dis /unix|grep 's[bwhd](tab)zero' to see examples.
	A cursory grepping finds that a static count of instructions
	in IRIX has 1+ % of the instructions are stores of zero.

	4) The other one I said was important was the set of
	branch-equal branch-not-equal instructions, where the zero
	register gives a nice non-special comparisons, so that one
	can get compare-and-branch of two registers, and compare-to-zero,
	both of which are frequent enough to be interesting.

|> That's really worth losing a general-purpose register for?

If you have 32: yes, which is why most people designing such have done it.

If you have 8, I'd say no, at least from experience working on 68K
code generators and looking at that code.
With 16: maybe, maybe not: none of the machines I've worked on that had
16 had zero registers.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 17 Aug 1998 01:51:16 GMT

In article <SCOTT.98Aug14174728@slave.doubleu.com>,
scott@nospam.doubleu.com (Scott Hess) writes:

|> y generate code where
|>    the address is legal, then the code will run fine on both flavors
|>    of CPUs.  Of coruse, if people really ahd been usign it as an
|>    address probe, it will not have the expected effect on CPU #2.  Of
|>    course, if #2 has a bunch of new isntructions you want, you may end
|>    up generating seprate code anyway.
|>
|> Perhaps I'm not clear on the distinction between "prefetch, no trap"
|> and "probe for valid address" uses of loading from an address to R0.
|> Why can't it do both?

It could, and without looking at the manuals, I'd guess that
plenty of different combinations have been done; it kind of depends on the
accumulated ISA up to that point, plus expected implementations.

(a) An explicit prefetch instruction (not load zero), could be defined as:
	(a) Evaluate the address, for sure, and if it causes a TLBmiss
	or protection trap, the trap must be taken (or at least, if this
	in an o-o-o machine, trap is taken if you actually get there).

	(b) The prefetch is an optional hint, and the CPU is free to
	discard it at any stage, even before checking the address,
	and the prefetch never causes traps of any sort.

	It is clear that if what you really want is to check and address,
	for the side-effect of causing a trap, (a) works, but (b) does not.
	In both cases, actually loading a cache line is optional.

(b) load zero,address, if the  way prefetch is implemented,
	would likely have to be defined as either (a) or (b),
	but it is hard to have the same bit pattern mean both.

Note: hardware designers *really* like (b), for isntance, because if
load/store queue entries are a scare resource at some point, prefetches
can just be thrown away.

|> The sequence that comes to mind is where you test a tag and only
|> access the memory if the tag indicates it's valid - as someone
|> implied, you couldn't pull the prefetch above the tag test unless it's
|> non-trapping.  Otherwise you could get a fault which wouldn't have
|> happened if the prefetch weren't present.

Yes.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 19 Aug 1998 18:02:22 GMT

Thanks goodness, a knowledgable posting, a welcome change to this thread...

In article <1998081903553700.XAA02501@ladder01.news.aol.com>,
mitchalsup@aol.com (MitchAlsup) writes:

|> And when you have a 6-bit opcode, you find that you run

Note: Mr. Keane labeled as "ridiculous and no one would do it" the idea of
6-bit opcodes. The main opcode in, for example, both MIPS and SPARC
(and others) is in fact 6-bits, although of course some main opcodes of
sub-opcodes, as for register-register operations.

|> Yes--this is EXACTLY why r0=0 is so useful, simpler operations
|> are degenerate cases of the standard (complexity) operations.
|> Since we HAVE to design the data path to handle the standard
|> cases at full operating speed, these simple operations fall out
|> for free (or even better--see above).

This is a *really* good point, as it bears on one of the most common
fallacies that comes up again and again in comp.arch, i.e.,
acting like hardware is software, when it isn't.  This most often happens
when people offer opinions on difficulty of implementation without
understanding the typical implementation methods.

HARDWARE IS NOT SOFTWARE!
A lot of computations & tests, that in software, look like a lot of code,
and might be worth optimizing for special-cases, are done by parallel
hardware that has to be there, and special-casing it only makes it worse,
or doesn't help.

Instantiations of this fallacy that have shown up in this thread include:

1) The idea that there is a lot of hardware to do special-case checks for
register zero somehow wired into a lot of instructions.  This is like thinking
from a C model of an instruction set that has a lot of in-line code.

2) The idea that there are big savings from avoiding a simple ALU op
in favor of special-case-detecting a MOVE operation ... when, in many
designs, trying to go around the ALU only adds wires and logic.
(yes, there are few cases where this might not be true, but in general,
given that any sensible design must make ALUs fast, trying to go around them
isn't something people care about very often.]

3) The idea that in general, there is a big savings in dependency-check hardware
by having an explicit load-immediate instruction rather than an
add-immediate, in an out-of-order, register-renaming design.
I.e., if you do:
	LI	reg1,immed
via
	ADDI	reg1,reg2,immed	where reg2 == zero

the dependency-check logic *must* exist for the general case, and it must be
fast, and as Mitch describes, with minimal logic identifies the ADDI as
having no register input dependencies.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 27 Aug 1998 17:53:21 GMT

In article <487BC60C4F%rw@shadow.org.uk>, Rich Walker <rw@shadow.org.uk> writes:

|> In message <6s0cl1$8tb$1@ocean.cup.hp.com>
|>           morrell@cup.hp.com (Michael Morrell) wrote:
|>
|> > John R. Mashey (mash@mash.engr.sgi.com) wrote:
|> > > 3) This has been sufficiently useful that I've often wished we'd done
|> > > 	something for floating point 0.0...
|> >
|> > I don't know about other architectures, but PA-RISC defines f0 as 0.0.
|>
|> And ARM provides 8 fp constants (0.0, 1.0, 0.5, 2.0, 3.0, 4.0, 5.0, 10.0)
|> which is about right for the ARM architecture.
|>
|> Mind you, it's still not a great advantage, as the ARM FP macrocell isn't the
|> world's fastest FP core...

Say some more: I don't have the FP details of ARM handy, but I thought
there were only 8 FP registers, which would make it unlikely that they
were dedicated to 8 constants...

Note the distinction between:
1) A register is hardwired to a constant, and participates in the usual
set of operations just like any other register.

2) Immediate operands for some instruction formats
	a) Integers are straightforward, since small integer constants get
	heavy use, and are straightforwardly expanded to full-width
	integers, even with sign-extension, by replicating the sign bit.
	b) Floating-point is trickier, since few instruction formats have
	the space for general FP constants, and therefore a plausible
	approach (which soudns like what ARM has done), is to use a small
	number of bits to select among a few common FP constants ... which
	is more work than sign-extension, but may be useful.
	[In MIPS, I've several times wanted an FP Load-Immediate, for example].

3) Of course, choosing a small set of constants for special treatment
requires serious study of the static and dynamic program behavior,
also with reference to compiler technology, i.e., if you a program spends
a lot of time in loop that has:
	for (i = 0; i < N; i++) {
		...
		a[i] = b[i] + constant*c[i];
	}

In order of increasing speed, to get the constant
	(1) One could execute a series of integer/FP Operations to create the
	constant, depending on the CPU & the constant.  This uses
	instructions, but no data memory.
	(2) One could have built a literal pool somewhere, and just load
	the value into a register.  This is often less instructions, but uses
	data memory, and a memory reference, which may or may not be a
	cache miss.
	(3) One can have a floating-load-immediate, which is like (1), but
	speeds up common cases, at the cost of an opcode or two.
	(4) One could have immediate versions of FP operations, i.e.,
	like multiply-immediate, which is fast, but may burn opcodes,
	or may not.

The tricky part is that making such choices isn't obvious, because it
is also affected by such things as:
	number of FP registers
	optimization, global register allocation

In the sample code fragment above, a good compiler, with enough registers,
will have materialized the constant once in the setup for the loop,
and if you think that your FP code is dominated by such loops, then
any extra hardware for FP constants is probably a waste. If you code is
dominated by small functions with lots of branches, many constants,
and few low-level loops, then more of the time is going towards constant setup,
and it it might be worth some hardware to help.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: TOP 10 MISTAKES IN COMPUTER ARCHITECTURE
Date: 28 Aug 1998 01:56:02 GMT

In article <6s4dl4$sip$1@news.ox.ac.uk>, Thomas Womack
<mert0236@sable.ox.ac.uk> writes:

|> Organization: Oxford University, England
|>
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> : In article <487BC60C4F%rw@shadow.org.uk>, Rich Walker <rw@shadow.org.uk> writes:
|> : |> And ARM provides 8 fp constants (0.0, 1.0, 0.5, 2.0, 3.0, 4.0, 5.0, 10.0)
|> : |> which is about right for the ARM architecture.
|> : |>
|> : |> Mind you, it's still not a great advantage, as the ARM FP macrocell isn't the
|> : |> world's fastest FP core...
|>
|> : Say some more: I don't have the FP details of ARM handy, but I thought
|> : tehre were only 8 FP registers, which would make it unlikely that they
|> : were dedicated to 8 constants...
|>
|> No; effectively, there are 16 FP registers eight of which are devoted to
|> constants (if I recall some very old documents correctly). So it's the
|> second of your options.

I'm back home where my ARM book (van Someren & Atack) is:
it dates from 1994, and says that:
	1) It has 8 fp registers, f0-f7.
	2) "indicates that the argument may be either a valid floating-point
	register or an immediate operand ... small constant from the
	following list: 0.0, 1.0. 2.0, 3.0, 4.0, 5.0. 0.5, 10.0.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog