Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c
Subject: Re: Shift left for width of type != 0 ?!
Date: 23 Mar 1998 05:58:33 GMT

A thread (from comp.std.c) has started to fill up with misinformation
that might useful be corrected; cross-posted to comp.arch where some
useful commentary might occur.
[Summary of thread:
	(A) "The hardware designers" have been "silly"/"lazy" in designing
	shifts that ignore high-order bits of registers for
	variable shift counts, and architectures that do this are "broken",
	"stupid", "wrong".  I.e.,:
		j = 129;
		...
		k = i << j;
	on "some" architectures is equivalent to
		k = i << 1;

	(B) C should have been specified so that shift counts larger than
	register widths set the register to 0 (for left shifts), and presumably
	propagate sign bits otherwise.   I.e., presumably this means
	that  k = i << j
	should always mean
		k = ((unsigned) j > max-shift) ? 0 : i << j;

	(C) The next version of Standard C should be changed to do this.
]

1. C grew up in the 1970s, on systems whose architectures pre-dated
the existence, or wide usage of C, and hence were not likely to have
catered to C in their design.

2. It has long been a design goal of C to be able to generate good,
straightforward code that can access machine resources efficiently.

3. Ignoring high-order bits is neither silly nor lazy,
and in any case, shouldn't be blamed on "the hardware people",
as this same choice has been made numerous times by combined
hardware/software teams who were neither silly nor lazy nor ignorant.
[I admit there may exist hardware people who are not software experts.
There may be ones that are lazy, but I must admit I've met very few lazy ones,
whereas I know plenty who've worked to the point of burnout.
I know many who will try hard to implement what we software
people ask for, or at the very least, reply to any serious request by
explaining what it will cost.  Calling this people silly & lazy is at
best ill-informed.]

4. In fact, obtaining effect B), in any *new* architecture, would not cost
much, and if anybody really cares, they would do it. On the other hand,
there are some implementations of some Instruction Set Architectures where
doing (B), rather than (A) requires logic slightly different than anything
else in the ISA, and consumes die space (or logic & wires in early minicomputer
days), and in some cases, such resources were *serious* issues.
(MIPS seriously considered omitting the variable shift completely,
but there was enough evidence to include it.)

5. Shift instructions in Instruction Set Architectures
	Among ISAs are found numerous combinations of shifts, rotates,
	extracts, merges, etc, that in fact tend to share hardware units,
	but focussing just on the ones likely to be used to do C's << and >>,
	we find:

	Shift direction
		Is usually encoded as part of an instruction,
		but in a few (important) cases, is taken from the
		sign bit of the shift amount field from a register.
	Shift size
		Can usually be gotten directly from instruction or
		from a register (usually  general-purpose, but sometimes
		special-purpose register like PA-RISC's 6-bit SAR.)
		Most of the widely-used ISAs do the same thing:
		use the N low order bits to contol the shift,
		where 2**N is the size of the register (or register pair)
		to be shifted, and ignore any high-order bits.

6. Historical sample of some ISAs
	These include a lot of the popular/widely-used ones,
	or ones especially relevant to C history.

The ones marked * encode shift direction in the instruction,
and use only the low-order bits.
The ones marked + use the entire register.

Group 1: designed before C, but especially important to C of early 1970s
S/360		*
GE635		?
PDP-11		Uses low order 6 bits of register as a signed number,
		negative numbers mean left shifts.

Group 2: designed before C, or without especially strong C influence,
	but that became relevant later.  I.e., this is late 1970s, early 1980s.
	68K might bel

U110x		?
XDS Sigma	?
VAX		+, but as in PDP-11, negative numbers mean left shifts.
Intel 8086	* (But use low-order 8 bits.  Can somebody say why?)
Intel 286-on	*
68K		*	(Note that some early 68K members used serial shifters,
			not barrel shifters; 68010 was this way, I think).

Group 3: designed with C (among other languages) in mind

ARM		= (but some special cases that I don't recall)
HP PA		*
MIPS		*
SPARC		*
Power,PPC	*
Alpha		*

* & ** some special cases.

7. As can be seen, most ISAs simply use the low-order bits from registers.
ARM does it the way some people would like,
and the VAX (but not PDP-11) *almost* do it that way,

In hardware, in most combinations of ISA and implementation constraints:
	(1) It would be unusual for it to be easier to do shifts based on
	full-register values rather than the low-order bits.
	(2) In most cases, it takes more logic to do the full-register
	case; worse, it can take more wires.
	(3) In some cases, such as early minicomputers and early
	microprocessors, where people fought for every bit of die space,
	or on multi-chip boards, every package, this actually *matters*.

8.  In the early 1970s:
	(a) C's definition of shifts worked fine on the existing architectures.
	(b) Defining it as (B) would have generated more code on most/all of
	the relevant architectures, in order to add this check.
	In particular, it easily could have turned 1 PDP-11 instruction
	into 4 or 5 ... and there is no way in the world that was
	going to happen at that time, as people expected C to give
	straightforward object code.

9. By the late 1970s, there was at least one relevant CPU that more-or-less
did the shifts the other way, i.e., the VAX, but of course, it did have
the characteristic that the register's sign bit controlled the shift direction,
which, I think, is not exactly what people are wishing for, since in fact,
supplying a negative shift amount reverses the direction of the shift,
so that  X << y can shift right...

10. Anyway, all of this is a moot point:
	(a) It would probably do little harm if new architectures included
	the feature (B), as it has exactly the same effects for the legal
	range of values, and if it makes people happier, the gate count is
	no big deal these days, and I don't think this would usually be
	in a critical path, although there might be some funny bussing,
	especially with 64-bit CPus, where one needs to grab the high-order
	57-58 bits and do something with them, and then force a 64-bit zero
	elsewhere.  On the other hand, nobody will do this unless it fits.
	(b) Nobody is likely to change the definition of shifts in existing
	architectures, and since people are running out of opcodes,
	it is unlikely that anyone is going to burn several with additional
	shift operations, so don't expect this to get added to the
	already-extant architectures.
	(c) I'd suggest that the chances of getting the C definition changed
	this way are ~0; once again, my sympathies go to the Committee
	for their patience.

11. As has often been the case, Dennis' early specs have stood the test
of time pretty well: as it stands, the existing definition of C's shifts:
	(a) Generates good code across most CPUs, and imposes zero overhead
	on those (numerous) cases where the shift count was created by
	masking anyway, and guaranteed to be OK (but not checkable as such
	by the compiler).
	(b) Lets the programmer perform checks as needed.
	(c) Does not impose requirements on CPU designs that are painful,
	for no particularly good reason.

12.  But finally, it is really ill-informed to use terms like "silly" and
"lazy" in this context.  The design of a CPU architecture takes a
myriad of decisions, which must hang together as a group, and that often
have interactions not obvious to people; sometimes trivial-looking
decisions have major implementation impacts.  It is very rare that
the software people involved in CPU architecture have thought it worth
fighting for (B) in place of (A).  In my personal opinion, (B) might be
OK, and it might even fit into some ISAs, but it had better be "free".



--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: [++] Re: Shift left for width of type != 0 ?!
Date: 26 Mar 1998 01:01:03 GMT

In article <35197DE9.7F6E@hda.hydro.com>, Terje Mathisen
<Terje.Mathisen@hda.hydro.com> writes:

|> John R. Mashey wrote:

|> > Intel 286-on    *
|>
|> This was probably partly to improve interrupt latency: Using a maximum
|> length (255) shift or rotate would lock out interrupts until the
|> operation finished.
|>
|> With one cycle/shift, a 4.77 MHz 8088 would use at least 53 microseconds
|> for SHR AX,CL when CL=255

Yes, I see the 286 manual says:
	"To reduce the maximum execution time, the iAPX 286 does not allow
shift counts greater than 31.  If a shift count greater than 31 is attempted,
only the bottom five bits of the shift count are used.  The
iAPX 86 uses all 8 bits of the shift count."

Clearly, as people got room for barrel shifters, this particular effect
disappeared :-)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 27 Mar 1998 20:29:10 GMT

In article <6fevqv$5jo@sun001.spd.dsccc.com>,
jmccarty@sun1307.spd.dsccc.com (Mike McCarty) writes:

|> Intel states that the reason they use the number of bits mod whatever is
|> to reduce interrupt latency. The shift instruction on the 80x86
|> architecture is not an interruptable instruction. If one tries to shift
|> a register left by 4 billion bits, and the processor actually carried it
|> out, then the time to execute it would be enormous.

|> OTOH, the same result could have been achieved simply by "knowing" to
|> clear the register when the shift exceeded the register size.

This is a good illustration of the sorts of issues that can be educational about
real-life design tradeoffs, but are very hard to answer without knowing
some more details: maybe some 8086 implementor can shed some light on this
as to what they were actually thinking.
As discussed earlier in this thread, one can:
	(a) Simply use the low order bits, and ignore the rest, which is
	    what most people do, and where 286s and later ended up,
	    with the reason described, according to the 286 manual.
	(b) Use the low-order bits as a shift count, but test the high-order
	    bits, and either clear the register or fill sign bits as
	    a special case.
	(c) Use the entire register (where X86's CL is an "entire" register)
	    as an iteration count, i.e., in an architecture intended to be
	    implemented with microcode, there are likely to be functional units
	    and operations that are "natural", and others that are not.
	    I.e., there may be existing facilities to use that say:
			put a count into an internal-to-microcode register
			repeat an operation, decrementing the count, and
			possibly incrementing any address pointers.
		It is clear from timing charts that 8086s do this.

Hence, if I had to guess, I would think that the 8086, a complex architecture
that had to be implemented with much earlier technology, and that used
microcode heavily as a result, had some hardware resources that made it
easy to use the CL as an iteration count, with no special cases,
and there was no particular other reason to need the specfic hardware,
so they didn't do it.  This is an example of why it is hard to analyze
features in isolation: their incremental cost often depends on the feature
set, and implementation resources already present.

This has been said before, but we've had:
1) Generations of machines heavily constrained by expensive memory,
and that therefore optimized code size, especially during that period where
microcode memories were very cost effective.

2) Then we had a round of microcoded microprocessors, where it was difficult
to squeeze the resources on-chip.

3) Then we had a round of hardwired-micros, which went to RISC ISAs to
get ISAs simple enough to implement on 1-2 chips with a few levels of metal.

4) During 2), and 3), the constraints were usually very tight for
the first few implementations.  At least for die space, they aren't now,
although people will still argue about gate delays and layout issues,
and problems where a feature requires long wires in awkward places.

ANyway, can any X*6 implementor shed any light on the reasoning?


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 2 Apr 1998 19:37:36 GMT

In article <2795e.smail.smayo@ziplink.net>, smayo@ziplink.net (Scott
Mayo) writes:

|> In comp.std.c, mash@mash.engr.sgi.com (John R. Mashey) wrote:
|> >Kai
|>
|> >|> This is extremely easy to get right. There can be no real excuse for
|> >|> getting it wrong.
|>
|> >OK, could you post, in detail the "right" answer?
|>
|> Now *there's* a loaded question. But I think C *could* have defined shift as
|> follows:

I'm not sure why this is a loaded question: when people post strong opinions
that include derogatory comments about (most) past work in the area,
as Mr. Henningsen did, I'm not sure what's loaded about a polite request
for "The right answer that everybody should have done".  Hopefully,
Mr. Henningsen will soon post the right answer, so that it can be evaluated.
A *convincing* post would be:
	1) At a minimum, specify the behavior of the simple variable shifts.
	2) Since ISA decisions are sometimes influenced by implementation
	technology, analyze this specification in the light of some
	choice of relevant CPU implementations.  A really good list would
	include:
	S/360 (/40, 50, 65, to get the heart of that market);
	PDP-11/45 or 11/70;
	VAX-11/780;
	8086/80286 or 68000
	80386 or 68020
	MIPS, SPARC, or maybe PA-RISC or ARM
	POWER, ALPHA, or IA-64
	3) A truly convincing argument would show that the additional
	implementation cost or performance implications of "the right way"
	was never truly significant, meaning, when you ask a hardware person
	to implement A rather than B, they can say:
		(a) Sure, no noticable implementation difference.
		(b) Probably, some extra resources, but doesn't change
		    the layout or add cycle time, we can squeeze it in.
		(c) Maybe, doesn't slow the CPU, but costs some resources
		    that will have to be taken away elsewhere.
		    What do you want to give up among those resources where
		    we physically can do what you want?
		(d) You can have it, but it will add a cycle to this operation,
		    or slow the clock cycle of the machine by X%.
		(e) Haha, no way.
	To be truly convincing (that hardware people have no excuse)
	would be to show that the common implementations that did it wrong
	were in categories (a) or (b) above, that is, doing it right
	would not generate a serious discussions on tradeoffs.


|> If shift is a logical operation, a left shift should simply shove bits left,
|> filling in zeros. The shift should occur "as if" the right operand were
|> taken seriously; if all the bits get shifted out of the word as a result,
|> so be it. The sign bit is not special. The bit count is treated as
|> unsigned, because it really *is* a burden on some CPUs to treat the shift
|> count as signed. (A nice compiler raises a warning when it can't be sure the
|> shift count is nonnegative, but by now we're into sheer fantasy.)

Hopefully Mr. Henningsen will also post his right proposal and supporting
analysis.

I think I understand the proposal for a left-shift instruction,
but let me make sure of the general scheme, by asking a few questions,
assuming typical 2's complement, power of 2-in 8-bit-bytes systems,
and expressing the effect in current C

For a 32-bit ISA, with ILP32 model, this is fairly clear:
	#define leftshift (a,b) ((unsigned long) b > 31 ? 0 : a << b

For a 64-bit CPU,  with LP64 model:
	#define leftshift (a,b) ((unsigned long) b > 63 ? 0 : a << b

(is that right? Of course, most machines don't do this, but that's OK,
so can I assume this is a proposal going forward, as opposed to a claim that
everybody should have done this already?)

A 64-bit architecture that has no 32-bit predecessor subset can get
away with 1 variable left-shift; these that are 64-bit supersets of
earlier 32-bit architectures tend to need 2 forms, one that does the
natural 64-bit shift, and the other that produces the approrpriate
results identical to the existing 32-bit shift, but represented in the
64-bit register, since those are not the same.


|> It's moot, though. When I've needed a reasonably behaved, reasonably
|> portable shift, I've written a function to do it. Indeed, such a shift
|> function strikes me as a good candidate for the C runtime, because to do
|> it *really* portably could be a pain in the neck, given holed integers.

Is the definition of those equivalent to the examples above, or something
else?

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 2 Apr 1998 23:56:33 GMT

In article <dennisEqt4JE.Hns@netcom.com>, dennis@netcom.com (Dennis
Yelle) writes:

|> >A 64-bit architecture that has no 32-bit predecessor subset can get
|> >away with 1 variable left-shift; these that are 64-bit supersets of
|> >earlier 32-bit architectures tend to need 2 forms, one that does the
|> >natural 64-bit shift, and the other that produces the approrpriate
|> >results identical to the existing 32-bit shift, but represented in the
|> >64-bit register, since those are not the same.
|>
|> I don't understand why you think 2 forms are needed.
|> What is wrong with using the 64 bit form all the time?

If you start with a 64-bit ISA, you can provide a 64-bit shift,
and generate appropriate masking for the smaller sizes, just as many 32-bit
ISAs offer only 32-bit shifts, and do masking for C chars and shorts.

Consider the MIPS code for:
	unsigned int i, j;	(assume sizeof(int) = 4)
	k = (i << j) >>j;

	sllv	ri,rj
	srlv	ri,rj		extracts low order 32-(rj) bits, rj modulo 32
				i.e., zeroes hi-order (rj) bits of 32

But
	dslv	ri,rj		d= up to 63-bit shift
	dsrlv	ri,rj

doesn't do the same thing: it zeroes the high-order 64-(rj) bits of ri,
but the resulting low-order 32-bits are not the same as the sllv/srlv did...

Needless to say, when you are running out of opcodes, you don't add
instructions for fun :-)


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 4 Apr 1998 00:31:44 GMT

In article <zalmanEquuup.9AM@netcom.com>, zalman@netcom.com (Zalman
Stern) writes:

|> Andy Glew (glew@cs.wisc.edu) wrote:
|> : Codes that do things like
|>
|> :     unsigned mask;
|>
|> :     mask = -1;
|> :     while( mask >> 1 ) {
|> :         /* implicit assumption that loop will be executed only 31 times */
|> :     }
|>
|> : will break if
|> : (1) unsigned becomes 64 bits,
|> : or
|> : (2) unsigned stays 32 bits, but '>>' is implemented only as a 64 bit shift,
|> : rather than a 64 bit shift followed by a mask (or a 32 bit shift).
|> : Even if the user has been a little bit more explicit,

Well, presumably this could have been done elsewhere, but perhaps Andy meant:
	while (mask >>= 1) {

So that the enxtr statement is true with no more assignments to mask:
|> So the loop will execute exactly one less than the number of bits in an

|> C says it has to work that way. Either go for 64-bit types, put in the
|> extra shift instructions, or pay the price of masking sometimes. (The
|> optimizer can eliminate the masking in some cases.)

After all, this is exactly what C compilers are required to do if the
mask had been declared unsigned char or unsigned short on a typical 32-bit CPU.

Let me try agin:
1)  If you have a new 64-bit ISA, you can get away with one 64-bit shift
for each of the shifts, and you just have to sometimes generate extra
instructions in dealing with the smaller sizes.  Sometimes you don't:
	for instance:
	unsigned short i;
	...
	i <<= 1;
may well generate something like:
	lhu	r1,i
	srl	r1,1
	shu	r1,i
I.e., there is nothing extra, if the variable was in memory, not register.

If ISA designers get talked into having shorter shifts for performance
reasons, one can have shorter shifts as well.

2) But you really only need 32- and 64-bit shifts where you are worried about
running existing binaries, that already have 32-bit shifts, and you want
those instructions to do the same thing as they used to, in both
32- and 64-bit modes, and also have full 64-bit shifts availalbe.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 16 Apr 1998 22:42:37 GMT

Some people have been bugging me to answer the question that I posed...

In article <6g0pe0$1hu$1@murrow.corp.sgi.com>, mash@mash.engr.sgi.com
(John R. Mashey) writes:

|> |> >Kai
|> |>
|> |> >|> This is extremely easy to get right. There can be no real excuse for
|> |> >|> getting it wrong.
|> |>
|> |> >OK, could you post, in detail the "right" answer?

Having waited a while in hopes of an answer, I have yet to see one, that
is, there have been various (reasonable) suggestions for how a shift
instruction "ought" to work, but it's been hard to find much trace of
history to show that people were being "silly" or "lazy".
Although it is impossible to know, one suspects that these opinions come
from lack of relevant & broad knowledge about real hardware
implementations.

I believe that the true state of affairs is:
1) In current high-end microprocessor technology, one could pick almost
any of the proposed shift methods.

2) Historically, for most of the commonly-used architectures that I
mentioned in my list, there were in fact *extremely* good reasons to implement
the simpler shifts that were actually implemented, as the "right" ones
either impose noticable space penalties, cost penalties, or time penalties;
it may be possible to argue that it would have been better to have implemented
one of the other shifts, but I couldn't find an important obvious case where
this was done out of laziness or ignorance: many hardware designs have
serious constraints and require analysis of tradeoffs, like "in order to
get the "right" shifts, we have to eliminate several other candidate
instructions".

3) I will go through the list of interesting cases I'd proposed, and explain
why people did what the did, to the limits that I know; in some cases,
the answers are fairly clear, to anybody who's had a digital design
course or two (i.e., by Jr-Sr year of EE or CSE at Stanford, like EE382,
for VLSI), and some historical background.  Textbooks often show layouts
of shifters in VLSI so this is not exactly an unknown art.

It is actually an interesting & understandable
example of: implementation effecs on ISA; ISA effects on C; C effects on
ISA; and finally, common misunderstandings regarding hardware design,
by software people without the relevant background (misunderstandings that
I've certainly shared at various times :-) If I get time,
I'll post some references later for people who want more detail.

|> 	2) Since ISA decisions are sometimes influenced by implementation
|> 	technology, analyze this specification in the light of some
|> 	choice of relevant CPU implementations.  A really good list would
|> 	include:
|> 	S/360 (/40, 50, 65, to get the heart of that market);
|> 	PDP-11/45 or 11/70;
|> 	VAX-11/780;
|> 	8086/80286 or 68000
|> 	80386 or 68020
|> 	MIPS, SPARC, or maybe PA-RISC or ARM
|> 	POWER, ALPHA, or IA-64
|> 	3) A truly convincing argument would show that the additional
|> 	implementation cost or performance implications of "the right way"

4) A quick tutorial on common implementations of variable left shift:

a) The simplest shift is to take the low-order N bits from a register,
and then do the shift.  Ignoring more complex combinations of rotations,
extracts, etc, which are sometimes all implemented together, and just focusing
on what you do to do a left-shift:

b) If all you have is a 1-bit shifter, you use the N-bits as an iteration count,
and this fits well into some ISAs.

c) With a little more hardware, you might have a shifter that does either
1-bit or (for example) 4-bit shifts, and you iterate 0-3 times on the low-order
2 bits, with 1-bit shifts, and then do 4-bit shifts up to 2** (N-2) - 1 times.

d) With more hardware, you use a barrel shifter, which uses each bit of the
input shift count to shift different amounts, i.e., low-order bit either
shifts left 1, or does no shift, next bit shifts left 2, or no shift.
The natural design (on a VLSI chip) has a set of multiplexors for each shift
bit, and can use substantial area [it's actually visible on an R2000 chip].
On VLSI chips, especially earlier ones with few metals, the issue often was
the wiring space to get the high-order bit's shifts, which must go 16 bits
for a 32-bit CPU, and 32-bits for a 64-bit CPU, especially since a barrel
shifter normally does both left and right shifts, of course.

e) For any choice of ISA and implementation, some minor differences of
ISA choice may have minor differences in implementation, but sometimes
the differences are major, simply because one choice may demand
functional units or datapaths that don't otherwise need to exist,
and hence, need really serious justification.  In many VLSI chips, the
wrong additional feature can blow the die sapce budget, either in total,
or in a specific part of the chip: it is quite typical for there to be
some leftover space that simply cannot be used because you can't get to it.

Example: if you have an ISA that already implies heavy microcoding,
and instructions with "small" iterations (like 0-255), then adding another
thing that works that way is relatively straightforward, but if not,
there will be fights to add just one thing that works this way.

f) For the simplest implementations, a 32 (64)-bit chip uses the
low-order 5 (6-bits) to control the shifter. In many implementation, it
would probably not be too bad to add another bit, i.e., so that a shift by
32-63 in a 32-bit ISA would provide zeroes, but a shift by larger numbers
or negative ones would be done modulo 64.

g) In order to do anything with the rest of the bits, the hardware must
inspect the high-order (32-5 = 27) bits or (64-6 = 56 bits), i.e., it basically
needs to do a 27 (56-bit) zero-detect, to select between the shifted result,
or zeroes.  So the issue comes down to:
	1) If that is free in most implementations, then maybe the
	   designers should have done it, and thus there might be
	   a basis for the derogatory remarks.
	2) If it it has serious costs, then perhaps the authors of
	   derogatory remarks don't understand the problem.

5) Quick historical analysis

S/360-40: mid-1960s design
	Gates were *awfully* expensive: this machine had an *8-bit* ALU,
	so doing the more complex shift would have slowed down every shift
	some by adding a bunch of steps to test the high-order *26* bits,
	which is at least 3 extra passes through the ALU for the high-order
	3 bytes, and whatever they have to do for the remaining 2 bits:
	for the book I look at (Husson's book), I couldn't tell at a glance
	how many clocks that would be, but I couldn't find any obvious datapaths
	that make it quick. In any case, when actually shifting, it does 1
	bit at a time; so shifts are pretty slow, and one could argue
	that just be a little slower, but every shift would pay the price,
	since this implementation was very serial, with 8- and 16-bit datapaths.

S/360-50: another mid-1960s design, for a ~$1M system.
	This at least had a 32-bit ALU + shifter, however, the shifter
	could only do 4-bit or 1-bit shifts, so if you look at the timing
	charts, there are a bunch of 4-bit iterations and 1-bit iterations.

S/360-65: I don't recall what this did; datapaths were certainly wider;
	there still isn't anything in the ISA that does a 26-bit zero-detect.

Both of the above implementations were from the era of:
	Very expensive gates, multiple packages/board, no caches,
	memory very expensive, microprogramming very appropriate.

PDP-11: all-out effort to pack a lot of functionality in small hardware,
	with minimum code-size, which is at least some of the
	reason why the PDP-11s used ASH instructions whose direction came from
	the sign-bit of the shift field, thus saving opcodes. Examining the
	high-order bits for zero-detect also doesn't show up elsewhere.
	Maybe somebody who worked on it remembers the exact reasoning;
	I asked Gordon Bell at dinner a week ago, but he didn't recall details.


VAX: actually uses all 32-bits of an input register.  I'm not sure what all
	of the issues were, but it is clear that the VAX instruction set was
	geared to great code-density, and expectation that most implementations
	would use microcode, and by this time, microcode memory could be
	much larger than in the days of the 360/40, so the incremental cost of
	doing this was probably not too bad.

	In some ways, the VAX is an historical oddity, in that the first
	implementation was not as constrained as first implementations usually
	are.

8086: Had a serial shifter controlled by an 8-bit iteration count in CL,
	which is a sort of structure used elsewhere I think, on a microcoded
	design.  At this point in microprocessors, people were back in the
	360/40 or PDP-11 space, but on one chip, with serious constraints on
	die size, hence the tendency to use a small number of hardware
	resources, plus whatever microcode will fit.

80286/80386: use the low-order bits of CL, not the whole register: this is
	an obvious move to get ready for barrel shifters, and space was still
	tight on chip.

68000/68010
	Heavy microcode, serial shifters, low-order bits from day one, as
	later ones would have barrel shifters.

MIPS
	No microcode, in 2-metal CMOS, i.e., *very* constrained implementation,
	but with barrel shifter from day one.  If you look at the original
	R2000, in the usual orientation, there is:
		- a ring around the edge
		- a left 40-45% of the chip, which has MMU, control, etc.
		- a right 40-45% of the chip, which has a vertical stack of
		integer registers, ALU, barrel shifter, multiplier/divider,
		address generation, PC-incrementer, etc ...
		AND IT JUST BARELY FITS.  You can see the shifter with
		naked eye, looking sort of like an X in the middle of the stack.
		In order to do a zero detect on the high 27-bits of the input
		register, you need another block about as high as the
		branch (branch-not-eq-zero almost does this function) found
		at the bottom right, plus another row of MUXes to force zeroes,
		or equivalent.  One can imagine re-using part of the
		zero-detect hardware of the branch ... although one is
		not allowed to add even one more MUX into that path, since it
		was on/near the critical path, and the chip would probably
		have to be rerranged to avoid wiring problems, since the shifter
		and zero-detect logic aren't close to each other.

		Now, would it be plausible, in such an implementation, to
		put in the zero-detect hardware to protect against this
		particular case (as opposed to having the compiler generate
		some extra instructions?)  To help answer this, I note:

		1) The divide instruction in this chip (and even having
		a divide instruction wasn't easy) did *NOT* have a zero detect
		for divide-by-zero exception, compilers generated a
		beqz instruction to guard any divide-by-variable...

		2) The chip almost didn't have a variable-shift at all;
		only a lot of analysis of UNIX kernel code (bitmaps) and
		graphics bitlbts convinced them to add the extra hardware
		(you need muxes to select between the 5 bits taken from the
		instruction, and the 5 bits from the usual register bus).
		That isn't a lot of hardware, but everything counts.

		3) Since MIPS ISA doesn't have condition codes of any sort
		(N, V, O, etc), and thus unlike some ISAs, doesn't have to
		worry about checking every bit shifted out to reflect in
		some code bit, the actual shift implementation can be a very
		lean, minimalist barrel shifter.  I tried to get some
		bit-field insert/extracts in there ... didn't fit.
		So, the bottom line is:
		The R2000 might have used the low-order 6-bits rather than
		5-bits; I don't know if that would have fit onto the die.
		THE FULL 32-BIT (27-BIT ZERO-DETECT VERSION) WOULD *NOT* HAVE
		FIT.

		Sometimes people claim that a feature would be small, and so
		should be included ... but as a general claim, this is simply
		fantasy, first because there are always a myriad of such things
		fighting for space in constrained implementations,
		but, second:

		IF IT DOESN'T FIT, IT DOESN'T FIT, NO MATTER HOW SMALL IT IS.
		Which means, another feature has to jump out of the lifeboat.

SPARC:	The original SPARC was done on gate arrays, and every bit of
	space was also precious; did essentially same as MIPS for shifts.

PA-RISC: The very first implementations were not single-chip VLSI designs, but
they were certainly thinking about it, as PA-RISC ISA shows every evidence
of people carefully weighing ISA features versus implementation issues,
and avoiding things that would cause extra gate delays, and using interesting
encodings in various places to achieve this.

POWER, ALPHA, and IA-64: POWER was originally designed for multi-chip
implementations, and could have afforded more gates for the more complex
shifts, likewise Alpha, and in fact, so could MIPS R4000s onward,
UltraSPARCs, etc ... but of course, it was already too late for that.

SUMMARY:
1) For most of the major ISAs that have been done in the history of computing,
the *first* implementations had serious constraints, either in:
	- hardware resources available (S/360, PDP-11, 8086, 68K, early RISCs)
	- microcode resources (early S/360s, PDP-11, X86, 68K)
	- die space (any of the earlier microprocessors)

2) Although it is not obvious, the variable shift that does a zero-detect
on the high-order bits consumes more scarce resources than you'd expect.
This does *not* say that such a feature would have been bad, just that
it had costs (in hardware, die space, cycle count, in various ways) that
were big enough to be nontrivial, and that would cause people to
consider compiler-generated code sequences.

For instance, if C had been specified in different ways, MIPS would have:
	a) C as it exists: low 5 bits, minimal hardware, actually done.
	b) C, but requiring low 6-bits to work: would have tried to do that.
	c) C, requiring zero-detect of high 27-bits: would probably have
	   done same implementation as a), but had compiler generate
	   extra code, and probably had a compiler switch to suppress it for
	   people who knew the code was safe (since, after all, many
	   code sequences obtain the shift count by safe sequences).

So, the particular choice (of relatively minimal shifts), may or may not
be optimal for all uses, but it is neither silly nor lazy: it has been
a reasoned avoidance of hardware that nobody has been willing to fight
very hard for, in the face of being required to throw other features out.

If you want to look for silly features in hardware, look for :
	1) Designs done with zero software input, and features that end up
	   very difficult to compile for or get to.  There are some like that,
	   but left-shift isn't one of them.
	2) Features demanded by software people ... that turn out not to be
	   worth their cost, or that are overly specific to some particular
	   software method that goes away later.

In my opinion, there are probably more "silly" things in hardware
caused by demands of software people, than there are caused by "lazy"
hardware designers.  (Rember that I'm a software person, not a hardware
engineer, so I'm saying that "we" are at least as bad as "them" :-))


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 24 Apr 1998 06:29:06 GMT

In article <353B39E3.C15E6480@soton.sc.philips.com>, Stephen Baynes
<stephen.baynes@soton.sc.philips.com> writes:

|> > Remember that we are comparing these figures with a saving of perhaps
|> > 0.001-0.01% in chip real estate.

|> Does that include RAM chip real estate, or just CPU? For embedded systems,
|> where you are that woried about interrupts being missed then the total
|> chip real estate can be significant.
|> In this sort of system it is not unknown to implement special hardware
|> as a sort of memory mapped IO device to do things like CRC's because
|> it is more compact and much faster to do it in custom hardware than code.

Yes ... in many respects, this is an interesting time for embedded, as
people can make up their own instructions sets somewhat with the various
cores & synthesizable logic available.

Regarding % of die space, I observe:
	(1) As has been said before, it doesn't matter how small something
	is if it doesn't fit where it has to go, so %die space is only
	one of the attributes needed [shape, connectivity, gate delays,
	etc, are all needed.]

	(2) For a 1986 MIPS R2000, in 2micron CMOS, I took a quick look at
	an old die photo, finding that:
		The integer datapath, local control, registers, etc,
		i.e., the stuff that does the usual user-level integer
		instructions, is about 20-25% of the entire die.
		The shifter itself was roughly 2% of the entire die,
		including the pad ring, or about 10% of the integer datapath.
	It looks like the zero-detect circuit needed to do the full
	protection against large shifts would be about 1% of the die,
	or about 5% of the space for the integer datapath.

Using a really simplistic  model that a shrink from 2micron to .2 micron
givens you a 10 X 10 = 100X areal shrink [it is a lot more complex than that
of course], one would guess that in current chips, allowing for
bigger chips, more metals (but metals don't shrink as fast as transistors),
but also, many want a 64-bit datapath, which doubles the width], one
might say:
	1%: solid datapoint, relevant at the time many current ISAs started.
	.01%: what one might expect now, except of course, for embedded
	 apps, people don't usually employ a large die, but fight over every
	 bit of space.

Anyway, from this it seems like .01% to 1% is a better approximation to
the real range than is .001-.01% ... but as always, there are a myriad of
such features competing for die space.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c,comp.arch
Subject: Re: Shift left for width of type != 0 ?!
Date: 27 Apr 1998 05:10:59 GMT

In article <3540B864.8555EAD6@cs.wisc.edu>, Andy Glew <glew@cs.wisc.edu>
writes:

|> Yes, good architects estimate the effect of their proposals on the
|> layout BEFORE settling on instruction set enhancements. Certainly for
|> the next chip, probably for the next two chips, maybe more.
|>
|> This may be an estimate, but in some cases it involves trial layouts.
|>
|> We don't just do architecture for the long term.
|> We also have to make money at every evolutionary chip in between.

Andy has this right, but even stronger:
	(1) The *first* implementation of an ISA has to be *possible*,
	and produce a usable design, i.e., there is a (proper) distinction
	between research chips and commerical chips, where the former
	may explore ideas, but leave out things that simply cannot be left out
	in production chips.

	(2) Many design efforts have proceeded via iterations wherein:
		(a) A general approach is drawn up.
		(b) The design gets fairly far along.
		(c) As estimates turn into detailed layouts, "optional"
		features may well drop out, or in a few cases get added.
		"optional" means:
			Almost anything in the first implementation of an ISA.
			Additional features in later implementations.

	(3) Sometimes detailed layouts already exist, and then something
	happens that causes potential layout perturbations that actually
	affect the ISA.  What might happen?
		(a) Some block of synthesized logic gets a bug fix that
		    causes its size to expand so it doesn't fit any more.
		    (Something like this happened to R4000).
		(b) New information is obtained late in the design cycle,
		    such as new requirements, or better information regarding
		    process technology that will/may not be available
		    when necessary.
		(c) A wish occurs that gets settled by observing that there
		    is no room left for feature Y, but that feature X helps
		    solve the same problem, and its natural implementation
		    uses almost no extra die space.  [LWL/LWR got into
		    R2000 *after* it was decided to support bi-endian byte
		    ordering, and it was observed that the necesary
		    transformations could be obtained by adding extra
		    wiring paths to the load-aligner in a way that didn't
		    consume more space.  In retrospect, these might well
		    have been better omitted, as they have an odd property
		    not shared by any other MIPS loads, which bothered
		    later implementors, but at the time, many people were
		    concerned about unaligned data, especially in FORTRAN,
		    given that many ISAs permitted unalgined data.]
 		(d) Proposed and highly desirable feature enhancements drop
		    out not long before tapeout, simply because the chip is
		    slightly bigger than the reticle (and that's not allowed),
		    leading to a frenzy of searching for features to
		    throw overboard.  For instance, this has often happened
		    with good/powerful performance counters ... whose
		    practical value is far greater than most minor changes to
		    the ISA.
I heartily recommend:

http://www-leland.stanford.edu/class/ee382/

which is EE382 (processor Design) at Stanford, as taught by Don Alpert.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



Index Home About Blog