Index Home About Blog
Newsgroups: comp.arch
From: (John R. Mashey)
Subject: Re: Intel 86-series -> RISC [really: register renaming]
Date: Fri, 16 Sep 1994 03:35:34 GMT

In article <>, (John
Ahlstrom) writes:

|> What do the cross-over points (where the utility of
|> adding named registers is less than the utility of
|> adding registers with renaming) depend on?
|> These cross-over points could depend on 
|> 	instruction bits to name registers  
|> 		e.g. can't destroy backward compatibility
|> 		need bits for other things
|> 	complexity to do renaming
|> 	nature of the code

Well, some of it is certainly the nature of the code.  Put another
way, it is a misbelief that the following are equivalent:

1) ISA #1 with N visible registers (with or without renaming).
2) ISA #2 with M << N visible registers, in an implementation that has N
physical registers, and uses renaming.

Of course, the larger N gets, the less it matters.
It is the case, already, that ISAs with N=32 (each for both integer
and FP) *already* can hit enough register pressure to want renaming
when using aggressive superscalar implementations and aggressive
optimizing compilers, particularly on some kinds of FP code.

Following are a few examples to illustrate where renaming helps,
and where it doesn't.  let's try N = 4 and M = 3 (to keep it simple).
I'll assume a psuedo-MIPS-like ISA.

Example 1, using ISA #1, that fetches elements from an array, adds 1
to each, and stores them in another array.  remember: 4 real registers.


LOOP:	LW	R4,0(R1)		4
	ADD	R4,R4,1			5
	SW	R4,0(R3)		6
	ADD	R1,R1,4			7
	ADD	R3,R3,4			8
	BLT	R1,R2,LOOP		9

Now, consider a 4-issue superscalar, with renaming, and speculative,
out-of-order execution (i.e., sort of like the Metaflow Thunder).
Let me assume that the first few fetches from ARRAY are cache misses,
but the CPU is trying hard to "unroll the loop" in hardware to get
ahead, and that instructions hang around until their inputs are ready.

Cycle 1: fetch instructions 1..4, rename registers (starting empty,
to physical registers P1..P3.)

Cycle 2: fetch 5..8, rename like:
	ADD	P5,P4,1		P5 is new result
	SW	P5,0(P3)
	ADD	P6,P1,4		P6 is new result (R1)
	ADD	P7,P3,4		P7 ""  (R3)

Cycle 3: fetch 9..., and assume predict branch taken

Cycle 4: fetch 2nd iteration, and rename:

LOOP:	LW	P8,0(P6)	Different values of R4 and R1
	ADD	P9,R8,1		
	SW	P9,0(R7)
	ADD	P10,P6,4

etc ... so what's happening is that while the ISA only has 4 visible
(integer) registers, good use is made of the extra physical registers
to have instrcutions from several iterations of the loop in progress
at once ... which is very difficult to do without rename.

Example 2: same code, but with ISA#2, with only 3 visible registers.
In this case, there aren't enough, and the compiler had to have generated
more loads and stores in the first place.  For ease of typing,
I'm going to call the visible registers R1,R3,R4:



LOOP:	LW	R4,0(R1)		4
	ADD	R4,R4,1			5
	SW	R4,0(R3)		6
	ADD	R1,R1,4			7
	ADD	R3,R3,4			8
	BLT	R1,R4,LOOP		9

i.e., the compiler ran out of registers, so it couldn't just set up
END_OF_ARRAY1 in a register and leave it there.  Now, if you run this
example, using the same sort of renaming & execution scheme, it also
works perfectly well, BUT the LW instruction labeled 8a doesn't disappear.

Example 3, let M =2 (awful, but for the sake of the example, and don't bug me
if there's a cycle or two that can go away):

LOOP1:	LW	R2,0(R1)
	ADD	R1,R1,4
	SW	R2,0(R1)
	ADD	R1,R1,4

Here, again, renaming would let you get multiple iterations, but you still
have those extra loads and stores.  The renaming will help case 2 or 3 with the
re-use of registers inside the loop, as well ... but still, the loads
and stores do not go away.

Suppose one's design allows 1 load/store/cycle, branches are predicted
correctly, no cache misses, etc, then:

4 registers	2 cycles/iteration
3 registers	3 cycles/iteration
2 registers	7 cycles/iteration

	Renaming can relieve register pressure around loops and
	conditional code ... but it does not eliminate loads
	and stores where a compiler would have rather globally allocated
	values to registers in the first place, had it enough visible
	registers.  This is an over-simplified example ... on the other
	hand, more than one vendor has already seen cases where
	32 integer and 32 FP registers made them wish for more.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

Index Home About Blog