Index Home About Blog
From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Why are interger and FP registers separated?
Date: 31 Jul 1995 20:26:01 GMT

In article <hbaker-3107950940090001@>,
(Henry Baker) writes:

|> My understanding is that most uP architects think that most integer
|> arithmetic is used for indexing stuff, and that any _real_ arithmetic
|> is done in fp.  By keeping the integer (indexing) and fp stuff separate,
|> they can run them in parallel and speed up the inner loop of certain
|> trivial Fortran-like programs (but potentially slowing everything else
|> down).
|> IMHO this attitude is wrong, but I also think that it is
|> very widespread.

1) All generalizations are false ... but making statements about what
"most uP architects" think is especially chancy.

2) I've talked to many uP architects at one time or another.  If most think
that above, it's very hard for me to tell.  On the other hand:
	a) As Dale Morris wrote, there are serious port/bandwidth issues
	   that make it attractive to separate integer and FP registers,
	   and differences in latency requirements.
	   As various people have mentioned, register-specifier size is
	   also relevant.
	b) Recall that computing did not start with uPs, and there is
	   a long history of machines with separate integer and FP registers,
	   like S/360s for example.
	c) Of the system-uP architectures {X86, 68K, MIPS, SPARC, PA-RISC,
	   POWER, Alpha}, the early implementation of all but one (Alpha)
	   was done at a time when die space limits forced implementation
	   of FP as at least one separate die, which might even be optional.
	   Even if one thought that it was a good idea to combine the register
	   files, this implementation issue would make you think....
	   Note that even some later high-performance implementations of
	   MIPS & POWER have used multi-chip sets, with a separate FPU.
3) But in any case:
	a) The register-specifier limits make 2 sets attractive.
	b) The comments about "trivial FORTRAN" above seem inappropriate,
	   in practice:
		1) Having gone through 16 FP register -> 32 FP registers ...
		   and sometimes seeing programs that would really like 64,
		   that matter to customers, makes this seem nontrivial.
		2) Modern compilers are capable of generating serious 
		   FP register pressure, much more so than 10 years ago.
	c) Most uP architects I've ever talked to know that:
		1) General-purpose system architectures need to do integer
		   processing well ... since things like OS's and DBMS are
		   almost entirely integer.
		2) If FP-performance is a goal as well, then it is not only
		  good to have more registers accessible, but separate FP
		  registers permit a usefully wider variety of implementation
		  choices. Even in single-chip designs, register naming,
		  ports and bandwidth all argue for separate registers.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Why are interger and FP registers separated?
Date: 8 Aug 1995 18:56:56 GMT

In article <hbaker-0508951047010001@>, (Henry Baker) writes:

|> In this day of multiple integer units and multiple issue instructions, why
|> do we still have specialized fp units in 'gp' machines?
Well, because that's what you do if you want to be competitive in the class
of CPUs for which high FP is important.

You can design CPUs:
1) With no hardware FP at all.
2) With a lot of resources shared betweeen integer & FP.
3) With separate FPUs, but whose subunits (like * + /) share resources to
save die space.
4) With separate FPUs, whose subunits don't share resources (much), so they have
less scheduling conflicts.
5) With multiple FP units, although not all types might be replicated,
i.e., you might have 2 multipliers, 2 adders, but only one divide.
6) With multiple complete FP units.

Within each FPU:
a) You could have exactly one FP operation in progress at once, i.e.,
the repeat is equal to the latency.
b) >1 subunit can be in progress at once, but each has repeat ~= latency.
c) There is some degree of pipelining, so that more than operation can be
in progress at once, ending with repeat=1, i.e., start new op every cycle.

Finally, for each operation:
A) The latency can be long.
B) The latency can be short.

Within each group: 1..6, a..c, A..B, the choices are sorted into nondecreasaing
order of die space and performance. 
It is well known that you simply don't do choice 1 if you care about FP.
In a chip with a 64-bit integer datapath, choice 2 is a possibility for chips
that need to be low-cost, although this combination has only appeared once
that I know of, i.e., in R4200 and derivatives.
Many earlier chips used choice 3 for die-space constraints. In the MIPS
1: R3000 (by itself), or embedded versions like IDT R3041, etc.
2: R4200
3: R3010, R4000's on-chip FPU
4: R10000
6: R8000

If you are into linear algebra or vectorizable codes in general, you may
like 6 or c, and accept A, i.e., either you have multiple FPUS, and/or
heavy pipelining, but can get away with a little longer latency if that's
the price, since there is a very high level of micro-parallelism available.
If you have long chains of dependent FP operations, with little micro-level
parallelism, you mostly care about latency, i.e., B, even if you have to
sacrifice repeat rate and subunit paralellism. 

Cheap FP = 2aA
Fast & versatile, and expensive FP = 6cB

|> Even IEEE zealots agree that only end-to-end performance is the issue,
|> not intermediate results.  For example, there is no need during the
|> inner Newton iterations for division/sqrt to preserve IEEE compliance,
|> so long as the entire calculation produces compliance.  Ditto for the
|> transcendentals, etc.

So far, the evidence seems to be:
a) You can afford some hardware for divide, swrt, reciprocals, etc.
b) Transcendentals seem usually to go off into microcode, and low-latency
FPUs have seemed reasonably competitive synthesizing them anyway.

But back to the original question:

At the ISA level, you will tend to have actual instructions for actions,
when they need to operate on a separate set of datatypes, such that
synthesizing those operations from series of existing instructions costs many
more cycles than reasonable hardware implementations.  If a hardware or
microcode implementation would use as many cycles, then about the only
reason for doing it in hardware or microcode would be for instruction density,
so you would only do it for instructions generated in-line.

	1) Everybody has FP instructions if they care about FP at all;
	   FP simulated via integer is uncompetively slow.  As shown above,
	   you have a huge range of choices for how much hardware you
	   want in the actual implementation.
	2) RISCs usually don't have hardware for sin, cos, log , etc, because
	   you can do a pretty good job synthesizing them, and because they
	   normally are packaged as out-of-line libraries, so there is little
	   code-density impact.
	3) RISCs generally don't have CALL instructions of the VAX ilk,
	   although one might argue for load-multiple+store-multiple as
	   instruction-density improvements.

Anyway, there are separate FP units to be competitive.

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Why are interger and FP registers separated?
Date: 8 Aug 1995 19:05:20 GMT

In article <405mvi$>, (Sanjay M. Krishnamurthy) writes:

|> I feel Henry Baker is absolutely right in that if the compiler has done
|> a good job at strength reduction and loop invariant code motion
|> (esp. fp constants and fp loop invariants), you end up with little
|> or no addressing overhead in fp-intensive loops. In my defintely biased
|> view, the Mot compilers for PowerPC fall in this category. What with
|> update forms of instructions and branch-on-ctr instructions, there
|> is little overlap between the fp and int units in fp intensive code.

This may be a problem in terminology:
	a) On most CPUs, any time you use an address, you are accessing
	   integer registers.  In some machines, you may have separate
	   index or address registers ... which are still a kind of
	   integer registers, usually with a subset of integer operations.
	b) When you use update addressing, you are doing an integer register
	   read, using an integer incrementer, and putting the result back in
	   the integer register file.
	c) Most CPU architectures don't have special separate counter
	   registers, but even when you do, they are integer registers
	   with special integer decrementer/incrementers.
To restate the last sentence in the posting above:
"There is little use of explicit integer instructions in FP-intense code.
There is substantial and continued use of the integer resources."

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-390-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: FPU in 64-bit MACHINE
Date: 13 Jan 1998 05:50:16 GMT

In article <69et2p$dqp$>,
(Hasdi Rodzmann Hashim) writes:

|> Organization: University of Michigan ITD News Server
|> Assuming the FPU unit operates on 64-bit registers, in a 64-bit machine
|> where the integer register size is also 64-bit:
|>         Is there any reason why FPU unit still must use a separate
|> 	register set than other arithmetic unit's register set?
|> Case point: Merced. The main register set is still separate from FPU's
|> register. Is there any point to separate them now that their register
|> sizes are identical?

This was discussed extensively a year or too ago, and the answer is still
the same:
	a) With the same number of bits in an instruction format, you can
	staightforwardly double the number of architecturally visible registers.

	b) Integer program contain no FP code (by definition).
	FP codes normally contain substantial integer instructions for
	indexing, addressing, and counting, and such actions often offer
	substantial potential overlap with the FP operations.

	c) A combined register file either requires substantially more
	read/write ports, which further adds to the pressure of
	wide-issue, multiple-unit designs on register files, or
	else extra stall cycles for regsier reads/writes when there are
	conflicts, thus complexifying compiler scheduling.

	d) It used to be helpful to allow a well-integrated off-chip FP
	coprocessor; this is no longer very important.

	e) This can ease layout issues, since FP units are area-intensive,
	and it is nice to be able to keep the FP registers close to them
	to keep wire lengths down, and it is nice to keep the number of
	units on a bus down for speed.

The price of course is:
	e) You need more registers, since they are split.
	This costs hardware, and perhaps context-save time, although
	some designs use dirty bits or trap-on-use bits to avoid
	wasted save/restores of FP registers in integer-dominated
	multi-tasking workloads.

	f) int<->fp conversion instructions sometimes suffer extra cycles.

	g) You consume some more opcodes, if only because you cannot use
	the integer load/store instructions to directly load/unload the
	FP registers.

Well-known single-register-set machines include the VAX and Motorola 88100;
separate register sets are used by at least the following:
S/360; MC68K; Intel IA-32; PA-RISC; MIPS; SPARC; POWER+PPC; Alpha; IA64.

-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: special registers, was TOP 10 MISTAKES
Date: 7 Aug 1998 05:14:05 GMT

In article <6qd5jq$>, (George
Herbert) writes:

|> The advantage is that you simplify some things: you only end
|> up with one register file, and you additionally get bonus
|> from being able to define in the compiler and on the fly
|> how many registers are doing FP things and how many are
|> doing INT things.  Assuming you keep the same total number
|> of registers, you get a net win on flexibility.

But, of course, you usually don't, that is, quite commonly,
you have N-bit register fields, and if you have int & fp sets,
you get twice as many registers.

|> The primary disadvantage is that your register memory has
|> to have more ports to use both int and fp simultaneously,
|> which slows down all accesses.

1) And you usually get half the registers.
2) And you get to have some longer busses, since people like to have
integer registers near the integer units, and FP registers near the
FP units.
3) And you mix together operations that are mostly natural to do in
one cycle, with those that are not, which can in some designs add to the
port issue.
4) And you cannot play the tricks that many have done to avoid
save/restoring FP registers on every context switch.

Anyway, one can make the following observation:
1) I would guess that most people who have actually designed new, real ISAs in
the last 20 years have had some-to-much familiarity with the VAX ISA,
which of course has one register set used for integer & FP.
many microprocessor ISAs were designed on VAXen, compared to VAXen, etc.

2) Nevertheless, with the exception of the Motorola 88K, pretty much
everybody chose to use separate integer & FP registers.  This doesn't mean that
it's necessarily right, but it is a data point that people who actually had
to do it (fairly) consistently made the same choice.

-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog