Index Home About Blog
From: (John R. Mashey)
Newsgroups: sci.math.num-analysis,comp.arch
Subject: Re: [++] Why 64-bit games? (Was: Re: Speed, and its need ...)
Date: 4 Jun 1996 16:53:17 GMT

In article <>, (Jan Vorbrueggen) writes:

|> In article <4p0cd1$> (John
|> R. Mashey) writes:
|>    If one took a MIPS-like design and added a carry bit, you'd get 3
|>    instructions: add-high-pieces, add-low-pieces,
|>    add-fromcarry-to-high-result
|> No, you get two instruction: add-lower-parts, add-higher-parts-with-carry -
|> i.e., you have the with and without variants of the integer add/substract
|> instructions instead of a seperate add-carry.

No, you get 3 instructions: that's why I said a MIPS-like design, which
uses 2-input adders; while there may be other useful reasons for doing 3-input
adders, most designers wouldn't do this just to save 1 clock for
multiprecision adds... 

Note, of course, that out-of-order-register-renamed designs may get
even more complexified by the existence of things like carry bits or
other status flags.  For instance, this is relatively straightforward on
MIPS, since each integer operation creates 1 64-bit result (with exception of
integer mul/div, which of course caused a painful special case), which is
renamed onto a set of physical registers 2X larger than the set of logical
registers.  Now, one could widen the result register to contain carry flags,
condition codes and such; however, you have probably introduced yet
another set of dependency checks required to be done by the instruction
scheduler.   Remember: in aggressive modern implementations, something
like a carry bit is *not* just one simple bit of state in a register somewhere;
there may well be a copy of it with every result, and there may be some
extensive logic to check its setting and use, if you don't want it to
become a resource that ends up serializing most instructions...

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: [++] Why 64-bit games? (Was: Re: Speed, and its need ...)
Date: 5 Jun 1996 04:36:38 GMT

In article <zalman-0406961727100001@>, (Zalman Stern) writes:

|> I'm not sure what John is disagreeing with me about...

Well, I wasn't disagreeing much :-), but I was disagreeing with the
the thought that MIPS was bad for multiprecision arithmetic because it
lacked a carry bit ...  i.e., while one might do better with a carry
bit, the sequence is not very long to do it without, and it did offer
32x32->64 ... in practice, with many email conversations on this topic
over 10 years, nobody gave me convincing evidence that saving a couple
cycles here made very much difference in real codes; I actually got
more requests for 64/32->32...

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: (John R. Mashey)
Newsgroups: comp.arch,comp.lang.misc,comp.lang.smalltalk
Subject: Re: The Architect's Trap (was Object orientation without GC is 
Date: 28 Jul 1997 07:44:41 GMT

In article <5rcfpp$>, (Nick
Maclaren) writes:

|> Yes, except for one thing. I have great difficulty in believing that
|> carry bit support is any harder than the support of the near-universal
|> constructions like 'a(n+m)'. In both cases, you need the result of one
|> operation as input to the next. If there IS a problem with carry bits,
|> then it obviously has to be with the implementation of the individual
|> operations and not the pipelining and multi-issue.

No, carry-bit support casues more problems.

|> In particular, it is immediately obvious that implementing a carry bit for a
|> N-bit addition or subtraction can be no harder than implementing a (N+1)-bit
|> addition or subtraction,

This sounds plausible, but turns out not to be the case.

This is an example of what is the most common misapprehension that
I encounter, i.e., analyzing implementations while THINKing of
really simple, non-pipelined designs (like the early 8-bit CPUs).
Many features that work fine in such designs are famous for causing
great pain in more aggressive designs; current CPUs do *not* work that way.
	- pipelined
	- pipelined superscalar
	- pipelined, speculative out-of-order.

The fundamental problem, for the bulk of RISC architectures, is that
a particular property helps simplify implementations, and every time
that property is violated, additional complexity results.

The desired property: each operation produces at most one result,
and all results are the same size, and there is a regular mechanism
for dealing appropriately with data hazards in the pipeline.

In a simple pipelined design, each result is written back to its target
register, and the result is normally made available via a bypass network,
in case it is immediately needed by the next instruction.

If you take a typical current RISC and include "add-with-carry", for instance
(and this is certainly doable), the irregularity definitely adds complexity,
andthe carry bit can become a bottleneck in more aggressive implementations,
as operations that either set it or test it become serialized on it ,
and also require extra hardware (comparators & muxes) in an awkward place.
(See H&P: look up bypassing & condition codes in the index).

As the CPU becomes superscalar, the extra logic for checking for
dependencies gets worse: implementors especially hate irregularities.

As it becomes speculative & out-of order, a typical design would require
not just a register-rename unit, but a "carry-bit-rename-unit", as
there is no such thing as "The carry bit", but rather an independent set
of carry bits, with appropriate rename logic to select the correct
carry bit set by the logically most recent instruction that sets the bit.

hence, it actually can be *much* more complex to implement N-bit-add + carry
bit, compared to N+1-bit add...

And finally, in some (most, actually) RISC architectures, it is fairly
easy to do multiprecision add, for example:

Assume you have data in registers to do double-wide a = b + c,
with inputs in bhigh,blow, chigh, clow, etc.

Suppose you have addc: 3-input adder (oops, that costs a little more also),
that also sets the carry bit (and I do the unsigned case):

	addc	zero,zero,zero		# clear carry bit
	addc	alow, blow,clow
	addc	ahigh,bhigh,chigh

Suppose you don't have addc.  MIPS code would look like:
	addu	alow, blow, clow
	sltu	tmp, alow, clow		(set tmp = 1 if alow < clow, else 0)
	addu	ahigh, bhigh, chigh	*
	addu	ahigh, ahigh, tmp

On an out-of-order chip, the 2 sequences might well take the same time,
because the 3 addcs are serialized through the carry bit, whereas the
add marked * can execute in parallel with one of the previous instructions,
assuming there are 2 ALUs.  But even if it's a few more cycles, it's not like
it's huge.

Multiply is harder, and divide is harder yet, but especially
the latter is dominated by the time to do the division itself.

Anyway, aggressive RISC CPUs: carry bits make much less sense than they
used to.  Of course, as Nick suggests, it is good to have standard
code for this available, but I-caches work fine for code dominated
by multiprecision integer arithmetic.

-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog