From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: sci.math.numanalysis,comp.arch Subject: Re: [++] Why 64bit games? (Was: Re: Speed, and its need ...) Date: 4 Jun 1996 16:53:17 GMT In article <JAN.96Jun4091421@cora.neuroinformatik.ruhrunibochum.de>, jan@neuroinformatik.ruhrunibochum.de (Jan Vorbrueggen) writes: > In article <4p0cd1$es6@murrow.corp.sgi.com> mash@mash.engr.sgi.com (John > R. Mashey) writes: > > If one took a MIPSlike design and added a carry bit, you'd get 3 > instructions: addhighpieces, addlowpieces, > addfromcarrytohighresult > > No, you get two instruction: addlowerparts, addhigherpartswithcarry  > i.e., you have the with and without variants of the integer add/substract > instructions instead of a seperate addcarry. No, you get 3 instructions: that's why I said a MIPSlike design, which uses 2input adders; while there may be other useful reasons for doing 3input adders, most designers wouldn't do this just to save 1 clock for multiprecision adds... Note, of course, that outoforderregisterrenamed designs may get even more complexified by the existence of things like carry bits or other status flags. For instance, this is relatively straightforward on MIPS, since each integer operation creates 1 64bit result (with exception of integer mul/div, which of course caused a painful special case), which is renamed onto a set of physical registers 2X larger than the set of logical registers. Now, one could widen the result register to contain carry flags, condition codes and such; however, you have probably introduced yet another set of dependency checks required to be done by the instruction scheduler. Remember: in aggressive modern implementations, something like a carry bit is *not* just one simple bit of state in a register somewhere; there may well be a copy of it with every result, and there may be some extensive logic to check its setting and use, if you don't want it to become a resource that ends up serializing most instructions...  john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@sgi.com DDD: 4159333090 FAX: 4159678496 USPS: Silicon Graphics 6L005, 2011 N. Shoreline Blvd, Mountain View, CA 940397311 From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch Subject: Re: [++] Why 64bit games? (Was: Re: Speed, and its need ...) Date: 5 Jun 1996 04:36:38 GMT In article <zalman0406961727100001@198.95.245.190>, zalman@macromedia.com (Zalman Stern) writes: > I'm not sure what John is disagreeing with me about... Well, I wasn't disagreeing much :), but I was disagreeing with the the thought that MIPS was bad for multiprecision arithmetic because it lacked a carry bit ... i.e., while one might do better with a carry bit, the sequence is not very long to do it without, and it did offer 32x32>64 ... in practice, with many email conversations on this topic over 10 years, nobody gave me convincing evidence that saving a couple cycles here made very much difference in real codes; I actually got more requests for 64/32>32...  john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@sgi.com DDD: 4159333090 FAX: 4159678496 USPS: Silicon Graphics 6L005, 2011 N. Shoreline Blvd, Mountain View, CA 940397311 From: mash@mash.engr.sgi.com (John R. Mashey) Newsgroups: comp.arch,comp.lang.misc,comp.lang.smalltalk Subject: Re: The Architect's Trap (was Object orientation without GC is nonsense) Date: 28 Jul 1997 07:44:41 GMT In article <5rcfpp$ed9@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick Maclaren) writes: > Yes, except for one thing. I have great difficulty in believing that > carry bit support is any harder than the support of the nearuniversal > constructions like 'a(n+m)'. In both cases, you need the result of one > operation as input to the next. If there IS a problem with carry bits, > then it obviously has to be with the implementation of the individual > operations and not the pipelining and multiissue. No, carrybit support casues more problems. > In particular, it is immediately obvious that implementing a carry bit for a > Nbit addition or subtraction can be no harder than implementing a (N+1)bit > addition or subtraction, This sounds plausible, but turns out not to be the case. This is an example of what is the most common misapprehension that I encounter, i.e., analyzing implementations while THINKing of really simple, nonpipelined designs (like the early 8bit CPUs). Many features that work fine in such designs are famous for causing great pain in more aggressive designs; current CPUs do *not* work that way.  pipelined  pipelined superscalar  pipelined, speculative outoforder. The fundamental problem, for the bulk of RISC architectures, is that a particular property helps simplify implementations, and every time that property is violated, additional complexity results. The desired property: each operation produces at most one result, and all results are the same size, and there is a regular mechanism for dealing appropriately with data hazards in the pipeline. In a simple pipelined design, each result is written back to its target register, and the result is normally made available via a bypass network, in case it is immediately needed by the next instruction. If you take a typical current RISC and include "addwithcarry", for instance (and this is certainly doable), the irregularity definitely adds complexity, andthe carry bit can become a bottleneck in more aggressive implementations, as operations that either set it or test it become serialized on it , and also require extra hardware (comparators & muxes) in an awkward place. (See H&P: look up bypassing & condition codes in the index). As the CPU becomes superscalar, the extra logic for checking for dependencies gets worse: implementors especially hate irregularities. As it becomes speculative & outof order, a typical design would require not just a registerrename unit, but a "carrybitrenameunit", as there is no such thing as "The carry bit", but rather an independent set of carry bits, with appropriate rename logic to select the correct carry bit set by the logically most recent instruction that sets the bit. hence, it actually can be *much* more complex to implement Nbitadd + carry bit, compared to N+1bit add... And finally, in some (most, actually) RISC architectures, it is fairly easy to do multiprecision add, for example: Assume you have data in registers to do doublewide a = b + c, with inputs in bhigh,blow, chigh, clow, etc. Suppose you have addc: 3input adder (oops, that costs a little more also), that also sets the carry bit (and I do the unsigned case): addc zero,zero,zero # clear carry bit addc alow, blow,clow addc ahigh,bhigh,chigh Suppose you don't have addc. MIPS code would look like: addu alow, blow, clow sltu tmp, alow, clow (set tmp = 1 if alow < clow, else 0) addu ahigh, bhigh, chigh * addu ahigh, ahigh, tmp On an outoforder chip, the 2 sequences might well take the same time, because the 3 addcs are serialized through the carry bit, whereas the add marked * can execute in parallel with one of the previous instructions, assuming there are 2 ALUs. But even if it's a few more cycles, it's not like it's huge. Multiply is harder, and divide is harder yet, but especially the latter is dominated by the time to do the division itself. Anyway, aggressive RISC CPUs: carry bits make much less sense than they used to. Of course, as Nick suggests, it is good to have standard code for this available, but Icaches work fine for code dominated by multiprecision integer arithmetic.  john mashey DISCLAIMER: <generic disclaimer: I speak for me only...> EMAIL: mash@sgi.com DDD: 4159333090 FAX: 4159323090 [415=>650 August!] USPS: Silicon Graphics/Cray Research 6L005, 2011 N. Shoreline Blvd, Mountain View, CA 940431389 
