Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Ridiculous
Date: 16 Sep 2000 20:01:25 GMT

In article <G0zrJG.JI8@ecf.utoronto.ca>, ben@eecg.toronto.edu (Benjamin Gamsa) writes:
|>
|> In article <8pv6mp$vg6$1@nnrp1.deja.com>, Nova  <nova@pacific.net.sg> wrote:
|> >Good point. So, 21364 EV7 should then support more than just simple two-
|> >CPU lockstep; maybe three-CPU lockstep with 'majority wins' rule?
|>
|> I believe it's more common to have a pair of two-cpu lockstep chips.
|> If one of the pair disagree, they are both taken out of service, and
|> the other pair takes over.  It is also probably easier this way since
|> each lock-step pair can be tightly coupled for efficient lock-stepping
|> and placed on separate boards from the other pair, allowing easier
|> field replacement of one of the pair.

TMR (with voting) is relatively rare in the computer business, although
Tandem, in Austin, used MIPS CPUs to build Integrity S3 that way.

Stratus (with various CPUs), and later MIPS-based Tandems, used the
"pair of checker-pairs" designs.  This stuff is hard enough to get right that
I wouldn't think of doing TMR among multiple 1-chip CPUs.   For most micro
designs, it is perfectly plausible to build checker-pairs with relatively
minimal extra hardware, unlike TMR, which has drastic hardware consequences
in requiring many more external bus pins.
Consider:
	Start with CPUs that can share a multidrop bus to an external
	agent, i.e., most of us build micros that can be used 2-4/shared bus.
	Then, you can build
	(a) Master-Listener (also called Master-Checker) pairs, where the
	Master drives the bus, and the Listener watches and compares,
	but never drives the external bus, except for a Fault singal.

	(b) Cross-Coupled Checker Pairs: one drives the regular data bits,
	and the other drives the ECC/parity bits; each one watches and
	compares the bits it isn't driving, and either can raise a Fault
	signal.
These features have been around in some micros for a long time,
at least since 1993 systems (with MIPS R4400s), and maybe in others'.

As an amusing historical note:
(1) Long ago, Gardner Hendrie and other Stratus folks were considering
early MIPS CPUs for potential fault-tolerant systems, and they pointed out
a couple of issues, similar to things they'd encountered on MC 68Ks, that
just drove them nuts.  Recall that they built external comparators to
keep checking the outputs of chips for being identical.  Sometimes chips
have "don't care" or "undefined" or "reserved" bits that don't mean anything
to the CPU ... but if you have an interrupt, for example, and save control
registers holding such stuff, you can end up with 2 CPUs storing a few bits
differently ... and this is a total nightmare for external comparators.

(2) Anyway, we fixed all of this, and although  Stratus didn't end up picking
MIPS chips ... it turned out that the fixes were *very* useful for Tandem
later on.  I mentioned this to Gardner recently, and he at least was amused.


--
-John Mashey EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-2663
USPS:   SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, non-conflicting,local, consulting elsewise.


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Ridiculous
Date: 18 Sep 2000 18:10:39 GMT

In article <pw-1809001228390001@166.84.250.180>, pw@panix.com (Paul Wallich) writes:

|> Even if the bits are "don't care" shouldn't they be deterministic, and hence
|> identical under lockstep conditions? (At least for the same mask rev of the
|> CPU) Or did the bit value float according to random electrical and process-
|> variation stuff?
	Yes, apparently.  Whatever it was,  Stratus had needed to do some
	fancy things in their 68K comparator setups to avoid spurious
	mis-compares.

	Maybe someone who knows the actual details can post, sicne this
	is long past.

	Testing can certainly be inhibited by such behavior, i.e.,
	if one can do so without awful performmance or cost penalties,
	making everything deterministic can avoid awful later surprises.

--
-John Mashey EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-2663
USPS:   SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, non-conflicting,local, consulting elsewise.

Index Home About Blog