SDI (Henry Spencer)

Index Home About Blog
Newsgroups: comp.risks
X-issue: 11.14
Date: Tue, 19 Feb 91 23:40:50 EST
From: henry@zoo.toronto.edu
Subject: Re: Predicting system reliability

>Argument 2: A system as complex as SDI can never be evaluated in a way which
>would give reasonable grounds for claiming that it would work correctly when
>deployed.

Of course, just what constitutes "reasonable grounds" is itself something
that should be part of the specifications, and it is something that may
have to be justified.  None of the complex systems designed for fighting
nuclear wars -- including the ones whose supposed efficacy has preserved
the peace of the planet for circa 40 years -- has *ever* been evaluated
in such a way (i.e. under nuclear attack!).  The design of test criteria
for such systems all too easily becomes a second-order way of cooking up
fallacious "proofs" that the system is anywhere from trivial to impossible.

Our lives depend frequently on systems that cannot possibly be tested to the
"reasonable confidence" point before they are used... if you interpret that to
imply reasonable confidence under the worst conceivable conditions.  No
airliner is ever tested in six-sigma turbulence.  No building is tested in
once-per-century wind loads.  Space-shuttle payload limits are based on landing
weight for the "Return To Launch Site" abort mode... a procedure that has never
been tried and which some astronauts doubt is survivable.  Operating-system
kernel software is rarely stress-tested under truly severe operational loads.

(As an example of that last, one of the major RISC-processor manufacturers does
massive simulation of new designs, to the point where their machine rooms go
from fairly quiet to furiously active at the time in late evening when the
night's batch of simulation jobs fire up.  This sudden huge surge in load has,
I'm told, had a serendipitous side effect: at least once it uncovered the
existence of very short "interrupt windows" in kernel code, where erroneous
assumptions about the atomicity of operations caused system failures only if an
interrupt struck in a window about a hundred nanoseconds long.  (Specifically,
the programmers had assumed that incrementing an integer in memory was an
atomic operation, which is sort of true on single-processor CISCs but is rarely
true on multiprocessors or RISC systems.)  The code containing this botch is
now theoretically obsolete, but it was in wide production use before the
problem was discovered and is probably still in use here and there.)

In traditional engineering, it is routine to assess worst-case behavior based
on extrapolation from less severe testing.  A demand that the worst case be
tested is often a disguised call for a system's cancellation, since such
testing is seldom feasible for large systems.  The proper consideration is not
whether we can safely extrapolate from less severe tests, because we must rely
on such extrapolation and we already do; the questions are how best to do such
extrapolation and what form of testing must be done to permit confident
extrapolation.
                         Henry Spencer at U of Toronto Zoology   utzoo!henry
Index Home About Blog