Index
Home
About
Blog
Newsgroups: comp.arch
From: mash@mash.wpd.sgi.com (John R. Mashey)
Subject: Re: Why isn't hardware designed to run software? [A: do it in s/w]
Date: Wed, 29 Dec 1993 22:42:21 GMT
In article <rfgCIqGAu.AGv@netcom.com>, rfg@netcom.com
(Ronald F. Guilmette) writes:
> (Note that I have NOT mentioned Cray. One poster here mentioned how much
> money would be involved in providing 12.5% more memory on a Cray. My only
> response is ``Frankly my dear, I don't give a damn.'' I am not likely to
> own one of *those* beaties in the near future anyway, so the whole subject
> seems about as relevant as the fact that I would suffocate if I were suddenly
> transported to the surface of the moon. That isn't bloody likely to happen
> anytime soon either.)
>
> I think the bottom line is that the kind of feature I'm speaking of is not
> really all that different from the watchpoint registers available on x86
> machines. These registers provide a fast, hardware-based solution for a
> problem that could (in theory) be done in software. But as anyone who
> has ever had need of these watchpoint registers can testify, the hardware
> solution definitely has its advantages... and can be had for little cost,
> either in terms of $$$'s or in terms of effects upon overall performance.
>
> In short, as I said originally, I think its well past time for hardware
> designers to start taking the real-life problems of software developers
> a bit more seriously. You hardware guys may not get any more dhrystones
> or SPECmarks out of the deal, but you may win some additional converts
> amoung us software people all the same. (And please don't forget who it
> is that *really* makes your systems attractive to end-users, ok?)
1) This request arises again and again, but *really* one can do a lot
of it [uninitialized variable-checking & subscript-range-checking,
or the related ADA constraint-checks] in software. [I looked at this
20 years ago in detail and even then, one could do it]. WATFIV often
found bugs in "production" code, to people's great consternation. :-)
2) For some reason, people seem to assume that hardware people are unreceptive
to such thing for no good reason:
a) We (MIPS folks, H/W + S/W) have looked into debugging/checking
tag features for almost every chip we've ever done, and we've *never*
been able to figure out features that were truly useful that didn't
have severe performance hits/implementation problems of one sort
or another, despite having among the design teams software people
that are passionate advocates of such things. It always came down
to hardware folks being perfectly willing to:
a) Tell us (sw folks) the cost of an implementation
b) If we could tell them *what* we wanted, and that we were
sure that what we wanted was worth the price.
c) And us never being able to design something that we thought
was really good enough to be worth it.
d) Of course, this might just have been lack of imagination
on our part, but it might also be because it's hard.
It is possible that highly-super-scalar-issue chips may
do better, and we'll look at this again for post-T5 chips.
e) Personally, I'm still convinced there is *some*
truly general mechanism that survives high-performance
implementations and helps error-checking+tagged,
but that we still haven't figured out the right thing,
despite conversations with truly world-class folks.
Having seen past attempts in hardware turn out not to be
quite right, we have to be *sure* we've got
something useful.
b) People have used bad-parity bits, years ago for such purposes ...
but not in any modern designs that I know of, and especially not
in byte-addressed machines / cached machines. In particular,
the load-from-cache path is *often* in the critical path,
and people are often loath to do anything else with it,
although the cost varies from design to design.
c) While the cost of CRAY memory may not be relevant to most people,
DRAM still costs money, but even worse, there is a chicken and
egg problem due to the cost advantages of standard SIMMs, i.e.,
no one would want to use an extra bit/byte, until there are SIMMs
that size, and no one will make them until there is a market :-)
d) Things like watch registers are *really* useful ... but sometimes,
software requirements make them less useful than you'd like,
i.e., if a debugger lets you set an arbitrary number of
watch points, it's hard to do that in hardware :-)
3) In any case, the *ideal* thing is a compiler option that generates
the checks ... but optimizes them well; such "constraint-check elimination"
is found in some ADA compilers, for example.
In modern optimizers, there is plenty of information around to make many
checks disappear; it would really be nice to see *someone* work hard on'
this problem, using current compilers, and see what the actual
run-time cost would be. If I had to guess, it's maybe 5-10% for many
kinds of applications, at least for the checking, assuming that there is
not too much overhead for initializing data to "uninitialized" like
WATFIV did. A general scheme for uninitialized-reference detection could be:
a) For each reference to each variable, maintain a bit that
indicates "known-to-be-defined", computed by:
a) Any assignment to that variable sets "defined"
b) Any code in which all paths to that usage are marked defined,
means that it is defined.
c) Any code in which one is not sure that it is defined require
generating a check, and then considering it defined.
b) For the entire code, if a check is ever generated for a specific
variable, then generate prolog code to initialize it to the "uninitialized"
value, else one doesn't need this code. On most machines, floating
point hardware supports NaNs or equivalent, so no code need ever
be generated to do checks, just (possibly) to do the initializations.
c) Start with minimalist optimization, then improve it until it gets
"good enough".
Let's try a few examples:
C code:
a) Assume that any function can *assume* that any values passed to it
are defined. [Why is this a good idea? because quite often the
caller knows something is defined, and they might even be constants;
why should hardware take any kind of hit whatsoever to check things
that are known to be constants].
b) Any variable initialized in its definition need *never* be checked;
const variables (I think) need not be checked.
c) In a lot of cases, the same optimizations that do value-propagation
or code-motion can help you eliminate checks:
func1(x) {
int y;
if (x)
y = x+1;
else
y = -1;
if (y > 5)
....
Needs no undefined-variable checks whatsoever; a good optimizer likely
knows that y has been set for sure.
func2(xp) int *xp;
{
if (*xp)
*xp = *xp + 1;
...
That needs *one* check, no matter how many times *xp is
referenced, assuming xp doesn't change [and compilers already
look hard at pointers]. I.e., thoughout the entire code body
of func2, the compiler *knows* that the item referenced by *xp
was checked on the first access, and either it faulted then, or
not, but it doesn't need to be checked again.
func3(xp, yp, zp;) int *xp,*yp, *zp;
{
int i;
for (i =0; i <100; i++)
*zp++ = *xp + *yp;
...
This ought to generate 2 checks:
a) AN optimizer would like to pull the expression *xp + *yp out of
the loop, but couldn't, due to possible aliasing of zp.
b) But, there is always at least one *reference* to each of
*xp and *yp, and the addresses don't change, so the optimizer could
move the *checks* in front of the loop...
Compared to C, FORTRAN is probably easier: plusses and minues of
each might be:
C FORTRAN
- Pointers everywhere + No pointers
+ BSS initialized to zero, by definition- Memory not guaranteed
- Pointers not guaranteed to non-alias + Array parameters not aliased
- Loop terminations can be complex + Loops (in practice) simple
+ const
- moderate use of 8 & 16-bit data + little use, i.e., if values
these may be harder to check are 32- and 64-bit, it's
easier to find "uninit"
values
+ Heavy use of floating point:
most chips have NaNs
+ Call-by-value - Call-by-reference
Anyway, given the kinds of analysis that current good optimizers do,
I don't think this sort of analysis is all that bad to do; it would be nice
to see some research on this [maybe it's around, I haven't been following
it, but I do believe that it's possible to generate quite reasonable code,
and there is *nothing* about this technique that is very hardware-dependent
(except having natural NaNs or not).
4) Subscript-out-of-range detection/optimization is more work;
maybe somebody who's done this in ADA would comment. I expect this is
easier in FORTRAN than in C, and in any case, there's no chance of doing
this very well in hardware for current machines and languages.
SUMMARY:
1) There are subtle interactions between:
a) Instruction-set-architecture
b) Implementation
c) Language-design [some features cause more trouble than others]
d) Level of copmpiler technology.
and how much checking can be done easily.
2) There is *little chance* of getting general hardware uninitialized-variable
checking into currently-popular CPU families, and subscript-checks
are probably even worse. About the best that can be hoped for,
from hardware, are a few instructions (like TRAP on MIPS & others)
that do a check quickly, expecting to succeed, and trap otherwise),
and maybe (if someone every can figure out the right thing), some
hardware pattern-match-and-trap feature general enough to cover
enough cases to make it worth having. There is *little chance* of
using extra bits in the memory system.
3) BUT: current optimizing compilers already keep track of much information
that could help optimize many of the checks away, and they could do it with
existing machines. Hence, the likeliest hope of achieving this (desirable)
goal is:
a) Encouraging more compiler research
b) Encouraging compilers writers to do it.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: mash@sgi.com
DDD: 415-390-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
Index
Home
About
Blog