Index Home About Blog
From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 19 Sep 2005 12:06:52 -0700
Message-ID: <1127156812.929231.226160@f14g2000cwb.googlegroups.com>

Terje Mathisen wrote:
> glen herrmannsfeldt wrote:

> This would be a _very_ special case, since it would require a loop with
> only a single int variable, or at least you must be able to prove that
> only this int can ever overflow. In that case you can of course skip the
> overflow testing.
>
> >> If your architecture can do this, then by all means use an inline
> >> branch-and-link instead of the INTO style check!
> >
> > But how much cost is there in not being able to retire the instruction
> > until the overflow status is known?  Assuming the usual out of order,
> > but in order retire model.
>
> Not much: Processing can easily continue for at least one or two cycles,
> forwaring the preliminary result to the next user, and as long as the
> vast majority of these ints won't actually overflow, the delayed
> retirement does not correspond to any actual branch misses.
>
> > Then again, maybe all you need is a sticky overflow bit.  You could
> > do some set of calculations and test once at the end for any overflow,
> > and clear the sticky overflow bit at that time.
>
> Nice idea. If overflows are really rare, then it would be a win to redo
> the entire loop to save testing on every operation.

People have sometimes used sticky overflow bits.  To me, the hierarchy
is:
  a) Precise exceptions: you know exactly where the exception occurred.
  b) Sticky bits, you don't know exactly where, but get some bound.
  c) No sticky-bits, explicit tests required everywhere, and hence, in
many cases, the test are omitted for speed and code size.

As a software guy, I am, of course, biased in favor of a).  As a
software/hardware guy, I also realize that sometimes one may have to
compromise, but one should never be *looking* for reasons to compromise
on a) - at most, one might reluctantly admit that introducing some
imprecision is the lesser evil, grit one's teeth and bear it.

At MIPS, I (and the rest of the OS group) were among the loudest voices
in demanding precise exceptions everywhere, most of us having had to
deal with weird cases and hardware bugs and related software bugs too
many times in past lives.  In the long run, this did turn out to help
chip verification as well.

To be more specific, we always wished for:

When a user-level instruction X caused an exception, and a trap to the
OS:

a) The Exception Program Counter (EPC) was set to point at the
instruction.
b) The CAUSE register was set to indicate the reason.
c) All instructions before X had been completed (or equivalent).
d) X itself had had no user-visible side-effects, i.e., there were no
partial stores, no register writebacks, no auto-increments, no shadow
registers to be distentangled, i.e., X had no effect except to cause
the trap.

We didn't quite get that, as:
a) and b): if X was in a Branch Delay slot, the EPC pointed at X-4, and
the BD-bit was set in the CAUSE register.  Although this was sometimes
a pain, and proved to be a source of designer complaint in some later
implementations, software people viewed it as tolerable, even though
it sometimes meant messy debugger code and trickier emulation code (as
in floating point emulation done on systems lacking FPUs).

c) This wasn't true from the hardware viewpoint, but the CPU provided
this illusion to the programmer, so that was viewed as OK.
Specifically, consider the
sequence:

Y: (long-running FP operation, or integer multiply/divide)
....
X, causing trap.

At the time of the trap, Y might still be running in an independent
functional unit.  However, this was all carefully defined so that:

Either the instruction was defined so that it could not ever cause an
exception (in MIPS, the divide doesn't raise a divide-by-zero; instead,
if the compiler sees variable1/variable2, and can't be sure variable2
is non-zero, it generates code like:
DIV rs,rt
BEQZ rt,divide-by-zero
....

In the floating point cases, the insistence on precise exceptions ended
up encouraging the innovative "check the exponent fields quickly and
stall the CPU if there is any chance of an exception" patent of Tom
Riordan's that I've mentioned before.

In any of these cases, a reference to a register that was to be
written by one of these asynchronous units simply caused a stall.
Interrupt code would save away the regular integer registers first, by
which time any lingering MUL/DIV or FP ops would likely have completed.

Of course, a "fast-user-trap" feature would have to modify some of the
above, and I still wish we'd had time to think hard enough about that
early enough [2Q85].

Anyway, the message of all this is:

Start with the goal of keeping the exception model simple and precise.
Only back off reluctantly.  I've posted many a time that dealing with
exceptions and their weird interactions has long been a source of
problems, as even experienced humans miss things.  Although there are
some things about MIPS that I'd do differently if I could do it over,
THIS wasn't one of them.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 19 Sep 2005 23:41:44 -0700
Message-ID: <1127198504.378354.101340@o13g2000cwo.googlegroups.com>

Stephen Fuld wrote:
> "John Mashey" <old_systems_guy@yahoo.com> wrote in message
> news:1127156812.929231.226160@f14g2000cwb.googlegroups.com...
>
> snip
>
> > People have sometimes used sticky overflow bits.  To me, the hierarchy
> > is:
> >  a) Precise exceptions: you know exactly where the exception occurred.
> >  b) Sticky bits, you don't know exactly where, but get some bound.
> >  c) No sticky-bits, explicit tests required everywhere, and hence, in
> > many cases, the test are omitted for speed and code size.
>
> Is it totally ridiculous to have a design that kept the instruction address
> "with" the instruction as it is executed?  Then, no matter in what order the
> instructions were executed,or the state of other instructions or parts of
> the CPU, the CPU could report the exact address of the instruction causing
> the exception.  It would certainly require an additional internal register
> (to hold the instruction address) within each FU and additional space in the
> renameing table, etc. but they could be written in parallel with the other
> stuff that had to be written so it may not cause any additional time.  The
> only time these registers would be read is upon exeptions.

Of course not (totally ridiculous, that is), as PCs have to be kept
around (in some form or other) anyway, but implementation always
matters.

1) As a reminder, not all CPU designs are complex speculative OOO CPUs.
 In fact, only a tiny fraction of distinct CPU designs are such. Some
extra stuff doesn't cost much in a big OOO CPU, but percentage-wise, it
may cost more in a simple pipeline or even a relatively simple 2-issue
in-order superscalar.

In some ways, an OOO design handles exceptions "easier" than the other
designs, in that most instructions are executed speculatively anyway,
and one needs all of the mechanisms for in-order graduation and
unwinding mispredicted code anyway.  But consider the (common) designs
with the following characteristics:
- in-order issue
- multiple functional units, with long latencies and potential
out-of-order completion

One may track the PC for each instruction, but suppose there are 4
independent FUs (like integer add, integer mul/divFP add/mul, FP
divide/sqrt), of which the latter 3 naturally have multi-cycle
operations, some very long, and of course, there may be queueing
effects on FUs.

Suppose your code looks like:

1: FP DIV   f1 = f2/f3   some number of clocks, more than FP MUL
..... instructions not depending on f1
2: FP MUL   f2 = f3*f4   note this clobbers f2 [typically 2-8 clocks]
..... instructions not depending on f2
2: INT DIV  r1 = r2/r3   [probably ~64 clocks on 64-bit CPU]
..... instructions not depending on r1
4: INT ADD  r2 = r3*r4   and this clobbers r2

Suppose one has an ISA in which every one of these can cause an
exception [on MIPS, INT DIV can't, but the rest can.]

THE GOOD CASE
If the ISA semantics follow the rules I described earlier
a) FP DIV and FP MUL stall until they are sure they don't cause an
exception.  Then they run to completion.
b) The INT DIV stalls until it does a test-for-zero, and  then it runs
as long as necessary.  It's quite possible that 1, 2, and 3 are all
executing concurrently.

Then:
1) The behavior is identical on all implementations regardless of the
number and latency of functional units.
2) Likewise, if the OS has to redo the offending instruction and
continue execution, it can, relatively straightforwardly.  Why would
would one want to do that? Isn't an exception The End?
Nope: in particular, the FPU might not implement all of IEEE 754, it
might implement the common cases, and trap on data-dependent, but
infrequent cases [as has been done in at least some implementations of
both MIPS and SPARC].

In any case, the IEEE standard recommends that users be allowed to
specify a trap handler for any of the standard exceptions, such that it
can compute a substitute result.

THE HARD CASE
Assume that 1-4 of the instructions above could raise an exception upon
completion, not by stalling.  [As faras I can tell from a quick read,
HP PA RISC is somewhat like this.]

Here's are some exercises for the reader:

1) Enumerate the number and order of exceptions that might happen.


2) Write the IEEE trap handlers that handle traps from the FP
operations.
Note that if 1 traps, but late enough, one of its inputs is already
overwritten by the output of 2, which means the trap handler can't get
at the inputs.

2a) Design the logic to keep track of the dependencies above, i.e.,
requiring the tracking of any register used as an *input* by any
uncompleted operation, and stalling any operation that modifies such a
register (which of course, wrecks the very concurrency that one is
trying to get).

3) When a trap occurs, does the CPU complete any pending operations?
Suppose one of them causes an exception also? Design all of the state
provided to the kernel, and to the user trap handler.

4) In a family of CPUs, does the outcome of a program depend on the
relative latencies of various functional units? Discuss the
cirumstances under which this might be OK.

Note that the answer to 1 is: 0-4 exceptions, and in any order,
depending on the latencies, and also on the number of intervening
instructions.

MESSAGE: for some reason, many people want to propose complex
mechanisms.  I say again: especially in this turf, simplicity really
pays off, and complexity is only accepted reluctantly, at least by
people who actually have to implement the hardware and software.  One
can make complex setups work, but it's expensive.


Index Home About Blog