Index
Home
About
Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: System Instruction Caches in Superscalar processors?
Date: 4 Feb 1996 03:03:06 GMT
In article <4eqngr$lml@news.connectnet.com>, rschnapp@fido.metaflow.com
(Russ Schnapp) writes:
|> The idea of dedicating some ICache to system processes is not entirely
|> daft. It's probably overkill, though. A set-associative cache does the
|> trick, and is a more general solution. It solves the interrupt-handler
This has been done for embedded applications, although generally with
off-chip caches, where one knows a lot about the behavior of the
overall system, and also where one expects to spend a fair amount of time
in system state, and do a lot of low-overhead context-switching.
For example, somebody once used MIPS R3000 chips with a double I-cache,
where the extra I-cache was selected by a bit settable by the kernel
of a telephone switching application.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: mash@sgi.com
DDD: 415-933-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Long latency instructions in optimizing compilers
Date: 1 Dec 1998 00:00:28 GMT
In article <abaum-3011981152330001@althea.pa.dec.com>, abaum@pa.dec.com
(Allen J. Baum) writes:
|> You can talk about a machine with extremely fast interrupt response.
|> I agree that interrupt response can be degraded by execution of a
|> non-interruptible
|> multi-cycle instruction.
So, it sounds like this thread was really about interrupt latency,
and guarantees thereof? i.e., such as that flavor of real-time that
cares about interrupt latency.
Note of course, that people who like tight real-time bounds on interrupt
latency hate things like:
a) Long cycle-count instructions that have to be finished before
the interrupt is recognized. [this thread of discussion ... and only
the tip of the iceberg, since the following can be much worse]
b) Non-lockable caches
c) Code that can cause TLB misses
d) Complex implementations, with lots of internal state that has
to be unraveled, as in some out-of-order CPUs.
e) Complex multiprocessor issues
Suppose you want a real-time UNIX on a multiprocessor (SMP or ccNUMA);
such things exist, but usually with fairly generous latency guarantees.
Suppose you want a tight guarantee:
(a) Write tight interrupt handler, counting instructions.
(b) On systems where possible, lock down the interrupt code into
part of the cache.
(c) Use large-page mapping, or unmapped space for the handler,
to avoid TLB misses.
but still:
(d) The CPU may hold off interrupts for the length of a long
instruction (~100 cycles certainly possible). Worse, it could
end up being stalled for 100s of cycles on one instruction.
For example, suppose you can use uncached load instructions to
retrieve data from an I/O device, and you've already issued the
load to the device, and the load has side-effects (some do,
as in retrieving next item from a queue), and can wait some modest
number of microseconds before returning. (100s or 1000s of cycles).
(e) An aggressive o-o-o CPU might take 10s of cycles to unwind it's
state.
(f) Even if the instruction stream looks OK, the CPU's bus may be
blocked off by a stream of memory requests that
were issued to the memory system before the interrupt arrived,
or by combinations of misses and writebacks of dirty data...
(100s, even 1000s of cycles, depending on how many outstanding
requests there are, and the nature of the coherency hardware, and
which events cause the CPU to stall.
Anyway, worst cases (not average cases) can get amazingly bad ... and
divisions or square roots are *nothing* compared to the other stuff
that can happen ... which is why hard real-time folks seem to prefer
simpler CPUs, with controllable caches, and understandable cycle counts :-)
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-969-6289
USPS: Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: alt.folklore.computers,comp.arch,alt.sys.pdp8,alt.sys.pdp10,
alt.sys.pdp11,vmsnet.pdp-11
Subject: Real-time versus modern CPUs [was: Re: Q: Why not (2^n)-bit?]
Date: 19 Aug 2000 05:25:14 GMT
There is a general rule that if you want to be able to compute timings,
and do real-time, then you'd prefer the simplest possible machine:
no caches or NUMA memory
no instruction prefetch
no overlap of functional units
really simple execution model:
fetch 1
decode 1
execute 1
fetch 2 ...
all of which flies in the face of trying to run faster:
multiple layers of caches, with queued accesses
[which is why real-time-targeted chips often ahve
cache-lockdown features]
write-back caches
asynchronous instruction fetches, with branch prediction,
and deep speculation through multiple branches
multiple functional units, possibly with shared resources,
partial pipelining, etc.
etc, etc.
With a modern, speculative, out-order chip:
(1) It is obviously difficult to compute timings of instruction
sequences [I remember 360/50s being reasonable in 1967, and
PDP-11s were OK, and 68010s were getting messy, but 68020s and
VAXEN were nearly hopeless. Cutler once sent me email pointing out
the amazing number of pages that a single worst-case
VAX instruction could manage to require [41 for 6-operand Add Packed,
Divide Packed is slightly worse]. R2000s were relatively easy again...
(2) But, if you want to compute the maximum time from receipt of
an external interrupt, until you a through at least saving registers,
it is even worse, and the principle of
least astonishment can be violated easily, because times may be
affected by all sorts of seemingly un-related concurrent actions.
For example, suppose you executing a long operation [like divide or
sqrt], and an interrupt works to allow completion of already-started
instructions; you have to find the longest such instruction timing.
That's easy ... but just the tip of a huge iceberg.
But, maybe interrupts require rollback of speculative instructions,
which means you have to check out exactly how long it takes
a chip to roll them back.
But, of course, fetching the interrupt routine may cause I-cache misses,
so you have to add them in.
But wait, the desired location(s) in L2 cache (typically joint I&D),
needed to fetch the instructions, may be currently filled with
dirty data that requires writebacks before the I-cache line can be read.
But wait, the system bus may be occupied by a stream of incoming
cache transactions, which in fact might never need to have happened,
but were fired-up many cycles ago by speculatively-fetched
memory-access instructions [and a lot of chips can have 4-8 outstanding
memory transactions].
But wait, it's not that simple: in the middle of all of this,
in a multiprocessor [or in a uniprocessor with cache-coherent DMA],
transactions may arrive from outside the CPU that require
invalidation, writeback, etc of cache lines.
And of course, in some designs there may be TLB misses, which
also often generate cache misses...
Anyway, the bottom line is: the typical high-performance techniques
raise the performance overall, lower the typical real-time response time,
but make it incredibly difficult to compute a worst-case time,
and such bounds are usually much more worse than the typical case,
than they were for the simpler processors.
--
-John Mashey EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-933-2663
USPS: SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, non-conflicting,local, consulting elsewise.
Index
Home
About
Blog