Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: System Instruction Caches in Superscalar processors?
Date: 4 Feb 1996 03:03:06 GMT

In article <4eqngr$lml@news.connectnet.com>, rschnapp@fido.metaflow.com
(Russ Schnapp) writes:

|> The idea of dedicating some ICache to system processes is not entirely
|> daft.  It's probably overkill, though.  A set-associative cache does the
|> trick, and is a more general solution.  It solves the interrupt-handler

This has been done for embedded applications, although generally with
off-chip caches, where one knows a lot about the behavior of the
overall system, and also where one expects to spend a fair amount of time
in system state, and do a lot of low-overhead context-switching.
For example, somebody once used MIPS R3000 chips with a double I-cache,
where the extra I-cache was selected by a bit settable by the kernel
of a telephone switching application.

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Long latency instructions in optimizing compilers
Date: 1 Dec 1998 00:00:28 GMT

In article <abaum-3011981152330001@althea.pa.dec.com>, abaum@pa.dec.com
(Allen J. Baum) writes:

|> You can talk about a machine with extremely fast interrupt response.
|> I agree that interrupt response can be degraded by execution of a
|> non-interruptible
|> multi-cycle instruction.

So, it sounds like this thread was really about interrupt latency,
and guarantees thereof?  i.e., such as that flavor of real-time that
cares about interrupt latency.

Note of course, that people who like tight real-time bounds on interrupt
latency hate things like:
	a) Long cycle-count instructions that have to be finished before
	the interrupt is recognized.  [this thread of discussion ... and only
	the tip of the iceberg, since the following can be much worse]
	b) Non-lockable caches
	c) Code that can cause TLB misses
	d) Complex implementations, with lots of internal state that has
	to be unraveled, as in some out-of-order CPUs.
	e) Complex multiprocessor issues

Suppose you want a real-time UNIX on a multiprocessor (SMP or ccNUMA);
such things exist, but usually with fairly generous latency guarantees.
Suppose you want a tight guarantee:
	(a) Write tight interrupt handler, counting instructions.
	(b) On systems where possible, lock down the interrupt code into
	part of the cache.
	(c) Use large-page mapping, or unmapped space for the handler,
	to avoid TLB misses.
but still:
	(d) The CPU may hold off interrupts for the length of a long
	instruction (~100 cycles certainly possible).  Worse, it could
	end up being stalled for 100s of cycles on one instruction.
	For example, suppose you can use uncached load instructions to
	retrieve data from an I/O device, and you've already issued the
	load to the device, and the load has side-effects (some do,
	as in retrieving next item from a queue), and can wait some modest
	number of microseconds before returning. (100s or 1000s of cycles).

	(e) An aggressive o-o-o CPU might take 10s of cycles to unwind it's
	state.

	(f) Even if the instruction stream looks OK, the CPU's bus may be
	blocked off by a stream of memory requests that
	were issued to the memory system before the interrupt arrived,
	or by combinations of misses and writebacks of dirty data...
	(100s, even 1000s of cycles, depending on how many outstanding
	requests there are, and the nature of the coherency hardware, and
	which events cause the CPU to stall.

Anyway, worst cases (not average cases) can get amazingly bad ... and
divisions or square roots are *nothing* compared to the other stuff
that can happen ... which is why hard real-time folks seem to prefer
simpler CPUs, with controllable caches, and understandable cycle counts :-)
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-969-6289
USPS:   Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: alt.folklore.computers,comp.arch,alt.sys.pdp8,alt.sys.pdp10,
	alt.sys.pdp11,vmsnet.pdp-11
Subject: Real-time versus modern CPUs [was: Re: Q: Why not (2^n)-bit?]
Date: 19 Aug 2000 05:25:14 GMT

There is a general rule that if you want to be able to compute timings,
and do real-time, then you'd prefer the simplest possible machine:
	no caches or NUMA memory
	no instruction prefetch
	no overlap of functional units
	really simple execution model:
		fetch 1
		decode 1
		execute 1
		fetch 2 ...

all of which flies in the face of trying to run faster:
	multiple layers of caches, with queued accesses
		[which is why real-time-targeted chips often ahve
		cache-lockdown features]
	write-back caches
	asynchronous instruction fetches, with branch prediction,
		and deep speculation through multiple branches
	multiple functional units, possibly with shared resources,
		partial pipelining, etc.
	etc, etc.

With a modern, speculative, out-order chip:
	(1) It is obviously difficult to compute timings of instruction
	sequences [I remember 360/50s being reasonable in 1967, and
	PDP-11s were OK, and 68010s were getting messy, but 68020s and
	VAXEN were nearly hopeless.  Cutler once sent me email pointing out
	the amazing number of pages that a single worst-case
	VAX instruction could manage to require [41 for 6-operand Add Packed,
	Divide Packed is slightly worse].  R2000s were relatively easy again...

	(2) But, if you want to compute the maximum time from receipt of
	an external interrupt,  until you a through at least saving registers,
	it is even worse, and the principle of
	least astonishment can be violated easily, because times may be
	affected by all sorts of seemingly un-related concurrent actions.

	For example, suppose you executing a long operation [like divide or
	sqrt], and an interrupt works to allow completion of already-started
	instructions; you have to find the longest such instruction timing.
	That's easy ... but just the tip of a huge iceberg.

	But, maybe interrupts require rollback of speculative instructions,
	which means you have to check out exactly how long it takes
	a chip to roll them back.

	But, of course, fetching the interrupt routine may cause I-cache misses,
	so you have to add them in.

	But wait, the desired location(s) in L2 cache (typically joint I&D),
	needed to fetch the instructions, may be currently filled with
	dirty data that requires writebacks before the I-cache line can be read.

	But wait, the system bus may be occupied by a stream of incoming
	cache transactions, which in fact might never need to have happened,
	but were fired-up many cycles ago by speculatively-fetched
	memory-access instructions [and a lot of chips can have 4-8 outstanding
	memory transactions].

	But wait, it's not that simple: in the middle of all of this,
	in a multiprocessor [or in a uniprocessor with cache-coherent DMA],
	transactions may arrive from outside the CPU that require
	invalidation, writeback, etc of cache lines.

	And of course, in some designs there may be TLB misses, which
	also often generate cache misses...

Anyway, the bottom line is: the typical high-performance techniques
raise the performance overall, lower the typical real-time response time,
but make it incredibly difficult to compute a worst-case time,
and such bounds are usually much more worse than the typical case,
than they were for the simpler processors.




--
-John Mashey EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-933-2663
USPS:   SGI 1600 Amphitheatre Pkwy., ms. 562, Mountain View, CA 94043-1351
SGI employee 25% time, non-conflicting,local, consulting elsewise.

Index Home About Blog