From: "John Mashey" <firstname.lastname@example.org>
Subject: Re: Exceptions at basic block boundaries
Date: 10 May 2005 00:00:45 -0700
> EricN> Ah, yes. I was not proposing software TLB reloads. I was
> EricN> proposing that hardware "accessed" and "dirty" bits are
> EricN> unnecessary, because they can be emulated by OS software.
> Okay, in the page fault handler, which is what runs when the table
> walker finds the PTE and it says accessed denied. Got it.
> So...how often does the page fault handler get called to tweak
> dirty or accessed bits if you emulate that way. And, do you need
> to replicate page table structures that might otherwise be shared?
> It would be awfully nice not to need a writeback path from the TLB
> to the in-memory page tables.
> EricN> Yes, running multiple soft-TLB handlers in parallel would
> EricN> pretty silly,
> Well, note that if the soft-TLB handler takes more than ~30 cycles,
> it's likely to overlap the restarted code. To get interesting
> overlap, I'd need some way to block the trapping memop until the
> TLB fill is done. But given that, I actually can imagine multiple
> soft-TLB handlers active simultaneously.
> But the blocking ISA is nasty. I'd rather just do it with wires.
> EricN> OK, I see your point -- might as well do all the page table
> EricN> walks before re-starting the BB.
> Actually, I'd rather keep the BB in the OoO core and run the page
> walker in parallel, waking up the blocking memop when I've filled
> the TLB, same as with a 2nd level cache miss.
> But now that I understand "page fault" better, yes, once that
> happens it's a pipe flush and no parallelism for sure.
Hi Iain ... catching up on quarterly lookin to comp.arch.
General comments on this thread:
1) There's sometimes been some terminology fuzziness in this, likely
because different people use terms differently. Personally, I'd split
a) Asynchronous interrupts [I/O, bus-errors, etc], i.e., that happen
independent of the instruction stream.
b1) Precise synchronous exceptions, in which the handler (kernel or
user (UNIX signal handlers, sometimes)) expects a clean pointer to the
faulting instruction, takes action, and may very well fix something up,
and return to re-execute, or else may do something, and then skip it to
resume execution immediately after. People who write such things are
really not so keen on simulating a whole sequence of instructions,
which I think was one of the suggestions. (?) This is somewhat
reminiscent of the exception-handling complexity of the i860.
The branch-delay slots of MIPS were already somewhat painful. Note
that one must not only deal with the usual compiled code, but with
things like debuggers wanting to insert BREAK instructions in various
places and pick up the pieces cleanly.
b2) Imprecise exceptions: as noted, some CPUs allow these, typically
for when out-of-order completion is allowed for floating-point, and
usually in such cases, nobody is expecting the software to do much
other than terminate the process with a nice message.
It may be plausible to expect software to deal with more complexity, if
the benefits are really worth it ... but:
- The frequent cases better not be too expensive.
- From past history, this kind of stuff is notoriously buggy, such that
many old OS folks will tend to say: "Give me precise exceptions, or
else..." (meaning: give me an exception program counter that points at
the faulting instruction (modulo something simple like MIPS'
branch-delay-bit), and all earlier instructions have completed, and
none of the following ones have had any effect (modulo cache
2) If a hardware table-walker is doing TLB refill, and one wants
hardware TLB shootdown:
a) For performance reasons it better go through the cache hierarchy, as
PTEs often have reasonable locality. This can be hard to get right,
and although I don't recall which ones did this, some early micro MMUs
were forced to expect PTEs to be uncached, OK for older systems with
low memory latency. [For example, if you expect the hardware
table-walker to set modify or reference bits, sometimes the only way to
implement this simply was to use read/modify/write bus-locks on some
PTE accesses. Ugh.
b) The TLB essentially becomes an L0 cache, and so:
- It must be prepared to flush a PTE entry when a TLB-shootdown
arrives, but also, when the cache-line holding a PTE is flushed, or a
dirty one written back, which may well happen because a dirty cache
line is ejected to make room for a cache line to be fetched, possibly
by a speculative fetch.
- If the TLB can modify state of its copy of a PTE, it must be prepared
to propagate that information back out through the cache hierarchy, and
in fact, any TLB refill may need to do that if it needs to eject a
dirty PTE back out to make room for the next incoming PTE. [This can
be done ... but this can get pretty ugly, which is why we (OS people)
insisted that the R2000's MMU only have data modified under program
- One is probably going to want the inclusion property for the cache
hierarchy, so that one doesn't have to interrogate TLB tags so often.
[Inclusion property = if an entry is in level x of a cache, it's
present in x+1, x+2...], which means that if there's no tag match at an
outer level, you don't have to consume a cycle at the inner level(s).
c) Just for fun, consider the table-walker's actions on a 128P
directory-based ccNUMA system [i.e., like a 10-year-old SGI O2000],
with an o-o-o CPU with fairly deep read/write queues.
The CPU needs to survive an occasional memory access that takes 1000s
or even 10s of 1000s of clocks. Suppose CPU0 wants to do a store to a
- Do the table-walk to figure out where the PTE is located. This of
course, may cause cache misses, which means trips to the directories,
which may require forcing write-backs of dirty cache lines from other
CPUs that own them. Of course, with paged page-tables, there may even
be page-faults in the middle of this. Also, in a ccNUMA, there may
well be multiple router hops to get to any memory, and all of the
relevant memory mauy be in the furthest-away memory.
- Now, we have the physical address of the PTE. Suppose the PTE is
currently cached (clean, not dirty) in CPUs 1-1023. In CPU0's Ln
cache, the line the PTE should occupy has another dirty cache line.
- Put the dirty cache line in the outgoing write-queue, having checked
L(n-1), etc caches [if they are all WB] for newer data that needs to be
written-back, check the TLB, invalidate all of these.
- Obtain ownership-for-write for the line that has the PTE. This means
that the memory controller needs to tell *every* CPU that has a copy to
invalidate it, and (probably) get a response from them all, to make
sure. [yes, there are optimizations.] There are lots of mesasges and
acknowledgements flying around.
- Get cache line into the L(n)... L1 caches, and do the write (into L1,
if it's WriteBack), or into L1...Ln if the inner ones are
BTW, just in case an invalidate comes in the middle of this, do The
This is one of a zillion cases; the description of all the cases for an
O2000 was about 40 (fairly dense) pages ... and of course, R10000s had
software-refilled TLBs, so table-walking/refills, were just ordinary
loads subsumed by the usual coherency.
3) To do anything, one has to look hard at the statistics. Note that
in a UNIX system, the rate of unmaps (i.e., marking PTEs invalid) is
more or less proportional to:
- page outs
- freeing of pages, due to program free's, or process exit.
and sometimes the games played around COW (Copy on Write).
The transition of a PTE from clean, but writable, to dirty is
proportional to the rate of allocation of data page. That bit has to
get to the PTE in mememory, but if there are stale copies in TLBs, that
may cause unnecessary traps, but it still works.
Reference-bit setting is optional, and since it is a hint, they can
perfectly well get out of synch without bothering anything much.
Ther are interesting time-stamp algorithms for lessening the number of
4) Bottom-line: one can implement TLB page-walkers and TLB shootdowns
in hardware, or not, and somebody or other has done most of the
combinations. Before implementing such, one would want to do a lot of
modeling to make sure it was worth it, knowing that, in practice,
especially in larger systems, a few clock here and there are irrelevant
in the face of memory latencies.
5) In particular, it is always good to avoid the creation of large
hardware state, or long lockups, just to cater to really, really rare
cases that still have to work. The state machines are complex,
especially if they't trying to overlap operations, and such things were
the cause of numerous weird bugs in the micros of the early 1980s.
6) As always, the combination of:
- paged virtual-memory MMUs
- aggressive o-o-o CPUs
is very troublesome, as there are numerous nonobvious interactions.
Even the first 3 together have caused *plenty* of problems over the
years, usually manifested as really rare and difficult-to-reproduce
bugs that lead drive OS people nuts.