Index Home About Blog
From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch
Subject: Re: Sophistication of TLB/MMU?
Message-ID: <b3p0po$2i2$1@penguin.transmeta.com>
Date: Sat, 1 Mar 2003 01:01:44 +0000 (UTC)

In article <slrnb5vib7.53i.skipper@web1.calweb.com>,
Skipper Smith <skipper@no-spam.calweb.com> wrote:

>Power and PowerPC fufill just about all of that for active pages using an
>inverse page table.  Granted, for inactive pages, there is no hardware to
>do anything, but when it comes time to do page swaps, the overhead of
>managing those tables in software is considered by most to be
>inconsequential.

Ok, you hit a hot button of mine.

The Power/PowerPC inverse page tables are horrible!

And that's exactly because of the thing you (and apparently most hw
engineers at IBM) dismiss as "inconsequential".

Building up and tearing down those horrible tables is a major exercise,
and I think any engineer who dismisses the behaviour of unavoidable
(compulsory) cache misses is living in the wrong century.

Caches are getting so big (and that very much includes TLB's), that the
compulsory misses often _dominate_ for a lot of loads.  And anybody who
doesn't see that that trend is only getting worse must be blind.  Big
caches with high associativity aren't exactly going away last I looked.

The Power hash tables are basically an off-chip TLB extension, and by
being off-chip, you just moved the fill/flush overhead further out - AND
you made your compulsory misses more expensive.  Which is not a good
optimization to make in the long run.

In other words: the cost of managing those tables is not
inconsequential, because that cost is very much exactly that of the
compulsory miss. The hashes make the compulsory misses - building up the
cache - much worse.

In contrast, a nice regular per-process tree-based page table, backed by
a large _on_chip_ TLB (and probably make it multi-level to keep the
critical path clean - for all the same reasons you make your data caches
multi-level too) is a much more cache-friendly thing to initialize, and
is nicely amenable to well-behaving optimizations like prefetching etc,
since unlike the hashes used for Power page tables it has nice locality.

Together with a self-mapping (ie map the page tables themselves linearly
into the virtual address space), you can also actually do the TLB fill
with a single memory access (at the cost of having to make your on-chip
TLB very slightly bigger to compensate) - _without_ getting the nasty
conflict misses or the horrible cache behaviour that the hash tables
get.  Add reasonable address space identifiers and make the TLB large
enough that conflict misses aren't that common.

And it's been done.  Alpha did it right in software.  It might well be
argued that you want to do what alpha did in hardware, to take advantage
of concurrency and speculative fill events. I don't know of anybody who
has done that yet.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch
Subject: Re: Sophistication of TLB/MMU?
Message-ID: <b3u814$dfr$1@penguin.transmeta.com>
Date: Mon, 3 Mar 2003 00:35:48 +0000 (UTC)

In article <slrnb62u8g.17v7.skipper@web1.calweb.com>,
Skipper Smith <skipper@no-spam.calweb.com> wrote:

>Actually, the reason for me they are "inconsequential" is because I
>haven't worked with an OS that does page swapping in... hmmm.. about 15
>years.  For the embedded space, the inverse page table is WONDERFUL to
>use, but horrendous to learn/understand.

Why would the inverse page tables be any better there? I just don't see
the point. Even in an embedded market, they really only end up being a
"large inmemory TLB", and there are no actual _advantages_ to doing it
with a hash table as opposed to a tree-like structure.

The hashes are:
 - less dense
 - worse for the cache
 - slower to look up
as compared to a simple self-mapping tree structure.

(One of the traditional arguments _for_ hashing is that you can do most
lookups with a single memory access, but as I pointed out, that is
equally true of a self-mapped tree).

So what's so wonderful about them, exactly?

Methinks it's "wonderful" because once you've gone through the pain to
understand them, you don't want to make that pain be for nothing.

>Nah, just working in a market that doesn't suffer from them
>much.  Courtesy of BATs, most of my customers can fit 100% of their page
>translations in the available TLB entries.

Hey, once you do that it doesn't _matter_ what the external page table
looks like, so it might as well be something sane, wouldn't you agree?

>>The Power hash tables are basically an off-chip TLB extension, and by
>>being off-chip, you just moved the fill/flush overhead further out - AND
>>you made your compulsory misses more expensive.  Which is not a good
>>optimization to make in the long run.
>
>Depends on what your I/O bandwidth is and if your OS is capable of doing
>other work while one task is blocked waiting for all the drek related to
>a page reload to be done.  Granted, that isn't a PC class OS, typically,
>or hardware, either.

It's nothing about I/O bandwidth.  99% of the time the pages are in
memory already in a real system, and the only cost of actually filling
the mapping in is the CPU cost of (a) taking the trap and (b) filling
the TLB.

As far as I can tell, the Power reverse mapping was designed for Irix
and Fortran: running large static workloads (that build up the mappings
_once_ and thus it doesn't matter) with an OS where the cost of filling
in the page tables was so huge that it didn't matter that hardware made
it slightly more expensive.

>Wanna do it in software?  Ignore what the PowerPC does in hardware and
>shut it off (ala Book E or any of the non-divisible by 10 PowerPC devices
>as an example).

Yeah, Linux has actually used the sw fill approach simply because it is
_faster_ than using the hardware TLB hashing. The fill itself is slower,
but it's made up for by the fact that you don't have to play with the
hashes. HOWEVER:

 - not all Power/Powerpc chips can do it.
 - a lot of the chips that _can_ do it have pitifully small on-chip
   TLB's, because the hashes are supposed to be good at filling them.
 - once you do soft-fill, you can't take advantage of speculative work
   by the hw, so software fills _are_ fundamentally slower than hw
   fills.

Which is why "just do it in software" isn't the answer on PPC.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch
Subject: Re: Sophistication of TLB/MMU?
Message-ID: <b42qql$v0$1@penguin.transmeta.com>
Date: Tue, 4 Mar 2003 18:21:09 +0000 (UTC)

In article <slrnb68gbl.10tc.skipper@web1.calweb.com>,
Skipper Smith <skipper@no-spam.calweb.com> wrote:
>Linus Torvalds <torvalds@penguin.transmeta.com> wrote:
>>
>>The hashes are:
>> - less dense
>
>If using many different task ids, but without swapping (as is the case
>for most real time/task protected OSes), it can be shown to
>work out more nicely then two or three level tables.

Yeah. "If we limit reality, the power hashes actually make sense".

It's still a bad design. It's fine if you're willing to limit the loads
you put on the CPU, but that's not what I'd call a good choice.

>> - worse for the cache
>
>Depends on your definition of worse.  Given that the next level of a table
>is guaranteed to be non-linear, if you are caching table searches in a
>traditional mechanism, you guarantee wiping out two cache sectors with
>little likelyhood of reuse in a timely fashion.

Ehh.. You're now totally ignoring the fact that all the hashed look-ups
are pretty much guaranteed to miss in the cache the way the hashed page
tables work.

Your argument is that a tree-based approach sometimes misses in the
cache.  Big deal.  My argument is that a hash table _almost_always_
misses in the cache on lookup, unless you choose a hash function that
puts adjacent pages in the same hash, which you can generally only do
(without getting excessive hash collisions) if you know your workload
statically.

And it doesn't just miss in the cache on lookup, it does bad things to
the cache on initial TLB populate and teardown too. So you can't sanely
pre-populate the tables, because it's too expensive.

>				  Using hash tables means that in one
>burst (assuming a PPC 32 byte burst), you can load in four tlb entries to
>the cache which, using AIX as an example, have a 98% chance of having the
>entry you are looking for in them.

Yes.  You have a 98% chance of filling _one_ entry with a hash approach.
And that's a f*cking DISASTER from a cache standpoint.  You just read 32
bytes of data from main memory, and you can only fill ONE TLB entry with
it? And that's with a 98% likelihood? You consider that GOOD?

The fact is, you just don't know what GOOD actually is.

I'll tell you: t a _good_ TLB fill should pre-populate the TLB with all
the entries it can fit in one cache-line.  Do you realize that a
bog-standard Intel CPU will fetch 8 TLB entries in one go? Together with
a self-mapping (or, as Andy points out, you can just cache the other
levels in dedicated caches), that means that with a single memory
reference you get _eight_times_ the coverage that the silly Power hash
tables get.

>			  In a typical embedded operating
>environment with no swapping, that can be improved to a mere 100% :-).


Yeah. And that other CPU just kicked your butt: even if the TLB locality
isn't perfect, it's usually boog enough that the tree-based TLB fill is
able to use more than one TLB entry per cache line.

In your parlance, that means that it improved your paltry 100% to
several hundred percent.

>> - slower to look up
>
>By far, your answer is wrong, at least in my world, indicating that you
>probably aren't looking things from my POV.  Not that you should, of
>course :-).

I _am_ trying looking at it from your POV. It's just that your arguments
don't make any sense. All the things you say hashes are good for, a
linear page table is _better_ at.

>For embedded systems (without swapping), it means maintaining one table
>with a size that scales nicely based on physically mapped memory no matter
>how many tasks are implemented, instead of one set of tables for each task
>that can expand nastily for sparse logical memory allocations.

For embedded systems, you want to avoid sparse memory allocations
anyway. In fact, you want to avoid them on a PowerPC too, since on a PPC
you need to control the hashing of the entries, in order to get good
hash usage without forcing excessive hash collisions.

In fact, since you were so proud of the 100% number, that obviously
means that you cannot afford _any_ hash collisions, which definitely
means that you are already controlling the virtual memory layout quite
heavily.

So to me it looks like you're comparing apples to oranges in a very
dishonest way: you bring up the "bad" cases for tree-based mappings,
without actually allowing for them in your own hashed mappings.

The fact is, with the generic big virtual mappings, the Power hashes do
_not_ scale linearly with physical memory.  For the generic case, you
need to have either a backing store (in _addition_ to the hardware
hashes) that scales with virtual memory, or you need to scale your
hashes up.  Traditionally you do both (ie you make your hashes bigger
than they need to be - making the cache behaviour even worse - _and_ you
have the external backing store that takes care of the overflow case).

Of course, as you point out, if you control the load of the device
(which you pretty much always do in the embedded space), you don't have
to worry about this all. I'll certainly give you that once you can
control the mappings, and don't have to worry about generic loads that
you have no control over, the hashes are less painful - but I don't see
them being noticeably _better_ either.

>  BTW, one of the more interesting things I saw done
>with Linux on PPC was a paper done by a couple of New Mexico State
>University undergrads who managed to con someone at Motorola to give them
>a couple of MPC604 based computers so they could see if they could
>manipulate how the MMU was dealt with.  About 10 pages was spent
>describing better TLB entry allocation, "garbage collection" type
>algorithms for keeping the most recently used PTEs left justified in the
>table, etc... gaining them about a few percentage points
>improvement.  Then they used a DBAT to map the graphics card memory and
>gained about 10% improvement in MMU performance IIRC.

Sure.  That tells us two things: (a) large pages are good and (b) page
table performance was so important that avoiding the overhead gained 10%
real time.  I certainly agree with both of these points.

Btw, large pages can be embedded as short-circuits in a tree-based page
table.  So again the tree-based approach ends up showing itself being
the more flexible one: you don't _need_ BAT registers, since on a page
table tree it's largely equivalent to making the tree be of flexible
depth instead.

(Which is not to say that I mind BAT registers, I think they are a
sensible common case regardless of what your TLB fill is.  The biggest
problem with the BAT registers on the PPC is usually that there aren't
_enough_ of them: again, for the general case, you'd like to be able to
have big processes that know their TLB behaviour program their own
regions, and there may well be more than just a few of them).

>>It's nothing about I/O bandwidth.  99% of the time the pages are in
>>memory already in a real system, and the only cost of actually filling
>>the mapping in is the CPU cost of (a) taking the trap and (b) filling
>>the TLB.
>
>Hmmm, again, not in my world.  When using hardware TLB refills, I don't
>get traps (in my sense of the word, anyway) and reading the table and
>reloading a TLB is never slower then using a two level table.

Sure it is: see above on why you can only get one entry per cacheline
with the general hashes.

>Irix?  Seems I am not the only one muching too many chiles lately
>:-).

yeah yeah, I'm a retard. AIX.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch,alt.folklore.computers
Subject: Re: Why only 24 bits on S/360?
Message-ID: <b42ro2$vg$1@penguin.transmeta.com>
Date: Tue, 4 Mar 2003 18:36:50 +0000 (UTC)

In article <slrnb69qel.ehb.nobody@animas.frontier.net>,
Allodoxaphobia  <NOSPAM_nobody@frontier.net> wrote:
>
>I assume that you are referring to the sort of "map the page tables
>linearly into the virtual address space" technique, that MIPS practiced,
>that Richard Uhlig et al published on.
>
>"TLB fill with a single [virtual] memory access."

Yes.

>(1) It's not clear to me if this has any advantage over an x86 style page
>table in the physical address space, with appropriate caching of interior
>nodes of the page table.

I agree, and I think what you describe is 100% equivalent to mapping the
interior nodes in the TLB itself - it's cenceptually just splitting the
TLB into two.

Having a separate TLB for the interior nodes can be an advantage (ie
specialization in hardware sometimes allows for simplification and
targeting the hw better), but can be a disadvantage: sharing the TLB
caches of the different levels can obviously also be a good thing, if
the OS ends up using the virtual mappings itself.

I don't think it's a huge distinction either way.

One of the advantages of using a separate cache might be that the
explicit self-mapping with TLB entry sharing pretty much requires (for
sanity) that the different levels look pretty similar.  That limits your
flexibility, and it may not be what you want to do (but it has worked
pretty well on alpha, x86 and mips, so it doesn't seem to be a huge
problem either).

>(2) By the way, can I ask a question about the implementation details
>of the embedding of the page tables in the virtual address space?
>
>One of the ways that embedding can be done uses a simple address
>mapping function, something like
>        PageTableVirtualBaseAddress + VirtualAddress/PageSize*PTESize
>
>Now - let's see if I can remember this - I think there is a stationary point,
>or, actually a range of virtual addresses that are both the N and N-1
>page table entries, and so on, potentially all the way down.

Yes, if you use the "obvious" self-mapping, ie actually make one entry
in the highest level point to itself, you will effectively automatically
get mappings "all the way down", ie if you have a three-level tree, you
will get both levels 2 and 3 mapped virtually.

>This can lead to confusion if you support "concatenated" large pages
>- large pages that do not correspond to an entire interior node,
>but where, e.g., two adjacent page table entries are marked as
>belonging to the same 2X the normal size PTE. Similarly 4, 8, etc.

If you always fill out all entries even for concatenated entries (like
alpha did - the concatenation bits were architected to be only "hints"),
this causes no confusion. And you really have to do that anyway, to
avoid confusion at TLB fill time.

>Q: which did Alpha use?

Alpha page tables look like perfectly normal 3-level page tables (did
they ever architect the 4th level? Dunno) and superpage entries still
require each page table entry to be filled out to look like the correct
"single" entry - except with the size hint bits set.

			Linus


Date: Wed, 05 Mar 2003 13:57:00 -0600
From: Greg Pfister <pfister@us.ibm.com>
Newsgroups: comp.arch
Subject: Re: POWER hashes vs tree
Message-ID: <3e66569c_4@news1.prserv.net>

Maynard Handley wrote:
> In article <m3ptp5vnif.fsf@averell.firstfloor.org>,
>  Andi Kleen <freitag@alancoxonachip.com> wrote:
>
>
>>Maynard Handley <name99@redheron.com> writes:
>>
>>
>>>Linus' answer appears to be a combination of "they're idiots" and
>>>"they've spent so much effort learning them and they don't want to admit
>>>it was a waste of time". Now it is true that these frequently are
>>>factors in why people do things, but when I met Jim Kahle he struck me
>>>as neither an idiot nor someone obsessed with pride and an unwillingness
>>>to admit mistakes.
>>
>>Or perhaps they never evaluated trees for new CPUs? Just keeping the
>>hash is very likely the first choice for an ISA compatible next
>>generation chip.
>>
>>-Andi
>
>
> I find that unlikely. The only people who care about the PTE structure
> are the OS's, and IBM owns the OS's --- it's not like Intel dicking
> around with the x86 PTE setup. Certainly AIM have been perfectly willing
> to modify OS-visible state on PowerPCs.
> Meanwhile it's not like they don't continually get complained at about
> how their PTEs are different from everyone else's.

Before I reply to this,

WARNING: I am not an expert in this area, and have not been really
connected to it for quite a while. So what I say below MAY NOT HAVE
ANY relationship to reality.

That said:

It's my understanding that the hash table / inverse page table scheme
in POWER is (was?) heavily intertwined in both the memory management
and the file system of AIX. That connection is (was) viewed, at least
in earlier versions of AIX, as something significantly valuable for
reasons that I won't even pretend to understand, at least any more; I
may have known it long ago. It was also fiendishly complex, I heard.

If that's still the case (and I'm not sure it is), then changing that
aspect of the hardware would have radical, extensive, and expensive,
consequences for AIX.

Such interlocking isn't particularly common elsewhere, as far as I
know, and so probably isn't there in Linux. Nor is it part of the OSs
that run on PowerPC/AIM. So they don't have the same constraints, and,
for Linux in particular, they don't see  why anybody would keep it.

Repeat: WARNING: I am not an expert in this area, and have not been
really connected to it for quite a while. So what I said above may not
have ANY relationship to reality.

I can comment with a few fewer caveats about how this came about.
Ancient history follows.

Previous comments in this thread about the hardware PT being in
essence an in-memory cache of part of the total virtual memory map are
precisely correct. The intent of the original inverse (not hashed)
page table design was to have a structure whose real memory use was
proportional to the amount of real memory, and not to the amount of
virtual memory; it was viewed as quite a clever technique to achieve that.

But why would that be important? Because the thing was actually
designed in the mid-70s. TLBs weren't anywhere near as big then, and,
more importantly, real memory was assumed to be a rather small
fraction of the virtual memory that was allocated. (Why it persisted
into the late 80s when the first "real" product was designed is an
interesting question that mostly involves sociology.)

Furthermore, there was also a notion at the time (obviously and, in my
personal opinion for good reason, later abandoned) that the right way
to do multiprocessing was to use a cluster of systems and shared
virtual memory. This would make the virtual memory mapping in use not
just the map of one machine, bur rather the concatenation of the
mappings of many. That makes the VM map even bigger relative to each
machine's real memory size and makes the technique even more valuable
by limiting the required in-memory size of the mapping tables needed
to feed the TLBs.

This advantage of course became diluted when SMPs came along, since
they forced the table to not be a strict "inverse" page table -- one
entry per page of real memory -- in order to accomodate segments
shared among processors. I know that's true, since I designed the
architecture of those changes back in the early 90s, and the switch
from a true inverse table cause some major raised eyebrows. However,
for the life of me I don't exactly remember now why SMPs required
multiple entries for the same real page. Segments mapped to different
virtual memory areas in different processes, maybe? Whatever; I don't
recall. But I do strongly recall that it did.

All that, however, is truly ancient history. The part I WARNED about
is obviously more relevant to why it persists today.
--
Greg Pfister  http://pfister.userv.ibm.com/


Index Home About Blog