Index Home About Blog
Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <a6te11$si7$1@penguin.transmeta.com>
Date: Fri, 15 Mar 2002 18:23:06 GMT
Message-ID: <fa.j2b04jv.126o3gf@ifi.uio.no>

In article <E16la2m-0000SX-00@starship>,
Daniel Phillips  <phillips@bonn-fries.net> wrote:
>On March 14, 2002 02:21 pm, Momchil Velikov wrote:
>>
>> Out of curiousity, why there's a need to update the linux page tables ?
>> Doesn't pte/pmd/pgd family functions provide enough abstraction in
>> order to maintain _only_ the hashed page table ?
>
>No, it's hardwired to the x86 tree view of page translation.

No no no.

If you think that, then you don't see the big picture.

In fact, when I did the 3-level page tables for Linux, no x86 chips that
could _use_ three levels actually existed.

The Linux MM was actually _designed_ for portability when I did the port
to alpha (oh, that's a long time ago). I even wrote my masters thesis on
why it was done the way it was done (the only actual academic use I ever
got out of the whole Linux exercise ;)

Yes a tree-based page table matches a lot of hardware architectures very
well.  And it's _not_ just x86: it also matches soft-fill TLB's better
than alternatives (newer sparcs and MIPS), and matches a number of other
architecture specifications (eg alpha, m68k).

So on about 50% of architectures (and 99.9% of machines), the Linux MM
data structures can be made to map 1:1 to the hardware constructs, so
that you avoid duplicate information.

But more importantly than that, the whole point really is that the page
table tree as far as Linux is concerned is nothing but an _abstraction_
of the VM mapping hardware. It so happens that a tree format is the only
sane format to keep full VM information that works well with real loads.

Whatever the hardware actually does, Linux considers that to be nothing
but an extended TLB.  When you can make the MM software tree map 1:1
with the extended TLB (as on x86), you win in memory usage and in
cheaper TLB invalidates, but you _could_ (if you wanted to) just keep
two separate trees.  In fact, with the rmap patches, that's exactly what
you see: the software tree is _not_ 1:1 with the hardware tree any more
(but it _is_ a proper superset, so that you can still get partial
sharing and still get the cheaper TLB updates).

Are there machines where the sharing between the software abstraction
and the hardware isn't as total? Sure. But if you actually know how
hashed page tables work on ppc, you'd be aware of the fact that they
aren't actualy able to do a full VM mapping - when a hash chain gets too
long, the hardware is no longer able to look it up ("too long" being 16
entries on a PPC, for example).

And that's a common situation with non-tree VM representations - they
aren't actually VM representations, they are just caches of what the
_real_ representation is.  And what do we call such caches? Right: they
are nothing but a TLB.

So the fact is, the Linux tree-based VM has _nothing_ to do with x86
tree-basedness, and everything to do with the fact that it's the only
sane way to keep VM information.

The fact that it maps 1:1 to the x86 trees with the "folding" of the mid
layer was a design consideration, for sure.  Being efficient and clever
is always good.  But the basic reason for tree-ness lies elsewhere.
(The basic reasons for tree-ness is why so many architectures _do_ use a
tree-based page table - you should think of PPC and ia64 as the sick
puppies who didn't understand.  Read the PPC documentation on virtual
memory, and you'll see just _how_ sick they are).

			Linus


Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <a6qsae$6er$1@penguin.transmeta.com>
Date: Thu, 14 Mar 2002 19:08:37 GMT
Message-ID: <fa.johu3qv.c7q2qs@ifi.uio.no>

In article <87wuwfxp25.fsf@fadata.bg>,
Momchil Velikov  <velco@fadata.bg> wrote:
>
>Out of curiousity, why there's a need to update the linux page tables ?
>Doesn't pte/pmd/pgd family functions provide enough abstraction in
>order to maintain _only_ the hashed page table ?

No.  The IBM hashed page tables are not page tables at all, they are
really just a bigger 16-way set-associative in-memory TLB.

You can't actually sanely keep track of VM layout in them.

Those POWER4 machines are wonderful things, but they have a few quirks:

 - it's so expensive that anybody who is slightly price-conscious gets a
   farm of PC's instead. Oh, well.

 - the CPU module alone is something like .5 kilowatts (translation:
   don't expect it in a nice desktop factor, even if you could afford
   it).

 - IBM nomenclature really is broken. They call disks DASD devices, and
   they call their hash table a page table, and they just confuse
   themselves and everybody else for no good reason.  They number bits
   the wrong way around, for example (and big-endian bitordering really
   _is_ clearly inferior to little-endian, unlike byte-ordering.  Watch
   the _same_ bits in the _same_ register change name in the 32 vs
   64-bit architecture manuals, and puke)

But with all their faults, they do have this really studly setup with 8
big, fast CPU's on a single module. A few of those modules and you get
some ass-kick performance numbers. As you can see.

		Linus


Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <a6uubq$uqr$1@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 08:09:16 GMT
Message-ID: <fa.jv2m4sv.7mi3op@ifi.uio.no>

In article <20020316061535.GA16653@krispykreme>,
Anton Blanchard  <anton@samba.org> wrote:
>
>hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

It's interesting to see that scalability doesn't seem to be the #1
problem by a long shot.

>7.52 seconds is not a bad result for something running under a hypervisor.
>The profile looks much better now. We still spend a lot of time flushing tlb
>entries but we can look into batching them.

I wonder if you wouldn't be better off just getting rid of the TLB range
flush altogether, and instead making it select a new VSID in the segment
register, and just forgetting about the old TLB contents entirely.

Then, when you do a TLB miss, you just re-use any hash table entries
that have a stale VSID.

It seems that you spend _way_ too much time actually trying to
physically invalidate the hashtables, which sounds like a total waste to
me. Especially as going through them to see whether they need to be
invalidated has to be a horribe thing for the dcache.

It would also be interesting to hear if you can just make the hash table
smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or
just bypass it altogether (at least the 604e used to be able to just
disable the stupid hashing altogether and make the whole thing much
saner).

Note that the official IBM "minimum recommended page table sizes" stuff
looks like total and utter crap.  Those tables have nothing to do with
sanity, and everything to do with a crappy OS called AIX that takes
forever to fill the hashes.  You should probably make them the minimum
size (which, if I remember correctly, is still quite a large amount of
memory thrown away on a TLB) if you can't just disable them altogether.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203160953420.31850-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 18:09:37 GMT
Message-ID: <fa.n8ld85v.cnei8a@ifi.uio.no>

On Sat, 16 Mar 2002, Paul Mackerras wrote:
>
> > But more importantly than that, the whole point really is that the page
> > table tree as far as Linux is concerned is nothing but an _abstraction_
> > of the VM mapping hardware. It so happens that a tree format is the only
> > sane format to keep full VM information that works well with real loads.
>
> Is that still true when we get to wanting to support a full 64-bit
> address space?

Yup.

We'll end up (probably five years from now) re-doing the thing to allow
four levels (so a tired old x86 would fold _two_ levels instead of just
one, but I bet they'll still be the majority), simply because with three
levels you reasonably reach only about 41 bits of VM space.

With four levels you get 50+ bits of VM space, and since the kernel etc
wants a few bits, we're just in the right range.

>		  Given that we can already tolerate losing PTEs for
> resident pages from the page tables quite happily (since they can be
> reconstructed from the information in the vm_area_structs and the page
> cache)

Wrong. Look again. The most common case of all (anonymous pages) can NOT
be reconstructed.

You're making the same mistake IBM did originally.

If it needs reconstructing, it's a TLB.

And if it is a TLB, then it shouldn't be so damn big in the first place,
because then you get horrible overhead for flushing.

A in-memory TLB is fine, but it should be understood that that is _all_
that it is. You can make the in-memory TLB be a tree if you want to, but
if it depends on reconstructing then the tree is pointless - you might as
well use something that isn't able to hold the whole address space in the
first place.

And a big TLB (whether tree-based or hased or whatever) is bad if it is so
big that building it up and tearing it down takes a noticeable amount of
time. Which it obviously does on PPC64 - numbers talk.

What IBM should do is

 - face up to their hashes being so big that building them up is a real
   performance problem. It was ok for long-running fortran and database
   programs, but it _sucks_ for any other load.

 - make a nice big on-chip L2 TLB to make their legacy stuff happy (the
   same legacy stuff that is so slow at filling the TLB in software that
   they needed the humungous hashtables in the first place).

Repeat after me: there are TLB's (reconstructive caches) and there are
page tables (real VM information). Get them straight.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161037160.31913-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 18:48:48 GMT
Message-ID: <fa.n9537tv.f7kjga@ifi.uio.no>

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:

> On Sat, Mar 16, 2002 at 10:06:06AM -0800, Linus Torvalds wrote:
> > We'll end up (probably five years from now) re-doing the thing to allow
> > four levels (so a tired old x86 would fold _two_ levels instead of just
> > one, but I bet they'll still be the majority), simply because with three
> > levels you reasonably reach only about 41 bits of VM space.
>
> Why so few bits per level? Don't you want bigger pages or page clusters?

Simply because I want to be able to share the software page tables with
the hardware page tables.

Not sharing the page tables implies that you have to have two copies and
keep them in sync, which is almost certainly going to make fork/exec just
suck badly.

Now I agree with you that in the long range all the hardware will just use
a 64k pagesize or we'll have extents or whatever.  I'm not saying that
trees are the only good way to populate a VM, it's just the currently
dominant way, and as long as it's the dominant way I want to have the
common mapping be the 1:1 mapping.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161102070.31913-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 19:19:37 GMT
Message-ID: <fa.n8l775v.fnkj8f@ifi.uio.no>

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> >
> > Simply because I want to be able to share the software page tables with
> > the hardware page tables.
>
> Isn't this only an issue when the hardware wants to search the tables?
> So for a semi-sane architecture, the hardware idea of pte is only important
> in the tlb.

Show me a semi-sane architecture that _matters_ from a commercial angle.

The x86 is actually fairly good: a sane data structure that allows it to
fill multiple pages in one go (the page size may be just 4kB, but the x86
TLB fill rate is pretty impressive - I _think_ Intel actually fills a
whole cacheline worth of tlb entries - 8 pages - per miss).

But the x86 page table structure is fairly rigid, and is in practice
limited to 4kB entries for normal user pages, and 4kB page table entries.

> is there a 64 bit machine with hardware search of pagetables? Even ibm
> only has a hardware search of hash tables - which we agree are simply
> a means of making your hardware TLB larger and slower.

ia64 does the same mistake, I think.

But alpha does a "pseudo-hardware" fill of page tables, ie as far as the
OS is concerned you might as well consider it hardware. And that is
actually limited to 8kB pages (with a "fast fill" feature in the form of
page size hints - a cheaper version of what Intel seems to do).

The upcoming hammer stuff from AMD is also 64-bit, and apparently a
four-level page table, each with 512 entries and 4kB pages. So there you
get 9+9+9+9+12=48 bits of VM space, which is plenty. Linux won't be able
to take advantage of more than 39 bits of it until we switch to four
levels, of course (39 bits is plenty good enough too, for the next few
years, and we'll have no pain in expanding to 48 when that day comes).

So yes, there are 64-bit chips with hardware (or architecture-specified)
page tables. And I personally like how Hammer looks more than the ia64 VM
horror.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161144340.31971-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:01:42 GMT
Message-ID: <fa.n95186v.f7ii8f@ifi.uio.no>

On Sat, 16 Mar 2002, David Mosberger wrote:
>
> ia64 has an optional hardware walker which can operate in "hashed"
> mode or in "virtually mapped linear page table mode".  If you think
> you can do a TLB lookup faster in software, you can turn the walker
> off.

I used to be a sw fill proponent, but I've grown personally convinced that
while sw fill is good, it needs a few things:

 - large on-chip TLB to avoid excessive trashing (ie preferably thousands
   of entries)

   This implies that the TLB should be split into a L1 and a L2, for all
   the same reasons you split other caches that way (and with the L1
   probably being duplicated among all memory units)

 - ability to fill multiple entries in one go to offset the cost of taking
   the trap.

Without that kind of support, the flexibility advantages of a sw fill just
isn't enough to offset the advantage you can get from doing it in
hardware (mainly the ability to not have to break your pipeline).

An in-memory hash table can of course be that L2, but I have this strong
suspicion that a forward-looking chip engineer would just have put the L2
on the die and made it architecturally invisible (so that moore's law can
trivially make it bigger in years to come).

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161214380.31971-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:27:15 GMT
Message-ID: <fa.n95b7ev.f7cj0f@ifi.uio.no>

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
>
> AMD claims L1, L2 and with hammer an I/D split as well.

Oh, people have done L1/L2 TLB splits for a long time. The two-level TLB
exists in Athlon (and I think nexgen did it in the x86 space almost 10
years ago, and that's probably what got AMD into that game). Others have
done it too.

And people have done split TLB's (I/D split is quite common, duplicated by
memory unit is getting so).

But multiple entries loaded at a time?

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161238510.32013-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:50:26 GMT
Message-ID: <fa.nb4v5lv.d7oho5@ifi.uio.no>

On Sat, 16 Mar 2002, David Mosberger wrote:
>
> Yes, Itanium has a two-level DTLB, McKinley has both ITLB and DTLB
> split into two levels.  Not quite as big though: "only" on the order
> of hundreds of entries (partially offset by larger page sizes).  Of
> course, operating the hardware walker in hashed mode can give you an
> L3 TLB as large as you want it to be.

The problem with caches is that if they are not coherent (and TLB's
generally aren't) you need to invalidate them by hand. And if they are
in main memory, that invalidation can be expensive.

Which brings us back to the whole reason for the discussion: this is not a
theoretical argument. Look at the POWER4 numbers, and _shudder_ at the
expense of cache invalidation.

NOTE! The goodness of a cache is not in its size, but how quickly you can
fill it, and what the hitrate is. I'd be very surprised if you get
noticeably higher hitrates from "as large as you want it to be" than from
"a few thousand entries that trivially fit on the die".

And I will guarantee that the on-die ones are faster to fill, and much
faster to invalidate (on-die it is fairly easy to do content-
addressability if you limit the addressing to just a few ways - off-chip
memory is not).

>   Linus>  - ability to fill multiple entries in one go to offset the
>   Linus> cost of taking the trap.
>
> The software fill can definitely do that.  I think it's one area where
> some interesting experimentation could happen.

If you can do it, and you don't do it already, you're just throwing away
cycles. If that was your comparison with the "superior hardware fill", it
really wasn't very fair.

Note that by "multiple entry support" I don't mean just a loop that adds
noticeable overhead for each entry - I mean something which can fairly
efficiently load contiguous entries pretty much in "one go". A TLB fill
routine can't afford to spend time setting up tag registers etc.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161158320.31971-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:06:01 GMT
Message-ID: <fa.n94t8ev.f7ui03@ifi.uio.no>

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
> On Sat, Mar 16, 2002 at 11:16:16AM -0800, Linus Torvalds wrote:
> > Show me a semi-sane architecture that _matters_ from a commercial angle.
>
> I thought we were into this for the pure technical thrill-)

I don't know about you, but to me the difference between technological
thrill and masturbation is that real technology actually matters to real
people.

I'm not in it for some theoretical good. I want my code to make _sense_.

> > page tables. And I personally like how Hammer looks more than the ia64 VM
> > horror.
>
> No kidding. But  I want TLB load instructions.

TLB load instructions + hardware walking just do not make much sense if
you allow the loaded entries to be victimized.

Of course, you can have a separate "lock this TLB entry that I give you"
thing, which can be useful for real-time, and can also be useful for
having per-CPU data areas.

But then you might as well consider that a BAT register ("block address
translation", ppc has those too), and separate from the TLB.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161046400.31913-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 19:05:21 GMT
Message-ID: <fa.nakn85v.dnoi8b@ifi.uio.no>

On Sat, 16 Mar 2002, Daniel Phillips wrote:
>
> I meant that the functions are hardwired to the tree structure, which they
> certainly are

Oh yes.

Sure, you can abstract the VM stuff much more - and many people do, to the
point of actually having a per-architecture VM with very little shared
information.

The thing I like about the explicit tree is that while it _is_ an abstract
data structure, it's also a data structure that people are very aware of
how it maps to the actual hardware, which means that the abstraction
doesn't come with a performance penalty.

(There are two kinds of performance penalties in abstractions: (a) just
the translation overhead for compilers etc, and (b) the _mental_ overhead
of doing the wrong thing because you don't think of what it actually means
in terms of hardware).

Now, the linux tree abstraction is obviously _so_ close to a common set of
hardware that many people don't realize at all that it's really meant to
be an abstraction (albeit one with a good mapping to reality).

> It could be a lot more abstract than that.  Chuck Cranor's UVM (which seems
> to bear some sort of filial relationship to the FreeBSD VM) buries all
> accesses to the page table behind a 'pmap' API, and implements the standard
> Unix VM semantics at the 'memory object' level.

Who knows, maybe we'll change the abstraction in Linux some day too..
However, I personally tend to prefer "thin" abstractions that don't hide
details.

The problem with the thick abstractions ("high level") is that they often
lead you down the wrong path. You start thinking that it's really cheap to
share partial address spaces etc ("hey, I just map this 'memory object'
into another process, and it's just a matter of one linked list operation
and incrementing a reference count").

Until you realize that the actual sharing still implies a TLB switch
between the two "threads", and that you need to instantiate the TLB in
both processes etc. And suddenly that instantiation is actually the _real_
cost - and your clever highlevel abstraction was actually a lot more
expensive than you realized.

[ Side note: I'm very biased by reality. In theory, a non-page-table based
  approach which used only a front-side TLB and a fast lookup into higher-
  level abstractions might be a really nice setup. However, in practice,
  the world is 99%+ based on hardware that natively looks up the TLB in a
  tree, and is really good at it too.  So I'm biased. I'd rather do good
  on the 99% than care about some theoretical 1% ]

			Linus



Newsgroups: fa.linux.kernel
From: Paul Mackerras <paulus@samba.org>
Original-Message-ID: <15507.9919.92453.811733@argo.ozlabs.ibm.com>
Subject: Re: 7.52 second kernel compile
Date: Sat, 16 Mar 2002 11:10:28 GMT
Message-ID: <fa.hdv5mnv.1rmsa9m@ifi.uio.no>

Linus Torvalds writes:

> I wonder if you wouldn't be better off just getting rid of the TLB range
> flush altogether, and instead making it select a new VSID in the segment
> register, and just forgetting about the old TLB contents entirely.
>
> Then, when you do a TLB miss, you just re-use any hash table entries
> that have a stale VSID.

We used to do something a bit like that on ppc32 - flush_tlb_mm would
just assign a new mmu context number to the task, which translates
into a new set of VSIDs.  We didn't do the second part, reusing hash
table entries with stale VSIDs, because we couldn't see a good fast
way to tell whether a given VSID was stale.  Instead, when the hash
bucket filled up, we just picked an entry to overwrite semi-randomly.

It turned out that the stale VSIDs were causing us various problems,
particularly on SMP, so I tried a solution that always cleared all the
hash table entries, using a bit in the linux pte to say whether there
was (or had ever been) a hash table entry corresponding to that pte as
an optimization to avoid doing unnecessary hash lookups.  To my
surprise, that turned out to be faster, so that's what we do now.

Your suggestion has the problem that when you get to needing to reuse
one of the VSIDs that you have thrown away, it becomes very difficult
and expensive to ensure that there aren't any stale hash table entries
left around for that VSID - particularly on a system with logical
partitioning where we don't control the size of the hash table.  And
there is a finite number of VSIDs so you have to reuse them sooner or
later.

[For those not familiar with the PPC MMU, think of the VSID as an MMU
context number, but separately settable for each 256MB of the virtual
address space.]

> It would also be interesting to hear if you can just make the hash table
> smaller (I forget the details of 64-bit ppc VM horrors, thank God!) or

On ppc32 we use a hash table 1/4 of the recommended size and it works
fine.

> just bypass it altogether (at least the 604e used to be able to just
> disable the stupid hashing altogether and make the whole thing much
> saner).

That was the 603, actually.  In fact the newest G4 processors also let
you do this.  When I get hold of a machine with one of these new G4
chips I'm going to try it again and see how much faster it goes
without the hash table.

One other thing - I would *love* it if we could get rid of
flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
_all_ of the flush_tlb_* functions tell us what address(es) we need to
invalidate, and let the architecture code decide whether a complete
TLB flush is justified.

Paul.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161020230.31850-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 18:35:17 GMT
Message-ID: <fa.n84t7dv.f7ujg9@ifi.uio.no>

On Sat, 16 Mar 2002, Paul Mackerras wrote:
>
> Your suggestion has the problem that when you get to needing to reuse
> one of the VSIDs that you have thrown away, it becomes very difficult
> and expensive to ensure that there aren't any stale hash table entries
> left around for that VSID - particularly on a system with logical
> partitioning where we don't control the size of the hash table.

But the VSID is something like 20 bits, no? So the re-use is a fairly
uncommon thing, in the end.

Remember: think about the hashes as just TLB's, and the VSID's are just
address space identifiers (yeah, yeah, you can have several VSID's per
process at least in 32-bit mode, I don't remember the 64-bit thing). So
what you do is the same thing alpha does with it's 6-bit ASN thing: when
you wrap around, you blast the whole TLB to kingdom come.

The alpha wraps around a lot more often with just 6 bits, but on the other
hand it's a lot cheaper to get rid of the TLB too, so it evens out.

Yeah, there are latency issues, but that can be handled by just switching
the hash table base: you have two hash tables, and whenever you increment
the VSID you clear a small part of the other table, designed so that when
the VSID wraps around the other table is 100% clear, and you just switch
the two.

You _can_ switch the hash table base around on ppc64, can't you?

So now the VM invalidate becomes

	++vsid;
	partial_clear_secondary_hash();
	if (++vsid > MAXVSID)
		vsid = 0;
		switch_hashes();
	}

> > just bypass it altogether (at least the 604e used to be able to just
> > disable the stupid hashing altogether and make the whole thing much
> > saner).
>
> That was the 603, actually.

Ahh, my mind is going.

>			  In fact the newest G4 processors also let
> you do this.  When I get hold of a machine with one of these new G4
> chips I'm going to try it again and see how much faster it goes
> without the hash table.

Maybe somebody is seeing the light.

> One other thing - I would *love* it if we could get rid of
> flush_tlb_all and replace it with a flush_tlb_kernel_range, so that
> _all_ of the flush_tlb_* functions tell us what address(es) we need to
> invalidate, and let the architecture code decide whether a complete
> TLB flush is justified.

Sure, sounds reasonable.

		Linus



Newsgroups: fa.linux.kernel
From: Paul Mackerras <paulus@samba.org>
Original-Message-ID: <15507.63649.587641.538009@argo.ozlabs.ibm.com>
Subject: Re: 7.52 second kernel compile
Date: Sun, 17 Mar 2002 02:05:20 GMT
Message-ID: <fa.kcikp6v.3hku3d@ifi.uio.no>

Linus Torvalds writes:

> Remember: think about the hashes as just TLB's, and the VSID's are just
> address space identifiers (yeah, yeah, you can have several VSID's per
> process at least in 32-bit mode, I don't remember the 64-bit thing). So
> what you do is the same thing alpha does with it's 6-bit ASN thing: when
> you wrap around, you blast the whole TLB to kingdom come.

I have performance measurements that show that having stale hash-table
entries cluttering up the hash table hurts performance more than
taking the time to get rid of them does.  This is on ppc32 using
kernel compiles and lmbench as the performance measures.

> You _can_ switch the hash table base around on ppc64, can't you?

Not when running under a hypervisor (i.e. on a logically-partitioned
system), unfortunately.  It _may_ be possible to choose the VSIDs so
that we only use half (or less) of the hash table at any time.

> Maybe somebody is seeing the light.

Maybe.  Whenever I have been asked what hardware features should be
added to PPC chips to make Linux run better, I usually put having an
option for software loading of the TLB pretty high on the list.

However, one good argument against software TLB loading that I have
heard (and which you mentioned in another message) is that loading a
TLB entry in software requires taking an exception, which requires
synchronizing the pipeline, which is expensive.  With hardware TLB
reload you can just freeze the pipeline while the hardware does a
couple of fetches from memory.  And PPC64 remains the only
architecture I know of that supports a full 64-bit virtual address
space _and_ can do hardware TLB reload.

I would be interested to see measurements of how many cache misses on
average each hardware TLB reload takes; for a hash table I expect it
would be about 1, for a 3-level tree I expect it would be very
dependent on access pattern but I wouldn't be surprised if it averaged
about 1 also on real workloads.

But this is all a bit academic, the real question is how do we deal
most efficiently with the real hardware that is out there.  And if you
want a 7.5 second kernel compile the only option currently available
is a machine whose MMU uses a hash table. :)

Paul.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161834250.1591-100000@penguin.transmeta.com>
Date: Sun, 17 Mar 2002 02:48:50 GMT
Message-ID: <fa.o9q9f7v.g4u7bc@ifi.uio.no>

On Sun, 17 Mar 2002, Paul Mackerras wrote:
>
> But this is all a bit academic, the real question is how do we deal
> most efficiently with the real hardware that is out there.  And if you
> want a 7.5 second kernel compile the only option currently available
> is a machine whose MMU uses a hash table. :)

Yeah, at a cost of $2M+, if I'm not mistaken. I think I'll settle for my 2
minute time that is actually available to mere mortals at a small fraction
of one percent of that ;)

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161203050.31971-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:17:16 GMT
Message-ID: <fa.n7l576v.enmj88@ifi.uio.no>

On Sat, 16 Mar 2002 yodaiken@fsmlabs.com wrote:
>
> What about 2M pages?

Not useful for generic loads right now, and the latencies for clearing or
copying them them etc (ie single page faults - nopage or COW) are still
big enough that it would likely be a performance problem at that level.
And while doing IO in 2MB chunks sounds like fun, since most files are
still just a few kB, your page cache memory overhead would be prohibitive
(ie even if you had 64GB of memory, you might want to cache more than a
few thousand files at the same time).

So then you'd need to do page caching at a finer granularity than you do
mmap, which imples absolutely horrible things from a coherency standpoint
(mmap/read/write are supposed to be coherent in a nice UNIX - even if
there are some of non-nice unixes still around).

We may get there some day, but right now 2M pages are not usable for use
access.

64kB would be fine, though.

Oh, and in the specific case of hammer, one of the main advantages of the
thing is of course running old binaries unchanged. And old binaries
certainly do mmap's at smaller granularity than 2M (and have to, because a
3G user address space won't fit all that many 2M chunks).

Give up on large pages - it's just not happening. Even when a 64kB page
would make sense from a technology standpoint these days, backwards
compatibility makes people stay at 4kB.

Instead of large pages, you should be asking for larger and wider TLB's
(for example, nothing says that a TLB entry has to be a single page:
people already do the kind of "super-entries", where one TLB entry
actually contains data for 4 or 8 aligned pages, so you get the _effect_
of a 32kB page that really is 8 consecutive 4kB pages).

Such a "wide" TLB entry has all the advantages of small pages (no
memory fragmentation, backwards compatibility etc), while still being able
to load 64kB worth of translations in one go.

(One of the advantages of a page table tree over a hashed setup becomes
clear in this kind of situation: you cannot usefully load multiple entries
from the same cacheline into one TLB entry in a hashed table, while in a
tree it's truly trivial)

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [Lse-tech] Re: 10.31 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203161236160.32013-100000@penguin.transmeta.com>
Date: Sat, 16 Mar 2002 20:44:26 GMT
Message-ID: <fa.n9595lv.f7mhob@ifi.uio.no>

On Sat, 16 Mar 2002, Richard Gooch wrote:
>
> These are contiguous physical pages, or just logical (virtual) pages?

Contiguous virtual pages, but discontiguous physical pages.

The advantage being that you only need one set of virtual tags per "wide"
entry, and you just fill the whole wide entry directly from the cacheline
(ie the TLB entry is not really 32 bits any more, it's a full cacheline).

The _real_ advantage being that it should be totally invisible to
software. I think Intel does something like this, but the point is, I
don't even have to know, and it still works.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203181146070.4783-100000@home.transmeta.com>
Date: Mon, 18 Mar 2002 20:06:49 GMT
Message-ID: <fa.m6eefav.14gka0i@ifi.uio.no>

On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> I have a counter-proposal.  How about a hardware tlb load (if we must have
> one) that does the right thing?

Well, I actually hink that an x86 comes fairly close.

Hashes simply do not do the right thing. You cannot do a speculative load
from a hash, and the hash overhead gets _bigger_ for TLB loads that miss
(ie they optimize for the hit case, which is the wrong optimization if the
on-chip TLB is big enough - and since Moore's law says that the on-chip
TLB _will_ be big enough, that's just stupid).

Basic premise in caching: hardware gets better, and misses go down.

Which implies that misses due to cache contention are misses that go away
over time, while forced misses (due to program startup etc) matter more
and more over time.

Ergo, you want to make build-up fast, because that's where you can't avoid
the work by trivially just making your caches bigger. So you want to have
architecture support for aggressive TLB pre-loading.

> I still think there are some clever tricks one could do with the VSID's to
> get a much saner system that the current hash table.  It would take some
> serious work I think but the results could be worthwhile.  By carefully
> choosing the VSID scatter algorithm and the size of the hash table I think
> one could get a much better access method.

But the whole point of _scattering_ is so incredibly broken in itself!
Don't do it.

You can load many TLB entries in one go, if you just keep them close-by to
each other. Load them into a prefetch-buffer (so that you don't dirty your
real TLB with speculative TLB loads), and since there tends to be locality
to TLB's, you've just automatically speeded up your hardware walker.

In contrast, a hash algorithm automatically means that you cannot sanely
do this _trivial_ optimization.

Face it, hashes are BAD for things like this.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203181213130.12950-100000@home.transmeta.com>
Date: Mon, 18 Mar 2002 20:26:29 GMT
Message-ID: <fa.l0cjoqv.1nh001n@ifi.uio.no>

On Mon, 18 Mar 2002, Linus Torvalds wrote:
>
> Well, I actually hink that an x86 comes fairly close.

Btw, here's a program that does a simple histogram of TLB miss cost, and
shows the interesting pattern on intel I was talking about: every 8th miss
is most costly, apparently because Intel pre-fetches 8 TLB entries at a
time.

So on a PII core, you'll see something like

	  87.50: 36
	  12.39: 40

ie 87.5% (exactly 7/8) of the TLB misses take 36 cycles, while 12.4% (ie
1/8) takes 40 cycles (and I assuem that the extra 4 cycles is due to
actually loading the thing from the data cache).

Yeah, my program might be buggy, so take the numbers with a pinch of salt.
But it's interesting to see how on an athlon the numbers are

	   3.17: 59
	  34.94: 62
	   4.71: 85
	  54.83: 88

ie roughly 60% take 85-90 cycles, and 40% take ~60 cycles. I don't know
where that pattern would come from..

What are the ppc numbers like (after modifying the rdtsc implementation,
of course)? I suspect you'll get a less clear distribution depending on
whether the hash lookup ends up hitting in the primary or secondary hash,
and where in the list it hits, but..

			Linus

-----
#include <stdlib.h>

#define rdtsc(low) \
   __asm__ __volatile__("rdtsc" : "=a" (low) : : "edx")

#define MAXTIMES 1000
#define BUFSIZE (128*1024*1024)
#define access(x) (*(volatile unsigned int *)&(x))

int main()
{
	unsigned int i, j;
	static int times[MAXTIMES];
	char *buffer = malloc(BUFSIZE);

	for (i = 0; i < BUFSIZE; i += 4096)
		access(buffer[i]);
	for (i = 0; i < MAXTIMES; i++)
		times[i] = 0;
	for (j = 0; j < 100; j++) {
		for (i = 0; i < BUFSIZE ; i+= 4096) {
			unsigned long start, end;

			rdtsc(start);
			access(buffer[i]);
			rdtsc(end);
			end -= start;
			if (end >= MAXTIMES)
				end = MAXTIMES-1;
			times[end]++;
		}
	}
	for (i = 0; i < MAXTIMES; i++) {
		int count = times[i];
		double percent = (double)count / (BUFSIZE/4096);
		if (percent < 1)
			continue;
		printf("%7.2f: %d\n", percent, i);
	}
	return 0;
}



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203181357160.1457-100000@home.transmeta.com>
Date: Mon, 18 Mar 2002 22:02:18 GMT
Message-ID: <fa.m5eaeqv.16gk90n@ifi.uio.no>

On Mon, 18 Mar 2002, Cort Dougan wrote:
>
> } But the whole point of _scattering_ is so incredibly broken in itself!
> } Don't do it.
>
> Yes, that is indeed correct theoretically.  The problem is that we actually
> measured it and there was very little locality.  When I added some
> multiple-tlb loads it actually decreased wall-clock performance for nearly
> every user load I put on the machine.

This is what I meant by hardware support for multiple loads - you mustn't
let speculative TLB loads displace real TLB entries, for example.

> Linus, I knew that deep in my heart 8 years ago when I started in on all
> this.  I'm with you but I'm not good enough with a soldering iron to fix
> every powerpc out there that forces that crappy IBM spawned madness upon
> us.

Oh, I agree, we can't fix existing broken hardware, we'll ave to just live
with it.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203181434440.10517-100000@penguin.transmeta.com>
Date: Mon, 18 Mar 2002 22:50:14 GMT
Message-ID: <fa.nbl56tv.enqgg7@ifi.uio.no>

On Mon, 18 Mar 2002, Dieter [iso-8859-15] Nützel wrote:
>
> it seems to be that it depends on gcc and flags.

That instability doesn't seem to show up on a PII. Interesting. Looks like
the athlon may be reordering TLB accesses, while the PII apparently
doesn't.

Or maybe the program is just flawed, and the interesting 1/8 pattern comes
from something else altogether.

			Linus



Newsgroups: fa.linux.kernel
From: Paul Mackerras <paulus@samba.org>
Original-Message-ID: <15510.32200.595707.145452@argo.ozlabs.ibm.com>
Subject: Re: 7.52 second kernel compile
Date: Tue, 19 Mar 2002 00:48:11 GMT
Message-ID: <fa.k22kmnv.618sjd@ifi.uio.no>

Linus Torvalds writes:

> Oh, you're cycle timer is too slow to be interesting, apparently ;(

The G4 has 4 performance monitor counters that you can set up to
measure things like ITLB misses, DTLB misses, cycles spent doing
tablewalks for ITLB misses and DTLB misses, etc.  I hacked up a
measurement of the misses and total cycles doing tablewalks during a
kernel compile and got an average of 36 cycles per DTLB miss and 40
cycles per ITLB miss on a 500MHz G4 machine.  What I need to do now is
to put some better infrastructure for using those counters in place
and try your program using those counters instead of the timebase.

Paul.


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: 7.52 second kernel compile
Original-Message-ID: <Pine.LNX.4.33.0203181805460.10711-100000@penguin.transmeta.com>
Date: Tue, 19 Mar 2002 02:12:02 GMT
Message-ID: <fa.n8lh6lv.dn6go6@ifi.uio.no>

On Mon, 18 Mar 2002, David S. Miller wrote:
>
>    Or maybe the program is just flawed, and the interesting 1/8 pattern comes
>    from something else altogether.
>
> I think the weird Athlon behavior has to do with the fact that
> you've made your little test program as much of a cache tester
> as a TLB tester :-)

Oh, I was assuming that malloc(BIG) would do a mmap() of MAP_ANONYMOUS,
which should make all the pages 100% shared, and thus basically zero cache
overhead on a physically indexed machine like an x86.

So it was designed to really only stress the TLB, not the regular caches.

Although I have to admit that I didn't actually _test_ that hypothesis.

		Linus


Newsgroups: fa.linux.kernel
From: Paul Mackerras <paulus@samba.org>
Original-Message-ID: <15510.42401.731136.2503@argo.ozlabs.ibm.com>
Subject: Re: 7.52 second kernel compile
Date: Tue, 19 Mar 2002 03:21:29 GMT
Message-ID: <fa.hfu7kfv.1nmq81a@ifi.uio.no>

Linus Torvalds writes:

> Btw, here's a program that does a simple histogram of TLB miss cost, and
> shows the interesting pattern on intel I was talking about: every 8th miss
> is most costly, apparently because Intel pre-fetches 8 TLB entries at a
> time.

Here are the results on my 500Mhz G4 laptop:

   1.85: 22
  17.86: 26
  14.41: 28
  16.88: 42
  34.03: 46
   9.61: 48
   2.07: 88
   1.04: 90

The numbers are fairly repeatable except that the last two tend to
wobble around a little.  These are numbers of cycles obtained using
one of the performance monitor counters set to count every cycle.
The average is 40.6 cycles.

This was with a 512kB MMU hash table, which translates to 8192 hash
buckets each holding 8 ptes.  The machine has 1MB of L2 cache.

Paul.


Index Home About Blog