Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: branch prediction, renaming, joins
Message-ID: <br2njp$dpp$1@build.pdx.osdl.net>
Date: Mon, 08 Dec 2003 20:50:02 GMT

Sander Vesik wrote:
>
> It might allow you to create longer basic blocks to do instruction
> scheduling and register allocation over. Yes, it needs to have peak-ahead
> and heuristic so that the other side isn't aversly affected.

Having spent more time than I really want to looking at compiler output, I
can say that
 - the biggest problems often tend to be register pressure
 - most compiler optimizations are worthless in real life

Almost all optimizations tend to put more register pressure on the result,
through having longer liveness regions etc. This is then hidden by tons of
tweaking by the compiler author so that the heuristics are right for the
particular benchmark used to test the optimization. But it HURTS normal
integer code.

You do _not_ want to schedule over conditionals in software to make basic
blocks bigger - although you might want to do so to make liveness regions
_smaller_.

In any sane modern architecture, the CPU will _dynamically_ do the trivial
stuff (moving loads earlier, moving stores later) and often does a better
job of it because inside the CPU you can do things speculatively without
worrying about spurious faults changing the semantics of the result. And
once you do this at run-time, you can do better alias analysis anyway.

The same is true of loop unrolling etc - if your CPU needs it, it's just a
sign that there's something wrong with the CPU. If it can't look ahead
enough to do the loop unrolling, it's just not worth worrying about. The
CPU needs to do register renaming in hardware _anyway_, so all the
complexity in the compiler is just totally not worth it.

And with caches growing, the biggest performance impact these days on some
loads are the unavoidable cache misses - the cold-cache ones or the ones
brought on by serialization across CPU's. Sometimes L1's are still
direct-mapped (broken!), and so cache layout can matter, but with any
associativity the wins you can get at a compiler level are likely not very
interesting.

And code size is growing. There's a lot of real loads where the code size is
a lot bigger than the data size (windowing environments with megabytes and
megabytes of shared libraries all calling each other), which means that
the traditional compiler optimizations are all really really wrong.

CSE, register allocation, and the trivial code movement (ie turn simple
conditionals into conditional moves). That still matters. The rest looks
like broken hardware to me.

And quite frankly, compilers _have_ to be so stable that I'd rather take one
that does a good job and is really stable over one that does a great job
but is potentially flaky.

(The same is true of core operating systems, for that matter - at every
feature you have to ask yourself whether it complicates the system enough
that it's not worth it. Simply because even perfectly written code tends to
become unstable if it's complex).

                Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: branch prediction, renaming, joins
Message-ID: <br5ujk$739$1@build.pdx.osdl.net>
Date: Wed, 10 Dec 2003 02:00:02 GMT

Jan C. Vorbrüggen wrote:

>>         need_resched:
>>                 ti->preempt_count = PREEMPT_ACTIVE;
>>                 schedule();
>>                 ti->preempt_count = 0;
>
> What stops the compiler from killing/moving the first assignment to
> preempt_count - only the fact that there is a function call between
> the two assignments? What if the compiler can inline schedule() and
> notice that could now do that optimization?

Oh, we definitely keep tabs on what happens with the compiler, so we'd
notice if the compiler started inlining hundred-line functions. And yes,
even in the absense of that there are other things like the spinlocks that
the scheduler takes that would keep the compiler from optimizing this
particular part (the spinlocks have compiler barriers as part of them).

But yes, occasionally compiler improvements mean that we have to change how
we do things. Especially with inline assembly the syntax and semantics have
occasionally changed, and you have to work with those changes.  That's what
you get for being intimate with the machine - you can't just pretend to
code to some abstract standard. You have to code to the hardware, and that
means that you have to take the compiler into account.

                Linus


From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Date: Sat, 03 Jan 2004 18:10:34 +0100
Message-ID: <bt6t2b$3at$1@osl016lin.hda.hydro.com>

Aaron Spink wrote:
> IIRC, one of the high benefit optimizations of the Spike system is code
> movement.  Spike has the ability to reorginize the position of the code so
> that the most used paths fit into a smaller/contiguous space, significantly
> reducing I-cache misses.  Spike has shown significant speed-ups on large
> scale applications such as databases.  A lot of the speed up on something
> like a database is most likely because of the code movement.

This was one of the key points I made in my Code optimization talk:

icache/itlb footprint really matters on modern OoO cpus:

A cpu like the P4 (as well as all the PPro generation cpus) runs a _lot_
slower as soon as it starts to miss the L1 icache (since it contains
more or less predecoded opcodes), and when you're running out of L2
cache and/or ITLB entries, things go from bad to worse.

Reorganizing code automatically is one of the really interesting uses of
FDO, more important than getting branch hints correct.

It is only when worst-case behaviour (DoS resistance!) is important that
I'd turn this off.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


From: "Glew News SBC" <andy-glew-public@sbcglobal.net>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <WLMKb.6506$7i7.1605@newssvr27.news.prodigy.com>
Date: Wed, 07 Jan 2004 05:26:14 GMT

> A cpu like the P4 (as well as all the PPro generation cpus) runs a _lot_
> slower as soon as it starts to miss the L1 icache (since it contains
> more or less predecoded opcodes), and when you're running out of L2
> cache and/or ITLB entries, things go from bad to worse.

Actually, Intel P6s do not have any predecode bits
in the I$. Banias may have added them... does anyone know?

Intel P5 had predecode bits in the I$.

Intel P4, of course, has a more fully decoded trace cache.

AMD K7 and K8 have predecode bits in the I$,
but also in the L2.




From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Date: Wed, 07 Jan 2004 09:11:31 +0100
Message-ID: <btgevk$etd$1@osl016lin.hda.hydro.com>

Glew News SBC wrote:

>>A cpu like the P4 (as well as all the PPro generation cpus) runs a _lot_
>>slower as soon as it starts to miss the L1 icache (since it contains
>>more or less predecoded opcodes), and when you're running out of L2
>>cache and/or ITLB entries, things go from bad to worse.
>
>
> Actually, Intel P6s do not have any predecode bits
> in the I$.

Not even instruction beginning/end markers?

I really seemed to remember measuring execution speed differences based
on this, but that must have been P5 only. Thanks for correcting me!
....
OK, I do remember now: What the P6 did was to decode multiple
instr/cycle, but only if all the secondary opcodes were really simple,
generating a single microops, right?

> Banias may have added them... does anyone know?
>
> Intel P5 had predecode bits in the I$.

Right: First iteration (not out of L1 cache) would run at maximum 1
instr/cycle, while properly paired and scheduled (i.e. handwritten by a
very competent asm programmer) code could sustain 2.0 instr/cycle on all
the following iterations. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


Newsgroups: comp.arch
Subject: Re: Itanium strikes again
From: lindahl@pbm.com (Greg Lindahl)
Message-ID: <3fdfb66b$1@news.meer.net>
Date: Wed, 17 Dec 2003 01:50:37 GMT

In article <christian.bau-3FA328.00232717122003@slb-newsm1.svr.pol.co.uk>,
Christian Bau  <christian.bau@cbau.freeserve.co.uk> wrote:

>There are a few things that you don't get on a P4.
>
>One is software pipelining - executing statements from two or three
>consecutive iterations of a loop simultaneously.

Software pipelining started life as a compiler technique. While IA64
does have hardware support for it, you can do it without hardware
support. In Pathscale's Opteron compiler, we measured and decided that
the Opteron is so Out-of-Order that software software pipelining isn't
a win.  There are enough hidden registers to hold the proper state, so
the OoO hardware is going to naturally pipeline your loop for you, and
you wouldn't get any additional benefit with a lot more visible
registers & explicit software pipelining. You might think this is
making lemonade with lemons, but at least software pipelining is a
good way to understand what the OOO engine is going to do with a naive
loop.

>The worst is speculative loads. You prepare for loading some data, then
>later on you check whether you have indeed got the right data.

There's more than one way to skin a cat. Prefetching is another way
to approach this problem. And it doesn't have the correctness issue that
you mention.

-- greg
employed by, but not speaking for, Pathscale


From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Date: Thu, 18 Dec 2003 13:31:36 +0100
Message-ID: <brs6n8$vvg$1@osl016lin.hda.hydro.com>

Michael S wrote:
> According to my (limited) understanding of I2, prefetches and
> speculative loads serve different purposes. Prefetch is best for
> hiding 150ns L2/L3 miss latency. Speculative load works at different
> timescale - it is good for hiding 3.3ns latency of L2 hit (all FP
> loads on I2 goes to L2, so prefetch is no help) and as enabler for SW
> pipelining. For example:
>
> void foo_with_possible_pointer_alias(double *dst, double *src, int
> *indx, int n){
>   for (int i = 0; i < n; ++i)
>      dst[i] = src[indx[i]] * 5;
> }
> Advanced load can speed up this loop tenfold (literally).

Well, but so can the _exactly equivalent_ approach of an explicit check
for possible aliasing, followed by a branch to either a naive loop, or a
loop with significant unrolling.

The indirect loading makes it harder to do an upfront analysis, but you
can still do the same by maintaining a window of destination addresses,
making sure that no indirect load touches that area.

I.e. assuming we have enough fp registers to keep ~30 such operations
ongoing, each load operation will have to check its address against a
window of less than 256 bytes.

Extending the checked window to 512 bytes would allow me to handle it
with a single masked compare. This isn't exact, but good enough!

Let's see...

void foo_with_possible_pointer_alias(double *dst, double *src,
                     int *indx, int n)
{
   for (int i = 0; i <= n-30; i += 30) {
      // Base destination address:
      t_intptr possible_alias = 0, dstbase = (t_intptr) (dst+i);

      // Pretend this is unrolled into 30 local (register) vars!
      for (int j = 0; j < 30; j++) {
         double temp[30};

         // Check current source address:
         t_intptr srcidx = (t_intptr) (src + indx[i+j]);
         possible_alias += ((srcidx ^ dst_base) & ~511) == 0;
         temp[j] = *(double *) srcidx * 5;
      }
/* Before saving all the results, we must check that there was
    no actual aliasing, which is quite simple since the number of
    such possible collisions have been counted:
*/
      if (possible_alias > 0) {
        for (int j = i; j < i+30; j++) {
          // This runs quickly since all source data is in cache!
          dst[j] = src[indx[j]] * 5;
        }
      }
      else {
        for (int j = 0; j < 30; j++) {
          dst[i+j] = temp[j];
        }
      }
   }
   // Handle any remaining entires here!
   for (int j = i; j < n; j++) {
     dst[j] = src[indx[j]] * 5;
   }
}

(Using Fortran makes it trivial, because the compiler can know that
dst[] and src[] must be non-overlapping. :-()

My point is that the code an IA64 compiler must generate is morally
equivalent to what I wrote above, having the checked loads just uses a
lot more hw resources to make the testing exact. I avoid the need for a
test/branch inside the 30-entry inner load loop, because I know that the
normal case is no alias, and I don't care about the difference between a
single aliased load, and a bunch of them within the same inner block.

OTOH, if the destination array used indirect addressing as well, or just
some huge stride, then my setup would have caused a lot more false
positives. Having _some_ form of total range checks on the src[] and
dst[] arrays would have been _much_ better. :-)

Worst case I would need to do an initial scan of the indx[] array to
determine min and max values, possibly doing the same for a destination
indirection array, and then comparing the ranges.

At this point I humbly suggest that rewriting the code with local
variables to make it explicit that you don't care about aliasing would
be a huge win. :-(

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Index Home About Blog