Software pipelining (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <brosgu$fi$1@build.pdx.osdl.net>
Date: Wed, 17 Dec 2003 06:20:00 GMT

Robert Myers wrote:
>
> Erf.  The world is supposed to forgo something as beautiful and
> natural as software pipelining because it makes debugging harder?

No. The world is supposed to forgo it because it is STUPID.

A good OoO implementation makes software pipelining redundant and wasteful.
You can just do a lot better job of it dynamically, and anybody who claims
otherwise is wearing some heavy blinders. Asking the compiler to do it for
you gets you suboptimal results, at the expense of complexity in an area
where you _really_ don't want it.

> Fleas on the back of an elephant.  If you want to make debugging
> easier, the first step would be to restrict everything to
> single-threaded code.  You don't want to do that of course: the whole
> idea is to get as much going as you can in parallel.

Your "of course" is not "of course" at all. Single-threading is great.  You
_absolutely_ want to have as little code as possible multi-threaded,
because multi-threaded code is _hard_. It's hard to get right, it's hard to
debug, and it's not worth it. Why do microthreads when OoO gets you most of
the way _without_ the pain?

Why are you ignoring reality? A good OoO architecture is much simpler to
program for, AND IT OUTPERFORMS that silly "let's do software pipelining
and microthreads" mentality that is so fundamentally fragile.

And yes, OoO is complex in itself. But it's clearly not "fundamentally
fragile", and it has been proven successful. The proof is in the pudding.

> Modern hardware and software are a mess.  If EPIC could make a
> noticeable contribution to the messiness of things, it would be a
> phenomenon worth studying.  Maybe it would help us to get a handle on
> all the other mess, the mess we have with or without EPIC.

Intriguing argument: "we're already in a mess, so let's make it worse and
really study it".

Here's a big clue-hint: people already _know_ that threading is hard. People
already _know_ that requiring extreme compiler cleverness is fundamentally
hard and leads to debugging nightmares.

And no, that does not make EPIC "an interesting phenomenon worth studying".
Quite the reverse.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <brq2j6$mbi$1@build.pdx.osdl.net>
Date: Wed, 17 Dec 2003 17:10:02 GMT

Jan C. Vorbrüggen wrote:
>>
>> Your "of course" is not "of course" at all. Single-threading is great.
>> You _absolutely_ want to have as little code as possible multi-threaded,
>> because multi-threaded code is _hard_. It's hard to get right, it's hard
>> to debug, and it's not worth it. Why do microthreads when OoO gets you
>> most of the way _without_ the pain?
>
> Bah humbug. "A rose by any other name..." The central point of Myers's
> argument is "the whole idea is to get as much going as you can in
> parallel." Whether you call it OoO, (micro-)threads, EPIC, SMT, or
> anything else doesn't matter: you introduce complexity. With
> (non-micro-)threads, at least, you put the responsibility squarely where
> it belongs: on the programmer.

Bzzt. Wrong.

That is just wrong for so many reasons that it is hard to even start.

For one thing, the responsibility for making hardware run fast is _not_ on
the programmer. It's on the hardware manufacturer that wants to sell their
hardware.  And this is not some specious argument: 99.9% of all hardware is
sold to run existing programs, not for compiler people.

And that leads to another reason why you are wrong: complexity is bad only
when it isn't clearly controlled. Controlled complexity is fine, and
hardware (or firmware, for that matter) ends up quite _fundamentally_
controlling the complexity for some quite basic reasons:

 - unlike compilers, hardware runs _everything_ out there. When you
   develop a compiler, you try to test it against a wide workload, but
   it's nowhere _near_ as wide a test that a piece of hardware gets. For
   example, few enough compiler developers end up being able to see the
   bugs they introduce when the compiler is used to compile something
   like an office suite. In contrast, those poor CPU developers do.

In other words, hardware gets much better testing. And testing is _the_ best
way to control the downsides of complexity. (Btw, "hardware" above doesn't
have to be strictly hardware. It can be a JIT or similar. The only point
being that it runs _existing_ released binaries rather than some artificial
subset).

 - software is easier to change. Which is an upside, but it means that
   there is also much less of an innate "brake" on complexity.  The end
   result is that software tends to be more complicated, more fragile,
   and just _buggier_ than hardware.

In other words, putting the complexity in hardware has a fundamental testing
advantage as compared to putting it in the compiler. In addition, you get
an environment where you just _have_ to be more careful anyway, which is
what you want when you have something really complex and fundamental.

Finally, even if you ignore the above arguments, the fact is that dynamic
run-time optimizations are fundamentally stronger than doing it statically
at compile time. Clearly you want to do both, but once you do dynamic
optimization at run-time, you should take that into account when you do the
static ones. And one thing that fairly simple dynamic optimizations do is
to make loop unrolling and software pipelining be largely unnecessary.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <brsjsr$7h6$1@build.pdx.osdl.net>
Date: Thu, 18 Dec 2003 16:20:01 GMT

Alexis Cousein wrote:

>>   And this is not some specious argument: 99.9% of all hardware is
>> sold to run existing programs, not for compiler people.
>
> So? The existing programs are compiled with a compiler that just gets
> a "-O3" flag passed to it. No added complexity there for the users
> (provided the compiler people get it right).

Sorry, you lose.

Very very few binaries out there have been compiled with -O3 or _any_ kind
of aggressive optimizations.

Most of the use that the aggressive optimizations see is for benchmarks and
spec in particular.

Why? Because software vendors have long since realized that they stand more
to lose from a buggy compiler than they do from a 5% performance hit (if
even that) that the "advanced optimization" techniques give on most
projects.

This is what it really boils down to: compiler complexity is UNACCEPTABLE to
99% of the market. The cost of the complexity is too high, and hardware has
gotten faster much more quickly than the incremental steps compilers have
had.

Are you really arguing against this fact? If so, why? Or have you been
brainwashed by system vendors that tout benchmark numbers and tweak them
with unrealistic compiler options that pretty much NOBODY ELSE USES?

In contrast, the CPU improvements are real and tangible, and everybody uses
them.  That was kind of my point - they are stable exactly because they get
a hell of a lot more testing in real life than those special compiler
features.

Maybe I'm biased. I've gotten bit by compiler bugs too many times. I turn
features _off_, even though I'm a performance freak.

If I care about some codepath (which tends to be just a small small portion
of the whole project), I will spend time on that path as a programmer. But
I will compile the "whole project" with reasonable - but not agggressive -
optimizations.

And yet people tell me that the optimizations I use are considered
aggressive by most commercial projects. Where I use

        -O2 --fno-strict-aliasing

other projects use just -O, or use -g with _no_ optimizations and then just
strip the debugging info off before shipping.

See what I'm saying?

Maybe you've worked on just the trivial projects with one special FP loop,
where you know what the answer should be, and can verify it trivially. And
you care so much that you use -O3 and you select the one compiler version
that actually happens to work for you.

But that's not what the rest of the world does.

And yes, I'm generalizing. But I bet that most people who have done real
development will be nodding.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <bs2ep2$3ri$1@build.pdx.osdl.net>
Date: Sat, 20 Dec 2003 21:30:02 GMT

Jonathan Stone wrote:

> ... Sure, Which is pretty much exactly what I went on to say.  The
> bottom line was, I think Linus' claims about optimization are bunk.
> For example, back when Mike Meissuer was maintainging gcc2 for mips,
> he'd (politely) tell anyone who asked to use gcc with -O2, because
> that's what *he* mostly used, so therefore -O2 was the best tested
> and best debugged.

I actually largely agree when it comes to gcc. "-O2" is widely used, partly
because gcc without optimizations is just incredibly stupid, to the point
of being useless.

But gcc is, at least from what I've seen, one of the stabler compilers
around. The gcc people may be surprised to hear me say that, since I end up
complaining about some of the things that gcc does - but I do so because
it's the compiler of choice for me (for more than just stability reasons,
obviously).

The reports I hear from vendor compilers are _scary_. These compilers are
often developed expressly for Spec, and sometimes don't seem to get any
other testing AT ALL. And I'm convinced that the reason gcc is generally a
lot better is exactly the fact that gcc gets a lot of testing on a varied
base of programs.

(gcc has also traditionally not been a very aggressive compiler).

Which has been my point all along: you need to test stuff extensively if
you're doing complex things. Mistakes will happen, even when you "prove"
correctness.

And gcc is in the much better part of the spectrum of compilers wrt testing.
But hardware is even further out.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <bs3kur$kja$1@build.pdx.osdl.net>
Date: Sun, 21 Dec 2003 08:20:03 GMT

Maynard Handley wrote:

> It seems to me that one always has to bear in mind when reading Linus'
> strong comments on these issues that he sees the world through x86 eyes;
> and through x86 eyes much of what he says is true.

Replace "x86" with "integer code", and I'll agree.

I used to be a huge alpha bigot, and quite frankly, the code generation
issues were largely the same as far as I was concerned. Good register
allocation was still the major thing - if you waste your registers, your
function prologue and epilogue just end up being bigger and more wasteful.

There just aren't that many interesting integer programs that have tight
loops. And a lot of loops end up having loop counts in the single or double
digits. Yeah, they get called over and over again, but in the end what
matters is nice good code generation. The loops are there, but they are a
hell of a lot more complex - to the point where the outer loop is not going
to fit even in the L2 cache.

Yeah, I'm totally not interested in HPC. Mea culpa. Not my cup of tea.

In HPC, people care about data cache misses, and icache doesn't tend to
matter. In integer code, imiss costs are as big OR BIGGER than dmisses.
Doing things that make the code-stream bigger is simply optimizing for the
wrong damn thing, when a sane architecture (whether x86 or a modern alpha)
will do the right thing dynamically.

If you come from a HPC background, you've likely never seen that side.

Dick Sites co-authored a nice paper in the early (mid?) 90's about how DB
loads on the early alphas (21064 and 21164) were pinbandwidth limited on
the instruction side.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Itanium strikes again
Message-ID: <bsa3cr$uos$1@build.pdx.osdl.net>
Date: Tue, 23 Dec 2003 19:10:01 GMT

Robert Myers wrote:

> Agressive global feedback-directed compiler optimization.  OoO with a
> trace cache for the brave of heart.  Even the most interactive of
> codes execute only the smallest fraction of possible code paths
> frequently.

That's an argument that is brought up often ("most programs spend 90% of
their time in loops"), and it's true, but it's largely irrelevant from a
cache standpoint.

The cache footprint has nothing to do with "hot spots". The cache footprint
is the working set, and code that gets executed rarely is quite as
expensive (from a cache standpoint) as code that gets executed thousands of
times. Both cases need the line to be brought into memory.

So let's go back to common integer patters: you have loops that often have
quite small loop counts, but they are called many times. It can be as
simple as "strlen()" being a hot-spot when doing profiling: the CPU may
spend a lot of time in "strlen()", but that is _not_ usually because you
have strings that are millions of characters long. No, it's because
strlen() is repeatedly called with small strings - the loop count itself is
small.

In contrast, in HPC, a common pattern is literally a fairly tight loop that
executes millions of times, and indeed, icache doesn't matter, because the
"working set" for that loop is very small.

But in the integer case, the "working set" is _not_ the "strlen()" function.
It's the bigger loop that calls "strlen()". And that bigger loop probably
doesn't have a very high internal repeat rate at all either - in fact a lot
of integer code is "non-loopy", so it's more likely that it's a function
that calls another function that calls a function that calls strlen(). And
then that BIGGER function is called from an external loop that calls many
other big functions.

See? Often the leaf functions are loopy - like strlen(), and yes,
technically you spend most of your CPU cycles in those loops, so when
people say "90% of all time is spent in 10% of the code", they aren't
lying. But the working set is _not_ that 10% of the code. The working set
is bigger. And suddenly icache matters.

For example, one of _the_ hottest loops in a kernel under many real loads
(and I've done a lot of profiling over the years) is actually the loop that
checks a file name path component. The component is commonly just a few
characters long ("usr" or "lib" or "mysource.c"), so the inner loop count
is literally usually in the single digits.

And that loop is called from the thing that parses the whole pathname, and
does a lot of locking and checks hashes etc, and that "outer" loop has a
loop count that is the count of the components in the pathname. Again, this
"loop" quite often has a count of just one, but it can easily be four of
five. More than ten is _extremely_ rare.

And that loop in turn is called from user space when the user does an
"open()" or "stat()". Quite often the user space loop is the result of a
readdir(), so the loop count tends to be on the order of the size of the
directory (it can be big, but it's usually in the few tens, maybe
hundreds).

And _that_ thing is then a loop itself, working down the directory
structure. Again, the loop count is commonly in the teens or so.

Notice? The inner loop may get executed millions of times and may be the
single hottest part of the thing, but it's not because that loop count is
all that high there.  It's just that these things get multiplied...

The _working_ set for this thing is usually megabytes of code. Literally.
It's things like running a big "make" on a project - and quite often the
object files are fine, but "make" will have to do a "stat()" on every d*ng
file involved to _verify_ that. So maybe the compiler gets invoked only
once or twice for a file that has been changed..

So please understand that in this case the working set is _literally_ tens
of megabytes:

 - a big chunk of the kernel (the hot place is _not_ the only part called)

 - all of "make", your compiler, your linker, etc. They are often "loopy"
   without even knowing it, simply by virtue of having a much higher-level
   loop in which the whole _program_ is executed in a loop.

These are all things that the compiler can do NOTHING about. The best the
compiler can do is to create tight and efficient code - SMALL code. Because
the compiler doesn't even see the _big_ loops of "make" calling other
programs etc. And a "make" is actually a fairly _simple_ example of a
CPU-intensive integer problem.

I'm not just making this up, you know. Others will tell you that GUI code is
even WORSE - it adds several layers of code, all usually without huge
loops, and all "independent" of each other. And they have to be independent
so that the programmer can stay sane - even when people _know_ that the
indirection will result in slower code than coding it directly by hand.

And the above example wasn't even a _complicated_ path. For things like TCP,
just a single path through the stack will often blow your L1 I-cache, and
you don't need any extra layers at all to make it worse (but obviously you
often do - think about a web browser).

So to get good performance on loads like this, you should:
 - have as big a L1 cache as possible, or be _really_ good at filling it
   from the L2/L3. Because you will get a _lot_ of L1 misses.
 - try to aim for a dense instruction encoding (this basically makes your
   effective L1 cache size bigger)
 - not care about loop unrolling etc, because it just won't help. Make the
   compiler generate tight code instead.

Itanium compensates for some of this with it's _huge_ L3 cache, but even so
I suspect that that cache is too small for non-spec loads in the absense of
everything else. I've never seen benchmarks of a desktop Itanium rendering
a page off a local network. Yet people obviously do care about those
things.

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: Linus burned by Transmeta JIT morphing optimizations?
Message-ID: <bso8ft$9g9$1@build.pdx.osdl.net>
Date: Mon, 29 Dec 2003 04:00:01 GMT

Glew News SBC wrote:

> Although I am one of the most vociferous proponents of out-of-order
> dynamic execution hardware and hardware optimizations (as should
> be demonstrated by my recent post on hardware dynamic instruction
> rewriting optimizations), Linus has gone past me.

Heh. I aim high.

Anyway, I'm active on the newsgroup partly to just avoid coding: with a
stable kernel release, I have to take out my aggressions _somewhere_, and
it can't be in new exciting kernel features, so I'll happily admit to
perhaps overstating my case just to make a point..

> On the other hand, I remember when Linus went to Transmeta:
> I thought that if anyone could make Transmeta style dynamic
> code morphing JIT optimization competitive with out-of-order,
> it would be Linus.
>
> I can't help but wonder if Linus turned off such optimizations
> by bad experience at Transmeta.

I'm actually mostly turned off by the things I see gcc do to perfectly
regular code, and seeing it often generate worse code with newer versions
just because it tries to be clever. When most of the time all the real
performance problems I see in profiles are either icache or dcache misses.

The dcache misses the compiler usually can't do that much about: they are
largely in the algorithms, and not really related to code generation. So
I'm left worrying about icache.

Part of it is probably colored by Transmeta - although in a different way
than you'd think. The "x86 JIT" does do things like load movement and has
something that an Itanium person would probably think looks like an ALAT,
but the thing is, it's all dynamic, so it actually reacts to whether it got
aliases at run-time. Without the programmer needing to ever think about it.

It's not _as_ dynamic as full hardware that ends up doing it instruction by
instruction - but it has re-plays on internal mispredicts etc, and it
decides to retranslate if it notices that something was badly predicted or
something is _so_ hot that it pays to do the really aggressive things (ie
loop unrolling etc - all done on demand).

So in a rather real sense, Transmeta actually does do a form of OoO. It's
different, and not as dynamic as "real" OoO, but a lot of it is the likely
same. The P4 trace cache is in many ways apparently similar to what
Transmeta does - the actual engine is just very different.

But at Transmeta I did get a feel for what kind of actual workload a real
CPU sees and what it actually executes. Obviously with a pretty PC-centric
viewpoint, and we ran almost exclusively Windows apps, partly because it
became so quickly obvious that things like "spec" just don't have any
real-life relevance. It's not what really drives technology.

And seeing what a CPU ends up getting really fed at a low level (while at
the same time being a programmer and thus having the view from "above" too)
definitely changed how I think about performance. And seeing how _good_ x86
icache behaviour can be because of reasonably tight instruction encoding
was a big revelation.

And I definitely saw what kind of code the "real world" is using. It sucks.
I'm used to look at assembly code for the kernel, and I can tell you - it's
good code. It would be good code even with the compiler tuned down quite a
bit. Because commercial software vendors really care about things that have
very little to do with what compiler people seem to think about..

                Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: hardware design is not software design (was Sun researchers: 
	Computers  do bad math ;)
Message-ID: <bspvam$1ja$1@build.pdx.osdl.net>
Date: Mon, 29 Dec 2003 19:30:02 GMT

Tarjei T. Jensen wrote:
> Terje Mathisen wrote:
>> I.e. Linus' post about the TCPIP stack being too large for code cache,
>> causing compulsory misses for every packet. One solution: Batch packet
>> processing, which increases latency to improve speed/bandwidth.
>
> It is already done that way. I believe that most modern gigE cards to TCP
> batching. At least the ones on various Unix servers. Some of the early
> cards were not particularly good at this, but more recent cards is
> supposedly doing it much better.

Yes, TSO helps, but it only helps for throughput. And only for big writes.

And those aren't even very common or interesting. A _lot_ of network traffic
is all about fairly small segments. Think about things like web-pages,
emails, etc. A lot of them are made up of connections where the total data
over the whole lifetime of the connection is in the "few kB" range.

Which means that most of the time, you really have just a packet or two that
goes through the stack. And there is _nothing_ you can do about that. In
fact, most of the things that people get so excited about (zero-copy etc)
tend to make the stack just a _lot_ bigger and have a much bigger icache
footprint - which just makes things worse.

(Side note: it's not only compilers I rant against - having the compilers
generate tight and simple code is only a part of the fix: having the source
try to be tight and simple is obviously also required. Sadly, a lot of
optimizations end up being done with benchmarks that are totally
unrealistic, like having TCP throughput tuned for honking huge writes that
don't occur in most real-life situations. Things like zero-copy TCP tend to
be a loss in real life, and play hell with latency).

                Linus

Index Home About Blog