Index Home About Blog
Date: 	Sun, 17 Sep 2000 17:25:22 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux-2.4.0-test9-pre2
Newsgroups: fa.linux.kernel

On Mon, 18 Sep 2000, Chris Wedgwood wrote:

> On Sun, Sep 17, 2000 at 10:37:51AM -0700, Linus Torvalds wrote:
> 
>         - "extern inline" -> "static inline".  It doesn't matter right now,
>           but it's proactive for future gcc versions.
> 
> can someone please explain the difference?

Let's assume that gcc decides that it won't inline a function, because
it's too "big", according to some gcc definition of "big".

With "extern inline", the function will not exist AT ALL, and you'll end
up getting a link-time error complaining about the lack of that function.

With "static inline", gcc will emit the function as a separate function
for that compilation block if it didn't get inlined.

Both are valid things. You use "extern inline" when you have a "backing
store" for the funcion (ie you do have the non-inlined version in a
library somewhere). You use "static inline" when you don't.

For the kernel, we very seldom have the non-inlined versions in any
library, so for the kernel "extern inline" is almost always the wrong
thing.

Note that with most versions of gcc this is all a complete non-issue, as
most versions of gcc will _always_ inline a function that the user has
asked to be inlined. So the issue seldom actually comes up.

		Linus


Date: 	Tue, 19 Sep 2000 07:50:05 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux-2.4.0-test9-pre2
Newsgroups: fa.linux.kernel

On Tue, 19 Sep 2000, Rogier Wolff wrote:
> 
> If gcc starts shouting:
> 
> somefile.c:1234: declared inline function 'serial_paranoia_check' is 
> somefile.c:1234: larger than 1k. Declining to honor the inline directive. 

That's not what gcc does.

Gcc silently just doesn't inline it. 

And the error message you get is 

	ld: undefined function 'serial_paranoia_check'

which is not exactly helpful.

That, together with the fact that gcc's notion of "large" is completely
undefined (for a while, it had absolutely nothing to do with size, but
with what kinds of things the function did, like having the address of a
label taken) means that it's basically not useful for what you suggest
anyway..

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 17:07:32 UTC
Message-ID: <fa.Y7RioKQ3mx7qS+xB52xG2U4vPcA@ifi.uio.no>

On Sun, 27 Apr 2008, Adrian Bunk wrote:
>
> My opinion on this is still:
> "OPTIMIZE" means "work around bugs in the kernel".

No.

It means that

 - gcc used to (long ago) always honor "inline", and we had kernel code
   that depended on that in various ways (ie required that there was no
   return etc).

   We've been mostly replacing the ones we know about with
   "__always_inline", but there may be some that remain. We'll find out, I
   guess.

 - gcc was a total and utter piece of horrible crap in the inlining
   department, doing insane things and changing their documentation to
   match the new behaviour (and some people then claimed that it was
   always documented that way).

   It would not inline big functions even when they statically collapsed
   to nothing, etc.

As a result, we really couldn't afford to let gcc make any inlining
decisions, because the compiler was simply *broken*.

			Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 17:33:53 UTC
Message-ID: <fa.ktNN6QIWRCGYLQe9VHLwJDoUQ/s@ifi.uio.no>

On Sun, 27 Apr 2008, Adrian Bunk wrote:
>
> I'm looking at it from a different angle, all code in the kernel should
> follow the following rules [1]:
> - no functions in .c files should be marked inline
> - all functions in headers should be static inline
> - all functions in headers should either be very small or collapse
>   to become very small after inlining
>
> I can simply not see any usecase for a non-forced inline in the kernel,
> and fixing the kernel should give a superset of the space savings of
> this "inline optimization".

Your whole argument is premised on the assumption that the compiler does
the right thing.

That's a *known*to*be*bogus* assumption.

Modern versions of gcc may do the right thing. Note the two very important
code-words: "modern" and "may".

I'm just telling you that

 - older versions of gcc (and by "older" I do not mean "really ancient" or
   "deprecated", but stuff that is still in use) are known to be total and
   utter crap when it comes to inlining

 - even absent that, there are historical reasons stemming from even more
   ancient versions of gcc that are no longer in use.

In other words, my arguments have nothing to do with "I wish". They are
plain facts. Why argue with them?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 18:25:30 UTC
Message-ID: <fa.yGr23xtFTKgjy+Qe7gxwpx0OjF4@ifi.uio.no>

On Sun, 27 Apr 2008, Christoph Hellwig wrote:
>
> As Linus mentioned the hint doesn't make any sense because gcc will
> get it wrong anyway.  In fact when you look at kernel code it tends
> to inline the everything and the kitchensink as long as there's just
> one caller and this bloat the stack but doesn't inline where it needs
> to.  Better don't try to mess with that and do it explicit.

The thing is, the "inline" vs "always_inline" thing _could_ make sense,
but sadly doesn't much.

Part of it is that gcc imnsho inlines too aggressively anyway in the
absense of "inline", so there's no way "inline" can mean "you might
inline" this, because gcc will do that anyway even without it. As a
result, in _practice_ "inline" and "always_inline" end up being very close
to each other - perhaps more so than they should.

I do obviously think that we're right to move into the direction that
"inline" should be a hint. In fact, the biggest issue I have with the new
kconfig option is that I think it should probably be unconditional, but I
suspect that compiler issues and architecture issues make that not be a
good idea.

It will take time before we've sorted out all the fall-out, because I bet
there is still code out there that _should_ use __always_inline, but
doesn't.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 23:38:09 UTC
Message-ID: <fa.2HXlHLUmHPuRX5AgZHCm44xcNy8@ifi.uio.no>

On Sun, 27 Apr 2008, Arjan van de Ven wrote:
>
> (actually, other than some obscure commandline options, the only sane way to
> avoid gcc doing this too agressive is using -Os)

Well, CC_OPTIMIZE_FOR_SIZE has been defaulting to 'y' for a *loong* time,
but it's hidden behind a EXPERIMENTAL (unless you were on some embedded
architectures), so many people won't see it.

Perhaps it is time to remove the EXPERIMENTAL? I think the gcc warnings
were mostly bogus - it's not as if there haven't been compiler bugs
without -Os too..

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 18:12:41 UTC
Message-ID: <fa.dVZdkd/PIapDzssss4w5lNFI6OY@ifi.uio.no>

On Sun, 27 Apr 2008, Adrian Bunk wrote:
>
> What I want instead:
> - we continue to force the compiler to always inline with "inline"
> - we remove the inline's in .c files and make too big functions in
>   headers out-of-line

Sure, I can agree with that as a mostly good goal, but you're still
ignoring the fact that nobody should really expect the compiler to always
do a good job at deciding high-level issues.

For example, what's wrong with having "inline" on functions in .c files if
the author thinks they are small enough? He's likely right. Considering
past behaviour, he's quite often more right than the compiler.

Just as an example of this: gcc will often inline even big functions, if
they are called from only one call-site. In fact, ask a compiler guy, and
he'll likely say that that is obviously a good thing.

But ask somebody who debugs the resulting oops reports, and he may well
disagree violently.

In other words, inlining is about much more than pure optimization.

Sometimes it's about forcing it (or not forcing it) for simple correctness
issues when the compiler doesn't understand that the code in question has
specific rules (for example, we sometimes want to *force* certain
functions to be in specific segments).

And sometimes it's about debugging. For the kernel, backtraces posted by
random users are one of the main debug facilities, and unlike many other
projects, it's not reasonable to ask people to recompile with "-O0 -g" to
get better backtraces. The bulk of all reports will come from people who
use precompiled images from a distribution.

And that means that inlining has a *huge* impact on debuggability.

I have very often cursed gcc inlining some biggish function - who the f*ck
cares if a thousand-instruction function can shave a couple of
instructions of call overhead, when it then causes the call trace to be
really hard to read?

So quite frankly, my preferred optimization would be:

 - Heavily discourage gcc from inlining functions that aren't marked
   "inline". I suspect it hurts kernel debugging more than many other
   projects (because other projects aren't as dependent on the traces)

 - I do agree 100% with you that header file functions should be small
   (unless they use __builtin_constant_p() or other tricks to guarantee a
   much smaller static footprint than dynamic one)

 - I also suspect we should have some way for developers to ask for *hints*
   from the compiler, ie instead of having gcc inline on its own by
   default, have the people who care about it ask the compiler to warn
   about cases where inlining would be a big win.

 - Make "inline" mean "you may want to inline this", and "forced_inline"
   mean "you *have* to inline this". Ie the "inline" is where the compiler
   can make a subtle choice (and we need that, because sometimes
   architecture or config options means that the programmer should not
   make the choice statically!)

In short, in general I actually wish we'd inline much much less than we
do. And yes, part of that is that we have way too much code in our header
files.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 18:31:53 UTC
Message-ID: <fa.eAK784Oy9vKaBStsBo4dI/85idc@ifi.uio.no>

On Sun, 27 Apr 2008, Christoph Hellwig wrote:
>
> Actually looking at older code in the tree he's most likely wrong :)
> Probably as bad as the compiler.  But the nice part about the code is
> that we can fix it easily.

Good point.

It *would* be really interesting to have some way to check our assumptions
(both ways - warn about over-large inlines and small-and-hot non-inlines).

I considered making sparse give some size estimate for inlines and warn
about ones that generate a lot of code, but I was never able to do it
sanely.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] prepare kconfig inline optimization for all
Date: Sun, 27 Apr 2008 18:58:43 UTC
Message-ID: <fa.jBC/pD1JGYKgVs0JhBBOX4JXd3Y@ifi.uio.no>

On Sun, 27 Apr 2008, Adrian Bunk wrote:

> On Sun, Apr 27, 2008 at 11:11:27AM -0700, Linus Torvalds wrote:
> >
> > For example, what's wrong with having "inline" on functions in .c files if
> > the author thinks they are small enough? He's likely right. Considering
> > past behaviour, he's quite often more right than the compiler.
> >...
>
> Ingo's commit in your tree just broke this assumption.

Note that our problem is too much inlining, not too little.

I'm actually happier with gcc not deciding to inline (despite having an
"inline") than I am with gcc deciding to inline (in violation of _not_
having an "inline").

So I don't disagree with Ingo's commit per se.

The only problem with not inlining is a historical one: exactly because
gcc _used_ to always do what people asked for, Linux has historically
treated "inline" as a "force_inline". And I was very unhappy when gcc
changed that, just because it broke historically good code.

In many ways, it might have been better if we had a "__may_inline" thing
to tell the compiler "you can inline this if you think it's worth it").
Both gcc (long ago) and Ingo (now) decided to just make plain "inline"
mean that, but with a pretty strong bias. It was wrong for gcc to do so,
imho, and it may have been wrong for this OPTIMIZE_INLINE thing too.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 17:35:42 UTC
Message-ID: <fa.mlhALmaYp2DsiDFyN2Qc7d3UXDQ@ifi.uio.no>

On Tue, 26 Aug 2008, Rusty Russell wrote:
>
> Your workaround is very random, and that scares me.  I think a huge number of
> CPUs needs a real solution (an actual cpumask allocator, then do something
> clever if we come across an actual fastpath).

The thing is, the inlining thing is a separate issue.

Yes, the cpumasks were what made stack pressure so critical to begin with,
but no, a 400-byte stack frame in a deep callchain isn't acceptable
_regardless_ of any cpumask_t issues.

Gcc inlining is a total and utter pile of shit. And _that_ is the problem.
I seriously think we shouldn't allow gcc to inline anything at all unless
we tell it to. That's how it used to work, and quite frankly, that's how
it _should_ work.

The downsides of inlining are big enough from both a debugging and a real
code generation angle (eg stack usage like this), that the upsides
(_somesimes_ smaller kernel, possibly slightly faster code) simply aren't
relevant.

So the "noinline" was random, yes, but this is a real issue. Looking at
checkstack output for a saner config (NR_CPUS=16), the top entries for me
are things like

	ide_generic_init [vmlinux]:             1384
	idefloppy_ioctl [vmlinux]:              1208
	e1000_check_options [vmlinux]:  	1152
	...

which are "leaf" functions. They are broken as hell (the e1000 is
apparently because it builds structs on the stack that should all be
"static const", for example), but they are different from something like
the module init sequence in that they are not going to affect anything
else.

It would be interesting to see what "-fno-default-inline" does to the
kernel. It really would get rid of a _lot_ of gcc version issues too.
Inlining behavior of gcc has long been a problem for us.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 18:42:59 UTC
Message-ID: <fa./sT9C2U+q7cX3FV9Goaa9LTb84w@ifi.uio.no>

On Tue, 26 Aug 2008, Adrian Bunk wrote:
>
> A debugging option (for better traces) to disallow gcc some inlining
> might make sense (and might even make sense for distributions to
> enable in their kernels), but when you go to use cases that require
> really small kernels the cost is too high.

You ignore the fact that it's really not just about debugging.

Inlining really isn't the great tool some people think it is. Especially
not since gcc stack allocation is so horrid that it won't re-use stack
slots etc (which I don't disagree with per se - it's _hard_ to re-use
stack slots while still allowing code scheduling).

NOTE! I also would never claim that _our_ choices of "inline" are all that
great, and we've often inlined too much or not inlined things that really
could be inlined. But at least when a developer says "inline" (or forgets
to say it), we have somebody to blame. When the compiler does insane
things that doesn't suit us, we're just screwed.

> But if you don't trust gcc's inlining you should revert
> commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's
> freedom regarding what to inline in 2.6.27

Actually, that just allows gcc to _not_ inline. Which is probably ok.

(Well, it would be ok if gcc did it well enough, it obviously has some
problems at times).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 20:43:44 UTC
Message-ID: <fa.C5vJOOJZB+d8zFzQajCEmlDlR30@ifi.uio.no>

On Tue, 26 Aug 2008, Adrian Bunk wrote:
>
> I had in mind that we anyway have to support it for tiny kernels.

I actually don't think that is true.

If we really were to decide to be stricter about it, and it makes a big
size difference, we can probably also add a tool to warn about functions
that really should be inline.

> > Inlining really isn't the great tool some people think it is. Especially
> > not since gcc stack allocation is so horrid that it won't re-use stack
> > slots etc (which I don't disagree with per se - it's _hard_ to re-use
> > stack slots while still allowing code scheduling).
>
> gcc's stack allocation has become better
> (that's why we disable unit-at-a-time only for gcc 3.4 on i386).


I agree that it has become better. But it still absolutely *sucks*.

For example, see the patch I just posted about e1000 stack usage. Even
though the variables were all in completely separate scopes, they all got
individual space on the stack over the whole lifetime of the function,
causing an explosion of stack-space. As such, gcc used 500 bytes too much
of stack, just because it didn't re-use the stackspace.

That was with gcc-4.3.0, and no, there were hardly any inlining issues
involevd, although it is true that inlining actually did make it slightly
worse in that case too (but since it was essentially a leaf function, that
had little real life impact, since there were no deep callchains below it
to care).

So the fact is, "better" simply is not "good enough". We still need to do
a lot of optimizations _manually_, because gcc cannot see that it can
re-use the stack-slots.

And sometimes those "optimizations" are actually performance
pessimizations, because in order to make gcc not use all the stack at the
same time, you simply have to break things out and force-disable inlining.

> Most LOCs of the kernel are not written by people like you or Al Viro or
> David Miller, and the average kernel developer is unlikely to do it as
> good as gcc.

Sure. But we do have tools. We do have checkstack.pl, it's just that it
hasn't been an issue in a long time, so I suspect many people didn't even
_realize_ we have it, and I certainly can attest to the fact that even
people who remember it - like me - don't actually tend to run it all that
often.

> For the average driver the choice is realistically between
> "inline's randomly sprinkled across the driver" and
> "no inline's, leave it to gcc".

And neither is likely to be a big problem.

> BTW:
> I just ran checkstack on a (roughly) allyesconfig kernel, and we have a
> new driver that allocates "unsigned char recvbuf[1500];" on the stack...

Yeah, it's _way_ too easy to do bad things.

> With the "gcc inline's static functions" you complain about we have
> 4-5 years of experience.

Sure. And most of it isn't all that great.

But I do agree that letting gcc make more decisions is _dangerous_.
However, in this case, at least, the decisions it makes would at least
make for less inlining, and thus less stack space explosion.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 18:48:26 UTC
Message-ID: <fa.58S73z2Ro8cetCCdaZlVn7v4cug@ifi.uio.no>

On Tue, 26 Aug 2008, Adrian Bunk wrote:
>
> I added "-fno-inline-functions-called-once -fno-early-inlining" to
> KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel
> image by 2%.

Btw, did you check with just "-fno-inline-functions-called-once"?

The -fearly-inlining decisions _should_ be mostly right. If gcc sees early
that a function is so small (even without any constant propagation etc)
that it can be inlined, it's probably right.

The inline-functions-called-once thing is what causes even big functions
to be inlined, and that's where you find the big downsides too (eg the
stack usage).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 19:20:24 UTC
Message-ID: <fa.Wg+gZgTdCb2OztVz1a2Vi6ze+l8@ifi.uio.no>

On Tue, 26 Aug 2008, Jamie Lokier wrote:
>
> A function which is only called from one place should, if everything
> made sense, _never_ use more stack through being inlined.

But that's simply not true.

See the whole discussion.

The problem is that if you inline that function, the stack usage of the
newly inlined function is now added to ALL THE OTHER paths too!

So the case we had in module loading was that yes, we had a function with
a big stack footprint, but it was NOT in the deep path.

But by inlining it, it now moved the stack footprint "up" one level to
another function, and now the big stack footprint really _was_ in the deep
path, because the caller was involved in a much deeper chain.

So inlining moves the code up the callchain, and that is a problem for the
backtrace, but that's "just" a debugging issue. But it also moves the
stack footprint up the callchain, and that can actually be a correctness
issue.

Of course, a compiler doesn't _have_ to do that. A compiler _could_ have
multiple different stack footprints for a single function, and do liveness
analysis etc. But no sane compiler probably does that, because it's very
painful indeed, and it's not even an issue if you aren't stack-limited
(and being stack-limited is really just a kernel thing).

(Yeah, it can be an issue even if you have a big stack, in that you get
worse cache behaviour, so a dense stack footprint _would_ help. But the
complexity of stack liveness analysis is almost certainly not worth the
relatively small gains it would get on some odd cases).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 21:05:50 UTC
Message-ID: <fa.90r7JglDJFWmBAGsDX5D27LfubA@ifi.uio.no>

On Tue, 26 Aug 2008, Adrian Bunk wrote:
>
> If you think we have too many stacksize problems I'd suggest to consider
> removing the choice of 4k stacks on i386, sh and m68knommu instead of
> using -fno-inline-functions-called-once:

Don't be silly. That makes the problem _worse_.

We're much better off with a 1% code-size reduction than forcing big
stacks on people. The 4kB stack option is also a good way of saying "if it
works with this, then 8kB is certainly safe".

And embedded people (the ones that might care about 1% code size) are the
ones that would also want smaller stacks even more!

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 23:48:40 UTC
Message-ID: <fa.7kLVLfzZeWBJsiUhXi6gNHhAMGo@ifi.uio.no>

On Tue, 26 Aug 2008, Parag Warudkar wrote:
>
> This is something I never understood - embedded devices are not going
> to run more than a few processes and 4K*(Few Processes)
>  IMHO is not worth a saving now a days even in embedded world given
> falling memory prices. Or do I misunderstand?

Well, by that argument, 1% of kernel size doesn't matter either..

1% of a kernel for an embedded device is roughly 10-30kB or so depending
on how small you make the configuration.

If that matters, then so should the difference of 3-8 processes' kernel
stack usage when you have a 4k/8k stack choice.

And they _all_ will have at least 3-8 processes on them. Even the simplest
ones will tend to have many more.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Wed, 27 Aug 2008 01:50:09 UTC
Message-ID: <fa.lMsGdiKpgzM6rP6SDHiwvBYsfKc@ifi.uio.no>

On Tue, 26 Aug 2008, Parag Warudkar wrote:
>
> And although you said in your later reply that Linux x86 with 4K
> stacks should be more than usable - my experiences running a untainted
> desktop/file server with 4K stack have been always disastrous XFS or
> not.  It _might_ work for some well defined workloads but you would
> not want to risk 4K stacks otherwise.

Umm. How long?

4kB used to be the _only_ choice. And no, there weren't even irq stacks.
So that 4kB was not just the whole kernel call-chain, it was also all the
irq nesting above it.

And yes, we've gotten much worse over time, and no, I can't really suggest
going back to that in general. The code bloat has certainly been
accompanied by a stack bloat too.

But part of it is definitely gcc. Some versions of gcc used to be
absolutely _horrid_ when it came to stack usage, especially with some
flags, and especially with the crazy inlining that module-at-a-time
caused.

But I'd be really happy if some embedded people tried to take some of that
bloat back, and aim for 4kB stacks. Because it's definitely not
unrealistic. At least it _shouldn't_ be. And a lot of the cases of us
having structures on the stack is actually not worth it, and tends to be
about being lazy rather than anything else.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Wed, 27 Aug 2008 02:54:02 UTC
Message-ID: <fa.YuK/O5nxS/zITx1L8cMZ7Lh09dE@ifi.uio.no>

On Tue, 26 Aug 2008, Parag Warudkar wrote:
>
> What about deep call chains? The problem with the uptake of 4K stacks
> seems to be that is not reliably provable that it will work under all
> circumstances.

Umm. Neither is 8k stacks. Nobody "proved" anything.

But yes, some subsystems have insanely deep call chains. And yes, things
like the VFS recursion (for symlinks) makes that deeper yet for
filesystems, although only on the lookup path. And that is exactly the
kind of thing that can exacerbate the problem of the compiler artificially
making for a bigger stack footprint of a function (*).

For things like the VFS layer, right now we allow a nesting level of 8, I
think. If I remember correctly, it was 5 historically. Part of raising
that depth, though, was that we actually moved the recursive part into
fs/namei.c, and the nesting stack-depth was something pretty damn small
when the filesystem used "follow_link" properly and let the VFS do it for
it (ie the callchain to actually look up the link could be deep, but it
would not recurse back, and instead just return a pointer, so that the
actual _recursive_ part was just __do_follow_link() and is just a few
words on the stack).

So yes, we do have some deep callchains, but they tend to be pretty well
managed for _good_ code. The problems tend to be the areas with lots of
indirection layers, and yeah, XFS, MD and ACPI all have those kinds of
things.

In an embdedded world, many of those should be a non-issue, though.

			Linus

(*) ie the function that _is_ on the deep chain doesn't actually need much
of a stack footprint at all itself, but it may call a helper function that
is _not_ in the deep chain, and if it gets inlined it may give its
excessive stack footprint to the deep chain - and this is _exactly_ the
problem that happened with inlining "load_module()".


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Tue, 26 Aug 2008 23:52:36 UTC
Message-ID: <fa.XydhBaEDgcIZeTmvtm169q7PACI@ifi.uio.no>

On Wed, 27 Aug 2008, Adrian Bunk wrote:
> >
> > We're much better off with a 1% code-size reduction than forcing big
> > stacks on people. The 4kB stack option is also a good way of saying "if it
> > works with this, then 8kB is certainly safe".
>
> You implicitely assume both would solve the same problem.

I'm just saying that your logic doesn't hold water.

If we can save kernel stack usage, then a 1% increase in kernel size is
more than worth it.

> While 4kB stacks are something we anyway never got 100% working

What? Don't be silly.

Linux _historically_ always used 4kB stacks.

No, they are likely not usable on x86-64, but dammit, they should be more
than usable on x86-32 still.

> But I do not think the problem you'd solve with
> -fno-inline-functions-called-once is big enough to warrant the size
> increase it causes.

You continually try to see the inlining as a single solution to one
problem (debuggability, stack, whatever).

The biggest problem with gcc inlining has always been that it has been
_unpredictable_. It causes problems in many different ways. It has caused
stability issues due to gcc versions doing random things. It causes the
stack expansion. It makes stack traces harder for debugging, etc.

If it was any one thing, I wouldn't care. But it's exactly the fact that
it causes all these problems in different areas.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c -
Date: Wed, 27 Aug 2008 00:30:08 UTC
Message-ID: <fa.eKVQlroLjQDNpYhVVVGTpVic8Fw@ifi.uio.no>

On Wed, 27 Aug 2008, Adrian Bunk wrote:
>
> When did we get callpaths like like nfs+xfs+md+scsi reliably
> working with 4kB stacks on x86-32?

XFS may never have been usable, but the rest, sure.

And you seem to be making this whole argument an excuse to SUCK, and an
excuse to let gcc crap even more on our stack space.

Why?

Why aren't you saying that we should be able to do better? Instead, you
seem to asking us to do even worse than we do now?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 16:29:45 UTC
Message-ID: <fa.SDq2D7lqr2P8ym8ftoiIrUhScao@ifi.uio.no>

On Fri, 9 Jan 2009, Ingo Molnar wrote:
>
> Core kernel developers tend to be quite inline-conscious and generally do
> not believe that making something inline will make it go faster.

Some of us core kernel developers tend to believe that:

 - inlining is supposed to work like macros, and should make the compiler
   do decisions BASED ON CALL-SITE.

   This is one of the most _common_ reasons for inlining. Making the
   compiler select static code rather than dynamic code, and using
   inlining as a nice macro. We can pass in a flag with a constant value,
   and only the case that matters will be compiled.

It's not about size - or necessarily even performance - at all. It's about
abstraction, and a way of writing code.

And the thing is, as long as gcc does what we ask, we can notice when _we_
did something wrong. We can say "ok, we should just remove the inline"
etc. But when gcc then essentially flips a coin, and inlines things we
don't want to, it dilutes the whole value of inlining - because now gcc
does things that actually does hurt us.

We get oopses that have a nice symbolic back-trace, and it reports an
error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully" inlined
things to the point that only an expert can realize "oh, the bug was
actually five hundred lines up, in that other function that was just
called once, so gcc inlined it even though it is huge".

See? THIS is the problem with gcc heuristics. It's not about quality of
code, it's about RELIABILITY of code.

The reason people use C for system programming is because the language is
a reasonably portable way to get the expected end results WITHOUT the
compiler making a lot of semantic changes behind your back.

Inlining is also the wrong thing to do _even_ if it makes code smaller and
faster if you inline the unlikely case, or inlining causes more live
variables that cause stack pressure. And we KNOW this happens. Again, I'd
be much happier if we had a compiler option to just does "do what I _say_,
dammit", and then we can fix up the mistakes. Because then they are _our_
mistakes, not some random compiler version that throws a dice!

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Sat, 10 Jan 2009 00:06:40 UTC
Message-ID: <fa.04KsVYKnYARGpdZw/NVNVhqLy1Y@ifi.uio.no>

On Fri, 9 Jan 2009, Nicholas Miell wrote:
>
> So take your complaint about gcc's decision to inline functions called
> once.

Actually, the "called once" really is a red herring. The big complaint is
"too aggressively when not asked for". It just so happens that the called
once logic is right now the main culprit.

> Ignore for the moment the separate issue of stack growth and let's
> talk about what it does to debugging, which was the bulk of your
> complaint that I originally responded to.

Actually, stack growth is the one that ends up being a correctness issue.
But:

> In the general case is it does nothing at all to debugging (beyond the
> usual weird control flow you get from any optimized code) -- the
> compiler generates line number information for the inlined functions,
> the debugger interprets that information, and your backtrace is
> accurate.

The thng is, we do not use line number information, and never will -
because it's too big. MUCH too big.

We do end up saving function start information (although even that is
actually disabled if you're doing embedded development), so that we can at
least tell which function something happened in.

> It is only in the specific case of the kernel's broken backtrace code
> that this becomes an issue. It's failure to function correctly is the
> direct result of a failure to keep up with modern compiler changes that
> everybody else in the toolchain has dealt with.

Umm. You can say that. But the fact is, most others care a whole lot
_less_ about those "modern compiler changes". In user space, when you
debug something, you generally just stop optimizing. In the kernel, we've
tried to balance the "optimize vs debug info" thing.

> I think that the answer to that is that the kernel should do its best to
> be as much like userspace apps as it can, because insisting on special
> treatment doesn't seem to be working.

The problem with that is that the kernel _isn't_ a normal app. An it
_definitely_ isn't a normal app when it comes to debugging.

You can hand-wave and talk about it all you want, but it's just not going
to happen. A kernel is special. We don't get dumps, and only crazy people
even ask for them.

The fact that you seem to think that we should get them just shows that
you either don't understand the problems, or you live in some sheltered
environment where crash-dumps _could_ work, but also by definition those
environments aren't where they buy kernel developers anything.

The thing is, a crash dump in a "enterprise environment" (and that is the
only kind where you can reasonably dump more than the minimal stuff we do
now) is totally useless - because such kernels are usually at least a year
old, often more. As such, debug information from enterprise users is
almost totally worthless - if we relied on it, we'd never get anything
done.

And outside of those kinds of very rare niches, big kernel dumps simply
are not an option. Writing to disk when things go hay-wire in the kernel
is the _last_ thing you must ever do. People can't have dedicated dump
partitions or network dumps.

That's the reality. I'm not making it up. We can give a simple trace, and
yes, we can try to do some off-line improvement on it (and kerneloops.org
to some degree does), but that's just about it.

But debugging isn't even the only issue. It's just that debuggability is
more important than a DUBIOUS improvement in code quality. See? Note the
DUBIOUS.

Let's take a very practical example on a number that has been floated
around here: letting gcc do inlining decisions apparently can help for up
to about 4% of code-size. Fair enough - I happen to believe that we could
cut that down a bit by just doing things manually with a checker, but
that's neither here nor there.

What's the cost/benefit of that 4%? Does it actually improve performance?
Especially if you then want to keep DWARF unwind information in memory in
order to fix up some of the problems it causes? At that point, you lost
all the memory you won, and then some.

Does it help I$ utilization (which can speed things up a lot more, and is
probably the main reason -Os actually tends to perform better)? Likely
not. Sure, shrinking code is good for I$, but on the other hand inlining
can actually be bad for I$ density because if you inline a function that
doesn't get called, you now fragmented your footprint a lot more.

So aggressively inlining has to be shown to be a real _win_.

You try to say "well, do better debug info", but that turns inlining into
a _loss_, so then the proper response is "don't inline".

So when is inlining a win?

It's a win when the thing you inline is clearly not bigger than the call
site. Then it's totally unambiguous.

It's also often a win if it's a unconditional call from a single site, and
you only inline one such, so that you avoid all of the downsides (you may
be able to _shrink_ stack usage, and you're hopefully making I$ accesses
_denser_ rather than fragmenting it).

And if you can seriously simplify the code by taking advantage of constant
arguments, it can be an absolutely _huge_ win. Except as we've seen in
this discussion, gcc currently doesn't apparently even consider this case
before it does the inlining decision.

But if we're just looking at code-size, then no, it's _not_ a win. Code
size can be a win (4% denser I$ is good), but a lot of the cases I've seen
(which is often the _bad_ cases, since I end up looking at them because we
are chasing bugs due to things like stack usage), it's actually just
fragmenting the function and making everybody lose.

Oh, and yes, it does depend on architectures. Some architectures suck at
function calls. That's why being able to trust the compiler _would_ be a
good thing, no question about that. But yes, we do need to be able to
trust it to make sense.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Sat, 10 Jan 2009 00:42:39 UTC
Message-ID: <fa.bL2NeqhaOXYjGqbaslOzhjMU8eA@ifi.uio.no>

On Sat, 10 Jan 2009, Andi Kleen wrote:
>
> > What's the cost/benefit of that 4%? Does it actually improve performance?
> > Especially if you then want to keep DWARF unwind information in memory in
> > order to fix up some of the problems it causes? At that point, you lost
>
> dwarf unwind information has nothing to do with this, it doesn't tell
> you anything about inlining or not inlining.  It just gives you
> finished frames after all of that has been done.
>
> Full line number information would help, but I don't think anyone
> proposed to keep that in memory.

Yeah, true. Although one of the reasons inlining actually ends up causing
problems is because of the bigger stack frames. That leaves a lot of space
for old stale function pointers to peek through.

With denser stack frames, the stack dumps look better, even without an
unwinder.

> > Does it help I$ utilization (which can speed things up a lot more, and is
> > probably the main reason -Os actually tends to perform better)? Likely
> > not. Sure, shrinking code is good for I$, but on the other hand inlining
> > can actually be bad for I$ density because if you inline a function that
> > doesn't get called, you now fragmented your footprint a lot more.
>
> Not sure that is always true; the gcc basic block reordering
> based on its standard branch prediction heuristics (e.g. < 0 or
> == NULL unlikely or the unlikely macro) might well put it all out of line.

I thought -Os actually disabled the basic-block reordering, doesn't it?

And I thought it did that exactly because it generates bigger code and
much worse I$ patterns (ie you have a lot of "conditional branch to other
place and then unconditional branch back" instead of "conditional branch
over the non-taken code".

Also, I think we've had about as much good luck with guessing
"likely/unlikely" as we've had with "inline" ;)

Sadly, apart from some of the "never happens" error cases, the kernel
doesn't tend to have lots of nice patterns. We have almost no loops (well,
there are loops all over, but most of them we hopefully just loop over
once or twice in any good situation), and few really predictable things.

Or rather, they can easily be very predictable under one particular load,
and the totally the other way around under another ..

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Sat, 10 Jan 2009 04:06:43 UTC
Message-ID: <fa.GL+P+cOZVrnLEz4STPw6diYC4Y4@ifi.uio.no>

On Fri, 9 Jan 2009, Nicholas Miell wrote:
>
> It's only too big if you always keep it in memory, and I wasn't
> suggesting that.

Umm. We're talking kernel panics here. If it's not in memory, it doesn't
exist as far as the kernel is concerned.

If it doesn't exist, it cannot be reported.

> My point was that you can get completely accurate stack traces in the
> face of gcc's inlining, and that blaming gcc because you can't get good
> stack traces because the kernel's debugging infrastructure isn't up to
> snuff isn't exactly fair.

No. I'm blaming inlining for making debugging harder.

And that's ok - IF IT IS WORTH IT.

It's not. Gcc inlining decisions suck. gcc inlines stuff that doesn't
really help from being inlined, and doesn't inline stuff that _does_.

What's so hard to accept in that?

> And this is where we disagree. I believe that crash dumps should be the
> norm and all the reasons you have against crash dumps in general are in
> fact reasons against Linux's sub-par implementation of crash dumps in
> specific.

Good luck with that. Go ahead and try it.  You'll find it wasn't so easy
after all.

> So, here I am, a non-enterprise end user with a non-stale kernel who'd
> love to be able to give you a crash dump (or, more likely, a stack trace
> created from that crash dump), but I can't because Linux crash dumps are
> stuck in the enterprise ghetto.

No, you're stuck because you apparently have your mind stuck on a
crash-dump, and aren't willing to look at alternatives.

You could use a network console. Trust me - if you can't set up a network
console, you have no business mucking around with crash dumps.

And if the crash is hard enough that you can't any output from that,
again, a crash dump wouldn't exactly help, would it?

> Hell, I'd be happy if I could get the the normal panic text written to
> disk, but since the hard part is the actual writing to disk, there's no
> reason not to do the full crash dump if you can.

Umm. And why do you think the two have anything to do with each other?

Only insane people want the kernel to write to disk when it has problems.
Sane people try to write to something that doesn't potentially overwrite
their data. Like the network.

Which is there. Try it. Trust me, it's a _hell_ of a lot more likely to
wotk than a crash dump.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 16:45:43 UTC
Message-ID: <fa.GoHLHlU7kqBZW1gjYdTjXLkexW0@ifi.uio.no>

On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> As far as naming is concerned, gcc effectively supports four levels,
> which *currently* map onto macros as follows:
>
> __always_inline		Inline unconditionally
> inline			Inlining hint
> <nothing>		Standard heuristics
> noinline		Uninline unconditionally
>
> A lot of noise is being made about the naming of the levels

The biggest problem is the <nothing>.

The standard heuristics for that are broken, in particular for the "single
call-site static function" case.

If gcc only inlined truly trivial functions for that case, I'd already be
much happier. Size be damned.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 17:16:22 UTC
Message-ID: <fa.nmG/C0r1/KKfc6z69YV0o0AjbXY@ifi.uio.no>

On Fri, 9 Jan 2009, Steven Rostedt wrote:
>
> I vote for the, get rid of the current inline, rename __always_inline to
> inline, and then remove all non needed inlines from the kernel.

This is what we do all the time, and historically have always done.

But
 - CONFIG_OPTIMIZE_INLINING=y screws that up
and
 - gcc still inlines even big functions static that have no markings at
   all.

> We'll, probably start adding a lot more noinlines.

That's going to be very painful. Especially since the cases we really want
to not inline is the random drivers etc - generally not "hot in the
cache", but they are the ones that cause the most oopses (not per line,
but globally - because there's just so many drivers).

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 17:13:03 UTC
Message-ID: <fa.0KW5wRM7/J1l+pSr2jlnyJgnj4g@ifi.uio.no>

On Fri, 9 Jan 2009, Andi Kleen wrote:
>
> There's also one alternative: gcc's inlining algorithms are extensibly
> tunable with --param. We might be able to find a set of numbers that
> make it roughly work like we want it by default.

We tried that.

IIRC, the numbers mean different things for different versions of gcc, and
I think using the parameters was very strongly discouraged by gcc
developers. IOW, they were meant for gcc developers internal tuning
efforts, not really for external people. Which means that using them would
put us _more_ at the mercy of random compiler versions rather than less.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 17:56:32 UTC
Message-ID: <fa.KA9gurkEIaCwsWdskomSSe32Jtc@ifi.uio.no>

On Fri, 9 Jan 2009, Matthew Wilcox wrote:
>
> That seems like valuable feedback to give to the GCC developers.

Well, one thing we should remember is that the kernel really _is_ special.

The kernel not only does things no other program tends to do (inline asms
are unusual in the first place - many of them are literally due to system
issues like atomic accesses and interrupts that simply aren't an issue in
user space, or that need so much abstraction that they aren't inlinable
anyway).

But the kernel also has totally different requirements in other ways. When
was the last time you did user space programming and needed to get a
backtrace from a user with register info because you simply don't have the
hardware that he has?

IOW, debugging in user space tends to be much more about trying to
reproduce the bug - in a way that we often cannot in the kernel. User
space in general is much more reproducible, since it's seldom as hardware-
or timing-dependent (threading does change the latter, but usually user
space threading is not _nearly_ as aggressive as the kernel has to be).

So the thing is, even if gcc was "perfect", it would likely be perfect for
a different audience than the kernel.

Do you think _any_ user space programmer worries about the stack space
being a few hundred bytes larger because the compiler inlined two
functions, and caused stack usage to be sum of them instead of just the
maximum of the two?

So we do have special issues. And exactly _because_ we have special issues
we should also expect that some compiler defaults simply won't ever really
be appropriate for us.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 19:45:38 UTC
Message-ID: <fa.t6pNzsqwbYZNifIf+ERhbcufeJg@ifi.uio.no>

On Fri, 9 Jan 2009, Richard Guenther wrote:
>
> -fno-inline-functions-called-once disables the heuristic that always
> inlines (static!) functions that are called once.  Other heuristics
> still apply, like inlining the static function if it is small.
> Everything else would be totally stupid - which seems to be the "default
> mode" you think GCC developers are in.

Well, I don't know about you, but the "don't inline a single instruction"
sounds a bit stupid to me. And yes, that's exactly what triggered this
whole thing.

We have two examples of gcc doing that, one of which was even a modern
version of gcc, where we had sone absolutely _everything_ on a source
level to make sure that gcc could not possibly screw up. Yet it did:

	static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
	{
	        return ((1UL << (nr % BITS_PER_LONG)) &
	                (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
	}

	#define test_bit(nr, addr)                      \
	        (__builtin_constant_p((nr))             \
	         ? constant_test_bit((nr), (addr))      \
	         : variable_test_bit((nr), (addr)))

in this case, Ingo said that changing that _single_ inline to forcing
inlining made a difference.

That's CRAZY. The thing isn't even called unless "nr" is constant, so
absolutely _everything_ optimizes away, and that whole function was
designed to give us a single instruction:

	testl $constant,constant_offset(addr)

and nothing else.

Maybe there was something else going on, and maybe Ingo's tests were off,
but this is an example of gcc not inlining WHEN WE TOLD IT TO, and when
the function was a single instruction.

How can anybody possibly not consider that to be "stupid"?

The other case (with a single "cmpxchg" inline asm instruction) was at
least _slightly_ more understandable, in that (a) Ingo claims modern gcc's
did inline it and (b) the original function actually has a "switch()"
statement that depends on the argument that is constant, so a stupid
inliner might believe that it's a big function. But again, we _told_ the
compiler to inline the damn thing, because we knew better. But gcc didn't.

The other part that is crazy is when gcc inlines large functions that
aren't even called most of the time (the "ioctl()" switch statements tend
to be a great example of this - gcc inlines ten or twenty functions, and
we can guarantee that only one of them is ever called). Yes, maybe it
makes the code smaller, but it makes the code also undebuggable and often
BUGGY, because we now have the stack frame of all ten-to-twenty functions
to contend with.

And notice how "static" has absolutely _zero_ meaning for the above
example. Yes, the thing is called just from one place - that's how
something like that very much works. It's a special case. It's not _worth_
inlining, especially if it causes bugs. So "called once" or "static" is
actually totally irrelevant.

And no, they are not marked "inline" (although they are clearly also not
marked "uninline", until we figure out that gcc is causing system crashes,
and we add the thing).

If these two small problems were fixed, gcc inlining would work much
better. But the first one, in particular, means that the "do I inline or
not" decision would have to happen after expanding and simplifying
constants. And then, if the end result is big, the inlining gets aborted.

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 20:27:28 UTC
Message-ID: <fa.8Ms5ia9LP1Tql1NMu3GVwSOLeaY@ifi.uio.no>

On Fri, 9 Jan 2009, Richard Guenther wrote:
>
> This is a case where the improved IPA-CP (interprocedural constant
> propagation) of GCC 4.4 may help.  In general GCC cannot say how a call
> argument may affect optimization if the function was inlined, so the
> size estimates are done with just looking at the function body, not the
> arguments (well, for GCC 4.4 this is not completely true, there is now
> some "heuristics").  With IPA-CP GCC will clone the function for the
> constant arguments, optimize it and eventually inline it if it is small
> enough.  At the moment this happens only if all callers call the
> function with the same constant though (at least I think so).

Ok, that's useless. The whole point is that everybody gives different -
but still constant - arguments.

> The above is definitely one case where using a macro or forced inlining is
> a better idea than to trust a compiler to figure out that it can optimize the
> function to a size suitable for inlining if called with a constant parameter.

. and forced inlining is what we default to. But that's when "let's try
letting gcc optimize this" fails. And macros get really unreadable, really
quickly.

> > Maybe there was something else going on, and maybe Ingo's tests were off,
> > but this is an example of gcc not inlining WHEN WE TOLD IT TO, and when
> > the function was a single instruction.
> >
> > How can anybody possibly not consider that to be "stupid"?
>
> Because it's a hard problem, it's not stupid to fail here - you didn't tell the
> compiler the function optimizes!

Well, actually we did. It's that "inline" there. That's how things used to
work. It's like "no". It means "no". It doesn't mean "yes, I really want
to s*ck your d*ck, but I'm just screaming no at the top of my lungs
because I think I should do so".

See?

And you do have to realize that Linux has been using gcc for a _loong_
while. You can talk all you want about how "inline" is just a hint, but
the fact is, it didn't use to be. gcc people _made_ it so, and are having
a damn hard time admitting that it's causing problems.

> Experience tells us that people do not know better.  Maybe the kernel is
> an exception here

Oh, I can well believe it.

And I don't even think that kernel people get it right nearly enough, but
since for the kernel it can even be a _correctness_ issue, at least if we
get it wrong, everybody sees it.

When _some_ compiler versions get it wrong, it's a disaster.

> But would you still want small functions be inlined even if they are not
> marked inline?

If you can really tell that they are that small, yes.

> They do - just constant arguments are obviously not used for optimizing
> before inlining.  Otherwise you'd scream bloody murder at us for all the
> increase in compile-time ;)

A large portion of that has gone away now that everybody uses ccache. And
if you only did it for functions that we _mark_ inline, it wouldn't even
be true. Because those are the ones that presumably really should be
inlined.

So no, I don't believe you. You much too easily dismiss the fact that
we've explicitly marked these functions for inlining, and then you say
"but we were too stupid".

If you cannot afford to do the real job, then trust the user. Don't guess.

		Linus


From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 21:25:01 UTC
Message-ID: <fa.pHG1bWsOlG5lAFBkIpBHeIBCKQQ@ifi.uio.no>

I'm beginning to think that for the kernel, we should just simply
remove CONFIG_OPTIMIZE_INLINING (so that inline means
"always_inline"), and -fno-inline-functions
-fno-inline-functions-called-one (so that gcc never inlines functions
behind our back) --- and then we create tools that count how many times
functions get used, and how big functions are, so that we can flag if some
function really should be marked inline when it isn't or vice versa.

But given that this is a very hard thing for an automated program
todo, let's write some tools so we can easily put a human in the loop,
who can add or remove inline keywords where it makes sense, and let's
give up on gcc being able to "guess" correctly.

For some things, like register allocation, I can accept that the
compiler will usually get these things right.  But whether or not to
inline a function seems to be one of those things that humans (perhaps
with some tools assist) can still do a better job than compilers.

     	  		    	     - Ted


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 17:59:19 UTC
Message-ID: <fa.f+jsEeQpJ46YBYW58+UvifsbW/8@ifi.uio.no>

On Fri, 9 Jan 2009, Andi Kleen wrote:
>
> Universal noinline would also be a bad idea because of its
> costs (4.1% text size increase). Perhaps should make it
> a CONFIG option for debugging though.

That's _totally_ the wrong way.

If you can reproduce an issue on your machine, you generally don't care
about inline, because you can see the stack, do the whole "gdb vmlinux"
thing, and you generally have tools to help you decode things. Including
just recompiling the kernel with an added noinline.

But _users_ just get their oopses sent automatically. So it's not about
"debugging kernels", it's about _normal_ kernels. They are the ones that
need to be debuggable, and the ones that care most about things like the
symbolic EIP being as helpful as possible.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y
Date: Fri, 09 Jan 2009 19:49:39 UTC
Message-ID: <fa.tGne9QEbD5pyh244mGZJTVEGt2c@ifi.uio.no>

On Fri, 9 Jan 2009, Matthew Wilcox wrote:
>
> Now, I'm not going to argue the directIO code is a shining example of
> how we want things to look, but we don't really want ten arguments
> being marshalled into a function call; we want gcc to inline the
> direct_io_worker() and do its best to optimise the whole thing.

Well, except we quite probably would be happier with gcc not doing that,
than with gcc doing that too often.

There are exceptions. If the caller is really small (ie a pure wrapper
that perhaps just does some locking around the call), then sure, inlining
a large function that only gets called from one place does make sense.

But if both the caller and the callee is large, like in your example, then
no. DON'T INLINE IT. Unless we _tell_ you, of course, which we probably
shouldn't do.

Why? Because debugging is more important. And deciding to inline that, you
probably decided to inline something _else_ too. And now you've quite
possibly blown your stackspace.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Fri, 09 Jan 2009 21:51:54 UTC
Message-ID: <fa.Uz+BoXjDdtDFTn4KWhLuFymXOqQ@ifi.uio.no>

On Fri, 9 Jan 2009, Harvey Harrison wrote:
>
> __needs_inline?  That would imply that it's for correctness reasons.

. but the point is, we have _thousands_ of inlines, and do you know which
is which? We've historically forced them to be inlined, and every time
somebody does that "OPTIMIZE_INLINE=y", something simply _breaks_.

So instead of just continually hitting our head against this wall because
some people seem to be convinced that gcc can do a good job, just do it
the other way around. Make the new one be "inline_hint" (no underscores
needed, btw), and there is ansolutely ZERO confusion about what it means.

At that point, everybody knows why it's there, and it's clearly not a
correctness issue or anything else.

Of course, at that point you might as well argue that the thing should not
exist at all, and that such a flag should just be removed entirely. Which
I certainly agree with - I think the only flag we need is "inline", and I
think it should mean what it damn well says.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Sat, 10 Jan 2009 01:32:03 UTC
Message-ID: <fa.LbAa6jaMU4TZ3Iy4i6c+pGDUI+Y@ifi.uio.no>

On Fri, 9 Jan 2009, Harvey Harrison wrote:

> On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote:
>
> >  - Headers could probably go back to 'extern inline' again. At not small
> >    expense - we just finished moving to 'static inline'. We'd need to
> >    guarantee a library instantiation for every header include file - this
> >    is an additional mechanism with additional introduction complexities
> >    and an ongoing maintenance cost.
>
> Puzzled?  What benefit is there to going back to extern inline in headers?

There's none. In fact, it's wrong, unless you _also_ have an extern
definition (according to the "new" gcc rules as of back in the days).

Of course, as long as "inline" really means _always_ inline, it won't
matter. So in that sense Ingo is right - we _could_. Which has no bearing
on whether we _should_, of course.

In fact, the whole mess with "extern inline" is a perfect example of why a
inlining hit should be called "may_inline" or "inline_hint" or something
like that.

Because then it actually makes sense to have "extern may_inline" with one
definition, and another definition for the non-inline version.  And it's
very clear what the deal is about, and why we literally have two versions
of the same function.

But again, that's very much not a "let's use 'extern' instead of
'static'". It's a totally different issue.

		Linus

[ A third reason to use "extern inline" is actually a really evil one: we
  could do it for our unrelated issue with system call definitions on
  architectures that require the caller to sign-extend the arguments.

  Since we don't control the callers of system calls, we can't do that,
  and architectures like s390 actually have potential security holes due
  to callers that don't "follow the rules". So there are different needs
  for trusted - in-kernel - system call users that we know do the sign
  extension correctly, and untrusted - user-mode callers that just call
  through the system call function table.

  What we _could_ do is for the wrappers to use

	extern inline int sys_open(const char *pathname, int flags, mode_t mode)
	{
		return SYS_open(pathname, mode);
	}

  which gives the C callers the right interface without any unnecessary
  wrapping, and then

	long WRAP_open(const char *pathname, long flags, long mode)
	{
		return SYS_open(pathname, flags, mode);
	}
	asm ("\t.globl sys_alias\n\t.set WRAP_open");

  which is the one that gets linked from any asm code. So now asm code
  and C code gets two different functions, even though they use the same
  system call name - one with inline expansion, one with linker games.

  Whee. The games we can play (and the odd reasons we must play them). ]


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Sat, 10 Jan 2009 05:29:32 UTC
Message-ID: <fa.NGJfVzTPrXcx3LshiT09SZVnq54@ifi.uio.no>

On Fri, 9 Jan 2009, H. Peter Anvin wrote:
>
> I was thinking about experimenting with this, to see what level of
> upside it might add.  Ingo showed me numbers which indicate that a
> fairly significant fraction of the cases where removing inline helps is
> in .h files, which would require code movement to fix.  Hence to see if
> it can be automated.

We _definitely_ have too many inline functions in headers. They usually
start out small, and then they grow. And even after they've grown big,
it's usually not at all clear exactly where else they should go, so even
when you realize that "that shouldn't be inlined", moving them and making
them uninlined is not obvious.

And quite often, some of them go away - or at least shrink a lot - when
some config option or other isn't set. So sometimes it's an inline because
a certain class of people really want it inlined, simply because for
_them_ it makes sense, but when you enable debugging or something, it
absolutely explodes.

		Linus


From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Sat, 10 Jan 2009 06:01:56 UTC
Message-ID: <fa.9B8Q/eZ88R7Shrn4VqTPnEpPl5Y@ifi.uio.no>

Linus Torvalds wrote:
>
> And quite often, some of them go away - or at least shrink a lot - when
> some config option or other isn't set. So sometimes it's an inline because
> a certain class of people really want it inlined, simply because for
> _them_ it makes sense, but when you enable debugging or something, it
> absolutely explodes.
>

And this is really why getting static inline annotations right is really
hard if not impossible in the general case (especially when considering
the sheer number of architectures we compile on.)  So making it possible
for the compiler to do the right thing for at least this class of
functions really does seem like a good idea.

	-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Sun, 11 Jan 2009 19:26:38 UTC
Message-ID: <fa.+m20ZSg3RmlCFqettyuMx5WX7p8@ifi.uio.no>

On Sun, 11 Jan 2009, Andi Kleen wrote:
>
> The proposal was to use -fno-inline-functions-called-once (but
> the resulting numbers were not promising)

Well, the _optimal_ situation would be to not need it, because gcc does a
good job without it. That implies trying to find a better balance between
"worth it" and "causes problems".

Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or
balances only when we add some very version-specific random options (ie
the stack-usage one). And even those options don't actually make much
sense - yes, they "balance" things, but they don't do it in a sensible
manner.

For example: stack usage is undeniably a problem (we've hit it over and
over again), but it's not about "stack must not be larger than X bytes".

If the call is done unconditionally, then inlining _one_ function will
grow the static stack usage of the function we inline into, but it will
_not_ grow the dynamic stack usage one whit - so deciding to not inline
because of stack usage is pointless.

See? So "stop inlining when you hit a stack limit" IS THE WRONG THING TO
DO TOO! Because it just means that the compiler continues to do bad
inlining decisions until it hits some magical limit - but since the
problem isn't the static stack size of any _single_ function, but the
combined stack size of a dynamic chain of them, that's totally idiotic.
You still grew the dynamic stack, and you have no way of knowing by how
much - the limit on the static one simply has zero bearing what-so-ever on
the dynamic one.

So no, "limit static stack usage" is not a good option, because it stops
inlining when it doesn't matter (single unconditional call), and doesn't
stop inlining when it might (lots of sequential calls, in a deep chain).

The other alternative is to let gcc do what it does, but

 (a) remove lots of unnecessary 'inline's. And we should likely do this
     regardless of any "-fno-inline-functions-called-once" issues.

 (b) add lots of 'noinline's to avoid all the cases where gcc screws up so
     badly that it's either a debugging disaster or an actual correctness
     issue.

The problem with (b) is that it's a lot of hard thinking, and debugging
disasters always happen in code that you didn't realize would be a problem
(because if you had, it simply wouldn't be the debugging issue it is).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Sun, 11 Jan 2009 23:07:40 UTC
Message-ID: <fa.F8Gy8J6PinYS9c2guwx7gk2NrA8@ifi.uio.no>

On Sun, 11 Jan 2009, Linus Torvalds wrote:
> On Sun, 11 Jan 2009, Andi Kleen wrote:
> >
> > Was -- i think that got fixed in gcc. But again only in newer versions.
>
> I doubt it. People have said that about a million times, it has never
> gotten fixed, and I've never seen any actual proof.

In fact, I just double-checked.

Try this:

	struct a {
		unsigned long array[200];
		int a;
	};

	struct b {
		int b;
		unsigned long array[200];
	};

	extern int fn3(int, void *);
	extern int fn4(int, void *);

	static inline __attribute__((always_inline)) int fn1(int flag)
	{
		struct a a;
		return fn3(flag, &a);
	}

	static inline __attribute__((always_inline)) int fn2(int flag)
	{
		struct b b;
		return fn4(flag, &b);
	}

	int fn(int flag)
	{
		if (flag & 1)
			return fn1(flag);
		return fn2(flag);
	}

(yeah, I made sure it would inline with "always_inline" just so that the
issue wouldn't be hidden by any "avoid stack frames" flags).

Gcc creates a big stack frame that contains _both_ 'a' and 'b', and does
not merge the allocations together even though they clearly have no
overlap in usage. Both 'a' and 'b' get 201 long-words (1608 bytes) of
stack, causing the inlined version to have 3kB+ of stack, even though the
non-inlined one would never use more than half of it.

So please stop claiming this is fixed. It's not fixed, never has been, and
quite frankly, probably never will be because the lifetime analysis is
hard enough (ie once you inline and there is any complex usage, CSE etc
will quite possibly mix up the lifetimes - the above is clearly not any
_realistic_ example).

So even if the above trivial case could be fixed, I suspect a more complex
real-life case would still keep the allocations separate. Because merging
the allocations and re-using the same stack for both really is pretty
non-trivial, and the best solution is to simply not inline.

(And yeah, the above is such an extreme case that gcc seems to realize
that it makes no sense to inline because the stack frame is _so_ big. I
don't know what the default stack frame limit is, but it's apparently
smaller than 1.5kB ;)

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 00:22:38 UTC
Message-ID: <fa.feEsHj0nKUTofIJ5de4dl+CmKNo@ifi.uio.no>

On Mon, 12 Jan 2009, Andi Kleen wrote:
>
> so at least least for this case it works. Your case also doesn't work
> for me. So it looks like gcc didn't like something you did in your test
> program.

I very intentionally used _different_ types.

If you use the same type, gcc will apparently happily say "hey, I can
combine two variables of the same type with different liveness into the
same variable".

But that's not the interesting case.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Date: Fri, 09 Jan 2009 21:48:44 UTC
Message-ID: <fa.BV8ptAKdMdvB5p5Fea5O4SgjuyY@ifi.uio.no>

On Fri, 9 Jan 2009, Ingo Molnar wrote:
>
> - Perhaps we could introduce a name for the first category: __must_inline?
>   __should_inline? Not because it wouldnt mean 'always', but because it is
>   'always inline' for another reason than the correctless __always_inline.

I think you're thinking about this the wrong way.

"inline" is a pretty damn strong hint already.

If you want a weaker one, make it _weaker_ instead of trying to use
superlatives like "super_inline" or "must_inline" or whatever.

So I'd suggest:

 - keep "inline" as being a strong hint. In fact, I'd suggest it not be a
   hint at all - when we say "inline", we mean it. No ambiguity
   _anywhere_, and no need for idiotic "I really really REALLY mean it"
   versions.

 - add a "maybe_inline" or "inline_hint" to mean that "ok, compiler, maybe
   this is worth inlining, but I'll leave the final choice to you".

That would get rid of the whole rationale for OPTIMIZE_INLINING=y, because
at that point, it's no longer potentially a correctness issue. At that
point, if we let gcc optimize things, it was a per-call-site conscious
decision.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 19:03:35 UTC
Message-ID: <fa.xQXrLzkV/0a0UJDNwwd8v3DkUWM@ifi.uio.no>

On Mon, 12 Jan 2009, Bernd Schmidt wrote:
>
> Something at the back of my mind said "aliasing".
>
> $ gcc linus.c -O2 -S ; grep subl linus.s
>         subl    $1624, %esp
> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
>         subl    $824, %esp
>
> That's with 4.3.2.

Interesting.

Nonsensical, but interesting.

Since they have no overlap in lifetime, confusing this with aliasing is
really really broken (if the functions _hadn't_ been inlined, you'd have
gotten the same address for the two variables anyway! So anybody who
thinks that they need different addresses because they are different types
is really really fundmantally confused!).

But your numbers are unambiguous, and I can see the effect of that
compiler flag myself.

The good news is that the kernel obviously already uses
-fno-strict-aliasing for other reasons, so we should see this effect
already, _despite_ it making no sense. And the stack usage still causes
problems.

Oh, and I see why. This test-case shows it clearly.

Note how the max stack usage _should_ be "struct b" + "struct c". Note how
it isn't (it's "struct a" + "struct b/c").

So what seems to be going on is that gcc is able to do some per-slot
sharing, but if you have one function with a single large entity, and
another with a couple of different ones, gcc can't do any smart
allocation.

Put another way: gcc doesn't create a "union of the set of different stack
usages" (which would be optimal given a single frame, and generate the
stack layout of just the maximum possible size), it creates a "set of
unions of different stack usages" (which can be optimal in the trivial
cases, but not nearly optimal in practical cases).

That explains the ioctl behavior - the structure use is usually pretty
complicated (ie it's almost never about just _one_ large stack slot, but
the ioctl cases tend to do random stuff with multiple slots).

So it doesn't add up to some horrible maximum of all sizes, but it also
doesn't end up coalescing stack usage very well.

		Linus
---
struct a {
	int a;
	unsigned long array[200];
};

struct b {
	int b;
	unsigned long array[100];
};

struct c {
	int c;
	unsigned long array[100];
};

extern int fn3(int, void *);
extern int fn4(int, void *);

static inline __attribute__ ((always_inline))
int fn1(int flag)
{
	struct a a;
	return fn3(flag, &a);
}

static inline __attribute__ ((always_inline))
int fn2(int flag)
{
	struct b b;
	struct c c;
	return fn4(flag, &b) + fn4(flag, &c);
}

int fn(int flag)
{
	fn1(flag);
	if (flag & 1)
		return 0;
	return fn2(flag);
}


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 19:46:29 UTC
Message-ID: <fa.d2BOK2R9sxTXWaAjPkPmmeiFZi8@ifi.uio.no>

On Mon, 12 Jan 2009, H. Peter Anvin wrote:
>
> This is about storage allocation, not aliases.  Storage allocation only
> depends on lifetime.

Well, the thing is, code motion does extend life-times, and if you think
you can move stores across each other (even when you can see that they
alias statically) due to type-based alias decisions, that does essentially
end up making what _used_ to be disjoint lifetimes now be potentially
overlapping.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 23:21:02 UTC
Message-ID: <fa.WjTY2Y0vOejWf/j9R1+xYCfjDEM@ifi.uio.no>

On Mon, 12 Jan 2009, Jamie Lokier wrote:
>
> Sometimes code motion makes code faster and/or smaller but use more
> stack space.  If you want to keep the stack use down, it blocks some
> other optimisations.

Uhh. Yes. Compiling is an exercise in trade-offs.

That doesn't mean that you should try to find the STUPID trade-offs,
though.

The thing is, there is no excuse for gcc's stupid alias analysis. Other
compilers actually take advantage of things like the C standards type
alias ambiguity by (a) realizing that it's insane as a general thing and
(b) limiting it to the real special cases, like assuming that pointers to
floats and pointers to integers do not alias.

That, btw, is where the whole concept comes from. It should be passed off
as an "unsafe FP optimization", where it actually makes sense, exactly
like a lot of other unsafe FP optimizations.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 19:44:28 UTC
Message-ID: <fa.67YDirP2yaKdaEv9I8I7fP5wmj8@ifi.uio.no>

On Mon, 12 Jan 2009, Andi Kleen wrote:
>
> What I find nonsensical is that -fno-strict-aliasing generates
> better code here. Normally one would expect the compiler seeing
> more aliases with that option and then be more conservative
> regarding any sharing. But it seems to be the other way round
> here.

No, that's not the surprising part. And in fact, now that you mention it,
I can even tell you why gcc does what it does.

But you'll need some background to it:

Type-based aliasing is _stupid_. It's so incredibly stupid that it's not
even funny. It's broken. And gcc took the broken notion, and made it more
so by making it a "by-the-letter-of-the-law" thing that makes no sense.

What happens (well, maybe it's fixed, but this was _literally_ what gcc
used to do) is that the type-based aliasing overrode everything else, so
if two accesses were to different types (and not in a union, and none of
the types were "char"), then gcc "knew" that they clearly could not alias,
and could thus wildly re-order accesses.

That's INSANE. It's so incredibly insane that people who do that should
just be put out of their misery before they can reproduce. But real gcc
developers really thought that it makes sense, because the standard allows
it, and it gives the compiler the maximal freedom - because it can now do
things that are CLEARLY NONSENSICAL.

And to compiler people, being able to do things that are clearly
nonsensical seems to often be seen as a really good thing, because it
means that they no longer have to worry about whether the end result works
or not - they just got permission to do stupid things in the name of
optimization.

So gcc did. I know for a _fact_ that gcc would re-order write accesses
that were clearly to (statically) the same address. Gcc would suddenly
think that

	unsigned long a;

	a = 5;
	*(unsigned short *)&a = 4;

could be re-ordered to set it to 4 first (because clearly they don't alias
- by reading the standard), and then because now the assignment of 'a=5'
was later, the assignment of 4 could be elided entirely! And if somebody
complains that the compiler is insane, the compiler people would say
"nyaah, nyaah, the standards people said we can do this", with absolutely
no introspection to ask whether it made any SENSE.

Anyway, once you start doing stupid things like that, and once you start
thinking that the standard makes more sense than a human being using his
brain for 5 seconds, suddenly you end up in a situation where you can move
stores around wildly, and it's all 'correct'.

Now, take my stupid example, and make "fn1()" do "a.a = 1" and make
"fn2()" do "b.b = 2", and think about what a compiler that thinks it can
re-order the two writes willy-nilly will do?

Right. It will say "ok, a.a and b.b can not alias EVEN IF THEY HAVE
STATICALLY THE SAME ADDRESS ON THE STACK", because they are in two
different structres. So we can then re-order the accesses, and move the
stores around.

Guess what happens if you have that kind of insane mentality, and you then
try to make sure that they really don't alias, so you allocate extra stack
space.

The fact is, Linux uses -fno-strict-aliasing for a damn good reason:
because the gcc notion of "strict aliasing" is one huge stinking pile of
sh*t. Linux doesn't use that flag because Linux is playing fast and loose,
it uses that flag because _not_ using that flag is insane.

Type-based aliasing is unacceptably stupid to begin with, and gcc took
that stupidity to totally new heights by making it actually more important
than even statically visible aliasing.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Mon, 12 Jan 2009 20:10:36 UTC
Message-ID: <fa.+rVQxZtsXbi4ypqk1Tw3fFt1JsE@ifi.uio.no>

On Mon, 12 Jan 2009, Linus Torvalds wrote:
>
> Type-based aliasing is unacceptably stupid to begin with, and gcc took
> that stupidity to totally new heights by making it actually more important
> than even statically visible aliasing.

Btw, there are good forms of type-based aliasing.

The 'restrict' keyword actually makes sense as a way to say "this pointer
points to data that you cannot reach any other way". Of course, almost
nobody uses it, and quite frankly, inlining can destroy that one too (a
pointer that is restricted in the callEE is not necessarily restricted at
all in the callER, and an inliner that doesn't track that subtle
distinction will be very unhappy).

So compiler people usually don't much like 'restrict' - because it is very
limited (you might even say restricted) in its meaning, and doesn't allow
for nearly the same kind of wild optimizations than the insane standard C
type-aliasing allows.

The best option, of course, is for a compiler to handle just _static_
alias information that it can prove (whether by use of 'restrict' or by
actually doing some fancy real analysis of its own allocations), and
letting the hardware do run-time dynamic alias analysis.

I suspect gcc people were a bit stressed out by Itanium support - it's an
insane architecture that basically requires an insane compiler for
reasonable performance, and I think the Itanium people ended up
brain-washing a lot of people who might otherwise have been sane.

So maybe I should blame Intel. Or HP. Because they almost certainly were
at least a _part_ reason for bad compiler decisions.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement
Date: Tue, 13 Jan 2009 00:22:31 UTC
Message-ID: <fa.MoJrA8QokXYOPjcuNCJOjXM388c@ifi.uio.no>

On Mon, 12 Jan 2009, Bernd Schmidt wrote:
>
> Too lazy to construct one myself, I googled for examples, and here's a
> trivial one that shows how it affects the ability of the compiler to
> eliminate memory references:

Do you really think this is realistic or even relevant?

The fact is

 (a) most people use similar types, so your example of "short" vs "int" is
     actually not very common. Type-based alias analysis is wonderful for
     finding specific examples of something you can optimize, but it's not
     actually all that wonderful in general. It _particularly_ isn't
     wonderful once you start looking at the downsides.

     When you're adding arrays of integers, you're usually adding
     integers. Not "short"s. The shorts may be a great example of a
     special case, but it's a special case!

 (b) instructions with memory accesses aren't the problem - instructions
     that take cache misses are. Your example is an excellent example of
     that - eliding the simple load out of the loop makes just about
     absolutely _zero_ difference in any somewhat more realistic scenario,
     because that one isn't the one that is going to make any real
     difference anyway.

The thing is, the way to optimize for modern CPU's isn't to worry
over-much about instruction scheduling. Yes, it matters for the broken
ones, but it matters in the embedded world where you still find in-order
CPU's, and there the size of code etc matters even more.

> I'll grant you that if you're writing a kernel or maybe a malloc
> library, you have reason to be unhappy about it.  But that's what
> compiler switches are for: -fno-strict-aliasing allows you to write code
> in a superset of C.

Oh, I'd use that flag regardless yes. But what you didn't seem to react to
was that gcc - for no valid reason what-so-ever - actually trusts (or at
least trusted: I haven't looked at that code for years) provably true
static alias information _less_ than the idiotic weaker type-based one.

You make all this noise about how type-based alias analysis improves code,
but then you can't seem to just look at the example I gave you. Type-based
alias analysis didn't improve code. It just made things worse, for no
actual gain. Moving those accesses to the stack around just causes worse
behavior, and a bigger stack frame, which causes more cache misses.

[ Again, I do admit that kernel code is "different": we tend to have a
  cold stack, in ways that many other code sequences do not have. System
  code tends to get a lot more I$ and D$ misses. Deep call-chains _will_
  take cache misses on the stack, simply because the user will do things
  between system calls or page faults that almost guarantees that things
  are not in L1, and often not in L2 either.

  Also, sadly, microbenchmarks often hide this, since they are often
  exactly the unrealistic kinds of back-to-back system calls that almost
  no real program ever has, since real programs actually _do_ something
  with the data. ]

My point is, you're making all these arguments and avoiding looking at the
downsides of what you are arguing for.

So we use -Os - because it generally generates better (and simpler) code.
We use -fno-strict-alias for the same reason.

			Linus

Index Home About Blog