Inline assembly (H. Peter Anvin; Linus Torvalds)

Index Home About Blog

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Why Plan 9 C compilers don't have asm("")
Date: 	Wed, 4 Jul 2001 17:22:44 +0000 (UTC)
Newsgroups: fa.linux.kernel

In article <20010704002436.C1294@ftsoj.fsmlabs.com>,
Cort Dougan  <cort@fsmlabs.com> wrote:
>
>There isn't such a crippling difference between straight-line and code with
>unconditional branches in it with modern processors.  In fact, there's very
>little measurable difference.

Oh, the small details get to you eventually.

And it's not just the "call" and "ret" instructions.  They _do_ hurt,
even on modern CPU's, btw.  They tend to break up the prefetching, and
often mean that you cannot do as good of a instruction mix. 

But there's an even more serious problem: a function call in C is a
VERY heavy operation as far as the compiler is concerned. It's a major
sequence point, and the compiler doesn't know what memory locations are
potentially dead etc.

Which means that the compiler has to save everything that might be
relevant to memory, and depending on the calling convention has to
assume that registers are trashed.  And when you come back, you have to
re-load everything again, on the assumption that the function might have
changed state. 

You also often have issues like reloading the gp pointer on many 64-bit
architectures, where functions can be in different "domains", and
returning from an unknown function means that you have to do other nasty
setup in order to get at your global data.

And trust me, it's noticeable. On alpha, a fast function call _should_
be a a simple two-cycle thing - branch and return. But because of
practical linker issues, what the compiler ends up having to generate
for calls to targets that it doesn't know where they are is 

 - load a 64-bit address off the GP area that the linker will have fixed
   up.
 - do an indirect branch to that address
 - the callee re-loads the GP with _its_ copy of the GP if it needs any
   global data or needs to call anybody else.
 - we return to the caller
 - the caller reloads its GP.

Your theoretical two cycles that the CPU could follow in the front end
and speculate around turns into multiple loads, a indirect branch and
about 10 instructions.  And that's without any of the other effects even
being taken into account.  No matter _how_ good the CPU is, that's going
to be slower than not doing it. 

[ And yes, I know there are optimizing linkers for the alpha around that
  improve this and notice when they don't need to change GP and can do a
  straight branch etc.  I don't think GNU ld _still_ does that, but who
  knows. Even the "good" Digital compilers tended to nop out unnecessary
  instructions rather than remove them, causing more icache pressure on
  a CPU that was already famous for needing tons of icache ]

Now, you could get around a bit of this by allowing for special calling
conventions.  Gcc actually has this for some details - namely the
"register arguments" part, which actually makes for much more readable
code (that's my main personal use for it - never mind the fact that it
is probably faster _too_). 

But gcc doesn't have a good "you can re-order this call wrt other stuff"
setup, and gcc lacks the ability to change the calling convention
on-the-fly ("this function will not clobber any registers"). 

Try it and see. There are good reasons for "inline asm", not the least
of which is that it often makes the produced code much more readable.

And if you never look at the produced assembler code, then you'll never
have a fast system. Really. Compilers can do only so much. People who
understand what the end result is makes for a difference.

Now, you could probably argue that instead of inline asms we should have
more flexibility in doing a per-callee calling convention. That would be
good too, no question about it.

			Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Why Plan 9 C compilers don't have asm("")
Date: 	Fri, 6 Jul 2001 18:44:31 +0000 (UTC)
Newsgroups: fa.linux.kernel

In article <20010706023835.A5224@ftsoj.fsmlabs.com>,
Cort Dougan  <cort@fsmlabs.com> wrote:
>I'm talking about _modern_ processors, not processors that dominate the
>modern age.  This isn't x86.

NONE of my examples were about the x86.

I gave the alpha as a specific example.  The same issues are true on
ia64, sparc64, and mips64.  How more "modern" can you get? Name _one_
reasonably important high-end CPU that is more modern than alpha and
ia64.. 

On ia64, you probably end up with function calls costing even more than
alpha, because not only does the function call end up being a
synchronization point for the compiler, it also means that the compiler
cannot expose any parallelism, so you get an added hit from there.  At
least with other CPU's that find the parallelism dynamically they can do
out-of-order stuff across function calls. 

>Unconditional branches are definitely predictable so icache pre-fetches are
>not more complicated that straight-line code.

Did you READ my mail at all?

Most of these "unconditional branches" are indirect, because rather few
64-bit architectures have a full 64-bit branch.  That means that in
order to predict them, you either have to do data-prediction (pretty
much nobody does this), or you have a branch target prediction cache,
which works very well indeed but has the problem that it only works for
stuff in the cache, and the cache tends to be fairly limited (because
you need to cache the whole address - it's more than a "which direction
do we go in"). 

There are lots of good arguments for function calls: they improve icache
when done right, but if you have some non-C-semantics assembler sequence
like "cli" or a spinlock that you use a function call for, that would
_decrease_ icache effectiveness simply because the call itself is bigger
than the instruction (and it breaks up the instruction sequence so you
get padding issues). 

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] speed up on find_first_bit for i386 (let compiler do
Date: Thu, 28 Jul 2005 15:39:31 UTC
Message-ID: <fa.g0q33jc.nge7io@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507280823210.3227@g5.osdl.org>

On Thu, 28 Jul 2005, Steven Rostedt wrote:
>
> In the thread "[RFC][PATCH] Make MAX_RT_PRIO and MAX_USER_RT_PRIO
> configurable" I discovered that a C version of find_first_bit is faster
> than the asm version now when compiled against gcc 3.3.6 and gcc 4.0.1
> (both from versions of Debian unstable).  I wrote a benchmark (attached)
> that runs the code 1,000,000 times.

I suspect the old "rep scas" has always been slower than
compiler-generated code, at least under your test conditions. Many of the
old asm's are actually _very_ old, and some of them come from pre-0.01
days and are more about me learning the i386 (and gcc inline asm).

That said, I don't much like your benchmarking methodology. I suspect that
quite often, the code in question runs from L2 cache, not in a tight loop,
and so that "run a million times" approach is not necessarily the best
one.

I'll apply this one as obvious: I doubt the compiler generates bigger code
or has any real downsides, but I just wanted to say that in general I just
wish people didn't always time the hot-cache case ;)

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] speed up on find_first_bit for i386 (let compiler do
Date: Thu, 28 Jul 2005 17:43:35 UTC
Message-ID: <fa.g0pv3bh.mgi7aj@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507281018170.3227@g5.osdl.org>

On Thu, 28 Jul 2005, Steven Rostedt wrote:
>
> I can change the find_first_bit to use __builtin_ffs, but how would you
> implement the ffz?

The thing is, there are basically _zero_ upsides to using the __builtin_xx
functions on x86.

There may be more upsides on other architectures (*cough*ia64*cough*) that
have strange scheduling issues and other complexities, but on x86 in
particular, the __builtin_xxx() functions tend to be a lot more pain than
they are worth. Not only do they have strange limitations (on selection of
opcodes but also for compiler versions), but they aren't well documented,
and semantics aren't clear.

For example, if you use the "bsfl" inline assembly instruction, you know
what the semantics are and what the generated code is like: Intel
documents it, and you know what code you generated. So the special cases
like "what happens if the input is zero" are well-defined.

In contrast, the gcc builtins probably match some standard that is not
only harder to find, but also has some _other_ definition for what happens
for the zero case, so the builtins automatically end up having problems
due to semantic mis-match between the CPU and the standard.

Basic rule: inline assembly is _better_ than random compiler extensions.
It's better to have _one_ well-documented extension that is very generic
than it is to have a thousand specialized extensions.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] speed up on find_first_bit for i386 (let compiler do
Date: Thu, 28 Jul 2005 19:01:04 UTC
Message-ID: <fa.g1a73jf.n0e6it@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507281146290.3307@g5.osdl.org>

On Thu, 28 Jul 2005, Steven Rostedt wrote:
>
> OK, I guess when I get some time, I'll start testing all the i386 bitop
> functions, comparing the asm with the gcc versions.  Now could someone
> explain to me what's wrong with testing hot cache code. Can one
> instruction retrieve from memory better than others?

There's a few issues:

 - trivially: code/data size. Being smaller automatically means faster if
   you're cold-cache. If you do cycle tweaking of something that is
   possibly commonly in the L2 cache or further away, you migt as well
   consider one byte of code-space to be equivalent to one cycle (a L1 I$
   miss can easily take 50+ cycles - the L1 fill cost may be just a small
   part of that, but the pipeline problem it causes can be deadly).

 - branch prediction: cold-cache is _different_ from hot-cache. hit-cache
   predicts the stuff dynamically, cold-cache has different rules (and it
   is _usually_ "forward predicts not-taken, backwards predicts taken",
   although you can add static hints if you want to on most architectures).

   So hot-cache may look very different indeed - the "normal" case might
   be that you mispredict all the time because the static prediction is
   wrong, but then a hot-cache benchmark will predict perfectly.

 - access patterns. This only matters if you look at algorithmic changes.
   Hashes have atrocious locality, but on the other hand, if you know that
   the access pattern is cold, a hash will often have a minimum number of
   accesses.

but no, you don't have "some instructions are better at reading from
memory" for regular integer code (FP often has other issues, like reading
directly from L2 without polluting L1, and then there are obviously
prefetch hints).

Now, in the case of your "rep scas" conversion, the reason I applied it
was that it was obviously a clear win (rep scas is known bad, and has
register allocation issues too), so I'm _not_ claiming that the above
issues were true in that case. I just wanted to say that in general it's
nice (but often quite hard) if you can give cold-cache numbers too (for
example, using the cycle counter and being clever can actually give that).

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] speed up on find_first_bit for i386 (let compiler do
Date: Fri, 29 Jul 2005 16:33:37 UTC
Message-ID: <fa.g0ad33e.m0472u@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507290924140.3307@g5.osdl.org>

On Fri, 29 Jul 2005, Maciej W. Rozycki wrote:
>
>  Hmm, that's what's in the GCC info pages for the relevant functions
> (I've omitted the "l" and "ll" variants):
>
> "-- Built-in Function: int __builtin_ffs (unsigned int x)
>      Returns one plus the index of the least significant 1-bit of X, or
>      if X is zero, returns zero.

This, for example, clashes with the x86 semantics.

If X is zero, the bsfl instruction will set the ZF flag, and the result is
undefined (on many, but not all, CPU's it will either be zero _or_
unmodified).

We don't care, since we actually test the input for being zero separately
_anyway_, but my point is that if the builtin is badly done (and I
wouldn't be in the least surprised if it was), then it's going to do a
totally unnecessary conditional jump of cmov.

See? __builtin's can generate _worse_ code, exactly because they try to
have portable semantics that may not even matter.

In contrast, just doing it by hand allows us to avoid all that crap.

Doing it by hand as inline assembly also allows us to do dynamic
optimizations like instruction rewriting, so inline assembly is a _lot_
more powerful than builtins can reasonably ever be.

> If that's not enough, then what would be?  I'm serious -- if you find it
> inadequate, then perhaps it could be improved.

It's inadequate because IT IS POINTLESS.

The builtin buys you absolutely _nothing_, and the inline asm is simpler,
potentially faster, and works with every single version of gcc.

USING THE BUILTIN IS A PESSIMISATION!

It has absolutely _zero_ upsides, and I've named three _major_ downsides.

It has another downside too: it's extra complexity and potential for bugs
in the compiler. And if you tell me gcc people never have bugs, I will
laugh in your general direction.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] speed up on find_first_bit for i386 (let compiler do
Date: Fri, 29 Jul 2005 16:35:52 UTC
Message-ID: <fa.fvqn2ra.mgu7qq@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507290910090.3307@g5.osdl.org>

On Fri, 29 Jul 2005, David Woodhouse wrote:
>
> On Thu, 2005-07-28 at 10:25 -0700, Linus Torvalds wrote:
> > Basic rule: inline assembly is _better_ than random compiler extensions.
> > It's better to have _one_ well-documented extension that is very generic
> > than it is to have a thousand specialized extensions.
>
> Counterexample: FR-V and its __builtin_read8() et al.

There are arguably always counter-examples, but your arguments really are
pretty theoretical.

Very seldom does compiler extensions end up being (a) timely enough and
(b) semantically close enough to be really useful.

> Builtins can also allow the compiler more visibility into what's going
> on and more opportunity to optimise.

Absolutely. In theory. In practice, not so much. All the opportunity to
optimize often ends up being lost in semantic clashes, or just because
people can't use the extension because it hasn't been there since day one.

The fact is, inline asms are pretty rare even when we are talking about
every single possible assembly combination. They are even less common when
we're talking about just _one_ specific case of them (like something like
__builtin_ffs()).

What does this mean? It has two results: (a) instruction-level scheduling
and register allocation just isn't _that_ important, and the generic "asm"
register scheduling is really plenty good enough. The fact that in theory
you might get better results if the compiler knew exactly what was going
on is just not relevant: in practice it's simply not _true_. The other
result is: (b) the compiler people don't end up seeing something like the
esoteric builtins as a primary thing, so it's not like they'd be tweaking
and regression-testing everything _anyway_.

So I argue very strongly that __builtin_xxx() is _wrong_, unless you have
very very strong reasons for it:

 - truly generic and _very_ important stuff: __builtin_memcpy() is
   actually very much worth it, since it's all over, and it's so generic
   that the compiler has a lot of choice in how to do it.

 - stuff where the architecture (or the compiler) -really- sucks with
   inline asms, and has serious problems, and the thing is really
   important. Your FR-V example _might_ fall into this category (or it
   might not), and ia64 has the problem with instruction packing and
   scheduling and so __builtin's have a bigger advantage.

Basically, on most normal architectures, there's seldom any reason at
_all_ to use builtins except for things like memcpy. On x86, I think the
counter-example might be if you want to schedule MMX code from C - which
is a special case because it doesn't follow my "rule (a)" above. But we
don't do that in the kernel, really, or we just schedule it out-of-line.

			Linus

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 1/1] X86: use explicit register name for get/put_user
Date: Mon, 07 Dec 2009 18:37:20 UTC
Message-ID: <fa.16mB62HOEizTG6swd5bUEYtX/WI@ifi.uio.no>

On 12/07/2009 04:37 AM, Jiri Slaby wrote:
> Is this documented somewhere? Or do we rely on an undocumented feature?
> I mean it doesn't refer only to the constraint but also to a concrete
> register allocation. As far as I understand it (from the gcc 4.4
> documentation), if one does
>  "insn %0" : "=r" (out) : "0" (in)
> the "0" constraint corresponds to the concrete register allocated for
> out, not to any register (which is the constraint "r").

Yes, but it only corresponds to the information that is conveyed in the
register selection.

> In the document they write only about the "same location" occupied by in
> and out, nothing is said about size (and hence I think we cannot
> mismatch size of operands). And I couldn't find any other
> restrictions/documentation about inline assembly, hence the patch,
> because nothing assured me this cannot change in the future.

There is almost no documentation at all; some of the little
documentation there is is in comments in the source code.  To a first
order of approximation, asm() is defined by behavior, not by a written
spec.  Trying to play language lawyer with the little bit that is
written down is pointless -- the gcc people have been more than happy to
break asm() between releases regardless of what is and is not written down.

> Now I tried different compilers (clang, llvm-gcc) and they choke on that:
> $ cat c.c
> void x(void)
> {
>         unsigned long in;
>         int out;
>         asm("insn %0" : "=r" (out) : "0" (in));
> }
> $ clang c.c -S -o -
> c.c:5:36: error: unsupported inline asm: input with type 'unsigned long'
>       matching output with type 'int'
>         asm("insn %0" : "=r" (out) : "0" (in));
>                               ~~~         ^~
> 1 diagnostic generated.
> $ llvm-gcc c.c -S -o -
> c.c: In function 'x':
> c.c:5: error: unsupported inline asm: input constraint with a matching
> output constraint of incompatible type!
>
> thanks for the review,

gcc is the standard for gcc-style asm()... if they don't comply, that a
bug...

	-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

Index Home About Blog