POP instruction speed (Jeff Garzik; Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291103000.28839@ppc970.osdl.org>
Date: Fri, 29 Oct 2004 18:20:21 GMT
Message-ID: <fa.iun3n88.13ki83s@ifi.uio.no>

On Fri, 29 Oct 2004, linux-os wrote:
> > with the following:
> >
> > leal 4(%esp),%esp
>
> Probably so because I'm pretty certain that the 'pop' (a memory
> access) is not going to be faster than a simple register operation.

Bzzt, wrong answer.

It's not "simple register operation". It's really about the fact that
modern CPU's are smarter - yet dumber - then you think. They do things
like speculate the value of %esp in order to avoid having to calculate it,
and it's entirely possible that "pop" is much faster, simply because I
guarantee you that a CPU will speculate %esp correctly across a "pop", but
the same is not necessarily true for "lea %esp".

Somebody should check what the Pentium M does. It might just notice that
"lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
possible that lea will confuse its stack engine logic and cause
stack-related address generation stalls..

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291212350.28839@ppc970.osdl.org>
Date: Fri, 29 Oct 2004 19:58:26 GMT
Message-ID: <fa.j07fng7.124u8bt@ifi.uio.no>

On Fri, 29 Oct 2004, Andreas Steinmetz wrote:

> Linus Torvalds wrote:
> > Somebody should check what the Pentium M does. It might just notice that
> > "lea 4(%esp),%esp" is the same as "add 4 to esp", but it's entirely
> > possible that lea will confuse its stack engine logic and cause
> > stack-related address generation stalls..
>
> Now especially Intel tells everybody in their Pentium Optimization
> manuals to *use* lea whereever possible as this operation doesn't depend
> on the ALU and is processed in other parts of the CPU.
>
> Sample quote from said manual (P/N 248966-05):
> "Use the lea instruction and the full range of addressing modes to do
> address calculation"

Does it say this about %esp?

The stack pointer is SPECIAL, guys. It's special exactly because there is
potentially extra hardware in CPU's that track its value _independently_
of the actual physical register.

Just for fun, google for 'x86 "stack engine"', and you'll hit for example
http://arstechnica.com/articles/paedia/cpu/pentium-m.ars/5 which talks
about this and perhaps explains it in ways that I apparently haven't been
able to.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291249440.28839@ppc970.osdl.org>
Date: Fri, 29 Oct 2004 20:48:06 GMT
Message-ID: <fa.j0ndo8e.11ks93m@ifi.uio.no>

On Fri, 29 Oct 2004, Andreas Steinmetz wrote:
>
> If you still believe in features I can't find any manufacturer
> documentation for, well, you're Linus so it's your decision.

It's not that I'm Linus. It's that I am apparently better informed than
you are, and the numbers you are looking at are irrelevant. For example,
have you even _looked_ at the Pentium M stack engine documentation, which
is what this whole argument is all about?

And the documentation you look at is not revelant. For example, when you
look at the latency of "pop", who _cares_? That's the latency to use the
data, and has no meaning, since in this case we don't actually ever use
it. So what matters is other things entirely, like how well the
instructions can run in parallel.

Try it.

	popl %eax
	popl %ecx

should one cycle on a Pentium. I pretty much _guarantee_ that

	lea 4(%esp),%esp
	popl %ecx

takes longer, since they have a data dependency on %esp that is hard to
break (the P4 trace-cache _may_ be able to break it, but the only CPU that
I think is likely to break it is actually the Transmeta CPU's, which did
that kind of thing by default and _will_ parallelise the two, and even
combine the stack offsetting into one single micro-op).

So my argument is that "popl" is smaller, and I doubt you can find a
machine where it's actually slower (most will take two cycles). And I am
pretty confident that I can find machines where it is faster (ie regular
Pentium).

Not that any of this matters, since there's a patch that makes all of this
moot. If it works.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291705210.28839@ppc970.osdl.org>
Date: Sat, 30 Oct 2004 00:29:44 GMT
Message-ID: <fa.ivnhn8a.12ks83q@ifi.uio.no>

On Fri, 29 Oct 2004, dean gaudet wrote:
>
> for p4 model 0 through 2 it was faster to avoid lea and shl and generate
> code like:
>
> 	add %ebx,%ebx
> 	add %ebx,%ebx
> 	add %ebx,%ebx
> 	add %ebx,%ebx

I think that is true only for the lea's that have a shifted input. The
weakness of the original P4 is its shifter, not lea itself. And for a
simple lea like 4(%esp), it's likely no worse than a regular "add", and
there lea has the advantage that you can put the result in another
register, which can be advantageous in other circumstances.

So lea actually _is_ useful for doing adds, in many cases. Of course, on
older CPU's you'll see the effect of the address generation adder being
one cycle "off" (earlier) the regular ALU execution unit, so lea often
causes AGI stalls.  I don't think this is an issue on the P6 or P4 because
of how they actually end up implementing the lea in the regular ALU path.

How the hell did we get to worrying about this in the first place?

		Linus

Newsgroups: fa.linux.kernel
Original-Message-ID: <4182BF36.5040709@pobox.com>
From: Jeff Garzik <jgarzik@pobox.com>
Subject: Re: Semaphore assembly-code bug
Date: Fri, 29 Oct 2004 22:29:23 GMT
Message-ID: <fa.eaatog5.kk43a7@ifi.uio.no>

Linus Torvalds wrote:
>
> 	popl %eax
> 	popl %ecx
>
> should one cycle on a Pentium. I pretty much _guarantee_ that
>
> 	lea 4(%esp),%esp
> 	popl %ecx
>
> takes longer, since they have a data dependency on %esp that is hard to
> break (the P4 trace-cache _may_ be able to break it, but the only CPU that
> I think is likely to break it is actually the Transmeta CPU's, which did
> that kind of thing by default and _will_ parallelise the two, and even
> combine the stack offsetting into one single micro-op).


One of my favorite "optimizing for Pentium" docs is

	http://www.agner.org/assem/pentopt.pdf
		from
	http://www.agner.org/assem/

which is current through newer P4's AFAICS.

It notes on the P4 specifically that LEA is split into additions and
shifts.  Not sure what it does on the P3, but I bet it generates more
uops in addition to the data dependency.

	Jeff

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291209170.28839@ppc970.osdl.org>
Date: Sat, 30 Oct 2004 01:38:54 GMT
Message-ID: <fa.iv7jn8e.134q83m@ifi.uio.no>

On Fri, 29 Oct 2004, linux-os wrote:
>
> Linus, there is no way in hell that you are going to move
> a value from memory into a register (pop ecx) faster than
> you are going to do anything to the stack-pointer or
> any other register.

Sorry, but you're wrong.

Learn about modern CPU's some day, and realize that cached accesses are
fast, and pipeline stalls are relatively much more expensive.

Now, if it was uncached, you'd have a point.

Also think about why

	call xxx
	jmp yy

is often much faster than

	push $yy
	jmp xxx

and other small interesting facts about how CPU's actually work these
days.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291133220.28839@ppc970.osdl.org>
Date: Sat, 30 Oct 2004 01:43:16 GMT
Message-ID: <fa.ivn7o08.12km8rs@ifi.uio.no>

On Fri, 29 Oct 2004, linux-os wrote:

> On Fri, 29 Oct 2004, Richard Henderson wrote:
> >
> > Also not necessarily correct.  Intel cpus special-case pop
> > instructions; two pops can be dual issued, whereas a different
> > kind of stack pointer manipulation will not.
> >
>
> Then I guess the Intel documentation is incorrect, too.

Where?

It definitely depends on the CPU. Some CPU's dual-issue pops, some don't.

I think the Pentium can dual-issue, while the PPro/P4 does not. And AMD
has some other rules, and I think older ones dual-issue stack accesses
only if esp doesn't change. Haven't looked at K8 rules.

And Pentium M is to some degree more interesting than P4 and Ppro, because
it's apparently the architecture Intel is going forward with for the
future of x86, and it is a "improved PPro" core that has a special stack
engine, iirc.

Anyway, it's quite likely that for several CPU's the fastest sequence ends
up actually being

	movl 4(%esp),%ecx
	movl 8(%esp),%edx
	movl 12(%esp),%eax
	addl $16,%esp

which is also one of the biggest alternatives.

Anyway, making "asmlinkage" imply "regparm(3)" would make the whole
discussion moot, so I'm wondering if anybody has the patches to try it
out? It requires pretty big changes to all the x86 asm code, but I do know
that people _had_ patches like that at least long ago (from when people
like Jan were playing with -mregaparm=3 originally). Maybe some of them
still exist..

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291150360.28839@ppc970.osdl.org>
Date: Sat, 30 Oct 2004 02:08:56 GMT
Message-ID: <fa.j07fog5.124u9bv@ifi.uio.no>

On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
> Anyway, making "asmlinkage" imply "regparm(3)" would make the whole
> discussion moot, so I'm wondering if anybody has the patches to try it
> out? It requires pretty big changes to all the x86 asm code, but I do know
> that people _had_ patches like that at least long ago (from when people
> like Jan were playing with -mregaparm=3 originally). Maybe some of them
> still exist..

Looking at just doing this for the semaphore code, I hit on the fact that
we already do this for the rwsem's.. So changing just the regular
semaphore code to use "fastcall" should fix this particular bug, but I'm
still interested in hearing whether somebody has a patch for the system
calls and faults too?

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410291355460.28839@ppc970.osdl.org>
Date: Fri, 29 Oct 2004 21:12:38 GMT
Message-ID: <fa.j0njoga.11kq9bq@ifi.uio.no>

On Fri, 29 Oct 2004, Linus Torvalds wrote:
>
> Here's a totally untested patch to make the semaphores use "fastcall"
> instead of "asmlinkage"

Ok, I tested it, looked through the assembly code, and did a general size
comparison. Everything looks good, and it should fix the problem that
caused this discussion. Checked in.

The patch actually improves code generation by moving the failure case
argument generation _into_ the failure case: this makes the inline asm one
instruction longer, but it means that the fastpath is often one
instruction shorter. In fact, the fastpath is usually improved even _more_
than that, because gcc does sucketh at generating code that uses fixed
registers (ie the old code often caused gcc to first generate the value
into another register, and then _move_ it into %eax, rather than just
generating it into %eax in the first place).

My test-kernel shrunk by a whopping 2kB in size from this change.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410301040050.28839@ppc970.osdl.org>
Date: Sat, 30 Oct 2004 17:55:54 GMT
Message-ID: <fa.iunbofs.13kq9bm@ifi.uio.no>

On Sat, 30 Oct 2004, Denis Vlasenko wrote:
>
> On Saturday 30 October 2004 05:13, Andi Kleen wrote:
> > Linus Torvalds <torvalds@osdl.org> writes:
> >
> > > Anyway, it's quite likely that for several CPU's the fastest sequence ends
> > > up actually being
> > >
> > > 	movl 4(%esp),%ecx
> > > 	movl 8(%esp),%edx
> > > 	movl 12(%esp),%eax
> > > 	addl $16,%esp
> > >
> > > which is also one of the biggest alternatives.
> >
> > For K8 it should be the fastest way. K7 probably too.
>
> Pity. I always loved 1 byte insns :)

I personally am a _huge_ believer in small code.

The sequence

	popl %eax
	popl %ecx
	popl %edx
	popl %eax

is four bytes. In contrast, the three moves and an add is 15 bytes. That's
almost 4 times as big.

And size _does_ matter. The extra 11 bytes means that if you have six of
these sequences in your program, you are pretty much _guaranteed_ one more
icache miss from memory. That's a few hundred cycles these days.
Considering that you _maybe_ won a cycle or two each time it was executed,
it's not at all clear that it's a win, except in benchmarks that have huge
repeat-rates. Real life doesn't usually have that. In many real-life
scenarios, repeat rates are in the tens of hundreds for most code...

And that's ignoring things like disk load times etc.

Sadly, the situation is often one where when you actually do all the
performance testing, you artificially increase the repeat-rates hugely:
you run the same program a thousand times in order to get a good profile,
and you keep in the the cache all the time. So performance analysis often
doesn't actually _see_ the downsides.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0410301831300.28839@ppc970.osdl.org>
Date: Sun, 31 Oct 2004 01:44:27 GMT
Message-ID: <fa.j07ho7t.12408jn@ifi.uio.no>

On Sun, 31 Oct 2004, Andi Kleen wrote:
>
> Using the long stack setup code was found to be a significant
> win when enough registers were saved (several percent in real benchmarks)
> on K8 gcc.

For _what_?

Real applications, or SpecInt?

The fact is, SpecInt is not very interesting, because it has almost _zero_
icache footprint, and it has generally big repeat-rates, and to make
matters worse, you are allowed (and everybody does) warm up the caches by
running before you actually do the benchmark run.

_None_ of these are realistic for real life workloads.

>  It speed up all function calls considerably because it
> eliminates several stalls for each function entry/exit.

... it shaves off a few cycles in the cached case, yes.

> The popls will all depend on each other because of their implicied
> reference to esp.

Which is only true on moderately stupid CPU's. Two pop's don't _really_
depend on each other in any real sense, and there are CPU's that will
happily dual-issue them, or at least not stall in between (ie the pop's
will happily keep the memory unit 100% busy).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0411011531100.28839@ppc970.osdl.org>
Date: Mon, 1 Nov 2004 23:58:15 GMT
Message-ID: <fa.iv7dnfu.134o8bm@ifi.uio.no>

On Mon, 1 Nov 2004, linux-os wrote:
>
> No. You've just shown that you like to argue. I recall that you
> recently, like within the past 24 hours, supplied a patch that
> got rid of the time-consuming stack operations in your semaphore
> code. Remember, you changed it to pass parameters in registers.

.... because that fixed a _bug_.

> Why would you bother if stack operations are free?

I didn't say that instructions are free. I just tried (unsuccessfully) to
tell you that "lea" is not free either, and that "lea" has some serious
problems on several setups, ranging from old cpu's (AGI stalls) to new
CPU's (stack engine stalls). And that "pop" is often faster.

And you have been arguing against it despite the fact that I ended up
writing a small test-program to show that it's true. It's a _stupid_
test-program, but the fact is, you only need a single test-case to prove
some theory wrong.

Your theory that "lea" is somehow always cheaper than "pop" is wrong.

> It's not a total focus. It's just necessary emphasis. Any work
> done by your computer, ultimately comes from and goes to memory.

Not so.

A lot of work is done in cache. Any access that doesn't change the state
of the cache is a no-op as far as the memory bus is concerned. Ie a store
to a cacheline that is already dirty is just a cache access, as is a load
from a cacheline that is already loaded.

This is especially true on x86 CPU's, where the lack of registers means
that the core has been highly optimized for doing cached operations.
Remember: a CPU is not some kind of abstract entity - it's a very
practical piece of engineering that has been highly optimized for certain
usage patterns.

And the fact is, "lea" on %esp is not a common usage pattern. Which is
why, in practice, you will find CPU's that end up not optimizing for it.
While "pop"+"pop" is a _very_ common pattern, and why existing CPU's
do them efficiently.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0411011327400.28839@ppc970.osdl.org>
Date: Tue, 2 Nov 2004 03:21:05 GMT
Message-ID: <fa.j0n9n84.11kk83g@ifi.uio.no>

On Mon, 1 Nov 2004, linux-os wrote:
>
> Wrong.
>
> (1)  The '486 didn't have the rdtsc instruction.
> (2)  There are no 'serializing' or other black-magic aspects of
> using the internal cycle-counter. That's exactly how you you
> can benchmark the execution time of accessible code sequences.

Sorry, but you shouldn't argue with people who know more than you do. I
know Dean, and he analyzes things for work, and does know what he is
doing.

"rdtsc" _does_ partly serialize things, and it's not even architecturally
defined, so you'll find that it serializes things in different ways on
different CPU's. You can't just do

	rdtsc
	...
	rdtsc

and expect the stuff in between the rdtsc's to be timed exactly: some of
it will overlap with the rdtsc's, some of it won't.

On Intel, if I recall correctly, rdtsc is totally serializing, so you're
testing not just the instructions between the rdtsc's, but the length of
the pipeline, and the time it takes for stuff around it to calm down.
Which is why two rdtsc's in sequence will show quite a lot of overhead on
a P4 (something like 80 cycles).

So you really want to do more operations in between the rdtsc's.

Try the appended program. On a P4, the two sequnces are the same for me
(92 cycles, 80 cycles overhead), while on a Pentium M, the sequence of two
popl's (57 cycles) is faster than the sequence of "lea+popl" (59 cycles)
and the overhead is 47 cycles.

So can you _please_ just admit that you were wrong? On a P4, the pop/pop
is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
according to this test. Your contention that "pop" has to be slower than
"lea" is WRONG.

		Linus

----
#define PUSHEBX "pushl %%ebx\n\t"
#define PUSHECX "pushl %%ecx\n\t"
#define POPECX "popl %%ecx\n\t"
#define POPEBX "popl %%ebx\n\t"

#ifdef TEST_LEA

#undef POPECX
#define POPECX "leal 4(%%esp),%%esp\n\t"

#endif

#ifdef TEST_OVERHEAD

#undef PUSHEBX
#undef PUSHECX
#undef POPEBX
#undef POPECX

#define PUSHEBX
#define PUSHECX
#define POPEBX
#define POPECX

#endif

int main(void)
{
	unsigned long start;
	unsigned long long end;

	asm volatile(
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		PUSHEBX
		PUSHECX
		"rdtsc\n\t"
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		POPECX
		POPEBX
		"movl %%eax,%%esi\n\t"
		"rdtsc"
		:"=A" (end), "=S" (start));
	printf("%ld cycles\n", (long) end-start);
}

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: Semaphore assembly-code bug
Original-Message-ID: <Pine.LNX.4.58.0411011342090.28839@ppc970.osdl.org>
Date: Tue, 2 Nov 2004 03:25:00 GMT
Message-ID: <fa.iunrnnv.13k69jl@ifi.uio.no>

On Mon, 1 Nov 2004, Linus Torvalds wrote:
>
> So can you _please_ just admit that you were wrong? On a P4, the pop/pop
> is the same cost as lea/pop, and on a Pentium M the pop/pop is faster,
> according to this test. Your contention that "pop" has to be slower than
> "lea" is WRONG.

Btw, I'd like to emphasize "this test". Modern OoO CPU's are complex
animals. They have pipeline quirks etc that just means that things depend
on alignment, on code around it, and on register usage patterns of the
instructions that you test _and_ the instructions around those
instructions. So take any proof with a pinch of salt, because there are
bound to be other circumstances where factors around the code just change
the assumptions.

In short, any time you're looking at single cycle timings, you should be
very aware of the fact that your measurements are suspect. The best way to
avoid most of the problem is to never try to measure single cycles.
Measure performance on a program, not on a single instruction.

			Linus

Index Home About Blog