Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <>
Subject: Re: [BENCHMARK] Lmbench 2.5.54-mm2 (impressive improvements)
Original-Message-ID: <>
Date: Sun, 5 Jan 2003 03:58:20 GMT
Message-ID: <>

On Sat, 4 Jan 2003, Linus Torvalds wrote:
> It doesn't show up on lmbench (insufficient precision), but your AIM9
> numbers are quite interesting. Are they stable?

Btw, which checking whether the numbers are stable it is also interesting
to see stability across reboots etc, since for the scheduling latency in
particular it can easily depend on location of the binaries in physical
memory etc, since that matters for cache accesses (I think the L1 D$ on a
PIII is 4-way associative, I'm not sure - it makes it _reasonably_ good at
avoiding cache conflicts, but they can still happen and easily account for
a 5% fluctuation. I don't remember what the L1 I$ situation is).

And with a fairly persistent page cache, whatever cache situation there is
tends to largely stay the same, so just re-running the benchmark may not
change much, at least for the I$ situation.

You can see this effect quite clearly in lmbench: while the 2p/0k context
switch numbers tend to be fairly stable (almost zero likelyhood of any
cache conflicts), the others often fluctuate more even with the same
kernel (ie for me the 2p/16kB numbers fluctuate between 3 and 6 usecs).

D$ conflicts are largely easier to see (because they usually _will_ change
when you re-run the benchmark, so they show up as fluctuations), but the
I$ effects in particular can be quite persistent because (a) the kernel
code will always be at the same place and (b) the user code tends to be
sticky in the same place due to the page cache.

I'm convinced the I$ effects are one major issue why we sometimes see
largish fluctuations on some ubenchmarks between kernels when nothing has
really changed.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] dcache: better name hash function
Date: Wed, 28 Oct 2009 00:59:41 UTC
Message-ID: <fa.kui7e0GfzEFgO5FeOdH/dv/>

On Tue, 27 Oct 2009, Stephen Hemminger wrote:
> Agreed. Here is the reduced version of the program.
> To run:
>   find /home -printf '%f\n' 2>/dev/null | ./htest -n 100

The timings are very sensitive to random I$ layout at least on Nehalem.
The reason seems to be that the inner loop is _so_ tight that just
depending on exactly where the loop ends up, you can get subtle
interactions with the loop cache.

Look here:

	[torvalds@nehalem ~]$ find /home -printf '%f\n' 2>/dev/null | ./htest -n 100
	Algorithm             Time       Ratio       Max   StdDev
	full_name_hash       1.141899     1.03      4868 263.37
	djb2                 0.980200     1.03      4835 266.05
	string10             0.909175     1.03      4850 262.67
	string10a            0.673915     1.03      4850 262.67
	string10b            0.909374     1.03      4850 262.67
	string_hash17        0.966050     1.03      4805 263.68
	string_hash31        1.008544     1.03      4807 259.37
	fnv32                0.774806     1.03      4817 259.17

what do you think the difference between 'string10', 'string10a' and
'string10b' are?

None. None what-so-ever. The source code is identical, and gcc generates
identical assembly language. Yet those timings are extremely stable for
me, and 'string10b' is 25% faster than the identical string10 and
string10a functions.

The only difference? 'string10a' starts aligned to just 16 bytes, but that
in turn happens to mean that the tight inner loop ends up aligned on a
128-byte boundary. And being cacheline aligned just there seems to matters
for some subtle micro-architectural reason.

The reason I noticed this is that I wondered what small modifications to
'string10' would do for performance, and noticed that even _without_ the
small modifications, performance fluctuated.

Lesson? Microbenchmarks like this can be dangerous and misleading. That's
_especially_ true if the loop ends up being just tight enough that it can
fit in some trace cache or similar. In real life, the name hash is
performance-critical, but at the same time almost certainly won't be run
in a tight enough loop that you'd ever notice things like that.


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on
Date: Mon, 17 Nov 2008 21:35:42 UTC
Message-ID: <>

On Mon, 17 Nov 2008, Ingo Molnar wrote:
> this function _really_ hurts from a 16-bit op:
> ffffffff8048943e:     6503 	66 c7 83 a8 00 00 00 	movw   $0x0,0xa8(%rbx)
> ffffffff80489445:        0 	00 00
> ffffffff80489447:   174101 	5b                   	pop    %rbx

I don't think that is it, actually. The 16-bit store just before it had a
zero count, even though anything that executes the second one will always
execute the first one too.

The fact is, x86 profiles are subtle at an instruction level, and you tend
to get profile hits _after_ the instruction that caused the cost because
an interrupt (even an NMI) is always delayed to the next instruction (the
one that didn't complete). And since the core will execute out-of-order,
you don't even know what that one is, since there could easily be
branches, but even in the absense of branches you have many instructions
executing together.

For example, in many situations the two 16-bit stores will happily execute
together, and what you see may simply be a cache miss on the line that was
stored to. The store buffer needs to resolve the read of the "pop" in
order to complete, so having a big count in between stores and a
subsequent load is not all that unlikely.

So doing per-instruction profiling is not useful unless you start looking
at what preceded the instruction, and because of the out-of-order nature,
you really almost have to look for cache misses or branch mispredicts.

One common reason for such a big count on an instruction that looks
perfectly simple is often that there is a branch to that instruction that
was mispredicted. Or that there was an instruction that was costly _long_
before, and that other instructions were in the shadow of that one
completing (ie they had actually completed first, but didn't retire until
the earlier instruction did).

So you really should never just look at the previous instruction or
anything as simplistic as that. The time of in-order execution is long


From: Linus Torvalds <>
Newsgroups: fa.linux.kernel
Subject: Re: system_call() - Re: [Bug #11308] tbench regression on each kernel
Date: Mon, 17 Nov 2008 22:11:26 UTC
Message-ID: <fa.k2XEDzmDPQyt/>

On Mon, 17 Nov 2008, Ingo Molnar wrote:
> syscall entry instruction costs - unavoidable security checks, etc. -
> hardware costs.

Yes. One thing to look out for on x86 is the system call _return_ path. It
doesn't show up in kernel profiles (it shows up as user costs), and we had
a bug where auditing essentially always caused us to use 'iret' instead of
'sysret' because it took us the long way around.

And profiling doesn't show it, but things like lmbench did, iret being
about five times slower than sysret.

But yes:

> -ENTRY(system_call_after_swapgs)
> +.globl system_call_after_swapgs
> +system_call_after_swapgs:

This definitely makes sense. We definitely do not want to align that
special case.


Index Home About Blog