From: Linus Torvalds <firstname.lastname@example.org>
Subject: Re: [patch 00/2] improve .text size on gcc 4.0 and newer compilers
Date: Wed, 04 Jan 2006 22:44:20 UTC
On Tue, 3 Jan 2006, Martin J. Bligh wrote:
> Well, it's not just one function, is it? It'd seem that if you unlined
> a whole bunch of stuff (according to this theory) then normal
> macro-benchmarks would go faster? Otherwise it's all just rather
> theoretical, is it not?
One of the problems with code size optimizations is that they
fundamentally show much less impact under pretty much any traditional
In user space, for example, the biggest impact of size optimization tends
to be in things like load time and low-memory situations. Yet, that's
never what is benchmarked (yeah, people will benchmark some low-memory
kernel situations by compating different swap-out algorithms etc against
each other - but they'll almost never compare the perceived speed of your
desktop when you start swapping).
Now, I'll happily also admit that code and data _layout_ is often a lot
more effective than just code size optimizations. That's especially true
with low-memory situations where the page size being larger than many of
the code sequences, you can make a bigger impact by changing utilization
than by changing absolute size.
But even then it's actually really hard to measure. Cache effects tend to
be hard _anyway_ to measure, it's really hard when the interesting case is
the cold-cache case and can't even do simple microbenchmarks that repeat a
million times to get stable results.
So the best we can usually do is "microbenchmarks don't show any
noticeable _worse_ behaviour, and code size went down by 10%".
Just as an example: there's an old paper on the impact of the OS design on
the memory subsystem, and one of the tables is about how many cycles per
instruction is spent on cache effects (this was Ultrix vs Mach, the load
was the average of a workload that was either specint or looked a lot
Ultrix user: 0.07 0.08
Mach user: 0.07 0.08
Ultrix system: 0.43 0.23
Mach system: 0.57 0.29
Now, that's an oldish paper, and we really don't care about Ultrix vs Mach
nor about the particular hw (Decstation), but the same kind of basic
numbers have been repeated over an over. Namely that system code tends to
have very different behaviour from user code wrt caches. Caches may have
gotten better, but code has gotten bigger, and cores have gotten faster.
And in particular, they tend to be _much_ worse. Something you'll seldom
see as clearly in micro-benchmarks, if only because under those benchmarks
things will generally be more cached - especially on the I$ side.
So I should probably try to find something slightly more modern, but what
it boils down to is that at least one well-cited paper that is fairly easy
to find claims that about _half_a_cycle_ for each instruction was spent on
I$ misses in system code on a perfectly regular mix of programs. And that
the cost of I$ was actually higher than the cost of D$ misses.
Now, most of the _time_ tends to be spent in user mode. That is probably
part of the reason for why system mode gets higher cache misses (another
is that obviously you'd hope that the user program can optimize for its
particular memory usage, while the kernel can't). But the result remains:
I$ miss time is a noticeable portion of system time.
Now, I claim that I$ miss overhead on a system would probably tend to have
a pretty linear relationship with the size of the static code. There
aren't that many big hot loops or small code that fits in the I$ to skew
that trivial rule.
So at a total guess, and taking the above numbers (that are questionable,
but hey, they should be ok as a starting point for WAGging), reducing code
size by 10% should give about 0.007 cycles per instruction in user space.
Whee. Not very noticeable. But on system code (the only thing the kernel
can help), it's actually a much more visible 0.05 CPI.
Yeah, the math is bogus, the numbers may be irrelevant, but it does show
why I$ should matter, even though it's much harder to measure.