Index Home About Blog
From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: A faster memcpy and bzero for x86
Date: 7 Dec 2000 12:03:53 -0800

In article <3gOX5.3201$TU6.93084@ptah.visi.com>,
Grant Edwards <grante@visi.com> wrote:
>In article <m3itow46ph.fsf@sysengr.res.ray.com>, Johan Kullstam wrote:
>
>[regarding use of FPU to do memory move/fill]
>
>>> Now I'm just wondering why Linux doesn't already have these
>>> optimizations.
>>
>>you can't just clobber the FPU regs, something else might be using
>>them.  the kernel cannot use FPU without saving and restoring them
>>since they could be used in a userspace program (save/restore FPU every
>>time you enter the kernel is too expensive, FPU is only save/restore'd
>>at userspace program context switches).
>
>Are we talking about memcpy and bzero in glibc?  In that case
>it's not the kernel...

Note that even in user space memcpy() using MMX registers is NOT
necessarily a good idea at all.

Why?

It looks damn good in benchmarks. Especially for large memory areas that
are not in the cache.

But it tends to have horrible side-effects.  Like the fact that when
multiple processes (or threads) are running, it means that the FP state
has to be switched all the time.  Normally we can avoid this overhead,
because most programs do not actually tend to use the FP unit very much,
so with some simple lazy FP switching we can make thread and process
switches much faster.

Using the FPU or MMX for memcpy makes that go away completely.  Suddenly
you get slower task switching, and people will blame the kernel.  Even
though the _real_ bug is an optimization that looks very good on
benchmarks, but does not necessarily actually win all that much in real
life.

Basically, you should almost never use the i387 for memcpy(), unless you
know you can get it for free (ie you're already using the FPU). A i387
state save/restore is expensive. It's expensive even in user mode where
you don't do it explicitly, but the kernel does it for you.

The MMX stuff is similar. Only use it if you already know you're using
the MXX unit.  Because otherwise you _will_ slow the system down.

NOTE! If you absolutely want to do it anyway, make sure that the size
cutoff is large. It definitely is not worth a few FPU task switches to
do small memcpy's. But for really large memcpy's you might consider it
(ie if size is noticeably larger than a few kilobytes). Use regular
integer stuff for smaller areas.

And it's insidious.  When benchmarking this thing, you usually (a) don't
have any other programs running and (b) even if you do, they haven't
been converted to using FPU memcpy yet anyway, so you'd see only half of
the true cost anyway.

			Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: A faster memcpy and bzero for x86
Date: 15 Dec 2000 11:17:00 -0800

In article <ouppuitbp33.fsf@pigdrop.muc.suse.de>,
Andi Kleen  <freitag@alancoxonachip.com> wrote:
>Kasper Dupont <kasperd@daimi.au.dk> writes:
>
>
>> > Of course if the CPU is busy all the time, if an application writes to a
>> > large number of new pages quickly, or if memory is outrageously limited
>> > in size then this wouldn't help, but perhaps in most cases it would
>> > help.
>>
>> I think Linux does use idle time to zero pages on some architectures.
>> But I have no clue why doing it on some and not on others.
>
>It is only useful when the architecture has instructions to zero pages
>without polluting the caches. PPC has this. Modern x86 with SSE has too,
>but as far as I know nobody tried it there yet.

It's not just about polluting the caches.

It's also about polluting a shared bus, where the other CPU's that
aren't idle may be doing real work that _needs_ that bus.

An idle thread should NOT use resources that may be useful for other
threads that are doing real work.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: A faster memcpy and bzero for x86
Date: 16 Dec 2000 12:06:35 -0800

In article <3A3BAD96.6087@ev1.net>, Robert Redelmeier  <redelm@ev1.net> wrote:
>
>4) The first write to a page is always long because of the
>copy-on-write VM philosophy.  The OS has to scrounge a freeable
>page.  But Linux takes considerably longer.  Why? Fragmentation?
>It should bzero rather than memcpy the zeropage. Most (>90%) CoW
>pagefaults are undoubtedly from the zeropage, and a memcpy
>would do nothing but eat cycles and pollute the caches!

Under Linux, the above is not a COW-fault at all: Linux will not
populate the page tables with zero pages, so Linux will get a "nopage"
fault and will build up the page tables at run-time. Linux will notice
that it's a write, allocate a new page, and clear it.

I'll see if I can see anything obvious from the profile.

I suspect that FBSD might be pre-allocating pages.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: A faster memcpy and bzero for x86
Date: 17 Dec 2000 12:20:33 -0800

In article <3A3C2CD9.4D27@ev1.net>, Robert Redelmeier  <redelm@ev1.net> wrote:
>Linus Torvalds wrote in part:
>
>> I'll see if I can see anything obvious from the profile.
>
>Well, I've looked at the 2.4.0t8 kernel (the best way for me
>to understand it is `objdump -d /usr/src/linux/vmlinux | more`)
>and memset looks healthy.
>
>It uses `rep movsb` instead of the slightly faster `rep movsl` but
>that only costs 16 extra CPU cycles 2% (863 vs 847) per 4KB zero.
>Hardly worth any special-size code.

Clearing a page doesn't use "memset()", it should be using
"clear_user_highpage()" - which on an x86 expands to a constant memset
which is different from the generic "memset()" library function.

Unless you have high-memory support enabled (ie the stuff for 1GB+ of
RAM), the code should just expand to

	movl $1024,%ecx
	xorl %eax,%eax
	rep ; stosl

for the page clearing code.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: A faster memcpy and bzero for x86
Date: 17 Dec 2000 13:52:54 -0800

In article <3A3BAD96.6087@ev1.net>, Robert Redelmeier  <redelm@ev1.net> wrote:
>Linus Torvalds wrote:
>>
>> It's not just about polluting the caches.
>>
>> It's also about polluting a shared bus, where the other CPU's that
>> aren't idle may be doing real work that _needs_ that bus.
>
>Well, there's something stinky happening anyways :)

Ok, I've analyzed this, and there is nothing stinky happening.

Or rather, the "stinky" is cache effects, not software.

>The asm program below tests various bzero routines and
>outputs four times in CPU clocks :  a userland 4kB bzero
>with `rep stosl`, the [g]libc 4kB bzero, the time for a
>second 32bit write and the time for the first 32bit write
>to an otherwise unused RAM page.

It turns out that what you are testing is not at all what you _think_
you're testing. Which is why you don't understand the results, and why
they look so different on Linux and FreeBSD.

>Typical times are: [Abit BP6, 2 * 5.5 * 94 MHz Celerons]
>
>Linux 2.4.0t8:   847    915  33  12000 (~15% at 25000)
>FreeBSD   4.1:  1030  11000  33   5000 (all variable 15%)
>        notes:   (1)    (2)  (3)   (4)

One thing to clarify (for others) is that the numbers are shown in the
reverse order that they are actually calculated. Ie (4) is done first,
and (1) is done last. That's confusing, and made the real reason for the
numbers less apparent.

>1)  This time could be improved to 811 with MMX moves, a
>4% improvement likely to cost much more for MMX context saves.
>AMD K7 may perform differently.  The worst 32bit code I could
>write in good conscience took 2100, the best 1600. Why FBSD
>has variable times for a single instruction is a mystery.

NO!

(1) is the best possible time for doing a memset() WITH HOT CACHES!

Which is not necessarily the same as "best memset()" in general at all.
The fastest way to clear memory can be very different for the hot-cache
and the cold-cache case.

>2)  The glibc bzero routine is d@mn good!  It ought not
>to be so deprecated.  The FreeBSD libc bzero routine s#x.

NO! Again.

(2) is, under Linux, again the same thing: it's a memset() with HOT
CACHES.

Under FreeBSD, it's a memset with mostly COLD CACHES. Those 11000 cycles
are what you need to clear memory that is not in the cache. Remember
this for later.

>3) 33 CPU cycles is exactly as expected:  32 clocks for rdtsc
>[measurement] overhead, and 1 clock for a single write to L1
>cache which was brought in by the first [long] write.

This, actually, is not even a write to the cache, but to the write
buffers. There was no "first long write" at all. The cache never comes
into play here, really.

>4) The first write to a page is always long because of the
>copy-on-write VM philosophy.  The OS has to scrounge a freeable
>page.  But Linux takes considerably longer.  Why? Fragmentation?
>It should bzero rather than memcpy the zeropage. Most (>90%) CoW
>pagefaults are undoubtedly from the zeropage, and a memcpy
>would do nothing but eat cycles and pollute the caches!

No.  What happens is that under Linux, most of those 12000 cycles are
clearing memory that is not in the cache.  In particular, see above on
the FreeBSD memset() numbers.  Of the Linux 12000 cycles, 90%+ is for
the memset. The actual page fault handling etc takes on the order of
1000 cycles - a few hundred of this is the actual hardware cost on x86
to take a page fault.

On FreeBSD, those 5000 cycles are just mapping in an already zeroed
page. It looks to me like FreeBSD basically keeps most of the free
page list zeroed at all times, which is why you can have even 1000 pages
take a constant (fairly low) time for the mapping cost.

Ok, now for the REAL analysis of the numbers:

 - the FreeBSD memset() routine is probably fine. Call memset() twice,
   and I bet the second number will be ok.

 - FreeBSD does pre-zeroing, which makes it look good on certain things.
   In particular, this makes it look good at allocation-time.

   But what the numbers also show is that THIS IS BAD IN REAL LIFE!

   Pre-zeroing makes the kernel profile look good, which is probably
   what BSD's do it. What it doesn't show is that it generates all the
   wrong behaviour for the cache: it means that the cache has long since
   cooled down by the time the page is actually _used_. Look at the sum
   of all numbers (which is basically doing the same thing), and FreeBSD
   clearly loses to Linux.

This is, in fact, the exact same performance problem that SLAB caches
have.  The thinking behind SLAB caches is that the initialization of a
data structure after allocation is expensive, so if you can avoid that
initialization, you can avoid that expense.  So slab caches remember
what the old contents were, and only initialize the fields that need it.

This makes allocation costs much better, and makes them show up less
obviously on performance profiles.

What the people who did this and wrote all the papers on it forgot about
is that the initialization also brings the data structure into the
cache, and that subsequent _uses_ of the data structure thus get sped up
by initializing it.  So most of the time, the cost of the cache misses
is just spread out, not actually elided.

This can be clearly seen in the memset() numbers above: the page miss is
improved, but because the page was cleared much earlier the cost of
bringing it into the cache is now transferred to user space instead of
kernel space.  Which loks good if you want to optimize the kernel, but
looks bad if you want to get best system _performance_.

You can obviously find cases where the pre-initialization actually does
help: if you don't touch the pages after allocating them,
pre-initialization is a clear win. It's easy to create these kinds of
benchmarks, I just don't think they are very realistic.

Anyway, I would suggest you improve your benchmark to more clearly state
what it actually tests. I also would suggest that at least your
benchmark actually shows that the FreeBSD approach sucks quite badly,
and Linux comes out the clear winner here. You may want to create a
benchmark that shows the reverse ;)

		Linus

Index Home About Blog