CPU store buffers (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 0/3] 64-bit futexes: Intro
Date: Wed, 04 Jun 2008 19:58:23 UTC
Message-ID: <fa.h5Of0ZBThkgLrmtZ7AUHq8LzEnc@ifi.uio.no>

On Tue, 3 Jun 2008, Nick Piggin wrote:
>
> I think optimised our unlock_page in a way that it can do a
> non-atomic unlock in the fastpath (no waiters) using 2 bits. In
> practice it was still atomic but only because other page flags
> operations could operate on ->flags at the same time.

I'd be *very* nervous about this.

> We don't require any load/store barrier in the unlock_page fastpath
> because the bits are in the same word, so cache coherency gives us a
> sequential ordering anyway.

Yes and no.

Yes, the bits are in the same word, so cache coherency guarantees a lot.

HOWEVER. If you do the sub-word write using a regular store, you are now
invoking the _one_ non-coherent part of the x86 memory pipeline: the store
buffer. Normal stores can (and will) be forwarded to subsequent loads from
the store buffer, and they are not strongly ordered wrt cache coherency
while they are buffered.

IOW, on x86, loads are ordered wrt loads, and stores are ordered wrt other
stores, but loads are *not* ordered wrt other stores in the absence of a
serializing instruction, and it's exactly because of the write buffer.

So:

> But actually if we're careful, we can put them in seperate parts of the
> word and use the sub-word operations on x86 to avoid the atomic
> requirement. I'm not aware of any architecture in which operations to
> the same word could be out of order.

See above. The above is unsafe, because if you do a regular store to a
partial word, with no serializing instructions between that and a
subsequent load of the whole word, the value of the store can be bypassed
from the store buffer, and the load from the other part of the word can be
carried out _before_ the store has actually gotten that cacheline
exclusively!

So when you do

	movb reg,(byteptr)
	movl (byteptr),reg

you may actually get old data in the upper 24 bits, along with new data in
the lower 8.

I think.

Anyway, be careful. The cacheline itself will always be coherent, but the
store buffer is not going to be part of the coherency rules, and without
serialization (or locked ops), you _are_ going to invoke the store buffer!

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 0/3] 64-bit futexes: Intro
Date: Wed, 04 Jun 2008 20:40:23 UTC
Message-ID: <fa.eI3YqkCp6fNAymFyv87g5gWQrvU@ifi.uio.no>

On Wed, 4 Jun 2008, Linus Torvalds wrote:
>
> So when you do
>
> 	movb reg,(byteptr)
> 	movl (byteptr),reg
>
> you may actually get old data in the upper 24 bits, along with new data in
> the lower 8.

Put another way: the CPU may internally effectively rewrite the two
instructions as

	movb reg,tmpreg		(tmp = writebuffer)
	movl (byteptr),reg	(do the 32-bit read of the old cached contents)
	movb tmpreg,reg		(writebuffer snoop by reads)
	movb tmpreg,(byteptr)	(writebuffer actually drains into cacheline)

and *if* your algorithm is robust wrt these kinds of rewrites, you're ok.
But notice how there are two accesses to the cacheline, and how they are
actually in the "wrong" order, and how the cacheline could have been
updated by another CPU in between.

Does this actually happen? Yeah, I do believe it does. Is it a deathknell
for anybody trying to be clever with overlapping reads/writes? No, you can
still have things like causality rules that guarantee that your algorithm
is perfectly stable in the face of these kinds of reordering. But it *is*
one of the few re-orderings that I think is theoretically architecturally
visible.

For example, let's start out with a 32-bit word that contains zero, and
three CPU's. One CPU does

	cmpxchgl 0->0x01010101,mem

another one does

	cmpxchlg 0x01010101->0x02020202,mem

and the third one does that

	movb $0x03,mem
	movl mem,reg

and after it all completed, you may have 0x02020203 in memory, but "reg"
on the third CPU contains 0x01010103.

Note how NO OTHER CPU could _possibly_ have seen that value! That value
never ever existed in any caches. If the final value was 0x02020203, then
both the cmpxchgl's must have worked, so the cache coherency contents
*must* have gone from 0 -> 0x01010101 -> 0x02020202 -> 0x02020203 (with
the movb actually getting the exclusive cache access last).

So the third CPU saw a value for that load that actually *never* existed
in any cache-line: 0x01010103.  Exactly because the x86 memory ordering
allows the store buffer contents to be forwarded within a CPU core.

And this is why atomic locked instructions are special. They bypass the
store buffer (or at least they _act_ as if they do - they likely actually
use the store buffer, but they flush it and the instruction pipeline
before the load and wait for it to drain after, and have a lock on the
cacheline that they take as part o the load, and release as part of the
store - all to make sure that the cacheline doesn't go away in between and
that nobody else can see the store buffer contents while this is going
on).

This is also why there is so much room for improvement in locked
instruction performance - you don't _have_ to flush things if you just are
very careful about tracking how and when you use which elements in the
store buffer, and track the ordering of cache accesses by all of this.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 0/3] 64-bit futexes: Intro
Date: Thu, 05 Jun 2008 03:09:54 UTC
Message-ID: <fa.4GmCFz3h0qr9IM9YbSwB3GffLcE@ifi.uio.no>

On Thu, 5 Jun 2008, Nick Piggin wrote:
>
> I'd have thought that for a case like this, you'd simply hit the store
> alias logic and store forwarding would stall because it doesn't have
> the full data.

That's _one_ possible implementation.

Quite frankly, I think it's the less likely one. It's much more likely
that the cache read access and the store buffer probe happen in parallel
(this is a really important hotpath for any CPU, but even more so x86
where there are more of loads and stores that are spills). And then the
store buffer logic would return the data and a bytemask mask (where the
mask would be all zeroes for a miss), and the returned value is just the
appropriate mix of the two.

> I'd like to know for sure.

You'd have to ask somebody very knowledgeable inside Intel and AMD, and it
is quite likely that different microarchitectures have different
approaches...

> The other thing that could be possible, and I'd imagine maybe more likely
> to be implemented in a real CPU because it should give more improvement
> (and which does break my algorithm) is just that the load to the cacheline
> may get to execute first, then if the cacheline gets evicted and
> modified by another CPU before our store completes, we effectively see
> store/load reordering again.

Oh, absolutely, the perfect algorithm would actually get the right answer
and notice that the cacheline got evicted, and retried the whole sequence
such that it is coherent.

But we do know that Intel expressly documents loads and stores to pass
each other and documents the fact that the store buffer is there. So I bet
that this is visible in some micro-architecture, even if it's not
necessarily visible in _all_ of them.

The recent Intel memory ordering whitepaper makes it very clear that loads
can pass earlier stores and in particular that the store buffer allows
intra-processor forwarding to subsequent loads (2.4 in their whitepaper).
It _could_ be just a "for future CPU's", but quite frankly, I'm 100% sure
it isn't. The store->load forwarding is such a critical performance issue
that I can pretty much guarantee that it doesn't always hit the cacheline.

Of course, the partial store forwarding case is not nearly as important,
and stalling is quite a reasonable implementation approach. I just
personally suspect that doing the unconditional byte-masking is actually
_simpler_ to implement than the stall, so..

			Linus

Index Home About Blog