Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: comp.arch
Subject: Re: branch prediction, renaming, joins
Message-ID: <br2lov$ctk$1@build.pdx.osdl.net>
Date: Mon, 08 Dec 2003 20:10:01 GMT

Terje Mathisen wrote:
>
>> I believe that there are situations, where declaring a variable as
>> volatile isn't enough.  I forget the exact details, but I do know that
>> Linus added a "memory barrier" macro ("barrier()") to the (Linux)
>> kernel for exactly this reason, back in around 1994.  I tried to
>
>  From other stuff Linus has written later, I'd bet this was added
> specifically to get around the relatively _very_ loose memory ordering
> guarantees on Alpha.

No. For the actual CPU memory ordering, we have different macros ("mb()" for
"memory barrier", "rmb()", "wmb()" etc). They _also_ inhibit compiler
reordering, or course (otherwise they'd not be very useful), but they also
emit the instructions to tell the CPU to be careful.

The "barrier()" macro is there entirely to defeat the compiler reordering
that is sometimes deadly.

For example, the compiler doesn't know _squat_ about locking, and that's
just as well, since compiler-generated locking tends to suck dead donkeys
through a straw (ie monitors etc that some languages have, and that are
tied to the data structures and tend to be overly careful since the
compiler doesn't actually understand what's going on).

But not knowing about locking means that if you have code like this:

        local_cpu_lock = 1;

        .. do something critical ..

        local_cpu_lock = 0;

and you use the "local_cpu_lock" to tell interrupts to not do certain
things, the compiler will be totally clueless.

In particular, the compiler doesn't know that the "local_cpu_lock" thing is
special, and has any relationship to the code inside, so the compiler might
decide to move the assignments around. Maybe it might even decide to remove
the first assignment entirely, since it has "no effect".

Marking the thing volatile doesn't much help - it is not really a well
defined language feature, and it usually just means that the compiler does
really stupid things with it.

So you add a "barrier()" to the locking functions, telling the compiler that
it can't move stuff around it. The barrier doesn't generate any code in
itself.

(No, the above is not a real piece of code, and the locks do _not_ look like
that in Linux, but you get the idea).

There are other cases where local ordering (as opposed to SMP ordering)
really matters. They tend to all be about asynchronous events (ie
interrupts that cause scheduling or similar), and they don't tend to be all
that common, but when they happen you really want to make them clear both
to the programmer _and_ the compiler. The "barrier()" macro is a good way
to do both.

And "volatile" really isn't a useful thing usually. It's not that the data
itself is necessarily something the compiler cannot optimize: it's
literally the flow of _code_ that is volatile, not the data. A particular
piece of memory may well be "stable" as far as most usage is concerned (ie
the variable itself shouldn't be marked volatile, since that would just
needlessly impact the 99% of uses that do not care), but in some cases we
do special things. For example, this is real code from the scheduler:

        need_resched:
                ti->preempt_count = PREEMPT_ACTIVE;
                schedule();
                ti->preempt_count = 0;

                /* we could miss a preemption opportunity ...
                barrier();
                if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
                        goto need_resched;

where the point here is that we must wait until _after_ we clear the
preemption count before we can test the TIF_NEED_RESCHED flag. We want to
make sure that we really are running the highest-priority process, but at
the same time we just disabled asynchronous preemption while we did the
scheduling decision, so we may have had an asynchronous event in the
meantime that made our scheduling decision be the wrong one.

And here the ordering really is a property of the _code_, not of the data
involved. Sure, the thread flag could be marked volatile, but that would
just make the ordering "invisible" . And the fact is, the ordering is
really between those _two_ data structures - it's not just one of them that
is somehow magically "volatile".

If I could make it explicit that there is an ordering constraint on those
data structures, I would. It's actually a bit like the "aliasing" issues
that you find with wild pointers - even though the data doesn't alias in
physical memory, there are ordering constraints on the code that accesses
them. See?

However, trying to describe these aliasing things to the compiler is
basically a huge problem, and it's just not worth it. It's much easier and
clearer to just use a big hammer and say "compiler - you just don't have a
clue, so don't do anything at ALL in this area".

Compiler people tend to think that people want smart compilers. I as a
compiler power-user can emphatically tell you that that isn't true. People
want _reliable_ compilers, and they want compilers that can be told to just
get the hell out of the way when needed. It's not often you need to, but
when you need to, you _really_ need to.

The notion of making the language more complex to be able to tell the
compiler details like the above so that it would just do the RightThing(tm)
is just crazy talk.  Compilers are good at some things (mindless
instruction scheduling, register allocation etc), but that shouldn't make
you think they should be smart.

                        Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #5]
Date: Thu, 23 Mar 2006 19:29:07 UTC
Message-ID: <fa.qNuQkxhZeXjPz50lNpU9GB7HuLk@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603231121110.26286@g5.osdl.org>

On Thu, 23 Mar 2006, David Howells wrote:
>
> > Some architectures have more expansive definition of data dependency,
> > including then- and else-clauses being data-dependent on the if-condition,
> > but this is probably too much detail.
>
> Linus calls that a "control dependency" and doesn't seem to think that's a
> problem as it's sorted out by branch prediction.  What you said makes me
> wonder about conditional instructions (such as conditional move).

I'd put it the other way: a control dependency is not "sorted out" by
branch prediction, it is effectively _nullified_ by branch prediction.

Basically, control dependencies aren't dependencies at all. There is
absolutely _zero_ dependency between the following two loads:

	if (load a)
		load b;

because the "load b" can happen before the "load a" because of control
prediction.

So if you need a read barrier where there is a _control_ dependency in
between loading a and loading b, you need to make it a full "smp_rmb()".
It is _not_ sufficient to make this a "read_barrier_depends", because the
load of b really doesn't depend on the load of a at all.

So data dependencies that result in control dependencies aren't
dependencies at all. Not even if the address depends on the control
dependency.

So

	int *address_of_b;

	address_of_b = load(&a);
	smp_read_barrier_depends();
	b = load(address_of_b);

is correct, but

	int *address_of_b = default_address_of_b;

	if (load(&a))
		address_of_b = another_address_of_b;
	smp_read_barrier_depends();
	b = load(address_of_b);

is NOT correct, because there is no data dependency on the load of b, just
a control dependency that the CPU may short-circuit with prediction, and
that second case thus needs a real smp_rmb().

And yes, if we ever hit a CPU that does actual data prediction, not just
control prediction, that will force smp_read_barrier_depends() to be the
same as smp_rmb() on such an architecture.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 8/8] i386: bitops: smp_mb__{before, after}_clear_bit()
Date: Tue, 24 Jul 2007 17:13:35 UTC
Message-ID: <fa.51ckzDlQ1Em+mYvYcr8Ce+zCiUE@ifi.uio.no>

On Tue, 24 Jul 2007, Satyam Sharma wrote:
>
> Looks like when you said "CPU memory barrier extends to all memory
> references" you were probably referring to a _given_ CPU ... yes,
> that statement is correct in that case.

No. CPU memory barriers extend to all CPU's. End of discussion.

It's not about "that cacheline". The whole *point* of a CPU memory barrier
is that it's about independent memory accesses.

Yes, for a memory barrier to be effective, all CPU's involved in the
transaction have to have the barriers - the same way a lock needs to be
taken by everybody in order for it to make sense - but the point is, CPU
barriers are about *global* behaviour, not local ones.

So there's a *huge* difference between

	clear_bit(x,y);

and

	clear_bit(x,y);
	smp_mb__before_after_clear_bit();

and it has absolutely nothing to do with the particular cacheline that "y"
is in, it's about the *global* memory ordering.

Any write you do after that "smp_mb__before_after_clear_bit()" will be
guaranteed to be visible to _other_ CPU's *after* they have seen the bit
being cleared. Yes, those other CPU's need to have a read barrier between
reading the bit and reading some other thing, but the point is, this hass
*nothing* to do with cache coherency, and the particular cache line that
"y" is in.

And no, "smp_mb__before/after_clear_bit()" must *not* be just an empty "do
{} while (0)". It needs to be a compiler barrier even when it has no
actual CPU meaning, unless clear_bit() itself is guaranteed to be a
compiler barrier (which it isn't, although the "volatile" on the asm in
practice makes it something *close* to that).

Why? Think of the sequence like this:

	clear_bit(x,y);
	smp_mb__after_clear_bit();
	other_variable = 10;

the whole *point* of this sequence is that if another CPU does

	x = other_variable;
	smp_rmb();
	bit = test_bit(x,y)

then if it sees "x" being 10, then the bit *has* to be clear.

And this is why the compiler barrier in "smp_mb__after_clear_bit()" needs
to be a compiler barrier:

 - it doesn't matter for the action of the "clear_bit()" itself: that one
   is locked, and on x86 it thus also happens to be a serializing
   instruction, and the cache coherency and lock obviously means that the
   bit clearing *itself* is safe!

 - but it *does* matter for the compiler scheduling. If the compiler were
   to decide that "y" and "other_variable" are totally independent, it
   might otherwise decide to move the "other_variable = 10" assignment to
   *before* the clear_bit(), which would make the whole code pointless!

See? We have two totally independent issues:

 - the CPU itself can re-order the visibility of accesses. x86 doesn't do
   this very much, and doesn't do it at all across a locked instruction,
   but it's still a real issue, even if it tends to be much easier to see
   on other architectures.

 - the compiler doesn't care about rules of "locked instruction" at all,
   because it has no clue. It has *different* rules about how it can
   re-order instructions and accesses, and maybe the "asm volatile" will
   guarantee that the compiler won't re-order things around the
   clear_bit(), and maybe it won't. But making it a compiler barrier (by
   using the "memory clobber" thing, *guarantees* that gcc cannot reorder
   memory writes or reads.

See? Two different - and _totally_ independent - levels of ordering, and
we need to make sure that both are valid.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] x86: improved memory barrier implementation
Date: Sat, 29 Sep 2007 16:08:12 UTC
Message-ID: <fa.3a6WR8eL0500GOd1Q3VOaYOlr6w@ifi.uio.no>

On Sat, 29 Sep 2007, Nick Piggin wrote:
> >
> > The non-temporal stores should be basically considered to be "IO", not any
> > normal memory operation.
>
> Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
> RAM apparently can go out of order too, and they are being used in the kernel
> for some things.

I'm really saying that people to a first approximation should think "NT is
an IO (DMA) thing". Whether cached or not. Exactly because they do not
honor the normal memory ordering.

It may be worth noting that "clflush" falls under that heading too - even
if all the actual *writes* were done with totally normal writes, if
anybody does a clflush instruction, that breaks the ordering, and that
turns it to "DMA ordering" again - ie we're not talking about the normal
SMP ordering rules at all.

So all the spinlocks and all the smp_*mb() barriers have never really done
*anything* for those things (in particular, "smp_wmb()" has *always*
ignored them on i386!)

> Likewise for rep stos, apparently.

No. As far as I can tell, the fast string operations are unordered
*within*themselves*, but not wrt the operations around it.

In other words, you cannot depend on the ordering of stores *in* the
memcpy() or memset() when it is implemented by "rep movs/stos" - but that
is 100% equivalent to the fact that you cannot depend on the ordering even
when it isn't - since the "memcpy()" library routine might be copying
memory backwards for all you know!

The Intel memory ordering paper doesn't talk about the fast string
instructions (except to say that the rules it *does* speak about do not
hold), but the regular IA manuals do say (for example):

   "Code dependent upon sequential store ordering should not use the
    string operations for the entire data structure to be stored. Data and
    semaphores should be separated. Order dependent code should use a
    discrete semaphore uniquely stored to after any string operations to
    allow correctly ordered data to be seen by all processors."

and note how it says you should just store to the semaphore. If you think
about it, that semaphore will be involving all the memory ordering
requirements that we *already* depend on, so if a semaphore is sufficient
to order the fast string instruction, then by definition using a spinlock
around them must be the same thing!

In other words, by Intels architecture manual, fast string instructions
cannot escape a "semaphore" - but that means that they cannot escape a
spinlock either (since the two are exactly the same wrt memory ordering
rules! In other words, whenever the Intel docs say "semaphore", think
"mutual exclusion lock", not necessarily the kernel kind of "sleeping
semaphore").

But it might be good to have that explicitly mentioned in the IA memory
ordering thing, so I'll ask the Intel people about that. However, I'd say
that given our *current* documentation, string instructions may be
*internally* out-of-order, but they would not escape a lock.

> But this means they are already at odds with spin_unlock, unless they
> are enclosed with mfences everywhere they are used (of which I think
> most are not). So this is an existing bug in the kernel.

See above. I do not believe that it's an existing bug, but the basic point
that the change to "smp_rmb()" doesn't change our existing rules is true.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch 3/3] x86: optimise barriers
Date: Fri, 12 Oct 2007 15:15:18 UTC
Message-ID: <fa.bv8WB+p6aghB/pNYTlM/SIfxvgU@ifi.uio.no>

On Fri, 12 Oct 2007, Jarek Poplawski wrote:
>
> First it looks like a really great thing that it's revealed at last.
> But then... there is probably some confusion: did we have to use
> ineffective code for so long?

I think the chip manufacturers really wanted to keep their options open.

Having the option to re-order loads in architecturally visible ways was
something that they probably felt they really wanted to have. On the other
hand:

 - I bet they had noticed that things break, and some applications depend
   on fairly strong ordering (not necessarily in Linux-land, but..)

   I suspect hw manufacturers go through life hoping that "software
   improves". They probably thought that getting rid of the old 16-bit
   windows would mean that less people depended on undefined behaviour.

   And I suspect that they started noticing that no, with threads and
   JVM's and things, *more* people started depending on fairly strong
   memory ordering.

 - I suspect Intel in particular noticed that they can do a lot of very
   aggressive re-ordering at a microarchitectural level, but can still
   guarantee that *architecturally* they never show it (dynamic detection
   of reordered loads being replayed on cache dirty events etc).

IOW, I suspect that both Intel and AMD noticed that while they had wanted
to keep their options open, those options weren't really realistic, and
not something that the market wanted (aggressive use of threading wants
*stricter* memory ordering, not looser), and they could work well enough
with a fairly strict memory model.

> So, maybe linux needs something like this, instead of waiting few
> years with each new model for vendors goodwill? IMHO, even for less
> popular processors, this could be checked under some debugging option
> at the system start (after disabling suspicios barrier for a while
> plus some WARN_ONs).

Quite frankly, even *within* Intel and AMD, there are damn few people who
understand exactly what the memory ordering requirements and guarantees
are and historically were for the different CPU's.

I would bet that had you asked a random (but still competent) Intel/AMD
engineer that wasn't really intimately involved with the actual design of
the cache protocols and memory pipelines, they would absolutely not have
been able to tell you how the CPU actually worked.

So no, there's no way a software person could have afforded to say "it
seems to work on my setup even without the barrier". On a dual-socket
setup with a shared bus, that says absolutely *nothing* about the
behaviour of the exact same CPU when used with a multi-bus chipset. Not to
mention another revisions of the same CPU - much less a whole other
microarchitecture.

So yes, I've personally been aware for about a year that the memory
ordering was going to likely be documented, but no way was I going to
depend on it until Intel and AMD were ready to state so *publicly*.
Because before that happens, they may have noticed errata etc that made it
not work out.

Also, please note that we didn't even just change the barriers immediately
when the docs came out. I want to do it soon - still *early* in the 2.6.24
development cycle - exactly because bugs happen, and if somebody notices
something strange, we'll have more time to perhaps decide that "oops,
there's something bad going on, let's undo this for the real 2.6.24
release until we can figure out the exact pattern".

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: + kthread-add-a-missing-memory-barrier-to-kthread_stop.patch
Date: Sat, 23 Feb 2008 19:00:19 UTC
Message-ID: <fa.0LIW3BoiU2bmm6p6MbpK43MeP6w@ifi.uio.no>

On Sat, 23 Feb 2008, Oleg Nesterov wrote:
>
> Yes, but still I suspect wmb() is not enough. Note that try_to_wake_up()
> first checks (reads) the task->state,
>
> 	if (!(old_state & state))
> 		goto out;
>
> without the full mb() it is (in theory) possible that try_to_wake_up()
> first reads TASK_RUNNING and only then sets CONDITION. IOW, STORE and
> LOAD could be re-ordered.

No. The spinlock can have preceding stores (and loads, for that matter)
percolate *into* the locked region, but a spinlock can *not* have loads
(and stores) escape *out* of the region withou being totally broken.

So the spinlock that precedes that load of "old_state" absolutely *must*
be a memory barrier as far as that load is concerned.

Spinlocks are just "one-way" permeable. They stop both reads and writes,
but they stop them from escaping up to earlier code (and the spin-unlock
in turn stops reads and writes within the critical section from escaping
down past the unlock).

But they definitely must stop that direction, and no loads or stores that
are inside the critical section can possibly be allowed to be done outside
of it, or the spinlock would be pointless.

[ Yeah, yeah, I guess that in theory you could serialize spinlocks only
  against each other, and their serialization would be in a totally
  different "domain" from normal reads and writes, and then the spinlock
  wouldn't act as a barrier at all except for stuff that is literally
  inside another spinlock, but I don't think anybody really does that ]

So I think a simple smp_wmb() should be ok.

That said, I do not *mind* the notion of "smp_mb_before/after_spinlock()",
and just have architectures do whatever they want (with x86 just being a
no-op). I don't think it's wrong in any way, and may be the really
generic solution for some theoretical case where locking is done not by
using the normal cache coherency, but over a separate lock network. But I
would suggest against adding new abstractions unless there are real cases
of it mattering.

(The biggest reason to do it in practice is probably architectures that
have a really slow smp_wmb() and that also have a spinlock that is already
serializing enough, but if this is only for one single use - in
try_to_wake_up() - even that doesn't really seem to be a very strong
argument).

> A bit offtopic, but let's take another example, __queue_work()->insert_work().
> With some trivial modification we can move set_wq_data() outside of cwq->lock.
> But according to linux's memory model we still need wmb(), which is a bit
> annoying. Perhaps we can also add smp_wmb_before_spinlock(). Not sure this
> is not too ugly though.

Is that really so performance-critical? We still need the spinlock for the
actual list manipulation, so why would we care?

And the smp_wmb() really should be free. It's literally free on x86, but
doing a write barrier is generally a cheap operation on any sane
architecture (ie generally cheaper than a read barrier or a full barrier:
it should mostly be a matter of inserting a magic entry in the write
buffers).

		Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch] mm: fix lockless pagecache reordering bug (was Re: BUG:
Date: Mon, 05 Jan 2009 19:40:06 UTC
Message-ID: <fa.guvlZaw/dx3aAkpoMUGL9NOgwfA@ifi.uio.no>

On Mon, 5 Jan 2009, Linus Torvalds wrote:
>
> Either the value can change, or it can not. It's that simple.
>
> If it cannot change, then we can load it just once, or we can load it
> multiple times, and it won't matter. Barriers won't do anything but screw
> up the code.
>
> If it can change from under us, you need to use rcu_dereference(), or
> open-code it with an ACCESS_ONCE() or put in barriers. But your placement
> of a barrier was NONSENSICAL. Your barrier didn't protect anything else -
> like the test for the RADIX_TREE_INDIRECT_PTR bit.
>
> And that was the fundamental problem.

Btw, this is the real issue with anything that does "locking vs
optimistic" accesses.

If you use locking, then by definition (if you did things right), the
values you are working with do not change. As a result, it doesn't matter
if the compiler re-orders accesses, splits them up, or coalesces them.
It's why normal code should never need barriers, because it doesn't matter
whether some access gets optimized away or gets done multiple times.

But whenever you use an optimistic algorithm, and the data may change
under you, you need to use barriers or other things to limit the things
the CPU and/or compiler does.

And yes, "rcu_dereference()" is one such thing - it's not a barrier in the
sense that it doesn't necessarily affect ordering of accesses to other
variables around it (although the read_barrier_depends() obviously _is_ a
very special kind of ordering wrt the pointer itself on alpha). But it
does make sure that the compiler at least does not coalesce - or split -
that _one_ particular access.

It's true that it has "rcu" in its name, and it's also true that that may
be a bit misleading in that it's very much useful not just for rcu, but
for _any_ algorithm that depends on rcu-like behavior - ie optimistic
accesses to data that may change underneath it. RCU is just the most
commonly used (and perhaps best codified) variant of that kind of code.

			Linus

Index Home About Blog