cmpxchg, LL/SC, and portability (Al Viro; Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Wed, 06 Dec 2006 19:06:58 UTC
Message-ID: <fa.8eRcprbGjS0mD2SZTS5EqdC4tzI@ifi.uio.no>

On Wed, 6 Dec 2006, Christoph Lameter wrote:
>
> I'd really appreciate a cmpxchg that is generically available for
> all arches. It will allow lockless implementation for various performance
> criticial portions of the kernel.

I suspect ARM may have been the last one without one, no?

That said, cmpxchg won't necessarily be "high-performance" unless the hw
supports it naturally in hardware, so..

Russell, are you ok with the code DavidH posted (the "try 2" one)? I'd
like to get an ack from the ARM maintainer before applying it, but it
looked ok.

		Linus

From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch 
	doesn't support it
Date: Wed, 06 Dec 2006 19:09:03 UTC
Message-ID: <fa.1xd/DwdsUSIC1zUuvYUHlERUVQk@ifi.uio.no>

On Wed, Dec 06, 2006 at 11:05:22AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 6 Dec 2006, Christoph Lameter wrote:
> >
> > I'd really appreciate a cmpxchg that is generically available for
> > all arches. It will allow lockless implementation for various performance
> > criticial portions of the kernel.
>
> I suspect ARM may have been the last one without one, no?

No.  sparc32 doesn't have one, for instance.

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Wed, 06 Dec 2006 19:26:57 UTC
Message-ID: <fa.qlkXNmn5fkATK9TK369zBB+PI+8@ifi.uio.no>

On Wed, 6 Dec 2006, Al Viro wrote:
>
> No.  sparc32 doesn't have one, for instance.

Ok. For SMP-safety, it's important that any architecture that can't do
this needs to _share_ the same spinlock (on SMP only, of course) that it
uses for the bitops.

It would be good (but perhaps not as strict a requirement) if the atomic
counters also use the same lock. But that is probably impossible on
sparc32 (since it has a per-counter "lock"-like thing, iirc). So doing a
cmpxchg() on an atomic_t would be a bug.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Wed, 06 Dec 2006 19:36:28 UTC
Message-ID: <fa.vC5HYZlGrT4RNyMAdtDp/AHX0LA@ifi.uio.no>

On Wed, 6 Dec 2006, Matthew Wilcox wrote:
>
> Given parisc's paucity of atomic operations (load-and-zero-32bit and
> load-and-zero-64bit), cmpxchg() is impossible to implement safely.
> There has to be something we can hook to exclude another processor
> modifying the variable.  I'm OK with using atomic_cmpxchg(); we have
> atomic_set locked against it.

How do you to the atomic bitops?

Also, I don't see quite why you think "cmpxchg()" and "atomic_cmpxchg()"
would be different. ANY cmpxchg() needs to be atomic - if it's not,
there's no point to the operation at all, since you'd just write it as

	if (*p == x)
		*p = y;

instead, and not use cmpxchg().

So yes, architectures without native support (where "native" includes
load-locked + store-conditional) always need to

 - on UP, just disable interrupts
 - on SMP, use a spinlock (with interrupts disabled), and share that
   spinlock with all the other atomic ops (bitops at a minimum - the
   "atomic_t" operations have traditionally been in another "locking
   space" because of sparc32 historic braindamage, but I'd suggest
   sharing the same spinlock between them all).

And yeah, it sucks. You _can_ (if you really want to) make the spinlock be
hashed based on the address of the atomic data structure. That at least
allows you to do _multiple_ spinlocks, but let's face it, your real
problem is _likely_ going to be cacheline bouncing, not contention, and
then using a hashed lock won't be likely to buy you all that much.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Thu, 07 Dec 2006 15:43:36 UTC
Message-ID: <fa.B0cnif1XXuQ9lguRtbw1eCSsbm8@ifi.uio.no>

On Thu, 7 Dec 2006, Nick Piggin wrote:
>
> It might be reasonable to implement this watered down version, but:
> don't some architectures have restrictions on what instructions can
> be issued between the ll and the sc?

Yes. You really probably do not want to expose ll/sc on a C level because
of this.

On alpha, the architecture manual says (I didn't go back and check, but
I'm pretty sure) that a ld.l and st.c cannot have a taken branch in
between then, for example. That basically means that you can't allow the
compiler to reorder the basic blocks (which it often will with a
while-loop).

Now, I actually suspect that this was not a microarchitectural flaw, and
that a branch would _work_ there, and that the architecture manual was
just being anal, but strictly speaking, it means that these things had
better always be in assembly, and you can sadly not expose them (on alpha,
at least) as higher-level primitives as such - you can only expose the
end result ("cmpxchg" or similar).

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Fri, 08 Dec 2006 19:17:46 UTC
Message-ID: <fa.bh3f64Ce15ybt2blPOkc7Rop28M@ifi.uio.no>

On Fri, 8 Dec 2006, Christoph Lameter wrote:
>
> As also shown in this thread: There are restrictions on what you can do
> between ll/sc

This, btw, is almost certainly true on ARM too.

There are three major reasons for restrictions on ll/sc:

 - bus-cycle induced things (eg variations of "you cannot do a store in
   between the ll and the sc, because it will touch the cache and clear
   the bit", where "the store" might be a load too, and "the cache" might
   be just "the bus interface")

 - trap handling usually clears the internal lock bit too, which means
   that depending on the micro-architecture, even internal microtraps
   (like even just branch misprediction, but more commonly things like TLB
   misses etc) can cause a sc to always fail.

 - timing. Livelock in particular.

The last one is the one that hits everybody, regardless of
microarchitecture. The rule may be that the LL/SC need to be within a
certain number of cycles (which can be very small - like ten) in order to
guarantee that the cacheline can't be stolen.

All of which means that _nobody_ can really do this reliably in C. Even if
there are no other microarchitectural rules (and it sounds like that might
be true on ARM), the timing issue means that you can _still_ only use it
for very specific and simple sequences, and trying to expose it as a
higher-level thing is not going to work in general for anything even
remotely complicated.

(The timing may also mean that you end up having to do random back-off
etc, just to make sure _somebody_ makes progress. Ie it might not be a
matter of "within ten cycles", but "you need to randomize the timing").

In other words, it's simply not an option to expose LL/SC as an interface.
It would be VERY convenient to do, since cmpxchg can emulate ll/sc (the
"ll" part is a normal load, the "sc" part is a "compare that the old value
still matches, and store the new one if so"). But because you can't expose
LL/SC anyway in any reasonably portable way, that just doesn't work.

So, you really do end up with three possibilities:

 - do things with TRULY PORTABLE interfaces. And like it or not, cmpxchg
   is the closest thing you can get to that. It's trivial to do cmpxchg
   using ll/sc (modulo the "random backoff part" if you need it, which is
   still pretty simple, but no longer totally trivial), and architectures
   that have neither ll/sc _nor_ a native cmpxchg can just go screw
   themselves with spinlocks - they really aren't worth worrying about in
   SMP. At some point you have to tell hardware designers that their
   hardware just sucks.

 - have ugly conditional code in generic code. I personally think this is
   a _much_ worse option in most cases.

 - have a much higher-level interface and make it _all_ architecture-
   dependent (possibly with a "generic" version for sane architectures).
   This works, but the more high-level it is, the more you end up having
   the same thing written in many different ways, and nasty maintenance.

   So we generally set the bar pretty low. Things like semaphore locking
   primitives are high-level enough already that we prefer to try to make
   them use common lower-level interfaces (spinlocks, cmpxchg etc).
   Something like kernel/workqueue.c is _way_ too high a level to do
   arch-specific.

So right now, I think the "cmpxchg" or the "bitmask set" approach are the
alternatives. Russell - LL/SC simply isn't on the table as an interface,
whether you like it or not.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Fri, 08 Dec 2006 19:38:06 UTC
Message-ID: <fa.aiRPmBC1/vz9G7ZaR3RP2t6m8RY@ifi.uio.no>

On Fri, 8 Dec 2006, Russell King wrote:
>
> Yes you can.  Well, you can on ARM at least.  Between the load exclusive
> you can do anything you like until you hit the store exclusive.  If you
> haven't touched the location (with anything other than another load
> exclusive) and no other CPU has issued a load exclusive, your store
> exclusive succeeds.

Is that actually true?

Almost all LL/SC implementations have granularity rules, where "touch the
location" is not a byte-granular thing, but actually ends up being
something like "touch the same cachline".

They also often have _other_ rules like: "the cacheline has to stay in the
L1 in exclusive state" etc. Which means that in a direct-mapped L1 cache,
you can't even load anything that might be in the same way, because it
would cause a cache eviction that invalidates the SC.

It's possible that ARM has really strong LL/SC, but quite frankly, that
sounds unlikely. I've never heard of anybody ever _architecturally_ saying
that they support that strong requirements, even if certain micro-
architectures might actually support stronger semantics than the ones
guaranteed by the architectural rules.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Fri, 08 Dec 2006 19:39:17 UTC
Message-ID: <fa.na39PGkeB7aGTVnkKJGiF0MQ3L8@ifi.uio.no>

On Fri, 8 Dec 2006, Russell King wrote:
>
> I utterly disagree.  I could code atomic_add() as:

Sure. And Alpha could do that too. If you write the C code a specific way,
you can make it work. That does NOT mean that you can expose it widely as
a portable interface - it's still just a very _nonportable_ interface that
you use internally within one architecture to implement other interfaces.

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Fri, 08 Dec 2006 20:03:36 UTC
Message-ID: <fa.ve49K7Cmhjknp/thRGK9ZTn1phM@ifi.uio.no>

On Fri, 8 Dec 2006, Russell King wrote:
>
> No such restriction on ARM.
>
> Also not true.  The architectural implementation is:

I checked the ARM manuals, and quite frankly, they don't back you up.

They do not claim that the physical address tag is byte-granular, and in
fact they make it pretty clear that the same tag is used for all the
sizes, which implies that the tag granularity is NOT byte granular, but
likely something else.

Both the manuals I checked also say: "Other events might cause the tag to
be cleared", without going into particular details other than saying that
a region that is marked non-shared might still clear the tag on access by
other CPU's - but they leave it open whether that's by design or not.

In other words, if there actually is an architectural guarantee that
ldrex/strex are really as strong as you imply, it's not in the standard
architecture manuals from ARM at least for the ARM11562 or the ARM1136.

So I suspect you're wrong, and that the ldrex/strex tags actually are not
all that different from other architectures which tend to have cacheline
granularities or more (I _think_ the original alpha granularity was the
whole address space, and any cache traffic would clear it. That's _really_
pathetically weak, but hey, I might remember wrong, and it was the very
first implementation. I doubt ARM is _that_ weak, but I doubt it's as
strong as you claim).

			Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an arch
Date: Fri, 08 Dec 2006 20:35:48 UTC
Message-ID: <fa.Lski7sADK7tC1ShAYOcMpfmcIQc@ifi.uio.no>

On Fri, 8 Dec 2006, Russell King wrote:

> The only instructions which affect the exclusive access state are the
> exclusive access instructions themselves.

Not according to the docs I found.

The ARM1136 manual explicitly states that any attempt to modify that
address clears the tag (for shared memory regions, by _any_ CPU, and for
nonshared regions by _that_ CPU).

And btw, that _has_ to be true, because otherwise the whole ldrex/strex
sequence would be totally unusable as a way to do atomic bit operations on
UP in the presense of interrupts (well, you could have a clrex instruction
in the interrupt handler, but as far as I know you don't, so that seems to
be a moot point - you only seem to do it in __switch_to).

In other words, I _really_ think you're wrong.

So the very code sequence you quote MUST NOT WORK the way you claim it
does. And not only that, since the granularity of the mark is not just for
the bytes in question, but potentially apparently up to 128 bytes, any
store even _close_ to the memory you had exclusive access to will break the
exclusive access.

Really, Russell. Your stance makes no sense. It doesn't make any sense
from a microarchitectural standpoint (it's not how you'd normally
implement these things), but it ALSO makes no sense from the way you
already use those instructions (as a way to protect against other
processors - including your own - touching that word).

                    Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] WorkStruct: Implement generic UP cmpxchg() where an
Date: Mon, 11 Dec 2006 04:50:56 UTC
Message-ID: <fa.5HtvZW/9We8LPZpBQK8xWJuarxw@ifi.uio.no>

On Sun, 10 Dec 2006, linux@horizon.com wrote:
>
> While I agree that LL/SC can't be part of the kernel API for people to
> get arbitrarily clever with in the device driver du jour, they are *very*
> nice abstractions for shrinking the arch-specific code size.

I'm not sure.

The thing is, it's _really_ hard to tell the compiler to not reload values
from memory in between two inline asm statements.

So what you easily end up with is
 (a) yes, you can actually get the compiler to generate the "obvious" code
     sequence 99% of the time, and it will all work fine.
 (b) but it's really hard to actually guarantee it, and some subtle things
     can really mess you up.

An example of (b) is how we actually put some of these atomic data
structures on the stack ("struct completion" comes to mind), and it can
get really interesting if something works in all the tests, but then
subtly breaks on some microarchitectures when the data structures happen
to be on the stack just because the compiler happened to do a register
reload to the stack at the wrong point.

Now, if you don't inline any of these things, you can control things a lot
better, since then you end up having a much smaller set of circumstances,
and you never have code "around" the actual operation that changes things
like register reload. And yes, I do think that it might be possible to
have some kind of generic "ll/sc template" setup for that case. You can
often make gcc generate the code you want, especially if there is no real
register pressure and you can keep the code simple.

> The semantics are widely enough shared that it's quite possible in
> practice to write a good set of atomic primitives in terms of LL/SC
> and then let most architectures define LL/SC and simply #include the
> generic atomic op implementations.

Well, you do have to also realize that the architectures that dont' do
ll/sc do end up limiting the number of useful primitives, especially
considering that we know that some architectures simply cannot do a lot
between them (which _also_ limits it).

I think we've ended up implementing most of the common ones. We have a
fairly big set of ops like "atomic_add_return()"-like operations, and
those are the obvious ones that can be done for _any_ ll/sc architecture
too. So I don't think there's all that much more to be had there.

		Linus

Index Home About Blog