Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Memory barriers and spin_unlock safety
Date: Sat, 04 Mar 2006 17:29:44 UTC
Message-ID: <fa.BNjIFuQxw4BEpOMCm3Fp65Xvidw@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603040914160.22647@g5.osdl.org>

On Sat, 4 Mar 2006, Paul Mackerras wrote:
>
> > If so, a simple write barrier should be sufficient. That's exactly what
> > the x86 write barriers do too, ie stores to magic IO space are _not_
> > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a
> > spin_unlock()) at all.
>
> By magic IO space, do you mean just any old memory-mapped device
> register in a PCI device, or do you mean something else?

Any old memory-mapped device that has been marked as write-combining in
the MTRR's or page tables.

So the rules from the PC side (and like it or not, they end up being
what all the drivers are tested with) are:

 - regular stores are ordered by write barriers
 - PIO stores are always synchronous
 - MMIO stores are ordered by IO semantics
	- PCI ordering must be honored:
	  * write combining is only allowed on PCI memory resources
	    that are marked prefetchable. If your host bridge does write
	    combining in general, it's not a "host bridge", it's a "host
	    disaster".
	  * for others, writes can always be posted, but they cannot
	    be re-ordered wrt either reads or writes to that device
	    (ie a read will always be fully synchronizing)
	- io_wmb must be honored

In addition, it will help a hell of a lot if you follow the PC notion of
"per-region extra rules", ie you'd default to the non-prefetchable
behaviour even for areas that are prefetchable from a PCI standpoint, but
allow some way to relax the ordering rules in various ways.

PC's use MTRR's or page table hints for this, but it's actually perfectly
possible to do it by virtual address (ie decide on "ioremap()" time by
looking at some bits that you've saved away to remap it to a certain
virtual address range, and then use the virtual address as a hint for
readl/writel whether you need to serialize or not).

On x86, we already use the "virtual address" trick to distinguish between
PIO and MMIO for the newer ioread/iowrite interface (the older
inb/outb/readb/writeb interfaces obviously don't need that, since the IO
space is statically encoded in the function call itself).

The reason I mention the MTRR emulation is again just purely compatibility
with drivers that get 99.9% of all the testing on a PC platform.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Memory barriers and spin_unlock safety
Date: Wed, 08 Mar 2006 03:56:11 UTC
Message-ID: <fa.fvWC5NqPWNdlq/uvLaaHJz9U+pg@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603071930530.32577@g5.osdl.org>

On Wed, 8 Mar 2006, Paul Mackerras wrote:
>
> Linus Torvalds writes:
>
> > So the rules from the PC side (and like it or not, they end up being
> > what all the drivers are tested with) are:
> >
> >  - regular stores are ordered by write barriers
>
> I thought regular stores were always ordered anyway?

For the hw, yes. For the compiler no.

So you actually do end up needing write barriers even on x86. It won't
compile to any actual _instruction_, but it will be a compiler barrier (ie
it just ends up being an empty inline asm that "modifies" memory).

So forgetting the wmb() is a bug even on x86, unless you happen to program
in assembly.

Of course, the x86 hw semantics _do_ mean that forgetting it is less
likely to cause problems, just because the compiler re-ordering is fairly
unlikely most of the time.

> >  - PIO stores are always synchronous
>
> By synchronous, do you mean ordered with respect to all other accesses
> (regular memory, MMIO, prefetchable MMIO, PIO)?

Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
basically won't stall for them forever. I _think_ they'll wait for it to
hit that external bus, though.

So it's totally serializing in the sense that all preceding reads have
completed and all preceding writes have hit the cache-coherency point, but
you don't necessarily know when the write itself will hit the device (the
write will return before that necessarily happens).

> In other words, if I store a value in regular memory, then do an
> outb() to a device, and the device does a DMA read to the location I
> just stored to, is the device guaranteed to see the value I just
> stored (assuming no other later store to the location)?

Yes, assuming that the DMA read is in respose to (ie causally related to)
the write.

> >  - MMIO stores are ordered by IO semantics
> > 	- PCI ordering must be honored:
> > 	  * write combining is only allowed on PCI memory resources
> > 	    that are marked prefetchable. If your host bridge does write
> > 	    combining in general, it's not a "host bridge", it's a "host
> > 	    disaster".
>
> Presumably the host bridge doesn't know what sort of PCI resource is
> mapped at a given address, so that information (whether the resource
> is prefetchable) must come from the CPU, which would get it from the
> TLB entry or an MTRR entry - is that right?

Correct. Although it could of course be a map in the host bridge itself,
not on the CPU.

If the host bridge doesn't know, then the host bridge had better not
combine or the CPU had better tell it not to combine, using something like
a "sync" instruction that causes bus traffic. Either of those approaches
is likely a performance disaster, so you do want to have the CPU and/or
hostbridge do this all automatically for you.

Which is what the PC world does.

> Or is there some gentleman's agreement between the host bridge and the
> BIOS that certain address ranges are only used for certain types of
> PCI memory resources?

Not that I know. I _think_ all of the PC world just depends on the CPU
doing the write combining, and the CPU knows thanks to MTRR's and page
tables. But I could well imagine that there is some situation where the
logic is further out.

> What ordering is there between stores to regular memory and stores to
> non-prefetchable MMIO?

Non-prefetchable MMIO will be in-order on x86 wrt regular memory (unless
you use one of the non-temporal stores).

To get out-of-order stores you have to use a special MTRR setting (mtrr
type "WC" for "write combining").

Or possibly non-temporal writes to an uncached area. I don't think we do.

> If a store to regular memory can be performed before a store to MMIO,
> does a wmb() suffice to enforce an ordering, or do you have to use
> mmiowb()?

On x86, MMIO normally doesn't need memory barriers either for the normal
case (see above). We don't even need the compiler barrier, because we use
a "volatile" pointer for that, telling the compiler to keep its hands off.

> Do PCs ever use write-through caching on prefetchable MMIO resources?

Basically only for frame buffers, with MTRR rules (and while write-through
is an option, normally you'd use "write-combining", which doesn't cache at
all, but write combines in the write buffers and writes the combined
results out to the bus - there's usually something like four or eight
write buffers of up to a cacheline in size for combining).

Yeah, I realize this can be awkward. PC's actually get good performance
(ie they normally can easily fill the bus bandwidth) _and_ the sw doesn't
even need to do anything. That's what you get from several decades of hw
tweaking with a fixed - or almost-fixed - software base.

I _like_ PC's. Almost every other architecture decided to be lazy in hw,
and put the onus on the software to tell it what was right. The PC
platform hardware competition didn't allow for the "let's recompile the
software" approach, so the hardware does it all for you. Very well too.

It does make it somewhat hard for other platforms.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Memory barriers and spin_unlock safety
Date: Wed, 08 Mar 2006 15:31:30 UTC
Message-ID: <fa.o0oXm0OcZL0GX90/gpio5bOme7Y@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603080724180.32577@g5.osdl.org>

On Wed, 8 Mar 2006, Alan Cox wrote:
>
> On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote:
> > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
> > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
> > basically won't stall for them forever.
>
> The bridges I have will stall forever. You can observe this directly if
> an IDE device decides to hang the IORDY line on the IDE cable or you
> crash the GPU on an S3 card.

Ok. The only thing I have tested is the timing of "outb()" on its own,
which is definitely long enough that it clearly waits for _some_ bus
activity (ie the CPU doesn't just post the write internally), but I don't
know exactly what the rules are as far as the core itself is concerned: I
suspect the core just waits until it has hit the northbridge or something.

In contrast, a MMIO write to a WC region at least will not necessarily
pause the core at all: it just hits the write queue in the core, and the
core continues on (and may generate other writes that will be combined in
the write buffers before the first one even hits the bus).

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #2]
Date: Wed, 08 Mar 2006 19:27:51 UTC
Message-ID: <fa.DPj5kmcAhypdBulf9gmQBKUQMxU@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603081115300.32577@g5.osdl.org>

On Wed, 8 Mar 2006, David Howells wrote:

> Alan Cox <alan@redhat.com> wrote:
>
> > 	spin_lock(&foo->lock);
> > 	writel(0, &foo->regnum);
>
> I presume there only needs to be an mmiowb() here if you've got the
> appropriate CPU's I/O memory window set up to be weakly ordered.

Actually, since the different NUMA things may have different paths to the
PCI thing, I don't think even the mmiowb() will really help. It has
nothing to serialize _with_.

It only orders mmio from within _one_ CPU and "path" to the destination.
The IO might be posted somewhere on a PCI bridge, and depending on the
posting rules, the mmiowb() just isn't relevant for IO coming through
another path.

Of course, to get into that deep doo-doo, your IO fabric must be separate
from the memory fabric, and the hardware must be pretty special, I think.

So for example, if you are using an Opteron with its NUMA memory setup
between CPU's over HT links, from an _IO_ standpoint it's not really
anything strange, since it uses the same fabric for memory coherency and
IO coherency, and from an IO ordering standpoint it's just normal SMP.

But if you have a separate IO fabric and basically two different CPU's can
get to one device through two different paths, no amount of write barriers
of any kind will ever help you.

So in the really general case, it's still basically true that the _only_
thing that serializes a MMIO write to a device is a _read_ from that
device, since then the _device_ ends up being the serialization point.

So in the exteme case, you literally have to do a read from the device
before you release the spinlock, if ordering to the device from two
different CPU's matters to you. The IO paths simply may not be
serializable with the normal memory paths, so spinlocks have absolutely
_zero_ ordering capability, and a write barrier on either the normal
memory side or the IO side doesn't affect anything.

Now, I'm by no means claiming that we necessarily get this right in
general, or even very commonly. The undeniable fact is that "big NUMA"
machines need to validate the drivers they use separately. The fact that
it works on a normal PC - and that it's been tested to death there - does
not guarantee much anything.

The good news, of course, is that you don't use that kind of "big NUMA"
system the same way you'd use a regular desktop SMP. You don't plug in
random devices into it and just expect them to work. I'd hope ;)

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #2]
Date: Thu, 09 Mar 2006 01:27:52 UTC
Message-ID: <fa.jC+DvBNUVTmeANFcOue1gYuCqW0@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603081716400.32577@g5.osdl.org>

On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
> ... and x86 mmiowb is a no-op.  It's not x86 that I think is buggy.

x86 mmiowb would have to be a real op too if there were any multi-pathed
PCI buses out there for x86, methinks.

Basically, the issue boils down to one thing: no "normal" barrier will
_ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any
situation where there are multiple paths to one physical device means that
mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the
serialization you look for.

Put another way: the only way to avoid mmiowb() being special is either
one of:
 (a) have the bus fabric itself be synchronizing
 (b) pay a huge expense on the much more critical _regular_ barriers

Now, I claim that (b) is just broken. I'd rather take the hit when I need
to, than every time.

Now, (a) is trivial for small cases, but scales badly unless you do some
fancy footwork. I suspect you could do some scalable multi-pathable
version with using similar approaches to resolving device conflicts as the
cache coherency protocol does (or by having a token-passing thing), but it
seems SGI's solution was fairly well thought out.

That said, when I heard of the NUMA IO issues on the SGI platform, I was
initially pretty horrified. It seems to have worked out ok, and as long as
we're talking about machines where you can concentrate on validating just
a few drivers, it seems to be a good tradeoff.

Would I want the hard-to-think-about IO ordering on a regular desktop
platform? No.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #2]
Date: Thu, 09 Mar 2006 05:39:18 UTC
Message-ID: <fa.RZoPelGP4glpDSrFfLDXaR//zYc@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603082127110.32577@g5.osdl.org>

On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
> A spin_lock does show up on the bus, doesn't it?

Nope.

If the lock entity is in a exclusive cache-line, a spinlock does not show
up on the bus at _all_. It's all purely in the core. In fact, I think AMD
does a spinlock in ~15 CPU cycles (that's the serialization overhead in
the core). I think a P-M core is ~25, while the NetBurst (P4) core is much
more because they have horrible serialization issues (I think it's on the
order of 100 cycles there).

Anyway, try doing a spinlock in 15 CPU cycles and going out on the bus for
it..

(Couple that with spin_unlock basically being free).

Now, if the spinlocks end up _bouncing_ between CPU's, they'll obviously
be a lot more expensive.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers
Date: Thu, 09 Mar 2006 16:33:33 UTC
Message-ID: <fa.Rh8VPjoVudhjVHBdWXAKcO6kk7E@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603090814530.18022@g5.osdl.org>

On Thu, 9 Mar 2006, David Howells wrote:
>
> So, you're saying that the LOCK and UNLOCK primitives don't actually modify
> memory, but rather simply pin the cacheline into the CPU's cache and refuse to
> let anyone else touch it?
>
> No... it can't work like that. It *must* make a memory modification - after
> all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather
> than an atomic_set().

Basically, as long as nobody else is reading the lock, the lock will stay
in the caches.

Only old and stupid architectures go out to the bus for locking. For
example, I remember the original alpha "load-locked"/"store-conditional",
and it was totally _horrible_ for anything that wanted performance,
because it would do the "pending lock" bit on the bus, so it took hundreds
of cycles even on UP. Gods, how I hated that. It made it almost totally
useless for anything that just wanted to be irq-safe - it was cheaper to
just disable interrupts, iirc. STUPID.

All modern CPU's do atomic operations entirely within the cache coherency
logic. I think x86 still support the notion of a "locked cycle" on the
bus, but I think that's entirely relegated to horrible people doing locked
operations across PCI, and quite frankly, I suspect that it doesn't
actually mean a thing (ie I'd expect no external hardware to actually
react to the lock signal). However, nobody really cares, since nobody
would be crazy enough to do locked cycles over PCI even if they were to
work.

So in practice, as far as I know, the way _all_ modern CPU's do locked
cycles is that they do it by getting exclusive ownership on the cacheline
on the read, and either having logic in place to refuse to do release the
cacheline until the write is complete (ie "locked cycles to the cache"),
or to re-try the instruction if the cacheline has been released by the
time the write is ready (ie "load-locked" + "store-conditional" +
"potentially loop" to the cache).

NOBODY goes out to the bus for locking any more. That would be insane and
stupid.

Yes, many spinlocks see contention, and end up going out to the bus. But
similarly, many spinlocks do _not_ see any contention at all (or other
CPU's even looking at them), and may end up staying exclusive in a CPU
cache for a long time.

The "no contention" case is actually pretty important. Many real loads on
SMP end up being largely single-threaded, and together with some basic CPU
affinity, you really _really_ want to make that single-threaded case go as
fast as possible. And a pretty big part of that is locking: the difference
between a lock that goes to the bus and one that does not is _huge_.

And lots of trivial code is almost dominated by locking costs. In some
system calls on an SMP kernel, the locking cost can be (depending on how
good or bad the CPU is at them) quite noticeable. Just a simple small
read() will take several locks and/or do atomic ops, even if it was cached
and it looks "trivial".

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers
Date: Thu, 09 Mar 2006 17:55:08 UTC
Message-ID: <fa.6OV1SBtoOW6fTyb8Qr0w5aym9iE@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603090947290.18022@g5.osdl.org>

On Thu, 9 Mar 2006, David Howells wrote:
>
> I think for the purposes of talking about memory barriers, we consider the
> cache to be part of the memory since the cache coherency mechanisms will give
> the same effect.

Yes and no.

The yes comes from the normal "smp_xxx()" barriers. As far as they are
concerned, the cache coherency means that caches are invisible.

The "no" comes from the IO side. Basically, since IO bypasses caches and
sometimes write buffers, it's simply not ordered wrt normal accesses.

And that's where "bus cycles" actually matter wrt barriers. If you have a
barrier that creates a bus cycle, it suddenly can be ordered wrt IO.

So the fact that x86 SMP ops basically never guarantee any bus cycles
basically means that they are fundamentally no-ops when it comes to IO
serialization. That was really my only point.

> > I think x86 still support the notion of a "locked cycle" on the
> > bus,
>
> I wonder if that's what XCHG and XADD do... There's no particular reason they
> should be that much slower than LOCK INCL/DECL. Of course, I've only measured
> this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other
> behaviour.

I think it's an internal core implementation detail. I don't think they do
anything on the bus, but I suspect that they could easily generate less
optimized uops, simply because they didn't matter as much and didn't fit
the "normal" core uop sequence.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers
Date: Thu, 09 Mar 2006 17:57:07 UTC
Message-ID: <fa.NcWkt8dIwWrRUzmdfrGXt1Darwc@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603090954420.18022@g5.osdl.org>

On Thu, 9 Mar 2006, Linus Torvalds wrote:
>
> So the fact that x86 SMP ops basically never guarantee any bus cycles
> basically means that they are fundamentally no-ops when it comes to IO
> serialization. That was really my only point.

Side note: of course, locked cycles _do_ "serialize" the core. So they'll
stop at least the core write merging, and speculative reads. So they do
have some impact on IO, but they have no way of impacting things like
write posting etc that is outside the CPU.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #4]
Date: Wed, 15 Mar 2006 00:21:36 UTC
Message-ID: <fa.CJhEzqByhJAmrTxslasBkjvq0dQ@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603141609520.3618@g5.osdl.org>

On Tue, 14 Mar 2006, David Howells wrote:
>
> But that doesn't make any sense!
>
> That would mean we that we'd've read b into d before having read the new value
> of p into q, and thus before having calculated the address from which to read d
> (ie: &b) - so how could we know we were supposed to read d from b and not from
> a without first having read p?
>
> Unless, of course, the smp_wmb() isn't effective, and the write to b happens
> after the write to p; or the Alpha's cache isn't fully coherent.

The cache is fully coherent, but the coherency isn't _ordered_.

Remember: the smp_wmb() only orders on the _writer_ side. Not on the
reader side. The writer may send out the stuff in a particular order, but
the reader might see them in a different order because _it_ might queue
the bus events internally for its caches (in particular, it could end up
delaying updating a particular way in the cache because it's busy).

[ The issue of read_barrier_depends() can also come up if you do data
  speculation. Currently I don't think anybody does speculation for
  anything but control speculation, but it's at least possible that a read
  that "depends" on a previous read actually could take place before the
  read it depends on if the previous read had its result speculated.

  For example, you already have to handle the case of

	if (read a)
		read b;

  where we can read b _before_ we read a, because the CPU speculated the
  branch as being not taken, and then re-ordered the reads, even though
  they are "dependent" on each other. That's not that different from doing

	ptr = read a
	data = read [ptr]

  and speculating the result of the first read. Such a CPU would also need
  a non-empty read-barrier-depends ]

So memory ordering is interesting. Some "clearly impossible" orderings
actually suddenly become possible just because the CPU can do things
speculatively and thus things aren't necessarily causally ordered any
more.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] Document Linux's memory barriers [try #4]
Date: Wed, 15 Mar 2006 01:48:35 UTC
Message-ID: <fa./sJf47uO2SM01GjqbUgPqKikmOo@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603141734100.3618@g5.osdl.org>

On Wed, 15 Mar 2006, David Howells wrote:

> Linus Torvalds <torvalds@osdl.org> wrote:
>
> > That's not that different from doing
> >
> > 	ptr = read a
> > 	data = read [ptr]
> >
> >   and speculating the result of the first read.
>
> But that would lead to the situation I suggested (q == &b and d == a), not the
> one Paul suggested (q == &b and d == old b) because we'd speculate on the old
> value of the pointer, and so see it before it's updated, and thus still
> pointing to a.

No. If it _speculates_ the old value, and the value has actually changed
when it checks the speculation, it would generally result in a uarch trap,
and re-do of the instruction without speculation.

So for data speculation to make a difference in this case, it would
speculate the _new_ value (hey, doesn't matter _why_ - it could be that a
previous load at a previous time had gotten that value), and then load the
old value off the new pointer, and when the speculation ends up being
checked, it all pans out (the speculated value matched the value when "a"
was actually later read), and you get a "non-causal" result.

Now, nobody actually does this kind of data speculation as far as I know,
and there are perfectly valid arguments for why outside of control
speculation nobody likely will (at least partly due to the fact that it
would screw up existing expectations for memory ordering). It's also
pretty damn complicated to do. But data speculation has certainly been a
research subject, and there are papers on it.

> > Remember: the smp_wmb() only orders on the _writer_ side. Not on the
> > reader side. The writer may send out the stuff in a particular order, but
> > the reader might see them in a different order because _it_ might queue
> > the bus events internally for its caches (in particular, it could end up
> > delaying updating a particular way in the cache because it's busy).
>
> Ummm... So whilst smp_wmb() commits writes to the mercy of the cache coherency
> system in a particular order, the updates can be passed over from one cache to
> another and committed to the reader's cache in any order, and can even be
> delayed:

Right. You should _always_ have as a rule of thinking that a "smp_wmb()"
on one side absolutely _has_ to be paired with a "smp_rmb()" on the other
side. If they aren't paired, something is _wrong_.

Now, the data-dependent reads is actually a very specific optimization
where we say that on certain architectures you don't need it, so we relax
the rule to be "the reader has to have a smp_rmb() _or_ a
smp_read_barrier_depends(), where the latter is only valid if the address
of the dependent read depends directly on the first one".

But the read barrier always has to be there, even though it can be of the
"weaker" type.

And note that the address really has to have a _data_ dependency, not a
control dependency. If the address is dependent on the first read, but the
dependency is through a conditional rather than actually reading the
address itself, then it's a control dependency, and existing CPU's already
short-circuit those through branch prediction.

			Linus


Index Home About Blog