Index Home About Blog
From: "(John Mashey)" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: 14 Dec 2004 15:48:17 -0800
Message-ID: <1103068097.926843.101230@c13g2000cwb.googlegroups.com>

LL/SC: Livermore S-1.
Maybe there was something earlier thart had it, but this is where I
thought it came from.

http://www.cs.clemson.edu/~mark/s1_alumni.html lists some of the
alunmi, of which a bunch worked at MIPS at some point.  I especially
recall Earl Killian being involved with this.

MIPS R2000 (MIPS-I) didn't have any synchronization ops on purpose,
because every mechanism we knew cauased at least some customer to tell
us why it wasn't good enough :-)  LL/SC was added in MIPS-II to get
minimal operations able to synthesize a lot of people's favorite ones,
and by then we felt we knew much better what people needed.



From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: Wed, 15 Dec 2004 22:15:16 +0100
Message-ID: <cpq9h4$dfa$1@osl016lin.hda.hydro.com>

Ian Shef wrote:
> No, it is still possible to provide locking mechanisms entirely in software
> (assuming that the instruction set and other hardware provides certain
> minimal capabilities - and MIPS R2000 can meet these conditions).  There
> are papers (I used to have one by Leslie Lamport, if I remember the name
> correctly) that describe how to do this.  This is how you were supposed to
> perform locking on the MIPS R2000.  (At least there was some piece of MIPS
> R2000 documentation that provided pointers to papers on software locking
> algorithms - that is how I found the papers by Lamport).

AFAIR, Lamport's algorithm depends on having unlimited range counters!

This does not work for unlimited uptimes, but might be good enough,
particularly on 64-bit machines.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: 15 Dec 2004 16:13:26 -0800
Message-ID: <1103156006.597207.44180@z14g2000cwz.googlegroups.com>

Actually [now that my posts aren't disappearing any more :-)], let me
review what *really* was going on in 1985 with R2000.

1) We didn't really expect  people to be using Lamport's algorithm,
except as a last resort.

2) UNIXes of the time had semaphores via semget(), etc, which were
syscalls.

3) For user-level code,I was unable to find any consistent expectation
across the industry of:
a) Specific hardware primitives for synchronization
b) User-level APIs to access these
and in fact, computer systems vendors, especially those doing bus-based
multiprocessors, often denigrated the usual CPU instructions in favor
of special extra hardware for synchronizatiion, like the Sequent SLIC
bus.  In particular, SMP folks were horrified at instructions that did
complete bus-locks.

4) At that point, UNIX kernel code had not generally been reworked for
fine-grain parallelism on SMPs, and certainly, uniprocessrs could quite
happily do efficient mutual exclusion via set-priority-level to turn
off interrupts.

5) There simply didn't exist a lot of user-level code (on UNIX, anyway)
that made large use of lightweight threads, or that did automatic
parallelization and expected really efficient user-level
synchronization operations.

6) Hence, I didn't feel at all bad about leaving the synch ops out,
since I couldn't tell the chip folks anything I could guarantee I'd be
happy with for many years [and once it's there, it's there].  However,
I'd had in the back of my mind that we should do a set of functions of
the C library with the common mechanisms, that turned into fast kernel
calls, so that code would get written to a standard API (for us at
least), and that later, we'd slide some new instructions underneath the
API .... a good plan, but in the terrible crush of other things to do,
I forgot about it.

Unlike some other areas, I regard this (synchronization operations) as
one of several areas where I'm still not comfortable that the industry
has produced many mechanisms that simultaneously have:
- consistent APIs, especially across vendors
- simple, cheap low-end implementations
- good scaling on large machines



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: ulocks.h (was: Re: CAS and LL/SC)
Date: 16 Dec 2004 15:12:52 -0800
Message-ID: <1103238772.551466.305570@f14g2000cwb.googlegroups.com>

re: ulocks.h

At MIPS, we should have defined a set no later than 2Q86, recommended
across MIPS customers.

A joint team of MIPS & SGI folks did a SYS V port, then each group took
the OS  in its own different directions, for various reasons.

The SGI ulocks.h set was done sometime later, I don't recall when.

=====
To reiterate the point: whenver possible, it is a really, really good
idea to define a standard API for something, so that software gets
written using it, that hides the details of lower level hardware, so
that you can improve things over time without disrupting existing
software, and so new features are actually used in practice quicker.

One of my favorite examples of an exceptionally well-thought-out API
stack would be OpenGL, which manged to survive huge improvements in
hardware, and made some sense across a huge range from minimal graphics
hardware to extremely powerful hardware.
Another good example would be the various implementations of AS/400.



From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: Thu, 16 Dec 2004 11:09:25 +0100
Message-ID: <cprmsl$4sr$1@osl016lin.hda.hydro.com>

Joe Seigh wrote:
> Terje Mathisen wrote:
>>AFAIR, Lamport's algorithm depends on having unlimited range counters!
>>
>>This does not work for unlimited uptimes, but might be good enough,
>>particularly on 64-bit machines.
>>
>
> Peterson's algorithm fixes that.  It's also a distributed algorithm like
> Lamport's.

What about the required memory barriers?

It seems like all these algorithms (Lamport, Dekker, Peterson,
Bakery...) starts with the assumption that all memory updates are
globally consistent, i.e. that all memory operations will be globally
visible in program order, right?

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"


From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: 20 Dec 2004 10:58:18 -0800
Message-ID: <1103569098.536503.28680@z14g2000cwz.googlegroups.com>

Dan Koren wrote:
> "(John Mashey)" <old_systems_guy@yahoo.com> wrote in message
> news:1103068097.926843.101230@c13g2000cwb.googlegroups.com...
> > LL/SC: Livermore S-1.
> > Maybe there was something earlier thart had it, but
> > this is where I thought it came from.
>
>
> LL/SC instructions were introduced with the CDC-6600.

What are they called there? I hadn't remebmered there being anything
like that, and I reviewed:
http://ed-thelen.org/comp-hist/CDC-6600-R-M.html

The PP's had add/subtract to memory, but that isn't LL/SC.

>
> > http://www.cs.clemson.edu/~mark/s1_alumni.html lists
> > some of the alunmi, of which a bunch worked at MIPS
> > at some point. I especially recall Earl Killian being
> > involved with this.
> >
> > MIPS R2000 (MIPS-I) didn't have any synchronization ops
> > on purpose, because every mechanism we knew caused at
> > least some customer to tell us why it wasn't good enough :-)
>
>
> The story I remember hearing from one of the original MIPS
> architects (who shall remain unnamed ;-)) was that atomic
> synchronization instructions were used too infrequently
> to justify putting them in hardware ;-)

That is incorrect, and I *know* because I was there, and heavily
involved.
The MMU, cache-control, exception-handling, coprocessor0 & related
features were mostly designed by negotiation between the VLSI folks and
the OS group (i.e., myself and a few people who worked for me).

We (the OS group) had various thoughts about synchronization
operations, and I certainly did some asking around, and the kind of
input I got from various people went like, with X..Z describing typical
CPU instructions:
a) X is simply not good enough, we like Y better.
b) Y is simply not good enough, we like Z better.
c) You need all of X, Y, and Z. [and we weren't going to get that].
d) X, Y, and Z are all useless, for a serious SMP, you need a special
interconnect mechanism that does scalable locking, and the last thing
in the world you want is something that does a bus-lock.

Recall the minimalist design style: none of us were keen about putting
half-baked features in that consumed resources, but worse, that we were
committed to forever [i.e., especially user-level stuff.]

When we asked for things, we had to negotiate/iterate with VLSI group,
but usually, the VLSI attitude was "If you can tell us exactly what you
want, we'll try to do it, or at least discuss alternative
implementations that might be easier."  Many of the VLSI team were
ecstatic to actually get useful input. We didn't get everything we
wanted, but we got a lot, and we certainly cared about OS and
multi-tasking performance, and we got tradeoffs to help those whenever
we could.

I believe, that had we definitively said "We MUST have test-and-set (or
one of the other simpler mechanisms)", we would likely have gotten it,
albeit not without some grumbling (since it would have been the only
combined load/store operation).

However:
a) SMP was an explicit non-goal of the R2000.  [Some of us were big SMP
fans, but there was already too much to do, so the rule was not to do
anything that would make SMP unnecessarily difficult, but don't add to
schedule length or die space to enable SMP.  Expect to do that later
(the minimal possible in R3000, and then substantially in R4000).  IBM
followed a similar path with RS/6000.

b) For the first rounds of machines, with uniprocessors only, we (OS
group) believed we could do without synch ops inside the kernel, since
we could do mutual exclusion other ways.  Like I said before, the
mistake was in not providing a decent user-level library, with a
fast-path through kernel, and that was entirely because it got lost in
the frenzy, and that wasn't the VLSI group's fault.

c) Anyway, the result was that we (OS group) were not in a position to
tell the VLSI group exactly what we wanted, and we were reluctant to
spec something that would be there I was in discussions of the form "Do
you want test-and-set, or something different?"  That was 2Q85.
forever, and still not do the job, and we weren't willing to stall the
schedule a couple months to work though all the issues properly.

Other people may remember this differently, but not all designers were
involved with all parts of the design; I was certainly in the middle of
this one.  As for believing that it was left out because it was
infrequently used:
a) In general, of course we tried to leave things out that were
infrequently used.
b) But, if something was infrequently used, but structurally necessary
[like halfword load/stores for dealing with device drivers], we got it
in some form or other.

So, one more time: the synchronization omission was on purpose, simply
because we could not specify a committed-forever feature early enough,
and we thought we could live without it for a while, until we knew
something we thought would really work well.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & 
	VSE)
Date: 22 Dec 2004 01:32:52 -0800
Message-ID: <1103707972.895564.11190@z14g2000cwz.googlegroups.com>

Eric Smith wrote:
> Anne & Lynn Wheeler <lynn@garlic.com> writes:
> > charlie invented compared-and-swap while working on fine-grain
> > locking for cp/67 360/67 smp at the science center
> > http://www.garlic.com/~lynn/subtopic.html#545tech
>
> Interesting.  Other than 360 and derivatives, some of the Motorola
> 68K processors (starting with MC68020), and later x86 processors, what
> architectures have included CAS?
>
> Who invented the Load Locked and Store Conditional instructions used
> in the MIPS, Power, and Alpha architectures, and where else have they
> been used?
>
> Eric
>
> [Originally posted to alt.folklore.computers and bit.listserv.ibm-main,
> but seems relevant here.  Edited slightly, as I realized after the
> original posting that recent x86 processors also have the equivalent of
> CAS.]


All of this reminded me to pull out some old design discussion notes
from 1996/1997.

A useful reference is Maged M. Michael, Michael L. Scott, "Concurrent
Update on Multiprogrammed Shared Memory Multiprocessors", University of
Rochester TR  #614, April 1996.  For proposal of LL/SC, it points at E.
H. Jensen, G. W. Hagensen, J. M. Broughton, "A New Approach to
Exclusive Data Access in Shared Memory Multiprocessors", TR UCRL-97663,
LLNL, Nov 1987.  It also references Herlihy's work that categorized
primitives into levels, such that those at higher levels (CAS, LL/SC)
could be used to synthesize lower ones (TestAndSet, FETCHADD,
FETCHSTORE), i.e., M. Herlihy, "Wait-Free Synchonization", ACM TOPLAS
13(1):124-149, Jan 1991.


CONTEXT
At this point, there had been 3 major generations of MIPS
multiprocessor systems architecture at SGI:
1988- Bus-based Power Series, using R3000s, up to 8P
=====
1993- Bus-based Challenge/Power Challenge, R4400, R8000, R10000, up to 36P.
1996- ccNUMA Origin 2000, up to 128P/256P.

The 2nd and 3rd used CPUs with LL/SC, but most systems had extra
synchronization hardware for performance ... all of which were a bit
different, and required different OS support, and (I think) were even
broken in early models, and often had awkward programming restrictions
[magic uncached registers in specific locations.]  Origin2000s had some
fetch-op hardware.

Also, by then, we had people from the Cray T3x group, and those
machines had their own highly scalable synchronization hardware.

We were working on the successor ccNUMA system architecture, NUMAFlex,
which appeared in MIPS-based Origin 3000, and Itanium-based Altix.

We were also working on an R1x000 successor [which never got built],
but among whose design goals was better memory hierarchy and more
efficient, much-more usable synchronization hardware to help out
software.

Here was the fundamental issue, which hasn't been discussed much here,
but is the elephant at the dining table:
==============
You can use some atomic synch instruction (or pair, like LL/SC), and it
looks like nice tight code ... but it may consume a large number of
cycles, and worse, lock shared resources more than you'd ever expect.
In addition, some nice-looking primitives (from programming viewpoint)
sometimes require very complex/expensive hardware designs to make them
fast.
==============
Scaling things at reasonable cost requires both CPU and system design,
and of course, good software at both OS and application level.

1) EARLY 1980s
In many CPUs, especially micros, primitives like Test-And-Set were
usually defined, originally as:
- Fetch memory, with bus-lock-modifier asserted
- Operate on data
- Write it back, and then drop the bus-lock

I.e. the assumption was that there was a bus or equivalent, such that a
CPU could lock it temporarily, and that this was OK.

Many such features, especially in micros in the early 1980s, didn't
strongly consider interactions with caches, since, at least earlier,
the CPUs didn't have data caches, and cache implementation if any was
done in external hardware.  Typically, later CPUs in sequence had to be
careful to define how this worked, i.e., in 68020 CAS & TAS just did
bus-lock; in 68040, the effect of CAS & TAS on cache-resident data was
specified.

As noted in earlier posting, it is difficult to make
synchronization-based-on-bus-lock scale up very far, and people doing
micro-based, bus-based SMPs tended to build their own extra
synchronization hardware to minimize bus-locking.

In early 1980s, a synchronization op might consume 10s of cycles,
depending on cache design, coherency mechanisms, and bus activity, and
of course, it would stop any other CPU from using the bus for a while.

2) By late 1980s/early 1990s, people were defining synchronization
operations whose original definitions were cache and bus-SMP-aware, of
which LL/SC was particularly popular.  This scaled further, and could
ride on top of normal cache-coherency mechanisms, and of course, did
not lock the bus, which was especially important to people who were
trying to scale bus-based SMPs as far as they could, and were already
using wide, split-transaction busses.

Straightforward bus-lock didn't make any sense for such systems.  Lock
words either had to be uncacheable, or else participate in the
coherency scheme, and in any case, with faster CPUs, one can imagine
cases where a simple-looking TAS turns into 50-100 clocks, depending on
bus arbitration, outstanding bus transactions, residency of cache
lines, etc.

But of course, as systems scaled up, even these ran into limits.  The
classic issue for LL/SC is for a heavily-used lock, where the cache
line migrates around the caches.

3) By mid-1990s, this was really painful, with ccNUMAs scaling much
further, and of course, with more complex interconnects than simple
busses, so there wasn't even a single hardware entity to lock, and
there was no simple bus to watch to do cache invalidation. In many
ccNUMA designs, all is well when:
a) Lots of CPUs share a read-only copy of a cache line
b) One CPU has exclusive use of a (dirty) cache line for a while.

But it is Not a Good Thing if 256 CPUs all want to do LL/SC on the same word.
Assume that each CPU in fact does LL/SC with no intervening invalidation.
CPU 1 does LL/SC, ending in having a dirty cache line.
=====
CPU 2 does LL, which asks for cache line. (1)
Memory controller gets request, sees that CPU 1 has exclusive use,
sends message to CPU 1 (2)
CPU 1 pushes dirty cache line back to memory controller (3) and invalidates
its copy.  It either sends the cache line directly to CPU 2, or lets the
memory controller do it (4).
CPU 2 does SC.

(There are numerous variations on this theme, but it is easy for the
LL/SC in CPU 2 to cost 100s of CPU cycles in a big ccNUMA, sometimes.)
=====

We thought we had some noticable improvements, based on:
a) Studying what we already had in the kernel, in terms of APIs for
operations (that were usually then simulated with LL/SC).  Looking at
the kernel source, we'd used LL/SC to synthesize almost anything anyone
could think of :-)
b) Some new ops for the CPU being designed.
c) CNA (Consistent, no-allocate) cache mode, probably.
d) Fetch-op hardware in memory controllers as well.

But of course, we never actually built that MIPS CPU.  Some of the
features are there, sort of, in Itanium, and some elements in PA-RISC
and PPC.

Briefly, what we were thinking of was:

A) LOAD32 and LOAD64 FETCH-AND-OPS  (several each)
B) STORE32 AND STORE64 FETCH-AND-OPS (several each)

To maximize usability, it was preferred that the semantics of these
appear the same whether a data item were cacheable or uncacheable, and
if cacheable, if it were currently resident or not, with CNA access
preferred.

Doing this right seems to require that the operations must be doable
either by a CPU, or by a memory controller, and in any case, with the
right cache coherency.  That probably wasn't practical in moderate-cost
1980s machines, but  ccNUMA memory controllers are pretty hefty
already.

Briefly, the idea was to have a set of load-and-ops, and store-and-ops,
showing just the load-fetch-add and store-fetch-add (with ORs and ANDs
or XCHGs useful as well).
LFADD  reg, addr
SFADD  reg, addr

a) If the relevant cache-line was already in the cache, and EXCLusive,
do the operation there, atomically, without any other CPU being bothered.
SFADD just adds the register to memory atomically,
LFADD returns the previous value.
b) Otherwise, send the register value, and the operation type to the
appropriate memory controller, which:
1) If a CPU has an EXCL line, gets it written back to memory,
and invalidates any SHAREd lines, i.e., it does what it normally does when
a CPU requests an EXCL line (i.e., with intent to write).
2) Reads the memory value, and if the op were an LFADD, returns that value,
i.e., as though it had been an uncached memory read. (That's where the CNA
access comes in: don't let this allocate cache lines.)
3) Does the atomic modify, memory <- memory+reg

The SFADD has lower latency than LFADD, for cases where you don't need
the old value.

The idea was:
a) Keep busy global locks out of the caches, and keep the activity
right at the memory controllers.  Transfer 8-byte data items, not
128-byte cache lines.
[The fetchop hardware in Origin2000 did this sort of thing, i.e., there
were a specific bunch of special uncached memory addresses, not very
general.]

b) Allow use of normal memory addressing in both kernel and user, with
lock words embedded in data structures when that made sense, and
avoiding the horrific bugs that can happen when cached and uncached
accesses get accidentally mixed.  Piggy-back on the existing coherency
protocols.

One can argue about the case where the LFADD/SFADD encounters a cache
line in SHAREd state: should it try to get EXCL and keep the line, or
just drop the line, and send the data+op to the memory controller as
above?  Of course, there are a zillion details, given the complex
coherency protocols, and of course, none of this would automagically
make software scale :-)

Too bad we never built that CPU :-)

==============
SUMMARY

Efficient synchronization needs:
a) Synch operations in CPU, appropriate to the scalability needed
b) Appropriate hardware systems design, which usually has extra
features beyond those provided by commodity CPUs, at least when used in
bigger systems.
c) Careful software work.

Put another way, I think the choice of specific CPU synch operations is
less important than the way these pieces interact.  I also think that
nice-looking synch operations can fool people into thinking they must
be fast, when in fact, they may incur 10s or 100s of cycles of delay.
Now, finally off to the slopes.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: CAS and LL/SC
Date: 22 Dec 2004 12:25:14 -0800
Message-ID: <1103747114.047502.313120@c13g2000cwb.googlegroups.com>

Andi Kleen wrote:
> "John Mashey" <old_systems_guy@yahoo.com> writes:
>
> > The idea was:
> > a) Keep busy global locks out of the caches, and keep the activity
> > right at the memory controllers.  Transfer 8-byte data items, not
> > 128-byte cache lines.
> > [The fetchop hardware in Origin2000 did this sort of thing, i.e., there
> > were a specific bunch of special uncached memory addresses, not very
> > general.]
>
> It seems like a bad idea to me because it optimizes the slow (congested)
> path of a lock in favour of the fast (uncongested) path.  On a modern
> CPU you really want the fast path to be in cache, because just talking
> to the external bus costs you hundreds of cycles. Trying to talk
> to an external memory controller even for an uncongested lock sounds
> extremly bad and definitely not do something you should do for
> any important locks.
>
> If you have a lock that is congested enough that the slow path is a
> real problem the only real fix is to redesign your software to split
> the lock up so that it is not congested.  That fixes the problem
> at the root and you care more about the performance of the uncongested
> fast path, which your design probably penalized.
>
> Trying to optimize the lock slow path doesn't help you much longer term.
> If a lock is too congested it will always be slow and not scale well.
>
> -Andi

Maybe I didn't explain this quite well enough, but you might consider
the history and the background, i.e., we chose LL/SC fairly early, with
the idea that they'd work well for uncongested locks, and at least work
for congested ones.  We had a *lot* of experience with these in rather
scalable designs, up to 64P or 128P at the time we were thinking about
this.
IRIX folks had spent a lot of time doing the kinds of software
optimizations you suggest.  It's hard work.  When you discover that
something is congested, it usually isn't a one-line change to fix it.

As I noted, this thread has frequently ignored some kinds of
implementation issues, like what the *real* cycle count implications
are.  The problem is that when you do LL/SC, and you're relying on the
standard coherency system, in a big ccNUMA (like an Origin2000, for
example), if the lock gets even a bit congested the overhead gets worse
fast, with every single access to the lock causing *multiple*
transactions with the relevant memory controller, and frequent
back-and-forth transfers of entire 128-byte cache lines.

In that case, you're better off at least lessening the overhead.  It is
also the case that the common likely-to-be-cache-resident cases fall
out naturally, as in the case where locks are embedded with the data.
Also, given a "prefetch-with-intent-to-write" instruction [which we
had], it's trivial and efficient to have  a fetch-and-op function that
keeps the lock in the local cache, until it gets congested, in which
case it migrates back to the memory controller where you want it.

In the Origin2000, we had this fetchop hardware, and it wasn't there
because hardware people thought it would be fun, it's because OS people
wanted all the help they could get!  They already had LL/SC for the
easier cases.

One more time: the reason we are still arguing about this stuff is
because it's hard, and it's hard because it has asynchrony and
parallelism, neither of which are very intuitive for people, and
because it requires bottom-to-top designs that hang together, involving
CPU, system architecture, OS, libraries, language specification,
compilers and applications.


Index Home About Blog