From: "(John Mashey)" <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: 14 Dec 2004 15:48:17 -0800 Message-ID: <firstname.lastname@example.org> LL/SC: Livermore S-1. Maybe there was something earlier thart had it, but this is where I thought it came from. http://www.cs.clemson.edu/~mark/s1_alumni.html lists some of the alunmi, of which a bunch worked at MIPS at some point. I especially recall Earl Killian being involved with this. MIPS R2000 (MIPS-I) didn't have any synchronization ops on purpose, because every mechanism we knew cauased at least some customer to tell us why it wasn't good enough :-) LL/SC was added in MIPS-II to get minimal operations able to synthesize a lot of people's favorite ones, and by then we felt we knew much better what people needed.
From: Terje Mathisen <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: Wed, 15 Dec 2004 22:15:16 +0100 Message-ID: <firstname.lastname@example.org> Ian Shef wrote: > No, it is still possible to provide locking mechanisms entirely in software > (assuming that the instruction set and other hardware provides certain > minimal capabilities - and MIPS R2000 can meet these conditions). There > are papers (I used to have one by Leslie Lamport, if I remember the name > correctly) that describe how to do this. This is how you were supposed to > perform locking on the MIPS R2000. (At least there was some piece of MIPS > R2000 documentation that provided pointers to papers on software locking > algorithms - that is how I found the papers by Lamport). AFAIR, Lamport's algorithm depends on having unlimited range counters! This does not work for unlimited uptimes, but might be good enough, particularly on 64-bit machines. Terje -- - <Terje.Mathisen@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: "John Mashey" <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: 15 Dec 2004 16:13:26 -0800 Message-ID: <firstname.lastname@example.org> Actually [now that my posts aren't disappearing any more :-)], let me review what *really* was going on in 1985 with R2000. 1) We didn't really expect people to be using Lamport's algorithm, except as a last resort. 2) UNIXes of the time had semaphores via semget(), etc, which were syscalls. 3) For user-level code,I was unable to find any consistent expectation across the industry of: a) Specific hardware primitives for synchronization b) User-level APIs to access these and in fact, computer systems vendors, especially those doing bus-based multiprocessors, often denigrated the usual CPU instructions in favor of special extra hardware for synchronizatiion, like the Sequent SLIC bus. In particular, SMP folks were horrified at instructions that did complete bus-locks. 4) At that point, UNIX kernel code had not generally been reworked for fine-grain parallelism on SMPs, and certainly, uniprocessrs could quite happily do efficient mutual exclusion via set-priority-level to turn off interrupts. 5) There simply didn't exist a lot of user-level code (on UNIX, anyway) that made large use of lightweight threads, or that did automatic parallelization and expected really efficient user-level synchronization operations. 6) Hence, I didn't feel at all bad about leaving the synch ops out, since I couldn't tell the chip folks anything I could guarantee I'd be happy with for many years [and once it's there, it's there]. However, I'd had in the back of my mind that we should do a set of functions of the C library with the common mechanisms, that turned into fast kernel calls, so that code would get written to a standard API (for us at least), and that later, we'd slide some new instructions underneath the API .... a good plan, but in the terrible crush of other things to do, I forgot about it. Unlike some other areas, I regard this (synchronization operations) as one of several areas where I'm still not comfortable that the industry has produced many mechanisms that simultaneously have: - consistent APIs, especially across vendors - simple, cheap low-end implementations - good scaling on large machines
From: "John Mashey" <email@example.com> Newsgroups: comp.arch Subject: Re: ulocks.h (was: Re: CAS and LL/SC) Date: 16 Dec 2004 15:12:52 -0800 Message-ID: <firstname.lastname@example.org> re: ulocks.h At MIPS, we should have defined a set no later than 2Q86, recommended across MIPS customers. A joint team of MIPS & SGI folks did a SYS V port, then each group took the OS in its own different directions, for various reasons. The SGI ulocks.h set was done sometime later, I don't recall when. ===== To reiterate the point: whenver possible, it is a really, really good idea to define a standard API for something, so that software gets written using it, that hides the details of lower level hardware, so that you can improve things over time without disrupting existing software, and so new features are actually used in practice quicker. One of my favorite examples of an exceptionally well-thought-out API stack would be OpenGL, which manged to survive huge improvements in hardware, and made some sense across a huge range from minimal graphics hardware to extremely powerful hardware. Another good example would be the various implementations of AS/400.
From: Terje Mathisen <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: Thu, 16 Dec 2004 11:09:25 +0100 Message-ID: <firstname.lastname@example.org> Joe Seigh wrote: > Terje Mathisen wrote: >>AFAIR, Lamport's algorithm depends on having unlimited range counters! >> >>This does not work for unlimited uptimes, but might be good enough, >>particularly on 64-bit machines. >> > > Peterson's algorithm fixes that. It's also a distributed algorithm like > Lamport's. What about the required memory barriers? It seems like all these algorithms (Lamport, Dekker, Peterson, Bakery...) starts with the assumption that all memory updates are globally consistent, i.e. that all memory operations will be globally visible in program order, right? Terje -- - <Terje.Mathisen@hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: "John Mashey" <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: 20 Dec 2004 10:58:18 -0800 Message-ID: <firstname.lastname@example.org> Dan Koren wrote: > "(John Mashey)" <email@example.com> wrote in message > news:firstname.lastname@example.org... > > LL/SC: Livermore S-1. > > Maybe there was something earlier thart had it, but > > this is where I thought it came from. > > > LL/SC instructions were introduced with the CDC-6600. What are they called there? I hadn't remebmered there being anything like that, and I reviewed: http://ed-thelen.org/comp-hist/CDC-6600-R-M.html The PP's had add/subtract to memory, but that isn't LL/SC. > > > http://www.cs.clemson.edu/~mark/s1_alumni.html lists > > some of the alunmi, of which a bunch worked at MIPS > > at some point. I especially recall Earl Killian being > > involved with this. > > > > MIPS R2000 (MIPS-I) didn't have any synchronization ops > > on purpose, because every mechanism we knew caused at > > least some customer to tell us why it wasn't good enough :-) > > > The story I remember hearing from one of the original MIPS > architects (who shall remain unnamed ;-)) was that atomic > synchronization instructions were used too infrequently > to justify putting them in hardware ;-) That is incorrect, and I *know* because I was there, and heavily involved. The MMU, cache-control, exception-handling, coprocessor0 & related features were mostly designed by negotiation between the VLSI folks and the OS group (i.e., myself and a few people who worked for me). We (the OS group) had various thoughts about synchronization operations, and I certainly did some asking around, and the kind of input I got from various people went like, with X..Z describing typical CPU instructions: a) X is simply not good enough, we like Y better. b) Y is simply not good enough, we like Z better. c) You need all of X, Y, and Z. [and we weren't going to get that]. d) X, Y, and Z are all useless, for a serious SMP, you need a special interconnect mechanism that does scalable locking, and the last thing in the world you want is something that does a bus-lock. Recall the minimalist design style: none of us were keen about putting half-baked features in that consumed resources, but worse, that we were committed to forever [i.e., especially user-level stuff.] When we asked for things, we had to negotiate/iterate with VLSI group, but usually, the VLSI attitude was "If you can tell us exactly what you want, we'll try to do it, or at least discuss alternative implementations that might be easier." Many of the VLSI team were ecstatic to actually get useful input. We didn't get everything we wanted, but we got a lot, and we certainly cared about OS and multi-tasking performance, and we got tradeoffs to help those whenever we could. I believe, that had we definitively said "We MUST have test-and-set (or one of the other simpler mechanisms)", we would likely have gotten it, albeit not without some grumbling (since it would have been the only combined load/store operation). However: a) SMP was an explicit non-goal of the R2000. [Some of us were big SMP fans, but there was already too much to do, so the rule was not to do anything that would make SMP unnecessarily difficult, but don't add to schedule length or die space to enable SMP. Expect to do that later (the minimal possible in R3000, and then substantially in R4000). IBM followed a similar path with RS/6000. b) For the first rounds of machines, with uniprocessors only, we (OS group) believed we could do without synch ops inside the kernel, since we could do mutual exclusion other ways. Like I said before, the mistake was in not providing a decent user-level library, with a fast-path through kernel, and that was entirely because it got lost in the frenzy, and that wasn't the VLSI group's fault. c) Anyway, the result was that we (OS group) were not in a position to tell the VLSI group exactly what we wanted, and we were reluctant to spec something that would be there I was in discussions of the form "Do you want test-and-set, or something different?" That was 2Q85. forever, and still not do the job, and we weren't willing to stall the schedule a couple months to work though all the issues properly. Other people may remember this differently, but not all designers were involved with all parts of the design; I was certainly in the middle of this one. As for believing that it was left out because it was infrequently used: a) In general, of course we tried to leave things out that were infrequently used. b) But, if something was infrequently used, but structurally necessary [like halfword load/stores for dealing with device drivers], we got it in some form or other. So, one more time: the synchronization omission was on purpose, simply because we could not specify a committed-forever feature early enough, and we thought we could live without it for a while, until we knew something we thought would really work well.
From: "John Mashey" <email@example.com> Newsgroups: comp.arch Subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM & VSE) Date: 22 Dec 2004 01:32:52 -0800 Message-ID: <firstname.lastname@example.org> Eric Smith wrote: > Anne & Lynn Wheeler <email@example.com> writes: > > charlie invented compared-and-swap while working on fine-grain > > locking for cp/67 360/67 smp at the science center > > http://www.garlic.com/~lynn/subtopic.html#545tech > > Interesting. Other than 360 and derivatives, some of the Motorola > 68K processors (starting with MC68020), and later x86 processors, what > architectures have included CAS? > > Who invented the Load Locked and Store Conditional instructions used > in the MIPS, Power, and Alpha architectures, and where else have they > been used? > > Eric > > [Originally posted to alt.folklore.computers and bit.listserv.ibm-main, > but seems relevant here. Edited slightly, as I realized after the > original posting that recent x86 processors also have the equivalent of > CAS.] All of this reminded me to pull out some old design discussion notes from 1996/1997. A useful reference is Maged M. Michael, Michael L. Scott, "Concurrent Update on Multiprogrammed Shared Memory Multiprocessors", University of Rochester TR #614, April 1996. For proposal of LL/SC, it points at E. H. Jensen, G. W. Hagensen, J. M. Broughton, "A New Approach to Exclusive Data Access in Shared Memory Multiprocessors", TR UCRL-97663, LLNL, Nov 1987. It also references Herlihy's work that categorized primitives into levels, such that those at higher levels (CAS, LL/SC) could be used to synthesize lower ones (TestAndSet, FETCHADD, FETCHSTORE), i.e., M. Herlihy, "Wait-Free Synchonization", ACM TOPLAS 13(1):124-149, Jan 1991. CONTEXT At this point, there had been 3 major generations of MIPS multiprocessor systems architecture at SGI: 1988- Bus-based Power Series, using R3000s, up to 8P ===== 1993- Bus-based Challenge/Power Challenge, R4400, R8000, R10000, up to 36P. 1996- ccNUMA Origin 2000, up to 128P/256P. The 2nd and 3rd used CPUs with LL/SC, but most systems had extra synchronization hardware for performance ... all of which were a bit different, and required different OS support, and (I think) were even broken in early models, and often had awkward programming restrictions [magic uncached registers in specific locations.] Origin2000s had some fetch-op hardware. Also, by then, we had people from the Cray T3x group, and those machines had their own highly scalable synchronization hardware. We were working on the successor ccNUMA system architecture, NUMAFlex, which appeared in MIPS-based Origin 3000, and Itanium-based Altix. We were also working on an R1x000 successor [which never got built], but among whose design goals was better memory hierarchy and more efficient, much-more usable synchronization hardware to help out software. Here was the fundamental issue, which hasn't been discussed much here, but is the elephant at the dining table: ============== You can use some atomic synch instruction (or pair, like LL/SC), and it looks like nice tight code ... but it may consume a large number of cycles, and worse, lock shared resources more than you'd ever expect. In addition, some nice-looking primitives (from programming viewpoint) sometimes require very complex/expensive hardware designs to make them fast. ============== Scaling things at reasonable cost requires both CPU and system design, and of course, good software at both OS and application level. 1) EARLY 1980s In many CPUs, especially micros, primitives like Test-And-Set were usually defined, originally as: - Fetch memory, with bus-lock-modifier asserted - Operate on data - Write it back, and then drop the bus-lock I.e. the assumption was that there was a bus or equivalent, such that a CPU could lock it temporarily, and that this was OK. Many such features, especially in micros in the early 1980s, didn't strongly consider interactions with caches, since, at least earlier, the CPUs didn't have data caches, and cache implementation if any was done in external hardware. Typically, later CPUs in sequence had to be careful to define how this worked, i.e., in 68020 CAS & TAS just did bus-lock; in 68040, the effect of CAS & TAS on cache-resident data was specified. As noted in earlier posting, it is difficult to make synchronization-based-on-bus-lock scale up very far, and people doing micro-based, bus-based SMPs tended to build their own extra synchronization hardware to minimize bus-locking. In early 1980s, a synchronization op might consume 10s of cycles, depending on cache design, coherency mechanisms, and bus activity, and of course, it would stop any other CPU from using the bus for a while. 2) By late 1980s/early 1990s, people were defining synchronization operations whose original definitions were cache and bus-SMP-aware, of which LL/SC was particularly popular. This scaled further, and could ride on top of normal cache-coherency mechanisms, and of course, did not lock the bus, which was especially important to people who were trying to scale bus-based SMPs as far as they could, and were already using wide, split-transaction busses. Straightforward bus-lock didn't make any sense for such systems. Lock words either had to be uncacheable, or else participate in the coherency scheme, and in any case, with faster CPUs, one can imagine cases where a simple-looking TAS turns into 50-100 clocks, depending on bus arbitration, outstanding bus transactions, residency of cache lines, etc. But of course, as systems scaled up, even these ran into limits. The classic issue for LL/SC is for a heavily-used lock, where the cache line migrates around the caches. 3) By mid-1990s, this was really painful, with ccNUMAs scaling much further, and of course, with more complex interconnects than simple busses, so there wasn't even a single hardware entity to lock, and there was no simple bus to watch to do cache invalidation. In many ccNUMA designs, all is well when: a) Lots of CPUs share a read-only copy of a cache line b) One CPU has exclusive use of a (dirty) cache line for a while. But it is Not a Good Thing if 256 CPUs all want to do LL/SC on the same word. Assume that each CPU in fact does LL/SC with no intervening invalidation. CPU 1 does LL/SC, ending in having a dirty cache line. ===== CPU 2 does LL, which asks for cache line. (1) Memory controller gets request, sees that CPU 1 has exclusive use, sends message to CPU 1 (2) CPU 1 pushes dirty cache line back to memory controller (3) and invalidates its copy. It either sends the cache line directly to CPU 2, or lets the memory controller do it (4). CPU 2 does SC. (There are numerous variations on this theme, but it is easy for the LL/SC in CPU 2 to cost 100s of CPU cycles in a big ccNUMA, sometimes.) ===== We thought we had some noticable improvements, based on: a) Studying what we already had in the kernel, in terms of APIs for operations (that were usually then simulated with LL/SC). Looking at the kernel source, we'd used LL/SC to synthesize almost anything anyone could think of :-) b) Some new ops for the CPU being designed. c) CNA (Consistent, no-allocate) cache mode, probably. d) Fetch-op hardware in memory controllers as well. But of course, we never actually built that MIPS CPU. Some of the features are there, sort of, in Itanium, and some elements in PA-RISC and PPC. Briefly, what we were thinking of was: A) LOAD32 and LOAD64 FETCH-AND-OPS (several each) B) STORE32 AND STORE64 FETCH-AND-OPS (several each) To maximize usability, it was preferred that the semantics of these appear the same whether a data item were cacheable or uncacheable, and if cacheable, if it were currently resident or not, with CNA access preferred. Doing this right seems to require that the operations must be doable either by a CPU, or by a memory controller, and in any case, with the right cache coherency. That probably wasn't practical in moderate-cost 1980s machines, but ccNUMA memory controllers are pretty hefty already. Briefly, the idea was to have a set of load-and-ops, and store-and-ops, showing just the load-fetch-add and store-fetch-add (with ORs and ANDs or XCHGs useful as well). LFADD reg, addr SFADD reg, addr a) If the relevant cache-line was already in the cache, and EXCLusive, do the operation there, atomically, without any other CPU being bothered. SFADD just adds the register to memory atomically, LFADD returns the previous value. b) Otherwise, send the register value, and the operation type to the appropriate memory controller, which: 1) If a CPU has an EXCL line, gets it written back to memory, and invalidates any SHAREd lines, i.e., it does what it normally does when a CPU requests an EXCL line (i.e., with intent to write). 2) Reads the memory value, and if the op were an LFADD, returns that value, i.e., as though it had been an uncached memory read. (That's where the CNA access comes in: don't let this allocate cache lines.) 3) Does the atomic modify, memory <- memory+reg The SFADD has lower latency than LFADD, for cases where you don't need the old value. The idea was: a) Keep busy global locks out of the caches, and keep the activity right at the memory controllers. Transfer 8-byte data items, not 128-byte cache lines. [The fetchop hardware in Origin2000 did this sort of thing, i.e., there were a specific bunch of special uncached memory addresses, not very general.] b) Allow use of normal memory addressing in both kernel and user, with lock words embedded in data structures when that made sense, and avoiding the horrific bugs that can happen when cached and uncached accesses get accidentally mixed. Piggy-back on the existing coherency protocols. One can argue about the case where the LFADD/SFADD encounters a cache line in SHAREd state: should it try to get EXCL and keep the line, or just drop the line, and send the data+op to the memory controller as above? Of course, there are a zillion details, given the complex coherency protocols, and of course, none of this would automagically make software scale :-) Too bad we never built that CPU :-) ============== SUMMARY Efficient synchronization needs: a) Synch operations in CPU, appropriate to the scalability needed b) Appropriate hardware systems design, which usually has extra features beyond those provided by commodity CPUs, at least when used in bigger systems. c) Careful software work. Put another way, I think the choice of specific CPU synch operations is less important than the way these pieces interact. I also think that nice-looking synch operations can fool people into thinking they must be fast, when in fact, they may incur 10s or 100s of cycles of delay. Now, finally off to the slopes.
From: "John Mashey" <firstname.lastname@example.org> Newsgroups: comp.arch Subject: Re: CAS and LL/SC Date: 22 Dec 2004 12:25:14 -0800 Message-ID: <email@example.com> Andi Kleen wrote: > "John Mashey" <firstname.lastname@example.org> writes: > > > The idea was: > > a) Keep busy global locks out of the caches, and keep the activity > > right at the memory controllers. Transfer 8-byte data items, not > > 128-byte cache lines. > > [The fetchop hardware in Origin2000 did this sort of thing, i.e., there > > were a specific bunch of special uncached memory addresses, not very > > general.] > > It seems like a bad idea to me because it optimizes the slow (congested) > path of a lock in favour of the fast (uncongested) path. On a modern > CPU you really want the fast path to be in cache, because just talking > to the external bus costs you hundreds of cycles. Trying to talk > to an external memory controller even for an uncongested lock sounds > extremly bad and definitely not do something you should do for > any important locks. > > If you have a lock that is congested enough that the slow path is a > real problem the only real fix is to redesign your software to split > the lock up so that it is not congested. That fixes the problem > at the root and you care more about the performance of the uncongested > fast path, which your design probably penalized. > > Trying to optimize the lock slow path doesn't help you much longer term. > If a lock is too congested it will always be slow and not scale well. > > -Andi Maybe I didn't explain this quite well enough, but you might consider the history and the background, i.e., we chose LL/SC fairly early, with the idea that they'd work well for uncongested locks, and at least work for congested ones. We had a *lot* of experience with these in rather scalable designs, up to 64P or 128P at the time we were thinking about this. IRIX folks had spent a lot of time doing the kinds of software optimizations you suggest. It's hard work. When you discover that something is congested, it usually isn't a one-line change to fix it. As I noted, this thread has frequently ignored some kinds of implementation issues, like what the *real* cycle count implications are. The problem is that when you do LL/SC, and you're relying on the standard coherency system, in a big ccNUMA (like an Origin2000, for example), if the lock gets even a bit congested the overhead gets worse fast, with every single access to the lock causing *multiple* transactions with the relevant memory controller, and frequent back-and-forth transfers of entire 128-byte cache lines. In that case, you're better off at least lessening the overhead. It is also the case that the common likely-to-be-cache-resident cases fall out naturally, as in the case where locks are embedded with the data. Also, given a "prefetch-with-intent-to-write" instruction [which we had], it's trivial and efficient to have a fetch-and-op function that keeps the lock in the local cache, until it gets congested, in which case it migrates back to the memory controller where you want it. In the Origin2000, we had this fetchop hardware, and it wasn't there because hardware people thought it would be fun, it's because OS people wanted all the help they could get! They already had LL/SC for the easier cases. One more time: the reason we are still arguing about this stuff is because it's hard, and it's hard because it has asynchrony and parallelism, neither of which are very intuitive for people, and because it requires bottom-to-top designs that hang together, involving CPU, system architecture, OS, libraries, language specification, compilers and applications.