Index Home About Blog
From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Mon, 25 Aug 2008 11:23:08 -0700 (PDT)
Message-ID: <73422a43-849c-42b4-82e3-515ab7185ca7@79g2000hsk.googlegroups.com>

On Aug 25, 10:02 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
<snip>
> On the matter that started this thread, I believe that cache-coherent
> shared memory between processes is a stupid idea, and am being proved
> right.

It is my belief that the problem is not shared cache-coherent memory
(bear with me for a second), but the 'universality' of shared cache-
coherent memory. Everything that is cacheable is by mandate coherent,
and this, in turn squashes out the parallelism at several hierarchies.
There should be some mechanism that allows data to reside in a cache
where that data is not coherent, and likewise, some mechanism that
explicitly states that this data IS shared and must be coherent. The
lack of mechanisms to subscribe-to and subscribe-against coherence
results in massive cache coherent traffic in the system.

>          What is needed is shared memory with explicit synchronisation,
> which is VASTLY more scalable and easier to get right; indeed, I would
> go for a three-category segment table state:
>
>     a) The segment is being shared read-only.
>     b) The segment is owned by a single process.
>     c) The segment is used for communication, and is accessed
> atomically and with sequential consistency, but ONLY through the
> instructions provided for the purpose (others would fault SIGILL).

The problem, as I see it, is that when it comes to ATOMIC
interactions, architects layer atomicity on-top-of cache coherence!
simply making the problem worse.

What is needed in Atomic transactions is a way to explicitly specify
those memory locations taking place in the atomic multi-memory-
modifying transaction, and providing an Atomic memory ordering model
for those explicitly annotated memory references; while keeping the
normal memory ordering for other non-participating memory operations/
instructions. Thus, those participating memory locations are seen
either in the completely-before sense or the completely-after sense
with no visibility to intermediate transitions that may have to
take place to go from the before state to the after state.

Rather than relate all the nuances here, I refere the readers to the
USPTO and search Patent applications with "IN/Alsup AND AN/Advanced".
I am rather sorry that the C code in the patent applications was run
through a word processor and did not retain the C indentation
structure of the original. A CButifier can straighen this out to near-
normal. This lack-of-pretty-printability happened after I left AMD.

The sub-architecture is sufficiently powerful to allow one to move an
element from a doubly linked list and insert it into another doubly
linked list in a single atomic transaction, even when the pointers
contains addresses, sequence numbers, and state that may get copied or
modified as the atomic transaction proceeds. It is also possible to
insert and remove a data element from a structure where 3 sets of
doubly linked pointers must always remain consistent.

This sub-architecture allows the construction of atomic primitives
anywhere in the spectrum between fully locked and fuly non-blocking.

Used carefully, this can enable creating atomic primitives that have
bigO(log(n)) worse case latency {instead of bigO(n**2)} but of similar
importance many of the activities that hit the bigO(n**2) problems
with current atomic instructions run in bigO(3) in this system. Things
such as a timer goes off and all CPUs try to extract a task from the
front of a run queue.

I call this a sub-architecture, because it is more than just new
instructions, it carries an atomic memory model with it that adds
negative acknowledgement on the responses to cacheable accesses, and
it explicitly enters and explicitly exits atomic activity.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Tue, 26 Aug 2008 09:42:19 -0700 (PDT)
Message-ID: <e679d004-1e4e-4ae9-968a-a8b905d75b83@r35g2000prm.googlegroups.com>

On Aug 25, 2:39 pm, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:
> Very interesting!
>
> It sounds like a hw engineer's solution to the "Transactional Memory"
> question?

A few years ago, I was charged with answering the question "Why don't
we (AMD) give MS the CompareTwoSwapTwo instruction they say they want.

I looked into the mater and found that there was no mechanism in the
CPU, hostBridge, of transport fabric that could be used to cause two
cache lines to become and remain co-resident in the cache for the
handfull of CPU cycles that would be needed to move the data arround.

So, I invented a mechanism to do this, and it turns out that the same
mechanism would naturally extend to at least 7 outstanding memory
transactions. I put this mechanism in the cache miss buffers rather
than in the cache itself (as did TM) and this ends up to be the key
for making forward progress and avoiding all the cache coherency crap
that gets in the way of atomicity.

Secondarily, the mechanism works even when there are no caches in the
system whatsoever, and each memory update is delivered directly to
DRAM. Thus, it is architecturally clean (like the original caches were
architecturally clean -- invisible). The mechanism requires the
ability to NAK a coherent request for data, and dealing with the
fabric issues this creates.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Fri, 29 Aug 2008 13:25:24 -0700 (PDT)
Message-ID: <b1c6a672-3d9d-46d4-aa70-4199ce3c6c43@t54g2000hsg.googlegroups.com>

On Aug 28, 11:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "MitchAlsup" <MitchAl...@aol.com> wrote in message
> I don't have time to read the patent right now, but I was wondering how it
> compares with SUNS hardware implementation of limited HTM?

You can use my mechanism to (help) implement SUNs sTM stuff.

TM is for those applications that just want synchronization to work.
ASF is for those applications that want synchronization to work FAST.

>Also, how does
> your system attempt to prevent live-lock? Think along the lines of something
> like:
>
> http://groups.google.com/group/comp.programming.threads/msg/3a356c832..

To the issue raised at the given URL: ASCII-art extracted from same::

SP(0)-->NodeA(0)-->NodeB(0)-->NodeC(0)-->NodeD(0)-->NULL

With ASF one can 'monitor' the front of the queue SP(0) while the
application makes its way down the list, by making it look like
address[SP(0)] is participating in the atomic event at NodeD(0). Then
when NodeD(0) is found and the application attempts to do a NodeD(0)-
>NodeD(1) transition, anyone that has attempted to modify SP(0) will
prevent the NodeD(0) transtition. Thus the actual modification at
NodeD(0) will look like (in low level ASF form)::

while( TRUE );
{
    sp0 = &SP(0);
    watch( sp0 );                // Anyone who modifies SP(0) will
                                 // cause the subsequent ACQUIRE to fail
    for( i = p = SP(0)->nest; p; p=p->next)
        if( IS_WHAT_I'M_LOOKING_FOR( p->kind, kind) )
        {
            watch( p );
            if( ACQUIRE(2) == SUCCESS )  // ATOMICITY
            {                            //  Note below
                sp0->next = i;           // Cause everyone else to fail--by
                                         //  writing the value it currently
                                         //  has back into it
                p->data = 1;
                RELEASE();
                break;
            }
        }
}

Note below:: we know at this point that this application has been
granted atomic access to both NodeD and SP, and we know that that
noone has modified SP since we (started watch-ing it and) used it as
the starting point in our search for NodeD (even when the search takes
a lot of cycles). Neither SP nor any of the intermediate nodes need
to remain in the cache (nor does there need to be a cache in the
system), however if anyone else does access SP (or NodeD) this CPU
needs to see a snoop to that/those address (which is no burden to
modern Cache Coherent CPUs).

> A CAS can theoretically be subjected to live-lock, and the more compare
> locations you add, well, the worse it can get. How do you attempt to
> mitigate this issue?

I should just say:: read the patent applications, however I will go on
to say::

In a CAS based synchronization system, the application can see that
interference is happening (CAS failed) but the application has no
means to determine how much interference is happening, and has to
resort ot things like exponential back-off to make forward progress.
Thus, software is measureing interference by trying again and again
and again.

In an ASF based synchronization, when your effective-CAS event fails,
you get a number of where you are in line of applications trying to
synchronize on the same cache lines (0 means success). Thus, you can
use this to seed the exponential back-off stuff, OR, You can use this
to access a different member of the data structure that others will
not be accessing. In order to determine which other member of the data
structure you might want, you use the failure code (which is unique
wrt other failure codes other applications received) to participate in
deciding which other element to access (for example: the N-th item
down the list). Since you got a unique failure code, the probability
at nodeN colliding with another atomic access at NodeN is remote--and
this is what drives SAF up from bigO(3) to bigO(log(n)) in total
latency.

Note: I am well aware that some data structures do not allow the
access of elements other than at either front or rear. These kinds of
data structures are doomed to have bigO(n**2) latencies in any
synchronization "kind". But all data structures DO NOT have to have
this property. Code wisely.

Thus, for the typical worst case synchronization event of "timer goes
off, and every CPU attempt to access the front of the run queue"; in
the ASF system, everybody attemts the access the first time, and all
but one fails and all the failures are seeded with a unique failure
code (monotonically increaseing at the point where interference is
detected). Then using this failure code, they access the [n-th] item
on the run queue and find that nobody else is accessing it, remove it
from the queue and presto :: N CPUs have removed N units of work from
a single queue in 3 trips (each) through the atomic section of code.
And where N can be significantly large (16+)

If software does not use the failure code to spread interfering
accesses apart, the total latency will remain bigO(n**2).

Secondarily, one can do significant things that CAS, DCAS cannot do.
For example one can remove an element from a doubly linked list and
insert it between items in another doubly linked list in a single
atomic event. Thus the concurrent data structure is never seen with
intermediate states, AND the total number of synchronizations goes
DOWN for the same amount of forward progress. In addition, it fully
supports the notions of time stamps and cursors as part of the pointer
being manipulated--and NOT just when the address, timestamp, and state
can be squished into 64-bits (Quad) or 128-bits (Double Quad) storage
items. ASF allows software to build whatever data structure it needs/
wants/requires and makes only the stipulation that the "pointer" has
to fit in a cache line {64-ish bytes--4 Double QUads)}, and the second
limitation is that the total number of such pointers must be greater
than 0 and less than 8.

Mtich


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Sat, 30 Aug 2008 12:14:13 -0700 (PDT)
Message-ID: <40c70b11-c9f9-455c-a030-e7a801a71f5a@26g2000hsk.googlegroups.com>

On Aug 30, 2:35 am, Terje Mathisen <terje.mathi...@hda.hydro.com>
wrote:
> MitchAlsup wrote:
> > On Aug 28, 11:35 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> >> "MitchAlsup" <MitchAl...@aol.com> wrote in message
> >> I don't have time to read the patent right now, but I was wondering how it
> >> compares with SUNS hardware implementation of limited HTM?
>
> > You can use my mechanism to (help) implement SUNs sTM stuff.
>
> > TM is for those applications that just want synchronization to work.
> > ASF is for those applications that want synchronization to work FAST.
>
> Nice! :-)
>
Mark Moir from SUN came up with this one while we were presenting ASF
to them.
{snip}
> This is very nice indeed.
>
> My best pure sw idea has been to use XADD to allow any number of
> producers/consumers to access the same queue, and as long as the hw can
> guarantee forward progress, each client will receive a unique
> number/pointer/index as the result of the update.
>
> If the work items pointed to are all located in separate cache lines,
> there should be no subsequent collisions, right?

The general situation is that you want the element.next and
element.prev 'pointers' to be in different cache lines. Thus one can
be manipulating the element immediately before an element and
immediately after an element simultaneously without collision.

> The problem is in the memory hw which is needed to make the XADD work
> when a _lot_ of threads are trying to access the variable at the same time.
>
> I really like your idea of letting the single update attempt return a
> unique failure code, but I can't (easily) figure out how the hw
> requirements differ from an XADD-based approach?

There has to be somebody who does the interference counting, and this
is something that the CPU cannot do for itself; for the CPU is not in
a position to see all of the interference which may be in progress
before that CPU generates an address that will ultimately result in
interference.

> > Mtich
>
> Ouch. You know that aguy is in _real_ hurry when he misspells his own
> name. :-(

You got me there.....

On Aug 30, 3:04 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
> A slight niggle: that's not true in general, even if it is under the
> circumstances being considered.  For example, all LIFOs and FIFOs are
> like that, and they can be implemented a lot more efficiently (even
> with parallel access).

OK, lets look at the FIFO situation.

Consider a FIFO that has at least n items in it, and n CPUs attempting
to take the front item. The current situation is that n-1 beat each
other up while one (at random) CPU gets the front element (along with
some beating), and the cycle continues with n-1 players. In effect,
even though the FIFO is strictly accessed only at its "front", there
is a random relationship between the n CPUs and n FIFO items. (That
is, each CPU gets FIFO access to a random item in the FIFO.)

So, the question becomes, why cannot the n CPUs access deeper into the
FIFO as long as the CPUs can only access/remove from the "front-most"
n items?

As seen by the CPUs they still get a random item from the "front" of
the FIFO--so the view from the CPUs remains the same.
As seen by the FIFO the same number of items from the "front" can be
accesses simultaneously.
So, there are many data structure access methods that give the
appearance of having to be (a lot) more ordered than they really have
to be.

Yet, all of the accesses do not actually take place at the "front" of
the FIFO. It is this "liberation" from order that enables the
reduction from bigO(n**2) to bigO(log(n)) latency.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Sun, 31 Aug 2008 18:39:50 -0700 (PDT)
Message-ID: <fceb5195-b763-4e3e-a3ae-3d3340027aa4@d1g2000hsg.googlegroups.com>

On Aug 31, 11:39 am, EricP <ThatWouldBeTell...@theisland.com> wrote:

> I assume you are referring to the "Method for proactive
> synchronization within a computer system" and friends patents.

Yep.

> I looked at these patents but it was not immediately obvious
> to me why it is guaranteed to make forward progress.

The guarentee of foward progress is massively stronger than in other
synchronization systems but not absolute.

> I probably misunderstood the mechanism, but it sounded like the cache
> queue contains monitor logic (e.g. a CAM plus some counters) that
> track the number of requests for a collection of addresses.

Its the miss buffers that keep track of the lines participating in the
atomic event.

> The program issues multiple LOCK prefix instructions to modify a
> set of addresses, and the addresses and new values are stored in
> the monitor logic.

The LOCK prefix mearly indicates that this address is potentially
participating in an atomic event.

> The program issues an ACQUIRE instruction and
> if no one else is also updating *any* of the addresses in the set
> then the whole batch of new values is committed and ACQUIRE returns
> success, otherwise failure and a counter of the contenders.

The way you have phrased your stated question is a not correct AST
event sequence. ACQUIRE only marks the start of CDS modification while
RELEASE marks the end. Bad things happening between these two markers
cause the atomic event to fail and control flow flows out of the
ACQUIRE instruction with a failure code. ACQUIRE returns success
BEFORE you have damaged anything. It is your key to go and do the
damage. RELEASE is the indication that you are done and want the rest
of the sytem to see the new state of the participating lines.

LOCKed memrefs announce participation
Once all the participants are announced you execute an ACQUIRE
instruction followed immediately by a jnz instruction
If ACQUIRE succeeds, then you start modifying the participating data;
otherwise control leaves the atomic event
When done modifying the participating data, you execute a RELEASE
instruction.

If an exception occurs after ACQUIRE and before RELEASE, the
participating cache lines are rolled back making it look like the
event never took place.

IF ACQUIRE returns success, certain pieces of HW operate differently
preventing the HW live lock situations.

Malformed ASF software may create its own SW livelock--and there is
nothing HW can do about it (nor should it).

But if you use the concurrent data structure (CDS) philosophy of: A)
start at the front of the CDS, B) traverse until you find item of
merit, C) attempt to manipulate CDS at point of merit (ACQUIRE), D)
Failure leads back to A, F) Success leads through the CDS modification
code, E) announce you are done (RELEASE).

I have reworded your question in a correct manner:

> The program issues an ACQUIRE instruction and is given the success
> indication and enters the CDS modification block. If no one else is also
> updating *any* of the addresses in the set then the whole batch of new
> values is made visible to the system at the conclusion of the RELEASE
> instruction.

> Why wouldn't that just livelock?

Can you digest what I have written and come back with a refined
question?

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Mon, 1 Sep 2008 12:19:27 -0700 (PDT)
Message-ID: <76f15a9b-6a04-4526-8511-843c0a52a4f9@w7g2000hsa.googlegroups.com>

First of all, I would like to thank Mike for stepping in with new
paper and data that I have never seen before.

On Sep 1, 10:20 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> MitchAlsup wrote:
> > On Aug 31, 11:39 am, EricP <ThatWouldBeTell...@theisland.com> wrote:
> > LOCKed memrefs announce participation
> > Once all the participants are announced you execute an ACQUIRE
> > instruction followed immediately by a jnz instruction
> > If ACQUIRE succeeds, then you start modifying the participating data;
> > otherwise control leaves the atomic event
> > When done modifying the participating data, you execute a RELEASE
> > instruction.
>
> > If an exception occurs after ACQUIRE and before RELEASE, the
> > participating cache lines are rolled back making it look like the
> > event never took place.
>
> Yes, and for interrupts too. It looks like it takes a replay trap
> back to the ACQUIRE and then re-executes it but with a Fail status.

Interrupts are defered from the first LOCKed memref for a number of
cycles under OS control. This gives the ASF event time to make forward
progress but not so long as to seriously effect interrupt performance.
When I left, the watch-dog counter was going to default to one
absolutely worst case DRAM access across the fabric with a full memory
controller queue at that node (about 5-ish µs if I remember
correctly). The counter can be set to 0 and interupts are not defered.

> That would seem to set limits on the amount of code between
> ACQUIRE and RELEASE. It would also seem to preclude non OoO
> cpus from implementations.

Nothing prevents an in order machine from doing ASF events. Nothing in
ASF even requires that a data cache be present. All that IS required
is at least 9 miss buffers {address and line of data}, so that TLB
misses and instruction fetches can continue while participating lines
are "watched". And there has been absolutely no relief for the general
concept of an atomic event should contain as LITTLE work as possible.
ASF just makes the amount of work possible bigger than DCAS. But ASF
does not anticipate the application ever needing more than a few
handfulls of instruction or a couple of branches to perform the atomic
event. ASF is NOT TM. If you need more than a few handfulls of
instructions, use TM not ASF.

> Would having Release also return a Pass/Fail condition code,
> in addition to Acquire, allow non OoO implementations?

We had one proponent of this direction. Both myself and the other lead
did not like that direction. I don't know if the version Mike brings
to the table did this or not.

> A key phrase in getting this to work is the clause
>
> "[0055]... After a passed cache line arrives, processor core 18 may hold
> onto that cache line and prevent other processor cores from stealing the
> line by responding to coherent invalidation probes with
> Fail-to-REQUESTOR responses"

The NAK.

> which in turn causes others Acquire attempt to fail.
> (And then I see this was worthy of a whole patent itself:
> "Method for denying probes during proactive synchronization within
> a computer system".)
>
> I also notice that there is a single per-system Synchronization Arbiter
> to resolve disputes over sets of ACQUIRE addresses.

Nope, there is a single SO per node in the system (each chip has a
SO). OS software can decide to route all ASF messages to a single SO,
or to route different applications' ASF messages to different SOs.
Virtual machines that do not share pages, or applications that do not
share pages where atomic events are being performed can be routed to
different SOs. Thus, if you are not performing atomic activities on
pages in the disk cache, you can share these pages and still direct
the ASF messages to different SOs. There is a lot of SW thinking left
to be done, here. And none of it has yet been done (or made public).

It was anticipated that each SO would contain 64 entries so for the
kind of stuff DCAS is good for, a single SO could be allowing 32
independent atomic events to be simultaneously in progress. The
biggest ASF event we thought was "pretty cool" {That of moving a CDS-
element from one place to another in a single atomic event} takes 5
participating lines. So the single SO can manage as many as 12 of
these at one instant.

> So that seems to answer my double linked list livelock question.
> After one cpu has Acquired its address set, the Sync Arbiter prevents
> the ping-ponging, and these NAKs allow a cpu to hold multiple cache
> lines until it is finished the critical section.

Yep, that is why the guarentee of forward progress is greatly
strengthened.

But to be absolutely fair, the bigO(n**2) ping-ponging still takes
place, but it takes place with one 64-byte and one 64-bit incoherent
messages rather than taking place with a dozens of coherent messages
and a handful of data movements in the fabric with all sorts of
ordering constraints. This reduces the amount of fabric traffic in a 4-
node system by around 8X (its all addresses rather than cache lines).

Of similar importance, is that none of these messages arrive at the
memory controllers' queues and do not interact with the states of the
DRAM controllers. One way of thinking about this is that this
synchronization method works on addresses rather than on the bit-
pattern of a unit of data associated with an address. Thus the
synchronizing agents are not impeding other benign processes while
attempting the atomic event.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Thu, 4 Sep 2008 10:34:44 -0700 (PDT)
Message-ID: <81768b1d-bc68-4051-baa4-50e647dd686a@r15g2000prd.googlegroups.com>

On Sep 3, 2:46 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> My interest is primarily in the mechanism. The only comment I'd have
> is that anything that allows 7 or 8 addresses to be updated atomically
> is definitely an improvement and I'm sure it would find many uses.
>
> WRT deterministic vs optimistic, in reading between the lines of the
> descriptions, I got the impression the distinction was whether ACQUIRE
> waited for a response or not. This lead me to wonder about the latency
> of the Acquire operation and if this was a synchronous vs async issue.

Basically, in optimistic mode, the CPU does the checks for
interference at COMMIT--if any interference has happened undo any
damage and exit at ACQUIRE with a failure code.
While in deterministic mode, the CPU gets permission at ACQUIRE. ANd
will run to completion if exceptions are not generated.

But there is a sublty that has escaped everyone, in optimistic mode,
you only get as many instructions in the atomic section as you have
room in the execution window. So there are spurious failures in
optimistic mode, for example, if you use more than 24 instructions, or
overflow LS2 (in Opteron) with stores then the atomic section will
fail back to ACQUIRE. Since different machines may have different
sized windows and store buffering, it would be unwise to expose these
to SW. This lead us back to having a deterministic mode that can
handle an essentially-unbounded number of instructions and memrefs in
the atomic section.

> I'm also thinking that optimistic code needs to be more capable of
> handling interference and performing a graceful back off and retry,
> whereas deterministic code would be more straight forward.

Optimistic mode is almost like there were no atomic checks in the code
at all. ACQUIRE takes 1 clock, COMMIT takes only a few (to ask the BIU
about interference).

Deterministic mode bundles up the participating cache line addresses
and ships these to a synchronization observer (no closer than 12
nanosecond away or 24 nanoseconds round trip--with a 4-node system
averaging 30-ish nanoseconds round trip as seen at the random CPU). So
this mode always incurs a fabric crossing delay, even if all the data
is sitting in its caches.

Thus, by having a predictor, the typical lightly loaded
synchronizations can run as fast or faster than simple LOCKs/CASs that
already exist. This give ASF the ability to compete in performance
even at the simplest levels of synchronization--and allows microcode a
new avenue to make the already existing instructions faster. But when
interference IS detected, then switch everyone to the SO so that the
bigO(n**2) effects are ameliorated. But you don't want the software
model to change between optimistic and deterministic.

So, the expected general use pattern is to make a try in optimistic
mode, and then when this fails, make a second try in deterministic
mode. The failure codes at ACQUIRE are designed to support this
notion, and no code "looks or smells" different at the instruction
level, while "looking and smelling" considerably different in the HW
gyration needed to pull it all off.

> The patent refers to deterministic vs optimistic being selected by the
> BIOS or OS or virtual machine. I don't think that would work out well.
> Whether optimistic is tolerable or not would be dependent on the way
> particular routine was written. If an OS was forced to choose, it
> would always choose the most restrictive which would be deterministic.
> So as a system wide option, this need not exist.

First, the BIOS/OS/VMM can select between always optimistic, always
deterministic, and a predictor (or turn facility off) for each of the
priviledge and for the unprivileged.

The OS may want to choose something for the application(s) and
something different for itself!

> If it is a synchronous vs async issue, then it would be better to
> have two Acquire instructions so that those routines that are written
> to perform a more complex back off and retry without ping ponging can
> select the async Acquire and gain some performance by being optimistic.

No, you want the software model (what coders code to) to be the same
in either mode.

But the thing, here, is that for the typical mode data element within
CDS kinds of synchronizations, ASF does not require the software
writer to code back-offs (complex or not). ASF wants the SW code to
take a new look at the CDS (pointers) and use the failure code to
select a different element that has much lower probability of
suffering from interference, and give that one a try.

> However people like me would just want things to work obviously
> would choose the deterministic version.

In optimistic mode, with data sitting in the cache, one can perform a
CAS-like atomic event in a dozen cycles (4-nanoseconds).
In deterministic mode, with data sitting in the cache, one can perform
a CAS-like atomic event in 30-nanoseconds (best case--with fabric
overhead additional).

Mitch

P.S. Also note, the SO messages contain physical addresses. So if you
launch one program 18 times, the individual programs will not interfer
with each other unless they actually share memory and attempt atomic
accesses on that shared memory that happen to collide. Nor are these
physical addresses ever exposed to the running programs.


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Some confusion about virtual cache
Date: Fri, 5 Sep 2008 10:02:37 -0700 (PDT)
Message-ID: <dc75f9f6-531f-4b08-a49d-1463f4b7b949@k7g2000hsd.googlegroups.com>

On Sep 5, 11:25 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> MitchAlsup wrote:
> > Basically, in optimistic mode, the CPU does the checks for
> > interference at COMMIT--if any interference has happened undo any
> > damage and exit at ACQUIRE with a failure code.
> > While in deterministic mode, the CPU gets permission at ACQUIRE. ANd
> > will run to completion if exceptions are not generated.
>
> It sounds like both Deterministic and Optimistic modes would be
> interfered with if an Acquired cache line is accessed by another cpu
> or DMA, because, for example, some data object not logically guarded
> by that Acquire shared the cache line and it was either read or written.

Define CPU A as the one doing the atomic event
Define CPU B as the one making the read
DEFINE IO C as an innocent bystander making a read

With CPU A optimistically performing an atomic event on line L and
between ACQUIRE and COMMIT, CPU B makes an access to line L. In this
case, CPU B gets the old value of line L, and CPU A aborts the atomic
event and rolls back any protected cache lines and takes an error exit
through the ACQUIRE instruction, and enters deterministic mode.

With CPU A deterministically performing an atomic event on line L and
between ACQUIRE and COMMIT, CPU B makes an acces to line L. IN this
case, CPU B gets a NAK, while CPU A continues to plug along on the
atomic event.

IO is a special case, and the receiving CPU can see that the request
came from IO. IO is special because it is not coherent, so special
rules apply that don't apply to CPUs (and their caches). In this case
CPU A aborts the atomic event and exits through ACQUIRE with a failure
code; and the IO request sees the now current bit pattern in cache/
memory, his request is fulfilled and he goes away happy..

> (I include read access in the above because the mechanism is trying to
> ensure globally coherent access to all cache lines in an update set.
> So if any uncoordinated read occurs, the system must present a before
> image of the entire update set, which presumably triggers an undo
> and retry).
>
> So one must ensure that guarded objects don't share cache
> lines with any unrelated data objects.

I might ask why one has IO scheduled to a line that is usbject to
atomic events....

(snip)

> >> If it is a synchronous vs async issue, then it would be better to
> >> have two Acquire instructions so that those routines that are written
> >> to perform a more complex back off and retry without ping ponging can
> >> select the async Acquire and gain some performance by being optimistic..
>
> > No, you want the software model (what coders code to) to be the same
> > in either mode.
>
> Ok, if the mode switch-over is automatic that's even better.
> I just didn't see how optimistic mode could guarantee forward progress.

It can't and that is why CAS, DCAS, Locked Load/Store Conditioinal
based stuff can not get the job done. (i.e. doesn't scale)

> I can see certain usage patterns on certain data structures
> would cause optimistic mode to behave pathologically,
> which means it can't be set at a system wide level (which it isn't).
>
> > But the thing, here, is that for the typical mode data element within
> > CDS kinds of synchronizations, ASF does not require the software
> > writer to code back-offs (complex or not). ASF wants the SW code to
> > take a new look at the CDS (pointers) and use the failure code to
> > select a different element that has much lower probability of
> > suffering from interference, and give that one a try.
>
> This will sound ignorant, but what's a CDS?

Concurrent Data Structure

> Anyway, in the kind of usage I might have for ASF would be making
> lock free list and tree management a realistic consideration
> by lowering its complexity.
> However the operations are dictated by the app requirements that
> it insert/remove/update a particular data object and it can't
> really go off and work on something else if there is a collision.
> It just has to hang around a retry and thus my concern about
> ping pongs.

So you get hit with interference. and by the time you can look at the
structure, it has changed under your nose. So you go back to so safe
point (like the top) and then traverse into the structure to re-find
where you want to do the insert/extract/update and try again.

Or you can do what CAS-based stuff ahs been doing all along--keep
pounding on the CDS until you get your increment of change performed.

Think of the failure code as somthing you can ignore (except for
atomic activities) or use to your benefit.

> So in that context I'm not exactly sure how the failure code could
> be used especially if fair, mostly-fifo access is desired.

You can use the failure code to seed a back-off timer. So rather than
accumulate the needed back-off time (ala CAS-based with exponential
back-off) one can use the failure code directly
(failure*BACK_OFF_LOOP_COUNT) or indirect via a table
(BackOffLoopTime[failure]) to determine the time/cycles to wait.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD working on scaleable hardware-based atomic transactions...
Date: Sat, 6 Sep 2008 09:46:33 -0700 (PDT)
Message-ID: <0a4356fa-1f4d-43af-a79f-78f3402b8a10@m36g2000hse.googlegroups.com>

On Sep 4, 3:48 am, n...@cus.cam.ac.uk (Nick Maclaren) wrote:
> Let's consider language architectures and how they interact with this
> feature.
>
> In several language standards, I have argued vigorously that features
> like volatile (C, C++ and Fortran) should be an attribute of an
> object's actual definition (i.e. might control where it is placed)
> and not merely its declaration (i.e. how it is used).  The reasons
> for arguing that should be obvious in the context of this design.
> I have lost, almost uniformly.
>
> In all of those languages, you can add the volatile attribute to
> objects not declared as volatile, with only a few restrictions on
> parallel access as volatile and non-volatile.  They are enough
> restrictions to allow an implementation to forbid such access (sic)
> in C++ and Fortran, with some loss of optimisation, but it's still a
> bit iffy.
>
> Now, my understanding is that any non-protected access to any protected
> memory location between an ACQUIRE and COMMIT (whether read or write)
> leads to undefined behaviour.  Is that so?  If not, what happens?

In order for ASF to give the illusion of atomicity, there can be no
visibility to the protected cache lines betweeen ACQUIRE and COMMIT.
Visibility remains to the unprotected cache lines. In ASF you specify
which cache lines are protected and these lines are treated in a
special way, everything else remains 'normal' cache coherent memory
and retains its visibility. In effect, the HW is attaching volatile-
like semantics and then removing volatile-like semantics on an
instruction by instruction basis.

So, when you take the error exit at ACQUIRE you cannot be sure that
you have not executed any of the instructions between ACQUIRE and
COMMIT, or not. At the compiler level, each instruction between
ACQUIRE and COMMIT has an implicit back edge to ACQUIRE (as if each
instruction had a conditional branch as part of that instruction.)
Other interested parties will not see any intermediate state in
protected lines, but any other memory modified while in the code
between ACQUIRE and COMMIT may be seen (or not).

And, in fact, at lest while I was there; that is how you debug an ASF
event. You define a buffer in local memory (say your stack) and while
making progress over the atomic event, you lob various pieces of data
into this buffer, and if you end up taking the error exit from
ACQUIRE, the data that did arrive in this buffer, give tells you how
far through the atomic event you got before trouble happened. So at
least you have a chance of understanding what went on. Thus software
can help you understand what went on.

This leaves the door open for software to define this buffer in shared
memory (bad idea) or to pass a pointer to it through shared memory
(even worse) that allows another application visibiity to this buffer
while there is an atomic event lobbing data into the buffer. As a HW
person, I revert back to the "No programming language can prevent bad
programming practices" line of reasoning. SW can prevent this by
allocating these buffers away from shared memory, thereby, its not a
HW problem. Atomicity is maintained on the protected cache lines--
nothing else is guarenteed.

Notice that you simply cannot single step through an atomic event and
have any illusion of atomicity*. This is why debug exceptions are
supressed during the atomic event. Basically, you can be single
stepping up to the first LOCKed load, and then the next step you will
find yourself either at the first instruction following COMMIT or at
the JNZ following the ACQUIRE with a failure code in the specified
register. Thus, this gives the illusion of atomicity, even to the
single stepping program.

Thus writes to unprotected memory may be undone (as will likely happen
in optimistic mode) or may persist (as needs to happen in
deterministic mode). So, what happens is not 'undefined', but it is
defined is a manner that current programming languages do not have
semantics to express.

> I hope this gets established, because this is an area where the
> language standards need to stop using the criterion "We can't see
> why it won't work, so let's allow it" and go back to the old one of
> "We can't see why it will work, so let's forbid it".
>
> Regards,
> Nick Maclaren.

Mitch

(*) We discussed a debugger that when it encounters an ASF event,
would find all other applications that share memory and put them in a
stopped state; and then you can single step through an atomic event
and retain the illusion of atomicity (at least from CPUs). And when
the ASF event was complete, the other protrams would be put back in a
running state.


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Wed, 29 Apr 2009 13:05:52 -0700 (PDT)
Message-ID: <9a68f85b-06ed-4363-8338-7110fc71bfaf@x1g2000prh.googlegroups.com>

Well, horay, my invention has finally seen the light of day.

AMD have diddled with the instruction ordering (and spelling of the
instruction names) but most of the original intent remains.

As far as I can tell, they have lost the ability to beat the bigO
(n**2) problem in DCAS based nonBlocking algorythms. This is shown in
table 6.2.1. Any downgrade of cache line status, aborts the CPU that
has potentially made the most progress. When I looked into this, this
kind of scenario leads to livelock under certain timing arrangements
between CPUs/cores/Nodes.

So: Chris Dodd is correct in his observation.

Eric writes:
"The key is that the cache coherency bus protocol is enhanced
to allow cache line read requests to be NAK'ed. "

Yes, indeed, that WAS the key, however table 6.2.1 indicates this NAK
is not present in this version of ASF. And one can ONLY hope that it
is restored as time marches on.

"It might be nicer if a writer could set a guard flag on the
cache line so that readers would could see an update was
in progress and spin wait without disrupting the transaction. "

That remains possible. For example, by setting a 'LOCK' in advance of
the data structures being manipulated (like in the queue header),
however, at this point, its time for real software writers to delve in
see what works, what is idiosyncratic, and what needs work.

DCAS, TriCAS, and QuadCAS are certainly possible, with 'references' as
big as 64-bytes each. That is a reference can contain a pointer, a
timestamp, a cpuID, and ancilarry information that prevents the A-B
synchronization problem.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 10:30:45 -0700 (PDT)
Message-ID: <88cb8192-58a7-4d5e-94e2-44348271519a@f19g2000yqo.googlegroups.com>

On May 1, 10:04 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > Well, horay, my invention has finally seen the light of day.
>
> I wanted to ask some question on AMD forum since they are asking for
> feedback. But since Mitch is here I will ask here.
>
> Table 6.2.1. postulates that if CPU B holds cache-line in Protected
> Owned state, and CPU A make plain (non-transactional) read, then CPU B
> aborts.
> Why it's necessary to abort CPU B?

You have lost the illusion of atomicity on that cache line. Thus, its
time to abort and retry (potentially later).

ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
atomicity.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 12:54:03 -0700 (PDT)
Message-ID: <158108bb-ee47-4159-8690-226b63559009@37g2000yqp.googlegroups.com>

On May 1, 12:51 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > You have lost the illusion of atomicity on that cache line. Thus, its
> > time to abort and retry (potentially later).
>
> For whom exactly I lost illusion of atomicity?

Interested third parties (like a DMA device, graphics, other threads)
may have seen the cache line. So, you must revert back to the pre-
atomic-event state. At this point, its simply easier, and less
perilous to abandon the current critical section and take a fresh
look at the concurrent data structure. And then figure out what is the
best course of events from here onwards.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 12:57:00 -0700 (PDT)
Message-ID: <47873841-02f8-455d-afbd-b66b120b68a6@d14g2000yql.googlegroups.com>

On May 1, 12:51 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:30 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > ASF is not guarenteeing atomicity, it is guarenteeing the ILLUSION of
> > atomicity.
>
> Isn't illusion of atomicity IS-A atomicity? :)

Sort of depends if you are thiniking at the instruction set level, or
at the clock by clock level.

At the clock by clock level, once you attempt to provide an atomic
event that is wider (number of stores) than one can performs stores on
a per clock basis, the illusion of atomicity is all that can be
provided.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 09:30:50 -0700 (PDT)
Message-ID: <b3297150-6701-4875-ad88-490440bf17af@z8g2000prd.googlegroups.com>

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> Why it's necessary to use LOCK MOV, why all reads and writes inside
> transactional region may not be treated as transactional?

A good question.

But consider that one must end up dereferencing memory references to
set up the cache lines for the critical section. And notice that there
is a strict limit to the number of protected cache lines. Thus, if
there is any additional level of indirection, you run out of
protectable resources.

Secondly, in order to dbugg these critical regions, one must build a
thread-safe memory buffer and dump critical region state into same so
that you can printi it out (or otherwise examine it) after success or
failure. You really (REALY) do not want these to be protected and
rolled back on failure.

Thus, the limited number of protectable lines, and the absolute need
to be able to figure out what went on inside the critical region,
prevents treating everything as protected.

In my opinion, this is a good thing, a very good thing.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 10:38:59 -0700 (PDT)
Message-ID: <6bf0cac1-8a93-4c70-adff-a0a1897b2257@k2g2000yql.googlegroups.com>

On May 1, 10:13 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On Apr 29, 1:05 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> With AMD's approach (LOCK MOV) I will have to rewrite all code in
> asm... not very cool...

AMD has worked with SUN and we jointly came to the conclusion that ASF
would be a powerful tool to make transactional memory work better/
faster.

I think it was Mark Moir that said "Transactional memory is for people
who just want synchronization to work" at which point I said "ASF is
for people who want synchronization to work fast", and we all had a
good chuckle. There is a completely different philosophy here. ASF is
for coders who understand atomicity, synchronization, nonBlocking,
WaitFree, and the intricate techniques involved therein, and are
willing to work for their performance. TM maybe/is for everybody else.

If you like TM write in TM, only if TM is not giving you the
performance you are looking for, then selectively recode those time
critical section in ASF to beat down the costs.

I also think that over 5-ish years SW people will come up with better
abstractions than did we so as to present SW coders with powerful more
easy to use primitives. This is why I spend so much effort building
primitives than solutions (DCAS,TCAS,QCAS).

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 12:51:22 -0700 (PDT)
Message-ID: <8e8742e5-6b44-4307-bf77-e4c0084805ea@e21g2000yqb.googlegroups.com>

On May 1, 1:11 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 1, 10:38 am, MitchAlsup <MitchAl...@aol.com> wrote:
> > I also think that over 5-ish years SW people will come up with better
> > abstractions than did we so as to present SW coders with powerful more
> > easy to use primitives. This is why I spend so much effort building
> > primitives than solutions (DCAS,TCAS,QCAS).
>
> If some design decision is a trade-off decision which affects
> performance, then I vote with 3 hands for solution which provides more
> performance/scalability.

What would you do if the intellectual burden was such that, after you
developed/exposed those primitives, only 2 dozen people in the world
understood the sublties to the point needed to code reliable
synchronizations? We stumbled over more than a few such possibilities
uncovered along the way.

>                                   Well. but why not just provide UNLOCKED MOV
> instead, this does not sacrifice performance (at my first glance).

We talked about this on the previous thread. Placing synchronization/
atomicity on top of the Cache coherence protocol makes complexity go
through the roof. Placing beside the CC protocols only makes it really
hard to get right. You are playing HW games with the cache coherence
protocol. The cache IS actually incoherent for a short period of time,
on exactly those protected lines, and in addition there is the
potential that the memory request protocol is different/enhanced/more-
dangerous/more-complicated durring part of that period of time. This
leaves vast windows-in-time where you can screw up royally, and not
have the ability to recover (architecture is fundamentally broken).
What we are talking about here, however, is if after all these
subtleties (99.9%-ish level) have been discovered, talked about, dealt
with, applied in (demonstration-like) practice, and written down.

> I have to thank you for such a great job. Especially I am happy that
> it was in the "x86 company" :)

Thanks.

> RELEASE feature is super-cool! Something which even todays STM lack
> of. I was talking with developers of Intel STM about "partial commits"
> or "partly overlapping transactions", not sure whether they are going
> to support them...

I have to defer credit to Dave Christie, here.

> Btw, here is one question I have to ask. AMD was working with SUN, but
> did AMD have some discussions with Intel on this (about support in
> Intel processors)?  IMVHO such initiatives must be a join effort of
> both x86 giants. Not sure whether Intel had discussions with AMD about
> AVX...

The necessities of nonContamination of Intellectual Property means
that AMD/Intel cannot work together until something is at the level of
the just issued specification. However, I am 99.44% sure that a few
people with in Intel were aware of the general nature of this effort
as we were aware of several TM-like investigations inside Intel. This
is how NonDisclosure agreements work in practice. IP stuff is not
directly disclosed, but the questions from NDA signers are phrased in
such a way that the NDA-holding party does end up understanding that
somebody else is looking at <blah>. And generally one can google up
enough intermediate data to infer what the other party may be thnking
about.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Thu, 30 Apr 2009 10:58:59 -0700 (PDT)
Message-ID: <cb03316d-c3b5-4519-a016-5d34a913c469@v1g2000prd.googlegroups.com>

On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
> They appear to have mated 4 ideas:

Actually, the original intent was "How do we give MicroSoft the DCAS
they want". Which degrades into how do we guarantee in the processor
to have multiple cache lines co-resident for several-to-many cycles
and allow SW to express a number of manipulations on those cache
lines, and make it appear to be ATOMIC. Having it smell a little like
LL and SC was completely coincidental. We did not recognize until
significantly later that LL and SC could be easily synthesized from
the primitives we created--we were targeting more DCAS-like stuff than
LL-SC stuff.

> 1. Do (strong) load-link/store-conditional with multiple link addresses

The stores are not conditional all by themselves, SW has to create the
conditionality around the stores using compares and conditional
branches.

> 2. Do all the store-conditionals atomically

All locked stores are made to appear to have been performed at one
instant in time. Instant before-> none visible, Instant after-> all
visible.

> 3. If the block store-conditional fails, partially restore modified
> state to a checkpointed position.

There are all sorts of other failure modes, including spurious ones.
HW guaranteed to restore 100% of the protected state, HW does not
guarantee to restore or leave alone the unprotected state. HW does
guarantee that unprotected state remains visible to third party
observers in causal memory order.

> 4. Fail to checkpointed position as soon as it is known that a
> reservation has been violated.

Fail to back to checkpoint may not be "as soon as known" especially if
you are thinking about the architecture as presented by the debugger
to the SW person. Failure will be detected before COMMIT.

> Now, based on their initial motivation (atomic updates to various
> regions), why don't they:
> - lock: lock an address:
>         - force line to E (writing back any modified lines)
>         - may always succeed or
>         - may fail, returning a success/failure indication.
> - while a cache-line is locked, any requests to that cache line get deferred

This leads to unsolvable deadlock/livelock issues--depending on the
protocol on the interconnect(s). And created massive headaches for
interrupts, and more that a few issues for exceptions.

> - invalidate: flush the locked cache lines.
>         - force cache line from M to I.
> - unlock: remove locks on all locked lines
>         - can augment the ALL variant with per address variants.
>         - an unlock can fail: in this case unlock = invalidate
>         - unlock returns success/failure

<snip>

> A straight forward implementation would add an extra state to the cache;
> converting MESI to SMILE. The extra transitions are:
>         * lock: force to E then to L (will cause writeback if M)
>         * unlock:  L->M
>         * invalidate: L->I
> Alternatively, add another state U (=modified,locked). In that case
>         * on store: L->U, U->U
>         * unlock: L->E, U-> M
>         * invalidate: L->E U->I
> This has its pros and cons. The biggest drawback is that the cache state
> logic has to be able to update multiple entries on unlock and invalidate
> at once.

This requires a search (directed or not) through the cache to modify
the states after successful conclusion. See below.

>               Another negative is that to guarantee 4 lockable cache-lines,
> one must have at least 5-way associative caches, or a regular cache
> paired with a smaller but higher associativity cache (e.g. victim cache).

ASF does this whole thing in the memory access (miss) buffers** and
can do the whole visiblity thing in one clock. Nor does ASF have to do
anything other than make it all visible (flushes cache state
changes,...). This makes it a lot esier on the HW. And is not (sic)
associated with the associativity of the cache in any way shape or
form. In fact, ASF works even if there is no cache in the whole
system.

(**) this is where the limitation on the number of protected lines
comes from. Even a processor without a cache will have miss buffers.

----------------------------------------------

Back when I was working on this, we were intending to support 8
protected lines (with 7 as a back up). There are useful atomic
primitives that use 5 protected lines (moving a concurrent data
structure (CDS) element from one place in a CDS to another in a single
atomic event.)

And here is where I* think this current ASF architecture is screwed
up*. This atomic move basically requires finding the element to be
moved (a many-multi-cycle process) and then finding a place to put it
(another many-multi-cycle process) and then doing the move atomically.
The original ASF allowed one to lock the protected cache line(s) as
they got found before entering the atomic region. At entry to the
atomic region, if the protected cache lines have been downgraded from
the cache, the entry fails, otherwise a few SW checks are performed,
and then the atomic operations are performed.

This current ASF requires either that the atomic region be very much
longer (around both searches) or each search has to move a bunch of
state to a thread-safe memory buffer that will be compared later
agains the current state(s), and then compare all that state in the
atomic region before modifying CDS state atomically. By defering the
entry into the atomic phase until after SW has determined which lines
are participating, but leaving HW watching for interference in the
meantime, the duration of the atomic region is made signficantly
smaller.

In addition, in the original proposal, SW got an indicator of how much
contention was going on (and whether the failure was a spurrious
issue), and SW could use same to modulate what to do on failures. (try
again immediately, exponential backoff, context switch, become a
helper for other atomic event,...)

While this ASF may be a step forward in atomic stuff, it is
significantly less powerful that the previous version described in the
previous thread.

So, its nice to have it see the light of day, It is shameful to have
lost so much power and expressivity along the way. It is truly sad
that ASF lost the ability to attack the BigO(n**2) problem along the
way.

Mitch

(*) But I recognize that my opinion on this is completely irrelevent
at the same time.


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 09:47:01 -0700 (PDT)
Message-ID: <b963139c-6ff6-4baf-87eb-7504adaf2009@y9g2000yqg.googlegroups.com>

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> The current ASF would also seem to preclude the hardware
> accelerators as it never forms an update "set" until the commit.
> I rather liked that idea, given the wealth of transistors available,
> and I think it would have given a distinct competitive advantage
> in the multi-core arena.

I saw nothing in the specification that would have precluded an ASF
implementation from modifying the cache at the normal pipeline
(LOCKed) store timing with the caveat that the premodified data be
moved to the miss data buffers where it can be recovered upon failure.
Thus success can be made fast, penalizing failures; instead of the
other way around.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 09:58:49 -0700 (PDT)
Message-ID: <aa777612-88e1-49ff-9e57-95fbaf42d5df@h11g2000yqb.googlegroups.com>

On Apr 30, 4:38 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> I assume you mean the N**2 cost of collision-retry loops.
> Yeah... and I don't like their unbounded queue delays.

N**2 is the amount of memory traffic to deal with N processor threads
accessing (for/with write permission) a single cache line and all N of
these processors agreeing on who "got" that line. Its simple cache
coherence at play/work.

Software programatics and SW synchronization architectures can make it
worse than this.

Mitch


Date: Fri, 01 May 2009 09:13:52 +0200
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Message-ID: <-fGdncs-xfmtPGfUnZ2dnUVZ8uCdnZ2d@giganews.com>

Mayan Moudgill wrote:
> MitchAlsup wrote:
>> Back when I was working on this, we were intending to support 8
>> protected lines (with 7 as a back up). There are useful atomic
>> primitives that use 5 protected lines (moving a concurrent data
>> structure (CDS) element from one place in a CDS to another in a single
>> atomic event.)
>
> Did you talk to any processor designers when you did this?

Are you serious?

 From LinkedIn:


Mitch Alsup's Experience

     *
       Architect
       AMD

       (Public Company; Mechanical or Industrial Engineering industry)

       1999 - 2007 (8 years)
     *
       Chief Architect
       Ross Technology

       (Public Company; Mechanical or Industrial Engineering industry)

       1991 - 1998 (7 years)
     *
       Architect
       Motorola SPS

       (Mechanical or Industrial Engineering industry)

       1983 - 1992 (9 years)
     *
       Processor Architect
       Motorola

       (Public Company; Mechanical or Industrial Engineering industry)

       1983 - 1991 (8 years)

I sort of have the feeling Mitch has spent way more time than most, even
here at c.arch, talking to processor designers. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 09:43:59 -0700 (PDT)
Message-ID: <057a87c9-6e90-4216-9e27-064762bc6ed6@p11g2000yqe.googlegroups.com>

On May 1, 6:20 am, Mayan Moudgill <ma...@bestweb.net> wrote:
> Again, I apologize for doubting Mitch Alsup's credentials. However, I
> still think that he is suffering from a little bit of tunnel-vision.

For the record, I thank Terje for his defense and would like to
indicate that I take no offense upon Myan.

This whole synchronization philosophy was developed from the hardware
level (make the HW work first) and then figure out how to program it
(present reasonable abstraction to the SW levels). It is 'different'
because we ultimately felt that the current directions {DCAS, xxTM}
were lacking at varous points in their (sub)architectures.

I would (happily) submit that the field of synchronization has so many
mind numbing details that all paths to any potential solutions have a
certain amount of tunnel vision.

However, the more original ASF proposal was the first (that I know of)
that showed a way to attack the BigO(n**2) problem. This problem has
to do with the cache coherent protocols and how memory systems are
architected with performance towards the 'normal' access patterns and
a near complete blind eye to synchronization access patterns (repeated
access to hot lines). DCAS (and varients) and Transactional Memory
have these exact same problems. The problem is not DCAS, nor xxTM, but
inside the cache hierarchy and memory systems. And we went into
significant detail in the aforeposted thread earlier.

ASF evolved over 2-2.5 years (while I was there) with insight from the
best architects within AMD and from selected AMD partners who gave
useful and illuminating input into what they might have liked and what
they did not. I'm pretty sure that this evolution did not stop after I
left.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 10:19:55 -0700 (PDT)
Message-ID: <a4471cf7-4098-4ff1-9ee5-860aa093f2d2@h2g2000yqg.googlegroups.com>

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > On Apr 30, 8:28 am, Mayan Moudgill <ma...@bestweb.net> wrote:
>
> >>- while a cache-line is locked, any requests to that cache line get deferred
>
> > This leads to unsolvable deadlock/livelock issues--depending on the
> > protocol on the interconnect(s). And created massive headaches for
> > interrupts, and more that a few issues for exceptions.
>
> 1) The abort-on-probe in ASF can lead to live-lock.

Agreed. This is why the earlier earlyASF {that I know about} had a HW
mechanism that HW could use to prevent live-lock.

> 2) If the interconnect does NOT allow deferred access to cache-lines,
> then the ASF proposal does not work either; the ASF has to defer access
> whilst committing the locked regions to the cache.

This is why I was lobbying for a NAK to be added to the HyperTansport
interconnect fabric and only for ASF events.

> 3) The mechanism described specifically allows interrupts and exceptions
> to be taken (and reservations to fail)

My point was that UnConstrained NAKing makes interrupts really tricky,
whereas tightly controlled NAKing avoids these pitfalls without
loosing the value of the NAKing {i.e. preventing livelock} and does
not make interrupts any more (usefully) difficult than they already
are.

The earlyASF that I was associated with set a timeout at the point when
the critical section "became" critical. During this short window of
time, interrupts would be deferred. This time was supposed to be about
2*worst case DRAM access. Thus, the critical region program should
have most of its protected data in its cache, and if it does, then the
timeout would not become visible. If the critical region had lost
several of its lines there is a good chance that interference would
cause a failure in the critical region anyways.

Thus, one NEEDS an instruction to MARK the transistion from collecting
the lines to be processed, and to manglement of the lines in the
ATOMIC event. This is what was lost. In addition, the instruction that
used to do this delivered an integer (+0-) and embedded in this
integer was the success indicator (0) or failure indicators. Negative
failure numbers were assigned to spurious errors (buffer overflows,
table overflows,...) to allow HW simplification and allow SW to ignore
these and simply try again. Positive numbers were assigned by the
order of interference on a given set of protected lines. Thus one
could attempt to ACQUIRE the lines needed to perform a critical
section, and receive back an indication that 6 other threads have
touched (at least part of) those same lines to be protected. SW was
supposed to use this indicator to do something other than just try
again. Just trying again leads to BigO(N**2) contention. Knowing that
you are 6th in line allows that routine to march down the linked list
and then attempt an access that has low probability that others are
accessing. This is what allows SW to attack the BigO(N**2) problem.

Consider a timer goes off and 15 threads jump at the head of the run
queue. In normal synchronizations, this (lets just say) takes a long
time. However, in earlyASF everybody wakes up, jumps at the head of
the queue, is given an interference number, uses this number to access
independent elements on the subsequent attempt and BigO(N**2) is not
BigO(3). Since timing cannot be guaranteed, BigO(3) degrades to
"about" BigO(log(N)).

However, it appeasr that this has been lost.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 15:12:26 -0700 (PDT)
Message-ID: <8a4ef2d2-5102-4f3a-b081-f729a7cc4fd4@c9g2000yqm.googlegroups.com>

On May 1, 4:30 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > This is why I was lobbying for a NAK to be added to the HyperTansport
> > interconnect fabric and only for ASF events.
>
> Interesting; I didn't realize that Hypertransport does not have a Retry
> read response. That means that once a message is put on the bus, it will
> either complete or come back with an error.
>
> BTW: What typically happens if the response is Target Abort? Is it
> retried, or just an error? How about Data Error?

HT only has completion and failure responses. Failures are machine
check-like events--don't want to go there. Failures have generally
indicated that the HT routing tables are screwed, or that the TLB
mapping tables are screwed. Neither is good for your current OS's
assumptions. What is needed is a message that contains "yes I have it,
but I cannot give it to you right now" response. Which is neither a
success nor a failure response.

What we wanted was for the NAK to arrive back at the requestor, and if
it was attempting to perform an ASF event, then the ASF event would
fail (and backtrack -- CDS is undergoing modification as we speak),
but if it was not attepting an ASF event, the request was simply
reissued (innocent or malevolent bystander). Thus, innocent
interaction with data undergoing synchronization gets deferred while
the synchronization event makes forward progress. Programming
languages cannot prevent innocent or malevolent threads accessing the
CDS any time they want--only SW disipline and coding practices can,
however earlyASF gave priority to the good guys over the innocents and
malevolents. It appears that this ASF does not.

Since the "loop" of request->MC->PROBES->miss-buffer->requestor is
many hundreds of CPU clocks long, simply retrying is sufficient and
does NOT flood the interconnect with messages. Since the various
messages are short (4-8 HT clocks) and ordered at the memory
cotrollers (1 memory access in active state per cache line--the rest
queued), and traffic is throttled in the HT fabric, hundreds of CPUs
can be contending with little overhead (<10%-ish) in the HT fabric due
to these innocent and malevolent retries.

Thus, the good guys make forward progress, the innocent get delayed
and see the latest state of the concurrent data structure when their
access completes.

The rather similar looking Intel interconnect fabric does not have
this 'lack-of-flooding' property, and if Intel were to implement
something like ASF they would make different choices about some of
these details. I am available to help them is they so desire.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Fri, 1 May 2009 10:26:55 -0700 (PDT)
Message-ID: <1a425455-dd30-43d5-9f1c-271df6144d2b@g1g2000yqh.googlegroups.com>

On Apr 30, 9:38 pm, Mayan Moudgill <ma...@bestweb.net> wrote:
> MitchAlsup wrote:
> > This requires a search (directed or not) through the cache to modify
> > the states after successful conclusion. See below.
>
> Not really; what it requires is easy to do _IF_ you've implemented your
> cache state using flops instead of SRAMs.

Look, NOBODY is going to build their cache(s) in register technology,
its just not dense enough. However everybody is going to have a number
of miss buffers that ARE built in register technology. Thus if you can
accept a reasonable limit to the number of protected liines, then the
miss buffers are actually easy to use for this purpose.

Now, you might be able to afford to build all or part of your TAG
array in register technology, and some of this is (and has been done)
(especially invalidations. But you will never be able to afford the
data portion of the cache to be implemented in register technology.

In any event, the miss buffers already have all the desired
characteristics. They already deal with inbound and outbound data, the
demand address and the victim address, and they are snooped
continuously. So they are the perfect place to add a "little"
functionality. In addition, for all intents and purposes, they ARE a
cache; a cache of memory things in progress that need constant
tending.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Tue, 5 May 2009 16:45:39 -0700 (PDT)
Message-ID: <6f360462-6082-4084-92e4-1784b60d5c79@o20g2000vbh.googlegroups.com>

On May 3, 3:28 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 2, 9:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> My primitive view is that we may (1) transmit some bus messages on
> commit, or (2) redo some work + transmit the same number of messages
> during re-execution.

Now the COMMIT instruction takes an essentially unbounded amount of
time. And I don't think the HT has the infrastructure to smell atomic
under these assumptions.

> Though now I am not sure that such details must be included into the
> specification at all.
>
> > That having been said, I can see the utility in having an explicit
> > 'unsafe read' which says 'I know what I'm doing so get out of my way'.
>
> Once again, my intent is to reduce number of aborts w/o affecting
> observable behavior.
> Mitch said that I loose visible atomicity, if so please show me where
> and how (this is not my intent).

It may very well be that it remains possible to retain atomicity under
these circumstances. However with the HW limits we had imposed (and
self imposed*) upon us, I could not.

Mitch

(*) In order to qualify for inclusion in a particular rev of the given
machine, this had to be a "small" change to a bounded number of
existing machine circuits.


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Tue, 5 May 2009 16:40:20 -0700 (PDT)
Message-ID: <cb88e959-58d8-4853-801c-9cc3bc63ce00@s16g2000vbp.googlegroups.com>

On May 2, 11:54 am, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> Also, looking at the MOESI coherency protocol, it is possible they
> may be trying to retain the lines in either the Exclusive (not updated,
> no shared copies) or Modified (updated, no shared copies) states.
> That might allow the Commit to do its thing without
> transmitting any bus messages.

To retain the illusion of atomicity, all lines must remain in the
prior state or become visible in the after state in one instant.
Passing bus messages around makes this very significantly harder.

> Allowing a read would transition Modified lines to the Owned state,
> necessitating possibly multiple Invalidate messages on Commit.

Which makes the set of lines not transition from the before state to
the after state in one instant.

> All such lines would have to be held in a limbo state until the
> replies were received, which implies resources must be tied down,
> expands timing windows for responses to requests for those lines, etc.
> That in turn might limit the ability to expand the number
> of lockable lines in the future.

And, in practice, does not work.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Tue, 5 May 2009 16:36:29 -0700 (PDT)
Message-ID: <bd9742d7-011c-4dc8-adfc-d5b27bd5db06@r3g2000vbp.googlegroups.com>

On May 2, 6:23 am, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> I am not sure why you are talking about complexity and disposition of
> the atomicity relative to cache coherence, what I was talking about is
> no more than renaming of names. It must not affect complexity,
> disposition of the atomicity relative to cache coherence, etc.

The HW in question had 8 memory (miss) buffers and 22 store buffers.
Some of the atomic activities we wanted to be able to perform would
have overflowed the 22 stores limit in the store buffer if one were
programming with cursors {pointer, timestamp, other cursor data == 3
stores per 'reference'}. If the atomic event takes more room than is
available in the execution window, then it cannot be recovered (like a
branch misprediction is recovered). Since we also wanted coders to be
able to debugg these regions, other stores would be needed. Thus,
complexity would have been increased, or the "ASF model" would have
had to shrink to fit the execution window of a particular machine.
Neither was desirable.

So, ASF was designed so that its SW model remains intact with any
associativity at the cache hierarchy (including zero) and with any
number of execution window resources. Its only mandate was enough MABs
to allow forward progress once a number of them were locked down for
ASF activites. This being the smallest and cheapest resource that
already existed in the correct places for the activities we wanted to
perform. The MABs also had the property that they got snooped, were
(or could be) associated with line buffers (for recovery purposes),
and the decoration needed was only a couple of flip flops and some
state logic.

Thus, I disagree with your assumption that it would not have affected
complexity (relative to almost anything in that machine). Thus, this
functionality could be installed in a machine by removing the #UD
traps on memref instructions that happen to have a LOCK prefix, a
couple of new instructions, and a couple of new bits in an already
existing functioning piece of the machine.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Thu, 14 May 2009 11:08:06 -0700 (PDT)
Message-ID: <80f45c69-403d-4210-aac8-bd3303025645@g20g2000vba.googlegroups.com>

On May 13, 2:57 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
> On May 13, 11:40 pm, Dmitriy Vyukov <dvyu...@gmail.com> wrote:
>
> > Mitch, can you provide some comments on whether it's possible to make
> > it work?
> > AFAIR, abort address is accessible by software. If transaction does
> > not actually execute any speculative code then nothing will be rolled-
> > back on abort. Registers will be intact. Everything executed inside of
> > transaction will be treated as normal code.
> > So the question is: is it possible to restart transaction and just
> > jump to the abort address in the abort handler?
>
> Well, it seems that ASF does NOT support "resume". On abort RIP is
> saved in the ASF_EXCEPTION_IP MSR, however RSP and EFLAGS are LOST, so
> we can't resume :(
> It would be nice if ALL state will be saved on abort...

No, it is neither nice to save state on abort, nor possible to reenter
an atomic region after an abort; unless you so mangle the definition
of atomic to be completely useless.

A) Where would you put this state where absolutely nobody could see it
(i.e. remain atomic-smelling), including other threads, OS and VM
services.
B) What guarantees can you provide that the data structure(s) being
manipulated will be the same once you get back from the abort? (A:
low)
C) What if the OS or VM needs to access the same data structure while
you are between abort and reentering? (A: Screwed)

When we looked at this, the conclusion we came to is that it is simply
better to throw away everything you understand about the state of the
CDS and start over. This prevents having to invent a new place to dump
state (and make the OS people happy), and enables the OS or VM to be
essentially unconcerned with what you (the application) may be doing a
any given moment in time--even if they the OS or VM need to perform
similar kinds of accesses to potentially shared CDSs.

Atomic means that nobody can see any intermediate state. ASF is
designed under the through train that you want to perform many many
millions of alterations to a concurrent data structure per second.
Basically this means if you blink your eyes, the CDS has changed under
your nose. Having as little state as possible to support said event
means that you basically have to punt and backup if anything prevents
your successful completion of the event at hand.

D) What is the probability that this saved and restored state is
completely useless after returning? (A: high)

Th orginal ASF provided an indicator to inform SW WHY the abort path
was taken and this list included {Interrupts, exception, spurrious,
interference} so SW could make an educated and rational decision as to
what to do next. Apparently the current ASF does not trust SW writers
to be sufficiently educated or rational.

The original ASF was not designed to be sufficient for writing
transactions. It was designed to be sufficient for writing complicated
ATOMIC primitives such as BOOLEAN cursor::DCAS(a,b,c,d,e,f) { } or
BOOLEAN timestamped::CDS_move(from, to, element) { }. And these
procedures retain the property that they amount of code inside the
atomic critical section is "as small as possible". Transactions are a
higher level abstraction that can be supported by ASF and ASF-like
procedures, but do not have the property that what is inside the
critical section is "as small as possible".

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.programming.threads,comp.arch
Subject: Re: AMD Advanced Synchronization Facility: HTM
Date: Wed, 13 May 2009 11:52:49 -0700 (PDT)
Message-ID: <aac3914a-f450-4965-8dd2-0338bb3b954b@t10g2000vbg.googlegroups.com>

On May 13, 12:58 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:IRDOl.52002$bi7.51156@newsfe07.iad...
>
> > "Dmitriy Vyukov" <dvyu...@gmail.com> wrote in message [...]
> >> [...]
> >> Mitch said that this will complicate COMMIT and let it make cache-line
> >> transfers. They want CPU to hold all locked cache lines in the cache.
>
> > Damn! There has to be a way to allow for a normal plain load from CPU A to
> > abort CPU B's speculative region. Grrr...
>
> Ummm... I meant to say:
>
> Damn! There has to be a way to allow for a normal plain load from CPU A to
> NOT abort CPU B's speculative region. Grrr..

What you want, here, is the ability to perform a READ_ONCE on [A].

READ_ONCE reads the value, but leaves write permission where it
already is, so if an ASF event already has write permission, it
retains write permission, while giving the reader the old copy of the
data.

In theory, this works just fine (assuming you have access to
READ_ONCE) however, in practice, the line that gets transferred to the
READ_ONCE is the original incarnation of the line. so the CPU holding
that line, may have to reconstruct the old image before transmitting,
and the reassemble the new image to preceede with ASF event in
progress. These are not difficult, but are cumbersome. They are easier
in a HT environment where only 1 memory ref is active to a given cache
line at a time, than in the QP environment where several can be in
progress simultaneously.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable
Date: Mon, 28 Dec 2009 16:54:30 -0800 (PST)
Message-ID: <87ddaab5-6760-4432-9834-2981b1a158a1@f5g2000yqh.googlegroups.com>

On Dec 26, 2:39 pm, EricP <ThatWouldBeTell...@thevillage.com> wrote:
> EricP wrote:
>
> > Having the ability to perform a LoadLocked/StoreConditional on
> > up to 4 separate memory locations would eliminate much of the
> > need to escalate to the heavyweight OS synchronization ops.
>
> This appears to require a method of establishing a global
> first-come-first-served ordering for the cpus that is
> independent of the physical memory locations involved.

Correct.

> In the unlikely event of cache line ownership contention then
> the first cpu to begin its multiple update sequence wins
> and the other rolls back.

But more importantly, that entity that determins who wins also
establishes order over current participants avoid contention on the
subsequent access. Thus, one can achieve something on the order of BigO
(log(n) instead of BigO(n**2) memory references worst case. You cannot
get to this point unless the synchronization 'event' returns an
integer number instead of simple win-retry.

>
> The trick is for it be a low cost mechanism (ideally the cost of
> a single cache miss to establish the order) that works within
> the existing cpu hardware, bus and coherency protocol.

In practice it requires a two way transfer through the fabric, but
does not require a DRAM access delay. So the latency is better than a
DRAM access. The entity looks and smells remarkably like a TLB and can
process a stream of requests as fast as the fabric can deliver a
stream of requests (i.e. no back pressure--at least none required).
And the TLB does not have to be "that big" either.

> For that I'm thinking that maybe a global device located
> at some special physical memory location would establish
> the global order at the start of a multiple update sequence.

Yep, programmed up by a standard header making it look like a device
writing anywhere in fabric addressible space.

> Then using Invalidate bus ops to a set of other special
> physical memory locations could communicate that ordering
> to other cpus and they can associate that with the bus id.

Nope, dead wrong, here. You return the order as an integer response to
a message that contains all of the participating addresses. This part
of the process does not use any side-band signaling. After a CPU hase
been granted exclusive access to those cache lines, it, then, is
enabled to NAK requests from other CPUs (or devices) so that it, the
blessed CPU makes forward progress while the unblessed are delayed.

> So in this example the overhead cost would be 3 bus ops to
> Read the global device, an Invalidate indicate my order to
> the peers, and an Invalidate at the end of the sequence.

In my model, there is a message carrying up to eight 64-bit physical
addreses to the Observer entity, if there are no current grants to any
of the requested cache lines, the transaction as a whole is granted
and a 64-bit response is given as a response to the sending CPU. Most
would call this one fabric transaction. Just like a Read-cache-Line is
one fabric operation.

The CPUs contend for the actual cache lines in the standard manner
(with the exception of the NAK above).

Mitch

Index Home About Blog