Index Home About Blog
Subject: Re: Message Passing vs. Shared Memory
From: mash@mash.engr.sgi.com (John R. Mashey) 
Date: Dec 22 1995
Newsgroups: comp.sys.super

In article <4bcb71$20q@larry.rice.edu>, preston@cs.rice.edu (Preston
Briggs) writes:

|> Phil@irony_bvld <phil@ctrhdc4.ct.ornl.gov> wrote:
|> >    The bottom line is, why is shared memory to be an "easier"
|> >model, when in actual fact it has all the hazards of ANY parallel
|> >problem? Namely, synchronisation, read/write hazards, temporal
|> >ordering, etc.... The perfomance issue may simply be seen as a local
|> >memory latency, as opposed to a remote memory access, and as such
|> >should be recognised for what it is.
|> 
|> To me, the difference is in the size of the code required for good
|> performance.  One example is the code for the NAS parallel benchmark

|> I agree that latency is terribly important.  Tera can effectively

I'd agree with Preston, but beyond that:

1) Is shared memory "easier"?  YES
        There may be several reasonable interpretations of "easier",
        depending on the level of time, expertise, and money involved.
        Let me try one very pragmatic definition of "easier":

        Consider commerical third-party software applications that have
        been fine-grain parallelized onto various kinds of systems.
        ISVs need to worry about the actual cost of doing something,
        do not normally have legions of graduate students, and need to worry
        about maintaining code cost-effectively.
        

        a) How much ISV software has been parallelized onto SMP systems
                (vector or cache-based)?
        b) How much ISV software has been parallelized onto message-passing
                hardware? (or distributed-shared-memory hardware with
                remote latency >> local latency ... which of course tends to
                make people program more like message-passing anyway)
        I think the answer is pretty clear [i.e., more on SMP] 
        It appears that some SMP systems that have appeared in the last year or
        two already have more parallelized software than most venerable
        message-passing systems that have been around for years.

        Hence, either this observation needs to be refuted, or theoretical
        observations need to be adjusted to fit the practice, or perhaps
        software/humans need to get smarter :-)

2) So, why would this be?
        a) The performance of both vector and cache-based SMPs is affected
        by memory layout, of course in different ways, and with different
        degrees of sensitivity.  It seems like people:
                1) Start with a static data layout, perhaps derived directly
                   from the uniprocessor version.
                2) Parallelize it and tune by looking at the dynamic
                   characteristics of memory references.
                3) Tune it by tweaking the memory layout a little, i.e.,
                   by cache line padding, or getting array sizes that
                   produce good strides.
        b) As best as I can tell from those who've done it for years, it seems
        that the process is often different when using message-passing:
                1) Figure out the static data layout and how it will interact
                   with the dynamic references ... and if you're wrong,
                   incur serious performance penalities.  This may involve
                   considerable effort if the individual memories in a
                   message-passing system are slightly too small for the
                   "natural" size.

        c) A good example was Paul Woodward's debriefing on the CFD code
                that was used on the prototype Challenge Array a few years ago.
                (I.e., this was a bunch of SMP nodes ganged together with
                FDDI, using SMP inside each node, and explicit message-passing
                between the nodes.)  He observed that it was *much* easier to
                coordinate a moderate (10s) number of CPUs with a large shared
                memory for fine-grain sharing, and coordinate the resulting
                large chunks of CPU+memory via message-passing, than it was to
                coordinate a large number of individual CPUs with small
                memories.  He mentioned trying to put computations whose
                natural size was 40MB onto 32MB nodes, labeled as unpleasant.
        

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090    FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: distributed servers vs big iron
Date: 3 Jun 1996 03:32:44 GMT

In article <mjrDrrCvD.MB@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> There are four independent axis in computer systems:
|> 	Processor (Gigaops, Gigaflops ...)
|> 	Memory (Gigabytes, Gigabytes/second of bandwidth and Latency)
|> 	I/O (Terabytes of storage, Gigabytes/second of bandwidth)
|> 	Interprocessor Communications 
|> 		(Gigabytes/second of bisectional bandwidth and Latency)
|> 
|> Every computer design makes tradeoffs in these four axis plus a fifth
|> which can be called usability. Many people say that shared memory is
|> easier to use than distributed memory so even though shared memory
|> systems do not scale as large as distributed memory systems the ease
|> of programming makes up for it. 

I'd agree, although one might make the distinction between logical and
physical sharing, i.e., shared-memory might be in one place, like
most SMPs, or physically-distributed like in DASH or Exemplars.


|> If a company sold an Intel based system using Scaleable Coherent Interface
|> (SCI) communications would that be big iron or distributed system or both?

Maybe somebody can answer some questions on that:
	(1) What's the bisection bandwidth of a single SCI ring?
	    (for simplicity, assume sustained, i.e., peak minus obvious
	   overhead, but perhaps ignoring invalidates & such.
	   This should be 2X (sustained BW on 1 link).
	(2) What's the remote memory latency? (from cache-miss-detect to
	   resume-instruction, on a load-miss)? 
	   For simplicity, ignoring contention, but of course with a
	   per-node factor, i.e., should look something like
	   C + n*X, where n = total # of nodes.
Obviously, different boards may have different numbers.
	   
-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: distributed servers vs big iron
Date: 4 Jun 1996 03:07:00 GMT

In article <31B3029B.446B@austin.ibm.com>, Greg Pfister
<pfister@austin.ibm.com> writes:

|> But does the distributed memory maintain the ease of use of uniform
|> memory access (UMA)?

Well, that's always a good question ... it's pretty clear that:
(a) If the bandwidth is high, and the latency is low, it might.
(b) If the bandwdith is low, or the latency high, then it won't.
    For instance, many people say the KSR's had this issue. 

...
|> The comment about programming of course does not apply to the blokes who
|> actually write the subsystems.

I.e., like the OS folks ... (I've been through managing one of these,
and it wasn't great fun...)


|> > Maybe somebody can answer some questions on that:
|> >         (1) What's the bisection bandwidth of a single SCI ring?
|> >             (for simplicity, assume sustained, i.e., peak minus obvious
|> >            overhead, but perhaps ignoring invalidates & such.
|> >            This should be 2X (sustained BW on 1 link).
|> >         (2) What's the remote memory latency? (from cache-miss-detect to
|> >            resume-instruction, on a load-miss)?
|> >            For simplicity, ignoring contention, but of course with a
|> >            per-node factor, i.e., should look something like
|> >            C + n*X, where n = total # of nodes.
|> > Obviously, different boards may have different numbers.
|> 
|> Of course, SCI also allows interconnect via networks of switches, not
|> just rings.  So the minimum latency could look like C + X*log(n), or C +
|> X*some_root_of(n), or whatever, where n = number of nodes.  Or even just
|> C, for systems small enough to have a crossbar or bus--say, 8 or so,
|> which is quite large enough to do some major computing.

Oops, of course; I should have been more specific, as I was thinking of
the several SCI+Intel CC-NUMAs that already have Web pages, i.e., 
Sequent & DG.  [Are there any Web pages for Intel + SCI-non-ring topologies
that people can point to?  I hadn't noticed any, but hadn't looked too hard.]

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: distributed servers vs big iron
Date: 4 Jun 1996 16:37:29 GMT

In article <mjrDsGF1o.91y@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> That is an interesting contrast. It seems that with the large caches
|> and the high hit rates that most SMP systems have that even SMPs 
|> might be considered logically shared physically-distributed.

That's an interesting topic; I've never felt that I had the
"killer explanation" for the difference ... but I think there's
a fundamental difference, in practice.  SO, let me try, and hopefully
somebody can offer a more succinct version.
[Note: references would be Lenoski & Weber's Scalable Shared-Memory
Multiprocessing &  Greg Pfister's In Search of Clusters.]

(1) "SMP-classic":
	(a) CPUs with coherent caches
	(b) UMA shared memory, i.e., with same access time from every CPU,
	    and access time >> cache access

(2) "CC-NUMA-classic":
	(a) CPUs with coherent caches.
	(b) NUMA shared memory, with a more complex access time function:
		Local memory >> cache access
		Remote memory > local memory
	and possibly:
		Remote memory access may or may not vary much according to
		where it is.
	At some point, if the remote memory access >> local memory access,
	people will use the system as (3) message-passing system, even if the
	memory is logically shared, simply because the penalty for acting
	like everything has the same access time gets too high.


Now, what's different:
layout of data for caches tends to be a short-term, small-size, dynamic issue;
layout of data for multiple memory systems tends to be a long-term,
larger-granularity issue ... and apparently much more difficult in the
general case, as evidenced by the relative numbers of 3rd-party parallelized
applications (i.e., many for SMP, not many for the extreme case of
message-passing systems.  There's plenty of data for the extreme points,
not so much for CC-NUMAs, although if you look at NCSA's software application
pages, there may be some indication.)

People certainly spend time tuning for
knowledge of caches, but for some reason, this seems to be easier, I'm
not sure why, but maybe:
	(a) If you start with a vanilla sequential program, it at least
	works when put on a cached system, then usually can be tuned up
	somewhat, often with relatively little fundamental rearrangement,
	but by things like padding arrays.  Sometimes it can be parallelized
	fairly easily, at least for 4-8-way.  This sort of seems like Darwinian
	evolution: if someone makes some changes, and gets positive results,
	they are encouraged :-)  If they have to work a long time before the
	program even runs, they are not encouraged.  ISVs don't do that.
	(b) If you goof slightly with caches, there will be some penalty,
	but maybe not too much.
	(c) If you goof slightly with memory allocation in a system that
	has a long remote latency, performance may drop a long ways.
	(d) Sometimes the proram doing the allocation cannot have much of
	a clue to know where to put the data at the time it needs to put it
	there.  For example, consider a CC-NUMA with a long latency.  You
	may be able to partition the internal kernel data structures pretty
	well, but for example: where do you put the UNIX buffer cache?
	Where do you put shared-memory regions that are indeed shared by
	lots of processes?  Put another way, where do you put those big chunks
	of data, that unfortunately, may have good cache-hit rates (at the
	disk-buffer-cache level), but very bad hit rates at the CPU-cache
	level?  Where do you put, for example, the code for the UNIX shell?
	Obviously, smart page-replication & migration may be useful,
	plus plenty of hardware counters to give the OS a fighting chance...

Anyway, it seems that with an SMP:
	(a) User algorithm, and sometimes compiler help attack specific
	cache optimizations.
	(b) The OS does cache-affinity scheduling (i.e., resechedule a task
	on a CPU where it has recently spent a lot of time, hoping to find
	a warm cache instead of a cold one)
but, for example, when reading in a code page, or any disk block, the
OS need *not* agonize much over the choice of page frame ... but in a
NUMA, it probably better think about it more.


-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: distributed servers vs big iron
Date: 11 Jun 1996 21:29:40 GMT

In article <mjrDsKz49.2Iv@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> Excellent point, any guess as to what the ratio might be. 10X 100X. Also,
|> doesn't the communications also play a part here. Even if the remote memory
|> is very close (say 2X away) but congest easily wouldn't that force you
|> to treat the machine as distributed memory?

Seems likely.

|> While it is true that the number of MPP (distributed memory) apps is small
|> I'm not sure the explanation is that MPP is too hard to program. My 
|> experience in the industry was that the SMP systems were used for multi-user
|> timesharing systems and the MPP systems were designed for a small number
|> of large applications.

This may be a matter of viewpoint, i.e.:
(a) As Mark notes, many SMPs were straightforwardly used for throughput of
multiple single-thread programs that were probably derived from uniprocessors
anyway, and as noted, with varying degrees of success for commercial
DBMS.
(b) On the other hand, in the technical world, people were certainly
routinely doing small-N-way parallelization of codes in the late 1980s,
early 1990s.  While I'm not sure how much of this was going on on other
vendors' systems, I do know it was happening on SGI PowerSeries systems
I was still at MIPS, but over the years since I've talked with lots of
scientists and engineers who had done this, certainly not to do algorithms
research, but just to solve their problems faster. I especially recall
one who had an 8-CPU system: he was running a research job that took weeks,
and that every once in a while looked at a file to see how many CPUs it should
be using; when he went home, he'd let it have all 8, when he came back,
he'd tell it to keep 4, so he could have 4 for his interactive work.]
It is also the case that Power Challenges acquired a reasonable number
of parallelized ISV applications fairly quickly during 2H94.

-- 
-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?
Date: 12 Jun 1997 17:42:00 GMT

In article <5noqu6$dmj$1@murrow.corp.sgi.com>, mccalpin@frakir.asd.sgi.com (John McCalpin) writes:
<bunch of good stuff>

|> >Do the trends in other commercial machines indicate the dropping of CC-NUMA
|> >as a popular design style?  Is a Galaxy-like system really the future?
|>
|> There is no doubt that it is more work to get the availability/redundancy
|> features in a single O/S image than in a cluster (which is why SGI supports
|> both approaches).

0) Bandwidth+latency: "you can run, but you can't hide", i.e., there's
no substitute for good numbers in the general case, although you can
always find special cases where you can relax the constraints.

1) People are familiar with clustering, and often like to do it to
make "hard" boundaries between machines for fault-containment reasons,
and hence clustering has a role [which si why we supprot it.]

2) Where clustering is totally broken, is when the customer is forced
to partition their environment "unnaturally" among cluster nodes to
fit the size of the nodes that we vendors happen to be offering today.

The computer business has such a long history of the following approach
that people take it for granted:
adapt the shape of the hardware
	The shape of the customer's solution is dictated by adapting the
	shape of the problem to the shape of the available hardware.
	[I.e., if you have mainframes, centralize, if you have small
	systems decentralize, whether this fits what they want or not.]
whereas one would prefer:
	The shape of the hardware is dictated by the shape of the
	customer's *problem*.

For instance, if one has a bunch of independent tasks, with high ratios
of compute-to-communicate, a loosely-coupled cluster is a fine match.
On the other hand, if one has a large RDBMS, with many tables and random
queries/joins, etc, then it is *idiotic* to split this data up among
nodes , where the internodal bandwidth is 10X or more less than the
intra-node bandwidth.  Likewise, if the applications is data-centric,
and the parts of the DB grow at different rates, getting good partitioning,
that persists (i.e., real life, not just benchmarks/demo days) is very
difficult.

3) Many customers worry hard about unpredictability, which comes from
the current set of trends, i.e.:
	- Big data [disks & DRAM on 4X / 3 years growth rate]
	  4X/3 in disk areal density means 2X / 3 in bandwidith, plus
	  any improvement from higehr RPM.
	- The Net [everybody *thinks* they should be able to get at data
	anywhere, and they *think* they should have Web pages fronting
	legacy DBMS that cause multiple transactions to multiple DBMS to
	provide an answer].  People didn't use to expect this.
	- New services [which may or may not work]
Hence, capacity planning is serious worry for many people,
since doing it badly can mean you get fired.

4) Putting all of this together, a lot of people favor the following:
	- Set up a cluster arrangement dictated by the "shape" of the
	data access & interchange & fault-resiliency arrangements,
	and organize the system administration based on that ...
	and try not to change sysadmin arrangements.

	- Start the individual nodes small, and then grow each one
	incrementally [this is where the CC-NUMA comes in], leaving IP
	addresses, sysadmin alone, just adding CPUs, memory, disks,
	or sometimes subtracting: some people have conversion periods
	when they need systems to be bigger, then they split them back
	down into smaller ones, or if they offer a new service, but the
	service doesn't take off, they don't incur the cost.


	- Suppose, for example, that somebody wants to start a project
	with 2 4-CPU systems, 8 CPus total, and commensurate disk/memory,
	and assume this is a data-centric application.  Suppose their
	need grows 50%, and they are using systems limited to 4 CPUs
	each. Are they supposed to take their 2-way partitioned data, and
	split it into 3 ways, adding a 3rd system?  What happens when they
	grow another 50% (i.e., 18 CPus), do they redo the 3-way split
	into 5?  Arghhhh.
	OK, start with 2 bus- or central-crossbar SMPs that can go to
	16 CPus each ... oops, at 4 CPUs each you have a fairly cost-inefficient
	part of the configuration space, and you are spending a bunch of
	money for capacity that you *might* not end up using for a while,
	and you may be accepting unpleasant limits on things like
	memory & I/O bandwidth, and in any case, many people like to
	maintain 30% headroom leftover [I talk to CIOs a lot, and 30%
	headroom is the most common number.  Some require 50%.] This means
	they'd start getting nervous when they've got 11 CPus in each of
	2 systems.  If CPus are packaged in 4s, that means they get
	3 boards with 4 CPus [2x12 = 24 total], and then they're nervous,
	so that another 50% growth causes trouble.

5) Summary: it is silly to insist that a site have exactly one CC-NUMA
and grow it to fit all needs; it is equally silly to insist that one
server the needs by partitioning workloads in unnatural ways onto
clusters.  Natural partitioning is useful and good... it's the unnatural
cases that are painful.




--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [416=>650 August]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?
Date: 13 Jun 1997 21:59:51 GMT

In article <mjrEBpt0F.1n9@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> Interesting point. This may be an example of relaxed constraints but
|> if it is it is a good size market and growing quite rapidly.

Yes ... it's actually one of SGI's fastest growing markets, often the
application that gets us in the door.

|> Data Warehousing seems to partition well. One way to layout a DW is
|> a star schema with fact tables and dimension tables. Typically the
|> fact tables are very large (100s of millions of rows to billions of
|> rows, and even 10s of billions of rows) and the dimemnsion tables are
|> much smaller (typically less than a million rows). There is one notable
|> exception to this, customer tables in a business that track individual
|> customers (financial, telecom, and now a days retail) can have 10s of
|> millions of customers.

Of course, at least among the customers I've talked to in this space,
tracking customers is exactly what they do. [this may be because we
sell to telecom, retail, and financial markets :-)].

|> An other interesting note about large DW apps is that the data does
|> not come close to fitting into memory so the internode speed does
|> not have to match intranode memory speeds but rather intranode
|> disk speeds which should only be a few hundred MB/sec currently.

Serious DBMS seldom fit into memory, but this is hardly new :-)

One must be very careful to match internode speeds and disk speeds.
Let's see: disk areal density ~ 4X/ 3 years, so linear density gets
~ 2X / 3 years, and bandwidth tends ~ linear density * rotation speedups.
With next years' disks @ 7200-10000 RPM and 1-1.5 Gb/sq in, we're looking
at 10-15 MB/sec per 9GB 1"x3.5" drive.  While a FC-AL loop can handle
100+ drives (attached), you only get ~70 MB/sec sustained, or 5-10
drives in bandwidth-intense applications.  So, even though you get ~ 1 TB in
a 19" rack already you might well want multiple channels for bandwidth.

There are, of course technologies bubbling around that threaten to
increase the density 10X (and thus the bandwidth 3X) in 1998,
and if this happens, one would expect to see 30MB/sec individual disks,
of which not very many are needed to choke (for instance) a 40 MB/s
system area network, or even individual 100 MB/s links.

THOUGHT QUESTIONS:
	Suppose disk density increases 10X next year, without too much
	increase in cost.
	(1) Will you then have "too much" disk space?
		(Probably not: disks are binary devices:
		they are either new, or full).
	(2) What will you do for backup?
	(3) My little O2 has 2 disks ... suppose they were each
	90 GB?  (well, at least they could still be 1-2 file systems;
	imagine if there were some awful limit like 2GB for a file system?
	90 filesystems on 2 disks.  great.)
	(4) Will networks keep up?


|> An interesting scenario but you left out some alternatives. With SGI's
|> older Challenge series and other vendors current systems the processors
|> could be migrated to larger boxes with out much difficulty (ie a Challenge
|> L to a Challenge XL). Also if the growth occurs over time and is not
|> instantaneous than upgrades in processors might also be an alternative
|> to more processors.

This is where we came *from*.  Our customers said they wanted something
better.   Each CPU uprev in an existing infrastructure tends to be
less effective than the previous one, i.e., faster CPUs don't make the
busses, memory, I/O faster.
Also, while the CPU *boards* can migrate, I/O cabling  is far less pleasant.


|> My experience with older generations of CC-NUMA (BBN's Butterfly) that
|> you had better partition the problem well or you will spend most of your
|> time waiting for distant memory fetches. I have not use the new SGI
|> O series machine but there was some traffic that seemed to indicate
|> that partitioning is still an issue. If you need to worry about
|> partitioning anyway I not sure I see what the advantage of CC-NUMA is
|> then.

You might want to try current machines, since:
1) I supposed it possible to call the Butterfly CC-NUMA: there were
certainly no incoherent caches, as there were no caches.
2) Remote memory latency was on the order of 4 microseconds, about
5X longer than local latency.
3) The 8MHz MC68000s were, of course, in-order CPUs with little
overlap.

A current CC-NUMA like an Origin has furthest remote memory latency in a 128-CPU
system of ~1.2 microsecond, has 4 MB, 2-set-associative caches,
out-of-order execution, and up to 4 cache misses outstanding ... i.e.,
all hardware aimed at masking the (relatively low) remote latency.
Software does more [page replication & migration], with the bottom line
that the typical latency seems similar to (for example) the Sun UE10000,
but without spending the money upfront for the crossbar [which is the
main point for a CC-NUMA: get the effect of a centralized crossbar without
paying for it...]

Msot people *don't* worry about partitioning, they just run the
SMP binaries off Challenges. We have some customers who will committ
vile acts for another 10%, or are trying to get some algorithm to give
maximum performance for large CPU counts, and so they get knobs and
dials to get more control over memory placement.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?
Date: 16 Jun 1997 20:31:21 GMT

In article <33A28965.2387@hda.hydro.com>, Terje Mathisen
<Terje.Mathisen@hda.hydro.com> writes:

|> Greg Pfister wrote:
|> [snip]
|> > If the latency & bandwidth ratios are less than 1.4:1, then absolutely
|> > nobody writing virtually any software cares.  There's a real data point
|> > on this:  IBM's 3094, shipped in about 1983, was in it's 8-processor
|> > configuration actually a two-node CC-NUMA system with a latency ratio of
|> > 1.4:1.  They didn't say anything, and nobody cared.  Actually, it was
|> > part of the design requirements that not a line of software have to
|> > change.  It didn't.
|> >
|> > If the latency is more than 3:1, I don't know of a single software
|> > person I've talked to who doesn't want to start modifying their code to
|> > match the hardware lumps to keep the performance OK.
|> >
|> > Between 1.4:1 and 3:1 it switches from one to the other, but at exactly
|> > what number I wouldn't want to have to say.
|>
|> I agree totally.
|>
|> My rule-of-thumb is that you need 2:1 before it really means something.
|>
|> Unless you have a factor 2 performance difference, most users won't
|> really notice it.
|>
|> This is particularly obvious for PC upgrades, be it cpu, memory or disk:
|> All of them need to be about twice as fast/big as the previous to make
|> the user sit up and say: "Hey, this is really great!"


Lest there be confusion engendered from this post:
1) This has nothing to do with upgrades of systems; in particular,
people can and do notice performance upgrades/downgrades <2X.
In particular, if someone has a job that runs overnight, that they
expect to finish about the time they come back into work, it is sometimes
totally unacceptable if it surprises them by running 24 hours instead
of 12...

2) The discussion is specifically of bandwidth/latency differences,
which are *not* the same as performance differences, and which can
be very sensitive to the nature of the code being run, as well as
to the CPU/memory system's abilities to overlap and hide latency.
The nubmers we're all throwing around a really simple rules-of-thumb;
there is plenty of literature with serious studies.


3) Greg's datapoint of 1.4 is a useful addition, although I am curious:
	a) Was that remote latency?  (yes, I'd assume)
	b) Was there actually less (peak) remote bandwidth, i.e., was
	the interconnect slower than the local memory?
	c) Or was that 1.4X less sustained bandwidth?

(The idea that 1.4X worse latency, at least, is survivable, seems
consistent with other experience.)



-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?  (Q for Sun?)
Date: 16 Jun 1997 21:18:22 GMT

In article <mjrEBrq9v.Esz@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> As disk become cheaper, I am begining to see disk to disk backup.
|>
|> >	(3) My little O2 has 2 disks ... suppose they were each
|> >	90 GB?  (well, at least they could still be 1-2 file systems;
|> >	imagine if there were some awful limit like 2GB for a file system?
|> >	90 filesystems on 2 disks.  great.)
|>
|> Even the 32 bit OSes (AIX 4.2 & HPUX 10.20) have > 2 GB file & file systems.

Yes, no, maybe: there are files, filesystems, and then whether or not
something is actually practical on a system given actual software, i.e.,
if there's a 32-bit OS, have they implemented Large File Summit features
yet, or not ... and whether one buys an extra file system feature, or not
and there are many combinations.  There are still plenty
of systems out there that would end up splitting a 90GB disk into more
than 1 file system whether that was what they wnated to do or not.
Obviously, it is easier to get bigger file systems than files, since the
data structures are less visible to as many programs.
=========
As a side note, in rummaging around to see what was in Solaris 2.6
(which is when the Large File stuff gets added), I found the following
interesting URL, as of 12/96:
http://www.sun.com/solaris/whitepapers/solaris64.html

which claims, among other things:
"Sun Microsystems, Inc. is the only company that has provided a
roadmap to 64-bit computing that preserves the investments in
32-bit systems, while providing a non-intrusive migration path to 64-bits."

Can anybody explain what that *means*? I'm trying to figure it out... :-)


|> 4 microseconds on an 8 Mhz processor vs 1 microsecond on a 250 Mhz processor.
|> Seems like you went in the wrong direction for memory that is not local.

I don't think so.  Raw DRAM access went from ~ 120ns to 60ns during that
period, i.e., 2:1 faster. If remote access went from 4 to 1, that's
4X faster.   It is a fact of life that CPUs are getting faster, faster
than memory.


|> >Software does more [page replication & migration], with the bottom line
|> >that the typical latency seems similar to (for example) the Sun UE10000,
|> >but without spending the money upfront for the crossbar [which is the
|> >main point for a CC-NUMA: get the effect of a centralized crossbar without
|> >paying for it...]
|>
|> Getting something for nothing why does this seem too good to be true 8-}.
|> Is the Sun a true cross bar or a staged switch. If it is the former than
|> a hypercube does not have the same characteristics. If it is the latter
|> then it would be more accurate to say pay for only what you need when
|> you need it.

The UE10000 has a centerplane with 34 ASICs,  of which some form
a crossbar for the data transfers.
I said the Origin2000 got the "effect" of a centralized crossbar, not that
it was identical.  What's clear is that:
	(a) As you add nodes, the bisection bandwidths of the two systems
	is very similar.
	(b) Latency is complicated to figure out, but the typical latency
	of the O2000 is less than the worst case latency, and it appears there
	are plenty of cases where the effective average latency of the two
	systems is difficult to distinguish.

The Routers in Origins are much more complicated devices to engineer;
nobody got anything for free.


|> So all of the RDBMSes are running with just a recompile and scalling
|> to 128 processors. If so then that is truely very impressive.

Well, they run without a recompile. Whether or not one can actually
use that many CPUs depends heavily on what they are doing, and many DBMS
have internal scalability limits of their own.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?  (Q for Sun?)
Date: 17 Jun 1997 19:38:27 GMT

In article <33A6AA8E.38DA@Sun.COM>, Ashok Singhal <ashok.singhal@Sun.COM>
writes:

|> John R. Mashey wrote:
|> > As a side note, in rummaging around to see what was in Solaris 2.6
|> > (which is when the Large File stuff gets added), I found the following
|> > interesting URL, as of 12/96:
|> > http://www.sun.com/solaris/whitepapers/solaris64.html
|> >
|> > which claims, among other things:
|> > "Sun Microsystems, Inc. is the only company that has provided a
|> > roadmap to 64-bit computing that preserves the investments in
|> > 32-bit systems, while providing a non-intrusive migration path to 64-bits."
|> >
|> > Can anybody explain what that *means*? I'm trying to figure it out... :-)
|> >
|>
|> Sure, just take your explanation for this one and apply the same
|> philosophy to the Sun whitepaper quote.
|> http://www.sgi.com/Products/hardware/servers/technology/overview.html
|>
|> "The DSM structure of Origin2000 also creates the notion of local
|> memory.
|> This memory is close to the processor and has reduced latency compared
|> to bus-based systems, where all memory must be accessed through a shared
|> bus."
|>
|> lmbench latency to *shared* memory over a bus on the UE 6000: 304 ns
|> lmbench latency to *local* memory on SGI Origin 2000: 480 ns

0) Wrong: my request was a question about what the Sun statement could mean;
Ashok's was to supply a specific (but incorrect) interpretation of the SGI
URL and then claim that the statement is wrong (whereas it might be not
specific enough).

1) Thanks for reminding people of that URL, which has a pretty readable
description of how an Origin works, with only occasional marketing.
(I didn't write this, either.)

2) The statement in the URL does NOT say:
	"No vendor sells a bus-based MP whose shared memory latency,
	as measured by lmbench, that is as good as the local memory latency
	of the Origin, as measured by lmbench."
Since, this is obviously untrue, even internally, since, after all,
a single Origin node is a effectively a 2-way shared-bus SMP :-)

Likewise, it is untrue to claim that the URL can only be interpreted as 2),
that is, Ashok has chosen to place a specific interpretation on it.
One might legitimately claim that the statement could have been made more
specific: I will explain what people were thinkg of as this was written.


3) If the URL were a published paper for an architecture conference,
the single sentence above would be expanded to several pages, of which
the introduction would be as given below,
noting that that the logical context of the
sentence is the "Scalability and Modularity" section earlier in the URL.
Also, while lmbench is useful, it's only one of the metrics, and one
can get trapped by thinking it's the only metric.

"Various system designs make different tradeoffs between bandwidth,
unloaded latency, loaded latency, scalability of system design,
uniformity of latency under various loads, and reduction of effective
latency via overlapped operations.

It is possible to build small bus-based SMPs, with a few CPUs (2-4 typical),
whose unloaded, non-overlapped latency is very low, because parts-count,
bus-lengths, etc can be kept at a minimum.  This tends to optimize
uniprocessor, or several-processor performance at the expense of
scalability, since the bus bandwdith is fixed, and if targeted at
low cost, is fixed at a relatively low level.
A good example would be an Intel SHV.

It is possible to build larger bus-based SMPs [like  Sun UEX000],
with good latency, especially unloaded, but with the classical limits
to bus-based design, as mentioned, for example, in:
http://www.sun.com/servers/datacenter/products/starfire/tech.html
"This unique crossbar, unlike traditional buses, ensures fast, uniform,
conflict-free memory access throughout the entire system, bringing the
promise of MPP and NUMA systems' high performance, low cost, and
scalability to the wide range of SMP applications and exceptional data
center manageability."

As any system is scaled up, any fixed resource that is shared is progressively
shared among more CPUs and I/O devices; specifically, any fixed interconnect
must be paid for at the beginning, and its bandwidth and number of
transactions/second are divided by the number of (CPUs+I/O devices).
Either such a resource is over-engineered for smaller configurations,
or under-engineered for the larger ones, leading to the classic
"imaginary configuration" problem, wherein, in mid-range and up SMPs,
nobody buys the smallest configurations or largest configurations,
because the former do not amortize the infrastructure investment,
and the latter often run out of bandwidth, or connectivity, or something.
[For example, while SGI Challenge XL's theoretically went from 1 to 36
CPUs, the end cases were seldom purchased :-)]

Since large SMP servers are more expensive than desktops, they
tend to be kept busy, and hence are usually optimized for heavy workloads,
i.e., many processors working at once.

Large SMP servers' designs are often affected by the nature of the CPUs.
For example, a CPU with in-order execution and relatively low overlap of
multiple cache misses really demands a memory system with low unloaded latency;
it is not particularly useful for the memory system to be able to
handle more overlapped requests from one CPU than that CPU can generate.

Origins are built to allow interconnect bandwidth to scale up incrementally,
without paying for all of the bandwidth in the minimal configuration,
that is, 1-2 CPU versions  are actually useful, using the same
component as 64-128 CPU versions.
The price, of course, is non-uniformity of latency, although every
effort was taken to minimize that [as discussed earlier in this thread].

R10000s are out-of-order chips with aggressive speculation, that
can have up to 4 cache misses to memory outstanding, and hence can
generate bursts of cache misses; the memory system is tuned to overlap
these, especially under conditions of heavy load (there is a lot of queueing
in the Hub chips to allow this).

So, suppose you think:
	a) You need to deal with systems in the 32-64 CPU range, at least.
	b) You're going to tell people they don't need to buy all of their
	bandwidth up front, and that in fact, they don't need to avoid
	buying the last few CPUs (i.e., as sometimes happens in fixed-
	interconnect systems, where the bandwidth starts to run out, and the
	effective latency goes up), i.e., that one should worry most about
	latency under load.
	c) You're using CPus that overlap cache misses, and this actually
	happens sometimes, so you worry about the behavior for streams
	of cache misses, not just individual ones.
Then, under those assumptions, you would say:

|> "The DSM structure of Origin2000 also creates the notion of local
|> memory.
|> This memory is close to the processor and has reduced latency compared
|> to bus-based systems, where all memory must be accessed through a shared
|> bus."

Now, as it happens, lmbench, which is certainly one of the useful
benchmarks (as it does tell you about certain worst-case behaviors, and
it has good correlation with some codes):
	(1) Normally measures the unloaded case, i.e., there is no credit
	    for queuing, and no accounting for heavily-loaded busses.
	(2) On purpose, is cleverly written to defeat all overlapping,
	i.e., by following pointers so that even a heavily speculative
	CPU *never* gets any use from those features (despite the fact
	that those features are useful in real life).
However, if one thinks the lmbench latency is the *only* relevant latency
metric, nobody would design anything like a Sun UEX000 or SGI Origin2000:
what you'd want would be:
	- The simplest in-order chip you could find, with highest clock rate.
	[Out-of-order doesn't help you on lmbench].
	- No more than 1 level of cache, but preferably 0, to minimize
	time to cache-miss detection.
	- Transfer one word of data per cache miss.
	- Memory controller with no overlap/queuing etc, and one size
	of DRAM, and maybe just built into the CPU chip.
	- But neither Sun nor SGI build SMPs that way, do we?
	[Well, actually SGI/Cray sort of does, in that a T90, with SRAM
	main-memory and 1-element vectors...]

NOTE: it is sad, but true, that even good, useful benchmarks can get
over-interpreted...

So: one might reasonably complain that the (simplistic) interpretation
of the URL's statement was misleading, and it would have been better to
have added the relevant caveats/explanations ... because it is *quite
clear* from the rest of the design that engineers were focussed on
latency reduction, for streams of transactions, under heavy load.

What's there is the equivalent of saying:

Airplanes are faster than cars.

without adding all of the caveats, like:
- except going backwards
- except when using one to go the grocery store.
- Except that certain cars can go faster than 747s, i.e., from standing
start to the end of the runway, and given that the car is not required
to fly away at the end.


Anyway, I'm still (politely) wondering what the Sun statement *means*,
that is, is there a plausible explanation for this (fairly strong)
claim?

|> > "Sun Microsystems, Inc. is the only company that has provided a
|> > roadmap to 64-bit computing that preserves the investments in
|> > 32-bit systems, while providing a non-intrusive migration path to 64-bits."

All I want to know is what reasoning there is: people may consider the
reasoning bogus (or not), but it would be nice to know if there were some.
This is in the same vein as (politely) asking some Intergraph
people the reasoning behind the following, a while ago:

"Intergraph Corporation is the world's largest company dedicated to
supplying interactive computer graphics systems."

And answers included things like: "Intergraph has more employees than SGI,
and anyway, Sun/SGI etc do things that aren't interactive graphics,
and even Intergraph's servers are really to support interactive graphics,
but some of the other companys' aren't, so they're not dedicated."

One may or may not agree, but at least people provided an explanation.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?
Date: 18 Jun 1997 03:21:51 GMT

In article <33A70583.30FE@Sun.COM>, Ashok Singhal <ashok.singhal@Sun.COM>
writes:

|> John R. Mashey wrote:
,
|> > but without spending the money upfront for the crossbar [which is the
|> > main point for a CC-NUMA: get the effect of a centralized crossbar without
|> > paying for it...]
|> >
|>
|> 1. You don't pay for the full crossbar up front with the Sun UE10000.
|> You pay for a 16-way crossbar on the centerplane  The first stage switch
|> is on each board and you pay for that as you go.

The UE10000 centerplane, before any system boards, has 34 ASICs on it;
of course there is a "Local Data Router" on each system board, but the
global data router is in place upfront.
I observe that SUn's pricing imputes the price of an 4-CPU
system to be ~$500K.  There's nothing wrong with that, but it certainly
is $$ upfront.


|> 2. What you save on the initial rack on the O2000,
|> you lose on the number of racks as you grow.
|> The 64 CPUs, 64GB memory and IO slots fit in a *single* rack that is
|> roughly the same size as an Origin rack.  For a similar size system,
|> I believe you need 4 Origin racks(?).  Quite apart from the
|> cost of the racks themselves, the floor space occupied is often
|> at a premium in many environments.

With the Web, it is easier than it used to be to check facts:
System	Height	Width	Depth	Floorspace
	Inches	Inches	Inches	Sq inches
UE10000	70	50	39	1950  (3.03 larger than the O2000)
O2000	73	23	28	644

One might argue that 3X larger is "roughly the same size", I would
argue that it's 3X larger :-)

Regarding floor space:
		Max
	# racks	# CPUs	memory	# IOs	Floorpace	Disks (?)
O2000	1	16	32 GB	24	 644		5+5+6 = 16
UE10000	1	16	16 GB	16	1950		?? (looks like 7x3=21)

O2000	2	32	64 GB	48	1288		32
UE1000	1	32	32 GB	32	1950		21?

O2000	3	48	96 GB	72	1932		48
UE10000	1	48	48 GB	48	1950		21?

O2000	4	64	128 GB	96	2576		64
UE10000	1	64	 64 GB	64	1950		21?
UE1000	2	64	 64 GB	64	3900		many


Each of the 2 modules in an O2000 rack has room for 5 disks in the module,
plus room in between the modules for another 6 [but there may be configuration
rules.... :-)]  I *think* that the base UE10000 has 3 trays of 7 disks
each, but that could be wrong.

I/Os: each UE10000 system board has 2 SBUS64 controllers (100 MBs sustained)
and up to 4 SBUS64 cards per system board.
Each O2000 rack has 4 XBOWs, each with 6 (2X600 MBs sustained) XIO cards,
although total I/O would be limited by XBOW-Hub connections, call it
10 GB/s in round numbers for the I/O limit, with current node bandwidth
limiting it to about 5 GB/s.

Put another way, IF you do the UE10000-O2000 comparison at exactly the
right configuration [49-64 CPUs, few enough disks that they fit into the
first UE10000 cabinet], then the UE10000 is more space-efficient,
else usually not.  Note that in a 4-rack O2000, I/O cables can be split
among the 8 modules; in the UE10000 *all* cables go into the 2 sides of
the bottom half of the first cabinet.

|> 3. Memory replication/migration also has it's costs, both monetary and
|> in terms of performance: you need more memory in order
|> to replicate (which adds to the cost; I believe this alone could
|> more than make up for the initial extra cost of the UE10000 backplane),
|> and you use bandwidth (both interconnect and memory) to
|> replicate/migrate.
|> I also believe that replication with write-sharing can have serious
|> negative performance impact in many cases.

Of course ... but there is no data presented... and I will point out that
SMPs, even bus-based ones, can develop memory hotspots, that is,
one may have an infinite-GBs crossbar, but if everybody is touching the
same physical memory bank, that is the limit....


|> 4. If one didn't want a big system to start with, one could get a
|> Sun UEx000 and migrate to a UE10000, taking the expensive components,
|> CPUs, memory and SBus boards, along (they're the same across the
|> two platforms).

Let me make sure I understand this.  It sounds like:
- Suppose you have a 16-CPU UE6000, which is starting to run out of gas,
  and it has 16 SBUS cards (I think that's possible).
- You wheel in a UE10000, with 4 system boards plus 6 empty system boards.
- You take the UE6000 down.
- You extract the CPUs and memory from the UE6000s, and plug them into
  the empty system boards.
- You extract the SBUS cards from the UE6000s I/O boards, and plug them into
  the UE10000 system cards.
- You hook all of the cables back up.
- You move any internal disks & other devices (?)
- You bring the UE10000 up.
- You throw away the UE6000 enclosure, backplane, power supplies (?),
  CPU/memory boards & I/O boards, or return to Sun.

Can you point me at a Web page that describes this?  (I looked, and found
lots of other upgrade path descriptions, but perhaps I missed this one.
If this is something that is *actually* recommended to customers,
I woudl be interested in readign the wording.)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?  (Q for Sun?)
Date: 18 Jun 1997 03:42:10 GMT

In article <mjrEBxzFo.88x@netcom.com>, mjr@netcom.com (Mark Rosenbaum) writes:

|> Seems that a better metric for this might be number of cycles a processor
|> misses while waiting. For 8 MHz processor it would be 4 X 8 or 32. For the
|> 250 MHz processor it would be 1 X 250 or 250. As you pointed out the
|> processors are getting faster at a more rapid rate then memory or
|> interconnections. This would seem to indicate that eventually some sort
|> of partitioning scheme would be needed.

|> If the software has to be rearchitected anyway does it make sense to
|> start partitioning the data since that is where the every increasing
|> difference in CPU and memory performance is driving things?

Of course, and people with their own code, if it is amenable to
such approaches, do this ... but lots of people just want
existing SMP binaries to run well, and take their time tuning,
or not tune at all.  If anyone knows how to take an <arbitrary>
program, and get 128X performance increases on a 128-CPU system,
automatically ... they aren't telling :-)

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Is CC-NUMA no longer a good idea?
Date: 19 Jun 1997 03:48:21 GMT

In article <5o9h5p$3ol$1@engnews2.Eng.Sun.COM>, kbn@Eng.Sun.COM (Kevin
Normoyle) writes:

|> Organization: Sun Microsystems Inc.
|>
|> In article 1@murrow.corp.sgi.com, mccalpin@frakir.asd.sgi.com (John McCalpin) writes
|>
|> > John Mashey has long said that users
|> > will not tolerate more than a 3:1 latency in ratio for remote to local
|> > memory accesses, and I would add that users will not tolerate much more
|> > than a 2:1 bandwidth ratio for remote to local memory accesses.
|>
|>
|> In another context, a wag commented that users can't tell the real difference
|> between two systems unless the performance ratio is at least 3:1.
|> (2:1 if they paid for it)
|>
|> With tongue-in-cheek, I might say that you're pointing out that users
|> don't want any apparent difference in latency or bandwidth anywhere :)

Well, sort-of, but not quite.

Ideally:
1) A system would start at (small #, like, 1,2, or 4)-CPUs,
with cost range in the typical small server range.
2) You could scale up CPUs, memory, I/O, by incremental *addition*,
and in wide range of proportions TO 2b) any number that anyone's budget
would pay for.
3) The memory bandwidth/CPU would remain constant & high, and the CPU
should get the same bandwidth to any memories, if there are more than one.
4) The bisection bandwidth of any interconnect should scale linearly as
well.
5) Memory latencies (both unloaded and loaded) should be low, with
no difference between memories.
6) It's all shared-emmory.

Of course, the laws of physics catch up with you, so it is unlikely
that anyone will build this (since, among other things, 2) includes
a system with at least 3K CPUs...)

Bus-based SMPs, crossbar-based SMPs (UE1000, HP Convex S, Cray T90, mainframes),
and CCNUMAs all make different tradeoffs.
I mean, if you really want high-bandwidth, and low latency under load,
I can get you a deal on a Cray T932 (multiple vector pipes to crossbar
switch with heavily-interleaved SRAM memory help).

The 3:1 comment from me, more precisely was: if the remote/local latency
gets worse than about 3 or 4:1, programmers notice and will switch to
message-passing.  Nobody thinks more latency is good ... it's
simply that we're willing to pay some price (in 5) to get 1), 2), 3),
most of 4), 6), but of course, we're not willing to pay too much, which
is why there was that whole list of techniques used to reduce the effective
latency as seen by real programs, at least some of which came out of
the DASH work, and also out of studies from other CCNUMAs and COMAs,
at least some of which said that queuing effects could really bother
you and increase the effective latency to remote memories,
if you weren't very careful.

Anyway, the point was: it's not that regular SMPs (bus- or crossbar) are
Bad Things, it's that not all CC-NUMAs are the same, and that if you are
very careful, you can keep the latency under control and hide some of it,
to keep the effective latency down far enough that some people don't seem
to notice.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 415-933-3090 FAX: 415-932-3090 [415=>650 August!]
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: SGI Origin Sync inherently unfair? (was: Re: Interesting Proposal 
	for NUMA machines (BBN Butterfly and IBM RP3)?)
Date: 19 Sep 1997 01:47:08 GMT

In article <5vre8e$4po@flonk.uk.sun.com>, Andrew Harrison SUNUK
Consultancy <andrew.harrison@uk.sun.com#> writes:

|> > Yes, and that's fine.  You're attributing some virtue to SGI, however,
|> > that doesn't appear to exist in the other NUMA vendors. I know that
|> > Sequent in particular goes out of its way to claim that their NUMA-Q
|> > systems "maintain the SMP programming model" so you can just run all
|> > your old SMP code, no change.  None of that ugly rewrite like you have
|> > to have for MPP, no sir, just plug it in and go.
|> >
|>
|> This is interesting because Sequent seems to have a much worseratio of local
|> to remote memory latency, one Sequent white paper seemed
|> quite pleased with a 10x difference. Since the general concensus seems
|> to be 3x and you are in trouble Sequent NUMA would seem to be a
|> candidate for a NUMA programming model.
|>
|> > I don't know whether SGI's marketeers have used this one or not.

Yes.

Note: there are several statements that one can make, and they are
easy to confuse, and they mean different things:
1) All of the vendor's old SMP binaries execute on the new ccNUMA. [yes]
2) Some SMP binaries perform well on the ccNUMA. [yes]
3) Some SMP binaries perform well, and can be made to perform better
	with a little tuning. [usually]
4) All SMP binaries perform well, automatically. [not usually]
5) All SMP binaries perform at least OK, but some need tuning to
	perform well. [possible to get close]

and of course, the dreaded:
6) Some SMP binaries perform well, some OK, and some horribly,
and some can only be improved with monster rewrites. [happens, especially
if bandwdith is low or latency long, or both]

and the really dreaded:

7) An SMP binary that performs well with 1, 2, N CPUs degrades miserably
when N+1 is added to handle a bigger workload. [happens, especially with
ccNUMAs where going from N to N+1 crosses a major interconnect boundary]

and even worse:

8) An SMP binary performs well under certain workloads and configurations,
and is awful under others, and it is difficult to characterize the difference.
[happens, usually an OS or DBMS randomness]

and finally, the worst of all:

9) An SMP binary, running the same workload on the same configuration,
sometimes performs well, and sometimes horribly, for no obvious reason.
[On a smaller scale, long ago, for instance, people running overnight
ECAD simulations on the original MIPS machines found that most of the
time, they'd run overnight, but 10% of the time, they'd run twice as long
(due to random allocation of cache pages and page-coloring of small
caches), which is basically catastrophic.]
[can happen, especially with scientific/engineering codes with
matrix operations near cache-size inflection points.]

In general, if there is:
	- mostly read-only access
	- relatively little interprocess communication
	- low-bandwidth I/O
almost any system design will work well.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Review of Forest Baskett's talk at UW, 11/13/97
Date: 17 Nov 1997 06:16:13 GMT

In article <87oh3kd6cu.fsf@serpentine.com>, Bryan O'Sullivan
<bos@serpentine.com> writes:

|> m> A Sun Ultra without any modules costs $.25M and the expensive
|> m> connections limit the processor speeds to < 100 MHz, whereas the
|> m> current RISC processors are ~500 MHz.
|>
|> I don't know where this number got invented, but it is arrant
|> nonsense.  Interconnect speed has absolutely no effect whatsoever on
|> processor speed, and there has never been an UltraSPARC processor
|> shipped that clocked at less than 143MHz (current generations are in
|> the 300MHz+ range).

I don't know where this number got invented either: I suppose it's
possible that Forest said such a thing, but it seems extraordinarily unlikely:
I'm right now looking at the slide set he used, although I can't of course
tell if he used all of the slides in the set.  Slide 10 says:

"THE PHYSICS OF PROCESSOR INTERCONNECT
Busses are electrically difficult:
- Multiple loads create noise
- Noise must settle
- Clock rate must allow settling
==> Clock rates are limited
.e.g. 66MHz for Intel Pentiums
     83 Mhz for SUN UE10000

Point-to-point wires can track circuit technology
e.g. 400 MHz external interfaces."
=======
Now, I *know* Forest knows the difference between Processor MHz and
Processor Interconnect MHz ... perhaps Michael Ess did not ...

Of course, processor speed may be constrainted to be one of several
potential multipliers versus interconnect speed, and of course, interconnect
speed has effects on processor performance.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Review of Forest Baskett's talk at UW, 11/13/97
Date: 17 Nov 1997 19:57:55 GMT

In article <64ps64$dsc@mtinsc03.worldnet.att.net>, "Michael Ess"
<mikeess@worldnet.att.net> writes:

|> Organization: AT&T WorldNet Services

|> I believe you and John Mcalpin have seen the slides for longer than I
|> did so I defer to you guys on what they actually said, I could be wrong.
|> If they could be posted somewhere that would be great. But some additional
|> points.

|>      never mentioned that these were peak numbers. In such a young,
|>      impressionable audience, I thought it should at least have been
|>      mentioned.

I'll mention it to him: it is certainly possible that he wouldn't have
bothered to belabor the point.  I may be a little more sensitized to this
as I more often give talks to general university audiences than
Forest does these days.

|> 2. I must have missed the point on why SMP couldn't use high MHz
|> processors,
|>     it seems they could be used but that the bus could not run at simlarly
|>     increased speeds. And so the balance of earlier machines had to change.

Again, there are:
	- CPU MHz (which can be tricky in its own right, of course,
		with some of the gimmicks that people do)
	- CPU *interconnect* MHz

and the two are different.  In the PC world, on-chip MHz is emphasized
by vendors even more than elsewhere, and the other attributes tend to disappear,
sometimes leading to surprises of relative performance.

If the *interconnect* is a bus, it is getting harder and harder to run it
much faster.  One can always run the internal clock speed faster, and
then step it down to the external limits.  This is done *quite
frequently*.  In the "old days", one used to have internal clock ==
external bus clock:25Mhz R3000s had 25MHz external busses.  I'm not sure
if the R4000 was the first to have ratio-d clocks, but it was one of the
earlier micros to do this, i.e., typical early R4000s ran 100Mhz internal
and 50Mhz external.  In any case, I can't think of a current
high-performance micro that *doesn't* do this.  The problem is, as has
been seen in SGI Challenges, DEC 8400s, etc, is that if you keep
inceasing the internal speed, and leave the external speed alone, you get
less and less of the internal speed increase to materialize as actual
performance ... at least for anything that misses caches :-)

Anyway, I can't find anything in the slides that says you can't use
high MHz CPUs in SMPs, but I find things that say *busses* can be limiters
as interconnects.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Review of Forest Baskett's talk at UW, 11/13/97
Date: 18 Nov 1997 06:12:21 GMT

	[[significant typos corrected as per later posts]]

In article <34709E00.F2F9AD@cs.wisc.edu>, Andy Glew <glew@cs.wisc.edu> writes:

|> The problem with Forest's view is that both transaction processingand
|> web serving do not seem to require single large programs,
|> but can instead run well on clusters of nearly independent machines,
|> as described in Greg Pfister's book.  Better intra-cluster network bandwidth
|> is highly desirable, as are cluster lock managers, etc., but shared memory
|> does not seem to be a requirement.

The problem with Andy's view is that he assumes this *is* Forest's view,
rather than Michael's posting of his interpretation of what he heard
Forest say.  Although I wasn't there, I have good reason to believe
that Michael's posting is incomplete/incorrect, which is to be expected when
quickly trying to summarize a talk with 40 slides.


|> > For Forest, the new emphasis for these big machines is transaction
|> > processing and web servers, not necessarily scientific computing. The
|> > latencies and bandwidth are the key metrics for determining which apps
|> > are appropriate.  The benchmarks nowadays are:
|> >
|> >         read a file from disk
|> >         read/write a file
|> >         read in a file, sort it, write it back out
|> >         backup a disk

Forest's slide says:
Origin2000 I/O Milestones
+ Fibre channel results
	- 4.3 GB/s read from a single file
	- 3.9 GB/s write to a single file
	- 7.6 GB sorted to/from disk in 1 minute
+ SCSI results
	1 TB backup to tape in 1 hour
	1 TB sorted to/from disk in 152 minutes
	3.8 GB/s read from a single file
	2.0 GB/s write to a single file
	7.2 GB sorted to/from disk in 1 minute

|> > All of these point to a transaction oriented customer base, not a

No, they don't.

The term "transaction" covers a wide range of things, but if one says:
	(a) There are IOPS-dominated workloads, like "Classic OLTP",
	airline reservations systems.  Small blocks, lots of accesses to
	nearly independent things.  Some Web apps are like this,
	and some aren't, if only because some are frontends for other DBMS.
	Of course, backing the DBMS up may be different.
	For this space, backups are relevant, as are sorts.
	Reading/writing huge files sequentially is not so relevant.

	(b) It may astonish some people, but some technical apps do massive
	amounts of disk I/O :-)  All of the above numbers are relevant,
	and such people often use massive striped disk arrays, and complain
	about speed reading individual huge files.

	(c) Data warehousing and data mining are more like (b) than (c),
	and most of the benchmarks are relevant.

	(d) Some uses of RDBMS seem like mixtures of (a) and (c), i.e.,
	run some transactions during the day, load new data at night
	(good to sort first), run big extracts and reports, and backup.

Now, back to Andy's statement, which (I think) mis-states what Forest said:
(a) Forest's concluding slide (which I assume got used) says:

CONCLUSIONS FOR FUTURE SYSTEMS
- Point-to-point wiring with (single chip) switch interconnect
- Bypass interface architectures for network interfaces and I/O
- ccNUMA for serious scalable systems
- Clusters of simple systems and ccNUMA systems

(b) Now, it is an article of faith at Intel and Microsoft that *all*
work can be done by clusters of machines of <some size>, and we certainly
agree that, IF you have classical transaction workloads of (mostly)
unrelated accesses, then clusters of machines, connected by networks of
low relatively bandwidth, work fine, as shown by early Tandem systems.
This leads to the idea that all users should have only those environments.

(c) Where we don't agree is thinking that this represents all workloads
in the real world.  Perhaps Andy talks to CIOs and MIS Directors a lot.
Maybe the ones he talks to aren't the ones I talk to, some of whom *violently*
reject what he posted, including, such classic MIS folks as CIOs of
investment banks, who've tried exactly that, and never, ever want to do
it again.  (I mean, a CIO of a serious place, walked in one time, and said:
"We want to know what you have, but if you tell us you want to do
a cluster of small nodes, you can stop right there, we've tried that already."

	In general, if data is split into "natural" groupings, so that
	all of the big data that needs to be together is together, and
	near the CPUs that work on it, all is well, and this definitely
	covers some important applications.

	If not, all is not well.

	For instance, if you have N small nodes, and you've split the
	database up into N approximately equal pieces, and the transactions
	are simple retrievals of indpendent records, all is well.
	Suppose you double the amount of data.  Does that mean you
	double the number of nodes?  Does that mean you make each node
	2X bigger? Does that mean you oversize each node by a factor of 2?
	I such systems, there is usually some mechanism to start with an
	input key and route it to the appropriate node.  If such a mechanism
	produces too much imbalance, the disks may need to be rebalanced,
	and nobody likes that much.

	If you are doing simple retrievals from an RDBMS, you may be able
	to spread the load around well.  If you start doing joins on big
	tables that happen to be on separate nodes, with low bandwidth
	in between, what is simple and straightforward on an SMP / ccNUMA
	is not fun on a cluster.

	As a simple example, consider the sort numbers mentioned above:
	7+ GB/min sort of 1 file, 1 TB in ~2.5 hours.
	Each starts with one file on (striped) disks, reads it in,
	sorts it, using Ordinal Technology's commercial nsort program,
	and writes it back to disk, to one file,
	i.e., this is like sort file -o file
	(i.e., straightforward results, like what people do in real life,
	distribution of keys is fairly irrelevant, no odd setup, etc)

	Berkeley's Now-Sort achieved 8.41 GB in 1 minute, on a network
	of 95 nodes; UCB is doing interesting research, but this result
	is often used to claim that this is a good idea, and maybe it is, and
	maybe it isn't, although it is interesting research.
	If this result is gotten in a way similar to the previous 64-way
	result:
		(1) The input data is approximately equally partitioned
		among N nodes.
		(2) Each node reads its data, and sends each record to
		the node on which the record should end up.  I don't know
		what they are doing with 95; for 64 nodes, they used a
		distribution of keys that was uniform on the first 6 bits,
		so those bits could be used as an index to the nodes.
		(3) Each node receives its data, sorts it, and writes it to
		its local disks, thus ending with N files.

		The reader should evaluate the extent to which this kind
	of sort would meet their sorting needs.

	Anyway, it's well worth visiting the UCB web pages, and it may be that
	certain environments can use such a sort ... but remember, in the
	real world, data doesn't appear from nowhere, be run as a benchamrk,
	and then disappear.

Anyway, some workloads *are* suitable for distributed clusters,
and it's hard to imagine any knowledgable person disagreeing with that.
on the other hand, there *are* reasons why some people who already have
distributed clusters are trying to consolidate them back together into
larger systems.




--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mccalpin@frakir.engr.sgi.com (John McCalpin)
Newsgroups: comp.arch
Subject: Re: Review of Forest Baskett's talk at UW, 11/13/97
Date: 18 Nov 1997 14:18:49 GMT

In article <64rbk5$kps$1@murrow.corp.sgi.com>,
John R. Mashey <mash@mash.engr.sgi.com> wrote:
>Forest's slide says:
>Origin2000 I/O Milestones
>+ Fibre channel results
>	- 4.3 GB/s read from a single file
>	- 3.9 GB/s write to a single file
>	- 7.6 GB/s sorted to/from disk in 1 minute
              ^^^^
>+ SCSI results
>	1 TB backup to tape in 1 hour
>	1 TB sorted to/from disk in 152 minutes
>	3.8 GB/s read from a single file
>	2.0 GB/s write to a single file
>	7.2 GB/s sorted to/from disk in 1 minute
            ^^^^

I should clarify that both of these sort numbers should
have units of "GB", not "GB/s".

The number means that a single file (7.6 GB or 7.2 GB for
the two tests) of 100 byte records was read from disk, sorted,
and returned to disk in one minute.  The TB sort has the correct
units, and was actually done "in-place", since we did not have
an extra TB of disk handy to hold the output.

The "nsort" program that did these sorts has been extended to
allow other items like regular expression searches.  We have
seen regular expression searches on a single file at speeds
ranging from 0.4-1.0 GB/s (depending on the pattern searched for
and whether the record length is specified).
--
--
John D. McCalpin, Ph.D.     Supercomputing Performance Analyst
Technical Computing Group   http://reality.sgi.com/mccalpin/
Silicon Graphics, Inc.      mccalpin@sgi.com  650-933-7407



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Review of Forest Baskett's talk at UW, 11/13/97
Date: 18 Nov 1997 20:13:56 GMT

In article <64s849$55j$1@murrow.corp.sgi.com>,
mccalpin@frakir.engr.sgi.com (John McCalpin) writes:

|> Organization: Silicon Graphics, Inc., Mountain View, CA
|>
|> In article <64rbk5$kps$1@murrow.corp.sgi.com>,
|> John R. Mashey <mash@mash.engr.sgi.com> wrote:
|> >Forest's slide says:
|> >Origin2000 I/O Milestones
|> >+ Fibre channel results
|> >	- 4.3 GB/s read from a single file
|> >	- 3.9 GB/s write to a single file
|> >	- 7.6 GB/s sorted to/from disk in 1 minute
|>               ^^^^
|> >+ SCSI results
|> >	1 TB backup to tape in 1 hour
|> >	1 TB sorted to/from disk in 152 minutes
|> >	3.8 GB/s read from a single file
|> >	2.0 GB/s write to a single file
|> >	7.2 GB/s sorted to/from disk in 1 minute
|>             ^^^^
|>
|> I should clarify that both of these sort numbers should
|> have units of "GB", not "GB/s".

Yes, sorry, typo copying from the slides.


|> The "nsort" program that did these sorts has been extended to
|> allow other items like regular expression searches.  We have
|> seen regular expression searches on a single file at speeds
|> ranging from 0.4-1.0 GB/s (depending on the pattern searched for
|> and whether the record length is specified).

Although it's a little out of date, see: http://www.ordinal.com,
which includes some white papers and manual pages on nsort.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



Index Home About Blog