Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 6 Jan 1998 02:42:40 GMT

In article <sjcEMByGx.GsC@netcom.com>, sjc@netcom.com (Steven Correll) writes:

|> Pascal programs love to copy objects larger than 8 bytes, and our
|> compiler was capable of doing so without a function call.

1) At least on some S/360 models, one owuld find that lm/stm was faster
than the equivalent MVC.

2) UNIX systems care about the equivalent of MVC (or MVCL, anyway) a
fair amount, i.e., this is bcopy/memcpy (i.e., known size), not
strcpy (zero-terminated)., and bcopy is high on anybody's hand-coding
list.  For various reasons, block-copy hardware has come and gone,
in various forms in various RISC based-machines: it's hard to get right,
and it's hard to get a persistent model for something that works,
doesn't cost too much to implement, and especially, that can be
used straightforwardly at user level, and that make sense in the presence of
caches, and especially multiple coherent caches.

3) Hence, the fundamental tension comes down to the following:
	a) If it makes sense within the architectural context, you have
	an MVC instruction, and the hardware is free to optimize that
	to fit its datapath widths, overlap, cache design, etc.
	If there is microcode, it tends to show a philosophical resemblance
	to a hand-coded RISC bcopy, with all these tests for sizes,
	alignment, misalignment of source and destination, etc, as
	showed up in the timing charts for an S/360 or VAX.
	You accept the idea that even a highly parallel machine will probably
	get fairly sequentiallized when doing an MVC, at least for any
	other memory references:
	S	R0,0(R3)
	MVC	A(200),B
	L	R1, 0(R4)

	An out-of-order implementation would need some care here, and
	either needs a bunch of arithmetic comparisons in a load/store unit
	(I.e., given this MVC, one has 2 200-byte adddress ranges, and
	you'd better be careful starting it if the STore instruction's
	target overlaps) or else the MVC needs to decompose itself into
	a series of invisible loads and stores whose addresses can be
	queued and associative-compared for conflicts.
	Likewise, in an o-o-o- machine, it would be nice to issue the L before
	the MVC has finished, but this is nontrivial.  Finally, if one had:
	MVC	A(4), B	(assume all aligned properly)
	MVC	C(4), D

It may take more work than it's worth to figure out that these are all
nonoverlapping addresses, and it may well happen that:
	L	R1,B		cache hit
	ST	R1,A		cache miss, causing exclusive read of line
	L	R2,D		cache miss
	ST	R2,C
will execute noticably faster, because it is easier to figure out that
D is independent from A, and therefore can slip ahead of ST R1,A,
and thus get a cache miss started much earlier.

So, the good thing about an MVC is that you know everything the programmer
wants, right there, and the bad thing is the difficult to building hardware
that does "the right thing", i.e., preserves the semantics, but runs
fast without wasting cycles checking for things that hardly ever happen.
(The original MVC, for example, allowed overlapping address ranges,
so that a common paradigm to zero an area was:
	MVI	A,0
	MVC	A+1(length),A		)

Also, of course, unlike an aligned L or ST, which uses the MMU once,
an MVC can use it multiple times for each of the two arguments.

	b) OR you can use a classic RISC load/store architecture, where
	all the hardware sees is a sequence of loads and stores.  This is easier
	on MMUs, and much easier on load/store address queue checks, which are
	simple associative lookups verusus individual storage-unit addresses
	(not arithmetic range comparisons).  What's bad is:
		1) It takes a lot of code in bcopy to get it right,
		especially with all of the alignment cases.
		2) Global intent is not visible to the hardware, this has
		frequently shown up in irritation with writeback caches:
	        the code is bcopy or bzero; it knows that it is about to
		overwrite an entire cache line, but has no way to *say*
		that safely, and the result is having to wastefully fetch
		a line that will be totally overwritten.  Various mechanisms
		have been included in various CPUs to go after this one.
		3) It may well be that the source argument is not cache
		resident, but copying it through the cache simply flushes a
		lot of otherwise good entries to bring in a source that
		won't be used again soon. (Typically, the target will be
		used again soon).

So, it's not that people don't think it's important to move data around
efficiently ... but so far, designers of most microprocessor instruction
sets have preferred to deal with the issues in b), than the complexity of
a), even though many machines have been built with MVC-equivalents, and
it's not from lack of knowledge from MVC.

--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 12 Jan 1998 05:44:39 GMT

In article <697kba$hes$1@lyra.csx.cam.ac.uk>, nmm1@cus.cam.ac.uk (Nick
Maclaren) writes:

|> Actually, no.  Almost all of the arguments against were (and are) based
|> on the fact that the MVC specification required certain horrible cases
|> to work.  There isn't really any difficulty in implementing such an
|> operation, provided that you are prepared to accept the same semantics
|> that you would get from a sequence of byte copies.

|> Again, look at history.  There have been a fair number of architectures
|> that have had some support for arbitrary length arithmetic objects, but
|> very few languages have provided decent support for them.  Why shouldn't
COBOL? PL/I?

Readers who may be getting confused by this discussion might be helped by
the following observation:

(1) One point of view says you have to be very careful with MVC and
related operations, as there are numerous subtle problems in creating
instructions that have sensible implementations over the lifetimes of
serious CPU architectures, and that even some savvy and experienced folks
who have implemented these have been sorry later, and that about the
closest people feel good about are equivalent to the S/360 LM/STM.
In particular, it is believed that it is difficult to get
reasonable semantics, sensible exception behavior, and efficiency
of implementation (i.e., relative to datapath widths & aggressive
implemention wishes) all at the same time.

(2) Another point of view says that (1) is silly, wrong, etc,
and that it should be easy.

So far, people who like (1) seem to include several who actually have
years of experience being involved in the design of multiple production
microprocessors, who chair Hot Chips (Allen does it this year), etc,
and hence who one might expect know something about this.
This does not mean there is room for disagreement, but perhaps those
people who claim that all this is easy might perhaps describe their
experience with production CPU family design, especially since the
problems arise far more often in the design of families than of
individual processors.


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389


From: abaum@pa.dec.com (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: Fri, 09 Jan 1998 12:31:08 -0800

stanlass@netins.net (Stan Lass) wrote:
 why Load/Store byte string would be a good idea

 mash@mash.engr.sgi.com says...
> >Sigh.
> >"Those who don't remember the past are condemned to repeat it",

and then goes on to give quite a nice synopsis of why MVC and MVCL has
been done, and why its not a great idea.

 stanlass@netins.net  replies:
>As I recall, the MVC is a memory to memory move byte string
>instruction. As such, the MVC is not useful for moving operands
>to/from the arithmetic unit.

So, I must conclude that Stan is really talking about
Load String and Store  String operations, not Move String.
Exactly why unaligned Load/Store is desirable/necessary is
unclear, and whether fixed or variable length is unclear,
and how to specify which registers get loaded/stored is
unclear (sequential?, arbitrary, but in order? arbitrary?)

Having said all that: The ARM processor, which is arguably
a RISC, has a LDM/STM instruction which will ld/st arbitrary
registers, but in-order, and fixed length at word aligned
boundaries. It is interruptable.
It may be one of the least RISCy features of the ARM architecture,
and it is one of the most useful (in its intended market; it is
on the features responsible for the ARMs good code density.
Although it is primarily used for procedure entry/exit code,
the compiler will opportunistically combine 2 loads or more
into a LDM if it can find them. In some implementations,
it can serve as  a hint to the memory system that a burst
is coming.

Note that PowerPC has a load multiple as well; fixed length,
fixed alignment, sequential registers.

--
***********************************************
* Allen J. Baum                               *
* Digital Semiconductor                       *
* 181 Lytton Ave.                             *
* Palo Alto, CA 94306                         *
***********************************************


From: abaum@pa.dec.com (Allen J. Baum)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: Mon, 12 Jan 1998 11:00:22 -0800

I said:
>> Having said all that: The ARM processor, which is arguably
>> a RISC, has a LDM/STM instruction which will ld/st arbitrary
>> registers, but in-order, and fixed length at word aligned
>> boundaries. It is interruptable.
>
rjb@dcs.gla.ac.uk (Dr. Richard Black) points out:
>It is non-interruptable; it is, however, fairly easy to restart in
>the event that it faults since the PC is always transferred last.

I mispoke - I couldn't find the word I wanted (which meant 'it can fault
in the middle') and used interrupt, which implies external (I/O) interrupt
and not an internal fault. He is correct, LDM/STM will not be aborted in
the middle by an external interrupt (which has implications on interrupt
response), but can be by an internal fault (e.g. protection/translation
fault).

Note that while LDM/STM is appropriate for the ARM architecture, that
doesn't imply it is workable for other architectures with different
application domains and marketing focuses; i.e. there is no universal
'good idea'.

--
***********************************************
* Allen J. Baum                               *
* Digital Semiconductor                       *
* 181 Lytton Ave.                             *
* Palo Alto, CA 94306                         *
***********************************************



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 12 Jan 1998 20:58:58 GMT

In article <abaum-1201981100220001@terrapin.pa.dec.com>, abaum@pa.dec.com
(Allen J. Baum) writes:

|> Note that while LDM/STM is appropriate for the ARM architecture, that
|> doesn't imply it is workable for other architectures with different
|> application domains and marketing focuses; i.e. there is no universal
|> 'good idea'.

For sure!

Carefully crafted LDM/STMs:
	aligned
	straightforwardly repeatable, i.e., no registers overwritten
	timing/interrupt character appropriate to the design goals
seem plausible choices, with arguming based on number of registers available,
compiler technology (goodness of register allocation), etc.

As an amusing sidelight on a closely-related topic, at one point,
MIPS-II (R6000, R4000...) had defined a pair of aligned integer operations:
	load 64-bit into 2 32-bit integer regs
	store 2 32-bit integer rgs into a 64-bit memory location

The idea was to take advantage of 64-bit data paths, shorten function
save/restore sequences, i.e., this is about the most restricted from
of LM/STM one might do.  It got deleted before any machine actually
got built, partially because the load has the irritation of needing an
extra write port compared to the rest of the loads, but even more because
of the confusion engendered in doing the 64-bit implementation, i.e.,
such operations were unusually awkward in transferring over, i.e.:

32-bit model
8-byte aligned MVC: 0(r5) <- 0(r4)
	ld	r2,0(r4)		sets r2,r3 = 0(r4), 4(r4)
	sd	r2,0(r5)		stores r2, r3

save	2 registers
	sd	r2,0(r5)

64-bit model:
8-byte aligned MVC: 0(r5) <- 0(r4)
	ld	r2,0(r4)		sets r2,r3 = 0(r4), 4(r4)
	sd	r2,0(r5)		stores r2, r3

I.e., this assumes that:
	ld	r2,0(r4)
is the same as:
	lw	r2,0(r4)
	lw	r3,4(r4)
and that instructions do the same work in both 32- and 64-bit modes.

Unfortunately, the second use then falls apart:
	sd	r2,0(r5)
does not save 2 registers, it saves the low order 32-bit of each of two
registers, yielding bad effects when all 64-bit of each register must be
restored :-)


--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389


From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Scalability of VLIW
Date: 13 Jan 1998 20:55:54 GMT

In article <zalmanEMpKvv.1vE@netcom.com>, zalman@netcom.com (Zalman
Stern) writes:

|> one of the main motivations for this feature. John Mashey just posted how
|> MIPS decided that would never work. Actual implementations of PowerPC seem

I didn't quite say that: I said it would be unusually awkward.
Maybe I should have said more:

(a) In going from MIPS-I & MIPS-II to MIPS-III:
	(a) The integer registers were widened from 32 to 64-bits.
	(b) Existing integer operations continued to produce the exact same
	results in the low-order 32-bits of 64-bit registers,
	and the definitions extended to get the hi-order 32 bits "right"
	For instance:
		LW	sign-extends the 32-bit fetched from memory
		ADDI	sigh-extends the 16-bit immediate, does a 64-bit
			add, and raises an overflow if carries out of bits
			30 and 31 differ
		ANDI	zero extends the 16-bit immediate, then ANDs
	etc.
	(c) Additional operations were defined to manipulate the entire 64-bit
	registers, like:
		LD	loads 64-bits
		DADDI	sign-extends the 16 bit immediate, does a 64-bit
			add, and raises overflow if the	carry-outs of
			bits 62 and 63 differ
		LWU	fetches 32-bits from memory and zero extends

(b) Anyway, it proved possible to:
	(a) Avoid having a lot of hardware for 32-vs-64-bit mode-specific cases
	(b) Have existing 32-bit binaries keep running.
	(c) Extend the meaning of the existing instructions, but keeping
	them as "natural" in the 64-bit environment, i.e., load-byte still
	does what you'd expect.
	(d) Avoid changing opcodes, in the sense of making them do
	noticably different things in the 32- and 64-bit environments.

(c) It wasn't that lm/stm couldn't ever work, but it wasn't obvious how to
make sense of them with these constraints, and it was clear there was plenty of
chance to disobey the principle of least astonishment.
--
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD: 650-933-3090 FAX: 650-932-3090
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

Index Home About Blog