Index
Home
About
Blog
Newsgroups: comp.arch
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: Exception handling in PowerPC [no: RISC-vs-CISC, one more time]
Message-ID: <D4qAvo.EIz@odin.corp.sgi.com>
Date: Tue, 28 Feb 1995 21:11:47 GMT
In article <3ibmb6$so1@newsbf02.news.aol.com>, danhicks@aol.com
(DanHicks) writes:
|> >>>
|> Oh? Consider the problem of designing exception handling for the P6.
|> It's pipelining, out-of-order and speculative execution that make
|> exception handling tough, and both the CISC and RISC people are doing
|> as much of this as they can. It has little to do with the instruction
|> set.
|> <<<
|>
|> True, but, as you point out, the distinction between RISC and CISC is
|> disappearing, to the point where the main distinction is psychological.
|> And that's the biggest problem: The RISC designers can't be convinced of
|> the importance of designing an architecture suited for something other
|> than benchmarks, whereas the CISC designers realize that accommodating
|> operating system requirements is a critical part of their job.
1) Attached is the Nth repost of the discussion of RISC-vs-CISC
non-convergence, i.e., architecture != implementation, one more time.
If you've followed this newsgroup for a while, you've seen this before.
I have included some followon discussions, and done minuscule editing.
2) The comments about RISC designers above are simply wrong;
"All generalizations are false", but especially this one.
3) re: Exception-handling: as I've noted for years in public talks (the
"Car" talk especially), tricky exception-handling is the price for
machines with increased parallelism and overlap. Errat sheets are
customarily filled with bugs around exception cases. Oddly enough,
speculative-execution, o-o-o machines may actually fare better than
you'd think, given that they already need mechanisms for undoinbg
(more-or-less) completed instructions. From history, recall that the
360/91, circa 1967, had some imprecise exceptions, as did the 360/67
(early 360 with virtual memory), so exciting exception handling have
been with us for a while.
====REPOST====
Article 22850 of comp.arch:
Path: mips!mash
Subject: Nth re-posting of CISC vs RISC (or what is RISC, really)
Message-ID: <2419@spim.mips.COM>
Most of you have seen most of this several times before; there is a
little editing, nothing substantial. Some followup comments have been
added.
PART I - ARCHITECTURE, IMPLEMENTATION, DIFFERENCES
PART II - ADDRESSING MODES
PART III - MORE ON TERMINOLOGY; WOULD YOU CALL THE CDC 6600 A RISC?
PART IV - RISC, VLIW, STACKS
PART I - ARCHITECTURE, IMPLEMENTATION, DIFFERENCES
WARNING: you may want to print this one to read it...
(from preceding discussion):
>Anyway, it is not a fair comparison. Not by a long stretch. Let's see
>how the Nth generation SPARC, MIPS, and 88K's do (assuming they last)
>compared to some new design from scratch.
Well, there is baggage and there is BAGGAGE.
One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:
a) Architectures persist longer than implementations, especially
user-level Instruction-Set Architecture.
b) The first member of an architecture family is usually designed
with the current implementation constraints in mind, and if you're
lucky, software people had some input.
c) If you're really lucky, you anticipate 5-10 years of technology
trends, and that modifies your idea of the ISA you commit to.
d) It's pretty hard to delete anything from an ISA, except where:
1) You can find that NO ONE uses a feature
(the 68020->68030 deletions mentioned by someone
else).
2) You believe that you can trap and emulate the feature
"fast enough".
i.e., microVAX support for decimal ops,
68040 support for transcendentals.
Now, one might claim that the i486 and 68040 are RISC implementations
of CISC architectures .... and I think there is some truth to this,
but I also think that it can confuse things badly:
Anyone who has studied the history of computer design knows that
high-performance designs have used many of the same techniques for years,
for all of the natural reasons, that is:
a) They use as much pipelining as they can, in some cases, if this
means a high gate-count, then so be it.
b) They use caches (separate I & D if convenient).
c) They use hardware, not micro-code for the simpler operations.
(For instance, look at the evolution of the S/360 products.
Recall that the 360/85 used caches, back around 1969, and within a few
years, so did any mainframe or supermini.)
So, what difference is there among machines if similar implementation
ideas are used?
A: there is a very specific set of characteristics shared by most
machines labeled RISCs, most of which are not shared by most CISCs.
The RISC characteristics:
a) Are aimed at more performance from current compiler technology
(i.e., enough registers).
OR
b) Are aimed at fast pipelining
in a virtual-memory environment
with the ability to still survive exceptions
without inextricably increasing the number of gate delays
(notice that I say gate delays, NOT just how many gates).
Even though various RISCs have made various decisions, most of them have
been very careful to omit those things that CPU designers have found
difficult and/or expensive to implement, and especially, things that
are painful, for relatively little gain.
I would claim, that even as RISCs evolve, they may have certain baggage
that they'd wish weren't there .... but not very much.
In particular, there are a bunch of objective characteristics shared by
RISC ARCHITECTURES that clearly distinguish them from CISC architectures.
I'll give a few examples, followed by the detailed analysis:
MOST RISCs:
3a) Have 1 size of instruction in an instruction stream
3b) And that size is 4 bytes
3c) Have a handful (1-4) addressing modes) (* it is VERY
hard to count these things; will discuss later).
3d) Have NO indirect addressing in any form (i.e., where you need
one memory access to get the address of another operand in memory)
4a) Have NO operations that combine load/store with arithmetic,
i.e., like add from memory, or add to memory.
(note: this means especially avoiding operations that use the
value of a load as input to an ALU operation, especially when
that operation can cause an exception. Loads/stores with
address modification can often be OK as they don't have some of
the bad effects)
4b) Have no more than 1 memory-addressed operand per instruction
5a) Do NOT support arbitrary alignment of data for loads/stores
5b) Use an MMU for a data address no more than once per instruction
6a) Have >=5 bits per integer register specifier
6b) Have >= 4 bits per FP register specifier
These rules provide a rather distinct dividing line among architectures,
and I think there are rather strong technical reasons for this, such
that there is one more interesting attribute: almost every architecture
whose first instance appeared on the market from 1986 onward obeys the
rules above .....
Note that I didn't say anything about counting the number of
instructions....
So, here's a table:
C: number of years since first implementation sold in this family
(or first thing which with this is binary compatible).
Note: this table was first done in 1991, so year = 1991-(age in table).
3a: # instruction sizes
3b: maximum instruction size in bytes
3c: number of distinct addressing modes for accessing data (not jumps)>
I didn't count register or
literal, but only ones that referenced memory, and I counted different
formats with different offset sizes separately. This was hard work...
Also, even when a machine had different modes for register-relative and
PC_relative addressing, I counted them only once.
3d: indirect addressing: 0: no, 1: yes
4a: load/store combined with arithmetic: 0: no, 1:yes
4b: maximum number of memory operands
5a: unaligned addressing of memory references allowed in load/store,
without specific instructions
0: no never (MIPS, SPARC, etc)
1: sometimes (as in RS/6000)
2: just about any time
5b: maximum number of MMU uses for data operands in an instruction
6a: number of bits for integer register specifier
6b: number of bits for 64-bit or more FP register specifier,
distinct from integer registers
Note that all of these are ARCHITECTURE issues, and it is usually quite
difficult to either delete a feature (3a-5b) or increase the number
of real registers (6a-6b) given an initial isntruction set design.
(yes, register renaming can help, but...)
Now: items 3a, 3b, and 3c are an indication of the decode complexity
3d-5b hint at the ease or difficulty of pipelining, especially
in the presence of virtual-memory requirements, and need to go
fast while still taking exceptions sanely
items 6a and 6b are more related to ability to take good advantage
of current compilers.
There are some other attributes that can be useful, but I couldn't
imagine how to create metrics for them without being very subjective;
for example "degree of sequential decode", "number of writebacks
that you might want to do in the middle of an instruction, but can't,
because you have to wait to make sure you see all of the instruction
before committing any state, because the last part might cause a
page fault," or "irregularity/assymetricness of register use",
or "irregularity/complexity of instruction formats". I'd love to
use those, but just don't know how to measure them.
Also, I'd be happy to hear corrections for some of these.
So, here's a table of 12 implementations of various architectures,
one per architecture, with the attributes above. Just for fun, I'm
going to leave the architectures coded at first, although I'll identify
them later. I'm going to draw a line between H1 and L4 (obviously,
the RISC-CISC Line), and also, at the head of each column, I'm going
to put a rule, which, in that column, most of the RISCs obey.
Any RISC that does not obey it is marked with a +; any CISC that DOES
obey it is marked with a *. So...
1991
CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
A1 4 1 4 1 0 0 1 0 1 8 3+ 1
B1 5 1 4 1 0 0 1 0 1 5 4 -
C1 2 1 4 2 0 0 1 0 1 5 4 -
D1 2 1 4 3 0 0 1 0 1 5 0+ 1
E1 5 1 4 10+ 0 0 1 0 1 5 4 1
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3
G1 1 1 4 4 0 0 1 1 1 5 5 -
H1 2 1 4 4 0 0 1 0 1 5 4 - RISC
---------------------------------------------------------------
L4 26 4 8 2* 0* 1 2 2 4 4 2 2 CISC
M2 12 12 12 15 0* 1 2 2 4 3 3 1
N1 10 21 21 23 1 1 2 2 4 3 3 -
O3 11 11 22 44 1 1 2 2 8 4 3 -
P3 13 56 56 22 1 1 6 2 24 4 0 -
An interesting exercise is to analyze the ODD cases.
First, observe that of 12 architectures, in only 2 cases does an
architecture have an attribute that puts it on the wrong side of the line.
Of the RISCs:
-A1 is slightly unusual in having more integer registers, and less FP
than usual. [Actually, slightly out of date, 29050 is different,
using integer register bank instead, I hear.]
-D1 is unusual in sharing integer and FP registers (that's what the
D1:6b == 0).
-E1 seems odd in having a large number of address modes. I think most of this
is an artifact of the way that I counted, as this architecture really only
has a fundamentally small number of ways to create addresses, but has several
different-sized offsets and combinations, but all within 1 4-byte instruction;
I believe that it's addressing mechanisms are fundamentally MUCH simpler
than, for example, M2, or especially N1, O3, or P3, but the specific number
doesn't capture it very well.
-F1 .... is not sold any more.
-H1 one might argue that this process has 2 sizes of instructions,
but I'd observe that at any point in the instruction stream, the instructions
are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit,
i.e., not dynamically encoded in every instruction.
Of the processors called CISCs:
-L4 happens to be one in which you can tell the length of the instruction
from the first few bits, has a fairly regular instruction decode,
has relatively few addressing modes, no indirect addressing.
In fact, a big subset of its instructions are actually fairly RISC-like,
although another subset is very CISCy.
-M2 has a myriad of instruction formats, but fortunately avoided
indirect addressing, and actually, MOST of instructions only have 1
address, except for a small set of string operations with 2.
I.e., in this case, the decode complexity may be high, but most instructions
cannot turn into multiple-memory-address-with-side-effects things.
-N1,O3, and P3 are actually fairly clean, orthogonal architectures, in
which most operations can consistently have operands in either memory or
registers, and there are relatively few weirdnesses of special-cased uses
of registers. Unfortunately, they also have indirect addressing,
instruction formats whose very orthogonality almost guarantees sequential
decoding, where it's hard to even know how long an instruction is until
you parse each piece, and that may have side-effects where you'd like to
do a register write-back early, but either:
must wait until you see all of the instruction until you commit state
or
must have "undo" shadow-registers
or
must use instruction-continuation with fairly tricky exception
handling to restore the state of the machine
It is also interesting to note that the original member of the family to
which O3 belongs was rather simpler in some of the critical areas,
with only 5 instruction sizes, of maximum size 10 bytes, and no indirect
addressing, and requiring alignment (i.e., it was a much more RISC-like
design, and it would be a fascinating speculation to know if that
extra complexity was useful in practice).
Now, here's the table again, with the labels:
1991
CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K
B1 5 1 4 1 0 0 1 0 1 5 4 - R2000
C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC
D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000
E1 5 1 4 10+ 0 0 1 0 1 5 4 1 HP PA
F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC
G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000
H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860
---------------------------------------------------------------
L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM 3090
M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486
N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016
O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040
P3 13 56 56 22 1 1 6 2 24 4 0 - VAX
General comment: this may sound weird, but in the long term, it might
be easier to deal with a really complicated bunch of instruction
formats, than with a complex set of addressing modes, because at least
the former is more amenable to pre-decoding into a cache of
decoded instructions that can be pipelined reasonably, whereas the pipeline
on the latter can get very tricky (examples to follow). This can lead to
the funny effect that a relatively "clean", orthogonal archiecture may actually
be harder to make run fast than one that is less clean. Obviously, every
weirdness has it's penalties.... But consider the fundamental difficulty
of pipelining something like (on a VAX):
ADDL @(R1)+,@(R1)+,@(R2)+
(I.e., something that, might theoretically arise from:
register **r1, **r2;
**r2++ = **r1++ + **r1++;
Now, consider what the VAX has to do:
1) Decode the opcode (ADD)
2) Fetch first operand specifier from I-stream and work on it.
a) Compute the memory address from (r1)
If aligned
run through MMU
if MMU miss, fixup
access cache
if cache miss, do write-back/refill
Elseif unaligned
run through MMU for first part of data
if MMU miss, fixup
access cache for that part of data
if cache miss, do write-back/refill
run through MMU for second part of data
if MMU miss, fixup
access cache for second part of data
if cache miss, do write-back/refill
Now, in either case, we now have a longword that has the
address of the actual data.
b) Increment r1 [well, this is where you'd LIKE to do it, or
in parallel with step 2a).] However, see later why not...
c) Now, fetch the actual data from memory, using the address just
obtained, doing everything in step 2a) again, yielding the
actual data, which we needto stick in a temporary buffer, since it
doesn't actually go in a register.
3) Now, decode the second operand specifier, which goes thru everything
that we did in step 2, only again, and leaves the results in a second
temporary buffer. Note that we'd like to be starting this before we get
done with all of 2 (and I THINK the VAX9000 probably does that??) but
you have to be careful to bypass/interlock on potential side-effects to
registers .... actually, you may well have to keep shadow copies of
every register that might get written in the instruction, since every
operand can use auto-increment/decrement. You'd probably want badly to
try to compute the address of the second argument and do the MMU
access interleaved with the memory access of the first, although the
ability of any operand to need 2-4 MMU accesses probably makes this
tricky. [Recall that any MMU access may well cause a page fault....]
4) Now, do the add. [could cause exception]
5) Now, do the third specifier .... only, it might be a little different,
depending on the nature of the cache, that is, you cannot modify cache or
memory, unless you know it will complete. (Why? well, suppose that
the location you are storing into overlaps with one of the indirect-addressing
words pointed to by r1 or 4(r1), and suppose that the store was unaligned,
and suppose that the last byte of the store crossed a page boundary and
caused a page fault, and that you'd already written the first 3 bytes.
If you did this straightforwardly, and then tried to restart the
instruction, it wouldn't do the same thing the second time.
6) When you're sure all is well, and the store is on its way, then you
can safely update the two registers, but you'd better wait until the end,
or else, keep copies of any modified registers until you're sure it's safe.
(I think both have been done ??)
7) You may say that this code is unlikely, but it is legal, so the CPU must
do it. This style has the following effects:
a) You have to worry about unlikely cases.
b) You'd like to do the work, with predictable uses of functional
units, but instead, they can make unpredictable demands.
c) You'd like to minimize the amount of buffering and state,
but it costs you in both to go fast.
d) Simple pipelining is very, very tough: for example, it is
pretty hard to do much about the next instruction following the
ADDL, (except some early decode, perhaps), without a lot of gates
for special-casing.
(I've always been amazed that CVAX chips are fast as they are,
and VAX 9000s are REALLY impressive...)
e) EVERY memory operand can potentially cause 4 MMU uses,
and hence 4 MMU faults that might actually be page faults...
f) AND there are even worse cases, like the addp6 instruction, that
can require *40* pages to be resident to complete...
8) Consider how "lazy" RISC designers can be:
a) Every load/store uses exactly 1 MMU access.
b) The compilers are often free to re-arrange the order, even across
what would have been the next instruction on a CISC.
This gets rid of some stalls that the CISC may be stuck with
(especially memory accesses).
c) The alignment requirement avoids especially the problem with
sending the first part of a store on the way before you're SURE
that the second part of it is safe to do.
Finally, to be fair, let me add the two cases that I knew of that were more
on the borderline: i960 and Clipper:
CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper
K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB
(I think an ARM would be in this area as well; I think somebody once
sent me an ARM-entry, but I can't find it again; sorry.)
Note: slight modification (I'll integrate this sometime):
From jfc@MIT.EDU Mon Nov 29 12:59:55 1993
Subject: Re: Why are Motorola's slower than Intel's ? [really what's a RISC]
Newsgroups: comp.arch
Organization: Massachusetts Institute of Technology
Since you made your table IBM has released a couple chips that support
unaligned accesses in hardware even across cache line boundaries and
may store part of an unaligned object before taking a page fault on
the second half, if the object crosses a page boundary.
These are the RSC (single chip POWER) and PPC 601 (based on RSC core).
John Carr (jfc@mit.edu)
(Back to me; jfc's comments are right; if I had time, I'd add another
line to do PPC ... which, in some sense replays the S/360 -> S/370
history of relaxing alignment restrictions somewhat. I conjecture that
at least some of this was done to help Apple s/w migration.)
SUMMARY:
1) RISCs share certain architectural characteristics, although there
are differences, and some of those differences matter a lot.
2) However, the RISCs, as a group, are much more alike than the
CISCs as a group.
3) At least some of these architectural characteristics have fairly
serious consequences on the pipelinability of the ISA, especially
in a virtual-memory, cached environment.
4) Counting instructions turns out to be fairly irrelevant:
a) It's HARD to actually count instructions in a meaningful
way... (if you disagree, I'll claim that the VAX is RISCier
than any RISC, at least for part of its instruction set :-)
Why: VAX has a MOV opcode, whereas RISCs usually have
a whole set of opcodes for {LOAD/STORE} {BYTE, HALF, WORD}
b) More instructions aren't what REALLY hurts you, anywhere
near as much features that are hard to pipeline:
c) RISCs can perfectly well have string-support, or decimal
arithmetic support, or graphics transforms ... or lots of
strange register-register transforms, and it won't cause
problems ..... but compare that with the consequence of
adding a single instruction that has 2-3 memory operands,
each of which can go indirect, with auto-increments,
and unaligned data...
PART II - ADDRESSING MODES
Article: 30346 of comp.arch
Path: odin!mash.wpd.sgi.com!mash
Subject: Updated addressing mode table
Message-ID: <C52tAM.K4B@odin.corp.sgi.com>
Nntp-Posting-Host: mash.wpd.sgi.com
I promised to repost this with fixes, and people have been asking for it,
so here it is again: if you saw it before, all that's really different
is some fixes in the table, and a few clarified explanations:
THE GIANT ADDDRESSING MODE TABLE (Corrections happily accepted)
This table goes with the higher-level table of general architecture
characteristics.
Address mode summary
r register
r+ autoincrement (post) [by size of data object]
-r autodecrement (pre) [by size,...and this was the one I meant]
>r modify base register [generally, effective address -> base]
NOTE: sometimes this subsumes r+, -r, etc,
and is more general, so I categorize it
as a separate case.
d displacement d1 & d2 if 2 different displacements
x index register
s scaled index
a absolute [as a separate mode, as opposed to displacement+(0)
I Indirect
Shown below are 22 distinct addressing modes [you can argue whether
these are right categories]. In the table are the *number* of different
encodings/variations [and this is a little fuzzy; you can especially
argue about the 4 in the HP PA column, I'm not even sure that's
right]. For example, I counted as different variants on a mode the
case where the structure was the same, but there were different-sized
displacements that had to be decoded. Note that meaningfully counting
addressing modes is *at least as bad* as meaningfully counting opcodes;
I did the best I could, and I spect a lot of hours looking at manuals
for the chips I hadn't programmed much, and in some cases, even after
hours, it was hard for me to figure out meaningful numbers... *Most* of
these archiectures are used in general-purpose systems and *most* have
at least one version that uses caches: those are important because many
of the issues in thinking about addressing modes come from their
interactions with MMUs and caches...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
r r
r r r +d1 +d1
r r r | | r r | r r+ +d +d1 I +s
r r r +d +x +s| s+ s+|s+ +d +d|r+ +d I I I +s I
r +d +x +s >r >r >r|r+ -r a a r+|-r +x +s|I I +s +s +d2 +d2 +d2
-- -- -- -- -- -- --|-- -- -- -- --|-- -- --|-- -- -- -- --- --- ---
AMD 29K 1 | | |
Rxxx 1 | | |
SPARC 1 1 | | |
88K 1 1 1 | | |
HP PA 2 1 1 4 1 1| | |
ROMP 1 2 | | |
POWER 1 1 1 1 | | |
i860 1 1 1 1 | | |
Swrdfish 1 1 1 | 1 | |
ARM 2 2 2 1 1| 1 1
Clipper 1 3 1 | 1 1 2 | |
i960KB 1 1 1 1 | 2 2 | 1 |
S/360 1 | 1 |
i486 1 3 1 1 | 1 1 2 | 2 3|
NSC32K 3 | 1 1 3 3 | 3| 9
MC68000 1 1 | 1 1 2 | 2 |
MC68020 1 1 | 1 1 2 | 2 4| 16 16
VAX 1 3 1 | 1 1 1 1 1| 1 3| 1 3 1 3
COLUMN NOTES:
1) Columns 1-7 are addressing modes used by many machines, but very few,
if any clearly-RISC architectures use anything else. They are all
characterized by what they don't have:
2 adds needed before generating the address
indirect addressing
variable-sized decoding
2) Columns 13-15 include fairly simple-looking addressing modes, which however,
*may* require 2 back-to-back adds beforet he address is available. [*may*
because some of them use index-register=0 or something to avoid
indexing, and usually in such machines, you'll see variable timing figures,
depending on use of indexing.]
3) Columns 16-22 use indirect addressing.
ROW NOTES
1) Clipper & i960, of current chips, are more on the RISC-CISC border,
or are sort of "modern CISCs". ARM is also characterized (by ARM people,
Hot Chips IV: "ARM is not a "pure RISC".
2) ROMP has a number of characteristics different from the rest of the RISCs,
you might call it "early RISC", and it is of course no longer made.
3) You might consider HP PA a little odd, as it appears to have more
addressing modes, in the same way that CISCs do, but I don't think this
is the case: it's an issue of whether you call something several modes
or one mode with a modifier, just as there is trouble counting opcodes
(with & without modifiers). From my view, neither PA nor POWER have
truly "CISCy" addressing modes.
4) Notice difference between 68000 and 68020 (and later 68Ks): a bunch of
incredibly-general & complex modes got added...
5) Note that the addressing on the S/360 is actually pretty simple,
mostly base+displacement, although RX-addressing does take 2 regs+offset.
6) A dimension *not* shown on this particular chart, but also highly
relevant, is that this chart shows the different *types* of modes, *not*
how many addresses can be found in each instruction. That may be worth
noting also:
AMD : i960 1 one address per instruction
S/360 - MC68020 2 up to 2 addresses
VAX 6 up to 6
By looking at alignment, indirect addressing, and looking only at those
chips that have MMUs,
consider the number of times an MMU *might* be used per instruction for
data address translations:
AMD - Clipper 2 [Swordfish & i960KB: no TLB]
S/360 - NSC32K 4
MC68Ks (all) 8
VAX 24
When RS/6000 does unaligned, it must be in the same cache line
(and thus also in same MMU page), and traps to software otherwise, thus
avoiding numerous ugly cases.
Note: in some sense, S/360s & VAXen can use an arbitrary number of translations
per instruction, with MOVE CHARACTER LONG, or similar operations & I don't
count them as more, because they're defined to be interruptable/restartable,
saving state in general-purpose registers, rather than hidden internal state.
SUMMARY:
1) Computer design styles mostly changed from machines with:
2-6 addresses per instruction, with variable sized encoding
address specifiers were usually "orthogonal", so that any could ggo
anywhere in an instruction
sometimes indirect addressing
sometimes need 2 adds *before* effective address is available
sometimes with many potential MMU accesses (and possible exceptions)
per instruciton, often buried in the middle of the instruction,
and often *after* you'd normally want to commit state because
of auto-increment or other side effects.
to machines with:
1 address per instruction
address specifiers encoded in small # of bits in 32-bit instruction
no indirect addressing
never need 2 adds before address available
use MMU once per data access
and we usually call the latter group RISCs. I say "changed" because
if you put this table together with the earlier one, which has the
age in years, the older ones were one way, and the newer ones are different.
2) Now, ignoring any other features, but looking at this single attribute
(architectural addressing features and implementation effects therof),
it ought to be clear that the machines in
the first part of the table are doing something *technically* different
from those in the second part of the table. Thus, people may sometimes
call something RISC that isn't, for marketing reasons, but the people
calling the first batch RISC really did have some serious technical issues at
heart.
3) One more time: this is *not* to say that RISC is better than CISC,
or that the few in the middle are bad, or anything like that ... but
that there are clear technical characteristics...
PART III - MORE ON TERMINOLOGY; WOULD YOU CALL THE CDC 6600 A RISC?
Article: 39495 of comp.arch
Newsgroups: comp.arch
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: Why CISC is bad (was P6 and Beyond)
Organization: Silicon Graphics, Inc.
Date: Wed, 6 Apr 94 18:35:01 PDT
In article <2nii0d$kkn@crl2.crl.com>, dbennett@crl.com (Andrea Chen) writes:
|> You may be correct on the creation of the term, but RISC does
|> refer to a school of computer design that dates back to the
|> early seventies.
This is all getting fairly fuzzy and subjective, but it seems very confusing
to label RISC as a school of thought that dates back to the early 1970s.
1) One can say that RISC is a school of thought that got popular in the
early-to-mid 80's, and got widespread commercial use then.
2) One can say that there were a few people (like John Cocke & co at IBM)
who were doing RISC-style research projects in the mid-70s.
3) But if you want to go back, as has been discussed in this newsgroup often,
a lot of people go back to the CDC 6600, whose design started in 1960,
and was delivered in 4Q 1964. Now, while this wouldn't exactly fit the
exact parameters of current RISCs, a great deal of the RISC-style approach
was there in the central processor ISA:
a) Load/store architecture.
b) 3-address register-register instructions
c) Simply-decoded instruction set
d) Early use of instructions schedule by compiler, expectation
that you'd usually program in high-level language and not often
resort to assembler, as you'd expect compiler to do well.
e) More registers than common at the time
f) ISA designed to make decode/issue easy
Note that the 360/91 (1967) offered a good example of building a
CISC-architecture into a high-performance machine, and was an interesting
comparison to the 6600.
4) Maybe there is some way to claim that RISC goes back to the 1950s,
but in general, most machines of the 1950s and 1960s don't feel very
RISCy (to me). Consider Burroughs B5000s; IBM 709x, 707x, 1401s; Univac 110x;
GE 6xx, etc, and of course, S/360s. Simple load/store architectures
were hard to find; there were often exciting instruction decodings required;
indirect addressing was popular; machines often had very few accumulators.
5) If you want to try sticking this in the matrix I've published before,
as best as I recall, the 6600 ISA generally looked like:
CPU 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD
RULE =1 =4 <5 =0 =0 =1 <2 =1 >4 >3
-------------------------------------------------------------------------
CDC 6600 2 * 1 0 0 1 0 1 3 3 4 (but ~1 if fair)
That is:
2: it has 2 instruction sizes (not 1), 15 & 30 bits (however, were packed
into 60-bit words, so if you had 15, 30, 30, the second 30-bitter would not
cross word boundaries, but would start in the second word.)
*: 15-and-30 bit instructions, not 32-bit.
1: 1 addressing mode [Note: Time McCaffrey emailed me that one might consider
there to be more, i.e., you could set address register to combinations
of the others to give autoincrement/decrement/Index+offset, etc).
In any case, you compute an address as a simpel combination of 1-2
registers, andthen use the address, without furhter side-effects.
0: no indirect addressing
1: have one memory operand per instruction
0: do NOT support arbitrary alignment of operands in memory
(well, it was a word-addressed machine :-)
1: use an MMU for data translation no more than once per instruction
(MMU used loosely here)
3,3: had 3-bit fields for addressing registers, both index and FP
Now, of the 10 ISA attributes I'd proposed for identifying typical RISCs,
the CDC 6600 obeys 6. It varies in having 2 instruction formats, and in
having only 3 bits for register fields, but it had simple packing of the
instructions in to fixed-size words, and register/accumulators were
pretty expensive in those days (some popular machines only had one
accumulator and a few index registers, so 8 of each was a lot). Put
another way: it had about as many registers as you'd conveniently build
in a high-speed machine, and while they packed 2-4 operations into a
60-bit word, the decode was pretty straighforward. Anyway, given the
caveats, I'd claim that the 6600 would fit much better in the RISC part
of the original table...
PART IV - RISC, VLIW, STACKS
Article: 43173 of comp.arch
Newsgroups: comp.sys.amiga.advocacy,comp.arch
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: PG: RISC vs. CISC was: Re: MARC N. BARR
Date: Thu, 15 Sep 94 18:33:14 PDT
In article <35a1a3$mlb@doc.armltd.co.uk>, Clive.Jones@armltd.co.uk writes:
|> Really? The Venerable John Mashey's table appears to contain as many
|> exceptions to the rule about number of GP registers as most others.
|> I'm sure if one were to look at the various less conventional
|> processors, there would be some clearly RISC processors that didn't
|> have a load-store architecture - stack and VLIW processors spring to
|> mind.
I'm not sure I understand the point. One can believe any of several
things:
a) One can believe RISC is some marketing term without technical
meaning whatsoever. OR
b) One can believe that RISC is some collection of implementation
ideas. This is the most common confusion.
c) One can believe that RISC has some ISA meaning (such as RISC ==
small number of opcodes) ... but have a different idea of RISC
than do most chip architects who build them. If you want to pay
words extra money every Friday to mean something different than
what they mean to practitioners ... then you are free to do so,
but you will have difficulty communicating with practitioners
if you do so.
EX: I'm not sure how stack architectures are "clearly RISC" (?)
Maybe CRISP, sort of. Burroughs B5000 or Tandem's original
ISA: if those are defined as RISC, the term has been rendered
meaningless.
EX: VLIWs: I don't know any reason why I'd call VLIWs, in
general, either clearly RISC or clearly not. VLIW is a technique
for issuing instructions to more functional units than you
have the die space/cycle time to decode more dynamically.
There gets to be a fuzzy line between:
a) A VLIW, especially if it compresses instructions in
memory, then expands them otu when brought into the cache.
b) A superscalar RISC, which does some predecoding on the
way from memory->cache, adding "hint" bits or rearranging
what it keeps there, speeding up cache->decode->issue.
At least some VLIWs are load/store architectures, and the operations
they do look usually look like typical RISC operations.
OR, you can believe that:
c) RISC is a term used to characterize a class of relatively-similar
ISAs mostly developed in the 1980s. Thus, if a knowledgable
person looks at ISAs, they will tend to cluster various ISAs
as:
1) Obvious RISC, fits the typical rules with few exceptions.
2) Obviously not-RISC, fits the inverse of the RISC
rules with relatively few exceptions. Sometimes
people call this CISC ... but whereas RISCs, as a group,
have realitvely similar ISAs, the CISC label is sometimes
applied to a widely varying st of ISAs.
3) Hybrid / in-the-middle cases, that either look like
CISCy RISCs, or RISCy CISCs. There are a few of these.
Cases 1-3 are appropriate may apply to reasonably contemporaneous
processors, and make some sense. and then 4)
4) CPUs for which RISC/CISC is probably not a very relevant
classification. I.e., one can apply the set of rules
I've suggested, and get an exception-count, but it
may not mean much in practice, especially when
applied to older CPUs created with vastly different
constraints than current ones, or embedded
processors, or specialized ones. Sometimes an older
CPU might have been designed with some similar
philosophies (i.e., like CDC 6600 & RISC, sort of)
whether or not it happend to fit the rules.
Sometimes, die-space constraints my have led to
"simple" chips, without making them fit the suggested
criteria either. personally, torturous arguments
about whether a 6502, or a PDP-8, or a 360/44 or an
XDS Sigma 7, etc, are RISC or CISC ... do not
usually lead to great insight. After a while such
arguments are counting angels dancing on pinheads
("Ahh, only 10 angels, must be RISC" :-).
In this belief space, one tends to follow Hennessy & Patterson's
comment in E.9 that "In the history of computing, there has never
been such widespread agreement on computer architecture."
None of this pejorative of earlier architectures, just the observation
that the ISAs newly-developed in the 1980s were far more similar
that the earlier groups of ISAs. [I recall a 2-year period in
which I used IBM 1401, IBM 7074, IBM 7090, Univac 1108, and
S/360, of which only the 7090 and 1108 bore even the remotest
resemblance to each other, i.e., at least they both had 36-bit words.]
Summary: RISC is a label most commonly used for a set of ISA characteristics
chosen to ease the use of aggressive implementation techniques found in
high-performance processors (regardless of RISC, CISC, or irrelevant).
This is a convenient shorthand, but that's all, although it probably makes
sense to use the term thae way it's usually meant by people who do chips for
a living.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: mash@sgi.com
DDD: 415-390-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Who started RISC? [really:RISC & CISC are different kinds of
words]
Date: 23 Jun 1995 20:38:19 GMT
Organization: Silicon Graphics, Inc.
In article <3sabb4$m14@cronkite.cisco.com>, jahlstrom@cisco.com (John
Ahlstrom) writes:
|> : ifetch/decode/execute stages, and then Cray & Thornton perfected in the
|> : CDC 6400 & 6600 what we now call superpipelined RISC machines. This
|> They were certainly RISCy (who coined THAT wonderful term?) were they
|> RISC? Let's ask Mashey.
1) This has come up before & I talked about it in PART III of the usual
big "What is RISC" thing that I repost now and then. My answer was that
while it didn't exactly fit the architectural parameters of classic RISC,
it would generally have fit more in the RISC part of the table.
2) However, some of this whole thread has gotten a little crazy:
a) The term RISC, used properly, turns out to be a useful description
for a set of ISAs that:
- Were designed at roughly the same time
- Shared at least some assumptions in software and hardware
technology.
- Have enough similarities that they form a relatively tight
"cluster" in the space of all ISAs. (The big posting
described some of these attributes).
b) Put another way, it has been the case that if somebody said they
had designed a RISC CPU, and knew what they were talking about,
and you knew nothing else about it, you'd guess that it would have
32-bit instructions, load/store architecture, simple addressing
modes, 32-bit integer registers (or the obvious 64-bit extensions),
separate integer and FP rgisters, etc ...
and you'd probably mostly guess right.
At some point, trying to argue whether or not a 30-year-old
CPU was RISC or not is counting angels dancing on pinheads,
that is, it is not useful. It might be useful to consider the
various specific attributes of ISAs and see what was there.
Just because a CPU has a low gate-count doesn't make it RISC ...
but it is also not clear that calling it CISC tells you much, since:
c) CISC is *not* a term like RISC, i.e., I think it was invented
to mean "not-RISC", and people often use it that way, thus
including the entire architectural space *except* that little
cluster properly labeled RISC. [Note: this is not a pejorative
comment on the term CISC, just a note that people often use
a RISC-vs-CISC style argument as though the two terms had the
same nature ... and they don't]. For example, suppose someone
tells you they've worked on a CISC. What would you know about
the nature of that ISA?
Would it have a range of instruction sizes?
maybe ... but there have been plenty of ISAs with
a single instruction size that would be very hard
to call RISCs.
Would it have indirect addressing, and if so, 1-level,
or multi-level?
Maybe, maybe not [S360's don't].
Would it have separate integer and FP registers?
Maybe, maybe not. [VAX's don't]
Would it be a General Register Machine, a stack machine,
or some kind of memory-memory machine?
Completely unclear.
Thus, there are a bunch of relatively similar ISAs called RISCs;
there are a much larger bunch called CISCs, that often have little
in common. There are a few that lie on the border, in some
multi-dimensional space of ISA attributes.
d) As a result, it is fairly difficult (for me) to make much sense of
arguments about which machines are RISCier or CISCier - I don't
know of any natural linear scale of such things [recall that the
charts I've posted explicitly avoided computing a single RISC/CISC
number].
Anyway, it certainly seemed to me that Seymour Cray had some of the same
approach as later RISC designers, but that the current RISC wave especially
started with John Cocke & co at T. J. Watson.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: mash@sgi.com
DDD: 415-390-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
Index
Home
About
Blog