Index Home About Blog
From: (Mitch Alsup)
Newsgroups: comp.arch
Subject: Re: implementation issues (apropos WIZ)
Date: 23 Jun 2004 08:27:46 -0700
Message-ID: <> (Jeffrey Dutky) wrote in message

> I've been fiddling around with logic gates (adders, MUXs, etc) for the
> past few days, in order to gauge what kind of gate delays could be
> expected for different CPU operations, and I've got a few questions
> for the group:
> 1a) ISTR reading in this group that the expected fanout in current
> semiconductor processes is fairly low: 2 or 3 loads per ouptut. Is
> this correct? If so, that would imply that just building a simple
> adder or multiplexor would require a tree of drivers on the input
> lines, increasing the total number of gate delays by several times
> (e.g. although an adder or multiplexor has 3 essential gate delays, if
> my fanout is only 2, I'll need to add a 3 tiered driver tree to the
> inputs, yielding a total of 6 gate delays for the entire device).

In effect, yes, most of the logic that needs to be fast is done with
low fan-in (3-input max) and low fan-out (3 loads max.) However, an
inverter is generally recognized as being 4-6 times as fast as a
3-input Nand (6-8 ps for inverter in the technology I am working).
The rise time of the input signals and the load on the output of a
gate interact with the effective speed of the gate. Occasionally a
4-input or 5-input gate results in a faster circuit because it can
eliminate a layer in the gate logic and overcomes the inherent slowness
of the gate with fewer actual gates in the delay path (occasionally).

Consider an 8-gate design point and a multiplexor controlling 64-bits.
Much of the cycle prior to the multiplexor can be consumed by routing
the direction signals through a tree of fan-out and then latching
these redundant copies so that the individual units of the multiplexor
are driven control signals in a fan-out of 4 arrangement. So you end
up with at least 16 redundant copies of the actual control signal.
In a 16 gate design point the same amount of time is consumed in routing
but there is not generally a clock boundary between control signals
and the multiplexor. Athlon and Opteron are 16-gate design points.

> 1b) The same is, obviosly, true for bus drivers: either we need to use
> bigger transistors for our bus drivers, or we need buffer trees to
> drive the busses. In either case, the result is more equivalent gate
> delays.

A typical standard cell library has inverters of 8 different sizes
and most gates come in at least 6 different sizes to allow the
logic designer to balance speed, area, and power. There is a reason
a typical STD cell library has 250+ gates. Gates are like fastening
hardware. You don't use a 1/8" bolt to hold a cylinder head on a
big block Chevy, and you don't use a 1" 4340 forged bolt with rolled
threads to make a swing. Just as each bolt has an intended application,
each gate has an intended application.

> 1c) I only got a rudimentary physics and EE education, but it is my
> impression that when we say 'load' in reference to CMOS devices, we
> are talking about a capacitance. Hence, a long bus represents a large
> load, even though there might only be two gates on the bus (a source
> gate and a sinc gate).

Modern high speed microprocessors are not only seeing R resistance in
the wires, but many are seeing L (inductance) and a few of the faster
ones are beginning to see some skin effects and Ampere's Law effects.
Skin and Ampere's effects are visible in simulation when edge speeds
aproach 0.1 V/ps.

> 4) Concerning adders: I assume that wide-word adders are built up from
> smaller predicted carry adders, with ripple carries between the
> sub-adders, so we have another set of gate delays to worry about (for
> a 32-bit adder built from 4-bit predicted carry adders, we would have
> 7 or 8 extra gate delays so that the carries could propogate). Is this
> how things are really done, or do we just let the Verilog/VHDL
> synthesis engine bang out a full 32-bit wide predicted carry adder for
> us?

The first 4 to 8 bits of a fast adder are done in carry lookahead form
and the rest are done in carry select form; sized such that the select
can drive a fanout of 16 final multiplexors and 1 next select driver
and the wire distance between the select driver and the next select
driver. There are a bunch of papers on this from the DEC Alpha days.

The 64-bit adder in Opteron is built from 11 layers of 3-input logic,
and just happens to have a delay equal to 8 standard 3-input gates
with fan-out of 3. The freedom to choose different strengths of
individual gates and a large library allows this construction.

> Unfortunately, the 32-bit wide predicted carry adder will still have
> extra gate delays, due to the buffer trees needed to drive the AND/OR
> matrix: 8 4-bit adders with ripple carry between will be about 13 gate
> delays (3 for the input buffer trees, 3 for the AND/OR matrix and 7
> for carry propogation), while a full 32-bit wide adder would be 8 gate
> delays (5 for the input buffer trees and 3 for the AND/OR matrix),
> which is less than a factor of 2 improvement.

This is where tools like SPICE come in. You cannot reason about the
technology without seeing how the various arguments play out in
simulation. Each time you increase the size of the gate, you affect
its drive strength and its input loading to the gate the drives it.
Finding an optimal set of sized gates is a non-trivial undertaking
for something on the scale of a 64-bit adder.

Given a 64-bit adder built from Std Cell library, one can, without
changing the schematic, change the speed of the adder by sizing
individual transistors (by some 20%); going all the way to carefully
crafted dynamic logic (still with the same basic schematic) one can
almost double the speed of the adder (40% faster; 60% of the delay).

> 5) It looks like the major limitations to speed are going to be gate
> delays (mostly in the buffer trees) and bus capacitance (on the long
> busses), not speed-of-light limits. Am I missing something?

CMOS has never been speed of light limited. Indeed, the speed of propo-
gation in copper on-die is on the order of 10% of the speed of light
not the 50% speed of light one would find in 'stripline' routing
found on motherboards. Its mainly limited by RC and a good deal of
the C comes from the dielectrics in current use. Here 3D modeling of
critical wire paths is becoming manditory.

> 6) From the software engineering perspective: when you write a
> simulator for a new architecture, how do you account for fanout limits
> and gate delays? Do just make estimates based on experience or is
> there some well established methodology? Do you just ignore these
> issues and go for a behavioral model without realistic timing?

Before you start, you have a collection of building blocks that have
known characteristics (adders, reg files, SRAM macros,...) You take
the experience of the previous machines and add new ideas and build
from there.

You determine a characteristic length of wire that corresponds to
a cycle of the machine. Every time a wire travels this far another
clock has occured (whether you want it to or not). If you overestimate
the wire delay, you will run into it later as the design gets faster.
If you underestimate the wire delay, the part will never get faster.
Choose wisely!

> I've gotten the impression from previous posts to this group that,
> while Verilog/VHDL can be used to build the simulator, the preferred
> method is to write simulators in C, because the C code is MUCH faster
> than the Verilog or VHDL. There was one post, a year or so back, that
> talked about machine conversion of an HDL to C, making it sound slow
> and unreliable (some posters claimed that the resulting C-code took
> forever to compile).
> Also, if you write the simulator in C, how do you usually do it? Do
> you just write a bunch of free-form C code that roughly mimics the
> simulated system's inner workings, or do you actually implement
> something to do RTL-style stuff? I've done both: free-form C is easier
> for simple simulators (this is what I'm doing with the psuedo-WIZ
> simulator) but the RTL-style stuff is nicer for a complex design (like
> DLX).
> - Jeff Dutky

There is a whole hierarchy of simulators, from quick and dirty to
ferret out pipeline/architecture issues (usually in C), to detailed
simulators to ferret out control algorithms (also in C), to precise
simulators (almost always in Verilog/VHDL) also usually written
in high level form ('+' is a 64-bit adder) and lower level forms
(where the adder is expanded into the actual gates,...).


A better question is WHAT to simulate. It would be nice to simulate
a real workload for a real measurable unit of time. However the
more accurate the simulation the slower it runs. In any event, don't
forget to simulate operating system effects, task switching, TLBs,
DRAMs, Pins, clock skew; as you look intently at one favored benchmark
or another. In order to verify a modern large fast microprocessor
you may need on the order of 1 trillion (carefully chosen) test vectors.

Index Home About Blog