Index Home About Blog
From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Embedded DRAM
Date: Sat, 26 Sep 2009 14:08:04 -0700 (PDT)
Message-ID: <e2fd90d4-c26e-45ae-9b91-2a9b76c99c92@f33g2000vbm.googlegroups.com>

A couple of issues to point out:

Way back in '00 (2000) I was pushing for a DRAM secondary cache (I
also wanted DRAM caches in the late 1980s at Moto). The trick to
making such a device perform well was to heavily bank the cache so
that successive accesses do not go to the same sets of DRAMs as the
previous access. Here, (target frequency 4 GHz) a new access would
start every 250 ps to 1 of 64 banks (actually 1 read, 1 write, and 1
tag update would start every 250 ps but I digress) and 1 refresh would
start on an unused bank. Each access (including refresh) would take 5-
ish cycles before the bank was ready for another access. We modeled
up to 12 cycles and found essentially no performance difference
{assuming the read delay did not change}.

At this point you can do several things that the DRAM designers almost
never get around to:: throw some power at the access time problem and
you can make the DRAM less power hungry than an equivalent sized SRAM
(smaller physical area), and then you can also do the cell recharge
while the cache is shipping data back to the core. Secondly, you can
refresh the crap out of the cache, so you don't have to have
capacitors that hold charge for 16 milliseconds, the caps only need to
hold charge for about 10 microseconds (or so). Thirdly, you can do
intelligent refresh by keeping track of which cell rows have been
accessed and skip the refresh on this go-round. So, for example, strip-
mine memory access does not even see refresh cycles.

So, for example, take 64 banked DRAM arrays of 128-'words', each word
being 512 bits big and we have a 1/2 MB cache. One can build bigger
cache stores by using this 1/2 MB instance as the building block. It
takes only 8192 refresh cycles (or at 4 GHz, 2 microseconds). So if
the cells are safe for 10 microseconds, they will be robustly
refreshed under this scheme, and you can do the clock adjustments
(div-2, div-3, ...) everyone loves and still adequately refresh the
DRAMs.

Now with the need for long storage time and big caps eliminated
(ameliorated) one can now build DRAMs without resorting to vertical
structures mentioned above {If you have vertical structures in your
process, go ahead and use them and save refresh power}. The only thing
that is needed is to have a depletion implant for the storage nodes
and sufficient distance from the transfer gate nodes to avoid the
implant from making the transger gate leaky. I saw the size of such a
cell and it was about 1/12 the size of a 6T SRAM cell in the same
process (about 130 nm) but when the rest of the array was wrapped
around the cells, that array was just a little better than 5X smaller.

The issue that always comes up is reliability--as in ECC to protect
the cells/rows tag state. But in the time since then, most L2 and L3s
have ECC attached anyway at least to the data. The only real
difference is that the tag arrays also need ECC, and the overall ECC
should be more robust than 8 bit ECCs protect 64-bits. Something like
using 16 bits protecting 128 bits of store with the property that all
2-bit errors are corrected, all strings of 3, 4, 5, 6, and 7 bit
errors are also corrected, and anything else you can work into the
correcting code.

But the crux of the matter is that no-one in upper layer management is
willing to bet their company on DRAM L2/L3 working. So, the designers
have to build a plug replacible SRAM cache of whatever size it ends up
being. This is the one that gets taped out first, and if it works well
enough, nobody ever swaps it out for the 4X-6X DRAM version.

Mitch


From: Terje Mathisen <terje.wiig.mathisen@gmail.com>
Newsgroups: comp.arch
Subject: Re: Embedded DRAM
Date: Sun, 27 Sep 2009 04:37:35 -0700 (PDT)
Message-ID: <ffaa0406-daf8-40ec-9364-43ceb0b8ea45@h30g2000vbr.googlegroups.com>

On Sep 26, 2:08 pm, MitchAlsup <MitchAl...@aol.com> wrote:
> A couple of issues to point out:
>
> Way back in '00 (2000) I was pushing for a DRAM secondary cache (I
> also wanted DRAM caches in the late 1980s at Moto). The trick to
> making such a device perform well was to heavily bank the cache so
> that successive accesses do not go to the same sets of DRAMs as the
> previous access. Here, (target frequency 4 GHz) a new access would
> start every 250 ps to 1 of 64 banks (actually 1 read, 1 write, and 1
> tag update would start every 250 ps but I digress) and 1 refresh would
> start on an unsued bank. Each access (including refresh) would take 5-
> ish cycles  before the bank was ready for another access. We modeled
> up to 12 cycles and found essentially no performance difference
> {assuming the read delay did not change}.

Mitch, I really like this, but there is one possible problem which
seems so obvious that you must have considered it at the time:

What about strided access to some largish data structure?

With 64 banks, anything that happens to use a 256-byte (assuming 32-
bit blocks) stride would hit the same bank on every access, right?

Any other medium-size power-of-two would cause similar trouble, to
somewhat smaller degree.

I remember discussions years back about a mainframe vendor who used 17
banks: Having a prime number of banks tends to minimize the possible
number of conflicts, while primes of the form 2^n +/- 1 allowing quite
fast and power-efficient modulo calculations.

> But the crux of the mater is that no-one in upper layer management is
> willing to bet their company on DRAM L2/L3 working. So, the designers
> have to build a plug replacible SRAM cache of whatever size it ends up
> being. This is the one that gets taped out first, and if it works well
> enough, nobody ever swaps it out for the 4X-6X DRAM version.

If the regular SRAM version works well, you can make a very expensive
"server-only/extreme" version with 1.5X/2X/3X the amount of (SRAM)
cache and sell it for 3X-10X as much. :-)

Having a 5X DRAM cache would be much cheaper and might perform as well
as the current server chips, right?

Terje


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Embedded DRAM
Date: Mon, 28 Sep 2009 08:50:08 -0700 (PDT)
Message-ID: <d73d549e-8114-4587-9c3c-da6fe093faa7@d23g2000vbm.googlegroups.com>

On Sep 27, 6:37 am, Terje Mathisen <terje.wiig.mathi...@gmail.com>
wrote:
> Mitch, I really like this, but there is one possible problem which
> seems so obvious that you must have considered it at the time:
>
> What about strided access to some largish data structure?
>
> With 64 banks, anything that happens to use a 256-byte (assuming 32-
> bit blocks) stride would hit the same bank on every access, right?

Other than the repetitive hit is at 4096 byte strides, you are
correct. But I supposed you could run into this at 256 if you 16-way
setted the L2.

Way back when we were contemplating this, we dialed in those
parameters to the cache in the simulator and did not take a visible
hit on any of the many hundred apps we looked at regularly. A hit
versus the same kinds of parameters excepting using a smaller amount
of SRAM as cache and no back to back penalty. One thing to remember,
is that the L1 has already cleaned up a lot of the trash accesses that
cahes are good at cleaning out of the main memory access path.

> Any other medium-size power-of-two would cause similar trouble, to
> somewhat smaller degree.
>
> I remember discussions years back about a mainframe vendor who used 17
> banks: Having a prime number of banks tends to minimize the possible
> number of conflicts, while primes of the form 2^n +/- 1 allowing quite
> fast and power-efficient modulo calculations.

Bouroughs Scientific Processor (BSP). It used a multiply array in the
address path to memory. Skewed associative caches are another way to
minimize the hit.

> > But the crux of the mater is that no-one in upper layer management is
> > willing to bet their company on DRAM L2/L3 working. So, the designers
> > have to build a plug replacible SRAM cache of whatever size it ends up
> > being. This is the one that gets taped out first, and if it works well
> > enough, nobody ever swaps it out for the 4X-6X DRAM version.
>
> If the regular SRAM version works well, you can make a very expensive
> "server-only/extreme" version with 1.5X/2X/3X the amount of (SRAM)
> cache and sell it for 3X-10X as much. :-)

But the chip with the DRAM Cache 5 times as big is
SMALLER.............

> Having a 5X DRAM cache would be much cheaper and might perform as well
> as the current server chips, right?

Which is why I have always been a fan of pushing this density edge to
your advantage.

Mitch

Index Home About Blog