Index Home About Blog
From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch
Subject: Re: If you can't do VLIW, you can't do much else, either
Message-ID: <b3tplu$dej$1@penguin.transmeta.com>
Date: Sun, 2 Mar 2003 20:30:54 +0000 (UTC)

In article <17746vo6hu5ug7jjhvu20n6tjcsbtmng18@4ax.com>,
Robert Myers  <rmyers1400@attbi.com> wrote:
>
>Intel has now put out two cranky processors in a row.  No matter what
>anyone says, the P4 is cranky.  When it's good, it can be very good,
>but it can also be downright mediocre or worse.

I don't think the P4's crankiness is necessarily a design issue, as much
as an implementation detail.

Some of the crankiness will apparently be fixed by prescott, ie it seems
to fix at least part of the problems for "normal code": shifter and
multiplier contention and the miniscule L1 caches.  That's likely to
help a lot on regular code.

The OS-visible crankiness (which I agree with you on) and the huge
latencies for things like memory barriers and other pipeline-hazardous
operations (privilege level switches are too expensive) are probably a
bit more fundamental, and are likely to stay with us.

On a P4, a serializing instruction costs almost 200 CPU cycles, and I
_assume_ that it's because the thing waits until the whole deep pipeline
has fully drained before it is restarted. Others have been able to do
with less draconian measures, maybe Intel can fix that one too.

		Linus


From: torvalds@penguin.transmeta.com (Linus Torvalds)
Newsgroups: comp.arch
Subject: Re: If you can't do VLIW, you can't do much else, either
Message-ID: <b4058c$1gc$1@penguin.transmeta.com>
Date: Mon, 3 Mar 2003 18:00:44 +0000 (UTC)

In article <m3n0kckcpm.fsf@averell.firstfloor.org>,
Andi Kleen  <freitag@alancoxonachip.com> wrote:
>torvalds@penguin.transmeta.com (Linus Torvalds) writes:
>> On a P4, a serializing instruction costs almost 200 CPU cycles, and I
>> _assume_ that it's because the thing waits until the whole deep pipeline
>> has fully drained before it is restarted. Others have been able to do
>> with less draconian measures, maybe Intel can fix that one too.
>
>Just another reason to do it with less locks @)

Well, it's not just locks.  Even using lockless algorithms can be quite
painful, since they often depend on cmpxchg (serializing on P4, ie
100-200 cycles) or at least on memory barriers (also largely
serializing, even the "low-overhead" lfence is something like ~50
cycles).

>Wasn't Prescott supposed to have an even deeper pipeline?

Yes.  I can't imagine that it would matter though, since clearly the
cycles wasted on the synchronization are much higher than _just_ the
pipeline depth.  Dunno what it is.

>Interesting on P4 is when you time instructions you have to be very careful
>to not time RDTSC too. That's because for some reason RDTSC is very slow
>and partly serializing on P4, much slower than on other modern x86.

Yeah, just the difference between consecutive rdtsc's is something like
80 cycles. Which is still "only" about half of a memory barrier.

			Linus


From: Terje Mathisen <terje.mathisen@hda.hydro.com>
Newsgroups: comp.arch
Subject: Re: P4/Netburst architecture is dead
Date: Mon, 10 May 2004 08:25:53 +0200
Message-ID: <c7n79i$hno$1@osl016lin.hda.hydro.com>

Stefan Monnier wrote:

>>Did you miss when Intel demonstrated 10GHz ALUs last year?  The P4 strategy
>>was always based on marketing GHz over performance, but their recent
>>setbacks in increasing clock speed have caused them to kill the entire
>>product.
>
>
> There's no question that Intel's marketing has played pretty heavily the
> Ghz song.  But this newsgroup is not about marketing, so the real question
> is whether the marketing drove the microarchitecture or not.
>
> I personally don't believe it did.  The P4 is a pretty good performer if
> you ask me, so there seem to have been valid technical reasons to go
> this route.

May I suggest you all take the 70-90 minutes required to watch Bob
Colwell (+ Andy 'Crazy' Glew at one point) explain all this stuff in a
Stanford lecture?

http://stanford-online.stanford.edu/courses/ee380/040218-ee380-100.asx

_Very_ short version: Yes, the P4 was intentionally a GHz speed demon,
but tempered with the need to deliver some actual performance that the
engineers could be comfortable with.

Bob also makes the same argument that I made in a conference
presentation last year: The P4 is quite brittle, i.e. it is too easy to
get stuck with very non-optimal performance for too long.

Terje
PS. Thanks to RM for sending me the link!
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Index Home About Blog