Index Home About Blog
Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <b3b6oa$bsj$1@penguin.transmeta.com>
Date: Sun, 23 Feb 2003 19:23:46 GMT
Message-ID: <fa.k71001p.1m862d@ifi.uio.no>

In article <20030223082036.GI10411@holomorphy.com>,
William Lee Irwin III  <wli@holomorphy.com> wrote:
>On Sun, Feb 23, 2003 at 12:07:50AM -0800, David Lang wrote:
>> Garrit, you missed the preior posters point. IA64 had the same fundamental
>> problem as the Alpha, PPC, and Sparc processors, it doesn't run x86
>> binaries.
>
>If I didn't know this mattered I wouldn't bother with the barfbags.
>I just wouldn't deal with it.

Why?

The x86 is a hell of a lot nicer than the ppc32, for example.  On the
x86, you get good performance and you can ignore the design mistakes (ie
segmentation) by just basically turning them off.

On the ppc32, the MMU braindamage is not something you can ignore, you
have to write your OS for it and if you turn it off (ie enable soft-fill
on the ones that support it) you now have to have separate paths in the
OS for it.

And the baroque instruction encoding on the x86 is actually a _good_
thing: it's a rather dense encoding, which means that you win on icache.
It's a bit hard to decode, but who cares? Existing chips do well at
decoding, and thanks to the icache win they tend to perform better - and
they load faster too (which is important - you can make your CPU have
big caches, but _nothing_ saves you from the cold-cache costs).

The low register count isn't an issue when you code in any high-level
language, and it has actually forced x86 implementors to do a hell of a
lot better job than the competition when it comes to memory loads and
stores - which helps in general.  While the RISC people were off trying
to optimize their compilers to generate loops that used all 32 registers
efficiently, the x86 implementors instead made the chip run fast on
varied loads and used tons of register renaming hardware (and looking at
_memory_ renaming too).

IA64 made all the mistakes anybody else did, and threw out all the good
parts of the x86 because people thought those parts were ugly.  They
aren't ugly, they're the "charming oddity" that makes it do well.  Look
at them the right way and you realize that a lot of the grottyness is
exactly _why_ the x86 works so well (yeah, and the fact that they are
everywhere ;).

The only real major failure of the x86 is the PAE crud.  Let's hope
we'll get to forget it, the same way the DOS people eventually forgot
about their memory extenders.

(Yeah, and maybe IBM will make their ppc64 chips cheap enough that they
will matter, and people can overlook the grottiness there. Right now
Intel doesn't even seem to be interested in "64-bit for the masses", and
maybe IBM will be. AMD certainly seems to be serious about the "masses"
part, which in the end is the only part that really matters).

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302231326370.1534-100000@home.transmeta.com>
Date: Sun, 23 Feb 2003 21:39:07 GMT
Message-ID: <fa.m6ucdqo.140m9go@ifi.uio.no>

On Sun, 23 Feb 2003, David Mosberger wrote:
>
> But does x86 reall work so well?  Itanium 2 on 0.13um performs a lot
> better than P4 on 0.13um.

On WHAT benchmark?

Itanium 2 doesn't hold a candle to a P4 on any real-world benchmarks.

As far as I know, the _only_ things Itanium 2 does better on is (a) FP
kernels, partly due to a huge cache and (b) big databases, entirely
because the P4 is crippled with lots of memory because Intel refuses to do
a 64-bit version (because they know it would totally kill ia-64).

Last I saw P4 was kicking ia-64 butt on specint and friends.

That's also ignoring the fact that ia-64 simply CANNOT DO the things a P4
does every single day. You can't put an ia-64 in a reasonable desktop
machine, partly because of pricing, but partly because it would just suck
so horribly at things people expect not to suck (games spring to mind).

And I further bet that using a native distribution (ie totally ignoring
the power and price and bad x86 performance issues), ia-64 will work a lot
worse for people simply because the binaries are bigger. That was quite
painful on alpha, and ia-64 is even worse - to offset the bigger binaries,
you need a faster disk subsystem etc just to not feel slower than a
bog-standard PC.

Code size matters. Price matters. Real world matters. And ia-64 at least
so far falls flat on its face on ALL of these.

>                         As far as I can guess, the only reason P4
> comes out on 0.13um (and 0.09um) before anything else is due to the
> latter part you mention: it's where the volume is today.

It's where all the money is ("ia-64: 5 billion dollars in the red and
still sinking") so of _course_ it's where the efforts get put.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302231634150.1690-100000@home.transmeta.com>
Date: Mon, 24 Feb 2003 00:45:46 GMT
Message-ID: <fa.m5ugfii.150ub8u@ifi.uio.no>

On Sun, 23 Feb 2003, David Mosberger wrote:
>
>    2 GHz Xeon:		701 SPECint
>    1 GHz Itanium 2:	810 SPECint
>
> That is, Itanium 2 is 15% faster.

Ehh, and this is with how much cache?

Last I saw, the Itanium 2 machines came with 3MB of integrated L3 caches,
and I suspect that whatever 0.13 Itanium numbers you're looking at are
with the new 6MB caches.

So your "apples to apples" comparison isn't exactly that.

The only thing that is meaningful is "performance at the same time of
general availability". At which point the P4 beats the Itanium 2 senseless
with a 25% higher SpecInt. And last I heard, by the time Itanium 2 is up
at 2GHz, the P4 is apparently going to be at 5GHz, comfortably keeping
that 25% lead.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302231840220.1690-100000@home.transmeta.com>
Date: Mon, 24 Feb 2003 02:59:50 GMT
Message-ID: <fa.m6eefqe.14gcagq@ifi.uio.no>

On Sun, 23 Feb 2003, David Mosberger wrote:
>   >> 2 GHz Xeon:	701 SPECint
>   >> 1 GHz Itanium 2:	810 SPECint
>
>   >> That is, Itanium 2 is 15% faster.
>
> Unfortunately, HP doesn't sell 1.5MB/1GHz Itanium 2 workstations, but
> we can do some educated guessing:
>
>   1GHz Itanium 2, 3MB cache:		810 SPECint
>   900MHz Itanium 2, 1.5MB cache:	674 SPECint
>
> Assuming pure frequency scaling, a 1GHz/1.5MB Itanium 2 would get
> around 750 SPECint.  In reality, it would get slightly less, but most
> likely substantially more than 701.

And as Dean pointed out:

  2Ghz Xeon MP with 2MB L3 cache:	842 SPECint

In other words, the P4 eats the Itanium for breakfast even if you limit it
to 2GHz due to some "process" rule.

And if you don't make up any silly rules, but simply look at "what's
available today", you get

  2.8Ghz Xeon MP with 2MB L3 cache: 	907 SPECint

or even better (much cheaper CPUs):

  3.06 GHz P4 with 512kB L2 cache:	1074 SPECint
  AMD Athlon XP 2800+:			 933 SPECint

These are systems that you can buy today. With _less_ cache, and clearly
much higher performance (the difference between the best-performing
published ia-64 and the best P4 on specint, the P4 is 32% faster. Even
with the "you can only run the P4 at 2GHz because that is all it ever ran
at in 0.18" thing the ia-64 falls behind.

>   Linus> The only thing that is meaningful is "performace at the same
>   Linus> time of general availability".
>
> You claimed that x86 is inherently superior.  I provided data that
> shows that much of this apparent superiority is simply an effect of
> the larger volume that x86 achieves today.

And I showed that your data is flawed. Clearly the P4 outperforms ia-64
on an architectural level _even_ when taking process into account.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302231343050.1534-100000@home.transmeta.com>
Date: Sun, 23 Feb 2003 21:49:50 GMT
Message-ID: <fa.m5e8eal.15gi80t@ifi.uio.no>

On Sun, 23 Feb 2003, John Bradford wrote:
>
> I could be wrong, but I always thought that Sparc, and a lot of other
> architectures could mark arbitrary areas of memory, (such as the
> stack), as non-executable, whereas x86 only lets you have one
> non-executable segment.

The x86 has that stupid "executablility is tied to a segment" thing, which
means that you cannot make things executable on a page-per-page level.
It's a mistake, but it's one that _could_ be fixed in the architecture if
it really mattered, the same way the WP bit got fixed in the i486.

I'm definitely not saying that the x86 is perfect. It clearly isn't. But a
lot of people complain about the wrong things, and a lot of people who
tried to "fix" things just made them worse by throwing out the good parts
too.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302231805240.1690-100000@home.transmeta.com>
Date: Mon, 24 Feb 2003 02:43:43 GMT
Message-ID: <fa.m6eieqj.14g0bgv@ifi.uio.no>

On 24 Feb 2003 linux@horizon.com wrote:
>
> Now wait a minute.  I thought you worked at Transmeta.
>
> There were no development and debugging costs associated with getting
> all those different kinds of gates working, and all the segmentation
> checking right?

So? The only thing that matters is the end result.

> Wouldn't it have been easier to build the system, and shift the effort
> where it would really do some good, if you didn't have to support
> all that crap?

Probably not appreciably. You forget - it's been tried. Over and over
again. The whole RISC philosophy was all about "wouldn't it perform better
if you didn't have to support that crap".

The fact is, the "crap" doesn't matter that much. As proven by the fact
that the "crap" processor family ends up being the one that eats pretty
much everybody else for lunch on performance issues.

Yes, the "crap" does end up making it a harder market to enter. There's a
lot of IP involved in knowing what all the rules are, and having literally
_millions_ of tests that check for conformance to the architecture (and
much of the "architecture" is a de-facto thing, not really written down in
architecture manuals).

But clearly even that is not insurmountable, as shown by the fact that not
only does the x86 perform well, it's also one of the few CPU's that are
actively worked on by multiple different companies (including Transmeta,
as you point out - although clearly the "crap" is one reason why the sw
approach works at all).

> Transmeta's software-decoding is an extreme example of what all modern
> x86 processors are doing in their L1 caches, namely predecoding the
> instructions and storing them in expanded form.  This varies from
> just adding boundary tags (Pentium) and instruction type (K7) through
> converting them to uops and cacheing those (P4).

But you seem to imply that that is somehow a counter-argument to _my_
argument. And I don't agree.

I think what Transmeta (and AMD, and VIA etc) show is that the ugliness
doesn't really matter - there are different ways of handling it, and you
can either throw hardware at it or software at it, but it's still worth
doing, because in the end what matters is not the bad parts of it, but the
good parts.

Btw, the P4 tracecache does pretty much exactly the same thing that
Transmeta does, except in hardware. It's based on a very simple reality:
decoding _is_ going to be the bottleneck for _any_ instruction set, once
you've pushed the rest hard enough. If you're not doing predecoding, that
only means that you haven't pushed hard enough yet - _regardless_ of your
archtiecture.

> This exactly undoes any L1 cache size benefits.  The win, of course, is
> that you don't have as much shifting and aligning on your i-fetch path,
> which all the fixed-instruction-size architectures already started with.

No. You don't understand what "cold-cache" case really means. It's more
than just bringing the thing in from memory to the cache. It's also all
about loading the dang thing from disk.

> So your comments only apply to the L2 cache.

And the disk.

> And for the expense of all the instruction predecoding logic betweeen
> L2 and L1, don't you think someone could build an instruction compressor
> to fit more into the die-size-limited L2 cache?

It's been done. See the PPC stuff. I've read the papers (it's been a long
time, admittedly - it's not something new), and the fact is, it's not
apparently being used that much. Because it's quite painful, unlike the
x86 approach.

> > stores - which helps in general.  While the RISC people were off trying
> > to optimize their compilers to generate loops that used all 32 registers
> > efficiently, the x86 implementors instead made the chip run fast on
> > varied loads and used tons of register renaming hardware (and looking at
> > _memory_ renaming too).
>
> I don't disagree that chip designers have managed to do very well with
> the x86, and there's nothing wrong with making a virtue out of a necessity,
> but that doesn't make the necessity good.

Actually, you miss my point.

The necessity is good because it _forced_ people to look at what really
matters. Instead of wasting 15 years and countless PhD's on things that
are, in the end, just engineering-masturbation (nr of registers etc).

> The low register count *does* affect you when using a high-level language,
> because if you have too many live variables floating around, you start
> suffering.  Handling these spills is why you need memory renaming.

Bzzt. Wrong answer.

The right answer is that you need memory renaming and memory alias
hardware _anyway_, because doing dynamic scheduling of loads vs stores is
something that is _required_ to get the kind of performance that people
expect today. And all the RISC stuff that tried to avoid it was just a BIG
WASTE OF TIME. Because the _only_ thing the RISC approach ended up showing
was that eventually you have to do the hard stuff anyway, so you might as
well design for doing it in the first place.

Which is what ia-64 did wrong - and what I mean by doing the same mistakes
that everybody else did 15 years ago. Look at all the crap that ia64 does
in order to do compiler-driven loop modulo-optimizations. That's part of
the whole design, with predication and those horrible register windows.
Can you say "risc mistakes all over again"?

My strong suspicion (and that makes it a "fact" ;) is that in another 5
years they'll get to where the x86 has been for the last 10 years, and
they'll realize that they will need to do out-of-order accesses etc, which
makes all of that modulo optimization pretty much useless, since the
hardware pretty much has to do it _anyway_.

> It's true that x86 processors have had fancy architectural features
> sooner than similar-performance RISCs, but I think there's a fair case
> that that's because they've *needed* them.

Which is exactly my point. And by the time you implement them, you notice
that the half-way measures don't mean anything, and in fact make for more
problems.

For example, that small register state is a pain in the ass, no? But since
you basically need register renaming _anyway_, the small register state
actually has some advantages in that it makes it easier to have tons of
read ports and still keep the register file fast. And once you do renaming
(including memory state renaming), IT DOESN'T MUCH MATTER.

>				  Why do the P4 and K7/K8 have
> such enormous reorder buffers, able to keep around 100 instructions
> in flight at a time?  Because they need it to extract parallelism out
> of an instruction stream serialized by a miserly register file.

You think this is bad?

Look at it another way: once you have hundreds of instructions in flight,
you have hardware that automatically

 - executes legacy applications reasonably well, since compilers aren't
   the most important thing.

   End result: users are happy.

 - you don't need to have compilers that do stupid things like unrolling
   loops, thus keeping your icache pressure down, since you do loop
   unrolling in hardware thanks to deep pipelines.

Even the RISC people are doing hundreds of instructions in flight (ie
Power5), but they started doing it years after the x86 did, because they
claimed that they could force their users to recompile their binaries
every few years. And look where it actually got them..

> They've developed some great technology to compensate for the weaknesses,
> but it's sure nice to dream of an architecture with all that great
> technology but with fewer initial warts.  (Alpha seemed like the
> best hope, but *sigh*.  Still, however you apportion blame for its
> demise, performance was clearly not one of its problems.)

So my premise is that you always end up doing the hard things anyway, and
the "crap" _really_ doesn't matter.

Alpha was nice, no question about it. But it took them way too long to get
to the whole OoO thing, because they tried to take a short-cut that in the
end wasn't the answer. It _looked_ like the answer (the original alpha
design was done explicitly to not _need_ things like complex out-of-order
execution), but it was all just wrong.

The thing about the x86 is that hard cold reality (ie millions of
customers that have existing applications) really _forces_ you to look at
what matters, and so far it clearly appears that the things you are
complaining about (registers and segmentation) simply do _not_ matter.

> I think the same claim applies much more powerfully to the ppc32's MMU.
> It may be stupid, but it is only visible from inside the kernel, and
> a fairly small piece of the kernel at that.
>
> It could be scrapped and replaced with something better without any
> effect on existing user-level code at all.
>
> Do you think you can replace the x86's register problems as easily?

They _have_ been solved. The x86 performs about twice as well as any ppc32
on the market. End of discussion.

> > The only real major failure of the x86 is the PAE crud.
>
> So you think AMD extended the register file just for fun?

I think the AMD register file extension was unnecessary, yes. They did it
because they could, and it wasn't a big deal. That's not the part that
makes the architecture interesting. As you should well know.

> Hell, the "PAE crud" is the *same* problem as the tiny register
> file.  Insufficient virtual address space leading to physical > virtual
> kludges.

Nope. The small register file is a non-issue. Trust me. I do work for
transmeta, and we do the register renaming in software, and it doesn't
matter in the end.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Minutes from Feb 21 LSE Call
Original-Message-ID: <Pine.LNX.4.44.0302232041130.4453-100000@home.transmeta.com>
Date: Mon, 24 Feb 2003 05:02:35 GMT
Message-ID: <fa.m7tseqi.160q9go@ifi.uio.no>

On Sun, 23 Feb 2003, Martin J. Bligh wrote:
>
> > The fact is, the "crap" doesn't matter that much. As proven by the fact
> > that the "crap" processor family ends up being the one that eats pretty
> > much everybody else for lunch on performance issues.
>
> But is that because it's a better design? Or because it has more money
> thrown at it? I suspect it's merely it's mass-market dominance generating
> huge amounts of cash to improve it ... and it got there through history,
> not technical prowess.

Sure. It's to a large degree "more money and resources", no question about
that.

But what is "better design"? Would it have been possible to put as much
effort as Intel (and others) put into the x86 architecture into something
else, and make it even better?

MY standpoint is that the above question is _meaningless_ and stupid.
People did try. Very hard. Claiming anything else is clearly misguided.
But compatibility and price matter equally much - and often more - than
raw performance. Which means that even _if_ another architecture performed
better (and it certainly happened, in the hay-day of the alpha), it
wouldn't much matter. People still stayed away from it in droves.

And in the end, that's why I don't like IA-64. I'll take back every single
bad thing I've ever said about IA-64 if Intel were to just to sell those
things to the mass market instead of P4's. But clearly the IA-64 can't
make it in that market, and thus it is made irrelevant. The same way alpha
was made irrelevant, _despite_ having had much better performance - an
advantage that ia-64 clearly doesn't have.

(Admittedly, alpha didn't have hugely better performance for very long.
Intel came out with the PPro, and took a _lot_ of people by surprise).

AMD's x86-64 approach is a lot more interesting not so much because of any
technical issues, but because AMD _can_ try to avoid the "irrelevant"
part. By having a part that _can_ potentially compete in the market
against a P4, AMD has something that is worth hoping for. Something that
can make a difference.

IBM with Power5 and apple could be the same thing (yeah yeah, I personally
suspect it goes enough against IBMs normal approach that it will cause
some friction). A CPU that actually competes in a market that is relevant.

Because server CPU's simply aren't very interesting from a technical
standpoint. I don't know of a _single_ CPU that ever grew down. But we've
seen a _lot_ of CPU's grow _up_. In other words: the small machines tend
to eat into the large ones, not the other way around.

And if you start from the large ones, you aren't going to make it in the
long run.

Put yet another way: if I was on Intels IA-32 team, I'd be a lot more
worried about those XScale people finally getting their act together than
I would be about IA-64.

			Linus


Index Home About Blog