Index Home About Blog
From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Code density and performance?
Date: 22 Jul 2005 00:30:56 -0700
Message-ID: <1122017456.365060.113060@g49g2000cwa.googlegroups.com>

Tom Linden wrote:
> On Wed, 20 Jul 2005 12:07:33 -0700, glen herrmannsfeldt
> <gah@ugcs.caltech.edu> wrote:
>
> > That was 1986, though I don't believe VAX was a viable architecture
> > by 1996, even though it wasn't out of address bits.
>
> It certainly could have been viable, had DEC continued its development and
> squandered its resources on the Alpha adventure.

Sigh ... that opinion is strongly at variance with the facts in the
real world; the best engineers in the world (and DEC had plenty)
couldn't have implemented viable (in the sense of being truly
competitive) VAXen in 1996...  I think the last VAXen were shipped in
2000, as installed base always has inertia...

I'd suggest reading a fine paper by a couple of the best computer
architecture performance people around, both of whom were senior DEC
engineers:

Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture:
Comparing a RISC and a CISC with Similar Hardware Organization,", ACM
SIGARCH CAN, 1991 [and a couple other places].  A copy can be found:
http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m

A long, serious, competent analysis by world-class DEC people includes:
VAX 4000/300 MIPS M/2000
1990         1989 [really 4Q88] system ship date
28ns         40ns               cycle time
$100K        $80K               List price
7.7          19.7               SPEC89 integer
8.1          16.3               SPEC89 float
====

"So, while VAX may 'catch up' to *current* single-instruction-issue
RISC performance, RISC designs will push on with earlier adoption of
advanced implementation techniques, achieving still higher performance.
 The VAX architectural disadvantage might thus be viewed as a time lag
of some number of years."

The summary:
"RISC as exemplified by MIPS offers a significant processor performance
advantage over a VAX of comparable hardware organization."

And this proved to be true, even with superb CMOS VAX implementations
done in the early 1990s.

When computer companies design successive implementations of some ISA,
they tend to accumulate more statistics to help tune designs, they keep
tricks that work, they avoid ones that don't, i.e., it is a huge
implementation advantage to have done multiple designs over many years.

The paper especially discusses VAX designs that shipped in 1986 (VAX
8700) and 1990 (VAX 4000/300), 8 and 12 years, and many implementations
after 1978's VAX 11/780.  These were done by experienced design teams
backed by a large, profitable revenue stream.

By comparison, the MIPS R2000 CPU first shipped in 1986, the R2010 FPU
in 1987, and the R3000/R3010 in 4Q88.  The R3000 used an R2000 core
with some improvements to the cache interface, and the R3010 was a
shrunken R2000, i.e., there wasn't a lot of architectural tuning, and
of course, that was done by a small startup that did *not* have a big
revenue stream :-)

BOTTOM LINE:

DEC had every motivation in the world to keep extending the VAX as long
as possible, as it was a huge cash cow.  DEC had plenty of money,
numerous excellent designers, long experience in implementing VAXen.

BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen...



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Code density and performance?
Date: 22 Jul 2005 14:38:18 -0700
Message-ID: <1122068298.764906.206420@g44g2000cwa.googlegroups.com>

Anton Ertl wrote:
> "John Mashey" <old_systems_guy@yahoo.com> writes:

> >Sigh ... that opinion is strongly at variance with the facts in the
> >real world; the best engineers in the world (and DEC had plenty)
> >couldn't have implemented viable (in the sense of being truly
> >competitive) VAXen in 1996...
>
> How do you know?  AFAIK in the real world in which I live in the
> engineers at DEC did not try to implement a new VAX in 1996; instead,
> VAX development stopped pretty soon after the release of the Alpha.

I'm surprised any long-time participant in comp.arch would ask, since
this topic has been discussed here numerous times, including the topic
of architectural issues that made the
 (elegant) VAX and (fairly elegant 68020)
   more difficult to do cost-effective fast implementations for than
the
(simpler) 68010 and the (inelegant X86).

Once again, I will remind people that RISC is a label for a specific
style of ISA design, and the bunch of ISAs that more-or-less fit that
are relatively similar.  CISC was coined to describe "the rest", which
covered an immense range of different ISAs.  IBM S/360, X86, and VAX
are *different*, and an analysis comparing X86 to (a) RISC does nothing
to invalidate earlier comparison of VAX to (a) RISC.

So, how do I know?
Well, in the *real* world of CPU architecture, there were a fairly
modest number of people who actually did this for a living, and many of
them knew each other, moved among companies, went to the same
conferences, recruited each other for panels, worked on committees
(like SPEC), traded benchmarks, and talked informally in bars.  I've
said before here that I'd more than once been kidded by VAX
implementors about how easy RISC guys had it, and then they'd cite
various horrible weird special cases that had to be handled and that
got in the way of performance.  I wouldn't attribute that to anyone of
course.

But, the bottom line is that the engineers who *knew* the VAX best, and
who implemented it many time, and many of whom were *really, really
good* engineers, came to believe that they simply could not keep
implementing competitive CPUs that were VAX ISA.  Some of them were
already starting to think that in the mid-1980s, but lots more thought
so a few years later, and so did certain DEC sales managers, who knew
that if the cutomer wanted VMS, they won, but if the customer wanted
some UNIX, they just couldn't compete.  The VAX 9000 fiasco didn't
help.  There were some fine CMOS VAX chips done, but it was just too
hard, even with mature, good compilers.

There were various internal DEC RISC efforts, and at a certain point
[when DEC chose to use MIPS R3000s for some products], the most
irritating thing to some DEC engineers was that they kept getting
grabbed back off RISC investigations to help do more VAXen.

In the *real world* of (very competent) designers who made their living
doing VAXen, they just couldn't figure out how to keep doing it
competitively.

I will point out that many of the Alpha folks had done VAX
implementations and software and performance analysis, i.e., guys like
Witek, Sites, Dobberpuhl, Uhler, Bhandarkar, Supnik.  Anyone who has
the opinion that it was reasonable to be designing new VAXen in 1996,
expecting them to be competitive, has to believe these guys are
clueless idiots.

NOTE: that doesn't mean that I am claiming "so, they had to do Alpha,
and do it  the way they did it", as they were other options.  I'm just
saying that continuing on a VAX-only path wasn't believable to their
experienced engineers, whose technical skills I hold in high regard,
often from repeated first-hand contact.




> >I'd suggest reading a fine paper by a couple of the best computer
> >architecture performance people around, both of whom were senior DEC
> >engineers:
> >
> >Dileep Bhandarkar, Douglas W. Clark, "Performance from Architecture:
> >Comparing a RISC and a CISC with Similar Hardware Organization,", ACM
> >SIGARCH CAN, 1991 [and a couple other places].  A copy can be found:
> >http://www.cs.mipt.ru/docs/comp/eng/hardware/common/comparing_risc_and_cisc_proc/m
>
> Well, a few years later Dileep Bhandarkar, then employed at Intel,
> wrote a paper where he claimed (IIRC) that the performance advantage
> of RISCs had gone (which I did not take very seriously at the time);
> unfortunately I don't know which of his papers that is; I just looked
> at "RISC versus CISC: a tale of two chips" and it looks more balanced
> than what I remember.

That was another fine paper from Dileep, but the conclusion:
X86 can be made competitive with RISC
is not the same as:
VAX can be made competitive with RISC

> >BOTTOM LINE:
> >
> >DEC had every motivation in the world to keep extending the VAX as long
> >as possible, as it was a huge cash cow.  DEC had plenty of money,
> >numerous excellent designers, long experience in implementing VAXen.
> >
> >BUT IT STOPPED BEING POSSIBLE TO DESIGN COMPETTIVE VAXen...
>
> Looking at what Intel and AMD did with the 386 architecture, I am
> convinced that it is technically possible to design competetive VAXen;
> I don't see any additional challenges that the VAX poses over the 386
> that cannot be addressed with known techniques; out-of-order execution
> of micro-instructions with in-order commit seems to solve most of the
> problems that the VAX poses, and the decoding could be addressed
> either with pre-decode bits (as used in various 386 implementations),
> or with a trace cache as in the Pentium 4.

You're entitled to your opinion, which was shared by the VAX9000
implementors.

Many important senior VAX implementors disagreed.
I've posted some of the reasons why VAX was harder than X86, years ago.
Of course you can do these things, but different ISAs get different
mileage from the same techniques.

> Of course, on the political level it stopped being possible to design
> competetive VAXen, because DEC had decided to switch to Alpha, and
> thus would not finance such an effort, and of course nobody else
> would, either.

Ken Olsen loved the VAX and would have kept it forever.  Key
salespeople told him it was getting uncompetitive, and engineers told
him they couldn't fix that problem, and they'd better start doing
something else.


From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch,comp.lang.fortran
Subject: Re: Code density and performance?
Date: 28 Jul 2005 23:56:29 -0700
Message-ID: <1122620189.001082.120890@g44g2000cwa.googlegroups.com>

Tom Linden wrote:
> But again, back to the familiar theme, had VAX
> received the billions that alpha received it too would be spinning today
> at 4GHz.

Nonsense.  Not in the Real World where economics and ROI actually
matter.

1) When the VAX was designed (1975-), PL/I may have been the third most
important language, after FORTRAN and COBOL, especially if one wanted
to attack the IBM mainframe market.   Of course, the VAX itself was
created by the agonizing decision that the PDP-11 could not be extended
further upward, but the VAX certainly catered to PL/I and COBOL.

2) In 1988, the tradeoffs were different, and the fraction of new code
being written in PL/I had decreased, and DEC's priorities were
different.  I understand that the VAX->Alpha transition might be less
than optimal, especially for a PL/I vendor (like Kednos).  That's Life.

3) Read the article by Bob Supnik:
http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt

Speaking of 1988:
"Nonetheless, senior managers and engineers saw trouble ahead.
Workstations had displaced VAX VMS from its original technical market.
Networks of personal computers were replacing timesharing. Application
investment was moving to standard, high-volume computers. Microprocessors
had surpassed the performance of traditional mid-range computers and were
closing in on mainframes. And advances in RISC technology threatened to
aggravate all of these trends. Accordingly, the Executive Committee asked
Engineering to develop a long-term strategy for keeping Digital's systems
competitive.  Engineering convened a task force to study the problem.

The task force looked at a wide range of potential solutions, from the
application of advanced pipelining techniques in VAX systems to the
deployment of a new architecture. A basic constraint was that the
proposed solution had to provide strong compatibility with current
products.  After several months of study, the team concluded that only a
new RISC architecture could meet the stated objective of long-term
competitiveness, and that only the existing VMS and UNIX environments
could meet the stated constraint of strong compatibility. Thus, the
challenge posed by the task force was to design the most competitive RISC
systems that would run the current software environments."
....
"The original study team was called the "RISCy VAX Task Force." The
advanced development work was labeled "EVAX." When the program was
approved, the Executive Committee demanded a neutral code name, hence
"Alpha."

Put another way, the EVAX was *supposed* to be an aggressive VAX
implementation, whose goals were to be extended to 64-bit, and close
the performance gap with RISCs, but after serious work, a fine
engineering team (my opinion) concluded the problems weren't solvable.
Just as in 1975, they decided they *had* to make an architecture
change.

4) DEC designers certainly understood the application of OOO techniques
by 1991/1992, and knew the basics of what EV6 was going to look like,
and could have applied that to the VAX for 1996.  BUT,  I don't think
anybody serious believed it was possible for a VAX design to match an
Alpha design in performance, given similar design costs and
time-to-market.  Either the VAX would need a huge design team [way
beyond the 100 or so microprocessor designers they had in the 1980s],
or it would take a lot longer to do.

[I have a bunch of email from senior ex-DEC engineers involved in VAX
and Alpha implementations, and I wouldn't quote them without their
permission, but words like "uninformed opinion" and "revisionist
history" were prominent in descriptions of this thread.]

5) I don't have time or interest in being serious about sketching an
OOO VAX design, and of course, no one in their right mind would do that
without access to the traces from many VAX codes, and lots of CPU
cycles for simulation, sicne of course, serious designs can only be
done with statistics.  Handwaving has zero credibility.

On the other hand, I have had some email discussions with engineers
who've implemented VAX microprocessors, and vetted some of the ideas
about performance bottlenecks.  I know this might seem strange, but I
actually ascribe high credibility on this topic to competent people who
have implemented multiple VAX micros...

If I get time in the next week, I'll try to consolidate that, at least
to sketch the sorts of VAX architectural issues deemed to be hard....

and the main issues *aren't* the decoding of complex instructions, as
much as the later-stage execution issues that make it difficult to get
as much parallelism as you might expect in any reasonable
implementation.  The best I can do is a sketch, because I don't have
the statistics ... but the people that did made the decision.

ONE LAST TIME: it wasn't politics that ended the VAX, it was
engineering judgement by excellent (IMHO) DEC engineers, and the
economics of doing relatively low-volume chips [low-volume compared to
X86].  Nobody could afford to keep the VAX ISA competitive at DEC
volumes....



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch,comp.lang.fortran
Subject: Re: Code density and performance?
Date: 29 Jul 2005 14:38:34 -0700
Message-ID: <1122673114.705144.313630@o13g2000cwo.googlegroups.com>

glen herrmannsfeldt wrote:
> John Mashey wrote:
> (snip regarding VAX and Alpha)
>
> > Put another way, the EVAX was *supposed* to be an aggressive VAX
> > implementation, whose goals were to be extended to 64-bit, and close
> > the performance gap with RISCs, but after serious work, a fine
> > engineering team (my opinion) concluded the problems weren't solvable.
> > Just as in 1975, they decided they *had* to make an architecture
> > change.
>
> Not to disagree, (because I can tell you know this much better
> than I do), but I always thought DEC wanted more from Alpha.
>
> I always felt that they wanted to be known for something, for breaking
> open the 64 bit world when everyone else was stuck in the 32 bit world.
> To show that they could do something other companies couldn't.
>
> Maybe sort of the way Cray felt about the supercomputer market.
>
> They could have had a darn fast processor with 32 bit addresses,
> maybe even 64 bit registers, and easily beat the VAX.  It was many
> years before others decided to go for 64 bits, and even now I see little
> need for 64 bits for 98% of the people.  An x86 processor with a 36 bit
> MMU would have gone a long way to satisfying users addressing needs.

Well, the truth is not that way...

1) Recall that DEC was famously burnt by running out of address bits
too quickly on the PDP-11.  Google comp.arch: mashey gordon bell pdp-11
 gets you my BYTE article on 64-bit that references Gordon's comment.
DEC engineers could plot a straight-line on a log-scale chart as well
as MIPS could, i.e., for prctical DRAM sizes.  Personally, I think good
computer vendors are supposed to think ahead, and have hardware ready,
so that software can get ready, so that when customers were ready to
buy bigger memory and use it, they could ... and not, once again,
recreate the awkward workarounds that have occurred many times when
running out of address bits.

2) A lot of readers of this newsgroup don't understand how
interconnected the wroking computer architecture community could be.
The following just notes some of the Stanford / DEC / SGI / MIPS
relationships.

Among other combinations:
- "MIPS: A VLSI Processor Architecture", John Hennessy, Norman Jouppi,
Forest Baskett, and John Gill, Stanford Tech Report 223, June 1983.
- Forest left to run DECWRL, right across El Camino Real from Stanford,
until he became CTO @ SGI later in 1986.  Norm was over at DECWRL as
well.
- Hennessy, of course, took sabbatical in 1984-1985 to co-found MIPS.
- DECWRL had many RISC fans; the first MIPS presentation there (that I
was involved in, anyway) was April 1986, shortly after R2000 was
announced.
- DEC, of course, had various on-again, off-again RISC projects in the
1980s, with none getting critical mass.  I wouldn' attempt to describe
the politics, even what I know, except the phrase "VAX uber alles" was
heard now and then :-)

3) Most people know that DEC did a deal with MIPS in 1988.  As it
happens, I iniated the process that led to that deal, via a chance
meeting with an old Bell Labs colleague [Bob Rodriguez], then at DEC in
NH, at a Uniforum in early 1988.  I'd given Bob a MIPS Performance
Brief.  That evening, at the conference beer bust,  Bob said "these
look fast!" and evinced a wish to "port Ultrix to this and wheel it in
and show Ken Olsen that it doesn't take 3 years and hundreds of
people."

I talked our Boston sales office into loaning Bob two MIPS systems,
noting that the likelihood of DEC ever buying an outside CPU was low,
but if there was somebody who could do a quick Ultrix port, it was Bob.
 After a month or so of paperwork, Bob and a couple friend got Ultrix
working pretty well in less than 3 weeks [late April/early May].  That
incited the the DEC Palo Alto workstation people, who were having a
rough time competing in the workstation business with VAX chips, except
when VMS was required.  [BTW, I knew these folks also, if for no other
reason, than MIPS participation in the Hamilton Group [of non-Sun UNIX
folks], so named because the first meeting was held at DEC on Hamilton
Avenue in Palo Alto.]

A frenzied sequence of meetings then ensued, as DEC Palo Alto, and
various Ultrix folks clamored to use MIPS chips so they could build
competitive workstations *soon*.

4) In any case, there was a meeting in the Summer in which a team of
DEC's most senior engineers [in VLSI, systems, compilers, and OS] was
gathered together by Ken Olsen with a few days' notice and sent to
Sunnyvale to do a solid day of technical due diligence.  I'm pretty
sure we knew by then that the R4000 should be 64-bit, and discussed
that with DEC (and SGI, and a few other close customers, but DEC and
SGI were the ones who were especially desirous of 64-bit).  If it
wasn't at that meeting [it was a lonnnngg day], it was shortly
thereafter.

The DEC engineers were *quite* competent, and asked lots of tough
questions [although the OS part was trivial, given that Ultrix was
already running and benchmarking well. :-)] The CPU designers &
compiler people, of course, were from one of the premier places for
doing this in the world, had been building one of the most successful
computer lines in history, and DEC was at the height of its
profitability.  Some had been advocating RISC for years, but were
always getting grabbed away, so the projects were on-again, off-again,
which meant, among other things, that there was little progress on
software, and of course, DEC was investing big-time in ECL, much to the
Hudson, MA CMOS guys' consternation.

Given all that, they got a mission to come see if DEC should use MIPS
chips ... and it would have been really easy to have done a political
NIH hatchet job, but they didn't.  They went back the next day, wrote
their report, and said OK, although I'm sure it was really, really
painful for some of them.  I RESPECT THAT A LOT, which is why I thought
some of the postings in this thread were simply nonsense...

At that point, MIPS was a few hundred people in two little buildings in
Sunnyvale.  At the end of the day, we gave them a tour of our computer
room.
Some of the DEC Hudson CMOS engineers were badly shocked.  On the way
out the door, I happened to overhear one say to another, in a stunned
voice: "This little startup has more compute power for chip simulation
than we do at Hudson!!  We're DEC, how can that be?!"  Answer (in voice
cold enough to freeze air) "Yes, that's why we've got a bad problem and
we'd better fix it."

This was Summer of 1988 ... now look again at:
http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt

Of course, by the time of the 64-bit working group meetings in
mid-1992, both SGI/MIPS and DEC had chosen the same flavor of C model.
It was well understood at that point (in that group) that HP, Sun, and
IBM were working on 64-bit designs.

5) SGI shipped the R4000 in the Crimson in 1Q92, albeit running a
32-bit OS.  DEC shipped production Alpha systems ~4Q92, running 64-bit
OSs.  Of course they marketed 64-bit hard...  I would too.

Around 4Q94, both DEC and SGI first shipped SMPs with *both* of the
following:
- 64-bit OSs
- plausibly-purchasable memories above 4GB, i.e., where >4GB user
address space was starting to be needed.

I knew customers who bought Power Challenges, got 4GB+ of memory, and
immediately recompiled code to use it all, in one (parallelized)
program.  [I.e., that's a one-line change in some big Computational
Fluid Dynamics code :-), changing some array size.] During 1989-1992,
there was plenty of discussion among MIPS, SGI, and DEC people about
64-bit programming models, for example.

6) SUMMARY

DEC couldn't figure out how to make the VAX competitively fast and
64-bit, and they could look at DRAM and Moore's Law charts like the
rest of us.  Privately, they (and we) were a little surprised that
other RISC vendors were waiting a generation.  DEC certainly knew
exactly what MIPS was doing, and certainly knew SGI intended to ship
64-bit OSs, if for no other reason than the amount of back-door
communication amongst like-minded software people, and they certainly
had a good idea of what everybody else was up to.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Code density and performance?
Date: 29 Jul 2005 20:01:20 -0700
Message-ID: <1122692480.749175.129540@g47g2000cwa.googlegroups.com>

Eric P. wrote:

> The other problems might be:
>
> - Strong memory ordering prevents any access reordering
> - Any number of idiosyncrasies wrt the order that data values
>   are read or written vs the order that auto increment/decrement
>   deferred operation are performed will inject pipeline stalls due
>   to potential memory aliases that probably never actually happen.
>   This combines with strong ordering to basically serialize everything.
> - Having program counter in a general registers that can be
>   manipulated by auto increment addressing modes probably
>   causes many pipeline problems later to feed value forward
> - 16 integer registers with many having predefined functions
>   is too small and causes lots of register spills.
> - Having a single integer and float register set means extra
>   time moving float values over to float registers and back.
> - Small combined register set means spilling float values a lot
> - Small page size requires lots of TLB entries, which should be
>   fully assoc. for performance, which means big TLB chip real estate.
> - The worst case of requiring 46 TLB entries to be resident
>   to ensure the ADDP6 instruction can complete. Not really a
>   performance limit so much as a pain in the ass to design around.

I think you have the general idea; more later, when I get time.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Code density and performance? [really Part 1 of 3: Micro economics 
	101]
Date: 30 Jul 2005 16:40:53 -0700
Message-ID: <1122766853.815954.195560@g49g2000cwa.googlegroups.com>

John Mashey wrote:

> I think you have the general idea; more later, when I get time.

PART 1 - Microprocessor economics 101 (this post)
PART 2 - Applying all this to DEC, NVAX, Alpha, competition (this post)
PART 3 - Why it seems difficult to make an OOO VAX competitve (later)

PART 1 - Microprocessor economics 101 (simplified)

This thread is filled with *fantasies* about cheap/fast/timely VAXen,
because the issue isn't (just) what's technically feasible, it's what
you can do that's cost-competitive and time-to-market competitive.  The
following is over-simplified, of course, but hopefully it will bring
some reality to this discussion.

DESIGN COST
Suppose it costs $50M to design a chip and get into to production.  The
amortized cost/chip for engineering versus total unit volume is:
Cost/chip  Volume
$1,000,000         50
  $100,000        500
   $10,000      5,000
    $1,000     50,000
      $100    500,000
       $10  5,000,000
        $1 50,000,000

Alternatively, if your volumes happen to be 5,000,000, you could spend
$500M on development, and still only have an engineering cost/chip of
$100.

INTEL AND AMD CAN MAKE FAST X86S BECAUSE THEY HAVE VOLUME.

Anne&Lynn Wheelers' post a while ago pointed at VAX unit volumes, which
as of 1987 had MicroVAX II (a very successful product) having shipped
65,000 units over 1985-1987.

If the unit volumes are inherently lower, you either have to get profit
based on *system* value that yields unusually high margins, so that the
system profit essentially subsidizes the use of such chips. This works
for a systems vendor when the market and customer switching costs allow
high margins, i.e., IBM mainframes to this day, and VAXen in the 1980s.
 [The first MIPS were designed using 2 VAX 11/780s, plus (later) an
8600, plus some Apollos ... and the VAXen  seemed expensive, but they
were what we needed, so we paid.]

SYSTEMS COMPANIES
Mainframes and minicomputer systems companies thrived when the design
style was to integrate a large number of lower-density components, with
serious value-add in the design of CPUs from such components.  [Look at
a VAX 11/780 set of boards.]

As microprocessors came in, and became usefully competitive, most such
companies struggled very hard with the change in economics, and the
internal struggles to maintain upward compatibility.  Most systems
companies had real problems with this.  There were pervasive internal
wars among different strategies, especially in multi-division
companies:

A "We can do this ourselves, especially with new ECL  (or even GaAs)
   [Supercomputer and mainframe, and minicomputers chasing mainframes]

B "We can build CMOS micros that are almost as fast as ECL, and much cheaper,
   and enough better than commodity in feature, function, and performance."
   [IBM, DEC, HP ... and later Sun]

C "We should put our money in system design and software, and buy micros."
   [Apollo, Convergent, Sun. SGI, Sequent, and various divisions of older
   companies].

Of course, later there came:
D "We should buy micros, add value in systems design, and use Microsoft"
   Hence, IBM PC division, Compaq, etc.
and yet later:
E  "We should minimize engineering costs and focus on distribution"
   I.e., Dell.

Many companies did one internal design too many.
Most of the minicomputer vendors went out of business.

IBM, DEC, and HP were among the few that actually had the expertise to
do CMOS micro technology, albeit not without internal wars.  [I was
invovled in dozens of these wars. One of the most amusing was when IBM
friends asked me to come and participate in a mostly-IBM-internal
conference at T. J. Watson, ~1992.  What they *really* wanted, it
turned out, was for somebody outside IBM politics (i.e., "safe") to
wave a red flag in front of the ECL mainframe folks, by showing a
working 100Mhz 64-bit R4000.]

SEMICONDUCTOR COMPANIES, SYSTEMS COMPANIES, AND FABS
In the 1980s, if you were a semiconductor company, you owned one or
more fabs, which were expensive at the time, but nothing like they are
now.  You designed chips that either had huge volumes, or high ASPs, or
best, both!

Volume experience improves yield of chips/wafer, and of course, volume
amortizes not only the fab cost, but the design cost.  If you are a
systems vendor, and only one fab can make your chips, you have the
awkward problem that if the fab isn't full, you have a lot of capital
tied up, and if it is full, you are shipment-constrained.

When you built a fab, first you ran high-value wafers, but even after
the fab had aged, and was no longer state-of-the-art, quite often you
could run parts that didn't need the most advanced technology.

In the 1980s, if you wanted to have access to leading-edge VLSI
technology for your own designs, EITHER:

- You were a semiconductor company, i.e., the era of "Only real men own
fabs"    (sorry, sexist, but that was the quote of the day).  OR

- You were a systems company big enough to afford the fab treadmill.
 IBM [which always had great process technology].
 DEC usually had 1-2 CMOS fabs [more on that later].
 HP had at least 1, but at least sometimes there was conflict between the
 priorities for high-speed CPU design and high-volume lower-cost parts.
 Fujitsu, NEC, etc were both chip and systems companies.
 In any case, you must carefully amortize the FAB costs.

- You were a fabless systems company who could convice a chip company
to partner with you, where your designs were built in their fabs, and
either you had enough volume alone, or (better) they could sell the
chips to a large audience of others.  [Example: Sun & TI]   OR

- You were a small systems/chip company [MIPS] that was convincing
various other systems companies and embedded designers to use the
chips, and thus able to convince a few chip partners to do long-term
deals to make the chips, and sell most of them to other companies, and
as desired, have licenses to make variations of their own from the
designs.  Motivations might be that a chip salesperson could get in the
door with CPUs, and be able to sell many other parts, like SRAMs
[motivation for IDT/MIPS and Cypress/SPARC], or be able to do ASIC
versions [motivation for LSI Logic].

In MIPS Case, the first few years, the accessible partners were
small/medium chip vendors, and it was only in 1988 that we were able to
(almost) do a deal with Motorola and were able to do ones with NEC and
Toshiba, i.e., high-volume vendors with multiple fabs.

Now, you might say, why wouldn't a company like Sun just go to a
foundry with its designs, or in MIPS case, why wouldn't it just be a
normal fabless semiconductor vendor of which there are many?
A: Accessible foundries, geared to producing outside designs, with
state-of-the-art fabs ... didn't really exist.  TSMC was founded in
1987, and it took a long time for it to grow.

ON HAVING A FAB, OR NOT
If you own the process, you can diddle it somewhat to fit what you're
building.

If your engineers are close with the fab process people, and you have
wizard circuit designers, you can do all sorts of things to get higher
clock rates.  If you aren't, you use the factory design rules ... or
maybe you can do a little negotiation with them.

In any case, there is a tradeoff between owning a fab ($$$) and getting
higher clock rate, and not owning the fab, and being less able to tune
designs.

SYSTEMS COMPANIES THAT DESIGN CHIPS, SOLD TO OTHERS OR NOT

There is a distinct difference in approach between the extremes of:
- We're designing these chips for our systems, running our OSes and
compilers.
  We might sell chips to a few close partners for strategic reasons.
VERSUS
- We expect to use these chips in our systems, but we expect that large
numbers will be used in other systems, with other software.  We will
include features that will never be used in our own systems.  We will
invest in design documentation, appropriate technical support, sampling
support, debugging support, software licensing, etc, etc to enable
others to be successful.

IBM still has fabs, but of course IBM Microelectronics does foundry
work for others (to amortize the fab cost).  POWER was really geared to
RS/6000; PPC made various changes to allow wider use outside, and IBM
really sought this (volume).

Sun never had a fab, did a lot of industry campaigning to spread SPARC,
but in practice, outside of a few close partners, most of the $ sales
of SPARC chips went to Sun.

HP has sold PA-RISC chips to a few close partners, but in general,
never was set up in the business of selling microprocessors.

MIPS started to do chips, but had enough work to do on systems that it
needed to build systems (and, if you understand the volume issues
above, needed systems revenue, since in the early days, it couldn't
poossibly get enough chip volume to make money.  I.e., systems buisness
can work at low volumes, whereas chip business doesn't.]

DEC, of course, after its original business (modules) was much more set
up as a systems vendor, and never really had a chip-supplier mindset,
although Alpha, of course, was forced to try to do that (volume,
again).  Somebody suggested they should have been selling VAX chips, and
that may be so, but it is really hard to make that sort of thing
happen, as it requires serious investment to enable other customers to
be ssuccessful, and it requires the right mindset, and it's really hard
to make that work in a big systems company.
(I'm DEC and I sell you VAX Chips.  What OS do you run? Oh, VMS; OK
we'll license you that.  What sort of systems do you want to build?
You want to build lower-cost minicomputers to sell to the VAX/VMS
installed-base? Oh.... actually, it looks like we're out of VAX chips
this quarter, sorry.)
(Or one might recall that Sun talked Solbourne into using SPARC, and
Solbourne designed their own CPUs and built SMPs.  If a Sun account
wanted an SMP, and somebody like SGI was knocking at the door, Sun
would point at Solbourne (to keep SPARC), but if Solbourne was
infringing on a Sun sale, it was not so friendly - I once got a copy of
a Sun memo to the salesforce about how to clobber Solbourne.)

Anyway, a *big* systems vendor, to be motivated to the bother of
successfully selling its otherwise proprietary CPU chips, has to find
other, essentially non-competitive users of them, who can be
successful.

The most successful example of that is the IBM PPC -> Apple case.

Probably the most interesting Alpha case was its use in the Cray T3
systems, fine supercomputers, but not exactly high-volume.

ON DESIGNS AND ECONOMICS

People probably know the old project management adage: "features, cost,
schedule: you pick two, I'll tell you the other."

In CPU design, you could, these days, use:

- FPGA
- Structured ASIC
- ASIC, full synthesized logic
- Custom, with some synthesized and some custom logic/layout design,
  and maybe with some special circuit design.

Better tools help ... but they're expensive, especially because people
pushign the state of the art tend to need to build some of their own.

This is in increasing order of design cost.
- An FPGA will be big, use more power, and run at lower clock rate.
- The more a custom a chip is, the faster it can go, but it either
takes more people, or longer to design, or (usually) both.

Companies like Intel have often produced an original design with a lot
of synthesized logic, with one design team, and then another team right
behind them, to use the same logic, but tune the critical paths for
higher clock rate, shrink the die with more custom design, work on
yield improvements, etc.

Put another way, if you have enough volume, and good ASPs, you can
afford to spend a lot of engineering effort to tune designs, even to
overcome ISA problems.

PART 2 - Applying all this to DEC, NVAX, Alpha, Competition

DEC (at least some people) understood the importance of VLSI CMOS. DEC
had excellent CPU and systems designers, software people, and invested
in fabs (for better or worse - some of us could never quite fiture out
how they could afford the fabs in the 1990s).  They had some
super-wizard circuit designers, who even impressed some of the best
circuit designers I've known.

However, in the 1980s, they never had more than about 100 VLSI CPU
designers, which in practice meant that at any one time, they could
realistically be doing one brand-new design, and one {shrink,
variation}.  They of course were doing the ECL VAX9000, but that was a
whole different organization.

The problem that DEC faced was that their VAX cash cow was under
attack, and they simply couldn't figure out how to keep the VAX
competitive, first in the technical markets [versus PA RISC, SPARC, and
MIPS], and then in commercial [PA RISC].  I think Supinik's article
described this reasonably well.

http://research.compaq.com/wrl/DECarchives/DTJ/DTJ800/axp-foreword.txt


As a good head-to-head comparison, NVAX and the Alpha 20164 were built:
- in same process
- about the same time
- with the same design tools
- with similar-sized teams ... although the NVAX team had the advantage
of having implemented pipelined CMOS VAXen before, long history of
diagnsotics, test programs, statistics on behavior, etc, wheras Alpha
team didn't alread have as much of that.

The ISA difference between VAX and Alpha was such that the NVAX team
had to spend a lot more effort on micro-architecture, wheras the Alpha
team could spend that effort on aggressive implementation, such that
the MHz difference was something like 80-90MHz for NVAX/NVAX+, and up
to 200Mhz for 21064.  Around 1992, modulo maybe a year difference in
software, that gave numbers like:

SGI        DEC           DEC
Crimson    VAX7000/610   DEC7000/610
MIPS       VAX           Alpha
R4000      NVAX          21064
1.3M       1.3M          1.68M          # transistors
184 mm^2   237 mm^2      234 mm^2       # size
1.0 micron .75 micron    .75 micron     # process
2-metal    3-metal       3-metal        # metals
1MB L2     4MB L2        4MB L2         # L2
100Mhz     90MHz         182Mhz         # clock rate

61         34             95            # SPECint89
78         58            244            # SPECfp89

Now, we all know SPECint/SPECfp aren't everything, and the exact
numbers don't matter much, but that's still a big difference.  I threw
in the MIPS chip to illustrate that even a well-designed NVAX was
outperformed by a single-issue chip that was 3/4 the size, in a
substantially less dense technology [1.0 micron versus .75, and 2-metal
versus 3], required to meet a generic design rules across multiple fab
vendors, was 64-bit, and still had a higher clock rate.

None of this was due to incompetence on the NVAX team; that was a
*fine*, successful design to be proud of.  But once again, go back to
the economics.

It's a classical move to try to take market share and build volume via
all-out performance, selling first to those with the most portable code
and willing to pay for performance.  It's a lot harder to do that with
an NVAX what was 60-80% of the performance (on these, anyway) of
something like an R4000, that, if not a commodity, was a lot closer to
that.

A bit later, I'll post Part 3, my analysis of why I think it would have
been hard to build a *competitive* OOO VAX.

In the real world, it wasn't enough to build an OOO VAX, it had to be
competitive on time-to-market, performance, and cost. This post has
covered the economic issues, the next will discuss some of the ISA
issues.

But, as a teaser, I note that there are some ISA attributes of the VAX:
a) Not found in RISCs
b) Not found in X86
c) Some of which are found in S/360 family, but less often

Some of them are the same ones that make other aggressive
implementations hard, but some *really* cause trouble for OOO
implementations, and in particular, make it very hard to get as mileage
from the X86 convert->micro-op style of designs.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Part 1 of 3: Micro economics 101 (was: Code density ...)
Date: 31 Jul 2005 17:48:37 -0700
Message-ID: <1122857317.014506.193890@g14g2000cwa.googlegroups.com>

Anton Ertl wrote:
> "John Mashey" <old_systems_guy@yahoo.com> writes:
> >This thread is filled with *fantasies* about cheap/fast/timely VAXen,
> >because the issue isn't (just) what's technically feasible, it's what
> >you can do that's cost-competitive and time-to-market competitive.
>
> Well, my impression was that you made a claim about technical
> feasibility.

Once again, I don't know why a long-time participant in comp.arch would
think that.  I've posted on this topic off and on for years, including
the old April 18, 1991 "RISC vs CISC (very long)" post that's been
referenced numerous times. (Google: mashey risc cisc very long)  It
said, among other things:

"General comment: this may sound weird, but in the long term, it might
be easier to deal with a really complicated bunch of instruction
formats, than with a complex set of addressing modes, because at least
the former is more amenable to pre-decoding into a cache of
decoded instructions that can be pipelined reasonably, whereas the pipeline
on the latter can get very tricky (examples to follow).  This can lead to
the funny effect that a relatively "clean", orthogonal architecture may
actually be harder to make run fast than one that is less clean."

In context, this was ~ "it might be easier to deal with X86 than VAX"
Decoded Instruction Cache ~ Intel "trace" cache

And in March 8, 1993 (Google: mashey vax complex addressing modes), I
said: "Urk, maybe I didn't say this right:
        a) Decoding complexity.
        b) Execution complexity, especially in more aggressive (more parallel)
                designs.
I'm generally much more worried about the latter than the former, since there
are reasonable things to do about the former (i.e., decoded instruction caches,
which at least help some)."

If I've ever posted anything that seemed to imply it was impossible to
build an OOO VAX, I apologize for the ambiguity, but I think I've
consistently expressed this as "difficult" or "complex", or "needs a
lot of gates", or "likley to incur extra gate delays" NOT as
"impossible".  I've various times discussed the 360/91 or the VAX 90000
as things that went fast for their clock rate, but at the cost of high
cost and complexity. After all, the key issues around OOO were
published in Jan 1967 (Anderson, Sparacio, Tomasulo on 360/91).  An
aphorism of the mid-1980s amongst CPU desingers was "Sometime we'll get
enough gates to catch up with the 360/91."

In the real world, engineers have to juggle design/verification cost,
product cost, and time-to-market; there are plenty of things that are
"technically feasible" but have no ROI...



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Code density and performance? [really Part 2b of 3: Micro 
	economics 101]
Date: 3 Aug 2005 23:16:04 -0700
Message-ID: <1123136164.540256.264630@g47g2000cwa.googlegroups.com>

From side questions, here's an update to Part 2, and you will
definitely want to use Fixed Font...

Here's a better one, with a few more CPUs to give context, and you may
want to print.


TABLE 1 - MIPS, VAX, Alpha, Intel

CODE   A        B         C        D        E
SHIP   1Q92	3Q92      4Q92     3Q93     3Q95
CO     SGI      DEC       DEC      SGI      DEC
PROD   Crimson  7000/610  7000/610 Chal XL  600 5/266
ARCH   MIPS     VAX       Alpha    MIPS     Alpha
CPU    R4000    NVAX+     21064    R4400    21164
XSTRS  1.3M     1.3M      1.68M    2.3M     9.7M
mm^2   184      237       234      184      209
Micron 1.0      0.75      0.75     .8       0.35
Metals 2        3         3        2        4
L1     8KI+8KD  2KI+8KD   8K+8K    16K+16K  8KI+8KD
L2     1MB      4MB       4MB      1MB      96K
L3
MHz    100      90        182      150      266
Type   1P       1P        2SS      1P       4SS
Bus    64       128       128      64       128

SPEC89
Issue  Jun92    Sep92     Mar93    -        -
Si89    61       34        95      -        -
Sfp89   78*      58*      244*     -        -

*All of these have the matrix300-raised numbers

SPEC92
Issue  June92    June92*  Mar93    Jun93    Sep95
Si92    58       34E**     95       88      289
Sfp92   62       46E**    182       97      405

**My estimate, noting that MIPS & Alpha derated by .75-.8
going from SPECfp89 to SPECfp92.  Take with many grains of salt.
I couldn't easily find any SPEC92 numbers for VAX.

CODE   F       G                   H         I         J
SHIP   1991    3Q92                3Q93      2Q94      2Q96
CO     Intel   CPQ                 Intel     Intel     Intel
PROD   Xpress  Deskpro             Xpress    Xpress    Alder
ARCH   IA-32   IA-32               IA-32     IA-32     IA32
CPU    486DX   486DX2              Pentium   P54C      PentiumPro
XSTRS  1.2M    1.2M+               3.1M      3.2M      5.5M
mm^2   81      ?                   295?      147       196
Micron 0.8     ?                   0.8       0.6       0.35
Metals 2?      2?                  3         4         4
L1     8K      8K                  8KI+8KD   8KI+8KD   8KI+8KD
L2     256K    256K                256K      512K      256K
MHz    50      66                  66        100       200
Type   1P      1P                  2SS       2SS       3-OOO
Bus    32      32                  64        64        64


SPEC92
Issue  Mar92    June92             Jun93     Jun94     Dec95
Si92    30      32                 65        100       366
Sfp92   14      16                 60         75       283

Type: 1P: 1-issue, pipelined
      2SS: 2-issue, superscalar
      4SS: 4-issue superscalar
      3-OOO: 3-issue, out-of-order

=================================
What I've done is:
Show the SPEC89 numbers for VAXen, because I can't find SPEC92 numbers.
 Then I've done a gross estimate of the equivalent SPEC92, so that I can
get all of the machines on the same scale, noting of course that
benchmarks degrade over time due to compiler-cracking.  I used the
highest NVAX numbers I have handy, from my old paper SPEC Newsletters.

I'm ignoring costs, below, and the dates in the table must be taken
with lots of salt, for numerous reasons, and as always SPECint and
SPECfp aren't everything [spoken as an old OS person].

NVAX shipped in 4Q91, NVAX+ in 1992.  The first R4000s shipped in
systems in 1Q92, so these are contemporaneous, as they are with 486DX
and 486DX2.

The NVAX+ is about 75-80% of a MIPS R4000 on integer and FP here,
despite using a better process [.75 micron, 3-metal, versus 1.0 micron,
2-metal], a larger die [237 versus 184], and being 32-bit rather than
64-bit.
[It is somewhat an accident of history and business arrangements that
the R4000 was done in 2-metal, but that forced it to be superpipelined,
1-issue, rather than the original plan of 2-issue superscalar.  As a
result, the R4000/R4400 often had lower SPECfp numbers than the
contemporaneous HP and IBM RISCs, although  for compensation it
sometimes had better integer performance, and sometimes could afford
bigger L2 caches, because the R4000/R4400 themselves were relatively
small.

In any case, on SPEC FP performance, in late 1992, the fastest NVAX+
was outperformed by {IBM, HP, MIPS, Sun (maybe), and Alpha (by 3-4X).
The NVAX+ was 3X faster than a 66MHz 486DX2.

In SPECint, in late 1992, the NVAX was outperformed, generally by the
RISCs ... but worse, there wasn't much daylight between it and a 66MHZ
486DX2, or even a 50MHz 486DX.

The real problem of course (not just for the VAX, but for everybody),
was the bottom right corner of the Table.  Intel had the resources and
volume to "pipeline" major design teams [Santa Clara & Portland] plus
variants and shrinks, and there was an incredible proliferation in
these years.

It's worth comparing [B] NVAX+ with [H] Pentium.

Suppose one were a VAX customer in 1992:
If you were using VAX/VMS:
- commercial: committed to VMS for a long time.
- technical (FP-part): RISC competitors keep coming by with their
numbers
If you were using Ultrix:
- FP: serious pressure from RISC competitors
- Integer: serious pressure already from RISC competitors, and
  horrors! Intel getting close to parity on performance

I'm not going to comment on DEC's handling of Alpha, fabs,
announcements, alternate strategy variations.  But this part should
make clear that ther was real pressure on the VAX architecture, for
above (in terms of performance) and below (Intel coming up).

One might imagine, that had there been no Alpha, and everybody at
Hudson had kept working on VAXen, that they could have gotten:
[X] a 2SS superscalar [like Pentium], in 1994, perhaps
OR
[Y} some OOO CPU [like Pentium Pro], in 1996, perhaps

as well as doing the required shrinks and variants.

From the resources I've heard described, I find it difficult to believe
they could have done both [X] and [Y]  (and note, world-class design
teams don't grow on trees).  I could be convinced otherwise, but (as
one of the NVAX designers says), only by "members of the NVAX and Alpha
design teams, plus Joel Emer" :-), i.e., well-informed people.

In Part 3, I'll sketch some of the tough issues of implementing the
VAX, as best I can, and in particular, note the ISA features that might
make things harder for VAX than for X86, even for 2SS, 4SS, or OOO
designs.  In particular, what this means is that you can implement a
type of microarchitecture, but it gains you more or less performance
dependent on the ISA and the rest of the microarchitecture.  For
instance, the NVAX design at one point was going to decode 2
operands/cycle, and it was found to add much complexity and only get
2%.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: PART 3. Why it seems difficult to make an OOO VAX competitive (really 
	long)
Date: 7 Aug 2005 18:48:10 -0700
Message-ID: <1123465690.038575.6800@z14g2000cwz.googlegroups.com>

(You will want Fixed Font).
The earlier parts were:
- (posted Jul 30)
PART 1 - Microprocessor economics 101
PART 2 - Applying all this to DEC, NVAX, Alpha, competition

- (posted Aug 3)
Really part  2b  (updated Table 1 and added more discussion)

FUNDAMENTAL PROBLEM
Certain VAX ISA features complexify high-performance parallel
implementations, compared to high-performance RISCs, but also to IA-32.

The key issue is highlighted by Hennessy & Patterson [1, E-21]]:
"The VAX is so tied to microcode we predict it will be impossible to
build the full VAX instruction set without microcode."

Unsaid, presumably because it was taken for granted is:

For any higher-performance, more parallel micro-architecture, designers
try to reduce the need for microcode (ideally to zero!).  Some kinds of
microcoded instructions make it very difficult to decouple:
A)	Instruction fetch, decode, and branching
B)	Memory accesses
C)	Integer, FP, and other operations that act on registers

Instead, they tend to make A&B, or A&C, or A,B&C have to run more in
lockstep.

It is hard to achieve much Instruction Level Parallelism (ILP) in a
simple microcoded implementation, so in fact, implementations have
evolved to do more prefetch, sometimes predecode, branch prediction,
in-order superscalar issue with multiple function units, decoupled
memory accesses, etc, etc. ISAs often had simple microcoded
implementations [360/30, VAX-11/780, Intel 8086] and then evolved to
allow more pipelining.  Current OOO CPUs go all-out to decouple A), B),
and C), to improve ILP actually achieved, at the expense of complex
designs, die space, and power usage.

Some ISAs are more suitable for aggressive implementations, and some
make it harder.  The canonical early comparison was the CDC 6600 versus
the IBM 360/91; the even stronger later one would be Alpha versus VAX.
A widespread current belief is that the complexity, die cost, and
propensity for long wires of high-end OOOs may have reached diminishing
returns, compared to multi-core designs with simpler cores, where the
high-speed signals can be kept in compact blocks on-chip.

IA-32 has baroque, inelegant instruction encoding, but once decoded,
most frequently-used instructions can be converted to a small number
(typically 1-4) micro-ops that are RISC-like in their semantic
complexity, and certainly don't need typical microcode.  As noted
earlier in this sequence, the IA-32 volumes can pay for heroic design
efforts.

The VAX ISA is orthogonal, general, elegant,  and easier to understand,
but the generality, but it also has difficult decoding when trying to
do several operands in parallel.  Worse, numerous cases are possible
that tend to lockstep together 2 or 3 of A), B), or C), lowering ILP,
or requiring hardware designs that tend to slow clock rate or create
difficult chip layouts.  Even worse, a few of the cases are even common
in some or many workloads, not just potential.

As one VAX implementor wrote me: "it doesn't take much of a
percentage of micro-coded instructions to kill the benefits of the
micro-ops."

That is a *crucial* observation, but of course, the people who really
know the numbers tend to be the implementers...

It is interesting to note that the same things that made VAX pipelining
hard, and inhibited the use of a 2-issue superscalar, also make OOO
hard.  Some problems are easier to solve, but others just move around
and manifest themselves in different ways.
-	decode complexity
-	indirect addressing
-	multiple side-effects
-	some very complex instructions
-	subroutine call mechanism

Following is a more detailed analysis, showing REFERENCES first (easier
to read on Web), briefly describing OOO, and then going through a
sample of troublesome VAX features, and comparing them to IA-32, and
sometimes S/360.  CONCLUSION that wraps all this together with DEC's
CMOS roadmap in the early 1990s to show the difficulty of keeping the
VAX competitive.

==========================
CAVEAT: I've never helped design VAXen, although I used them
off-and-on between 1979-1986.  I have participated (modestly) in
several OOO designs, the MIPS R10000 and successors, plus one that
never shipped.  I've had lots of informal discussions over the years
with VAX implementors, and I have reviewed some of the ideas below with
at least one of them.  I don't have the statistics that a
professional needs to really do an OOO VAX design, so at best I can
sketch some of the problems.  With enough gates, you can do almost
anything ... but complexity incurs design cost, design time, and often
chip layout problems and gate delays.  Unlike software on
general-purpose systems, where adding a bit of code rarely bothers
much, the blocks of a chip have to fit on a 2-dimensional layout, and
their physical relationships *matter*.  Sometimes minor-seeming
differences cause real problems.
==========================

REFERENCES (placed here for convenience):

Assumed reading:
[0] Hennessy & Patterson, Computer Architecture, a Quantitative
Approach, 3rd Edition, 2003.  Chapters 2, 5, and especially 3, plus
Appendix A.

Brief explanation, and detailed reference of the VAX:
[1] Hennessy & Patterson, "Another alternative to RISC: the VAX
Architecture", www.mkp.com/CA3, Appendix E.
[2] Digital Equipment, VAX Architecture Handbook, 1981.

Superb VAX performance analyses of the early 1980s by DEC people;
ironically, invaluable to RISC designers [at MIPS, used to settle
arguments]:
[3] Clark and Levy, "Measurement and analysis of instruction use in the
VAX-11/780", 1982. ACM SIGARCH CAN, 10, no 3 (April 1982), 9-17.
[4] Wiecek, "A case study of VAX-11 instruction set usage for compiler
execution", ASPLOS 1982, 177-184.
[5] Emer and Clark, "A characterization of processor performance in the
VAX-11/780, Proc. ISCA, 1984, 301-310.
[6] Clark and Emer, "Performance of the VAX-11/780 Translation Buffer:
Simulation and Measurement", ACM TOCS 3, No. 1, (Feb 1985), 31-62.

Important analysis of ILP, discussed at length in [0].
[7] Wall, Limits of Instruction-Level Parallelism, DECWRL REPORT 93/6.

Another superb performance analysis by two of the best:
[8] Bhandarkar and Clark, "Performance from Architecture: Comparing a
RISC and a CISC with similar hardware organization", 1991.

The NVAX:
[9] Uhler, Bernstein, Biro, Brown, Edmondson, Pickholtz, and Stamm,
"The NVAX and NVAX+ High-performance VAX Microprocessors".
http://research.compaq.com/wrl/DECarchives/DTJ/DTJ701/DTJ701SC.TXT

A good intro to the IA-32.
[10] Hennessy and Patterson, "An alternative to RISC: The Intel
80x86", www.mkp.com/CA3, Appendix D.

4-issue superscalar in-order Alpha versus OOO PentiumPro
[11] Bhandarkar, "RISC versus CISC: A Tale of Two Chips", ACM
SIGARCH CAN 25, Issue 1 (March 1997), 1-12.

IBM ES/9000 (1992) was superscalar OOO in Bipolar, but in CMOS, they
went back to simpler designs.
[12] Sleegel, Pfeiffer, Magee, "The IBM eServer z990
microprocessor", IBM J. RES. DEV. Vol 48, No. 3/4,  May/July 2004,
294-309.
[13] Heller and Farrell, "Millicode in an IBM zSeries processor",
IBM Journal of Research and Development 48, No. 3/4, May/July 2004,
425-434.

[Some of these can be found on WWW, some are in ACM Digital Library
(Subscription), many are discussed in detail in [0] anyway].


INTRODUCTION - OOO (Out-of-Order) (see [0, Chapter 3]):

OOO CPUs try to maximize ILP as follows:

A) Fetch instructions in-order, with extensive branch prediction,
- decode, (and maybe even cache the decoded instructions
- apply register renaming to convert logical registers to physical
- put the resulting operations (decoded instruction using renamed
registers) into internal queue(s) (reorder buffer, active-list, etc),
such that an operation can be performed (often OOO) whenever upon
availability of its inputs and necessary functional units.

B) A load/store unit tries to discover cache misses and start refills
quickly.  Loads can (depending on ordering model) be done out of order,
and stores can at least (sometimes) profitably fetch the targets of
cache lines, although the final store operation must wait.  Decoupling
this unit as much as possible is absolutely crucial to getting good
ILP, given the increasing relative latency to memory.

C) Other instructions are executed by appropriate function units,
commonly 1-2 integer ALUs, and a collection of independent FP units.

A) Again, since it's more related to A): Completed instructions are
retired in-order.  If it turns out that the fetch unit has mispredicted
a branch, when that is discovered, the register state, condition codes,
etc are rolled back to those just before the branch, and the branch is
followed in the other direction.   If an instruction generates an
exception, the exception normally doesn't take effect until the
instruction is retired, in which case the following instructions are
cancelled.  Something similar occurs with asynchronous interrupts.

OOO CPUs run most of the time speculating, i.e., working on multiple
instructions that might or might not actually be reached, which is why
people worry so much about good branch prediction, because the penalty
for bad prediction gets worse as the design gets {longer pipelines,
more parallelism}.  Also, they hope for code patterns that help confirm
branch directions early.

Once upon a time, it was easy to know how many cycles an instruction
would use, but that was long ago, with real-memory, uncached designs
:-)  It is very difficult to know what an OOO CPU is up to.  There are
also serious hardware tradeoff problems that arise, even though
invisible to most programmers.  There is never as much die space as
potential uses, and the payoffs of different choices must be carefully
analyzed across large workloads, especially because there can be
serious discontinuities in gate-count, or worse, gate-delays caused by
"minor" changes in things like queue sizes.  For instance, unlike
limits on logical (programmer-visible) registers, there is no apriori
limit on the number of physical registers, but in practice, these
register numbers are used in large numbers of comparators, and one
would think hard in going from 64 to 65, or 128 to 129.  Likewise,
load/store queues have big associative CAMs so that the next address
can be quickly checked against all the outstanding memory operations.
Quite likely, load queues are filled with outstanding memory
references, with multiple cache misses outstanding.  If a queue is
filled, but an instruction needs a piece of data *right now*, just to
be decoded into micro-ops, it either has to have special-case hardware,
or it will have to wait until a queue entry is available.  [The VAX has
this problem, unlike IA-32 or RISCs.]

Some instructions (like ones changing major state/control registers, or
memory mapping, etc) are inherently *serializers*, that is, their
actions cannot take effect until all logically older instructions have
been retired.  Also, partially-executed instructions *following* the
serializer may need to be redone. The decoder might recognize such
serializers and stop fetching, if it is deemed a waste of time to go
beyond them.

Unlike loads, where all the work can be done speculatively, stores
cannot be completed until they are retired, because they can't be
sanely undone.

Cleverness can preserve sequential (strict) ordering while pulling
loads ahead of earlier stores, by redoing them if it turns out they
conflict. [0, p.619 on R10000; discussed in US Patent 6,216,200].  The
VAX's strict ordering might *not* have been a real problem.

WHY DO ALL THIS OOO COMPLEXITY?

a. Speculate into memory accesses as early as possible and get cache
misses going, to deal with the increasing latency to memory.  Overlap
address calculations, get actual load data early, and fetch cache-line
targets of stores early.  Also, try to smooth the flow of cache
accesses to lower latency and increase effective bandwidth.  It was
often said:
 "The main reason to do OOO is to find cache misses as early as
possible."

b. Extract more ILP by overlapping non-dependent ALU/FP operations in a
bigger window [40+ typical] than is available for in-order
superscalars, which typically examine no more than 4
instructions/clock.  This is especially valuable for long-cycle
operations like integer multiply/divide, or many FP ops.

c. Alleviate decoding delays of messier variable-length instructions;
this obviously applies less to RISCs, although some have done modest
pre-decode when storing instructions in the I-cache.

d. Reduce pressure on small register sets by using register renaming to
create more physical registers than logical ones.  This also eliminates
false dependencies, even in RISCs with large register sets, but it does
help VAX (and IA-32) somewhat more, as both are short of registers.
That moves the problem around, as it puts more pressure on load/store
queues, and efficient handling of load-after-store-to-same-address.


The 360/91 used OOO for a. (it had no cache), b. (long FP-cycle ops),
and d. (only 4 FP registers). I think c. was less important, as S/360
instruction length decode is easy.

ILP, NORMAL INSTRUCTIONS, and IA-32 VERSUS VAX

Consider the normal unprivileged instructions that need to be executed
quickly, meaning with high degrees of ILP, and with minimal stalls from
memory system.

RISC instructions make 0-1 memory reference per operation.  Despite the
messy encodings, *most* IA-32 instructions (dynamic count) can be
directly decoded into a fixed, small number of RISC-like micro-ops,
with register references renamed onto the (larger) set of physical
registers.  Both IA-32 and VAX allow unaligned operations, so I'll
ignore that extra source of complexity in the load/store unit.

In an OOO design, the front-end provides memory references to a
complex, highly-asynchronous load/store/cache control unit, and then
goes on.  In one case, [string instructions with REP prefix], IA-32
needs the equivalent of a microcode loop to issue a stream of micro-ops
whose number is dependent on an input register, or dynamically, on
repeated tests of operands.  Such operations tend to lessen the
parallelism available, because the effect is of a microcode loop that
needs to tie together front-end, rename registers, and load/store unit
into something like a lock-step.  Although this doesn't require that
all earlier instructions be retired before the first string micro-ops
are issued, it is likely a partial serializer, because it's difficult
do much useful work beyond an instruction that can generate arbitrary
numbers of memory references (especially stores!) during its execution.

However, the VAX has more cases, and some frequent ones, where the
instruction bits alone (or even with register values) are insufficient
to know even the number of memory references that will be made, and
this is disruptive of normal OOO flow, and is likely to force difficult
[read: complex, high-gate-count or long-wire] connections among
functional blocks on a chip.  Hence, while the VAX decoding complexity
can be partially ameliorated by a speculative OOO design with decoded
cache [I alluded to this in the RISC CISC 1991 posting], it doesn't
fix the other problems, which either create microcode lock-steps
between decode, load/store, and other execution units, or require other
difficult solutions.  In some VAX instructions, it can take a dependent
chain of 2 memory references to find a length!

VAX EXAMPLES [1], [2], especially compared to IA-32 [10] and sometimes
S/360.

Specific areas are:
-	Decimal string ops
-	Character string ops
-	Indirect addressing interactions with above
-	VAX Condition Codes (maybe)
-	Function calls, especially CALL*/RET, PUSHR/POPR.

DECIMAL STRING OPERATIONS: MOVP, CMPP, ADDP, SUBP, MULP, DIVP, CVT*,
ASHP, and especially EDITPC: are really, really difficult without
looping microcode.  [S/360 has same problem, which is why (efficient)
non-microcoded implementations generally omitted them.  The VAX
versions, especially the 3-address forms, are even more complex than
the 2-address ones on S/360, and there are weird cases.  DIVP may
allocate 16-bytes on the stack, and then restore the SP later.

These instructions somewhat resemble the (interruptible) S/370 MVCL,
but are more complex, including the infamous ADDP6.  They all set 4-6
registers upon completion or interrupt.

EDITPC is like the S/360 EDMK operation, but even more baroque.  "The
destination length is specified exactly by the pattern operators in the
pattern string." [2, p. 336] I.e., you know the beginning address of
the destination, but you can't tell the ending address of a written
field without completely executing the instruction.

The IA-32 doesn't have these memory-memory decimal operations.

One might argue that C, FORTRAN, BLISS, PASCAL, etc could care less
about these, but COBOL and PL/I do care, so if they are a customer's
priority, they may not be happy with the performance they get on an OOO
VAX, i.e., C speeds up, FORTRAN speeds up, but decimal operations are
unlikely to speed up as much, as these certainly look like microcode
that tends to serialize resources.

CHARACTER STRING AND CRC OPERATIONS: MOVC, MOVTC, MOVTUC*, CMPC, SCANC,
SPANC, LOCC, SKPC, MATCHC, CRC: also tough without looping microcode,
and they are generally more complex than the S/360 equivalents.  MOVTUC
is a fine example: it has 3 memory addresses, and copies/translates
bytes until it finds an escape character.  Hence, at decode time, it is
impossible to know how many memory addresses will be fetched from, and
worse, stored into...

The IA-32 REPEAT/string operations have some of the same issues, but
are simpler, with the length and 2 string addresses supplied in
registers.

VAX INDIRECT ADDRESSING AND CHARACTER OR DECIMAL OPS

For any of the above, note that most operands, INCLUDING the lengths
can be given by indirect (DEC deferred) addresses:
@D(Rn)  Displacement deferred [2.7%, according to [5,Table 4]
@(Rn)+  Auto-increment deferred [2.1%, according to [5, Table 4]

The first adds the displacement to register Rn, to address a memory
word, the second uses the address in Rn (followed by an auto-increment)
to address the memory word.  That word contains the address of the
actual operand value.  This makes it impossible for the front-end to
know the length early.  Rather than being able to hand off load/store
operations, unidirectionally to the load/store unit, the front-end has
to wait for the load/store unit to supply the operand value, just to
know the character string length.  I have no idea how frequent this is,
but VAXen pass arguments on the stack, and a call-by-reference that
passes a length argument will do this: @D(SP).

Consider how much easier are the regular VAX MOV* instructions, each of
whose length is fixed.  Each of those is easily translated into:
Load (1 value, 1, 2, or 4 bytes) into (renamed register); store that
value
Or
Load (2 or 4 longwords); store (2 or 4 longwords)

(Of course, one might like the MOV to just act in load/store unit, but
that's not quite possible, due to the MOV and Condition Codes issue
described later.)

IA-32 doesn't have this problem, as the length for a REP-string op is
just taken from a register.  Of course, that value must be available,
but that falls out as part of the normal OOO process.  The closest the
S/360 gets is the use of an Execute instruction to supply length(s) to
a Move Character (MVC) or other SS instruction.  That's a somewhat
irksome control transfer: think of it as replacing the EX with the MVC
after ORing the length in, but at least you know the length at that
point, without having to ask the load/store unit to do possibly)
multiple memory references in the middle of the instruction, which
requires some special cases in the front-end <-> load/store unit
interaction.

VAX CONDITION CODES AND MOVES [CONJECTURE ON MY PART]

OOO processors typically use rename maps to map logical registers to
physicals.

Condition Codes (CC) require an additional rename map of their own, in
ISAs that have them.  Each micro-op has an extra dependency on the CC
of some predecessor, and produces a CC, just as it produces a result.
Register renaming uses massive bunches of comparators to keep track of
dependencies, and CCs just add more maps and more wires and
comparators.

IA-32 and S/360 would need this also, but the VAX is slightly
different, in that its data movement instructions affect the CC.

S/360: CVB,CVD, DR, D, IC, LA, LR, L, LH, LM, MVC, MVI, MVN, MVO, MVZ,
MR, M, MH, PACK, SLL, SRDL, SRL, ST, STC, STCM, STH, TR, UNPK do *not*
set the CC, i.e., most data movement instructions do NOT affect CC.

IA-32: MOV does not affect any flags.

VAX: almost everything affects flags, including all the MOVes (except
MOVA and PUSHA, which do address arithmetic), so that the simple
equivalents of LOAD and STORE on other ISAs now have to set the CC.
It's hard to say whether this matters or not without a lot of
statistics.  It does complexify some advanced optimizations.  For
instance, there are some kinds of store/load sequences where one wants
everything to be done in a L/S unit (which generally knows nothing
about CCs).  For instance, one may recognize that a pending store has
the same address as a later load, and one can simply hand the store
data directly to the load without incurring a cache access.  [I think
Pentium 4 does something like this].  This easily happen when a bunch
of arguments are quickly pushed onto the stack, and the stores are
queued in the L/S unit (because they arrive faster than the cache can
service them), but later loads quickly appear to fetch the arguments.
This seems to imply extra complexity, because the L/S unit must compute
the CC and get it back to the rest of the CPU.

NOTE: upon exception or interrupt, the CC must be set appropriately,
which means that it has to be tracked.  However, it also means that
most conditional branches depend on the immediately-preceding
instruction, and that may (or may not) make it harder to extract ILP.

AND SAVING THE BEST FOR LAST: "SHOTGUN" INSTRUCTIONS LIKE VAX
CALLS, CALLG

In the NVAX, these shut down the pipeline because the scoreboard
couldn't keep track of them, so that sequences of simpler
instructions were faster.

The VAX ISA makes it harder than usual for a decoder to turn an
instruction into a small, known set of micro-ops.

CALLS and CALLG generate long sequences of actions, most of which can
be turned into micro-ops straightforwardly.  However, one thing is
painful:

CALLS numarg.rl,dst.ab  and CALLG arglist.ab, dst.ab

The decoder cannot tell from the instruction how many registers will
get saved on the stack, because the dst.ab argument (which could be
indirect) yields the address, not of the first instruction of the
called routine, but of a *mask*, 12 of whose bits correspond to
registers R11 through R0, showing which ones need to be saved onto the
stack along with everything else, and all the register side-effects.
This means, that in the middle of decoding the instruction, the decoder
has to hand dst.ab to the address calculator, get the result back [OK
so far], but then it has to fetch the target mask, scan the bits, and
generate one micro-op store per register save.

Presumably, in an OOO with trace-cache design, and with fully-resolved
subroutine addresses, one could do this OK, but it's a pain, because
of the potential variability.  Of course, in C, with pointers to
functions, an indirect call through a pointer is awkward ... but of
course, it's awkward for everybody.

RET inverts this, but the trace cache approach doesn't help, in that
it POPS a word from the stack that has the register mask, scans the
mask, and restores the marked registers from the stack.  This is
another thing that wants to generate a variable number of memory
operations based on another memory operand, so the micro-ops are not
easily generatable from the instruction bits alone, or even
instruction+register values.

PUSHR and POPR push and pop multiple registers, using a similar
register mask, but at least, in common practice, they would be
immediate operands ... although of course, it is possible the mask was
some indirect-addressed value, sigh.

Of course, some use the simpler:
BSB/JSB to subroutine

Subroutine:
  PUSHR, plus other prolog code
  Body
  POPR and other epilog code
  RSB

However, at least as seen in [3], [4], [5], CALLS/RET certainly got
used:
[4, Table 3] has 4.86% of instructions being CALLS+RET, for
instructions executed by compilers for BASIC, BLISS, PASCAL, PL/I,
COBOL, FORTRAN.
[3] has instruction distributions [by frequency and time], with [3,
Table 7] showing the distributions across all modes.  This lowers the %
of CALLS/RET, since the VMS kernel and supervisor don't use them.
Still one would guess that ~1% of executions each for CALL and RET,
with about 12% of total cycles, would fit VAX 11/780.
[5, Table 1] gives 3.2% for CALL/RET

Of course, the semantics of these things tend to incur bursts of stores
or loads, which means the load/store queues better be well-sized to
accept them.

IA-32:
While the CALL looks amazingly baroque, it's not as bad as it looks,
that is, there are bunch of combinations to decode, and they do
different things, but once you decode the instruction bits, you know
what each will do.

RET doesn't restore registers, it just jumps back, although again,
there is a complex set of alternatives, but each is relatively simple,
especially the heavily-used ones (I think).

PUSHA/PUSHAD and POPA/POPAD simply push/op all the general registers...
 a fixed set of micro-ops.

S/360:
This has LM/STM (Load/store Multiple), but the register numbers are
encoded in the instructions, with the only indirect case being an
Execute of an LM/STM, something rarely seen by  me.

NOT IMPOSSIBLE, BUT HARD
I my job had been to have been keeping the VAX competitive, I'd
probably have been thinking about software tricks to lessen the number
of CALL/RETs executed, but it's just one of many issues.  Maybe there
are other implementation tricks, but in general, this stuff is *hard*,
and solutions that are straightforward in RISCs, and somewhat so in
IA-32, are different/tricky for VAXen.

To see how complex it can get to make an older architecture go fast,
see [12] on the z990.  IBM did OOO CPUs in the 360/91 (1967), and
ES/9000 (around 1992), but has reverted to in-order superscalars since.
 The recent z990 (2-core) is a 2-decode, 3-issue, in-order superscalar.
 The chip is in 130nm, with 8 metals, has 121M transistors, of which
84M are in arrays, and the rest (37M!!) are combinatorial logic.  That
is two cores, so figure ech side is 60M, with 17M in combinatorinal
logic.  That's still big.  It has 3 integer pipelines (X, Y, and Z),
of which one does complex instructions (fixed point multiplies and
decimal), and sometimes (as for Move Character and other SS
instructions), X and Y are combined into a virtual Z, with a 16-byte
datapath.  "Instructions that use the Z pipeline always execute
alone."  The millicode approach [13] might help a VAX, but again,
this is not simple.

AND MORE
I picked out a couple of the obvious issues.  In my experience, the
people who *really* know the warts and weird problems of implementing
an ISA are those who've actually done it a couple times, and I
haven't ever done a VAX.  If one of those implementers says they knew
how to fix all the issues, I'd at least listen to their solutions
with interest, but I do know that a lot of the issues are statistical
things, not just simple feature descriptions.

CONCLUSION
Earlier in this thread, I noted [11], which says: "The VAX
architectural disadvantage might thus be viewed as a time lag of some
number of years.  The data in Table 1 in my previous post agrees, as
does the clear evidence of the late 1980s.  DEC understood the VAX
quite well; there were superb architectural analysis articles [3-6]
from the 1980s. Serious CPU designers gather massive data, and simulate
alternatives, and DEC folks were very good at this process.

Nth REMINDER: there is *architecture* (ISA) and there is
*implementation*, and they interact, but they are different.  If this
isn't familiar, go back and read old postings.  One might be able to
say that one ISA is simpler than another, because the minimum gate
count for a reasonable implementation is lower than the other's.  One
might say that the design complexity of similar implementations differs
between the two ISAs.

VAX VS RISC [TABLE 1, FIRST GROUP]
By the late 1980s, some system-RISCs were selling in volumes similar to
VAXen, i.e., in workstation/server markets [SPARC, HP PA, MIPS], and
hence, none had the vast volumes of the PC market to allow
extraordinary-expensive designs, but all could design faster CPUs than
contemporaneous VAXen, at lower design cost. It was certainly clear by
1988 that RISCs were causing trouble for the VAX franchise, at least,
in the Ultrix side of it. Reference [11] was discussed earlier in this
thread, and its conclusions recognized that.  Of course, IBM re-entered
the RISC fray in 1991 with (aggressive) POWER CPUs.

It is not at all unreasonable that RISC ISAs, first shipped in 1986 [HP
PA, MIPS], 1987 [SPARC], 1991 [IBM POWER], and 1992 [DEC Alpha], should
be more cost-effectively implementable than the VAX, first shipped in
1978.  Even tiny MIPS was able to do that, over most of that period.

Hence, one of the jaws closing on the VAX was higher-performance RISCs,
delivered at lower cost, in similar volumes. The other jaw, as
discussed in the previous post, was the performance rise of high-volume
IA-32 CPUs, which allowed the use of larger design resources to deal
with the complexities of IA-32. The second group in Table 1 showed a
few examples of that.

The VAX ISA is far cleaner, more orthogonal, more elegant, and easier
to comprehend, than the IA-32 ISA, as it was in 1993.  The Intel
Pentium offered 2-issue superscalar (1993), and PentiumPro (1996) went
OOO, and Pentium 4 (2000) went even further, with decoded instruction
cache (trace cache).  It took substantial resources, which DEC didn't
have, to do that, including "pipelining" design teams at Intel (Santa
Clara, Portland).

In 1992, at Microprocessor Forum, Michael Uhler showed a chart that
included;
CMOS-4  CMOS-5  CMOS-6
1991    1993    1996   manufacturing year
..75     .5      .35    Min feature
3       4       4      Metals
7.2     16      32     Relative logic density
2.2     2.9     3.7    Relative Gate speed
        1.3X    1.7X   Gate speed relative to CMOS-4

It should be pretty clear from Table 1 of the previous post, that
straight shrinks from CMOS-4 to CMOS-5 and then CMOS-6 wouldn't have
put the VAX back competitive, because you wouldn't even get the
clock-rate gain, given increasing relative memory latency.  At the
least, you'd have to redo the layout and increase the cache sizes.
If you got the gate speed improvement, you'd have:

1993: 1.3*90Mhz = 117Mhz, 44E SPECint92, 60E SPECfp92, compared to 65
Si and 60 Sfp for Pentium.

1996: 1.7*90Mhz = 153Mhz, 58E Si92, 99E Sfp82, compared to  366 Si92,
283 Si92 for PentiumPro.

For various reasons, I doubt that DEC would ever have built a
Pentium-like 2-issue superscalar.  In particular, the NVAX team found
that it didn't help much (2%) to do multiple-operand decode (and was
complex hardware), because the bottlenecks were later in the pipeline.
I conjecture that it is hard to get much ILP just looking at 1-2 VAX
instructions, as lots of them have immediate dependencies.

Hence, (if there had been no Alpha, just VAX), it would have been more
plausible to target an OOO design for 1996, but I'd guess it would
have also had to make the big change to 64-bit at that point.  It's
hard to believe they could have gotten to a trace cache design then
[neither Intel nor AMD had], and the tougher VAX decode might well
incur more branch-delay penalty than the IA-32.

Given DEC's design resources, one can sort of imagine doing:
a) Clock spins and minor reworks on NVAXen, to keep installed base from
bolting, while holding out hope that all would get well in 1996, but
that's 3 years with not much performance scaling; very tough market.
b) Simultaneously doing a 64-bit OOO VAX, because 1999 would have been
late.

However, as has been pointed out in detail, there are just lots of
extra complexities in the VAX ISA, and all this stuff just adds up.
Professionals don't design CPUs using vague handwaving, because it
doesn't work.

Anyway, DEC's gamble with Alpha didn't work [for various reasons],
but at least it was a gutsy call to recreate the "one architecture"
rule at DEC.  Of course, personally, I would rather they had done
something else :-)

But the bottom line is: the VAX ISA was very difficult to keep
competitive.  The obvious decoding complexity is always there, in one
form or another, but the more serious problem is execution complexity
that lessens effective ILP and is thus a continual drag on performance
with reasonable implementations.

VAX: one of the great computer families, built around a clean ISA
appropriate to the time, but increasingly difficult to implement
competitively.
R.I.P.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive 
	(really long)
Date: 8 Aug 2005 07:39:55 -0700
Message-ID: <1123511995.056652.53920@g43g2000cwa.googlegroups.com>

Among the problems with comp.arch is that it fills up with opinions
that don't survive even minimal perusal of the literature...

1) One can argue about the PC, but if one reads the VAX-study
references I quoted, one finds [Emer & Clark] that Immediates (PC)+
were 2.4% of the specifiers, and Absolute @(PC)+ were 0.6%, or 3% of
the total.  Personally, I didn't think that was worth the other
problems, and neither did many of the other RISC designers,(ARM being a
notable exception), nor X86 nor 68K, but it does help with code size.

2) Peter thinks it's a bad idea to use a GPR as the SP.  Most designers
of real chips in recent years have thought otherwise, because
SP-relative addressing is common, and it is ugly to special-case it.

3) The VAX equivalents of IA-32 LEA are MOVA and PUSHA.

Peter "Firefly" Lund wrote:
> > b) The PC and SP are general registers for historical reasons - upwards
> > compatibility with the PDP-11.
>
> Compatibility is a good reason -- a /very/ good one.
>
> But the VAX didn't have binary compatibility, it just had a mapping from
> PDP-11 registers, addressing modes, and instructions onto the VAX ones.
>
> That made it easy to transliterate assembly source code.  Emulating (or
> even JITting) is also made easier.
>
> But would it really have hurt so much if the VAX had provided one or two
> more general purpose registers and hid away the SP and PC?  A couple of
> extra registers for the emulator to play with internally could have been
> nice (but there were already eight more in the VAX than in the PDP-11 so I
> guess it wouldn't have mattered much).
>
> Instructions that accessed the SP and PC would have had to be
> special-cased in the transliterator and the emulator -- but I'm not sure
> it would have been difficult or expensive (you would need special handling
> of the PC register anyway, since PDP-11 code addresses wouldn't match VAX
> code addresses, and of the SP register since 16-bit values on the stack
> for calls/returns won't match the native 32-bit values).
>
> What do you need to do with SP?  Push, pop, call/ret, the occasional
> add/sub, SP-relative addressing for loading/storing parameters/return
> values/local variables.  If you can move the SP to/from a GPR then what
> else would you need?
>
> What do you need to do with PC?  Conditional/unconditional branches,
> calls, returns, and PC-relative loads and stores.
>
> Maybe we would like an equivalent of the IA-32 LEA instruction, too, for
> creating absolute pointers to values with SP/PC-relative addresses.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: PART 3. Why it seems difficult to make an OOO VAX competitive 
	(really long)
Date: 14 Aug 2005 21:13:20 -0700
Message-ID: <1124079200.399125.14810@g47g2000cwa.googlegroups.com>

Eric P. wrote:
> John Mashey" <old_systems_guy@yahoo.com> writes:
> >
> > But the bottom line is: the VAX ISA was very difficult to keep
> > competitive.  The obvious decoding complexity is always there, in one
> > form or another, but the more serious problem is execution complexity
> > that lessens effective ILP and is thus a continual drag on performance
> > with reasonable implementations.
>
> In case anyone is still interested in this topic,
> there are a bunch of papers by Bob Supnik at
> http://simh.trailing-edge.com/papers.html
> covering a variety of DEc design issues.

Great material; thanks for posting; Bob is doing a dandy job preserving
old stuff.  In particular, if somebody actually wants to build things,
it is really useful to get insight about design processes and
tradeoffs.

The HPS postings were useful too.

> The one labeled "VLSI VAX Micro-Architecture" is from 1988
> (marked "For Internal Use Only, Semiconductor Engineering Group")
> mentions at the end the ways a VAX might get lower CPI. It says
>
> "However the VAX architecture is highly resistant to macro-level
> parallelism:
> - Variable length specifiers make parallel decoding of specifiers
>   difficult and expensive
> - Interlocks within and between instructions make overlap of
>   specifiers with instruction execution difficult and expensive
>
> Most (but not all) VAX architects feel that the costs of macro-level
> parallelism outweighs the benefits; hence this approach is
> not being actively pursued."
>
> So it would seem that the designers felt at that time that decode
> was a major impediment.

I actually hadn't read this before I posted, but obviously, I'd talked
to VAX implementors in the late 1980s, and what they complained about
sank in.

Anyway, thanks for posting.


Index Home About Blog