Wiz (John Mashey)

Index Home About Blog

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 11 Jun 2004 16:34:39 -0700
Message-ID: <ce9d692b.0406111534.1bc3e50d@posting.google.com>

Terje Mathisen <terje.mathisen@hda.hydro.com> wrote in message news:<cabkp9$ij8$1@osl016lin.hda.hydro.com>...

> 'restrict' might help, but I'm not too optimistic, I'm afraid my
> physical vicinity (just the shallow North Sea between us) to Nick has
> turned me into more of a cynic. :-(

I'm even less optimistic.  Why is anyone assuming *any* use of C on
WIZ, much less wonderful optimization?

IMHO, the WIZ ISA, such as it is, is about the *least appropriate*
target for C that I've seen in 20 years, rivalled in inappropriateness
mainly by some 1950s and 1960s ISAs designed well before C existed.

The issue was well-discussed in the early 1990s in comp.arch.

- Doing the code generator would be no fun, and the best you could do would
  be seriously uncompetitive.  This ISA is exceptionally hostile to C.

- Moving common C code to it would be *excruciating*
  [As people found on in the mid-1970s on Xerox Sigmas, UNIVAC 1100s,
  and some other minicomputers.]
  Cray folks did OK, but then C was far less important than FORTRAN there.
  Stanford MIPS would have had this problem, but MIPSco MIPS didn't.

- Interfacing with typical external (or other SoC) devices would be *agonizing*
  [As AMD2900 and Alpha folks discovered, and made architectural changes later,
  and either of those started in much better shape on this issue than WIZ.]

Pascal might be made to work, or perhaps BCPL.   Some FORTRAN might
barely work, although a lot wouldn't.  C? C++? people must be kidding
:-)

Why?

"WIZ" [see below] IS A 32-BIT WORD-ADDRESSED CPU.
8-BIT AND 16-BIT DATA ITEMS HAVE NO ADDRESSES.

This is even worse than where AMD29000 & Alphas started, i.e., with
byte addressing, and with pretty decent 8-bit/16-bit manipulations
inside registers, but with 8-bit and 16-bit loads/stores omitted on
purpose for the usual reasons ... and added back later.  There has
been plenty of discussion on the net; try: Google search in comp.arch:
mashey byte addressing

"WIZ" doesn't even have byte-pointer support (say like PDP-10s or
Stanford MIPS), or even decent byte/halfword manipulation in
registers, especially because it takes several instructions just to do
a 16-bit shift or mask.

Note: re "WIZ": I'm not very interested in discussions in which a car
is described as having terrific speed, great gas mileage, and low
power, and besides, it could easily be turned into a submarine or
airplane as needed, or in fact anything that anybody might think of,
and would be better.  So, it seemed like the best I could do to
evaluate WIZ was to take a look at Steve Bush's best-developed
example, and some code for it.  So I went to:

http://www.steve.bush.org/WizdomR&D/pg003300.html and then:

http://www.steve.bush.org/WizdomR&D/pg003322.html - unpacking "bytes":
which shows 3 instructions to get low-16-bits, and 4 for the high
16-bits.

http://www.steve.bush.org/WizdomR&D/pg003309.html gives the *actual* register
definitions used in the code ... and there are a lot, including some
floating-point units.

http://www.steve.bush.org/WizdomR&D/pg003308.html is the
register-save/restore code

http://www.steve.bush.org/WizdomR&D/pg003317.html is the top-level
example program.

======
Now, at Bell Labs, I had friends who did compilers for Xerox and
Univac machines, and the memos they wrote about it were not pretty.
Strcpy on the Univac was especially amazing, and it even had some
hardware help.  I heard from DG friends that doing C on some fo the DG
minis wasn't great fun either.

In general, hardly anyone put C on a word-addressed machine unless it
was an external constraint of installed base.

Now, there's nothing sacred about being able to run C, but perhaps the
Wiz descrption upfront should say: "Note, this ISA is not designed
with C in mind."

With a bunch of redesign, one could probably fix some of this,
although at the expense of a bunch more complexity.

But, this is just the tip of the iceberg - WIZ is just *filled* with
retrograde mis-features, and I'll try to summarize them in a later
posting.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 14 Jun 2004 14:29:31 -0700
Message-ID: <ce9d692b.0406141329.69add720@posting.google.com>

"Nicolas Capens" <nicolas_capens@hotmail.com> wrote in message news:<caesbu$c0h$1@gaudi2.UGent.be>...
> Hi John,
>
> > I'm even less optimistic.  Why is anyone assuming *any* use of C on
> > WIZ, much less wonderful optimization?
> >
> > IMHO, the WIZ ISA, such as it is, is about the *least appropriate*
> > target for C that I've seen in 20 years, rivalled in inappropriateness
> > mainly by some 1950s and 1960s ISAs designed well before C existed.
> >
> > The issue was well-discussed in the early 1990s in comp.arch.
> >
> > - Doing the code generator would be no fun, and the best you could do
> > would
> >   be seriously uncompetitive.  This ISA is exceptionally hostile to C.
>
> Starting point: to execute a conventional 3-operand instruction (MIPS-like),
> WIZ has to execute three instructions. Since it can do these in parallel,
> it's no worse than an 8086. In fact it's better since the 8086 doesn't have
> 3-operand instructions. This obviously limits it to one 'operation' per
> clock cycle, at most. But let's keep assuming it executes three instructions
> (register transfers) in parallel, VLIW.

Let's not.  The WIZ ISA, such as it is, is clearly *not* VLIW, or even LIW.
The implementation is in-order single-issue, out of-order completion,
just like MIPS (and many other) chips of the 1980s.  I suppose it
might be possible to imagine a superscalar WIZ, although the ISA, such
as it is, isn't designed to make that easy.

>
> Now we have a lot of freedom to move these instructions around. We can
> eliminate many of them, and we can schedule them to hide latencies. No ISA
> has this much freedom, not MIPS nor x86. So it can't possibly be worse.
> Since MIPS and x86 are reasonably well suited for C compilation, how can WIZ
> be "exceptionally hostile"?

Breathtakingly wrong.  Most RISCs were designed to allow code
rearrangement and code-scheduling; MIPS certainly did, and we were
hardly the first.
The WIZ ISA, such as it is, makes this harder, not easier.

> > "WIZ" [see below] IS A 32-BIT WORD-ADDRESSED CPU.
> > 8-BIT AND 16-BIT DATA ITEMS HAVE NO ADDRESSES.
>
> Is it?
Yes.

> > Note: re "WIZ": I'm not very interested in discussions in which a car
> > is described as having terrific speed, great gas mileage, and low
> > power, and besides, it could easily be turned into a submarine or
> > airplane as needed, or in fact anything that anybody might think of,
> > and would be better.  So, it seemed like the best I could do to

> So, since Steve Bush defined "WIZ" this way, it means it is not extendable?
> These problems are easy to solve. Just define partial registers. The ISA
> sais very little about exactly how the available registers should be used.
> Steve just presented a minimal implementation, only to show that it works.
> We're past that stage now, fortunately...

OK, if these problems are easy to solve, why don't *you* do it?
They are certainly *possible* to solve, and they should be far easier
than to solve all the other problems.  This issue has been discussed
plenty in this newsgroup, and the answers can be found.

Design a proper feature-set, and a clear specification, post it on the net.
I'll even offer to take the time to critique it, for workability &
performance.

> Please, before you say "WIZ can't do this", think about how it can be
> extended to do it. You're crawling over the floor and see a wall, so you
> think you can't get over it. But in fact if you stood on your feet, you'd
> see it's just a hurdle.

Do you realize how condescending that reads?

> > With a bunch of redesign, one could probably fix some of this,
> > although at the expense of a bunch more complexity.
>
> How much complexity it requires remains to be seen... I'm optimistic about
> it and you're pessimistic. Fair enough.

I'm sad to see people waste their own time, or get others confused.

You've done some reasonable-looking graphics & software work.
Is there some reason you *don't* want to do a thesis in something
related to an area where you actually have expertise and might
actually contribute to forward motion (which real research is supposed
to do)?

Gent's course syllabus looks OK, but, from your comments, it's not
clear whether you've yet completed the computer architecture course
(that uses H&P).
Prof. de Bosschere certainly has wide-ranging relevant interests and
has supervised plenty of masters students.  Have you bounced WIZ off
him yet?

There are so many *interesting* things to be done, rather than trying
to start with something so retrograde.

> > But, this is just the tip of the iceberg - WIZ is just *filled* with
> > retrograde mis-features, and I'll try to summarize them in a later
> > posting.

Which will take a while, since there are *so* many problems.  Maybe
it's worth doing, mostly as a "why we don't do these bad things any
more" lesson.
=======================

But just for starters: the whole WIZ approach to registers and
functional units is *painfully* wrong for a CPU [as opposed to a bunch
of I/O devices.]  The pain is from experience, not from sitting around
theorizing.

Bear with me for the first part of this, whose connection with WIZ may
not be apparent to everyone.  Basically, a small part of the MIPS ISA
that was probably a mistake is used *everywhere*, only worse, in WIZ.

I will pick a real ISA as an example: MIPS-I (but it applies to later
ones as well), and a real implementation R2000/R2010: because:

a) The ISA, in one part of its integer definition, uses a feature from
which much can be learned about the problems with the WIZ approach,
and was probably a mistake, although it was still much cleaner than
WIZ.

b) Whose basic structure (in-order single-issue, out-of-order
completion, which WIZ labels as "asynchronous") is what WIZ is trying
to do.

c) That was shipped in 1986, with software that did most of the things
that you and Steve Bush seem to postulate as WIZ-enabled advances,
i.e., in this case:
- global-optimizing compilers for C, FORTRAN, Pascal
- code-generation with knowledge of latencies
- assembler that scheduled and reordered code appropriately.
[and many of the techniques were hardly new then]

d) For which I've looked at a lot (tens of thousands of lines) of
source & generated code of real applications and operating systems
(not miniscule toy code).  Even better, I got to see a lot of
micro-architecture simulation results over the years.

e) Which is well-documented, implemented by numerous teams, and widely
studied, as per H&P.

f) And of course, whose ISA & implementations I helped design :-)

WHAT WE DID WITH INTEGER MULTIPLY / DIVIDE

The MIPS-I ISA provided a separate integer multiply/divide that is
somewhat similar to the overall WIZ approach:

DIV   rs, rt   signed divide rs by rt;   set LO = quotient, HI = remainder
DIVU  rs, rt   unsigned divide rs by rt; set LO = quotient, HI = remainder
MULT  rs, rt   signed multiply rs by rt; LO = low 32-bits, HI = high 32-bits
MULTU rs, rt   unsigned multiple rs by rt; LO = low 32-bits, HI = high 32-bits

MFHI  rd       copy HI to rd
MFLO  rd       copy LO to rd

MTHI  rs       copy rs to HI
MTLO  rs       copy rs to LO

Usage is as follows:
1) A DIV* or MULT* is issued, using two GP registers as input;
the CPU pipeline continues to run.

2) The instruction will execute for many cycles, implementation-dependent,
but in R2000, 10-32 clocks.

3) By definition, none of these instructions can cause exceptions,
which is *REALLY* important if you want to keep precise exceptions on
an out-of-order-completion design.  For instance, the muldiv unit
doesn't check for divide-by-zero, since one can do:
   DIV R1,R2
   BEQ  R2,divide-by-zero
and you don't need the check when dividing by a constant, or when the
compiler can know that a variable is non-zero.

3) When the instruction completes, the LO and HI registers are set.

4) One or two MF* instructions are issued to fetch LO or HI, or both.
If the muldiv unit hasn't finished computing its results, the MF
instruction stalls until it is finished.

5) However, if you issue another DIV* or MULT* instruction, before
using MF* to get the result(s), it restarts the muldiv unit.
Actually, the MIPS-I ISA specs a 2-instruction hazard, i.e., if either
of the 2 instructions preceding DIV* or MULT* is MF*, the result of
the MF is undefined.  In fact, there are various other hazards
regarding MF* and MT*, which the assembler of course knew how to deal
with.

WHAT COMPILER DID IN PRACTICE
1) As long-latency instructions, try to schedule DIV* or MULT* early.
2) Then fill in with other operations as possible.
3) Then, use MF* to retrieve the result.

HOW WELL DID IT WORK
1) The integer muldiv performance was better than on some RISCs, but
not because of the overlap, but because we had actual hardware, and
some didn't.
There was plenty of discussion of the plusses and minuses of various
approaches 10-15 years ago in comp.arch; as usual, it depends on the
choice of benchmarks.

2) On an in-order-issue machine, it proved fairly dfficult to schedule
much other useful work in the shadow of the DIV*/MULT* latencies.  I
don't remember the particular numbers, but in practice, if you issued
a DIV*/MULT* opcode, fairly soon you were going to be stalled in an
MF*.
[And this was from a compiler system whose optimization was extremely
well-respected in the industry.]

WHAT WE DID WITH THE REST OF THE INTEGER ALU
1) The straightforward thing, with 1-cycle
add/subtract/shift/logicals, and with fanatic attention paid by most
CPU designers to make back-to-back ALU operations go fast, i.e., like
bypass networks.

WHAT WE DID WITH MIPS FLOATING POINT

This is another set of FUs with ISA-spec'd semantics that allow
correct in-order issue, out-of-order completion.

1) The various FP units [ADD, MUL, DIV] were (fairly) independent as
well, had longer latencies than the simple integer ALU operations.  It
was well-known that there were likely to be implementations with
various latencies and repeat rates, although the compilers always knew
these, and scheduled accordingly.  However, correct execution never
depended on this, since:

3) Unlike the integer muldiv unit, the FPU was scoreboarded and
interlocked.
I.e., there might well be multiple operations in progress, but if the
CPU attempted to issue an operations that the FPU wasn't ready to
accept, the main pipeline stalled until the FPU was ready.  Likewise,
any attempt to read an FP register that was the target of an executing
FP operation would stall.

4) In addition, the FP unit used a clever trick to retain precise
exceptions in an in-order-issue, out-of-order-completion CPU:
GSCA: hansen floating point patent

Basically, this is a requirement of *any* long-latency independent FU
in this type of design: either it can't cause exceptions at all, or if
it can:
a) It stalls the issue logic until it is sure there will be no
exception.
b) Hopefully, there is some quick check it can make [as in Hansen's
scheme], otherwise.

HOW WELL DID THAT WORK?
Fine.

ASSESSMENT OF MIPS ISA FOR MULDIV

1) Design assumptions
   1a) ISA: We wanted hardware integer muldiv, as we had enough programs to
       believe they were useful, and we didn't want to assume an FPU
       (many chips do integer MULTs in the FPU).
   1b) ISA: programs could make use of all 64-bits produced by DIV*/MULT*.
       I.e., full 64-bit product (2 registers) for MULT*,
       both quotient & remainder for DIV*.  You get remainder "for free"
       IMPLICATION: unlike all other MIPS ALU instructions, these are
       the only ones that generate 2 results, not just 1 [and that proved
       a painful special case in the MIPS R10000].
   1c) Implementation: we didn't have scoreboarding on integer register file.

Given those, what we did was OK.

2) However, in retrospect, I wished we'd done something more like
Alpha's integer multiply instructions, which are normal 3-operand
instructions, and normally deliver the low-order register [which is
what most compiled code uses], and use a separate instruction [UMULH]
to deliver the high-order register.

Alpha doesn't have integer divide hardware, but if one wanted it, the
same approach could be taken, i.e., just produce the quotient, and if
desired, separate opcode for remainder.

Of our assumptions:
 1a) was an OK choice.
 1b) We knew some people who loved 32x32->64 bit multiplies, and it was
     occasionally useful to get both quotient and remainder ... but in practice,
     most compiled code took little advantage of this.
 1c) To have had an integer scoreboard would have been culturally hard,
     and maybe would have cost precious schedule ... but could have been done.

WHY DO I WISH WE HAD DONE IT DIFFERENTLY?

1) The regular MIPS integer register file was very clean: 32
registers, of which only the zero register was different in hardware.
While there are occasionally requirements for registers that have
hardware differences, from a compiler writers' point of view,
especially when doing serious optimization, the nice thing about RISC
was that we got rid of a lot of the weird special cases and dedicated
resources that plagued many older architectures.
For example, having done compiler work on 68Ks, I can attest to the
irritation of the split between A & D registers ... and the 68K is
relatively clean.
Some of the earlier architectures were far worse [and IA32 certainly
retains plenty of such issues, but at least it's a general-register
machine.]

2) But, the MIPS muldiv unit introduced two registers, LO and HI, that
weren't quite like the rest.  In general, coupling function units and
registers this way is awkward, as seen in plenty of code.  The problem
is that the LO & HI registers are unique named resources that must be
allocated to a single muldiv operation at a time.  While it would be
silly to have 2 separate muldiv units for general applications, and
not worth the cost, it would be really awkward with this structure,
because you essentially would need to add another HI/LO pair, plus
instructions to deal with them, and compiler pain.

On the other hand, many MIPS implementations have multiple integer
and/or FP ALUs, with binary compatibility, and they work just fine.

Terminology notes for next section:
A register value is DEAD if you can write over it without disrupting the
computation, i.e., it has no further use.  Otherwise it is LIVE.

Normal calling conventions for register-oriented machines:
- dedicate some registers, i.e., as stack pointer, frame pointer (if needed)
- spec some registers as CALLER-SAVE, i.e., not safe across a function call,
  so if caller has LIVE data in any of them, it better save them before call.
  In some ISAs, it is natural to use some of these for the first N registers,
  i.e., the caller fills them with LIVE data, but considers them DEAD on
  return.
- spec other registers as CALLEE-SAVE, i.e., safe across function calls.
  it's up to the callee to save and restore any such registers.

3) The awkardnesses of things like HI and LO

- They're different from the regular GP registers.
  In C, on MIPS:
     func(a+b)   does ADD 1st-arg-register,a,b, then calls func
                 and really nice, even if a+b took multiple cycles,
                 (suppose it were FP), there would be no stall until
                 func actually accesses 1st-arg-register, which in fact
                 gives plenty of cycles.
  whereas
     func(a*b)   does a MUL a,b, followed by MFLO 1st-argument-register
                 to get the data to the right place.
                 I.e., there is *less* parallelism

- They're high-speed registers, and as such precious ... but they're
  really only useful for muldiv operations, but they have to be saved
  and restored "just to make sure".
  Every bit of my experience says that there better be a really good reason
  for a dedicated resource, as:
  - Their usage tends to cause serialization bottlenecks.
  - They end up having to be saved/restored, and otherwise managed,
    often with extra data movement, and often when actually unnecessary,
    because the code doing it can't know whether the data is actually LIVE.

- In practice, you would never start a DIV*/MULT* before a call, without
  using MF* to retrieve the data.  I.e., while these theoretically could be
  CALLER-SAVE, in practice that makes no sense, because any code that does
  a MUL*/DIV* has to use MFHI,MFLO to save the registers, and then MTHI, MTLO
  later to restore them.  If the first MFHI stalls, then you stall anyway.
  However, in most cases, saving/restoring them at call is a waste,
  because they are usually DEAD, because the data that anybody cares about has
  been copied already.

- Interrupt routines, except for very lightweight ones, end up having
to save and restore HI & LO, even though, most of the time, it turns
out they are DEAD anyway, but there's no way for a kernel to know
that.  [yes, we could have introduced "valid" bits or some such thing,
but the complexity wouldn't have been worth it.]

-  In the R10000, they caused a bunch of special cases that caused moaning
   from designers.

BOTTOM LINE
If I had to do it again, I'd do this feature more like Alpha's.
Most of the user-level MIPS ISA is relatively clean and simple,
but the muldiv unit is an oddity in the ISA that remains to this day.
[Some day, I'll write the integrated "MIPS-I: what worked well, what I
wish we'd done different, and why" document.  While muldiv isn't #1 on
my list, it's fairly high...

But, I hope it's clear that *lots* of people perfectly well understand
the idea of an ISA whose semantics allow for in-order issue,
out-of-order-completion.
It isn't new in WIZ, it wasn't new in MIPS, and that was nearly 20
years ago.
Also, lots of people understand code scheduling and optimization, and
they aren't new either.
Finally, lots of people understand that just throwing lots of
functional units at something doesn't automagically create great
improvements.  In some cases,
not even an *infinite* number of FUs, with perfect lookahead, helps
much.
(Wall's 1991 ASPLOS paper "Limits of Instruction-Level Parallelism"
being one of the classics).

NOW - WIZ

1) MIPS has one integer unit with awkward properties [-]
   An operation is issued
   + (1) Two operands are fetched [OK]
   + (2) Multi-cycle execution doesn't stall the instruction issue
   - (3) Two results are produced, not one
   - (4) The results sit in special registers
   - (5) The special registers need to be managed, they cannot be allocated
     the same way as the regular registers by compilers.
     They often end up being saved/restored unnecesarily
   - (6) Unlike the FPU, one cannot issue another operation to the unit,
     and just stall if it happens to be busy for any reason, because
     a new issue to it overrides any current execution.

2) Essentially every WIZ unit has similar properties, but worse, since:
   - (1) Operands must explicitly be moved to the unit, which in fact...

   - (7) Creates even more special registers

and then, the absolute disaster
 ----(8) Does this, not just for operations (like mul or div) that
         are inherently multi-cycle, but for the simplest ALU operations,
         that most CPU designers work very hard to assure efficient
         back-to-back dependent operations.  I.e., adds, shifts, logicals.

and hence, after decades of moving away from computers where most
registers had dedicated functions [i.e., the typical Accumulator / MQ
/ index registers of the 1950s designs like IBM 70x, 709x]...

  ---(9) The WIZ style makes most of its registers special-purpose, and
      some of them (COUNTERs) have side-effects that must be coped with.

This has nothing to do with clocked-vs-asynchronous designs [there is
plenty of confusion already.]  It doesn't have much to do with
implementation issues [there are plenty there], or with all the other
missing things that real-world processors need.  Fundmantally, the
connection of FUs with visible register state, in the way WIZ does it,
makes it *harder* to improve ILP, not easier.

SUMMARY
1) Compiler people will not be keen on WIZ.

2) Neither will operating systems people.

3) And there are many more mis-features to make me believe 1) and 2),
but I'm done for now.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 15 Jun 2004 09:54:40 -0700
Message-ID: <ce9d692b.0406150854.5202e165@posting.google.com>

"Wilco Dijkstra" <wilco-dot-dijkstra@ntlworld.com> wrote in message news:<wzzzc.20$0W3.7@newsfe2-gui.server.ntli.net>...
> "Nicolas Capens" <nicolas_capens@hotmail.com> wrote in message
> news:calldd$tc0$1@gaudi2.UGent.be...
> > > BOTTOM LINE
> > > If I had to do it again, I'd do this feature more like Alpha's.
> > > Most of the user-level MIPS ISA is relatively clean and simple,
> > > but the muldiv unit is an oddity in the ISA that remains to this day.
> > > [Some day, I'll write the integrated "MIPS-I: what worked well, what I
> > > wish we'd done different, and why" document.  While muldiv isn't #1 on
> > > my list, it's fairly high...
> >
> > I'm interested to know what #1 is.
>
> Not sure what John's list is, but I'd say:
>
> #1 no interlocks (MIPS = Microprocessor without Interlocking Pipe Stages)
> #2 branch-delay slot
>
> #1 was bad enough it was dropped immediately (it makes binary compatibility
> impossible, is bad for codesize because of all the NOPs needed and stops you
> adding features with unpredictable timing, such as caches and superscalar
> execution).
>
> #2 still exists today, it helps simple implementations as much as it hurts
> advanced ones, so it is a win today given MIPS is now an embedded CPU
> with mostly single-issue in-order implementations.

I'd put it a different way, although it needs a serious discussion in
more detail than I have time for now.

Note that the MIPS acronym was a misnomer anyway, because there were
interlocks in the FPU, and the MFHI, MFLO instructions

#2 = no scoreboard on integer registers
  => cause of doing the muldiv unit the way we did, rather than like Alpha
  => cause of load-delay slot

  Of these two effects, we were able to upward-compatibly fix the
  load-delay slot in MIPS-II.
  The first-implementation artifact of the muldiv unit did remain,
  because you couldn't get rid of it.  [Note: relevant for WIZ.]

Offhand, #3 = load/store word left & right [I've alluded to that
befeore]

As you note, branch-delay slots are a mixed blessing;  on balance, if
I were doing it today, with today's technology I wouldn't have it, but
it was actually pretty useful for the crucial early years of it's
life.

But what's #1?  IMHO, I'll give a hint:

There was one feature [2 instructions] that we omitted from the R2000's ISA.
The feature would have caused some implementation pain, which might
have been enough to hurt the schedule, or maybe not.
Not having the feature lost very little performance at the time, but
caused an excruciating transition later on from which it took years to
recover, in practice.  The feature did appear in the R4000, but the
practical interaction of hardware and software held back its
widespread usage longer than anyone would have expected.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 15 Jun 2004 22:27:47 -0700
Message-ID: <ce9d692b.0406152127.52aa7bde@posting.google.com>

Seongbae Park <Seongbae.Park@Sun.COM> wrote in message news:<cannb6$4vv$1@news1nwk.SFbay.Sun.COM>...
> John Mashey <old_systems_guy@yahoo.com> wrote:
> ...
> > But what's #1?  IMHO, I'll give a hint:
> >
> > There was one feature [2 instructions] that we omitted from the
> > R2000's ISA.
> > The feature would have caused some implementation pain, which might
> > have been enough to hurt the schedule, or maybe not.
> > Not having the feature lost very little performance at the time, but
> > caused an excruciating transition later on from which it took years to
> > recover, in practice.  The feature did appear in the R4000, but the
> > practical interaction of hardware and software held back its
> > widespread usage longer than anyone would have expected.
>
> Are they LL and SC ?
>
> Seongbae

Nope. [The mistake there wasn't in not having synchronization
instructions, it was in not having a standard API for the common
operations, especially at user level [i.e., that would have turned
into a low-overhead syscall].
My fault, I'd made a mental note that we had to do that ... and then
thigns got busy and I forgot until it was too late.

At the time, we simply were unable to decide on synchronization ops
that we'd be happy having to have forever.  That was because we asked
a lot of people, and for every synch op we knew, there credible people
telling us it wasn't good enough, and that we should do something
different.  The R2000 wasn't geared for SMP, and the R3000 had the
barest minimum of support ... and the customers who were going to use
it for SMPs had their own ideas of what the were going to do for synch
outside the CPU.

The #1 issue is one where we might have known better.  Any more
guesses?
[And it's in this thread because it's also relevant to WIZ :-)]

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 15 Jun 2004 23:05:38 -0700
Message-ID: <ce9d692b.0406152205.26484735@posting.google.com>

iain-3@truecircuits.com (Iain McClatchie) wrote in message news:<45022fc8.0406151757.5090477c@posting.google.com>...
> Mash> But what's #1?  IMHO, I'll give a hint:
>
> My guess: the reverse endian control bit.  What was it, something like
> 20 gates to implement and 2 man-years to verify?

Nope.

1) The Bi-Endian feature got added late [Summer 1985], primarily in
response to the possibility of getting Daisy Systems [who demanded
Little Endian] as a customer.  Up to that point, we'd seen no point in
doing both BE and LE.

2) It was observed [by Tom Riordan, I think] that it was relative few
gates, and given that we already had byte/half support, that the extra
paths to do LE (the way we did it) wouldn't even add signifcantly to
the die space.

3) I don't ever recall any particular verification issues.  Software
folks had to worry some more about it.

4) We didn't get DAISY.

5) But of course, in 1988, it helped us get an absolutely crucial
design win at DEC.  We might have gotten there with BE [there were
some in DEC who wanted the DECstation to be different, to make sure it
was clearly different ... but DEC had never shipped anything but LE,
ever ... so it certainly removed a sales inhibition.]

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 15 Jun 2004 14:56:56 -0700
Message-ID: <ce9d692b.0406151356.58750e4f@posting.google.com>

"Wilco Dijkstra" <wilco-dot-dijkstra@ntlworld.com> wrote in message news:<wzzzc.20$0W3.7@newsfe2-gui.server.ntli.net>...
> #1 no interlocks (MIPS = Microprocessor without Interlocking Pipe Stages)
> #2 branch-delay slot
>
> #1 was bad enough it was dropped immediately (it makes binary compatibility
> impossible, is bad for codesize because of all the NOPs needed and stops you
> adding features with unpredictable timing, such as caches and superscalar
> execution).

Oops, I forgot to add, before running out:
While I certainly would have preferred to do without the NOPS, the
compilers were pretty good about scheduling, and caches worked just
fine. The argument was that we could have 1 load-delay-slot in the
ISA,because we'd never have less than that, and if later chips [like
R4000] needed more, then they could just stall at that point perfectly
well and run existing code, which they did.
Anyway, we would have done better to have had a scoreboard and been
done with it ... but recall that this was an outgrowth of Stanford
MIPS, and that it took about 12 months to do the whole ISA definition,
implementation, and tapeout ... i.e., this was an insane frenzy.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 16 Jun 2004 16:29:57 -0700
Message-ID: <ce9d692b.0406161529.3776b6f2@posting.google.com>

Tim Olson <ogailx502@NOSPAMsneakemail.com> wrote in message news:<ogailx502-D66D95.07415816062004@news-central.dca.giganews.com>...
> In article <ce9d692b.0406152127.52aa7bde@posting.google.com>,
>  old_systems_guy@yahoo.com (John Mashey) wrote:
>
>
> |  The #1 issue is one where we might have known better.  Any more
> |  guesses?
> |  [And it's in this thread because it's also relevant to WIZ :-)]
>
> Load/Store double to coprocessor registers (in particular, CP1)?

Tim gets the prize.  IMHO, this was the most painful mistake.
I'll post a long article later, as there is a lot to be learned from this one.
Needless to say, WIZ has exactly the same problem, only worse.
The WIZ examples only use 32-bit FP, and 64-bit doesn't fit well.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 16 Jun 2004 21:46:17 -0700
Message-ID: <ce9d692b.0406162046.3a54bfc6@posting.google.com>

Tim Olson <ogailx502@NOSPAMsneakemail.com> wrote in message news:<ogailx502-D66D95.07415816062004@news-central.dca.giganews.com>...
> In article <ce9d692b.0406152127.52aa7bde@posting.google.com>,
>  old_systems_guy@yahoo.com (John Mashey) wrote:
>
>
> |  The #1 issue is one where we might have known better.  Any more
> |  guesses?
> |  [And it's in this thread because it's also relevant to WIZ :-)]
>
> Load/Store double to coprocessor registers (in particular, CP1)?

Tim gets the prize.  IMHO, this was the most painful mistake, even
though doing it right in the R2000 would likely have lengthened the
schedule.  It certainly would have required a slightly special case of
instructions that made 2 memory references, not one, with some
possible funny cases where the first word hit in cache, and second
missed.

IMHO, the later pain was so great that a little earlier pain would
have been worth it ... although course, I wasn't designing the logic,
so that's easy for me to say. :-)  and certainly, I didn't think of it
in time [I'd guess anythinjg later than June 85 was probably too late.
 [I wasn't thinking much about FP, except for the coprocessor-usable
bit to lessen context-switch overhead, and demanding precise
exceptions.]

CONTEXT
R2010/R3010 FPUs, with relatively large, uniform FP register set, and
of course, with very good optimizing compilers, were *very*
competitive in FP performance for a relatively long time (in this
business).  I.e., from 1986 to 1990, they were pretty tough.  By 1991,
they were definitely getting long in the  tooth, especially against
things like HP "Snakes" or IBM POWER.
This was definitely one of the better FP implementations of its time,
and it avoided a lot of mistakes .... BUT:

REVIEW OF MIPS-I IN R2000/R2010 (1986/1987) and R3000/R3010 (1988)

In MIPS-I: 64-bit (internal) FPU:
16 64-bit registers, each of which could hold
- a 64-bit DP value
- a 32-bit SP value
- theoretically 2 separate SP values [not very useful in practice]

The 16 registers were numbered f0, f2...f30, but to load a DP value:

 LWC1  f0, X
 LWC1  f1, X+4

instead of LDC1 f0,X  (from MIPS-II)

You couldn't really do much with the odd-numbered registers, except
load, store, and move to/from CPU.  Operations, even SP, only worked
on even registers.

That does increase the code size, but with separate I- and D-caches,
and a 32-bit bus, it had minimal performance impact, especially as
loopy FP code works well with an I-cache.  The LDC1 couldn't have run
the bus any faster, and would have been more complicated.

BUT (and this is where in retrospect, we might have known better)
PHYSICAL CPU BUS WIDTHS TEND TO GROW OVER TIME.
SOONER OR LATER, YOU WILL HAVE 64-bit BUSSES, OR HIGH-END FAMILY
MEMBERS THAT HAVE 64- or EVEN 128-BIT BUSSES.
 That had even been true in the IBM/360, 20 years earlier.
 WE SHOULD HAVE TAKEN THE PAIN TO MAKE LDC1 (and SDC1, easier) work.

PROBLEMS IN TRANSITION TO MIPS-II (still 32-bit, for R6000 & R4000,
of which I'll consider the R4000 from now on)
What we really wanted, by the time we were working on MIPS-II, i.e.,
~1988:

32 64-bit FP registers
  [compiler writers were quite able to make use of >16]

 LWC1 f0 ... would load 32-bit SP value into f0  [OK, still need]

and add:
 LDC1 f0 ... would load 64-bit DP value into F0  [wanted badly]
             especially since, unlike some of the older chips where
             even FP adds were plenty of cycles, when the adds are
             a few cycles, 2 extra cyles for load/store are painful.
 (and of course SDC1)

We certainly wanted to use LDC1/SDC1 operations, since we had 64-bit
external bus and D-cache path.  We knew it was going to be very hard
to stay competitive in FP without that.

but the problem was:
 LWC1 f1 ...  would load 32-bit SP value into f1 ...
              which no longer was part of the "real" f0/f1 pair.

Hence, the fundamental semantics of LWC1/SWC1 would change between
MIPS-I and MIPS-II, and there was no way to mix them, so we ended up
with R4000's having a status bit (the FR bit) that chose the old
behavior or the new behavior.
[Status bits are generally evil :-)]

In practice, it was impossible to mix the two models in one program,
which meant that there had to be two complete sets of our libraries
.... and worse, we had to convince ISVs to create two versions of their
programs, of which:
The old form [o32] would run on R2000/R3000/R4000.
The new form [n32] would run only on R4000.

Guess what, they mostly said "NO, we'll wait until the installed base
shifts."
Fortunately for SGI, there was a fair fraction of the customers who
used their own code and SGI libraries, and would happily recompile on
the drop of a hat.
But with ISVs ... it was really, really tough, and I don't blame them.

[Of course, there could have been yet another form, which used MIPS-II
integer instructions, but kept the FR bit with the R2000 behavior.
Too many flavors.]

We even looked hard at binary-translation schemes [which we often did
for other reasons like profiling] that would convert o32 ==> n32 ...
so they would run better on R4000s, but it was just too hard.  In
particular, we had long done code rearrangement, which in fact often
separated a pair of LWC1's, so it was really hard to find them and be
sure they really meant a single LDC1.

I think we even considered reverse-translating n32 to o32 so that ISVs
could move easier to n32.  [I.e., imagine an IRIX on an R3000: sees an
n32 binary., reverse translates it to o32 .... difficult, especially
when going to a CPU with half the real registers ... the converted n32
would run substantially slower than the old o32 ... Not Good.
Customers don't like that.]

For what it's worth, since I was probably the biggest worrier at MIPS
about practical binary-compatibility-in-the-real-world-with-ISVs
issues, I was originally against doing a 32-bit MIPS-II.  I would
rather have done all the changes in a 64-bit version (i.e., MIPS-III),
where the binaries were definitely incompatible anyway. BUT, I simply
couldn't argue with Earl Killian's performance modeling that showed
how much we'd lose from not having LDC1/SDC1.

In this case, the critical issue was that the absence of LDC1/SDC1
caused us to get the *visible state* wrong in such a way that was not
upward-compatibly recoverable.

WHAT I WISH WE'D DONE, IN RETROSPECT [but not early enough, like I
said]

1)  Implemented LDC1, even though it would have required 2 bus
transactions, and some extra complexity with the CPU to stall the
pipeline, and it definitely would have been a weird special case.

The R2000/R3000 had a separate tag for each 32-bit word in cache, so a
LDC1 could have caused 0, 1, or 2 cache misses.  Since an R3000
fetched multiple word on cache miss, it would have had 0 or 1 misses.
We'd have had to think about the meaning of an uncached LDC1/SDC1, or
disallowed it.

2) Implemented SDC1, which is easier, since there could be no cache
misses with a write-thru cache.

3) Implemented f0-f31 [as in MIPS-II].  If there hadn't been die space
for that, we certainly could have had 16 registers f0-f15.  Later, we
could have added f16-f31 as scratch registers, and code compiled to
use them would have happily interlinked with code that only used
f0-f15, whereas there was just no way, from the MIPS-I starting point,
to do that. Not as good as f0-f31, but still an easier transition.

Anyway, that's why that's #1 on my list.  It as in the "agonizing"
category, unlike the others, which I'd call "irritating" at worst.
Again, note that R2000/R3000 performed quite well on floating-point.

For 1990, with more pins available [R2000 had ~144, R3000 had ~172],
we could have used another ~32 pins to do an "R3500", one of whose
possibilities would have been a 64-bit bus, which would actually have
helped code with LDC1/SDC1.

SOME LESSONS
1) It is *really* hard to do good ISA design for long-lived ISAs,
especially:

- If the ISA is widely used by multiple vendors, as opposed to one
where the ISA design, hardware vendor, and systems software vendor are
the same.  [In the latter case, it is sometimes possible to make
amazing hardware changes work.

The AS/400 comes to mind, but other long-lived ISAs, like the S/360
[40 years and still going!] and the VAX [where some versions didn't
have all the instructions] fit. The IA32 is fairly amazing,
considering...

- If binary compatibility matters, i.e., in general-purpose
environments. Certainly, some of IA32's success is owed to the
overpowering effect of the inertia of installed binary software.

- Everybody knows that forward compatibility matters, but in practice,
so does backwards, especially if ISV software is involved.  Nobody is
keen to recompile everything, retest, and keep two separate versions
of software, just to run a bit faster on your newest machine.  Among
other things, they've sometimes signed contracts to support a machine
for N years, so they are stuck with the old version no matter what.

- It is especially easy in a schedule-frenzied startup to create
"first implementation artifacts".

2) You can usually diddle with privileged operations without causing
too much trouble.  There tend to be differences anyway, and kernel
people are well used to dealing with them.  In particular, there are
often important operations that only appear a few times in an entire
kernel, so they are relatively easy to change.

3) You can add new, long-running operations, if they are naturally
hidden behind clear APIs, and still get forwards & backward
compatibility. For instance, MIPS-II added SQRT, which is normally
found in a separately-compiled function.  At the very worst, if you
had dynamic linked libraries, a MIPS-I main running on an R4000 could
have called a sqrt function that used the new opcode.  In practice, it
turned out that there was a clever trick (from Bill Earl) that was a
quick test for R3000-vs-R4000, and the existing sqrt function just got
an extra test and only used the sqrt opcode if it was an R4000.  That
function could even be statically linked with a MIPS-I code and work.

4) You can *remove* old, long-running operations, as in 3), especially
if their dyanmic frequency is low, and there's a reasonable way to
trap and emulate For instance, as  I recall, the microVAX didn't have
the decimal ops [both long and rare].  Neither did the 360/44 or
360/91.

One might allow a RISC to do unaligned load/store, as long as within a
cache line, but then trap to do the case when the access crosses cache
lines. [I think POWER or PPC does this.]  This is one where the basic
operation is fast, but the odd-case frequency should be low. [I wished
we'd done something like that in MIPS, rather than LWL, etc.]

The very first MIPS systems trapped all FP operations to the kernel,
which worked fine when they were rarely used :-), and was wonderful
when FP boards came, because the software was in pretty good shape.

5) If you only care about about forward compatibility, you can add new
instructions that are short, and widely-distributed ... like LDC1 ...
But, if (like LDC1) instructions need to be fast, and are
widely-generated in code, you're probably headed for
backwards-compatibility problems.

6) It's easier to change operations than it is to fix *state*. The
issue with not having LDC1/SDC1 was especially tied to the MIPS-I FP
register definitions.

================================================
THE WIZ RELEVANCE

If you look carefully at the WIZ stuff, you'll find that the bus is
32-bits, and so are the instructions (or CONST doesn't work).  There
aren't any definitions for the (sketchy) FP code, but it's pretty
clear that the assumption is 32-bits everywhere.  With a lot of work,
one might fix this ... but it's got the same awkwardness that the
"simple, natural" WIZ style falls apart if it has to deal with 64-bit
quantities, i.e., at the very least, two separate MOVEs are needed to
fetch a 64-bit double.  Also, there's a lot of work to do to get
IEEE754, fi that's desired. Doubling the bus to 64-bits is not
casually done, but that's for implementation reasons, not ISA reasons.
 [I'm still working on the obvious ISA problems before descending into
the grim implementation problems.]

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 17 Jun 2004 15:37:16 -0700
Message-ID: <ce9d692b.0406171437.8f02fe2@posting.google.com>

Seongbae Park <Seongbae.Park@Sun.COM> wrote in message news:<carf7m$5na$1@news1nwk.SFbay.Sun.COM>...
> John Mashey <old_systems_guy@yahoo.com> wrote:
> > In this case, the critical issue was that the absence of LDC1/SDC1
> > caused us to get the *visible state* wrong in such a way that was not
> > upward-compatibly recoverable.

> I thought you had the alternative approach (which SPARC used)
> that would have avoided this software compatibility problem.
> SPARC v8 has 32 32bit fp registers which can be used as
> 16 64bit fp registers, somewhat like MIPS-I
> but allowed SP operations on all 32 registers.
> Then in v9 (64bit extension), they just added 16 more DP registers
> and made it such that DP operation on odd number registers
> would access those added DP registers than existing 16.
> Since DP operation on odd number register was not allowed in v8,
> this didn't cause any backward compatibility problem.
>
> Although I could be totally wrong - maybe something didn't allow
> this approach in MIPS (?) -

Yes, it's back to the same problem, I think:
SPARC got it closer to right by having load/store double floating
point in the first ISA, and generated code used them.  As far as I
know, there was probably little or no SPARC code, that ever used two
32-bit floating loads to load both halves of a 64-bit value, whereas
all MIPS-I code worked that way.

In essence, one could assume that a load/store single in SPARC meant a
load/store of an SP value, whereas this was not true of MIPS.

We (both MIPS and SPARC) would have better to have had:
16 (or preferably 32) 64-bit registers
load/store single/double [as SPARC did]

AND, if somebody really thought there were good data-parallel 32-bit apps,
the "interpret each FP register as pairs of SP values, and add/sub/mul
two of them in parallel."

From a compiler viewpoint, I've *never* been fond of even/odd register
pair instructions, even though I've used them for 30+ years [S/360-onward].
This style is one more thing that makes register allocation a bit
harder.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 16 Jun 2004 11:20:57 -0700
Message-ID: <ce9d692b.0406161020.77d4c300@posting.google.com>

rpw3@rpw3.org (Rob Warnock) wrote in message news:<TZCdncYer7O8m03dRVn-jA@speakeasy.net>...
> John Mashey <old_systems_guy@yahoo.com> wrote:
> +---------------
> | 5) But of course, in 1988, [LE mode] helped us get an absolutely crucial
> | design win at DEC.  We might have gotten there with BE [there were
> | some in DEC who wanted the DECstation to be different, to make sure it
> | was clearly different ... but DEC had never shipped anything but LE,
> | ever ...
> +---------------
>
> Well, *that's* not quite true: The PDP-6/-10/-20 (36-bit) line (started
> shipping in 1964, kept going for more than 20 years) was *definitely*
> big-endian only. And in fact, DEC was still updating the manuals in 1989!
> The DEC guys you were talking to were probably VAX bigots...

Oops, right, although word-addressed machines, of course.
The ones I was talking to weren't VAX bigots ... but there was a
general tenor of "it would save us work to use the BE stuff ... but we
*have* to do LE."

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 17 Jun 2004 01:07:58 -0700
Message-ID: <ce9d692b.0406170007.2b66852c@posting.google.com>

"Nicolas Capens" <nicolas_capens@hotmail.com> wrote in message news:<calldd$tc0$1@gaudi2.UGent.be>...
> Hi John,
>
> > Let's not.  The WIZ ISA, such as it is, is clearly *not* VLIW, or even
> > LIW.
>
> Why would you want to ignore VLIW? It's very clear that WIZ will benefit
> enormously from it. You're right that Steve Bush doesn't mention it
> explicitely on his site, but should we therefore limit ourselves to this
> minimalistic implementation?

one more time: find a copy of H&P, 3rd Edition, and read p 215-224,
plus other pages mentioned in index under VLIW.

It's simply not worth discussing until you understand what a VLIW is.

I will say that not only is WIZ not a VLIW, IMHO getting a WIZ-like
VLIW ISA that makes any sense is way harder than doing so for a
typical RISC, but that gets into implementation issues that I've
deferred.

Of course, there are still serious ISA problems even with the
single-issue WIZ.

So far, I've identified in previous posts:

1) WORD-ADDRESSED => forget a reasonble C, C++, JAVA, etc.

2) IRREGULAR AND AWKWARD "REGISTERS"
   [That was the long discussion about why the main thing in MIPS that
   acts a little like WIZ (muldiv) turned out to be awkward.]

to this we add:

3) APPARENTLY BROKEN: the use of SKIPIF and the way WIZ does branches.
Code examples use:
   register => skipif-lt    [skip next instruction if lt]
   jump <somewhere>

but jump <somewhere> is really 2 instruction words:
   CONST => PC
   <somewhere>

which means that skipif must really have some funky special case where
it skips 2 words, rather than 1, if the *next* instruction happens to
be CONST => PC.,
i.e., this doesn't just increment the PC, something has to check the
next word. Sorry if I've misread this, it was confusing.

Despite all the dicsusion of variable-sized instructions, it is clear
that CONST is 32-bits wide, which makes instructions 32-bits wide.
The examples all work that way, and the main example uses register #s
requiring 11 bits, with no help whatsoever for efficient decoding.
However, 8-bits would do, which sounds like two 16-bit instructions
could be packed together ... but there is a great deal of effort to
get anything that allows that.  The current definitions don't, given:
- fact that second 16-bit instruction lacks an address
- the way CONST gets set
- the way SKIPIF works
- the way exceptions work

4) PERFORMANCE PROBLEMS WITH SMALL CONSTANTS

Most 32-bit architectures have immediate operands that efficiently
provide small constants, and they do that because there is a massive
amount of data that shows the usefulness of these things.  The WIZ
idiom:
  <value> => register

  requires 2 32-bit words, so that operations on typical machines like:

  SLL R1, R2, 6

turn into

  <6> => SHIFT-count  (2 words)
  R2 => SHIFT-input
  SHIFT-shiftoutput => R1

i.e., 4 words.

Or, suppose we ignore the word-addressing problem, and have fetched a
memory word, and want to extract the 2nd least significant byte.  I
think PA-RISC could do this in one instruction; in MIPS it takes two:
  SLL  R1, 16
  SRA  R1, 24

In Wiz, the straightforward code is:
  <16> => SHIFT-count (2 words)  [assuming 16 is left shift]
  R1 => SHIFT-input
  SHIFT-Output => SHIFT-input
  <-24> => SHIFT-count  [and you can't do that before the previous,
                        or your change SHIFT-output] (2 words)
  SHIFT-aShiftOutput => R1

So, that's 7 words = 28 bytes of instructions.
OF COURSE, a world-class optimizer will figure out whether or not it's
worth putting various constants in registers.  Maybe a complete
software system would decide to globally allocate a bunch of GP
registers to hold common constants.... but compiler people would
*hate*  needing this thing jsut to decide that you need to have
registers with 1, 4, 8, 255, etc lying around.

See H&P, 240-283: it is hard work extracting ILP from realizable
processors, especially on general-purpose integer code.  There are
just too many data dependencies ... and visibility, multiplicity, and
irregularity of the WIZ registers are negatives, not positives.

Legions of designers, for years, have worked very, very hard on
optimizing the performance of back-to-back dependent simple ALU ops,
and for good reason.

5) PERFORMANCE PROBLEM WITH ADDRESSING

THIS IS A SPECIAL CASE OF 4), BUT WORTH CALLING OUT:

A few machines [AMD29K, IA64] only have base-addressing, i.e., 0(Reg),
but of course have ADD Immediates of at least modest-sized constants.
Widely-used structures in C, C++, or any other language that has them,
or access to local variables on the stack like base+offset addressing,
i.e., offset(base) which most machines have. [That's *all* MIPS has;
many others have more.]

Consider a simple data structure in C:
   struct item { int a;  struct item *nextitem;}

assume I have struct *item p, already in a register, and I want to do:

   p = p->nextitem.

On most CPUs, this is something like:

   LW  P, 4(P)

In WIZ, the straightforward code is:

   P => ADDR1-A
   CONST => ADDR1-B
   <1>              [aha, you remember this is a word-addressed
machine :-)]
   ADDR1-SUM => MEM-read-address
   MEM-read-data => P

and this is what you'd get from a typical compiler making typical
structure references, i.e., 5 instruction words.

Some kinds of code, including many OSs, have very frequent references
to non-zero-offset references to structure items.

A good optimizer  would desperately try to keep useful addresses
around in MEM registers, assuming there are enough of them.  [This has
some echo of 68K A(ddress) regs, or CDC 6600's A & X regs,  except way
more awkward.]

Likewise, simple references to simple local variables allocated on a
typical stack look this way.

WIZ has all these MEM registers, and they may help some kinds of array
manipulation, but they don't help access to structure elements or even
simple variables on the stack.

All history says that compiler writers hate having bunches of special
dedciated registers ... and it's not a defense to say: "well we can
add as many as we want", because resources cost.  It's also not a
defense to assume that compilers will automagically adapt to the
specific sets of resources, because in some cases, that's likely to
make big changes in code-generation strategy.

6) PERFORMANCE ISSUE WITH OVERFLOWS, OTHER EXCEPTIONS
Some software cares about integer overflows.

Integer adds don't cause overflow exceptions [they can't be allowed to
do so, as I described in previous post about MIPS Muldiv.]  So if you
want to do integer adds, and check for overflow:

The ADDR-STATUS register "has 2 bit flags in the lower two bits, ready
and overflow".  "When the logic finishes ... the ready bit is set to
one and the overflow bit is set if overflow has occurred." that means
that ADDR-STATUS will be set either to 11B or 01B, assuming the left
bit is the overflow.

   T1 => ADDER1-A
   T2 => ADDER1-B
   ADDER1-SUM => T3
   ADDER1-STATUS => SHIFT-rightOne
   Shift-rightOne => SKIPIF-Z
   goto <overflow>  (assuming goto as above, i.e., 2 words)

That's 7 32-bit instruction words to do an overflow-checked add.

7) FP IS UNDEFINED, AND 64-BIT FP DOESN'T FIT VERY WELL

Since there's no definition, just a few examples, I spend little time
on this.  I mentioned the 64-bit issue on an earlier post about
MIPS-I's lack of LDC1/SDC1.

8) CACHE AND MMU INTERFACES NOT OBVIOUS

And what's there...
How many OS programmers want to go back to some version of the 1960s
base-limit registers?

9) EXCEPTION-HANDLING, MINIMAL KERNEL/USER DESCRIPTION, ETC

Not worth going in there, as not well-enough defined ... but
historically, exception-handling has been utterly notorious for bugs.

10) INTERACTIONS

If you design a clean architecture with well-defined exceptions, MMU,
cache interfaces, etc, it's not too hard to subset it ... but history
say that if you don't think hard about these early, you will get
surprised later by weird interactions that will drive OS people crazy.

==================================
ISA SUMMARY (AND THIS IS MORE THAN ENOUGH):
1) WORD-ADDRESSED => forget a reasonble C, C++, JAVA, etc.
2) IRREGULAR AND AWKWARD "REGISTERS"
3) APPARENTLY BROKEN: the use of SKIPIF and the way WIZ does branches.
4) PERFORMANCE PROBLEMS WITH SMALL CONSTANTS
5) PERFORMANCE PROBLEM WITH ADDRESSING
6) PERFORMANCE ISSUE WITH OVERFLOWS, OTHER EXCEPTIONS
7) FP IS UNDEFINED, AND 64-BIT FP DOESN'T FIT VERY WELL
8) CACHE AND MMU INTERFACES NOT OBVIOUS
9) EXCEPTION-HANDLING, MINIMAL KERNEL/USER DESCRIPTION, ETC
10) INTERACTIONS

Like I said, neither compiler people nor OS people (and I've been
both) would like WIZ much...  even assuming there were
high-performance implementations.

IMPLEMENTATION
Time to move to implementation issues, maybe via Socratic method,
sicne I'm tired.  Questions for EEs:

1) If WIZ actually uses asynchronous logic [in the AMULET or Ivan
Sutherland sense], how do you feel about implementing:
2 8-bit busses, and a 32-bit bus, with all "registers" attached?
How does that really work as a CPU bus, rather than an I/O bus?

What sort of *real* handshaking protocols would be necessary,
especially for the 3-party transaction that is each instruction?
[Sutherland's Turing Lecture in CACM, June 1989, may be useful.]

2) If it turns out that WIZ is really clocked, with all of the
asynchronous discussion just meaning that functional units have
non-unit execution times,
possibly unpredictable, but at least not requiring low-level
handshakes on every interaction:

How do you feel about making that bus structure go fast, with every
functional unit sitting directly on those busses?  How well does that
scale? is it much, much faster than other processor? Is an instruction
4 gate delays?

[I think it's fair to assume some reasonable register numbering such
that the high bits encode the FU, and the low bits encode the register
within the FU.]
The large WIZ example has ~130 "registers".

[I put "registers" in quotes since they are a lot more like I/O
registers than typical CPU register files.]

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 18 Jun 2004 11:42:14 -0700
Message-ID: <ce9d692b.0406181042.184d2b31@posting.google.com>

rpw3@rpw3.org (Rob Warnock) wrote in message news:<N8ednY_f7Iyqb0_dRVn-sA@speakeasy.net>...
> Terje Mathisen  <terje.mathisen@hda.hydro.com> wrote:
> +---------------
> | Rob Warnock wrote:
> | > ... PDP-10 ... simple "strlen()" equivalent...

> Maybe not, but the above was posted just as a quick demo of the byte
> instructions and how the incrementing feature of ILDB/IDPB/IBP forced
> a BigEndian model [speaking to Mashey's post about DEC & LittleEndian].

1) The PDP-10 is famous for having (fanatically) loyal supporters,
even 15+ years after its original vendor quit shipping 36-bit
machines! and this is once gain illustrated :-)

2) I always thought the PDP-10 had a nice set of string operations,
far cleaner than most word-addressed machines.  If you thought Rob's
long example was long, you should have seen strcpy for Univac 110x,
which was pages and pages of code, as there were explicit instructions
of "load 1st (2nd, 3rd, 4th) 9-bit byte.

3) Note of course, that the "VAX ueber alles" group in DEC had long
ago won before the 1988 decision to start using MIPS, and there was
zero concern for MIPS chips being able to compatibly share binary data
with PDP-10s :-)

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 18 Jun 2004 11:31:37 -0700
Message-ID: <ce9d692b.0406181031.575c84dc@posting.google.com>

Jan Vorbrüggen <jvorbrueggen-not@mediasec.de> wrote in message news:<2jfmhdF115lcnU1@uni-berlin.de>...
> With regard to the load/store double MIPS ISA gotcha:

> Conceptually, I think the pain you describe has to do with what I'd call
> "premature optimization": the semantic instruction "load double fp" has been
> broken down into a series of interdependent "load single fp" instructions,

This kind of thing is usually called a "first implementation
artififact" of the quite common class described as "letting the first
physical bus width sneak into the ISA in a place where later, wider
buses will strongly prefer different instructions".

A related software error (which MIPS didn't make, but some 68K systems
did), on machines with storage alignment, would be allocation of DP
values on any word boundary, and storignthat data anywhere. That works
fine in hardware in MIPS-I ... but would cause problems later when DP
values need to be double-aligned.

Lesson: always think ahead to wider bus widths; take the pain early.
An example of doing it right: S/360, whose models had bus widths of
1,2,4, and 8.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 18 Jun 2004 16:39:33 -0700
Message-ID: <ce9d692b.0406181539.228f88a9@posting.google.com>

dutky@bellatlantic.net (Jeffrey Dutky) wrote in message news:<f6013729.0406171201.19eb7f17@posting.google.com>...
> John Mashey) wrote:
> > IMPLEMENTATION
> > Time to move to implementation issues, maybe via Socratic method,
> > sicne I'm tired.  Questions for EEs:
> >
> > 1) If WIZ actually uses asynchronous logic ... how do you feel
> > about implementing: 2 8-bit busses, and a 32-bit bus, with all
> > "registers" attached? How does that really work as a CPU bus,
> > rather than an I/O bus?
.....
> > 2) If it turns out that WIZ is really clocked, ...
> >
> > How do you feel about making that bus structure go fast, with
> > every functional unit sitting directly on those busses? How well
> > does that scale? is it much, much faster than other processor?
> > Is an instruction 4 gate delays?
>
> I have to admit, the ONLY reason I'm still reading this thread is for
> John's input (I'm learning more with each of his postings than I did
> in any given week of Advanced Computer Architecture as an undergrad
> at UMd).

Thanks for the kind words.

> This message is about as good an indictment of WIZ (if not of move
> I've started writing my own simulator for a WIZ-like architecture
> and the effort is proving VERY enlightening:

Actually, it's *not* a truly good indictment of WIZ: it says that WIZ
has an awful ISA that ignores most of what we've  learned about ISA
design the last 50 years ... and you are clearly learning even more
things quickly about the  ISA by doing a simulator.  But, even if
somebody manages to fix the ISA issue [akin to hoping that a WW I
biplane can be made competitive with an F-16 by adding swept wings;
from a later post of yours, I think *you* understand that], it doesn't
matter.  Maybe this was posted earlier in this long sequence, but
nobody recently has taken up my hints on bus structure, i.e., the
unmentioned:

"ELEPHANT AT THE DINING TABLE"

WIZ also manages to ignore most of what we've ever learned about
high-performance and/or low-power CPU design.

THE WIZ BUS STRUCTURE IS ABSURD FOR A CPU DESIGN
Whether this is is truly asynch bus [a la Sutherland], or a clocked
bus with multi-cycle FUs (like almost every CPU on the planet), it's a
BAD, BAD, BAD thing to do for the internals of a CPU.  This kind of
structure is fine for I/O (if the details are done right), and it's
even OK for coupling small-N SMPs and I/Os  [PMC-Sierra has a new
design like that for SoCs, up to 12 units], but it's totally wrong for
the internal connections of a fast and/or low-power CPU.  As usual,
H&P discusses tradeoffs between synch and asynch buses ...and if it's
the latter ... it's probably even worse - I went through the likely
handshaking required for WIZ, and it's painful.

The LAST thing in the world I'd want is a design that, for doing A = B
+ C, to take two GP registers add them and put result back:

1) Does 3 transactions on a shared-bus structure to which every FU is
connected, i.e., read SLOW [pesky laws of physics :-)]

2) And each transaction needs 3-way interactions between IR, source,
destination and IR, across that universal, shared structure.

Not only that, but REAL CPU designers sweat blood, first to design
ISAs that can be decoded quickly, and convert them into control
signals and the narrowest, shortest possible buses, with the fewest
loads on them they can.
Good ISAs are painstaking designed to do this ... and if the ISA is
troublesome (like the inelegant, but possible IA32, or the elegant,
but difficult VAX), you pre-decode into some useful from of horizontal
microcode, i.e., RISC-like or VLIW-like.

The LAST thing they'd do is broadcast *all* instruction bits along a
bus, to every register (230ish in WIZ example), or even every FU
(20-40), then  wait for each of them to use two sets of comparators to
decide if they are source or target.  That's OK for an I/O bus [see
H&P], but for the innards of a CPU: absurd.

Other things being equal:
- shorter buses are faster than longer ones
- buses with less loads are faster than ones with more
- unidirectional ones are faster than bidirectional

Again, CPU designers sweat blood on such issues.

There is a lesson, in general:

Good ISA & CPU design *requires* interdisciplinary skills rarely
present in any one person.  Letting hardware designers loose without
good software people around is a disaster, but letting software people
spec hardware without good logic and circuit design input is at least
as bad.  If software people want to play with ISA design, there's some
minimal amount of digital design knowledge required about things likely
to be fast or slow.

This illustrates the incompleteness problem of basic software
simulators.  It is *very* useful and instructive to do a simulator, as
you've been finding.
A simulator can prove something is bad, and sometimes it can prove
that one design is better than another, IF the underlying timing
issues are really understood.  The problem is that sometimes features
that look simple in a software simulator simply don't mirror the
parallelism or lack therof in real hardware.  Then, one generates a
nice-looking ISA that simply doesn't work.
[I've just spec'd a dynamite CPU: it has a "compile my program"
opcode. :-)]

For instance, to use a silly example, suppose my metric for goodness
it total dynamic instruction count, easy enough to get from a
simulator.  That doesn't tell me that, in *real* machines, multiplies
are slower than adds, and divides are worse.

For a simulator, with <256 registers, I might be tempted to write a
256-way switch statement, and the specific bit encodings would be
irrelevant ... and yet, in real machines, NOBODY just does arbitrary
bit encodings.  Maybe this would be more obvious in software if the
switch *had* to be written as if-then-else cascades, i.e., making it
clear there was some *cost* in having more entries.

ANYWAY: PEOPLE DON'T DO FAST CPUS THE WIZ WAY, AND FOR GOOD REASON.
Hence, I'm not even sure it's useful pedagologically, even if the
plethora of ISA problems were cleared up, except as an example of how
not to do it.

Enough, I give up.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 19 Jun 2004 09:19:45 -0700
Message-ID: <ce9d692b.0406190819.63d1efc4@posting.google.com>

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message news:<lOLAc.93269$Gx4.20787@bgtnsc04-news.ops.worldnet.att.net>...
> "John Mashey" <old_systems_guy@yahoo.com> wrote in message
> news:ce9d692b.0406181042.184d2b31@posting.google.com...
>
> snip
>
> >  If you thought Rob's
> > long example was long, you should have seen strcpy for Univac 110x,
> > which was pages and pages of code, as there were explicit instructions
> > of "load 1st (2nd, 3rd, 4th) 9-bit byte.
>
> That depends upon which generation of 11xx systems you are talking about.
> Certainly true for the 1108 and 1106, but the 1110 and descendants added a
> "byte mode" that made things easier, and the latter machines (1100/60 and

Yes, the 110x designation was explicit.
What's odd is that the software work was done after some of the newer
machines had been introduced, but I'd guess there were installed-base
issues to deal with.  In any case, by a miracle, I found the old memo,
and my (20+ years) memory had exaggerated: the strcpy code was really
only 3 pages long: perhaps the horror of it stuck with me.

> But even on the 1108, people used "well known" tricks.  A common one for
> dealing with bytes was a table of instructions that were executed (via the
> execute instruction).  Since the address of the instruction to be executed
> could be indexed, you looped over (in your example) four instructions (one
> "hard coded" for each quarter word).  Each of those instructions loaded (or
> in another set stored) a byte in the string, the address of which was also
> indexed. The last of the four instructions incremented the index register so
> it pointed to the next word.  You would need some initialization code to
> figure out which load to start with, and testing for zero after the store
> would be trivial.  It seems like only a couple of dozen instructions, but it
> has been a *long* time since I had to worry about such things.  :-)

The one I saw was slightly different, essentially using all-out
unrolled code for speed.  Let me explain the (very unusual) context.
There was a large project building a large Operations Support System,
which started in the late 1960s, and was still in progress in the late
1970s (not something I worked on, thank goodness :-).  Early in the
project, Univac 110x was chosen, and that stuck.  In the 1970s, they
decided to do a lot of development in C, and the nature of the
application meant that the C str* functions were actually pretty
important.

This was probably an unusual circumstance; I wouldn't guess that there
all that many big apps written for 110x in C at that time :-)

Since it may have tutorial value, here's what went on:

1) Unlike Rob's beloved PDP-10 :-), the 110x didn't have byte pointer
ops.

2) The base hardware used 18-bit word addresses.
The BTL software did the obvious thing for byte pointers: shift the
18-bit word address 2 bits left, and use the low-order 2 bits to
select the 1 of 4 9-bit byte.  The hardware couldn't use the byte
pointers directly.

3) hence the code looked like:

a) char * strcpy( s1, s2 ) char *s1, *s2;
   Do the setup:
   Given byte pointers s1 and s2, extract the word pointers.
   Extract the 2 2-bit byte-offsets to get 4-bit index:
    index = (s1 & 3) << 2 | (s2 & 3)
Then use a jump table of 4 groups of several instructions each, so
that you enter at the right place in the sequence, then loop through
the 4 elements in the group until you get a zero byte.

jmptab:
..   qi means qi of s1 and qj of s2
 +q1q1
 +q1q2
 +q1q3
 +q1q4
 +a2q1
  ....
 +q4q4

.. q1q1, q2q2, q3a3, q4a4 do the first of 4 alignments
q1q1:
 l,q1  a0,0,a2     a0 = 1st byte loaded from 0(a2)
 s,q1  a0,0,a1     store into 1st byte of 0(a1)
 jz    a0,cret     exit loop if done

q2q2:
 l,q2  a0,0,a2     a0 = 2nd byte from 0(a2)
 s,q2  a0,0,a2     store 2nd byte of 0(a1)
 jz    a0,cret     exit

q3q3:
 l,q3  a0,0,a2    a0 = 3rd byte
 s,q3  a0,0,a1
 jz    a0,cret

q4a4:
 l,q4  a0,0,*a2   a0 = 4th byte, increment a2
 s,q3  a0,0,*a1   store 4th byte, increment a1
 jnz   a0,a1q1    loop back for next word
 j     cret

Then, there were 3 similar sets of code, ie.., the next group was:
q2q1, q2q2, q4q3, q4q1, etc.

The assembly code was 5-10X faster than the compiled C code.

The total instruction size was 82 words, compared to 27 for the
compiled C.
the total size of {strcat, strchr, strcmp, strcpy, strlen, strncat,
strncmp, strncpy, strrchr} was 835 instructions, compared to 360 for
compiled C.

I think they went to this trouble because they had applications where
str* usage was much higher than in typical UNIX C code, and (probably)
where the average size of strings was higher, justifying more complex
setup.

SOME LESSONS

1) The 1107 was introduced in 1962.  Unsurprisingly, it did not
anticipate C and its (unusual) model of null-terminated character
strings of arbitrary size.

2) Since the hardware didn't have byte pointers, the C software had to
simulate them.  This probably worked OK for this project, since it was
mostly newly-written software.  Others who ported UNIX to machines
with different-flavored pointers (XDS Sigma, later DG machines)
sometimes ran into unportabilities of the form:

int *x;

i = func(x);

int func(a) char *a; ...

I.e., where people were used to byte-addressed machines, where there
was really only one kind of pointer, and got sloppy about carefully
using casts.

3) Overall, 110x machines were actually pretty decent for their time.
At least, I was using both an IBM 7094 and UNIVAC 1108 around
1967/1968, and the 1108, with its bigger register set, was pretty
good.  I remember its FORTRAN compiler as pretty good as well.   I do
remember occasionally being surprised by the ones-complement effects
(having both +0 and -0), but overall, I remember it as decent machine.

4) Of all the things that cause trouble for an architecture
 a) Running out of address bits is the major downfall, as Gordon Bell
noted, i.e., the PDP-11 ran out very quickly, needing the VAX.

 b) A new datatype may appear and get heavily used.  If it doesn't fit
the existing hardware it can yield painfully slow/big code.  If the
nature of the datatype is such that relatively simple hardware can
provide access to substantial parallelism, this is one of the best
justifications for adding instructions to deal with the datatype.

    EXAMPLES:
-You can emulate floating point with integer code, but it's painfully
slow, so any CPU that expects to run FP code has much faster parallel
hardware for it.  [Some people seemed to think it was weird that RISCs
had floating point!]

-As multimedia datatypes became common, many ISAs added instructions
to support them, starting, I think, with HP PA-RISC.  Again, you could
get good speedups for these data-parallel ops, with minimal extra
hardware.

 c) On the other hand, adding a new instruction just to do a sequence
of simple dependent operations may end up no faster than the sequence
itself, although it may save code.  It may end up slower, sometimes:
the "sings-and-dances" VAX subroutine call instruction is the classic
example.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 19 Jun 2004 12:13:03 -0700
Message-ID: <ce9d692b.0406191113.91865e@posting.google.com>

anamax@earthlink.net (Andy Freeman) wrote in message news:<8bbd9ac3.0406190547.7e9b908@posting.google.com>...
> old_systems_guy@yahoo.com (John Mashey) wrote in message news:<ce9d692b.0406181539.228f88a9@posting.google.com>...

> > The LAST thing they'd do is broadcast *all* instruction bits along a
> > bus, to every register (230ish in WIZ example), or even every FU
> > (20-40), then  wait for each of them to use two sets of comparators to
> > decide if they are source or target.
>
> Not so fast.  The bits on the source half of a WIZ instruction uniquely
> identify both the FU containing the data register and the register.  Assuming
> a reasonable encoding, broadcast is not required.  The implementation
> "merely" has to use a subset of the bits to route the read request to the
> appropriate functional unit.  Once the data arrives, a similar mechanism
> can be used to send it.

Warning! Will Robinson! Warning!
You may be catching an infection of the virus seen often in this
discussion:
"X is great"
  "no X isn't, and here's 10 reasons why not"
"well, then X could really be something else, so X is still great."

As I've noted all along, I've analyzed the WIZ as more-or-less defined,
not as it might be if it were mostly different :-)

I read the description of WIZ, starting with:

http://www.steve.bush.org/WizdomR&D/pg000100.html

Which claims simplicity, and superiority, among other things by having
zero decoding at the Instruction Register, offering a "7GHz WIZ,
without trying."

The block diagram clearly shows how it is supposed to work:
"When any register decodes its own register number on the upper
(source) bits of the register address bus, it puts its data onto the
inter-register bus (shown on the left). When any register decodes its
own register address on the lower (destination) bits of the register
address bus, it receives data from the inter-register bus."  That's
pretty clear.

Great flexibility is claimed in the WIZ pages about having whatever
registers you want, with whatever numbers you want ... i.e., like the
way you'd do #defines in C for tags whose values need only be distinct
....but the exact opposite of tightly-designed, easy-to-decode
instructions that drive specific, dedicated buses and control lines
[what real CPU designers do.]

This can be seen in the worked-out example:

http://www.steve.bush.org/WizdomR&D/pg003309.html

There is zero hint of arranging bits for reasonable decoding, becuase
the IR doesn't decode anything.
We find the following defines, annotated by me to show the binary
values:
                                    binary
define SKIPIF-LE            = 120   0111 1000  <= Oh, great!

define ADDER1-A             = 121   0111 1001
define ADDER1-B             = 122   0111 1010
define ADDER1-minusB        = 123   0111 1011
define ADDER1-sum           = 124   0111 1100

define ADDER2-A             = 125   0111 1101
define ADDER2-B             = 126   0111 1110
define ADDER2-minusB        = 127   0111 1111
define ADDER2-sum           = 128   1000 0000   <= Oh, great!

[What a dandy encoding for fast decode :-)  One almost might have
hoped that one ADDR was 01110xx and the other was 01111xx ...]

In an earlier post, I'd even tried to allow something a little more
sensible [for decoding at the FU, but retaining the key WIZ approach
of zero decoding at the IR] by writing:

"[I think it's fair to assume some reasonable register numbering such
that the high bits encode the FU, and the low bits encode the register
within the FU.] The large WIZ example has ~130 "registers"."

[I actually rechecked, and the number was closer to 160, I think I'd
skipped a page.  Somewhere else I'd typoed the 130 as 230.  Sorry.]

>
> And, when that routing becomes a pain (and it will), one can set up
> a chain of "route stations", each responsible for a subset of the FUs.
> (Each station passes through requests that it can't handle and data
> that previous stations read from their FU/registers.)  Push an instruction
> into the tail station and the corresponding source data will pop out of
> the head station N beats later, plus any time spent waiting for FU
> delay.

Some of that sounds reminiscent of the i860...

[Plausible discussions faced in implementing something somewhat
different from WIZ yet another different way.]

=====
My bottom line: real CPU designers worry about careful encodings,
because, in fact, they must have ideas about control lines and
short-and-narrow-as-possible buses.  They must think about good
encodings. They cannot just think that they can do anything they like,
because they have a simple, uniform, shared-bus structure like WIZ.
They just don't do that.  Especially, when people are going for cycles
that are only 10-15 gate delays, every gate delay counts.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 20 Jun 2004 12:54:58 -0700
Message-ID: <ce9d692b.0406201154.12c8817c@posting.google.com>

anamax@earthlink.net (Andy Freeman) wrote in message news:<8bbd9ac3.0406200342.9d70683@posting.google.com>...
> old_systems_guy@yahoo.com (John Mashey) wrote in message news:<ce9d692b.0406191113.91865e@posting.google.com>...
> > The block diagram clearly shows how it is supposed to work:

> The problem with WIZ-space is not that there are dumb implementations,
> it's that there can't be good ones.

I agree: at least, I couldn't think of any [good ones that is], just
really bad [like the original] and less bad :-)

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 22 Jun 2004 10:21:55 -0700
Message-ID: <ce9d692b.0406220921.69ae7baa@posting.google.com>

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message news:<W8KBc.111259$Gx4.55125@bgtnsc04-news.ops.worldnet.att.net>...
> "David W. Schroth" <David.Schroth@unisys.com> wrote in message
> news:cb75q5$2488$1@si05.rsvl.unisys.com...
> > I'm going to try to comment on the current state of the descendants of
> > the 110x line...
>
> Great!  Good to get you back explaining things.  :-)

Correct me if I'm wrong ...

Unless I'm forgetting something, the ISA started with the 1107's
introduction in 1962 is the *oldest* commercial ISA for which hardware
is still built [in Clearpath Plus2200].  Hence, its ISA changes, and
the aforementioned memory management issues are interesting, i.e., to
see how people cope with 40+ years of evolution, an eternity in the
computer business.
[The 1107 was limited to 64K 36-bit words... a very large machine.]
A lot of ISAs have come and gone in that time ... sometimes in just
one generation.

Bull DPS 9000s are descendents of the old GE 635, which I think was
was a year or two later, as I recall.

Also, the Unisys Clearpath MCP servers are, I think, descendents of
the Burroughs B6500, introduced in 1966 [although shipped late in
1969].  [The B5000 was earlier, but that's a different ISA].

The UNIVAC and Burroughs mainframes of the late 1960s were well-known
as early multiprocessors.  [IBM wasn't yet into that much.]

Of course, IBM S/360 ISA is still with us; we recently had the 40th
anniversary party at the Computer History Museum.

From: old_systems_guy@yahoo.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: The WIZ Processor
Date: 24 Jun 2004 00:31:20 -0700
Message-ID: <ce9d692b.0406232331.3a42c3ed@posting.google.com>

old_systems_guy@yahoo.com (John Mashey) wrote in message news:<ce9d692b.0406181539.228f88a9@posting.google.com>...

> Good ISA & CPU design *requires* interdisciplinary skills rarely
> present in any one person.  Letting hardware designers loose without
> good software people around is a disaster, but letting software people
> spec hardware without good logic and circuit design input is at least
> as bad.  If software people want to play with ISA design, there's some
> minimal about of digital design knowledge required about things likely
> to be fast or slow.

I got some side emails wanting to know more detail about
"interdisciplinary" and "minimal digital design knowledge" for
software people.  I've written soem thoughts on the latter, now I'll
go at the former.

WHERE THIS IS GOING: BUILDING GOOD ISAS AND IMPLEMENTTION(S):

 WHAT DOES IT TAKE TO DO A GOOD JOB CREATING A FROM-SCRATCH ISA?
 WHO IS NEEDED TO DO THIS?
 HOW DO THINGS GO WRONG? WHAT ARE COMMON ERROR SYNDROMES?

  Just my opinions, of course, for what they're worth.

First,  back to 1986:

1. COMPLEXITY AS TRASH

From "RISC. MIPS, and The Motion of Complexity", an article/talk I did
for UNIFORUM Feb 1986.  This includes a spectacular performance growth
chart ...that ends around 8 VAX-mips :-)

"Complexity is like trash:
(1) If you're very careful, you may make less of it, but you cannot
make it disappear.

(2) If you sweep it under the rug and ignore it, you will be sorry.

(3) You may be able to live with your share of it, but not the whole
neighborhood's.

(4) The best you can do is push it around, hoping to spread it where
it causes the least trouble, or even acts like fertilizer."

For simplicity, I split the disciplines into:
- 1. Chip
- 2. Hardware Systems
- 3. Operating Systems
- 4. Compilers

and I had little cartoons of people trying to kick the garbage can
around.
[Sometimes it all landed on the chip person labeled "CISC" :-)]

A chart gave examples of features that were sometimes not implemented
where you might expect, at least for chips current when MIPS was
designed, in 1985.
In the following, - is where it might of been done, > or < is where it
was done.
1 2 3 4
C H O C
< -      Cache control [usually, at the time, off-chip]
- >      Read-modify-write [as explained earlier in this therad, we weren't
         ready to commit to synchronization instructions, and there were
         acceptable solutiomns on VME bus, etc.
< - -    Substantial, but somewhat odd-for-the time MMU on-chip
         It did fast translations of parts of the address space,
         and caused an exception if it didn't have a valid translation.
         it did no memory acccesses, on purpose.
- - >    MMU (low-level), Cache control (high-level)
         Much was done in software in OS
<   -    Precise exceptions [I was running the OS group; I didn't want
         to hear about imprecise/unclean exceptions; Not Again.]
-     >  Pipeline scheduling - some (nowhere near all) went to compilers,
         and even in 1986, they certainly knew plenty about dealing with
         hazards and latencies and doing good scheduling.
         This was especially useful for low-frequency things like
         system coprocessor operations, where simple compiler scheduling
         simplified hardware.
    - >  Register saving [the compilers did a lot of this, but to be fair,
         this should have probably gotten some complexity from CPU as well.
         In general, though, we could avoid some OS work / CPU hardware
         because we trusted the compilers to do good register allocation.

Oddly enough, sometimes when we moved a feature, or even got rid of
it, *everything* got simpler, and even better, some things worked
first time that had, in our experience, been endless sources of
trouble in the past.
For example, it surprised people that the MMU hardware never changed
the status of page behind the OS's back, i.e., a write to a writable,
but not yet dirty page trapped ... to the OS ... which did what it
really wanted to do, and then explicitly diddled the MMU entry.  [OS
people explained the chip people that on machines that let the
hardware mark a Page Table Entry Dirty, we had to fake it and trap
anyway, in order to get Copy-on-Write semantics for pages.]

A lot of this stuff was discussed in 3 papers in IEEE COMPCON 1986,
San Francisco [a chip paper, an OS paper, and a compiler/tools paper].

From this, and from trading war stories with people from other
architecture groups of that era, I'd summarize this in a few
observations.

This is a sequence of MAXIMs, i.e., things to do, intermingled with
specific ERRORS related to them.

2. MAXIMS AND COMMON ERRORS

MAXIM 1: DOING *GOOD*  *NEW* ISA DESIGN IS REALLY, REALLY HARD, SO BE
READY
 - Especially if it is for a general-purpose architecture that needs
   (some form of) binary compatibility, must support wide product range
   over multiple generations, to be used by many people writing many
   programs in a variety of languages, some of which don't exist yet.
 - Embedded [high-performance, low-power, low-cost, or combinations]
   have really stringent, albeit different requirements as well.
   Binary compatibility is less of an issue.

   And those lead to:
MAXIM 2: YOU NEED TO START WITH A DECENT SET OF REQUIREMENTS
 - What problems are implementations supposed to solve?
 - And for whom?
 - And in what price range?
 - And what will be its value proposition?
   If it's new, in what areas will it excel?
   [By the way, being "less bad than some of the bad ones on the market"
   is a non-starter: in practice, unless you have vast resources,
   you need to be noticably superior to any *combination* of incumbents.
   I.e., if A & B are out there, and your new thing is better than A on
   task 1, and better than B on task 2 ... this does you no good whatsoever
   if B is better than you on 1 and A better on 2 ... because sensible
   people will have picked systems by what they do best, not worst.
   You have to do *something* better than all the incumbents.

   More specifically, at the very, very least, I'd want to know a bunch of
   (external) requirements.
2A: in general, what is this for?
   general-purpose? embedded, and if so, which of the myriad domains?
2B: what sorts of applications will we excel in?
2C: what other software must we run?  [must work, need not excel]
2D: what languages must we support, and in what priority?
2E: what sorts of operating systems?
2F: are there any upward-compatibility/legacy issues?

And then there are all sorts of internal implementation issues, like:

2G: what's tradeoff between first-implementation and looking ahead several
generations of technology?

MAXIM 3: PUT TOGETHER A GREAT, EXPERIENCED TEAM THAT COVERS 100% OF
THE BASES

(I'll address the different disciplines some other time - the 4 broad
categories I listed earlier are incomplete approximations.)

You generally like to have plenty of people whose expertise crosses
domains to make sure things stay together.  That's the only way the
complexity-motion tricks and tradeoffs ever worked.

Suppose you cover 90% of the bases with great designers, and
everything they do is great.  Would you want to guess where you'll
have trouble?
I've mentioned before that having hardware and software people without
the others normally causes trouble, but it goes much deeper than that,
i.e., top OS people and optimizing compiler people often aren't the
same.

Every time you assume that some problem outside *your* expertise will
be handled somewhere else, you'd better have the attention of an real
expert there, or you are living a delusion in fantasy-land.

Here are some classic errors. Some are "Hope Springs Eternal" errors.

ERROR 1: MAGIC COMPILERS WILL SAVE ME #1.

I can write good assembler code for small examples, therefore I know
compilers can really optimize this well, and that's just an exercise.
I have some really cool features, I know the compilers can use them.

ERROR 2: MAGIC COMPILERS WILL SAVE ME #2.
My hardware has terrific parallelism, and I'm sure the
compilers-to-come will dig it out.  I give this its own item, because
it's been so wrong so many times.

Parallelism is always worth seeking, but:
High ILP is really tough in many apps.
Effective multiprocessor parallelism is non-trival.
[The compiler gang at MIPS/SGI was awesome ... and there was a lot of
experience on multiprocessor systems ... and it was still hard.]

Occasionally these *are* true ... but almost always it happens because
the compiler and its optimization technology mostly existed *before*
the ISA and was used to design it, with input from outstanding
compiler people.
I'm hoping to discover more of what the Stretch folks are really
doing.

ERROR 3: APPLICATION PROGRAMMERS LOVE FLAGS.
the application programmers will fall over themselves to find the
right permutations from 50 compiler flags to get the best performance.
[There are important sets of users for whom this true, a fraction of
those in the HP space, and some in the high-performance or low-power
embedded spaces.]
For most apps on most general-purpose computers.

ERROR 4: MAGIC OS ALGORITHMS AND FIXES WILL MAKE IT WORK.
This happens, but sometimes there are hardware features that require
an OS to be telepathic and precognitive to do the best thing.

ERROR 5: TOO MANY FIRST IMPLEMENTATION ARTIFACTS

Sometimes this happens because you are just too resource-constrained
to do what you'd like to, and you have to ship product or go under.
So be it.

Sometimes it is just not thinking far ahead enough soon enough.

I did mention some MIPS items of this ilk, like the lack of 64-bit FP
load/store, and the awkwardness of muldiv.

The "running out of address bits" problem is probably the most
infamous and recurrent.

- The S/360's 24-bit addressing limit was an artifact to help the
low-end 360/30, which had to struggle with 8-bit wide data paths.
[fixed later]
- The PDP-11 ran out of address bits way too quick. [not fixed -->
VAX]
- The MC68000 repeated the S/360 problem exactly [fixed by 68020].

All of those were committed by good architects who are friends of
mine, and whom I respect highly ... even though I suffered through
every one of these :-)

Most of us did get it right going from 32 -> 64 bits [i.e., check the
hi-order bits rather than ignoring them, so clever programmers don't
use them for something else, like we did on S/360s and early Macs.]

ERROR 6: MAKE THE SAME ERROR AGAIN, SHAME ON ME
I used to say every mistake happened 3 times (mainframes, minis, and micros),
but now every mistake happens four times, we've now added SoCs.

ERROR 7: SYSTEMS COMPANY WORLDVIEW / CHIP COMPANY WORLDVIEW

This is a painful one: you have terrific teams who are really good at
designing good hardware/software systems for sale ... but then it
turns out that somebody would really like to sell the CPU chips
outside (for strategic, or volume economics), and sometimes your
assumptions get jerked around, and you have to make changes.

For instance, consider RS/6000 -> PPC.

I conjecture that some of the Alpha byte/halfword addition was the
unwillingness of some outside buyers to do the complicated I/O mapping
just to deal with partial-word I/O registers.  [Comment? Dick Sites?]

I say this is painful, because often people were in fact well-focussed
on their requirements ... but then the nature of the systems
businesses changed.

Of course, chip companies have often suffered from NON-systems
thinking, if they don't stay close enough to what their customers
really want to do.  This is perhaps less of a problem today, as there
is so much of a system integrated onto one die, that a lot of the
system design choices are already done.

MAXIM 4: EVEN GREAT TEAMS MAKE MISTAKES

I've mentioned some I consider MIPS mistakes.

Consider that the IBM S/360 [a great team] gave us JCL ["The worst
language ever designed" according to Fred Brooks, under whom it was
done], and the 24-bit addressing mentioned above.

Rich Witek and Dick Sites [the lead Alpha architects] are outstanding,
and even  had the advantage of watching earlier RISCS ... and Alpha
was fine, but they still changed that pesky byte/halfword thing pretty
quickly.
Google Search comp.arch: Dick Sites alpha mashey  gets you old
arguments.

When people sit around the bar talking, where Marketing won't hear
them, almost everybody will admit to wishing they had done something
different.

Of course, weak/incomplete teams make more mistakes, often leading to
architecture death within 2 generations, even with good resources.
Consider the Intel i860, an interesting chip in many ways.
In 1989, I organized a Hot Chips session with 4 fine compiler guys
called:
"Compiler Issues with HOT Chips", where they could discuss their views
of any chips they chose.  It was a lively session.  Most of them chose
the i860.
The session should have been called "Compiler Writers' Revenge" or
maybe "Attack of the Killer Compiler Gurus."

MAXIM 5: KNOW WHERE THE WORST BUGS USUALLY LURK

Most designers get individual elements right - if an adder doesn't
work, something is wrong, but you find it fast.
The excruciating bugs most often come from under-specification of
complex combinations, i.e., not thinking of all the weird things that
could happen, and making sure something sensible does happen.

Historically, it has often been up to OS people and sometimes compiler
people to work around these temporarily or even permanently.

ERROR 8 - ASYNCHRONOUS ACCIDENTS
If you don't make *sure* asynchronous events cause no trouble,
they will cause trouble.  This usually comes from under-specification,
i.e., not thinking through the various cases.

In a simple CPU, basic interrupt handling is no problem, but it is
very easy to make mistakes with asynchronous or parallel untis, or any
ofthe architectural tricks used to go faster.

Of course, more aggressive memory systems require serious thought
about the semantics of memory ordering.

ERROR 9 - INVIDIOUS IMPRECISION
If you have imprecise exceptions, and don't worry about it too much,
there will be trouble sooner or later.  There are places you can get
away with this, but you have to be very, very careful. If you assume
that an exception terminates a process, that such termination is the
only thing that can happen ... because in practice, there have often
been important programs that grew up on simpler processors and assumed
they could, for instance, catch a signal, fix something, and then
restart ... and you have to be really careful there is some way to do
that sensibly.  It cannot be a TBD.

A classic case was the old Bourne shell of speedup via catching case
when pointer ran off the edge of allocated memory, then allocating
memory and returning.  This didn't work on some Motorola chips, which
caused no end of hassles for UNIX ports in early 1980s.

An even stronger example was the great talk Mike O'Dell did for me at
a USENIX about the problems in doing UNIX on a GaAs processor trying
to go so fast that the OS was often in the position of trying to "hold
off" memory stores that were "almost" to memory while desperately
trying to rearrange something underneath.  As Mike pointed out, one
must be very careful of the implicit contracts between software and
hardware.

Of course, one must read IEEE 754 carefully [i.e., there are strong
temptations to make some FP exceptions imprecise in some pipeline
designs.  Be very careful.]

ERROR 10 - MALINGERING MMUs
There is almost always some weird special case with MMUs, especially
around page-crossings, or (with VAX as extreme) requiring huge numbers
of entries to make forward progress, or discovering that there is some
weird case where you have an N-way associative design, and somehow,
one case requires you to have N+1 valid entries, and in the rare case
where all N+1 end up in the same row, they just thrash around, but
maybe only if you get a page fault in the wrong place.

I don't remember the exact details, but one of the R4x00 chips had
some bug where if exactly the wrong 2-instruction sequence (probably
involving a delayed branch) crossed a page boundary, and the second
page missed, then Something Bad happened ... and it got through tons
of testing and systems were ready to ship before it got discovered ...
so the OS had to look at the end of each page for this and modify it
.... and this was on a "simple" RISC.  I remember legions of these
things in the CISC micros of the early 1980s ... one of the reasons we
insisted on a simple-acting MMU *we* could control.

ERROR 11 - CANTANKEROUS CACHES; I/O
there are often problems with caches, if they haven't been well
thought out, and the problems may actually be with the ISA, not with
the cache itself.   It is almost always a mistake to think it's TBD,
i.e., that you can always straightforwardly add caches onto a design
where you haven't thought of it.  It's always better to work through
the details ... and then leave the caches out if you don't want them.

Many systems support uncached access or cache bypass, but:
- How is it actually implemented?
- Does it have any implicit synchronization effects on other
  memory operations that might be in progress?
- How does it interact with hardware like byte-gathering write buffers
  or prefetchers?
- Does it flush items that happen to be in the cache?
- How it is represented to the programmer?
  - Page Table Entry Bits?
  - Assymetric address space?
  - A global mode bit
  - Opcode or operand specifiers
- How do languages invoke  this?
- How do optimizers deal with it?
- How do OS and device drivers deal with it?

There are numerous assumptions and approaches, and some work a lot
better than others ... and there was a lot of painful learning in the
1980s and 1990s.
Depending on the nature of I/O, there are interactions there.

Of course, there is always the issue of correctly specing any
user-level ISA operations for dealing with caches.  If generate
code-on-the fly, how do you deal with it?  Are I-cache and D-cache
synched (usually not)?  how do you do cache invalidation?

ERROR 12 - MULTIPROCESSORS HAVE MULTI-PROBLEMS.
Multiprocessors have problems that uniprocessors don't.  They may not
be hardware problems, but they often cause software surprises.  We're
a lot better about this now ... but hyperthreading is *not* the same
as having 2 CPUs with their own distinct cache systems, so some
software needs rethought.

I guarantee that SoCs will replicate many of the weird issues faced by
multiple-micro-system designers of the 1980s and 1990s ... although
fortunately, it is some of the exact same people doing it, and they've
learned.

ERROR 13 - THE WHOLE IS NOT THE SUM OF THE PIECES
the pieces all work fine [interrupts, exceptions, MMU(s), caches(s),
multi processor] individually, but there's some rare, strange
interaction nobody thought through.

It may well be that you discover the bug in time, but the fix suddenly
blows up a piece of synthesized code and breaks the
carefully-comnstructed layout of your chip.  [This happened late in
game on MIPS R4000 with the "Hunk O' Logic" in the middle of the chip.

[I just had to make this #13].

ERROR 14 - A BAD DECISION MAY BE "FOREVER"

The essence of good architecture is to be able to make decisions that
you can live with a long time, always facing limited hardware and
software resources.
Once a feature is in, you usually can't wish it away later, unless
you've figured out some really lean emulation/conversion mechanism, or
are in an embedded space where you can get away with it.

ERROR 15 - CONTEMPLATE NO CHANGES
not thinking about workable extension mechanisms upfront, i.e., like
reserved opcodes, registers, coprocessor interfaces, etc.  By the way,
it's often easy to add hardware later ... and then discover it takes a
long time for it to get real-life improvements from it.

I like to point at Tensilica as an example of interesting thinking
here, because their tools create hardware & software extensions
together ... of course this was done by some of the very best combined
hardware/software people I know :-)
  I've given some examples from MIPS-land where we didn't think quite
hard enough about this.

ERROR 16 - LET ME THINK ABOUT IT, SOME MORE, SOME MORE
*not* making decisions.
Needless to say, if no decisions are made, there is nothing worth
looking at.

That leads to the last:

MAXIM 6: THE DEVIL IN THE DETAILS IS THE LEAST OF THE PROBLEMS
you must work out enough of the details, and if you haven't done this
much, you will get surprised by the depth to which you must go to get
ideas even worth of being considered. Some universities insist on
implementation projects to make sure students learn the gritty
realities.

People say "The devil is in the details" ... but that doesn't go far enough:
there are Orcs and imps and gremlins and all sorts of ugly things down
there.
It's a bit of an art to know how deep you have to go to know whether
it's worth pursuing.  I've left off all sorts of implementation and
ISA issues like reliability and fault-tolerance, but they' probably
overkill for this discussion.

SIMPLE SUMMATION
So, I'll end up with a proposed, barest minimum set of things I'd ask,
in a simple processor.  This starts with the MAXIM 2 requirements.
[Note, BTW, that these days, I see a fair number of new architectures
a year, because some of the time I do due-diligence consulting to help
VCs decide whether to invest in {chips, systems, software}, and then
sometimes work with companies post-investment, or sit on tech advisory
boards.]

A. *What is this for?  In what apps will it excel?  Why?

B.*What languages need it support?  (and explicitly, not support)
   What sort of compiler technology?  Does it already exist? Where?

   What sort of operating systems?

C. *Define the user-level ISA: operations and state.
   [I have zero interest in definitions that keep changing around.
   depending on the question.  Real architects make some decision,
   then evaluate it deep enough to undertand it.]

D.*Define the calling conventions, stack manipulation, etc.
   Exactly how are arguments passed.
   [I would look here for answers that show the proposer understands
   the issues and knows about the ticking bombs.]

E. Describe protection model, additional machine modes, if any.
   If extra modes (like kernel) have extra state or operations,
   describe them, plus software interactions with OS.

F. Describe exceptions, interrupts in believable detail.

G. Cache, MMUs, uncached modes, if any.

H. I/O

I. OS interfaces with the above.

=========
For user-level stuff, A-D is a pretty minimal set.
BTW, a lot of great-sounding schemes fall apart via ERROR #1, wishful
thinking by non-compiler experts.  Also, item D (calling conventions)
can have a lot of pitfalls on some kinds of architectures.

=========

Of course, I think the relevance of these to evaluating WIZ should be
fairly clear.  Good night.

Index Home About Blog