Index Home About Blog
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.sys.unisys,comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 25 Mar 1997 08:04:23 GMT

In article <333017C7.69A9@informatik.tu-muenchen.de>, Bernd Paysan <paysan@informatik.tu-muenchen.de> writes:
|> Michael Gschwind wrote:
|> > 
|> > Many people (including Hennessy and Patterson) claim that stack
|> > architectures cannot be implemented efficiently.  However, AFAIK,
|> > the Unisys/Burroughs A series machine have lead a pretty healthy live
|> > and managed to work around problems found in typical stack
|> > architectures.
|> 
|> IMHO Chuck Moore's F21 and uP21 are a much better example of efficiently
|> implemented (both metrics: die space and cycle time) stack machines. It
|> also shows that H&P's arguments are totally bogus. You can make slow
|> stack architectures, you can make fast, and certainly it depends on the
|> ISA and on the implementation, whether it's fast or slow. If you do one
|> thing wrong, the whole thing slows down and becomes expensive.
Time for a repost of a 10/10/1994 posting:
Article: 43730 of comp.arch
Newsgroups: comp.arch
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: References to papers on stack machines?
Date: Mon, 10 Oct 94 11:23:02 PDT

In article <CxCAHo.GuE@world.std.com>, ulrich@world.std.com (Thatcher Ulrich) writes:

|> Isn't the AMD 29000 sort of a stack machine?  Like a Sparc with
|> variable sized register windows?  Come to think of it, Sparc could be
|> considered stack-machine-ish...

No, please.

While classifications are not always perfect, when we use terms like
"stack machine", "accumulator machine", "general-register machine", RISC,
etc, people who are familiar with the area usually have some cluster of
characteristics in mind.  In particular, there is a common confusion
that "uses stacks" and "Stack architectures" are the same thing;
they are not.

Machines most commonly called stack machines [Burroughs
B5000, HP3000, etc] commonly have ALU operations with no explicit operands,
i.e., the operands are implicitly accessed at (or near) the top of the
stack.  

General-register machines (which include S/360, PDP-11, VAX, 68K, X86,
and RISCs (including 29K and SPARC) all have explicit specifications of
>1 general-purpose registers.  Some of them have addressing modes and
instructions specifically for using stacks, but it would really be
hard to call any of them stack machines.

 
As usual, a good reference would be Hennessy & Patterson, COMPUTER ARCHITECTURE-
A QUANTITATIVE APPROACH, p.90-92, section 3.2 "Classifying Instruction
Set Architectures".

As they note "A stack canot be randomly accessed.  This limitation makes
it difficult to generate efficient code.  It's also difficult to implement
efficiently, since the stack becomes a bottleneck."
An example of the difference, derived from their example:
Consider C = A + B
Stack		Register (but load/store as well, no memory->register ALU ops)
PUSH A		LOAD	R1,A
PUSH B		LOAD	R2,B
ADD		ADD	R3,R1,R2
POP  C		STORE	R3,C
*small*		***this is likely to be larger code**

Needless to say, it has been staightforward for years to generate reasonable
code for a stack machine; beginning compiler classes have done that for years.
Register machines, especially with larger numbers of registers, really want
to have good global optimizers.  For example, suppose you have:
	C = A + B; B++; 
then you would add, to each of the above:
PUSH B		ADD	R2,R2,1
ADD  1		(STORE R2,B  ..if needed in memory now, but not until then,
POP B		and maybe never, if B were a local automatic variable)

Note that the stack model especially doesn't fit the needs of current
high-performance floating-point code, where there is plenty of register
pressure even on 32-fp-register machines, i.e., compilers can make good
use of large numbers of registers. 

I always admired the elegance of the B5000, as a machine designed
with compilers and OS's in mind; it fit the technology of the day well,
but the tradeoffs have changed - code density is not so important,
the stack is a bottlneck for very high-performance implementations,
and global optimizing technology is widely available.

Note that stack machines achieve good code density, hence might well still be
suitable for certain kinds of embedded applications.  

With the exception of the B5000's
descendants, and with the phase-out of the HP3000 and Tandem stack
architectures, stack machines are now seldom-used in high-performance /
general-purpose systems.  Kudos to the Unisys (B5000...B6xxx..A-series) line
for longevity: not only was the B5000 the first major stack machine, its
descendents look like the last remaining general-purpose stack machines. 

I'm not sure why, but there always seems to be a recurring wish/hope/nostalgia
for stack machines to come back.  Unless something drastic happens, this
seems unlikely to me, at least for the general-purpose market.

=====================
New stuff: I looked in
http://www.unisys.com/marketplace/aseries/hardware/lrgscale/a28spec.htm
to ascertain the current state-of-the-art: it's hard to get direct
comparisons, perhaps someone from A_series-land can comment.  I did find
a few quotes on the ClearPath A 2800 (high-end), announced for 6/96
availability.

"The ClearPath A 2800 Series design incorporates "leading edge" 0.5 micron CMOS
technology throughout the system. The most noticeable applications of this new
technology are in the A 2800 Instruction Processor and Input/Output Subsystem.

The new SCIP (Single chip CMOS Instruction Processor), mounted on a
single Processor Module (PM) with two companion 0.5 micron CMOS Second
Level Cache chips, is the fastest commercial CMOS processor
implementation currently available.
... up to six processors in a single air-cooled, rack-mount cabinet.

The ClearPath A 2800 High Performance Enterprise Server uses the
    state-of-the-art 16 Mbit memory technology to increase memory
    density, lower price, and increase reliability. A single-domain
    ClearPath A 2800 Series system."

Note that the low-end A2100 is a software-emulation built on top of a
166Mhz Pentium; and the A2400 uses a 32MHz "SCAMP" chip.
http://www.unisys.com/marketplace/clearpath/brochure/a2ng.htm
and from other web pages, one finds there is a 200X performance difference
betweeen low-end (Pentium-166 emulating MCP), and what I *think*
is a 12-CPU A2800 system.  I think this says that whatever clock rate the SCIP
runs at, it looks like it is 200/12, or ~15-20X faster than a Pentium
emulating it.




-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.sys.unisys,comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 26 Mar 1997 00:33:31 GMT

In article <333f1858.7679448@dnntp.srv.cs.cmu.edu>, koopman@cmu.edu (Phil 
|> 
|> John -- Usually your posts are well-reasoned and on-target. <a>  In this
|> case I think you're off-target because you've assimilated some
|> persistent but incorrect concepts. <b>

Thanx  <a>, but I don't think so <b>.

|> mash@mash.engr.sgi.com (John R. Mashey) wrote:
|> >Needless to say, it has been staightforward for years to generate
|> >reasonable code for a stack machine; beginning compiler classes have
|> >done that for years. Register machines, especially with larger numbers
|> >of registers, really want to have good global optimizers.
|> 
|> The misconception that stack machines of necessity have poorly optimized
|> code persists, but is unfounded.

I not only do not believe this, but I certainly didn't post anything
that implied this: I don't understand how the statement:
"People can generate reasonable code for stack machines" can be read to
mean: "stack machines of necessity have poorly-optimized code." (???)


|>				I find it ironic that RISC designers
|> and computer architecture textbook authors who know well that only
|> complex "real-world" benchmarks are meaningful, tend to dismiss stack
|> machines with the classical example of a load-load-op-store code snippet

There are 2 possibilities:
	(1) This simple example convinces me that something is true.
OR	(2) I think something is true, due to long history,
	substantial accumulated experience, and few counterexamples,
	and, without spending a huge time: here is a simple example.
	Maybe the example is too simple.

My net posting fits in (2); I couldn't say authoritatively so
regarding "textbook authors" in general, but certainly, some such
authors that I know also fit well in (2).

|> There is suggestive evidence that it may be possible to do reasonable
|> stack scheduling that is almost as good at capturing locality as
|> register scheduling, although this is not thoroughly researched.  IMHO
|> the reason we don't have efficient stack compilers is because the effort
|> hasn't been expended, not (as many people think) because it is
|> fundamentally impossible.  Now, Java aside there may not be sufficient
|> commercial incentive to *produce* such compilers.  But that is not at
|> all the same as saying it can't be done, which is the typical argument.

(I thought from the start of this that there *were* efficient stack
compilers (?))  However, as an Old Guy, I claim that this general
assertion ("people haven't tried") is incorrect, and I base this
claim on the following:

1. (Bell, Mudge, McNamara, Computer Engineering-A DEC View of Hardware
Systems Design, DEC, 1978, p250-251) has good comments on the arguments,
ending with the note:
"The basic design decision that sets the PDP-11 apart was based on the
observation that by using truly general registers, and by suitable
addressing mechanisms, it was possible to consider the machine as a
0-address (stack), 1-address (general register), or 2-address
(memory-to-memory) computer..."  and of course, the VAX did this also,
and so do X86's and 68Ks, in varying degrees; the X86 uses a stack
model for floating-point.

2. Over the years, fairly serious efforts, by large numbers of
people have gone into the optimization of code for (PDP-11, VAX, 68K, X86),
and if a compiler writer chose to generate code that used push/pops,
at least some of the time, they would be rewarded with better code
density.  

3. A major point of the stack computers is to reduce the code
size by implicit addressing of the top (or top few) elements of the
stack.  If one builds a hybrid (like some of the aforementioned ones),
it now depends on the specific choices as to the ways that people can
use them, and here it gets a little fuzzy (but this may help answer
Richard O'Keefe's posting): I'd label an A-Series a stack machine,
because many of its operations indeed use 0 addresses, and implicitly
address the top-of-stack, or nearby. Of course there is stack-relative
addressing, i.e., offset(stack) of offset(frame) is a good substitute
for a larger memory address.  To the extent that a stack machine
grows registers with direct identifiers, and the extent to which
general operations can be done on them, to that same extent it starts
acting more like a general-register machine, and the code density may
gete worse ... and interestingly, that is exactly what has tended to'
happen: as compiler register-allocation technology has improved,
people have tended to use that fact, rather than (short) implicit
addresses, on machines that provided a meaningful choice.

4. Despite the fact that *many* design teams of current processors
had *serious* experience designing hardware and compilers for them with
various help for stacks ... they choose not to do much with them in
most of the 1980s architectures.  This doesn't say the method may not
return, and it doesn't say that there are not reasonable applications,
especially where small code-size for embedded applications,
especially integer ones.


5. HOWEVER: stack architectures were *NOT* avoided in the 1980s
RISC architectures because the designers didn't know about them,
or have any experience with them, or had never written compilers
for machines that had stack operations.  They were avoided for
exactly the disadvantages cited by H+P:


|> >As they [H&P textbook] note "A stack canot be randomly accessed. This
|> >limitation makes it difficult to generate efficient code. It's also
|> >difficult to implement efficiently, since the stack becomes a
|> >bottleneck."
|> 
|> With appropriate compiler technology the stack doesn't seem to be all
|> that much of a bottleneck.  Although typically there is little reason to
|> do so, there is no reason it can't be randomly accessed if you like
|> (typically with a multi-clock operation in keeping with its infrequent
|> use).  Stack CPUs have been built just that way.

Let me explain what they meant:
(1) Any reasonable stack architecture has some form of memory addresses
based off the stack, or off a frame pointer, or more generally
a Display (i.e., the D registers in B5000 descenendents).
I'm sure H+P are aware of that :-)
(2) What H+P mean is they don't have the same random access as a
register file.  (Somebody wondered if fast caches weren't as fast as
registers, making the issue moot.)  Wrong: even L1 caches sometimes
have an extra cycle of latency, and L2 have more.  Even worse,
consider a "simple" 2-issue superscalar chip: likely, each register
set (int and FP) has 4 read ports and 2 write ports (to handle
2 3-operand register-register operations), or more in some cases.
L1 caches have 1-2 (read+write) ports.
In a register file can be placed a bunch of values, with fairly
long lifetimes, they can be combined in random order.
That is *especially* important with many kinds of floating-point codes,
but at least some integer codes as well maintain fairly long
register lifetimes.  Some don't, or don't need to.


|> Now superscalar gets a bit more interesting and controversial, but the
|> arguments we are discussing here go *way* back.

<Example deleted: one can argue endlessly about simple examples;
maybe that one was too simple; my problem is that I've seen good
optimizers allocate all sorts of stuff to registers, for relatively
long lifetimes.>

Finally, let me just allude to the complexities of superscalar and/or
out-of-order designs: keeping things, in the heart of the CPU, often
requires:
	(a) Comparisons of (5-6-bit, usually) register tags, for
	doing register dependencies, renaming, etc.
	(b) Comparison, often with associative lookups, of much larger
	addresses, typically of smallest movable unit, i.e., like 64-bits
	in many current CPUs.  For instance, it's fairly important to
	know, when handling a load instruction, if any of the following is
	true:
		- that a logically earlier store is not yet complete,
		  so need to wait.
		- that an earlier load caused a cache miss, and is already
		  in progress, but don't schedule a wasted cache miss.
		- etc, etc, especially in cache-coherency sitautions,
		  and chipsthat can do multiple load/stores / cycle.
	Some modern CPUs have big queues of these things.
	(c) Designers don't mind too much doing an associative lookup of
	wide addresses, if it's done in parallel with something else that
	takes a while.    Having to do address manipulations just to
	figure out where the registers *are* is not something that
	thrills CPU designers :-)

|> Well, considering that the embedded market dwarfs the so-called
|> general-purpose market, stack machines may yet have their day.
|> Especially when, as you say, code density really matters.  IMHO the
|> widening latency gap between CPUs and DRAMs constitutes something
|> drastic happening.  We'll see where it leads.

But the performance problem is more on the data side.
Code density is worth addressing (hence ARM Thumb, MIPS16, etc),
but it's much more because everything has to *fit* in memory,
than to reduce I-cache misses.

|> Does it really *matter* whether stack machines actually "have their
|> day"?  Ultimately it really matters only whether companies can provide
|> cost-effective technology to the marketplace.  And there is abundant
|> evidence that the market doesn't always go for the "best" technical
|> solution.  But it would be unfortunate for engineers or researchers to
|> dismiss _possibly_ better alternatives _for some applications_ based on
|> faulty understanding, as has been too often the case with stack-based
|> CPUs.

One more time: many designers who didn't prefer pure stack machines
didn't get that way from *ignorance*...

Conjecture:
	(1) Pure stack machines have already missed the high-performance
	embedded markets, and classic pure stack machines really don't
	seem to fit codes with much FP at all [if somebody wants to
	point at explicit counterexamples, please post].
	(2) If stack machines want to play in the large embedded markets,
	their potential advantage is in code density, but they have to
	beat:
		(a) CISCs like X86 and 68K
		(b) RISC/CISC borderlines like i960.
		(c) Compressed-RISCs, like Thumb & MIPS16.
	all of which are already there, with plenty of software
	support, manufacturing economies, numerous versions
	(3) Maybe there is room for something more along the lines of
	PDP-11 or AT&T's CRISP [which are hybrids, certainly not pure
	stack machines] ...  but then, they get awfully close to
	(a), (b), and (c) ... and it's not at all obvious that there
	is enough edge left any more to allow this.

But, for the last time: many computer architects have chosen not to
build pure stack machines, and they may be right or wrong in that,
or the tradeoffs may change,
but if they were wrong, it was usually not from ignorance / mis-conception.

It may be that many people believe "A", but do not know very much, and
"A" may or may not be correct, but that those people are ill-informed
is *not* an argument that "A" is wrong, just that some people are ill-informed.

On the other hand, if people whose depth of knowledge and experience
impresses you over many years, and your own lesser experience, agree that
"A" is correct, and it seems difficult to find counterexamples,
and nothing leaps up at you as a major reversal of assumptions, then
you are strongly tempted to believe "A" ... and sooner or later,
get tired of arguing about it [which is why H+P mentioned it, and then
explained why they'd only analyze register machines from then on.]

-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.sys.unisys,comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 27 Mar 1997 03:25:22 GMT

In article <abaum-2603971517190001@terrapin.pa.dec.com>, abaum@pa.dec.com
(Allen J. Baum) writes:

|> In article <5h9qsr$46b@murrow.corp.sgi.com>, mash@mash.engr.sgi.com (John
|> R. Mashey) wrote:
|> 
|> 
|> > (2) What H+P mean is they don't have the same random access as a
|> > register file.  (Somebody wondered if fast caches weren't as fast as
|> > registers, making the issue moot.)  Wrong: even L1 caches sometimes
|> > have an extra cycle of latency, and L2 have more.  Even worse,
|> > consider a "simple" 2-issue superscalar chip: likely, each register
|> > set (int and FP) has 4 read ports and 2 write ports (to handle
|> > 2 3-operand register-register operations), or more in some cases.
|> > L1 caches have 1-2 (read+write) ports.
|> 
|> I think I'll jump in here and disagree a little.
|> Highly parallel superscalar register files with renaming already
|> pretty much treat the register file as a (highly) associative cache-
|> they have to map the virtual register number into a physical
|> renamed register number. This doesn't seem to me to be all that
|> different than a cache lookup. Call it a level 0 cache if it makes
|> you happier- it gets done during the D stage(s), and it has multiple
|> ports.
|> 
|> A top-of-stack cache might not have to be terribly different, and
|> possibly not even terribly deeper. I suspect the hit rate would
|> be extremely high.

Since Allen generally knows what he's talking about, this bears some
more discussion to get clarification.  A lot of renaming implementations
deal with 5-6-bit quantities, but many of the (fast) operations use
indexing, not associative lookup; rename tables often have large
numbers of ports (R10K's is ~12 read ports).

Are there any published references people can point at for
heavily-superscalar, out-of-order, renamed stack machines, i.e.,
that discuss the internals? or web pages, or public information?
Can people summarize these, if so?

I don't have the time to try to go deep into examples: maybe somebody
more knowledgable may.  I just observe that:

1) In a register machine, a load's target, and a store's source, are
usually 5-6-bit designators.  It seems relatively easy to build fast
structures that are indexed by 5-6 bits, and to have 5-6-bit comparators
for things like renaming and dependency checks.

2) Consider a typical 32-bit RISC instruction.
	- Its register fields are visible directly in the instructions:
	  no instruction can "surprise" you, i.e., people don't allow
	  instructions like:
		move	reg1,reg2
			copy the value of reg2 into the register
		        whose number is found in reg1.
	  This sort of thing is very awkward.
	- Thus, you can determine register dependencies with comparisons
	  of small integers, and do it early in the pipeline, and
	  you sometimes get to use fast indexing.
	- Memory addresses are harder, and require wider comparisons,
	  and thus associative lookups, to keep queue-ordering OK.
	  (Everybody has this problem on anythign that wants to have
	  multiple memory operations outstanding).
	- If you do renaming, and speculation, you must have plausible
	  ways to roll back the register state to the last place you
	  mis-guessed, on go on from there.  On R10Ks, that means
	  cancelling stores you didn't really get to, and restoring
	  the (small) map of logical->physical registers.
	  [I'm not sure what a similar stack machine would do, since
	  there would appear to be an arbitrary number of instructions,
	  and thus stack locations, modified between branches.
	  I.e., you need to restore the stack pointer, and the state of
	  the stack.

3) In most stack machines that I've seen, stacks are sequences of
memory locations (or more complex things like the A-series "cactus stacks"),
and some number of locations are implemented as fast registers,
with transfers to/from memory done behind the programmer's back, so that the
actual number is somewhat invisible.

3) Since they are memory locations, you can take their addresses, and
there are usually operations like:
	"store next-to-top-of-stack into address pointed to by top-of-stack",
	which, in fact, could point into the stack, but where it is
	difficult for the hardware to detect the aliasing until fairly
	late ...

4) To summarize (and it's time to go to dinner):

register machines have a small set of locations that are *not*
memory locations; this makes the compiler's job harder ... but it makes
the hardware's job easier if it wants to do aggressive implementations.


Anyway, maybe somebody can post design info on superscalar, out-of-order
stack machines so we'll get some insight about how one does this stuff.

-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.sys.unisys,comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 28 Mar 1997 20:43:14 GMT

In article <abaum-2803971116200001@terrapin.pa.dec.com>, abaum@pa.dec.com
(Allen J. Baum) writes:

|> often contain nuggets you hadn't thought of, and you will have managed to 
|> get someone else to do your research for you.
|> Thanks John!

Thanks, I guess, but this was hardly doing research :-)

|> In any case, it would be an interesting exercise to design a
|> stackcache/register file for a high performance machine, but it's not
|> simple, and not something I get paid for-- but possible something worth
|> someone's thesis?

Yes, this would be a worthwhile exercise.


|> Henry Baker said:
|>    "I've been around long enough to see fashions change a number of times,
|>     so I'm not going to rule out the possibility of additional fashion
|>     changes in the future."
|> 
|> Sounds like a good idea to me. Fashion sense isn't driven by economics,
|> unfortunately. (interesting actually: the fashion industry is driven by
|> making people change what they wear; ours is driven by making sure they
|> don't change...)

And actually, while fashion occasionally drives CPU architecture
in the short term, over the longer term, it is driven by changes in the
nature of design tradeoffs, with factors like:
	- cost/speed tradeoff between read-only and read-write memory
	- bandwidth possibilities
	- cost/implementability of logic versus memory
	- cost of memory
	- latency ratios

	- datatypes deemed important
	- operating systems and languages
	- kinds of applications
	- kind of systems architectures people want to use
	- and all sorts of other economic issues

For example, 
There was a time when memory was horrendously expensive,
and ALGOL/COBOL looked like the ways forward ... and the B5000 was a
brilliantly-innovative design that combined hardware and software
thinking together in ways that created true devotion.
Just to correct a possible misapprehension, some people have been
talking of this family as though it only had stacks for activation
records and local storage, but really, the expression stack was
included as well, i.e., code looked like (PUSH, PUSH, ADD, POP,
although the specific opcodes were different).

Finally, while I mentioned the issues for aggressive machines in being
able to take addresses, many similar issues exist even without that
ability.  The generic problem, as beaten into me again and again by
real chip designers, is that:
=========================
	If you view CPUs as in the "good old days", i.e., as relatively
	simple sequential devices with little pipelining and internal
	parallelism, then you can reach totally bogus results when
	considering CPUs that have substantial internal parallelism,
	and it is very easy to spec features that look fine for the simple
	machine, and create awful special cases for more aggressive
	implementations.  In some cases, the only way to evaluate
	some ISA feature is to go deeply into the implementation, not just
	of that feature, but of the interactions of it and other features.
==========================
	Example: the addition of cache+TLB to architectures that weren't
	thinking about these sometimes caused serious implementation
	problems, bugs, etc.  For example, consider the venerable S/360:
	- Part of the ISA is fairly RISCish (i.e., RR, RX, SI instructions),
	but some instructions are very straightforward on real memory,
	but less so on cached/virtual machines. 
	Consider:	load-multiple	reg1,reg2,offset(reg3)

	On real memory machine, this is fairly simple: iterate from
	reg1 to reg2, loading a (word, 2 words, whatever) per clock.
	This takes microcode with 3 items of state, equivalent to:
		current address from which to fetch
		current register in which to place the result
		count of words remaining
	Now, suppose you have a TLB: you would think you could do, 	
	straightforwardly:
		address = (reg3) + offset
		while registers left to do
			register = fetch(translate(address)
			address += 4
	but of course, you can't, since this may cross a page boundary,
		and the last word may cause a page fault, so now, you
		either need to dump this internal state somewhere, so
		you can continue the instruction, (and that means a bunch
		of extra datapaths, plus ways to restore the state),
		OR
		you need to be able to restart the instruction from
		scratch, which would be simpler ... but of course,
		the simple version doesn't work, because:
		LM	R1,R6,0(R1)
		has overwritten R1 before the exception...
	So, what you may have to do is:
		address2 = (reg3) + offset + 4*(reg2-reg1)
		probe address2 and trap now if not valid
		Then, go as above (but of course, lengthening the
		startup time to be safe against a case that doesn't usually
		occur...
	Which is why later instructions like MVCL (move long) use
	a set of registers in a way that permits interrupts anywhere,
	and why some later chips have had load-multiple instructions
	that forbid modifying the register used as the source of the
	address....

Anyway: bottom line: aggressive implementations are usually doing
more than is obvious, to do what Allen says: make it look like
they're the same, not different.


-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 3 Apr 1997 04:43:36 GMT

In article <hbaker-0204970856120001@10.0.2.1>, hbaker@netcom.com (Henry
Baker) writes:

|> My 'screw pipelining' comment was in reference to the fact that
|> computer architects seem to pull this one out of the hat before giving
|> the slightest thought to understanding the problem any deeper.
|> Introducing pipelining would seem to be the _last_ thing to introduce
|> -- when _all_ else fails -- rather than the first thing.

This statement can be read as:
	(a) Computer architects, in general, do this.
OR	(b) A few computer architects do this.
OR	(c) People who are interested in computer architecture, but have
	never been involved in the design of real computers, do this.

If you think it is (a), please post who you are thinking of.
Personal opinion: I've known plenty of *real* computer architects,
and I think they're smarter than this aspersion.


|> Look at the Stretch and the 360/91.  They added horrendous complexity for
|> relatively little performance gain.  A relatively simple cache was much more
|> cost-effective.

This is a totally-misleading application of this particular comparison,
which is usually stated as 360/91 versus 360/85, to compare two related
machines.
360/91 and (especially the earlier Stretch) were designed before useful
caches were feasible.  The later 360/85 was the first mainline IBM system
with a cache (Classic paper: J. Liptay, "The Cache", 
IBM Systems J, 7, 1 (1968), 15-21.)

360/85: 1.04microsecond main memory ... up to 4MB (!),
	16KB cache, 80ns
	80ns CPU cycle

360/91: 60ns cycle, 750ns main memory, no cache.
The complexity came from trying to cope with a single level of storage
that was >10 clocks away, with no cache ...
It wasn't complicated because the architects were ignorant fools ..
but because they didn't have SRAM :-)
And surprise ... as CPU clock rate:memory ratios are getting worse,
soem of the same techniques are returning.

-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: UNISYS A series question / stack architectiure question
Date: 7 Apr 1997 17:34:37 GMT

In article <334558d1.6466265@news.dk-online.dk>, egmose@dk-online.dk (Søren Egmose) writes:

|> >And surprise ... as CPU clock rate:memory ratios are getting worse,
|> >some of the same techniques are returning.
|>
|> Such as? I'm very interested in these tecnics.

See Hennessy & Patterson on 360/91.  From 1st edition:
"The IBM 360/91 introduced many new concepts, including tagging of
data, register renaming, dynamic detection of of memory hazards,
and generalized forwarding."  i.e., techniques that
weren't common 10 years ago in micros, but certainly appear in many by
now.  In addition, since there was no I-cache, the 360/91 guessed branches
*both* ways and speculated in each direction [which current processors
generally do not do].  Higher-performance micros do spend die space
on branch prediction tables, branch-target tables, various kinds of hints, etc,
to lessen the penalties of branching.

Using round numbers (i.e., trying to simplify a lot of info, please
don't get into nit-picking unless there's a serious point):

in 1985/1986, we started with micros running at ~ 8Mhz (125ns clock),
with external SRAMs that matched (i.e., one result each clock tick,
with a one-cycle penalty, and DRAMs that matched (in some sense):
120ns raw access time, obviously, more in real life. By 1987,
micros were typically at 16.6Mhz (60ns).

These days:
	- Desktop/server micro cycle times are centered around 4-6 ns.
	- Raw DRAM access is typically 60-80ns.
	- People often do cache hierarchies where the L1 cache has two
	  penalty cycles, not one.
	- Instead of peak single-issue/clock (or less), we typically have
	2-4 issues/clock.

So, picking a few data points:
1985/1986:  ~ 1 instruction / raw DRAM access
1997:	60/5 = 12 CPU clocks/ raw DRAM access (not too different from 360/91)
	12 * 3-4 issues/clock => ~40 instructions/ raw DRAM access

Suggestion: if we had 2ns SRAM, and 4ns DRAM as cost-effective things,
pipelines would tend to have stayed simple, but we don't, so they haven't,
and they're acting more like CDC 6600s and/or 360/91s.



-- 
-john mashey    DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL:  mash@sgi.com  DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389



Index Home About Blog