Index Home About Blog
From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 12 Sep 2005 22:42:31 -0700
Message-ID: <1126590151.256139.239050@g14g2000cwa.googlegroups.com>

David Hopwood wrote:
> andrewspencers@yahoo.com wrote:
> > Terje Mathisen wrote:

> A slightly different situation is where you have code that in practice
> always handles integers that fit in a single word, but that can't be
> statically guaranteed to do so, and the language specification says that
> bignum arithmetic must be supported -- the obvious example being Smalltalk.
> There were some attempts to support this in hardware (e.g. "Smalltalk on
> a RISC"; also something on SPARC that I can't remember the details of),
> but it turned out to be easier and faster for implementations of Smalltalk
> and similar languages to use other tricks that don't require hardware support.

Yes.
1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
which used the high-30 bits as integers, and the low 2 bits as tags; if
either low 2-bit field were non-zero, it trapped.

2) ~1988, while working on MIPS-II, I/we spent a lot of time talking
with Smalltalk & LISP friends, potential customers, etc, asking:
"Are there any modest extensions that would help you a lot, and would
be reasonable to implement?

Short answer: NO.

Longer answer:
a) They said either give them a complete, tailored solution [which they
didn't expect], or just make the CPU run fast, but don't bother with
minor enhancements.  Some said they knew about the SPARC feature, but
didn't use it.

b) Some said: they were all doing fairly portable versions, had learned
a lot of good tricks, and minor improvements that required major
structural changes just weren't worth it.

c) I spent some time with David Kay (of XEROX PARC/Smalltalk fame) on
this, i.e., were there features that would be substantially helpful?

The best general idea we could come up with was:
- A general low-overhead user-level trap mechanism (something I've
wished for many times for other reasons).
- Some kind of general mask/check mechanism that could generate such
traps on particular bit combinations, either:
 - on completed address generation (maybe)
 - on input value to ALU operation (maybe)
 - on output value from ALU (bad)
 - on output fetched by load instruction (bad)

But we were not able to generate a specific-enough proposal for
something that was sure to be really useful, and could be reasonable to
implement.

In that round we did add the TRAP instructions for MIPS-II, but they
were primarily for ADA, although could be used elsewhere.

I mention this, because as often happens, people throw around ideas for
features without having much concern for *serious* implementation
issues.

Implementors would not be happy designing a mechanism:
- that seems only useful for a few specific cases
- that is easily handled by normal user code
- that introduces an extra new trap type, that requires especially
efficient handling, because it's expected to be used essentially
in-line, and not as an error indicator.
- and that may well introduce gate delays in critical paths, even if
it's minimal hardware.

As I've posted many a time, traps are *notorious* for causing
implementation bugs in hardware or software, so people do their best
not to introduce new flavors of them unless strong evidence is provided
that they are needed or are really worth it for performance.

There is a great deal of pushback in introducing features that might
add gate delays in awkward places, of which two are:
a) Something only computable on the *output* of an ALU operation
b) The result of a load operation

In many implementations, such paths may be among the critical paths.
Sometimes, the need to get a trap indication from an ALU, FP ALU,
Load/store unit to the instruction fetch unit may create a long wire
that causes serious angst, or yelling in design meetings.

Integer Overflow is one of a), but it's simple enough not to be likely
to add gate delays.  Nevertheless, it is something determinable only
late in the cycle, so many ISA designers have chose not to have it be
trappable. HP PA, MIPS, an Alpha designers all did choose the
minimalist approach, in which:
- There is no OVFL flag in the Condition Code ... because there is no
CC.
- There is a reasonably complete set of ADD / SUB operations, each with
2 flavors: arithmetic/signed and logical/unsigned.  The former always
cause traps on overflow, the latter never do.  Compilers generate the
latter for C unsigned, and for synthesis of complex addressing
arithmetic.  This assumed that you wanted to make the normal case fast,
at the expense of needing multi-instruction sequences to get explicit
tests for overflow without trapping.

Anyway, this does show that it is possible, across a wide range of
designs, to detect integer Overflow in a timely fashion.  Likewise,
it's even less time-constrained (as SPARC does) to trap on bit-tests in
the input values.  Finally, a lot of floating-point trap tests can be
done on the input, or use the MIPS trick of examining the inputs and
stalling if it cannot be sure the operation will complete without trap
[discussed here earlier].

On the other hand, the kind of features that I described above in the
Kay are much tougher.  To be really useful, you'd want to have
something like:
- a mask register that specified which bits of a value should be
checked
- a compare register
- a flag to say whether to trap if equal or not equal, i.e.,:
  if (flag) then
   {if ((value & mask) == (compare & mask)) then trap();}
  else
   { if ((value & mask) != (compare & mask)) then trap();}

You could do this with (value XOR compare) & mask, but in any case, you
still need a comparator tree somewhere.

And then, you'd actually probably want several sets of
mask/compare/flag regs, and you'd need variants of operations that
would enable the checking.  Note, of course, that

It *might* be plausible to use this for input operands, although no one
would appreciate the extra read ports/bus loads, but at least the
checks could go on in parallel with the ALU operation.  Nobody would be
very happy about doing this on the output of the ALU or load
instructions.

This is a LOT of mechanism, and so needs serious justification ... and
as for doing counter comparisons, no way.

The closest real designs get to this sort of thing, or to
lower-overhead loop control are:

a) Special counter registers that help speed up branches, found in some
general ISAs.

b) Zero Overhead Loop (Buffers) found in some DSPs.

Anyway, it's pretty clear that relevant mechanisms were being discussed
~20 years ago, but nobody seems to have figured out features that
actually make implementation sense.  I'd be delighted to see a
well-informed proposal that had  sensible hardware/software
implementations and really helped LISP/Smalltalk/ADA and hopefully
other languages...



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 13 Sep 2005 08:33:17 -0700
Message-ID: <1126625597.860065.205410@g49g2000cwa.googlegroups.com>

Jan Vorbrüggen wrote:
> > The best general idea we could come up with was:
> > - A general low-overhead user-level trap mechanism (something I've
> > wished for many times for other reasons).
>
> Later, you mention the MIPS TRAP instruction(s)...is that along this line?
> What about the VAX's CHMx with x=U (which passes an additional constant
> parameter to the trap routine)? Couldn't one have a fast user-mode trap,
> and then make sure that an address-alignment trap was handled that way?
> (That would also go some way towards handling unaligned operands quicker.)

No, both are still kernel traps.  What I've long wished for, but didn't
have a design early enough in MIPS-land was something reminiscent of
Alpha PALcode, or the IBM PPC mechanism for handling unaligned data,
but workable in user-state.

NOTE: WARNING: this post is different (from my usual), because it
describes something for which I do *not* have a design, just some
thoughts from over the years.  I have warned many times that one has to
work out *all* the details, because the devil lies in their
interactions.  Note that we never implemented anything like this,
because none of us could come up with a mechanism that we were sure was
widely useful and that didn't have implementation issues, and it was
clear, later on, that we couldn't graft this on top of existing MIPS
very well.  It might be, that had we had more time in the original
design, that we could have included an appropriate mechanism ... or it
might be that there was no sensible design.

I wished for something general enough to:
a) Fix alignment errors, i.e., one would like to be able to run a
binary with/without alignment checking.  [Recall that MIPS could handle
alginment errors, but needed a recompile to use LWL/LWR, etc].
b) Be able to trap unimplemented instructions, i.e., like
floating-point operations on original MIPS R2000, before the FPU was
available, or for machines that didn't have one, rather than doing
coprocessor-unusable traps.  Also, one might do not-yet-implemented
instructions, like sqrt (which was not there in MIPS-I, but added
later).  One might consider doing integer mul/div this way, where some
designs had them, and some didn't.
c) Likewise, support for parts of IEEE FP that one didn't want to do in
hardware.
d) Tagged-trap support for LISP, Smalltalk, etc.
e) Other user-desired or more likely, language-system desired features

One can summarize these into several groups:
A) Managing binary compatibility across a family whose implemented
features vary.  Note that a good mechanism would let you run binaries
with new instructions on old systems, given the right emulation code.
B) Handling cases where most of the time, simple hardware can do the
right time, but in a small fraction of the cases, driven by data, one
needs to do something else, but it's not a fault in the usual sense,
i.e., an error from which immediate recovery is unlikely.

A) includes b) and c).  B) includes d) and e). a) has characteristics
of both A) and B).


All of, these are for needs where:
a) You want to execute code straightforwardly in the protection/address
context of the user program.
b) You want to keep overhead "low enough".
c) You want user-level programming flexibility

This would probably require:
a) A bunch of extra registers to record the location and nature of the
trapped instruction.  The location pieces is probably not too bad.  The
"nature" piece wants to crack the instruction, it's inputs, and outputs
into a useful form.  [I.e., akin to the way MIPS utlbmiss TRAP sets up
registers with values in useful places to lessen the code path for TLB
refill].

This can be a lot: read about the PPC's "DSISR" register to helpo
alignment-code fixups.

One would prefer not to have to refetch and interpret the instruction
completely in common cases.

b) A few regular registers reserved for the use of the trap code.
These could be regular registers [the way that MIPS reserves two
registers that the kernel can trash whenever it wants], or they could
be extra new ones.

c) Probably, for reasonable code, one needs mechanisms to fetch input
operands and get outputs back to the right place(s).  In typical RISCs,
input operands might be able to be presented in a pair of special
registers.  Getting the output back to the right register may take some
work, including something like an indirect register specifier or an
S/360 EXEecute instruction.

d) There is a lot of software-convention work to be done.

e) One has to decide what to do about asynchronous interrupts and
further exceptions, and the extent to which a user-level trap routine
has access to features beyond normal user code.  Such routines
certainly cannot be allowed to block external interrupts arbitrarily.

In general, there are a lot of details and their interactions to get
right, and there is a lot of tension betweeen "gneral-enough" and "more
extra hardware and complexity than is worth it."



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 14 Sep 2005 03:22:32 -0700
Message-ID: <1126693352.897261.87710@g14g2000cwa.googlegroups.com>

Seongbae Park wrote:

> I think LISP/Smalltalk/ADA market is just too small to justify
> adding any significant change in the general purpose ISA,
> unless this yet-to-be-invented mechanism is easy and cheap
> to implement or it is for some other purpose which happened to
> help them (like a fast user-level trap).

Well, that's why we never did it.  We certainly couldn't justify
expensive features for that market, but we hoped to find modest useful
ones that might be general enough to have other uses as well.  Maybe if
we could have afforded another 6 months to do the original MIPS-I ISA,
we might have thought of something reasonable, but after that, it was
probably too late.  Nothing very complex would have fit in the R2000 in
any case, although I would have given up a few TLB entries had we
gotten a good solution here.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 13 Sep 2005 17:43:02 -0700
Message-ID: <1126658582.506368.173210@g43g2000cwa.googlegroups.com>

Iain McClatchie wrote:
> Mash> There is a great deal of pushback in introducing features that
> Mash> might add gate delays in awkward places, of which two are:
> Mash> a) Something only computable on the *output* of an ALU
> Mash>    operation
> Mash> b) The result of a load operation
>
> Mash> In many implementations, such paths may be among the critical
> Mash> paths.  Sometimes, the need to get a trap indication from an
> Mash> ALU, FP ALU, Load/store unit to the instruction fetch unit
> Mash> may create a long wire that causes serious angst, or yelling
> Mash> in design meetings.
>
> Hmm... a feature that hangs some logic on the output of the ALU or
> load pipe, and causes a pipe flush and IF retarget if the logic
> detects some condition.
>
> I don't think this is a problem, Mash.  We're already doing this
> for integer overflow and various floating-point exceptions.  Suppose
> for a moment that the additional complexity of the feature added a
> pipe stage to this recurrence... in an OoO core, who cares?  GPR
> writeback is unaffected, you just have more logic writing to the tag
> bits in the reorder buffer.

Of course (i.e., it might not matter in an OoO), but you may have
missed the careful weasel-words "In many implementations".  After all,
of the horde of distinct pipeline implementations that have ever
existed, only a tiny fraction are OoO...

For what it's worth, there was some argument about this (overflow in
R2000) in 1985, because it was literally the *only* integer exception
that needed to be detected after the ALU stage, and in time to inhibit
register writeback, and somebody was worried about a possible extra
delay for a while.

> Now what would be very unpopular with the CPU guys would be
> instructions that monkey around with the dataflow inside the ALU.
> I skimmed the description of the Sparc tagged adds, but they
> sounded like just the kind of thing I'd want to kick out of the
> hardware, because getting data through the ALU really is the
> common case.

Again, I don't think the SPARC tagged ops are so bad, because they just
look at two bits each of the two inputs, so one can detect the trap
early.

>
> Heck, I'd like to get rid of sign extension on loads.  In an earlier
> proposal, I wanted to bolt an ALU (including shifter) onto the end of
> the load pipe, so that the op after the load could be scheduled with
> the load in one go.  The trouble is that raw pointer chasing is just
> too popular, and you don't want the load pipe latency dinking back
> and forth between two values.

You hardware guys are all alike [in hating sign-extension on loads]
:-).
We seriously looked at various schemes found elsewhere, i.e., where one
loads zero-extended partial-word data, and then uses an explicit EXT to
sign-extend.  We had enough data to prefer having both zero-extend and
sign-extend as operations, and if push had really come to shove, I
would have lived with an explicit EXT, although having done 68K
compiler work, and dealt with some of the funny optimization hassles
(i.e., can one get correct results without the EXT, sometimes?) I
certainly preferred to have the signed-load opcodes as first choice.
My second choice would have been 2-cycle load-signeds.  Third choice
was the explicit EXT.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 14 Sep 2005 03:09:04 -0700
Message-ID: <1126692544.309408.123480@g49g2000cwa.googlegroups.com>

Seongbae Park wrote:
> John Mashey <old_systems_guy@yahoo.com> wrote:
> ...
> > You hardware guys are all alike [in hating sign-extension on loads]
> >:-).

> Well, if the sign-extend version takes more cycles than zero-extend
> - I suppose your second choice meant such a case -
> it creates the same funny optimization hassle
> and such an optimization accompanies occasional bug reports that cry wolf
> over the zero-extend load that correctly replaced sign-extend load
> ("It's a signed char in my code.
> Why is the compiler using a zero-extend load ?
> The compiler must be buggy!").

Yes, but the complaints are much worse when people disassemble code and
see a bunch of EXTs that are clearly unnecessary, i.e., visible
instructions almost always get more attention/flak/whinging than slow
instructions, unfortunately.  I spent some time tuning a 68K compiler
years ago at Convergent, and this kind of thing came up, and it wasn't
trivial to fix at the time, and get it right, at least in pcc.



Message-ID: <4328A026.5070704@pacbell.net>
From: Eliot Miranda <eliotm@pacbell.net>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: Wed, 14 Sep 2005 21:59:09 GMT

John Mashey wrote:
> David Hopwood wrote:
>
>>andrewspencers@yahoo.com wrote:
>>
>>>Terje Mathisen wrote:
>
>
>>A slightly different situation is where you have code that in practice
>>always handles integers that fit in a single word, but that can't be
>>statically guaranteed to do so, and the language specification says that
>>bignum arithmetic must be supported -- the obvious example being Smalltalk.
>>There were some attempts to support this in hardware (e.g. "Smalltalk on
>>a RISC"; also something on SPARC that I can't remember the details of),
>>but it turned out to be easier and faster for implementations of Smalltalk
>>and similar languages to use other tricks that don't require hardware support.
>
>
> Yes.
> 1) There was Berkeley SOAR as noted, and SPARC included ADD/SUB Tagged,
> which used the high-30 bits as integers, and the low 2 bits as tags; if
> either low 2-bit field were non-zero, it trapped.
>
> 2) ~1988, while working on MIPS-II, I/we spent a lot of time talking
> with Smalltalk & LISP friends, potential customers, etc, asking:
> "Are there any modest extensions that would help you a lot, and would
> be reasonable to implement?
>
> Short answer: NO.
>
> Longer answer:
> a) They said either give them a complete, tailored solution [which they
> didn't expect], or just make the CPU run fast, but don't bother with
> minor enhancements.  Some said they knew about the SPARC feature, but
> didn't use it.

This would include Peter Deutsch and the design of HPS his 2nd dynamic
translation (JIT) VM.  The tag pattern for immediate integers was
already chosen to be 11, and changing it just for SPARC when the
performance boost would be below 10% in all but micro-bencmarks just
isn't worth it.  However, were the SPARC designers to have allowed the
trap mask to be a variable part of per-thread state, or even better, to
be specified in the instruction itself (eliminating problems combining
different language implementations in one program) then we would have
made use of it (certainly code exists to use it).

The most convenient design would be not a trap but a branch or skip.
Something like "add and skip on overflow or if either operand's tag
pattern doesn't match X".  Now with 64-bit implementations one would
also want to specify the width of the tag field (one bit would suit HPS;
its 32-bit and 64-bit implementations use a single bit to tag immediate
integers.

> b) Some said: they were all doing fairly portable versions, had learned
> a lot of good tricks, and minor improvements that required major
> structural changes just weren't worth it.

hence the need for any instructions to provide flexibility and not
dictate particular bit patterns...


[snip]

> Anyway, it's pretty clear that relevant mechanisms were being discussed
> ~20 years ago, but nobody seems to have figured out features that
> actually make implementation sense.  I'd be delighted to see a
> well-informed proposal that had  sensible hardware/software
> implementations and really helped LISP/Smalltalk/ADA and hopefully
> other languages...

We could use a tagged add/sub and skip on overflow or tag mismatch, and
a tagged compare and skip on tag mismatch, where the tag field can be
flexibly specified to suit both 32-bit implementations (typical tags
least significant two bits) and 64-bit implementations (typical tags
least significant three or four bits).

If 6 bits were dedicated to the tag specification, two would be the size
of the tag field
	00 -> least significant bit
	01 -> least significant two bits
	10 -> least significant three bits
	11 -> least significant four bits
The remaining four bits would specify the required tag pattern, bits
excess to the tag size being ignored.

The two operands would be interpreted as 2's complement signed integers
in the remaining non-tag bits.  The add/sub instructions would skip or
annul the following instruction if either operand's tag pattern didn't
match the tag specification or if the result overflowed.  The compare
instructions would skip or annul the following instruction if either
operand's tag pattern didn't match the tag specification.

The value of the result register of the tagged add/sub would have the
same tag pattern as the operands.  Result value is undefined if overflow
or tag mismatch (i.e. I don't think one would typically be interested in
the result).

Code sequences for polymorphic add/sub or compare would then look like

	fetch operand one
	fetch operand two
	tagged add/sub
	branch Ldone
	code for non-tagged case (method lookup)
	...
Ldone:

Code for compare sequences would depend on whether one needed to take a
conditional branch or produce a result.  So one could use an instruction
that would skip the next two instructions on tag mismatch.

If tagged compare skips the next instruction then tagged compare for a
conditional branch might look like
	fetch operand one
	fetch operand two
	tagged compare
	branch Lcond
	code for non-tagged case (method lookup)
	...
	compare result of non-tagged compare against TRUE value
	branch if equal Ltrue
	compare result of non-tagged compare against FALSE value
	branch if equal Lfalse
	call notBooleanError
Lcond:	branch on equal Ltrue
Lfalse:

If it skips the following two instructions then
	fetch operand one
	fetch operand two
	tagged compare
	branch if equal Ltrue
	branch Lfalse
	code for non-tagged case (method lookup)
	...
	compare result of non-tagged compare against TRUE value
	branch if equal Ltrue
	compare result of non-tagged compare against FALSE value
	branch if equal Lfalse
	call notBooleanError

which isn't much of a saving...

One could also make use of a tagged add/sub immediate as there's a high
dynamic frequency of var + 1 in most (Smalltalk) programs.  The
immediate value would omit the tag pattern and be shifted by the tag
size to increase useful range.
--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 14 Sep 2005 22:00:55 -0700
Message-ID: <1126760455.067853.13000@o13g2000cwo.googlegroups.com>

Eliot Miranda wrote:
> John Mashey wrote:

> > Longer answer:
> > a) They said either give them a complete, tailored solution [which they
> > didn't expect], or just make the CPU run fast, but don't bother with
> > minor enhancements.  Some said they knew about the SPARC feature, but
> > didn't use it.
>
> This would include Peter Deutsch and the design of HPS his 2nd dynamic
> translation (JIT) VM.

Lots of good details deleted...

1) The suggestions would probably fit HP PA better than MIPs, as it has
extensive "annul-next-instruction" features.

2) My comment above was indeed a paraphrase of Peter's comments,
although somewhat similar thoughts came from others as well.



From: "John Mashey" <old_systems_guy@yahoo.com>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Date: 14 Sep 2005 22:46:54 -0700
Message-ID: <1126763214.916084.152970@g43g2000cwa.googlegroups.com>

John Mashey wrote:

> c) I spent some time with David Kay (of XEROX PARC/Smalltalk fame) on
> this, i.e., were there features that would be substantially helpful?

Eliot Miranda correctly points out that I had to mean Alan Kay, not
David; sorry Alan.  (David Kay's activities are in a rather different
domain).



From: Eliot Miranda <eliotm@pacbell.net>
Newsgroups: comp.arch
Subject: Re: interrupting for overflow and loop termination
Message-ID: <RfCWe.1099$7x4.158@newssvr13.news.prodigy.com>
Date: Fri, 16 Sep 2005 16:18:57 GMT

John Mashey wrote:
> Eliot Miranda wrote:
>
>>John Mashey wrote:
>
>
>>>Longer answer:
>>>a) They said either give them a complete, tailored solution [which they
>>>didn't expect], or just make the CPU run fast, but don't bother with
>>>minor enhancements.  Some said they knew about the SPARC feature, but
>>>didn't use it.
>>
>>This would include Peter Deutsch and the design of HPS his 2nd dynamic
>>translation (JIT) VM.
>
>
> Lots of good details deleted...
>
> 1) The suggestions would probably fit HP PA better than MIPs, as it has
> extensive "annul-next-instruction" features.

And I was being dense.  Skipping is unnecessary.  More natural would be
to set condition flags for overflow if the tags were wrong.  In most
dynamic languages there will be a test for overflow following the
addition in cases where its not known if the tags are correct, so one
can fold the tag and overflow test together.

So the instructions would be tagged add/sub with overflow and compare
tagged with overflow.  Overflow would be set if the tags didn't match
the tag spec or if the arithmetic overflowed.

Code sequences for polymorphic add/sub or compare would then look like

     fetch operand one
     fetch operand two
     tagged add/sub
     branch no overflow Ldone
     code for non-tagged & overflow cases (method lookup)
     ...
Ldone:

--
_______________,,,^..^,,,____________________________
Eliot Miranda              Smalltalk - Scene not herd


Index Home About Blog