Index Home About Blog
From: mash@mips.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: Optimal # of registers?

In article <utnhcINNhof@grapevine.EBay.Sun.COM> adrianc@cobblers.uk.sun.com writes:
>In article 7214@minster.york.ac.uk, mjl-b@minster.york.ac.uk () writes:
>>>
>>>The calling procedure knows what registers have values that need to be
>>>preserved across the call. The called procedure know which registers
>>>it will use. The set of registers that need to be saved and then
>>>restored is the intersection of these two sets. Minimizing this
>>>intersection reduces the memory traffic associated with a call/return.
>>
>>But this isn't true when you've got separate compilation -- the callers of a
>>library-level subprogram are unknown until link time.
>
>SuperSPARC will substantially reduce the miss cost - the spill/fill
>time - which strengthens the case for register windows.

Actually, I don't think it makes a lot of difference....

Register windows are not exactly caches, but lets ignore that, in that it's
pretty easy to take an instruction & reference stream and model that,
whereas use of register windows vs other things changes the streams...

In any case:

1) There has been, over time, a reasonable amount of data published
(here or elsewhere, years ago) that shows with:
	- 32 integer registers (i.e., IPS, HP PA, IBM, Alpha, etc)
	- Good global optimizer, but WITHOUT turning on inlining or
	  inter-procedural optimization
	- Hence, applicable to dynlinked libraries or object-oriented stuff..
That the dynamic statistics say that you can get the number of registers
saved/restored per function call across benchmark suites like SPEC
(or the internal ones that MIPS has used, i.e., fairly large C and
FORTRAN programs)  and actually including FP registers, to get down
around 2.  I've looked at dozens of profiles of programs to say this.
The reason for this is that leaf functions (or things where the compiler
can make them act like leaf functions for the most frequent calls),
seldom save/restore *any* registers.

2) Implementations (such as HP or MIPS, for example) that started with
split I & D caches normally have 1-cycle loads and 1-2 cycle stores,
depending on the implementation.  Hence, a lower bound on the
cost is something like:
	2 registers * (1-2 (per save) + 1 per restore)
	= 4-6 cycles.
One expects that the restore should seldom cache-miss.
In a write-back cache, the first store may cause a writeback and/or
a memory read.
Of course, the instructions may I-cache miss.

3) The last time I looked, I observed that realistic C program typically
averaged 80-100 instructions function call; FORTRAN programs varied wildly
but were usually higher; I recall 200 instructions or so was common.
There are of course more cycles due to everything else, and a 1.2 CPI
would be typical; hence, what you're talking about is 100-120 cycles ..
of which 4-6 are spent saving/restoring registers.

So, as a gross rule of thumb, what's happening is that a MIPS_like
CPU is probably spending about 5% of the cycles saving/restoring
registers across function calls, in user-level code typical of
the early 1980s.  It's probably less for FORTRAN code (I guess
3%) and maybe for kernel code.

4) Now, most earlier SPARCs use joint caches, and hence have 2-cycle
loads and 2-3-cycle stores, and so the penalty *there* for
explicit save/restores, assuming same compiler technology:
is 2 * (2 + 2-3) or 8-10 cycles, hence there are more cycles to
go after in that kind of implementation (although %-wise it's probably
the same, as the joint-cache SPARCs have higher CPIs than 1.2).
Put another way: if you have fast loads and stores, and don't have
to save/restore many registers anyway, you're less interested in
register windows.

5) Anyway, it all depends on the program behavior and the effects of
compilers, certainly as Patterson said quite often.
*IF* you have portable-c-compiler, joint caches, and your model of
programs are user-level C programs, THEN register windows or
stack caches or whatever are probably good ideas.
*IF* the code has gotten *more layered* than it was around 1980,
that is, that it goes up and down the levels faster, and hence hits overflows/
underflows faster [i.e., one must count a *complete* exception-handling
sequence, not just the register load/store time itself], then
the penalties may end up equaling the gains. Now, as it happens,
I suspect that GUIs and similar layered code have more levels,
but I could be wrong.

6) I've watched UNIX OS's with debuggers.  One time, just following
a regular system call, I saw it quickly go down 12 levels, and then back
up the same 12 very quickly ... besides the several levels deep it already
was in the user code.  Note that this was *not* getpid, which is dandy
for a register window machine because it is so shallow you need not
save/restore any windows.  Anyway, it's amusing to watch something
like *stat* or *read* go at it.

For this reason, a peculiarity of register windows not seen in other
programs, is that reworking code to make it more layered, rather than having
a slight performance impact, may surprise you, if adding 1-2 levels
causes the windows to thrash.

7) Just so I can defend Sun occasionaly, I point out, once again,
that in UNIX, context-switching of register windows is pretty much
a non-issue, in workstations, and in fact, in most systems.

It certainly *does* cost cycles ... but these are heavily dominated by other
effects in UNIX.   When you add everything up, I'd exepct a SuperSPARC,
even at 33Mhz, to require no more than 5-6 microseconds to
save/restore integer registers + FP registers (which of course are not
part of the windows), assuming you actually save/restore them.
(A 50MHz R4000 does this in 1-2 microseconds with similar cache-miss
assumptions, as it would save and restore ~48 registers.)

Of course, if you're running a large transaction system with light transactions
you might care.

The knees in curves that people have often seen in SPARC systems
often have often come from the SRAM-based MMU designs used in earlier
systems, where there is some fixed number of MMU contexts,
and when you switch between N+1 tasks often, there is a major overhead
for unloading and loading the MMU ... but this has zero to do with
register windows.

Now, all of the above is appropriate for most UNIXes.  Whether or not it's
appropriate for other OSs depends on what they do.  For real-time
OS's, or things like telephone-switch OSs (zillions of tasks,
high context switch rate), people who do these things do *not* generally
like register windows much, to put it mildly.  Note that the
common reply of "dedicate register windows to processes" doesn't
fly with most of these folks, because there are often too many processes
to do that ... plus you need the compilers to act different.
[You'd better *not* act like you've got register windows if a task only
has 1-2 of its own... :-)]

8) Another non-argument is that there is a huge difference in
register access time just because there are more of them.
This might be an effect in some implementations in the future
(i.e., with register renaming, some implementations might get to be
awkward, perhaps; perhaps not), but I do not believe has been a major
issue so far.  [I doubt this has very much to do with SuperSPARC's
getting 40MHz so far...  I'd be happy to be proved wrong, of course.]

Anyway: bottom line is:

1) Register windows go after about 5% of cycles that integer C user
programs spend saving/restoring registers.
	a) It is fairly easy to model how much might be saved,
	and hence the motivation for doing it in the first place.

2) They lose some of it back via:
	a) User-level code causes overflow/underflow traps
	b) Heavy use of OS code (which seems to go up and down call
	stacks rapidly) causes more overflow/underflow traps
and, way down the list in UNIX
	c) A little extra overhead on context switch.

It is reasonable to model a), and trivial to compute c), and harder
to figure out what's going on with b).  It does say you'd
expect register windows to be at their best for single-thread user
programs.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS:	MIPS Computer Systems M/S 5-03, 950 DeGuigne, Sunnyvale, CA 94086-3650

Index Home About Blog