Index
Home
About
Blog
Newsgroups: comp.arch
From: mash@mash.engr.sgi.com (John R. Mashey)
Subject: Re: Intel 86-series -> RISC [really: register renaming]
Date: Fri, 16 Sep 1994 03:35:34 GMT
In article <1994Sep15.161725.3779@boole.com>, jka@boole.com (John
Ahlstrom) writes:
|> What do the cross-over points (where the utility of
|> adding named registers is less than the utility of
|> adding registers with renaming) depend on?
|>
|> These cross-over points could depend on
|> instruction bits to name registers
|> e.g. can't destroy backward compatibility
|> need bits for other things
|> complexity to do renaming
|> nature of the code
Well, some of it is certainly the nature of the code. Put another
way, it is a misbelief that the following are equivalent:
1) ISA #1 with N visible registers (with or without renaming).
2) ISA #2 with M << N visible registers, in an implementation that has N
physical registers, and uses renaming.
Of course, the larger N gets, the less it matters.
It is the case, already, that ISAs with N=32 (each for both integer
and FP) *already* can hit enough register pressure to want renaming
when using aggressive superscalar implementations and aggressive
optimizing compilers, particularly on some kinds of FP code.
Following are a few examples to illustrate where renaming helps,
and where it doesn't. let's try N = 4 and M = 3 (to keep it simple).
I'll assume a psuedo-MIPS-like ISA.
Example 1, using ISA #1, that fetches elements from an array, adds 1
to each, and stores them in another array. remember: 4 real registers.
LW R1,ARRAY1_PTR 1
LA R2,END_OF_ARRAY1_PTR 2
LW R3,ARRAY2_PTR 3
LOOP: LW R4,0(R1) 4
ADD R4,R4,1 5
SW R4,0(R3) 6
ADD R1,R1,4 7
ADD R3,R3,4 8
BLT R1,R2,LOOP 9
Now, consider a 4-issue superscalar, with renaming, and speculative,
out-of-order execution (i.e., sort of like the Metaflow Thunder).
Let me assume that the first few fetches from ARRAY are cache misses,
but the CPU is trying hard to "unroll the loop" in hardware to get
ahead, and that instructions hang around until their inputs are ready.
Cycle 1: fetch instructions 1..4, rename registers (starting empty,
to physical registers P1..P3.)
Cycle 2: fetch 5..8, rename like:
ADD P5,P4,1 P5 is new result
SW P5,0(P3)
ADD P6,P1,4 P6 is new result (R1)
ADD P7,P3,4 P7 "" (R3)
Cycle 3: fetch 9..., and assume predict branch taken
BLT P6,P7,LOOP
Cycle 4: fetch 2nd iteration, and rename:
LOOP: LW P8,0(P6) Different values of R4 and R1
ADD P9,R8,1
SW P9,0(R7)
ADD P10,P6,4
etc ... so what's happening is that while the ISA only has 4 visible
(integer) registers, good use is made of the extra physical registers
to have instrcutions from several iterations of the loop in progress
at once ... which is very difficult to do without rename.
Example 2: same code, but with ISA#2, with only 3 visible registers.
In this case, there aren't enough, and the compiler had to have generated
more loads and stores in the first place. For ease of typing,
I'm going to call the visible registers R1,R3,R4:
LW R1,ARRAY1_PTR 1
LW R3,ARRAY2_PTR 3
LOOP: LW R4,0(R1) 4
ADD R4,R4,1 5
SW R4,0(R3) 6
ADD R1,R1,4 7
ADD R3,R3,4 8
LW R4,END_OF_ARRAY1_PTR 8a
BLT R1,R4,LOOP 9
i.e., the compiler ran out of registers, so it couldn't just set up
END_OF_ARRAY1 in a register and leave it there. Now, if you run this
example, using the same sort of renaming & execution scheme, it also
works perfectly well, BUT the LW instruction labeled 8a doesn't disappear.
Example 3, let M =2 (awful, but for the sake of the example, and don't bug me
if there's a cycle or two that can go away):
LW R1,ARRAY1_PTR
LOOP1: LW R2,0(R1)
ADD R1,R1,4
SW R1,ARRY1_PTR
LW R1,ARRAY2_PTR
SW R2,0(R1)
ADD R1,R1,4
SW R1,ARRAY2_PTR
LW R1,ARRY1_PTR
LW R2,END_OF_ARRAY1_PTR
BLT R1,R2,LOOP1
Here, again, renaming would let you get multiple iterations, but you still
have those extra loads and stores. The renaming will help case 2 or 3 with the
re-use of registers inside the loop, as well ... but still, the loads
and stores do not go away.
Suppose one's design allows 1 load/store/cycle, branches are predicted
correctly, no cache misses, etc, then:
4 registers 2 cycles/iteration
3 registers 3 cycles/iteration
2 registers 7 cycles/iteration
Summary:
Renaming can relieve register pressure around loops and
conditional code ... but it does not eliminate loads
and stores where a compiler would have rather globally allocated
values to registers in the first place, had it enough visible
registers. This is an over-simplified example ... on the other
hand, more than one vendor has already seen cases where
32 integer and 32 FP registers made them wish for more.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: mash@sgi.com
DDD: 415-390-3090 FAX: 415-967-8496
USPS: Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311
Index
Home
About
Blog