Index
Home
About
From: preston@cs.rice.edu (Preston Briggs)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera books first order!
Date: 25 Nov 1996 20:54:04 GMT
>Running one spec program as one thread in the Tera whould give a
>terrible result, something like a 333 MHz / 20 instructions minimum
>between two issues on the same thread ~ 16 Mhz P6.
Maybe. Things we have in our favor include wide instruction words,
plenty of registers, and a nice compiler.
>Now if the program could be parallised by the compiler to use a
>minimum of 20 threads it would come nearer a 333 MHz P6.
If there's enough parallelism to saturate a processor (say, 20 to 40
threads), then we'll whop on any current commidity chip.
But these arguments miss the point. Our processors can be effectively
composed to give larger systems without modifying the programming
model. If code runs fast on 1 processor, it'll run about twice as
fast on 2 processors, etc. Without changing the source or
recompiling. Speed of a single processor should be unimportant to
users. It matters only as an engineering detail (i.e., is it cheaper
to make fewer, more powerful processor or more, less powerful
processors?).
Preston Briggs
From: preston@cs.rice.edu (Preston Briggs)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera books first order!
Date: 26 Nov 1996 20:53:05 GMT
>>As I understand it, it's a throughput machine. In other words, fast
>>context switching allows multithreading to hide latency, so that many
>>different jobs keep the machine busy. I'm assuming that typically
>>(although not *necessarily*) the different threads will be part of
>>entirely different jobs.
>The user will be free to use the machine as s/he pleases, but
>*typically* lots of the threads will belong to the same application.
>There are two kinds of "switching" going on:
>(1) Thread switching *within* a process: This is free, and happens
> after every cycle.
>(2) Context switching *between* processes: This is not free -- I
> think I saw an estimate of 30-40 cycles in one of the Tera
> technical reports.
Where did you see this? (So I can fix it)
There's one kind of switching. Each processor can support something
like 15 jobs at a time (that is, up to 128 threads thread spread as
you please over up to 15 jobs). Switching between threads is always
free, regardless of job.
And more processors => more simultaneous jobs.
Preston Briggs
Subject: Re: 64 bit registers
From: preston@Tera.COM (Preston Briggs)
Date: Mar 20 1996
Newsgroups: comp.arch
krste@ICSI.Berkeley.EDU (Krste Asanovic) writes:
>I'm curious how you [Tera] handle the case where you issue a burst of 8
>instructions from the same thread in consecutive pipeline stages?
We never issue instructions from the same thread in consecutive
pipeline stages (though one instruction can have 3 operations which
are issued simultaneously to different pipes (not different stages)).
At best, instructions from a single thread have at least 20 cycles
between them. If many other threads are active, the delay between
instructions may be longer.
Perhaps you're confused about the instruction lookahead? The
lookahead field points to the next instruction that depends on the
current memory reference. If it's set to 7, it means that the next 7
instructions can be issued (absent any other lookahead dependences)
before the current memory reference must be completed.
In older versions of the design, there was consideration given to
making the lookahead apply to all the operations. In that case, it
would have been possible to issue several instruction quite close
together. On the other hand, we wouldn't have been able to tolerate
as much memory latency.
Preston Briggs
Subject: Re: Does it exist a shared-bus NUMA multiprocessor?
From: preston@Tera.COM (Preston Briggs)
Date: Mar 05 1996
Newsgroups: comp.arch
>It's easy to provide uniform memory access time to large numbers of
>processors. A uniformly *bad* memory access time, that is. Heck,
>even the Tera MTA has NUMA as I understand it, but perhaps Preston
>has become so insensitive to memory latency that he doesn't notice
>any more. ;-)
It's true that I'm becoming insensitive to latency. But why should I
pay attention to something that's unimportant? Writing code for our
machine, I find healthy amount of parallelism at a fairly coarse
grain, enough to keep the processors saturated, then write the rest of
the code with no concern for parallelism, latency, locality, etc. The
only thing that matters is reducing the total work.
>Very generally, once one has enough memory in a system, some of it
>inevitably ends up being noticeably further away from a given CPU than
>the rest, and it seems to me unreasonable to slow down the nearby accesses
>out of a sense of cosmic justice.
Our memory is spread around a network, so some accesses take more hops
to get from the processor to the memory and back. But the difference
_isn't_ noticable. My thread runs along, does an access on one
instruction and, voila!, the result is there by the next instruction.
Of course, many other threads with have executed instructions in the
meantime, but that doesn't affect me.
When I write code for the machine, I often look at the inner loops to
get a feeling for the cost of each iteration (and to make sure the
compiler's doing the right thing). We count costs like this:
an add => 1
a multiply => 1
a branch => 1
a load => 1
a store => 1
a fetch&add => 1
spawn a thread => 1
quit a thread => 1
Latency just doesn't enter into the calculation.
Unfortunately, latency occasionally matters, even on the Tera. For
example, when many threads are competing for a critical region, the
longer it's locked, the less throughput you'll see. For example,
consider a simple random number generator.
sync unsigned seed$ = 123456;
unsigned rand() {
unsigned s = seed$; /* lock */
s = A * s + C;
seed$ = s; /* unlock */
return s % M;
}
where A, C, and M are magic numbers derived after a careful reading of
Knuth. I've marked the variable seed$ as "sync", which has special
meaning to our compiler. Sync variables have a value and a state.
The value is (in this case) an unsigned integer. The state is either
full or empty.
In the code above, the state of seed$ is initially full. When some
thread calls rand(), it reads seed$, setting it to empty (1
instruction of work, but perhaps 100 cycles of latency). Then
computes a new value (1 instruction, perhaps 20 cycles of latency).
The stores the new value, making seed$ full (1 instruction, perhaps
100 cycles of latency). Then does the mod and returns the result
(maybe 3 more instructions).
There are several things to look at here. We want good "random"
numbers, so we read Knuth closely. We want the total work to be
small: it's 8 instructions (had to load some constants). And we worry
about the length of the critical section (when other threads are
prevented from accessing seed$).
In this case, it's basically 1 round trip to memory + a couple of
instructions. That is
time for packet to travel from memory to processor
time for mul-add
time to issue store
time for packet to travel from processor back to memory
So maybe 150 cycles or so, depending on how busy the processor is (and
a little bit on the network, but it's got so much bandwidth that it's
difficult to overload).
150 cycles means that we can only generate 1 new number per 150 ticks,
no matter how many threads need random numbers. Not very impressive.
And this problem is due entirely to latency.
Of course, we get around the problem in this instance by using a
better random number generator -- one with no critical section.
Reading Knuth, Vol 2, 2nd edition, we discover algorithm A, which can
be implemented in such a way that we can produce a random number in 6
instructions of work, and with no critical sections. Thus, we get
6/processors new numbers per tick.
Preston Briggs
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.sys.super,comp.arch
Subject: Re: Tera Boots
Date: 15 Sep 1997 06:51:34 GMT
In article <5v10pg$vd8$2@news.rchland.ibm.com>, cecchi@signa.rchland.ibm.com (Del Cecchi) writes:
|> this IS a supercomputer with a novel architecture.
|>
|> John Mashey, if this is not somekind of propietary information, how
|> long did it take to do the R10,000 from "hey, lets design a new chip"
|> to Unix prompt?
1) The R10000 is perhaps not the right comparison, I'll list several to
bound this.
2) R2000:
4Q84: start, really, but already having Fred Chow's compilers
from Stanford, and the Stanford research, of course.
4Q85: tapeout
12/85: first silicon, bootproms up
~4/86: first UNIX boot, I think
~6/86: first UNIXstable enough that we dare let anyone else look at
it
~9/86: system shipments stable enough to dare sell as early units
[These were MIPS M/500s, i.e., deskside VME-based uniprocessor systems].
3) R10000:
4Q91: some discussion was going on
2H92: really got going (recall there was acquisition & move)
mid96: production-class SMP systems (Challenge R10Ks)
I.e., consider this 4+ years from getting serious to production
ships.
4) The first system of a family has the advantage that certain chip bugs
might be defined away as "features", which later chips cannot do,
since they have to be upward-compatible, or made to act that way.
First chips have the disadvantage that there are masses of softwre work to be
done.
5) OS's in particular mature as a function of:
{number of systems, number of users, elapsed time}
6) From the Tera press release:
a) I must admit I was a little surprised, as I'd assuemd they'd booted
UNIX before, as they'd reported NAS sort benchmark numbers.
[Their press release did not *say* there was an OS there, I'd just
assumed it, but I remember how it is...]
b) From the press release, it appers that a) This was the first boot on
real hardware, not simulations. Also, it appears that it was the
uniprocessor version, MP appears still to need work, not surprisingly.
7) When people ask me of my opinions of Tera, I always say:
"Interesting architecture, from which useful things will be learned.
Completely unclear if there's a business there or not, given the
amount of work and $$ it takes to ship reasonable systems in this
class, and the relatively small niche available. On the other
hand, if the machines are enough faster than anyone else, doing
a small set of applications that people care about, then they have
a chance, although it seems unlikely they'll be able to get lots of
3rd-party software, given what that's like these days."
8) If the first boot just happened, it does seem slightly
unlikely that they will be able to ship stable systems to SDSC this
calendar year, but stranger things have happened. The real killer is
that the bar raises every year for what people expect, and the nature
of the Tera systems is not to be a high-volume product.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-932-3090
USPS: Silicon Graphics/Cray Research 6L-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
Index
Home
About