Index Home About Blog
From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: Has Sun broken SPECfp2k?
Date: Fri, 30 Nov 2001 02:37:44 +0000 (UTC)

In article <3C061B44.ED1E0F0B@gmx.de>,
Bernd Paysan  <bernd.paysan@gmx.de> wrote:
>Greg Lindahl schrieb:
>> This optimization actually exists in some real compilers for vector
>> machines. If you've got codes from a Cray vector machine, which really
>> isn't bothered by large strides (e.g. accessing an array the wrong
>> way), and want to run them fast on just about anything else, this is a
>> good optimization to do. But it's a fairly rare situation in real codes.

At least, it has been fixed in most codes, since most vector codes
have been ported to cached machines for a long time....


>This is one of the cases where source changes to SPEC benchmarks should
>be allowed. A more or less trivial change that speeds up the benchmark
>by a huge factor is not something a compiler should do. Auto-transposing
>is something a compiler better should not do (because after all, C
>matrix addresses are well defined operations). Gennerally, benchmark
>sources should have been carefully tuned before submission. This avoids
>surprises, and compilers optimized for stupid code.

Here is a real example that demonstrates some of the subtleties of
what the legalities of certain optimizations applied to the SPEC
CPU suite....

This is the *real* code from the SPEC swim benchmark, showing the
change that happened and why it is such a mess....

In CPU95, there was a benchmark called 102.swim.  It was based on
perfectly reasonable algorithm published by Robert Sadourny in 1975
and this version was coded by Paul Swartztrauber at NCAR in 1984.
(It has not been representative of what people in meteorology
purchase computers to run for over ten years now, but that is
another story...)

The main program has a section of code (in FORTRAN) that looks like:

      IF(MOD(NCYCLE,MPRINT) .NE. 0) GO TO 370
         PCHECK = 0.0D0
         UCHECK = 0.0D0
         VCHECK = 0.0D0
         DO 3500 ICHECK = 1, MNMIN
           DO 4500 JCHECK = 1, MNMIN
             PCHECK = PCHECK + ABS(PNEW(ICHECK,JCHECK))
             UCHECK = UCHECK + ABS(UNEW(ICHECK,JCHECK))
             VCHECK = VCHECK + ABS(VNEW(ICHECK,JCHECK))
 4500      CONTINUE
 3500   CONTINUE
        WRITE(6,366) PCHECK, UCHECK, VCHECK
370   CONTINUE

For those who cannot read FORTRAN, the code above adds up the
absolute values of the elements of three arrays and prints the
results.  It does this every "MPRINT" time steps, and in CPU95
this value was set to a very large number -- 800, I think.

The array accesses are in the wrong order for FORTRAN on machines
with caches, so this loop takes a lot longer than it should unless
the compiler swaps the loop indices.  Many compilers handle this
case with no trouble, but since it is only called once every 800
steps, it makes little difference to the overall performance
whether the loop nest is "fixed" or not.

Having served as a professor of ocean modelling, I can say with
some authority that this piece of code is actually irrelevent to
the algorithm -- it is only there to print out an integrated metric
every once in a while so that the user can see if the solution
has "blown up".  The results of this loop nest are not otherwise
used.



OK, that was background.

Fast forward to CPU2000, where the benchmark was made larger and
renamed to 171.swim.   However two more changes were made.
First, an extra line was added in the code below between the
termination of the inner loop and the termination of the
outer loop.  This extra line modifies one of the arrays in
an obscure way and causes many optimizers to no longer interchange
the loop indices.

      IF(MOD(NCYCLE,MPRINT) .NE. 0) GO TO 370
         PCHECK = 0.0D0
         UCHECK = 0.0D0
         VCHECK = 0.0D0
         DO 3500 ICHECK = 1, MNMIN
           DO 4500 JCHECK = 1, MNMIN
             PCHECK = PCHECK + ABS(PNEW(ICHECK,JCHECK))
             UCHECK = UCHECK + ABS(UNEW(ICHECK,JCHECK))
             VCHECK = VCHECK + ABS(VNEW(ICHECK,JCHECK))
 4500      CONTINUE
	UNEW(ICHECK,ICHECK) = UNEW(ICHECK,ICHECK)
     1  * ( MOD (ICHECK, 100) /100.)
 3500   CONTINUE
        WRITE(6,366) PCHECK, UCHECK, VCHECK
370   CONTINUE

Careful examination shows that loop interchange is still legal, but
the compiler must prove it for this particular case before
proceeding.  Now we descend into the morass of special cases --
it is straightforward for the compiler to prove that the main
diagonal elements of UNEW are modified only after they are used,
so the solution is therefore unchanged by loop interchange.
How many different versions of the extra line must the compiler
recognize and analyze correctly in order to consider this a
"general-purpose" optimization?  Is it "general-purpose" if
the compiler only recognizes changes on the main diagonal as
legitimate?

The second change to the the 171.swim benchmark makes this issue
much more important.  Where the CPU95 version of the code ran this
loop only once every 800 steps, the CPU2000 version of the code
runs the loop every step.  On those cached machines where the
extra line of code inhibits loop interchange, most lose a factor
of at least 2x in performance!  So this SPEC benchmark loses a
factor of two because I am not allowed to go in and fix the source
code and/or change the frequency back to the more reasonable value
of once every few hundred steps.

If this had been a customer code, I would have fixed it, documented
it, and returned it immediately.  Since it is a holy benchmark with
sacred source code, I had to spend a *lot* of time getting the compiler
team to add the appropriate dependency analyses to recognize when
loops of this form are safe candidates for loop interchange.  Overall,
it cost IBM immensely more money to work around this problem that
only existed because of the frozen source code than it would have
taken to fix the problem the right way.

Many of the SPEC benchmarks have similar "features"....


--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."


From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: Has Sun broken SPECfp2k?
Date: Tue, 4 Dec 2001 18:42:33 +0000 (UTC)

In article <9ui5v1$1492$1@news.net.uni-c.dk>,
Erik Corry  <erik@arbat.com> wrote:
>Paul Hsieh <qed@pobox.com> wrote:
>
>What you have to realise about SPECfp is that a lot of the people
>who are interested in SPECfp are running software written by
>physicists, engineers and mathematicians in Fortran.  Sure, a lot
>of it can be made faster by source code rearrangements, but these
>people are domain experts, not coding experts.

I think that there is a misunderstanding about the classes of
source code problems that I have been talking about here.

In the SWIM benchmark, there is a piece of code that just plain
should not be executed very often.  If the folks at SPEC had
bothered to ask an ocean or atmospheric modeller how often
that extra loop should be executed, the answer would have been
"once every several hundred steps".  But they didn't ask, they
just changed the input deck so that it would run every step.
The rest of the SWIM code had been converted to being cache-
friendly years ago, but this loop had not been interchanged
because it almost never ran, so it did not make any difference.

Similarly in GALGEL, there is a 4-deep loop nest in the "hot"
subroutine, where switching the inner two loops with the outer
two loops results in a dramatic speedup (I got 38% on the whole
program with this simple swap).

These are not changes that degrade the readability and
maintainability of the code -- they are trivial rearrangements to
eliminate stupidities.

Now there certainly are really nasty and complex optimizations that
could be applied to SWIM that would indeed make it very hard to
read and maintain, and which would make it very fast on cached
machines, but those sorts of optimizations are not the topic
here....  Here is an example of such an optimization scheme:
http://www.cs.haverford.edu/people/davew/cache-opt/cache-opt.html

--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."


From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Tue, 11 Jun 2002 13:12:34 +0000 (UTC)
Message-ID: <ae4t02$j3i$1@ausnews.austin.ibm.com>

In article <3D05A7A9.11E77B6E@mediasec.de>,
Jan C. Vorbrüggen <jvorbrueggen@mediasec.de> wrote:
>> >The communication latency and bandwidth advantages of an MP
>> >system would seem not to be adequately tested with such a
>> >benchmark.
>> This is a known deficiency, and is one of the reasons that
>> the SPEComp benchmark suite was developed.  The current
>> version(s) of the SPEComp suite are not particularly good
>> (which is why I pulled IBM out of participation in the
>> development of SPEComp two years ago), but the intent is
>> fine, and it is possible that the set of codes used in the
>> suite may someday be more sensible....
>
>So could you tell us in a little more detail, John (McC.), what you
>don't like about SPEComp?

I reviewed the codes in SPECfp2000, and asked a number of
colleagues for their opinions about the relevance of the codes
being used.

Of the 14 codes in SPECfp2000, I concluded that
	six were definitely no representative of what
		people currently purchase computers to run
	two might be representative of economically
		important application areas, but this needs
		more detailed evaluation
	two might be representative of economically
		unimportant application areas, but I don't
		really care if the are representative or not
	four codes were in areas that I could not evaluate
		or find experts to evaluate

Because of this, I really did not want SPEComp to be based on the
CFP2000 codes.  Unfortunately another vendor made effective use
of its bureaucratic talents to take control of the effort and
drive it forward with zero concern for whether the suite actually
represented what people purchase computers to run.  The trouble
with bureaucracies is that after a while they exist in order to
provide jobs for bureaucrats, and not in order to serve the
original function of the organization....

At that point (June 2000) I looked at the timeline and decided
that I did not want to spend the next several years of my life
trying to bludgeon SPEC into doing what I could do myself in a
few months work.   Of course, the workloads that I have put
together for this purpose are proprietary and confidential....

--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."


Newsgroups: comp.arch
Subject: Re: MP benchmarking?
From: lindahl@pbm.com (Greg Lindahl)
Message-ID: <3d06c916$1@news.meer.net>
Date: 11 Jun 2002 21:07:50 -0700

In article <3D066878.578D437A@sgi.com>, Wesley Jones  <wesley@sgi.com> wrote:

> We are, however, working on HPC2002 which will include a few
> applications and they will run using MPI and OpenMP.

Cool. It'll be fun to see if shared memory machines run the MPI
versions faster than the OpenMP versions.

>We are working towards including/replacing individual benchmarks with more
>relevant application level benchmarks and market areas.

Good. Why don't you start by listening to what John McCalpin has to
say? I'm not very motivated to help in a modest way if John's
excellent input doesn't go far.

greg



From: Wesley Jones <wesley@sgi.com>
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Wed, 12 Jun 2002 10:38:17 -0700
Message-ID: <3D078709.38C29A79@sgi.com>

Greg Lindahl wrote:
>
> In article <3D066878.578D437A@sgi.com>, Wesley Jones  <wesley@sgi.com> wrote:
>
> > We are, however, working on HPC2002 which will include a few
> > applications and they will run using MPI and OpenMP.
>
> Cool. It'll be fun to see if shared memory machines run the MPI
> versions faster than the OpenMP versions.
>
> >We are working towards including/replacing individual benchmarks with more
> >relevant application level benchmarks and market areas.
>
> Good. Why don't you start by listening to what John McCalpin has to
> say? I'm not very motivated to help in a modest way if John's
> excellent input doesn't go far.

We have listened to John, both specifically and generally.

Here are some examples:
   We removed galgel from the OMPL2001 suite.
   We are trying to replace ammp with a chemistry application or multiple chemistry
      applications of greater relevance.
   We are trying to get the WRF weather model in HPC2002 and then into OMP benchmarks.
      This in itself is not easy.
   We are trying to get an unstructured fluids application.

I am sorry that IBM feels that it is better to not participate.  We would certainly
welcome the help of someone who could help get applications, evaluate applications,
integrate them into the tool set and report results.  However, at this time, I don't
think that IBM really wants to report benchmarks on their full system with HPC
applications.

Here's one example that I am familiar with.

The MM5 Weather model benchmark numbers from MM5/MPP performance data page.
Performance measured in MFlop/sec

#CPU	IBM P690 Power4     SGI Origin 3800
        1.3Ghz              600 MHz
        Performance         Performance
16	8248                5015
32      13701               10302
64                          19668

Given the price of the P690 and that the system has 4 1/3 times the Peak Flops per
processor and much better L1 to L2 peak bandwidth per processor and much better peak
memory bandwidth per processor averaged over the system, one would think that it
should be running an application level benchmark more than ~30% better than the
Origin 3800.  But it is not, at least in this case.

Of course MM5 is just one weather application and this is just one data set.

>
> greg


From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Wed, 12 Jun 2002 19:45:23 +0000 (UTC)
Message-ID: <ae88ci$hu8$1@ausnews.austin.ibm.com>

In article <3D078709.38C29A79@sgi.com>, Wesley Jones  <wesley@sgi.com> wrote:
>
>I am sorry that IBM feels that it is better to not participate.

The issue is primarily one of resource allocation.  Getting anything
done via the SPEC organization has proven to be incredibly slow and
painful, and at the time I pulled us out, we did not have anyone
with the right temperment and technical skills that we were willing
to commit to this project.

I am delighted to see that someone of your technical competence
is involved --- this is a good incentive for me to try to have
someone allocated to the SPEComp project.


>[...]  However, at this time, I don't
>think that IBM really wants to report benchmarks on their full system with HPC
>applications.

Like the IDC benchmark, SPEComp poses ethical difficulties.
Our SPEComp benchmark results are outstanding (I don't think
that I can go into any more detail), but that does not make the
codes any more (or less) relevant to customer applications.
Clearly there is motivation to promote these results aggressively
if what you are interested in is short-term marketing advantage.
On the other hand, if what you are interested in is "better
benchmarks", it would be wise to avoid getting committed to yet
another mediocre benchmark that will just confuse the picture in
the long term.

When I pulled IBM out of SPEComp 24 months ago, the attitude of
many members of the SPEC CPU subcommittee was:
"We will work two years on the first version, and three
 years on the second version, and three years on the third
 version, and by the time we ship that third revision, it
 will be pretty good."

I could not justify committing 8 person years of effort into a
project that would end up being 90% bureaucracy and 10% technical
benchmarking.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."


From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Thu, 13 Jun 2002 14:21:33 +0000 (UTC)
Message-ID: <aea9pd$l56$1@ausnews.austin.ibm.com>

In article <3D08542D.CF080408@mediasec.de>,
Jan C. Vorbrüggen <jvorbrueggen@mediasec.de> wrote:
>Just staying on the side-lines and
>afterwards saying, "I told you so" gives the impression of being
>unprofessional

That is why I was very careful to say "I told you so" *before*
SPEComp2000 was developed, rather than *after*.

I may not have told you personally, but my statements have been
clear and consistent for over two years now  (which, by the way,
long pre-dates any opportunity I would have had to determine
whether or not these codes would run well on IBM POWER4
platforms).
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."


From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Thu, 13 Jun 2002 14:29:08 +0000 (UTC)
Message-ID: <aeaa7k$fdk$1@ausnews.austin.ibm.com>

In article <3D08680F.5060404@brussels.sgi.com>,
Alexis Cousein  <al@brussels.sgi.com> wrote:
>McCalpin wrote:
>> Like the IDC benchmark, SPEComp poses ethical difficulties.
>> Our SPEComp benchmark results are outstanding (I don't think
>> that I can go into any more detail), but that does not make the
>> codes any more (or less) relevant to customer applications.
>
>Mhh -- by that rationale, of course, given that you can't buy
>1p p690s, and that most people buying larger Regattas may
>want to use more than one processor, IBM should have kept the
>SPEC2000 scores to themselves until SPECrate numbers were available

Unfortunately, that was not my call.  We had SPECrate numbers
available at launch, and the product management/marketing
folks chose not to allow them to be released.


>(and/or, arguably, SPEComp -- it's not that the SPEComp codes
>are any *worse* than the "normal" SPECint/fp codes wrt
>your criticisms, and the fact they use OpenMP and scale
>moderately well makes them rather more than less interesting).

Again, it is an issue of resource allocation.   I think that
the SPEComp2000 codes have some useful features and provide
some insights into performance.   However, what they provide
is by no means justifiable by the huge cost of being involved
in their development, maintainance, tuning, support, etc,
which is measured in the many millions of $$$ over the 5-8
years required to turn the suite into something that is actually
moderately representative of what people purchase computers
to run.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

Index Home About Blog