SpecFP bandwidth (John D. McCalpin)

Index Home About Blog

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.sys.mac.scitech,comp.lang.fortran
Subject: Re: Pathetic SPEC2000 numbers for Apple G4
Date: Tue, 4 Jun 2002 20:55:31 +0000 (UTC)
Message-ID: <adj9g3$mg4$1@ausnews.austin.ibm.com>

In article <796f488f.0206041006.79f1c1c6@posting.google.com>,
Paul Hsieh <qed@pobox.com> wrote:
>Bruce Hoult <bruce@hoult.org> wrote:
>
>> What it hasn't had the last few years is better memory bandwidth.
>> *That* is mostly what SPECFP is measuring.
>
>Well apparently some in depth study by McCalpin does not fully agree.
>I mean, I've looked at the code (for SpecFP 95) and seen nothing but
>nested loops running through memory arrays in various orders.  So far,
>I have not been able to reconcile this with McCalpin's findings.

On the cache modelling issue:
It turns out that staring at the loops in a code is not a very
effective way to model cache behavior....  The vector folks made
the same mistake --- lots of people thought that automotive crash
codes and weather codes both required vectors because the source
code contained lots of vectorizable loops.  But when you put a
real cache on it, you find out that there is a lot of work for
every cache miss, and you really don't have to have the mega-
expensive memory subsystem that the vector machines provide.
From my study, I found that both the weather codes and the auto
crash codes typically used only about 1/10 of the maximum
sustainable bandwidth on the SGI Origin2000.  It is pretty clear
that improving the bandwidth is a very weak lever for increasing
performance in this case.

On the SPEC95 issue:
SPECfp95 has a combination of codes with fairly small working
sets and loops with multiple accesses to the same array(s) with
different offsets.  Caches work pretty well for this.

If I recall correctly, about 4 of the 10 SPECfp95 codes wanted
a fair amount of bandwidth -- something like 1-2 bytes per
instruction.  The other 6 codes needed a lot less bandwidth,
down to very close to zero for one of the codes.   This suggests
that memory bandwidth can be a relatively strong lever on only
a subset of the SPECfp95 codes, while the majority are limited
primarily by the performance of the CPU core and caches.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch,comp.sys.mac.scitech,comp.lang.fortran
Subject: Re: Pathetic SPEC2000 numbers for Apple G4
Date: Fri, 7 Jun 2002 23:23:24 +0000 (UTC)
Message-ID: <adrf9b$lre$1@ausnews.austin.ibm.com>

In article <796f488f.0206061934.327156a3@posting.google.com>,
Paul Hsieh <qed@pobox.com> wrote:
>mccalpin@gmp246.austin.ibm.com (McCalpin) wrote in message news:<adj9g3$mg4$1@ausnews.austin.ibm.com>...
>> Paul Hsieh <qed@pobox.com> wrote:
>> >Bruce Hoult <bruce@hoult.org> wrote:
>> >> What it hasn't had the last few years is better memory bandwidth.
>> >> *That* is mostly what SPECFP is measuring.
>> >
>> >Well apparently some in depth study by McCalpin does not fully agree.
>> >I mean, I've looked at the code (for SpecFP 95) and seen nothing but
>> >nested loops running through memory arrays in various orders.  So far,
>> >I have not been able to reconcile this with McCalpin's findings.

>> [...] On the SPEC95 issue:
>> SPECfp95 has a combination of codes with fairly small working
>> sets and loops with multiple accesses to the same array(s) with
>> different offsets.  Caches work pretty well for this. [...]
>
>Our experiences obviously differ.  So I am not left with any more
>insight into this problem than I had before.  The results on spec.org
>clearly confirms my conception of what is going on (on the other hand
>I considered that to be part of my input data), so I just don't
>understand you analysis.

You have been looking at machines with very small caches,
while my study was on a machine with a 4 MB L2 cache.

There is no doubt that a machine with a 256kB or 512kB cache
will need to move a lot more data than a machine with a 4 MB
cache.  These systems will therefore be more sensitive to
both the bandwidth and the latency of the main memory
subsystem....



>I mean how do we reconcile this?  Is it possible that that SGI machine
>just had a really really cruddy core, or that the compiler really
>really sucked?

I think it is mostly just a big cache vs small cache issue.

I don't have data on bandwidth utilization of the SPECfp95 codes
on machines with small caches.  I did some measurements on 1 MB
caches and did see an increased sensitivity to bandwidth, but
I no longer have access to those studies.


There are, of course, other important differences between the
Origin2000 and the IA32 systems being discussed here.  The
Origin2000 uses 128 byte cache lines, which reduces sensitivity to
memory latency.   The Origin2000 also sustained more bandwidth per
SPECfp95 than high-end PentiumIII systems.  (The 800 MHz PIII's had
about 25% more STREAM bandwidth but 60% more SPECfp95).  I don't
think that this was bad compilers on SGI's part -- I think that
this was the advantage of an 800 MHz core instead of a 195 MHz
core.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

From: mccalpin@gmp246.austin.ibm.com (McCalpin)
Newsgroups: comp.arch
Subject: Re: MP benchmarking?
Date: Sat, 8 Jun 2002 17:57:11 +0000 (UTC)
Message-ID: <adtghn$lsc$1@ausnews.austin.ibm.com>

In article <20020608095700.10184.00000703@mb-bh.aol.com>,
At150bogomips <at150bogomips@aol.com> wrote:
>Is the SPECrate-style benchmarking of MP systems representative
>to the main intended uses for such systems (i.e., identical
>resource-type-usage, extremely independent multiprogram
>workloads)?  (Such uses seem much more suited to clusters with
>inexpensive interconnects.)

SPECrate has some good qualities and some poor qualities.
Like any benchmark, it is a tradeoff between accuracy and
comprehensibility.

Mostly what SPECrate shows is performance impact of contention
for the main memory bandwidth (and for any shared caches).
The test methodology maximizes the bandwidth contention,
and the timing of the benchmarks on a one-at-a-time basis
means that the detailed results contain a lot of information.

My favorite is 171.swim, which is the highest user of bandwidth
(among systems with caches >2 MB).   If I remember the numbers
correctly, on a machine with a write-allocate cache strategy,
each instance of 171.swim moves about 478 GB of data (neglecting
cache conflicts, which vary depending on page coloring and
compiler array padding), so it is possible to estimate
sustainable memory bandwidth from the 171.swim rate results.

>The communication latency and bandwidth advantages of an MP
>system would seem not to be adequately tested with such a
>benchmark.

This is a known deficiency, and is one of the reasons that
the SPEComp benchmark suite was developed.  The current
version(s) of the SPEComp suite are not particularly good
(which is why I pulled IBM out of participation in the
development of SPEComp two years ago), but the intent is
fine, and it is possible that the set of codes used in the
suite may someday be more sensible....

>[...]  For some problem mixes, a 32P Power4 system might offer
>nearly equivalent (or better??) throughput relative to a 16P HP
>system (half the cores disabled) and a 16P 'standard' system--at
>somewhat lower cost.

I think you meant a "16P HPC" system.   "HP" usually refers
to that other computer manufacturer.

I am confused by what you wrote above....  Of course the 32p
system offers better throughput than the 16p HPC system -- the
32p system has all of the cache and memory resources of the 16p
system, but has 16 more cores to make the compute part run faster.
Of course, the 32p system is very seldom twice as fast as the 16p
HPC config -- the performance ratio ranges all the way from 1x
(my 171.swim example) to very close to 2x (LINPACK HPC), with
examples to be found at nearly every intermediate point.
--
John D. McCalpin, Ph.D.           mccalpin@austin.ibm.com
Senior Technical Staff Member     IBM POWER Microprocessor Development
    "I am willing to make mistakes as long as
     someone else is willing to learn from them."

Index Home About Blog