Index
Home
About
From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.std.c
Subject: Re: Fastest ints (was: 64 bit advantage?)
Date: 31 Mar 1999 20:45:09 GMT
Unfortunately, the whole idea of "fastest int" may well be out-of-date.
It has long been the case, but is especially true now, that the
natural "fastest" integer size may be (using current 64-bit CPUs as examples):
1) The "natural" integer register width [i.e., 64 bits]
2) Equally, any of the sizes [64, 32, 16, 8]
3) Equally, any of the smaller sizes [32, 16, 8]
4) unequally, one of the smaller sizes [likely 32]
And which is true depends on:
a) The specific Instruction Set Architecture.
b) The specific implementation of the ISA, including datapath widths,
and gory details of cache hierarchy (like write-thru versus write-back caches,
and whether or not the nearest level of write-back cache has ECC or not,
and if so, is it ECC on 32, 64, or larger number of bits.
c) Whether the integer is signed or unsigned, interacting with a).
d) What kinds of operations are being done.
e) Cache effects relating from size of the data object.
f) The extent of global register allocation and tracking done by the compiler.
f) How the code is written...
For example, of 64-bit ISAs (but partially applicable to some 32-bit ISAs):
A1) Some ISAs (like MIPS) have a 64-bit integer LOAD instructions,
plus two each of 32-bit, 16-bit, and 8-bit loads, where you can choose
zero-extension or sign-extension. Sometimes all loads are the same speed,
sometimes the full-width load is fastest, and the rest are slower,
sometimes the full-width load and the zero-extends are fast, and the others
cost and extra cycle.
A2) Some ISAs have 64- and 32-bit LOADs, and do 8s and 16s with extraction ops.
A3) Some ISAs have either LOAD-zero-extend, or LOAD-sign-extend for
8, 16, 32, but not both (so it matters whether or not it is signed or
unsigned), and the other one needs explicit sign-extend.
A4) Some ISAs may offer LOAD-8 or LOAD-16 that only effect the low-order
bits of the register, so they act more like inserts.
Hence, it can either cost the same to get data into a register, or have
widely-varying costs.
B1) When storing data from a register into memory, most ISAs have
STORE 8, 16, 32, 64).
B2) Some ISAs may offer STORE-64 and STORE-32 (earliest Alphas), but
omit STORE-8 and STORE-16, using LOAD-64, insert, STORE-64 sequences.
B3) Some CPUs use L1 caches that are write-back with byte-parity,
and all stores take the same time. If L1 is write-back with ECC,
ECC will be computed on some size, and any store less than that will
incur extra cycles for read-modify-write. CPUs with store-through
L1 and ECC'd L2 can either incur extra cycles, or not many if they used
buffered writebacks in the hope of aggregating writes.
B4) Within the same ISA, different cache designs will appear over time,
and cache line-sizes, bus-widths, etc, sometimes change.
C1) An ISA may offer operations on any of the sizes, and then maybe
sign/zero extend the result to the full register width, or they might not
do this extension.
C2) An ISA might offer direct support for 2 sizes only: in some 64-bit CPUs
(as in MIPS, for example), there are 32-bit adds (with and without overflow
detection), which, on the 64-bit implementations, produce the same low-order
32-bits as did the 32-bit implementations, and then sign-extend the result
to 64-bits. The 64-bit implementations also provide full 64-bit adds.
C3) An ISA might really have only one size of add, and everything else has to be
extended first.
C4) The time to do a 32-bit add or 64-bit add are usually the same on
current CPUs, but not necesarily so, and this especially didn't use to be
true (where some CPU family members had narrower datapaths, narrower ALUs,
and had to do even 32-bit adds via multiple passes of 16-bit or even 8-bit
ALUs ... whereas other family members had full-width ALUs).
C5) I used ADD as an example, it being common. Usually, ADD and SUBTRACT
have the same characteristics, but different operations may or may not be
provided in an orthogonal fashion, i.e., 64-bit MIPS have both 64- and
32-bit ADDS, but only have 64-bit ORs and ANDs.
D1) For most implementations of most ISAs, integer multiplication is slower
than integer addition, and integer division, if provided at all, is slower yet.
64-bit integer multiplies are usually slower than 32-bit ones (but might not
be), and 64-bit integer divides are really slower than 32-bit ones,
even on CPUs that do have the divide hardware.
[This was a contributing reason to many people's wish not to do ILP64,
i.e., performance hit from inherent extra cycles from doubling the wdith.]
Of course, if you are doing multi-precision arithmetic, you are happy to
have hardware to help you and access to it.
E1) Compiler optimization matters. If there is extra overhead in fetching
smaller data, then that would argue for fastest being the full-register
width, but if a good optimizer fetches it to a register, expands it once,
and then re-uses it there, there's little difference, in whic hcase cache
effects may dominate.
E2) Cache effects matter ... a lot. We already have CPUs that can peak at 4
instructions per clock, and were headed higher. Within 2 years, we'll starting
seeing GHz chips (1ns cycle, 4-6 IPC peak), with for example, ~150ns
real access times to memory, meaning that one could execute ~1000 instructions
in the time to do one cache miss. Hence, on such machines, if you have big
arrays of integers, and using smaller sizes lessens the cache miss rate,
at the costs of a few more instruction cycles, it may not be a bad idea
to go smaller. Of course, if the total data is small, and likely to be
cache-resident anyway, then the tradeoff is different.
E3) Finally, complex out-of-order CPUs (like MIPS R10K, PA-RISC, XEONS, etc)
make it very hard to make strong conclusions by looking at micro-level
cycle counts anyway.
SUMMARY: in the old UNIX/C days [i.e., when K&R was first written],
it was pretty easy to know that 16-bit ints were fast on PDP-11s,
and 32-bit ints fast on VAXen. On 68000/68010, even though the registers
were 32-bits wide, 16-bits was often faster, and some people stuck with
16-bit ints as a result. With most current 64-bit CPUs, sometimes
64-bit is faster, but *usually* 32-bit gets enough (extra) support
compared to 16 that it is likely to be as fast as 64-bit, and better
than 16-bit ... but the bottom line of all of this, is that
"fastest" is a very slippery metric, and no one should ever expect
that any one size is uniformly faster, because it hardly ever is,
because it depends on all these factors listed above.
--
-john mashey DISCLAIMER: <generic disclaimer: I speak for me only...>
EMAIL: mash@sgi.com DDD: 650-933-3090 FAX: 650-933-4392
USPS: Silicon Graphics/Cray Research 40U-005,
2011 N. Shoreline Blvd, Mountain View, CA 94043-1389
Index
Home
About