Decimal FP (Glen Herrmannsfeldt; Mitch Alsup; Terje Mathisen; Wilco Dijkstra; hack@watson.ibm.com)

Index Home About Blog

From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Date: Thu, 19 Feb 2009 12:34:48 -0800 (PST)
Message-ID: <02450a18-8dfc-4838-b9e1-6796787ac834@j8g2000yql.googlegroups.com>

On Feb 18, 11:04 pm, Jeremy Linton <reply-to-l...@nospam.org> wrote:
>         Holy sh*t, it looks like there are _TWO_ standard "incompatible"
> formats in IEEE 754r. Who thought that would be a good idea? Apparently
> IBM pushed a hardware centric Densely Packed Decimal (DPD), and Intel
> pushed a software centric Binary Integer Decimal (BID).

DPD is designed for business applications
BID is designed to be integrated into binary FPUs

The biggest difference is the DPD is an unnormalized format, designed
to allow decimal calculations without the need for pre-alignment nor
post-normalization. Thus, it is expected that the dat being processed
remains right-aligned (although there is no requirement for it to
remain so). So, in a reasonable implementation of DPD, one can survey
the upper most section of the fraction and if it contains zeros, there
is an excellent chance that you are processing DPD.

To a certain extent this is my fault. The architectural review board
at AMD was ask to support DPD (from IBM) and BID (from Intel). We
decided that we did not understant enough of the nuanced issues to
vote one way or the other. And due to this, the Decimal FP comittee
basically broke down without a winner. This is a disasterous mistake
for the architecural community, a diasterous mistake for the business
community, and a diasterous mistake for joe-random human. After
studying the nuances for a couple of months, I became firmly
supportive of DPD is the "right" way to do Decimal FP calculations for
business applications, and for innocent people playing around with
eXcel-like spreadsheets where decimal results are expected by the non-
numerical analyst. By the time I realized my error it was too late to
change the situation. And for this I am quite sorry.

mitch

From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Date: Thu, 19 Feb 2009 16:37:55 -0800 (PST)
Message-ID: <8fd40e82-28fb-49d3-b006-6cf12d836174@o11g2000yql.googlegroups.com>

On Feb 19, 3:05 pm, Glen Herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> MitchAlsup wrote:
> > On Feb 18, 11:04 pm, Jeremy Linton <reply-to-l...@nospam.org> wrote:
> >>        Holy sh*t, it looks like there are _TWO_ standard "incompatible"
> >>formats in IEEE 754r. Who thought that would be a good idea? Apparently
> >>IBM pushed a hardware centric Densely Packed Decimal (DPD), and Intel
> >>pushed a software centric Binary Integer Decimal (BID).
> > DPD is designed for business applications
> > BID is designed to be integrated into binary FPUs
> > The biggest difference is the DPD is an unnormalized format, designed
> > to allow decimal calculations without the need for pre-alignment nor
> > post-normalization. Thus, it is expected that the dat being processed
> > remains right-aligned (although there is no requirement for it to
> > remain so). So, in a reasonable implementation of DPD, one can survey
> > the upper most section of the fraction and if it contains zeros, there
> > is an excellent chance that you are processing DPD.
>
> I hadn't heard this one before.  There were binary floating point
> processors years ago that would keep integer results right aligned,
> and with a biased exponent of zero.  That allows the same format
> to be used for integer and floating point.  (I believe some Burroughs
> and CDC machines did that.)

Buroughs and to a certain extent CDC

> As I understand it, it is very easy to convert DPD to BCD, maybe
> even when stored in processor registers.  That makes it fairly
> easy to normalize values, either left or right doesn't make much
> difference.

The trick is that DPD can get 3 decimal digits into 10-binary bits
(that is less than 2.4% waste) and expand/compress back into storage
form in 2 gates of delay. It would not surprise me to find an IBM
patent on this scheme.)

> I only heard about BID a few days ago.  I suppose it is easier
> to process in a binary ALU, but normalization (pre and post) will
> be much harder and slower.  (Multiply and divide by powers of 10.)

The real problems with BID is I/O (in ASCII) and rounding. In DPD
rounding only requires decimal shifting, wherease in BID it requires
multiplication (and maybe division (hazy here)).

> I believe IBM is producing DPD processors.  I only heard about BID
> recently, and have no information that Intel is producing processors
> supporting it.   To me it looks like DPD makes efficient use of the
> bits, and should be able to be implemented such that it runs about
> as fast as binary, and maybe faster.  (Fewer levels of logic for
> a barrel shifter for pre/post normalization.)

More importantly, one can set up the DFU pipeline to assume that no
pre-alignment and no post-normalization are required and handle these
as micro-faults.

I also suspect that IBM is doing 128-bit DPD calculations, as this
basically gets rid of overflow and undeflow for any reasonable
business application--including computing the world GDP in the least
valued currency in the world by adding up every single unit of value
transacted that year. I would council anyone contemplating doing a DPD
to just build the 128-bit DFU and have it done with. With 128-bit DFU
and respectable performance, one could garner a considerable amount of
physics calculations that are problematic in 64-bit IEEE 754.

Mitch

From: Glen Herrmannsfeldt <gah@ugcs.caltech.edu>
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Date: Thu, 19 Feb 2009 20:43:51 -0700
Message-ID: <gnl8te$6v4$1@aioe.org>

MitchAlsup wrote:
(snip)

> The trick is that DPD can get 3 decimal digits into 10-binary bits
> (that is less than 2.4% waste) and expand/compress back into storage
> form in 2 gates of delay. It would not surprise me to find an IBM
> patent on this scheme.)

IBM patents many things to be sure that they can build them
without being sued by others.  If they want this to spread,
then they should license if free or minimal cost.

>>I only heard about BID a few days ago.  I suppose it is easier
>>to process in a binary ALU, but normalization (pre and post) will
>>be much harder and slower.  (Multiply and divide by powers of 10.)

> The real problems with BID is I/O (in ASCII) and rounding. In DPD
> rounding only requires decimal shifting, wherease in BID it requires
> multiplication (and maybe division (hazy here)).

I suppose for business use that is I/O bound.  For CPU bound problems,
normalizing will get to you fast.  For addition and subtraction,
and to some extent multiplication you might get away with keeping
them unnormalized as long as possible, but with divide you will
(most of the time) have to normalize.

Looking at the Intel web site, it looks like BID is meant for
software implementation, and Intel supplied gcc with a library
of routines to do it.  Also, conversions to/from DPD.

(snip)

> More importantly, one can set up the DFU pipeline to assume that no
> pre-alignment and no post-normalization are required and handle these
> as micro-faults.

> I also suspect that IBM is doing 128-bit DPD calculations, as this
> basically gets rid of overflow and undeflow for any reasonable
> business application--including computing the world GDP in the least
> valued currency in the world by adding up every single unit of value
> transacted that year. I would council anyone contemplating doing a DPD
> to just build the 128-bit DFU and have it done with. With 128-bit DFU
> and respectable performance, one could garner a considerable amount of
> physics calculations that are problematic in 64-bit IEEE 754.

IBM has supported 128 bit floating point since the 360/85 in
about 1968.  Except for divide it was standard on S/370 and later.
Somewhere in ESA/390 DXR (extended precision divide) was added.

Other than that, others have not provided much in the way
of hardware implementations.  Some VAX had it, but most did it
as software emulation on the illegal instruction trap.

If 128 bit DPD began to be supported in hardware, and began
to be used, that would greatly speed up its popularity.

-- glen

From: "Wilco Dijkstra" <Wilco.removethisDijkstra@ntlworld.com>
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <jdGnl.17017$dm3.7222@newsfe27.ams2>
Date: Fri, 20 Feb 2009 22:33:52 -0000

<nmm1@cam.ac.uk> wrote in message news:gnmotu$p5b$1@soup.linux.pwf.cam.ac.uk...
> In article <63c6d94e-5e26-413c-a8b7-07fbc8ad37ff@j38g2000yqa.googlegroups.com>,
> MitchAlsup  <MitchAlsup@aol.com> wrote:
>>
>>Consider DPD versus BID: and generate a very small 64-bit DFP number
>>with lots of digits:: 1.234567890123456E-256 (*)
>>
>>Now multiply this number by 10 five hundred and twelve times.
>>
>>Do you get exactly 1.234567890123456E+256 ?
>>
>>In DPD the answer is actually yes. And this occurs because the
>>calculation actually takes place in decimal. And it does not mater
>>which digits make up the fraction.

Same for BID - unlike the current binary format.

>>(*) Any suitable fraction that uses every single bit of its
>>representation.
>
> I am pretty certain that the answer is yes in BID, too, because the
> calculation takes place in scaled integers.  If the answer ISN'T yes,
> then the IEEE 754R people made a spectacular mess of things that I
> am pretty certain they know a lot about.  Also, when I looked at it,
> I thought they had got that aspect right.
>
> While I can be very rude about their design, with very good reason,
> I don't think that there is anything to criticise about the consistency
> of the arithmetic between DPD and BID.  I could be wrong, of course.

Indeed, the key is that both use a decimal exponent. Whether the mantissa
is decimal or binary doesn't actually matter, DPD and BID are just different
encodings of the same value. I'm sure it is possible to convert between the
formats without any loss.

Wilco

Date: Fri, 20 Feb 2009 19:43:25 +0100
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <0pmdnfyBd_NTZAPUnZ2dnUVZ_szinZ2d@giganews.com>

MitchAlsup wrote:
> Consider DPD versus BID: and generate a very small 64-bit DFP number
> with lots of digits:: 1.234567890123456E-256 (*)
>
> Now multiply this number by 10 five hundred and twelve times.
>
> Do you get exactly 1.234567890123456E+256 ?

You must, which means that a binary encoded representation _must_ keep
the mantissa as an integer and store the decimal exponent separately, right?

> In DPD the answer is actually yes. And this occurs because the
> calculation actually takes place in decimal. And it does not mater
> which digits make up the fraction.
>
> In DPD one can start with the large number and divide it by 10 five
> hundred and twelve times and end up with exactly the smaller number.

This should be trivial in both representations, since only the exponent
field is modified. (I really have to check how BID is stored!)
...
OK, I looked at one of the Intel papers, showing that their library is
about an order of magnitude faster than decNumber, the only well-known
sw implementation, and the one GCC can use.

decNumber is based on the BCD/DPD packing, so that factor of 10
represents the sw overhead vs simply working in scaled binary all the time.

> In DPD one can start with the large number and multiply it by 0.1 five
> hundred and twelve times and end up with exactly the smaller number.

Since 0.1 in binary decimal is simply 1 + e-1, that multiplication
becomes an exponent decrement.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date: Fri, 20 Feb 2009 19:19:47 +0100
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <ALidnTo2RoHZaQPUnZ2dnUVZ_gWWnZ2d@giganews.com>

Bernd Paysan wrote:
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>> There might be a few more ways to do it, but I believe they will turn
>> out to be mostly trivial modifications to one of the two above.
>
> For software, the easiest way is to use a lookup table. There are an awful
> lot of possible encodings that fit 1000 states into 10 bits (1024!/24!=

The best software method is the one that needs the least storage space,
as long as the processing time is more or less identical, right?

> ~8.73e2615), all equally simple to encode/decode with lookup tables, and
> the one where you simply store the integers 0..999 in 10 bits even can be
> converted algorithmically (and easily can be converted to binary FP). Now
> if you want to find the easiest way to do hardware encoding, you probably
> really have to throw all those possible lookup tables into a term
> minimizer, and compare the results (you probably can find a heuristic like
> alpha-beta min-max to prune the tree ;-). You might even pose restrictions
> like "find the encoding with least amount of gates where the collating
> sequence is preserved" (there are still ~2.17e48 possible encodings with
> preserved collating sequence - maybe there are some which have a
> sufficiently simple en- and decoding logic).

Interesting...

> Even if you stay with the Chen-Ho/DPD idea of compressing by keeping 0-7 as
> 3 bits, and 8-9 as 1 bit, leaving space to tell the position, you still
> have an awful lot of possible permutations, and you still can change the
> constant values.

I just looked for encodings that kept as much of the logic pass-through,
and then minimized the size of the required lookup tables.

It is trivial to halve the table size by making the parity bit
passthrough: This needs a 9-bit index into 512 12-bit BCD results, while
the opposite operation would need 2048 9-bit DPD values.

I don't remember now exactly how far down I managed to get those
numbers, but some more should be possible afair.

> IMHO, the IBM method is a rather trivial remix of the Chen-Ho encoding, but
> the claimed advantage is IMHO unreal - Chen-Ho is left aligned, DPD right
> aligned, that's the main difference (so if you want a 7 bit Chen-Ho subset,
> you drop the right digit, whereas in DPD, you drop the left).

Keeping the flag bit on the right end might make a tiny bit of
difference for a compact sw algorithm, if that bit can be used for
predication around a table lookup for the special cases.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date: Fri, 20 Feb 2009 10:20:30 +0100
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <CKSdnWF8SPFC6APUnZ2dnUVZ_rLinZ2d@giganews.com>

MitchAlsup wrote:
> On Feb 19, 3:05 pm, Glen Herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>> I only heard about BID a few days ago.  I suppose it is easier
>> to process in a binary ALU, but normalization (pre and post) will
>> be much harder and slower.  (Multiply and divide by powers of 10.)
>
> The real problems with BID is I/O (in ASCII) and rounding. In DPD
> rounding only requires decimal shifting, wherease in BID it requires
> multiplication (and maybe division (hazy here)).

(I hit send by accident on my previous msg. :-()

Maybe 10 years ago I discovered a very fast way to convert unsigned
binary numbers into decimal, AMD used to show it in their optimization
manuals (unfortunately without attribution :-().

You never need to use division for this conversion operation,
multiplication by suitable scaled reciprocal powers of 10 is sufficient.

For a 32-bit value my method needs 3 or 4 MULs, all the remaining
operations are fast shift/mask/add/sub, so the total time for 32 bits to
10 ascii digits is about 25-50 cycles depending upon the speed of an
integer MUL.

In my conversion code the input value is first split into two base 1e5
numbers, both of them scaled by 2^28/1e4 (rounded up).

This means that the top decimal digit ends up in the top 4 bits, and can
be extracted with a shift right operation.

Afterwards I mask away the top 4 bits, multiply the result by 5 (i.e.
LEA on x86, shift + add on other architectures) and get the next decimal
digit in the top 5 bits.

When converting very wide binary decimal mantissas I would apply the
above process recursively, splitting the number into 2, 4 or 8 parts
before the final stages can happen in parallel on all the parts.

With 128-bit DFP (with ~112 bits of mantissa?) the speed limiter would
be the initial double-wide reciprocal multiplication.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date: Sat, 21 Feb 2009 22:52:30 +0100
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <LrKdncPVJdI96j3UnZ2dnUVZ_vednZ2d@giganews.com>

Bernd Paysan wrote:
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>> For software, the easiest way is to use a lookup table. There are an
>>> awful lot of possible encodings that fit 1000 states into 10 bits
>>> (1024!/24!=
>> The best software method is the one that needs the least storage space,
>> as long as the processing time is more or less identical, right?
>
> Yes. My go on code space efficiency for this would be a purely algorithmic
> one:
>
>>> and
>>> the one where you simply store the integers 0..999 in 10 bits even can be
>>> converted algorithmically
>
> You first multiply by a constant (16 bits are sufficient, the constant is 41
> or 0x29, which means two leas) to get the first BCD digit in the top 4

I like 41, I use it instead of div 100 to get the century in my date code:

41/4096 is close enough to 1/100 (rounded up) to give exactly the right
result for all years in a 400-year cycle.

> bits, and then you do two masks+leas to multiply by 5 to extract the
> remaining two digits; this should be a lot simpler than DPD from a software
> point of view (see your own posting on fast conversion of integer to
> decimal ;-). It has the advantage that you also can use binary string
> compares for sorting.

What's important here is that this style of code can work in a wide SIMD
register, while lookup tables really don't scale.

Storing each 10-bit field inside a 16-bit sub-register would allow these
sorts of tricks while working on 8 groups (24 decimal digits) in parallel.

Graphics-optimized vector unit, as used in GPUs and in Larrabee have
special hw to unpack common graphics texture formats, possibly including
2:10:10:10 which could then be (ab)used for DPD?

> Converting BCD back to the 0..999 format is even simpler, because
> multiplying by 10 is very cheap (lea+shift/add, and since you have to add
> anyway, just use another lea for that). You can always store integers from
> 0..99 in a 7 bit block, so you get all the advantages of DPD ;-).
>
> I would call this format BCM, binary coded millelimal.

This is my preferred internal representation for a sw DPD library.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: hack@watson.ibm.com
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Date: Sat, 21 Feb 2009 19:08:11 -0800 (PST)
Message-ID: <b7e9beca-0773-470c-989d-8d0efecbd7bb@f3g2000yqf.googlegroups.com>

On Feb 19, 4:20 am, Jan Vorbrüggen <Jan.Vorbrueg...@not-thomson.net>
wrote:
[on BID vs DPD decimal FP encodings]

> as compatible as possible - are they? (e.g., regarding range and
> precision)

They are.  For each precision, the two encodings represent exactly the
same finite set of representations (which is a stronger statement than
the same set of values).  Arithmetic and exceptions are also the same,
so the only way to tell them apart is by looking at the raw bits
(e.g. via type overlay).  The standard also requires the availability
of conversion functions between the native encoding and the other one.

There is an interesting difference in non-canonical encodings.  There
are more encodings than representations, though each representation
has exactly one canonical encoding (and operations return canonical
encodings for canonical operands; in fact, arithmetic operations
always return canonical results).  Each bit pattern (canonical or not)
is valid, i.e. maps to a valid representation.  For BID, the value of
non-canonical finite numbers is zero, but for DPD it is not (a
non-canonical declet -- group of 10 bits of the significand --
corresponds to a triple of decimal digits made up of 8's and 9's).

As for why we ended up with two encodings: IBM already had hardware
that implemented DPD (announced and shipped near the end of 754r
development).  Intel had published some interesting papers on BID.  (I
am not speaking for either company, of course.  I was a member of the
P754 committee for the last three years of its long existence.)

Michel.

Date: Sun, 22 Feb 2009 12:52:28 +0100
From: Terje Mathisen <"terje.mathisen at tmsw.no">
Newsgroups: comp.arch
Subject: Re: IEEE 754r, decimal floating point, BID and DPD
Message-ID: <6N6dnW9ysvPgoTzUnZ2dnUVZ_uidnZ2d@giganews.com>

hack@watson.ibm.com wrote:
> On Feb 19, 4:20 am, Jan Vorbrüggen <Jan.Vorbrueg...@not-thomson.net>
> wrote: [on BID vs DPD decimal FP encodings]
>> as compatible as possible - are they? (e.g., regarding range and
>> precision)
>
> They are.  For each precision, the two encodings represent exactly
> the same finite set of representations (which is a stronger statement
> than the same set of values).

Thanks, this is the way I expected it had to be done.

> Arithmetic and exceptions are also the same, so the only way to tell
> them apart is by looking at the raw bits (e.g. via type overlay).
> The standard also requires the availability of conversion functions
> between the native encoding and the other one.

<BG> Nick, did you see that?

> As for why we ended up with two encodings:  IBM already had hardware
> that implemented DPD (announced and shipped near the end of 754r
> development). Intel had published some interesting papers on BID.  (I

I have advocated using pure binary mantissas for decimal math here on
c.arch for many years now. :-)

Each time one of the Decimal evangelists try to come up with a benchmark
to "prove" that some form of BCD-style encoding is both needed and the
fastest way to solve a particular problem, I've been able to show that a
BID-style approach can be at least as fast, and much faster when you
don't have hardware 754r/DPD support.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Index Home About Blog