Index Home About Blog
From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Register-less CPU
Date: Mon, 19 May 2008 10:26:13 -0700 (PDT)
Message-ID: <33c368da-6b27-4abe-be23-606af51b4c55@y38g2000hsy.googlegroups.com>

On May 18, 1:40 pm, Jeffrey Dutky <jeff.du...@gmail.com> wrote:
> While it may seem like all decoder networks are the same depth (a
> layer of inverters and a layer of and gates) the actual logic gates
> built in silicon have a property called "fan out" which limits how
> many gate inputs a gate output can drive. As the decoder network gets
> wider you need to add trees of buffers to increase the fan out, which
> slows down the decoding process (or you can build bigger, higher power
> gates, but this tend to be slower as well).

In my experience, structures (i.e. register files) with 8 entries can
be accessed in 1/4 cycle, 16-32 (and occasionallly 64) entries can be
accessed in 2/4 cycle at modern CPU speeds, 64-128 can be accessed at
3/4 of a cycle, and those that are larger at 4/4 cycles. Architects
blend these access times into pipelines to hide most of the grubby
details.

Secondarily there are three different fan-in-fan-out problems and all
have to be carefully crafted for lowest latency.

The first is driving the index bits into the decoder 2 through 8 is no
additional delay (when the drivers into the decoder array are properly
sized), 16-32 is about gate of logic delay and approximately 1 gate of
wire-delay, 64-256 is 2-3 gates of delay and another 2 gates (min) of
wire delay.

The second is taking the output of the decoder and making it big
enough to drive the select line that accesses the array. 1-8 bits are
free, 16-32 bits are 1 gate of delay, 64-128 bits are 2 gates of delay
and 1 gate of wire delay.

The third is uniquely sensing the selected register bit competing for
the same wire that (2**n)-1 unselected register bits are also
connected. We used to use sense amps, but in deep submicron, these has
become problematic, so we sense with an inverter in 8-16 cells, and
then use strong drive through an or-tree to generate the final readout
result.

One does not add these trees of fan-out buffers to slow things down,
one adds these trees of fan-out buffers in order to speed things up.
If we did not add these buffere, the R*C components of the wires would
make the array even slower. These buffers just end up taking less time
than if we did not use them.

Mitch


From: MitchAlsup <MitchAlsup@aol.com>
Newsgroups: comp.arch
Subject: Re: Register-less CPU
Date: Tue, 20 May 2008 09:05:40 -0700 (PDT)
Message-ID: <f6c7dc7f-7261-4d50-ba3f-edd40a6974df@a1g2000hsb.googlegroups.com>

On May 19, 12:53 pm, James Harris <james.harri...@googlemail.com>
wrote:
> On 19 May, 18:26, MitchAlsup <MitchAl...@aol.com> wrote:
>
> > In my experience, structures (i.e. register files) with 8 entries can
> > be accessed in 1/4 cycle, 16-32 (and occasionallly 64) entries can be
> > accessed in 2/4 cycle at modern CPU speeds, 64-128 can be accessed at
> > 3/4 of a cycle, and those that are larger at 4/4 cycles. Architects
> > blens these access times into pipelines to hide most of the grubby
> > details.
>
> Given the 1/4 cycle access to one of 8 could you divide the registers
> into eight banks of 8 possibly selected on their lower three bits (so
> the first bank would be reg0, reg8, reg16, reg24 etc) then select the
> bank also in 1/4 cycle thus providing 2/4 cycle or better access to 64
> regs?

But then you have to pay a wire delay and a fanout delay to drive the
index bits into the 8 decoders, and then merge the individual outputs
into the final output, ending up as I stated before.

>
> > Secondarily there are three different fan-in-fan-out problems and all
> > have to be carefully crafted for lowest latency.
>
> > The first is driving the index bits into the decoder 2 through 8 is no
> > additional delay (when the drivers into the decoder array are properly
> > sized), 16-32 is about gate of logic delay and approximately 1 gate of
> > wire-delay, 64-256 is 2-3 gates of delay and another 2 gates (min) of
> > wire delay.
>
> > The second is taking the output of the decoder and making it big
> > enough to drive the select line that accesses the array. 1-8 bits are
> > free, 16-32 bits are 1 gate of delay, 64-128 bits are 2 gates of delay
> > and 1 gate of wire delay.
>
> > The third is uniquely sensing the selected register bit competing for
> > the same wire that (2**n)-1 unselected register bits are also
> > connected. We used to use sense amps, but in deep submicron, these has
> > become problematic, so we sense with an inverter in 8-16 cells, and
> > then use strong drive through an or-tree to generate the final readout
> > result.
>
> > One does not add these trees of fan-out buffers to slow things down,
> > one adds these trees of fan-out buffers in order to speed things up.
> > If we did not add these buffere, the R*C components of the wires would
> > make the array even slower. These buffers just end up taking less time
> > than if we did not use them.
>
> Thanks for all the details. I'm surprised there are wire delays given
> the shortness if the connections between components. Are they perhaps
> times for the outputs to drive their values against the input
> capacitances of depdendent gates (rather than delays of the 'wires'
> themselves)?

Some of it (maybe 20%) is due to the driver, but most of it is R*C
with a good deal fo the C part on the gates attached to the wire.

> Are the above considerations applicable to more than one technology?

They have held rather true from about 0.25 through 0.12µ, we not quite
so bad between 0.8µ and 0.35µ, and were largely ignored at larger
scales. This is a direct consequence of wires getting worse and
transistors getting more leaky as technology scales. {Note: leaky not
slow}.

Mitch

Index Home About Blog