Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: The performance and behaviour of the anti-fragmentation related
Date: Fri, 02 Mar 2007 05:13:00 UTC
Message-ID: <fa.HgoxLvmJHPfUlX5cG6gcXuXNQ0c@ifi.uio.no>

On Thu, 1 Mar 2007, Andrew Morton wrote:
>
> On Thu, 1 Mar 2007 19:44:27 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > In other words, I really don't see a huge upside. I see *lots* of
> > downsides, but upsides? Not so much. Almost everybody who wants unplug
> > wants virtualization, and right now none of the "big virtualization"
> > people would want to have kernel-level anti-fragmentation anyway since
> > they'd need to do it on their own.
>
> Agree with all that, but you're missing the other application: power
> saving.  FBDIMMs take eight watts a pop.

This is a hardware problem. Let's see how long it takes for Intel to
realize that FBDIMM's were a hugely bad idea from a power perspective.

Yes, the same issues exist for other DRAM forms too, but to a *much*
smaller degree.

Also, IN PRACTICE you're never ever going to see this anyway. Almost
everybody wants bank interleaving, because it's a huge performance win on
many loads. That, in turn, means that your memory will be spread out over
multiple DIMM's even for a single page, much less any bigger area.

In other words - forget about DRAM power savings. It's not realistic. And
if you want low-power, don't use FBDIMM's. It really *is* that simple.

(And yes, maybe FBDIMM controllers in a few years won't use 8 W per
buffer. I kind of doubt that, since FBDIMM fairly fundamentally is highish
voltage swings at high frequencies.)

Also, on a *truly* idle system, we'll see the power savings whatever we
do, because the working set will fit in D$, and to get those DRAM power
savings in reality you need to have the DRAM controller shut down on its
own anyway (ie sw would only help a bit).

The whole DRAM power story is a bedtime story for gullible children. Don't
fall for it. It's not realistic. The hardware support for it DOES NOT
EXIST today, and probably won't for several years. And the real fix is
elsewhere anyway (ie people will have to do a FBDIMM-2 interface, which
is against the whole point of FBDIMM in the first place, but that's what
you get when you ignore power in the first version!).

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: The performance and behaviour of the anti-fragmentation related
Date: Fri, 02 Mar 2007 17:18:05 UTC
Message-ID: <fa.aRyS5XsmqyY0kQs+E95OUanWSxg@ifi.uio.no>

On Fri, 2 Mar 2007, Mark Gross wrote:
> >
> > Yes, the same issues exist for other DRAM forms too, but to a *much*
> > smaller degree.
>
> DDR3-1333 may be better than FBDIMM's but don't count on it being much
> better.

Hey, fair enough. But it's not a problem (and it doesn't have a solution)
today. I'm not sure it's going to have a solution tomorrow either.

> > Also, IN PRACTICE you're never ever going to see this anyway. Almost
> > everybody wants bank interleaving, because it's a huge performance win on
> > many loads. That, in turn, means that your memory will be spread out over
> > multiple DIMM's even for a single page, much less any bigger area.
>
> 4-way interleave across banks on systems may not be as common as you may
> think for future chip sets.  2-way interleave across DIMMs within a bank
> will stay.

.. and think about a realistic future.

EVERYBODY will do on-die memory controllers. Yes, Intel doesn't do it
today, but in the one- to two-year timeframe even Intel will.

What does that mean? It means that in bigger systems, you will no longer
even *have* 8 or 16 banks where turning off a few banks makes sense.
You'll quite often have just a few DIMM's per die, because that's what you
want for latency. Then you'll have CSI or HT or another interconnect.

And with a few DIMM's per die, you're back where even just 2-way
interleaving basically means that in order to turn off your DIMM, you
probably need to remove HALF the memory for that CPU.

In other words: TURNING OFF DIMM's IS A BEDTIME STORY FOR DIMWITTED
CHILDREN.

There are maybe a couple machines IN EXISTENCE TODAY that can do it. But
nobody actually does it in practice, and nobody even knows if it's going
to be viable (yes, DRAM takes energy, but trying to keep memory free will
likely waste power *too*, and I doubt anybody has any real idea of how
much any of this would actually help in practice).

And I don't think that will change. See above. The future is *not* moving
towards more and more DIMMS. Quite the reverse. On workstations, we are
currently in the "one or two DIMM's per die". Do you really think that
will change? Hell no. And in big servers, pretty much everybody agrees
that we will move towards that, rather than away from it.

So:
 - forget about turning DIMM's off. There is *no* actual data supporting
   the notion that it's a good idea today, and I seriously doubt you can
   really argue that it will be a good idea in five or ten years. It's a
   hardware hack for a hardware problem, and the problems are way too
   complex for us to solve in time for the solution to be relevant.

 - aim for NUMA memory allocation and turning off whole *nodes*. That's
   much more likely to be productive in the longer timeframe. And yes, we
   may well want to do memory compaction for that too, but I suspect that
   the issues are going to be different (ie the way to do it is to simply
   prefer certain nodes for certain allocations, and then try to keep the
   jobs that you know can be idle on other nodes)

Do you actually have real data supporting the notion that turning DIMM's
off will be reasonable and worthwhile?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: The performance and behaviour of the anti-fragmentation related
Date: Fri, 02 Mar 2007 19:05:19 UTC
Message-ID: <fa.hb5OUBqd/dM49w1UNhpL+Kvhx/g@ifi.uio.no>

On Fri, 2 Mar 2007, Mark Gross wrote:
>
> I think there will be more than just 2 dims per cpu socket on systems
> that care about this type of capability.

I agree. I think you'll have a nice mix of 2 and 4, although not likely a
lot more. You want to have independent channels, and then within a channel
you want to have as close to point-to-point as possible.

But the reason that I think you're better off looking at a "node level" is
that

 (a) describing the DIMM setup is a total disaster. The interleaving is
     part of it, but even in the absense of interleaving, we have so far
     seen that describing DIMM mapping simply isn't a realistic thing to
     be widely deplyed, judging by the fact that we cannot even get a
     first-order approximate mapping for the ECC error events.

     Going node-level means that we just piggy-back on the existing node
     mapping, which is a lot more likely to actually be correct and
     available (ie you may not know which bank is bank0 and how the
     interleaving works, but you usually *do* know which bank is connected
     to which CPU package)

     (Btw, I shouldn't have used the word "die", since it's really about
     package - Intel obviously has a penchant for putting two dies per
     package)

 (b) especially if you can actually shut down the memory, going node-wide
     may mean that you can shut down the CPU's too (ie per-package sleep).
     I bet the people who care enough to care about DIMM's would want to
     have that *anyway*, so tying them together simplifies the problem.

> BTW I hope we aren't talking past each other, there are low power states
> where the ram contents are persevered.

Yes. They are almost as hard to handle, but the advantage is that if we
get things wrong, it can still work most of the time (ie we don't have to
migrate everything off, we just need to try to migrate the stuff that gets
*used* off a DIMM, and hardware will hopefully end up quiescing the right
memory controller channel totally automatically, without us having to know
the exact mapping or even having to 100% always get it 100% right).

With FBDIMM in particular, I guess the biggest power cost isn't actually
the DRAM content, but just the controllers.

Of course, I wonder how much actual point there is to FBDIMM's once you
have on-die memory controllers and thus the reason for deep queueing is
basically gone (since you'd spread out the memory rather than having it
behind a few central controllers).

		Linus

Index Home About Blog