vm_dirty_ratio (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Tue, 19 Jun 2007 00:07:02 UTC
Message-ID: <fa.IS+pnF1q5Yd788JptAXlpkdpJmU@ifi.uio.no>

On Mon, 18 Jun 2007, Andrew Morton wrote:

> On Mon, 18 Jun 2007 14:14:30 -0700
> Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> > Andrew,
> >
> > The default vm_dirty_ratio changed from 40 to 10
> > for the 2.6.22-rc kernels in this patch:

Yup.

> > IOZone write drops by about 60% when test file size is 50 percent of
> > memory.  Rand-write drops by 90%.
>
> heh.
>
> (Or is that an inappropriate reaction?)

I think it's probably appropriate.

I don't know what else to say.

For pure write testing, where writeback caching is good, you should
probably run all benchmarks with vm_dirty_ratio set as high as possible.
That's fairly obvious.

What's equally obvious is that for actual real-life use, such tuning is
not a good idea, and setting the vm_dirty_ratio down causes a more
pleasant user experience, thanks to smoother IO load behaviour.

Is it good to keep tons of dirty stuff around? Sure. It allows overwriting
(and thus avoiding doing the write in the first place), but it also allows
for a more aggressive IO scheduling, in that you have more writes that you
can schedule.

It does sound like IOZone just isn't a good benchmark. It doesn't actually
measure disk throughput, it really measures how good the OS is at *not*
doing the IO. And yes, in that case, set vm_dirty_ratio high to get better
numbers.

I'd rather have the defaults at something that is "pleasant", and then
make it easy for benchmarkers to put it at something "unpleasant, but
gives better numbers". And it's not like it's all that hard to just do

	echo 50 > /proc/sys/vm/dirty_ratio

in your /etc/rc.local or something, if you know you want this.

Maybe somebody can make a small graphical config app, and the distros
could even skip it? Dunno. I *suspect* very few people actually end up
caring.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Tue, 19 Jun 2007 19:06:26 UTC
Message-ID: <fa.Xzx6f8Otgbuf2jBeUNc3OXkCvgw@ifi.uio.no>

On Tue, 19 Jun 2007, John Stoffel wrote:
>
> Shouldn't the vm_dirty_ratio be based on the speed of the device, and
> not the size of memory?

Yes. It should depend on:
 - speed of the device(s) in question
 - seekiness of the workload
 - wishes of the user as per the latency of other operations.

However, nobody has ever found the required algorithm.

So "at most 10% of memory dirty" is a simple (and fairly _good_)
heuristic. Nobody has actually ever ended up complaining about the change
from 40% -> 10%, and as far as I know this was the first report (and it's
not so much because the change was bad, but because it showed up on a
benchmark - and I don't think that actually says anything about anything
else then the behaviour of the benchmark itself)

So are there better algorithms in theory? Probably lots of them.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Tue, 19 Jun 2007 19:07:44 UTC
Message-ID: <fa.vHOzMkinzH6sLmc5FX0cs8XIAW4@ifi.uio.no>

On Tue, 19 Jun 2007, Linus Torvalds wrote:
>
> Yes. It should depend on:
>  - speed of the device(s) in question

Btw, this one can be quite a big deal. Try connecting an iPod and syncing
8GB of data to it. Oops.

So yes, it would be nice to have some per-device logic too. Tested patches
would be very welcome ;)

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Thu, 21 Jun 2007 23:09:01 UTC
Message-ID: <fa.Ao8Ap5a7u+Whd3/C5l0f1v3HVFA@ifi.uio.no>

On Thu, 21 Jun 2007, Matt Mackall wrote:
>
> Perhaps we want to throw some sliding window algorithms at it. We can
> bound requests and total I/O and if requests get retired too slowly we
> can shrink the windows. Alternately, we can grow the window if we're
> retiring things within our desired timeframe.

I suspect that would tend to be a good way to go. But it almost certainly
has to be per-device, which implies that somebody would have to do some
major coding/testing on this..

The vm_dirty_ratio thing is a global value, and I think we need that
regardless (for the independent issue of memory deadlocks etc), but if we
*additionally* had a per-device throttle that was based on some kind of
adaptive thing, we could probably raise the global (hard) vm_dirty_ratio a
lot.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Wed, 20 Jun 2007 17:18:44 UTC
Message-ID: <fa.VqJjRtFLP6i3xirbliV91pJTh0g@ifi.uio.no>

On Wed, 20 Jun 2007, Peter Zijlstra wrote:
>
> Building on the per BDI patches, how about integrating feedback from the
> full-ness of device queues. That is, when we are happily doing IO and we
> cannot possibly saturate the active devices (as measured by their queue
> never reaching 75%?) then we can safely increase the total dirty limit.

The really annoying things are the one-off things. You've been happily
working for a while (never even being _close_ to saturating any IO
queues), and then you untar a large tree.

If the kernel now let's you dirty lots of memory, you'll have a very
unpleasant experience.

And with hot-pluggable devices (which is where most of the throughput
problems tend to be!), the "one-off" thing is not a "just after reboot"
kind of situation.

So you'd have to be pretty smart about it.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Change in default vm_dirty_ratio
Date: Wed, 20 Jun 2007 18:28:39 UTC
Message-ID: <fa.z+uZy9Fff0Y9v6Lx7g6oe5y+GjY@ifi.uio.no>

On Wed, 20 Jun 2007, Arjan van de Ven wrote:
>
> maybe that needs to be fixed? If you stopped dirtying after the initial
> bump.. is there a reason for the kernel to dump all that data to the
> disk in such a way that it disturbs interactive users?

No. I would argue that the kernel should try to trickle things out, so
that it doesn't disturb anything, and a "big dump" becomes a "steady
trickle".

And that's what "vm_dirty_ratio" is all about.

> so the question maybe is.. is the vm tunable the cause or the symptom of
> the bad experience?

No, the vm tunable is exactly what it's all about.

Do a big "untar", and what you *want* to see is not "instant dump,
followed by long pause".

A much *smoother* behaviour is generally preferable, and most of the time
that's true even if it may be lower throughput in the end!

Of course, "synchronous writes" are *really* smooth (you never allow any
dumps at *all* to build up), so this is about a balance - not about
"perfect smoothness" vs "best throughput", but about a heuristic that
finds a reasonable middle ground.

There is no "perfect". There is only "stupid heuristics". Maybe the
"vm_dirty_ratio" is a bit *too* stupid, but it definitely is needed in
some form.

It can actually be more than just a "performance" vs "smoothness" issue:
the 40% thing was actually a *correctness* issue too, back when we counted
it as a percentage of total memory. A highmem machine would allow 40% of
all memory free and it was all in low memory, and that literally caused
lockups.

So the dirty_ratio is not *only* about smoothness, it's also simply about
the fact that the kernel must not allow too much memory to be dirtied,
because that leads to out-of-memory deadlocks and other nasty issues. So
it's not *purely* a tunable.

		Linus

Index Home About Blog