Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
Date: Fri, 27 Apr 2007 15:19:40 UTC
Message-ID: <fa.+jJfsPdWk8uCcJT/AFsrJBvQMxQ@ifi.uio.no>

On Fri, 27 Apr 2007, Mike Galbraith wrote:
>
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load.  While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.

One thing to try out (and dammit, I should make it the default now in
2.6.21) is to just make the dirty limits much lower. We've been talking
about this for ages, I think this might be the right time to do it.

Especially with lots of memory, allowing 40% of that memory to be dirty is
just insane (even if we limit it to "just" 40% of the normal memory zone.
That can be gigabytes. And no amount of IO scheduling will make it
pleasant to try to handle the situation where that much memory is dirty.

So I do believe that we could probably do something about the IO
scheduling _too_:

 - break up large write requests (yeah, it will make for worse IO
   throughput, but if make it configurable, and especially with
   controllers that don't have insane overheads per command, the
   difference between 128kB requests and 16MB requests is probably not
   really even noticeable - SCSI things with large per-command overheads
   are just stupid)

   Generating huge requests will automatically mean that they are
   "unbreakable" from an IO scheduler perspective, so it's bad for latency
   for other reqeusts once they've started.

 - maybe be more aggressive about prioritizing reads over writes.

but in the meantime, what happens if you apply this patch?

Actually, you don't need to apply the patch - just do

	echo 5 > /proc/sys/vm/dirty_background_ratio
	echo 10 > /proc/sys/vm/dirty_ratio

and say if it seems to improve things. I think those are much saner
defaults especially for a desktop system (and probably for most servers
too, for that matter).

Even 10% of memory dirty can be a whole lot of RAM, but it should
hopefully be _better_ than the insane default we have now.

Historical note: allowing about half of memory to contain dirty pages made
more sense back in the days when people had 16-64MB of memory, and a
single untar of even fairly small projects would otherwise hit the disk.
But memory sizes have grown *much* more quickly than disk speeds (and
latency requirements have gone down, not up), so a default that may
actually have been perfectly fine at some point seems crazy these days..

		Linus

---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f469e3c..a794945 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -67,12 +67,12 @@ static inline long sync_writeback_pages(void)
 /*
  * Start background writeback (via pdflush) at this percentage
  */
-int dirty_background_ratio = 10;
+int dirty_background_ratio = 5;

 /*
  * The generator of dirty data starts writeback at this percentage
  */
-int vm_dirty_ratio = 40;
+int vm_dirty_ratio = 10;

 /*
  * The interval between `kupdate'-style writebacks, in jiffies


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
Date: Fri, 27 Apr 2007 15:56:03 UTC
Message-ID: <fa.kLUy950X1KrW104pgR9wbfRXM8I@ifi.uio.no>

On Fri, 27 Apr 2007, John Anthony Kazos Jr. wrote:
>
> Could[/should] this stuff be changed from ratios to amounts? Or a quick
> boot-time test to use a ratio if the memory is small and an amount (like
> tax brackets, I would expect) if it's great?

Yes, the "percentage" thing was likely wrong. That said, there *is* some
correlation between "lots of memory" and "high-end machine", and that in
turn tends to correlate with "fast disk", so I don't think the percentage
approach is really *horribly* wrong.

The main issue with the percentage is that we do export them as such
through the /proc/ interface, and they are easy to change and understand.
So changing them to amounts is non-trivial if you also want to support the
old interfaces - and the advantage isn't obvious enough that it's a
clear-cut case.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
Date: Fri, 27 Apr 2007 21:23:27 UTC
Message-ID: <fa.7wclyHT2TVvrbktexwmxzG794qs@ifi.uio.no>

On Fri, 27 Apr 2007, Jan Engelhardt wrote:
>
> Interesting. For my laptop, I have configured like 90 for
> dirty_background_ratio and 95 for dirty_ratio. Makes for a nice
> delayed write, but I do not do workloads bigger than extracing kernel
> tarballs (~250 MB) and coding away on that machine (488 MB RAM) anyway.
> Setting it to something like 95, I could probably rm -Rf the kernel
> tree again and the disk never gets active because it is all cached.
> But if dirty_ratio is lowered, the disk will get active soon.

Yes. For laptops, you may want to
 - raise the dirty limits
 - increase the dirty scan times

but you do realize that if you then need memory for something else,
latency just becomes *horrible*. So even on laptops, it's not obviously
the right thing to do (these days, throwing money at the problem instead,
and getting one of the nice new 1.8" flash disks, will solve all issues:
you'd have no reason to try to delay spinning up the disk anyway).

		Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
Date: Sat, 28 Apr 2007 16:06:33 UTC
Message-ID: <fa.NcQuLy50/h4j4ec9NT9NPyOSXIw@ifi.uio.no>

On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> >
> > Especially with lots of memory, allowing 40% of that memory to be dirty is
> > just insane (even if we limit it to "just" 40% of the normal memory zone.
> > That can be gigabytes. And no amount of IO scheduling will make it
> > pleasant to try to handle the situation where that much memory is dirty.
>
> What about using different dirtypage limits for different processes?

Not good. We inadvertedly actually had a very strange case of that, in the
sense that we had different dirtypage limits depending on the type of the
allocation: if somebody used GFP_HIGHUSER, he'd be looking at the
percentage as a percentage of _all_ memory, but if somebody used
GFP_KERNEL he'd look at it as a percentage of just the normal low memory.
So effectively they had different limits (the percentage may have been the
same, but the _meaning_ of the percentage changed ;)

And it's really problematic, because it means that the process that has a
high tolerance for dirty memory will happily dirty a lot of RAM, and then
when the process that has a _low_ tolerance comes along, it might write
just a single byte, and go "oh, damn, I'm way over my dirty limits, I will
now have to start doing writeouts like mad".

Your form is much better:

> --- i.e. every process has dirtypage activity counter, that is increased when
> it dirties a page and decreased over time.

..but is really hard to do, and in particular, it's really hard to make
any kinds of guarantees that when you have a hundred processes, they won't
go over the total dirty limit together!

And one of the reasons for the dirty limit is that the VM really wants to
know that it always has enough clean memory that it can throw away that
even if it needs to do allocations while under IO, it's not totally
screwed.  An example of this is using dirty mmap with a networked
filesystem: with 2.6.20 and later, this should actually _work_ fairly
reliably, exactly because we now also count the dirty mapped pages in the
dirty limits, so we never get into the situation that we used to be able
to get into, where some process had mapped all of RAM, and dirtied it
without the kernel even realizing, and then when the kernel needed more
memory (in order to write some of it back), it was totally screwed.

So we do need the "global limit", as just a VM safety issue. We could do
some per-process counters in addition to that, but generally, the global
limit actually ends up doing the right thing: heavy writers are more
likely to _hit_ the limit, so statistically the people who write most are
also the people who end up having to clean up - so it's all fair.

> The main problem is that if the user extracts tar archive, tar eventually
> blocks on writeback I/O --- O.K. But if bash attempts to write one page to
> .bash_history file at the same time, it blocks too --- bad, the user is
> annoyed.

Right, but it's actually very unlikely. Think about it: the person who
extracts the tar-archive is perhaps dirtying a thousand pages, while the
bash_history writeback is doing a single one. Which process do you think
is going to hit the "oops, we went over the limit" case 99.9% of the time?

The _really_ annoying problem is when you just have absolutely tons of
memory dirty, and you start doing the writeback: if you saturate the IO
queues totally, it simply doesn't matter _who_ starts the writeback,
because anybody who needs to do any IO at all (not necessarily writing) is
going to be blocked.

This is why having gigabytes of dirty data (or even "just" hundreds of
megs) can be so annoying.

Even with a good software IO scheduler, when you have disks that do tagged
queueing, if you fill up the disk queue with a few dozen (depends on the
disk what the queue limit is) huge write requests, it doesn't really
matter if the _software_ queuing then gives a big advantage to reads
coming in. They'll _still_ be waiting for a long time, especially since
you don't know what the disk firmware is going to do.

It's possible that we could do things like refusing to use all tag entries
on the disk for writing. That would probably help latency a _lot_. Right
now, if we do writeback, and fill up all the slots on the disk, we cannot
even feed the disk the read request immediately - we'll have to wait for
some of the writes to finish before we can even queue the read to the
disk.

(Of course, if disks don't support tagged queueing, you'll never have this
problem at all, but most disks do these days, and I strongly suspect it
really can aggravate latency numbers a lot).

Jens? Comments? Or do you do that already?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS
Date: Sat, 28 Apr 2007 16:31:12 UTC
Message-ID: <fa.uA3beCvRJirPf8chDaYzF/FAC18@ifi.uio.no>

On Sat, 28 Apr 2007, Matthias Andree wrote:
>
> Another thing that is rather unpleasant (haven't yet tried fiddling with
> the dirty limits) is UDF to DVD-RAM - try rsyncing /home to a DVD-RAM,
> that's going to leave you with tons of dirty buffers that clear slowly
> -- "watch -n 1 grep -i dirty /proc/meminfo" is boring, but elucidating...

Now *this* is actually really really nasty.

There are worse examples. Try connecting some flash disk over USB-1, and
untar to it. Ugh.

I'd love to have some per-device dirty limit, but it's harder than it
should be.

		Linus

Index Home About Blog