Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Tue, 30 May 2006 17:58:56 UTC
Message-ID: <fa.dCPbLFWs4so0L0H0P08eJKt1j68@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605301041200.5623@g5.osdl.org>

On Tue, 30 May 2006, Nick Piggin wrote:
>
> For workloads where plugging helps (ie. lots of smaller, contiguous
> requests going into the IO layer), the request pattern should be
> pretty good without plugging these days, due to multiple page
> readahead and writeback.

No.

That's fundamentally wrong.

The fact is, plugging is not about read-ahead and writeback. It's very
fundamentally about the _boundaries_ between multiple requests, and in
particular the time when the queue starts out empty so that we can build
up things for devices that want big requests, but even more so for devices
where _seeking_ is very expensive.

Those boundaries haven't gone anywhere. The fact that we do read-ahead and
write-back in chunks doesn't change anything: yes, we often have the "big
requests" thing handled, but (a) not always and (b) upper layers
fundamentally don't fix the seek issues.

I want to know that the block layer could - if we wanted to - do things
like read-ahead for many distinct files, and for metadata. We don't
currently do much of that yet, but the point is, plugging _allows_ us to.
Exactly because it doesn't depend on upper layers feeding everything in
one go.

Look at "sys_readahead()", and realize that it can be used to start IO for
read ahead _across_many_small_files_. Last I tried it, it was hugely
faster at populating the page cache than reading individual files (I used
to do it with BK to bring everything into cache so that the regular ops
would be faster - now git doesn't much need it).

And maybe it was just my imagination, but the disk seemed quieter too. It
should be able to do better seek patterns at the beginning due to plugging
(ie we won't start IO after the first file, but after the request queue
fills up or something else needs to wait and we do an unplug event).

THAT is what plugging is good for. Our read-ahead does well for large
requests, and that's important for some disk controllers in particular.
But plugging is about avoiding starting the IO too early.

Think about the TCP plugging (which is actually newer, but perhaps easier
to explain): it's useful not for the big file case (just use large reads
and writes), but for the "different sources" case - for handling the gap
between a header and the actual file contents. Exactly because it plugs in
_between_ events.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 00:57:30 UTC
Message-ID: <fa.IuIjpI5kdZR3VUiXt5lFAyPCgQ4@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605301739030.24646@g5.osdl.org>

On Wed, 31 May 2006, Nick Piggin wrote:
>
> The requests can only get merged if contiguous requests from the upper
> layers come down, right?

It has nothing to do with merging. It has to do with IO patterns.

Seeking.

Seeking is damn expensive - much more so than command issue. People forget
that sometimes.

If you can sort the requests so that you don't have to seek back and
forth, that's often a HUGE win.

Yes, the requests will still be small, and yes, the IO might happen in 4kB
chunks, but it happens a lot faster if you do it in a good elevator
ordering and if you hit the track cache than if you seek back and forth.

And part of that is that you have to submit multiple requests when you
start, and allow the elevator to work on it.

Now, of course, if you have tons of requests already in flight, you don't
care (you already have lots of work for the elevator), but at least in
desktop loads the "starting from idle" case is pretty common. Getting just
a few requests to start up with is good.

(Yes, tagged queueing makes it less of an issue, of course. I know, I
know. But I _think_ a lot of disks will start seeking for an incoming
command the moment they see it, just to get the best latency, rather than
wait a millisecond or two to see if they get another request. So even
with tagged queuing, the elevator can help, _especially_ for the initial
request).

> Why would plugging help if the requests can't get merged, though?

Why do you think we _have_ an elevator in the first place?

And just how well do you think it works if you submit one entry at a time
(regardless of how _big_ it is) and start IO on it immediately? Vs trying
to get several IO's out there, so that we can say "do this one first".

Sometimes I think harddisks have gotten too quiet - people no longer hear
it when access patters are horrible. But the big issue with plugging was
only partially about request coalescing, and was always about trying to
get the _order_ right when you start to actually submit the requests to
the hardware.

And yes, I realize that modern disks do remapping, and that we will never
do a "perfect" job. But it's still true that the block number has _some_
(fairly big, in fact) relationship to the actual disk layout, and that
avoiding seeking is a big deal.

Rotational latency is often an even bigger issue, of course, but we can't
do much about that. We really can't estimate where the head is, like
people used to try to do three decades ago. _That_ time is long past, but
we can try to avoid long seeks, and it's still true that you can get
blocks that are _close_ faster (if only because they may end up being on
the same cylinder and not need a seek).

Even better than "same cylinder" is sometimes "same cache block" - disks
often do track caching, and they aren't necessarily all that smart about
it, so even if you don't read one huge contiguous block, it's much better
to read an area _close_ to another than seek back and forth, because
you're more likely to hit the disks own track cache.

And I know, disks aren't as sensitive to long seeks as they used to be (a
short seek is almost as expensive as a long one, and a lot of it is the
head settling time), but as another example - I think for CD-ROMs you can
still have things like the motor spinning faster or slower depending on
where the read head is, for example, meaning that short seeks are cheaper
than long ones.

(Maybe constant angular velocity is what people use, though. I dunno).

		Linus


From: Jens Axboe <axboe@suse.de>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 06:12:44 UTC
Message-ID: <fa.vJrvOJ56Y4ZLjRSYJsS9ftdTnH8@ifi.uio.no>
Original-Message-ID: <20060531061110.GB29535@suse.de>

On Tue, May 30 2006, Mark Lord wrote:
> Linus wrote:
> >(Yes, tagged queueing makes it less of an issue, of course. I know,
>
> My observations with (S)ATA tagged/native queuing, is that it doesn't make
> nearly the difference under Linux that it does under other OSs.
> Probably because our block layer is so good at ordering requests,
> either from plugging or simply from clever disk scheduling.

Hmm well, I have seen 30% performance increase for a random read work
load with NCQ, I'd say that is pretty nice. And of course there's the
whole write cache issue, with NCQ you _could_ get away with playing more
safe and disabling write back caching.

NCQ helps us with something we can never fix in software - the
rotational latency. Ordering is only a small part of the picture.

Plus I think that more recent drives have a better NCQ implementation,
the first models I tried were pure and utter crap. Lets just say it
didn't instill a lot of confidence in firmware engineers at various
unnamed drive companies.

--
Jens Axboe



From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 14:47:45 UTC
Message-ID: <fa.PpPogv9StUkL/2M/GWbaZfrrHkY@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605310740530.24646@g5.osdl.org>

On Wed, 31 May 2006, Nick Piggin wrote:
>
> Now having a mechanism for a task to batch up requests might be a
> good idea. Eg.
>
> plug();
> submit reads
> unplug();
> wait for page

What do you think we're _talking_ about?

What do you think my example of sys_readahead() was all about?

WE DO HAVE EXACTLY THAT MECHANISM. IT'S CALLED PLUGGING!

> I'd think this would give us the benefits of corse grained (per-queue)
> plugging and more (e.g. it works when the request queue isn't empty).
> And it would be simpler because the unplug point is explicit and doesn't
> need to be kicked by lock_page or wait_on_page

What do you think plugging IS?

It's _exactly_ what you're talking about. And yes, we used to have
explicit unplugging (a long long long time ago), and IT SUCKED. People
would forget, but even more importantly, people would do it even when not
needed because they didn't have a good place to do it because the waiter
was in a totally different path.

The reason it's kicked by wait_on_page() is that is when it's needed.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 15:17:25 UTC
Message-ID: <fa.tmQJWOCgzAxdPcbf63Xg8rdISHA@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605310809250.24646@g5.osdl.org>

On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> > And yes, we used to have explicit unplugging (a long long long time ago),
> > and IT SUCKED. People would forget, but even more importantly, people would
> > do it even when not
>
> I don't see what the problem is. Locks also suck if you forget to unlock
> them.

Locks are simple, and in fact are _made_ simple on purpose. We try very
hard to unlock in the same function that we lock, for example. Because if
we don't, bugs happen.

That's simply not _practical_ for IO. Think about it. Quite often, the
waiting is done somewhere else than the actual submission.

> > needed because they didn't have a good place to do it because the waiter was
> > in a totally different path.
>
> Example?

Pretty much all of them.

Where do you wait for IO?

Would you perhaps say "wait_on_page()"?

In other words, we really _do_ exactly what you think we should do.

> I don't know why you think this way of doing plugging is fundamentally
> right and anything else must be wrong... it is always heuristic, isn't
> it?

A _particular_ way of doing plugging is not "fundamentally right". I'm
perfectly happy with changing the place we unplug, if people want that.
We've done it several times.

But plugging as a _concept_ is definitely fundamentally right, exactly
because it allows us to have the notion of "plug + n*<random submit by
different paths> + unplug".

And you were not suggesting moving unplugging around. You were suggesting
removing the feature. Which is when I said "no f*cking way!".

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 16:01:03 UTC
Message-ID: <fa.nhT4VJ/kuRrdmq6yZptQKGilsfQ@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605310840000.24646@g5.osdl.org>

On Thu, 1 Jun 2006, Nick Piggin wrote:
>
> You're really keen on unplugging at the point of waiting. I don't get
> it.

Actually, no. I'm _not_ really keen on unplugging at the point of waiting.

I'm keen on unplugging at a point that makes sense, and is safe.

The problem is, you're not even exploring alternatives. Where the hell
_would_ you unplug it?

You can't unplug it in the place where we submit IO. That's _insane_,
because it basically means never plugging at all.

And most of the callers don't even _know_ whether we submit IO or not. For
example, let's pick a real and relevant example (they don't get much more
relevant than this): "do_generic_mapping_read()".

Tell me WHERE you can unplug in that sequence. I will tell you where you
can NOT unplug:

 - you can NOT unplug _anywhere_ inside of the read-ahead logic, because
   we obviously don't want it there (it would break the whole concept of
   read-ahead, not to mention real code like sys_readahead()).

 - you can NOT unplug in "->readpage()", for similar reasons (readahead
   again, and again, the whole point of unplugging is that we want to do
   several readpages and then unplug later)

 - you can NOT just unplug in the path _after_ "readpage()", because the
   IO may have been started by SOMEBODY ELSE that just did read-ahead, and
   didn't unplug (on _purpose_ - the whole point of doing read-ahead is to
   allow concurrent IO execution, so a read-aheader that unplugs is broken
   by definition)

Those three are not just my "personal ideas". They are fundamental to how
unplugging works, and more importantly, to the whole _point_ of plugging.

Agreed?

Now, look at where we _currently_ unplug. I claim that there are exactly
_two_ places where we have to unplug ((1) we find a page that is not
up-to-date and (2) we've started a read on a page ourselves), and I claim
that those two places are _exactly_ the two places where we currently do
"lock_page()".

Again, this is not a "what if" scenario, or something where my "opinions"
are at stake. This is hard, cold, fact. We could do the unplugging outside
of the lock-page, but we'd do it in exactly that situation.

So what is your alternative? Put the explicit unplug at every possible
occurrence of lock_page() (and keep in mind that some of them don't want
it: we only want it when the lock-page will block, which is not always
true. Some people lock the page not because it's under IO, and they're
waiting for it to be unlocked, but because it's dirty and they're going to
start IO on it - the lock_page() generally won't block on that, and if
it doesn't, we don't want to kick previous IO).

In other words, we actually want to unplug at lock_page(), BUT ONLY IF IT
NEEDS TO WAIT, which is by no means all of them. So it's more than just
"add an unplug at lock_page()" it's literally "add an unplug at any
lock-page that blocks".

Still wondering why it ends up being _inside_ lock_page()?

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 16:12:59 UTC
Message-ID: <fa.+AsoAXvd+/sstrg1xNT60lLGHL8@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605310903310.24646@g5.osdl.org>

On Wed, 31 May 2006, Linus Torvalds wrote:
>
> In other words, we actually want to unplug at lock_page(), BUT ONLY IF IT
> NEEDS OT WAIT, which is by no means all of them. So it's more than just
> "add an unplug at lock_page()" it's literally "add an unplug at any
> lock-page that blocks".

Btw, don't get me wrong. In the read case, every time we do a lock_page(),
we might as well unplug unconditionally, because we effectively know we're
going to wait (we just checked that it's not up-to-date). Sure, there's a
race window, but we don't care - the window is very small, and in the end,
unplugging isn't a correctness issue as long as you do it at least as
often as required (ie unplugging too much is ok and at worst just makes
for bad performance - so a very unlikely race that causes _extra_
unplugging is fine as long as it's unlikely. forgetting to unplug is bad).

In the write case, lock_page() may or may not need an unplug. In those
cases, it needs unplugging if it was locked before, but obviously not if
it wasn't.

In the "random case" where we use the page lock not because we want to
start IO on it, but because we need to make sure that page->mapping
doesn't change, we don't really care about the IO, but we do need to
unplug just to make sure that the IO will complete.

And I suspect your objection to unplugging is not really about unplugging
itself. It's literally about the fact that we use the same page lock for
IO and for the ->mapping thing, isn't it?

IOW, you don't actually dislike plugging itself, you dislike it due to the
effects of a totally unrelated locking issue, namely that we use the same
lock for two totally independent things. If the ->mapping thing were to
use a PG_map_lock that didn't affect plugging one way or the other, you
wouldn't have any issues with unplugging, would you?

And I think _that_ is what really gets us to the problem.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 15:13:15 UTC
Message-ID: <fa.N5SV4Gor5aSil+xXUUwm34iNtU4@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605310755210.24646@g5.osdl.org>

On Wed, 31 May 2006, Linus Torvalds wrote:
>
> The reason it's kicked by wait_on_page() is that is when it's needed.

Btw, that's not how it has always been done.

For the longest time, it was actually triggered by scheduler activity, in
particular, plugging used to be a workqueue event that was triggered by
the scheduler (or any explicit points when you wanted it to be triggered
earlier).

So whenever you scheduled, _all_ plugs would be unplugged.

It was specialized to wait_for_page() in order to avoid unnecessary
overhead in scheduling (making it more directed), and to allow you to
leave the request around for further merging/sorting even if a process had
to wait for something unrelated.

But in particular, the old non-directed unplug didn't work well in SMP
environments (because _one_ CPU re-scheduling obviously doesn't mean
anything for the _other_ CPU that is actually working on setting up the
request queue).

The point being that we could certainly do it somewhere else. Doing it in
wait_for_page() (and at least historically, in waiting for bh's) is really
nothing more than trying to have as few points as possible where it's
done, and at the same time not missing any.

And yes, I'd _love_ to have better interfaces to let people take advantage
of this than sys_readahead(). sys_readahead() was a 5-minute hack that
actually does generate wonderful IO patterns, but it is also not all that
useful (too specialized, and non-portable).

I tried at one point to make us do directory inode read-ahead (ie put the
inodes on a read-ahead queue when doing a directory listing), but that
failed miserably. All the low-level filesystems are very much designed to
have inode reading be synchronous, and it would have implied major surgery
to do (and, sadly, my preliminary numbers also seemed to say that it might
be a huge time waster, with enough users just wanting the filenames and
not the inodes).

The thing is, right now we have very bad IO patterns for things that
traverse whole directory trees (like doing a "tar" or a "diff" of a tree
that is cold in the cache) because we have way too many serialization
points. We do a good job of prefetching within a file, but if you have
source trees etc, the median size for files is often smaller than a single
page, so the prefetching ends up being a non-issue most of the time, and
we do _zero_ prefetching between files ;/

Now, the reason we don't do it is that it seems to be damn hard to do. No
question about that. Especially since it's only worth doing (obviously) on
the cold-cache case, and that's also when we likely have very little
information about what the access patterns might be.. Oh, well.

Even with sys_readahead(), my simple "pre-read a whole tree" often ended
up waiting for inode IO (although at least the fact that several inodes
fit in one block gets _some_ of that).

			Linus


From: Jens Axboe <axboe@suse.de>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 18:11:42 UTC
Message-ID: <fa.uOFunOoH45tf1Bn/1DjYHQSFvJQ@ifi.uio.no>
Original-Message-ID: <20060531181312.GA29535@suse.de>

On Wed, May 31 2006, Linus Torvalds wrote:
>
>
> On Wed, 31 May 2006, Linus Torvalds wrote:
> >
> > The reason it's kicked by wait_on_page() is that is when it's needed.
>
> Btw, that's not how it has always been done.
>
> For the longest time, it was actually triggered by scheduler activity, in
> particular, plugging used to be a workqueue event that was triggered by
> the scheduler (or any explicit points when you wanted it to be triggered
> earlier).

Now it's time for me to give Linus a history lesson on plugging,
apparently.

Plugging used to be done by the issuer and with immediate unplugging
when you were done issuing the blocks. Both of these actions happened in
ll_rw_block() if the caller was submitting more than on buffer_head, and
it happened without the caller knowing about plugging.  He never had to
unplug.

1.2 then expanded that to be able to plug more than one device at the
time. It didn't really do much except allow the array of buffers passed
in being on separate devices. The plugging was still hidden from the
caller.

1.3 and on introduced a more generalised infrastructure for this, moving
the plugging to a task queue (tq_disk). This meant that we could finally
separate the plugging and unplugging from the direct IO issue. So
whenever someone wanted to the a wait_on_buffer/lock_page() equiv for
something that might to be issued, it would have to do a
run_task_queue(&tq_disk) first which would then unplug all the queues
that were plugged.

tq_disk was then removed and moved to a block list during the 2.5
massive io/bio changes. The functionality remained the same, though -
you had to kick all queues to force the unplug of the page you wanted.
This infrastructure lasted all up to the point where silly people with
lots of CPU's started complaining about lock contention for 32-way
systems with thousands of disks. This is the point where I reevaluated
the benefits of plugging, found it good, and decided to fix it up.
Plugging then became a simple state bit in the queue, and you would have
to pass in eg the page you wanted when asking to unplug. This would kick
just the specific queue you needed. It also got a timer tied to it, so
that we could unplug after foo msecs if we wanted. Additionally, it will
also self-unplug once a certain plug depth has been reached (like 4
requests).

Anyway, the point I wanted to make is that this was never driven by
scheduler activity. So there!

--
Jens Axboe



From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [rfc][patch] remove racy sync_page?
Date: Wed, 31 May 2006 18:27:20 UTC
Message-ID: <fa.793RRPSHB4LaFZylczbCMazfBfw@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605311121390.24646@g5.osdl.org>

On Wed, 31 May 2006, Jens Axboe wrote:
>
> Anyway, the point I wanted to make is that this was never driven by
> scheduler activity. So there!

Heh. I confused tq_disk and tq_scheduler, methinks.

And yes, "run_task_queue(&tq_disk)" was in lock_page(), not the scheduler.

			Linus




	[[ For an update on this subject, see Jens Axboe's 2011 LWN article:

		http://lwn.net/Articles/438256/

	-- Norman ]]

Index Home About Blog