Caches and read-ahead (Daniel Phillips; H. Peter Anvin; Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@transmeta.com>
Date: 	Mon, 27 Aug 2001 14:44:01 -0700
Subject: Re: [resent PATCH] Re: very slow parallel read performance
Newsgroups: fa.linux.kernel

In article <20010827203125Z16070-32383+1731@humbolt.nl.linux.org> you write:
>On August 27, 2001 09:43 pm, Oliver Neukum wrote:
>> 
>> If we are optimising for streaming (which readahead is made for) dropping 
>> only one page will buy you almost nothing in seek time. You might just as 
>> well drop them all and correct your error in one larger read if necessary.
>> Dropping the oldest page is possibly the worst you can do, as you will need 
>> it soonest.
>
>Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind, 
>dropping readahead pages is not supposed to happen frequently under 
>steady-state operation, so it's not that critical what we do here, it's going 
>to be hard to create a load that shows the impact.  The really big benefit 
>comes from not overdoing the readahead in the first place, and not underdoing 
>it either.

Note that the big reason why I did _not_ end up just increasing the
read-ahead value from 31 to 511 (it was there for a short while) is that
large read-ahead does not necessarily improve performance AT ALL,
regardless of memory pressure. 

Why? Because if the IO request queue fills up, the read-ahead actually
ends up waiting for requests, and ends up being synchronous. Which
totally destroys the whole point of doing read-ahead in the first place.
And a large read-ahead only makes this more likely.

Also note that doing tons of parallel reads _also_ makes this more
likely, and actually ends up also mixing the read-ahead streams which is
exactly what you do not want to do.

The solution to both problems is to make the read-ahead not wait
synchronously on requests - that way the request allocation itself ends
up being a partial throttle on memory usage too, so that you actually
probably end up fixing the problem of memory pressure _too_.

This requires that the read-ahead code would start submitting the blocks
using READA, which in turn requires that the readpage() function get a
"READ vs READA" argument.  And the ll_rw_block code would obviously have
to honour the rw_ahead hint and submit_bh() would have to return an
error code - which it currently doesn't do, but which should be trivial
to implement. 

I really think that doing anything else is (a) stupid and (b) wrong.
Trying to come up with a complex algorithm on how to change read-ahead
based on memory pressure is just bound to be extremely fragile and have
strange performance effects. While letting the IO layer throttle the
read-ahead on its own is the natural and high-performance approach.

		Linus

Date: 	Sun, 9 Sep 2001 11:17:09 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Sun, 9 Sep 2001, Rik van Riel wrote:
> On Sat, 8 Sep 2001, Linus Torvalds wrote:
>
> > It's only filesystems that have modified buffers without marking them
> > dirty (by virtue of having pointers to buffers and delaying the dirtying
> > until later) that are broken by the "try to make sure all buffers are
> > up-to-date by reading them in" approach.
>
> Think of the inode and dentry caches.  I guess we need
> some way to invalidate those.

Note that we've never done it before. The inode and dentry caches have
never been coherent with the buffer cache, and we have in fact
historically not even tried to shrink them (to try to minimize the impact
of the non-coherency).

The inode data in memory doesn't even show _up_ in the buffer cache, for
example. Much less dentries, which are so virtualized inside the kernel
that they have absolutely no information on whether they exist on a disk
at all, much less any way to map them to a disk location.

I agree that coherency wrt fsck is something that theoretically would be a
GoodThing(tm). And this is, in fact, why I believe that filesystem
management _must_ have a good interface to the low-level filesystem.
Because you cannot do it any other way.

This is not a fsck-only issue. I am a total non-believer in the "dump"
program, for example. I always found it to be a totally ridiculous and
idiotic way to make backups. It would be much better to have good
(filesystem-independent) interfaces to do what "dump" wants to do (ie have
ways of explicitly bypassing the accessed bits and get the full inode
information etc).

Nobody does a "read()" on directories any more to parse the directory
structure. Similarly, nobody should have done a "dump" on the raw device
any more for the last 20 years or so. But some backwards places still do.

Backup of a live filesystem should be much easier than fsck of a live
filesystem, but I believe that there are actually lots of common issues
there, and such an interface should eventually be designed with both in
mind. Both want to get raw inode lists, for exaple. Both potentially want
to be able to read (and specify) block positions on the disk. Etc etc.

And notice now de-fragmentation falls neatly into this hole too: I bet
that if you had good management interfaces for fsck and backup, you'd
pretty much automatically be able to do defrag through the same
interfaces.

			Linus

From: "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: linux-2.4.10-pre5
Date: 	9 Sep 2001 13:18:57 -0700
Newsgroups: fa.linux.kernel

Followup to:  <Pine.LNX.4.33.0109091105380.14479-100000@penguin.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> I agree that coherency wrt fsck is something that theoretically would be a
> GoodThing(tm). And this is, in fact, why I believe that filesystem
> management _must_ have a good interface to the low-level filesystem.
> Because you cannot do it any other way.
> 
> This is not a fsck-only issue. I am a total non-believer in the "dump"
> program, for example. I always found it to be a totally ridiculous and
> idiotic way to make backups. It would be much better to have good
> (filesystem-independent) interfaces to do what "dump" wants to do (ie have
> ways of explicitly bypassing the accessed bits and get the full inode
> information etc).
> 
> Nobody does a "read()" on directories any more to parse the directory
> structure. Similarly, nobody should have done a "dump" on the raw device
> any more for the last 20 years or so. But some backwards places still do.
> 

The main reason people seems to still justify use dump/restore is --
believe it or not -- the inability to set atime.  One would think this
would be a trivial extension to the VFS, even if protected by a
capability (CAP_BACKUP?).

The ideal way to run backups I have found is on filesystems which
support atomic snapshots -- that way, your backup set becomes not only
safe (since it goes through the kernel etc. etc.) but totally
coherent, since it is guaranteed to be unchanging.  This is a major
win for filesystems which can do atomic snapshots, and I'd highly
encourage filesystem developers to consider this feature.

	-hpa

-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

Date: 	Sun, 9 Sep 2001 17:23:41 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Mon, 10 Sep 2001, Alan Cox wrote:
>
> How do you plan to handle the situation where we have multiple instances
> of the same 4K disk block each of which contains 1K of data in the
> start of the page copy and 3K of zeroes.

Note that the file data is indexed by a completely different index, and
has been since the page cache was introduced. I am _not_ suggesting that
we get rid of _that_ alias. We've always had data aliasing with the page
cache, and it doesn't matter.

Think of it as an address space issue - where each address space has its
own indexing. One of the advantages of the page cache is the fact that it
has the notion of completely independent address spaces, after all.

So we have one index which is the "physical index", which is the one that
the getblk/bread interfaces use, and which is also the one that the
raw-device-in-pagecache code uses. There are no aliasing issues (as long
as we make sure that we have a "meta-inode" to create a single address
space for this index, and do not try to use different inodes for different
representations of the same block device major/minor number)

The other index is the one we already have, namely the file-virtual index.

And switching the first index over from the purely physically indexed
buffer cache to a new page-cache address space doesn't impact this
already-existing index _at_all_.

So in the physical index, you see one 1 4kB block.

In the virtual index, you have 4 4kB blocks, where the start contents just
happen to be gotten from different parts of the physical 4kB block..

And notice how this is NOT something new at all - this is _exactly_ what
we do now, except our physical index is currently not "wrapped" in any
page cache address space.

> > anyway, I very much doubt it has any good properties to make software more
> > complex by having that kind of readahead in sw.
>
> Even the complex stuff like the i2o raid controllers seems to benefit
> primarily from file level not physical readahead, that gives it enough to
> do intelligent scheduling and to keep the drive firmware busy making good
> decisions (hopefully)

Right. The advantages of read-ahead are two-fold:
 - give the disk (and low-level IO layer) more information about future
   patterns (so that they can generate better schedules for fetching the
   data)
 - overlap IO (especially seek and rotational delays) and computation.

HOWEVER, neither of those advantages actually exist for physical
read-ahead, simply because current disks will always end up doing
buffering anyway, which means that limited physical read-ahead is pretty
much guaranteed by the disk - and doing it in software is only going to
slow things down by generating more traffic over the control/data
channels.

Sure, you could do _extensive_ physical read-ahead (ie more than what
disks tend to do on their own), but that implies a fair number of sectors
(modern disks tend to do something akin to "track buffering" in the 64+kB
range, so you'd have to do noticeably more than that in software to have
any chance of making a difference), and then you'd probably often end up
reading too much and loading the line with data you may not actually need.

		Linus

Date: 	Mon, 10 Sep 2001 15:15:24 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Mon, 10 Sep 2001, Daniel Phillips wrote:
>
> Here's some anectdotal evidence to the contrary.
>
> This machine requires about 1.5 seconds to diff two kernel trees if both
> trees are in cache.  If neither tree is in cache it takes 90 seconds.  It's a
> total of about 300M of source - reading that into memory should take about 10
> seconds at 30M/sec, taking one pass across the disk and assuming no extensive
> fragmentation.
>
> We lost 78.5 seconds somewhere.  From the sound of the disk drives, I'd say
> we lost it to seeking, which physical readahead with a large cache would be
> able to largely eliminate in this case.

Yes, we could have a huge physical read-ahead, and hope that the logical
layout is such that consecutive files in the directory are also
consecutive on disk (which is quite often true).

And yes, doing a cold "diff" is about the worst case - we can't take
advantage of logical read-ahead within the files themselves (they tend to
be too small for read-ahead to matter on that level), and the IO is
bouncing back and forth between two different trees - and thus most likely
two very different places on the disk.

And even when the drive does physical read-ahead, a drive IO window of
64kB-256kB (and let's assume about 50% of that is actually _ahead_ of the
read) is not going to avoid the constant back-and-forth seeking when the
combined size of the two kernel trees is in the 50MB region.

[ There are also drives that just aren't very good at handling their
  internal caches. You'll see drives that have a 2MB on-board buffer, but
  the way the buffer is managed it might be used in fixed chunks. Some of
  the really worst ones only have a single buffer - so seeking back and
  forth just trashes the drive buffer completely. That's rather unusual,
  though. ]

However, physical read-ahead really isn't the answer here. I bet you could
cut your time down with it, agreed. But you'd hurt a lot of other loads,
and it really depends on nice layout on disk. Plus you wouldn't even know
where to put the data you read-ahead: you only have the physical address,
not the logical address, and the page-cache is purely logically indexed..

The answer to this kind of thing is to try to make the "diff" itself be
nicer on the cache. I bet you could speed up diff a lot by having it read
in multiple files in one go if you really wanted to. It probably isn't
worth most peoples time.

(Ugly secret: because I tend to have tons of memory, I sometimes do

	find tree1 tree2 -type f | xargs cat > /dev/null

just after I have rebooted, just so that I always have my kernel trees in
the cache - after that they tend to stay there.. Having a gig of ram
makes you do stupid things).

			Linus

Date: 	Mon, 10 Sep 2001 16:16:18 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Tue, 11 Sep 2001, Daniel Phillips wrote:

> On September 11, 2001 12:39 am, Rik van Riel wrote:
> > On Mon, 10 Sep 2001, Linus Torvalds wrote:
> >
> > > (Ugly secret: because I tend to have tons of memory, I sometimes do
> > >
> > > 	find tree1 tree2 -type f | xargs cat > /dev/null
> >
> > This suggests we may want to do agressive readahead on the
> > inode blocks.
>
> While that is most probably true, that wasn't his point.  He preloaded the
> page cache.

Well, yes, but more importantly, I pre-loaded my page cache _with_io_that_
_is_more_likely_to_be_consecutive_.

This means that the preload + subsequent diff is already likely to be
faster than doing just one "diff" would have been.

So it's not just about preloading. It's also about knowing about access
patterns beforehand - something that the kernel really cannot do.

Pre-loading your cache always depends on some limited portion of
prescience. If you preload too aggressively compared to your knowledge of
future usage patterns, you're _always_ going to lose. I think that the
kernel doing physical read-ahead is "too aggressive", while doing it
inside "diff" (that _knows_ the future patterns) is not.

And doing it by hand depends on the user knowing what he will do in the
future. In some cases that's fairly easy for the user to guess ;)

		Linus

Date: 	Mon, 10 Sep 2001 17:20:48 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Tue, 11 Sep 2001, Daniel Phillips wrote:
> On September 11, 2001 12:15 am, Linus Torvalds wrote:
> > However, physical read-ahead really isn't the answer here. I bet you could
> > cut your time down with it, agreed. But you'd hurt a lot of other loads,
> > and it really depends on nice layout on disk. Plus you wouldn't even know
> > where to put the data you read-ahead: you only have the physical address,
> > not the logical address, and the page-cache is purely logically indexed..
>
> You leave it in the buffer cache and the page cache checks for it there
> before deciding it has to hit the disk.

Umm..

Ehh.. So now you have two cases:
 - you hit in the cache, in which case you've done an extra allocation,
   and will have to do an extra memcpy.
 - you don't hit in the cache, in which case you did IO that was
   completely useless and just made the system slower.

Considering that the extra allocation and memcpy is likely to seriously
degrade performance on any high-end hardware if it happens any noticeable
percentage of the time, I don't see how your suggest can _ever_ be a win.
The only time you avoid the memcpy is when you wasted the IO completely.

So please explain to me how you think this is all a good idea? Or explain
why you think the memcpy is not going to be noticeable in disk throughput
for normal loads?

		Linus

Date: 	Mon, 10 Sep 2001 19:27:37 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Tue, 11 Sep 2001, Daniel Phillips wrote:
> >
> > Ehh.. So now you have two cases:
> >  - you hit in the cache, in which case you've done an extra allocation,
> >    and will have to do an extra memcpy.
> >  - you don't hit in the cache, in which case you did IO that was
> >    completely useless and just made the system slower.
>
> If the read comes from block_read then the data goes into the page cache.  If
> it comes from getblk (because of physical readahead or "dd xxx >null") the
> data goes into the buffer cache and may later be moved to the page cache,
> once we know what the logical mapping is.

Note that practically all accesses will be through the logical mapping,
and the virtual index.

While the physical mapping and the physical index is the only one you can
do physical read-ahead into.

And the two DO NOT OVERLAP. They never will. You can't just swizzle
pointers around - you will have to memcpy.

In short, you'll end up doing a memcpy() pretty much every single time you
hit.

> > Considering that the extra allocation and memcpy is likely to seriously
> > degrade performance on any high-end hardware if it happens any noticeable
> > percentage of the time, I don't see how your suggest can _ever_ be a win.
> > The only time you avoid the memcpy is when you wasted the IO completely.
>
> We don't have any extra allocation in the cases we're already handling now,
> it works exactly the same.  The only overhead is an extra hash probe on cache
> miss.

No it doesn't. We're already handing _nothing_: the only thing we're
handling right now is:

 - swap cache read-ahead (which is _not_ a physical read-ahead, but a
   logical one, and as such is not at all the case you're handling)

 - we're invalidating buffer heads that we find to be aliasing the virtual
   page we create, which is _also_ not at all the same thing, it's simply
   an issue of making sure that we haven't had the (unlikely) case of a
   meta-data block being free'd, but not yet written back.

So in the first case we don't have an aliasing issue at all (it's really a
virtual index, although in the case of a swap partition the virtual
address ends up being a 1:1 mapping of the physical address), and in the
second case we do not intend to use the physical mapping we find, we just
intend to get rid of it.

> > So please explain to me how you think this is all a good idea? Or explain
> > why you think the memcpy is not going to be noticeable in disk throughput
> > for normal loads?
>
> When we're forced to do a memcpy it's for a case where we're saving a read or
> a seek.

No.

The above is assuming that the disk doesn't already have the data in its
buffers. In which case the only thing we're doing is making the IO command
and the DMA that filled the page happen earlier.

Which can be good for latency, but considering that the read-ahead is at
least as likely to be _bad_ for latency, I don't believe in that argument
very much. Especially not when you've dirtied the cache and introduced an
extra memcpy.

>	  Even then, the memcpy can be optimized away in the common case that
> the blocksize matches page_size.

Well, you actually have to be very very careful: if you do that there is
just a _ton_ of races there (you'd better be _really_ sure that nobody
else has already found either of the pages, the "move page" operation is
not exactly completely painless).

>				  Sometimes, even when the blocksize doesn't
> match this optimization would be possible.  But the memcpy optimization isn't
> important, the goal is to save reads and seeks by combining reads and reading
> blocks in physical order as opposed to file order.

Try it. I'll be convinced by numbers, and I'll bet you won't have the
numbers for the common cases to prove yourself right.

You're making the (in my opinion untenable) argument that the logical
read-ahead is flawed enough of the time that you win by doing an
opportunistic physical read-ahead, even when that implies more memory
pressure both from an allocation standpoint (if you fail 10% of the time,
you'll have 10% more page cache pages you need to get rid of gracefully,
that never had _any_ useful information in them) and from a memory bus
standpoint.

> An observation: logical readahead can *never* read a block before it knows
> what the physical mapping is, whereas physical readahead can.

Sure. But the meta-data is usually on the order of 1% or less of the data,
which means that you tend to need to read a meta-data block only 1% of the
time you need to read a real data block.

Which makes _that_ optimization not all that useful.

So I'm claiming that in order to get any useful performance improvments,
your physical read-ahead has to be _clearly_ better than the logical one.
I doubt it is.

Basically, the things where physical read-ahead might win:
 - it can be done without metadata (< 1%)
 - it can be done "between files" (very uncommon, especially as you have
   to assume that different files are physically close)

Can you see any others? I doubt you can get physical read-ahead that gets
more than a few percentage points better hits than the logical one. AND I
further claim that you'll get a _lot_ more misses on physical read-ahead,
which means that your physical read-ahead window should probably be larger
than our current logical read-ahead.

I have seen no arguments from you that might imply anything else, really.

Which in turn also implies that the _overhead_ of physical read-ahead is
larger. And that is without even the issue of memcpy and/or switching
pages around, _and_ completely ignoring all the complexity-issues.

But hey, at the end of the day, numbers rule.

		Linus

From: Daniel Phillips <phillips@bonn-fries.net>
Subject: Re: linux-2.4.10-pre5
Date: 	Tue, 11 Sep 2001 12:02:51 +0200
Newsgroups: fa.linux.kernel

On September 11, 2001 12:39 am, Rik van Riel wrote:
> On Mon, 10 Sep 2001, Linus Torvalds wrote:
> 
> > (Ugly secret: because I tend to have tons of memory, I sometimes do
> >
> > 	find tree1 tree2 -type f | xargs cat > /dev/null
> 
> This suggests we may want to do agressive readahead on the
> inode blocks.
> 
> They are small enough to - mostly - cache and should reduce
> the amount of disk seeks quite a bit. In an 8 MB block group
> with one 128 byte inode every 8 kB, we have a total of 128 kB
> of inodes...

I tested this idea by first doing a ls -R on the tree, then Linus's find 
command:

    time ls -R linux >/dev/null
    time find linux -type f | xargs cat > /dev/null

the plan being that the ls command would read all the inode blocks and hardly 
any of the files would be big enough to have an index block, so we would 
effectively have all metadata in cache.

According to your theory the total time for the two commands should be less 
than the second command alone.  But it wasn't, the two commands together took 
almost exactly the same time as the second command by itself.

There goes that theory.

--
Daniel

Date: 	Tue, 11 Sep 2001 08:12:59 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Mon, 10 Sep 2001, Rik van Riel wrote:
> >
> > Pre-loading your cache always depends on some limited portion of
> > prescience.
>
> OTOH, agressively pre-loading metadata should be ok in a lot
> of cases, because metadata is very small, but wastes about
> half of our disk time because of the seeks ...

I actually agree to some degree here. The main reason I agree is that
meta-data often is (a) known to be physically contiguous (so pre-fetching
is easy and cheap on most hardwate) and (b) meta-data _is_ small, so you
don't have to prefetch very much (so pre-fetching is not all that
expensive).

Trying to read two or more pages of inode data whenever we fetch an inode
might not be a bad idea, for example. Either we fetch a _lot_ of inodes
(in which case the prefetching is very likely to get a hit anyway), or we
don't (in which case the prefetching is unlikely to hurt all that much
either). You don't easily get into a situation where you prefetch a lot
without gaining anything.

We might do other kinds of opportunistic pre-fetching, like have "readdir"
start prefetching for the inode data it finds. That might be a win for many
loads (and it might be a loss too - there _are_ loads that really only
care about the filename, although I suspect they are fairly rare).

		Linus

Date: 	Tue, 11 Sep 2001 08:39:44 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Tue, 11 Sep 2001, Daniel Phillips wrote:
> >
> > In short, you'll end up doing a memcpy() pretty much every single time you
> > hit.
>
> And *only* when we hit.  Even if we don't optimize away the memcpy, it's a
> win, so long as we get enough hits to make up for the cost of any wasted
> readaheads.  Any time physical readahead correctly hits a block of metadata
> then chances are good we've eliminated a synchronous stall.

Ehh..

Your argument seems to be "it won't cost us anything if we don't hit".

But hey, if we don't hit, then it _will_ cost us - that implies that the
read-ahead was _completely_ wasted. That's the worst case, not the "good"
case. You've just wasted CPU cycles in setting up the bogus IO, and disk
and bus cycles in executing it.

> >  - we're invalidating buffer heads that we find to be aliasing the virtual
> >    page we create, which is _also_ not at all the same thing, it's simply
> >    an issue of making sure that we haven't had the (unlikely) case of a
> >    meta-data block being free'd, but not yet written back.
>
> Aha.  So we are going to do the buffer cache hash probe anyway, the thing I
> call the the reverse lookup.  Now I'm just suggesting we drop the other shoe
> and chain all the page cache blocks to the buffer hash.  The only extra cost
> will be the buffer hash insert and delete, and in return we get complete
> coherency.

"complete"? No. The coherency is very much one-way: we'd better not have
anything that actually dirties the buffer cache, because that dirtying
will _not_ be seen by a virtual cache.

Note that this is not anything new, though.

> > The above is assuming that the disk doesn't already have the data in its
> > buffers. In which case the only thing we're doing is making the IO command
> > and the DMA that filled the page happen earlier.
> >
> > Which can be good for latency, but considering that the read-ahead is at
> > least as likely to be _bad_ for latency, I don't believe in that argument
> > very much. Especially not when you've dirtied the cache and introduced an
> > extra memcop.
>
> Wait, does dma dirty the cache?  I'd hope not.

It can do so (many architectures like the ARM actually do DMA in
software), but no, I wasn't talking about the DMA, but about the memcpy.

> > Well, you actually have to be very very careful: if you do that there is
> > just a _ton_ of races there (you'd better be _really_ sure that nobody
> > else has already found either of the pages, the "move page" operation is
> > not exactly completely painless).
>
> Is doesn't look that bad.  The buffer hash link doesn't change so we don't
> need the hash_table_lock.  We basically do an add_to_page_cache less the
> lru_cache_add and flags intialization.

Wrong.

You need the hash_table_lock for _another_ reason: you need to make really
sure that nobody is traversing the hash table right then - because you can
NOT afford to have somebody else find one of the pages or buffers while
you're doing the operation (you have to test for all the counts being
zero, and you have to do that test when you can guarantee that nobody else
suddenly finds it).

> > > An observation: logical readahead can *never* read a block before it knows
> > > what the physical mapping is, whereas physical readahead can.
> >
> > Sure. But the meta-data is usually on the order of 1% or less of the data,
> > which means that you tend to need to read a meta-data block only 1% of the
> > time you need to read a real data block.
> >
> > Which makes _that_ optimization not all that useful.
>
> Oh no, the metadata blocks have a far greater impact than that: they are
> serializers in the sense that you have to read the metadata before reading
> the data blocks.

Ehh.. You _have_ to read them anyway before you can use your data.

Your argument is fundamentally flawed: remember that you cannot actually
_use_ the data you read ahead before you actually have the linear mapping.
And you have to get the meta-data information before you can _create_ the
linear mapping.

Doing physical read-ahead does not mean you can avoid reading meta-data,
and if you claim that as a "win", you're just lying to yourself.

Doing physical read-ahead only means that you can read-ahead _before_
reading meta-data: it does not remove the need for meta-data, it only
potentially removes a ordering dependency.

But it potentially removes that ordering dependency through speculation
(only if the speculation ends up being correct, of course), and
speculation has a cost. This is nothing new - people have been doing
things like this at other levels for a long time (doing things like data
speculation inside CPU's, for example).

Basically, you do not decrease IO - you only try to potentially make it
more parallel. Which can be a win.

But it can be a big loss too. Speculation always ends up depending on the
fact that you have more "throughput" than you actually take advantage of.
By implication, you can also clearly see that speculation is bound to be
provably _bad_ in any case where you can already saturate your resources.

> > So I'm claiming that in order to get any useful performance improvments,
> > your physical read-ahead has to be _clearly_ better than the logical one.
> > I doubt it is.
>
> Even marginally better will satisfy me, because what I'm really after is the
> coherency across buffer and page cache.  Also, I'm not presenting it as an
> either/or.  I see physical readahead as complementary to logical readahead.

Marginally better is not good enough if other loads are marginally worse.

And I will bet that you _will_ see marginally worse numbers. Which is why
I think you want "clearly better" just to offset the marginally worse
numbers.

> > Basically, the things where physical read-ahead might win:
> >  - it can be done without metadata (< 1%)
>
> See above.  It's not the actual size of the metadata that matters, it's
> the way it serializes access to the data blocks.

See above. It always will. There is NO WAY you can actually return the
file data to the user before having read the metadata. QED. You ONLY
remove ordering constraints, nothing more.

> > Can you see any others?
>
> Yes:
>
>   - Lots of ide drives don't have any cache at all.  It's not in question
>     that *some* physical readahead is good, right?

Every single IDE drive I know of has cache. Some of them only has a single
buffer, but quite frankly, I've not seen a new IDE drive in the last few
years with less than 1MB of cache. And that's not a "single 1MB buffer".

And this trend hasn't been going backwards. Disks have _always_ been
getting more intelligent, rather than less.

In short, you're betting against history here. And against technology
getting better. It's not a bet that I would ever do..

>   - The disk firmware ought to be smart enough to evict already-read
>     blocks from its cache early, in which case our readahead
>     effectively frees up space in its cache.

With 2MB of disk cache (which is what all the IDE disks _I_ have access to
have), the disk will have more of a read-ahead buffer than you'd
reasonably do in software. How much did you imagine you'd read ahead
physically? Megabytes? I don't think so..

		Linus

Date: 	Tue, 11 Sep 2001 08:48:24 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: linux-2.4.10-pre5
Newsgroups: fa.linux.kernel

On Tue, 11 Sep 2001, Daniel Phillips wrote:
>
> But see my post in this thread where I created a simple test to show that,
> even when we pre-read *all* the inodes in a directory, there is no great
> performance win.

Note that I suspect that because the inode tree _is_ fairly dense, you
don't actually need to do much read-ahead in most cases. Simply because
you automatically do read-ahead _always_: when somebody reads a 128-byte
inode, you (whether you like it or not) always "read-ahead" the 31 inodes
around it on a 4kB filesystem.

So we _already_ do read-ahead by a "factor of 31". Whether we can improve
that or not by increasing it to 63 inodes, who knows?

I actually think that the "start read-ahead for inode blocks when you do
readdir" might be a bigger win, because that would be a _new_ kind of
read-ahead that we haven't done before, and might improve performance for
things like "ls -l" in the cold-cache situation..

(Although again, because the inode is relatively small to the IO cache
size, it's probably fairly _hard_ to get a fully cold-cache inode case. So
I'm not sure even that kind of read-ahead would actually make any
difference at all).

		Linus

Index Home About Blog