Index Home About Blog
Date: 	Thu, 26 Apr 2001 13:08:25 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Thu, 26 Apr 2001, Alexander Viro wrote:
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > how can the read in progress see a branch that we didn't spliced yet? We
>
> fd = open("/dev/hda1", O_RDONLY);
> read(fd, buf, sizeof(buf));

Note that I think all these arguments are fairly bogus.  Doing things like
"dump" on a live filesystem is stupid and dangerous (in my opinion it is
stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
discussion in itself), and there really are no valid uses for opening a
block device that is already mounted. More importantly, I don't think
anybody actually does.

The fact that you _can_ do so makes the patch valid, and I do agree with
Al on the "least surprise" issue. I've already applied the patch, in fact.
But the fact is that nobody should ever do the thing that could cause
problems.

		Linus



Date: 	Fri, 27 Apr 2001 09:52:19 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> 
> Actually this is done quite often, even on mounted fs's:
> 
> hdparm -t /dev/hda

Note that this one happens to be ok.

The buffer cache is "virtual" in the sense that /dev/hda is a completely
separate name-space from /dev/hda1, even if there is some physical
overlap.

		Linus



Date: 	Fri, 27 Apr 2001 09:59:46 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

[ linux-kernel added back as a cc ]

On Fri, 27 Apr 2001, Neil Conway wrote:
> 
> I'm surprised that dump is deprecated (by you at least ;-)).  What to
> use instead for backups on machines that can't umount disks regularly? 

Note that dump simply won't work reliably at all even in 2.4.x: the buffer
cache and the page cache (where all the actual data is) are not
coherent. This is only going to get even worse in 2.5.x, when the
directories are moved into the page cache as well.

So anybody who depends on "dump" getting backups right is already playing
russian roulette with their backups.  It's not at all guaranteed to get the
right results - you may end up having stale data in the buffer cache that
ends up being "backed up".

Dump was a stupid program in the first place. Leave it behind.

> I've always thought "tar" was a bit undesirable (updates atimes or
> ctimes for example).

Right now, the cpio/tar/xxx solutions are definitely the best ones, and
will work on multiple filesystems (another limitation of "dump"). Whatever
problems they have, they are still better than the _guaranteed_(*)  data
corruptions of "dump".

However, it may be that in the long run it would be advantageous to have a
"filesystem maintenance interface" for doing things like backups and
defragmentation..

		Linus

(*) Dump may work fine for you a thousand times. But it _will_ fail under
the right circumstances. And there is nothing you can do about it.



Date: 	Thu, 3 May 2001 10:21:04 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Thu, 3 May 2001, Alan Cox wrote:
>
> > > discussion in itself), and there really are no valid uses for opening a
> > > block device that is already mounted. More importantly, I don't think
> > > anybody actually does.
> > 
> > Actually I did. I might do it again :) The point was to get the kernel to
> > cache certain blocks in the RAM. 
> 
> Ditto for some CD based stuff. You burn the important binaries to the front
> of the CD, then at boot dd 64Mb to /dev/null to prime the libraries and
> avoid a lot of seeking during boot up from the CD-ROM.
> 
> However I could do that from an initrd before mounting

Ehh. Doing that would be extremely stupid, and would slow down your boot
and nothing more.

The page cache is _not_ coherent with the buffer cache. Any filesystem
that uses the page cache for data caching (which pretty much all of them
do, because it's the only way to get sane mmap semantics, and it's a lot
faster than the old buffer cache ever was), the above will do _nothing_
but spend time doing IO that the page cache will just end up doing again.

Currently it can help to pre-load the meta-data, but quite frankly, even
that is suspect, and won't work in 2.5.x when Al's metadata page-cache
stuff is merged (at least directories, and likely inodes too).

In short, don't do it. It doesn't work reliably (and hasn't since 2.0.x),
and it will only get more and more unreliable.

		Linus



Date: 	Fri, 4 May 2001 10:28:10 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Fri, 4 May 2001, Rogier Wolff wrote:
>
> Linus Torvalds wrote:
> > 
> > Ehh. Doing that would be extremely stupid, and would slow down your boot
> > and nothing more.
> 
> Ehhh, Linus, Linearly reading my harddisk goes at 26Mb per second.

You obviously didn't read my explanation of _why_ it is stupid.

> By analyzing my boot process I determine that 50M of my disk is used
> during boot. I can then reshuffle my disk to have that 50M of data at
> the beginning and reading all that into 50M of cache, I can save
> thousands of 10ms seeks.

No. Have you _tried_ this?

What the above would do is to move 50M of the disk into the buffer cache.

Then, a second later, when the boot proceeds, Linux would start filling
the page cache.

BY READING THE CONTENTS FROM DISK AGAIN!

In short, by doing a "dd" from the disk, you would _not_ help anything at
all. You would only make things slower, by reading things twice.

The Linux buffer cache and page cache are two separate entities. They are
not synchronized, and they are indexed through totally different
means. The page cache is virtually indexed by <inode,pagenr>, while the
buffer cache is indexed by <dev,blocknr,blocksize>. 

> Is this simply: Don't try this then? 

Try it. You will see. 

You _can_ actually try to optimize certain things with 2.4.x: all
meta-data is still in the buffer cache in 2.4.x, so what you could do is
to lay out the image so that the metadata is at the front of the disk,
and do the "dd" to cache just the metadata. Even then you need to be
careful, and make sure that the "dd" uses the same block size as the
filesystem will use.

And even that will largely stop working very early in 2.5.x when the
directory contents and possibly inode and bitmap metadata moves into the
page cache.

Now, you may ask "why use the page cache at all then"? The answer is that
the page cache is a _lot_ faster to look up, exactly because of the
virtual indexing (and also because the data structure is much better
designed - fixed-size entities with none of the complexities of the buffer
cache. The buffer cache needs to be able to do IO, while the page cache is
_only_ a cache and does that one thing really well - doing IO is a
completely separate issue with the page cache).

Now, if you want to speed up accesses, there are things you can do. You
can lay out the filesystem in the access order - trace the IO accesses at
bootup ("which file, which offset, which metadata block?") and lay out the
blocks of the files in exactly the right order. Then you will get linear
reads _without_ doing any "dd" at all.

Now, laying out the filesystem that way is _hard_. No question about it.
It's kind of equivalent to doing a filesystem "defragment" operation,
except you use a different sorting function (instead of sorting blocks
linearly within each file, you sort according to access order).

		Linus



Date: 	Fri, 4 May 2001 10:40:27 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Fri, 4 May 2001, Andrea Arcangeli wrote:

> On Fri, May 04, 2001 at 01:56:14PM +0200, Jens Axboe wrote:
> > Or you can rewrite block_read/write to use the page cache, in which case
> > you'd have more luck doing the above.
> 
> once block_dev is in pagecache there will obviously be no-way to share
> cache between the block device and the filesystem, because all the
> caches will be in completly different address spaces.

They already pretty much are.

I do want to re-write block_read/write to use the page cache, but not
because it would impact anything in this discussion. I want to do it early
in 2.5.x, because:

 - it will speed up accesses
 - it will re-use existing code better and conceptualize things more
   cleanly (ie it would turn a disk into a _really_ simple filesystem with
   just one big file ;).
 - it will make MM handling much better for things like fsck - the memory
   pressure is designed to work on page cache things.
 - it will be one less thing that uses the buffer cache as a "cache" (I
   want people to think of, and use, the buffer cache as an _IO_ entity,
   not a cache).

It will not make the "cache at bootup" thing change at all (because even
in the page cache, there is no commonality between a virtual mapping of a
_file_ (or metadata) and a virtual mapping of a _disk_). 

It would have hidden the problem with "dd" or "dump" touching buffer cache
blocks that the filesystem was using, so the original metadata corruption
that started this thread would not happen. But that's not a design issue
or a design goal, that would just have been a random result.

		Linus



Date: 	Fri, 4 May 2001 10:55:58 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.
Newsgroups: fa.linux.kernel

On Fri, 4 May 2001, Alexander Viro wrote:
> 
> Ehh... There _is_ a way to deal with that, but it's deeply Albertesque:
> 	* add pagecache access for block device
> 	* put your "real" root on /dev/loop0 (setup from initrd)
> 	* dd

You're one sick puppy.

Now, the above is basically equivalent to using and populating a
dynamically sized ramdisk.

If you really want to go this way, I'd much rather see you using a real
ram-disk (that you populate at startup with something like a compressed
tar-file). THAT is definitly going to speed up booting - thanks to
compression you'll not only get linear reads, but you will get fewer reads
than the amount of data you need would imply.

Couple that with tmpfs, or possibly something like coda (to dynamically
move things between the ramdisk and the "backing store" filesystem), and
you can get a ramdisk approach that actually shrinks (and, in the case of
coda or whatever, truly grows) dynamically.

Think of it as an exercise in multi-level filesystems and filesystem
management. Others have done it before (usually between disk and tape, or
disk and network), and in these days of ever-growing memory it might just
make sense to do it on that level too.

(No, I don't seriously think it makes sense today. But if RAM keeps
growing and becoming ever cheaper, it might some day. At the point where
everybody has multi-gigabyte memories, and don't really need it for
anything but caching, you could think of it as just moving the caching to
a higher level - you don't cache blocks, you cache parts of the
filesystem).

> 	Al, feeling sadistic today...

Sadistic you are.

		Linus



Index Home About Blog