Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 16:25:43 UTC
Message-ID: <fa.mBA/v2++EN/2U6g/Aucnot0KOzI@ifi.uio.no>

On Mon, 30 Mar 2009, Ric Wheeler wrote:
>
> A modern S-ATA drive has up to 32MB of write cache. If you lose power or
> suffer a sudden reboot (that can reset the bus at least), I am pretty sure
> that your above assumption is simply not true.

At least traditionally, it's worth to note that 32MB of on-disk cache is
not the same as 32MB of kernel write cache.

The drive caches tend to be more like track caches - you tend to have a
few large cache entries (segments), not something like a sector cache. And
I seriously doubt the disk will let you fill them up with writes: it
likely has things like the sector remapping tables in those caches too.

It's hard to find information about the cache organization of modern
drives, but at least a few years ago, some of them literally had just a
single segment, or just a few segments (ie a "8MB cache" might be eight
segments of one megabyte each).

The reason that matters is that those disks are very good at linear
throughput.

The latency for writing out eight big segments is likely not really
noticeably different from the latency of writing out eight single sectors
spread out across the disk - they both do eight operations, and the
difference between an op that writes a big chunk of a track and writing a
single sector isn't necessarily all that noticeable.

So if you have a 8MB drive cache, it's very likely that the drive can
flush its cache in just a few seeks, and we're still talking milliseconds.
In contrast, even just 8MB of OS caches could have _hundreds_ of seeks and
take several seconds to write out.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 17:08:30 UTC
Message-ID: <fa.otXpEypVgvyajiIOU18YtIVqWOo@ifi.uio.no>

On Mon, 30 Mar 2009, Mark Lord wrote:
>
> I spent an entire day recently, trying to see if I could significantly fill
> up the 32MB cache on a 750GB Hitach SATA drive here.
>
> With deliberate/random write patterns, big and small, near and far,
> I could not fill the drive with anything approaching a full second
> of latent write-cache flush time.
>
> Not even close.  Which is a pity, because I really wanted to do some testing
> related to a deep write cache.  But it just wouldn't happen.
>
> I tried this again on a 16MB cache of a Seagate drive, no difference.
>
> Bummer.  :)

Try it with laptop drives. You might get to a second, or at least hundreds
of ms (not counting the spinup delay if it went to sleep, obviously). You
probably tested desktop drives (that 750GB Hitachi one is not a low end
one, and I assume the Seagate one isn't either).

You'll have a much easier time getting long latencies when seeks take tens
of ms, and the platter rotates at some pitiful 3600rpm (ok, I guess those
drives are hard to find these days - I guess 4200rpm is the norm even for
1.8" laptop harddrives).

And also - this is probably obvious to you, but it might not be
immediately obvious to everybody - make sure that you do have TCQ going,
and at full depth. If the drive supports TCQ (and they all do, these days)
it is quite possible that the drive firmware basically limits the write
caching to one segment per TCQ entry (or at least to something smallish).

Why? Because that really simplifies some of the problem space for the
firmware a _lot_ - if you have at least as many segments in your cache as
your max TCQ depth, it means that you always have one segment free to be
re-used without any physical IO when a new command comes in.

And if I were a disk firmware engineer, I'd try my damndest to keep my
problem space simple, so I would do exactly that kind of "limit the number
of dirty cache segments by the queue size" thing.

But I dunno. You may not want to touch those slow laptop drives with a
ten-foot pole. It's certainly not my favorite pastime.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 15:44:17 UTC
Message-ID: <fa.VIEM1MVYV9DFAxchUa+Nxg7uoZ0@ifi.uio.no>

On Mon, 30 Mar 2009, Ric Wheeler wrote:
>
> People keep forgetting that storage (even on your commodity s-ata class of
> drives) has very large & volatile cache. The disk firmware can hold writes in
> that cache as long as it wants, reorder its writes into anything that makes
> sense and has no explicit ordering promises.

Well, when it comes to disk caches, it really does make sense to start
looking at what breaks.

For example, it is obviously true that any half-way modern disk has
megabytes of caches, and write caching is quite often enabled by default.

BUT!

The write-caches on disk are rather different in many very fundamental
ways from the kernel write caches.

One of the differences is that no disk I've ever heard of does write-
caching for long times, unless it has battery back-up. Yes, yes, you can
probably find firmware that has some odd starvation issue, and if the disk
is constantly busy and the access patterns are _just_ right the writes can
take a long time, but realistically we're talking delaying and re-ordering
things by milliseconds. We're not talking seconds or tens of seconds.

And that's really quite a _big_ difference in itself. It may not be
qualitatively all that different (re-ordering is re-ordering, delays are
delays), but IN PRACTICE there's an absolutely huge difference between
delaying and re-ordering writes over milliseconds and doing so over 30s.

The other (huge) difference is that the on-disk write caching generally
fails only if the drive power fails. Yes, there's a software component to
it (buggy firmware), but you can really approximate the whole "disk write
caches didn't get flushed" with "powerfail".

Kernel data caches? Let's be honest. The kernel can fail for a thousand
different reasons, including very much _any_ component failing, rather
than just the power supply. But also obviously including bugs.

So when people bring up on-disk caching, it really is a totally different
thing from the kernel delaying writes.

So it's entirely reasonable to say "leave the disk doing write caching,
and don't force flushing", while still saying "the kernel should order the
writes it does".

Thinking that this is somehow a black-and-white issue where "ordered
writes" always has to imply "cache flush commands" is simply wrong. It is
_not_ that black-and-white, and it should probably not even be a
filesystem decision to make (it's a "system" decision).

This, btw, is doubly true simply because if the disk really fails, it's
entirely possible that it fails in a really nasty way. As in "not only did
it not write the sector, but the whole track is now totally unreadable
because power failed while the write head was active".

Because that notion of "power" is not a digital thing - you have
capacitors, brown-outs, and generally nasty "oops, for a few milliseconds
the drive still had power, but it was way out of spec, and odd things
happened".

So quite frankly, if you start worrying about disk power failures, you
should also then worry about the disk failing in _way_ more spectacular
ways than just the simple "wrote or wrote not - that is the question".

And when was the last time you saw a "safe" logging filesystem that was
safe in the face of the log returning IO errors after power comes back on?

Sure, RAID is one answer. Except not so much in 99% of all desktops or
especially laptops.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 16:43:44 UTC
Message-ID: <fa.XWAk12W+YlGYeYupaKRDwBHn8JQ@ifi.uio.no>

On Mon, 30 Mar 2009, Ric Wheeler wrote:
>
> I still disagree strongly with the don't force flush idea - we have an
> absolute and critical need to have ordered writes that will survive a power
> failure for any file system that is built on transactions (or data base).

Read that sentence of yours again.

In particular, read the "we" part, and ponder.

YOU have that absolute and critical need.

Others? Likely not so much. The reason people run "data=ordered" on their
laptops is not just because it's the default - rather, it's the default
_because_ it's the one that avoids most obvious problems. And for 99% of
all people, that's what they want.

And as mentioned, if you have to have absolute requirements, you
absolutely MUST be using real RAID with real protection (not just RAID0).

Not "should". MUST. If you don't do redundancy, your disk _will_
eventually eat your data. Not because the OS wrote in the wrong order, or
the disk cached writes, but simply because bad things do happen.

But turn that around, and say: if you don't have redundant disks, then
pretty much by definition those drive flushes won't be guaranteeing your
data _anyway_, so why pay the price?

> The big issues are that for s-ata drives, our flush mechanism is really,
> really primitive and brutal. We could/should try to validate a better and less
> onerous mechanism (with ordering tags? experimental flush ranges? etc).

That's one of the issues. The cost of those flushes can be really quite
high, and as mentioned, in the absense of redundancy you don't actually
get the guarantees that you seem to think that you get.

> I spent a very long time looking at huge numbers of installed systems
> (millions of file systems deployed in the field), including  taking part in
> weekly analysis of why things failed, whether the rates of failure went up or
> down with a given configuration, etc. so I can fully appreciate all of the
> ways drives (or SSD's!) can magically eat your data.

Well, I can go mainly by my own anecdotal evidence, and so far I've
actually had more catastrophic data failure from failed drives than
anything else. OS crashes in the middle of a "yum update"? Yup, been
there, done that, it was really painful. But it was painful in a "damn, I
need to force a re-install of a couple of rpms".

Actual failed drives that got read errors? I seem to average almost one a
year. It's been overheating laptops, and it's been power outages that
apparently happened at really bad times. I have a UPS now.

> What you have to keep in mind is the order of magnitude of various buckets of
> failures - software crashes/code bugs tend to dominate, followed by drive
> failures, followed by power supplies, etc.

Sure. And those "write flushes" really only cover a rather small
percentage. For many setups, the other corruption issues (drive failure)
are not just more common, but generally more disastrous anyway. So why
would a person like that worry about the (rare) power failure?

> I have personally seen a huge reduction in the "software" rate of failures
> when you get the write barriers (forced write cache flushing) working properly
> with a very large installed base, tested over many years :-)

The software rate of failures should only care about the software write
barriers (ie the ones that order the OS elevator - NOT the ones that
actually tell the disk to flush itself).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 18:01:33 UTC
Message-ID: <fa.3cUIrl5yo7ID2JgjBfZLo56GKWA@ifi.uio.no>

On Mon, 30 Mar 2009, Ric Wheeler wrote:
> >
> > But turn that around, and say: if you don't have redundant disks, then
> > pretty much by definition those drive flushes won't be guaranteeing your
> > data _anyway_, so why pay the price?
>
> They do in fact provide that promise for the extremely common case of power
> outage and as such, can be used to build reliable storage if you need to.

No they really effectively don't. Not if the end result is "oops, the
whole track is now unreadable" (regardless of whether it happened due to a
write during power-out or during some entirely unrelated disk error). Your
"flush" didn't result in a stable filesystem at all, it just resulted in a
dead one.

That's my point. Disks simply aren't that reliable. Anything you do with
flushing and ordering won't make them magically not have errors any more.

> Heat is a major killer of spinning drives (as is severe cold). A lot of times,
> drives that have read errors only (not failed writes) might be fully
> recoverable if you can re-write that injured sector.

It's not worked for me, and yes, I've tried. Maybe I've been unlucky, but
every single case I can remember of having read failures, that drive has
been dead. Trying to re-write just the sectors with the error (and around
it) didn't do squat, and rewriting the whole disk didn't work either.

I'm sure it works for some "ok, the write just failed to take, and the CRC
was bad" case, but that's apparently not what I've had. I suspect either
the track markers got overwritten (and maybe a disk-specific low-level
reformat would have helped, but at that point I was not going to trust the
drive anyway, so I didn't care), or there was actual major physical damage
due to heat and/or head crash and remapping was just not able to cope.

> > Sure. And those "write flushes" really only cover a rather small percentage.
> > For many setups, the other corruption issues (drive failure) are not just
> > more common, but generally more disastrous anyway. So why would a person
> > like that worry about the (rare) power failure?
>
> This is simply not a true statement from what I have seen personally.

You yourself said that software errors were your biggest issue. The write
flush wouldn't matter for those (but the elevator barrier would)

> The elevator does not issue write barriers on its own - those write barriers
> are sent down by the file systems for transaction commits.

Right. But "elevator write barrier" vs "sending a drive flush command" are
two totally independent issues. You can do one without the other (although
doing a drive flush command without the write barrier is admittedly kind
of pointless ;^)

And my point is, IT MAKES SENSE to just do the elevator barrier, _without_
the drive command. If you worry much more about software (or non-disk
component) failure than about power failures, you're better off just doing
the software-level synchronization, and leaving the hardware alone.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.29
Date: Mon, 30 Mar 2009 20:16:33 UTC
Message-ID: <fa.plY8L5uelhpV+1TsWkuv8ffnAWI@ifi.uio.no>

On Mon, 30 Mar 2009, Rik van Riel wrote:
>
> Maybe a stupid question, but aren't tracks so small compared to
> the disk head that a physical head crash would take out multiple
> tracks at once?  (the last on I experienced here took out a major
> part of the disk)

Probably. My experiences (not _that_ many drives, but more than one) have
certainly been that I've never seen a _single_ read error.

> Another case I have seen years ago was me writing data to a disk
> while it was still cold (I brought it home, plugged it in and
> started using it).  Once the drive came up to temperature, it
> could no longer read the tracks it just wrote - maybe the disk
> expanded by more than it is willing to seek around for tracks
> due to thermal correction?   Low level formatting the drive
> made it work perfectly and I kept using it until it was just
> too small to be useful :)

I've had one drive that just stopped spinning. On power-on, it would make
these pitiful noises trying to get the platters to move, but not actually
ever work. If I recall correctly, I got the data off it by letting it just
cool down, then powering up (successfully) and transferring all the data
I cared about off the disk. And then replacing the disk.

> > And my point is, IT MAKES SENSE to just do the elevator barrier, _without_
> > the drive command.
>
> No argument there.  I have seen NCQ starvation on SATA disks,
> with some requests sitting in the drive for seconds, while
> the drive was busy handling hundreds of requests/second
> elsewhere...

I _thought_ we stopped feeding new requests while the flush was active, so
if you actually do a flush, that should never actually happen. But I
didn't check.

		Linus

Index Home About Blog