Index Home About Blog
From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] fs: Correct SuS compliance for open of large file without 
	options
Date: Thu, 27 Sep 2007 18:38:47 UTC
Message-ID: <fa.n5KutkVniAwx+S2HFHfsFTzUyyY@ifi.uio.no>

On Thu, Sep 27, 2007 at 10:59:17AM -0700, Greg KH wrote:
> Come on now, I'm _very_ tired of this kind of discussion.  Please go
> read the documentation on how to _use_ sysfs from userspace in such a
> way that you can properly access these data structures so that no
> breakage occurs.

I've read it; the question is whether every single application
programmer or system shell script programmer who writes code my system
depends upon has read it this document buried in the kernel sources,
or whether things will break spectacularly --- one of those things
that leaves me in suspense each time I update the kernel.

I'm reminded of Rusty's 2003 OLS Keynote, where he points out that
what's important is not making an interface easy to use, but _hard_
_to_ _misuse_.  That fact that sysfs is all laid out in a directory,
but for which some directories/symlinks are OK to use, and some are
NOT OK to use --- is why I call the sysfs interface "an open pit".
Sure, if you have the map to the minefield, a minefield is perfectly
safe when you know what to avoid.  But is that the best way to
construct a path/interface for an application programmer to get from
point A to point B?  Maybe, maybe not.

					- Ted


From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] fs: Correct SuS compliance for open of large file without 
	options
Date: Thu, 27 Sep 2007 23:21:00 UTC
Message-ID: <fa.mTIrfHji4gjsTJqiYOVRhFTNjLo@ifi.uio.no>

On Thu, Sep 27, 2007 at 02:34:45PM -0700, Greg KH wrote:
> Ok, how then should I advertise this better?  What can we do better to
> help userspace programmers out in this regard?

Would you accept a patch which causes the deprecated sysfs
files/directories to disappear, even if CONFIG_SYS_DEPRECATED is
defined, via a boot-time parameter?  Many people and distros are
likely to keep CONFIG_SYS_DEPRECATED defined just our of paranoia that
things might break.  Doing a quick google, I note that Fedora has been
going back and forth of turning it off, watching things break, and
then turning it back on.  The latest time, the changelog said:

* Fri Jan 26 23:00:00 2007 Bill Nottingham <notting{%}redhat{*}com>

    - turn on CONFIG_SYSFS_DEPRECATED so that things actually work. *sigh*

(and I've checked, Fedora's CVS still has CONFIG_SYSFS_DEPRECATED
defined; it's not just Debian at fault here.)

So having a boot-time parameter would make it much easier for
application programmers (who run distro kernels and who are unlikely
to want to compile their own custom kernel) to test to see what breaks
without CONFIG_SYS_DEPRECATED.

	   	     	     	      	- Ted


From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] fs: Correct SuS compliance for open of large file without 
	options
Date: Fri, 28 Sep 2007 02:23:24 UTC
Message-ID: <fa.ui1S5HKDUFK64/+IbYZviBPZfCs@ifi.uio.no>

On Thu, Sep 27, 2007 at 05:28:57PM -0600, Matthew Wilcox wrote:
> On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote:
> > Would you accept a patch which causes the deprecated sysfs
> > files/directories to disappear, even if CONFIG_SYS_DEPRECATED is
> > defined, via a boot-time parameter?
>
> How about a mount option?  That way people can test without a reboot:
>
> mount -o remount,deprecated={yes,no} /sys

It would be nice if that would be easy to make work, but the problem
is that remounting /sysfs doesn't change the entries in the sysfs tree
that have already been made in the tree.  We could do something such
as creating an sysfs_create_link_deprecated() call which created a
kobject with a new flag indicating it's deprecated, so it could be
filtered out dynamically when /sys is remounted, or when some file
such as /sys/kernel/deprecated_sysfs_files has "0" or "1" written to
it.

The question is whether it's worth it, since we'd have to bloat the
kobject structure by 4 bytes (it currently doesn't have a flags field
from which we could borrow a bit), or whether it's OK just to make the
user reboot.  (I do agree it would be nicer if the user didn't have to
reboot, but most of the time they will need to test the initrd and
init scripts anyway.

							- Ted


From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: Major regression on hackbench with SLUB (more numbers)
Date: Sat, 22 Dec 2007 19:26:37 UTC
Message-ID: <fa.IM/OSr6M7YLpddKT22pbvMabbqE@ifi.uio.no>

On Sat, Dec 22, 2007 at 10:01:37AM -0800, Linus Torvalds wrote:
> Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and
> the fact that there hasn't been a lot of complaints about it going away
> does seem to imply that it wasn't a very important ABI.
>
> I'm the first to stand up for backwards compatibility, but I also try to
> be realistic, and so far nobody has really complained about the fact that
> /proc/slabinfo went away on any grounds but on the general principle of
> "it was a user-visible ABI".

That's probably because the people who are most likely to be using
/proc/slabinfo tend to people people using kernels in production
environments, and they probably haven't started playing with SLUB yet.
So the fact we haven't heard the yelling doesn't necessarily mean that
people won't notice.

I know of a number of debugging scripts that periodically poll
/proc/slabinfo to find out what is using memory and to track trends
and predict potential problems with memory usage and balance.  Indeed,
I've left several such scripts behind at various customer
installations after debugging memory usage problems, and some system
administrators very much like being able to collect that information
easily.  So I can say for sure there at there are scripts out there
that depend on /proc/slabinfo, but they generally tend to be at sites
where people won't be running bleeding edge kernels.

> That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I
> as a developer tend to want from there, but it's really rather less
> convenient than I'd like. It is really quite hard, for example, to get any
> kind of "who actually uses the most *memory*" information out of there.

I have a general problem with things in /sys/slab, and that's just
because they are *ugly*.  So yes, you can write ugly shell scripts
like this to get out information:

> You have to use something like this:
>
> 	for i in *
> 	do
> 		order=$(cat "$i/order")
> 		slabs=$(cat "$i/slabs")
> 		object_size=$(cat "$i/object_size")
> 		objects=$(cat "$i/objects")
> 		truesize=$(( $slabs<<($order+12) ))
> 		size=$(( $object_size*$objects ))
> 		percent=$(( $truesize/100 ))
> 		if [ $percent -gt 0 ]; then
> 			percent=$(( $size / $percent ))
> 		fi
> 		mb=$(( $size >> 10 ))
> 		printf "%10d MB %3d %s\n" $mb $percent $i
> 	done | sort -n | tail

But sometimes when trying to eyeball what is going on, it's a lot
nicer just to use "cat /proc/slabinfo".

Another problem with using /sys/slab is that it is downright *ugly*.
Why is it for example, that /sys/slab/dentry is a symlink to
../slab/:a-0000160?  Because of this, Linus's script reports dentry twice:

     10942 MB  86 buffer_head
     10942 MB  86 journal_head
     10943 MB  86 :a-0000056
     21923 MB  86 dentry
     21924 MB  86 :a-0000160
     78437 MB  94 ext3_inode_cache

Once as "dentry" and once as ":a-0000160".  A similar situation
happens with journal_head and ":a-0000056".  I'm sure this could be
filtered out with the right shell or perl script, but now we're
getting even farther from /proc/slabinfo.

Further, if you look at Documentation/sysfs-rules.txt (does anyone
ever read it or bother to update it?), /sys/slab isn't documented so
it falls in the category of "everything else is a kernel-level
implementation detail which is not guaranteed to be stable".  Nice....

Also, looking at sysfs-rules.txt, it talks an awful lot about reading
symlinks, and implying that actually trying to *dereference* the
symlinks might not actually work.  So I'm pretty sure linus's "cd
/sys/slab; for i in *" violates the Documentation/sysfs-rules.txt
guidelines even if /sys/slab were documented.

In general, /sys has some real issues that need some careful scrutiny.
Documentation/sysfs-rules.txt is *not* enough.  For example, various
official websites are telling people that the way to enable or disable
power-saving is:

	echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

(Reference: http://www.lesswatts.org/tips/wireless.php)

Aside from the fact that "iwconfig wlan0 power on" is easier to type
and simpler to undrestand, I'm pretty sure that the above violates
sysfs-rules.txt.  From looking at sysfs-rules.txt, the fact that you
have to read symlinks as the only valid and guaranteed way for things
to work means that you have to write a perl script to safely
manipulate things in /sys.  People are ignoring sysfs-rules.txt, and
giving advice in websites contrary to sysfs-rules.txt because it's too
hard to follow.

> I worry about the lack of atomicity in getting the statistics.

And that's another problem with trying to use /sys when trying to get
statistics in bulk...  Very often the tables in /proc/* were not only
more convenient, but allowed for atomic access.

						- Ted


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Major regression on hackbench with SLUB (more numbers)
Date: Sat, 22 Dec 2007 21:51:24 UTC
Message-ID: <fa.mmnW/hA3UA4/z5aJXp+kq7Y3QxI@ifi.uio.no>

On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:

> > Another problem with using /sys/slab is that it is downright *ugly*.
> > Why is it for example, that /sys/slab/dentry is a symlink to
> > ../slab/:a-0000160?
>
> That's the only really ugly thing there. Otherwise, it's pretty nice, but
> having a million files makes for problems when trying to send somebody
> else the full info.

*raised brows*

I would say that there's that really ugly thing with embedding kobject
into a struct with the lifetime rules of its own and then having that
struct kfree()d while kobject might still have references, but hey,
maybe it's just me and my lack of appreciation of the glory that is
sysfs.

	Al, too tired of ranting about sysfs being a major architecture
mistake and a recurring source of turds like that one...


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Major regression on hackbench with SLUB (more numbers)
Date: Sat, 22 Dec 2007 23:30:06 UTC
Message-ID: <fa.om8/XntaL/B6De6Dehvoe5QNXPQ@ifi.uio.no>

On Sat, Dec 22, 2007 at 09:50:09PM +0000, Al Viro wrote:
> On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
>
> > > Another problem with using /sys/slab is that it is downright *ugly*.
> > > Why is it for example, that /sys/slab/dentry is a symlink to
> > > ../slab/:a-0000160?
> >
> > That's the only really ugly thing there. Otherwise, it's pretty nice, but
> > having a million files makes for problems when trying to send somebody
> > else the full info.
>
> *raised brows*
>
> I would say that there's that really ugly thing with embedding kobject
> into a struct with the lifetime rules of its own and then having that
> struct kfree()d while kobject might still have references, but hey,
> maybe it's just me and my lack of appreciation of the glory that is
> sysfs.
>
> 	Al, too tired of ranting about sysfs being a major architecture
> mistake and a recurring source of turds like that one...

BTW, the fact that presence of that kobject is conditional makes life even
uglier - we have to either kfree() or drop a reference to kobject leaving
actual kfree() to its ->release().   While we are at it, when do we remove
the symlinks?  That got to add even more fun for the lifetime rules...

Sigh...  How does one set a script that would mail a warning upon git pull
that introduces any instances of keyword from given set?  I hadn't noticed
that slub was using sysfs when it went into the tree back in May ;-/


From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Major regression on hackbench with SLUB (more numbers)
Date: Wed, 26 Dec 2007 22:17:56 UTC
Message-ID: <fa.7/Uy7n54uIyoz9je4tFrZUavypM@ifi.uio.no>

On Wed, Dec 26, 2007 at 01:31:35PM -0800, Christoph Lameter wrote:
> On Mon, 24 Dec 2007, Theodore Tso wrote:
>
> > So two questions: why isn't -f the default?  And is /sys/slab
>
> Because it gives misleading output. It displays the name of the first
> of multiple slabs that share the same storage structures.

Erm...  Let me spell it out: current lifetime rules are completely broken.
As it is, create/destroy/create cache sequence will do kobject_put() on
kfree'd object.  Even without people playing with holding sysfs files
open or doing IO on those.

a) you have kobject embedded into struct with the lifetime rules of its
own.  When its refcount hits zero you kfree() the sucker, even if you
still have references to embedded kobject.

b) your symlinks stick around.  Even when cache is long gone you still
have a sysfs symlink with its embedded kobject as a target.  They are
eventually removed when cache with the same name gets created.  _Then_
you get the target kobject dropped - when the memory it used to be in
had been freed for hell knows how long and reused by something that would
not appreciate slub.c code suddenly deciding to decrement some word in
that memory.

c) you leak references to these kobject; kobject_del() only removes it
from the tree undoing the effect of kobject_add() and you still need
kobject_put() to deal with the last reference.


From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: SLUB sysfs support
Date: Thu, 27 Dec 2007 23:00:47 UTC
Message-ID: <fa.THxQErSuLfAR3NP7eRMLsXLVL74@ifi.uio.no>

On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote:
> Hmmm.. If I separately allocate the kobject then I can no longer get to
> the kmem_cache structure from the kobject.
>
> I need to add a second kobject_del to sysfs_slab_remove() to make sysfs
> completely forget about the object?
>
> Probably should track down any remaining symlinks at that point and nuke
> them too. Isnt there some way to convince sysfs to remove the symlinks
> if the target vanishes?

Don't bother with separate allocation.

a) remove symlink when slab goes away
b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
c) have ->release() of these kobjects do kfree()


From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: SLUB sysfs support
Date: Thu, 27 Dec 2007 23:55:44 UTC
Message-ID: <fa.43+mqVeGzXvXTMaJbkVrPTI2JdI@ifi.uio.no>

On Thu, Dec 27, 2007 at 03:22:28PM -0800, Christoph Lameter wrote:
> > a) remove symlink when slab goes away
>
> Ok. Need to think about how to code that.

Huh?  Just call it from kmem_cache_destroy(); what business does that symlink
have being around after that point?

> > b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
>
> Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a
> kobject_del().

kobject_del() undoes the effect of kobject_add().  And then you are left
with the refcount you had before kobject_add(), i.e. from kobject_init().

Think of it that way: kobject refcount controls the lifetime of structure
the sucker's embedded into.  Depending on kobject methods, you might
be unable to do cleanup of non-kobject parts of that animal until after
kobject_del().  IOW, you _can't_ have kobject_del() dropping the (hopefully)
final reference to kobject - only the caller knows what might have to be
done in between.

From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: SLUB sysfs support
Date: Thu, 27 Dec 2007 23:58:37 UTC
Message-ID: <fa.bl6Y7uZ29owEWwTOsKd8RRa1Bc8@ifi.uio.no>

On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote:
> On Thu, 27 Dec 2007, Christoph Lameter wrote:
>
> > > a) remove symlink when slab goes away
> >
> > Ok. Need to think about how to code that.
>
> How do I iterate over all symlinks in /sys/kernel/slab/*?
>
> I remember trying to do it before and not being able to find a sysfs
> method for that.

What the hell do you need that for?  You have an obvious moment when
/sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
you remove the slab called cache_name.  And at that point you don't
need to iterate over anything - you have the exact name, for fsck sake...

Why would you want these symlinks to stick around for longer than that?


From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: SLUB sysfs support
Date: Fri, 28 Dec 2007 00:45:44 UTC
Message-ID: <fa.HFZls18wAjoWFZmF6Ayc5S3ZQrk@ifi.uio.no>

On Thu, Dec 27, 2007 at 04:02:56PM -0800, Christoph Lameter wrote:
> > Why would you want these symlinks to stick around for longer than that?
>
> /sys/kernel/slab/cache_name is a real directory but then there are the
> aliases in /sys/kernel/slab/alias_name pointing to that directory that
> also have to be removed.
>
> So I need to scan for symlinks in /sys/kernel/slab/* pointing to
> /sys/kernel/slab/cache_name.

Oh, lovely.  So we can have module A do kmem_cache_create(), calling
cache "foo".  Then module B does (without any knowledge about A)
completely unrelated kmem_cache_create(), calling the sucker "bar".
mm/slub decides that they are mergable.  Then we get rmmod A... and
no way to find out if that's foo or bar going away - kmem_cache_destroy()
doesn't have enough information to figure that out.  So we have to keep
both slab/foo and slab/bar around until both are gone or until somebody
kind enough creates a cache called foo.  Better yet, on some systems you
get things like slab/nfsd4_delegations sticking around long after nfsd is
gone just because slub.c decides that it's sharable, but in the next
kernel version the thing it's shared with gets different object size and
behaviour suddenly changes - now it's gone as soon as nfsd is gone.
Brilliant interface, that...


From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: SLUB sysfs support
Date: Fri, 28 Dec 2007 03:27:20 UTC
Message-ID: <fa.1N0m0v4q29qTBlj8LfAEIJ0qKFI@ifi.uio.no>

On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote:
> nfsd4_delegations? What is this about?

The random lifetimes of user-visible files you create in sysfs.

> How do I scan for the symlinks in sysfs?

At which point are you going to do that?  AFAICS, the fundamental problem
is that you
	* have aliases indistinguishable, so kmem_cache_destroy() can't tell
which one is going away, no matter what
	* have per-alias objects in sysfs
As the result, you have a user-visible mess in that directory in sysfs.
And I don't see how you would deal with that - on the "the contents of
directory changes in so-and-so way when such-and-such operation is
done", not the implementation details one.

BTW, I'm rather sceptical about free use of slabs; keep in mind that their
names have to be unique with your sysfs layout, so...


Index Home About Blog