kinit (Al Viro; H. Peter Anvin; Linus Torvalds; Theodore Tso)

Index Home About Blog

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sat, 01 Jul 2006 19:47:27 UTC
Message-ID: <fa.Lbyv11TKG6jDsbXsM+Lwylrnilw@ifi.uio.no>
Original-Message-ID: <20060701150506.GA10517@thunk.org>

On Sat, Jul 01, 2006 at 06:56:57AM -0400, Jeff Bailey wrote:
>
> The Ubuntu initramfs doesn't use kinit, and it would be nice if we
> weren't forced to.  We do a number of things in our initramfs (like a
> userspace bootsplace) which we need done before most of the things kinit
> wants to do take place.
>

This is going to be a problem given that people are hell-bent at
chucking functionality out of the kernel into userspace.  If various
distributions insist on having their own initramfs/initrd, we're going
to have a maintenance headache where future kernel versions won't work
on distro kernels, which is going to be painful for kernel developers
that want to stay on the bleeding edge.  We are already seeing the
beginnings of this, where the the fact that modern kernels expect the
distro initramfs will wait for the SCSI probe to finish after loading
modules and trying to mount the root filesystem has caused RHEL4
system to be incompatible with modern kernels.

Fortunately there is a workaround by not building the MPT Fusion
device driver as a module, but if Pavel succeeds in ejecting software
suspend into userspace, and preventing suspend2 from getting merged,
*and* distro's insist on doing their own thing with initramfs, we are
going to be headed for a major trainwreck.

Personally, I would be happier with keeping things like suspend2 in
the kernel, since I don't think the hellish compatibility problems
with non-reviewed kernel functionality that has been ejected into
userspace is really worth it --- but if we *are* going to go down the
route pushing everything into userspace, it is going to be critical
that distro's buy into using a kernel initialization system which is
shipped with the kernel, and can be updated without being tied a
particular distro's non-standard "value add".  Maybe that means we
need to have hooks so that the distro's can add their non-standard
"value add" without breaking the ability for users to upgrade to newer
kernels.  But either way, we're going to have to decide which way
we're going to go, and if we're going to go down the blind
in-userspace-good-in-kernel-bad approach, the distro's are going to
have to cooperate or it's going to be a mess.

						- Ted

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sat, 01 Jul 2006 20:09:05 UTC
Message-ID: <fa.orvt0wTbokXnNYl6+RThUxtkWrg@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0607011306060.12404@g5.osdl.org>

On Sat, 1 Jul 2006, Theodore Tso wrote:
>
> This is going to be a problem given that people are hell-bent at
> chucking functionality out of the kernel into userspace.

Btw, I'm not necessarily one of those people.

There _are_ some things that can be better done in user space, but on the
other hand, other things really are better off in the kernel.

The argument that user space is more debuggable has been shown to be
largely a red herring. User space is only more debuggable if it does
something independent, and we've seen that user space is _harder_ to debug
than kernel space if we have events going back and forth.

For example, the old pcmcia layer in user space was crap, crap, crap, and
at least I found it was MUCH harder to debug than the all-in-kernel code.

			Linus

From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sat, 01 Jul 2006 21:58:54 UTC
Message-ID: <fa.R1QMjj57EJpXC24/5NQaY/uhSyw@ifi.uio.no>
Original-Message-ID: <20060701215823.GC29920@ftp.linux.org.uk>

On Sat, Jul 01, 2006 at 01:08:17PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 1 Jul 2006, Theodore Tso wrote:
> >
> > This is going to be a problem given that people are hell-bent at
> > chucking functionality out of the kernel into userspace.
>
> Btw, I'm not necessarily one of those people.
>
> There _are_ some things that can be better done in user space, but on the
> other hand, other things really are better off in the kernel.
>
> The argument that user space is more debuggable has been shown to be
> largely a red herring. User space is only more debuggable if it does
> something independent, and we've seen that user space is _harder_ to debug
> than kernel space if we have events going back and forth.

True.  However, that depends on the interfaces being used.  Sure, when
a twit "moves things to userland" by marshalling stuff through sysfs,
the only thing it achieves is more sysfs crap *and* more internal kernel
APIs being cast in stone.

However, when the code really is a series of normal syscalls (on the level
of "all we call is sys_something()"), it makes a lot of sense to take the
damn thing to userland and leave it there...

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sat, 01 Jul 2006 22:32:16 UTC
Message-ID: <fa.E+NWfmlAwO2BMCkVVMy6kEqNCz0@ifi.uio.no>
Original-Message-ID: <44A6F7C4.1090201@zytor.com>

Linus Torvalds wrote:
>
> On Sat, 1 Jul 2006, Theodore Tso wrote:
>> This is going to be a problem given that people are hell-bent at
>> chucking functionality out of the kernel into userspace.
>
> Btw, I'm not necessarily one of those people.
>
> There _are_ some things that can be better done in user space, but on the
> other hand, other things really are better off in the kernel.
>
> The argument that user space is more debuggable has been shown to be
> largely a red herring. User space is only more debuggable if it does
> something independent, and we've seen that user space is _harder_ to debug
> than kernel space if we have events going back and forth.
>

Indeed.  The stuff that have been moved to userspace in the first cut of
klibc are stuff which can largely be tested independently, usually just
from the normal command line.  This is incredibly powerful, but it is
not -- and shouldn't be -- universally applied just because it can; it
should be applied if and when it makes sense.

> For example, the old pcmcia layer in user space was crap, crap, crap, and
> at least I fount it was MUCH harder to debug than the all-in-kernel code.

I have my own share of debugging shared kernel space/user space
applications, mainly autofs, and if done properly, it can be quite sane.
  If done *improperly*, it's a nightmare.  Personally, I find that if
one can:

- Run something as a separate component in userspace, and/or
- Can leverage strace(1) to get more insight

then one can usually get a lot of debugging help.  Otherwise, probably not.

	-hpa

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sun, 02 Jul 2006 00:06:08 UTC
Message-ID: <fa.IZwZmH2OFw521lFtqelTYkokA6M@ifi.uio.no>
Original-Message-ID: <20060702000528.GA15375@thunk.org>

On Sat, Jul 01, 2006 at 01:08:17PM -0700, Linus Torvalds wrote:
> The argument that user space is more debuggable has been shown to be
> largely a red herring. User space is only more debuggable if it does
> something independent, and we've seen that user space is _harder_ to debug
> than kernel space if we have events going back and forth.

Agreed, 100%.

In addition, userspace is debuggable only only if the initramfs has
enough debugging code in there (like a real live working shell,
strace, basic shell commands etc.)  Otherwise, it becomes even more
hellish to debug.  I wasted a huge amount of time trying to figure out
why the RHEL4 initramfs was incompatible with modern kernels using the
MPT Fusion SCSI driver, because I couldn't make it stop and break out
to a working shell; it had some busybox-like nash thing that wasn't
designed for user interaction, so all I could do was try to make tiny
changes to the initramfs, wait for the !@#@# very long boot cycle, and
watch the initial ramdisk fail to mount the root and crash, and
repeat, for hours on end.  RHEL4's userspace root mount system was
***not*** more debuggable, not in the last.  Adding printk's into a
kernel and recompiling would have been easier, and far less
frustrating.

Hopefully kinit will be better, but it's definitely not the case that
userpsace is easier to debug.

						- Ted

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sun, 02 Jul 2006 00:18:12 UTC
Message-ID: <fa.5Rbm7fQKt+OHhIRs15y1YqxMH3s@ifi.uio.no>
Original-Message-ID: <44A7108D.6090204@zytor.com>

Theodore Tso wrote:
> On Sat, Jul 01, 2006 at 01:08:17PM -0700, Linus Torvalds wrote:
>> The argument that user space is more debuggable has been shown to be
>> largely a red herring. User space is only more debuggable if it does
>> something independent, and we've seen that user space is _harder_ to debug
>> than kernel space if we have events going back and forth.
>
> Agreed, 100%.
>
> In addition, userspace is debuggable only only if the initramfs has
> enough debugging code in their (like a real live working shell,
> strace, basic shell commands etc.)  Otherwise, it becomes even more
> hellish to debug.  I wasted a huge amount of time trying to figure out
> why the RHEL4 initramfs was incompatible with modern kernels using the
> MPT Fusion SCSI driver, because I couldn't make it stop and break out
> to a working shell; it had some busybox-like nash thing that wasn't
> designed for user interaction, so all I could do was try to make tiny
> changes to the initramfs, wait for the !@#@# very long boot cycle, and
> watch the initial ramdisk fail to mount the root and crash, and
> repeat, for hours on end.  RHEL4's userspace root mount system was
> ***not*** more debuggable, not in the last.  Adding printk's into a
> kernel and recompiling would have been easier, and far less
> frustrating.
>
> Hopefully kinit will be better, but it's definitely not the case that
> userpsace is easier to debug.
>

It isn't automatically easier, but it *can* be.

In your case above, with kinit, I would have added dash and strace (the
latter would probably have to be statically linked against glibc; I
haven't actually tried building strace under klibc myself) -- or even
gdb -- to initramfs, and have /init drop into a shell.  From there one
can run strace -f on kinit.

One of the criticisms I've gotten for klibc has been why I have included
dash and a whole bunch of shell utilities when they're not used by the
default kernel build.  It certainly hasn't been just to prove a point;
as a matter of fact, getting dash to build correctly under Kbuild proved
to be surprisingly difficult.  However, by making sure that one can
*easily* pull together an interactive environment -- even if one didn't
have one from the start -- one has more readily access to a debuggable
environment.

Similarly, I try very hard to have small, individually testable modules.
  I haven't taken that even as far as I'd like to have, in the end, but
that's on my list of things to do.

	-hpa

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sun, 02 Jul 2006 00:46:01 UTC
Message-ID: <fa.7AD8NQQVxv1VynBCZEX2thsrut0@ifi.uio.no>
Original-Message-ID: <20060702003815.GB15375@thunk.org>

On Sat, Jul 01, 2006 at 05:17:17PM -0700, H. Peter Anvin wrote:
> One of the criticisms I've gotten for klibc has been why I have included
> dash and a whole bunch of shell utilities when they're not used by the
> default kernel build.  It certainly hasn't been just to prove a point;
> as a matter of fact, getting dash to build correctly under Kbuild proved
> to be surprisingly difficult.  However, by making sure that one can
> *easily* pull together an interactive environment -- even if one didn't
> have one from the start -- one has more readily access to a debuggable
> environment.

Well, I wouldn't be one of those folks.  In fact, how big is dash and
some select set of shell utilities?  If they aren't that big, it might
make sense to include them all the time so that a simple command-line
option on boot is all that's necessary in order to break into a
pre-kinit interactive shell.  That would make the resulting system
more debuggable by definition.  Then all we will would have to do is
make sure the distro's use the kernel-supplied kinit solution, instead
of rolling their own non-standard version.

						- Ted

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc] klibc and what's the next step?
Date: Sun, 02 Jul 2006 00:51:11 UTC
Message-ID: <fa./Ri/tRCyM2VTWVcvMDcoLSFnhKw@ifi.uio.no>
Original-Message-ID: <44A71840.8040904@zytor.com>

Theodore Tso wrote:
>
> Well, I wouldn't be one of those folks.  In fact, how big is dash and
> some select set of shell utilities?  If they aren't that big, it might
> make sense to include them all the time so that a simple command-line
> option on boot is all that's necessary in order to break into a
> pre-kinit interactive shell.  That would make the resulting system
> more debuggable by definition.  Then all we will would have to do is
> make sure the distro's use the kernel-supplied kinit solution, instead
> of rolling their own non-standard version.
>

Shared binaries, x86-64 (i386 is about 20-25% smaller):

-rwxrwxr-x 1 hpa hpa 58544 Jul  1 11:41 usr/dash/sh.shared*
-rwxrwxr-x 1 hpa hpa  2760 Jul  1 11:41 cat*
-rwxrwxr-x 1 hpa hpa   888 Jul  1 11:41 chroot*
-rwxrwxr-x 1 hpa hpa  4000 Jul  1 11:41 dd*
-rwxrwxr-x 1 hpa hpa   680 Jul  1 11:41 false*
-rwxrwxr-x 3 hpa hpa  1072 Jul  1 11:41 halt*
-rwxrwxr-x 1 hpa hpa  1664 Jul  1 11:41 insmod*
-rwxrwxr-x 1 hpa hpa  1336 Jul  1 11:41 ln*
-rwxrwxr-x 1 hpa hpa  5000 Jul  1 11:41 minips*
-rwxrwxr-x 1 hpa hpa  1984 Jul  1 11:41 mkdir*
-rwxrwxr-x 1 hpa hpa  1704 Jul  1 11:41 mkfifo*
-rwxrwxr-x 1 hpa hpa  1712 Jul  1 11:41 mknod*
-rwxrwxr-x 1 hpa hpa  2184 Jul  1 11:41 mount
-rwxrwxr-x 1 hpa hpa  1320 Jul  1 11:41 nuke*
-rwxrwxr-x 1 hpa hpa   856 Jul  1 11:41 pivot_root*
-rwxrwxr-x 3 hpa hpa  1072 Jul  1 11:41 poweroff*
-rwxrwxr-x 3 hpa hpa  1072 Jul  1 11:41 reboot*
-rwxrwxr-x 1 hpa hpa   864 Jul  1 11:41 sleep*
-rwxrwxr-x 1 hpa hpa   672 Jul  1 11:41 true*
-rwxrwxr-x 1 hpa hpa  1056 Jul  1 11:41 umount*
-rwxrwxr-x 1 hpa hpa  1952 Jul  1 11:41 uname*
-rw-rw-r-- 2 hpa hpa 71016 Jul  1 11:40
usr/klibc/klibc-Yy9wepARlc-x17pdZDwU1YCOiMQ.so
-rwxrwxr-x 1 hpa hpa 35664 Jul  1 17:48 usr/kinit/kinit.shared*

... totalling about 193K if you include everything.  For comparison,
static kinit by itself is 67K (which includes ipconfig, nfsroot, etc.)
For an "include everything" variant, it would probably make more sense
to port busybox, and/or add more tools as dash builtins.

All of these are uncompressed sizes.

	-hpa

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [klibc 30/31] Remove in-kernel resume-from-disk invocation code
Date: Thu, 06 Jul 2006 03:11:31 UTC
Message-ID: <fa.D0Ud+c+BLFJhEosdO3r/+ZxQcx4@ifi.uio.no>
Original-Message-ID: <44AC7F46.3050204@zytor.com>

Nigel Cunningham wrote:
> Hi again.
>
> (Excuse me replying to myself, but this might help someone else).
>
> On Thursday 06 July 2006 11:45, Nigel Cunningham wrote:
>> Is there a klibc howto somewhere? I tried googling for 'klibc howto',
>> reading the files in Documentation/ and browsing your klibc mailing list
>> archive before asking!
>
>> What I'm wondering specifically is: Say a user needs to run some commands
>> to set up access to encrypted storage before they can resume. At the
>> moment, we'd tell them to put these commands and the echo > do_resume in
>> their linuxrc (or init) script prior to mounting their root filesystem.
>> Forgive me if I'm asking a stupid question but it's not immediately obvious
>> to me how they would now do that. I'd much rather follow a simple howto
>> than spend a good amount of time tracing function calls etc. I still see
>> init/initramfs.c, and it mentions both CONFIG_BLK_DEV_INITRD and
>> CONFIG_BLK_DEV_RAM. Would I be right in surmising that you can still have
>> an initrd or ramfs to do such things as the above, after klibc has done its
>> work? If not, is there some other way I'm ignorant of?
>
> For the record, I've since discovered that what you really want is an
> initramfs howto. I think I stuck with those old-fangled initrds for too long.
> Better update my desktop from Mandrake 10 too :)... is there a pattern here?
>

Okay, let's try to start from the beginning...

initramfs is, indeed, a replacement for initrd, but it's not a 1:1 map.
  Instead, initramfs contents -- which can come from multiple sources!
-- is simply extracted right into rootfs.

kinit is a replacement for the in-kernel root-handling code, as well as
other related in-kernel code like resume from disk.  It is compiled as a
monolithic binary for size reasons.

klibc is a very small C library which *can* be used to produce initramfs
binaries; in particular, it's used to produce kinit, and is small enough
that it can be realistically included with the kernel distribution.

If you provide your own /init in an initramfs, it will override the
default, which is /init -> /kinit.  You can then choose to invoke kinit
if you want to; for example, you could try to resume from suspend2, and
invoke kinit if that fails.

	-hpa

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: klibc and what's the next step?
Date: Tue, 11 Jul 2006 13:46:30 UTC
Message-ID: <fa.S5q4R9dIC998SJyfNkh2xH7c7mo@ifi.uio.no>
Original-Message-ID: <20060711134554.GC24029@thunk.org>

On Tue, Jul 11, 2006 at 06:48:34AM +0200, Olaf Hering wrote:
> One is the kind that builds static kernels and uses no initrd of any kind.
> For those people, the code and interfaces behind prepare_namespace() has
> not changed in a long time.
> They will install that kinit binary once and it will continue to work with
> kernels from 2.6.6 and later, when "/init" support was merged. Or rather
> from 2.6.1x when CONFIG_INITRAMFS_SOURCE was introduced.
>
> The other group is the one that uses some sort of initrd (loop mount or cpio),
> created with tools from their distribution.
> Again, they should install that kinit binary as well because kinit emulates
> the loop mount handling of /initrd.image. This is for older distributions
> that still create a loop mounted initrd.

Kinit SHOULD be merged into the kernel, and the responsibility of
creating the initrd/initramfs image should be moved from the
distribution into the kernel build process.  There can and should be a
way for distro's to add their own "value add specials" into the
initrd/initramfs image, but we have to take over creating the base
initial userspace environment.  It's not just uswsusp (still not
convinced it's a good idea, but if we're going to do it we have to
wrest control of the initramfs away from the distro's), but finding
and mounting iSCSI disks, LVM setup, etc.

> In earlier mails you stated that having kinit/klibc in the kernel sources
> would make it easier to keep up with interface changes.
> What interface changes did you have in mind, and can you name any relevant
> interface changes that were made after 2.6.0 which would break an external
> kinit?

When you load a SCSI driver (the one that bit me was the MPT Fusion
driver), it no longer waits for SCSI bus probe to finish before
returning.  So the RHEL4 initrd fails to find the root filesystem, and
bombs out.  This change was definitely made after 2.6.0, and is an
example of the sort of change which wouldn't have happened if kinit
was under the kernel sources and not supplied by the distro.

> As others have stated in this thread, the code behind prepare_namespace() is
> very simple. It doesnt know anything abould lvm etc, nor about mount by
> filesystem UUID/LABEL nor does it know how to deal with properly with new
> technologies like iSCSI, evms, persistant storage device names, usb-storage,
> sbp2 or async device probing.
> Should all that knowledge end up in the kernel source on day?

Some of this will probably need to be farmed out into files provided
by external packages, but I **hope** that they are true upstream
external packages, and not distro-specific specials, which is one of
the reasons why the current initrd/initramfs situation is
so.... unsatisfactory.  Clearly some kernel-mandated interface for
other packages to insert scripts and binaries during the early-boot
process would be a good thing; but the core initramfs functionality
should IMHO belong to the kernel.

						- Ted

Index Home About Blog