Oops decoding (Al Viro; Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Top 10 kernel oopses for the week ending January 5th, 2008
Date: Tue, 08 Jan 2008 03:26:53 UTC
Message-ID: <fa.u5z/gIt3Ek9JDVs7GK067vURAxw@ifi.uio.no>

On Mon, 7 Jan 2008, Kevin Winchester wrote:

> J. Bruce Fields wrote:
> >
> > Is there any good basic documentation on this to point people at?
>
> I would second this question.  I see people "decode" oops on lkml often
> enough, but I've never been entirely sure how its done.  Is it somewhere
> in Documentation?

It's actually not necessarily at all that trivial, unless you have a deep
understanding of the code generated for the architecture in question (and
even then, some oopses take more time to figure out than others, thanks
to inlining and tailcalls etc).

If the oops happened with a kernel you generated yourself, it's usually
rather easy. Especially if you said "y" to the "generate debugging info"
question at configuration time. Because, in that case, you really just do
a simple

	gdb vmlinux

and then you can do (for example) something like setting a breakpoint at
the EIP that was reported for the oops, and it will tell you what line it
came from.

However, if you don't have the exact binary - which is the common case for
random oopses reported on lkml - you will generally have to disassemble
the hex sequence given in the oops (the "Code:" line), and try to match it
up against the source code to try to figure out what is going on.

Even just the disassembly is not entirely trivial, since the oops will
give you the eip that it happened at, but you often want to also
disassemble *backwards* in order to get more of a context (the "Code:"
line will mark the particular EIP that starts the oopsing instruction by
enclosing it in <xx>, but with non-constant instruction lengths, you need
to use a bit of trial-and-error to figure it out.

I usually just compile a small program like

	const char array[]="\xnn\xnn\xnn...";

	int main(int argc, char **argv)
	{
		printf("%p\n", array);
		*(int *)0=0;
	}

and run it under gdb, and then when it gets the SIGSEGV (due to the
obvious NULL pointer dereference), I can just ask gdb to disassemble
around the array that contains the code[] stuff. Try a few offsets, to see
when the disassembly makes sense (and gives the reported EIP as the
beginning of one of the disassembled instructions).

(You can do it other and smarter ways too, I'm not claiming that's a
particularly good way to do it, and the old "ksymoops" program used to do
a pretty good job of this, but I'm used to that particular idiotic way
myself, since it's how I've basically always done it)

After that, you still need to try to match up the assembly code with the
source code and figure out what variables the register contents actually
are all about. You can often try to do a

	make the/affected/file.s

to generate the asm file in your own tree - the register allocation can be
totally different due to different compilers and different options (and
things like the fact that maybe the source tree you do this on doesn't
match the oops report exactly), but it's usually a good starting point to
compare the disassembly from gdb with the *.s file output from the
compiler.

Quite often, it's all very obvious (you see some constant or other simple
pattern). But if you're not used to the assembly format, you'll spend a
lot of brainpower just trying to figure that part out even for the obvious
stuff, which is why it's a good thing if you are very comfortable indeed
with the assembly language of that particular platform.

It's not really all that hard. But the first few times you see those
oopses, it all looks mostly like just line noise. So it definitely takes
some practice to do it well.

Anyway, let's take an example, from

	http://lkml.org/lkml/2008/1/1/189

where the most obviously relevant parts are:

	BUG: unable to handle kernel paging request at virtual address 00100100
	EIP:    0060:[<f8819668>]
	EIP is at evdev_disconnect+0x65/0x9e

	eax: 00000000   ebx: 000ffcf0   ecx: c1926760   edx: 00000033
	esi: f7415600   edi: f741564c   ebp: f7415654   esp: c1967e68
	Call Trace:
		[<c03454b2>] input_unregister_device+0x6f/0xff
		[<c03c6eb6>] klist_release+0x27/0x30
		[<c029178a>] kref_put+0x5f/0x6c
	..
	Code: 5e 4c 81 eb 10 04 00 00 eb 21 8d 83 08 04 00 00 b9 06 00 02
	      00 ba 1d 00 00 00 e8 6a 93 95 c7 8b 9b 10 04 00 00 81 eb 10
	      04 00 00 <8b> 83 10 04 00 00 0f 18 00 90 8d 83 10 04 00 00
	      39 f8 75 cb 8d

so here let's do the above silly C program:

	const char array[]="\x5e\x4c\x81\xeb\x10\x04\x00\x00\xeb\x21..

and running it under gdb gives:

	0x8048500

	Program received signal SIGSEGV, Segmentation fault.
	0x080483f7 in main () at test.c:14
	14              *(int*)0=0;

and now I can just try

	x/20i 0x8048500

and it turns out that already gives a reasonable disassembly. The first
few instructions are bogus: they're really part of the previous
instruction, but it looks pretty sane around the actual problem spot,
which is "array+43" (there are 42 bytes of code before the EIP one, and 20
bytes after):

	0x8048500 <array>:      pop    %esi
	0x8048501 <array+1>:    dec    %esp
	0x8048502 <array+2>:    sub    $0x410,%ebx
	0x8048508 <array+8>:    jmp    0x804852b <array+43>
	0x804850a <array+10>:   lea    0x408(%ebx),%eax
	0x8048510 <array+16>:   mov    $0x20006,%ecx
	0x8048515 <array+21>:   mov    $0x1d,%edx
	0x804851a <array+26>:   call   0xcf9a1889
	0x804851f <array+31>:   mov    0x410(%ebx),%ebx
	0x8048525 <array+37>:   sub    $0x410,%ebx
	0x804852b <array+43>:   mov    0x410(%ebx),%eax
	0x8048531 <array+49>:   prefetchnta (%eax)
	0x8048534 <array+52>:   nop
	0x8048535 <array+53>:   lea    0x410(%ebx),%eax
	0x804853b <array+59>:   cmp    %edi,%eax
	0x804853d <array+61>:   jne    0x804850a <array+10>
	0x804853f <array+63>:   lea    (%eax),%eax
	..

so now we know that the faulting instruction was that

	mov    0x410(%ebx),%eax

and we can also see that this also matches the address that caused the
oops (ebx=000ffcf0, so 0x410(%ebx) is 00100100, which matches the "unable
to handle kernel paging request" message).

(Now, people used to kernel oopses will also recognize 00100100 as the
LIST_POISON1, so this is all about dereferencing the ->next pointer of a
list entry that has been removed from the list, but that's a whole
separate level of kernel knowledge).

Anyway, you can now do

	make drivers/input/evdev.s

and see if you can find that kind of code sequence in there. You can use
the "EIP: evdev_disconnect+0x65/0x9e" thing as a hint: if your compiler
setup isn't too different, it's likely to be roughly two thirds into that
evdev_disconnect function (but inlining really can mean that it's
somewhere else entirely in the source tree!)

The rest left as an exercise for the reader.

		Linus

From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: Top 10 kernel oopses for the week ending January 5th, 2008
Date: Tue, 08 Jan 2008 05:59:56 UTC
Message-ID: <fa.8NdQN5qEfuHgwZOVxCNgaRtDto4@ifi.uio.no>

On Mon, Jan 07, 2008 at 07:26:12PM -0800, Linus Torvalds wrote:

> I usually just compile a small program like
>
> 	const char array[]="\xnn\xnn\xnn...";
>
> 	int main(int argc, char **argv)
> 	{
> 		printf("%p\n", array);
> 		*(int *)0=0;
> 	}
Heh.  I prefer
char main[] = {.....};
for the same thing, with gdb a.out and no running at all.

FWIW, I'm going to go through Arjan's collection and post blow-by-blow
analysis of some of those suckers.  Tonight, probably...

Let's take e.g. http://www.kerneloops.org/raw.php?rawid=2618

RIP: 0010:[<ffffffff803b49a1>]  [<ffffffff803b49a1>] kref_put+0x31/0x80
RSP: 0000:ffff81007ffe5df0  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000034333545 RCX: ffffffff80606270
RDX: 0000000000000040 RSI: ffffffff803b38b0 RDI: 0000000034333545
RBP: ffff81007ffe5e00 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff8094c430 R11: 0000000000000000 R12: ffffffff803b38b0
R13: ffff81011ede44d8 R14: ffffffff804d7d50 R15: ffff81011ff210f0
FS:  0000000002024870(0000) GS:ffff81011ff0dd00(0000)
...
Call Trace:
 [<ffffffff803b37e9>] kobject_put+0x19/0x20
 [<ffffffff803b389b>] kobject_del+0x2b/0x40
 [<ffffffff804d7d50>] delayed_delete+0x0/0xb0
 [<ffffffff804d7db9>] delayed_delete+0x69/0xb0
 [<ffffffff80249775>] run_workqueue+0x175/0x210
 [<ffffffff8024a411>] worker_thread+0x71/0xb0
 [<ffffffff8024d9e0>] autoremove_wake_function+0x0/0x40
 [<ffffffff8024a3a0>] worker_thread+0x0/0xb0
 [<ffffffff8024d5fd>] kthread+0x4d/0x80
 [<ffffffff8020c4b8>] child_rip+0xa/0x12
 [<ffffffff8020bbcf>] restore_args+0x0/0x30
 [<ffffffff8024d5b0>] kthread+0x0/0x80
 [<ffffffff8020c4ae>] child_rip+0x0/0x12

Code: f0 ff 0b 0f 94 c0 31 d2 84 c0 74 0b 48 89 df 41 ff d4 ba 01

What do we have here?  It barfs in kref_put() called from kobject_put().
It's -rc6-mm1 and I don't have -mm at hand.  Let's see if we can make
any sense out of it from the mainline - it might be a good idea for the
first pass, unless there are some clear indications to the contrary.

kref_put() is fairly low-level (and deep in call chain, here).  And it's
pretty small:
int kref_put(struct kref *kref, void (*release)(struct kref *kref))
{
        WARN_ON(release == NULL);
        WARN_ON(release == (void (*)(struct kref *))kfree);

        if (atomic_dec_and_test(&kref->refcount)) {
                release(kref);
                return 1;
        }
        return 0;
}

Poking around on the site (I'm not familiar enough with it, so bear with
me) gives a link to posting and an important detail missed in that page:

<1>Unable to handle kernel paging request at 0000000034333545 RIP:
[<ffffffff803b49a1>] kref_put+0x31/0x80

Now, that's interesting - we barf on dereferencing a pointer that (a) has
upper 32 bits zero and (b) doesn't have a bit 7 set in any byte.  Smells
like ASCII data misinterpreted as a pointer.  Conversion to ASCII gives
"E534", which sure as hell does look like a fragment of some string.

OK, so where does that happen?  In this case we have only one candidate,
really - atomic_dec_and_test(&kref->refcount).  Before it we only compare
argument with constants, after it we pass argument to another function.

Now, looking at the registers (see above) we notice that this address had
come from rbx.  Let's try objdump -d lib/kref.o and see what we've got there
for kref_put():

  4a:   55                      push   %rbp
  4b:   48 85 f6                test   %rsi,%rsi
  4e:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  55:   ba 36 00 00 00          mov    $0x36,%edx
  5a:   48 89 e5                mov    %rsp,%rbp
  5d:   41 54                   push   %r12
  5f:   49 89 f4                mov    %rsi,%r12
  62:   53                      push   %rbx
  63:   48 89 fb                mov    %rdi,%rbx
  66:   74 15                   je     7d <kref_put+0x33>
  68:   48 81 fe 00 00 00 00    cmp    $0x0,%rsi
  6f:   75 26                   jne    97 <kref_put+0x4d>
  71:   48 c7 c1 00 00 00 00    mov    $0x0,%rcx
  78:   ba 37 00 00 00          mov    $0x37,%edx
  7d:   48 c7 c6 00 00 00 00    mov    $0x0,%rsi
  84:   48 c7 c7 00 00 00 00    mov    $0x0,%rdi
  8b:   31 c0                   xor    %eax,%eax
  8d:   e8 00 00 00 00          callq  92 <kref_put+0x48>
  92:   e8 00 00 00 00          callq  97 <kref_put+0x4d>
  97:   f0 ff 0b                lock decl (%rbx)
  9a:   0f 94 c0                sete   %al
  9d:   31 d2                   xor    %edx,%edx
  9f:   84 c0                   test   %al,%al
  a1:   74 0b                   je     ae <kref_put+0x64>
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
  a9:   ba 01 00 00 00          mov    $0x1,%edx
  ae:   5b                      pop    %rbx
  af:   41 5c                   pop    %r12
  b1:   c9                      leaveq
  b2:   89 d0                   mov    %edx,%eax
  b4:   c3                      retq

It's not necessary identical to what that kernel had; still, not bad for
a starting point.  We are even lucky enough to find 'f0 ff' immediately
in there:
  97:   f0 ff 0b                lock decl (%rbx)
Next instructions also match - actually, better than I expected.  So.
We even have register allocation matching the reported kernel and it
all makes sense - this is the instruction where it had puked, the address
had, indeed, come from rbx and it's locked decrement of *rbx followed
by some testing and conditional jump.  Just what one would expect
from atomic_dec_and_test().

IOW, we have &kref->refcount equal to 0x0000000034333545.  What's the
value of kref itself?  We could look into definition of struct kref,
or just notice that release(kref) should be right after that, so
we could see what gets passed to it.
  a3:   48 89 df                mov    %rbx,%rdi
  a6:   41 ff d4                callq  *%r12
is clearly it.  So what we are passing is rbx itself, so that offset got to
be zero.

All right, so we have kref_put() getting 0x0000000034333545 instead of a
pointer in the first argument.  It's called from kobject_put() and unless
-mm has changes in there, there's no need to guess where in kobject_put()
that had been:
void kobject_put(struct kobject * kobj)
{
        if (kobj)
                kref_put(&kobj->kref, kobject_release);
}

Aha.  Now, we need to work back from &kobj->kref to kobj.  Again, assuming
that -mm doesn't change struct kobject in a way that would affect the
offset, we have
struct kobject {
        const char              * k_name;
        struct kref             kref;
        struct list_head        entry;
...
Looks like it's not too likely, unless somebody had been deliberately
rearranging fields (to fit cachelines better, etc.).  We'll need to
verify that, of course, but for now it's a good starting assumption.
Very well, we have one pointer in front of it.  It's an amd64, so
it's an 8 byte field and no alignment padding is to be expected.
IOW, kobj is 0x0000000034333545 - 8, i.e. 0x000000003433353D.  What's _that_
in ASCII?  "=534".  OK, that makes even more sense for a part of some
string...

Let's check; time to take a look at that patch, after all
 struct kobject {
-       const char              * k_name;
+       const char              *name;
        struct kref             kref;

OK, not a problem.  While we are at it, kobject_put() is unchanged by
the patch.  Now, back to trace.  kobject_put() is called from kobject_del().
Now, that one does differ from mainline:

void kobject_del(struct kobject * kobj)
{
	if (!kobj)
		return;

	sysfs_remove_dir(kobj);
	kobj->state_in_sysfs = 0;
	kobj_kset_leave(kobj);
	kobject_put(kobj->parent);
	kobj->parent = NULL;
}

Humm...  So we have kobj->parent containing crap.  What about the caller?
It's from drivers/md/md.c:
static void delayed_delete(struct work_struct *ws)
{
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
}
and it's only used in
static void unbind_rdev_from_array(mdk_rdev_t * rdev)
{
        char b[BDEVNAME_SIZE];
        if (!rdev->mddev) {
                MD_BUG();
                return;
        }
        bd_release_from_disk(rdev->bdev, rdev->mddev->gendisk);
        list_del_init(&rdev->same_set);
        printk(KERN_INFO "md: unbind<%s>\n", bdevname(rdev->bdev,b));
        rdev->mddev = NULL;
        sysfs_remove_link(&rdev->kobj, "block");

        /* We need to delay this, otherwise we can deadlock when
         * writing to 'remove' to "dev/state"
         */
        INIT_WORK(&rdev->del_work, delayed_delete);
        schedule_work(&rdev->del_work);
}

Well, that takes care of the rest of trace - we have used INIT_WORK to set
rdev->del_work up, scheduled it for execution and eventually the callback
had been called (asynchronously to us).

So what do we have so far?
	unbind_rdev_from_array(rdev);
had been called and rdev->kobj.parent turned to contain a crap value
(0x000000003433353D) instead of a pointer.  That's about all we can
get out of trace.

Now, let's see what could have triggered it.  Curious... Looking through
diff shows an interesting bit:
-       rdev->kobj.parent = &mddev->kobj;
-       if ((err = kobject_add(&rdev->kobj)))
+       if ((err = kobject_add(&rdev->kobj, &mddev->kobj, "dev-%s", b)))
                goto fail;
in bind_rdev_to_array().  At the first sight the changes in kobject_add()
seem to match that.  And nothing else in md.c seems to be setting it to
anything non-NULL.  Very well, so it's one of the following:

	* unbind_rdev_from_array() called on rdev that didn't pass through
bind_rdev_to_array().
	* unbind_rdev_from_array() called on rdev that bailed out from
bind_rdev_to_array() before that point.
	* mddev value in the above is crap.  That's bloody unlikely, BTW -
kobject_add() would increment the refcount of rdev->kobj.parent (or we would
be in far more trouble, since it would not match kobject_del() and _that_
would hurt a lot more than just md.c).  So &mddev->kobj would better be
something saner when it went through that point.
	* *rdev got corrupted in between.

Actually, looking at the callers of unbind_rdev_from_array()...  We always
follow it with export_rdev().  Which does (presumably) final kobject_put()
on &rdev->kobj, freeing rdev itself.

What guarantees that it doesn't happen before we get to callback?  AFAICS,
nothing whatsoever...

And if it does happen, we'll get rdev happily freed (by rdev_free(), as
->release() of &rdev->kobj) by the time we get to delayed_delete().  Which
explains what's going on just fine.

BTW, -mm changes in kobject.c explain WTF it doesn't trigger in mainline -
there we managed to get away with that since kobject_add() bumped refcount
of kobject by one and kobject_del() decremented it.  That masked the bug.

From: Al Viro <viro@ZenIV.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: [oops report analysis] hfs_bnode_split() one (Arjan's #2753)
Date: Mon, 14 Jan 2008 19:42:01 UTC
Message-ID: <fa.xFJmGnjKIxaejJ3edBpFWYRNuqM@ifi.uio.no>

[summary for folks who want to skip blow-by-blow: it's missing check for
hfs_bnode_find() returning ERR_PTR(), there are 2 more places like that
in fs/hfs/* (all in brec.c) and graceful recovery may be non-obvious]

Text below is mostly for the benefit of newbies - it's more along the
lines of 'how to get from bug report to the source of bug', with more
details than normal.  FWIW, this might be worth doing on more or less
regular basis, especially if more people join the fun; everyone gets
their own set of tricks in that area and making it easier to gather
might help a lot of people.  It's not just about oops-tracing per se,
of course - Arjan's site gives a nice collection of those, so that
makes an obvious starting point.

Anyway, here's the report: http://www.kerneloops.org/raw.php?rawid=2753&msgid=

We immediately see that it's i386 (by register names in dump, for one
thing).  The kernel in question is called 2.6.24-rc7-gcdf71a10; everyone
and his dog got a naming scheme of their own, but that one is common
enough - <version>-g<beginning of git tag>.  Which might be from any
number of git repositories, of course, but it doesn't hurt to check the
mainline:
; git log cdf71a|head -5
commit cdf71a10c7b6432d9b48e292cca2c62a0b9fa6cf
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Tue Jan 8 19:47:38 2008 +0100

    futex: Prevent stale futex owner when interrupted/timeout
;

We are in luck - it _is_ pure mainline.  In any case, 2.6.24-rc7 would be
a good starting point, but here we have exact tree.  Very well.

] BUG: unable to handle kernel paging request at virtual address fffffff8
Memory access close to 0.
] EIP is at hfs_bnode_split+0x2aa/0x340
Search in the tree show that it's either fs/hfs or fs/hfsplus.  Perhaps the
stuff deeper in call chain will tell which one it is.  It's not without
danger, BTW - call trace is to be taken with grain of salt, since it might
be not exact.  In this case we see
] [<c02558ac>] hfs_create+0x3c/0x80
] [<c0179bb8>] vfs_create+0x88/0xd0
in there, which looks like transition from core kernel to fs code, and
hfs_create() is present only in fs/hfs.  So we start looking into fs/hfs.
_However_, hfsplus is worth looking into anyway - there's a good chance
that these two hfs_bnode_split() have shared ancestor, so bugs in one
might very well be present in both.

Now, Arjan's site has the oopsing code already disassembled:
   0:   89 50 0c                mov    %edx,0xc(%eax)
   3:   8b 4d dc                mov    0xffffffdc(%ebp),%ecx
   6:   83 e9 0e                sub    $0xe,%ecx
   9:   89 4d e0                mov    %ecx,0xffffffe0(%ebp)
   c:   29 48 14                sub    %ecx,0x14(%eax)
   f:   29 48 1c                sub    %ecx,0x1c(%eax)
  12:   e9 58 fe ff ff          jmp    0xfffffe6f
  17:   8b 45 cc                mov    0xffffffcc(%ebp),%eax
  1a:   e8 62 f4 ff ff          call   0xfffff481
  1f:   8b 55 d4                mov    0xffffffd4(%ebp),%edx
  22:   31 c9                   xor    %ecx,%ecx
  24:   89 c3                   mov    %eax,%ebx
  26:   8b 42 08                mov    0x8(%edx),%eax
  29:   89 f2                   mov    %esi,%edx
  2b:*  89 43 04                mov    %eax,0x4(%ebx)     <-- trapping instruction
  2e:   89 d8                   mov    %ebx,%eax
  30:   c7 04 24 0e 00 00 00    movl   $0xe,(%esp)
  37:   e8 85 f0 ff ff          call   0xfffff0c1
  3c:   8b 43 04                mov    0x4(%ebx),%eax
  3f:   31                      .byte 0x31

Nevermind that .byte 0x31 - it's just a truncated instruction (we have one
starting with 31 just several lines above).  The rest looks reasonable
enough - there might be some frame errors in the beginning, but everything
looks sane around the actual location of oops.

And sure enough, it's a memory access - eax = (u32 *)(ebx + 4).  Since
we'd stepped into 0xfffffff8, ebx got to be 0xfffffff4.  Which is confirmed
by the register dump in the beginning of report, so everything fits.

That's actually a nice hint - we have something just below 0 handled as
a pointer.  That's what ERR_PTR() is - in this case we have ebx equal to
ERR_PTR(-12), whatever that might be.  It may be a red herring, of course,
since there's any number of possible ways to get that value, but it's
worth looking into.  Now, _if_ that's some error value, which one would it
be?  Different architectures have slightly diverging error values, but
beginning is common for all of them and sits in asm-generic/errno-base.h
And there we have
#define ENOMEM          12      /* Out of memory */

Well.  So ebx is ERR_PTR(-ENOMEM), for whatever it might be worth.

Now, we need to find where in function that might be.  It's a big one -
according to the trace we are at hfs_bnode_split+0x2aa/0x340, so it's
closer to the end of nearly kilobyte worth of code.  Of course, that
might include something inlined and/or be in a function between
hfs_bnode_split() and the next one we know about (that's not too likely,
though - hfs_bnode_split() itself is static, so we had enough debug
info surviving to see those).

Anyway, it's big - about 100 lines of C in fs/hfs/brec.c.  Not fun -
finding a match will require more information about what that code is
doing.  OK, we have a function call soon after the offending instruction;
let's take a look at it:
  1f:   8b 55 d4                mov    0xffffffd4(%ebp),%edx
  22:   31 c9                   xor    %ecx,%ecx
  24:   89 c3                   mov    %eax,%ebx
  26:   8b 42 08                mov    0x8(%edx),%eax
  29:   89 f2                   mov    %esi,%edx
  2b:*  89 43 04                mov    %eax,0x4(%ebx)
  2e:   89 d8                   mov    %ebx,%eax
  30:   c7 04 24 0e 00 00 00    movl   $0xe,(%esp)
  37:   e8 85 f0 ff ff          call   0xfffff0c1

So we are stuffing some values in a bunch of registers *and* top of the
stack, then calling something.  That's a normal calling sequence for i386
kernel - unlike userland code it's built with -mregparm=3, which makes
the first 3 arguments passed in eax, edx and ecx resp.; the rest goes on
stack as usual.

IOW, that's some_function(eax, edx, ecx, 0xe).  With assignments before
the call itself setting these up.  What do we have there?
	edx = *(u32 *)(ebp - 0x2c);
	ecx = 0;
	ebx = eax;
	eax = *(u32 *)(edx + 8);
	edx = esi;
	*(u32 *)(ebx + 4) = eax;
	eax = ebx;
	some_function(eax, edx, ecx, 0xe);

Ouch.  Let's untangle that a bit.  We have something assigned to edx,
used a bit later and then edx is reassigned.  Ditto for eax.  Let's
massage that into something simpler:
	p = *(u32 *)(ebp - 0x2c);
	ecx = 0;
	ebx = eax;
	q = *(u32 *)(p + 8);
	edx = esi;
	*(u32 *)(ebx + 4) = q;
	eax = ebx;
	some_function(eax, edx, ecx, 0xe);
Aha.  So ebx gets the original value of eax, we use eax a bit and then
restore it to what it used to be.
	p = *(u32 *)(ebp - 0x2c);
	*(u32 *)(eax + 4) = *(u32 *)(p + 8);
	ebx = eax;
	ecx = 0;
	edx = esi;
	some_function(eax, edx, ecx, 0xe);
I.e.
	ebx = eax;
	p = *(u32 *)(ebp - 0x2c);
	*(u32 *)(eax + 4) = *(u32 *)(p + 8);
	some_function(eax, esi, 0, 0xe);
and we get into that with eax == ERR_PTR(-ENOMEM).  Unfortunately, we have
a bunch of 4-argument function calls in there:
        hfs_bnode_write(new_node, &node_desc, 0, sizeof(node_desc));
        hfs_bnode_read(node, &node_desc, 0, sizeof(node_desc));
        hfs_bnode_write(node, &node_desc, 0, sizeof(node_desc));
                hfs_bnode_read(next_node, &node_desc, 0, sizeof(node_desc));
                hfs_bnode_write(next_node, &node_desc, 0, sizeof(node_desc));

Well, well...  So sizeof(node_desc) is 0xe (14), esi contains &node_desc
and eax is new_node, node or next_node.  And function is either
hfs_bnode_read() or hfs_bnode_write().  So <something>node turned out
to be ERR_PTR(-ENOMEM).

Where had that eax come from?  Just before all these assignments we
had
  1a:   e8 62 f4 ff ff          call   0xfffff481
so eax holds the result of some function call.

That's very useful, since we know we'd just set whatever variable that
had been *and* the value being assigned had come from a function call.

Where do we assign anything to those?  Look at the function shows that
we never take address of any of those, so it got to be direct assignment
right in our function.  And each is assigned to only once, actually:
        node = fd->bnode;
        new_node = hfs_bmap_alloc(tree);
	struct hfs_bnode *next_node = hfs_bnode_find(tree, new_node->next);

Now, the first assignment is out - we assign fd->bnode, it's not a function
call *and* our function has no idea how fd->bnode itself had been initialized.
So it's not node.  Assignment to new_node is more promising, but (a) it's
too early in the function and (b) it's immediately followed by
        if (IS_ERR(new_node))
                return new_node;
The latter is a killer - if we had ERR_PTR() in new_node, we'd immediately
left.

That leaves only two possibilities: next_node or something weird (stack
corruption, miscompiled code, etc.).  Let's
see:
	struct hfs_bnode *next_node = hfs_bnode_find(tree, new_node->next);
	next_node->prev = new_node->this;
	hfs_bnode_read(next_node, &node_desc, 0, sizeof(node_desc));
	node_desc.prev = cpu_to_be32(next_node->prev);
	hfs_bnode_write(next_node, &node_desc, 0, sizeof(node_desc));
	hfs_bnode_put(next_node);
and that's it.  Now, if next_node ends up initialized to ERR_PTR(...), the
first assignment will instantly barf, just as it does in report.

All right, that sounds plausible and we narrowed the location down enough
to verify that code matches.  Certainly worth considering before we go
into possible stack corruption or nastier zebras.

; make fs/hfs/brec.s
;

and we can look into whatever the hell hfs_bnode_split() had compiled into.
There's only one call of hfs_bnode_find() in the entire function, so simply
searching for beginning of hfs_bnode_split and then for mentioning of
hfs_bnode_find gets us to
        call    hfs_bnode_find  #
        xorl    %ecx, %ecx      #
        leal    -26(%ebp), %edx #,
        movl    %eax, %ebx      #, next_node
        movl    8(%esi), %eax   # <variable>.this, <variable>.this
        movl    %eax, 4(%ebx)   # <variable>.this, <variable>.prev
        movl    %ebx, %eax      # next_node, next_node
        movl    $14, (%esp)     #,
        call    hfs_bnode_read  #
        movl    4(%ebx), %eax   # <variable>.prev, <variable>.prev
        xorl    %ecx, %ecx      #
Similar, but not quite...  What are the differences?  edx ends up being
ebp - 26 instead of esi in report.  We still set *(u32 *)(something + 8)
to *(u32 *)((void *)next_node + 4) (actually, next_node->this), but
in locally built code this 'something' is esi and in report it's
*(u32 *)(ebp - 0x2c).  Other than that, they are equivalent.

Let's see what it means: in local build we have ebp - 26 for &node_desc
and esi for new_node; in report we have esi for &node_desc and ebp - 0x2c
for &new_node.  Chalk it up to compiler differences - the local build used
esi to keep the value of &node_desc and report has esi used to store the
value of new_node.  If local variable does not end up in a register, its
address will be epb - constant, so everything's plausible.

OK, we've done about as much as we could with assembler.  It's either
hfs_bnode_find() returning ERR_PTR(-ENOMEM) in
	struct hfs_bnode *next_node = hfs_bnode_find(tree, new_node->next);
	next_node->prev = new_node->this;
with resulting oops (consistent with what we'd observed) or something far
weirder (miscompile/stack corruption/cosmic rays/whatnot).

The next question, of course, is whether hfs_bnode_find() could have
returned ERR_PTR(-ENOMEM).  And sure enough, it could - we have explicit
                return ERR_PTR(-ENOMEM);
and
        return ERR_PTR(-EIO);
in failure exits there.  Moreover, searching for other callers of the
same function shows that almost all of them are followed by check for
IS_ERR(what_we_just_got_from_it).   There are 3 exceptions:
fs/hfs/brec.c:147:              fd->bnode = hfs_bnode_find(tree, new_node->parent);
fs/hfs/brec.c:319:              struct hfs_bnode *next_node = hfs_bnode_find(tree, new_node->next);
fs/hfs/brec.c:411:              fd->bnode = hfs_bnode_find(tree, new_node->parent);

The second one is our bug, other two also appear to be broken.  We are not
quite done yet, though - just adding a check is not enough, we need to decide
what to do if it triggers.  Moreover, it might be that problem is elsewhere
(i.e. that in this place we can't legitimately get ERR_PTR() and adding
a check would just paper over the real bug).  It doesn't look that way,
though - node->next is a number, not a pointer and I don't see anything
that would guarantee anything about it.  hfs_bnode_find() does a lookup
by the number it got in the second argument in some hash and if it's not
found there, we do allocation.  Which can always fail, since we might
be allocating a lot.

Recovery from failure is nasty - we appear to have a double-linked list
of some kind, we'd already split an element in it and written it (presumably
to disk) and now we want to update the pointer going back from the next
element.  Leaving the damn thing as-is means corrupted on-disk list.
Basically, by that point it's too late to do anything.  So the obvious
solution would appear to get the next node earlier, before we'd done
anything.

Other two are nastier - especially the last one, by the look of it.  IOW,
time to pawn it off to maintainer or to do full review of that code with
failure modes on out-of-memory in mind...  Guess which one I'd prefer
for now.

One more thing: remember hfsplus cousin of that function?  As expected,
a look at it shows that the same place has the same problem.  Also needs
treatment.

At that point it's definitely time to dump the entire thing on maintainer:
	* we have a very likely explanation of bug report
	* we have found several more places that have the same breakage
	* we even have a tentative fix for the one that had started the
entire mess
	* further work in that direction requires a full review of quite
a pile of code and comments from maintainer will be needed anyway.

Maintainer for HFS is easily found - it's Roman.  HFS+ is not mentioned
in MAINTAINERS, AFAICS, but git log fs/hfsplus shows commits that went
through Roman's hands, so he's stuck with that part of fun too.  Or at
least ought to know where to bounce it.

Preliminary patch for hfs_bnode_split() follows, other instances left
alone for now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/hfs/brec.c b/fs/hfs/brec.c
index 8626ee3..d7971d5 100644
--- a/fs/hfs/brec.c
+++ b/fs/hfs/brec.c
@@ -229,7 +229,7 @@ skip:
 static struct hfs_bnode *hfs_bnode_split(struct hfs_find_data *fd)
 {
 	struct hfs_btree *tree;
-	struct hfs_bnode *node, *new_node;
+	struct hfs_bnode *node, *new_node, *next_node;
 	struct hfs_bnode_desc node_desc;
 	int num_recs, new_rec_off, new_off, old_rec_off;
 	int data_start, data_end, size;
@@ -248,6 +248,17 @@ static struct hfs_bnode *hfs_bnode_split(struct hfs_find_data *fd)
 	new_node->type = node->type;
 	new_node->height = node->height;

+	if (node->next)
+		next_node = hfs_bnode_find(tree, node->next);
+	else
+		next_node = NULL;
+
+	if (IS_ERR(next_node)) {
+		hfs_bnode_put(node);
+		hfs_bnode_put(new_node);
+		return next_node;
+	}
+
 	size = tree->node_size / 2 - node->num_recs * 2 - 14;
 	old_rec_off = tree->node_size - 4;
 	num_recs = 1;
@@ -261,6 +272,8 @@ static struct hfs_bnode *hfs_bnode_split(struct hfs_find_data *fd)
 		/* panic? */
 		hfs_bnode_put(node);
 		hfs_bnode_put(new_node);
+		if (next_node)
+			hfs_bnode_put(next_node);
 		return ERR_PTR(-ENOSPC);
 	}

@@ -315,8 +328,7 @@ static struct hfs_bnode *hfs_bnode_split(struct hfs_find_data *fd)
 	hfs_bnode_write(node, &node_desc, 0, sizeof(node_desc));

 	/* update next bnode header */
-	if (new_node->next) {
-		struct hfs_bnode *next_node = hfs_bnode_find(tree, new_node->next);
+	if (next_node) {
 		next_node->prev = new_node->this;
 		hfs_bnode_read(next_node, &node_desc, 0, sizeof(node_desc));
 		node_desc.prev = cpu_to_be32(next_node->prev);

Index Home About Blog