Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: 2.6 /proc/interrupts fails on systems with many CPUs
Original-Message-ID: <Pine.LNX.4.44.0311111019210.30657-100000@home.osdl.org>
Date: Tue, 11 Nov 2003 18:27:21 GMT
Message-ID: <fa.kmhfiob.183qk0n@ifi.uio.no>

On Tue, 11 Nov 2003, Martin J. Bligh wrote:
>
> I think it'd make more sense to only use vmalloc when it's explicitly
> too big for kmalloc - or simply switch on num_online_cpus > 100 or
> whatever a sensible cutoff is (ie nobody but you would ever see this ;-))

No, please please please don't do these things.

vmalloc() is NOT SOMETHING YOU SHOULD EVER USE! It's only valid for when
you _need_ a big array, and you don't have any choice. It's slow, and it's
a very restricted resource: it's a global resource that is literally
restricted to a few tens of megabytes. It should be _very_ carefully used.

There are basically no valid new uses of it. There's a few valid legacy
users (I think the file descriptor array), and there are some drivers that
use it (which is crap, but drivers are drivers), and it's _really_ valid
only for modules. Nothing else.

Basically: if you think you need more memory than a kmalloc() can give,
you need to re-organize your data structures. To either not need a big
area, or to be able to allocate it in chunks.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: 2.6 /proc/interrupts fails on systems with many CPUs
Original-Message-ID: <Pine.LNX.4.44.0311111033340.30657-100000@home.osdl.org>
Date: Tue, 11 Nov 2003 18:38:24 GMT
Message-ID: <fa.kn1lj85.18jgkgt@ifi.uio.no>

On Tue, 11 Nov 2003, Martin J. Bligh wrote:
>
> OK, I was actually trying to avoid the use of vmalloc, instead of the
> unconditional conversion to vmalloc, which is what the original patch did ;-)

Yes, I realize that, but it's the old case of

  "I'm totally faithful to my husband - I never sleep with other men when
   he is around"

joke.

Basically, if it's wrong to use, it's wrong to use even occasionally. In
fact, having two different code-paths just makes the code worse.

Yes, I realize that sometimes you have to do it that way, and it might be
the simplest way to fix something. In this case, though, the cost and
fragility of a generic interface is not worth it, since the problem isn't
actually in the generic code at all..

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: 2.6 /proc/interrupts fails on systems with many CPUs
Original-Message-ID: <Pine.LNX.4.44.0311111007350.30657-100000@home.osdl.org>
Date: Tue, 11 Nov 2003 18:19:06 GMT
Message-ID: <fa.kn1nig9.18jik8p@ifi.uio.no>

On Tue, 11 Nov 2003, Erik Jacobson wrote:
>
> I'm looking for suggestions on how to fix this.  I came up with one fix
> that seems to work OK for ia64.  I have attached it to this message.
> I'm looking for advice on what should be proposed for the real fix.

This is not the real fix.

Allowing people to use up vmalloc() space by opening the /proc files would
be a major DoS attack. Not worth it.

Instead, just make /proc/interrupts use the proper _sequence_ things, so
that instead of trying to print out everything in one go, you have the
"s_next()" thing to print them out one at a time. The seqfile interfaces
will then do the right thing with blocking/caching, and you only need a
single page.

Al - do we have some good documentation of how to use the seq-file
interface?

In the meantime, without documentation, the best place to look is just at
other examples. One such example would be the kernel/kallsyms.c case: see
how it does s_start/s_show/s_next/s_stop (or /proc/slabinfo, or vmstat, or
any number of them).

		Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [git patches] xfs and block fixes for virtually indexed arches
Date: Thu, 17 Dec 2009 16:47:36 UTC
Message-ID: <fa./QHQfxNy0ll6ju1xuIgCjLjh0uc@ifi.uio.no>

On Thu, 17 Dec 2009, tytso@mit.edu wrote:
>
> That's because apparently the iSCSI and DMA blocks assume that they
> have Real Pages (tm) passed to block I/O requests, and apparently XFS
> ran into problems when sending vmalloc'ed pages.  I don't know if this
> is a problem if we pass the bio layer addresses coming from the SLAB
> allocator, but oral tradition seems to indicate this is problematic,
> although no one has given me the full chapter and verse explanation
> about why this is so.

kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA
translations for such pages is trivial and fundamental.

In contrast, vmalloc is pure and utter unadulterated CRAP. The pages
may be contiguous virtually, but it makes no difference for the block
layer, that has to be able to do IO by DMA anyway, so it has to look up
the page translations in the page tables etc crazy sh*t.

So passing vmalloc'ed page addresses around to something that will
eventually do a non-CPU-virtual thing on them is fundamentally insane. The
vmalloc space is about CPU virtual addresses. Such concepts simply do not
-exist- for some random block device.

> Now that I see Linus's complaint, I'm wondering if the issue is really
> about kernel virtual addresses (i.e., coming from vmalloc), and not a
> requirement for Real Pages (i.e., coming from the SLAB allocator as
> opposed to get_free_page).  And can this be documented someplace?  I
> tried looking at the bio documentation, and couldn't find anything
> definitive on the subject.

The whole "vmalloc is special" has always been true. If you want to
treat vmalloc as normal memory, you need to look up the pages yourself. We
have helpers for that (including helpers that populate vmalloc space from
a page array to begin with - so you can _start_ from some array of pages
and then lay them out virtually if you want to have a convenient CPU
access to the array).

And this whole "vmalloc is about CPU virtual addresses" is so obviously
and fundamentally true that I don't understand how anybody can ever be
confused about it. The "v" in vmalloc is for "virtual" as in virtual
memory.

Think of it like virtual user addresses. Does anybody really expect to be
able to pass a random user address to the BIO layer?

And if you do, I would suggest that you get out of kernel programming
pronto. You're a danger to society, and have a lukewarm IQ. I don't want
you touching kernel code.

And no, I do _not_ want the BIO layer having to walk page tables. Not for
vmalloc space, not for user virtual addresses.

(And don't tell me it already does. Maybe somebody sneaked it in past me,
without me ever noticing. That wouldn't be an excuse, that would be just
sad. Jesus wept)

			Linus


From: tytso@mit.edu
Newsgroups: fa.linux.kernel
Subject: Re: [git patches] xfs and block fixes for virtually indexed arches
Date: Thu, 17 Dec 2009 17:40:32 UTC
Message-ID: <fa.owq5ds3zojXPqhywEyV61B6y5cQ@ifi.uio.no>

On Thu, Dec 17, 2009 at 08:46:33AM -0800, Linus Torvalds wrote:
> kmalloc() memory should be ok. It's backed by "real pages". Doing the DMA
> translations for such pages is trivial and fundamental.

Sure, but there's some rumors/oral traditions going around that some
block devices want bio address which are page aligned, because they
want to play some kind of refcounting game, and if you pass them a
kmalloc() memory, they will explode in some interesting and
entertaining way.  And it's Weird Shit(tm) (aka iSCSI, AoE) type
drivers, that most of us don't have access to, so just because it
works Just Fine on SATA doesn't mean anything.

And none of this is documented anywhere, which is frustrating as hell.
Just rumors that "if you do this, AoE/iSCSI will corrupt your file
systems".

    	    	    	       		       - Ted


From: Jens Axboe <jens.axboe@oracle.com>
Newsgroups: fa.linux.kernel
Subject: Re: [git patches] xfs and block fixes for virtually indexed arches
Date: Thu, 17 Dec 2009 19:37:01 UTC
Message-ID: <fa.0vm33/UFkFPbVmTwhzUSgNW9YFk@ifi.uio.no>

On Thu, Dec 17 2009, Linus Torvalds wrote:
>
>
> On Thu, 17 Dec 2009, tytso@mit.edu wrote:
> >
> > Sure, but there's some rumors/oral traditions going around that some
> > block devices want bio address which are page aligned, because they
> > want to play some kind of refcounting game,
>
> Yeah, you might be right at that.
>
> > And it's Weird Shit(tm) (aka iSCSI, AoE) type drivers, that most of us
> > don't have access to, so just because it works Just Fine on SATA doesn't
> > mean anything.
> >
> > And none of this is documented anywhere, which is frustrating as hell.
> > Just rumors that "if you do this, AoE/iSCSI will corrupt your file
> > systems".
>
> ACK. Jens?

I've heard those rumours too, and I don't even know if they are true.
Who has a pointer to such a bug report and/or issue? The block layer
itself doesn't not have any such requirements, and the only places where
we play page games is for bio's that were explicitly mapped with pages
by itself (like mapping user data).o

We fix driver crap like that, we don't work around it. It's a BUG.

--
Jens Axboe



From: tytso@mit.edu
Newsgroups: fa.linux.kernel
Subject: Re: [git patches] xfs and block fixes for virtually indexed arches
Date: Fri, 18 Dec 2009 14:18:15 UTC
Message-ID: <fa.c4MnMw5pLH6qKdl2D0a8lrr9FrQ@ifi.uio.no>

On Fri, Dec 18, 2009 at 09:21:30AM +0900, FUJITA Tomonori wrote:
>
> iSCSI initiator driver should work with kmalloc'ed memory.
>
> The reason why iSCSI didn't work with kmalloc'ed memory is that it
> uses sendpage (which needs refcountable pages). We added a workaround
> to not use sendpage with kmalloc'ed memory (it would be great if we
> remove the workaround though).

Well, with a patch that I plan to be pushing that we have general
agreement that it is a block device driver BUG not to accept
kmalloc'ed/SLAB allocated memory, is one where ext4 will use
kmalloc'ed/slab allocated memory on occasion when it needs to make
shadow copy of buffers for journalling purposes AND when the fs block
size is smaller than the page size.  (i.e., no more allocating a 16k
page when the fs block size is 4k).  So this won't happen all the
time; even if the case of a 16k Itanium system with 4k blocks, the
bulk of the data won't be sent via kmalloc'ed memory --- just some
critical metadata block and some data blocks that need to be escaped
when being written into the journal.

I do think we need to document that block device drivers are
_expected_ to be able to handle kmalloc'ed memory, and if they can't,
#1 they should do a BUG_ON instead of corrupting user's data, and #2,
if they do corrupt data, we should send the angry users with corrupted
file systems to bang at the doors of the block device authors.

					- Ted





Index Home About Blog