The slab allocator (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] greatly reduce SLOB external fragmentation
Date: Thu, 10 Jan 2008 16:15:16 UTC
Message-ID: <fa.tA+5wrogBYVvqeI8EmEWw8xNz+w@ifi.uio.no>

On Thu, 10 Jan 2008, Pekka J Enberg wrote:
>
> We probably don't have the same version of GCC which perhaps affects
> memory layout (struct sizes) and thus allocation patterns?

No, struct sizes will not change with compiler versions - that would
create binary incompatibilities for libraries etc.

So apart from the kernel itself working around some old gcc bugs by making
spinlocks have different size depending on the compiler version, sizes of
structures should be the same (as long as the configuration is the same,
of course).

However, a greedy first-fit allocator will be *very* sensitive to
allocation pattern differences, so timing will probably make a big
difference. In contrast, SLUB and SLAB both use fixed sizes per allocation
queue, which makes them almost totally impervious to any allocation
pattern from different allocation sizes (they still end up caring about
the pattern *within* one size, but those tend to be much stabler).

There really is a reason why traditional heaps with first-fit are almost
never used for any real loads.

(I'm not a fan of slabs per se - I think all the constructor/destructor
crap is just that: total crap - but the size/type binning is a big deal,
and I think SLOB was naïve to think a pure first-fit makes any sense. Now
you guys are size-binning by just two or three bins, and it seems to make
a difference for some loads, but compared to SLUB/SLAB it's a total hack).

I would suggest that if you guys are really serious about memory use, try
to do a size-based heap thing, and do best-fit in that heap. Or just some
really simple size-based binning, eg

	if (size > 2*PAGE_SIZE)
		goto page_allocator;
	bin = lookup_bin[(size+31) >> 5];

or whatever. Because first-fit is *known* to be bad.

At try to change it to address-ordered first-fit or something (which is
much more complex than just plain LIFO, but hey, that's life).

I haven't checked much, but I *think* SLOB is just basic first-fit
(perhaps the "next-fit" variation?) Next-fit is known to be EVEN WORSE
than the simple first-fit when it comes to fragmentation (so no, Knuth was
not always right - let's face it, much of Knuth is simply outdated).

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] greatly reduce SLOB external fragmentation
Date: Thu, 10 Jan 2008 18:29:58 UTC
Message-ID: <fa.5/sU7HmTcgpFnTWELxePHBXGXGQ@ifi.uio.no>

On Thu, 10 Jan 2008, Matt Mackall wrote:
> >
> > (I'm not a fan of slabs per se - I think all the constructor/destructor
> > crap is just that: total crap - but the size/type binning is a big deal,
> > and I think SLOB was naÃ¯ve to think a pure first-fit makes any sense. Now
> > you guys are size-binning by just two or three bins, and it seems to make
> > a difference for some loads, but compared to SLUB/SLAB it's a total hack).
>
> Here I'm going to differ with you. The premises of the SLAB concept
> (from the original paper) are:

I really don't think we differ.

The advantage of slab was largely the binning by type. Everything else was
just a big crock. SLUB does the binning better, by really just making the
type binning be about what really matters - the *size* of the type.

So my argument was that the type/size binning makes sense (size more so
than type), but the rest of the original Sun arguments for why slab was
such a great idea were basically just the crap.

Hard type binning was a mistake (but needed by slab due to the idiotic
notion that constructors/destructors are "good for caches" - bleargh). I
suspect that hard size binning is a mistake too (ie there are probably
cases where you do want to split unused bigger size areas), but the fact
that all of our allocators are two-level (with the page allocator acting
as a size-agnostic free space) may help it somewhat.

And yes, I do agree that any current allocator has problems with the big
sizes that don't fit well into a page or two (like task_struct). That
said, most of those don't have lots of allocations under many normal
circumstances (even if there are uses that will really blow them up).

The *big* slab users at least for me tend to be ext3_inode_cache and
dentry. Everything else is orders of magnitude less. And of the two bad
ones, ext3_inode_cache is the bad one at 700+ bytes or whatever (resulting
in ~10% fragmentation just due to the page thing, regardless of whether
you use an order-0 or order-1 page allocation).

Of course, dentries fit better in a page (due to being smaller), but then
the bigger number of dentries per page make it harder to actually free
pages, so then you get fragmentation from that. Oh well. You can't win.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] greatly reduce SLOB external fragmentation
Date: Thu, 10 Jan 2008 19:42:57 UTC
Message-ID: <fa.MXWGM2tJmOAstPzKf4b/KmTbUmo@ifi.uio.no>

On Thu, 10 Jan 2008, Matt Mackall wrote:
>
> One idea I've been kicking around is pushing the boundary for the buddy
> allocator back a bit (to 64k, say) and using SL*B under that. The page
> allocators would call into buddy for larger than 64k (rare!) and SL*B
> otherwise. This would let us greatly improve our handling of things like
> task structs and skbs and possibly also things like 8k stacks and jumbo
> frames.

Yes, something like that may well be reasonable. It could possibly solve
some of the issues for bigger page cache sizes too, but one issue is that
many things actually end up having those power-of-two alignment
constraints too - so an 8kB allocation would often still have to be
naturally aligned, which then removes some of the freedom.

> Crazy?

It sounds like it might be worth trying out - there's just no way to know
how well it would work. Buddy allocators sure as hell have problems too,
no question about that. It's not like the page allocator is perfect.

It's not even clear that a buddy allocator even for the high-order pages
is at all the right choice. Almost nobody actually wants >64kB blocks, and
the ones that *do* want bigger allocations tend to want *much* bigger
ones, so it's quite possible that it could be worth it to have something
like a three-level allocator:

 - huge pages (superpages for those crazy db people)

   Just a simple linked list of these things is fine, we'd never care
   about coalescing large pages together anyway.

 - "large pages" (on the order of ~64kB) - with *perhaps* a buddy bitmap
   setup to try to coalesce back into huge-pages, but more likely just
   admitting that you'd need something like migration to ever get back a
   hugepage that got split into large-pages.

   So maybe a simple bitmap allocator per huge-page for large pages. Say
   you have a 4MB huge-page, and just a 64-bit free-bitmap per huge-page
   when you split it into large pages.

 - slab/slub/slob for anything else, and "get_free_page()" ends up being
   just a shorthand for saying "naturally aligned kmalloc of size
   "PAGE_SIZE<<order"

and maybe it would all work out ok.

			Linus

Index Home About Blog