Highmem (H. Peter Anvin; Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Wed, 03 Jun 2009 16:11:28 UTC
Message-ID: <fa.7p0jTPDcR5cMHaKlOZ6QFfVDNuA@ifi.uio.no>

On Wed, 3 Jun 2009, Rusty Russell wrote:
>
> I took my standard config, and turned on AUDIT, CGROUP, all the sched options,
> all the namespace options, profiling, markers, kprobes, relocatable kernel,
> 1000Hz, preempt, support for every x86 variant (ie. PAE, NUMA, HIGHMEM64,
> DISCONTIGMEM).  I turned off kernel debugging and paravirt.  Booted with
> maxcpus=1.

Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
can't compare it to a no-highmem case).

It's one of those options that we do to support crazy hardware, and it is
EXTREMELY expensive (but mainly only if you actually have the hardware, ie
you actually have more than 1GB of RAM for HIGHMEM4G - HIGHMEM64G is
always expensive for forks, but nobody sane ever enables it).

IOW, it's not at all comparable to the other options. It's not a software
option, it's a real hardware option that hits you not depending on whether
you want some sw capability, but on whether you want to use memory.

Because depending on the CPU, some loads will have 25% of time spent in
just kmap/kunmap due to TLB flushes. Yes, really. There's a reason 32-bit
kernels are shit for 1GB+ memory.

After you've turned off HIGHMEM (or run on a sane architecture like x86-64
that doesn't need it), re-run the benchmark, because it's interesting. But
with HIGHMEM being different, your benchmark is totally invalid and
pointless.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 15:20:24 UTC
Message-ID: <fa.lrWFX0u+/Cwvnsr0i3k6hpVD7n4@ifi.uio.no>

On Tue, 9 Jun 2009, Ingo Molnar wrote:
>
> In practice the pte format hurts the VM more than just highmem. (the
> two are inseparably connected of course)

I think PAE is a separate issue (ie I think HIGHMEM4G and HIGHMEM64G are
about different issues).

I do think we could probably drop PAE some day - very few 32-bit x86's
have more than 4GB of memory, and the ones that did buy lots of memory
back when it was a big deal for them have hopefully upgraded long since.

Of course, PAE also adds the NX flag etc, so there are probably other
reasons to have it. And qutie frankly, PAE is just a small x86-specific
detail that doesn't hurt anybody else.

So I have no reason to really dislike PAE per se - the real dislike is for
HIGHMEM itself, and that gets enabled already for HIGHMEM4G without any
PAE.

Of course, I'd also not ever enable it on any machine I have. PAE does add
overhead, and the NX bit isn't _that_ important to me.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 15:08:35 UTC
Message-ID: <fa.Tomhcq6Eyzs6dZofLAQL2lEKXX8@ifi.uio.no>

On Tue, 9 Jun 2009, Nick Piggin wrote:

> On Tue, Jun 09, 2009 at 01:17:19PM +0200, Ingo Molnar wrote:
> >
> >  - The buddy allocator allocates top down, with highmem pages first.
> >    So a lot of critical apps (the first ones started) will have
> >    highmem footprint, and that shows up every time they use it for
> >    file IO or other ops. kmap() overhead and more.
>
> Yeah this really sucks about it. OTOH, we have basically the same
> thing today with NUMA allocations and task placement.

It's not the buddy allocator. Each zone has it's own buddy list.

It's that we do the zones in order, and always start with the HIGHMEM
zone.

Which is quite reasonable for most loads (if the page is only used as a
user mapping, we won't kmap it all that often), but it's bad for things
where we will actually want to touch it over and over again. Notably
filesystem caches that aren't just for user mappings.

> > Highmem simply enables a sucky piece of hardware so the code itself
> > has an intrinsic level of suckage, so to speak. There's not much to
> > be done about it but it's not a _big_ problem either: this type of
> > hw is moving fast out of the distro attention span.
>
> Yes but Linus really hated the code. I wonder whether it is
> generic code or x86 specific. OTOH with x86 you'd probably
> still have to support different page table formats, at least,
> so you couldn't rip it all out.

The arch-specific code really isn't that nasty. We have some silly
workarouds for doing 8-byte-at-a-time operations on x86-32 with cmpxchg8b
etc, but those are just odd small details.

If highmem was just a matter of arch details, I wouldn't mind it at all.

It's the generic code pollution I find annoying. It really does pollute a
lot of crap. Not just fs/ and mm/, but even drivers.

		Linus

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 18:00:03 UTC
Message-ID: <fa.zOipQ3s7JVqdt4hNOogfKyQu1Ao@ifi.uio.no>

Ingo Molnar wrote:
>
> OTOH, highmem is clearly a useful hardware enablement feature with a
> slowly receding upside and a constant downside. The outcome is
> clear: when a critical threshold is reached distros will stop
> enabling it. (or more likely, there will be pure 64-bit x86 distros)
>

A major problem is that distros don't seem to be willing to push 64-bit
kernels for 32-bit distros.  There are a number of good (and
not-so-good) reasons why users may want to run a 32-bit userspace, but
not running a 64-bit kernel on capable hardware is just problematic.

	-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 18:07:30 UTC
Message-ID: <fa.QORLYld5zEhjSed8Sn+v1j8Yxhc@ifi.uio.no>

On Tue, 9 Jun 2009, H. Peter Anvin wrote:
>
> A major problem is that distros don't seem to be willing to push 64-bit
> kernels for 32-bit distros.  There are a number of good (and
> not-so-good) reasons why users may want to run a 32-bit userspace, but
> not running a 64-bit kernel on capable hardware is just problematic.

Yeah, that's just stupid. A 64-bit kernel should work well with 32-bit
tools, and while we've occasionally had compat issues (the intel gfx
people used to claim that they needed to work with a 32-bit kernel because
they cared about 32-bit tools), they aren't unfixable or even all _that_
common.

And they'd be even less common if the whole "64-bit kernel even if you do
a 32-bit distro" was more common.

The nice thing about a 64-bit kernel is that you should be able to build
one even if you don't in general have all the 64-bit libraries. So you
don't need a full 64-bit development environment, you just need a compiler
that can generate code for both (and that should be the default on x86
these days).

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 18:09:03 UTC
Message-ID: <fa.1Ii9vr8mYOs8L+/XYIHjDuPkT4k@ifi.uio.no>

On Tue, 9 Jun 2009, Linus Torvalds wrote:
>
> And they'd be even less common if the whole "64-bit kernel even if you do
> a 32-bit distro" was more common.

Side note: intel is to blame too. I think several Atom versions were
shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
artifically crippled to just 32-bit mode.

			Linus

From: "H. Peter Anvin" <hpa@zytor.com>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Tue, 09 Jun 2009 23:02:26 UTC
Message-ID: <fa.2EvTGB9l9n1kIytDO5kyxfoVOKc@ifi.uio.no>

Matthew Garrett wrote:
> On Tue, Jun 09, 2009 at 11:07:41AM -0700, Linus Torvalds wrote:
>
>> Side note: intel is to blame too. I think several Atom versions were
>> shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
>> artifically crippled to just 32-bit mode.
>
> And some people still want to run dosemu so they can drive their
> godforsaken 80s era PIO driven data analyzer. It'd be nice to think that
> nobody used vm86, but they always seem to pop out of the woodwork
> whenever someone suggests 64-bit kernels by default.
>

There is both KVM and Qemu as alternatives, though.  The godforsaken
80s-era PIO driven data analyzer will run fine in Qemu even on
non-HVM-capable hardware if it's 64-bit capable.  Most of the time it'll
spend sitting in PIO no matter what you do.

	-hpa

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native
Date: Thu, 04 Jun 2009 15:04:24 UTC
Message-ID: <fa.1Yp64Ri4ORLbTI0xbvn3aR4tCJE@ifi.uio.no>

On Thu, 4 Jun 2009, Rusty Russell wrote:
> >
> > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > can't compare it to a no-highmem case).
>
> Thanks, your point is demonstrated below.  I don't think HIGHMEM4G is
> unreasonable for a distro tho, so I turned that on instead.

Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.

The thing I disagree with is that it's at all valid to then compare to
some all-software feature thing. HIGHMEM doesn't expand any esoteric
capability that some people might use - it's about regular RAM for regular
users.

And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
hated having to merge it, and I still hate it. It's a stupid, ugly, and
very invasive config option. It's just that it's there to support a
stupid, ugly and very annoying fundamental hardware problem.

So I think your minimum and maximum configs should at least _match_ in
HIGHMEM. Limiting memory to not actually having any (with "mem=880M") will
avoid the TLB flushing impact of HIGHMEM, which is clearly going to be the
_bulk_ of the overhead, but HIGHMEM is still going to be noticeable on at
least some microbenchmarks.

In other words, it's a lot like CONFIG_SMP, but at least CONFIG_SMP has a
damn better reason for existing today than CONFIG_HIGHMEM.

That said, I suspect that now your context-switch test is likely no longer
dominated by that thing, so looking at your numbers:

> minimal config: ~0.001280
> maximal config: ~0.002500	(with actual high mem)
> maximum config: ~0.001925     (with mem=880M)

and I think that change from 0.001280 - 0.001925 (rough averages by
eye-balling it, I didn't actually calculate anything) is still quite
interesting, but I do wonder how much of it ends up being due to just code
generation issues for CONFIG_HIGHMEM and CONFIG_SMP.

> So we're paying a 48% overhead; microbenchmarks always suffer as code is added,
> and we've added a lot of code with these options.

I do agree that microbenchmarks are interesting, and tend to show these
kinds of things clearly. It's just that when you look at the scheduler,
for example, something like SMP support is a _big_ issue, and even if we
get rid of the worst synchronization overhead with "maxcpus=1" at least
removing the "lock" prefixes, I'm not sure how relevant it is to say that
the scheduler is slower with SMP support.

(The same way I don't think it's relevant or interesting to see that it's
slower with HIGHMEM).

They are simply so fundamental features that the two aren't comparable.
Why would anybody compare a UP scheduler with a SMP scheduler? It's simply
not the same problem. What does it mean to say that one is 48% slower?
That's like saying that a squirrel is 48% juicier than an orange - maybe
it's true, but anybody who puts the two in a blender to compare them is
kind of sick. The comparison is ugly and pointless.

Now, other feature comparisons are way more interesting. For example, if
statistics gathering is a noticeable portion of the 48%, then that really
is a very relevant comparison, since scheduler statistics is something
that is in no way "fundamental" to the hardware base, and most people
won't care.

So comparing a "scheduler statistics" overhead vs "minimal config"
overhead is very clearly a sane thing to do. Now we're talking about a
feature that most people - even if it was somehow hardware related -
wouldn't use or care about.

IOW, even if it were to use hardware features (say, something like
oprofile, which is at least partly very much about exposing actual
physical features of the hardware), if it's not fundamental to the whole
usage for a huge percentage of people, then it's a "optional feature", and
seeing slowdown is a big deal.

Something like CONFIG_HIGHMEM* or CONFIG_SMP is not really what I'd ever
call "optional feature", although I hope to Dog that CONFIG_HIGHMEM can
some day be considered that some day.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Ubuntu 32-bit, 32-bit PAE, 64-bit Kernel Benchmarks
Date: Thu, 31 Dec 2009 18:40:14 UTC
Message-ID: <fa.JnoD3znpzxuyOzR4rFAaL50JnmU@ifi.uio.no>

On Wed, 30 Dec 2009, Yuhong Bao wrote:
>
> Given that Linus was once talking about the performance penalties of PAE
> and HIGHMEM64G, perhaps you'd find these benchmarks done by Phoronix of
> interest:
>   http://www.phoronix.com/scan.php?page=article&item=ubuntu_32_pae

PAE has no negative impact on user-land loads (aside from a potentially
really _tiny_ effect from just bigger page tables), and obviously means
that you actually have more RAM available, so it can be a big win.

The "25% cost" is purely kernel-side work when the kernel needs to
kmap/kunmap - which it only needs to do when it touches highmem pages
itself directly. Which is pretty rare - but when it happens a lot, it's
extremely expensive.

The worst load I've ever seen (which was the 25%+ case) needed btrfs
and heavy meta-data workloads (ie things like file creates/deletes, or
uncached lookups), because btrfs puts all its radix trees in highmem pages
and thus needs to kmap/kunmap them all. So that's one way to see heavy
kmap/kunmap loads.

(In the meantime, I complained to the btrfs people about the CPU hogging
behavior, and afaik btrfs has improved since I did my kernel profiles of
the benchmarks, but I haven't re-done them)

Theres' a potential secondary issue: my test-bed for that btrfs setup was
a netbook using Intel Atom. The performance profile of an Atom chip is
pretty different from any of the better out-of-order CPU's.

Extra instructions cost a lot more. For example, out-of-order is
particularly good at handling "nonsense" instructions that aren't on a
critical path and aren't important for actual semantics - things like the
stack frame modifications etc are often almost "free" on out-of-order
CPU's because they only tend to have trivial dependencies that can be
worked around with things like the "stack engine" etc. So I seem to
remember that the "omit stack frame" option was a much bigger deal on Atom
than on a Core 2 Duo CPU, for example.

So it's entirely possible that the TLB flushing (and eventual misses, of
course) involved with kmap()/kunmap() is much more expensive on Atom than
it is on a Core2 system. So it's possible that my 25% cost thing was for
pretty much a pessimal situation, due to a combination of heavy kernel
loads (I used "git status" as one of the btrfs/atom benchmarks - pretty
much _all_ it does is pathname lookups and readdir) with btrfs and atom.

		Linus

Index Home About Blog