Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] x86 alternatives : fix LOCK_PREFIX race with
Date: Thu, 14 Aug 2008 16:17:13 UTC
Message-ID: <fa.Mp29CBsYdW5mz46V5ff9zEnm0Pk@ifi.uio.no>

On Thu, 14 Aug 2008, Mathieu Desnoyers wrote:
>
> I can't argue about the benefit of using VM CPU pinning to manage
> resources because I don't use it myself, but I ran some tests out of
> curiosity to find if uncontended locks were that cheap, and it turns out
> they aren't.

Absolutely.

Locked ops show up not just in microbenchmarks looping over the
instruction, they show up in "real" benchmarks too. We added a single
locked instruction (maybe it was two) to the page fault handling code some
time ago, and the reason I noticed it was that it actually made the page
fault cost visibly more expensive in lmbench. That was a _single_
instruction in the hot path (or maybe two).

And the page fault path is some of the most timing critical in the whole
kernel - if you have everything cached, the cost of doing the page faults
to populate new processes for some fork/exec-heavy workload (and compiling
the kernel is just one of those - any traditional unix behaviour will show
this) is critical.

This is one of the things AMD does a _lot_ better than Intel. Intel tends
to have a 30-50 cycle cost (with later P4s being *much* worse), while AMD
tends to have a cost of around 10-15 cycles.

It's one of the things Intel promises to have improved in the next-gen
uarch (Nehalem), and while I am not supposed to give out any benchmarks, I
can confirm that Intel is getting much better at it. But it's going to be
visible still, and it's really a _big_ issue on P4.

(Of course, on P4, the page fault exception cost itself is so high that
the cost of atomics may be _relatively_ less noticeable in that particular
path)

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: AIM7 40% regression with 2.6.26-rc1
Date: Wed, 07 May 2008 17:48:04 UTC
Message-ID: <fa.RG+OsX4P9H72v4kmVKx3x5iefpA@ifi.uio.no>

On Wed, 7 May 2008, Linus Torvalds wrote:
>
> All the "normal" mutex code use fine-grained locking, so even if you slow
> down the fast path, that won't cause the same kind of fastpath->slowpath
> increase.

Put another way: let's say that the "good fastpath" is basically a single
locked instruction - ~12 cycles on AMD, ~35 Core 2. That's the
no-bouncing, no-contention case.

Doing it with debugging (call overhead, spinlocks, local irq saving rtc)
will probably easily triple it or more, but we're not changing anything
else. There's no "downstream" effect: the behaviour itself doesn't change.
It doesn't get more bouncing, it doesn't start sleeping.

But what happens if the lock has the *potential* for conflicts is
different.

There, a "longish pause + fast lock + short average code sequece + fast
unlock" is quite likely to stay uncontended for a fair amount of time, and
while it will be much slower than the no-contention-at-all case (because
you do get a pretty likely cacheline event at the "fast lock" part), with
a fairly low number of CPU's and a long enough pause, you *also* easily
get into a pattern where the thing that got the lock will likely also get
to unlock without dropping the cacheline.

So far so good.

But that basically depends on the fact that "lock + work + unlock" is
_much_ shorter than the "longish pause" in between, so that even if you
have <n> CPU's all doing the same thing, their pauses between the locked
section are still bigger than <n> times that short time.

Once that is no longer true, you now start to bounce both at the lock
*and* the unlock, and now that whole sequence got likely ten times slower.
*AND* because it now actually has real contention, it actually got even
worse: if the lock is a sleeping one, you get *another* order of magnitude
just because you now started doing scheduling overhead too!

So the thing is, it just breaks down very badly. A spinlock that gets
contention probably gets ten times slower due to bouncing the cacheline. A
semaphore that gets contention probably gets a *hundred* times slower, or
more.

And so my bet is that both the old and the new semaphores had the same bad
break-down situation, but the new semaphores just are a lot easier to
trigger it because they are at least three times costlier than the old
ones, so you just hit the bad behaviour with much lower loads (or fewer
number of CPU's).

But spinlocks really do behave much better when contended, because at
least they don't get the even bigger hit of also hitting the scheduler. So
the old semaphores would have behaved badly too *eventually*, they just
needed a more extreme case to show that bad behavior.

		Linus

Index Home About Blog