Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: A general question on SMP-safe driver code.
Original-Message-ID: <Pine.LNX.4.58.0412241522110.2353@ppc970.osdl.org>
Date: Fri, 24 Dec 2004 23:29:57 GMT
Message-ID: <fa.gtdlv65.a2sr1j@ifi.uio.no>

On Fri, 24 Dec 2004, Jim Nelson wrote:
>
> work if all other areas of the driver that send commands to the board also try for
> the semaphore?

The most common reason _not_ to use a semaphore, but a single simple
spinlock is:
 - spinlocks are generally faster.
 - you can't use semaphores to protect against interrupts, as interrupts
   cannot take semaphores.

> Is there an easier way of doing this?

The simplest approach tends to be to just have a single spinlock per
driver (or, if the driver can drive multiple independent ports, one per
port).

The only advantage of semaphores is that you can do user accesses and you
can sleep during them, but if you're looking at converting a driver that
used to just depend on the global interrupt lock, that shouldn't be an
issue anyway. Generally, the semaphores are more useful at a higher level
(ie there is almost never any reason to protect actual _IO_ accesses with
a semaphore).

The biggest problem with converting old-style irq locks into spinlocks
tends to be that the irq locking allowed nesting (though the use of
save_flags/restore_flags), and normal spinlocks don't.

You can make your own nesting spinlocks, of course, but the reason there
aren't any standard nesting locks in the kernel is that in pretty much all
cases you can trivially avoid the nesting by just moving the lock
sufficiently far out, or just re-organizing the source a bit.

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: i386 spinlock fairness: bizarre test results
Date: Tue, 11 Oct 2005 14:54:29 UTC
Message-ID: <fa.h93v74q.i7u2ga@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0510110740050.14597@g5.osdl.org>

On Tue, 11 Oct 2005, Alan Cox wrote:

> On Maw, 2005-10-11 at 00:04 -0400, Chuck Ebbert wrote:
> >   That test machine was a dual 350MHz Pentium II Xeon; on a dual 333MHz Pentium II
> > Overdrive (with very slow Socket 8 bus) I could not reproduce those results.
> > However, on that machine the 'xchg' instruction made the test run almost 20%
> > _faster_ than using 'mov'.
> >
> >   So I think the i386 spinlock code should be changed to always use 'xchg' to do
> > spin_unlock.
>
>
> Using xchg on the spin unlock path is expensive. Really expensive on P4
> compared to movb. It also doesn't guarantee anything either way around
> especially as you go to four cores or change CPU (or in some cases quite
> likely even chipset).

Indeed.

I suspect that the behaviour Chuck saw is (a) only present under
contention and (b) very much dependent on other timing issues.

(a) is the wrong thing to optimize for, and (b) means that Chuck's numbers
aren't reliable anyway (as shown by the fact that things like instruction
alignment matters, and by Eric's numbers on other machines).

We want the spinlocks to behave well when they are _not_ under heavy
contention. If a spinlock gets so much contention that it starts having
these kinds of issues, then there's something wrong at higher levels, and
the fix is to use a different algorithm, or use a different kind of lock.

Spinlocks by definition are the _simplest_ locks there are. Not the
smartest or most fair. Trying to make them anything else is kind of
missing the whole point of them.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Wed, 20 Jun 2007 17:36:31 UTC
Message-ID: <fa.V0GfHiGDFJmjiegObrxa2dYBIoA@ifi.uio.no>

On Wed, 20 Jun 2007, Jarek Poplawski wrote:
>
> I don't agree with this (+ I know it doesn't matter).
>
> The real bug is what Chuck Ebbert wrote: "Spinlocks aren't fair".
> And here they are simply lawlessly not fair.

Well, that's certainly a valid standpoint. I wouldn't claim you're
_wrong_.

At least in theory.

In *practice*, fair spinlocks are simply not possible. Not as in "it's
hard to do", but as in "you simply cannot do it".

The thing is, most spinlocks get held for a rather short time (if that
wasn't the case, you'd use something like a mutex), and it's acually a
short time not on a "human scale", but on a "CPU core scale".

IOW, the cost of doing the cacheline bouncing is often equal to, or bigger
than, the actual cost of the operation that happens inside the spinlock!

What does this result in? It automatically means that software simply
*cannot* do fair spinlocks, because all the timing costs are in parts that
software doesn't even have any visibility into, or control over!

Yeah, you could add artificial delays, by doing things like cycle counting
around the spinlock operation to see if you got the lock when it was
_contended_, or whether you got a lock that wasn't, and then adding some
statistics etc, but at that point, you don't have a spinlock any more, you
have something else. I don't know what to call it.

And you could add flags like "this spinlock is getting a lot of
contention", try some other algorithm etc. But it's all complicated and
against the whole *point* of a spinlock, which is to get in and out as
fast as humanly possible.

So in practice, spinlock fairness is inherently tied to the hardware
behaviour of the cache coherency algorithm.

Which gets us to the next level: we can consider hardware that isn't
"fair" in its cache coherency to be buggy hardware.

That's also a perfectly fine standpoint, and it's actually one that I have
a lot of sympathy for. I think it's much easier to some degree to do
fairness in the cache coherency than at any other level, because it's
something where the hardware really *could* do things like counting
bouncing etc.

However, while it's a perfectly fine standpoint, it's also totally
unrealistic. First off, the hardware doesn't even know whether the
spinlock "failed" or not on any architecture platform I'm aware of. On
x86, the spinlock operation under Linux is actually just an atomic
decrement, and it so happens that the rules for failure was that it didn't
go negative. But that's just a Linux internal rule - the hardware doesn't
know.

So you'd have to actually have some specific sequence with specific
fairness rules (maybe you could make the rule be that a failed atomic
"cmpxchg" counts as a real failure, and if you give higher priority to
cores with lots of failures etc).

IOW, it's certainly possible *in*theory* to try to make hardware that has
fairness guarantees. However, any hardware designer will tell you that
 (a) they have enough problems as it is
 (b) it's simply not worth their time, since there are other (simpler)
     things that are much much more important.

So the end result:
 - hardware won't do it, and they'd be crazy to really even try (apart
   from maybe some really simplistic stuff that doesn't _guarantee_
   anything, but maybe helps fairness a bit)
 - software cannot do it, without turning spinlocks into something really
   slow and complicated, at which point they've lost all meaning, and you
   should just use a higher-level construct like a mutex instead.

In other words, spinlocks are optimized for *lack* of contention. If a
spinlock has contention, you don't try to make the spinlock "fair". No,
you try to fix the contention instead!

That way, you fix many things. You probably speed things up, _and_ you
make it fairer. It might sometimes take some brains and effort, but it's
worth it.

The patch I sent out was an example of that. You *can* fix contention
problems. Does it take clever approaches? Yes. It's why we have hashed
spinlocks, RCU, and code sequences that are entirely lockless and use
optimistic approaches. And suddenly you get fairness *and* performance!

It's a win-win situation. It does require a bit of effort, but hey, we're
good at effort.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 16:03:26 UTC
Message-ID: <fa.8uEz4Pc3VHkU6G9tx74skRZt2Ok@ifi.uio.no>

On Thu, 21 Jun 2007, Jarek Poplawski wrote:
>
> BTW, I've looked a bit at these NMI watchdog traces, and now I'm not
> even sure it's necessarily the spinlock's problem (but I don't exclude
> this possibility yet). It seems both processors use task_rq_lock(), so
> there could be also a problem with that loop.

I agree that that is also this kind of "loop getting a lock" loop, but if
you can see it actually happening there, we have some other (and major)
bug.

That loop is indeed a loop, but realistically speaking it should never be
run through more than once. The only reason it's a loop is that there's a
small small race where we get the information at the wrong time, get the
wrong lock, and try again.

So the loop certainly *can* trigger (or it would be pointless), but I'd
normally expect it not to, and even if it does end up looping around it
should happen maybe *one* more time, absolutely not "get stuck" in the
loop for any appreciable number of iterations.

So I don't see how you could possibly having two different CPU's getting
into some lock-step in that loop: changing "task_rq()" is a really quite
heavy operation (it's about migrating between CPU's), and generally
happens at a fairly low frequency (ie "a couple of times a second" kind of
thing, not "tight CPU loop").

But bugs happen..

> Another possible problem could be a result of some wrong optimization
> or wrong propagation of change of this task_rq(p) value.

I agree, but that kind of bug would likely not cause temporary hangs, but
actual "the machine is dead" operations. If you get the totally *wrong*
value due to some systematic bug, you'd be waiting forever for it to
match, not get into a loop for a half second and then it clearing up..

But I don't think we can throw the "it's another bug" theory _entirely_
out the window.

That said, both Ingo and me have done "fairness" testing of hardware, and
yes, if you have a reasonably tight loop with spinlocks, existing hardware
really *is* unfair. Not by a "small amount" either - I had this program
that did statistics (and Ingo improved on it and had some other tests
too), and basically if you have two CPU's that try to get the same
spinlock, they really *can* get into situations where one of them gets it
millions of times in a row, and the other _never_ gets it.

But it only happens with badly coded software: the rule simply is that you
MUST NOT release and immediately re-acquire the same spinlock on the same
core, because as far as other cores are concerned, that's basically the
same as never releasing it in the first place.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 15:52:54 UTC
Message-ID: <fa.uSL/hNJ3o0E5xw/vrsNHmP3ZFUA@ifi.uio.no>

On Thu, 21 Jun 2007, Ingo Molnar wrote:
>
> what worries me a bit though is that my patch that made spinlocks
> equally agressive to that loop didnt solve the hangs!

Your parch kept doing "spin_trylock()", didn't it?

That's a read-modify-write thing, and keeps bouncing the cacheline back
and forth, and together with the fact that even *after* you get the
spinlock the "wait_for_inactive()" would actually end up looping back,
releasing it, and re-getting it.

So the problem was that "wait_for_inactive()" kept the lock (because it
actually *got* it), and looped over getting it, and because it was an
exclusive cacheline ownership, that implies that somebody else is not
getting it, and is kept from ever getting it.

So trying to use "trylock" doesn't help. It still has all the same bad
sides - it still gets the lock (getting the lock wasn't the problem:
_holding_ the lock was the problem), and it still kept the cache line for
the lock on one core.

The only way to avoid lock contention is to avoid any exclusive use at
all.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 16:35:03 UTC
Message-ID: <fa.d7XcD8xE1Mg0D0wLVXvYxCgBcMw@ifi.uio.no>

On Thu, 21 Jun 2007, Ingo Molnar wrote:
>
> so the problem was not the trylock based spin_lock() itself (no matter
> how it's structured in the assembly), the problem was actually modifying
> the lock and re-modifying it again and again in a very tight
> high-frequency loop, and hence not giving it to the other core?

That's what I think, yes.

And it matches the behaviour we've seen before (remember the whole thread
about whether Opteron or Core 2 hardware is "more fair" between cores?)

It all simply boils down to the fact that releasing and almost immediately
re-taking a lock is "invisible" to outside cores, because it will happen
entirely within the L1 cache of one core if the cacheline is already in
"owned" state.

Another core that spins wildly trying to get it ends up using a much
slower "beat" (the cache coherency clock), and with just a bit of bad luck
the L1 cache would simply never get probed, and the line never stolen, at
the point where the lock happens to be released.

The fact that writes (as in the store that releases the lock)
automatically get delayed by any x86 core by the store buffer, and the
fact that atomic read-modify-write cycles do *not* get delayed, just means
that if the "spinlock release" was "fairly close" to the "reacquire" in a
software sense, the hardware will actually make them *even*closer*.

So you could make the spinlock release be a read-modify-write cycle, and
it would make the spinlock much slower, but it would also make it much
more likely that the other core will *see* the release.

For the same reasons, if you add a "smp_mb()" in between the release and
the re-acquire, the other core is much more likely to see it: it means
that the release won't be delayed, and thus it just won't be as "close" to
the re-acquire.

So you can *hide* this problem in many ways, but it's still just hiding
it.

The proper fix is to not do that kind of "release and re-acquire"
behaviour in a loop.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 17:33:55 UTC
Message-ID: <fa.TjHSBXy5cmP3mC4KaDAGH2CbJmQ@ifi.uio.no>

On Thu, 21 Jun 2007, Chuck Ebbert wrote:
>
> A while ago I showed that spinlocks were a lot more fair when doing
> unlock with the xchg instruction on x86. Probably the arbitration is all
> screwed up because we use a mov instruction, which while atomic is not
> locked.

No, the cache line arbitration doesn't know anything about "locked" vs
"unlocked" instructions (it could, but there really is no point).

The real issue is that locked instructions on x86 are serializing, which
makes them extremely slow (compared to just a write), and then by being
slow they effectively just make the window for another core bigger.

IOW, it doesn't "fix" anything, it just hides the bug with timing.

You can hide the problem other ways by just increasing the delay between
the unlock and the lock (and adding one or more serializing instruction in
between is generally the best way of doing that, simply because otherwise
micro- architecture may just be re-ordering things on you, so that your
"delay" isn't actually in between any more!).

But adding delays doesn't really fix anything, of course. It makes things
"fairer" by making *both* sides suck more, but especially if both sides
are actually the same exact thing, I could well imagine that they'd both
just suck equally, and get into some pattern where they are now both
slower, but still see exactly the same problem!

Of course, as long as interrupts are on, or things like DMA happen etc,
it's really *really* hard to be totally unlucky, and after a while you're
likely to break out of the steplock on your own, just because the CPU's
get interrupted by something else.

It's in fact entirely possible that the long freezes have always been
there, but the NOHZ option meant that we had much longer stretches of time
without things like timer interrupts to jumble up the timing! So maybe the
freezes existed before, but with timer interrupts happening hundreds of
times a second, they weren't noticeable to humans.

(Btw, that's just _one_ theory. Don't take it _too_ seriously, but it
could be one of the reasons why this showed up as a "new" problem, even
though I don't think the "wait_for_inactive()" thing has changed lately.)

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 18:47:30 UTC
Message-ID: <fa.y8L2v+IaJneMUquz0nAf279Bo5E@ifi.uio.no>

On Thu, 21 Jun 2007, Eric Dumazet wrote:
>
> This reminds me Nick's proposal of 'queued spinlocks' 3 months ago
>
> Maybe this should be re-considered ? (unlock is still a non atomic op,
> so we dont pay the serializing cost twice)

No. The point is simple:

	IF YOU NEED THIS, YOU ARE DOING SOMETHING WRONG!

I don't understand why this is even controversial. Especially since we
have a patch for the problem that proves my point: the _proper_ way to fix
things is to just not do the bad thing, instead of trying to allow the bad
behaviour and try to handle it.

Things like queued spinlocks just make excuses for bad code.

We don't do nesting locking either, for exactly the same reason. Are
nesting locks "easier"? Absolutely. They are also almost always a sign of
a *bug*. So making spinlocks and/or mutexes nest by default is just a way
to encourage bad programming!

> extract :
>
> Implement queued spinlocks for i386. This shouldn't increase the size of
> the spinlock structure, while still able to handle 2^16 CPUs.

Umm. i386 spinlocks could and should be *one*byte*.

In fact, I don't even know why they are wasting four bytes right now: the
fact that somebody made them an "int" just wastes memory. All the actual
code uses "decb", so it's not even a question of safety. I wonder why we
have that 32-bit thing and the ugly casts.

Ingo, any memory of that?

(And no, on 32-bit x86, we don't allow more than 128 CPU's. I don't think
such an insane machine has ever existed).

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 19:39:24 UTC
Message-ID: <fa.C+YdNsHCB4Bb3YmBZ1JuL6EVLIE@ifi.uio.no>

On Thu, 21 Jun 2007, Linus Torvalds wrote:
>
> We don't do nesting locking either, for exactly the same reason. Are
> nesting locks "easier"? Absolutely. They are also almost always a sign of
> a *bug*. So making spinlocks and/or mutexes nest by default is just a way
> to encourage bad programming!

Side note, and as a "truth in advertising" section: I'll have to admit
that I argued against fair semaphores on the same grounds. I was wrong
then (and eventually admitted it, and we obviously try to make our mutexes
and semaphores fair these days!), and maybe I'm wrong now.

If somebody can actually come up with a sequence where we have spinlock
starvation, and it's not about an example of bad locking, and nobody
really can come up with any other way to fix it, we may eventually have to
add the notion of "fair spinlocks".

So my arguments are purely pragmatic. It's not that I hate fairness per
se. I dislike it only when it's used to "solve" (aka hide) other problems.

In the end, some situations do need fairness, and the fact that aiming for
fairness is often harder, slower, and more complicated than not doing so
at that point turns into a non-argument. If you need it, you need it.

I just don't think we need it, and we're better off solving problems other
ways.

(For example, we might also solve such problems by creating a separate
"fair_spin_lock" abstraction, and only making the particular users that
need it actually use it. It would depend a bit on whether the cost of
implementing the fairness is noticeable enough for it to be worth having
a separate construct for it).

		Linus


From: Ingo Molnar <mingo@elte.hu>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 19:58:22 UTC
Message-ID: <fa.FDj7Sw6XbxHY9Q2EKMbvGqko0Fs@ifi.uio.no>

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Umm. i386 spinlocks could and should be *one*byte*.
>
> In fact, I don't even know why they are wasting four bytes right now:
> the fact that somebody made them an "int" just wastes memory. All the
> actual code uses "decb", so it's not even a question of safety. I
> wonder why we have that 32-bit thing and the ugly casts.
>
> Ingo, any memory of that?

no real reason that i can recall - i guess nobody dared to touch it
because it used to have that 'volatile', indicating black voodoo ;-) Now
that the bad stigma has been removed, we could try the patch below. It
boots fine here, and we save 1K of kernel text size:

     text    data     bss     dec     hex filename
  6236003  611992  401408 7249403  6e9dfb vmlinux.before
  6235075  611992  401408 7248475  6e9a5b vmlinux.after

I can understand why no data is saved by this change: gcc is aligning
the next field to a natural boundary anyway and we dont really have
arrays of spinlocks (fortunately). [and we save no data even if using
the ((packed)) attribute.] Perhaps some data structure that is never in
the kernel image itself still got smaller? Any good way to determine
that?

But why is the text size different? Ah: i think it's spin_lock_init()
getting shorter :-)

but this is certainly not something for 2.6.22, it's an early 2.6.23
matter i suspect.

	Ingo

------------------->
From: Ingo Molnar <mingo@elte.hu>
Subject: [patch] spinlocks i386: change them to byte fields

all spinlock ops are on byte operands, so change the spinlock field to
be unsigned char. This saves a bit of kernel text size:

   text    data     bss     dec     hex filename
6236003  611992  401408 7249403  6e9dfb vmlinux.before
6235075  611992  401408 7248475  6e9a5b vmlinux.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-i386/spinlock_types.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/include/asm-i386/spinlock_types.h
===================================================================
--- linux.orig/include/asm-i386/spinlock_types.h
+++ linux/include/asm-i386/spinlock_types.h
@@ -6,7 +6,7 @@
 #endif

 typedef struct {
-	unsigned int slock;
+	unsigned char slock;
 } raw_spinlock_t;

 #define __RAW_SPIN_LOCK_UNLOCKED	{ 1 }


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 20:12:25 UTC
Message-ID: <fa.Xa9S1spFnpOh53KvUpmB/houe0M@ifi.uio.no>

On Thu, 21 Jun 2007, Ingo Molnar wrote:
>
> I can understand why no data is saved by this change: gcc is aligning
> the next field to a natural boundary anyway and we dont really have
> arrays of spinlocks (fortunately).

Actually, some data structures could well shrink.

Look at "struct task_struct", for example. Right now it has two spinlocks
right next to each other (alloc_lock and pi_lock).

Other data structures may have things like bitfields etc.

But yeah, I'd not expect that to be very common, and in some cases you
might have to re-order data structures to take advantage of better
packing, and even then it's probably not all that noticeable.

> but this is certainly not something for 2.6.22, it's an early 2.6.23
> matter i suspect.

Oh, absolutely.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 21 Jun 2007 20:38:39 UTC
Message-ID: <fa.utjluWB2oe3BPW4s3cWyfg4bHic@ifi.uio.no>

On Thu, 21 Jun 2007, Ingo Molnar wrote:
>
> yeah. I think Linux is i think the only OS on the planet that is using
> the movb trick for unlock, it even triggered a hardware erratum ;)

I'm pretty sure others do it too.

Maybe not on an OS level (but I actually doubt that - I'd be surprised if
Windows doesn't do the exact same thing), but I know for a fact that a lot
of people in threaded libraries end up depending very much on the "simple
store closes a locked section".

Just googling for "xchg" "mov" "spinlock" "-linux" shows discussion boards
for Windows developers with open-coded spinlocks like


	int ResourceFlag = 0; // 0=Free, 1=Inuse
	...
	// Wait until we get the resource
	while(InterlockedExchange(&ResourceFlag, 1) != 0) {
	   Sleep(0); } // Wait a tad
	// Have the resource
	... // do your thing
	ResourceFlag = 0; // Release the resource


and that's definitely Windows code, not some Linux person doing it.

And this is from an OS2 forum

	unsigned owned=0;

	void request() {
	  while(LockedExchanged(&owned,1)!=0)
	    ;
	}

	void release() {
	  owned = 0;
	}

so it's not even something unusual.

So while arguably these people don't know (and don't care) about subtle
issues like memory ordering, I can *guarantee* that a lot of programs
depend on them, even if that dependency may often come from a lack of
knowledge, rather than actively understanding what we do like in the Linux
kernel community.

(And yes, they rely on compilers not reordering either. Tough.)

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Wed, 27 Jun 2007 19:50:18 UTC
Message-ID: <fa.EmR1C1VCFVFgqH2OjW/DQr8Vryg@ifi.uio.no>

Nick,
 call me a worry-wart, but I slept on this, and started worrying..

On Tue, 26 Jun 2007, Linus Torvalds wrote:
>
> So try it with just a byte counter, and test some stupid micro-benchmark
> on both a P4 and a Core 2 Duo, and if it's in the noise, maybe we can make
> it the normal spinlock sequence just because it isn't noticeably slower.

So I thought about this a bit more, and I like your sequence counter
approach, but it still worried me.

In the current spinlock code, we have a very simple setup for a
successful grab of the spinlock:

	CPU#0					CPU#1

	A (= code before the spinlock)
						lock release

	lock decb mem	(serializing instruction)

	B (= code after the spinlock)

and there is no question that memory operations in B cannot leak into A.

With the sequence counters, the situation is more complex:

	CPU #0					CPU #1

	A (= code before the spinlock)

	lock xadd mem	(serializing instruction)

	B (= code after xadd, but not inside lock)

						lock release

	cmp head, tail

	C (= code inside the lock)

Now, B is basically the empty set, but that's not the issue I worry about.
The thing is, I can guarantee by the Intel memory ordering rules that
neither B nor C will ever have memops that leak past the "xadd", but I'm
not at all as sure that we cannot have memops in C that leak into B!

And B really isn't protected by the lock - it may run while another CPU
still holds the lock, and we know the other CPU released it only as part
of the compare. But that compare isn't a serializing instruction!

IOW, I could imagine a load inside C being speculated, and being moved
*ahead* of the load that compares the spinlock head with the tail! IOW,
the load that is _inside_ the spinlock has effectively moved to outside
the protected region, and the spinlock isn't really a reliable mutual
exclusion barrier any more!

(Yes, there is a data-dependency on the compare, but it is only used for a
conditional branch, and conditional branches are control dependencies and
can be speculated, so CPU speculation can easily break that apparent
dependency chain and do later loads *before* the spinlock load completes!)

Now, I have good reason to believe that all Intel and AMD CPU's have a
stricter-than-documented memory ordering, and that your spinlock may
actually work perfectly well. But it still worries me. As far as I can
tell, there's a theoretical problem with your spinlock implementation.

So I'd like you to ask around some CPU people, and get people from both
Intel and AMD to sign off on your spinlocks as safe. I suspect you already
have the required contacts, but if you don't, I can send things off to the
appropriate people at least inside Intel.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Wed, 27 Jun 2007 22:14:25 UTC
Message-ID: <fa.qWwie6lhP4XkvtFRvoDar9z+Xxk@ifi.uio.no>

On Wed, 27 Jun 2007, Davide Libenzi wrote:
> >
> > Now, I have good reason to believe that all Intel and AMD CPU's have a
> > stricter-than-documented memory ordering, and that your spinlock may
> > actually work perfectly well. But it still worries me. As far as I can
> > tell, there's a theoretical problem with your spinlock implementation.
>
> Nice catch ;) But wasn't Intel suggesting in not relying on the old
> "strict" ordering rules?

Actually, both Intel and AMD engineers have been talking about making the
documentation _stricter_, rather than looser. They apparently already are
pretty damn strict, because not being stricter than the docs imply just
ends up exposing too many potential problems in software that didn't
really follow the rules.

For example, it's quite possible to do loads out of order, but guarantee
that the result is 100% equivalent with a totally in-order machine. One
way you do that is to keep track of the cacheline for any speculative
loads, and if it gets invalidated before the speculative instruction has
completed, you just throw the speculation away.

End result: you can do any amount of speculation you damn well please at a
micro-architectural level, but if the speculation would ever have been
architecturally _visible_, it never happens!

(Yeah, that is just me in my non-professional capacity of hw engineer
wanna-be: I'm not saying that that is necessarily what Intel or AMD
actually ever do, and they may have other approaches entirely).

> IOW shouldn't an mfence always be there? Not only loads could leak up
> into the wait phase, but stores too, if they have no dependency with the
> "head" and "tail" loads.

Stores never "leak up". They only ever leak down (ie past subsequent loads
or stores), so you don't need to worry about them. That's actually already
documented (although not in those terms), and if it wasn't true, then we
couldn't do the spin unlock with just a regular store anyway.

(There's basically never any reason to "speculate" stores before other mem
ops. It's hard, and pointless. Stores you want to just buffer and move as
_late_ as possible, loads you want to speculate and move as _early_ as
possible. Anything else doesn't make sense).

So I'm fairly sure that the only thing you really need to worry about in
this thing is the load-load ordering (the load for the spinlock compare vs
any loads "inside" the spinlock), and I'm reasonably certain that no
existing x86 (and likely no future x86) will make load-load reordering
effects architecturally visible, even if the implementation may do so
*internally* when it's not possible to see it in the end result.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [BUG] long freezes on thinkpad t60
Date: Thu, 28 Jun 2007 00:48:27 UTC
Message-ID: <fa.5QHZibY6AxygMXHO/AVRyV2URZo@ifi.uio.no>

On Wed, 27 Jun 2007, Davide Libenzi wrote:

> On Wed, 27 Jun 2007, Linus Torvalds wrote:
> >
> > Stores never "leak up". They only ever leak down (ie past subsequent loads
> > or stores), so you don't need to worry about them. That's actually already
> > documented (although not in those terms), and if it wasn't true, then we
> > couldn't do the spin unlock with just a regular store anyway.
>
> Yes, Intel has never done that. They'll probably never do it since it'll
> break a lot of system software (unless they use a new mode-bit that
> allows system software to enable lose-ordering). Although I clearly
> remember to have read in one of their P4 optimization manuals to not
> assume this in the future.

That optimization manual was confused.

The Intel memory ordering documentation *clearly* states that only reads
pass writes, not the other way around.

Some very confused people have thought that "pass" is a two-way thing.
It's not. "Passing" in the Intel memory ordering means "go _ahead_ of",
exactly the same way it means in traffic. You don't "pass" people by
falling behind them.

It's also obvious from reading the manual, because any other reading would
be very strange: it says

 1. Reads can be carried out speculatively and in any order

 2. Reads can pass buffered writes, but the processor is self-consistent

 3. Writes to memory are always carried out in program order [.. and then
    lists exceptions that are not interesting - it's clflush and the
    non-temporal stores, not any normal writes ]

 4. Writes can be buffered

 5. Writes are not performed speculatively; they are only performed for
    instructions that have actually been retired.

 6. Data from buffered writes can be forwarded to waiting reads within the
    processor.

 7. Reads or writes cannot pass (be carried out ahead of) I/O
    instructions, locked instructions or serializing instructions.

 8. Reads cannot pass LFENCE and MFENCE instructions.

 9. Writes cannot pass SFENCE or MFENCE instructions.

The thing to note is:

 a) in (1), Intel says that reads can occur in any order, but (2) makes it
    clear that that is only relevant wrt other _reads_

 b) in (2), they say "pass", but then they actually explain that "pass"
    means "be carried out ahead of" in (7).

    HOWEVER, it should be obvious in (2) even _without_ the explicit
    clarification in (7) that "pass" is a one-way thing, because otherwise
    (2) is totally _meaningless_. It would be meaningless for two reasons:

     - (1) already said that reads can be done in any order, so if that
       was a "any order wrt writes", then (2) would be pointless. So (2)
       must mean something *else* than "any order", and the only sane
       reading of it that isn't "any order" is that "pass" is a one-way
       thing: you pass somebody when you go ahead of them, you do *not*
       pass somebody when you fall behind them!

     - if (2) really meant that reads and writes can just be re-ordered,
       then the choice of words makes no sense. It would be much more
       sensible to say that "reads can be carried out in any order wrt
       writes", instead of talking explicitly about "passing buffered
       writes"

Anyway, I'm pretty damn sure my reading is correct. And no, it's not a "it
happens to work". It's _architecturally_required_ to work, and nobody has
ever complained about the use of a simple store to unlock a spinlock
(which would only work if the "reads can pass" only means "*later* reads
can pass *earlier* writes").

And it turns out that I think #1 is going away. Yes, the uarch will
internally re-order reads, wrt each other, but if it isn't architecturally
visible, then from an architectural standpoint #1 simply doesn't happen.

I can't guarantee that will happen, of course, but from talking to both
AMD and Intel people, I think that they'll just document the stricter
rules as the de-facto rules.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: queued spinlock code and results
Date: Sun, 08 Jul 2007 16:50:46 UTC
Message-ID: <fa.P66KCAdrkloZIfL0DSQ67SbL95k@ifi.uio.no>

On Sun, 8 Jul 2007, Andi Kleen wrote:
>
> Nick Piggin <npiggin@suse.de> writes:
>
> > I made some tests of the queued spinlock code using userspace test code on
> > 64-bit processors. I believe the xadd based code no longer has any theoretical
> > memory ordering problems.
>
> Linus, the background of this is that on 8 socket Opteron systems
> the current spinlocks can become very unfair to the point of severe
> starvation. These boxes are becomming more common.

Yeah, considering the numbers, I don't have any real objections here.

I would ask that the code be given to both Intel and AMD engineers to look
over, just to verify that the lfence is sufficient (or whether it's even
needed), but I think the use of "xaddw" to both increment _and_ load the
old value for the non-contention case is an obviously good (and clever)
way to handle that one, and even if we'd have to add something heavier
than the lfence to the contended case, it looks fine to me.

So the only remaining issue is that unfairness is probably really good for
some loads (not just for the spinlock itself - it will likely cause much
better cache behaviour for stuff _inside_ the lock to stay on the same
core), but I don't think we want to optimize for the contended case
anyway, so that's more of a "it will be interesting to see" kind of
comment.

In short: if we can have AMD/Intel engineers look this over for any subtle
issues, and they are happy, then I'm happy.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: queued spinlock code and results
Date: Mon, 09 Jul 2007 19:26:53 UTC
Message-ID: <fa.8LpJsL2WoEGN5HOaFna1qAmf4ts@ifi.uio.no>

On Mon, 9 Jul 2007, Davide Libenzi wrote:
>
> So in this box, and in this test, the double-short Z-lock seems faster
> than a double-byte. I've no idea why, since it uses two ops more and an
> extra register.

At this kind of level, the exact instruction scheduling can make a big
difference.

The extra register usage won't matter if there is no register pressure,
and any extra instructions can actually happen to *help*, if they end up
just aligning something just the right way.

There can also be various random effects of prefixes: decoding x86
instructions is basically a very uarch-specific issue, and for all we know
it might be that the AMD setup may well end up behaving differently from
most Intel chips (and within the Intel family, the netburst situation is
likely different from the other P6-derived cores).

For example, does a single prefix decode faster? It could be that the
combination of "lock" _and_ "opsize" prefixes is problematic (as in a
16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock"
and "opsize" on their own don't cause any decoder issues (ie doing the
"lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw
both are fast).

But on another uarch it might work out the other way: if "lock" is always
a complex op, then having a opsize prefix on that one might be "free", and
then you're better combining them for the locked 16-bit xadd, and having
the releasing "decb" not have any prefix at all.

And regardless of that, just a random "it happened to get aligned that
way" (where "alignment" might be about hitting the cache-line just right,
but might also be about just having the right instruction mix to get the
intel decoders to run at their full 4-1-1-1 capacity), causing the timing
differences.

So before taking these numbers as any kind of "real" values, I'd suggest:

 - trying it out on at least a few different uarchs (Opteron, P4 and Core
   2 all have quite different restrictions on decoding)

 - possibly trying it out with things in different order and different
   compiler options (-O2 vs -Os), trying to cause different kinds of
   alignment issues.

Also, just a small nit: in the kernel, the locking would _not_ be inlined
(but the unlocking would), so marking the lock functions "inline" is
probably a bad idea. Without the inline, it's likely more realistic, and
the effects of register pressure will be hidden. Because of the uninlining
nature of locks, I think you can generally ignore the "one or two
registers" issue - you'll have three caller-clobbered registers to play
with regardless.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: queued spinlock code and results
Date: Mon, 09 Jul 2007 19:55:40 UTC
Message-ID: <fa.hSvIRzZ9VnPiOkAJBzwIJMoxMTg@ifi.uio.no>

On Mon, 9 Jul 2007, Davide Libenzi wrote:
>
> The always-lfence instruction in vadd-lock really is painfull though.
> If numbers are close, and given that spinlock size considering structure
> alignments should not matter much, wouldn't it be better to use a double
> short and remove the 256 CPUs cap?

On x86? No.

There are no issues with the 255-CPU cap on 32-bit x86. It's just not
relevant to anybody. So the _only_ thing that matters is speed and to a
secondary degree size.

On x86-64, things are slightly different, and we would want to have at
least the _capability_ to do 16 bits. So there might be a (somewhat weak)
argument in favor of trying to share code.

But even then, size and performance are really the only things that
matter, and if the 8/16-bit version is no slower, then I'd pick that by
default, and suggest the 16/32-bit one to be enabled by CONFIG_MAX_CPU's
being >=256 (at which point you can share the code with x86 anyway, since
that just becomes the <256 cpu case).

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: queued spinlock code and results
Date: Mon, 09 Jul 2007 20:09:04 UTC
Message-ID: <fa.BCyzXsI9iICIq+B2y2leF26pB8Y@ifi.uio.no>

On Mon, 9 Jul 2007, Linus Torvalds wrote:
>
> There are no issues with the 255-CPU cap on 32-bit x86. It's just not
> relevant to anybody. So the _only_ thing that matters is speed and to a
> secondary degree size.

..of course, from a pure speed standpoint, the "lock dec" one seems to
be the fastest, with the difference bwteen the 16-bit/32-bit "lock xadd"
being comparatively totally in the noise.

Which is what I'd expect.

The difference between a 16-bit and 32-bit xadd should basically not be
likely to be really measurable (ie we're likely talking about a single CPU
cycle - if that - for the decode of the operand size override, and since
both variants need it for _one_ of the operations, it likely ends up being
about instruction scheduling noise), while the difference between a "dec"
and "xadd" could be the difference between a native uop and microcoded.

[ Not that "xadd" couldn't be as fast as a "dec" in theory, but it's much
  less likely to be that. It obviously has to actually write to two
  targets: the register -and- memory, and that tends to require at least
  an extra uop.

  And together with being a r-op-w memory instruction to begin with (which
  is generally the "most complex" normal instruction), and not a very
  often used instruction, the end result is that it would often tend to be
  handled specially somehow - either in a special decode unit, or as
  actual microcode. ]

So from a pure performance standpoint, xadd will likely continue to lose
against dec. So the reason to choose xadd in the first place isn't "best
performance", but "best performance given fairness".

And any performance difference between xadd and dec is going to be much
bigger than any difference between 16/32-bit versions of xadd.

So I wouldn't get too hung up on a potential single cycle, and it's
arguably more important to make the (inlined) "unlock" thing be as simple
and small as possible.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 08 of 11] anon-vma-rwsem
Date: Wed, 14 May 2008 15:20:01 UTC
Message-ID: <fa.Z+xH+KLrJ9CamUSD04awG5OhAx8@ifi.uio.no>

On Wed, 14 May 2008, Robin Holt wrote:
>
> Are you suggesting the sending side would not need to sleep or the
> receiving side?

One thing to realize is that most of the time (read: pretty much *always*)
when we have the problem of wanting to sleep inside a spinlock, the
solution is actually to just move the sleeping to outside the lock, and
then have something else that serializes things.

That way, the core code (protected by the spinlock, and in all the hot
paths) doesn't sleep, but the special case code (that wants to sleep) can
have some other model of serialization that allows sleeping, and that
includes as a small part the spinlocked region.

I do not know how XPMEM actually works, or how you use it, but it
seriously sounds like that is how things *should* work. And yes, that
probably means that the mmu-notifiers as they are now are simply not
workable: they'd need to be moved up so that they are inside the mmap
semaphore but not the spinlocks.

Can it be done? I don't know. But I do know that I'm unlikely to accept a
noticeable slowdown in some very core code for a case that affects about
0.00001% of the population. In other words, I think you *have* to do it.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 08 of 11] anon-vma-rwsem
Date: Wed, 14 May 2008 18:34:35 UTC
Message-ID: <fa.p6Kp/ylUnXYxC9ObvkD5Dq0lIoc@ifi.uio.no>

On Wed, 14 May 2008, Christoph Lameter wrote:
>
> The problem is that the code in rmap.c try_to_umap() and friends loops
> over reverse maps after taking a spinlock. The mm_struct is only known
> after the rmap has been acccessed. This means *inside* the spinlock.

So you queue them. That's what we do with things like the dirty bit. We
need to hold various spinlocks to look up pages, but then we can't
actually call the filesystem with the spinlock held.

Converting a spinlock to a waiting lock for things like that is simply not
acceptable. You have to work with the system.

Yeah, there's only a single bit worth of information on whether a page is
dirty or not, so "queueing" that information is trivial (it's just the
return value from "page_mkclean_file()". Some things are harder than
others, and I suspect you need some kind of "gather" structure to queue up
all the vma's that can be affected.

But it sounds like for the case of rmap, the approach of:

 - the page lock is the higher-level "sleeping lock" (which makes sense,
   since this is very close to an IO event, and that is what the page lock
   is generally used for)

   But hey, it could be anything else - maybe you have some other even
   bigger lock to allow you to handle lots of pages in one go.

 - with that lock held, you do the whole rmap dance (which requires
   spinlocks) and gather up the vma's and the struct mm's involved.

 - outside the spinlocks you then do whatever it is you need to do.

This doesn't sound all that different from TLB shoot-down in SMP, and the
"mmu_gather" structure. Now, admittedly we can do the TLB shoot-down while
holding the spinlocks, but if we couldn't that's how we'd still do it:
it would get more involved (because we'd need to guarantee that the gather
can hold *all* the pages - right now we can just flush in the middle if we
need to), but it wouldn't be all that fundamentally different.

And no, I really haven't even wanted to look at what XPMEM really needs to
do, so maybe the above thing doesn't work for you, and you have other
issues. I'm just pointing you in a general direction, not trying to say
"this is exactly how to get there".

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] netfilter: use per-cpu spinlock rather than RCU
Date: Mon, 13 Apr 2009 19:28:16 UTC
Message-ID: <fa.izx9e0QheKePI2PvBc3f55riEIU@ifi.uio.no>

On Mon, 13 Apr 2009, Martin Josefsson wrote:
>
> Doesn't spin_lock() result in a pipeline flush on x86?

It's about 20-40 cycles when cached and uncontended, with some outliers
(50+ cycles on P4, 12 cycles on some AMD opterons).

So it's not a big deal if you actually hit that case.

> iirc there was a benchmark in an RCU paper that tested using per cpu
> spin_locks and the result was that it didn't scale well at all.

Spinlocks scale wonderfully well if you only touch them on one CPU.

Of course, if you truly only touch them on one CPU they are pointless, but
a "all normal code only touches the local CPU spinlock, the really odd
cases take all locks" approach works fine. It makes the uncommon case
really quite slow, but if it truly is uncommon, that's fine.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 17:58:05 UTC
Message-ID: <fa.av63hqWWhosqsJnL7uRenN6QTII@ifi.uio.no>

On Tue, 5 Jan 2010, Andi Kleen wrote:
>
> > Oh well. Somebody who is bored might look at trying to make the wrapper
> > code in arch/x86/lib/semaphore_32.S work on x86-64 too. It should make the
> > successful rwsem cases much faster.
>
> Maybe, maybe not.

If there is actual contention on the lock, but mainly just readers (which
is what the profile indicates: since there is no scheduler footprint, the
actual writer-vs-reader case is probably very rare), then the xadd is
likely to be _much_ faster than the spinlock.

Sure, the cacheline is going to bounce regardless (since it's a shared
per-mm data structure), but the spinlock is going to bounce wildly
back-and-forth between everybody who _tries_ to get it, while the regular
xadd is going to bounce just once per actual successful xadd.

So a spinlock is as cheap as an atomic when there is no contention (which
is the common single-thread case - the real cost of both lock and atomic
is simply the fact that CPU serialization is expensive), but when there is
actual lock contention, I bet the atomic xadd is going to be shown to be
superior.

Remember: we commonly claim that 'spin_unlock' is basically free on x86 -
and that's true, but it is _only_ true for the uncontended state.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 18:28:29 UTC
Message-ID: <fa.CRSbqIp1IQ102BFynZmJolD45Xo@ifi.uio.no>

On Tue, 5 Jan 2010, Christoph Lameter wrote:
>
> If the critical section protected by the spinlock is small then the
> delay will keep the cacheline exclusive until we hit the unlock. This
> is the case here as far as I can tell.

I hope somebody can time it. Because I think the idle reads on all the
(unsuccessful) spinlocks will kill it.

Think of it this way: under heavy contention, you'll see a lot of people
waiting for the spinlocks and one of them succeeds at writing it, reading
the line. So you get an O(n^2) bus traffic access pattern. In contrast,
with an xadd, you get O(n) behavior - everybody does _one_ acquire-for-
write bus access.

Remember: the critical section is small, but since you're contending on
the spinlock, that doesn't much _help_. The readers are all hitting the
lock (and you can try to solve the O(n*2) issue with back-off, but quite
frankly, anybody who does that has basically already lost - I'm personally
convinced you should never do lock backoff, and instead look at what you
did wrong at a higher level instead).

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 18:57:37 UTC
Message-ID: <fa.00FK13RWsChDc8YhY4l0JvHKUxI@ifi.uio.no>

On Tue, 5 Jan 2010, Christoph Lameter wrote:
>
> There will not be much spinning. The cacheline will be held exclusively by
> one processor. A request by other processors for shared access to the
> cacheline will effectively stop the execution on those processors until
> the cacheline is available.

AT WHICH POINT THEY ALL RACE TO GET IT TO SHARED MODE, only to then have
one of them actually see that it got a ticket!

Don't you see the problem? The spinlock (with ticket locks) essentially
does the "xadd" atomic increment anyway, but then it _waits_ for it. All
totally pointless, and all just making for problems, and wasting CPU time
and causing more cross-node traffic.

In contrast, a native xadd-based rwsem will basically do the atomic
increment (that ticket spinlocks also do) and then just go on with its
life. Never waiting for all the other CPU's to also do their ticket
spinlocks.

The "short critical region" you talk about is totally meaningless, since
the contention will mean that everybody is waiting and hitting that
cacheline for a long time regardless - they'll be waiting for O(n)
_other_ CPU's to get the lock first!

		Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 19:29:54 UTC
Message-ID: <fa.WR1qvcXQYgnFOql569TZMeEaMo4@ifi.uio.no>

On Tue, 5 Jan 2010, Christoph Lameter wrote:
>
> The wait state is the processor being stopped due to not being able to
> access the cacheline. Not the processor spinning in the xadd loop. That
> only occurs if the critical section is longer than the timeout.

You don't know what you're talking about, do you?

Just go and read the source code.

The process is spinning currently in the spin_lock loop. Here, I'll quote
it to you:

                LOCK_PREFIX "xaddw %w0, %1\n"
                "1:\t"
                "cmpb %h0, %b0\n\t"
                "je 2f\n\t"
                "rep ; nop\n\t"
                "movb %1, %b0\n\t"
                /* don't need lfence here, because loads are in-order */
                "jmp 1b\n"

note the loop that spins - reading the thing over and over - waiting for
_that_ CPU to be the owner of the xadd ticket?

That's the one you have now, only because x86-64 uses the STUPID FALLBACK
CODE for the rwsemaphores!

In contrast, look at what the non-stupid rwsemaphore code does (which
triggers on x86-32):

                     LOCK_PREFIX "  incl      (%%eax)\n\t"
                     /* adds 0x00000001, returns the old value */
                     "  jns        1f\n"
                     "  call call_rwsem_down_read_failed\n"

(that's a "down_read()", which happens to be the op we care most about.
See? That's a single locked "inc" (it avoids the xadd on the read side
because of how we've biased things). In particular, notice how this means
that we do NOT have fifty million CPU's all trying to read the same
location while one writes to it successfully.

Spot the difference?

Here's putting it another way. Which of these scenarios do you think
should result in less cross-node traffic:

 - multiple CPU's that - one by one - get the cacheline for exclusive
   access.

 - multiple CPU's that - one by one - get the cacheline for exclusive
   access, while other CPU's are all trying to read the same cacheline at
   the same time, over and over again, in a loop.

See the shared part? See the difference? If you look at just a single lock
acquire, it boils down to these two scenarios

 - one CPU gets the cacheline exclusively

 - one CPU gets the cacheline exclusively while <n> other CPU's are all
   trying to read the old and the new value.

It really is that simple.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 19:09:28 UTC
Message-ID: <fa.7pKkSXt09Ot98JejMJD75r6DQGg@ifi.uio.no>

On Tue, 5 Jan 2010, Paul E. McKenney wrote:
>
> But on many systems, it does take some time for the idle reads to make
> their way to the CPU that just acquired the lock.

Yes. But the point is that there is lots of them.

So think of it this way: every time _one_ CPU acquires a lock (and
then releases it), _all_ CPU's will read the new value. Imagine the
cross-socket traffic.

In contrast, doing just a single xadd (which replaces the whole
"spin_lock+non-atomics+spin_unlock"), every times _once_ CPU cquires a
lock, that's it. The other CPU's aren't all waiting in line for the lock
to be released, and reading the cacheline to see if it's their turn.

Sure, after they got the lock they'll all eventually end up reading from
that cacheline that contains 'struct mm_struct', but that's something we
could even think about trying to minimize by putting the mmap_sem as far
away from the other fields as possible.

Now, it's very possible that if you have a broadcast model of cache
coherency, none of this much matters and you end up with almost all the
same bus traffic anyway. But I do think it could matter a lot.

		Linus


From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC][PATCH 6/8] mm: handle_speculative_fault()
Date: Tue, 05 Jan 2010 19:24:16 UTC
Message-ID: <fa.yLRnsYlDY0ZLyOS1Cuh6VoCOGkM@ifi.uio.no>

On Tue, Jan 05, 2010 at 11:08:46AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 5 Jan 2010, Paul E. McKenney wrote:
> >
> > But on many systems, it does take some time for the idle reads to make
> > their way to the CPU that just acquired the lock.
>
> Yes. But the point is that there is lots of them.
>
> So think of it this way: every time _one_ CPU acquires a lock (and
> then releases it), _all_ CPU's will read the new value. Imagine the
> cross-socket traffic.

Yep, been there, and with five-microsecond cross-"socket" latencies.
Of course, the CPU clock frequencies were a bit slower back then.

> In contrast, doing just a single xadd (which replaces the whole
> "spin_lock+non-atomics+spin_unlock"), every times _once_ CPU cquires a
> lock, that's it. The other CPU's arent' all waiting in line for the lock
> to be released, and reading the cacheline to see if it's their turn.
>
> Sure, after they got the lock they'll all eventually end up reading from
> that cacheline that contains 'struct mm_struct', but that's something we
> could even think about trying to minimize by putting the mmap_sem as far
> away from the other fields as possible.
>
> Now, it's very possible that if you have a broadcast model of cache
> coherency, none of this much matters and you end up with almost all the
> same bus traffic anyway. But I do think it could matter a lot.

I have seen systems that work both ways.  If the CPU has enough time
between getting the cache line in unlocked state to lock it, do the
modification, and release the lock before the first in the flurry of
reads arrives, then performance will be just fine.  Each reader will
see the cache line with an unlocked lock and life will be good.

On the other hand, as you say, if the first in the flurry of reads arrives
before the CPU has managed to make its way through the critical section,
then your n^2 nightmare comes true.

If the critical section operates only on the cache line containing the
lock, and if the critical section has only a handful of instructions,
I bet you win on almost all platforms.  But, like you, I would still
prefer the xadd (or xchg or whatever) to the lock, where feasible.
But you cannot expect all systems to see a major performance boost from
switching from locks to xadds.  Often, all you get is sounder sleep.
Which is valuable in its own right.  ;-)

							Thanx, Paul

Index Home About Blog