Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling
Original-Message-ID: <Pine.LNX.4.58.0501271658460.2362@ppc970.osdl.org>
Date: Fri, 28 Jan 2005 01:08:57 GMT
Message-ID: <fa.gue006e.82iq1q@ifi.uio.no>

On Thu, 27 Jan 2005, Paul Jackson wrote:
>
> A long time ago, Linus wrote:
> > An atomic op is pretty much as expensive as a spinlock/unlock pair on x86.
> > Not _quite_, but it's pretty close.
>
> Are both read and modify atomic ops relatively expensive on some CPUs,
> or is it just modify atomic ops?

Linux pretty much assumes that any naturally aligned read or write of the
basic word-size (actually "int"/"long"/pointer) are always atomic.
That's true on all architectures that do SMP today. In fact, it's hard
seeing it ever be not true, although specialized architectures with
out-of-band locking bits might have different assumptions (and return
"invalid" for a memory location that is "in flux").

So if you just do a regular read or write, that's basically always cheap.

HOWEVER. In the absense of locking, you usually can't just do a regular
read or write. You'd have memory barriers etc, which quite easily bring
the cost up to similar as locking (modulo cacheline bounces, which would
happen on the actual access, not the barrier).

Also, "bigger" accesses are not atomic, ie a 64-bit read on a 32-bit
platform usually requires a lock (or equivalent data structures, like
sequence numbers with memory barriers).

The _real_ cost of not locking also often ends up being the inevitable
bugs. Doing clever things with memory barriers is almost always a bug
waiting to happen. It's just _really_ hard to wrap your head around all
the things that can happen on ten different architectures with different
memory ordering, and a single missing barrier.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 15:36:06 UTC
Message-ID: <fa.ZtjGaLvQg8vDj0SFlR56KNbJLDk@ifi.uio.no>

On Tue, 8 Dec 2009, Rafael J. Wysocki wrote:
>
> The wait queue plus the op_complete flag combo plays the role of the locking
> in the Linus' picture

Please just use the lock. Don't make up your own locking crap. Really.

Your patch is horrible. Exactly because your locking is horribly
mis-designed. You can't say things are complete from an interrupt, for
example, since you made it some random bitfield, which has unknown
characteristics (ie non-atomic read-modify-write etc).

The fact is, any time anybody makes up a new locking mechanism, THEY
ALWAYS GET IT WRONG. Don't do it.

I suggested using the rwsem locking for a good reason. It made sense. It
was simpler. Just do it that way, stop making up crap.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 16:43:14 UTC
Message-ID: <fa.g3TORoEYWdQw9V81QGwODZUuack@ifi.uio.no>

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> The semantics needed for this kind of lock aren't really the same as
> for an rwsem (although obviously an rwsem will do the job).  Basically
> it needs to have the capability for multiple users to lock it (no
> blocking when acquiring a lock) and the capability for a user to wait
> until it is totally unlocked.  It could be implemented trivially using
> an atomic_t counter and a waitqueue head.
>
> Is this a standard sort of lock?

Yes it is.

It's called a rwlock. The counter is for readers, the exclusion is for
writers.

Really.

And the thing is, you actually do want the rwlock semantics, because on
the resume side you want the parent to lock it for writing first (so that
the children can wait for the parent to have completed its resume.

So we actually _want_ the full rwlock semantics.

See the code I posted earlier. Here condensed into one email:

 - resume:

        usb_node_resume(node)
        {
		// Wait for parent to finish resume
                down_read(node->parent->lock);
		// .. before resuming outselves
                node->resume(node)

		// Now we're all done
                up_read(node->parent->lock);
                up_write(node->lock);
        }

	/* caller: */
	..
	// This won't block, because we resume parents before children,
	// and the children will take the read lock.
        down_write(leaf->lock);
	// Do the blocking part asynchronously
        async_schedule(usb_node_resume, leaf);
	..

 - suspend:

        usb_node_suspend(node)
        {
                // Start our suspend. This will block if we have any
                // children that are still busy suspending (they will
                // have done a down_read() in their suspend).
                down_write(node->lock);
                node->suspend(node);
                up_write(node->lock);

                // This lets our parent continue
                up_read(node->parent->lock);
        }

	/* caller: */

	// This won't block, because we suspend nodes before parents
        down_read(node->parent->lock);
	// Do the part that may block asynchronously
	async_schedule(do_usb_node_suspend, node);

It really should be that simple. Nothing more, nothing less. And with the
above, finishing the suspend (or resume) from interrupts is fine, and you
don't have any new lock that has undefined memory ordering issues etc.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 18:42:18 UTC
Message-ID: <fa.AJslCfQ2gKCTEGJaO7osCCDg1qU@ifi.uio.no>

On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > So we actually _want_ the full rwlock semantics.
>
> I'm not convinced.  Condense the description a little farther:
>
> 	Suspend:  Children lock the parent first.  When they are
> 		finished they unlock the parent, allowing it to
> 		proceed.
>
> 	Resume:  Parent locks itself first.  When it is finished
> 		it unlocks itself, allowing the children to proceed.

Yes. You can implement it with a simple lock with a count. Nobody debates
that.

But a simple counting lock _is_ a rwlock. Really. They are 100%
semantically equivalent. There is no difference.

> The whole readers vs. writers thing is a non-sequitur.

No it's not.

It's a 100% equivalent problem. It's purely a change of wording. The end
result is the same.

> The simplest generalization is to allow both multiple lockers and
> multiple waiters.  Call it a waitlock, for want of a better name:

But we have that. It _has_ a better name: rwlocks.

And the reason the name is better is because now the name describes all
the semantics to anybody who has ever taken a course in operating systems
or in parallelism.

It's also a better implementation, because it actually _works_.

> 	wait_lock(wl)
> 	{
> 		atomic_inc(&wl->count);
> 	}
>
> 	wait_unlock(wl)
> 	{
> 		if (atomic_dec_and_test(&wl->count)) {
> 			smp_mb__after_atomic_dec();
> 			wake_up_all(wl->wqh);
> 		}
> 	}
>
> 	wait_for_lock(wl)
> 	{
> 		wait_event(wl->wqh, atomic_read(&wl->count) == 0);
> 		smp_rmb();
> 	}
>
> Note that both wait_lock() and wait_unlock() can be called
> in_interrupt.

And note how even though you sprinkled random memory barriers around, you
still got it wrong.

So you just implemented a buggy lock, and for what gain? Tell me exactly
why your buggy lock (assuming you'd know enough about memory ordering to
actually fix it) is better than just using the existing one?

It's certainly not smaller. It's not faster. It doesn't have support for
lockdep. And it's BUGGY.

Really. Tell me why you want to re-implement an existing lock - badly.

[ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
  that anybody else who sees the new value will be guaranteed to have seen
  anything else the unlocker did.

  You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
  allow writes to migrate up either.  'atomic_read()' does not imply any
  barriers.

  But most architectures can optimize these things for their particular
  memory ordering model, and do so in their rwsem implementation. ]

> This becomes:
>
> 	usb_node_resume(node)
> 	{
> 		// Wait for parent to finish resume
> 		wait_for_lock(node->parent->lock);
> 		// .. before resuming outselves
>                 node->resume(node)
>
> 		// Now we're all done
> 		wait_unlock(node->lock);
> 	}
>
> 	/* caller: */
> 	..
> 	// This can't block, because wait_lock() is non-blocking.
> 	wait_lock(node->lock);
> 	// Do the blocking part asynchronously
> 	async_schedule(usb_node_resume, leaf);
> 	..

Umm? Same thing, different words?

That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'. We
_could_ expose such a mechanism for rwsem's too, but why do it? It's
actually nicer to use a real read-lock - and do it _around_ the operation,
because now the locking also automatically gets things like overlapping
suspends and resumes right.

(Which you'd obviously hope never happens, but it's nice from a conceptual
standpoint to know that the locking is robust).

> Aren't waitlocks simpler than rwsems?  Not as generally useful,
> perhaps.  But just as correct in this situation.

NO!

Dammit. I started this whole rant with this comment to Rafael:

 "The fact is, any time anybody makes up a new locking mechanism, THEY
  ALWAYS GET IT WRONG. Don't do it."

Take heed. You got it wrong. Admit it. Locking is _hard_. SMP memory
ordering is HARD.

So leave locking to the pro's. They _also_ got it wrong, but they got it
wrong several years ago, and fixed up their sh*t.

This is why you use generic locking. ALWAYS.

		Linus




From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 18:52:47 UTC
Message-ID: <fa.6kLTHeaoTWYynLCcAPKKnb2iNpc@ifi.uio.no>

On Tue, 8 Dec 2009, Linus Torvalds wrote:
>
> [ Hint: you need a smp_mb() *before* the atomic_dec() in wait-unlock, so
>   that anybody else who sees the new value will be guaranteed to have seen
>   anything else the unlocker did.
>
>   You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
>   allow writes to migrate up either.  'atomic_read()' does not imply any
>   barriers.
>
>   But most architectures can optimize these things for their particular
>   memory ordering model, and do so in their rwsem implementation. ]

Side note: if this was a real lock, you'd also needed an smp_wmb() in the
'wait_lock()' path after the atomic_inc(), to make sure that others see
the atomic lock was seen by other people before the suspend started.

In your usage scenario, I don't think it would ever be noticeable, since
the other users are always going to start running from the same thread
that did the wait_lock(), so even if they run on other CPU's, we'll have
scheduled _to_ those other CPU's and done enough memory ordering to
guarantee that they will see the thing.

So it would be ok in this situation, simply because it acts as an
initializer and never sees any real SMP issues.

But it's an example of how you now don't just depend on the locking
primitives themselves doing the right thing, you end up depending very
subtly on exactly how the lock is used.  The standard locks do have the
same kind of issue for initializers, but we avoid it elsewhere because
it's so risky.

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 20:49:03 UTC
Message-ID: <fa.Nk+gUSJz8cYSGiqFVe8TcJxsw8A@ifi.uio.no>

On Tue, 8 Dec 2009, Alan Stern wrote:
>
> >   You also need a smp_mb() in the wait_for_lock(), not a smp_rmb(). Can't
> >   allow writes to migrate up either.  'atomic_read()' does not imply any
> >   barriers.
>
> No, that's not needed.  Unlike reads, writes can't move in front of
> data or control dependencies.  Or so I've been lead to believe...

Sure they can. Control dependencies are trivial - it's called "branch
prediction", and everybody does it, and data dependencies don't exist on
many CPU architectures (even to the point of reading through a pointer
that you loaded).

But yes, on x86, stores only move down. But that's an x86-specific thing.

[ Not that it's also not very common - write buffering is easy and matters
  for performance, so any in-order implementation will generally do it. In
  contrast, writes moving up doesn't really help peformance and is harder
  to do, but can happen with a weakly ordered memory subsystem especially
  if you have multi-way caches where some ways are busy and end up being
  congested.

  So the _common_ case is definitely about delaying writes and doing reads
  early if possible. But it's not necessarily at all guaranteed in
  general. ]

> > That "wait_for_lock()" is equivalent to a 'read_lock()+read_unlock()'.
>
> Not really.  It also corresponds to a 'write_lock()+write_unlock()' (in
> the suspend routine).  Are you claiming these two compound operations
> are equivalent?

They have separate semantics, and you just want to pick the one that suits
you. Your counting lock doesn't have the "read_lock+read_unlock" version,
it only has the write_lock/unlock one (ie it requires totally unlocked
thing).

The point being, rwsem's can do everything your counting lock does. And
they already exist. And they already know about all the subtleties of
architecture-specific memory ordering etc.

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Tue, 08 Dec 2009 22:17:05 UTC
Message-ID: <fa.tIBw0b/onW+ZqpBSvpvg5eRmBXk@ifi.uio.no>

On Tue, 8 Dec 2009, Alan Stern wrote:
> >
> > Sure they can. Control dependencies are trivial - it's called "branch
> > prediction", and everybody does it, and data dependencies don't exist on
> > many CPU architectures (even to the point of reading through a pointer
> > that you loaded).
>
> Wait a second.  Are you saying that with code like this:
>
> 	if (x == 1)
> 		y = 5;
>
> the CPU may write to y before it has finished reading the value of x?

Well, in a way.  The branch may have been predicted, and the CPU can
_internally_ have done the 'y=5' thing into a write buffer before it even
did the read.

Some time later it will have to _verify_ the prediction and then perhaps
kill the write before it makes it to a data structure that is visible to
others, but internally from the CPU standpoint, yes, the write could have
happened before the read.

Now, whether that write is "before" or "after" the read is debatable. But
one way of looking at it is certainly that the write took place earlier,
and the read might have just caused it to be undone.

And there are real effects of this - looking at the bus, you might have a
bus transaction to get the cacheline that contains 'y' for exclusive
access happen _before_ the bus transaction that reads in the value of 'x'
(but you'd never see the writeout of that '5' before).

> And this write is visible to other CPUs, so that if x was initially 0
> and a second CPU sets x to 1, the second CPU may see y == 5 before it
> executes the write to x (whatever that may mean)?

Well, yes and no. CPU1 above won't release the '5' until it has confirmed
the '1' (even if it does so by reading it late). but assuming the other
CPU also does speculation, then yes, the situation you describe could
happen. If the other CPU does

		z = y;
		x = 1;

then it's certainly possible that 'z' contains 5 at the end (even if both
x and y started out zero). Because now the read of 'y' on that other CPU
might be delayed, and the write of 'x' goes ahead, CPU1 sees the 1, and
commits its write of 5, so when CPU2 gets the cacheline, z will now
contain 5.

Is it likely? No. CPU microarchitectures aim to do reads early, and writes
late. Reads are on the critical path, writes can be buffered. But you can
basically get into "impossible" situations where a write that was _later_
in the instruction stream than a read (on CPU2, the 'store 1 to x' would
be after the load of 'y' from memory) could show up in the other order on
another CPU.


			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Async resume patch (was: Re: [GIT PULL] PM updates for 2.6.33)
Date: Wed, 09 Dec 2009 21:54:28 UTC
Message-ID: <fa.63HFXDxas2yMkPDxORS5VYHh7JI@ifi.uio.no>

On Wed, 9 Dec 2009, Alan Stern wrote:
>
> That could be attributed to reordering on CPU2, so let's take CPU2's
> peculiarities out of the picture (initially everything is set to 0):
>
> 	CPU1			CPU2
> 	----			----
> 	if (x == 1)		z = y;
> 		y = 5;		mb();
> 				x = 1;
>
> This gets at the heart of the question: Can a write move up past a
> control dependency?
>  [ .. ]
> Can z end up equal to 5 in any of these examples?

In any _practical_ microarchitecture I know of, the above will never
result in 'z' being 5, even though CPU1 doesn't really have a memory
barrier. But if I read the alpha memory ordering guarantees right, then at
least in theory you really can end up with z=5.

Let me write that as five events (with the things in brackets being what
the alpha memory ordering manual calls them):

 - A is "read of x returns 1" on CPU1   [ P1:R(x,1) ]
 - B is "write of value 5 to y" on CPU1 [ P1:W(y,5) ]
 - C is "read of y returns 5" on CPU2   [ P2:R(y,5) ]
 - D is "write of value 1 to x" on CPU2 [ P2:W(x,1) ]
 - 'MB' is the mb() on CPU2             [ P2:MB ]

(The write of 'z' is irrelevant, we can think of it as a register, the end
result is the same).

And yes, if I read the alpha memory ordering rules correctly, you really
can end up with z=5, although I don't think you will ever find an alpha
_implementation_ that does it.

Why?

The alpha memory ordering literally defines ordering in two ways:

 - "location access order". But that is _only_ defined per actual
   location, so while 'x' can have a location access order specified by
   seeing certain values, there is no "location access order" for two
   different memory locations (x and y).

   The alpha architecture manual uses "A << B" to say "event A" is before
   "event B" when there is a defined ordering.

   So in the example above, there is a location access ordering between

	P2:W(x,1) << P1:R(x, 1)

   and

	P2:R(y,5) << P1:W(y,5)

   ie you have D << A and B << C.

   Good so far, but that doesn't define anything else: there's only
   ordering between the pairs (D,A) and (B,C), nothing between them.

 - "Processor issue order" for two instruction is _only_ defined by either
   (a) memory barriers or (b) accesses to the _same_ locations. The alpha
   architecture manual uses "A < B" to say that "event A" is before "event
   B" in processor issue order.

   So there is a "Processor issue order" on CPU2 due to the memory
   barrier: P2:R(y,5) < P2:MB < P2:W(x,1), or put another way C < MB < D:
   C < D.

Now, the question is, can we actually get the behaviour of reading 5 on
CPU2 (ie P2:R(y,5)), and that is only possible if we can find an ordering
that satisfies all the constraints. We have

	D << A
	B << C
	C < D

and it seems to be that it is a possible situation: "B C D A"
really does satisfy all the constraints afaik.

So yes, according to the actual alpha architecture memory ordering rules,
you can see '5' from that first read of 'y'. DESPITE having a mb() on
CPU2.

In order to not see 5, you need to also specify "A < B", and the _only_
way to do that processor issue order specification is with a memory
barrier (or if the locations are the same, which they aren't).

"Causality" simply is nowhere in the officially defined alpha memory
ordering. The fact that we test 'x == 1' and conditionally do the write
simply doesn't enter the picture. I suspect you'd have a really hard time
not having causality in practice, but there _are_ things that can break
causality (value prediction etc), so it's not like you'd have to actually
violate physics of reality to do it.

IOW, you could at least in theory implement a CPU that does every
instruction speculatively in parallel, and then validates the end result
afterwards according to the architecture rules. And that CPU would require
the memory barrier on alpha.

(On x86, 'causality' is defined to be part of the memory ordering rules,
so on x86, you _do_ have a 'A < B' relationship. But not on alpha).

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 8/9] Documentation: Fix invalid rcu assumptions
Date: Fri, 11 Dec 2009 16:10:51 UTC
Message-ID: <fa.GovF3YHQRNROPQZnrWw/LPdw9pQ@ifi.uio.no>

On Fri, 11 Dec 2009, David Howells wrote:

> Thomas Gleixner <tglx@linutronix.de> wrote:
>
> > -197          * we use RCU protection here
> > +196          * caller must be holding the RCU readlocke
>
> You mean "readlock" I suspect.

Or maybe he's talking about ye olde readlocke, used widely for OS research
throughout the middle ages. You still find that spelling in some really
old CS literature.

			Linus

Index Home About Blog