Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Fair low-latency rwlock v5
Date: Mon, 18 Aug 2008 19:03:35 UTC
Message-ID: <fa.i0A7jNLzdvpFkX1LWvtgizJSawY@ifi.uio.no>

On Sun, 17 Aug 2008, Mathieu Desnoyers wrote:
>
> Ah, you are right. This new version implements a "single cmpxchg"
> uncontended code path, changes the _fast semantic for "trylock
> uncontended" and also adds the trylock primitives.

Can you also finally

 - separate out the fastpaths more clearly. The slowpaths should not be
   even visible in the I$ footprint. The fastpaths are probably better off
   as a separate "fastpath.S" file to make sure that the compiler doesn't
   inline the slowpath or do other insane things.

   If the fastpath isn't just a few instructions, there's something wrong.
   It should be written in assembly, so that it's very clear what the
   fastpath is. Because most of the time, the fastpath is all that really
   matters (at least as long as there is some reasonable slow-path and
   contention behaviour - the slowpath matters in that it shouldn't ever
   be a _problem_).

 - don't call these "fair". They may be fairER than the regular rwlocks,
   but you'd really need to explain that, and true fairness you (a) can't
   really get without explicit queueing and (b) probably don't even want,
   because the only starvation people tend to worry about is reader vs
   writer, and writer-writer fairness tends to be unimportant. So it's not
   that people want the (expensive) "fairness" it's really that they want
   something _reasonably_ fair considering the normal worries.

   (ie if you have so much write activity that you get into write-write
   fairness worries, you shouldn't be using a rwlock to begin with, so
   that level of fairness is simply not very interesting).

   People react emotionally and too strongly to to words like "fairness"
   or "freedom". Different people have different ideas on what they mean,
   and take them to pointless extremes. So just don't use the words, they
   just detract from the real issue.

Also, the performance numbers on their own are pretty irrelevant, since
there's nothing to compare them to. It would be much more relevant if you
had a "this is what the standard rwlock" does as a comparison for each
number.

				Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Fair low-latency rwlock v5
Date: Tue, 19 Aug 2008 16:50:56 UTC
Message-ID: <fa.uz/Y4BZRbT36KT7WTwv45C7AgAs@ifi.uio.no>

On Tue, 19 Aug 2008, Mathieu Desnoyers wrote:
>
> It strikes me that Intel has a nice (probably slow?) cmpxchg16b
> instruction on x86_64. Therefore, we could atomically update 128 bits,
> which gives the following table :

Stop this crapola.

Take a look at the rwsem thing I pointed you to. After you understand
that, come back.

The WHOLE POINT of that thing was to use only 32-bit atomics on the hot
path. Don't even start thinking about cmpxchg16b. If you cannot do your
atomics in 32-bit, they're broken.

Please. I realize that the rwsem implementation I did is subtle. But
really. Spend the time to understand it.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Thu, 21 Aug 2008 21:02:28 UTC
Message-ID: <fa.5RkIXic+z6CNvQi6RlTJ8ecJBlI@ifi.uio.no>

On Thu, 21 Aug 2008, Mathieu Desnoyers wrote:
>
> Given that I need to limit the number of readers to a smaller amount of
> bits

I stopped reading right there.

Why?

Because that is already crap.

Go look at my code once more. Go look at how it has 128 bits of data,
exactly so that it DOES NOT HAVE TO LIMIT THE NUMBER OF READERS.

And then go look at it again.

Look at it five times, and until you can understand that it still uses
just a 32-bit word for the fast-path and no unnecessarily crap in it, but
it actually has 128 bits of data for all the slow paths, don't bother
emailing me any new versions.

Please. You -still- apparently haven't looked at it, at least not enough
to understand the _point_ of it. You still go on about trying to fit in
three or four different numbers in that one word. Even though the whole
point of my rwlock is that you need exactly _one_ count (active writers),
and _one_ bit (active reader) and _one_ extra bit ("contention, go to slow
path, look at the other bits ONLY IN THE SLOW PATH!")

That leaves 30 bits for readers. If you still think you need to "limit the
number of readers", then you aren't getting it.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Thu, 21 Aug 2008 21:17:13 UTC
Message-ID: <fa.iK6jcvs7jHEXqL3uQo14RxmWL94@ifi.uio.no>

On Thu, 21 Aug 2008, Linus Torvalds wrote:
>
> That leaves 30 bits for readers. If you still think you need to "limit the
> number of readers", then you aren't getting it.

Side note: the actual main rwlock thing is designed for a 64-bit word and
the waiters separately as two 32-bit words, so it doesn't really do what I
describe, but that's actually because the whole sleeping thing is _harder_
than a spinning thing, and has races with wakeups etc.

A spinning thing, in contrast, is pretty trivial.

So here's what I think your code should be like:

 rdlock:
	movl $4,%eax
	lock ; xaddl %eax,(%rdi)
	testl $3,%eax
	jne __rdlock_slowpath
	ret

 rwlock:
	xorl %eax,%eax
	movl $1,%edx
	lock ; cmpxchgl %edx,(%rdi)
	jne __rwlock_slowpath
	ret

 rdunlock:
	lock ; subl $4,(%rdi)
	ret

 rwunlock:
	lock ; andl $~1,(%rdi)
	ret

and I'm pretty damn sure that that should be totally sufficient for a
spinning rwlock. The non-spinning one is more complex just because the
unlock paths need to guarantee that something gets woken up, that just
isn't an issue when you do spinlocks.

Now, in the slow-path:
 - on the rwlock slowpath side, set bit#1 to make sure that readers get
   caught in the slowpath
 - then do a *separate* count of how many pending readers and writers
   (ie the ones that got caught into the slowpath) you have (one word
   each is probably fine), and then the slowpaths can just do the right
   thing depending on whether there are pending readers/writers.

See? The main lock needs not worry about number of writers AT ALL, because
it's totally irrelevant. So don't worry about running out of bits. You
won't. Just put those counts somewhere else! The only thing that matters
for the main lock word is whether there are active readers (30 bits), and
whether there is an active writer (there can only ever be one: 1 bit), and
whether new readers should be trapped (1 bit).

If you worry about overflows, you're doing something wrong.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Thu, 21 Aug 2008 22:25:02 UTC
Message-ID: <fa.58RdeJPlzHjTClTT+26PX77khZ4@ifi.uio.no>

On Thu, 21 Aug 2008, Linus Torvalds wrote:
>
>  rdlock:
> 	movl $4,%eax
> 	lock ; xaddl %eax,(%rdi)
> 	testl $3,%eax
> 	jne __rdlock_slowpath
> 	ret


Ok, so change that "testl" to "andl" to shave two bytes, and then make the
whole lock structure look something like this:

	struct rwlock {
		u32 fastpath;
		u32 pending_readers;
		u32 pending_writers;
	};

and then the slowpath would look something like this:

	void __rdlock_slowpath(struct rwlock *lock)
	{
		/* Move the read count to the pending path and undo our xadd */
		atomic_add(1, &lock->pending_readers);
		atomic_sub(4, &lock->fastpath);

		for (;;) {
			unsigned fastpath;
			while (lock->pending_writers)
				cpu_relax();
			while ((fastpath = lock->fastpath) & 1)
				cpu_relax();
			/* This will also undo the contention bit */
			if (atomic_cmpxchg(&lock->fastpath, fastpath, (fastpath & ~2)+4)) == fastpath);
				break;
		}
		atomic_sub(1, &lock->pending_readers);
	}

and yes, there are more atomic accesses in there than would be necessary,
in that if you can cram the whole thing into a single 64-bit word you can
do both the initial pending fixup and the final cmpxchg/pending_readers
thing as a single locked 64-bit op, but hey, the above is fairly simple.

You could also just use a spinlock to protect all the other data, of
course.

There are fairness issues here too - do you really want to wait for
pending writes every time, or just the first time through the loop? It
depends on just how much you want to prefer writers.

The slow-paths for writers is similar, and has similar issues for the
fairness issue. For example, instead of a simple "pending_writers" count,
maybe you want to use a ticket lock there, to make writers be fair among
themselves? Of course, the hope would be that there would never be that
kind of contention by pure writers, so that sounds like overdesign, but
the thing I'm trying to point out here is that this is all a separate
decision from the actual fast-path, and having fair writers doesn't
necessarily mean that the fastpath has to even know/care.

The placement of the counts is also something that can be optimized. For
example, on the assumption that readers are much more common than writers,
I put the "pending_readers" next to the "fastpath" lock, exactly so that
you -can- do the update of both with a single 64-bit locked access, even
if the fastpath just used a 32-bit one. But then you should at least add a
"__attribute__((aligned(8)))" to the structure definition.

And maybe you want to share the slowpath between all the irq-off etc
versions, in which case you need to make the "irq enable while waiting"
thing be conditional.

And if the write path wants to wait for pending readers, and you want to
make _that_ fair too, then you want to have that punch through if the
writer is in an interrupt, to avoid deadlocks. And again, you might want
to make the reader thing be some kind of ticket-based system, so that
writers can wait for the readers that came in _before_ that writer, but
not to later ones.

But again - none of that has anything to do with the fast-path. The
fast-path remains exactly the same for all these issues.

And finally: NONE of the code is at all tested. It's all just written in
my MUA. Caveat emptor.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Sat, 23 Aug 2008 18:03:50 UTC
Message-ID: <fa.vHes0ngetSTkIV8eUnyXjqtrJPo@ifi.uio.no>

On Sat, 23 Aug 2008, Mathieu Desnoyers wrote:
>
> Now, let me explain why I need at least not one, but _four different_
> contention bits. Simply because there are four types of contention, one
> for each execution context which may take the read lock. (irq, softirq,
> non-preemptable, preemptable)

No. You need _one_ contention bit in the fast-path.

Then, as you get into the slow-path, you can decide on four different
behaviours.

Quite frankly, I don't think this discussion is going anywhere. I don't
think I'd take anything from you, since you seem to have a really hard
time separating out the issue of fast-path and slow-path. So I'm simply
not going to bother, and I'm not going to expect to merge your work.

Sorry, but it simply isn't worth my time or effort.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Sat, 23 Aug 2008 21:42:06 UTC
Message-ID: <fa.F8Duxxoii3rBqdse9sr2p9gqMak@ifi.uio.no>

On Sat, 23 Aug 2008, Mathieu Desnoyers wrote:
>
> Actually, the point I just fixed in my head is that this bit will be a
> "MAY_CONTEND" bit, which could let higher priority readers access the
> lock in the slow path.

EXACTLY.

It's not even necessarily a "contention" bit per se - it's literally a
"readers have to take the slow-path" bit (writers will obviously _always_
take the slowpath if there is any non-zero value at all, so they don't
need it).

Then, the slow-path might actually decide that "hey, there is no _actual_
writer there yet - just some _waiting_ writer, but since this read lock is
in an interrupt context, we have to let it go through _despite_ the fact
that the lock is contended in order to avoid deadlock".

So it allows a fast-path for the trivial cases that is literally just a
couple of instructions long, and that is nice not just because of
performance issues, but because it then means that you can entirely ignore
all those things in the slow path. It also means that everybody can look
at the fast-path and decide that "ok, the fast-path really is optimal".

That fast-path is what a lot of people care more about than just about
anything else.

The slow-path, in comparison, can be in C, and can do all those checks
like "are we in an (sw-)interrupt handler?" and basically prioritize
certain classes of people.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFC PATCH] Writer-biased low-latency rwlock v8
Date: Thu, 21 Aug 2008 21:43:22 UTC
Message-ID: <fa.FJ3J5C6lg2s5i/TjMEMotRtaAJI@ifi.uio.no>

On Thu, 21 Aug 2008, H. Peter Anvin wrote:
>
> First of all, let me say I don't pretend to understand formally how you deal
> with overflow-after-the-fact, as unlikely as it is.

Just make sure it can't overflow. With spinlocks, you are guaranteed that
you won't have more than NR_CPU's thing, so 20 bits is pretty safe. 30
bits is ridiculously safe.

> However, it seems to me to be an easy way to avoid it.  Simply by changing the
> read-test mask to $0x80000003, you will kick the code down the slow path once
> the read counter reaches $0x80000004 (2^29+1 readers), where you can do any
> necessary fixup -- or BUG() -- at leisure.

Sure, you could do things like that, but that sounds like a separate
"debugging" version, not the main one.

> This fastpath ends up being identical in size and performance to the one you
> posted, although yours could be reduced by changing the test to a testb
> instruction -- at the almost certainly unacceptable expense of taking a
> partial-register stall on the CPUs that have those.

Well, you could just change the "testl $3,%eax" into an "andl $3,%eax",
and it will be two bytes shorter with no partial register stall.

I forgot that "testl" doesn't have the byte immediate version.

		Linus


Index Home About Blog