Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Unified tracing buffer
Date: Tue, 23 Sep 2008 00:42:05 UTC
Message-ID: <fa.2FJdiPLag3amNDH4PTvS9wNKBFk@ifi.uio.no>

On Mon, 22 Sep 2008, Masami Hiramatsu wrote:
>
> Sure, atomic counter might be more expensive but accurate for ordering.

Don't be silly.

An atomic counter is no more accurate for ordering than anything else.

Why?

Because all it tells you is the ordering of the atomic increment, not of
the caller. The atomic increment is not related to all the other ops that
the code that you trace actually does in any shape or form, and so the
ordering of the trace doesn't actually imply anything for the ordering of
the operations you are tracing!

Except for a single CPU, of course, but for that case you don't need a
sequence number either, since the ordering is entirely determined by the
ring buffer itself.

So the counter will be more expensive (cross-cpu cache bouncing for EVERY
SINGLE EVENT), less useful (no real meaning for people who DO want to have
a timestamp), and it's really no more "ordered" than anything that bases
itself on a TSC.

The fact is, you cannot order operations based on log messages unless you
have a lock around the whole caller - absolutely _no_ amount of locking or
atomic accesses in the log itself will guarantee ordering of the upper
layers.

And sure, if you have locking at a higher layer, then a sequence number is
sufficient, but on the other hand, so is a well-synchronized TSC.

So personally, I think that the optimal solution is:

 - let each ring buffer be associated with a "gettimestamp()" function, so
   that everybody _can_ set it to something of their own. But default to
   something sane, namely a raw TSC thing.

 - Add synchronization events to the ring buffer often enough that you can
   make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply
   noticing when the upper 32 bits change, although you could possibly do
   it with a heartbeat too.

 - Similarly, add a synchronization event when the TSC frequency changes.

 - Make the synchronization packet contain the full 64-bit TSC base, in
   addition to TSC frequency info _and_ the timebase.

 - From those synchronization events, you should be able to get a very
   accurate timestamp *after* the fact from the raw TSC numbers (ie do all
   the scaling not when you gather the info, but when you present it),
   even if you only spent 32 bits of TSC info on 99% of all events (an
   just had a overflow log occasionally to get the rest of the info)

 - Most people will be _way_ happier with a timestamp that has enough
   precision to also show ordering (assuming that the caller holds a
   lock over the operation _including_ the tracing) than they would ever
   be with a sequence number.

 - people who really want to can consider the incrementing counter a TSC,
   but it will suck in so many ways that I bet it will not be very popular
   at all. But having the option to set a special timestamp function will
   give people the option (on a per-buffer level) to make the "TSC" be a
   simple incrementing 32-bit counter using xaddl and the upper bits
   incrementing from a timer, but keep that as a "ok, the TSC is really
   broken, or this architecture doesn't support any fast cycle counters at
   all, or I really don't care about time, just sequence, and I guarantee
   I have a single lock in all callers that makes things unambiguous"

Note the "single lock" part. It's not enough that you make any trace thing
under a lock. They must be under the _same_ lock for all relevant events
for you to be able to say anything about ordering. And that's actually
pretty rare for any complex behavior.

The timestamping, btw, is likely the most important part of the whole
logging thing. So we need to get it right. But by "right" I mean really
really low-latency so that it's acceptable to everybody, real-time enough
that you can tell how far apart events were, and precise enough that you
really _can_ see ordering.

The "raw TSC value with correction information" should be able to give you
all of that. At least on x86. On some platforms, the TSC may not give you
enough resolution to get reasonable guesses on event ordering.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Unified tracing buffer
Date: Tue, 23 Sep 2008 03:29:23 UTC
Message-ID: <fa.fP6lHMdczhU4eK482duvruqzbhU@ifi.uio.no>

On Mon, 22 Sep 2008, Roland Dreier wrote:
>
> Just like Einstein said, it really seems to me that the order of things
> depends on your frame of reference.

Heh. Yes. In general, there is no single ordering unless you actually use
a serializing lock on all CPU's involved.

And exactly as in the theory of relativity, two people on different CPU's
can actually validly _disagree_ about the ordering of the same event.
There are things that act as "light-cones" and are borders for what
everybody can agree on, but basically, in the absence of explicit locks,
it is very possible that no such thing as "ordering" may even exist.

Now, an atomic increment on a single counter obviously does imply *one*
certain ordering, but it really only defines the ordering of that counter
itself. It does not at all necessarily imply any ordering on the events
that go on around the counter in other unrelated cachelines.

Which is exactly why even a global counter in no way orders "events" in
general, unless those events have something else that does so.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Unified tracing buffer
Date: Tue, 23 Sep 2008 04:07:00 UTC
Message-ID: <fa.DCyHU29ul/pLT0461c7xAxt4aTI@ifi.uio.no>

On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
>
> Unless I am missing something, in the case we use an atomic operation
> which implies memory barriers (cmpxchg and atomic_add_return does), one
> can be sure that all memory operations done before the barrier are
> completed at the barrier and that all memory ops following the barrier
> will happen after.

Sure (if you have a barrier - not all architectures will imply that for an
increment).

But that still doesn't mean a thing.

You have two events (a) and (b), and you put trace-points on each. In your
trace, you see (a) before (b) by comparing the numbers. But what does that
mean?

The actual event that you traced is not the trace-point - the trace-point
is more like a fancy "printk". And the fact that one showed up before
another in the trace buffer, doesn't mean that the events _around_ the
trace happened in the same order.

You can use the barriers to make a partial ordering, and if you have a
separate tracepoint for entry into a region and exit, you can perhaps show
that they were totally disjoint. Or maybe they were partially overlapping,
and you'll never know exactly how they overlapped.

Example:

	trace(..);
	do_X();

being executed on two different CPU's. In the trace, CPU#1 was before
CPU#2. Does that mean that "do_X()" happened first on CPU#1?

No.

The only way to show that would be to put a lock around the whole trace
_and_ operation X, ie

	spin_lock(..);
	trace(..);
	do_X();
	spin_unlock(..);

and now, if CPU#1 shows up in the trace first, then you know that do_X()
really did happen first on CPU#1. Otherwise you basically know *nothing*,
and the ordering of the trace events was totally and utterly meaningless.

See? Trace events themselves may be ordered, but the point of the trace
event is never to know the ordering of the trace itself - it's to know the
ordering of the code we're interested in tracing. The ordering of the
trace events themselves is irrelevant and not useful.

And I'd rather see people _understand_ that, than if they think the
ordering is somehow something they can trust.

Btw, if you _do_ have locking, then you can also know that the "do_X()"
operations will be essentially as far apart in some theoretical notion of
"time" (let's imagine that we do have global time, even if we don't) as
the cost of the trace operation and do_X() itself.

So if we _do_ have locking (and thus a valid ordering that actually can
matter), then the TSC doesn't even have to be synchronized on a cycle
basis across CPU's - it just needs to be close enough that you can tell
which one happened first (and with ordering, that's a valid thing to do).

So you don't even need "perfect" synchronization, you just need something
reasonably close, and you'll be able to see ordering from TSC counts
without having that horrible bouncing cross-CPU thing that will impact
performance a lot.

Quite frankly, I suspect that anybody who wants to have a global counter
might as well almost just have a global ring-buffer. The trace events
aren't going to be CPU-local anyway if you need to always update a shared
cacheline - and you might as well make the shared cacheline be the ring
buffer head with a spinlock in it.

That may not be _quite_ true, but it's probably close enough.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Unified tracing buffer
Date: Tue, 23 Sep 2008 04:20:29 UTC
Message-ID: <fa.F0JsaA0sZ+YQGWYYibi7PfxheJw@ifi.uio.no>

On Mon, 22 Sep 2008, Steven Rostedt wrote:
>
> But, with that, with a global atomic counter, and the following trace:
>
> cpu 0: trace_point_a
> cpu 1: trace_point_c
> cpu 0: trace_point_b
> cpu 1: trace_point_d
>
> Could the event a really come after event d, even though we already hit
> event b?

Each tracepoint will basically give a partial ordering (if you make it so,
of course - and on x86 it's hard to avoid it).

And with many trace-points, you can narrow down ordering if you're lucky.

But say that you have code like

	CPU#1		CPU#2

	trace_a		trace_c
	..		..
	trace_b		trace_d

and since each CPU itself is obviously strictly ordered, you a priori know
that a < b, and c < d. But your trace buffer can look many different ways:

 - a -> b -> c -> d
   c -> d -> a -> b

   Now you do know that what happened between c and d must all have
   happened entirely after/before the things that happened between
   a and b, and there is no overlap.

   This is only assuming the x86 full memory barrier from a "lock xadd" of
   course, but those are the semantics you'd get on x86. On others, the
   ordering might not be that strong.

 - a -> c -> b -> d
   a -> c -> d -> b

   With these trace point orderings, you really don't know anything at all
   about the order of any access that happened in between. CPU#1 might
   have gone first. Or not. Or partially. You simply do not know.

> But I guess you are stating the fact that what the computer does
> internally, no one really knows. Without the help of real memory barriers,
> ording of memory accesses is mostly determined by tarot cards.

Well, x86 defines a memory order. But what I'm trying to explain is that
memory order still doesn't actually specify what happens to the code that
actually does tracing! The trace is only going to show the order of the
tracepoints, not the _other_ memory accesses. So you'll have *some*
information, but it's very partial.

And the thing is, all those other memory accesses are the ones that do all
the real work. You'll know they happened _somewhere_ between two
tracepoints, but not much more than that.

This is why timestamps aren't really any worse than sequence numbers in
all practical matters. They'll get you close enough that you can consider
them equivalent to a cache-coherent counter, just one that you don't have
to take a cache miss for, and that increments on its own!

Quite a lot of CPU's have nice, dependable, TSC's that run at constant
frequency.

And quite a lot of traces care a _lot_ about real time. When you do IO
tracing, the problem is almost never about lock ordering or anything like
that. You want to see how long a request took. You don't care AT ALL how
many tracepoints were in between the beginning and end, you care about how
many microseconds there were!

			Linus

Index Home About Blog