Index Home About Blog
From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] splice support #2
Date: Thu, 30 Mar 2006 17:03:27 UTC
Message-ID: <fa.HxxRqCnbLbC5E9JmxpLtzlp2IyU@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603300853190.27203@g5.osdl.org>

On Thu, 30 Mar 2006, Jens Axboe wrote:
> On Thu, Mar 30 2006, Ingo Molnar wrote:
> >
> > neat stuff. One question: why do we require fdin or fdout to be a pipe?
> > Is there any fundamental problem with implementing what Larry's original
> > paper described too: straight pagecache -> socket transfers? Without a
> > pipe intermediary forced inbetween. It only adds unnecessary overhead.
>
> No, not a fundamental problem. I think I even hid that in some comment
> in there, at least if it's decipharable by someone else than myself...

Actually, there _is_ a fundamental problem. Two of them, in fact.

The reason it goes through a pipe is two-fold:

 - the pipe _is_ the buffer. The reason sendfile() sucks is that sendfile
   cannot work with <n> different buffer representations. sendfile() only
   works with _one_ buffer representation, namely the "page cache of the
   file".

   By using the page cache directly, sendfile() doesn't need any extra
   buffering, but that's also why sendfile() fundamentally _cannot_ work
   with anything else. You cannot do "sendfile" between two sockets to
   forward data from one place to another, for example. You cannot do
   sendfile from a streaming device.

   The pipe is just the standard in-kernel buffer between two arbitrary
   points. Think of it as a scatter-gather list with a wait-queue. That's
   what a pipe _is_. Trying to get rid of the pipe totally misses the
   whole point of splice().

   Now, we could have a splice call that has an _implicit_ pipe, ie if
   neither side is a pipe, we could create a temporary pipe and thus
   allow what looks like a direct splice. But the pipe should still be
   there.

 - The pipe is the buffer #2: it's what allows you to do _other_ things
   with splice that are simply impossible to do with sendfile. Notably,
   splice allows very naturally the "readv/writev" scatter-gather
   behaviour of _mixing_ streams. If you're a web-server, with splice you
   can do

	write(pipefd, header, header_len);
	splice(file, pipefd, file_len);
	splice(pipefd, socket, total_len);

   (this is all conceptual pseudo-code, of course), and this very
   naturally has none of the issues that sendfile() has with plugging etc.
   There's never any "send header separately and do extra work to make
   sure it is in the same packet as the start of the data".

   So having a separate buffer even when you _do_ have a buffer like the
   page cache is still something you want to do.

So there.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] splice support #2
Date: Thu, 30 Mar 2006 17:18:02 UTC
Message-ID: <fa.8oWWmFOZG4/xqIxG+WlFkGFFTlU@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603300905270.27203@g5.osdl.org>

On Thu, 30 Mar 2006, Linus Torvalds wrote:
>
> Actually, there _is_ a fundamental problem. Two of them, in fact.

Actually, four.

The third reason the pipe buffer is so useful is that it's literally a
stream with no position.

That may sound bad, but it's actually a huge deal. It's why standard unix
pipelines are so powerful. You don't pass around "this file, this offset,
this length" - you pass around a simple fd, and you can feed that fd data
without ever having to worry about what the reader of the data does. The
reader cannot seek around to places that you didn't want him to see, and
the reader cannot get confused about where the end is.

The 4th reason is "tee". Again, you _could_ perhaps do "tee" without the
pipe, but it would be a total nightmare. Now, tee isn't that common, but
it does happen, and in particular it happens a lot with certain streaming
content.

Doing a "tee" with regular pipes is not that common: you usually just use
it for debugging or logging (ie you have a pipeline you want to debug, and
inserting "tee" in the middle is a good way to keep the same pipeline, but
also being able to look at the intermediate data when something went
wrong).

However, one reason "tee" _isn't_ that common with regular pipe usage is
that normal programs never need to do that anyway: all the pipe data
always goes through user space, so you can trivially do a "tee" inside of
the application itself without any external support. You just log the data
as you receive it.

But with splice(), the whole _point_ of the system call is that at least a
portion of the data never hits a user space buffer at all. Which means
that suddenly "tee" becomes much more important, because it's the _only_
way to insert a point where you can do logging/debugging of the data.

Now, I didn't do the "tee()" system call in my initial example thing, and
Jens didn't add it either, but I described it back in Jan-2005 with the
original description. It really is very fundamental if you ever want to
have a "plugin" kind of model, where you plug in different users to the
same data stream.

The canonical example is getting video input from an mpeg encoder, and
_both_ saving it to a file and sending it on in real-time to the app that
shows it in a window. Again, having the pipe is what allows this.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] splice support #2
Date: Thu, 30 Mar 2006 21:18:14 UTC
Message-ID: <fa.cBk7LN6v/MAhN+Fi6talksv37O8@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603301259220.27203@g5.osdl.org>

On Thu, 30 Mar 2006, Jeff Garzik wrote:
>
> with splice this becomes
>
> 	if (special case fd combination #1)
> 		sendfile()
> 	else (special case fd combination #2)
> 		splice()
> 	else
> 		hand code fd->fd data move

No, it really should be splice for all combinations (possibly with a
manual read/write fallback for stuff where it just hasn't been done).

The fact that we don't have splice for certain fd combinations is really
just a result of it not being done yet. One of the reasons I wanted to
merge asap was that the original example patch was done over a year ago,
and not a lot happened, so I'm hoping that the fact that the core code is
now in the base tree is going to make people do the necessary splice input
functions for different file types.

For filesystems, splice support tends to be really easy (both read and
write). For other things, it depends a bit. But unlike sendfile(), it
really is quite possible to splice _from_ a socket too, not just _to_ a
socket. But no, that case hasn't been written yet.

(In contrast, with "sendfile()", you just fundamentally couldn't do it).

> Creating a syscall for each fd->fd data move case seems wasteful.  I would
> rather that the kernel Does The Right Thing so the app doesn't have to support
> all these special cases.  Handling the implicit buffer case in the kernel,
> when needed, means that the app is future-proofed: when another fd->fd
> optimization is implemented, the app automatically takes advantage of it.

splice() really can handle any fd->fd move.

The reason you want to have a pipe in the middle is that you have to
handle partial moves _some_ way. And the pipe being the buffer really does
allow that, and also handles the case of "what happens when we received
more data than we could write".

So the way to do copies is

	int pipefd[2];
	unsigned long copied = 0;

	if (pipe(pipefd) < 0)
		error

	for (;;) {
		int nr = splice(in, pipefd[1], MAX_INT, 0);
		if (nr <= 0)
			break;
		do {
			int ret = splice(pipefd[0], out, nr, 0);
			if (ret <= 0) {
				error: couldn't write 'nr' bytes
				break;
			}

			nr -= ret;
		} while (nr);
	}
	close(pipefd[0]);
	close(pipefd[1]);

which may _seem_ very complicated and the extra pipe seems "useless", but
it's (a) actually pretty efficient and (b) allows error recovery.

That (b) is the really important part. I can pretty much guarantee that
without the "useless" pipe, you simply couldn't do it.

In particular, what happens when you try to connect two streaming devices,
but the destination stops accepting data? You cannot put the received data
"back" into the streaming source any way - so if you actually want to be
able to handle error recovery, you _have_ to get access to the source
buffers.

Also, for signal handling, you need to have some way to keep the pipe
around for several iterations on the sender side, while still returning to
user space to do the signal handler.

And guess what? That's exactly what you get with that "useless" pipe. For
error handling, you can decide to throw the extra data that the recipient
didn't want away (just close the pipe), and maybe that's going to be what
most people do. But you could also decide to just do a "read()" on the
pipefd, to just read the data into user space for debugging and/or error
recovery..

Similarly, for signals, the pipe _is_ that buffer that you need that is
consistent across multiple invocations.

So that pipe is not at all unnecessary, and in fact, it's critical. It may
look more complex than sendfile(), but it's more complex exactly because
it can handle cases that sendfile never could, and just punted on (for
regular files, you never have the issue of half-way buffers, since you
just re-read them. Which is why sendfile() could do with it's simple
interface, but is also why sendfile() was never good for anything else).

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: RE: [PATCH] splice support #2
Date: Fri, 31 Mar 2006 20:50:19 UTC
Message-ID: <fa.v6Xj3qHTEkHqECL2kiyyWHSqufU@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0603311244300.27203@g5.osdl.org>

On Fri, 31 Mar 2006, Hua Zhong wrote:
>
> If I understand correctly:
>
> splice is one fd in, one fd out

Yes, and one of the fd's have to be a pipe.

> tee is one fd in, two fd out (and I'd assume the "one fd in" would always be
> a pipe)

Actually, all three of them would have to be pipes. The tee() thing has to
push to both sources in a "synchronized" manner, so it can't just take
arbitrary file descriptors that it doesn't know the exact buffering rules
for.

(Otherwise you wouldn't need "tee()" at all - you could have a "splice()"
that just takes several output fd's).

> How about one fd in, N fd out? Do you then stack the tee calls using
> temporary pipes?

I didn't write the tee() logic, but making it 1:N instead of 1:2 is not
conceptually a problem at all. The exact system call interface might limit
it some way, of course.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 18:45:07 UTC
Message-ID: <fa.vMA3GGqiJgbz0+94gQXIQBSf5p8@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604191111170.3701@g5.osdl.org>

On Wed, 19 Apr 2006, Diego Calleja wrote:
>
> Could someone give a long high-level description of what splice() and tee()
> are?

The _really_ high-level concept is that there is now a notion of a "random
kernel buffer" that is exposed to user space.

In other words, splice() and tee() work on a kernel buffer that the user
has control over, where "splice()" moves data to/from the buffer from/to
an arbitrary file descriptor, while "tee()" copes the data in one buffer
to another.

So in a very real (but abstract) sense, "splice()" is nothing but
read()/write() to a kernel buffer, and "tee()" is a memcpy() from one
kernel buffer to another.

Now, to get slightly less abstract, there's two important practical
details:

 - the "buffer" implementation is nothing but a regular old-fashioned UNIX
   pipe.

   This actually makes sense on so many levels, but mostly simply because
   that is _exactly_ what a UNIX pipe has always been: it's a buffer in
   kernel space. That's what a pipe has always been. So the splice usage
   isn't conceptually anything new for pipes - it's just exposing that
   old buffer in a new way.

   Using a pipe for the in-kernel buffer means that we already have all
   the infrastructure in place to create these things (the "pipe()" system
   call), and refer to them (user space uses a regular file descriptor as
   a "pointer" to the kernel buffer).

   It also means that we already know how to fill (or read) the kernel
   buffer from user space: the bog-standard pre-existing "read()" and
   "write()" system calls to the pipe work the obvious ways: they read the
   data from the kernel buffer into user space, and write user space data
   into the kernel buffer.

 - the second part of the deal is that the buffer is actually implemented
   as a set of reference-counted pointers, which means that you can copy
   them around without actually physically copy memory. So while "tee()"
   from a _conceptual_ standpoint is exactly the same as a "memcpy()" on
   the kernel buffer, from an implementation standpoint it really just
   copies the pointers and increments the refcounts.

There are some other buffer management system calls that I haven't done
yet (and when I say "I haven't done yet", I obviously mean "that I hope
some other sucker will do for me, since I'm lazy"), but that are obvious
future extensions:

 - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
   hardcoded to 16 "buffer entries" (which in turn are normally limited to
   one page each, although there's nothing that _requires_ that a buffer
   entry always be a page).

 - vmsplice() system call to basically do a "write to the buffer", but
   using the reference counting and VM traversal to actually fill the
   buffer. This means that the user needs to be careful not to re-use the
   user-space buffer it spliced into the kernel-space one (contrast this
   to "write()", which copies the actual data, and you can thus re-use the
   buffer immediately after a successful write), but that is often easy to
   do.

Anyway, when would you actually _use_ a kernel buffer? Normally you'd use
it it you want to copy things from one source into another, and you don't
actually want to see the data you are copying, so using a kernel buffer
allows you to possibly do it more efficiently, and you can avoid
allocating user VM space for it (with all the overhead that implies: not
just the memcpy() to/from user space, but also simply the book-keeping).

It should be noted that splice() is very much _not_ the same as
sendfile(). The buffer is really the big difference, both conceptually,
and in how you actually end up using it.

A "sendfile()" call (which a lot of other OS's also implement) doesn't
actually _need_ a buffer at all, because it uses the file cache directly
as the buffer it works on. So sendfile() is really easy to use, and really
efficient, but fundamentally limited in what it can do.

In contrast, the whole point of splice() very much is that buffer. It
means that in order to copy a file, you literally do it like you would
have done it traditionally in user space:

	int ret;

	for (;;) {
		int ret = read(input, buffer, BUFSIZE);
		char *p;
		if (!ret)
			break;
		if (ret < 0) {
			if (errno == EINTR)
				continue;
			.. exit with an inpot error ..
		}

		p = buffer;
		do {
			int written = write(output, p, ret);
			if (!written)
				.. exit with filesystem full ..
			if (written < 0) {
				if (errno == EINTR)
					continue;
				.. exit with an output error ..
			}
			p += written;
			ret -= written;
		} while (ret);
	}

except you'd not have a buffer in user space, and the "read()" and
"write()" system calls would instead be "splice()" system calls to/from a
pipe you set up as your _kernel_ buffer. But the _construct_ would all be
indentical - the only thing that changes is really where that "buffer"
exists.

Now, the advantage of splice()/tee() is that you can do zero-copy movement
of data, and unlike sendfile() you can do it on _arbitrary_ data (and, as
shown by "tee()", it's more than just sending the data to somebody else:
you can duplicate the data and choose to forward it to two or more
different users - for things like logging etc).

So while sendfile() can send files (surprise surprise), splice() really is
a general "read/write in user space" and then some, so you can forward
data from one socket to another, without ever copying it into user space.

Or, rather than just a boring socket->socket forwarding, you could, for
example, forward data that comes from a MPEG-4 hardware encoder, and tee()
it to duplicate the stream, and write one of the streams to disk, and the
other one to a socket for a real-time broadcast. Again, all without
actually physically copying it around in memory.

So splice() is strictly more powerful than sendfile(), even if it's a bit
more complex to use (the explicit buffer management in the middle). That
said, I think we're actually going to _remove_ sendfile() from the kernel
entirely, and just leave a compatibility system call that uses splice()
internally to keep legacy users happy.

Splice really is that much more powerful a concept, that having sendfile()
just doesn't make any sense except as some legacy compatibility layer
around the more powerful splice().

			Linus



From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Wed, 19 Apr 2006 21:50:24 UTC
Message-ID: <fa.yXhJGOL7Ral/h2wYM/jN3cSjRkQ@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604191433390.3701@g5.osdl.org>

On Wed, 19 Apr 2006, Trond Myklebust wrote:
>
> Any chance this could be adapted to work with all those DMA (and RDMA)
> engines that litter our motherboards? I'm thinking in particular of
> stuff like the drm drivers, and userspace rdma.

Absolutely. Especially with "vmsplice()" (the not-yet-implemented "move
these user pages into a kernel buffer") it should be entirely possible to
set up an efficient zero-copy setup that does NOT have any of the problems
with aio and TLB shootdown etc.

Note that a driver would have to support the splice_in() and splice_out()
interfaces (which are basically just given the pipe buffers to do with as
they wish), and perhaps more importantly: note that you need specialized
apps that actually use splice() to do this.

That's the biggest downside by far, and is why I'm not 100% convinced
splice() usage will be all that wide-spread. If you look at sendfile(),
it's been available for a long time, and is actually even almost portable
across different OS's _and_ it is easy to use. But almost nobody actually
does. I suspect the only users are some apache mods, perhaps a ftp deamon
or two, and probably samba. And that's probably largely it.

There's a _huge_ downside to specialized interfaces. Admittedly, splice()
is a lot less specialized (ie it works in a much wider variety of loads),
but it's still very much a "corner-case" thing. You can always do the same
thing splice() does with a read/write pair instead, and be portable.

Also, the genericity of splice() does come at the cost of complexity. For
example, to do a zero-copy from a user space buffer to some RDMA network
interface, you'd have to basically keep track of _two_ buffers:

 - keep track of how much of the user space buffer you have moved into
   kernel space with "vmsplice()" (or, for that matter, with any other
   source of data for the buffer - it might be a file, it might be another
   socket, whatever. I say "vmsplice()", but that's just an example for
   when you have the data in user space).

   The kernel space buffer is - for obvious reasons - size limited in the
   way a user-space buffer is not. People are used to doing megabytes of
   buffers in user space. The splice buffer, in comparison, is maybe a few
   hundred kB at most. For some apps, that's "infinity". For others, it's
   just a few tens of pages of data.

 - keep track of how much of the kernel space buffer you have moved to the
   RDMA network interface with "splice()".

   The splice buffer _is_ another buffer, and you have to feed the data
   from that buffer to the RDMA device manually.

In many usage scenarios, this means that you end up having the normal
kind of poll/select loop. Now, that's nothing new: people are used to
them, but people still hate them, and it just means that very few
environments are going to spend the effort on another buffering setup.

So the upside of splice() is that it really can do some things very
efficiently, by "copying" data with just a simple reference counted
pointer. But the downside is that it makes for another level of buffering,
and behind an interface that is in kernel space (for obvious reasons),
which means that it's somewhat harder to wrap your hands and head around
than just a regular user-space buffer.

So I'd expect this to be most useful for perhaps things like some HPC
apps, where you can have specialized libraries for data communication. And
servers, of course (but they might just continue to use the old
"sendfile()" interface, without even knowing that it's not sendfile() any
more, but just a wrapper around splice()).

			Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Thu, 20 Apr 2006 15:32:34 UTC
Message-ID: <fa.5P7b9V+hi0BWV0/GfTIP78atTog@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604200818490.3701@g5.osdl.org>

On Thu, 20 Apr 2006, Jens Axboe wrote:
>
> >  - an ioctl/fcntl to set the maximum size of the buffer. Right now it's
> >    hardcoded to 16 "buffer entries" (which in turn are normally limited to
> >    one page each, although there's nothing that _requires_ that a buffer
> >    entry always be a page).
>
> This is on a TODO, but not very high up since I've yet to see a case
> where the current 16 page limitation is an issue. I'm sure something
> will come up eventually, but until then I'd rather not bother.

The real reason for limiting the number of buffer entries is not to make
the number _larger_ (although that can be a performance optimization), but
to make it _smaller_ or at least knowing/limiting how big it is.

It doesn't matter with the current interfaces which are mostly agnostic as
to how big the buffer is, but it _does_ matter with vmsplice().

Why?

Simple: for a vmsplice() user, it's very important to know when they can
start re-using the buffer(s) that they used vmsplice() on previously. And
while the user could just ask the kernel how many bytes are left in the
pipe buffer, that's pretty inefficient for many normal streaming cases.

The _efficient_ way is to make the user-space buffer that you use for
splicing information to another entity a circular buffer that is at least
as large as any of the splice pipes involved in the transfer (depending on
use. In many cases, you will probably want to make the user-space buffer
_twice_ as big as the kernel buffer, which makes the tracking even easier:
while half of the buffer is busy, you can write to the half that is
guaranteed to not be in the kernel buffer, so you effectively do "double
buffering")

So if you do that, then you can continue to write to the buffer without
ever worrying about re-use, because you know that by the time you wrap
around, the kernel buffer will have been flushed out, or the vmsplice()
would have blocked, waiting for the receiver. So now you no longer need to
worry about "how much has flushed" - you only need to worry about doing
the vmsplice() call at least twice per buffer traversal (assuming the
"user buffer is double the size of the kernel buffer" approach).

So you could do a very efficient "stdio-like" implementation for logging,
for example, since this allows you to re-use the same pages over and over
for splicing, without ever having any copying overhead, and without ever
having to play VM-related games (ie you don't need to do unmaps or
mprotects or anything expensive like that in order to get a new page or
something).

But in order to do that, you really do need to know (and preferably set)
the size of the splice buffer. Otherwise, if the in-kernel splice buffer
is larger than the circular buffer you use in user space, the kernel will
add the same page _twice_ to the buffer, and you'll overwrite the data
that you already spliced.

(Now, you still need to be very careful with vmsplice() in general, since
it leaves the data page writable in the source VM and thus allows for all
kinds of confusion, but the theory here is "give them rope". Rope enough
to do clever things always ends up being rope enough to hang yourself too.
Tough.).

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Thu, 20 Apr 2006 22:21:12 UTC
Message-ID: <fa.UeNtWtRj/cjsmHrAZFPzrwabzkU@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604201512070.3701@g5.osdl.org>

On Thu, 20 Apr 2006, Piet Delaney wrote:
>
> What about marking the pages Read-Only while it's being used by the
> kernel

NO!

That's a huge mistake, and anybody that does it that way (FreeBSD) is
totally incompetent.

Once you play games with page tables, you are generally better off copying
the data. The cost of doing page table updates and the associated TLB
invalidates is simply not worth it, both from a performance standpoing and
a complexity standpoint.

Basically, if you want the highest possible performance, you do not want
to do TLB invalidates. And if you _don't_ want the highest possible
performance, you should just use regular write(), which is actually good
enough for most uses, and is portable and easy.

The thing is, the cost of marking things COW is not just the cost of the
initial page table invalidate: it's also the cost of the fault eventually
when you _do_ write to the page, even if at that point you decide that the
page is no longer shared, and the fault can just mark the page writable
again.

That cost is _bigger_ than the cost of just copying the page in the first
place.

The COW approach does generate some really nice benchmark numbers, because
the way you benchmark this thing is that you never actually write to the
user page in the first place, so you end up having a nice benchmark loop
that has to do the TLB invalidate just the _first_ time, and never has to
do any work ever again later on.

But you do have to realize that that is _purely_ a benchmark load. It has
absolutely _zero_ relevance to any real life. Zero. Nada. None. In real
life, COW-faulting overhead is expensive. In real life, TLB invalidates
(with a threaded program, and all users of this had better be threaded, or
they are leaving more performance on the floor) are expensive.

I claim that Mach people (and apparently FreeBSD) are incompetent idiots.
Playing games with VM is bad. memory copies are _also_ bad, but quite
frankly, memory copies often have _less_ downside than VM games, and
bigger caches will only continue to drive that point home.

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Fri, 21 Apr 2006 00:09:57 UTC
Message-ID: <fa.8eydz18zIN3pj0juOTrilsfxDc0@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604201649100.3701@g5.osdl.org>

On Thu, 20 Apr 2006, Piet Delaney wrote:
>
> I once wrote some code to find the PTE entries for user buffers;
> and as I recall the code was only about 20 lines of code. I thought
> only a small part of the TLB had to be invalidated. I never tested
> or profiled it and didn't consider the multi-threading issues.

Looking up the page table entry is fairly quick, and is definitely worth
it. It's usually just a few memory loads, and it may even be cached. So
that part of the "VM tricks" is fine.

The cost comes when you modify it. Part of it is the initial TLB
invalidate cost, but that actually tends to be the smaller part (although
it can be pretty steep already, if you have to do a cross-CPU invalidate:
that alone may already have taken more time than it would to just do a
straightforward copy).

The bigger part tends to be that any COW approach will obviously have to
be undone later, usually when the user writes to the page. Even if (by the
time the fault is taken) the page is no longer shared, and undoing the COW
is just a matter of touching the page tables again, just the cost of
taking the fault is easily thousands of cycles.

At which point the optimization is very debatable indeed. If the COW
actually causes a real copy and a new page to be allocated, you've lost
everything, and you're solidly in "that sucks" territory.

> Instead of COW, I just returned information in recvmsg control
> structure indicating that the buffer wasn't being use by the kernel
> any longer.

That is very close to what I propose with vmsplice(), and yes, once you
avoid the COW, it's a clear win to just look up the page in the page
tables and increment a usage count.

So basically:

 - just looking up the page is cheap, and that's what vmsplice() does
   (if people want to actually play with it, Jens now has a vmsplice()
   implementation in his "splice" branch in his git tree on
   brick.kernel.dk).

   It does mean that it's up to the _user_ to not write to the page again
   until the page is no longer shared, and there are different approaches
   to handling that. Sometimes the answer may even be that synchronization
   is done at a much higher level (ie there's some much higher-level
   protocol where the other end acknowledges the data).

   The fact that it's up to the user obviously means that the user has to
   be more careful, but the upside is that you really _do_ get very high
   performance. If there are no good synchronization mechanisms, the
   answer may well be "don't use vmsplice()", but the point is that if you
   _can_ synchronize some other way, vmsplice() runs like a bat out of
   hell.

 - playing VM games where you actually modify the VM is almost always a
   loss. It does have the advantage that the user doesn't have to be aware
   of the VM games, but if it means that performance isn't really all that
   much better than just a regular "write()" call, what's the point?

I'm of the opinion that we already have robust and user-friendly
interfaces (the regular read()/write()/recvmsg/sendmsg() interfaces that
are "synchronous" wrt data copies, and that are obviously portable). We've
even optimized them as much as we can, so they actually perform pretty
well.

So there's no point in a half-assed "safe VM" trick with COW, which isn't
all that much faster. Playing tricks with zero-copy only makes sense if
they are a _lot_ faster, and that implies that you cannot do COW. You
really expose the fact that user-space gave a real reference to its own
pages away, and that if user space writes to it, it writes to a buffer
that is already in flight.

(Some users may even be able to take _advantage_ of the fact that the
buffer is "in flight" _and_ mapped into user space after it has been
submitted. You could imagine code that actually goes on modifying the
buffer even while it's being queued for sending. Under some strange
circumstances that may actually be useful, although with things like
checksums that get invalidated by you changing the data while it's queued
up, it may not be acceptable for everything, of course).

		Linus


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.17-rc2
Date: Fri, 21 Apr 2006 17:59:48 UTC
Message-ID: <fa.P4X8m57rfBQQ1v8JcmfKfEMOjd8@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0604211047540.3701@g5.osdl.org>

I got slashdotted! Yay!

On Thu, 20 Apr 2006, Linus Torvalds wrote:
>
> I claim that Mach people (and apparently FreeBSD) are incompetent idiots.

I also claim that Slashdot people usually are smelly and eat their
boogers, and have an IQ slightly lower than my daughters pet hamster
(that's "hamster" without a "p", btw, for any slashdot posters out
there. Try to follow me, ok?).

Furthermore, I claim that anybody that hasn't noticed by now that I'm an
opinionated bastard, and that "impolite" is my middle name, is lacking a
few clues.

Finally, it's clear that I'm not only the smartest person around, I'm also
incredibly good-looking, and that my infallible charm is also second only
to my becoming modesty.

So there. Just to clarify.

		Linus "bow down before me, you scum" Torvalds


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch v3] splice: fix race with page invalidation
Date: Thu, 31 Jul 2008 17:04:26 UTC
Message-ID: <fa.ggoorbGTE/RQ8QYkhxLfVUyPXqE@ifi.uio.no>

On Thu, 31 Jul 2008, Nick Piggin wrote:
>
> It seems like the right way to fix this would be to allow the splicing
> process to be notified of a short read, in which case it could try to
> refill the pipe with the unread bytes...

Hmm. That should certainly work with the splice model. The users of the
data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
data", and stop. And it would be up to the other side to handle it (it
can see the state of the pipe, we can make it both wake up POLL_ERR _and_
return an error if somebody tries to write to a "blocked" pipe).

So yes, that's very possible, but it obviously requires splice() users to
be able to handle more cases. I'm not sure it's realistic to expect users
to be that advanced.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 0/3] make splice more generic
Date: Thu, 07 May 2009 15:58:53 UTC
Message-ID: <fa.GLhJvnN74DFQQElXj2ck1CIHTig@ifi.uio.no>

On Thu, 7 May 2009, Miklos Szeredi wrote:
>
> One more generalization would be to allow splice to work on two
> non-pipes, using an internal intermediate pipe, a-la do_splice_direct().

You can't do that without some painful issues.

Or rather, you can only do it trivially for the one case where we
_already_ do that, namely "sendfile()". That's exactly what sendfile() is
now.

What is so painful about it in general?

Reading from a source may _destroy_ that data, and you may not be able to
push it back to the source. And what happens if the destination cannot
take it?

Now, we could do a totally blocking version that simply refuses to return
until the destination has taken all the splice data, and maybe it would be
worth it as a simplified interface. But it does sound like a really ripe
place for deadlocks etc (set up some trivial circular thing of everybody
trying to pipe to each other, and all of them getting blocked on the
receiver, and now they are unkillable).

Now, the reason it works for sendfile() is that when the source is known
to be in the page cache, then if the receiver doesn't take the data, we
know we can just drop it. But what if the source is some character device
driver? We can't just drop the data on a signal.

So the reason splice does that pipe buffer is that the pipe itself then
acts as a long-term buffer _across_ the kernel returning to user space,
and thus allows the whole process to be interruptible.

That said, maybe we could allow it in a few more cases. Or maybe people
think the simplification in user interfaces is worth making the IO be
non-interruptible (but still killable, for example, at which point the
buffered data really is dropped - but that's not different from having
the buffers in user space, so at that point it's ok).

So I'm certainly willing to be convinced that it's a good idea after all,
it just worries me, and I wanted to point out the painful issues that
caused me to not allow it in general.

			Linus

Index Home About Blog