Memory mapping(John R. Mashey; Linus Torvalds)

Index Home About Blog

From: mash@mash.engr.sgi.com (John R. Mashey)
Newsgroups: comp.arch
Subject: Re: Ultrasparc 1 has no 64 bit pointers?
Date: 30 Apr 1996 00:51:57 GMT

In article <4m2n1n$36i@ansible.bbt.com>, bnm@indica.bbt.com (Brian N.
Miller) writes:

|> Bottom line:  Memory mapped file access is almost always more efficient
|> for system throughput than explicitly managed file buffers.

In general, I think Memory-Mapped files are A Good Thing to have ... but the
generalization above needs more actual data to support it...
"almost always" seems a little strong, given that:

	(a) RDBMS do a whole lot of explicit management of files, even when
	they have file-mapping of huge files available.  Maybe this is
	historical, or maybe there are good reasons for doing this.
	Oracle certainly manges its own LSGA itself, and has no desire for
	the OS to page things in and out of there...
	Is the assertion above that they are wrong to be doing this?
	(I.e., are RDBMS considered covered in the domain of discourse,
	or not?)

	(b) While I have no direct experience, I'm told by people who know
	that various kinds of image processing or visualization programs do
	explicit management of their file data, (which doesn't fit in
	memory).  They do this to get smooth scanning of detailed images,
	or smooth flyovers of terrain (such as image/geometry
	combinations). The only way they are able to do this is to
	"read-ahead" in the database in the direction they are going at the
	moment.  This is bad enough for 2-D pan-and-zoom, but of course, is
	worse for 3D flyovers.  It is, of course, rather difficult for an
	OS to outguess some application-orinted file format that has a
	mixture of 3D geometry and satellite images, and that in fact, is
	doing direction-prediction in response to a joystick.

	(c)  Various systems that support mapped files also support
	asynch I/O to let user programs manage their own buffering.
	Maybe they need to do this, maybe they don't ... but customers demand
	it.

Anyway, I'm not sure that categorizations like "almost always" help gain
much insight, especially as "almost always" must mean "almost all of the
members of some set of occurrences", and it is not all clear what that set is.
It might be more useful to explore the attributes that favor or disfavor
file-mapping compared to explicit management, also learning from some
past history (like explicit overlays of code versus paging.)

+ Favor file mapping
P1	OS has global idea of memory availability, i.e., a specific
	application that manages memory explicitly will tend to be less
	dynamic in its adaptability.
P2	Maybe the OS has a better set of algorithms.
P3	Some code may be simplified, especially that which randomly accesses
	a bunch of pages, in ways that have no particular predictability
	in location or size.  [One an imagine, for example, that you had a
	set of files representing a Boeing 777, and you wanted a browser that
	could look at the whole thing, or look at a doorknob, or anything
	in between, and ran on anything from a desktop to an 8GB Onyx.]

- Disfavor file mapping
N1	Consistency/performance issues: DBMS do *not* want an OS's guesses
	to control the timing of updating files.  They do *not* want an
	OS to write a disk block out just because the DBMS wrote 4 bytes
	there.  They do *not* want an OS to arbitrarily defer an update..
N2	The application knows something about the reference patterns that
	would be extremely difficult to tell an OS or for the OS to infer.

Now, for paging of code, people have generally decided that paging-in code
works pretty well.  For much internal data, paging in/out works OK,
although there are often funny interactions of paging and garbage
collection.

My bottom line: Mapped Files are Good, but claiming they are almost always
better seems yet to be proved; maybe others will post more data?

-john mashey    DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP:    mash@sgi.com 
DDD:    415-933-3090	FAX: 415-967-8496
USPS:   Silicon Graphics 6L-005, 2011 N. Shoreline Blvd, Mountain View, CA 94039-7311

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: SIGBUS instead of SIGSEGV
Date: 6 Apr 1998 05:30:06 GMT

In article <6g0v74$2k0$1@jaka.ece.uiuc.edu>,
Steve Peltz <peltz@jaka.ece.uiuc.edu> wrote:
>Is there any particular reason why accessing an mmap()ed memory segment
>gives a SIGBUS instead of a SIGSEGV if the underlying file isn't long
>enough (in particular, the file WAS long enough, but I truncated it).

No. The only reason is purely historical.

Accessing a mmap beyond the end of the file actually is something that
different UNIXes do differently, and also differently depending on
whether the mmap was shared or not. Confusing, and the only thing you
should assume is that _probably_ you get one of:
 - SIGBUS. This is the common one
 - SIGSEGV. This is especially understandable when you consider that
	SIGBUS doesn't even _exist_ in POSIX.
 - no error, just a zero-filled page. This is what you get for a "hole"
   in a file, you might as well consider the end of the file to be a
   infinitely large hole.

>It actually makes a bit of sense - after all, I mapped the memory, I
>just can't access it. I'm just wondering what the design decision was
>(it gives a SIGSEGV in Digital Unix, for example).

Under Linux, depending on the kernel version, it can give any of the
above three faults. SIGBUS seems to be the most common among unixes,
which is why that's the current Linux behaviour..

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: mmap() and swapping question
Date: 20 May 1998 05:39:05 GMT

In article <Et8rDt.xB@flashline.chipnet>,
Ingo Rohloff  <rohloff@informatik.tu-muenchen.de.PLOPP> wrote:
>
>Well now I know what happened:
>
>I only set mmap(... PROT_WRITE ...) no (PROT_READ).
>
>For some reasons it is possible to write AND READ from
>such a mmapped area if it isn't swapped out.

The "reason" for this is that the x86 doesn't actually _have_ any page
table mode that would allow writes but not reads. So what happens is
that if the page exists and is writable it will automatically be
readable too, there just isn't anything Linux can do about it.

>If it is swapped out there will be a segmentation error if you do a read
>(funny isn't it ?)

It's not a case of whether it is swapped out or not - it's a case of
whether it exists or not.  So you'll get the same SIGSEGV if your first
access happens to be a read from the page, because then the read will
take a page fault because the page isn't there yet, and the page fault
handler will decide that you don't have read capabilities.

What probably happened was that your usage patterns for your program
always started out with a write as the first access to the page, so that
paged it in and the kernel decided that you had write access.  Any
subsequent reads just succeeded because of the x86 page table
limitation.  UNTIL the page was swapped out again, and at this point
your program was mainly reading from the pages, so this time the first
access to the paged out page was a read, and boom! you got the SIGSEGV.

Surprising, yes.  Maybe I should just automatically extend a write-only
page to be readable so that you don't get these kinds of surprises, but
at the same time I do feel that the current Linux behaviour is correct,
even if it has these small surprises..

BTW: one of lifes little ironies.  The alpha does actually have separate
readability and writability in the page tables, but the alpha does not
have byte- or word-sized memory operations in the first chip versions.
As a result, you have to do a read and mask in any bytes you want to
write, so on the alpha Linux explicitly _will_ give read permissions
even to pages mapped write-only - otherwise you couldn't write to them
in byte-sized chunks..

Oh, well.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: 2.4.0-test10-pre3:Oops in mm/filemap.c:filemap_write_pa
Date: 	23 Oct 2000 10:33:15 -0700
Newsgroups: fa.linux.kernel

In article <14832.39946.657679.861952@charged.uio.no>,
Trond Myklebust  <trond.myklebust@fys.uio.no> wrote:
>
>As for simply settling for a self-consistent mmap() rather than
>tackling the problem of rereading; the main crime is that you're
>rendering file locking unusable.

This is not a crime.

Anybody who uses file locking over NFS is buggy and nobody sane should
expect it to work reliably. There are good reasons why mail agents etc
depend on other kinds of locking over NFS.

Furthermore, anybody who expects file locking to work _anywhere_ (ie
including local filesystems) in the presense of file mapping is just so
out to lunch that it's ridiculous. 

Neither of these issues have anything to do with Linux. They are just
statements of fact.

And yes, I know that file locking can try to be nice in both of these
cases.  I know that there are systems that try to revert shared mappings
etc upon file locking (or not allow file locking at all if mappings are
present).  Linux is not one of them, and never has been. Neither are
most UNIXes out there.

		Linus

Date: 	Mon, 25 Dec 2000 21:37:05 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: innd mmap bug in 2.4.0-test12
Newsgroups: fa.linux.kernel

On Tue, 26 Dec 2000, Chris Wedgwood wrote:

> On Mon, Dec 25, 2000 at 01:42:33AM -0800, Linus Torvalds wrote:
> 
>     We just don't write them out. Because right now the only thing
>     that writes out dirty pages is memory pressure. "sync()",
>     "fsync()" and "fdatasync()" will happily ignore dirty pages
>     completely. The thing that made me overlook that simple thing in
>     testing was that I was testing the new VM stuff under heavy VM
>     load - to shake out any bugs.
> 
> Does this mean anyone using test13-pre4 should also expect to see
> data not being flushed on shutdown? 

No.

This all only matters to things that do shared writable mmap's.

Almost nothing does that. innd is (sadly) the only regular thing that uses
this, which is why it's always innd that breaks, even if everything else
works.

And even innd is often compiled to use "write()" instead of shared
mappings (it's a config option), so not even all innd's will break.

		Linus

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: mmap/mlock performance versus read
Date: 	5 Apr 2000 13:18:19 -0700
Newsgroups: fa.linux.kernel

In article <200004042249.SAA06325@op.net>,
Paul Barton-Davis  <pbd@Op.Net> wrote:
>
>I was very disheartened to find that on my system the mmap/mlock
>approach took *3 TIMES* as long as the read solution. It seemed to me
>that mmap/mlock should be at least as fast as read. Comments are
>invited. 

People love mmap() and other ways to play with the page tables to
optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive
in itself.  It has a number of quite real disadvantages that people tend
to ignore because memory copying is seen as something very slow, and
sometimes optimizing that copy away is seen as an obvious improvment. 

Downsides to mmap:
 - quite noticeable setup and teardown costs. And I mean _noticeable_.
   It's things like following the page tables to unmap everything
   cleanly. It's the book-keeping for maintaining a list of all the
   mappings. It's The TLB flush needed after unmapping stuff. 
 - page faulting is expensive. That's how the mapping gets populated,
   and it's quite slow. 

Upsides of mmap:
 - if the data gets re-used over and over again (within a single map
   operation), or if you can avoid a lot of other logic by just mapping
   something in, mmap() is just the greatest thing since sliced bread. 

   This may be a file that you go over many times (the binary image of
   an executable is the obvious case here - the code jumps all around
   the place), or a setup where it's just so convenient to map the whole
   thing in without regard of the actual usage patterns that mmap() just
   wins.  You may have random access patterns, and use mmap() as a way
   of keeping track of what data you actually needed. 

 - if the data is large, mmap() is a great way to let the system know
   what it can do with the data-set.  The kernel can forget pages as
   memory pressure forces the system to page stuff out, and then just
   automatically re-fetch them again.

   And the automatic sharing is obviously a case of this..

But your test-suite (just copying the data once) is probably pessimal
for mmap().

		Linus

Date: 	Wed, 5 Apr 2000 19:34:50 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: mmap/mlock performance versus read
Newsgroups: fa.linux.kernel

On Wed, 5 Apr 2000, Albert D. Cahalan wrote:
> Linus Torvalds writes:
> 
> >  - page faulting is expensive. That's how the mapping gets populated,
> >    and it's quite slow. 
> 
> Could mmap get a flag that asks for async read and map?
> So mmap returns, then pages start to appear as the IO progresses.

It's not the IO on the pages themselves, it's actually the act of
populating the page tables that is quite costly. And doing that in the
background is basically impossible.

You can do it synchronously, and that is basically what mlock() will do
with "make_pages_present()". However, that path is not all that optimized
(not worth it), and even if it was hugely optimized it would _still_ be
quite slow. The page tables are just fairly complex data structures.

And on top of that you still have the actual CPU TLB miss costs etc. Which
can often be avoided if you just re-read into the same area instead of
being excessively clever with memory management just to avoid a copy.

memcpy() (ie "read()" in this case) is _always_ going to be faster in many
cases, just because it avoids all the extra complexity.  While mmap() is
going to be faster in other cases.

		Linus

Index Home About Blog