Index Home About Blog
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: x86 ptep_get_and_clear question
Date: 	15 Feb 2001 12:31:15 -0800
Newsgroups: fa.linux.kernel

In article <20010215201945.A2505@pcep-jamie.cern.ch>,
Jamie Lokier  <lk@tantalophile.demon.co.uk> wrote:
>> > << lock;
>> > read pte
>> > if (!present(pte))
>> > 	do_page_fault();
>> > pte |= dirty
>> > write pte.
>> > >> end lock;
>> 
>> No, it is a little more complicated. You also have to include in the
>> tlb state into this algorithm. Since that is what we are talking about.
>> Specifically, what does the processor do when it has a tlb entry allowing
>> RW, the processor has only done reads using the translation, and the 
>> in-memory pte is clear?
>
>Yes (no to the no): Manfred's pseudo-code is exactly the question you're
>asking.  Because when the TLB entry is non-dirty and you do a write, we
>_know_ the processor will do a locked memory cycle to update the dirty
>bit.  A locked memory cycle implies read-modify-write, not "write TLB
>entry + dirty" (which would be a plain write) or anything like that.
>
>Given you know it's a locked cycle, the only sensible design from Intel
>is going to be one of Manfred's scenarios.

Not necessarily, and this is NOT guaranteed by the docs I've seen.

It _could_ be that the TLB data actually also contains the pointer to
the place where it was fetched, and a "mark dirty" becomes

	read *ptr locked
	val |= D
	write *ptr unlock

Now, I will agree that I suspect most x86 _implementations_ will not do
this. TLB's are too timing-critical, and nobody tends to want to make
them bigger than necessary - so saving off the source address is
unlikely. Also, setting the D bit is not a very common operation, so
it's easy enough to say that an internal D-bit-fault will just cause a
TLB re-load, where the TLB re-load just sets the A and D bits as it
fetches the entry (and then page fault handling is an automatic result
of the reload).

However, the _implementation_ detail is not, as far as I can tell,
explicitly defined by the architecture.  And in another post I quote a
book by the designers of the original 80386 that implies strongly that
the "re-walk the page tables on D miss" assumption is not what they
_meant_ for the architecture design, even if they probably happened to
implement it that way. 

>An interesting thought experiment though is this:
>
><< lock;
>read pte
>pte |= dirty
>write pte
>>> end lock;
>if (!present(pte))
>	do_page_fault();
>
>It would have a mighty odd effect wouldn't it?

Why do you insist on the !present() check at all? It's not implied by
the architecture - a correctly functioning OS is not supposed to ever
be able to cause it according to specs..

I tink Kanoj is right to be worried. I _do_ believe that the current
Linux code works on "all current hardware". But I think Kanoj has a
valid point in that it's not guaranteed to work in the future.

That said, I think Intel tends to be fairly pragmatic in their design
(that's the nice way of saying that Intel CPU's tend to dismiss the
notion of "beautiful architecture" completely over the notion of "let's
make it work").  And I would be extremely surprised indeed if especially
MS Windows didn't do some really bad things with the TLB.  In fact, I
think I can say from personal experience that I pretty much _know_
windows has big bugs in TLB invalidation.

And because of that, it may be that nobody can ever create a
x86-compatible CPU that does anything but "re-walk the TLB tables on
_anything_ fishy going on with the TLB".

(Basically, it seems to be pretty much a fact of life that the x86
architecture will NOT raise a page protection fault directly from the
TLB content - it will re-walk the page tables before it actually raises
the fault, and only the act of walking the page tables and finding that
it really _should_ fault will raise an x86-level fault.  It all boils
down to "never trust the TLB more than you absolutely have to"). 

		Linus


Date: 	Thu, 15 Feb 2001 16:55:02 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: x86 ptep_get_and_clear question
Newsgroups: fa.linux.kernel

On Fri, 16 Feb 2001, Jamie Lokier wrote:
> 
> If you want to take it really far, it _could_ be that the TLB data
> contains both the pointer and the original pte contents.  Then "mark
> dirty" becomes
> 
>        val |= D
>        write *ptr

No. This is forbidden by the intel documentation. First off, the
documentation clearly states that it's a locked r-m-w cycle.

Secondly, the documentation also makes it clear that the CPU page table
accesses work correctly in SMP environments, which the above simply would
not do. It doesn't allow for people marking the entry invalid, which is
documented to work (see the very part I quoted).

So while the above could be a valid TLB writeback strategy in general for
some hypothetical architecture, it would _not_ be an x86 CPU any more if
it acted that way. 

So a plain "just write out our cached value" is definitely not legal.

> > Now, I will agree that I suspect most x86 _implementations_ will not do
> > this. TLB's are too timing-critical, and nobody tends to want to make
> > them bigger than necessary - so saving off the source address is
> > unlikely.
> 
> Then again, these hypothetical addresses etc. aren't part of the
> associative lookup, so could be located in something like an ordinary
> cache ram, with just an index in the TLB itself.

True. I'd still consider it unlikely for the other reasons (ie this is not
a timing-critical part of the normal CPU behaviour), but you're right - it
could be done without making the actual TLB any bigger or different, by
just having the TLB fill routine having a separate "source cache" that the
dirty-marking can use.

			Linus



Date: 	Thu, 15 Feb 2001 17:21:28 -0800 (PST)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: x86 ptep_get_and_clear question
Newsgroups: fa.linux.kernel

On Thu, 15 Feb 2001, Manfred Spraul wrote:
> 
> > Now, I will agree that I suspect most x86 _implementations_ will not do
> > this. TLB's are too timing-critical, and nobody tends to want to make
> > them bigger than necessary - so saving off the source address is
> > unlikely. Also, setting the D bit is not a very common operation, so
> > it's easy enough to say that an internal D-bit-fault will just cause a
> > TLB re-load, where the TLB re-load just sets the A and D bits as it
> > fetches the entry (and then page fault handling is an automatic result
> > of the reload).
> 
> But then the cpu would support setting the D bit in the page directory,
> but it doesn't.

Not necessarily. The TLB walker is a nasty piece of business, and
simplifying it as much as possible is important for hardware. Not setting
the D bit in the page directory is likely to be because it is unnecessary,
and not because it couldn't be done.

> But if we change the interface, could we think about the poor s390
> developers?
> 
> s390 only has a "clear the present bit in the pte and flush the tlb"
> instruction.

Now, that ends up being fairly close to what it seems mm/vmscan.c needs to
do, so yes, it would not necessarily be a bad idea to join the
"ptep_get_and_clear()" and "flush_tlb_page()" operations into one.

However, the mm/memory.c use (ie region unmapping with zap_page_range())
really needs to do something different, because it inherently works with a
range of entries, and abstracting it to be a per-entry thing would be
really bad for performance anywhere else (S/390 might be ok with it,
assuming that their special instruction is really fast - I don't know. But
I do know that everybody else wants to do it with one single flush for the
whole region, especially for SMP).

> Perhaps try to schedule away, just to improve the probability that
> mm->cpu_vm_mask is clear.
> 
> I just benchmarked a single flush_tlb_page().
> 
> Pentium II 350: ~ 2000 cpu ticks.
> Pentium III 850: ~ 3000 cpu ticks.

Note that there is some room for concurrency here - we can fire off the
IPI, and continue to do "local" work until we actually need the "results"
in the form of stable D bits etc. So we _might_ want to take this into
account in the interfaces: allow for a "prepare_to_gather()" which just
sends the IPI but doesn't wait for it to necessarily get accepted, and
then only by the time we actually start checking the dirty bits (ie the
second phase, after we've invalidated the page tables) do we need to wait
and make sure that nobody else is using the TLB any more.

Done right, this _might_ be of the type

 - prepare_to_gather(): sends IPI to all CPU's indicated in
   mm->cpu_vm_mask
 - go on, invalidating all VM entries
 - busy-wait until "mm->cpu_vm_mask" only contains the local CPU (where
   the busy-wait is hopefully not a wait at all - the other CPU's would
   have exited the mm while we were cleaning up the page tables)
 - go back, gather up any potential dirty bits and free the pages
 - release the mm

Note that there are tons of optimizations for the common case: for
example, if we're talking about private read-only mappings, we can
possibly skip some or all of this, because we know that we simply won't
care about whether the pages were dirty or not as they're going to be
thrown away in any case.

So we can have several layers of optimizations: for UP or the SMP case
where we have "mm->cpu_vm_mask & ~(1 << current_cpu) == 0" we don't need
the IPI or the careful multi-CPU case at all. And for private stuff, we
need the careful invalidation, but we don't need to go back and gather the
dirty bits. So the only case that ends up being fairly heavy may be a case
that is very uncommon in practice (only for unmapping shared mappings in
threaded programs or the lazy TLB case).

I suspect getting a good interface for this, so that zap_page_range()
doesn't end up being the function for hell, is the most important thing.

			Linus



Index Home About Blog