Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev
Date: Wed, 18 Jun 2008 16:23:39 UTC
Message-ID: <fa.NI3YkQp2+EAQ1YTxer7CUtxoOL4@ifi.uio.no>

On Wed, 18 Jun 2008, Bron Gondwana wrote:
> On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> >
> > Is there any reason it doesn't use mmap(MAP_SHARED) and make the
> > modifications that way too?
>
> Portability[tm].

Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable
than mixing mmap() and "write()" - or at least more _consistent_.

That said, it's probably six one way, and half a dozen the other. The
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real
unix"). That does include really really old Linux versions (ie 1.x
series), but more relevantly probably includes things like QNX etc.

On the other hand, the mmap()+write(), as mentioned, doesn't work well on
various hardware platforms where there can be cache aliases, and that
includes HP-UX (as you apparently have noticed), but I'm pretty certain
there are other cases too.

The cache alias issue can actually be really thorny, because it's going to
be very hard to see and essentially random: if your working set is big
enough (or the cache is small enough) that the cache basically gets
flushed between the write and the access through the mmap (and vice
versa), you'll never see any problems.

But then, _occasionally_, you'll have really hard-to-replicate corruption
due to cache aliases (ie you read something from the mmap() after the
write, but you don't actually see the newly written data, because it's
cached at a different virtual address).

Linux tries really hard to be coherent between mmap and read/write even on
those kinds of platforms, but I would definitely not call it "portable".
It really is a fundamentally nasty thing, and depends deeply on the CPU
architecture, not just the OS.

> It actually does use MAP_SHARED already, but only for reading.
> Writing is all done with seeks and writes, followed by a map
> "refresh", which is really just an unmmap/mmap if the file has
> extended past the "SLOP" (between 8 and 16 k after the end of
> the file length at last mapping).

Yeah, I can certainly see that working. That said, I can also see it
failing, partly because of the CPU virtual indexing cache issues, but
partly because it's such an unusual thing to do (partly because it simply
is known not to work on some systems, ie HP-UX). And that will mean that
it is probably not a well-tested path.. As you found out.

(Side note: I mention HP-UX just because it is known to historically have
totally and utterly brain-damaged and useless mmap support. It _may_ be
that they've fixed it in more modern versions. It literally used to be a
mix of horrible hardware problems - the virtual cache issue - _and_ a VM
system that was based on some really old BSD code).

So the more traditional way would be to do an all-mmap thing, and extend
the file with ftruncate(), not write. That's something that programs like
nntpd have been doing for decades, so it's a very "traditional" model and
thus much more likely to be safe. It also avoids all the aliasing issues,
if all accesses are done the same way.

That said, you _would_ need to have alternate strategies to access things,
but apparently Cyrus already has such strategies at least for HP-UX.

> Ahh - I found the explaination in doc/internal/hacking in
> the Cyrus source tree.  While 'ack' is a nice tool, it
> doesn't check files with no extention by default.  Ho hum:
>
> - map_refresh and map_free
>
>   - In many cases, it is far more effective to read a file via the operating
>     system's mmap facility than it is to via the traditional read() and
>     lseek system calls.  To this end, Cyrus provides an operating system
>     independent wrapper around the mmap() services (or lack thereof) of the
>     operating system.

One of the issues here is that in order to give coherency for mmap +
read/write access, the OS may need to map the area uncached or at least
flush caches when writing. So from a pure performance standpoint, it can
also cause problems.

Of course, even a uncached mmap() _can_ certainly be faster than using
just read()/write(), depending on the access patterns. So maybe Cyrus is
doing the right thing, it just sounds rather fragile and prone to
unexpected and hard-to-debug problems.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev
Date: Thu, 19 Jun 2008 00:21:24 UTC
Message-ID: <fa.dSdEOJ6nOXEiiQYD/w9DI/O53XU@ifi.uio.no>

On Thu, 19 Jun 2008, Robert Mueller wrote:
>
> As noted above, one thing cyrus does which does seem to be plain "wrong"
> is that it mmaps a region greater the file size (rounds to an 8k
> boundary, but 8k-16k past the current end of the file) and then assumes
> that when it writes to the end of the file (but less than the end of the
> mmap region) that there's no need to remmap and that data is immediately
> available within the previous mmaped region.

Pretty much any OS that tries to be make mmap() coherent with regular
read/write accesses will automatically also have to be coherent wrt file
size updates.

IOW, I don't think that cyrus is real any more "wrong" in this than in
assuming that you can mix read/write and mmap() accesses. In fact, I
suspect that Cyrus is probably _more_ conservative than most, in that it
would not be totally unheard of to just do a much bigger mmap(), and not
even bother to re-do it until the file grows past that size (ie no 8k/16k
granularity, but make it arbitrarily non-granular).

> Apparently that works on most OS's (but is what this bug actually
> exposed), but according to the mmap docs:
>
> ---
> If the size of the mapped file changes after the call to mmap() as a
> result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added or
> removed portions of the file is unspecified.

Note that if you really want to be portable, you simply must not mix
mmap() with *any* other operations without sprinking in a healthy amount
of "msync()" or unmapping/remapping entirely.

So _in_practice_ - because everybody tries to do a good job - you can
actually expect to have mmap() be coherent, even though there are no real
guarantees.

> Amazingly (apart from HP/UX) no OS actually seems to have a problem with
> this since there would be massive cyrus bug reports otherwise.

Yeah. Over the years, the pain from having a non-coherent mmap() generally
has pushed everybody into just making mmap() easy to use. Which means that
mixing things generally works fine, even if it is not at all _guaranteed_.

So I'd expect mmap+write to work and be coherent almost always. But it's
still a fairly unusual combination, and I would personally think that
using MAP_SHARED and writing through the mmap() would be the less
surprising model.

		Linus

Index Home About Blog