User-space filesystems (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411180959450.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 18:17:20 GMT
Message-ID: <fa.gue1uuf.828rpl@ifi.uio.no>

On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> It's possible, but I don't see why that's a problem.  If it can get
> more memory it's OK.  If allocation fails, then the write() will fail
> with ENOMEM, if OOM killer get's to work and kills the FUSE process,
> then write will return with ENOTCONN or something like that.

Why do you think it would kill the FUSE process? And why do you think
killing _any_ process would make the system come back to life? After all,
memory wasn't filled by process usage, it was filled by dirty FS pages.

I really do believe that user-space filesystems have problems. There's a
reason we tend to do them in kernel space.

But limiting the outstanding writes some way may at least hide the thing.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411181027070.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 18:46:25 GMT
Message-ID: <fa.gstju6d.aiuq1r@ifi.uio.no>

On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> Well, killing the fuse process _will_ make the system come back to
> life, since then all the dirty pages belonging to the filesystem will
> be discarded.

They will? Why? They're still mapped into other processes, still dirty.
How do they go away?

> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
>
> Well, NFS with a network failure has the same problem.  It's not the
> userspace that's the problem, it's the non-reliability.

No, it _is_ the userspace.

Yes, NFS is unreliable too, but it doesn't have the behaviour that when
the client locks up, the server locks up too. The two aren't "linked", and
they are protected from each other using up too much memory.

In contrast, a fuse process that needs to do IO is _not_ protected from
the clients having eaten up all the memory it needs to do the IO.

Btw, this is not a new issue. This is the _exact_ same issue that "run the
NFS server on the same machine as the client" has. And yes, it did have
problems. People still did it, because it allowed for user-space
filesystem demos.

> Currently shared writable mappings aren't allowed for non-root by
> default in FUSE.

Yes, that's a valid approach.

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411181108140.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 19:25:44 GMT
Message-ID: <fa.gtdftme.a2qqhk@ifi.uio.no>

On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> Will the clients be allowed to fill up the _whole_ memory with dirty
> pages?

Sure. It's not a situation that is easy to get into, but it's a nasty
case.

> Page writeback will start sooner than that, and then the
> client will not be able to dirty more pages until some are freed.

Ehh - the _CPU_ handles dirtying pages all on its own. The OS never even
knows that a page got dirtied, so "starting writeout early" is not much of
an option.

We actually had (for a short while) code that tracked the dirty bit in
software (ie make it unwritable by default, and take the write fault), but
people showed that that was actually a real performance problem on some
loads.

> BTW, I've never myself seen a deadlock, and I've not had any report of
> it.

Almost nobody uses shared writable mappings. Certainly not on "odd"
things. They are historically used by things like innd for the active
file, by some odd applications that want to do their own memory
management, and by databases. That's pretty much it.

So it's entirely possible that you have never even _seen_ a shared
writable mapping even if you stressed the filesystem very hard. They
really are that rare.

There's a few VM testers out there that do nasty things with writable
shared mappings. You could try them just for fun, but personally, if we
are seriously talking about merging FUSE, I'd actually prefer for writable
mappings to not be supported at all.

It wouldn't be the only filesystem that doesn't support the thing. I think
even NFS didn't support them until I did the pagecache rewrite. Nobody
really complained (well, _very_ few did).

IOW, from a merging standpoint, simple really _is_ better. Even if you
really really want to use exotic features like "direct IO" and writable
mappings some day, let's just put it this way: it's a lot easier to merge
something that has no questions about strange cases, and then _later_ add
in the strange cases, than it is to merge it all on day #1.

I'm a sucker. Ask anybody. I'll accept the exact same patch that I
rejected earlier if you just do it the right way. I'm convinced that some
people actually do it on purpose just for the amusement value ("Look, he
did it _again_. What a doofus!")

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411181140110.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 19:52:27 GMT
Message-ID: <fa.gtd9um6.a2grhs@ifi.uio.no>

On Thu, 18 Nov 2004, Miklos Szeredi wrote:
>
> OK, sorry.  I'd rephrase it then to say will the system allow _all_
> it's pages to be used for file data?

Yup, pretty much.

It's actually even _normal_ behaviour for many of the core users of shared
files. People who really do databases get quite upset if you don't let
them mmap as much memory as they want, because for them, they really tune
their cache sizes for the size of memory, and they think the OS (and
anything else, for that matter) just gets in their way. They want 99% of
memory to be used for the shared mapping, and the remaining 1% for their
code.

(That's a bit extreme, but you get the idea).

Historically, we've often tried to "partition" memory in various ways (ie
"the buffer cache can only grow up to 40% of real memory" etc). It ends up
being good for some things (watermarks etc), but almost ever time it ends
up being bad as a hard _limit_. So yes, the kernel tends to let people
do what they think they want to do.

"Give them rope",

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411181035140.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 18:53:43 GMT
Message-ID: <fa.gtddueb.a2oq9p@ifi.uio.no>

On Thu, 18 Nov 2004, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Why do you think it would kill the FUSE process? And why do you think
> > killing _any_ process would make the system come back to life? After all,
> > memory wasn't filled by process usage, it was filled by dirty FS pages.
> >
> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
>
> Are kernel space filesystems immune from this problem?  What happens
> when they need to kmalloc() in order to write some data?

That's why we have GFP_NOFS and other flags (PF_MEMALLOC etc). So yes,
they are "immune" in the sense that they have been inocculated, but not in
the sense that they can't have the bug conceptually.

So the kernel not only keeps a set of reserved pages for atomic
allocations, but also the VM knows not to recurse into a filesystem
operation when the reason for the memory allocation was a low-memory
circumstance. When a filesystem asks for memory in the page-out path, the
VM may still throw out cached pages for that FS, but it won't try to write
them back.

Guys, there is a _reason_ why microkernels suck. This is an example of how
things are _not_ "independent". The filesystems depend on the VM, and the
VM depends on the filesystem. You can't just split them up as if they were
two separate things (or rather: you _can_ split them up, but they still
very much need to know about each other in very intimate ways).

So what do you do? You limit shared dirty pages (inefficient memory use),
or you disallow certain behaviours, or you add tons of new interfaces to
expose essentially the same "every thing that can allocate and is on the
write-out path takes a GFP flag".

User-space filesystems are hard to get right. I'd claim that they are
almost impossible, unless you limit them somehow (shared writable mappings
are the nastiest part - if you don't have those, you can reasonably limit
your problems by limiting the number of dirty pages you accept through
normal "write()" calls).

			Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: [PATCH] [Request for inclusion] Filesystem in Userspace
Original-Message-ID: <Pine.LNX.4.58.0411181047590.2222@ppc970.osdl.org>
Date: Thu, 18 Nov 2004 18:59:42 GMT
Message-ID: <fa.gvdnumd.822rhr@ifi.uio.no>

On Thu, 18 Nov 2004, Alan Cox wrote:
>
> > I really do believe that user-space filesystems have problems. There's a
> > reason we tend to do them in kernel space.
> >
> > But limiting the outstanding writes some way may at least hide the thing.
>
> Possibly dumb question. Is there a reason we can't have a prctl() that
> flips the PF_* flags for a user space daemon in the same way as we do
> for kernel threads that do I/O processing ?

It's more than just PF_MEMALLOC.

And PF_MEMALLOC really is to avoid _recursion_, which is the smallest
problem. It does so by allowing the process to dip into the critical
resources, but that only works if you know that the process is actually
freeing pages right then and there. If you set it willy-nilly, you'll just
run out of pages soon, and you'll be dead.

The GFP_IO and GFP_FS pages are the _real_ protectors. They don't dip into
the (very limited) set of pages, they say "we can still free 90% of
memory, we just have to ignore that dangerous 10%".

And yes, you could somehow expose those as process flags too, and make
people who do GFP_USER or GFP_KERNEL actually look at some process flag
and do the proper masking.

So clearly you _can_ do it. But it requires very intimate knowledge of VM
behaviour or the VM knowing about you.

		Linus

Index Home About Blog