Log structured filesystems (Theodore Tso)

Index Home About Blog

Date:   Tue, 26 Oct 1999 18:06:13 -0400
From: "Theodore Y. Ts'o" <tytso@mit.edu>
Subject: Re: Question on FFS support
Newsgroups: fa.linux.kernel

   Date:   Tue, 26 Oct 1999 15:24:25 -0500
   From: Brian Grayson <bgrayson@ibmoto.com>

     I don't know, but the original FreeBSD responder may have meant
   LFS, the log-structured filesystem, which the BSDs have
   recently revamped.  It is a true journaling file system distinct
   from FFS/UFS, and not merely soft-updates on top of a
   traditional block-structured file system.  With LFS, you always
   append on writes, rather than overwrite.  Thus, writes are
   fast, and if the machine crashes, since the previous data has
   not been overwritten, you can recover quite quickly.  Of
   course, with finite disks you need to do some
   garbage-collecting, but LFS does all of that for you, and also
   has some optimizations so that reads are still fast.

The latter has been the traditional failing of LFS filesystems over
update-in-place filesystems (i.e., FFS, and ext2) --- because you never
write over an existing disk block, files and directories tend to get
scattered all over the disk.  There are ways things can be improved
(intelligent log cleaners, and huge amounts of memory to cache data so
you don't have to go disk for reads in the first place), but it's very
difficult to make a log structured filesystem work as well as an
update-in-place filesystem.  

There is a fairly widely held belief that LFS systems don't work well on
smaller systems --- especially ones without a lot of memory --- because
of the huge amounts of cache needed to make things reasonably fast.  And
there will be certain application access patterns which will be quite
pessimal for LFS.  There are systems where LFS works quite well, though.
NetApp boxes use a log structured filesystem, for example.  They've
spent a lot of effort, using both hardware and software techniques, to
make their boxes go *fast*.

						- Ted

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: Solaris ZFS on Linux [Was: Re: the " 'official' point of view" 
	expressed by kernelnewbies.org regarding reiser4 inclusion]
Date: Tue, 01 Aug 2006 03:01:10 UTC
Message-ID: <fa./5aWgQuXSwpY2BEtVeOQx2mVZFA@ifi.uio.no>
Original-Message-ID: <20060801030005.GA1987@thunk.org>

On Mon, Jul 31, 2006 at 08:31:32PM -0500, David Masover wrote:
> So you use a repacker.  Nice thing about a repacker is, everyone has
> downtime.  Better to plan to be a little sluggish when you'll have
> 1/10th or 1/50th of the users than be MUCH slower all the time.

Actually, that's a problem with log-structured filesystems in general.
There are quite a few real-life workloads where you *don't* have
downtime.  The thing is, in a global economy, you move from the
London/European stock exchanges, to the New York/US exchanges, to the
Asian exchanges, with little to no downtime available.  In addition,
people have been getting more sophisticated with workload
consolidation tricks so that you use your "downtime" for other
applications (either to service other parts of the world, or to do
daily summaries, 3-d frame rendering at animation companies, etc.)  So
the assumption that there will always be time to run the repacker is a
dangerous one.

The problem is that many benchmarks (such as taring and untaring the
kernel sources in reiser4 sort order) are overly simplistic, in that
they don't really reflect how people use the filesystem in real life.
(How many times can you guarantee that files will be written in the
precise hash/tree order so that the filesystem gets the best possible
time?)  A more subtle version of this problem happens for filesystems
where their performance degrades dramatically over-time without a
repacker.  If the benchmark doesn't take into account the need for
repacker, or if the repacker is disabled or fails to run during the
benchmark, the filesystem are in effect "cheating" on the benchmark
because there is critical work which is necessary for the long-term
health of the filesystem which is getting deferred until after the
benchmark has finished measuring the performance of the system under
test.

This sort of marketing benchmarks ("lies, d*mn lies, and benchmarks")
may be useful for trying to scam mainline acceptance of the filesystem
code, or to make pretty graphs that make log-structured filesystems
look good on Usenix papers, but despite the fact that huge numbers of
papers were written about the lfs filesystem two decades ago, it never
was used in real-life by any of the commercial Unix systems.  This
wasn't an accident, and it wasn't due to a secret conspiracy of BSD
fast filesystem hackers keeping people from using lfs.  No, the BSD
lfs died on its own merits....

						- Ted

From: Theodore Tso <tytso@mit.edu>
Newsgroups: fa.linux.kernel
Subject: Re: Solaris ZFS on Linux [Was: Re: the " 'official' point of 
	view"expressed by kernelnewbies.org regarding reiser4 inclusion]
Date: Tue, 01 Aug 2006 06:49:31 UTC
Message-ID: <fa.J8t6QOqxCHM0XoT+s0Rd7dzXtlk@ifi.uio.no>
Original-Message-ID: <20060801064837.GB1987@thunk.org>

On Mon, Jul 31, 2006 at 09:41:02PM -0700, David Lang wrote:
> just becouse you have redundancy doesn't mean that your data is idle enough
> for you to run a repacker with your spare cycles. to run a repacker you
> need a time when the chunk of the filesystem that you are repacking is not
> being accessed or written to. it doesn't matter if that data lives on one
> disk or 9 disks all mirroring the same data, you can't just break off 1 of
> the copies and repack that becouse by the time you finish it won't match
> the live drives anymore.
>
> database servers have a repacker (vaccum), and they are under tremendous
> preasure from their users to avoid having to use it becouse of the
> performance hit that it generates. (the theory in the past is exactly what
> was presented in this thread, make things run faster most of the time and
> accept the performance hit when you repack). the trend seems to be for a
> repacker thread that runs continuously, causing a small impact all the time
> (that can be calculated into the capacity planning) instead of a large
> impact once in a while.

Ah, but as soon as the repacker thread runs continuously, then you
lose all or most of the claimed advantage of "wandering logs".
Specifically, the claim of the "wandering log" is that you don't have
to write your data twice --- once to the log, and once to the final
location on disk (whereas with ext3 you end up having to do double
writes).  But if the repacker is running continuously, you end up
doing double writes anyway, as the repacker moves things from a
location that is convenient for the log, to a location which is
efficient for reading.  Worse yet, if the repacker is moving disk
blocks or objects which are no longer in cache, it may end up having
to read objects in before writing them to a final location on disk.
So instead of a write-write overhead, you end up with a
write-read-write overhead.

But of course, people tend to disable the repacker when doing
benchmarks because they're trying to play the "my filesystem/database
has bigger performance numbers than yours" game....

					- Ted

Index Home About Blog