Index Home About Blog
Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <>
Subject: Re: jbd bug(s) (?)
Original-Message-ID: <>
Date: Thu, 26 Sep 2002 13:46:08 GMT
Message-ID: <>

On Thu, Sep 26, 2002 at 02:56:47PM +0200, Jakob Oestergaard wrote:
> I know.  What I imagined was, that there were disks out there which
> *internally* worked with smaller sector sizes, and merely presented a
> 512 byte sector to the outside world.

Actually, it's the other way around.  Most disks are internally
actually using a sector size of 32k or now even 64k.  So ideally, I'd
like to have ext2 be able to support such larger block sizes, since it
would be win from a performance perspective.  There are only two
problems with that.  First of all, we need to support tail-merging, so
small files don't cause fragmentation problems.  But this isn't an
absolute requirement, since without it a 32k block filesystem would
still be useful for a filesystem dedicates for large multimedia files
(and heck, a FAT16 filesystem would often use an 32k or larger block

The real problem is that the current VM has an intrinsic assumption
that the blocksize is less than or equal to the page size.  It could
be possible to fake things by using an internal blocksize of 32k or
64k, but emulate a 4k blocksize to the VM layer, but provides only a
very limited benefit.  Since the VM doesn't know that the block size
is really 32k, we don't get the automatic I/O clustering.  Also, we
end up needing to use multiple buffer heads per real ext2 block (since
the VM still thinks the block size is PAGE_SIZE, not the larger ext2
block size).  So we could add larger block sizes, but it would mean
adding a huge amount of complexity for minimal gain (and if you really
want that, you can always use XFS, which pays that complexity cost).

It'd be nice to get real VM support for this, but that will almost
certainly have to wait for 2.6.

> Let's hope that none of the partitioning formats or LVM projects out
> there will misalign the filesystem so that your index actually *does*
> cross a 512 byte boundary   ;)

None of them would do anything as insane as that, not just because of
the safety issue, but because it would be a performance loss.  Just as
misaligned data accesses in memory are slower (or prohibited in some
architectures), misaligned data on disks are bad for the same reason.

> > Making parts of the disk suddenly unreadable on power-fail is
> > generally considered a bad thing, though, so modern disks go to great
> > lengths to ensure the write finishes.
> Lucky us  :)

Actually, disks try so hard to ensure the write finishes that
sometimes they ensure it past the point where the memory has already
started going insane because of the low voltage during a power
failure.  This is why some people have reported extensive data loss
after just switching off the power switch.  The system was in the
midst of writing out part of the inode table, and as the voltage of
the +5 voltage rail started dipping, the memory started returning bad
results, but the DMA engine and the disk drive still had enough juice
to complete the write.  Oops.

This is one place where the physical journalling layer in ext3
actually helps us out tremendously, because before we write out a
metadata block (such as part of the inode table), it gets written to
the journal first.  So each metadata block gets written twice to disk;
once to the journal, synchronously, and then layer when we have free
time, to the disk.  This wastes some of our disk bandwidth --- which
won't be noticed if the system isn't busy, but if the workload
saturates the disk write bandwidth, then it will slow the system down.
However, this redundancy is worth it, because in the case of this
particular cause of corruption, although part of the on-disk inode
table might get corrupted on an unexpected power failure, it is also
on the journal, so the problem gets silently and automatically fixed
when the journal is run.

Filesystems such as JFS and XFS do not use physical journaling, but
use logical journalling instead.  So they don't write the entire inode
table block to the journal, but just an abstract representation of
what changed.  This is more efficient from the standpoint of
conserving your write bandwidth, but it leaves them helpless if the
disk subsystem writes out garbage due to an unclean power failure.

Is this tradeoff worth it?  Well, arguably hardware Shouldn't Do That,
and it's not fair to require filesystems to deal with lousy hardware.
However, as I've said for a long time, the reason why PC hardware is
cheap is because PC hardware is crap, and since we live in the real
world, it's something we need to deal with.

Also, the double write overhead which ext3 imposes is only for the
metadata, and for real-life workloads, the overhead presented is
relatively minimal.  There are some lousy benchmarks like dbench which
exaggerate the metadata costs, but they aren't really representative
of real life.  So personally, I think the tradeoffs are worth it; of
course, though, I'm biased.  :-)

							- Ted

Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <>
Subject: Re: jbd bug(s) (?)
Original-Message-ID: <>
Date: Thu, 26 Sep 2002 14:26:27 GMT
Message-ID: <>

On Thu, Sep 26, 2002 at 03:05:57PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 26, 2002 at 09:44:35AM -0400, Theodore Ts'o wrote:
> > block size).  So we could add larger block sizes, but it would mean
> > adding a huge amount of complexity for minimal gain (and if you really
> > want that, you can always use XFS, which pays that complexity cost).
> XFS does't support blocksize > PAGE_CACHE_SIZE under linux. In fact the
> latest public XFS/Linux release doesn't even support any blocksize other
> than PAGE_CACHE_SIZE.  This has changed in the development tree now and
> the version merged in 2.5 and the next public 2.4 release will have that
> support.  Doing blocksize > PAGE_CACHE_SIZE will difficult if not
> impossible due VM locking issues with the 2.4 and 2.5 VM code.

My mistake.  At one point I was talking to Mark Lord and I had gotten
the impression they had some Irix-VM-to-Linux-VM mapping layer which
would make blocksize > PAGE_SIZE possible.

						- Ted

From: Theodore Tso <>
Newsgroups: fa.linux.kernel
Subject: Re: The ext3 way of journalling
Date: Tue, 08 Jan 2008 21:57:47 UTC
Message-ID: <>

On Tue, Jan 08, 2008 at 09:51:53PM +0100, Andi Kleen wrote:
> Theodore Tso <> writes:
> >
> > Now, there are good reasons for doing periodic checks every N mounts
> > and after M months.  And it has to do with PC class hardware.  (Ted's
> > aphorism: "PC class hardware is cr*p").
> If these reasons are good ones (some skepticism here) then the correct
> way to really handle this would be to do regular background scrubbing
> during runtime; ideally with metadata checksums so that you can actually
> detect all corruption.

That's why we're adding various checksums to ext4...

And yes, I agree that background scrubbing is a good idea.  Larry
McVoy a while back told me the results of using a fast CRC to get
checksums on all of his archived data files, and then periodically
recalculating the CRC's and checking them against the stored checksum
values.  The surprising thing was that once every so often (and the
fact that it happens at all is disturbing), he would find that a file
had a broken checksum even though it had apparently never been
intentionally modified (it was in an archived file set, the modtime of
the file hadn't changed, etc.)

And the fact that disk manufacturers on their high end enterprise
disks design their block guard system to detect cases where a block
gets written to a different part of the disk than where the OS
requested it to be written, and that I've been told of at least one
commercial large-scale enterprise database which puts a logical block
number in the on-disk format of their tablespace files to detect this
problem --- should give you some pause about how much faith at least
some people who are paid a lot of money to worry about absolute data
integrity have in modern-day hard drives....

> But since fsck is so slow and disks are so big this whole thing
> is a ticking time bomb now. e.g. it is not uncommon to require tens
> of minutes or even hours of fsck time and some server that reboots
> only every few months will eat that when it happens to reboot.
> This means you get a quite long downtime.

What I actually recommend (and what I do myself) is to use
devicemapper to create a snapshot, and then run "e2fsck -p" on the
snapshot.  If the snapshot without *any* errors (i.e., exit code of
0), then it can run "tune2fs -C 0 -T now /dev/XXX", and discard the
snapshot, and exit.  If e2fsck returns any non-zero error code,
indicating that it found changes, the output of e2fsck should be sent
e-mailed to the system administrator so they can schedule downtime and
fix the filesystem corruption.

This avoids the long downtime at reboot time.  You can do the above in
a cron script that runs at some convenient time during low usage
(i.e., 3am localtime on a Saturday morning, or whatever).

					- Ted

From: Theodore Tso <tytso@MIT.EDU>
Newsgroups: fa.linux.kernel
Subject: Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental
Date: Thu, 17 Jan 2008 22:55:04 UTC
Message-ID: <>

On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote:
> Have you observed that in the wild?  A former engineer of a disk drive
> company suggests to me that the capacitors on the board provide enough
> power to complete the last sector, even to park the head.

The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the

It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown.  Of course, PC-class hardware
has none of this.  My source for this was Jim Mostek, one of the
original Linux XFS porters.  He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk.  If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem.  I can't
find the program any more, but it wouldn't be hard to write.

I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table.  Ext3 solves this problem because of its physical
block journaling.

						- Ted

From: Theodore Tso <tytso@MIT.EDU>
Newsgroups: fa.linux.kernel
Subject: Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental
Date: Fri, 18 Jan 2008 14:24:41 UTC
Message-ID: <>

On Thu, Jan 17, 2008 at 04:31:48PM -0800, Bryan Henderson wrote:
> But I heard some years ago from a disk drive engineer that that is a myth
> just like the rotational energy thing.  I added that to the discussion,
> but admitted that I haven't actually seen a disk drive write a partial
> sector.

Well, it would be impossible or at least very hard to see that in
practice, right?  My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried to
read from it.

> Ted brought up the separate issue of the host sending garbage to the disk
> device because its own power is failing at the same time, which makes the
> integrity at the disk level moot (or even undesirable, as you'd rather
> write a bad sector than a good one with the wrong data).

Yep, exactly.  It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.

							- Ted

From: Theodore Tso <>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH 3/5] libata: Implement disk shock protection support
Date: Tue, 05 Aug 2008 13:43:50 UTC
Message-ID: <fa.xHEPMYwHEaiFFrkK9c2JdrI/>

On Mon, Aug 04, 2008 at 10:05:03PM -0600, Robert Hancock wrote:
> Yes, from what I've seen on these laptops, it doesn't take much to
> trigger the shock protection in Windows - lifting the front of the
> laptop off the table an inch and dropping it will do it,

A few years ago, I had a Thinkpad T21 laptop, accidentally slip
through my butterfingers and dropped about an inch before it landed on
the table.  Unfortunately, (a) the Thinkpad T21 laptop was rather
heavy (compared to modern laptops), (b) it didn't have the rubber
"bubble" on the bottom of the laptop to cushion the landing as the T22
and T23's had (and I'm sure I know why it was added), and (c) the hard
drive was active at the time.  It was enough to cause a head crash and
Linux immediately started reporting an exponentially increasing number
of write errors; the hard drive was totally unusable within an hour or

So there's a reason why the anti-shock protection is set at a rather
sensitive level...

The real right answer though is to buy one of the laptop drives (such
as the Seagate Momentus 7200.2 or 7200.3) which has the anti-shock
detection built directly into the hard drive.  That way you don't have
to have a daemon that sits in the OS waking up the CPU some 20 to 30
times a second and burning up your battery even when the laptop is

							- Ted

Index Home About Blog