Bad blocks (Theodore Y. Ts'o)

Index Home About Blog

Date: 	Tue, 11 Apr 2000 18:14:04 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 problems with 2.3.99pre3 and 2.3.48
Newsgroups: fa.linux.kernel

   Date: 	Wed, 12 Apr 2000 00:39:08 +0300 (EEST)
   From: <sampsa@staff.netsonic.fi>

   hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
   hda: read_intr: error=0x01 { AddrMarkNotFound }, LBAsect=816947,
   sector=816884

This means that you have low-level media errors with your filesystem.

When you see this kind of thing, if there's anything really
critical/important on the filesystem which you haven't backed up, I
usually recommend an *immediate* backup of the disk.  If you're really
paranoid, find another empty hard disk which is at least as big as the
disk with problems, and then do a disk to disk backup:

	dd if=/dev/hda of=/dev/hdb bs=1k conv=sync,noerror

The reason for this is that in the worst case, you've suffered a head
crash, where the disk heads (for whatever reason) have crashed into the
surface of disk platters which are spinning at 5,000+ RPM.  This can
gouge bits of iron oxide off the platter, which then start spinning
around about the disk platters, and when they smash into disk heads,
they can cause yet another head crash.  Rinse.  Lather.  Repeat.

In the worst case, you can have an exponential increase in the number of
bad blocks, as more and more head crashes happen, and more and more disk
blocks get gouged off the disk platters.  Of course, this is a worse
case scenario.  Many times things you'll have a few errors, and that'll
be it.  So I don't want to scare you that badly ---- but OF COURSE, you
DID follow a discpline of making regular backups, so you don't have
anything to worry about.  RIGHT?  :-)

In any case, if it's only an isolated few numbers of bad blocks, which
don't seem to be increasing, you can cause e2fsck to find the bad blocks
and map around them by using the -c option to e2fsck.  (Check the man
page for details, but it's basically /sbin/e2fsck -c /dev/hda1).

However, if more badblocks are found after you do this, then it's
probably the disk's way of telling you that it's time to go buy a new
disk drive, before it fails on you completely.

Good luck!

							- Ted

Date: 	Tue, 11 Apr 2000 18:02:43 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: Tue, 11 Apr 2000 16:37:49 -0500
   From: Ed Carp <erc@pobox.com>

   > When the kernel detects a bad block, it's not so simple to just throw it
   > into the badblocks list --- the block may very well likely be in use as
   > filesystem metadata, or because it's in use as a file data block.  It
   > might be possible to have the kernel handle more of these cases
   > automatically without requiring an fsck, but past a certain point,
   > you're introducing *way* to much hair into the kernel.

   The problem with this approach is, if you're working with systems
   that are up 24x7, to *not* have the ability to automatically detect a
   bad block, copy the data to another block, then mark that block as
   bad is a real pain at best and completely unacceptable at worst.  One
   of my clients is using Linux in a network communications controller
   (SONET/ATM backplane) and this sort of thing is going to raise the
   pain level around here as soon as someone realizes that badblocks
   aren't taken case of.

It's one thing if the bad block is in a file data block; there, you can
relocate the data to another block, assuming you can still read the
block by the time you find you have a disk error.  It's quite another if
the disk failure happens in a critical piece of the filesystem metadata.

The real right solution to this problem, if you have this kind of
reliability, is to either use disks that do badblock sparing at a
low-level, or (better yet) to use RAID.  If you have this kind of
reliability consideration that's what you should really be doing.
Or, if you using Linux in a somewhat embedded system (such as a network
communications controller), then perhaps you should be booting off of
flash ROM, and then keeping temporary files on a RAM disk.  

But don't expect a filesystem to be able to magically recover from
arbitrary media failures.  There are things we could do to make things
better, but it comes at an increased kernel complexity, and it still
won't solve the problem 100%.  The right tool for the sort of problem
you've outlined really is RAID.

						- Ted

Date: 	Tue, 11 Apr 2000 17:08:07 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: 	Tue, 11 Apr 2000 11:58:55 -0700 (PDT)
   From: Andre Hedrick <andre@linux-ide.org>

   You mentioned that there is no way to actively update the badblocks list
   in EXT2.  Can you explain or tell me how I can kludge around this?
   Also will EXT3 have this as a native feature?

You can update it using e2fsck's -c, -l, or -L options.

When the kernel detects a bad block, it's not so simple to just throw it
into the badblocks list --- the block may very well likely be in use as
filesystem metadata, or because it's in use as a file data block.  It
might be possible to have the kernel handle more of these cases
automatically without requiring an fsck, but past a certain point,
you're introducing *way* to much hair into the kernel.

I know, I know, Multics did have the ability to do on-line filesystem
scavaging, which basically when all of what we would call e2fsck
(including a garbage collector because directories were a minature LISP
world) was in the kernel.  I however, am not sure we want to go there.  :-)

						- Ted

Date: 	Wed, 12 Apr 2000 01:13:39 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: 	Tue, 11 Apr 2000 21:32:03 -0500
   From: Ed Carp <erc@pobox.com>

   > According to bert hubert:
   > > If you have a bad block on a modern disk it is time for an instant
   > > backup-and-replace.  The concept of a 'bad bit' on a disk is pretty
   > > much dead.
   > 
   > What he said.  I wouldn't bother putting a lot of intelligence into
   > handling of bad blocks; trying to keep a failing disk alive is false
   > economy of the worst kind.

   It must be nice to live in such a perfect world where one can replace
   disks instantly at the first sign of a problem.  Why have such
   badblock code in the kernel in the first place -- just insist that
   all your users have error-free drives.

   But out here in the real world we don't have such a luxury.  It could
   be days before a drive can get replaced.  In the meantime, we have to
   make do. 

But wait a moment!  I thought you said you had a high-reliability
requirement.  If so, what the heck are you using a single unprotected
IDE disk?!?  I've seen disks go from the first error to almost complete
failure in *hours*, and disks very much have finite, but variable,
lifetime, especially if they are put in heavy use or put into service in
harsh environments.

In the real world, if you really have that kind of requirement for
long-term, it-must-always-be-working reliability, you really should
either be using RAID, or some kind of silicon disk, depending on the
application.  

If the customer isn't willing to pay that kind of money for that kind of
reliability, then he probably doesn't want that kind of reliability
badly enough.  And if he really does, but refuses to invest the right
amount of resources to make it happen, and then is going to blame you
for not being able to violate the laws of physics, you probably don't
want that person as your customer....

							- Ted

Date: 	Wed, 12 Apr 2000 01:46:33 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: 	Tue, 11 Apr 2000 22:20:43 -0700 (PDT)
   From: Andre Hedrick <andre@linux-ide.org>

   On Thu, 6 Apr 2000, Alan Cox wrote:

   > > > Multiwrite IDE breaks on a disk error
   > >
   > > Explain.........Please........
   >
   > If you have one bad sector you should write the other 7..

   Now if the ata/ide driver does not address this recovery then I see big
   problems.  Alan's case (my reading) states that regardless if we blow the
   write to a sector (based on 8 multi-write command) we should write all
   that we can........

That's definitely the case.  If you're writing 8 sectors, and sector #4
has an error, the driver shouldn't give up and not write sectors 5-8!

   Now why the "fork recovery"?  We need to finish/complete the write
   request, but because we discovered a NEW BAD BLOCK/SECTOR we should walk
   the rest of the write because there may be a section if the disk/track
   that is failing.  Also this fork would provide the means to log the
   location of the newly failed sector and go back and MARK BAD and issue a
   request to the FS to update the BADBLOCKS table.  Thus we get:

	       0|1|2|3| 4 |5|6|7|8    0|1|2|3| 4 |5|6|7|8
   Theodore -> w|w|w|w|FSR|w|w|w|w -> T|h|e|o|FSR|d|o|r|e -> Theodore

   FSR == FaultSeekRecover THREAD............

Some kind of method where the block device layer can notify the
filesystem of specifically which blocks went bad would be useful,
probably as a callback.   Actually, it's already the case that the
filesystem can find out if there are problems assuming that the write is
being done synchronously.  It's just that most of the time disk writes
are done as a "fire and forget", and by the time the disk notices
something has gone wrong, the context in which the write request was
queued is long gone.

This is a Linux 2.5 issue, though --- it's not something we're going to
do for 2.4, and before we start this, we probably want to rototill the
entire block device layer anyway, since it's a bit of a mess and kludge
right now.

As far as what the filesystem can do, in *some* cases it may be able to
put the block onto the bad block, and then retry the write, but in other
cases, where the filesystem was just reading from a pre-existing file,
there really isn't much the filesystem can do that's sane.  It could
unlink the block from the inode, and leave a hole there, and then relink
the block to the bad-block inode, but underlying applications still
probably won't deal well with that happening.  And if the bad block is
discovered in the inode table, or some other part of the filesystem
metadata, there *really* isn't much that can be done from the kernel
level to recover from that.

What would be *really* useful would be if S.M.A.R.T., or some other
facility could inform the filesystem that a block was *about* to fail.
In that case, the block could get relocated before data got lost, and
that would certainly be worth doing.  There are still some cases where
if the bad block was to happen inside critical filesystem metadata, the
recovery would be so complex that you really wouldn't want to do it
inside the kernel, but things probably still could be made better.  All
of this is not something that's going to happen before 2.4 ships,
however!

And it still doesn't change my contention that somoene who wants
ultrareliability, and what you call "Enterprise class" computing,
without doing RAID, is fundamentally insane.  There are things we can do
to try to recover in the face of broken hardware --- but fundamentally,
cheap sh*t hardware is still cheap sh*t hardware.  You don't make
Enterprise class computers out of cheap sh*t.  It just doesn't happen.

							- Ted

Date: 	Wed, 12 Apr 2000 03:50:11 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: Wed, 12 Apr 2000 00:31:51 -0700 (PDT)
   From: Andre Hedrick <andre@linux-ide.org>

   > And it still doesn't change my contention that somoene who wants
   > ultrareliability, and what you call "Enterprise class" computing,
   > without doing RAID, is fundamentally insane.  There are things we can do
   > to try to recover in the face of broken hardware --- but fundamentally,
   > cheap sh*t hardware is still cheap sh*t hardware.  You don't make
   > Enterprise class computers out of cheap sh*t.  It just doesn't happen.

   Ted, I need to drag you to some of the cool labs that are in the ATA
   industry.  7200 RPM ATA drives are 7200 RPM SCSI drives!  Only the
   electronic interface on the bottom is the difference.

Oh, sure, I'm very well aware of this.  (At the same time, I still think
UltraDMA66 is a kludge, and I'm wondering how the ATA is going to
compete with Ultra-160 LVD SCSI --- come up with a UltraDMA132 that only
allows a 9-inch-long cable?  :-)

What I was referring to was folks who think they're going to run an
"Enterprise class", absolutely-can-not-fail, mission critical
application using a bare disk (SCSI *or* IDE) without RAID.  

We can put in all the clever callbacks from the disk driver to notify
the filesystem that there's a problem, but if the disk has a
catastrophic head crash (iron oxide flying everywhere, preciptating even
more head crashes), there's simply nothing the filesystem can do when it
gets the formal engraved invitation to the funeral.  With RAID, at least
you don't lose your data, and your server doesn't go down.  That's
generally considered desireable by those silly folks who want enterprise
computing.  :-)

						- Ted

Date: 	Wed, 12 Apr 2000 11:57:23 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: 	Wed, 12 Apr 2000 09:38:12 -0500
   From: Ed Carp <erc@pobox.com>

   My experience has been exactly the opposite.  I've got systems that
   have been running on drives with bad sectors for, literally, years.
   One drive that has periodic bad sectors on it has been running 24x7
   for over 3 years.

There are some cases where bad blocks appear but seem to auger a chain
reaction, sure.  Thats why we have the badblocks sector, and why we have
"e2fsck -c".  On the other hand, usually once bad blocks start
appearing, it really is the beginning of the end.   Stable bad blocks
when you mke2fs the filesystem --- sure.  Stable bad blocks that appear
while the filesystem is in service --- very, very, rare.  It's like
cancer, except well over 80% of the time it's malignant.

The other thing to ask is what is the economic value of the data on the
disk?  And how much would it cost your client to have you recover the
disk if it were to catastrophically fail with no warning?  If either of
the answers is more than $300 or so, then I'd probably replace any drive
that *has* to be running which was older than 2 years old.  Disk drives
are cheap; the data on them, and unscheduled downtown, usually isn't.
If you're running production systems, you really don't want to screw
around.

(More than once while I was working for MIT, I've seen this phenomenon
happen.  We would install a large number of disks for our fileservers
with disks that would all come from the same production run, and would
be put under the same load, all at the same time.  Usually some amount
of time later --- 18 to 24 months, usually --- all of the disks from
that batch would start failing within a weeks of one another.  It was
eerie to watch: every day or two, boom, another disk would die --- or
when we were using more modern drives, start registering soft errors
which is a hint that a disk is about to go.  Usually after the first
few, people would take the hint and start scheduling mass replacements
of all of the drives, before they died and took their data with them.
The lesson here is that disks *do* wear out and fail, and they *do* have
a finite lifetime.  When you have a large number of identical disks in
service, it's much easier to see this effect.)

						- Ted

Date: 	Wed, 12 Apr 2000 12:51:27 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: Wed, 12 Apr 2000 10:34:39 +0100
   From: "Stephen C. Tweedie" <sct@redhat.com>

   Right now it's really not practical to do any of that except for
   synchronous writes, because the filesystem has no way of knowing 
   what file a given block belongs to in general.  The only way 
   e2fsck can work that out is to scan the whole filesystem looking 
   for the ownership of the bad block.

Oh, if we were going to do this we'd have to add tags to the buffer
structure to indicate the inode number and logical block number, so that
if a buffer gets returned with an error, the filesystem can figure out
what to do with it.  (Which in the case of filesystem metadata still
could be nothing but the current choice of ignore the error, panic, or
mark the filesystem as read-only.)

						- Ted

Date: 	Wed, 12 Apr 2000 13:52:53 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: EXT2 and BadBlock updating.....
Newsgroups: fa.linux.kernel

   Date: 	Wed, 12 Apr 2000 11:46:53 -0500
   From: Ed Carp <erc@pobox.com>

   Feel free to correct me, but it is my understand that:

   - Bad sectors are automatically remapped to another track
   - There is a limit to how many bad sectors that can be remapped
   - There can be a number of sectors already remapped when you buy the drive new
   - There are very few drives out there that are totally error free - if there
     were, you'd be paying a lot more for drives
   - Consequently, most drives aren't error-free when they come out of the box

   Now, the question is, what happens when that remap track gets full?
   The drive reports an error back to the OS, and that's where the
   problem starts.  So this idea that "toss the drive if you have one
   bad block" is dumb, because they *all* have bad blocks, and getting
   an error on a block is no guarantee that your drive is failing -
   drives generally have an advertised error rate - and while it's close
   to zero, it's not zero, and even if it's 10E33, you're going to hit
   that sooner or later.

Disk come with some number of bad blocks "from the factory".  This is
the same when you buy a laptop, the usual standard is that up to 5
pixels can be bad, and the manufacturer will still call it a "good"
screen.

However, it's not the case that disks should in normal operation
randomly start to lose blocks, which they then automatically remap.
It's good that they can do this, but remember that you are potentially
*losing* *data* when this happens.  It's one thing when you have defects
from the manufacturing process; it's quite another thing to have blocks
go bad after the disk has been placed into service.  While it does
happen, it's very rare, and if it happened as often as you seem to think
it does, then files would be getting corrupted all the time --- and
that's not the case.

In most production houses that I've seen, if a disk starts reporting
soft errors (which is where the block was ultimately readable, but the
disk had to retry several times), that's generally the cue to replace
the disk.  That's because it's generally the case that the data on the
disk is far more valuable than the cost of replacing the disk (heck the
cost in people time of having restore from backup tapes is probably more
than the cost of the disk), and after 2-3 years of hard service in a
fileserver, the disk probably doesn't have much more life in it anyway.

There are exceptions to this rule, of course ---- if you're in a country
like Russia where disks are extremely expensive, then maybe the
cost/benefit ratio changes.  Or if you're a poor student, or if the
server is in a location which is very hard to get to.  However, I don't
buy the "7x24" argument.  If a service is so critical that it has to be
up 7 days a week, 24 hours a day, it's also probably so critical that
unscheduled downtime is far more disastrous than a planned downtime.
Also, if there is a requirement that it be up 7x24, why aren't there
redundant servers (never mind redundant disks in a RAID array)?  

Anyway, we've started straying off topic.  In answer to your question
--- it may be worth doing, but it's too close to 2.4, and I have other
higher priority projects --- and when/if it gets implemented, if you
depend on it too much, IMO you're probably trying to do things on the
cheap, and you WILL regret it someday.  :-)

							- Ted

Index Home About Blog