Index Home About Blog
From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: minixfs bitmaps and associated lossage
Date: Sat, 06 May 2006 22:05:21 UTC
Message-ID: <fa.c69F4uBszTznjhfA8NmBTguY3Hw@ifi.uio.no>
Original-Message-ID: <20060506220451.GQ27946@ftp.linux.org.uk>

	Warning: text below is a mild example of software coproarchaeology,
so if you are easily squicked by tangled mess of bugs and dumb lossage,
well... you've been warned.

	This particular clusterfsck had begun when AST decided to
store the metadata in host-endian order.  All of it; inode numbers
in directories, block numbers in inodes and indirect blocks, etc.
Ugly as it was, it would be more or less straightforward, if not
for one trap - the bitmaps.

	The rest of metadata had obvious element sizes, so it was hard
to get wrong.  However, for bitmaps it was arbitrary.  And it does
matter - mapping an array of bits to array of big-endian 16bit and
to array of big-endian 32bit gives different results.  We get either
	8-15, 0-7, 24-31, 16-23, ...
or
	24-31, 16-23, 8-15, 0-7, ...
resp.  For little-endian we get the same thing, though.  AST had chosen
to make it an array of 16-bit host-endian.

	Linux had minixfs support from the very beginning, but it started
on little-endian hosts, so that issue had been happily ignored - le16
or le32, you get the same result.  The second architecture to be merged
also had been little-endian (alpha), so it didn't cause any new problems.
fs/minix/inode.c used clear_bit(), etc. for bitmap access, which assumes
array of unsigned long in host-endian.

	Then it had hit the fan, but nobody cared - sparc merge was 1.1.77,
but I'm not sure if minix even existed on sparc at that point.  And it
sure as hell was not a concern with respect to sharing fs.  Same for mips
merge in 1.1.82 and ppc one in 1.3.45.

	The next one was m68k in 1.3.94.  And there it became serious - m68k
boxen with both minix and Linux on them did exist.  So behaviour of mainline
minixfs was a real problem - it would eat filesystems if it would ever build
and run.  m68k tree had that fixed, though, by providing minix_test_bit()
et.al. that did the right thing.  As always with m68k, "fixed" and "cared
to put the fix into mainline" had been rather... loosely coupled events.

	When the fix did go into the mainline (2.1.17), it had created an
interesting situation:
	* i386 and alpha: minix_test_bit() and friends added as wrappers
	* m68k: added, do the right thing.
	* sparc, mips and ppc: helpers absent, won't build with CONFIG_MINIX_FS

The real trouble was that the only non-trivial implementation had not been
documented - not even to say what it does and why it's needed.  So when the
folks on other platforms started to fix the breakage, results had been ugly:
	- sparc: blindly defined as on i386 - i.e. host-endian 32bit (== be32).
Compiles, still broken.
	- mips: defined as on i386 with bloody misguiding comment:
* FIXME: These assume that Minix uses the native byte/bitorder.
It _does_ use the native byte order.  It's chunk size that doesn't match
the native word size.  Overall: be32.
	- ppc: perhaps due to the second-hand confusion induced by mips
comment, perhaps independently, ppc went with _little-endian_ 32bit.
	- sparc64: blindly copied as on i386.  That meant yet another variant:
host-endian 64bit (== be64).

After that it went uglier and uglier.  Little-endian architectures were
still all right, but big-endian had done everything except the correct
behaviour.  Some did like ppc and used little-endian bitmaps.  Some did
be32.  Some be64.  The _ONLY_ big-endian that does the right thing is
m68k.  Everything else is using layouts that never would be recognized
by minix - on any platform.  Again, we are paying for the lack of
description of original minix_..._bit() family - and for the original
mess in minix fs layout.

	Minix recognizes two layouts:
16bit values	32bit values	bitmaps
01		0123		01234567...
10		3210		10325476...
	Little-endian architectures on Linux follow the first variant.
m68k follows the second one.  ppc, parisc and big-endian arm and frv do
10		3210		01234567...
The rest of 32bit big-endian goes with
10		3210		32107654...
and 64bit big-endian do
10		3210		76543210...

	In effect, we've got three new layouts, thanks to aforementioned
lossage.  But it gets even funnier: filesystem has to be created, after
all.  And _that_ is not just broken, it's broken differently.  We have:
	little-endian:	layout 1 (correct)
	m68k:		layout 2 (correct)
	everything else:layout 3
Of course, native minix mkfs always creates (1) or (2) and native minix
fsck gets quite unhappy with anything else.  Amusingly enough, debian
util-linux has mkfs.minix and fsck.minix excluded on sparc, so the
problem _was_ noticed and duly papered over.

	Recently all that crap got "regularized" kernel-side.  About the
only effect was the loss of some warnings along the lines of "something's
fishy here".

	So...  What the hell can we do?  Layouts (4) and (5) are clearly
broken and _never_ worked - there's nothing that would manage to create
such filesystem.  So these are obvious candidates for switching - either
to (2) (correct) or to (3) (broken, but at least match util-linux fsck.minix
and mkfs.minix on such platforms).  The question being, what do we do with
(3) (big-endian metadata, little-endian bitmaps) and what do we do with
Linux fsck.minix?  Aside of repeating the mantra, that is ("All Software
Sucks, All Hardware Sucks")...


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: minixfs bitmaps and associated lossage
Date: Sat, 06 May 2006 22:26:58 UTC
Message-ID: <fa.HUEIe1sEka+5KxH8AdIOV9umolI@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605061524420.16343@g5.osdl.org>

On Sat, 6 May 2006, Al Viro wrote:
>
> 	Warning: text below is a mild example of software coproarchaeology,
> so if you are easily squicked by tangled mess of bugs and dumb lossage,
> well... you've been warned.

LOL.

Maybe the right thing to do is to just disable minixfs for anything
big-endian except for m68k.

It's not like it likely matters, and while we could save your description
of the problem as an amusing "how to really f*ck up" episode, I doubt
anybody really _cares_ in this case.

			Linus


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: minixfs bitmaps and associated lossage
Date: Sat, 06 May 2006 23:11:38 UTC
Message-ID: <fa.m9dnYqDHwlvfKniXSJqwI+qMKPM@ifi.uio.no>
Original-Message-ID: <20060506231054.GR27946@ftp.linux.org.uk>

On Sat, May 06, 2006 at 03:26:21PM -0700, Linus Torvalds wrote:
>
>
> On Sat, 6 May 2006, Al Viro wrote:
> >
> > 	Warning: text below is a mild example of software coproarchaeology,
> > so if you are easily squicked by tangled mess of bugs and dumb lossage,
> > well... you've been warned.
>
> LOL.
>
> Maybe the right thing to do is to just disable minixfs for anything
> big-endian except for m68k.
>
> It's not like it likely matters, and while we could save your description
> of the problem as an amusing "how to really f*ck up" episode, I doubt
> anybody really _cares_ in this case.

Well...  There's a minixfs v3 patch floating around, so somebody apparently
cares ;-)

FWIW, the only way to really deal with such structure would be to treat
on-disk values as "fs-endian" and make the conversion to and from
host-endian check the superblock.  That would _really_ consolidate
minix_..._bit() (turning them into __test_bit(nr ^ sbi->mangle, p), etc.)
and would give support of big- and little-endian images for free.
That's what we do e.g. in fs/sysv and it's neither harder nor seriously
bigger than existing code.

Whether we care to do that is a separate question, of course, and I certainly
agree that not a lot of people care about the damn thing these days, no
matter which architecture it is.

If somebody wants to play with that code, they could just merge fs/minix
into fs/sysv - that might very well turn out to be the right thing and
a fun exercise.  Codebases are very close - minixfs is a derivative of
v7 filesystem, after all, and our fs/minix and fs/sysv had been kept
mostly in sync.  Might merge minix v3 into that while we are at it...
If there are any takers for that kind of work, go ahead and if you run
into problems - feel free to ask on fsdevel or l-k.  I promise to review
and comment, but I'm not signing up for doing the entire thing myself.

If nobody picks that up, marking it broken on affected platforms is
probably the best solution.  The only problem here is that we don't
have a uniform way to say "it's little-endian" in Kconfig, but that's
something we ought to do anyway - too many places have things like
(BROKEN || !(SPARC || PPC || PARISC || M68K || FRV))
in Kconfig dependencies.


From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: minixfs bitmaps and associated lossage
Date: Sat, 06 May 2006 23:43:51 UTC
Message-ID: <fa.QMBpxmhDlhyy5FsEPLF4QJi/fsA@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.64.0605061633020.16343@g5.osdl.org>

On Sun, 7 May 2006, Al Viro wrote:
>
> FWIW, the only way to really deal with such structure would be to treat
> on-disk values as "fs-endian" and make the conversion to and from
> host-endian check the superblock.  That would _really_ consolidate
> minix_..._bit() (turning them into __test_bit(nr ^ sbi->mangle, p), etc.)

Yeah, especially for bitmaps, it really _should_ be pretty simple, since
it's literally a bitwise xor of the bit number. It's actually worse for
things that truly have byte order dependencies where the values span bytes
and need re-ordering. For bits, that obviously will never be the case.

> If somebody wants to play with that code, they could just merge fs/minix
> into fs/sysv - that might very well turn out to be the right thing and
> a fun exercise.  Codebases are very close - minixfs is a derivative of
> v7 filesystem, after all, and our fs/minix and fs/sysv had been kept
> mostly in sync.

Heh. Yes. The physical filesystem layout of minix is close to the old sysv
one, and the implementation ends up being pretty closely related too,
although the genealogy there is the other way around.

However, I thought the direct sysv descendants used linked lists of
free-block lists, not bitmaps? So while a lot of the _other_ part of the
filesystem layout is similar, the actual free-block handling is very
different. No?

So there are things that are very similar (directory layout, inode
format), and could probably be share, while other things (free block and
inode handling) are fundamentally different, no?

			Linus


From: Al Viro <viro@ftp.linux.org.uk>
Newsgroups: fa.linux.kernel
Subject: Re: minixfs bitmaps and associated lossage
Date: Sun, 07 May 2006 07:37:40 UTC
Message-ID: <fa.xnYYAs+rhfmzicRlWhfWY/BGlSs@ifi.uio.no>
Original-Message-ID: <20060507073708.GW27946@ftp.linux.org.uk>

On Sat, May 06, 2006 at 04:42:27PM -0700, Linus Torvalds wrote:
> > If somebody wants to play with that code, they could just merge fs/minix
> > into fs/sysv - that might very well turn out to be the right thing and
> > a fun exercise.  Codebases are very close - minixfs is a derivative of
> > v7 filesystem, after all, and our fs/minix and fs/sysv had been kept
> > mostly in sync.
>
> Heh. Yes. The physical filesystem layout of minix is close to the old sysv
> one, and the implementation ends up being pretty closely related too,
> although the genealogy there is the other way around.

Actually, some things (e.g. indirect block tree handling and directory
handling via pagecache) went the other way - from fs/sysv to fs/minix.

> However, I thought the direct sysv descendants used linked lists of
> free-block lists, not bitmaps? So while a lot of the _other_ part of the
> filesystem layout is similar, the actual free-block handling is very
> different. No?

Yes and no - keep in mind that details of those lists are different for
various sysvfs flavours, so sysv_new_block() et.al. check sbi->s_type
anyway.  And the entry points into [ib]alloc are parallel, so it's not
hard to merge transparently for the rest of code.

Superblock layouts are very different, obviously, but they are just as
different among sysv flavours.  Again, no big deal...

BTW, there's a sysv flavour that uses bitmaps (EAFS); we only do it
read-only, so that's not an issue with the current fs/sysv code.

Again, what I'm saying is that figuring out details of doing it clean
way would make a good exercise, not that we can't live without that.

Index Home About Blog