Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: raid0 slower than devices it is assembled of?
Original-Message-ID: <Pine.LNX.4.58.0312160825570.1599@home.osdl.org>
Date: Tue, 16 Dec 2003 16:44:41 GMT
Message-ID: <fa.j4ssrdh.1fign8h@ifi.uio.no>

On Tue, 16 Dec 2003, Helge Hafting wrote:
>
> Raid-0 is ideally N times faster than a single disk, when
> you have N disks.

Well, that's a _really_ "ideal" world. Ideal to the point of being
unrealistic.

In most real-world situations, latency is at least as important as
throughput, and often dominates the story. At which point RAID-0 doesn't
improve performance one iota (it might make the seeks shorter, but since
seek latency tends to be dominated by things like rotational delay and
settle times, that's unlikely to be a really noticeable issue).

Latency is noticeable even on what appears to be "pure throughput" tests,
because not only do you seldom get perfect overlap (RAID-0 also increases
your required IO window size by a factor of N to get the N-time
improvement), but even "pure throughput" benchmarks often have small
serialized sections, and Amdahls law bites you in the ass _really_
quickly.

In fact, Amdahls law should be revered a hell of a lot more than Moore's
law. One is a conjecture, the other one is simple math.

Anyway, the serialized sections can be CPU or bus (quite common at the
point where a single disk can stream 50MB/s when accessed linearly), or it
can be things like fetching meta-data (ie indirect blocks).

> Wether the current drivers manages that is of course another story.

No. Please don't confuse limitations of RAID0 with limitations of "the
current drivers".

Yes, the drivers are a part of the picture, but they are a _small_ part of
a very fundamental issue.

The fact is, modern disks are GOOD at streaming data. They're _really_
good at it compared to just about anything else they ever do. The win you
get from even medium-sized stripes on RAID0 are likely to not be all that
noticeable, and you can definitely lose _big_ just because it tends to
hack your IO patterns to pieces.

My personal guess is that modern RAID0 stripes should be on the order of
several MEGABYTES in size rather than the few hundred kB that most people
use (not to mention the people who have 32kB stripes or smaller - they
just kill their IO access patterns with that, and put the CPU at
ridiculous strain).

Big stripes help because:

 - disks already do big transfers well, so you shouldn't split them up.
   Quite frankly, the kinds of access patterns that let you stream
   multiple streams of 50MB/s and get N-way throughput increases just
   don't exists in the real world outside of some very special niches (DoD
   satellite data backup, or whatever).

 - it makes it more likely that the disks in the array really have
   _independent_ IO patterns, ie if you access multiple files the disks
   may not seek around together, but instead one disk accesses one file.
   At this point RAID0 starts to potentially help _latency_, simply
   because by now it may help avoid physical seeking rather than just try
   to make throughput go up.

I may be wrong, of course. But I doubt it.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: raid0 slower than devices it is assembled of?
Original-Message-ID: <Pine.LNX.4.58.0312161304390.1599@home.osdl.org>
Date: Tue, 16 Dec 2003 21:15:55 GMT
Message-ID: <fa.j4cmqtg.1c2qnog@ifi.uio.no>

On Tue, 16 Dec 2003, Mike Fedyk wrote:
>
> On Tue, Dec 16, 2003 at 08:42:52AM -0800, Linus Torvalds wrote:
> > My personal guess is that modern RAID0 stripes should be on the order of
> > several MEGABYTES in size rather than the few hundred kB that most people
> > use (not to mention the people who have 32kB stripes or smaller - they
> > just kill their IO access patterns with that, and put the CPU at
> > ridiculous strain).
>
> Larger stripes may help in general, but I'd suggest that for raid5 (ie, not
> raid0), the stripe size should not be enlarged as much.  On many
> filesystems, a bitmap change, or inode table update shouldn't require
> reading a large stripe from several drives to complete the pairity
> calculations.

Oh, absolutely. I only made the argument as it works for RAID0, ie just
striping.  There the only downside of a large stripe is the potential for
a lack of parallelism, but as mentioned, I don't think that downside much
exists with modern disks - the platter density and throughput (once you've
seeked to the right place) are so high that there is no point to try to
parallelise it at the data transfer point.

The thing you should try to do in parallel is the seeking, not the media
throughput. And then small stripes hurt you, because they will end up
seeking in sync.

For RAID5, you have different issues since the error correction makes
updates be read-modify-write. At that point there are latency reasons to
make the blocking be small.

			Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: raid0 slower than devices it is assembled of?
Original-Message-ID: <Pine.LNX.4.58.0312170758220.8541@home.osdl.org>
Date: Wed, 17 Dec 2003 16:02:13 GMT
Message-ID: <fa.j6sgqtd.18i4lol@ifi.uio.no>

On Wed, 17 Dec 2003, Peter Zaitsev wrote:
>
> I'm pretty curious about this argument,
>
> Practically as RAID5 uses XOR for checksum computation you do not have
> to read the whole stripe to recompute the checksum.

Ahh, good point. Ignore my argument - large stripes should work well. Mea
culpa, I forgot how simple the parity thing is, and that it is "local".

However, since seeking will be limited by the checksum drive anyway (for
writing), the advantages of large stripes in trying to keep the disks
independent aren't as one-sided.

		Linus


Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: raid0 slower than devices it is assembled of?
Original-Message-ID: <Pine.LNX.4.58.0312171129040.8541@home.osdl.org>
Date: Wed, 17 Dec 2003 19:41:44 GMT
Message-ID: <fa.j6c8q5e.1924k0k@ifi.uio.no>

On Wed, 17 Dec 2003, Jamie Lokier wrote:
>
> If a large fs-level I/O transaction is split into lots of 32k
> transactions by the RAID layer, many of those 32k transactions will be
> contiguous on the disks.

Yes.

> That doesn't mean they're contiguous from the fs point of view, but
> given that all modern hardware does scatter-gather, shouldn't the
> contiguous transactions be merged before being sent to the disk?

Yes, as long as the RAID layer (or lowlevel disk) doesn't try to avoid the
elevator.

BUT (and this is a big but) - apart from wasting a lot of CPU time by
splitting and re-merging, the problem is more fundamental than that.

Let's say that you are striping four disks, with 32kB blocking. Not
an unreasonable setup.

Now, let's say that the contiguous block IO from high up is 256kB in size.
Again, this is not unreasonable, although it is actually larger than a lot
of IO actually is (it is smaller than _some_ IO patterns, but on the whole
I'm willing to bet that it's in the "high 1%" of the IO done).

Now, we can split that up in 32kB blocks (8 of them), and then merge it
back into 4 64kB blocks sent to disk. We can even avoid a lot of the CPU
overhead by not merging in the first place (and I think we largely do,
actually), and just generate 4 64kB requests in the first place.

But did you notice something?

In one scenario, the disk got a 256kB request, in the other it got a 64kB
requests.

And guess what? The bigger request is likely to be more efficient.
Normal disks these days have 8MB+ of cache on the disk, and do partial
track buffering etc, and the bigger the requests are, the better.

> It may strain the CPU (splitting and merging in a different order lots
> of requests), but I don't see why it should kill I/O access patterns,
> as they can be as large as if you had large stripes in the first place.

But you _did_ kill the IO access patterns. You started out with a 256kB
IO, and you ended up splitting it in four. You lose.

The thing is, in real life you do NOT have "infinite IO blocks" to start
with. If that were true, splitting it up across the disks wouldn't cost
you anything: infinite divided by four is still infinite. But in real life
you have something that is already of a finite length and a few hundred kB
is "good" in most normal loads - and splitting it in four is a BAD IDEA!

In contrast, imagine that you had a 1MB stripe. Most of the time the 256kB
request wouldn't be split at all, and even in the worst case it would get
split into just 2 requests.

Yes, there are some loads where you can get largely "infinite" request
sizes. But I'd claim that they are quite rare.

			Linus

Index Home About Blog