Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Thu, 25 Jan 2007 17:57:17 UTC
Message-ID: <fa.ZDD+yRo2UpZF+nfOBWEbu9GLpuQ@ifi.uio.no>

On Thu, 25 Jan 2007, Alan wrote:
>
> If you want to put your bmdma address at zero then libata-sff won't help
> you at the moment, and assumes zero is a safe "not in use" value because
> the PCI layer also takes a similar view in places.

Indeed. Zero means "not in use". It really is that simple.

And I'm sorry, David, but 99% of the world _is_ a PC, and that is where
PCI and ATA came from, so anybody who thinks that zero is a valid PCI
address is just sadly mistaken and has a bug. In fact, even on a hardware
level, a lot of devices will literally have "zero means disabled".

Broken architectures that put PCI things at some "PCI physical address
zero" need to map their PCI addresses to something else. It's part of why
we have the whole infrastructure for doing things like

	pcibios_bus_to_resource()
	pcibios_resource_to_bus()

etc - not just because a resource having zero in "start" means that it is
disabled, but because normally such broken setups have *other* problems
too (ie they don't have a 1:1 mapping between PCI addresses and physical
addresses _anyway_, so they need to address translations).

This has come up before. For example: for an IRQ, 0 means "does not
exist", it does _not_ mean "physical irq 0", and we test for whether a
device has a valid irq by doing "if (dev->irq)" rather than having some
insane architecture-specific "IRQ_NONE". And if you validly really have an
irq at the hardware level that is zero, then that just means that the irq
numbers you should tell the kernel should be translated some way.

(On a PC, hardware irq 0 is a real irq too, but it's a _special_ irq, and
it is set up by architecture-specific code. So as far as the generic
kernel and all devices are concerned, "!dev->irq" means that the irq
doesn't exist or hasn't been mapped for that device yet).

So there are three issues:

 - we always need a way of saying "not mapped"/"nonexistent", and using
   the value zero is the one that GIVES THE CLEANEST SOURCE CODE! It's why
   NULL pointers are zero too. Sure, virtual address zero is a real
   virtual address, but that doesn't change the fact that C made 0 be
   special on a language level. If you want to access that virtual
   address, and use NULL as a "doesn't exist" at the same time, you need
   to swizzle your pointer somehow.

 - the x86[-64] architecture is the one that gets tested the most. So
   everybody else should try to look like it, rather than say "we are
   different/better/strange". Don't have any special IO instructions?
   Tough. You'd better do "inb/outb" anyway, and map them onto your
   memory-mapped IO somehow.

 - per-architecture magic values is a bad idea. If you need a magic value
   (and things like this _do_ need one, unless we always want to carry a
   separate flag around saying "valid" or "not valid"), it's a lot better
   to just say "everybody uses this value" and then have the _small_
   architecture-specific code work around it, than have everybody have to
   work around a lot of architectures doing things differently.

All these issues boil down to the same thing: whenever at all physically
possible, we make sure that everything looks as much as a PC as possible.
Only when that literally isn't an option do we add other abstractions.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 01:29:12 UTC
Message-ID: <fa.5q6A/rH5fTOl+v0wn4vitWDiGsc@ifi.uio.no>

On Fri, 26 Jan 2007, David Woodhouse wrote:
>
> You're thinking of MMIO, while the case we were discussing was PIO. My
> laptop is perfectly happy to assign PIO resources from zero.

I was indeed thinking MMIO, but I really think it should extend to PIO
also. It certainly is (again) true on PC's, where the low IO space is
special and reserved for motherboard/system devices.

If you want to be different, that's YOUR problem. Some architectures have
tried to look different, and then drivers break on them, but they have
only themselves to blame.

> I believe PCMCIA just uses the generic resource code, which also seems
> to lack any knowledge of this hackish special case for zero.

The resource code actually knows what "enabled" means. A lot of other code
does not.

> > kernel and all devices are concerned, "!dev->irq" means that the irq
> > doesn't exist or hasn't been mapped for that device yet).
>
> So again you end up in a situation where zero is a strange special case.

No. We end up in a situation where *drivers* never have any strange or
special cases.

You need to have a "no irq" thing. It might as well be zero, since that is
not just the de-facto standard, it's also the one and only value that
leads to easily readable source-code (ie test it as a boolean in C).

The exact same thing has been true of MMIO. I would be not at all
surprised if several drivers do the same for PIO.

It's something you can trust on a PC. See above on your problems if you
decide that you want to be "generic" and use a value that is illegal on
99% of all hardware.

> It doesn't need to be per-architecture; it can just be -1.

Bollocks. People tried that. People tried to force this idiotic notion of
"NO_IRQ" down my throat for several years. I even accepted it.

And then, after several years, when it was clear that it still didn't
work, and drivers just weren't getting updated, it was time to just face
reality: if the choice is between 0 and -1, 0 is simply much easier for
the bulk of the code.

Live with it, or don't. I really don't care what you do on your hardware.
But if you can't face that

	if (!dev->irq)
		..

is simpler for people to write, and that it's what we've done for a long
time, then that really is YOUR problem.

The exact same issues have been true in MMIO. Some code will keep track of
separate "enabled" bits: the resource management code is such code. Guess
what? Not a lot of drivers tend to do that. You can try to fight
windmills, or you can just accept that the very language we use (namely,
C) has made 0 be special, and tends to be used to say "nobody home" simply
because it has that special meaning for a C compiler.

And I bet there are PIO devices out there that consider address zero to be
disabled. For EXACTLY the same reason.

(And yes, hardware actually tends to do the same thing. For PCI irq
routing registers, an irq value of 0 pretty much universally means
"disabled". In fact, even your lovely Cardbus example actually is an
example of exactly this: the very IO limit registers are DEFINED IN
HARDWARE to special-case address zero - so that making the base/limit
registers be zero actually disables the IO window, rather than making it
mean "four IO bytes at address zero").

But hey, if it works for you, go wild. Just don't expect drivers to always
work.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 02:59:34 UTC
Message-ID: <fa.gS/t8vRbaxD9LOLZMliMIQdyY+Q@ifi.uio.no>

On Fri, 26 Jan 2007, David Woodhouse wrote:
>
> The quality of our drivers is low; I'm fully aware that trying to
> improve driver quality is a quixotic task.

This is an important point. I actually hold things like the kernel VM
layer to _much_ higher standards than I would ever hold a driver writer.
That said, we tend to have "0 is special" even there, although we tend to
try to make it always be about a pointer for those cases.

But drivers really are not just the bulk of the code, but they also can't
be tested nearly as well. So for that reason (and really _only_ for that
reason), we should always just accept that a "driver" should never have to
worry, and *all* of the inconvenience of some special case or whatever
must be solidly elsewhere.

> But where do we draw the line? Should we abandon the dma-mapping stuff
> too? Declare that page zero is a special case and you can't DMA to it?
> Should we try to make every PCI write also do a read in order to flush
> posted writes, because people can't cope with the real world?

I think we should try to aim for two things:

 - things that "just work" on a PC should generally "just work" everywhere
   else. Just for the simple reason that most drivers never get tested
   anywhere else.

 - if it can't "just work", we should have as many static checks as
   possible, and not let it compile.

 - and if it does compile, the stuff that "looks simple" should just work.

For example, the reason "0" is special really _is_ about the compiler. For
any integer (or pointer, or FP) type in C, there really is just one
special value. 0 (or NULL, for pointers).

The reason why resources work fine with zeroes is that for a *resource*,
zero isn't actually a special value at all. Why? A resource isn't an
integer value. It's a struct.

So compiler type safety (or lack thereof) really ends up forcing some of
these issues. If something is represented as an integral type, the only
*obvious* real special case is always going to be 0, just because it's the
only one you can "test for" implicitly. Similarly, if you return a
pointer, you only have NULL that can sanely be used as a special value.

(Yeah, mmap() has taught some people about MAP_FAIL, but that's pretty
unusual too. And in the VFS layer, we use the magic "error pointers" that
actually encode error values in the bits too - but then, VFS people tend
to have to know a lot more about the kernel than a driver writer should
have to - VFS people are just held to higher standards).

With a pointer, NULL is usually ok anyway, and always tends to mean
something special - and for the other stuff you can then hide things
"behind" the pointer. So with a pointer, you could often have "ptr->valid"
or something else. With an integer, you can't really do that, because 0
always remains special, just because EVEN JUST BY MISTAKE the test of the
integer actually just ends up making zero be special.

So if you want a separate "enable" value, it almost has to be a structure
or some opaque type, because that's the only type where there isn't an
implicit special value.

We've done it occasionally. The VM has done it, for example. And we do it
in drivers for more complex cases (and resources is one such case: it's
already not a single value, since it has a start and a length, and other
structure).

We could have done it for interrupts too. A "struct irqnum" that has a bit
that specifies "valid". That would work. But it tends to be painful, so it
really has to give you something more than "zero is disabled".

It's just not worth it.

And it's why I decreed, that the ONLY SANE THING is to just let people do
the obvious thing:

	if (!dev->irq)
		return -ENODEV;

you don't have to know ANYTHING, and that code just works, and just looks
obvious. And you know what? If it causes a bit of pain for some platform
maintainer, I don't care one whit. Because it's obviously much better than
the alternatives.

We may not "need" that rule for IO ports. If IO port 0 "happens to work",
then hey, fine. But on the other hand, _all_ the same arguments really
end up being still 100% true. If some driver happens to write

	if (!dev->ioport)
		return -ENODEV;

then I say: go for it. The _driver_ is correct. It's the obvious thing to
do. It works on PC's, and it's simple and looks fine. Why not? If it
causes some minor heartburn for an architecture maintainer, so what?
Really? The tradeoff is obvious, and the architecture maintainer needs to
just go and fix his IO mappings so that no device ever sees a "valid" port
at port 0.

But in the meantime: if nobody complains, and it happens to work on
hardware even though some devices _can_ see a port of zero, I also don't
care. So I'm certainly not going to claim that your laptop "must be
fixed". If it works, it works. Hey fine.

But the first driver that doesn't work because it thought it didn't have
an IO port (beause it was zero), and the first time somebody complains, I
know *exactly* whose fault it is. It's not the driver.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 04:01:05 UTC
Message-ID: <fa.nLhVqdOQtMF/s02f+MMsM90cfyc@ifi.uio.no>

On Fri, 26 Jan 2007, David Woodhouse wrote:
>
> It's not just "my laptop", I believe. It's the generic resource code,
> which is happy to assign address zero since it's never been taught that
> zero is now a special case. If we're not going to ask for the bug I
> observed to be fixed -- if we're going to declare that driver authors
> don't have to sober up and clean up their code -- then the resource code
> should be modified accordingly.

The resource code really is totally agnostic, and you're barking up the
wrong tree there. In many ways, the resource code isn't even about "IO
resources", it could do other things too.

[ In practice, of course, IO resources is all it does and what it was
  designed for, since there really aren't a lot of hierarchical things
  that need to be able to nest and handle byte-range-like things. ]

It's really up to the architecture-specific PCI initialization what the
PCI resources look like. The resource code just takes whatever resource
layout it is given. Yes, there's a "root" ioport_resource, but that's just
the container for the whole PCI resource tree, and generally you'd show
the different PCI domains exposed with their buses in that tree.

Of course, for all the historical reasons (a single domain, and it was
written for a PC), on PC's, the root PCI bus just points directly to the
root io port resource. But the way things work is that your cardbus card
doesn't just allocate space from that "ioport_resource" itself. No, it
allocates space from the cardbus controller resources, which in turn have
allocated space from the PCI bridge controller resources, etc etc all the
way up to whatever is the PCI root resource.

There *are* drivers that use the "ioport_resource" directly, but they are
system devices (where "ISA" counts as a system device - augh: it's not
enumerable or discoverable) which know where they go. But a normal driver
never does in any modern world.

So the way to make sure that PCI devices get allocated in the proper area
is not to change the resource manager, but to make sure that the
architecture initializes the root bridges for all the domains properly.

(A lot of them do the "PC thing", of course: they just make the ioport
resource the direct parent of the root bridge, and that's ok if the root
really _is_ supposed to cover everything from zero. On a PC, that's
actually the right thing to do, because the system devices will insert
themselves into the low area, and then PCIBIOS_MIN_IO - 0x1000 on a PC -
is used as the minimum for any *dynamic* allocation.)

PCI PIO/IOMEM resource allocation is actually fairly complicated, and most
people really *really* never need to care. It should be considered a sign
of how well the resource code works that it all usually works without most
people ever really needing to understand it.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 04:48:45 UTC
Message-ID: <fa.iojOGGlC1u5q92O5t5UM9B5cguU@ifi.uio.no>

On Fri, 26 Jan 2007, David Woodhouse wrote:
>
> Of course it could, but then again why shouldn't we special-case zero in
> _all_ of those use cases, just to make life easier for those poor driver
> authors who presumably can't manage to write userspace code using stdin
> or open() either.

You're not thinking, or, more likely YOU DON'T WANT TO.

What is so hard to understand? IRQ0 is a _real_irq_ as far as the
*platform* goes even on PC's. Same goes for IO-port 0 and MMIO area zero.

Same goes for virtual address zero in a lot of user programs. It's a real
virtual address. The fact that the NULL pointer points there doesn't make
it less real.

Do you really have such a hard time understanding this fundamental issue?

Irq0 may _exist_. IO Port 0 may _exist_. Virtual address 0 may _exist_.

Got it?

But they ARE NOT VALID THINGS FOR DRIVERS TO WORRY ABOUT.

When a *DRIVER* sees a NULL pointer, it's always a sign of "not here".

It's not a "valid pointer that just happens to be at virtual address
zero". But in other contexts it actually _can_ be exactly that (ie when
you call "mmap(NULL .. MAP_FIXED)" you actually will get a NULL pointer
that is a *REAL POINTER*.

Similarly, when a *DRIVER* seens a "port 0" or MMIO 0, it is perfectly
valid for that driver to say "ok, my device apparently doesn't have an IO
port".

That does not mean that "port 0" doesn't exist. It just means that AS FAR
AS A DRIVER IS CONCERNED, port 0 is not a valid port for a driver!

Port 0 actually *does* exist on a PC. It's a system port (it also happens
to be a total legacy port that nobody would ever use, but that's another
issue entirely).

The same thing is true of irq 0. It exists. It's a valid IRQ for
architecture code for a PC. It's just NOT A VALID IRQ FOR A DRIVER! So
when a driver sees a device with !irq, it knows that the irq hasn't been
allocated.

I don't understand why this is so hard for you to accept. Even when you
yourself accept that irq0 actually *exists*, and even when you give as an
example why "setup_irq()" must be able to take that irq, you give that as
some kind of ass-hat example of why you are "right". Now you do exactly
the same thing for the IO port space.

You're totally confused. You say the words, and you seem to understand
that device drivers are special. But then you don't seem to follow that
understanding through, and you want to then say that everything else is
special too.

Don't you get it? If everybody is special, then nobody is special.

System code is special. System code can do things that drivers shouldn't
do. System code can know that irq0 is actually the timer, and can set it
up. System code can know that IO port 0 is actually decoded by the old and
insane AT architecture DMA controller.

This is not even kernel-specific. "normal programs" think that NULL is
special, and is never a valid pointer. But "system programs" may actually
know better. If you're programming in DOS-EMU, and use vm86 mode, you know
that you actually need to map the virtual address that is at 0, and NULL
is suddenly not actually an invalid pointer: it's literally a pointer to
the first byte in the vm86 model.

But when "malloc()" returns NULL, you know that it doesn't return such a
"system pointer", so when malloc returns NULL, you realize that it's a
special value.

The *EXACT* same thing is true within the kernel. When x86 architecture
code explicitly sets up IRQ0, it does so because it knows that it's the
timer interrupt. But that doesn't make it a valid irq for *anybody* else.

Ok, enough shouting.

Comprende? Do you _really_ think that the NULL pointer "doesn't exist"? Or
can you realize that it's generally just a convention, and it's a
convention that has been selected to be maximally easy to test for (both
on a code generation level and on a C syntax level)? It doesn't mean that
virtual address 0 "doesn't exist", and could not be mapped.

The exact same thing is true of "IO port 0". It's the maximally simple
_convention_ for someting that may actually exist, but it's something that
NO NORMAL USER SHOULD EVER SEE AS A REAL IO PORT. There are special users
that may use it, exactly the same way special users who know deep hardware
details may decide that "on this architecture, the NULL pointer actually
_literally_ means virtual address zero, and when I do *xyz* I actually can
access things through it".

Does the fact that some things can use NULL as meaning something else than
"no pointer" invalidate NULL as a concept? No. It just means that those
things are very architecture-specific. They're not "common code" aka
"drivers" any more.

Same exact deal.

		Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 06:01:57 UTC
Message-ID: <fa.+2kF1wS+eYu6Xg42Tv2N3Fd/FLs@ifi.uio.no>

On Fri, 26 Jan 2007, David Woodhouse wrote:
>
> My question was about _how_ you think this should be achieved in this
> particular case.

I already told you.

Have you noticed that the resource-allocation code _already_ never returns
zero on a PC?

Have you noticed that the resource-allocation code _already_ never returns
zero on sparc64?

Your special-case hack would actually be actively *wrong*, because the
resource-allocation code already does this correctly.

But by "correctly" I very much mean "the architecture has to tell it what
the allocation rules are".

And different architectures will have different rules.

On x86[-64], the reason the allocation code never returns a zero is
because of several factors:

 - PCIBIOS_MIN_IO is 0x1000
 - PCIBIOS_MIN_CARDBUS_IO is 0x1000
 - the x86 arch startup code has already reserved all the special ranges,
   including IO port 0.

so on a PC, no dynamic resource will _ever_ be allocated at zero simply
because there are several rules in place (that the resource allocation
code honors) that just makes sure it doesn't.

Now, the reason I say that it needs the architecture-specific code to tell
it those rules is that on other architectures, the reasons for not
returning zero may simply be *different*.

For example, on other platforms, as I already tried to explain, the root
PCI bus itself may not even start at 0 at all. You may have 16 bits worth
of "PIO space", but for all you know, you may have that for 8 different
PCI domains, and for all the kernel cares about, the domains may end up
mapping their PCI IO ports in any random way. For example, what might
"physically" be port 0 on domain0 might be mapped into the IO port
resource tree at 0xff000000, and "physical port 0" in domain1 might be at
0xff010000 - so that the eight different PCI domains really end up having
8 separate ranges of 64k ports each, covering 0xff000000-0xff07ffff in the
IO resource map.

See? I tried to explain this to you already..

If you are a cardbus device, and you get plugged in, your IO port range
DOES NOT GET allocated from "ioport_resource". No, no, no. It gets
allocated from the IO resource of the cardbus controller. And _that_ IO
resource was allocated within the resource that was the PCI bridge for the
PCI bus that the cardbus controller was on. And so on, all the way up to
the root bridge on that PCI domain.

And on x86, the root bridge will use the ioport_resource directly, since
it will always cover the whole area from 0 - IO_SPACE_LIMIT.

But that's an *architecture-specific* choice. It makes sense on x86,
because that's how the IO ports physically work. They're literally tied to
the CPU, and the CPU ends up being in that sense the "root" of the IO port
resources.

But on a lot of non-x86 architectures, it actually could make most sense
to never use "ioport_resource" AT ALL (in which case "cat /proc/ioports"
will always be empty), and instead make the root PCI controller have it's
IORESOURFE_IO resource be a resource that is then mapped into
"iomem_resource". Because that's _physically_ how most non-x86
architectures are literally wired up.

See? Nobody happens to do that (probably because a number of them want to
actually emulate ISA accesses too, and then you actually want to make the
PIO accesses look like a different address space, and probably because
others just copied that, and never did anything different), but the point
is, the resource allocation code really doesn't care. And it _shouldn't_
care. Because the resource allocation code tries to be generic and cater
to all these *different* ways people hook up hardware.

Now, I should finish this by saying that there's a number of legacy
issues, like the fact that "ioport_resource" and "iomem_resource" actually
even *exist* for all platforms. As mentioned, in some cases, it would
probably actually make more sense to not even have "ioport_resource" at
all, except perhaps as a sub-resource of "iomem_resource" (*).

So a lot of this ends up then being slightly set up in certain way - or at
least only tested in certain configurations - due to various historical
reasons.

For example, we've never needed to abstract out the IO port address
translations as much as we did for MMIO. MMIO has always had
"remap_io_range()" and other translator functions, because even x86 needed
them. The PIO resources have generally needed less indirection, not only
because x86 can't even use them *anyway* (for the "iomap" interfaces we
actually do the mapping totally in software on x86, because the hardware
cannot do it), but also because quite often, PIO simply isn't used as
much, and thus we've often ended up having ugly hacks instead of any real
"IO port abstraction".

For an example of this, look at the IDE driver. It has a lot of crap to
just allow other architectures to just use their own MMIO accessors
instead of even trying to abstract PIO some way. So the PIO layer actually
lacks some of the abstraction, simply because it's no used very much.

		Linus

(*) It should actually be possible to really just let an architecture
insert the "ioport_resource" as a sub-resource of the "iomem_resource". It
would be a bit strange, with "cat /proc/iomem" literally as part of it
showing all the same things that "cat /proc/ioport" shows, but it would
really be technically *correct* in many ways. BUT! I won't guarantee that
it works, simply because I've never tried it. There might be some reason
why something like that would break.


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [PATCH] libata-sff: Don't call bmdma_stop on non DMA capable
Date: Fri, 26 Jan 2007 06:19:06 UTC
Message-ID: <fa.6pUDBNzzrO8psqwGoI/yP/mBuxg@ifi.uio.no>

On Thu, 25 Jan 2007, Linus Torvalds wrote:
>
> Have you noticed that the resource-allocation code _already_ never returns
> zero on sparc64?

Btw, that was a rhetorical question, and I'm not actually sure what the
heck sparc64 will _really_ do ;)

I picked sparc64 as an example, because I _think_ that sparc64 actually is
an example of an architecture that sets up a separate root resource for
each PCI domain, and they are actually set up so that the ioport regions
are literally offset to match the hardware bases (and there are several
different kinds of PCI domain controllers that sparc supports, so those
bases will depend on that too).

So on sparc64, "ioport_resource" really is just a container for the actual
per-domain resource buckets that the hardware (within that domain) will
then do the resource allocation from. Afaik.

But you should actually verify that with somebody like Davem if you
_really_ care.  I cc'd him in case he wants to pipe up and perhaps prove
me wrong.


			Linus

Index Home About Blog