Device numbers (H. Peter Anvin; Linus Torvalds; Theodore Ts'o; Al Viro)

Index Home About Blog

Date: 	Mon, 14 May 2001 12:19:34 -0700
From: "H. Peter Anvin" <hpa@transmeta.com>
Subject: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

First of all, I apologize for not having sent this notice out sooner. 
This kind of writing is very painful to deal with.

Linus Torvalds has requested a moratorium on new device number
assignments. His hope is that a new and better method for device space
handing will emerge as a result.

Alan Cox has requested that I maintain a forked registry for his -ac
kernel patch tree.  I have agreed to do so once I have forked off the
"final" version of the registry for Linus' tree.  At that time I will
process the backlog for the benefit of the -ac registry only.  Please
have patience until I can get that to happen.

Please note that this is not my decision (in fact, I have serious
concerns with it.)  In particular, /dev namespace coordination still
applies.

Sincerely,

	H. Peter Anvin
	The Linux Assigned Names and Numbers Authority

Date: 	Mon, 14 May 2001 13:29:51 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Mon, 14 May 2001, Jeff Garzik wrote:
>
> Note also that persistence of permissions and hardcoded in-kernel naming
> is a problem throughout proc...  It's not unique to in-driver
> filesystems.

Also note how a 32-bit (or 64-bit) dev_t does NOT make it any easier to
manage permissions or anything like that anyway. Look at the current mess
/dev is. Imagine it an order of magnitude worse.

Big device numbers are _not_ a solution. I will accept a 32-bit one, but
no more, and I will _not_ accept a "manage by hand" approach any more. The
time has long since come to say "No". Which I've done. If you can't make
it manage the thing automatically with a script, you won't get a hardcoded
major device number just because you're lazy.

End of discussion.

		Linus

Date: 	Mon, 14 May 2001 21:30:33 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: Getting out of hand?
Newsgroups: fa.linux.kernel

On Mon, 14 May 2001, Alan Cox wrote:
> 
> Except that Linus wont hand out major numbers, which means I can't even boot
> simply off such a device. I bet the vendors in question dont think the sun
> shines out of linus backside any more.

Actually, it does. It's just that some people have gotten so blinded by my
a** that they can no longer see it any more ;)

The problem I have is that there are lots of _good_ solutions, but they
all imply a bit more work than the bad ones. 

What does that result in? Everybody continues to use the simple old setup,
which required no thought at all, but that is a pain to maintain.

For example, the only thing you need in order to boot is to have a nice
clean "disk" major number. That's it. Nothing fancy, nothing more. 

Look at what we have now:

 - ramdisk: major 1. Fair enough - ramdisk is special, in that it doesn't
   have any "real hardware". No problem.
 - SCSI disks:
	major 8, 65-71,
 - Compaq smart2:
	major 72-79
 - Compaq CISS:
	major 104-111
 - DASD;
	major 94
 - IDE:
	major 3, 22, 33-34, 56-57, 88-91

and then the small random ones.

NONE of these major numbers have _any_ redeeming qualities except for the
ramdisk. They should all be _one_ major number, namely "disk". There are
absolutely NO advantages to having separate devices for some strange
compaq controllers and IDE disks. There is _no_ point in having some SCSI
disks show up at major 8, while others (who just happen to be attached to
a scsi bus that is not driven by the generic SCSI layer) show up at major
104 or whatever.

And it will never ever get fixed, unless somebody says "No more!". Which
I'm trying my best to say, except some people are so comfortable rolling
around in the shit that they have re-defined shit to be the new standard.

When Microsoft defines darkness to be standard, we laugh at them. When we
do it, Alan Cox stands up for it and claims that it's the best thing since
sliced bread. Double standards, anybody?

What I'm saying is: "No more SHIT!". I'm more than happy to give out a new
standard number for _disks_. I'm NOT AT ALL willing to say "Ok, Peter, go
ahead and give the next braindamaged Compaq/RedHat/Xxxx engineer another
random number so that we can dig ourselves deeper and deeper into this
shithole that Alan and others like so much".

How hard is it to generate a new "disk driver framework", and let people
register themselves, kind of like the "misc" drivers do. Except we'd only
allow DISKS. You could add something like

	register_disk_driver("compaq-ciss", nr_disks, &my_queue);

and then the disk driver framework will select a range of minor numbers
for the disks, and forward all requests that come to those minor numbers
to "my_queue". No major numbers. No fixed minors. And the user sees _one_
disk major, and doesn't care _what_ the hell is behind it.

But no. When I tell people "enough is enough", people want to continue
with the unbearably stupid and ugly thing we've always had, without
realizing that the _real_ problem is not that we have too few major
numbers, but the real problem is that people have mis-used the ones we
_do_ have, and the fact that we have too few _minor_ numbers (which is
easily fixable, and where 20 bits is plenty).

		Linus

Date: 	Mon, 14 May 2001 22:17:35 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: Getting out of hand?
Newsgroups: fa.linux.kernel

On Mon, 14 May 2001, Linus Torvalds wrote:
> 
> How hard is it to generate a new "disk driver framework", and let people
> register themselves, kind of like the "misc" drivers do. Except we'd only
> allow DISKS. You could add something like
> 
> 	register_disk_driver("compaq-ciss", nr_disks, &my_queue);

Note: one _important_ part of this is that absolutely _nobody_ registers a
disk driver except for a controller that is physically found on the
machine. 

None of this stupid "we have numbers pre-allocated for hardware that does
not even exists on this machine" crap that the current setup is full of. 

This way, you can pretty much depend on the fact that in any "normal"
configuration, you'll find disks at "disk0", "disk1", ... completely
regardless of whether the machine has a IDE controller, a "old-fashioned
SCSI" controller, or a Compaq smart-raid controller. And THAT is useful.  
You can migrate filesystem setups from one machine to another, without
worrying about the fact that one machine has IDE disks and another has
SCSI disks - the filesystem will just work, and the kernels will just
boot.

THAT is how it is supposed to work.

For people who care about where the disks are (0.01% of all people, and
half of those are misguded anyway), you can have a /proc interface or an
ioctl or something. 

But don't make excuses for the current setup. And understand why we must
NOT continue to just give out major numbers indiscriminately.

[ Oh, and _please_ don't Cc: me on this discussion. I'm not that
  interested. I know what I want, and I've let the current mess go on for
  too long. If it takes some pain to fix it, then so be it. It needs to be
  fixed, even if people suddenly start thinking that the light of my a**
  dimmed a bit. That's ok. I just don't want to really fill my inbox - I
  read the kernel mailing list with a newsreader and the "D" key. ]

		Linus

Date: 	Mon, 14 May 2001 23:41:15 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Neil Brown wrote:
> 
> I want to create a new block device - it is a different interface to
> the software-raid code that allows the arrays to be partitioned using
> normal partition tables.

See the other posts about creating a "disk" layer. Think of it as just a
simple "lvm" thing, except on a higher level (ie not on the request level,
but on the level _before_ we get to queuing the thing at all).

Plug the thing in at "__blk_get_queue()", and you're done.

> So I need a major number - to give to devfs_register_blkdev at least.
> You don't want me to have a hardcoded one (which is fine) so I need a
> dynamically allocated one - yes?

If you are willing to use devfs, you can just use a major nr of zero, and
devfs will allocate a device for you. 

Not everybody likes devfs, and there are bootstrap issues with this
approach, but it is the simple "get things working quickly" approach that
needs _zero_ changes or infrastructure.

> This means that we need some analogue to {get,put}_unnamed_dev that
> manages a range of dynamically allocated majors.

We already do have that. And have had it for a long time. It's pretty much
been part of "register_blkdev()" since day one (not quite true, but I bet
that code has been there since the days of Linux-1.0.x). 

You just pass in a major number of zero to "register_blkdev()", and it
will make one up for you.

devfs inherited this behaviour from the first version, I think.

> Am I missing something obvious here?

The fact that it already exists, and has existed for 5+ years, but that
nobody really uses it?

Nobody really uses it because it would require you to add a line or two to
your init scripts to pick up the major number from /proc/devices, and
that's obviously too hard. Much better to just hardcode randome numbers,
right?

		Linus

Date: 	Tue, 15 May 2001 02:08:45 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Alan Cox wrote:
> > 
> > Nobody really uses it because it would require you to add a line or two to
> > your init scripts to pick up the major number from /proc/devices, and
> > that's obviously too hard. Much better to just hardcode randome numbers,
> > right?
> 
> modprobe ?

I was being ironic.

Yes, it's used. Not very widely at all, and historically what has actually
happened is that people have used the dynamic numbers for a while, but in
order to become "real members of society" they've applied for a real
static major number even if the dynamic one worked fine.

Silly, yes. 

Note that my whole argument is that we do NOT need more of the static
numbers, and we should NOT expand the major number space
unnecessarily. We _can_ make do with devfs (trivially - no need to do
anything at all, as devfs already handles the case of dynamic major
numbers quite well). 

But the fact remains that some users want to (a) avoid devfs and (b) have
static maintenance. And I'm ok with that too, but only if the static major
number is in the form of a _generic_ number that has absolutely nothing to
do with any specific drivers (which is why I'd be perfecly ok with still
adding a "disk" major number, but which is why I do NOT want to have Peter
give out "the random number of today" to various stupid device drivers).

So we seem to be in violent agreement here.

		Linus

Date: 	Tue, 15 May 2001 08:10:29 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Alan Cox wrote:
> 
> Given a file handle 'X' how do I find out what ioctl groups I should apply to
> it. So we can go from
> 
> 	if(MAJOR(st.st_rdev) == ST_MAJOR)
> 		issue_scsi_ioctls
> 	else if(MAJOR(st.st_rdev) == FTAPE_MAJOR)
> 		issue_ftape_ioctls
> 	else ..
> 	else
> 		error

Ugh. You do this?

And you don't realize that the whole system is too broken for words?

What is the horrible app that does something like this? 

The fix, I think, is to make the ioctl commands much more regular. That is
probably true in general, and fixing that would hopefully fix the need for
horrible code like the above.

That said:

> 	/* Use scsi if possible [scsi, ide-scsi, usb-scsi, ...] */
> 	if(HAS_FEATURE_SET(fd, "scsi-tape"))
> 		...
> 	else if(HAS_FEATURE_SET(fd, "floppy-tape"))
> 		..

doesn't look horrible, and I don't see why we couldn't expose the "driver
name" for any file descriptor. We already do for some: "fstatfs()" is
largely the same thing on another level.

		Linus

Date: 	Tue, 15 May 2001 10:43:18 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, James Simmons wrote:
> > 
> > Static devices like /dev/fbN are no different. They were just plugged in
> > before the OS booted.
> 
> Actually their are hotplug video cards. High end servers have hot swapable 
> graphics cards. Would you want to take down a very important server
> because the graphics card went dead. You pull it out and you plug a new
> one in. Also their are PCMCIA video cards. I have seen them for the hand
> held ipaqs. It is only a matter of time before all devices are hot
> swappable. 

True, but not really necessarily important.

The thing is, even if the device happens to be soldered down, inside a
computer that is locked in a safe, the question boils down to a fairly
simple one: "how do we approach devices?".

Do we approach devices as something static, or do we approach them as more
dynamic entities? Do we consider soldered-down devices to be fundamentally
different from the ones that can be hot-plugged?

And my opinion is that the "hot-plugged" approach works for devices even
if they are soldered down - the "plugging" event just always happens
before the OS is booted, and people just don't unplug it. So we might as
well consider devices to always be hot-pluggable, whether that is actually
physically true or not. Because that will always work, and that way we
don't create any artificial distinctions (and they often really _are_
artifical: historically soldered-down devices tend to eventually move in a
more hot-pluggable direction, as you point out).

Now, if we just fundamentally try to think about any device as being
hot-pluggable, you realize that things like "which PCI slot is this device
in" are completely _worthless_ as device identification, because they
fundamentally take the wrong approach, and they don't fit the generic
approach at all.

But this is also why I don't think static device numbers make any
sense. It's silly to have the same disk show up as different devices just
because it is connected to a different kind of controller. And it is
_really_ silly to statically pre-allocate device numbers based on the
"location" of a device. 

We should strive for a setup where device plugin causes that device to
show up in /dev, and everywhere else it is needed. And the logical
extension of such a setup is to consider built-in devices to be plugged in
at bootup.

This is true to the point that I would not actually think that it is a bad
idea to call /sbin/hotplug when we enumerate the motherboard devices. In
fact, if you look at the current network drivers, this is exactly what
will happen: when we auto-detect the motherboard devices, we _will_
actually call /sbin/hotplug to tell that we've "inserted" a network
device.

It's just that we haven't really mounted the root filesystem yet, so
user-land never actually "sees" this fact. But I think it's the right
approach to take, and realizing that even static devices are just a
sub-case of the problem of dynamic allocation means that you tend to
automatically also see that static device number allocation is just
broken.

[ The biggest silliness is this "let's try to make the disks appear in the
  same order that the BIOS probes them". Now THAT is really stupid, and it
  goes on a lot more than I'd ever like to see. ]

		Linus

Date: 	Tue, 15 May 2001 11:04:27 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, James Simmons wrote:
> > 
> > And if write() has too much overhead - we'd better fix _that_, because
> > it's much more likely hotspot than ioctl ever will be.
> 
> I would use write except we use write to draw into the framebuffer. If I
> write to the framebuffer with that data the only thing that will happen is
> I will get pretty colors on my screen. 

Note that this was the same argument that the USB people had, and it was
wrong then. It's wrong now.

The USB people decided on using ioctl's, because the way USB works you
send a packet down a "USB pipe", which is identified by the direction, the
device number and the type (and other details). So what the USB system
does to expose this to user land is very similar to what you propose for
ioctl's: a structured ioctl that has a "data" field.

What Al is saying, and what makes perfect sense is that you generate a
separate fd for each "pipe". It's even more obvious in the case of USB,
because, by golly, the things are actually _called_ "pipes" in the USB
documentation, which should have made people make the immediate
association. Instead of doing

	fd = open("unstructured-name" ...);
	ioctl(fd, MAGICIOCTL, { structured data });

you do

	fd = open("/structured/name", ...);
	write(fd, data, size);

or possibly you take a more socket-like approach and do

	fd = socket(part-of-the-structure);
	bind(fd, more-of-the-structure)
	connect(fd, last-part-of-the-structure);

and use write() there (or use "sendto()" etc which allow more dynamic
structure constructs - you don't have to statically bind the fd early at
bind/connect time.

See? 

Don't get boxed in by thinking that you only have one fd. Even if you have
only one _device_node_, you can have multiple fd's. In fact, you can, with
the Linux VFS layer, fairly easily do things like

	mknod /dev/fd0 c X Y

and then use

	fd = open("/dev/fd0/colourspace", O_RDWR);

and your device just implements some trivial "lookup()" functions (you
don't _have_ to be a directory to allow name lookups - although right now
I suspect that you can confuse the VFS layer if you aren't. That's a VFS
layer deficiency, if so. Nobody has tested it, but it should be really
easy to fix if somebody is really interested).

Note that with these kinds of things, you don't need ugly ioctl's. The
code, I bet, would be a LOT more readable. There's nothing fundamentally
impossible with having

	> /dev/fd0/eject

cause an eject event on /dev/fd0. It would be fairly easy, I bet, to
expand the current "struct file_operations def_blk_fops" to also include
_dentry_ operations, and then all of this could be done by fs/block_dev.c,
with the actual device drivers not having to know about it.

Same thing with character devices. We should be fairly easily able to make
something like

	fd = open("/dev/fd0/colourspace=1", ...)

be fully parsed by the fs/block_dev.c layer: we could add a nice string to
"bd_op->open()", to be passed in to the device driver to do with as it
wishes. That would require _no_ changes from device driver writers except
the addition of a new argument ("const char * arg") and the choice to
possibly using that argument for extra structure..

This, btw, is Al Viro's wet dream. But I have to agree: using name spaces
etc is MUCH preferable to ioctl's, makes code more readable and logical,
and often makes it possible to do things you couldn't sanely do before
(control these things from scripts etc).

And using ASCII names ("eject") instead of numbers (see the "FDEJECT" and
"CDROMEJECT" etc #defines) sure as hell makes for easier maintenance and
avoids the whole issue of maintaining static numbers (all the same things
that make me hate device number maintenance makes me also hate the fact
that we need to maintain this list of ioctl numbers etc). By using
descriptive names, the "maintenance" simple does not exist.

		Linus

Date: 	Tue, 15 May 2001 11:15:41 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Jeff Garzik wrote:
> 
> > Now, if we just fundamentally try to think about any device as being
> > hot-pluggable, you realize that things like "which PCI slot is this device
> > in" are completely _worthless_ as device identification, because they
> > fundamentally take the wrong approach, and they don't fit the generic
> > approach at all.
> 
> Should I interpret this as you disagreeing with
> exporting-bus-info-to-userspace type additions?  ie. some random
> get-info ioctl spits out pci_dev->slot_name to userspace.

Yes and no.

I'm absolutely _not_ against exporting information. That kind of
information can be very useful to help the user diagnose things, by
visualizing device layout etc (ie think here of a "device
manager" application that does all the pretty graphics that people so
enjoy).

Giving that kind of information to the user can be very useful indeed. And
I have no arguments against it.

The part I absolutely detest is when the information becomes more than
just "information", and is used to enforce a world-view. Anybody who uses
physical location for naming devices (ie you have to know where the hell
the thing is in order to look it up), is so far out to lunch that it's not
even funny. And the sad fact is that this is pretty much how ALL unixes
have historically done things ("Oh, you want to see the disk? Sure. It's
on scsi bus 1, channel 2, ID 3, lun 0, so you just open /dev/s1c3l0 and
you're done! Easy as pie!").

Keep it informational. And NEVER EVER make it part of the design.

That way, people who grew up with big unix machines can have their scripts
that creates the stupid names dynamically on the fly, and still play at
being bound to a static naming scheme that was silly 20 years ago and is
just incredibly stupid today. There's a script for doing exactly this for
SCSI. I forget what it's called, because I obviously think the thing is
stupid, but giving people the power to do even silly things is what Linux
is all about.

		Linus

Date: 	Tue, 15 May 2001 12:17:11 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Johannes Erdfelt wrote:
> 
> Even bulk has issues because USB pipe's aren't necessarily streams, they
> can packetized in the psuedo weird way that USB does things.

This is ok. "pipe" does not mean that the write data doesn't have
boundaries.

Think about UDP. It's done with file desriptors, yet it is very much
packetized. 

Even a regular "pipe" actually has packet behaviour: a single write of <
PIPEBUF is guaranteed by UNIX to complete atomically, which is exactly so
that people can use pipes in a "packet" environment.

A file descriptor does NOT imply that the data you read or write must be
one mushy stream of bytes. It's ok to honour write() packet boundaries
etc.

You should absolutely NOT think that "we cannot send a packet down the
control pipe because multiple writers might confuse each other". You can
still require that separate packets be cleanly delimited.

It's a huge mistake to think that you _have_ to use ioctl's to get
"packet" behaviour, or to get structured reads/writes. 

The advantage of read/write is that it doesn't _force_ a packet on you,
but the kernel really doesn't care if you have some structure to your read
and write requests.

> > or possibly you take a more socket-like approach and do
> > 
> > 	fd = socket(part-of-the-structure);
> > 	bind(fd, more-of-the-structure)
> > 	connect(fd, last-part-of-the-structure);
> 
> I don't like socket's since we do have a well bound set of endpoints. We
> don't have 4 billion IP's with 64k ports to choose from. We have x
> endpoints that the device tells us about ahead of time.

Note that "sockets" != "IPv4". Sockets just have names, they can be IPv4
(4+2 byte things), they can be pathnames (UNIX domain) and they can be
large IPv6 (16+2 or whatever). Or they could be small USB names. There's
nothing fundamentally wrong with "binding" a one-byte address and a
one-byte "interface" name. You'd just create a AF_USB layer ;)

But no, I don't actually like sockets all that much myself. They are hard
to use from scripts, and many more people are familiar with open/close and
read/write.

		Linus

Date: 	Tue, 15 May 2001 13:18:09 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Jonathan Lundell wrote:
> >
> >Keep it informational. And NEVER EVER make it part of the design.
> 
> What about:
> 
> 1 (network domain). I have two network interfaces that I connect to 
> two different network segments, eth0 & eth1;

So?

Informational. You can always ask what "eth0" and "eth1" are.

There's another side to this: repeatability. A setup should be
_repeatable_.

This is what we have now. Network devices are called "eth0..N", and nobody
is complaining about the fact that the numbering is basically random. It
is _repeatable_ as long as you don't change your hardware setup, and the
numbering has effectively _nothing_ to do with "location".

You don't say "oh, I have my network card in PCI bus #2, slot #3,
subfunction #1, so I should do 'ifconfig netp2s3f1'". Right?

The location of the device is _meaningless_. 

Linux gets this right. We don't give 100Mbps cards different names from
10Mbps cards - and pcmcia cards show up in the same namespace as cardbus,
which is the same namespace as ISA. And it doesn't matter what _driver_ we
use.

The "eth0..N" naming is done RIGHT!

> 2 (disk domain). I have multiple spindles on multiple SCSI adapters. 

So? Same deal. You don't have eth0..N, you have disk0..N. 

What's the problem? It's _repeatable_, in that as long as you don't change
your disks, they'll show up the same way. But the 0..N doesn't imply that
the disks are anywhere special.

Linux gets this _somewhat_ right. The /dev/sdxxx naming is correct (or, if
you look at only IDE devices, /dev/hdxxx). The problem is that we don't
have a unified namespace, so unlike eth0..N we do _not_ have a unified
namespace for disks.

Your argument that names change if you add disks etc is complete crap. OF
COURSE they change. You cannot avoid it. Whatever scheme you use will
cause name-changes. The location-based one causes exactly the same kinds
of problems, except they are even worse - now you have to care which ID
your disk has etc. 

The argument that "if you use numbering based on where in the SCSI chain
the disk is, disks don't pop in and out" is absolute crap. It's not true
even for SCSI any more (there are devices that will acquire their location
dynamically), and it has never been true anywhere else. Give it up.

		Linus

Date: 	Tue, 15 May 2001 13:37:53 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Alexander Viro wrote:
> 
> The thing being, why thet hell create these device/directory hybrids?

Backwards compatibility, and the ability to automatically take advantage
of existing filesystems without having any administrative worries.

No real technical reason, in other words.

But trust me: avoiding administrative worry is a _big_ plus.

And making people think of device nodes as more of a "window" into the
driver is a good thing anyway.

> Driver can export a tree and we mount it on fb0. After that you have
> the whole set - yes, /dev/fb0/colourspace, etc. - no problem. And no
> need to do mknod, BTW. Yes, we'll need to use /dev/fb0/frame for
> frame itself. BFD...

Actually, we can just continue to use "/dev/fb0", which would continue to
work the way it has always worked.

It's a mistake to think that a directory has to be a directory. Or to
think that a device node has to be a device node. It's perfectly ok to
just think of it as namespaces. So opening /dev/fb0 continues to open the
"master fd", whatever that means (in this case, the actual frame
buffer). The namespaces _under_ /dev/fb0 would be the control channels, or
in fact _anything_ that the frame buffer driver wants to expose.

They might also be exactly the same channel, except with certain magic
bits set. The example peter gave was fine: tty devices could very usefully
be opened with something like

	fd = open("/dev/tty00/nonblock,9600,n8", O_RDWR);

where we actually open up exactly the same channel as if we opened up
/dev/cua00, we just set the speed etc at the same time. Which makes things
a hell of a lot more readable, AND they are again easily done from
scripts. The above is exactly the kind of thing that UNIX has not done
well, and some others have done better (let's face it, even _DOS_ did it
better, for chrissake! Those callout devices and those ioctl's are a pain
in the ass, for no really good reason).

Using ASCII names for these kinds of channel controls is fine.

> You see, as soon as you want slightly more structured stuff (deeper than
> one level) you need the dentry tree, yodda, yodda. IOW, you need a
> filesystem anyway and it's easy to implement.

I want to ease people into this notion. I'm personally perfectly happy to
make it a real filesystem, if you are willing to write the code. But I've
become convinced that the transition has to be really simple, with no
administrative work.

It should be a case of "Just plug in a new kernel, and suddenly your
existing filesystem just allows you to do more! 20% more for the same
price! AND we'll throw in this useful ginzu knife for just 4.95 for
shipping and handling. Absolutely free!"

		Linus

Date: 	Tue, 15 May 2001 13:51:43 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Alexander Viro wrote:
>
> If you want them all to inherit it - inherit from mountpoint.

..which is exactly what the device node ends up being. The implicit
mount-point.

And which point, btw, it is completely indistinguishable to user space
whether the thing is implemented as a full filesystem, or whether it's
just that the device node exports a simple "lookup()" that it passes down
to the device driver. So this is also the point where it becomes nothing
but an implementation issue, and as such it's much less contentious.

Done right, they'll be automatic mount-points, which gives us:
 - perfect backwards compatibility (opening just the node will do what it
   has always done)
 - _zero_ extra system administration.

And I really think the zero system administration thing is the important
one. For some reason, sysadmin is where all the fights break out (see
devfs, but historically we had all the same problems with the original
device naming etc).

Sysadmin and editors. The holy wars of UNIX.

		Linus

Date: 	Tue, 15 May 2001 14:36:07 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Tue, 15 May 2001, Alex Bligh - linux-kernel wrote:
> 
> Q: Let us assume you have dynamic numbering disk0..N as you suggest,
>    and you have some s/w RAID of SCSI disks. A disk fails, and is (hot)
>    removed. Life continues. You reboot the machine. Disks are now numbered
>    disk0..(N-1). If the RAID config specifies using disk0..N thusly,

If you have a raid config like that, then you're screwed _whatever_ you
do.

Look into using UUID's, which fix this properly.

And note, btw, how I think the md autorun stuff do all of this the RIGHT
way. Where RIGHT very much includes not using positional information etc.

		Linus

Date: 	Wed, 16 May 2001 16:52:01 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Thu, 17 May 2001, Alan Cox wrote:
>
> > Are FireWire (and USB) disks always detected in the same order? Or does it
> > behave like ADB, where you never know which mouse/keyboard is which
> > mouse/keyboard?
> 
> USB disks are required (haha etc) to have serial numbers. Firewire similarly
> has unique disk identifiers.  

Well, as that doesn't actually work out in practice, the good news is that
USB at least _tries_ to always walk the tree the same way when detecting
devices, so if you don't change where your devices are in the topology
they should show up in similar places.

Of course, "not changing topology" also means things like not powering off
devices with external power supplies etc..

The serial numbers are probably not that reliable.

		Linus

Date: 	Wed, 16 May 2001 16:53:55 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: LANANA: To Pending Device Number Registrants
Newsgroups: fa.linux.kernel

On Wed, 16 May 2001, H. Peter Anvin wrote:
> Alan Cox wrote:
> > 
> > > Are FireWire (and USB) disks always detected in the same order? Or does it
> > > behave like ADB, where you never know which mouse/keyboard is which
> > > mouse/keyboard?
> > 
> > USB disks are required (haha etc) to have serial numbers. Firewire similarly
> > has unique disk identifiers.
> 
> How about for other device classes?

Note that this whole decision hinges on a fact that simply isn't _true_.

You simply _cannot_ get the physical location of many devices. Sometimes
the topology of the bus is basically anonymous - there _is_ no location.

People had better just accept this. Don't get hung up about where
something is.

		Linus

Date: 	Sat, 19 May 2001 11:13:48 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]
Newsgroups: fa.linux.kernel

On Sat, 19 May 2001, Alexander Viro wrote:
>
> 	Folks, before you get all excited about cramming side effects into
> open(2), consider the following case:

Your argument is stupid, imnsho.

Side-effects are perfectly fine if they are _local_ to the file
descriptor. Your example is contrived and idiotic.

Filename extensions would not replace ioctl's. But they are wonderful ways
to avoid unnecessary binary name-spaces, like the ones we have with
"callout" TTY names, and the one that the fb people had.

For example, do a "ls -l /dev/fd0*", and ponder. Also, realize that we
have these hard-coded names in _addition_ to the magic ioctl to set even
more parameters. These are all stupid and bad, and it would have been a
_lot_ cleaner to be able to do

	open("/dev/fd0/H1440", O_RDWR)..

or

	open("/dev/fd0/HD,18,85", O_RDWD)

to open special non-standard high-density modes.

We already did this, in a very limited and stupid way, by encoding the
minor number and generating a standard naming scheme. We can do the same
thing in a _much_ more generic way by just realizing that we wanted the
open to be name-based in the first place.

These are _not_ side effects. They are very much naming conventions. If I
want to open a the floppy in one of the special extended modes, it makes a
LOT more sense to just open it with the naming, than to open a "generic"
floppy device only to them use a magic and very unreadable ioctl to set
the mode of the device.

In short, I don't buy your arguments for one single second.

		Linus

Date: 	Sat, 19 May 2001 11:34:48 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [RFD w/info-PATCH] device arguments from lookup, partion code
Newsgroups: fa.linux.kernel

On Sat, 19 May 2001, Alan Cox wrote:
>
> > Now that I'm awake and refreshed, yeah, that's awful.  But
> > echo "hot-add,slot=5,device=/dev/sda" >/dev/md0/control *is* sane.  Heck,
> > the system can even send back result codes that way.
> 
> Only to an English speaker. I suspect Quebec City canadians would prefer a
> different command set.

I was waiting for the "anglo-saxon" argument.

I don't think it's a valid argument. You already have "/dev". You already
have english names for the numbers in ioctl's (and let's not be mentally
dishonest and say "numbers are cross-cultural", because NOBODY MUST EVER
USE THE RAW NUMBERS - you have to use the anglo-saxon #define'd names
because the numbers aren't even cross-platform on Linux, much less
portable to other systems).

So the "English is bad" argument is a complete non-argument.

		Linus

Date: 	Sat, 19 May 2001 12:00:41 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: no ioctls for serial ports? [was Re: LANANA: To Pending Device
Newsgroups: fa.linux.kernel

[ Attribution is gone, so I just deleted it.. ]

> > > > 	fd = open("/dev/tty00/nonblock,9600,n8", O_RDWR);
> > >
> > > Hmm, there might be problem with this. How do you change speed without
> > > reopening device? [Remember: your mice knows when you close device]

The naming scheme is not a replacement for these kinds of ioctl's - it's
just a way to make them less critical, and get people thinking in other
directions so that we don't get _more_ ioctl's.

Remember, the serial lines we already have legacy support for, that's not
going away. The termios-based stuff isn't Linux-only, and we'll
obviously maintain it for the forseeable future.

But if we can use naming to avoid ioctl's in the future, then THAT is
good. I'm in particular thinking about frame-buffer and similar things,
where we might be able to avoid making the situation worse.

And remember where this discussion started: not ioctl's, but device
numbers. The _biggest_ advantage of naming may be to get rid of the need
for extra major and minor numbers, and cleaning up /dev in the process-

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: [PATCH] register_blkdev
Original-Message-ID: <Pine.LNX.4.44.0303071708260.1796-100000@home.transmeta.com>
Date: Sat, 8 Mar 2003 01:16:51 GMT
Message-ID: <fa.m6eoeb0.14gqb0g@ifi.uio.no>

On Fri, 7 Mar 2003, Andrew Morton wrote:
>
> Some time back Linus expressed a preference for a 2^20 major / 2^12 minor split.

Other way around. 12 bits for major, 20 bits for minor.

Minor numbers tend to get used up more quickly, as shown by the current
state of affairs, and also as trivially shown by things like pty-like
virtual devices that pretty much scale arbitrarily with memory and users.

I don't much care personally. I think the devil is in the details, and
making sure we don't have legacy code that just knows about the fact that
it can index a 256-entry array with the minor number.

Also, I have to say that over time I've become convinced that it's just a
painful mistake to mix up minor and major numbers, so it might well be
sensible for people who actually care about them to always keep them
separate. That would actually imply that 32+32 is the right thing to do
internally after all, and any other limits (whether they be 8+8, 12+20 or
16+16) would be limited by things like over-the-wire or on-the-disk
representations.

			Linus

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: udev and devfs - The final word
Original-Message-ID: <20031231225536.GP4176@parcelfarce.linux.theplanet.co.uk>
Date: Wed, 31 Dec 2003 22:56:44 GMT
Message-ID: <fa.nlrcf4p.1r7g2qj@ifi.uio.no>

On Wed, Dec 31, 2003 at 05:20:18PM -0500, Rob Love wrote:
> On Wed, 2003-12-31 at 17:01, Nathan Conrad wrote:
>
> > One thing that I'm confused about with respect to device files is how
> > kernel arguments are supposed to work. Now, we _seem_ to have a
> > mish-mash of different ways to tell the kernel which device to open as
> > a console, which device to use as a suspend device, etc.... Now, all
> > of the device names are being migrated to userland. How is the kernel
> > supposed to determine which device to use when it is told use
> > /dev/hda3 or /dev/ide/host0/something/part3 as the suspend partition?
> > The kernel no longer knows to which device this string this device is
> > connected.
>
> Uh, Unix systems (Linux included) do not use the filename of the device
> node at all.  Those are just names for you, the user.
>
> The kernel uses the device number to understand what device user-space
> is trying to access.  The kernel associates the device with a device
> number.  Normally that number is static, and known a priori, so we just
> create a huge /dev directory with all possible devices and their
> assigned numbers (you can see these numbers with ls -la).
>
> But if the kernel _tells_ user-space what the device number is, for each
> device as it is created, we do not need a static /dev directory.  We can
> assemble the directory on the fly and device numbers really no longer
> matter.  This is what udev does.

I think you've missed a point here.  There are several places where kernel
deals with device identification.
	a) when normal pathname lookup results in a device node on filesystem.
That's the regular way.
	b) when we create a new device node; device number is passed to
->mknod() and new device node is created.  Also a normal codepath.
	c) when late-boot code mounts the final root.  It used to be black
magic, but these days it's done by regular syscalls.  Namely, we parse the
"device name" (most of the work is done by lookups in sysfs), do mknod(2)
and mount(2).  It's still done from the kernel mode, but it could be moved
to userland.  Should be, actually.
	d) when kernel deals with resume/suspend stuff.  Currently - black
magic.  Should be moved to early userland (same parser as for final root
name + mknod on rootfs + open() to get the device in question).
	e) in several pathological syscalls we pass device number to
identify a device.  ustat(2) and its ilk - bad API that can't die.
	f) /dev/raw passes device number to bind raw device to block device.
Bad API; we probably ought to replace it with saner one at some point.
	g) RAID setup - mix of both pathologies; should be done in userland
and interfaces are in bad need of cleanup.
	h) nfsd uses device number as a substitute for export ID if said
ID is not given explicitly.  That, BTW, is a big problem for crackpipe
dreams about random device numbers - export ID _must_ be stable across
reboots.
	i) mtdblk parses "device name" on boot; should be take to early
userland, same as RAID et.al.

	Eventually name_to_dev_t() should be gone from kernel mode
completely - all callers should be shifted to early userland.  But
that will take a lot of work - currently we have a big mess in that
area.

Newsgroups: fa.linux.kernel
From: "Theodore Ts'o" <tytso@mit.edu>
Subject: Re: duplicated inode numbers for different files?
Original-Message-ID: <20040202130437.GA8196@thunk.org>
Date: Mon, 2 Feb 2004 13:22:40 GMT
Message-ID: <fa.e7sdb17.h661ad@ifi.uio.no>

On Mon, Feb 02, 2004 at 12:41:37PM +0300, "Andrey Borzenkov"  wrote:
>
> Are inode numbers supposed to be unique inside a filesystem? There
> is some code in nfsd (at least in 2.4) that suggests that it is not
> always the case.

Yes, at least in theory.  If you stat a file, the combination of
st_dev and st_ino are supposed to be unique.  For example, if two
separately named pathnames when stat'ed return the same values in
their stat structure for st_dev and st_ino, a userpsace program (such
as GNU tar) is allowed to presume they are the same hard links of each
other.  Since each filesystem is supposed to have a unique st_dev, it
follows that inode numbers are supposed to be unique inside a
filesystem.

That being said, there are filesystems, generally remote, distributed
filesystems such as AFS, that have "cheated", mainly because they can
address more files than st_dev/st_ino combined.  When they cheat, such
that two files have the same st_dev and st_ino, programs can get
confused.  However, to the extent that filesystems can manage to keep
related files (that are likely to be tar'ed together) from having
duplicated inode numbers, they can mostly get away with it.

(So for example, although AFS looks like a single mounted filesystem
to AFS clients, it is made up of individual volumes which can be
located one or more different servers.  Files inside each volume are
guaranteed to have unique inodes, and users rarely run tar across
multiple AFS volumes, so AFS gets away with this, mostly.  By the way,
I'm not picking on AFS here.  Anytime you have a massively distributed
filesystem, this is going to be a potential problem.  So for example,
the Lustre filesystem has to deal with this as well.)

> Supermount is currently using 1-to-1 correspondence
> between super- and subfs inodes. This is OK for all except root
> inode - it still has to have some inode number for root all the time.
> So it assigns arbitrary number and changes it after subfs has been
> mounted to reflect subfs root.

Yup, that can certainly cause problems, mainly because supermount is
trying to create the illusion that the filesystem was always mounted,
except that it only mounts it when you traverse into the filesystem.
This is actually different from the duplicate inode problem, although
certainly using a 1:1 correspondence between super and subfs inodes
also runs that risk.

> idea is to use fixed root number; it can be done but may result
> in duplicated number. Using ino == 0 may lessen chances (is it valid
> BTW?)

Don't use ino of 0 --- that's just asking for trouble.

I'd probably use a very high number ((unsigned)-3), for example.  That
puts it out of the way, and less likely to collide with "real" inodes
--- at least, for filesystems that aren't playing games with synthetic
inode numbers....

						- Ted

Index Home About Blog