Deadlock (Greg KH; Linus Torvalds; Al Viro)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and
Original-Message-ID: <Pine.LNX.4.58.0401230939170.2151@home.osdl.org>
Date: Fri, 23 Jan 2004 17:43:50 GMT
Message-ID: <fa.j2skqtb.1ciskon@ifi.uio.no>

On Fri, 23 Jan 2004, Alan Stern wrote:
>
> Since I haven't seen any progress towards implementing the
> class_device_unregister_wait() and platform_device_unregister_wait()
> functions, here is my attempt.

So why would this not deadlock?

The reason we don't wait on things like this is that it's basically
impossible not to deadlock.

There are damn good reasons why the kernel uses reference counting
everywhere. Any other approach is broken.

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and
Original-Message-ID: <Pine.LNX.4.58.0401231008220.2151@home.osdl.org>
Date: Fri, 23 Jan 2004 18:18:47 GMT
Message-ID: <fa.j3roq5a.1di4k0m@ifi.uio.no>

On Fri, 23 Jan 2004, Alan Stern wrote:
> >
> > So why would this not deadlock?
>
> Kind of a general question, so I'll give a general answer.  This wouldn't
> deadlock for the same reason as anything else: People use it properly!

No.

> > The reason we don't wait on things like this is that it's basically
> > impossible not to deadlock.
>
> That's an exaggeration.

Not by much.

>		  There are places where it's _necessary_ to
> wait.  For example, consider this extract from a recent patch written by
> Greg KH:
>
> +	/* FIXME change this when the driver core gets the
> +	 * class_device_unregister_wait() call */
> +	init_completion(&bus->released);
>  	class_device_unregister(&bus->class_dev);
> +	wait_for_completion(&bus->released);
>
> For the full patch, see
> http://marc.theaimsgroup.com/?l=linux-usb-devel&m=107109069106188&w=2
>
> The general context is that a module is trying to unload, but it can't
> until the release() callback for its device has finished.

And in just about any circumstance, with the _possible_ exception of
module unload, things like this can be fooled into deadlocking by a
non-root user.

To the point where root can no longer fix things up.

We've had these bugs before. It's a mistake to make interfaces that
positively _encourage_ bugs like this.

The canonical example of a bug like this is when a regular user can
trigger an event that causes the wait, and then makes sure that it holds a
reference count on the event - by opening a file descriptor.

> And sometimes a part of the kernel has to wait until the reference count
> drops to 0.

Not likely. The rest of the kernel never does it.

What is it with USB that makes people think so? Remember all the USB bugs
early on that were due to _exactly_ that thinking.

It's wrong. YOU SHOULD NEVER WAIT FOR THE REFERENCE COUNT TO DROP TO ZERO!

You just ignore it. With proper memory management it doesn't matter.

For module unload, it's likely better to return an error than to wait.
Tell the super-user that you're busy.

> By the way, adding class_device_unregister_wait() has an excellent
> precedent.  The driver model core _already_ includes
> device_unregister_wait(), merged by Pat Mochel.  Are you saying that
> function shouldn't exist either?

Quite probably.

		Linus

Newsgroups: fa.linux.kernel
From: Greg KH <greg@kroah.com>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040123183109.GG23169@kroah.com>
Date: Fri, 23 Jan 2004 18:32:45 GMT
Message-ID: <fa.dd7ra64.1kkgors@ifi.uio.no>

On Fri, Jan 23, 2004 at 10:15:36AM -0800, Linus Torvalds wrote:
>
> What is it with USB that makes people think so? Remember all the USB bugs
> early on that were due to _exactly_ that thinking.

Oh yeah, I remember :(

> It's wrong. YOU SHOULD NEVER WAIT FOR THE REFERNCE COUNT TO DROP TO ZERO!
>
> You just ignore it. With proper memory management it doesn't matter.
>
> For module unload, it's likely better to return an error than to wait.
> Tell the super-user that you're busy.

That patch that Alan pointed to is wrong anyway, it dies a horrible
death, and was only a hack to try something.

I think it might be a better idea to just prevent the module unload of
the USB host controller until all drivers bound to devices owned by it
are also unbound in order to fix these kinds of problems.

thanks,

greg k-h

Newsgroups: fa.linux.kernel
From: Greg KH <greg@kroah.com>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040123181106.GD23169@kroah.com>
Date: Fri, 23 Jan 2004 18:14:41 GMT
Message-ID: <fa.dbnl9m4.1l4eobs@ifi.uio.no>

On Fri, Jan 23, 2004 at 09:42:09AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 23 Jan 2004, Alan Stern wrote:
> >
> > Since I haven't seen any progress towards implementing the
> > class_device_unregister_wait() and platform_device_unregister_wait()
> > functions, here is my attempt.
>
> So why would this not deadlock?

It will deadlock if the user does something braindead like:
	rmmod foo < /sys/class/foo_class/foo1/file

Now I know the network code can handle something like that, but they
have their own thread to handle issues like this...  It's not sane to
make every driver subsystem do that...

So in short, it's used to make sure that all references are dropped,
before allowing the module to be unloaded.

And Alan, I think Pat already has this in his tree, if only he would
send that to Linus one of these days...

thanks,

greg k-h

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and
Original-Message-ID: <Pine.LNX.4.58.0401231017030.2151@home.osdl.org>
Date: Fri, 23 Jan 2004 18:22:04 GMT
Message-ID: <fa.j2rqqd9.1ci6k8p@ifi.uio.no>

On Fri, 23 Jan 2004, Greg KH wrote:
> >
> > So why would this not deadlock?
>
> It will deadlock if the user does something braindead like:
> 	rmmod foo < /sys/class/foo_class/foo1/file

I don't much worry about things like that, since only root can rmmod
anyway.

HOWEVER - I do worry when people start exporting interfaces that are
basically _designed_ to deadlock. It's a bad interface. Don't export it.
There is possibly just _one_ place that can do it, and it's the module
unload part. Everything else would be a bug.

So do it in the one place. Don't make a function that does it and that
others will start using because it's "simple".

		Linus

Newsgroups: fa.linux.kernel
From: Greg KH <greg@kroah.com>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040123182714.GF23169@kroah.com>
Date: Fri, 23 Jan 2004 18:28:32 GMT
Message-ID: <fa.dd7h9ua.1kkaojq@ifi.uio.no>

On Fri, Jan 23, 2004 at 10:19:15AM -0800, Linus Torvalds wrote:
>
>
> On Fri, 23 Jan 2004, Greg KH wrote:
> > >
> > > So why would this not deadlock?
> >
> > It will deadlock if the user does something braindead like:
> > 	rmmod foo < /sys/class/foo_class/foo1/file
>
> I don't much worry about things like that, since only root can rmmod
> anyway.

Yeah, that's why I didn't care very much either.

> HOWEVER - I do worry when people start exporting interfaces that are
> basically _designed_ to deadlock. It's a bad interface. Don't export it.
> There is possibly just _one_ place that can do it, and it's the module
> unload part. Everything else would be a bug.
>
> So do it in the one place. Don't make a function that does it and that
> others will start using because it's "simple".

Ok, fair enough.  I'll make up a patch to remove that other
"unregister_wait" function in the driver core.

thanks,

greg k-h

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@osdl.org>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and
Original-Message-ID: <Pine.LNX.4.58.0401251054340.18932@home.osdl.org>
Date: Sun, 25 Jan 2004 19:04:02 GMT
Message-ID: <fa.id05irj.1jg0bq9@ifi.uio.no>

On Sun, 25 Jan 2004, Alan Stern wrote:
>
> Is there some reason why modules don't work like this?

There's a few:

 - pain. pain. pain.

 - doing proper refcounting of modules is _really_ really hard. The reason
   is that proper refcounting is a "local" issue: you reference count a
   single data structure. It's basically impossible to make a "global"
   reference count without jumping through hoops.

 - lack of testing. Unloading a module happens once in a blue moon, if
   even then.

The proper thing to do (and what we _have_ done) is to say "unloading of
modules is not supported". It's a debugging feature, and you literally
shouldn't do it unless you are actively developing that module.

Sadly, some modules are broken. Old 16-bit PCMCIA in particular _depends_
on unloading modules, since the old PCMCIA layer doesn't do hotplug: it
literally thinks of module load/unload as the "plug/unplug" event.

But it basically boils down to: don't think of module unload as a "normal
event". It isn't. Getting it truly right is (a) too painful and (b) would
be too slow, so we're not even going to try.

(As an example of "too painful, too slow", think of something like a
packet filter module. You'd literally have to increment the count in every
part that gets a packet, and decrement the count at every point where it
lets the packet go.  And since it would have to be SMP-safe, it would have
to be a locked cycle, or we'd have to have per-CPU counters - at which
point you now also have to worry about things like preemption and
sleeping, which just means that it would be a _lot_ of very fragile code).

			Linus

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040125202136.GR21151@parcelfarce.linux.theplanet.co.uk>
Date: Sun, 25 Jan 2004 20:23:39 GMT
Message-ID: <fa.nasfaqi.1o1cpb4@ifi.uio.no>

On Sun, Jan 25, 2004 at 11:02:58AM -0800, Linus Torvalds wrote:
> The proper thing to do (and what we _have_ done) is to say "unloading of
> modules is not supported". It's a debugging feature, and you literally
> shouldn't do it unless you are actively developing that module.
>
> Sadly, some modules are broken. Old 16-bit PCMCIA in particular _depends_
> on unloading modules, since the old PCMCIA layer doesn't do hotplug: it
> literally thinks of module load/unload as the "plug/unplug" event.
>
> But it basically boils down to: don't think of module unload as a "normal
> event". It isn't. Getting it truly right is (a) too painful and (b) would
> be too slow, so we're not even going to try.
>
> (As an example of "too painful, too slow", think of something like a
> packet filter module. You'd literally have to increment the count in every
> part that gets a packet, and decrement the count at every point where it
> lets the packet go.  And since it would have to be SMP-safe, it would have
> to be a locked cycle, or we'd have to have per-CPU counters - at which
> point you now also have to worry about things like preemption and
> sleeping, which just means that it would be a _lot_ of very fragile code).

Packet filter is hardly a normal module.  For absolute majority of modules
it's nowhere near that bad.

HOWEVER, module unload is not the real problem.  We have objects with
limited lifetimes.  Always had, always will.  Whether we remove pieces
of .text from the in-core kernel or not, we must be able to deal with
that.  Even if methods themselves are present, they won't do you any
good when data structures belonging to object are destroyed.

If that is handled right, rmmod is trivial for 99% of modules.  The
rest (including Rusty's stuff) can simply say "we can't be unloaded
at all" and be done with that.

Basically, "protect the module" is wrong - it should be "protect specific
object" and we need that anyway.  We already have that for the largest
class of modules - 300-odd netdev ones.  We have that for filesystems.
We have that for block devices.  We have infrastructure for doing that
to character devices, ttys and ldiscs.  Which leaves us with truly weird
stuff that doesn't work in such terms and yes, there your arguments
apply.

Frankly, I'm much more concerned about the stuff that _can't_ be disabled.
You can disable module unloading; hell, you can disable modules completely.
You can't disable ifdown(8).  Which currently means very bad things for
stuff in /proc/sys/net/ipv4/{conf,neigh}/*.  And all arguments re rmmod
deadlocks apply to that sort of situations...

Newsgroups: fa.linux.kernel
From: Greg KH <greg@kroah.com>
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040127202944.GE27240@kroah.com>
Date: Tue, 27 Jan 2004 20:48:49 GMT
Message-ID: <fa.ddmfauh.1n48pjj@ifi.uio.no>

On Mon, Jan 26, 2004 at 05:22:41PM +0100, Roman Zippel wrote:
>
> All this is done without a module count, this means that
> pci_unregister_driver() cannot return before the last reference is gone.
> For network devices this is not that much of a problem, as they can be
> rather easily deconfigured automatically, but that's not that easy for
> mounted block devices, so one has to be damned careful when to call the
> exit function.

Um, not anymore.  I can yank out a mounted block device and watch the
scsi core recover just fine.  The need to make everything hotpluggable
has fixed up a lot of issues like this (as well as made more
problems...)

thanks,

greg k-h

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040128021719.GV21151@parcelfarce.linux.theplanet.co.uk>
Date: Wed, 28 Jan 2004 02:18:32 GMT
Message-ID: <fa.naspair.1q1mpjf@ifi.uio.no>

On Wed, Jan 28, 2004 at 03:03:31AM +0100, Roman Zippel wrote:
> Hi,
>
> On Tue, 27 Jan 2004, Greg KH wrote:
>
> > > All this is done without a module count, this means that
> > > pci_unregister_driver() cannot return before the last reference is gone.
> > > For network devices this is not that much of a problem, as they can be
> > > rather easily deconfigured automatically, but that's not that easy for
> > > mounted block devices, so one has to be damned careful when to call the
> > > exit function.
> >
> > Um, not anymore.  I can yank out a mounted block device and watch the
> > scsi core recover just fine.  The need to make everything hotpluggable
> > has fixed up a lot of issues like this (as well as made more
> > problems...)
>
> Recovery of the scsi core is IMO one the smallest problems, but how do you
> recover at the block layer? The point is that you have here theoretically
> more one recovery strategy, but simply pulling out the module leaves you
> not much room for a controlled recovery.

Block layer is not too big issue.  We have almost everything in the tree
already - the main problem is to get check_disk_change() use regularized.
Now, sound and character devices in general...

Newsgroups: fa.linux.kernel
From: viro@parcelfarce.linux.theplanet.co.uk
Subject: Re: PATCH: (as177)  Add class_device_unregister_wait() and 
	platform_device_unregister_wait() to the driver model core
Original-Message-ID: <20040123194546.GK21151@parcelfarce.linux.theplanet.co.uk>
Date: Fri, 23 Jan 2004 19:47:52 GMT
Message-ID: <fa.n7d1bak.1mhuor6@ifi.uio.no>

On Fri, Jan 23, 2004 at 10:11:06AM -0800, Greg KH wrote:
> On Fri, Jan 23, 2004 at 09:42:09AM -0800, Linus Torvalds wrote:
> >
> >
> > On Fri, 23 Jan 2004, Alan Stern wrote:
> > >
> > > Since I haven't seen any progress towards implementing the
> > > class_device_unregister_wait() and platform_device_unregister_wait()
> > > functions, here is my attempt.
> >
> > So why would this not deadlock?
>
> It will deadlock if the user does something braindead like:
> 	rmmod foo < /sys/class/foo_class/foo1/file
>
> Now I know the network code can handle something like that, but they
> have their own thread to handle issues like this...  It's not sane to
> make every driver subsystem do that...

Network code does *NOT* wait for references to kobject to disappear.
->release() for those buggers is not in a module and freeing can
happen way after the rmmod.  No threads involved.

What it does wait for is different - in effect, net_device has a special-cased
rwsem in it, heavily optimised for down_read().  dev_hold() is an equivalent
of down_read().  dev_put() - up_read().  And unregister_netdev() does an
equivalent of down_write() after removing from all search structures.

That's what it is waiting for - it wants all readers (semaphore is unfair)
to go away.  It's not a refcounting issue at all - it's an exclusion around
the net_device shutdown, freeing resources, etc.

There was an optimistic attempt to turn that animal into sysfs refcount.
It didn't work - we had to add real refcounting there and make sysfs code
do essentially an equivalent of down_try_read() before accessing the guts
of net_device().  And we have ->release() for those buggers in core kernel.

Index Home About Blog