Index Home About Blog
From: (Linus Torvalds)
Newsgroups: comp.os.linux.development.system
Subject: Re: select() enigma
Date: 21 Apr 1998 17:55:48 GMT

In article <cfV_.259$>,
Kaz Kylheku <> wrote:
>In article <HTP_.70$>,
>Kaz Kylheku <> wrote:
>>In article <6h9bag$ukt$>,
>>Linus Torvalds <> wrote:
>>>Read the source again.
>>This is the specific directive I have been waiting for.

Seriously, don't you feel better about having found out on your own with
only a small hint from me?

>Ah, I get it now. The select table is basically just populated by wait queue
>entries that get inserted into various queues throughout the VFS. It exists as
>a container object for keeping a collection of wait queue nodes so they
>can be cleaned up later.

Correct. In order to wait on an event, we need to add us to a
wait-queue, and set up everything that involves. Usually this is a
trivial matter: normal actions only wait for a single condition, so
there really isn't any "wait queue management" necessary - the wait
queue stuff is usually kept on the kernel stack, in fact. So usually the
code simply just looks something like this:

	struct wait_queue wait = { current, 0 };

	add_wait_queue(&eventqueue, &wait);
	.... do the sleep test and potential schedule ...
	remove_wait_queue(&eventqueue, &wait);

But for "select()" the above doesn't work, because we need to keep track
of an arbitrary number of wait queues, and the wait queue table does
this for us (together with a few helper functions that hide all the
details from the drivers that do the select wakeup).

Think of the select table as just a list of wait-queues.

>In the main select() loop, the descriptors are polled in turn.  On the first
>iteration, the select table is populated with new nodes and these are inserted
>into queues. If everything comes out negative, the task does a schedule().

Yes.  And as you pointed out earlier, if _any_ of the descriptors
happens to return a positive on the first round, we stop populating the
wait queues early, because we know we aren't going to sleep (at least
one bit will be set), so there is no point in doing any extra work.

Also note that the select queue is populated only on the first round,
and then never touched any more (until the very end of select, when we
de-populate it before returning to user mode).  So there are two ways
the select table pointer becomes NULL: partway through the first scan if
any of the descriptors returned a positive, or after the first round
once we have filled in the wait-queues for all descriptors.

>But its state has been earlier set to TASK_INTERRUPTIBLE, which causes
>schedule() to put it into an interruptible sleep state; it will wake up when a
>wake_up_interruptible() is done on any of the queues, at which  point it will
>do another round through the descriptors.

Correct again.  Note that the state is expressly set to INTERRUPTIBLE at
a very early early stage - before doing _any_ of the device asking. The
reason for this is race conditions: if the device notices that it is
readable after all just after we have polled it but before we have
actually done the schedule(), then this guarantees correct behaviour:
the wake_up() from the device will mark us runnable even before we have
had time to go to sleep, so then we won't sleep but instead do the next
round immediately (which is what we want to do, because somebody told us
that they are ready).

>				   Of course, by the time the process
>wakes up, the condition which brought about the wakeup might have disappeared,
>so it's possible that it will have to sleep again.

Possible, but unlikely.  Yes.  Also, select() in itself doesn't
guarantee that the file descriptor _stays_ readable/writable, so we may
have returned a positive value to user space, and then something caused
the descriptor to become unreadable again.  This is normal select()
behaviour, and is one reason why anything that uses select() should also
use non-blocking IO (so that select() is the _only_ thing that sleeps).


Index Home About Blog