Index Home About Blog
Date: 	Mon, 23 Oct 2000 22:39:36 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer. 

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

	struct event {
		unsigned long id;	/* file descriptor ID the event is on */
		unsigned long event;	/* bitmask of active events */
	};

	int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

	int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

 - a queue head in "struct files_struct", initially empty.

 - doing a "bind_event()" basically adds a fasync entry to the file
   structure, but rather than cause a signal, it just looks at whether the
   fasync entry is already linked into the event queue, and if it isn't
   (and the event is one of the ones in the event bitmask), it adds itself
   to the event queue.

 - get_event() just traverses the event queue and fills in the array,
   removing them from the event queue. End of story. If the event queue is
   empty, it trivially sees that in a single line of code (+ timeout
   handling)

Advantage: everything is O(1), except for "get_event()" which is O(n)
where 'n' is the number of active events that it fills in. 

Basically, I can't see that it can be done much simpler, and I don't see
that you can have much better performance. 

Example "server":	

#define MAXEV	(32)

	struct event ev;

	.. create socket.. 
	listen(sock, 5);

	/*
	 * start out with only the listening socket 
	 * as an "event"
	 */
	ev.id = LISTEN_ID;
	ev.event = POLLIN;
	bind_event(sock, &ev);

	/* Loop forever, handling the events */
	for (;;) {
		static struct event ev_list[MAXEV];
		int events = get_events(ev_list, MAXEV, NULL);
		struct event *ev = ev_list;

		do {
			unsigned long id = ev->id;
			unsigned long mask = ev->mask;
			ev++;
			/* Call the event handler */
			handle[id](id,mask);
		} while (--events);
	}

The interesting part would be what the "handle[]" function pointer array
would do, and for the listen socket it would simply do something like
accepting every pending socket and then add an event binding to it:

	listen_handle(unsigned long id, unsigned long mask)
	{
		while ((fd = accept(listen_sock..)) > 0) {
			struct event ev;
			ev.id = RDEVENT;
			ev.mask = ...;
			bind_event(fd, &ev);
		}
	}

You get the idea. Very simple, and looks like it should perform quite
admirably. With none of the complexities of signal handling, and none of
the downsides of select() or poll(). Both of the new system calls would be
on the order of 20-30 lines of code (along with the infrastructure to
extend the fasync stuff to also be able to handle events)

Yes, I'm probably missing something. But this is the kind of thing I think
we should look at (and notice that you can _emulate_ this with poll(), so
you can use this kind of interface even on systems that wouldn't support
these kinds of events natively).

		Linus


Date: 	Mon, 23 Oct 2000 23:35:54 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&ma
> describes the FreeBSD kqueue interface for events:

I've actually read the BSD kevent stuff, and I think it's classic
over-design. It's not easy to see what it's all about, and the whole <kq,
ident, filter> tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple
event queues completely detract from the whole point of being simple to
use and implement.

Now, I agree that my bind_event()/get_event() has limitations, and the
biggest one is probably the event "id". It needs to be better, and it
needs to have more structure. The "id" really should be something that not
only contains the "fd", but also contains the actor function to be called,
along with some opaque data for that function.

In fact, if you take my example server, and move the "handle[id]()" call
_into_ get_events() (and make the "handle[id]()" function pointer a part
of the ID of the event), then the library argument goes away too: it
doesn't matter _who_ calls the get_event() function, because the end
result is always going to be the same regardless of whether it is called
from within a library or from a main loop: it's going to call the function
handle associated with the ID that triggered.

Basically, the main loop would boil down to

	for (;;) {
		static struct event ev_list[MAXEV];
		get_event(ev_list, MAXEV, &tmout);
		.. timeout handling here ..
	}

because get_event() would end up doing all the user-mode calls too (so
"get_event()" is no longer a system call: it's a system call + a for-loop
to call all the ID handler functions that were associated with the events
that triggered).

So the "struct event" would just be:

	struct event {
		int fd;
		unsigned long mask;
		void *opaque;
		void (*event_fn)(int fd, unsigned long mask, void *opaque);
	}

and there's no need for separate event queues, because the separate event
queues have been completely subsumed by the fact that every single event
has a separate event function.

So now you'd start everything off (assuming the same kind of "listen to
everything and react to it" server as in my previous example) by just
setting

	bind_event(sock, POLLIN, NULL, accept_fn);

which basically creates the event inside the kernel, and will pass it to
the "__get_event()" system call through the event array, so the
get_event() library function basically looks like

	int get_event(struct event *array, int maxevents, struct timeval *tv)
	{
		int nr = __get_event(array, maxevents, tv);
		int i;
		for (i = 0; i < nr; i++) {
			array->event_fn(array->fd, array->mask, array->opaque);
			array++;
		}
		return nr;
	}

and tell me why you'd want to have multiple event queues any more?

(In fact, you might as well move the event array completely inside
"get_event()", because nobody would be supposed to look at the raw array
any more. So the "get_event()" interface would be even simpler: just a
timeout, nothing more. Talk about simple programming.

(This is also the ideal event programming interface - signals get racy and
hard to handle, while in the above example you can trivially just be
single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
you want to: you multi-thread things by just having multiple threads that
all call "get_event()" on their own).

		Linus


Date: 	Mon, 23 Oct 2000 23:41:52 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Mon, 23 Oct 2000, Dan Kegel wrote:
 
> kqueue lets you associate an arbitrary integer with each event
> specification; the integer is returned along with the event.
> This is very handy for, say, passing the 'this' pointer of the
> object that should handle the event.  Yes, you can simulate it
> with an array indexed by fd, but I like having it included.

I agree. I think the ID part is the most important part of the interfaces,
because you definitely want to re-use the same functions over and over
again - and you want to have some way other than just the raw fd to
identify where the actual _buffers_ for that IO is, and what stage of the
state machine we're in for that fd etc etc.

Also, it's potentially important to have different "id"s for even the same
fd - in some cases you want to have the same event handle both the read
and the write part on an fd, but in other cases it might make more
conceptual sense to separate out the read handling from the write
handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have
two separate events, one with POLLIN and one with POLLOUT.

This was what my "unsigned long id" was, but that is much too hard to use.
See my expanded suggestion of it just a moment ago.

(And yes, I'm sure you can do all this with kevents. I just abhor the
syntax of those things).

		Linus


Date: 	Tue, 24 Oct 2000 08:33:33 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

[ Moving on to practical matters ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> Might be good to pick more unique names than 'struct event' and 'get_event'.
> People probably use those already.

I agree. I would never implement them under those names, but it's easier
to talk about "event" than about something very specific.

> Hiding ev_list is probably ok.  However,
> http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
> suggests that knowing how many events are pending is a useful measure 
> of server load, and that if more than a certain number of events
> are pending, web servers should reject new connections.  Thus it might
> be handy to make the return value of get_event be the number of events
> gotten.

Note that my "get_event()" function actually did return exactly that: it
returned the number of events, even though the actual event array handling
was hidden inside the library function.

> > So now you'd start everything off (assuming the same kind of "listen to
> > everything and react to it" server as in my previous example) by just
> > setting
> > 
> >         bind_event(sock, POLLIN, NULL, accept_fn);
> 
> A couple questions:
> 
> * how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

If done that way, I'd prefer something where an event _mask_ of 0 means
that the event obviously no longer exists, as it can no longer trigger.
The accept_fn would be part of the identifier, and as such zeroing it out
would lose the identification.

But I'd personally probably prefer to be more explicit, and have a
separate function for it. 

Note that whether it's a separate function or just making "bind_event()"
really be "modify_or_bind_or_remove_event()" is pretty much not anything I
care about - it gets to be so low-level details that it would be something
that is probably best tried out and something that I don't have any
religious issues with.

I care about high-level interfaces, because _those_ are the ones that
screw me years down the line. For example, select() (and later bind()) was
always an interface that did _not_ fit into the UNIX way of doing things
AT ALL. It really stands out among all the IO interfaces as being very
different. And it's one of the nastier ones to maintain.

> * how do you change a mask?  (By calling bind_event with a new mask?)

Same as above. Either bind with a new mask (zero being remove), or
separate call.

> * Is it ok to do these things while in an event_fn?  (Yes?)

Oh, absolutely. The kernel doesn't even _know_ when something is in an
event function - the kernel only sees "change the event" and "get the list
of events".

> * Do you get an event whenever an fd is ready for something, or
>   only when its readiness changes?  (Presumably whenever an fd is ready for something?)

Only when its readiness changes - probably with the addition that it would
simplify things that a new event always (unconditionally) gets added to
the event queue.

Note that there are no races in this implementation - there is no way for
events to get lost, even trivially seen in SMP/threaded environments.
That's important. 

The BSD kevent paper goes on about "level and edge triggered" and it
becomes a big thing for them, and they selected level-triggered events as 
if it made any difference. And sure - it _does_ make a difference, but the
only difference is in how hard it is to implement, and level-triggered is
a noticeably harder.

The bind_event()/get_events() interface is edge-triggered, and works
fundamentally the same way as SIGCHLD. If you don't keep up, you get
SIGCHLD only once, even if ten children died on you in rapid succession.
And it's edge-triggered: that doesn't mean that you lose events (like the
BSD paper implies), it only means that you have to reap your children with
a loop:

	while ((err = waitpid(..)) >= 0) {
		.. get all children ..
	}

The reason "edge-triggered" (ie only when an event changes) is preferable
is that it's MUCH simpler, and means that the "get_events()" system call
doesn't need to actually understand what the events mean at all. It will
just take the events off the list. If anything triggers after that, it will
be put back on the list again, but you can guarantee that you don't get
any lost events simply by the simple fact that

 - by the time get_events() returns to user mode (ie before the action
   functions have been called), any new event that happens even before we
   have had to react to the old events will cause the event to be queued
   up again.

What does this mean? It basically means that if you continue to get events
on a file descriptor during your event handler function, the event will
have moved back to the event list, and next time you _will_ get an event
again.

		Linus


Date: 	Tue, 24 Oct 2000 08:42:13 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Dan Kegel wrote:
> Linus Torvalds wrote:
> > Basically, the main loop would boil down to
> >         for (;;) {
> >                 static struct event ev_list[MAXEV];
> >                 get_event(ev_list, MAXEV, &tmout);
> >                 .. timeout handling here ..
> >         }
> > 
> > because get_even() would end up doing all the user-mode calls too (so
> > "get_event()" is no longer a system call: it's a system call + a for-loop
> > to call all the ID handler functions that were associated with the events
> > that triggered).
> 
> Occurs to me that this stuff will be used from other languages
> than C, which might prefer to do their own event dispatching,
> and would want the system to treat event_fn as another opaque quantity.

Yes and no.

The KERNEL would treat it as just another opaque quantity: after all, it
would never ever touch the thing other than get it as part of
"bind_event()" and return it as part of "get_events()". So as far as the
kernel is concerned, both "event_fn" and "opaque" are just random
information.

However, it is very important to have every common user agree about the
meaning of the ID's in user space: the whole point of encapsulating
"event_fn" as a function pointer is that you can have different parts of
the same program use the "bind_event()" interface, and they won't step on
each others toes.

So if you program in Fortran, for example, and you expect to link against
the C library, and the C library uses events for something, then you'd
better make sure that the fortran interfaces end up using the event_fn
thing in the same way as the C one..

> So I don't object to a function that dispatches events, e.g.
>          int get_event(&tmout);
> as long as there's also 
>          int get_events(ev_list, MAXEV, &tmout)
> that treats event_fn as an opaque pointer and *does not* dispatch the events.

That's a user-mode issue, and obviously, if you don't care about
compatibility with anything else you can do this. The kernel won't know,
or care. But I think it is unlikely that you'd ever use the raw events in
any other way - most environments under UNIX tend to have a C library
component somewhere, because the low-level stuff is written in C and uses
the common code that way even when the programmer isn't aware of it.

		Linus


Date: 	Tue, 24 Oct 2000 08:58:17 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Mitchell Blank Jr wrote:
> I think everyone should take a timeout and look at Solaris 8's /dev/poll
> interface.  This discussion is reinventing the wheel, the lever, and the
> inclined plane.
> 
> 	http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123
> 
> I think this is a lot cleaner than your approach:

You must be kidding.

/dev/poll is the interface from _hell_. It's absolutely disgusting.

>   * it doesn't add extra syscalls

Sure it does. 

What do you think ioctl's are? Every single ioctl number is an extra
system call. It's just indirected inside the device driver instead of at
the normal system call entry point. The advantage? It's noticeably slower
than a real system call.

And the final reason why /dev/poll should just go away: not only is the
interface ugly, but it's hard to use too. It's really hard to mix
different users of /dev/poll - you'll have to create a whole new event
structure around it.

Sure, you can have multiple "poll lists" by opening /dev/poll multiple
times, and you seem to see this as a great feature. I'm telling you that
multiple event queues are a horrible disaster, and it's a perfect example
of what an engineer with the "more is better" mindset comes up with.

Multiple event queues are bad, because it completely breaks the notion of
even-driven programming. How do you want to listen to them all? You can't.
You can only listen to one event queue at a time - unless you create some
kind of "hierarchy" of event queues, where you have one event queue that
you wait on that contains the _other_ event queues, so that you can tell
which one has anything pending..

Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
multiple event queues has some serious problems.

Basically, tastes differ. Everything you seem to see as an advantage of
/dev/poll, I see as a horrible kludge.

		Linus


Date: 	Tue, 24 Oct 2000 10:03:04 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Simon Kirby wrote:
> 
> However, isn't there already something like this, albeit maybe without
> the ability to return multiple events at a time?  When discussing
> select/poll on IRC a while ago with sct, sct said:
> 
>  <sct> Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
>        SIGIO as usual.

[ siginfo and RT signals, but using "sigwaitinfo()" instead of actual
  signal handlers ]

The problem with RT signals (whether you use sigwaitinfo() or signal
handling overhead) is that they don't scale to lots of events. The memory
pressure for keeping track of millions of siginfo structures is more than
any system can handle - so all RT signals implementations will have to
eventually fall back on something else.

Which makes for really hard to debug bugs, and complex programming (you
essentially end up having to duplicate all of the event handling, and the
"fallback" case has the added difficulty that you won't even see it except
under heavy load). 

> Also, what is different in your above interface that prevents it from
> being able to queue up too many events?

Basically, with get_events(), there is a maximum of one event per "bind".
And the memory for that is statically allocated at bind_event() time. 

>					  I guess the structure is only
> sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
> FDs on x86, but I don't see how the other method would be significantly
> different.  The kernel would have to store the queued events still,
> surely...

Oh, the bind information would be more like 32 bytes per bind, so assuming
you have 20000 fd's you're waiting for, you'd certainly be using a lot of
memory. Still less than the quite complex SIGIO structures, though.

But you'd be doing so in a controlled manner: the memory use wouldn't go
up just because there is a sudden influx of 50000 packets. So it scales
with load by virtue of simply not _caring_ about the load - it only cares
about the number of fd's you're waiting on.

(Obviously _other_ parts of the system will have to keep up with tons of
new packets, but that's why protocols like TCP have flow control, and
that is something that is done independently of the event notification,
so regardless of _what_ kind of event model you have you have to handle
the simple load of TCP ;).

		Linus


Date: 	Tue, 24 Oct 2000 11:33:57 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> But user code currently written for poll() has the luxury of dropping
> events because poll() will happily report on the current readiness of
> the socket every time.  /dev/poll is level-triggered because it's trying
> to make conversion of poll()-based code easy.  With your scheme, 
> whatever user code is receiving the edges better darn well do something
> about them, because it's only going to get them once.

Oh, I agree. I'm not saying that my approach magically fixes bugs in user
space ;)

> > The BSD kevent paper goes on about "level and edge triggered" and it
> > becomes a big thing for them, and they selected level-triggered events as
> > if it made any difference. And sure - it _does_ make a difference, but the
> > only difference is in how hard it is to implement, and level-triggered is
> > a noticeably harder.
> 
> I don't see why edge triggered is that much harder.  All it adds is
                  ^^^^ level
> a layer which receives the edges and moves fds back and forth between
> a 'ready' list and a 'not ready' list.  Easy as pie.

Not true.

For example, if you're truly level-triggered, and you have a socket that
gets data, the event move onto the event queue. So far so fine: both edge
and level agree about this one.

The point they disagree is when the event gets removed from the event
queue. For edge triggered, this one is trivial: when a get_events() thing
happens and moves it into user land. This is basically a one-liner, and it
is local to get_events() and needs absolutely no help from anybody else.
So obviously event removal is _very_ simple for edge-triggered events -
the INTACK basically removes the event (and also re-arms the trigger
logic: which is different from most interrupt controllers, so the analogy 
falls down here).

For level, the thing is not as easy at ALL. Suddenly removal becomes a big
issue, and needs help from the actual driver. You can do it two ways:
calling down to the driver when you remove (to see if the event should be
dismissed or not once it has been read) or have the driver pro-actively
remove the event whenever a read() happens (or whatever that undoes the
event).

Both are actually fairly hard. Much harder than they sound. For different
reasons.

 - the callback approach at get_events() time sounds trivial, but actually
   has two problems: cache footprint for "get_events()" grows a _lot_
   (because the events are likely to be spread out a lot if there are a
   lot of them pending, so you don't get a nice tight inner loop at all),
   and you get "double events" - by the time the event first happens, it
   will still be active, so we cannot actually remove it at that time
   (there is still data to be read - and the event doesn't go away until
   we read it) so we'll get the event _again_, and on the next
   get_events() it will notice that the event was bogus, and remove it
   (and we can optimize it away from reporting it to user land at that
   point, so only the kernel needs to look at it twice and do two
   callbacks)

 - the "proactively remove events when the thing that triggerred them goes
   away" approach means that each anti-event (like a read that empties the
   buffer) needs to undo it's events, but it also needs to be careful that
   it doesn't undo combined events, and it needs to be very careful about
   races (new packet coming in), so the proactive remove actually ends up
   being less than trivial - and in a performance-critical section.

Now, compare that to a one-liner that just does a "list_del(&event->list)"
as it copies over the event to user mode. Woudln't you say that the
edge-triggered version is simpler?

> > The reason "edge-triggered" (ie only when an event changes) is preferable
> > is that it's MUCH simpler, and means that the "get_events()" system call
> > doesn't need to actually understand what the events mean at all. 
> 
> Not much understanding is required on the part of the edge-to-level filter.

Implement it, and see. I bet you'll find that it gets really interesting
when you have concurrent accesses to the same file descriptor etc.

			Linus


Date: 	Tue, 24 Oct 2000 11:34:47 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Abramo Bagnara wrote:

> Linus Torvalds wrote:
> > 
> > 
> >         struct event {
> >                 int fd;
> >                 unsigned long mask;
> >                 void *opaque;
> >                 void (*event_fn)(int fd, unsigned long mask, void *opaque);
> 
> My experience say that:
> 
> 	        unsigned long rmask;
> 		void (*event_fn)(struct event *event);
> 
> is a far better solution (more type safe, more descriptive).

You're probably right. It certainly makes it easier to add fields later on
if we'd want to extend the ID part without having to change old users..

		Linus


Date: 	Tue, 24 Oct 2000 16:02:55 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable? 
Newsgroups: fa.linux.kernel

On Tue, 24 Oct 2000, Evan Jeffrey wrote:
> 
> > Multiple event queues are bad, because it completely breaks the notion of
> > even-driven programming. How do you want to listen to them all? You can't.
> > You can only listen to one event queue at a time - unless you create some
> 
> You can listen to one event queue per thread.

Oh, I agree.

And I think something like CLONE_EVENTS would be fine - and decide
yourself what kind of threads you want (do you want indistinguishable
"anonymous" threads like apache, or do you want a threads that handle
separate event queues). Or you might have a mixture of the two - for web
serving the "accept()" event list might be a separate thing with a few
threads doing that, while "worker threads" handle the actual IO..

But if you want to have some event queue ID, I just wonder how you'd set
it up sanely without tons of complications..

		Linus


Date: 	Thu, 26 Oct 2000 09:44:21 -0700 (PDT)
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: Linux's implementation of poll() not scalable?
Newsgroups: fa.linux.kernel

On Thu, 26 Oct 2000, Dan Kegel wrote:

> With level-triggered interfaces like poll(), your chances
> of writing a correctly functioning program are higher because you
> can throw events away (on purpose or accidentally) with no consequences; 
> the next time around the loop, poll() will happily tell you the current
> state of all the fd's.

Agreed.

However, we also need to remember what got us to this discussion in the
first place. One of the reasons why poll() is such a piggy interface is
exactly because it tries to be "nice" to the programmer.

I'd much rather have an event interface that is documented to be edge-
triggered and is really _lightweight_, than have another interface that
starts out with some piggy features.

I do understand that level to some degree is "nicer", but everybody pretty
much agrees that apart from requiring more care, edge-triggered certainly
does everything the level-triggered interfaces do. 

For example, if you want to only partially handle an event (one of the
more valid arguments I've seen, although I do not agree with it actually
being all that common or important thing to do), the edge-triggered
interface _does_ allow for that. It's fairly easy, in fact: you just
re-bind the thing.

(My suggested "bind()" interface would be to just always add a newly bound
fd to the event queue, but I would not object to a "one-time test for
active" kind of "bind()" interface either - which would basically allow
for a poll() interface - and the existing Linux internal "poll()"
implementation actually already allows for exactly this so it wouldn't
even add any complexity).

> With edge-triggered interfaces, the programmer must be much more careful
> to avoid ever dropping a single event; if he does, he's screwed.
> This gives rise to complicated code in apps to remember the current
> state of fd's in those cases where the programmer has to drop events.

No, the "re-bind()" approach works very simply, and means that the
overhead of testing whether the event is still active is not a generic
thing that _always_ has to be done, but something where the application
can basically give the kernel the information that "this time we're
leaving the event possibly half-done, please re-test now".

Advantage: performance again. The common case doesn't take the performance
hit.

			Linus

Index Home About Blog