Index Home About Blog
Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011203010.3000-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 19:10:05 GMT
Message-ID: <fa.o9plbuv.g4u5jd@ifi.uio.no>

On Wed, 31 Jul 2002, David Howells wrote:
>
> Can you comment on whether a driver is allowed to block signals like this, and
> whether they should be waiting in TASK_UNINTERRUPTIBLE?

They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a flag
to distinguish between "increases load average" and "doesn't". So you
could have

	TASK_WAKESIGNAL - wake on all signals
	TASK_WAKEKILL	- wake on signals that are deadly
	TASK_NOSIGNAL	- don't wake on signals
	TASK_LOADAVG	- counts toward loadaverage

	#define TASK_UNINTERRUPTIBLE	(TASK_NOSIGNAL | TASK_LOADAVG)
	#define TASK_INTERRUPTIBLE	TASK_WAKESIGNAL

and then people who wanted to could use other combinations. The
TASK_WAKEKILL thing is useful - there are many loops that cannot exit
until they have a result, simply because the calling conventions require
that. The most common example is disk wait.

HOWEVER, if they are killed by a signal, the calling convention doesn't
matter, and the read() or whatever could just return 0 (knowing that the
process will never see it), and leave a locked page locked. Things like
generic_file_read() could easily use this, and make processes killable
even when they are waiting for a hung NFS mount - regardless of any soft
mount issues, and without NFS having to have special code.

In the end, I'm too lazy, and I don't care. So I can only tell you how it
_should_ be done, and maybe you can tell somebody else until the sucker to
actually do it is found.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011315220.12103-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 20:23:18 GMT
Message-ID: <fa.n8ld55v.fnah8e@ifi.uio.no>

On Thu, 1 Aug 2002, David Woodhouse wrote:
>
> torvalds@transmeta.com said:
> >  They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
> > flag  to distinguish between "increases load average" and "doesn't".
>
> The disadvantage of this approach is that it encourages people to be lazy
> and sleep with signals disabled, instead of implementing proper cleanup
> code.
>
> I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least
> making people apply for a special licence to be permitted to use it :)

Can't do that.

Easy reason: there are tons of code sequences that _cannot_ take signals.
The only way to make a signal go away is to actually deliver it, and there
are documented interfaces that are guaranteed to complete without
delivering a signal. The trivial case is a disk read: real applications
break if you return partial results in order to handle signals in the
middle.

In short, this is not something that can be discussed. It's a cold fact, a
law of UNIX if you will.

There are enough reasons to discourage people from using uninterruptible
sleep ("this f*cking application won't die when the network goes down")
that I don't think this is an issue. We need to handle both cases, and
while we can expand on the two cases we have now, we can't remove them.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011348430.12015-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 20:52:13 GMT
Message-ID: <fa.nalf5lv.fn8g82@ifi.uio.no>

On Thu, 1 Aug 2002, Roman Zippel wrote:
> >
> > In short, this is not something that can be discussed. It's a cold fact, a
> > law of UNIX if you will.
>
> Any program setting up signal handlers should expext interrupted i/o,
> otherwise it's buggy.

Roman, THAT IS JUST NOT TRUE!

Go read the standards. Some IO is not interruptible. This is not something
I'm making up, and this is not something that can be discussed about. The
speed of light in vacuum is 'c', regardless of your own relative speed.
And file reads are not interruptible.

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011430450.1647-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 21:43:30 GMT
Message-ID: <fa.oaqddmv.j4m4b9@ifi.uio.no>

On Thu, 1 Aug 2002, Roman Zippel wrote:

> > Go read the standards. Some IO is not interruptible.
>
> Which standard? Which "some IO"?

Any regular file IO is supposed to give you the full result.

If you write() to a file, and get a partial return value back due to a
signal, there are programs that will assume that the disk is full (there
are also programs that will just lose your data).

This is not "sloppy programming". See the read() system call manual, which
says

     Upon successful completion, read(), readv(), and pread() return the num-
     ber of bytes actually read and placed in the buffer.  The system guaran-
     tees to read the number of bytes requested if the descriptor references a
     normal file that has that many bytes left before the end-of-file, but in
     no other case.

Note the "The system guarantees to read the number of bytes requested .."
part.

Stop arguing about this. It's a FACT.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011538220.1277-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 22:41:59 GMT
Message-ID: <fa.o9q1efv.g4i4j1@ifi.uio.no>

On Thu, 1 Aug 2002, David Woodhouse wrote:
>
> torvalds@transmeta.com said:
> >  Any regular file IO is supposed to give you the full result.
>
> read(2) is permitted to return -EINTR.

It is _not_ allowed to do that for regular UNIX filesystems.

It is allowed to return it for things like pipes, sockets, etc, and for
filesystems that do not have UNIX behaviour.

> Regular file I/O through the page cache is inherently restartable, anyway,
> as long as you're careful about fpos.

It's not the kernel side that is not restartable. It's the _user_ side.
There is 30 _years_ of history on this, and there are programs that have
been programmed to follow the existing documentation.

And the existing documentation says that if you return a partial read from
a normal file, that means EOF for that file.

You may not like it, but that doesn't make it less so. Linux has UNIX
semantics for read(). Linux is not a research project where we change
fundamental semantics just because we don't like it. That's final.

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.33.0208011613440.1315-100000@penguin.transmeta.com>
Date: Thu, 1 Aug 2002 23:31:38 GMT
Message-ID: <fa.oaq9cev.j4q5j8@ifi.uio.no>

[ I'm having a real hard time to not shout at the top of my lungs "SHUT
  THE FUCK UP ABOUT THIS ALREADY!" and then running around the offices
  with a chainsaw laughing maniacally. But I'll try. This once. ]

On Fri, 2 Aug 2002, Roman Zippel wrote:
>
> Relying on that the fd will always point to a normal file is only asking
> for trouble.

People _do_ rely on regular files working this way. Wake up and smell the
coffee, you cannot change reality by just arguing about it.

There are cases where you absolutely _have_ to rely on this documented
UNIX behaviour. One example is using a log-file (yes, a _file_, not a
socket or a pipe) that you explicitly opened with O_APPEND, just so that
you can guarantee _atomic_ writes that do not get lost or partially
re-ordered in your log.

And yes, these logging programs are mission-critical, and they do have
signals going on, and they rely on well-defined and documented interfaces
that say that doing a write() to a filesystem is _not_ going to return in
the middle just because a signal came in.

These programs know what they are doing. They are explicitly _not_ using
"stdio" to write the log-file, exactly because they cannot afford to have
the write broken up into many parts (and they do not want to have it
buffered either, since getting the logging out in a timely fashion can be
important).

The only, and the _portable_ way to do this under UNIX is to create one
single buffer, and write it out with one single write() call. Anything
else is likely to cause the file to be interspersed by random
log-fragments, instead of being a nice consecutive list of full log
entries.

Feel free to change Linux to have your stupid preferred semantics where
everybody is supposed to be able to handle signals at any time, but please
re-name it to "Crapix" when you do. Because that is what it would be.
Crap. Utter braindamage.

If people cannot find this in SuS, then I simply don't _care_. I care
about not having a crap OS, and I also care about not having to repeat
myself and give a million examples of why the current behaviour is
_required_, and why we're not getting rid of it.

[ Did profanity help explain the situation? Do people finally understand
  why this _really_ isn't up for discussion? Please don't bother sending
  me any more email about this.  My co-workers are already eyeing me and
  my chainsaw nervously. Thank you for sparing them. ]

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.44.0208020833110.18265-100000@home.transmeta.com>
Date: Fri, 2 Aug 2002 15:38:46 GMT
Message-ID: <fa.l2thn9v.1m1i21p@ifi.uio.no>

On Fri, 2 Aug 2002, Roman Zippel wrote:
>
> If these programs are so mission-critical, they better do some error
> checking. Your atomic write will fail on a full disk and if the
> information is that important, the program has to handle that.

The logging thing is logging. It's critical for performance tuning, it's
critical for later finding of errors, but it's still secondary.

It's also performance-critical in a very real way.

> I looked around a bit and it doesn't look that portable. UNIX systems seem
> to have this behaviour, but not all POSIX systems.

POSIX is a hobbled standard, and does not matter.

We're not making a "POSIX-compliant OS". People have done that before:
see all the RT-OS's out there, and see even the NT POSIX subsystem.

They are uninteresting.

Linux is a _real_ OS, not some "we filled in the paperwork and it is now
standards compliant".

And being a real OS means taking the real world into account.

And the real world says that it's not acceptable to make up your own
semantics, unless you have some _damn_ good reason for doing so.

Let's turn the tables here. I'm not interested in your "but.." arguments
at all. I've refuted every single one, and I've refuted them with cold
hard facts.

The fact is, there is _zero_ reason to change existing functionality.

We already do this right, and there is no reason to _break_ the fact that
we do it right. Can you come up with a _single_ reason for why we should
break existing standardized binary interfaces?

Binary compatibility is important. As is the larger issue of generic UNIX
compatibility. You had better have some really strong arguments for why
you would think I'd be willing to break compatibility. So far you have had
_no_ arguments for the question "Why?".

			Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.44.0208020920120.18265-100000@home.transmeta.com>
Date: Fri, 2 Aug 2002 16:27:02 GMT
Message-ID: <fa.l2tln1v.1m1m29q@ifi.uio.no>

On Fri, 2 Aug 2002, Benjamin LaHaise wrote:
>
> Personally, I think that uninterruptible file io is good, but there needs
> to be an upper limit to the maximum size of the io.  As it stands today,
> someone can do a single multigigabyte read or write that is completely
> uninterruptible (even to kill -9), but could take a minute or more to
> complete.

Actually, if you read my original email on this thread, I actually
suggested splitting up the "INTERRUPTIBLE" vs "UNINTERRUPTIBLE" into three
different cases and one extra bit.

Sending somebody a SIGKILL (or any signal that kills the process) is
different (in my opinion) from a signal that interrupts a system call in
order to run a signal handler.

So my original suggestion on this thread was to make

        TASK_WAKESIGNAL - wake on all signals
        TASK_WAKEKILL   - wake on signals that are deadly
        TASK_NOSIGNAL   - don't wake on signals
        TASK_LOADAVG    - counts toward loadaverage

        #define TASK_UNINTERRUPTIBLE    (TASK_NOSIGNAL | TASK_LOADAVG)
        #define TASK_INTERRUPTIBLE      TASK_WAKESIGNAL

and then let code like generic_file_write() etc use other combinations
than the two existing ones, ie

	(TASK_WAKEKILL | TASK_LOADAVG)

results in a process that is woken up by signals that kill it (but not
other signals), and is counted towards the loadaverage. Which is what we
want in generic_file_read() (and _probably_ generic_file_write() as well,
but that's slightly more debatable).

(We'd also have to add a new way to test whether you've been killed, so
that such users could use "process_killed()" instead of the
"signal_pending()" that a INTERRUPTIBLE sleeper uses to test whether it
should exit).

This is the trivial way to get the best of both worlds - you can still
kill a process that is in D wait (if that particular kernel path allows
it), but you don't get process-visible semantic changes.

AND it also allows waiting uninterruptibly without adding to the
loadaverage, for those people who want to do long uninterruptible waits
(which was one of the reasons for starting this whole thread in the first
place - it just got slightly de-railed in the meantime).

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.44.0208021023040.914-100000@home.transmeta.com>
Date: Fri, 2 Aug 2002 17:30:00 GMT
Message-ID: <fa.mk3kjkv.1dii19r@ifi.uio.no>

On Fri, 2 Aug 2002, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Sending somebody a SIGKILL (or any signal that kills the process) is
> > different (in my opinion) from a signal that interrupts a system call in
> > order to run a signal handler.
>
> So it's ok to have truncated log entries (or more realistically,
> truncated simple database entries) if the logging program is killed?

This is why I said

 "Which is what we want in generic_file_read() (and _probably_
  generic_file_write() as well, but that's slightly more debatable)"

The "slightly more debatable" comes exactly from the thing you mention.

The thing is, "read()" on a file doesn't have any side effects outside the
process that does it, so if you kill the process, doing a partial read is
always ok (yeah, you can come up with thread examples etc where you can
see the state, but I think those are so contrieved as to not really merit
much worry and certainly have no existing programs issues).

With write(), you have to make a judgement call. Unlike read, a truncated
write _is_ visible outside the killed process. But exactly like read()
there _are_ system management reasons why you may really need to kill
writers. So the debatable point comes from whether you want to consider a
killing signal to be "exceptional enough" to warrant the partial write.

I can see both sides. I personally think I'd prefer the "if I kill a
process, I want it dead _now_" approach, but this definitely _is_ up for
discussion (unlike the signal handler case).

		Linus



Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: manipulating sigmask from filesystems and drivers
Original-Message-ID: <Pine.LNX.4.44.0208021109210.1108-100000@home.transmeta.com>
Date: Fri, 2 Aug 2002 18:10:32 GMT
Message-ID: <fa.m6e0c2v.14g298r@ifi.uio.no>

On 2 Aug 2002, Trond Myklebust wrote:
>
> Would you therefore be planning on making down() interruptible by
> SIGKILL?

Can't do - existing users know that down() cannot fail.

But we already have a "down_interruptible()", so if we introduce the
notion of "non-interruptible but killable", we can also introduce a
"down_killable()".

		Linus


Index Home About Blog