Saving the floating-point state (Linus Torvalds)

Index Home About Blog

Newsgroups: fa.linux.kernel
From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: context switch vs. signal delivery [was: Re: Accelerating user 
	mode linux]
Original-Message-ID: <ail2qh$bf0$1@penguin.transmeta.com>
Date: Mon, 5 Aug 2002 05:36:20 GMT
Message-ID: <fa.k1162hv.5mq5i9@ifi.uio.no>

In article <m3u1mb5df3.fsf@averell.firstfloor.org>,
Andi Kleen  <ak@muc.de> wrote:
>Ingo Molnar <mingo@elte.hu> writes:
>
>
>> actually the opposite is true, on a 2.2 GHz P4:
>>
>>   $ ./lat_sig catch
>>   Signal handler overhead: 3.091 microseconds
>>
>>   $ ./lat_ctx -s 0 2
>>   2 0.90
>>
>> ie. *process to process* context switches are 3.4 times faster than signal
>> delivery. Ie. we can switch to a helper thread and back, and still be
>> faster than a *single* signal.
>
>This is because the signal save/restore does a lot of unnecessary stuff.
>One optimization I implemented at one time was adding a SA_NOFP signal
>bit that told the kernel that the signal handler did not intend
>to modify floating point state (few signal handlers need FP) It would
>not save the FPU state then and reached quite some speedup in signal
>latency.
>
>Linux got a lot slower in signal delivery when the SSE2 support was
>added. That got this speed back.

This will break _horribly_ when (if) glibc starts using SSE2 for things
like memcpy() etc.

I agree that it is really sad that we have to save/restore FP on
signals, but I think it's unavoidable. Your hack may work for you, but
it just gets really dangerous in general. having signals randomly
subtly corrupt some SSE2 state just because the signal handler uses
something like memcpy (without even realizing that that could lead to
trouble) is bad, bad, bad.

In other words, "not intending to" does not imply "will not".  It's just
potentially too easy to change SSE2 state by mistake.

And yes, this signal handler thing is clearly visible on benchmarks.
MUCH too clearly visible.  I just didn't see any safe alternatives
(and I still don't ;( )

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: context switch vs. signal delivery [was: Re: Accelerating user
Original-Message-ID: <Pine.LNX.4.44.0208050922570.1753-100000@home.transmeta.com>
Date: Mon, 5 Aug 2002 16:39:34 GMT
Message-ID: <fa.m7f8dqv.17gi8gs@ifi.uio.no>

On Mon, 5 Aug 2002, Jamie Lokier wrote:

> Linus Torvalds wrote:
> > I agree that it is really sad that we have to save/restore FP on
> > signals, but I think it's unavoidable.
>
> Couldn't you mark the FPU as unused for the duration of the
> handler, and let the lazy FPU mechanism save the state when it is used
> by the signal handler?

Nope. Believe me, I gave some thought to clever things to do.

The kernel won't even _see_ a longjmp() out of a signal handler, so the
kernel has a really hard time trying to do any clever lazy stuff.

Also, people who play games with FP actually change the FP data on the
stack frame, and depend on signal return to reload it. Admittedly I've
only ever seen this on SIGFPE, but anyway - this is all done with integer
instructions that just touch bitpatterns on the stack.. The kernel can't
catch it sanely.

> For sophisticated user space uses, like the above, I'd like to see
> a trap handling mechanism that saves only the _minimum_ state.

I would not mind an extra per-signal flag that says "don't bother with FP
saves" (the same way we already have "don't restart" etc), but I would be
very nervous if glibc used it by default (even if glibc doesn't use SSE2
in memcpy, gcc itself can do it, and obviously _users_ may just do it
themselves).

So it would have to be explicitly enabled with a SA_NOFPSIGHANDLER flag or
something.

(And yes, it's the FP stuff that takes most of the time. I think the
lmbench numbers for signal delivery tripled when that went in).

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: context switch vs. signal delivery [was: Re: Accelerating usermode
Original-Message-ID: <Pine.LNX.4.44.0208051317480.11693-100000@home.transmeta.com>
Date: Mon, 5 Aug 2002 20:24:54 GMT
Message-ID: <fa.l3t7nqv.1n143hl@ifi.uio.no>

On Mon, 5 Aug 2002, Oliver Neukum wrote:
>
> > Also, people who play games with FP actually change the FP data on the
> > stack frame, and depend on signal return to reload it. Admittedly I've
> > only ever seen this on SIGFPE, but anyway - this is all done with integer
> > instructions that just touch bitpatterns on the stack.. The kernel can't
> > catch it sanely.
>
> Could the fp state be put on its own page and the dirty bit
> evaluated in the decision whether to restore fpu state ?

I'm sure anything is _possible_, but there are a few problems with that
approach. In particular, playing VM games tends to be quite expensive on
SMP, since you need to make sure that the TLB entry for that page is
invalidated on all the other CPU's before you insert the FPU page.

Also, you'd need to play games with dirty bit handling, since the page
_is_ dirty (it contains FP data), so the VM must know to write it out if
it pages things. That's ok - we have separate per-page and per-TLB-entry
dirty bits anyway, but right now the VM layer knows it can move the TLB
entry dirty bit into the per-page dirty bit and drop it - which wouldn't
be the case if we also have a FPU dirty bit.

That's fixable - we could just make a "software TLB dirty bit" that it
updated whenever the hardware TLB dirty bit is cleared and moved into the
per-page dirty bit.

But the end result sounds rather complicated, especially since all the
page table walking necessary for setting this all up is likely to be about
as expensive as the thing we're trying to avoid..

Rule of thumb: it almost never pays to be "clever".

		Linus

Newsgroups: fa.linux.kernel
From: Linus Torvalds <torvalds@transmeta.com>
Subject: Re: context switch vs. signal delivery [was: Re: Accelerating user
Original-Message-ID: <Pine.LNX.4.44.0208050910420.1753-100000@home.transmeta.com>
Date: Mon, 5 Aug 2002 16:22:27 GMT
Message-ID: <fa.m6uudiv.170o88u@ifi.uio.no>

On 5 Aug 2002, Andi Kleen wrote:
>
> I think the possibility at least for memcpy is rather remote. Any sane
> SSE memcpy would only kick in for really big arguments (for small
> memcpys it doesn't make any sense at all because of the context save/possible
> reformatting penalty overhead). So only people doing really
> big memcpys could be possibly hurt, and that is rather unlikely.

And this is why the kernel _has_ to save the FP state.

It's the "only happens in a blue moon" bugs that are the absolute _worst_
bugs. I want to optimize the kernel until I'm blue in the face, but the
kernel must NEVER EVER have a "non-stable" interface.

Signal handlers that don't restore state are hard as _hell_ to debug. Most
of the time it doesn't really matter (unless the lack of restore is
something really major like one of the most common integer registers), but
then depending on what libraries you use, and just _exactly_ when the
signal comes in, you get subtle data corruption that may not show up until
much later.

At which point your programmer wonders if he mistakenly wandered into
MS-Windows land.

No thank you. I'll take slow signal handlers over ones that _sometimes_
don't work.

> After all Linux should give you enough rope to shot yourself in the foot ;)

On purpose, yes. It's ok to take careful aim, and say "I'm now shooting
myself in the foot".

And yes, it's also ok to say "I don't know what I'm doing, so I may be
shooting myself in the foot" (this is obviously the most common
foot-shooter).

And if you come to me and complain about how drunk you were, and how you
shot yourself in the foot by mistake due to that, I'll just ignore you.

BUT - and this is a big BUT - if you are doing everything right, and you
actually know what you're doing, and you end up shooting yourself in the
foot because the kernel was taking a shortcut, then I think the kernel is
_wrong_.

And I'd rather have a slow kernel that does things right, than a fast
kernel which screws with people.

> In theory you could do a superhack: put the FP context into an unmapped
> page on the stack and only save with lazy FPU or access to the unmapped
> page.

That would be extremely interesting especially with signal handlers that
do a longjmp() thing.

The real fix for a lot of programs on x86 would be for them to never ever
use FP in the first place, in which case the kernel would be able to just
not save and restore it at all.

However, glibc fiddles with the fpu at startup, even for non-FP programs.
Dunno what to do about that.

		Linus

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
Date: Tue, 26 Jul 2005 21:53:46 UTC
Message-ID: <fa.hekf6kt.hne20f@ifi.uio.no>
Original-Message-ID: <Pine.LNX.4.58.0507261438540.19309@g5.osdl.org>

On Tue, 26 Jul 2005, Chuck Ebbert wrote:
>
>  Since fxsave leaves the FPU state intact, there ought to be a better way to do
> this but it gets tricky.  Maybe using the TSC to put a timestamp in every thread
> save area?

We used to have totally lazy FP saving, and not touch the FP state at
_all_ in the scheduler except to just set the TS bit.

It worked wonderfully well on UP, but getting it working on SMP is a major
pain, since the lazy state you want to switch back into might be cached on
some other CPU's registers, so we never did it on SMP. Eventually it got
too painful to maintain two totally different logical code-paths between
UP and SMP, and some bug or other ended up resulting in the current "lazy
on a time slice level" thing which works well in SMP too.

Also, a lot of the cost is really the save, and before SSE2 the fnsave
would clear the FPU state, so you couldn't just do a save and try to elide
just the restore in the lazy case. In SSE2 (with fxsave) we _could_ try to
do that, but the thing is, I doubt it really helps.

First off, 99% of all programs don't hit the nasty case at all, and for
something broken like volanomark that _does_ hit it, I bet that there is
more than one thread using the FP, so you can't just cache the FP state
in the CPU _anyway_.

So we could enhance the current state by having a "nonlazy" mode like in
the example patch, except we'd have to make it a dynamic flag. Which could
either be done by explicitly marking binaries we want to be non-lazy, or
by just dynamically noticing that the rate of FP restores is very high.

Does anybody really care about volanomark? Quite frankly, I think you'd
see a _lot_ more performance improvement if you could instead teach the
Java stuff not to use FP all the time, so it feels a bit like papering
over the _real_ bug if we'd try to optimize this abnormal and silly case
in the kernel.

		Linus

Index Home About Blog