From: Linus Torvalds <firstname.lastname@example.org>
Subject: Re: [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling
Date: Wed, 17 Nov 2004 15:54:52 GMT
On Wed, 17 Nov 2004, Ingo Molnar wrote:
> In practice this means it's 10 cycles more expensive on P3-style CPUs.
> (I think on P4 style CPUs it should be much closer to 0, but i havent
> been able to reliably time it there - cycle measurements show a 76
> cycles cost which is way out of line.)
No, it's not way out of line. It _is_ that expensive. It seems to totally
synchronize the trace cache or something.
It shows up very clearly in profiles. Spinlocks and atomic operations are
very expensive even on UP, where there are zero cache bounce issues. I'm
not kidding you - compile an SMP kernel, run it on a UP P4, and check it
out with oprofile.
I did that just a few weeks ago out of idle curiousity about some lmbench
And it's not just P4's either. It shows up on at least P-M's too, even
though that's a PPro-based core like a PIII. It's at least 22 cycles
according to my tests (certainly not 10), but it seems to have a bigger
impact than that, likely because it also ends up serializing the pipeline
around it (ie it makes the instructions _around_ it slower too, something
you don't end up seeing if you just do rdtsc which _also_ serializes).
Try lmbench some time. Just pick a UP machine, compile a UP kernel on it,
run lmbench, then compile a SMP kernel, and run it again. There's
literally a 10-20% performance hit on a lot of things. Just from a
_couple_ of locks on the paths.
Sure, some of it is actual different paths, like the VM teardown, which
on SMP is just fundamentally more expensive because we have to be more
careful. So mmap latency (delayed page frees) and file delete (delayed RCU
delete of dentry) ends up having differences of 30-40%. But the 10-20%
differences are things where the _only_ difference really is just the
totally uncontended locking.
That's on a P-M. On a P4 it is _worse_, I think.