From: Linus Torvalds <email@example.com>
Subject: Re: [patch] x86, mm: pass in 'total' to
Date: Mon, 02 Mar 2009 21:17:58 UTC
On Mon, 2 Mar 2009, Nick Piggin wrote:
> I would expect any high performance CPU these days to combine entries
> in the store queue, even for normal store instructions (especially for
> linear memcpy patterns). Isn't this likely to be the case?
None of this really matters.
The big issue is that before you can do any write to any cacheline, if the
memory is cacheable, it needs the cache coherency protocol to synchronize
with any other CPU's that may have that line in the cache.
The _only_ time a write is "free" is when you already have that cacheline
in your own cache, and in an "exclusive" state. If that is the case, then
you know that you don't need to do anything else.
In _any_ other case, before you do the write, you need to make sure that
no other CPU in the system has that line in its cache. Whether you do that
with a "write and invalidate" model (which would be how a store buffer
would do it or a write-through cache would work), or whether you do it
with a "acquire exclusive cacheline" (which is how the cache coherency
protocol would do it), it's going to end up using cache coherency
Of course, what will be the limiting factor is unclear. On a single-socket
thing, you don't have any cache coherency issues, an the only bandwidth
you'd end up using is the actual memory write at the memory controller
(which may be on-die, and entirely separate from the cache coherency
protocol). It may be idle and the write queue may be deep enough that you
reach memory speeds and the write buffer is the optimal approach.
On many sockets, the limiting factor will almost certainly be the cache
coherency overhead (since the cache coherency traffic needs to go to _all_
sockets, rather than just one stream to memory), at least unless you have
a good cache coherency filter that can filter out part of the traffic
based on whether it could be cached or not on some socket(s).
IOW, it's almost impossible to tell what is the best approach. It will
depend on number of sockets, it will depend on size of cache, and it will
depend on the capabilities and performance of the memory controllers vs
the cache coherency protocol.
On a "single shared bus" model, the "write with invalidate" is fine, and
it basically ends up working a lot like a single socket even if you
actually have multiple sockets - it just won't scale much beyond two
sockets. With HT or QPI, things are different, and the presense or absense
of a snoop filter could make a big difference for 4+ socket situations.
There simply is no single answer.
And we really should keep that in mind. There is no right answer, and the
right answer will depend on hardware. Playing cache games in software is
almost always futile. It can be a huge improvement, but it can be a huge
deprovement too, and it really tends to be only worth it if you (a) know
your hardware really quite well and (b) know your _load_ pretty well too.
We can play games in the kernel. We do know how many sockets there are. We
do know the cache size. We _could_ try to make an educated guess at
whether the next user of the data will be DMA or not. So there are
unquestionably heuristics we could apply, but I also do suspect that
they'd inevitably be pretty arbitrary.
I suspect that we could make some boot-time (or maybe CPU hotplug time)
decision that simply just sets a threshold value for when it is worth
using non-temporal stores. With smaller caches, and with a single socket
(or a single bus), it likely makes sense to use non-temporal stores
But even with some rough heuristic, it will be wrong part of the time. So
I think "simple and predictable" in the end tends to be better than
"complex and still known to be broken".
Btw, the "simple and predictable" could literally look at _where_ in the
file the IO is. Because I know there are papers on the likelihood of
re-use of data depending on where in the file it is written. Data written
to low offsets is more likely to be accessed again (think temp-files),
while data written to big offsets are much more likely to be random or to
be written out (think databases or simply just large streaming files).
So I suspect a "simple and predictable" algorithm could literally be
- use nontemporal stores only if you are writing a whole page, and the
byte offset of the page is larger than 'x', where 'x' may optionally
even depend on size of cache.
But removing it entirely may be fine too.
What I _don't_ think is fine is to think that you've "solved" it, or that
you should even try!