Index Home About Blog
From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 19:40:56 UTC
Message-ID: <fa.VhcI/UySZ4DfxFF65q30JAspB2I@ifi.uio.no>

On Wed, 25 Apr 2007, Adrian Bunk wrote:
> >
> > .. but if the alternative is a feature that just isn't worth it, and
> > likely to not only have its own bugs, but cause bugs elsewhere? (And yes,
> > I believe STD is both of those. There's a reason it's called "STD". Go
> > to google and type "STD" and press "I'm feeling lucky". Google is God).
>
> Is there really no use case for STD?

People seem to have reading comprehension problems.

The STD code is buggy, and has introduced bugs in STR too, largely thanks
to bad design. Some of them have happily gotten fixed. Others did not, and
now we have three totally different versions (two of which share some
infrastructure), all of which are broken (ie the "suspend2" people will
swear up-and-down that swsusp doesn't work for them, but anybody who
thinks that "suspend2" will work for everybody is just being a total
idiot, and I have a bridge to sell to them).

I'd actually be happier *removing* STD support in the sense it is now:
it's way too closely integrated with STR, even though it has absolutely
nothing in common with it. When you STD, you'e actually much closer to a
*shutdown* than to STR, yet the STD code continually seems to want to be
in the "suspend" path, as shown even by its name.

So my objections to STD have nothing to do with saving state and shutting
down. They have everything to do with the fact that it is not - and will
never be - a "suspend", and it shouldn't affect suspend.

And that's a *fundamental* problem. If the STD people cannot even realize
that they have less to do with "suspend" than to "reboot", how do you ever
expect them to get anything to work, and not affect other things
negatively?

Yeah, I'm down on it. I'm down on it because every person involved with
the whole STD thing seems to have basically zero taste, and a total
inability to work with anybody else.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 20:10:00 UTC
Message-ID: <fa.WaMu/NBrZAoJcpiUjgajIuB5b2E@ifi.uio.no>

On Wed, 25 Apr 2007, Kenneth Crudup wrote:
>
> Any working suspend-to-disk method takes care of that for me.  (I'm
> really not sure why Linus hates S2D so much, though. Back in the day
> there was a lot more BIOS support, but that's been years now.)

The really sad part is that APM actually did this better..

It's not often you can say that, and APM didn't do diddly-squat for
run-time power management, but when it comes to suspend-to-disk, APM
actually did ok.

I think you could do STD right too, but:

 - if you think it's about suspending devices, you are immediately
   disqualified. If you call the device driver "suspend" or "resume"
   functions, you are doing something wrong.

 - "suspend" is: snapshot memory, and anything you do *after* the snapshot
   is totally irrelevant. You MUST NOT suspend devices before, since
   devices are what that snapshot should be written out to, and you MUST
   NOT suspend devices afterwards either, because that shows that you are
   a moron who didn't understand the "machine will be turned off" part.

 - "resume" is basically: get image into memory, turn *off* every device,
   put image into its proper location, and call the "startup" function. If
   you call a device "resume()" function, you again show that you are a
   moron, because you're not resuming anything at all, you're resetting
   the device from scratch. You _reinitialize_ the device. You don't
   resume it, and somebody may have (and indeed, *WILL HAVE* used the
   device in between). There should be absolutely zero shared code, and
   the *last* thing you should do is to call the device with the same
   function, and give it a flag to tell it to do one thing or the other.

The important thing to take away from this is that it has nothing to do
with "suspend" or "resume" at any level what-so-ever. Not at a device
level, not at a system level, and not even when it comes to hardware. But
for completely idiotic and wrong reasons, it is currently intimately
involved in suspend/resume, and calls the same device management functions
as a suspend/resume thing does, and shares a lot of the code.

And THAT is why I hate the kernel STD. It is fundamentally confused. In
ways that APM was not, I'd like to point out.

I'd love to get it fixed. But the first fix is to not call it "suspend",
because language *does* matter, and using that term is what I'm convinced
has confused so many people.

If it had been called "snapshot + restore", I suspect a lot of people
wouldn't have been so confused about what it does and how it needs to do
it, and wouldn't have tried to shoehorn it into the same corner of the
kernel as "suspend-to-ram" (where you really *can* do things like
"suspend" devices, and while they might certainly lose power in between,
they also really might not, and they certainly won't be *doing* things in
between).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 20:26:44 UTC
Message-ID: <fa.A98UNmboUJfxAyWsiifF3W+3y4Q@ifi.uio.no>

On Wed, 25 Apr 2007, Rafael J. Wysocki wrote:
>
> Please ask anyone who's worked with me if he's had any problem with that.
> If anyone say I'm unable to work with anybody else, I'd say you're right.  Till
> then, I feel offended.

I'll apologise (and virtually kiss your hairy feet) if you could actually
show me a single implementation that people can agree on.

But until then, I claim that the suspend-to-disk people cannot work with
each other.

And no, "three different implementations" doesn't cut it. Even _two_ is
too much. We need to get *rid* of something, not add more.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 20:45:46 UTC
Message-ID: <fa.2tcuvNRyCOFeitpMRgV5G+BVS2A@ifi.uio.no>

On Wed, 25 Apr 2007, Pavel Machek wrote:
>
> Can I get you on IRC somewhere? No, I do not think I'm a moron, and
> yes, I need to suspend^Wsnapshot the devices before, so I have that in
> the snapshot. Of course, I'll need to resume^Wrestore the devices
> before writing snapshot. That's okay, it does not take long.

You do NOT need to "suspend" the devices, and that's the whole point.

You may want to save the device info somewhere, BUT THAT IS SOMETHING
TOTALLY DIFFERENT!

This is *exactly* the confusion I'm talking about. The STD and STR
codepaths try to use the same function for two TOTALLY DIFFERENT things.

STR actually wants to "suspend".

STD actually wants to "atomic snapshot", and it must not allow allocations
or anything like that, because the whole snapshot image should be done
atomically as one event. But it should *not* suspend, because that device
may actually be needed afterwards.

So not the same thing at all.

So here's what "suspend()" wants:
 - suspend() - preparatory work, can error out, can delay, can park the
   disk, etc etc.
 - suspend_late() - called late, with interrupts disabled, should actually
   suspend if the early suspend didn't do it already

And here is what "snapshot()" wants:
 - prepare_to_snapshot() (for memory allocation)
 - snapshot() - called late, with interrupts disabled, save state.

and there is absolutely _zero_ overlap between them. There just isn't
anything in common. Yes, both are two-phase (for the simple reason that
both want an "atomic" part), but there's really no real overlap.

Just trying to *make* them be the same operations is just going to
introduce flags that then cause them to be totally different *and*
confusing and generate bugs. It also means that people do one of them, and
"it works" for that case, and the other case is totally broken, but it's
not obvious, because doing one means that the system _thinks_ that you did
both!

In the very unlikely case that some driver actually *wants* to use the
same function for snapshots and suspending, that driver could just go
ahead and _use_ the same function pointer. But now, as things are set up,
we force a total confusion on drivers by calling them through the same
interface for two totally different things.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 22:20:05 UTC
Message-ID: <fa.adZrEyU33HI8/eYBWtP/xtFPhmw@ifi.uio.no>

On Wed, 25 Apr 2007, Pavel Machek wrote:
>
> Not the same... but they are still related. "freeze" (for atomic
> snapshot) is actually subset of "suspend"... freeze needs DMAs off and
> saved state, and you need DMAs off and saved state for "suspend".

THEY HAVE ABSOLUTELY NOTHING IN COMMON!

Nobody in their right mind thinks that "disable DMA" and "suspend" are
similar operations.

> So it is actually correct to do "suspend" when you want "freeze"; it
> is just slow. That's why they only differ in parameter these days.

It is *not* correct to "suspend" when you want "freeze".

I don't understand how you can even *claim* something like that.

Here's a trivial example:
 - SCSI disk

Tell me, what does "suspend" do, and what does "freeze" (snapshot) do?

And name *one* thing that have in common.

I'll tell you: Nada. Zero. Zilch. Nothing.

"Freeze" for a disk is a total no-op. There is no DMA, there is no
nothing. In contrast, "suspend" for a disk is a totally valid operation.

Anybody who claims that these two operations are "related" is a moron.

I'm sorry Pavel, but that's exactly how it is.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 22:57:13 UTC
Message-ID: <fa.QUUrlFzfGb3Vp9l+UK4SqDx2qvo@ifi.uio.no>

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
> >
> > And name *one* thing that have in common.
>
> Set/reset the scsi transaction id thingy? Hibernation didn't work with
> SCSI for a long time precisely because that support was missing.

And by "hibernation", you mean what? You mean "snapshot + shutdown",
right?

Think about it for five seconds, and then ask yourself: at which point in
the "snapshot + shutdown" sequence would you actually tell a disk to shut
down?

If you said "snapshot", then you'd be *wrong*.

That's my _point_. The snapshot() function should not (and MUST NOT) tell
disks to shut down, because unlike suspend(), we're still going to _use_
those disks afterwards (why? To write out the snapshot image!).

In other words, the act of creating a snapshot has *nothing* to do with
suspend.

Now, after you've created (and written out) the snapshot, what do you
actually end up doing?

That's right - you end up _shutting down_ the machine, and yes, as part
of the _shutdown_ sequence you may actually end up doing a lot of the
things that a suspend would do. But that's long *after* you've actually
done the "snapshot" part, and has absolutely nothing to do with it.

That's where I started: whole "suspend to disk" thing actually has _more_
to do with "shutdown" than with "suspend".

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 22:59:32 UTC
Message-ID: <fa.ktpsp0humzc45z+CLyutLXhQvNk@ifi.uio.no>

On Thu, 26 Apr 2007, Pavel Machek wrote:
>
> Suspend syncs caches/spins down. Freeze does not do anything.
>
> That's okay, I keep claiming "freeze" is subset of "suspend". Can you
> name device where that is not true?

Sure. Like just about any PCI device that doesn't do things on its own.

A "freeze" does nothing at all, or perhaps shuts down the reader side
(for something like a network controller).

A "suspend" does "write D3 to the suspend register". Absolutely zero in
common.

> Remember we do
>
> suspend(PMSG_FREEZE)
> atomic snapshot
> resume()
> write snapshot.

AND THAT IS STUPID. It mixes up "suspend()" and creating a snapshot in
ways that are totally idiotic. There is nothing in common!

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 23:02:03 UTC
Message-ID: <fa.w9y67C39Dj5xTrsoM/NGNO42rAU@ifi.uio.no>

On Wed, 25 Apr 2007, Chuck Ebbert wrote:
>
> Freeze is a subset of suspend, isn't it? (It might be an empty subset
> in some cases.)

NO IT IS NOT!

Yes, you are parroting Pavel, but he can say it a million times, and it's
*still* not true.

That's like saying "read() is a subset of write(), isn't it?" On many
devices, they share some of the setup, like writing the same sector
registers with the same values.

Does that make them subsets of each other?

Or does it mean that they *may* use some of the same common helper
functions for some devices?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 23:11:53 UTC
Message-ID: <fa.nPEjPYbo0ETRKtKgUpazcfuWKSQ@ifi.uio.no>

On Thu, 26 Apr 2007, Pavel Machek wrote:
> >
> > I don't understand how you can even *claim* something like that.
>
> BTW most problems are in thaw/resume functions.

And do you realize that the thaw/resume functions are totally different
too?

Or rather, they *would* be, if you allowed them to.

For example, for "snapshot + thaw", the _sane_ thing is to actually make
the snapshot just throw away all the DMA tables etc, and let the thawing
just do a full initialization (as it did on boot). It basically needs to
do that anyway, and it simplifies the whole thing (ie you don't even
*want* to save things like the DMA command queues etc - the ones that will
quite often be stepped on by the final "write snapshot to disk" stuff
anyway).

For suspend to ram, in contrast, since you *know* that nobody will be
touching the hardware, and since the timings are very different anyway
(you'd hope that you can resume in a second or two), you'd generally want
to keep the DMA engine tables right where they are, and just literally
suspend the PCI chip itself.

See? Again, *nothing* in common.

You think they have things in common just because your whole (incorrect)
mindset has _forced_ them to have things in common, becasue your setup
stupidly thinks that "resume" is the same as "thaw", the same way you
think "freeze" is the same as "suspend".

NEITHER is true. You've _made_ them true in your mind, but there's
absolutely zero reason that they *should* be true.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 23:21:32 UTC
Message-ID: <fa.S6UbQbdgdXrMZ/722gaBaAkrVLk@ifi.uio.no>

On Wed, 25 Apr 2007, Alan Cox wrote:
>
> Both of them have to ensure you can make a consistent snapshot.

Bzzt. Wrong again. Very much so.

STR does not need to "ensure that you have a consistent snapshot".

Why? Because there is no _room_ for inconsistency. There's nothing to be
"inconsistent with", since any changes to memory (by things like DMA or
other setup that happens while the suspend process is going on) is by
_definition_ consistent with the resume image (becasue there is no
separate image).

> Doing that means you've got to be able to define a single "point" at
> which the snapshot is made and is internally self-consistent. That in
> both cases tends to mean you've got to ensure nothing occurs which pees
> on the image while you are making that snapshot (such as outstanding
> O_DIRECT I/O to user pages).

Get off the drugs, Alan. There *is* no snapshot with suspend-to-ram.

Which is the whole point I'm trying to make! A _lot_ of people are
confused about this.

With suspend-to-ram, you don't need to do a damn thing to the chip, except
suspend it and resume it. There are _zero_ consistency issues. There is no
need to freeze anything at any point. You can suspend each device totally
independently of all other devices (taking into account things like bus
topology, of course), and there is no "atomic" snapshot that needs to ever
exist.

That's TOTALLY DIFFERENT from "suspend to disk". In suspend to disk, you
need a completely different kind of mindset, namely you need a single
consistent image, where the image is consistent not only with memory, but
with all the devices.

For example, the whole myth that "freeze" needs to shut off DMA is a total
and utter *myth*. It needs nothing of the sort at all. Rather than shut
off DMA and try to make the hardware be wevy wevy quiet while it's hunting
wabbits, it's a lot easier to just do nothing at all on "freeze", and just
make sure that "thaw" will re-initialize the DMA tables entirely! All
drivers have code to do that anyway, since that's what you need to do at
boot.

Notice?  Totally different. Absolutely NOTHING in common. Not on a
practical plane, and not even conceptually.  The current (broken!)
implementation has forced a totally idiotic model on things, where instead
of snapshotting doing the sane and simple thing, it ends up doing extra
work that is totally unnecessary, but *becomes* necessary just because it
*also* shares the "resume" path (which should _not_ be the same either!)

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 23:31:08 UTC
Message-ID: <fa.t57eh0CMKkWoGAaEbatrCB2s5RM@ifi.uio.no>

On Thu, 26 Apr 2007, Pavel Machek wrote:
>
> Current design is:

Broken. Yes. I've tried to tell you.

> Twice. Once during snapshot (then we spin it up when the snapshot is
> done), and once during shutdown.

And nobody can possibly say that is "sane". But it's a direct result of
the incorrect thinking that "suspend()" and "snapshot()" have anything
what-so-ever to do with each other.

> Yep, we optimize away spindown, because it takes too long, so SCSI
> disks are actually very bad example.

No. SCSI disks are a *good* example. It's an example of how you
(incorrectly) call the same function for two totally different things, and
then that function is smart enough that it *understands* that they are
totally different.

But the *confusion* remains. It remains in your head, and you've poisoned
people like Alan too, that usually are not confused. And THAT is the main
problem (although there are also indirect problems like "fixing one may
break the other", but I actually think that the fundamental problem is the
confusion it creates, which in turn causes bugs to happen because people
are confused and think that they should do the same thing for suspend and
for snapshot).

> No, I'd like you to understand that we actually CAN tell the disks to
> spin down, because we'll call resume and spin them back again before
> writing the image. We used to do it. We still can do it, but it is
> slow.
>
> Yes, it is quite confusing.

It's worse than just confusing, it's *idiotic*.

It _can_ work in practice, but
 - we have pretty damn solid evidence that it doesn't work all that often
   in practice
 - the fact that something *can* be done the stupid way is in no way an
   argument that it *should* be done the stupid way.

I claim that the current STD is *stupid*. Yes, it can work. But that
doesn't make it less stupid.

What's your argument? Your argument seems to be that it's not stupid,
because it can work. Can't you see that that simply isn't an argument at
all? "stupid and wrong" doesn't mean "cannot work in theory". But it
*does* mean that people get confused, and it *does* mean that there are
likely more bugs, because confused people do not tend to write very good
code.

I'm not claiming that the current code cannot work. It clearly *does*
work for a lot of people. But I'm claiming that it's STUPID.

So don't argue that "it works". Windows works, kind of. That doesn't make
it less stupid and badly designed!

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Wed, 25 Apr 2007 23:58:37 UTC
Message-ID: <fa.tK3gbCfAIBtgrxhLu0aPM8D7Tmc@ifi.uio.no>

On Thu, 26 Apr 2007, Pavel Machek wrote:
>
> > For suspend to ram, in contrast, since you *know* that nobody will be
> > touching the hardware, and since the timings are very different anyway
> > (you'd hope that you can resume in a second or two), you'd generally want
> > to keep the DMA engine tables right where they are, and just literally
> > suspend the PCI chip itself.
>
> I'd actually prefer resume to be similar to module insert, too... Do
> you think that resume is _that_ time critical?

I think it probably depends on the device, and it should depend on the
driver writer how he wants to do it.

My _point_ is that there is absolutely zero reason to think that the two
events are the same. We *know* that for snapshot+shutdown, we need to
actually keep the DMA tables intact *over* the snapshot (because writing
out the snapshot may _need_ them). But exactly because we keep them
intact, a driver writer may sanely say "I didn't even bother shutting them
down, so on thaw, I cannot trust them, so I'll just re-initialize them
entirely".

In contrast, over suspend-to-ram, it's entirely reasonable to just leave
them in memory, and just keep them. There's no *reason* not to.

And that's my whole point in this argument: the two paths are
fundamentally totally different. You *claim* that "snapshot()" needs to
stop DMA etc, but that's simply not true.

So I claim:
 - for a lot of devices, it's actually a *lot* easier to just have
   snapshot not do anything at all, and re-initialize on thaw
 - for those same devices, for s2ram, since the tables are *safe* and
   don't get touched by anything else, it's probably easier to just let
   them be.

See? The "it's easier to do X" is a _different_ X for the two cases.

So the whole "suspend is a superset of freeze" is simply not true.

> [I'd like you to drop me a line saying you understand current design
> and that it works -- even if it is very inelegant]

I _do_ understand the current design. I just think that it's totally
and seriously broken. I've ranted against it before. I think it's stupid
to play like you're "suspending" something just to save some state,
especially since most users probably don't even *want* to suspend the
state, and would quite happily re-initialize the chip instead.

And I think it's horrible to have a dynamic flag to tell the difference
between two or more state changes that the devices should statically be
able to determine. _If_ some driver really does have the same routine,
just use the same routine. There are no downsides to splitting them up.

> Now, we can separate suspend/freeze and resume/thaw (with some common
> helpers). It will speed the code up by avoiding unneccessary
> operations. It also needs attetion from driver writers (ouch).
>
> Do we want to do that?

I'd personally certainly want to do that. But I want to split up the
callers too. Right now we mix those a lot as well. I suspect that would
automatically be fixed by just forcing them to separate out (since they
now call different functions of the devices), but I'm not 100% sure. There
might be other issues.

Just as an example: one of the most painful things there is in the suspend
sequence is that we shut off the console (because the console device will
be suspended in hw, and it's thus not safe to use it over a suspend/resume
sequence). That should just go away entirely for "snapshot()", because
there is *never* any excuse for actually turning off the console during a
snapshot: even a network console should be entirely functional.  Things
like that - things that sound like small issues, but that really aren't.

(Right now you can enable the "don't disable the console" config option,
but since network drivers will actually shut down etc, it just means that
you'll have oopses etc if you do, and you have netconsole enabled)

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 00:35:31 UTC
Message-ID: <fa.XJc2qHrv9AYt1yoCOiR6xG8qLGc@ifi.uio.no>

On Wed, 25 Apr 2007, Linus Torvalds wrote:
>
> The *thaw* needs to happen with devices quiescent.

Btw, I sure as hell hope you didn't use "suspend()" for that. You're
(again) much better off having a totally separate function that just
freezes stuff.

So in the "snapshot+shutdown" path, you should have:

 - prepare_to_snapshot() - allocate memory, and possibly return errors

   We can skip this, if we just make the rule be that any devices that
   want to support snapshotting must always have the memory required for
   snapshotting pre-allocated. Most devices really do allocate memory for
   their state anyway, and the only real reason for the "prepare" stage
   here is because the final snapshot has to happen with interrupts off,
   obviously. So *if* we don't need to allocate any memory, and if we
   don't expect to want to accept some early error case, this is likely
   useless.

 - snapshot() - actually save device state that is consistent with the
   memory image at the time. Called with interrupts off, but the device
   has to be usable both before and afterwards!

And I would seriously suggest that "snapshot()" be documented to not rely
on any DMA memory, exactly because the device has to be accessible both
before and after (before - because we're running and allocating memory,
and after - because we'll be writing things out). But see later:

For the "resume snapshot" path, I would suggest having

 - freeze(): quiesce the device. This literally just does the absolute
   minimum to make sure that the device doesn't do anything surprising (no
   interrupts, no DMA, no nothing). For many devices, it's a no-op, even
   if they can do DMA (eg most disk controllers will do DMA, but only as
   an actual result of a request, and upper layers will be quiescent
   anyway, so they do *not* need to disable DMA)

   NOTE! The "freeze()" gets called from the *old* kernel just _before_ a
   snapshot unpacking!!

 - restart_snapshot() - actually restart the snapshot (and usually this
   would involve re-setting the device, not so much trying to restore all
   the saved state. IOW, it's easier to just re-initialize the DMA command
   queues than to try to make them "atomic" in the snapshot).

   NOTE! This gets called by the *new* kernel _after_ the snapshot resume!

And if you *want* to, I can see that you might want to actually do a
"unfreeze()" thing too, and make the actual shapshotting be:

	/* We may not even need this.. */
	for_each_device() {
		err = prepare_to_snapshot();
		if (err)
			return err;
	}

	/* This is the real work for snapshotting */
	cli();
	for_each_device()
		freeze(dev);
	for_each_device()
		snapshot(dev);
	.. snapshot current memory image ..
	for_each_device_depth_first()
		unfreeze(dev);
	sti();

and maybe it's worth it, but I would almost suggest that you just make the
rule be that any DMA etc just *has* to be re-initialized by
"restart_snapshot()", in which case it's not even necessary to
freeze/unfreeze over the device, and "snapshot()" itself only needs to
make sure any non-DMA data is safe.

But adding the freeze/unfreeze (which is a no-op for most hardware anyway)
might make things easier to think about, so I would certainly not *object*
to it, even if I suspect it's not necessary.

Anyway, the restore_snapshot() sequence should be:

	/* Old kernel.. Normal boot, load snapshot image */
	cli()
	for_each_device()
		freeze(dev);
	restore_snapshot_image();
	restore_regs_and_jump_to_image();
	/* noreturn */


	/* New kernel, gets called at the snapshot restore address
	 * with interrupts off and devices frozen, and memory image
	 * constsntent with what it was at "snapshot()" time
	 */
	for_each_dev_depth_first()
		restore_snapshot(dev);
	/* And if you want to, just to be "symmetric"

		for_each_dev_depth_first()
			unfreeze(dev)

	   although I think you could just make "restore_snapshot()"
	   implicitly unfreeze it too..
	 */
	sti();
	/* We're up */

and notice how *different* this is from what happens for s2ram. There
really isn't anything in common here. Exactly because s2ram simply doesn't
_have_ any of the issues with atomic memory images.

So s2ram is just

	for_each_dev()
		suspend(dev);
	cli();
	for_each_dev()
		late_suspend(dev);
	.. go to sleep ..
	for_each_dev_depth_first()
		early_resume(dev);
	sti();
	for_each_dev_depth_first()
		resume(dev);

and has none of the "freeze" issues at all.

Doesn't that seem a lot more straightforward? Yes, it's more functions,
but each function is a lot more obvious. This follows the unix rule of "do
one thing, and do that thing well", instead of trying to make one function
do many very different things depending on what you actually want done..

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 00:39:05 UTC
Message-ID: <fa.jc+oMLxWYUhdSgpSiT7YYPb1cfU@ifi.uio.no>

On Thu, 26 Apr 2007, Pavel Machek wrote:
>
> Ok, I guess I'll have nightmares of DMA controllers doing DMAs from
> chips that are no longer there tonight.

Umm. Welcome to the 21st century: we don't do that "separate DMA
controller" thing any more. All devices do their own DMA.

> Only the fact that we are currently using same device call during
> snapshot() and during restore(). We obviously could do _5_ device
> calls
>
> (suspend/resume/freeze/quiesce_disable_dma/thaw)
>
> ...but that looks like too many calls to me.

I'd much rather have five or even more functions that each do *one*
obvious thing.

Think like a device driver writer: would you prefer to just implement five
functions that do something very specific that you know trivially how to
do ("I know how to disable interrupts and DMA") or would you want to do
some high-level operation that you don't even know why the caller asks you
to suspend? What does "suspend()" even mean when the caller is just going
to wake up up immediately again? Is it performance-critical? Should I tear
down all my DMA's? I dunno!

In other words, splitting things up actually makes things simpler. That's
*doubly* true if you can then give each specific function some really
clear goals.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 01:12:26 UTC
Message-ID: <fa.qIqlJgDf4G/TLAbdlch6JUhJPHI@ifi.uio.no>

On Thu, 26 Apr 2007, Alan Cox wrote:
>
> You bet there is. We need to know if data arrived or not, because there
> is no guarantee that the data retrieved if we inadvertently re-execute a
> command will be the same. The hardware state itself isn't the problem,
> its the combination of hardware state and internal state which need to
> match in some cases.

... which is why "suspend()" suspends the hardware.

Is that so hard to understand?

Once the hardware is suspended, it's not doing anything.

But STR doesn't have any need for atomicity guarantees _between_devices_.

That's a really *fundamental* difference.

The reason s2ram is *so* different from snapshot-to-disk is exactly the
fact that s2ram can (and does) work on one device at a time.

In contrast, snapshot-to-disk needs to snapshot all the devices
*together*, since it has a separate disk image.

See? Two *totally* different cases. They have *nothing* in common. Not the
call sequence, not the logic, not *anything*.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 02:06:49 UTC
Message-ID: <fa.VmdV6z2SPxeZh+nhMpMRCEQfhZI@ifi.uio.no>

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> That's where I think you're overstretching the argument. Like suspend
>(to ram), we're concerned at the snapshot point with getting the hardware
>in the same state at a later stage.

Really, no.

"suspend to ram" doesn't _have_ a "snapshot point".

I've tried to explain this multiple times, I don't know why it's not
apparently sinking in. This is much more fundamental than the fact that
you don't want to stop disks for snapshotting, although it really boils
down to all the same issues: the operations are simply not at all the
same!

I agree 100% that "snapshot to disk" is a "snapshot event". You have to
create a single point in time when everything is stable. And I'd much
rather call it "snapshot to disk" than "suspend to disk" to make it clear
that it's something _totally_ different from "suspend".

Because the thing is, "suspend to ram" is *not* a snapshot event. At no
point do you actually need to "snapshot" the system at all. You can just
gradually shut more and more things down, and equally gradually bring them
back up. There simply is *never* any "snapshot" time from a device
standpoint, because you can just shut down devices in the right order AND
YOU ARE DONE.

Really.

[ Obviously s2ram does have one "magic moment", namely the time when the
  CPU does the magic read from the northbridge that actually turns off
  power for the CPU. But that's really a total non-event from a device
  standpoint, so while it's undoubtedly a very interesting moment in the
  suspend sequence, it's not really relevant in any way for device
  drivers in general. Not at all like the "snapshot moment" that requires
  the whole system to be totally quiescent in a "snapshot to disk"! ]

And the reason s2ram doesn't have a that "snapshot" moment is exactly that
the RAM contents are just always there, so there's no need to have a
"synchronization event" when ram and devices match. The RAM will *always*
match whatever any particular device has done to it, and the proper way to
handle things is to just do a simple per-device "save-and-suspend" event.

And yes, the _individual_ "save-and-suspend" events obviously needs to be
"atomic", but it's purely about that particular individual device, so
there's never any cross-device issues about that.

For example, if you're a USB hub controller, which is just about the most
complex issue you can have, you obviously want to "save the state" with
the controller in a STOPPED state, but that should just go without saying:
if the controller isn't stopped, you simply *cannot* save the state, since
the state is changing under you.

The difference is, that the USB driver needs to just "stop, save, and
suspend" as one simple operation for s2ram. In contrast, when doing
snapshot to disk, it cannot do that, because while it does want to do the
"stop" part, it needs to do so _separately_ from the "save" part because
you need to stop everything else *too* before you "save" anythng at all.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 02:34:24 UTC
Message-ID: <fa.jozrX0FyWp9XuPkr5PbfIi+s6tU@ifi.uio.no>

On Wed, 25 Apr 2007, H. Peter Anvin wrote:
>
> That was the 1990s.  On a brand new server system:
>
> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset DMA
> Engine (rev b1)
>
> For better or worse, slave DMA seems to be making a comeback of sorts.
> Not to mention all kinds of embedded crap^Whardware with optimized DMA
> engines which look nothing like PCI at all.

Well, the solution to that tends to be to just leave them be, and hold
them on until the very end - and just ignore them (and just make-believe
that it's actually the device itself that does the DMA transfer).

The PCI spec for controlling DMA is really pretty nasty. You can disable
it in the PCI config word, of course, but that usually just messes up the
device entirely.

So in practice, the way to shut up DMA (regardless of whether it's an
internal DMA engine or an external one) is that you just tell the device
not to listen any more (for example, for a network controller, the way to
make sure it doesn't do DMA is just to make sure that you're not sending
any frames, but also that it's not listening to any either)!

So whether it's internal to the device, or some "system DMA controller",
the sequence for shutting down DMA always ends up being the same:

 - make sure the host itself doesn't generate any new traffic (eg shut
   down the send-queue). This is generally a higher-level thing anyway, ie
   not really a driver decision.
 - the driver needs to tell the hardware to stop listening (ie "stop
   scanning the command mailboxes" or "stop walking USB command
   structures" or "stop receiving data")
 - the driver then needs to wait for the controller to say "ok, I'm idle".

because regardless of whether it's the system DMA controller or some
on-chip DMA controller, you generally can *not* just say "stop
transferring DMA data", because that will generally just lock the chip up
or cause other major unhappiness.

So I don't think an external DMA controller (like the i8237, ugh!) really
_changes_ anything. Except for just the horrible pain of serializing
access to them for programming etc horrible resource handling issues, of
course (but that's not specific to suspend/resume).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 03:05:09 UTC
Message-ID: <fa.nGBKVfmKQ6k5lCF/FbgsokYXKoU@ifi.uio.no>

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> Sorry. I wasn't clear. I wasn't saying that suspend to ram has a
> snapshot point. I was trying to say it has a point where you're seeking
> to save information (PCI state / SCSI transaction number or whatever)
> that you'll need to get the hardware into the same state at a later
> stage. That (saving information) is the point of similarity.

Yes, they do both save information, but I'm not actually convinced they
would necessarily even save the *same* information.

Let's just take an example of USB, and to make things more interesting,
say that the disk you want to suspend to is itself over USB (not
necessarily something you _want_ to do, but I think we can all agree that
it's something that should potentially work, no?)

Now, USB devices actually have per-connection state (at a minimum, the
"toggle" bit or whatever), and that's obviously something that will
inevitably *change* as a result of the device being used after
snapshotting (and even if not used, by the rediscovery by the first kernel
to boot), and we fundamentally cannot put the final toggle state in the
snapshot.

So in the snapshot-to-disk scenario, there are some pieces of data that
simply fundamentally *cannot* be snapshotted, because they are not
controller state, they are "connection" state.

So in that case, you basically know that you *have* to rebuild the
connection when you do the "snapshot_resume()" thing. So there's no point
in even keeping these kinds of connection states (the same is true of
keyboards, mice, anything else - it's how USB works).

In contrast, in suspend-to-RAM, USB connections might just be things you
actually want to keep open and active, and you *can* do so, in ways you
simply cannot do with "snapshot to disk". In fact, if you are something
like an OLPC and actually go to s2ram very aggressively, you might well
want to keep the connection established, because it's conceivable that you
might otherwise lose keypresses etc issues)

See? There are real *technical* reasons to believe that the two "save
state" operations are really fundamentally different. There are reasons to
believe that a s2ram can actually happen while keeping some connections
open that cannot be kept open over a disk snapshot.

Do they *have* to be different? Of course not. For many devices the "save"
and "freeze" operations will likely all be no-ops, and there would be
absolutely no difference between suspending and snapshotting, because the
driver state already natively contains all the information needed to get
the device going again.

Equally, I don't doubt that in many drivers you'll have very similar "save
state" logic, but in fact I believe that in many cases that "save state"
logic will often just be a simple

	pci_save_state(dev);

call, so it's literally the case that they will not be just shared between
the "suspend" and "snapshot" case, they'll be shared across all simple PCI
devices too!

But that doesn't mean that the functions to do so should be the same. You
might have

	static int mypcidevice_suspend(struct pci_dev *dev)
	{
		pci_save_state(dev);
		pci_set_power_state(dev, PCI_D3);
		return 0;
	}

	static int mupcidevice_snapshot(struct pci_dev *dev)
	{
		pci_save_state(dev);
		return 0;
	}

and who cares if they both have that same call to a shared "save state"
function? They're still totally different operations, and the fact that
*some* devices may save the same things doesn't make them any more
similar! See above why some devices might save totally *different* things
for a "snapshot" vs a "suspend" event.

> I suppose that's another point of similarity - for snapshotting, the
> same ordering is probably needed?

I agree that you're likely to walk the device list in the same order. The
whole "shut down leaf devices first", "start up root devices first" is
pretty fundamental.

But that's true of reboot and device discovery too. Should that ordering
mean that we should use the "discovery()" function and pass it a flag and
say "you shouldn't discover, you should snapshot or suspend now"? No.
Everybody agrees that device discovery is something different from device
suspend. The fact that it's done in a topological order and thus they bear
some kind of inverse relationship to each other doesn't make them "the
same".

> > And yes, the _individual_ "save-and-suspend" events obviously needs to be
> > "atomic", but it's purely about that particular individual device, so
> > there's never any cross-device issues about that.
>
> No interdependencies? I'm not sure.

Well, we pretty much count on it, since we will *suspend* the devices at
the same time. So if they had interdependencies that aren't described by
the ordering we enforce, they are pretty much screwed anyway ;)

So yes, the device list needs to be topologically sorted (and you need to
walk it in the right direction), but apart from that we'd *better* not
have any interdependencies, or we simply cannot suspend at all.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 16:04:06 UTC
Message-ID: <fa.03VHua2Kj+89QziR35uGArG+LHs@ifi.uio.no>

On Thu, 26 Apr 2007, Alan Cox wrote:
>
> > The PCI spec for controlling DMA is really pretty nasty. You can disable
> > it in the PCI config word, of course, but that usually just messes up the
> > device entirely.
>
> And some devices ignore it. Some of the older Cyrix stuff I have appears
> not to care how the master bit is set.

I'm not surprised. If the choice is between locking up the PCI bus by
hanging the device in endless retries, or just ignoring the bit, I suspect
"just ignore it" is actually the better choice.

Of course, in a perfect world you'd happily honor it, raise a PCI error,
and all is good, but in practice the internal state machine of most
non-trivial hardware is simply so complicated that the "abort gracefully"
simply isn't an option.

The hw people have enough problems in getting things to work when
everything is peachy and well, and a lot of companies end up releasing
stuff with known errata for even the _normal_ cases, just because they
expect software to work around them ("Doctor, doctor, it hurts when I do
the documented access!" "You didn't read errata #317, did you? Don't do
that, then!")

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 16:11:53 UTC
Message-ID: <fa.J8Eqi2UcWvZcuUK9jMgWViNj0y4@ifi.uio.no>

On Thu, 26 Apr 2007, Mark Lord wrote:
> Linus Torvalds wrote:
> >
> > See? Two *totally* different cases. They have *nothing* in common. Not the
> > call sequence, not the logic, not *anything*.
>
> Except that both methods cannot rely upon hot-pluggable devices
> still being present on resume/restore.  It is exceptionally common
> to unplug all USB/firewire cables, mouse, keyboard, docking cables etc..
> after a machine is in S2R state.

Right, and that has nothing to do with suspend/resume. You'd better be
able to handle unexpected hotplugs _regardless_.

For example, it's quite common that people just "remove" the
pcmcia/cardbus card while the driver is active. And in fact, when that
happens, it's also quite common that the hardware raises the irq for that
(active) driver (in fact, it's more than common: since the "card removal"
interrupt for the Cardbus controller is generally always the same as the
"card interrupt" interrupt for the low-level card driver, you can pretty
much *guarantee* that you get that interrupt).

So the end result is that the interrupt handler and all normal IO routines
for a hotpluggable piece of hardware basically _have_ to be able to
gracefully handle the "oops, the hw simply isn't there any more" case!

The resume code isn't any different at all. It should run perfectly
normally, but for hotpluggable devices, it has to follow all the same
rules: handle the "oops, the hw is gone" case gracefully.  No different,
and it's totally unrelated to suspend/resume: it's a *generic* issue.

In fact, suspend/resume is better off than a lot of the other code is,
simply because it's easier to test that case and know you hit that
particular sequence! It's much harder to verify that the "send packet"
case is safe, because how are you going to know to remove the card at the
right point to trigger it?

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Thu, 26 Apr 2007 16:57:45 UTC
Message-ID: <fa.Ht1tMkUUV2hivVikgAfotx3VGY0@ifi.uio.no>

On Thu, 26 Apr 2007, Nigel Cunningham wrote:
>
> * Doing things in the right order? (Prepare the image, then do the
> atomic copy, then save).

I'd actually like to discuss this a bit..

I'm obviously not a huge fan of the whole user/kernel level split and
interfaces, but I actually do think that there is *one* split that makes
sense:

 - generate the (whole) snapshot image entirely inside the kernel

 - do nothing else (ie no IO at all), and just export it as a single image
   to user space (literally just mapping the pages into user space).
   *one* interface. None of the "pretty UI update" crap. Just a single
   system call:

	void *snapshot_system(u32 *size);

   which will map in the snapshot, return the mapped address and the size
   (and if you want to support snapshots > 4GB, be my guest, but I suspect
   you're actually *better* off just admitting that if you cannot shrink
   the snapshot to less than 32 bits, it's not worth doing)

User space gets a fully running system, with that one process having that
one image mapped into its address space. It can then compress/write/do
whatever to that snapshot.

You need one other system call, of course, which is

	int resume_snapshot(void *snapshot, u32 size);

and for testing, you should be able to basically do

	u32 size;
	void *buffer = snapshot_system(&size);
	if (buffer != MAP_FAILED)
		resume_snapshot(buffer, size);

and it should obviously work.

And btw, the device model changes are a big part of this. Because I don't
think it's even remotely debuggable with the full suspend/resume of the
devices being part of generating the image! That freeze/snapshot/unfreeze
sequence is likely a lot more debuggable, if only because freeze/unfreeze
is actually a no-op for most devices, and snapshotting is trivial too.

Once you have that snapshot image in user space you can do anything you
want. And again: you'd have a fully working system: not any degradation
*at*all*. If you're in X, then X will continue running etc even after the
snapshotting, although obviously the snapshotting will have tried to page
a lot of stuff out in order to make the snapshot smaller, so you'll likely
be crawling.

> * Mulithreaded I/O (might as well use multiple cores to compress the
> image, now that we're hotplugging later).
> * Support for > 1 swap device.
> * Support for ordinary files.
> * Full image option.
> * Modular design?

I'd really suggest _just_ the "full image". Nothing else is probably ever
worth supporting. Your "snapshot to disk" wouldn't be _quite_ as simple as
"echo disk > /sys/power/state", but it should not necessarily be much
worse than

	snapshot_kernel | gzip -9 > /dev/snapshot

either (and resuming from the snapshot would just be the reverse)!

And if you want to send the snapshot over a TCP connection to another
host, be my guest. With pretty images while it's transferring. Whatever.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Thu, 26 Apr 2007 17:09:40 UTC
Message-ID: <fa.+yQh1U2MrUR1/5ambKxRyx1AV4E@ifi.uio.no>

On Thu, 26 Apr 2007, Linus Torvalds wrote:
>
> Once you have that snapshot image in user space you can do anything you
> want.

Side note: the exception, of course, is page out more. The swap device has
to be read-only.

We actually have support for that mode (it's how "swapoff" works: it marks
swap devices as not accepting _new_ entries, even though old entries are
still valid). So you can have a fully running system, with 99% of memory
swapped out, and still guarantee that you won't swap out anything *more*
(which would destroy the swap image, which you don't want, since it's
where a lot of the memory may end up being, in order to make the snapshot
itself as small as possible)!

Anybody who cares can look at the code that messes with the the
SWP_WRITEOK flag. You'd basically swap out enough to make the snapshot
image fit comfortably in memory, and then you'd clear SWP_WRITEOK on all
swap devices and return to user space. Or something very close to that.

But the point here is that we should actually really be able to have a
fully working system, even _after_ we created the snapshot. I don't even
think you should need any "initrd only" kind of situation.

If somebody can do that, with just those two system calls, I'll remove
every other suspend-to-disk wannabe from the kernel in a heartbeat. I may
have missed something subtle, of course, but I really *think* it should be
doable.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Thu, 26 Apr 2007 17:35:01 UTC
Message-ID: <fa.qg1r8rf9SBDvAmtrMZZ//SHEXPw@ifi.uio.no>

On Thu, 26 Apr 2007, Xavier Bestel wrote:
>
> Won't there be problems if e.g. X tries to write something to its
> logfile after snapshot ?

Sure. But that's a user-level issue.

You do have to allow writing after snapshotting, since at a minimum, you'd
want the snapshot itself to be written. So the kernel has to be fully
running, and support full user space. No "degraded mode" like now.

So when I said "fully running user mode", I really meant it from the
perspective of the kernel - not necessarily from the perspective of the
"user". You do want to limit _what_ user mode does, but you must not limit
it by making the kernel less capable.

Remounting mounted filesystems read-only sounds like a good idea, for
example. We can do that. We have the technology. But we shouldn't limit
user space from doing other things (for example, it might want to actually
*mount* a new filesystem for writing the snapshot image).

For example, right now we try to "fix" that with the whole process freezer
thing. And that code has *caused* more problems than it fixed, since it
tries to freeze all the kernel threads etc, and you simply don't have a
truly *working*system*.

I think it's fine to freeze processes if that is what you want to do (for
example, send them SIGSTOP), but we freeze them *too*much* right now, and
the suspend stuff has taken over policy way too much. We don't actually
leave the system in a runnable state. I can almost guarantee that you'd be
*better* off having the snapshot taking thing do a

	kill(-1, SIGSTOP);

in user space than our current broken process freezer. At least that
wouldn't have screwed up those kernel threads as badly as swsusp did.

And no, I'm not saying that my suggestion is the only way to do it. Go
wild. But the *current* situation is just broken. Three different things,
none of which people can agree on. I'd *much* rather see a conceptually
simpler approach that then required, but even more important is that right
now people aren't even discussing alternatives, they're just pushing one
of the three existing things, and that's simply not viable. Because I'm
not merging another one.

In fact, I personally feel that I shouldn't even have merged
userspace-swsusp, but if Andrew thinks it needs to be merged, my personal
feelings simply don't matter that much. I have to trust people. But yes,
as far as *I* am personally concerned, I think it was a mistake to merge
it.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2:
Date: Thu, 26 Apr 2007 17:44:34 UTC
Message-ID: <fa.YUfyQkniEPJYP9EwUs8ydTCTNrE@ifi.uio.no>

On Thu, 26 Apr 2007, Josh Triplett wrote:
>
> http://www.gettysfamily.org/wordpress/?p=32
> http://www.gettysfamily.org/wordpress/?p=33
>
> You cannot resume "too fast"; suspend to RAM should become so fast that you
> use it without even thinking about it.

Yes. I think the OLPC is a prime example of why suspend-to-ram should
basically try to not do any setup at all: leave all the DMA command
structures etc in memory, leave even things like USB URB's pending, just
"stop the actual controllers and suspend it".

If you want to just save power by shutting down the CPU, it's a great idea
to leave the NIC powered up etc, so that you don't have to renegotiate the
link speed etc, and that you can go into s2ram and come back in less than
a second.

Then, if you really want to *sleep*, you can have a s2ram script that
actually turns off devices for real and requires more tear-down and setup,
but having something that tries to get in and out of the processor sleep
modes as fast as possible is definitely interesting.

Obviously, it's *more* interesting on the OLPC than on many other machines
(since that one can keep its display refreshed while it's sleeping), so it
may well be that the OLPC is one extreme example that isn't very realistic
for some other uses, but I think it's one of the things we want to
support.

(And for me personally, I'd love to have all my machines "sleep" by
default, but wake up by ethernet and keyboard - I'd love for my screen
saver to literally put the machine to sleep, but not have to worry about
touching a keyboard - just ssh'ing into them should still wake up. It's
*technically* doable, but it's just a pain to do right now)

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Thu, 26 Apr 2007 20:46:34 UTC
Message-ID: <fa.B99O/xZx3eo0/EWSSRF8lgFFc/E@ifi.uio.no>

On Fri, 27 Apr 2007, Nigel Cunningham wrote:
>
> Perhaps you should try to make an alternative yourself instead of
> pushing us into making something we don't believe will work (my case) or
> have already done but in a way you don't like (Rafael). Don't talk about
> Pavel cutting code. He's just acking/nacking what Rafael sends him.

I've done that in the past (USB, PCMCIA - screw the maintainers, redo
it basically from scratch). But the thing is, I'm totally uninterested
personally in the whole disk-snapshotting, so I'm not likely to do it
there.

But yes, I'm actually hoping that some new person will come in with a new
idea. The current people seem to be too set in "their" corners, and I
don't expect that to really change.

Quite honestly, I don't foresee any of the current tree approaches really
doing something new and obviously better, unless somebody new steps in.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Thu, 26 Apr 2007 23:17:09 UTC
Message-ID: <fa.GBjaWq+/S1NCuJc+zv+FccYx3xc@ifi.uio.no>

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Well, I think that much of what Linus is saying indicates that he hasn't tried
> to write any such thing himself. ;-)

That's definitely true. The only interaction I ever had with "hibernation"
(and yes, we should just call it that) is when I was working on s2ram and
cleaning up the PCI device suspend/resume in particular, and trying
(_mostly_ successfully - I think I broke it once or twice mainly due to
interactions with the console, but on the whole I think it mostly worked)
to not break hibernation in the process without actually running it.

> Now that I know much more than before, I can say I agree with Linus on his
> opinion about the separation of s2ram form the snapshot/restore functionality
> (I'll call it 'hibernation' for simplicity from now on).

So my strong opinion on it literally comes from the other end (ie _not_
knowing about hibernation, but trying to work with s2ram, and cursing the
mixups).

> It should be done, because it would make things simpler and cleaner.
> Still, it will be difficult to do without screwing users en masse and
> that's my main concern here.

I do agree. It will inevitably affect a lot of devices. That's always
painful.

> I don't agree that we don't need the tasks freezer for suspending and
> hibernation.  We need it, because we need to be sure that the (other) tasks
> will not get us in the way, and that also applies to kernel threads (and I
> don't think the tasks freezer is 'screwing' them, BTW).

I actually feel much less strongly about that, because just separating out
s2ram and hibernate entirely from each other would already really get the
thing _I_ care about taken care of - being able to work on one of the
other without fear of breaking the other one.

And besides, I actually came into the whole discussion because I'm not a
huge fan of thinking that user-land is "better". If the thing can sanely
be done in kernel, I'm actually all for that. What drives me wild is
having three different things, and nobody driving.

It needs somebody who (a) cares (b) has good taste and (c) has enough time
and personal karma to burn that he can actually take the (obviously)
inevitable heat from just doing things right, and convincing people to
select *one* implementation.

That kind of person is really really hard to find. And if you're it,
you're in for some pain ;)

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [linux-pm] driver power operations (was Re: suspend2 merge)
Date: Fri, 27 Apr 2007 15:44:36 UTC
Message-ID: <fa.Rj+WQ52KWfYaHc7oahqIIPEWU8Q@ifi.uio.no>

On Fri, 27 Apr 2007, Johannes Berg wrote:
>
> Eh, the point I actually wanted to make is that many drivers don't care
> for the irqs disabled case and would have to add code to exclude it.

You really *really* want to do a two-phase thing, at least for the case I
care about. Whether that snapshotting thing does or not, I could care
less.

There's a damn good reason why the kernel uses

	/* phase 1 */
	for_each_dev()
		dev->suspend(dev);

	cli();
	/* phase 2 */
	for_each_dev()
		dev->suspend_late(dev);

(and the reverse case on resume).

The reason is simply that there are two totally different cases: things
like disks etc want to spin down and do slow and high-level operations,
while things like USB controllers and console devices do *not* want to be
suspended early, because if you do, you lose debuggability.

So some things really *really* want to be done when they know that there
isn't anything else going on any more, and they want to delay the shutdown
to the very end. While other things really *require* that they can send
requests that can take time, and cannot run with interrupts disabled.

I actually think that "snapshot" is totally different, exactly because for
snapshotting, the slow operations like spinning down disks etc probably
don't really even exist, and would always be no-ops. But who knows..

Anyway, I do have a final comment:

     DO NOT PASS "STATE FLAGS" TO DRIVERS

(or, even worse, assume that drivers would test "implicit" state by
calling the same function under two different states, and then have the
drivers test for "are interrupts disabled? Then I need to do something
else").

If drivers are possibly going to do two different things, make it two
different entry-points. There's absolutely no downsides. It's _clearer_ to
the device writer when he gets called two different ways that it's not the
same case, and in case a particular device can do the same thing for both
cases, he can just set the function pointer to the same entry for both.

Never EVER pass dynamic flags that modify behaviour. It's simply bad
programming. A function should do *one* thing, and do it well.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [linux-pm] driver power operations (was Re: suspend2 merge)
Date: Fri, 27 Apr 2007 15:54:07 UTC
Message-ID: <fa.P91/vvs0D+zEily7H7mGCvXTCbc@ifi.uio.no>

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> I think we can use 'stages' and pass them as arguments to the functions.

No, no NOOOO!

If you use stages, just describe them in the function name instead.

> quiesce(PREPARE) -- that may be needed for drivers that allocate much memory
> before quiescing devices (if any)
> ...
> quiesce(PRE_SNAPSHOT)
> ...
> quiesce(PRE_SNAPSHOT_IRQ_OFF)

There is *no* advantage to this (and _lots_ of disadvantages) compared to
saying

	dev->snapshot_prepare(dev);
	dev->snapshot_freeze(dev);
	dev->snapshot(dev)

The latter is
 - more readable
 - MUCH easier for programmers to write readable code for (if-statements
   and case-statements are *by*definition* more complicated to parse both
   for humans and for CPU's - static information is good)
 - allows for the different stages to have different arguments, and
   somewhat related to that, to have better static C type checking.

Look here, which one is more readable:

	int some_mixed_function(int arg)
	{
		do_one_thing();
		if (arg == SLEEP)
			do_another_thing();
		else
			do_yet_another_thing();
	}

or

	int do_sleep(void)
	{
		do_one_thing();
		do_another_thing();
	}

	int prepare_to_sleep(void)
	{
		do_one_thing();
		do_yet_another_thing();
	}

and quite frankly, while the second case may take more lines of code,
anybody who says that it's not clearer what it does (because it can
"self-document" with function names etc) is either lying, or just a really
bad programmer. The second case is also likely faster and probably not
larger code-size-wise either, since it does static decisions _statically_
(since all callers are realistically going to use a constant argument
anyway, and the argument really is static).

Finally, the second case is *much* easier to fix, exactly because it
doesn't mix up the cases. You can change the arguments, you can have
totally different locking, you don't need things like

	int gfp = (arg == SLEEP) ? GFP_ATOMIC : GFP_KERNEL;

etc, and it's just more logical.

So don't overload a function. That's the *bug* with the current
"dev->suspend()" interface already. Don't re-create it. The current one
overloads two *totally*different* operations onto one function.

Just don't do it. Not in the suspend part, not *ever*.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Fri, 27 Apr 2007 21:45:52 UTC
Message-ID: <fa.u2rxPXSk/IKYmUpePR0iOHKxX2c@ifi.uio.no>

On Fri, 27 Apr 2007, Rafael J. Wysocki wrote:
>
> Why do you think that keeping the user space frozen after 'snapshot' is a bad
> idea?  I think that solves many of the problems you're discussing.

It makes it harder to debug (wouldn't it be *nice* to just ssh in, and do

	gdb -p <snapshotter>

when something goes wrong?) but we also *depend* on user space for various
things (the same way we depend on kernel threads, and why it has been such
a total disaster to try to freeze the kernel threads too!). For example,
if you want to do graphical stuff, just using X would be quite nice,
wouldn't it?

But I do agree that doing everything in the kernel is likely to just be a
hell of a lot simpler for everybody.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Fri, 27 Apr 2007 22:09:52 UTC
Message-ID: <fa.jj5zT/KT6z/pRuTljjUqPLTqWgA@ifi.uio.no>

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> We're freezing many of them just fine. ;-)

And can you name a _single_ advantage of doing so?

It so happens, that most people wouldn't notice or care that kmirrord got
frozen (kernel thread picked at random - it might be one of the threads
that has gotten special-cased to not do that), but I have yet to hear a
single coherent explanation for why it's actually a good idea in the first
place.

And it has added totally idiotic code to every single kernel thread main
loop. For _no_ reason, except that the concept was broken, and needed more
breakage to just make it work.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Fri, 27 Apr 2007 23:18:33 UTC
Message-ID: <fa.dQTqzXLVB+P6pf+BHd9O00Xo+xk@ifi.uio.no>

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> > And can you name a _single_ advantage of doing so?
>
> Yes.  We have a lot less interdependencies to worry about during the whole
> operation.

That's not an advantage. That's why it has *sucked*.

Trying to freeze kernel threads has _caused_ problems. It has _added_
these interdependencies. It hasn't removed a single dependency at any
time, it has just added new problems!

> 1) if the kernel threads are frozen, we know that they don't hold any locks
> that could interfere with the freezing of device drivers,
> 2) if they are frozen, we know, for example, that they won't call user mode
> helpers or do similar things,
> 3) if they are frozen, we know that they won't submit I/O to disks and
> potentially damage filesystems (suspend2 has much more problems with that
> than swsusp, but still.  And yes, there have been bug reports related to it,
> so it's not just my fantasy).

NONE of these are valid explanations at all. You're listing totally
theoretical problems, and ignoring all the _real_ problems that trying to
freeze kernel threads has _caused_.

If you want to control user-mode helpers, you do that - you do not freeze
kernel threads!

And no, kernel threads do not submit IO to disks on their own. You just
made that up. Yes, they can be involved in that whole disk submission
thing, but in a good way - they can be required in order to make disk
writing work!

The problem that suspend has had is that it's done everything totally the
wrong way around. Do kernel threads do disk IO? Sure, if asked to do so.
For example, kernel threads can be involved in md etc, but that's a *good*
thing. The way to shut them up is not to freeze the threads, but to freeze
the *disk*.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 00:01:17 UTC
Message-ID: <fa.dHn9GZMHp8EP/85clh3thb6aZ58@ifi.uio.no>

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> Actually, the less things happen while we're creating and saving the image,
> the less sources of potential problems there are and by freezing the kernel
> threads (not all of them), we cause less things to happen at that time.

That makes no sense.

You have to create the snapshot image with interrupts disabled *anyway*.

I really don't see how you can say that stopping threads etc can make any
difference what-so-ever. If you don't create the snapshot with interrupts
disabled (and just with a single CPU running) you have so many other
problems that it's not even remotely funny.

So there's *by*definition* nothing at all that can happen while you
snapshot the system. Claiming otherwise is just silly.

> To make you happy, we could stop doing that, but what actual _advantage_
> that would bring?

Like getting rid of all the magic "I don't want you to freeze me" crud?

Or getting rid of this horribly idiotic "three times widdershins" kind of
black magic mentality! It looks like the main reason for the process
freezing has nothing to do with technology, but some irrational fear of
other things happening at the same time, even though they CANNOT happen if
you do things even half-way sanely.

The "let's stop all kernel threads" is superstition. It's the same kind of
superstition that made people write "sync" three times before turning off
the power in the olden times. It's the kind of superstition that comes
from "we don't do things right, so let's be vewy vewy quiet and _pray_
that it works when we are being quiet".

That's bad.

It's doubly bad, because that idiocy has also infected s2ram. Again,
another thing that really makes no sense at all - and we do it not just
for snapshotting, but for s2ram too. Can you tell me *why*?

> > Trying to freeze kernel threads has _caused_ problems. It has _added_
> > these interdependencies. It hasn't removed a single dependency at any
> > time, it has just added new problems!
>
> What problems are you talking about?

Like you wouldn't know. Look at commit b43376927a that you yourself are
credited with, just a month ago.

Then, do something as simple as

	git grep create_freezeable_workthread

and ponder the end results of that grep. If you don't see something wrong,
you're blind.

> > NONE of these are valid explanations at all. You're listing totally
> > theoretical problems, and ignoring all the _real_ problems that trying to
> > freeze kernel threads has _caused_.
>
> Example, please?

Who do you think you are kidding? See above.

And if you think that's an isolated example, look again. And start
grepping for PF_NOFREEZE, and other examples.

The fact is, there is not a *single* reason to freeze kernel threads. But
some rocket scientist decided to, and then screwed everybody else over.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 00:21:07 UTC
Message-ID: <fa.lFuM6XOwDo/gDDqCnry9ycuvNKo@ifi.uio.no>

On Fri, 27 Apr 2007, Linus Torvalds wrote:
>
> The "let's stop all kernel threads" is superstition. It's the same kind of
> superstition that made people write "sync" three times before turning off
> the power in the olden times. It's the kind of superstition that comes
> from "we don't do things right, so let's be vewy vewy quiet and _pray_
> that it works when we are being quiet".

Side note: while I think things should probably *work* even with user
processes going full bore while a snapshot it taken, I'll freely admit
that I'll follow that superstition far enough that I think it's probably a
good idea to try to quiesce the system to _some_ degree, and that stopping
user programs is a good idea. Partly because the whole memory shrinking
thing, and partly just because we should do the snapshot with hw IO queues
empty.

But I don't think it would necessarily be wrong (and in many ways it would
probably be *right*) to do that IO queue stopping at the queue level
rather than at a process level. Why stop processes just becasue you want
to clean out IO queues? They are two totally different things!

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 00:41:58 UTC
Message-ID: <fa.H01jX84odTLNPEbRYRf0jAvXMQY@ifi.uio.no>

On Fri, 27 Apr 2007, David Lang wrote:
>
> all that's needed for the snapshot is to prevent userspace from scheduling,

Strictly speaking, all you *really* want to make sure is not so much that
user-space isn't scheduling, as the fact that all device IO buffers must
be empty.

We can trivially snapshot an active user-space, and in fact it would
probably be hard to do a snapshot in a way that it could even *know* or
care about whether there are user-space processes running at the time of
the snapshot.

So that's not the real problem.

What we obviously *cannot* snapshot is if some particular device is in the
middle of being written to or read from, and has outstanding commands on
the device itself (as opposed to just queued to the driver). So what we do
want to make sure happens is that there are no IO queues that are active.

And the best way to make sure that there are no IO queues active is to
make sure that there are no new read or write-requests. And *that* you can
do two ways:

 - actually intercepting the read/write requests. Probably not too hard,
   we could literally do it in the IO scheduler (and probably much more
   easily than doing it in the process scheduler), but the easy cases will
   only cover the block device layer, and character devices don't have the
   same kind of scheduler you can trap IO in.

 - we also don't want to generate new data that needs to be snapshotted,
   so we want to trap people who write even just to the page cache and
   turn pages dirty. Again, we could probably do it at *that* point (ie
   trapping them when they try to dirty a page), and it would be more
   logical, but again, there are other cases of people who generate more
   data (just any memory allocation obviously is a special case of
   generating more data to be snapshotted),

so I do agree that we want to stop producing new data to be snapshotted,
and we want to stop producing new read-requests. But kernel threads really
do neither: in an idle system, kernel threads are idle too. A kernel
thread is not like a user program that actually generates data - they only
tend to act on behalf of other processes' needs.

So I think that what snapshotting really *wants* to stop is not scheduling
per se, but IO. And stopping user processes (as opposed to kernel threads)
is probably a good way to get there.

In fact, I'd argue that you want to stop user space and then encourage
some kernel threads to *start* running, notably things like bdflush should
probably be kicked to clean up some dirty stuff as part of the "shrink
data to be snapshotted" part. Trying to free memory will do that on its
own, of course.

			Linus



From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 01:12:43 UTC
Message-ID: <fa.u81Uz7laLnKQNgj2HjpI3sh8A2Y@ifi.uio.no>

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
>
> > It's doubly bad, because that idiocy has also infected s2ram. Again,
> > another thing that really makes no sense at all - and we do it not just
> > for snapshotting, but for s2ram too. Can you tell me *why*?
>
> Why we freeze tasks at all or why we freeze kernel threads?

In many ways, "at all".

I _do_ realize the IO request queue issues, and that we cannot actually do
s2ram with some devices in the middle of a DMA. So we want to be able to
avoid *that*, there's no question about that. And I suspect that stopping
user threads and then waiting for a sync is practically one of the easier
ways to do so.

So in practice, the "at all" may become a "why freeze kernel threads?" and
freezing user threads I don't find really objectionable.

But as Paul pointed out, Linux on the old powerpc Mac hardware was
actually rather famous for having working (and reliable) suspend long
before it worked even remotely reliably on PC's. And they didn't do even
that.

(They didn't have ACPI, and they had a much more limited set of devices,
but the whole process freezer is really about neither of those issues. The
wild and wacky PC hardware has its problems, but that's _one_ thing we
can't blame PC hardware for ;)

> > 	git grep create_freezeable_workthread
>
> s/workthread/workqueue/

Yes.

> > and ponder the end results of that grep. If you don't see something wrong,
> > you're blind.
>
> This was a mistake, quite unrelated to the point you're making.

Did you actually _do_ the "grep" (with the fixed argument)?

I had two totally independent points. #1 was that you yourself have been
fixing bugs in this area. #2 was the result of that grep. It's absolutely
_empty_ except for the define to add that interface.

NOBODY USES IT!

Now, grep for the same interface that creates _non_freezeable workqueues.

Put another way:

	[torvalds@woody linux]$ git grep create_workqueue | wc -l
	35

	[torvalds@woody linux]$ git grep create_freezeable_workqueue | wc -l
	1

and that _one_ hit you get for the "freezeable" case is not actually a
user, it's the definition!

Ie my point is, nobody wants freezeable kernel threads. Absolutely nobody.

Yet we have all this support for freezing them (or rather, we freeze them
by default, and then we have all this support for _not_ doing that wrong
default thing!)

So yes, I think it would be interesting to just stop freezing kernel
threads. Totally.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 16:30:47 UTC
Message-ID: <fa.YRE06ZHh+SUgOY22HDChbRqB81g@ifi.uio.no>

On Sat, 28 Apr 2007, Pavel Machek wrote:
>
> We do not want kernel threads running:
>
> a) they may hold some locks and deadlock suspend
>
> b) they may do some writes to disk, leading to corruption

You're really just making both of those up.

If a kernel thread holds a lock and deadlocks suspend, that would deadlock
anything else _too_. Suspend isn't *that* special. Everything it does are
things other people do too.

And no, kernel threads do not write to disk on their own. Name one. They
help *others* write to disk, but those disk writes need to happen.

The freezer has *caused* those deadlocks (eg by stopping threads that were
needed for the suspend writeouts to succeed!), not solved them.

So stop making these totally bogus arguments up.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 21:26:31 UTC
Message-ID: <fa.loAiOqyOOTYwB7OH27vUK2+apLU@ifi.uio.no>

On Sat, 28 Apr 2007, Rafael J. Wysocki wrote:
> >
> > The freezer has *caused* those deadlocks (eg by stopping threads that were
> > needed for the suspend writeouts to succeed!), not solved them.
>
> I can't remember anything like this, but I believe you have a specific test
> case in mind.

Ehh.. Why do you thik we _have_ that PF_NOFREEZE thing in the first place?

Rafael, you really don't know what you're talking about, do you?

Just _look_ at them. It's the IO threads etc that shouldn't be frozen,
exactly *because* they do IO. You claim that kernel threads shouldn't do
IO, but that's the point: if you cannot do IO when snapshotting to disk,
here's a damn big clue for you: how do you think that snapshot is going to
get written?

I *guarantee* you that we've had a lot more problems with threads that
should *not* have been frozen than with those hypothetical threads that
you think should have been frozen.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Back to the future.
Date: Sat, 28 Apr 2007 23:45:56 UTC
Message-ID: <fa.B5aTn2jjYeZEP2cbGCdlIA6JpBU@ifi.uio.no>

On Sun, 29 Apr 2007, Rafael J. Wysocki wrote:
>
> OK, more precisely: fs-related threads should not try to process their queues,
> etc., after the snapshot is done, because that may cause some fs data to be
> written at that time and then the fs in question may be corrupted after the
> restore.  Not all of the I/O in general, fs data.

But that's not true _either_. That's only true because right now I think
we cannot even suspend to a swapfile (I might be wrong).

If you have a swapfile on a filesystem, you'd need those fs queues
running!

> Well, I'm not sure whether or not that still would have been the case if we had
> stopped to freeze kernel threads for the hibernation/suspend.

Did you miss the email where Paul pointed out that Mac/PowerPC didn't use
to do any of this? And apparently never had any issues with it? And
probably worked more reliably several years ago than suspend/hibernation
does _today_?

Ie we do have history of _not_ freezing things.  The freezing came later,
and came with the subsystem that had more problems..

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFD] Freezing of kernel threads
Date: Sat, 12 May 2007 19:58:46 UTC
Message-ID: <fa.+M53iNKBaV4LEik6pxPznEFfBgI@ifi.uio.no>

On Sun, 13 May 2007, Gautham R Shenoy wrote:
>
> With the "freeze-limited-kernel-threads" scheme, we would still need
> to perform this audit, since we now have to freeze only those kernel
> threads which *may* end up calling foo().

What the *heck* is wrong with just adding proper locking or other
synchronization?

Why the hell do people think that they don't need locking for things?

In other words, you're just adding crap to make up for other crap. That's
*not* how we do kernel develpment. I would seriously suggest people like
that approach MS for a position on _their_ kernel team.

The fact is, if you want to avoid problems, you do so by having clear and
clean approaches. THAT is how you avoid bugs. Not by adding crap.

I've several times suggested that if you actually want to stop the
machine, you can do so at the scheduler level.

So stop talking about freezing when you're talking about CPU hotplug. The
two have _nothing_ to do with each other, and if there is something this
whole discussion has absolutely convinced me about, it's that they will
never have anything to do with each other.

If you worry about people doing

	for_each_online_cpu()

in the presense of hotplug, here's a couple of simple solutions:

 - Just teach for_each_online_cpu() about locking. It's not a new concept.
   We have it all over the kernel.

   Maybe the macro itself can take whatever locks it needs, and maybe it's
   as simple as a rule saying that: "you must hold spinlock XYZ when you
   do that". But it has _nothing_ to do with freezing all threads.

 - make the rule be that you cannot sleep in the above macro, and make the
   rule be that CPU hotplug uses the same mechanisms that module unload
   already does!

In other words, we already have *better* mechanisms for cpu hotplug than
the freezer. We have mechanisms that there is absolutely zero controversy
about.

IOW, have you actually looked at "stop_machine_run()", which people use
much more than the freezer, and which has never really caused any issues,
and didn't result in us adding *crap* to every single kernel thread?

Really. The arguments for the freezer are CRAP. Why cannot people just
admit it? We have *better* things available, that don't *need* any of this
crap. The "stop_machine()" stuff literally ends up just running a
high-priority thread on each CPU, and doesn't need any special support
anywhere.

AND IT JUST WORKS.

(Btw, I'm pretty sure it could be used for hibernation too: just stop the
IO at a higher level first, sync, then stop_machine_run() etc. But I don't
care. What I care about is that the freezer() insanity doesn't _spread_).

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: [RFD] Freezing of kernel threads
Date: Sat, 12 May 2007 20:12:55 UTC
Message-ID: <fa.fADUh1105S9grrutFgTKe7MYevY@ifi.uio.no>

On Sat, 12 May 2007, Linus Torvalds wrote:
>
>  - make the rule be that you cannot sleep in the above macro, and make the
>    rule be that CPU hotplug uses the same mechanisms that module unload
>    already does!

Side note: we obviously already do the stop_machine_run thing for CPU
down, so doing it for CPU bringup too would seem to just be a good thing.
And it means that the locking for CPU's disappearing is the same as the
locking rules for CPU's appearing: you just want to make sure to disable
preemption (which you already have to do, to make the "CPU went away" case
work out).

So how about just making sure "__cpu_up()" gets called with the same
stop_machine_run() logic? Maybe it's not *quite* a one-liner change, but
it should come fairly close...

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Thu, 24 May 2007 22:26:22 UTC
Message-ID: <fa.n5j0BV+ar6SxHzonb494rFHcyK0@ifi.uio.no>

On Thu, 24 May 2007, Linus Torvalds wrote:
>
> Then, what you do is:
>  - stop user space
>  - suspend
>  - resume
>  - start user space

Btw, this is where things like "udevd" can be really problematic. That
whole "uevent over netlink" stuff is really nasty for things like this.

It's quite possible that even for user-level threads, we simply MUST NOT
freeze them the way we do. Exactly because of deadlocks.

I'm personally really really convinced that the whole freezer thing is a
total disaster. I don't know how anybody can even imagine anything else.
It's simply deadlock city.

We should freeze IO, not processes.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Fri, 25 May 2007 03:33:59 UTC
Message-ID: <fa.KhNd8pNlz9BedKy0iVzbGVns6aE@ifi.uio.no>

On Fri, 25 May 2007, Nigel Cunningham wrote:
> >
> > That said, I think freezing is crap even for snapshotting/suspend-to-disk,
> > but the point of the above rant is to show how insane it is to think that
> > problems and complexity in one area should translate into problems and
> > complexity in another area.
>
> Does that imply that you'd prefer to see filesystem checkpointing code,
> that you think freezing can be done better, or do you have some other
> solution that hasn't occurred to me?

I actually don't think that processes should be frozen really at all.

I agree that filesystems have to be frozen (and I think that checkpointing
of the filesystem or block device is "too clever"), but I just don't think
that has anything to do with freezing processes.

So I'd actually much prefer to freeze at the VFS (and socket layers, etc),
and make sure that anybody who tries to write or do something else that we
cannot do until resuming, will just be blocked (or perhaps just buffered)!

See? I actually think that this process-based thing is barking up the
wrong tree. After all, it's really not the case that we need to stop
processes, and stopping processes really does have some problems. It's
simpler in some ways, but I think a more directed solution would actually
be better.

>bviously we _do_ want to actually try to quiesce normal user processes.
>But by "normal user", I'd be willing to limit it to non-uid-zero things,
>for example. Exactly because it does turn out that the kernel kind of
>depends on user-land things for stuff like network filesystem connection
>setup etc (ie we tend to do things like the mount encryption stuff in
>userland!).

But I really don't care that deeply per se, exactly because I don't use it
myself. I think people are going down the wrong rabbit-hole, but it
wouldn't _irritate_ me that much except for the fact that it now also
impacts suspend-to-RAM.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Fri, 25 May 2007 04:52:18 UTC
Message-ID: <fa.svBofzBtsCXnQRqW5l/TLc1UynQ@ifi.uio.no>

On Fri, 25 May 2007, Nigel Cunningham wrote:
>
> Does that mean you never ever power off your laptop (assuming you have
> one), and the battery never runs out? Surely you must power it off
> completely sometimes?

So? The bootup isn't that much worse than a disk suspend/resume, and it's
reliable.

And actually, I don't use laptops much. I use mostly desktops, and STR
works fine on at least some of them. In contrast, doing some
suspend-to-disk thing would just be insane and idiotic. If I have to wait
for half a minute and have a slow system even after that because my git
trees aren't in the cache, I really might as well just shut them off.

In contrast, STR means they are quiet and don't waste energy when I don't
use them, but they're instantly available when I care. HUGE difference.

I really think suspend-to-disk is just a total waste of my time.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Mon, 28 May 2007 16:56:29 UTC
Message-ID: <fa.2XLoaEqUrUGSFFaH2TfYSBInrkw@ifi.uio.no>

On Mon, 28 May 2007, Pavel Machek wrote:
>
> I guess we should warn the driver authors, then; and decide what driver
> authors should do.

Drivers really shouldn't do anything at all.

> If I'm video4linux driver for grabbing screen, have been suspended, and
> someone asks me to read a frame, should I
>
> a) return -ESORRYIMSUSPENDED
>
> b) just block the caller

The "subsystem" thing should have stopped the queues, and the device
should never even _see_ this.

Before we suspend a device, we call the subsystem that that device has
been registered with. Ie, we have code like this:

	if (dev->class && dev->class->suspend)
		error = dev->class->suspend(dev, state);

which was very much designed so that individual devices wouldn't have to
always know - if the upper layer devices for that class can handle these
things, they should.

Do people actually _do_ this, right now? No. But we do actually have the
infrastructure, and I think we have one or two classes that actually do
use it (at least the "rfkill_class" has a suspend member, dunno how well
this model actually works).

I think Greg had some patches to make network drivers use this, for
example. Network drivers right now all end up doing stuff that really
doesn't belong in the driver at all when they suspend, and the
infrastructure _should_ just do it for them (ie do all the _network_
related stuff, as opposed to the actual hardware-related stuff).

(Examples of things that we probably _should_ do for network devices on a
class level:

	suspend:
		netif_poll_disable()
		if (netif_running(dev))
			dev->stop(dev);

	resume:
		if (netif_running(dev))
			dev->start(dev);
		netif_poll_enable(dev);

or something similar).

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Tue, 29 May 2007 21:35:31 UTC
Message-ID: <fa.BlgPrqBjZN/00DvcuKUoLjsgpl0@ifi.uio.no>

On Wed, 30 May 2007, Nigel Cunningham wrote:
>
> On Tue, 2007-05-29 at 10:19 -0400, Mark Lord wrote:
> >
> > How about blocking brk() and mmap(MAP_ANONYMOUS) in addition to
> > the filesystem VFS callers?   Or is that starting to get messy again?
>
> Yeah. Getting messy again :)

Indeed. And also misses the point - the point being that we don't actually
need to freeze anything at all most of the time. There's nothing wrong
with making memory allocations etc.

And yes, suspend is different from hibernate. I can see how hibernate
people are worried about people writing to things after doing the
snapshot, but those concerns don't exist with suspend. With suspend, the
biggest concern is accessing a device after it has been suspended, but on
the other hand, also the fact that we end up having driver writers used
to the system being "runnable", so they do things that really do require a
full-fledged system (and sometimes that means just some delayed action
using a kernel thread, other times it seems to rely on more complex
behaviour like firmware loading :^p )

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: pcmcia resume 60 second hang. Re: [patch 00/69] -stable review
Date: Wed, 30 May 2007 16:01:34 UTC
Message-ID: <fa.nrf1kRlkRj4Nh2oxdft2Woo0lF0@ifi.uio.no>

On Wed, 30 May 2007, Rafael J. Wysocki wrote:
>
> Very true.  And I think the right order should be to make the midlayers do
> this and then remove the freezer from the STR code path, not the other way
> around. :-)

Yes. After all, STR simply shouldn't _care_.

The rule should be that in a well-written setup, STR "just works" whether
user processes are suspended or not. In other words, the whole freezing
part isn't about STR. It should be totally immaterial.

(Of course, that assumes that the freezing is _sane_, of course: ie the
core kernel threads shouldn't all be frozen. I think Rafael's patch to
turn the defaults around are a big step in the right direction).

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Wed, 20 Feb 2008 20:31:12 UTC
Message-ID: <fa.eryn80rS/GazxIquhuvGaAflAyk@ifi.uio.no>

On Wed, 20 Feb 2008, Rafael J. Wysocki wrote:
>
> I think we should export the target sleep state somehow.

Yeah. By *not* using "->suspend()" for freezing or hibernate.

Please, Rafael - just make the f*cking suspend-to-disk use other routines
already. 99% of all hardware needs to do exactly *nothing* on
suspend-to-disk, and the ones that really do need things tend to need to
not do a whole lot.

For example, the "freeze" action for USB (which is one of the hardest
things to suspend) should literally be something like just setting the
controller STOP bit, and waiting for it to have stopped. The "unfreeze"
should be to just clear the stop bit, while the "restart" should be just a
controller reset to use the current memory image.

NONE OF THIS HAS ABSOLUTELY ANYTHING TO DO WITH SUSPEND.

It never did. I've told people so for years. Maybe actually seeing the
problems will make people realize.

So please, we shouldn't call "->suspend[_late]" or "->resume[_early]" at
all. Not with PMSG_FREEZE, not with PMSG_*anything*.

Can we please get this fixed some day?

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Wed, 20 Feb 2008 21:14:41 UTC
Message-ID: <fa.FK1kbYZ+/b/XZZSb2ZGLIEGTpoM@ifi.uio.no>

On Wed, 20 Feb 2008, Jesse Barnes wrote:
>
> The current callback system looks like this (according to Rafael and the last
> time I looked):
>   ->suspend(PMSG_FREEZE)
>   ->resume()
>   ->suspend(PMSG_SUSPEND)
>   *enter S3 or power off*
>   ->resume()

Yes, it's very messy.

It's messy for a few different reasons:

 - the one you hit: a driver actually has a really hard time telling what
   PMSG_SUSPEND really means.

 - more importantly, we generally don't want to "suspend/resume" the
   hardware at all around a power-off, because we're going to resume with
   the state at the time of the PMSG_FREEZE, which means that the hardware
   has actually *changed* and been used in between!

that second case is very fundamental for things like USB devices, which in
theory you can hold alive over a real suspend event (ie a STR event), but
which absolutely MUST NOT be resumed over a suspend-to-disk event, because
all the low-level request state is bogus!

So the "->resume" really isn't a resume at all. It's much closer to a
"->reset".

Of course, the "solution" to this all right now is that we have to reset
everything even if it *is* a suspend event, so it basically means that STR
ends up using the much weaker model that snapshot-to-disk uses.

The fundamental problem being that the two really have nothing
what-so-ever to do with each other. They aren't even similar. Never were.

> And in the long term we could have:
>   ->suspend()
>   *enter S3*
>   ->resume()

Yes, apart from all the complexities (suspend_late/resume_early). So in
reality it's more than that, but the suspend/resume things are clearly
nesting, and they have the potential to actually keep state around
(because we *know* this machine is not going to mess with the devices in
between).

IOW, here we actually can have as an option "assume the device is there
when you return".

> or:
>   ->hibernate()
>   *kexec to another kernel to save image*
>   *power off*
>   ->return_from_hibernate() (or somesuch)

Enough people don't trust kexec that I suspect the right thing simply is

	->freeze()		// stop dma, synchronize device state
	*snapshot*
	->unfreeze();		// resume dma
	*save image*
	[ optionally ->poweroff() ]	// do we really care? I'd say no
	*power off*
	->restore()		// reset device to the frozen one

which may have four entry-points that can be illogically mapped to the
suspend/resume ones like we do now, but they really have nothing to do
with suspending/resuming.

And notice how while "freeze/restore" kind of pairs like a
"suspend/resume", it really shouldn't be expected to realistically restore
the same state at all. The "restore" part is generally much better seen as
a "reset hardware" than a "resume" thing. Because we literally cannot
trust *anything* about the state since we froze it - we might have booted
a different OS in between etc. Very different from suspend/resume.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Wed, 20 Feb 2008 22:23:38 UTC
Message-ID: <fa.MSJK3tLqJscj/4pZ48wCXcj64RE@ifi.uio.no>

On Wed, 20 Feb 2008, Jesse Barnes wrote:
>
> Really, in the simple s3 case we still need early/late stuff?

Absolutely.

Two big reasons:

 - debuggability

   I know we don't do this correctly right now, but I want to be able to
   at least feel like we can some day actually do printk's etc through 99%
   of the suspend/resume cycle. It's a *huge* thing for debugging problems
   that happen in the wild, and one of the biggest issues is that we
   currently usually just get a "the machine died" message when suspend or
   resume doesn't work.

   Yes, doing printk's to the Intel management flash stuff can help a lot
   here, and I want that too, but I'd really like to shut down consoles
   individually rather than having the "big hammer" approach that shuts
   them up entirely over the whole suspend/resume sequence (or not at all,
   if you use "no_console_suspend").

   And I'd *really* like to do things like VGA-console shutdown in the
   late phase (and resume early).

 - it's actually likely *much* simpler for some devices.

   Simple devices (and that includes things like PCI bridges etc, but
   also potentially USB host controllers etc) are things that can often be
   trivially suspended - all the complexity is really not in the
   controller itself, but beyond, in the bus that it actually drives.

   And the late-suspend/early-resume means that you don't have to worry
   about things like interrupts happening while you're suspended. Yes,
   putting the device into D3 will disable interrupts from that device too
   (unless there are bugs), *BUT* you may be sharing an interrupt line,
   and interrupts may be posted and delayed, so an earlier interrupt may
   well be pending etc.

   suspending late and resuming early just avoids those issues entirely.

Sometimes these things interact. For example, firewire is certainly not
trivial to suspend as a "subsystem" thing (ie all the devices behind the
firewire bridge need to do magic things, like spinning down etc that
obviously can not happen in the final "late" phase), but the firewire
controller itself is likely trivial to suspend/resume and can easily be
handled in the late/early routines. And guess what? It's also exactly what
you want to happen in case you end up using the firewire RDMA as a debug
aid.

IOW, you want that firewire controller (and the PCI bridges) working
really early, so that if a problem does happen when you resume some more
complex device (say, one of the graphics chips that need X to really come
alive), you can use the firewire rdma to read out the kernel log buffer
from memory.

> Well, it seems like we'll have to fix drivers in either case, and isn't a
> kexec approach fundamentally more sound and simple, design-wise?  Rafael
> pointed out some problems with properly setting wakeup states, but I think
> that could be overcome...

I don't personally mind kexec at all, but on the other hand, I don't care
about suspend-to-disk in the first place. I do know that some people
really don't want it, and I suspect that they have valid reasons. Ranging
from memory use to simply just performance.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Wed, 20 Feb 2008 23:13:56 UTC
Message-ID: <fa.0asqgfZwA+h1y+JPZrvXiOtQPxE@ifi.uio.no>

On Wed, 20 Feb 2008, Rafael J. Wysocki wrote:
> > >
> > > which may have four entry-points that can be illogically mapped to the
> > > suspend/resume ones like we do now, but they really have nothing to do
> > > with suspending/resuming.
>
> Apart from putting devices into the right low power states, that is.

And by "right low power states" you mean "wrong low-power states", right?

The thing is, they really *are* the wrong states for 99% of all hardware.

If you really have a piece of hardware that you want to have the
"->poweroff()" thing do the same as "->suspend()", then hey, just use the
same function (or better yet, use two different functions with a call to a
shared part).

Because IT IS NOT TRUE that ->suspend() puts the devices in the "right
power state". The power states are likely to be totally different for S3
and for poweroff, and they are going to differ in different ways depending
on the device type.

One example would be the one that started this version of the whole
discussion (shock horror! We're on subject!) ie when you do a system
shutdown, you generally do not even *want* to put individual devices into
low-power states at all, because the actual "power off the system" thing
will take care of it for you much better.

So to take just something as simple as VGA as an example: you really do
not want to suspend that device, because you want to see the poweroff
messages until the very end.

So that final device ->poweroff function really has absolutely *nothing*
in common with the device ->suspend[_late] functions, simply because
almost any sane driver would decide to do different things.

Of course, we can continue to do the insane thing and just continue to use
inappropriate and misleading function callback names, and then encoding
what the *real* action should be in the argument and/or in magic
system-wide state parameters.

So in that sense, it's certainly totally the same thing whether we call it
->shutdown or ->poweroff or ->eat_a_banana, since you could always just
look at the argument and other clues, and decide that *this* time, for
*this* kind of device, the "eat a banana" callback actually means that we
should power it off, but wouldn't it be a lot more logical to just make it
clear in the first place that they aren't called for the same reason at
all?

I'd claim that it's much easier for everybody (and _especially_ for device
driver writers) to have

	static int my_shutdown(struct pci_device *dev, int state)
	{
		.. do something ..
	}

	static int my_suspend(struct pci_device *dev, int state)
	{
		.. do something ..
	}

	...
	.shutdown = my_shutdown,
	.suspend = my_suspend,
	...

than to have

	static int my_suspend(struct pci_device *dev, state)
	{
		.. common code ..
		if (state == XYZZY)
			..special code..
		else
			..other case code..
	}

	...
	.suspend = my_suspend
	...

even if the latter might be fewer lines. It doesn't really matter if it's
fewer, does it, if the alternate version is more obvious about what it
does?

The other issue is that I've long wanted to make sure that when people fix
suspend-to-ram, they don't screw up suspend-to-disk by mistake and vice
versa. When a driver writer makes changes, he shouldn't have the kind of
illogical "oops, unintended consequences" issues in general. It should be
pretty damn obvious when he changes suspend code vs when he changes
snapshot/restore code.

We've somewhat untangled that on the "core kernel" layer, but we've left
the driver confusion alone.

		Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Thu, 21 Feb 2008 00:01:43 UTC
Message-ID: <fa.E7dQgJdEmUmqUokJC1yRDqUfb7I@ifi.uio.no>

On Thu, 21 Feb 2008, Rafael J. Wysocki wrote:
>
> In fact we have acpi_pci_choose_state() that tells the driver which power
> state to put the device into in ->suspend().  If that is used, the device ends
> up in the state expected by to BIOS for S4.

First off, nobody should *ever* use that directly anyway.

Secondly, the one that people should use ("pci_choose_state()") doesn't
actually do what you claim it does. It does all kinds of wrong things, and
doesn't even take the target state into account at all. So look again.

> No.  Again, if there are devices that wake us up from S4, but not from S5,
> they need to be handled differently in the *enter S4* case (hibernation) and
> in the *enter S5* case (powering off the system).

And again, what does this have to do with (the example I used) the
graphics hardware? Answer: nothing. The example I gave you we simply DO
THE WRONG THING FOR.

Same thing for things like USB devices - where pci_choose_state() doesn't
work to begin with. Why do we call "suspend()" on such a thing when we
don't want to suspend it? We shouldn't. We should call "freeze/unfreeze"
(which are no-ops) and then finally perhaps "poweroff", and that final
stage might want to spin things down or similar.

But *none* of it has anything to do with suspend, and none of it has
anything to do with pci_choose_state() (much less acpi_pci_choose_state)

The fact is, we should let the driver decide, and we should make it clear
to the driver writer what he is deciding about - rather than basically lie
and say "suspend the device and put it into D3" even when that's the last
thing it should ever do.

			Linus


From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: 2.6.25-rc2 System no longer powers off after suspend-to-disk.
Date: Thu, 21 Feb 2008 00:26:54 UTC
Message-ID: <fa.MmEk7FJPhe4Mobn7lGJeAivKU1M@ifi.uio.no>

On Thu, 21 Feb 2008, Rafael J. Wysocki wrote:
>
> > Secondly, the one that people should use ("pci_choose_state()") doesn't
> > actually do what you claim it does. It does all kinds of wrong things, and
> > doesn't even take the target state into account at all. So look again.
>
> Well, if platform_pci_choose_state() is defined, pci_choose_state() returns
> its result and on ACPI systems that points to acpi_pci_choose_state(), so in
> fact it does what I said (apart from the error path).

Did you check closer?

I repeat: acpi_pci_choose_state() (when called from pci_choose_state())
doesn't even look at the target 'state'. It just blindly assumes that you
want the deepest sleep-state you can have.

Which happens to be correct for normal suspend, but means that if you want
to test other states (through '/sys/devices/.../power'), that sounds
broken.

I didn't check any closer, but go check it yourself. The short and sweet:
acpi_pci_choose_state() totally ignores its 'state' argument. Do you
really think that's correct? But yes, "pci_choose_state()' effectively
does that too, apart from PM_EVENT_ON, which is never used.

(But the whole and only point of pci_choose_state() was to do the
PM_EVENT_FREEZE thing differently, which it doesn't do, so I think the
real issue here is that the interface is really rather mis-designed)

I suspect most people who ever really looked and worked on this code had a
specific device in mind, and I'm sure that all of the code individually
always ends up making sense from the standpoint of some specific device
driver. It's just that it never seems to make sense from a bigger issues
standpoint, and often seems senseless from the standpoint of other devices
of other types.

			Linus


Index Home About Blog