Hardware glitches (Linus Torvalds)

Index Home About Blog

From: Linus Torvalds <torvalds@osdl.org>
Newsgroups: fa.linux.kernel
Subject: Re: Hardware bug or kernel bug?
Date: Thu, 12 Oct 2006 19:13:55 UTC
Message-ID: <fa.2k3I/yiqabSKDBfWisKVcCFUaUo@ifi.uio.no>

On Thu, 12 Oct 2006, David Johnson wrote:
>
> I'm having a major problem on a system that I've been unable to track down.
> When using scp to transfer a large file (a few gig) over the network
> (@100Mbit/s) the system will reboot after about 5-10 minutes of transfer. No
> errors, just a reboot. I have another identical system which exhibits the
> same behaviour.

A reboot usually indicates a serious hardware problem - it could be an
overheating sensor tripping, but it could be some serious corruption
causing a triple-fault or something like that too.

But the _most_ likely problem is just the power supply. If your power
supply is border-line, having something that stresses CPU, disk,
southbridge and networking at the same time may be just the way to cause a
power-fail signal, which usually causes an instant reboot.

> The system is a Supermicro P4SCT+ with a hyperthreading P4. I've posted the
> dmesg here:
> http://www.david-web.co.uk/download/dmesg
>
> I initially tried a different NIC in case that was at fault, but the results
> were the same.
>
> Changing the interrupt timer frequency in the kernel makes a difference:
> 100Hz - system reboots instantly when transfer is started
> 250Hz - reboots after a few seconds
> 1000Hz - reboots after 5-10 minutes

I think it just changes timings, and there is something timing-related
going on - like just instant power draw. The timer frequency should not
have any serious impact on heat, so I doubt it's about overheating, but
it's certainly worth opening the case and using one of those
compressed-air things to cool down the CPU and/or southbridge chips.

> As the problem appears to be interrupt-related, I disabled the I/O APIC in the
> BIOS (after first having to disable hyperthreading) which resulted in the
> system lasting a bit longer before it reboots. I then tried disabling the
> Local APIC as well but this made no difference.

Interrupts generally aren't problematic, I'd be more likely to suspect CPU
overclocking or similar (does the cpuinfo match the frequency claimed by
the BIOS?) or just some strange motherboard problem (which could be
firmware: bad programming of memory timings etc). So a BIOS upgrade is
worth looking into.

Soemtimes issues like this can be worked around - for example, maybe the
problem is the chipset having issues with concurrent DMA or something, so
turning off DMA on the disk drives could possibly at least _hide_ the
problem.

> Does anyone have any idea whether this is likely to be a hardware problem or a
> kernel problem?

Anything is possible, and it certainly _could_ be a kernel bug. There are
situations that cause triple-faults and insta-reboots. If the stack
pointer gets whacked in kernel space, you can get some bad bad stuff
happening.

But check the power supply first. And check to see if there is a BIOS
upgrade available. You can double-check the cooling: check that all
heat-sinks are properly seated and have appropriate amounts of thermal
grease. And blowing air from a compressed-air can on top of the things
until you see the frost over is certainly a good spot-check.

In other words, I'd almost bet on bad hardware.

			Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.30-rc8 [also: VIA Support]
Date: Thu, 04 Jun 2009 17:07:56 UTC
Message-ID: <fa.NXSHPrEl3tE/9zdAgQ0lxu9m3ME@ifi.uio.no>

Side note: is it more stable if you disable the VIA speedstep thing
(whatever it's called (ok, google tells me it's called "TwinTurbo" and
"Advanced PowerSaver")?

Features like that easily put a huge stress on power regulators etc, if
they result in sudden changes in current draw.  Underspecced capacitors
etc can cause CPU "brown-outs", which in turn can easily cause total
failure.

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: Linux 2.6.30-rc8 [also: VIA Support]
Date: Thu, 04 Jun 2009 17:46:48 UTC
Message-ID: <fa.YkSdNxa2iVeJLPBFHO3S3va9jGA@ifi.uio.no>

On Thu, 4 Jun 2009, Michael S. Zick wrote:
>
> Yes, I build test cases with and without - -
> It was a fixed-speed kernel build that first hit the 4 hour up-time mark.
> I just reposted that build today (the -09143lk).
>
> > Features like that easily put a huge stress on power regulators etc, if
> > they result in sudden changes in current draw.  Underspecced capacitors
> > etc can cause CPU "brown-outs", which in turn can easily cause total
> > failure.
>
> There is also a possible thermal issue with these machines - -
> I doubt that VIA runs their qualification testing in bake ovens;
> which is what NetBook cases amount too.  ;)

If the fixed-speed case runs for longer, it's not likely to be a thermal
issue. The fixed speed case should be the higher-power one.

So it can easily be a weak power setup (insufficient grounding, bad
capacitors etc). But it could also be external bus issues, in case VIA
power management also impact the external bus (eg "stopclock" like
behavior on the CPU<->chipset bus).

One thing you could try is to avoid using the "halt" instruction. It will
obviously increase power use (and thus higher temperatures), but again,
current fluctuations are much more likely to cause problems than higher,
but fairly constant, power draw.

Think about all the light-bulbs you've seen that burn out just when you
turn them on.

Use "idle=poll" on the kernel command line to avoid the idle loop using
the "halt" or "mwait" instructions to save power.

(That polling idle loop can also end up hiding cache coherency issues with
DMA, so if that works better, it doesn't necessarily prove it's
power-related. Shutting down the CPU core can have interesting
implications for external events, and you can have various races - maybe
you shut down the core just as a chipset event happened, and the chipset
_thinks_ the core is now awake, but the core went to sleep. End result:
hung machine).

		Linus

From: Linus Torvalds <torvalds@linux-foundation.org>
Newsgroups: fa.linux.kernel
Subject: Re: RFC: starting a kernel-testers group for newbies
Date: Fri, 02 May 2008 16:29:20 UTC
Message-ID: <fa.piLT77lyIJK0z17GQZGMYak+jOk@ifi.uio.no>

On Fri, 2 May 2008, Carlos R. Mafra wrote:
>
> So I would like to ask you what an user should do when facing what is
> probably a timing-related bug, as it appears I have the bad luck
> of hitting one.

Quite frankly, it will depend on the bug.

If it's *reliably* timing-related (which sounds crazy, but is not at all
unheard of), it can be reliably bisected down to some totally unrelated
commit that doesn't actually introduce the problem at all, but that
reliably turns it on or off.

That can be very misleading, and can cause us to basically revert a good
commit, only to not actually fix the bug (and possibly re-introduce the
bug that the reverted commit tried to fix).

But sometimes it gives us a clue where the timing problem is. But quite
frankly, that seems to be the exception rather than the rule.

There have been issues that literally seemed to depend on things like
cacheline placement etc, where changing config options for code that was
never actually even *run* would change timing just enough to show a bug
pseudo-reliably or not at all.

The good news is that those timing issues are really quite rare.

Tha bad news is that when they happen, they are almost totally
undebuggable.

> This same problem is still present with yesterday's git, and sometimes
> it hangs without hpet=disable and sometimes it doesn't. (And never
> with hpet=disable in the boot command line)

Hey, it may well be a HPET+NOHZ issue. But it could also be that HPET is
the thing that just allows you to see the hang.

> And using vga=6 or vga=0x0364 makes a difference in the probability
> of hanging.

.. and yeah, these kinds of really odd and obviously totally unrelated
issues are a sign of a bug that is either simply hardware instability or
very subtly timing-related.

The reason I mention hardware instability is that there really are bugs
that happen due to (for example) power supply instabilities. Brownouts
under heavy load have been causes of problems, but perhaps surprisingly,
so has _idle_ time thanks to sleep-states!

The latter is probably due to bad power conditioning on the CPU power
lines, where the huge current swings (going at high CPU power to low, and
back again) not only have made some motherboards "sing" (or "hum",
depending on frequency) but also causes voltage instability and then
the CPU crashes.

Am I saying that's the reason you see problems? Probably not. Most
instabilities really are due to kernel bugs. But hardware instabilities do
happen, and they can have these kinds of odd effects.

> I am just waiting -rc1 to be released to send an email with my
> problem again, as I am unable to debug this myself.
> I think this is ok from my part, right?

Yes. You've been a good bug reporter, and kept at it. It's not your fault
that the bug is hard to pin down.

Quite frankly, it does sound like the hang happens somewhere around the

	hpet_init
	hpet_acpi_add
	hpet_resources
	hpet_resources: 0xfed00000 is busy

printk's you added (correct?) and we've had tons of issues with NO_HZ, so
at a guess it is timer-related.

(And I assume it's stable if/once it gets past that boot hang issue? That
tends to mean that it's not some hardware instability, it's literally our
init code).

			Linus

Index Home About Blog