Ticket #7817 (closed defect: fixed)
time has stopped and CPU is %nan
|Reported by:||janni||Owned by:|
I am experiencing strange partial lock-ups in the guest. The system runs fine and after some time, dmesg tells me that "Clocksource tsc unstable (delta = 4398042164711 ns)". ntpd corrects the time and everything's fine. Then, after a random amount of time (last time it was about 5 days), the time stops running monotonously.
This is what I mean:
jan@janniweb:~$ date Fr 10. Dez 07:13:26 CET 2010 jan@janniweb:~$ date Fr 10. Dez 07:13:27 CET 2010 jan@janniweb:~$ date Fr 10. Dez 07:13:28 CET 2010 jan@janniweb:~$ date Fr 10. Dez 07:13:29 CET 2010 jan@janniweb:~$ date Fr 10. Dez 07:13:25 CET 2010 jan@janniweb:~$ date Fr 10. Dez 07:13:28 CET 2010
And the time never gets past 07:13:29.
top tells me that all CPUs are being 100% idle (Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st which is definitely not the case. htop shows every CPU at "nan%".
This behaviour causes several services to stop working, including ntpd and dovecot (smtpd). I can't even restart these services. Trying /etc/init.d/dovecot restart hangs and dovecot does not respond to SIGTERM anymore. Unmounting a partition makes mount lock up, so I can't even reboot. The only way to recover is to terminate VBoxHeadless via Ctrl+C and restart it. The host serves 3 other virtual machines via VirtualBox, all running Debian 5 with a Linux 2.6.26-2-amd64 SMP kernel. They never had these problems.
I already tried the following (nothing solved the problem), as I think that some kind of time sync is the problem:
- remove GA, thus disabling host-guest time sync
- adding several different kernel options (no_lapic, no_apic, clocksource=acpi_pm, divider, ...)
- using ntpd to keep the time in sync
- reinstalled Debian several times (an act of desperation)
The problem persists since some 3.0.x version (don't know the exact version anymore). Since the 3.2.12 Changelog didn't mention anything that goes in my direction, I haven't updated, yet. Updating means telling all customers to stop their machines at a specific point of time, which is troublesome, so I didn't update.
Let me know if you need more log files than those I have already attached.
PS. I'm not sure about the "Component" of this ticket...