#3135 closed defect (fixed)
time inside guest fluctuates
Reported by: | Costin Grigoras | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 3.0.6 |
Keywords: | system time | Cc: | |
Guest type: | Linux | Host type: | Linux |
Description (last modified by )
Host:
OS: Ubuntu 8.10, 64bit, VirtualBox 2.1.2 (was also 2.1.0 and had the same problems) HW: HP DL360 G5, dual Xeon X5450, 16GB RAM
Guest: 64bit
OS1: Scientific Linux CERN 4.7 (Fedora Core 3), Kernel 2.6.9-78.0.8.EL.cern OS2: Scientific Linux CERN 5.2 (Fedora Core 6), Kernel 2.6.18-92.1.18.el5
Time is running too fast inside the guest, and vboxadd-timesync keeps bringing it back. Below is a sequence of commands ran at ~1sec interval:
# date Fri Jan 23 23:55:33 CET 2009 # date Fri Jan 23 23:55:35 CET 2009 # date Fri Jan 23 23:55:37 CET 2009 # date Fri Jan 23 23:55:34 CET 2009 # date Fri Jan 23 23:55:36 CET 2009
Guest: 32bit
Same OSs, but 32bit installations
Here the time flows correctly only if the guest OS is busy. If the guest is idle, time flows very slowly and the time sync daemon brings it forward a lot.
Other remarks:
Guest additions are installed and have the same version as the host virtualbox installation.
CPU frequency scaling is disabled, all cores run at full speed.
Attachments (8)
Change History (105)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
Here you can see how heavy the problem is
[
Feb 1 22:04:00 endian fcron[4224]: Job /usr/bin/ntpfix started for user root (pid 4225) Feb 1 22:04:00 endian ntpd[3272]: ntpd exiting on signal 15 1 Feb 22:04:11 ntpdate[4233]: step time server 213.133.108.253 offset 10.731488 sec Feb 1 22:04:12 endian fcron[4224]: Job /usr/bin/ntpfix completed Feb 1 22:05:00 endian fcron[4237]: Job /usr/bin/ntpfix started for user root (pid 4238) 1 Feb 22:05:26 ntpdate[4251]: step time server 213.133.108.253 offset 25.675820 sec Feb 1 22:05:28 endian fcron[4237]: Job /usr/bin/ntpfix completed Feb 1 22:06:00 endian fcron[4408]: Job /usr/bin/ntpfix started for user root (pid 4409) 1 Feb 22:05:54 ntpdate[4417]: step time server 213.133.108.253 offset -6.396049 sec Feb 1 22:05:56 endian fcron[4408]: Job /usr/bin/ntpfix completed Feb 1 22:07:00 endian fcron[4419]: Job /usr/bin/ntpfix started for user root (pid 4420) 1 Feb 22:07:23 ntpdate[4428]: step time server 213.133.108.253 offset 22.603067 sec Feb 1 22:07:24 endian fcron[4419]: Job /usr/bin/ntpfix completed Feb 1 22:08:00 endian fcron[4430]: Job /usr/bin/ntpfix started for user root (pid 4431) 1 Feb 22:07:56 ntpdate[4439]: step time server 213.133.108.253 offset -4.023262 sec Feb 1 22:07:58 endian fcron[4430]: Job /usr/bin/ntpfix completed Feb 1 22:09:00 endian fcron[4444]: Job /usr/bin/ntpfix started for user root (pid 4445) 1 Feb 22:09:13 ntpdate[4453]: step time server 213.133.108.253 offset 12.435577 sec Feb 1 22:09:15 endian fcron[4444]: Job /usr/bin/ntpfix completed Feb 1 22:10:00 endian fcron[4457]: Job /usr/bin/ntpfix started for user root (pid 4458) 1 Feb 22:09:54 ntpdate[4475]: step time server 213.133.108.253 offset -6.266985 sec Feb 1 22:09:55 endian fcron[4457]: Job /usr/bin/ntpfix completed
No Problem on host
comment:3 by , 16 years ago
This problem is not present on a laptop (Intel P8600), Ubuntu 8.04 64bit as host and 64bit OpenSuSE 11.1 as guest.
32bit SLC4 as guest is still slower and catches up from time to time, but the 64bit guest runs smoothly even if vboxadd-timesync is not running.
Looking in the logs, the big difference is that this message shows up many times in the logs of 32 and 64bit guests that have problems (the original bug report), but don't show up at all on the laptop, for neither of them:
32:28:28.230 TM: Giving up catch-up attempt at a 60005481924 ns lag; new total: 129068108517031 ns 32:29:47.698 TM: Giving up catch-up attempt at a 60000297065 ns lag; new total: 129128108814096 ns 32:31:03.507 TM: Giving up catch-up attempt at a 60003226290 ns lag; new total: 129188112040386 ns 32:32:58.975 TM: Giving up catch-up attempt at a 60003303860 ns lag; new total: 129248115344246 ns
follow-up: 9 comment:4 by , 16 years ago
The problem seems related guests running older kernel versions. 2.6.9 and 2.6.18 show this problem, 2.6.27 runs fine.
After adding "noapic nolapic nosmp" options to the old guest kernels, it stopped racing and the time flows nicely.
comment:6 by , 15 years ago
This is still a bug in VBox 3.0.x. RHEL 5.4 x86_64 guest on a Fedora 10 x86_64 host. RHEL 5.x is stuck on a 2.6.18 kernel so I cannot simply upgrade the kernel for a fix. Has any progress been made? Are there any workarounds? I've tried "clock=pit" and no change. 1 minute goes by in real time while the guest has gone by 5 minutes or so.
follow-up: 8 comment:7 by , 15 years ago
Version: | VirtualBox 2.1.2 → VirtualBox 3.0.6 |
---|
mooninite: you've tried 3.0.6? Not that I think it will fix your problem, but that version does contain a timing fix (rdtsc).
comment:8 by , 15 years ago
Replying to sandervl73:
mooninite: you've tried 3.0.6? Not that I think it will fix your problem, but that version does contain a timing fix (rdtsc).
I was using 3.0.2, but now I am using 3.0.6. The time is "doubled" in 3.0.6. The 3.0.2 version seemed to be "tripled" in time. It's closer to being fixed, but the timer is still very bad. Just over the course of writing this comment to you the time in the host has gone by 1 minute, and the guest has gone by 4 minutes. The 3.0.6 guest addons are installed.
host: 2.6.27.30-170.2.82.fc10.x86_64 guest: 2.6.18-164.el5
comment:9 by , 15 years ago
Replying to costing:
The problem seems related guests running older kernel versions. 2.6.9 and 2.6.18 show this problem, 2.6.27 runs fine.
After adding "noapic nolapic nosmp" options to the old guest kernels, it stopped racing and the time flows nicely.
I had a very similar effect on a linux host, not only inside the guest system. Problem was also a kind of alternating time. The host was a dual core system and after a couple of days running the system the time starts jumping fore and backwards. The background was that the interrupt for the RTC was switching the CPU from time to time. This highly depends on system load and other ressources (eth, scsi, usb). Time stopped jumping when assigning the RTC and timer intterrupt to one singe CPU core. You might do this by executing: "echo 02 >/proc/irq/0/smp_affinity" -> this binds IRQ0 to core #1 Verify if the irq is the correct one for you (--> cat /proc/interrupts) This will potentially fix the problem with the "jumping time" but not avoid it from "running twice faster".
follow-up: 12 comment:10 by , 15 years ago
I am not having issues with the host timer, but I did notice something very interesting. The guest timer is being hit abnormally high. This is the output of /proc/interrupts on the guest after just a few minutes of being up:
# cat /proc/interrupts CPU0 0: 658872 IO-APIC-edge timer 1: 12 IO-APIC-edge i8042 6: 4 IO-APIC-edge floppy 8: 0 IO-APIC-edge rtc 9: 0 IO-APIC-level acpi 12: 3188 IO-APIC-edge i8042 15: 5649 IO-APIC-edge ide1 177: 24647 IO-APIC-level ahci 193: 2550 IO-APIC-level eth0 201: 1211 IO-APIC-level vboxadd NMI: 0 LOC: 329619 ERR: 0 MIS: 0
follow-up: 13 comment:11 by , 15 years ago
Why do you think that the guest timer is being hit too often? If the guest timer has a frequency of 1000HZ then your VM is probably about 11 minutes up.
comment:12 by , 15 years ago
Replying to mooninite:
I am not having issues with the host timer, but I did notice something very interesting.
Excuse the spam, but I wanted to add that a SuSE 9 guest (on an ancient kernel 2.6.9) on the same host is not having any timer off-sync problems, but it, too, is hitting the timer hard inside that guest. I shut down the SuSE guest and started the RHEL guest, but that did not solve anything.
comment:13 by , 15 years ago
Replying to frank:
Why do you think that the guest timer is being hit too often? If the guest timer has a frequency of 1000HZ then your VM is probably about 11 minutes up.
Ah, silly me. I'm used to CONFIG_NO_HZ now. That would make sense then, but my SuSE example still stands. Should I provide a vbox.log of both RHEL and SuSE? Would that do any good?
comment:14 by , 15 years ago
I've tried the following kernel boot parameters:
notsc divider=10
divider=10 clocksource=acpi_pm
divider=10 clocksource=tsc
None of those made a difference. I can tell divider worked as its only hitting the timer interrupt at 100Hz (200-300Hz a real second though). According to VMWare, RHEL 5.4 is supposed to not need any boot options as the timer issues with 2.6.18 were "fixed." It seems this is a VBox timer exposing bug then.
RHEL 5.4 (notsc divider=10): time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
RHEL 5.4 (divider=10 clocksource=acpi_pm): time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
RHEL 5.4 (divider=10 clocksource=tsc): time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
SuSE 9 (defaults): Using tsc for high-res timesource
I suppose there aren't that many RHEL customers using VBox otherwise I would imagine this issue would have been fixed or a known workaround would have been posted somewhere.
comment:15 by , 15 years ago
Success!
divider=10 nolapic_timer
These boot parameters allow accurate time on the RHEL guest. I hope this helps narrow down the cause of this issue. It seems to be ACPI/APIC related. No changes to the host were ever made.
comment:16 by , 15 years ago
oh my! That worked for me! [on a RHEL4.4 guest running kernel 2.6.9-78.EL on an Ubuntu 9.04 host on a Core Duo 32-bit laptop] You are my saviour! Thank you!
follow-up: 18 comment:17 by , 15 years ago
divider=10 nolapic_timer did NOT work for me.
I am running openSUSE (x86) host with 2 CentOS 5.3 guests - both with GA installed.
The kernel in the guests is: 2.6.18-164.el5
One of the guests is in sync but the other is slow by over 20 minutes.
follow-up: 19 comment:18 by , 15 years ago
follow-up: 21 comment:19 by , 15 years ago
Replying to mooninite:
Replying to blender:
Are you using SMP on the guests?
In Virtualbox Management interface I am only provided 1 Processor for each VM.
If you are asking whether the guest kernel is SMP capable, it is:
[root@master ~]# uname -v #1 SMP Thu Sep 3 03:33:56 EDT 2009
Are you suggesting I pass nosmp kernel boot parameter?
Please advise.
Thanks!
Jason
follow-up: 22 comment:20 by , 15 years ago
Also, please post the output of
cat /sys/devices/system/clocksource/clocksource0/*
comment:21 by , 15 years ago
Replying to blender:
Are you suggesting I pass nosmp kernel boot parameter?
No. Adding that option would do nothing. In my findings, SMP (having VirtualBox show more than 1 CPU) forces the APIC timer and causes fast time even with the boot options. I have not experienced slower time, just faster time.
Frank, this is with the divider=10 nolapic_timer boot options set and one guest CPU.
# cat /sys/devices/system/clocksource/clocksource0/* jiffies jiffies
follow-up: 23 comment:22 by , 15 years ago
Replying to frank:
Also, please post the output of
cat /sys/devices/system/clocksource/clocksource0/*
[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/* acpi_pm jiffies tsc pit tsc
Thanks!
follow-up: 24 comment:23 by , 15 years ago
Replying to blender:
[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/* acpi_pm jiffies tsc pit tscThanks!
Is this when you used "nolapic_timer"? If so, then you must have typed something wrong on boot.
Try just this boot option:
clocksource=jiffies
comment:24 by , 15 years ago
Replying to mooninite:
Replying to blender:
[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/* acpi_pm jiffies tsc pit tscThanks!
Is this when you used "nolapic_timer"? If so, then you must have typed something wrong on boot.
I edit menu.1st/grub.conf:
[root@master ~]# grep apic /boot/grub/menu.lst
kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/VolGroup00/LogVol00 divider=10 nolapic_timer
Hmm, I just had a meeting and the timesync is now only a few minutes slow... So maybe I spoke too soon. Perhaps given more time it will achieve the timesync.
Try just this boot option:
clocksource=jiffies
If this is the best practice I will change it - can you please confirm that it is indeed the preferred configuration?
Many thanks!
Jason
follow-up: 27 comment:25 by , 15 years ago
It seems that with acpi_pm the guest clock is very stable. Is this timer selected by default by your guest if you don't override the clock source? If not, the output of dmesg in the guest could be interesting (attach please, don't copy'n'paste).
follow-up: 29 comment:26 by , 15 years ago
blender, the output you posted, was this with or without overriding the clock source?
comment:27 by , 15 years ago
Replying to frank:
It seems that with acpi_pm the guest clock is very stable. Is this timer selected by default by your guest if you don't override the clock source? If not, the output of dmesg in the guest could be interesting (attach please, don't copy'n'paste).
This is without overriding:
[root@master ~]# grep kernel /boot/grub/menu.lst # all kernel and initrd paths are relative to /boot/, eg. # kernel /vmlinuz-version ro root=/dev/VolGroup00/LogVol00 kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/VolGroup00/LogVol00 kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00 [root@master ~]# cat /sys/devices/system/clocksource/clocksource0/current_clocksource tsc
Cheers
follow-up: 30 comment:28 by , 15 years ago
I don't see an "Attach file" button anywhere to attach dmesg.out
Please advise.
Cheers
comment:29 by , 15 years ago
Replying to frank:
blender, the output you posted, was this with or without overriding the clock source?
With
comment:30 by , 15 years ago
Replying to blender:
I don't see an "Attach file" button anywhere to attach dmesg.out
The attach button is just below the ticket description.
by , 15 years ago
Attachment: | blender_dmesg_no_override.out added |
---|
follow-up: 32 comment:31 by , 15 years ago
I'm running both VMs with clocksource=jiffies now. I basically see the same behavior. i.e. the one VM that has always been in sync is still in sync and the other VM that has always been slow is still slow, around 7 minutes currently.
How long should I give the VM to achieve the timesync?
The VMWare approach is detailed in this document: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1006427&sliceId=1&docTypeID=DT_KB_1_1&dialogID=2591349&stateId=0%200%202593559
On that platform the suggestion for my circumstances would be to disable VMWare Tools timesync (their notion of Guest Additions), set kernel parameters to divider=10 clocksource=acpi_pm and run NTP.
Let me know next steps.
Thanks!
comment:32 by , 15 years ago
Replying to blender:
I'm running both VMs with clocksource=jiffies now. I basically see the same behavior. i.e. the one VM that has always been in sync is still in sync and the other VM that has always been slow is still slow, around 7 minutes currently.
How long should I give the VM to achieve the timesync?
My suggestion was only because my VM's timer was too fast. Your timer are too slow so it may be a different issue. You may want to open a new bug.
comment:33 by , 15 years ago
I took the VMWare approach. i.e.:
I disabled Virtualbox GA:
chkconfig vboxadd off chkconfig vboxadd-service off
Set kernel parameters:
divider=10 clocksource=acpi_pm
Configured NTP per VMWare article.
The VMs are keeping good time now - within a few seconds anyway which is better than I could achieve with GA and suggestions to date.
Since my guests are headless anyway AFAICT the only interest I had in GA was timesync, the other features have little/no benefit.
Anyway, I am happy to continue to debug if asked but at least now I can proceed with testing.
Cheers
comment:34 by , 15 years ago
blender and others: I saw the problem with too fast clock on a 64-bit SLES 10 VM. I tracked the problem down to doubled timer interrupt delivery. It's pretty easy to check if this is your problem: run 'cat /proc/interrupts'; use a clock or watch to measure a 5- or 10-second interval, high accuracy is not required; run 'cat /proc/interrupts' again. Calculate how many timer interrupts occurred in the time period. Check whether that corresponds with the kernel HZ setting or not. For a 250 Hz kernel, you should see ~250 timer interrupts per second. I'm seeing ~500, which is why the time in my guest runs way too fast.
This problem is specific to VMs which use the IOAPIC, which I think 64-bit Linux always(?) does. I don't know if the NOHZ option might have any effect because I need to use the stock SLES 10 kernel.
comment:35 by , 15 years ago
Just my two cents:
I'm using Ubuntu Hardy for both host and a couple of guests on a Pentium 4 without VT-x.
I've had "clocksource=acpi_pm" set on my VMs since the VBox 2.0 or 2.1 days. Once I had that set, I didn't have any issues. Time is stable and stays in sync with the host.
Since switching to 3.0, I've been having problems where sometimes when the first of my VMs starts up, it will print the console message "PM-Timer running at invalid rate: 116% of normal - aborting.". There will be no acpi_pm timer and the kernel will pick one of the less reliable ones (tsc or pit). At that point, the time in my guests will drift off from the host despite the fact that I'm running the guest additions daemon. This happens 25% of the time in the first of the two VMs that I run, but I've never seen it in the second.
follow-up: 37 comment:36 by , 15 years ago
The problem originally described in this ticket should be fixed in the next VirtualBox maintenance release. The symptom was time inside the guest running at 2x the correct speed. The issue was routing of 8259A interrupts with local APIC enabled. In practice that meant the problem was only likely to be observed on 64-bit Linux guests, even though guest bitness had nothing to do with it (because 64-bit OSes always use the I/O APIC and 32-bit usually don't).
comment:37 by , 15 years ago
Replying to michaln:
The problem originally described in this ticket should be fixed in the next VirtualBox maintenance release.
Sweet. I'll be sure to test & report when I get my hands on some binaries. Thanks!
comment:38 by , 15 years ago
Unfortunately the fix has some unexpected side effects, so we will not include it in 3.0.10. A bit more patience is required.
comment:39 by , 15 years ago
I'd been able to stabilize a Centos VM (on a Centos host) with the following settings (off plain vanilla distributions) ... some are probably uneccesary, but it works, so I'm not willing to fiddle with it ...
Centos 5 host -- straight install, yum update to current, no extra repos. NTP enabled for clock sync to NIST.
Centos 4 client -- straight install, yum update to current on Base and rpmforge. NTP enabled for clock sync. config on client : Kernel smp, 2 cpu cores, all acceleration options selected, PAE not selected. kernel boot options on client : elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm
With those options, the client VM boots, stays in sync. In addition, it's cpu usage is 1/4th of what it used to be.
Host system is an IBM x3650 (dual quad core 3ghz xeon)
comment:42 by , 15 years ago
Replying to frank:
That bug should be fixed in VBox 3.0.12. Please verify.
With IO APIC checked on the VM settings, the VM kernel dumps immediately upon boot. I can't see the full dump message as its larger than the console.
With IO APIC unchecked, the kernel acts like it is booting, but gets hung somewhere and never boots (nash never starts). There is a message on the screen about the APIC timer not being accurate.
In order to have a working, booting VM I have to add back my "divider=10 nolapic_timer" kernel arguments. IO APIC is unchecked.
comment:43 by , 15 years ago
Upgraded to virtualbox 3.0.12 (virtualbox-3.0_3.0.12-54655_Ubuntu_jaunty_amd64.deb)
Host OS: kubuntu 9.04
Guest OS: RHEL 5.3
Guest works with vbox 3.0.10 (although guest time is not accurate).
Guest (RHEL) fails to boot with 3.0.12. A variety of errors on the console with multiple boot attempts.
Here one example:
Memory for crash kernel (0x0 to 0x0) notwithin permissible range NMI Watchdog detected LOCKUP on CPU 0 CPU 0 Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.18-128.el5 #1 RIP: 0010:[<ffffffff801a33e7>] [<ffffffff801a33e7>] vt_console_print+0x19d/0x228
I suspect something related to interrupts is at fault.
3.0.12 is not usable for me, I have to revert to 3.0.10 (virtualbox-3.0_3.0.10-54097_Ubuntu_jaunty_amd64.deb) to get RHEL booting.
Hope there's enough data there to track down the bug.
Thanks.
follow-ups: 45 47 comment:44 by , 15 years ago
Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.
comment:45 by , 15 years ago
Replying to frank:
Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.
That worked, I can now use RHEL5.3 guest with vbox 3.0.12. Thanks for your help.
comment:46 by , 15 years ago
Just upgraded to 3.0.12 and am still experiencing the problem.
Host: OSX 10.6.2
Guest: CentOS 4.7 (Final) Kernel 2.6.9-78.0.22.EL
dmesg is showing the "PM-Timer running at invalid rate: 120% of normal - aborting" message seen before.
Tried booting with nmi_watchdog=0 kernel parameter but clock still sped away. Am now using the 'elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm' kernel parameters suggested by litz and things seem fine.
comment:47 by , 15 years ago
Replying to frank:
Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.
Using only that kernel parameter works. I had to have "IO APIC" checked as well. Time is being ticked normally. Thanks.
Is this kernel parameter going to be required from now on or is there an additional fix coming to not require it?
comment:48 by , 15 years ago
There is an additional fix coming to not require the nmi_watchdog=0 kernel parameter. However, you may want to keep using the parameter because the NMI watchdog is pretty useless (which is why later Linux kernel versions don't turn it on by default).
follow-up: 50 comment:49 by , 15 years ago
Seeing this problem in the form of the clock in the guest running slowly compared to the host clock.
Running Virtual Box 3.0.12 (r54655), on Ubuntu 9.04, with a Windows Vista guest. Guest additions for 3.0.12 installed.
Guest clock consistently drifts behind host clock: after manually setting the time, within a few minutes the clock is already 4 minutes slow.
Can't set nmi_watchdog=0 as the guest is not Linux.
Log file contains many instances of 00:04:06.193 TM: Giving up catch-up attempt at a 60 000 013 044 ns lag; new total: 60 000 013 044 ns
Followed by: 20:10:53.331 TM: Aborting catch-up attempt on reset with a 15 206 232 414 ns lag on reset; new total: 49 912 965 959 774 ns.
Any ideas?
Thanks.
follow-up: 51 comment:50 by , 15 years ago
comment:51 by , 15 years ago
Replying to mooninite:
Yes. Create a new bug. This bug is not the same as yours.
It's not entirely clear why you say this isn't the same bug: the time in the guest is incorrect, with the same log error messages. Admittedly it's not the same guest OS as the original bug report, but otherwise the symptoms are similar...
Obviously I'm happy to create another bug, but it would be good to understand why you believe that my issue is not the same one as covered by this ticket.
comment:52 by , 15 years ago
I've just tested VirtualBox 3.1.0 and 3.0.12, both still present the problem. The older kernel don't boot unless "nmi_watchdog=0", but this doesn't solve the time issue. But 3.0.10 was surprisingly running the machines without any time issues ... That I only realized _after_ upgrading, and now I regret doing it. So, could you please fix the fix ? :)
comment:53 by , 15 years ago
Until then this magic worked also for me, big thanks to the parents for suggesting it!
nmi_watchdog=0 elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm
comment:54 by , 15 years ago
On 2009/5/19 a change was made to TM.cpp that (AFAIK) broke timer catchup (now fixed internally I've been told: http://vbox.innotek.de/pipermail/vbox-dev/2009-December/002124.html), which causes the 'TM: Giving up catch-up attempt' messages.
Tracing Guest Additions timer-sync showed the (VirtualBox adjusted-for-lag) host clock running @ 2/3 speed with a 60 second catchup jump everytime the 'TM: Giving up catch-up attempt' messages occurs, which causes GA timesync to surge faster then slower as it tried to match the jumping input clock.
Disabling GA timesync causes the guest time to run @ 2/3 speed and lose time.
I'm not 100% clear on how timers work in the VMM, but they all (many) seem to rely on the CPU TSC counter, and in my case the timers may be malfunctioning because my host CPU does not have the 'Invariant TSC' CPU feature (CPUID(80000007)[EDX]=0), and has EIST (Enhanced Intel SpeedStep Technology) which changes the CPU clock based on CPU load.
Perhaps people having this problem could disable EIST and/or C1 and/or C1E and/or other CPU clock power saving measures in their host's BIOS and see if that affects VM timekeeping at all?
(Note: Some googling says that all recent Intel CPUs have Invariant TSC, and blogs/posts suggest checking the CPU family and model to determine if the feature is enabled (as does VirtualBox in TM.cpp), BUT I'm not sure this suggested check/code is correct. The 'Intel® 64 and IA-32 Architectures Software Developer’s Manual' I have documents Invariant TSC in exactly the same was as for AMD, namely CPUID(80000007)[EDX][bit8] (Table 3-20, page 3-189, volume 2A), and my CPU reports CPUID(80000007)[EDX]=0 (in VBox.log) and acts as if the TSC varies with CPU clock speed.)
comment:55 by , 15 years ago
I disabled EIST and C1E on my host BIOS. CPU-Z confirmed that the CPU was now locked at the highest rate.
BUT the VM still ran @ 2/3 = 66.7% of normal time when the GA timesync is disabled, which would seem to rule out my Invariant TSC theory. I thought I saw the VM time slow down, then catchup then settle down to 2/3 speed, all with the VBoxService not running and therefore no time sync, which was puzzling.
comment:56 by , 15 years ago
I have the same problem as the 'Guest: 32bit' of the OP: guest time flows slowly (2/3 speed in my case) unless the guest is busy.
Guest: Window 2000 Server with uniprocessor HAL.
I've done some testing and code examination, which I hope may be useful:
For the testing I've done, I note the following (all with GA disabled in the VM - no timesync):
('catchup works' below means: No 'Catchup given up' messages in VBox.log, and clock tracks in line with host @~100%.)
- If I have two VMs running at the same time (the original Windows 2000 and a clone of the same), then catchup works: The clock is accurate.
- If the (single) VM is busy, catchup works (it will even double/triple time the timers and run the VM faster than normal to catchup the current 60 second lag, and correct the clock - except for the given-up lag of course).
- If I set the Windows 2000 VM to use the ACPI Multiprocessor PC HAL, catchup works (but I get very high host CPU usage).
(The catchup only fails when using either of the two uniprocessor HALs.)
- If I have a single (or multiple) Windows XP BartPE (UBCD4WIN) VM running, catchup works.
(IO APIC doesn't matter; I'm not sure what HAL this uses.)
- If I have the Windows 2000 VM running and also a separate XP VM, catchup works in both VMs.
- In summary, the VM only goes slow (2/3 speed) and has 'Catchup given up' messages when a single Windows 2000 VM is running with a uniprocessor HAL when it is idle.
For the code examination I've done, I note the following:
(Please excuse me if I misunderstand any of the code: that is certainly possible.)
(These comments based on the 3.0.10 stable codeline.)
- There are only two functions that calculate and catchup amount, AND store that catchup back into the timers to actually fix the catchup.
(The calculation is done in a few more places, but only for internal use and is not actually "caught up" by updating a new value for tm.s.offVirtualSync.)
These are:
1) TM.cpp>tmR3TimerQueueRunVirtualSync (line 1908)
2) TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked (line 403)
Because of the fUpdateStuff bug, TM.cpp>tmR3TimerQueueRunVirtualSync did not actually fix and store the catchup.
Combining the tests and the code examination, seems to indicate this:
- In many situations, TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked is periodically run, which performs timer catchup and clocks keep in sync.
- When the host is running only a single VM that has only a single CPU and the guest is idle, TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked IS NOT periodically called, and no timer catchup occurs.
Is it intended that TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked should NOT run when there is only
1 CPU and that CPU is idle (halted)?
If it should be running, then that is the (a) problem.
If it not supposed to periodically run with a single idle (halted) CPU, and instead TM.cpp>tmR3TimerQueueRunVirtualSync runs instead (or TM.cpp>tmR3TimerQueueRunVirtualSync is supposed to run all the time anyway), then the fUpdateStuff bug is the (a) problem.
- Questions about the 2/3 rate:
+ Is the design actually that the timer in a guest should run at 2/3 (66.7%) the speed of the host, and that the catchup speeds it back up to 100%?
+ I suppose that design would / might handle host CPUs that had a varying TSC rate?
+ What is the explanation for the 2/3 rate?
comment:57 by , 15 years ago
The problems seem to have gotten worse in 3.1.2. Here even a new kernel (2.6.32.2) doesn't keep the correct time, but with varying amounts: when idle a couple of seconds every minute, when under load it becomes 15-20 sec/minute.
3.0.12 running the same machine with the same configuration doesn't present this problem. Also SMP seems to really work in 3.0.12 while 3.1.2 uses a bit more than 1 core when two tasks run at full blast.
comment:58 by , 15 years ago
Thanks for your help, I successfully fix my problem.
Here how I did it:
Host: Ubuntu Server 9.10 64bit 2.6.31-16-server Virtualbox: 3.1.2 Guest: Centos 5.4 64bit 2.6.18-164.10.1.el5.
Host: install and configure NTP.
Guest:
- sudo yum install binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel
- Reboot
- Install the GA
- Set kernel parameter: divider=10 nolapic_timer
- Verify that "cat /sys/devices/system/clocksource/clocksource0/*" return: jiffies jiffies
- Reboot
My time is in sync for more then 2 days now.
comment:59 by , 15 years ago
In 3.0.12 I was getting 'Giving up catch-up attempt' messages in VBox.log.
In 3.1.0 this was fixed (IIRC), and no such messages appeared.
In 3.1.2 the messages have come back again, and time sync is a bit wobbly (oscillates slow then fast).
comment:60 by , 15 years ago
Keep in mind that the 3.1.2 guest additions for non-Windows guests have a bug. If the guest clock is a bit ahead of the host clock, the guest time will be incorrectly adjusted and the time difference gets bigger and bigger (fixed in the next maintenance release). I'm not sure if this bug is related to your problem though.
comment:61 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The 3.1.4 additions should nicely synchronize the guest with the host (3.1.2 bug fixed. Please reopen if still relevant).
comment:62 by , 15 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Sorry, it doesn't fix anything. I ran a simple test on both the guest and host:
while true; do date; done
In 20 seconds the guest was 15 seconds behind when running 2.6.18 and 3.1.4 GA (same without GA). However when the guest runs a custom compiled 2.6.32 and GA or no GA, time is running fine. GA doesn't change the situation at all, the broken one is still broken, the one that works still works fine.
I state my problem again, if you do many-seconds catch-ups every now and then, this brakes a compilation for example. Especially that when using the CPU intensively the time differences are widening. But this problem was fixed in 3.0.10, how come you have broken it again and can't find the magic combination again? Uf, this seems to go nowhere...
comment:64 by , 15 years ago
Logs attached, please let me know if I can help with any other information.
comment:65 by , 15 years ago
I can't see it from the log but are you 100% sure that you installed the 3.1.4 guest additions in the VM with the 2.6.18 kernel? And is the timesync daemon running (check for VBoxService) and are you sure that there are no other timesync daemons running (neither from older VBox releases nor ntp or somethink similar)?
If so then you could start the timesync daemon manually like this:
su killall VBoxService /usr/sbin/VBoxService -f -vvv | tee ~/log
leave it running for a while, make your date experiments, then stop the service and attach the compressed log file to this ticket?
comment:66 by , 15 years ago
Yes to all questions:
# dmesg | grep -i virtualbox ... vboxguest: Successfully loaded version 3.1.4 (interface 0x0010004) vboxfs: Successfully loaded version 3.1.4 (interface 0x00010004) # pgrep -f ntp # # killall VBoxService #
I'm attaching the log now.
comment:67 by , 15 years ago
After the test with the older kernel I've rebooted the guest and tried the same thing with the newer kernel. The time difference between the guest and the host didn't increase any more.
follow-up: 70 comment:68 by , 15 years ago
Thanks for the logs! It appears that the guest time walks away that quick that adjtime() is just to slow to adapt the time correctly. Could you also attach the output of dmesg from the guest?
comment:69 by , 15 years ago
dmesg output attached. Just to be on the safe side I've repeated the tests with a single CPU available to the guest (all the above was with 2). There is no difference in behavior, same drift and good/bad situation function of kernel version.
comment:70 by , 15 years ago
Replying to frank:
Thanks for the logs! It appears that the guest time walks away that quick that adjtime() is just to slow to adapt the time correctly. Could you also attach the output of dmesg from the guest?
It appears more likely that adjtime() is failing to adjust the time at all.
comment:71 by , 15 years ago
No, adjtime adjusts the time in very small steps (< 5ms per step) to ensure that the applications are not confused.
follow-up: 74 comment:72 by , 15 years ago
I've upgraded to 3.1.6 and the time problem seems gone. 2.6.18 and 2.6.32, with 1 or 2 cores, idle or under load, no problems at all.
Guys, thanks a lot for fixing this!
comment:74 by , 15 years ago
Replying to costing:
I've upgraded to 3.1.6 and the time problem seems gone. 2.6.18 and 2.6.32, with 1 or 2 cores, idle or under load, no problems at all.
Guys, thanks a lot for fixing this!
Are you still using kernel arguments? I still have to otherwise the kernel never finishes booting. With kernel arguments (divider=10 nolapic_timer) timing has always been fine with any 3.1.x VBox release.
comment:75 by , 15 years ago
At least in one case I can confirm that the extra parameters are no longer needed, it doesn't make any difference if they are passed or not with either of the guest kernel versions. But even before it didn't make any difference if I specified them or not, the time would anyway be wrong. For now I cannot put the new VB version in production where there is a different hardware and more cases, I'll get back to you with a report when I do that.
follow-up: 77 comment:76 by , 15 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I have to report a complete failure in this ...
3.1.6 has completely broken Centos 2.6.18 kernels (Centos 5).
I can't get any time sync to work at all - no errors are reported, it just slips out of sync, about 1 second delay per second of time. Once past 60 seconds off, it stops attempting and in a matter of 1-2 minutes, you're 15 minutes off in time.
e.g. : 05:59:02.531 TM: Giving up catch-up attempt at a 60 003 096 004 ns lag; new total: 6 139 186 536 679 ns
I've tried pretty much every kernel option on grub (noapic, nolapic, nolapictimer, and every clocksource= possiblity), and it didn't help.
This is a pretty huge issue for me - I test in VM a couple of applications that require sync'd time to operate, and this is causing them to fail (and therefore making the VM's useless for me).
Any ideas?
- litz
comment:77 by , 15 years ago
Replying to litz:
I've tried pretty much every kernel option on grub (noapic, nolapic, nolapictimer, and every clocksource= possiblity), and it didn't help.
Instead of "nolapictimer", the correct argument is "nolapic_timer". You can add "divider=10" to that to reduce wakeups.
comment:78 by , 15 years ago
My mistake, I typo'd the comment above.
The last set of options I tried (which didn't work) are :
divider=10 nolapic_timer noapic nolapic
I've tried everything listed above, to no avail, too.
comment:79 by , 15 years ago
I just checked my other VM's.
They are ALL failing. Centos 4 and 5, all fail. All systems are up-to-date (latest kernels available).
All systems have divider=10 nolapic_timer.
All fail with log messages similar to :
TM: Giving up catch-up attempt at a 60 003 096 004 ns lag; new total: 6 139 186 536 679 ns
I can see clock skews off as much as 15 minutes within 2-3 minutes of startup.
comment:80 by , 15 years ago
I've attached one of my VM's log files :
VBox.3.log (118.3 kB) - added by litz on 2010-04-09 23:17:39.
litz timesync failure
comment:81 by , 15 years ago
litz: your log shows the guest is still using a 1000 hz timer. Are you sure this divider=10 statement is correct in your grub config? Did you install the guest additions?
comment:82 by , 15 years ago
divider=10 is on the grub config line (and shows up in dmesg at system start), and guest additions is installed.
from dmesg :
Kernel command line: ro root=LABEL=/ divider=10 nolapic_timer noapic nolapic
So ... I dunno what's going on.
Anyone got any ideas? (timer=1000 shouldn't cause the time sync problems anyways, should it?)
- litz
comment:83 by , 15 years ago
Actually I misread the log file. (it's not 1024 hz, but 100) Didn't you mention in the forum that you also added this option for your host? (you shouldn't)
comment:85 by , 15 years ago
BTW - Centos 5.4 is vastly more stable (it's still not acceptably stable, but it's MUCH better) than Centos 4.8 ...
5.5 is out real soon; let's see how that behaves. It may fix the issue.
- litz
comment:86 by , 15 years ago
Ok ... I've been able to ... somewhat ... stabilize things on Centos 5.4
4.8 is, I think, a lost cause.
Have to use the following command line : divider=10 noapic nolapic nolapic_timer clocksource=tsc
and have ntpd enabled
If I do that, it seems to hold OK time ... exactly 7 seconds behind realtime.
Nothing I have tried has gotten the clock to hold realtime; it's always slow and lagging behind.
I can run two VM's and they stay perfectly in sync -- with each other ... exactly the same amount behind realtime, each. So whatever's happening, it's reproducable -- there is no actual sync between the two systems other than their timesync to the host (and ntp).
New log file attached.
comment:87 by , 15 years ago
I might also add ... that while the clock seems to be "keeping up with realtime" ... it is by NO MEANS steady.
It jumps back and forth, hangs for a 1/2 second, speeds up to catch up ...
Think of a car really in need of tuneup that spits, hesitates, then roars to life to catch up to traffic.
That's what the clock is running like.
The above solution is *workable*. It is NOT -- by any means -- an acceptable alternative to a properly stable clock.
Something's still broken. Any ideas what to test/try?
comment:88 by , 15 years ago
litz: is there any reason why you have IOAPIC enabled in the VM config? The VirtualBox default is having IOAPIC off, as emulating it costs quite some CPU cycles, and you don't seem have selected a 64bit guest OS.
comment:89 by , 15 years ago
hmm ... I do not know why IOAPIC is enabled. I'll turn it off ... can't hurt to save some cycles.
I've tried it both ways, before, though, and not seen a difference in performance either way ...
- litz
comment:90 by , 14 years ago
Something is funny with the machines, now one of them again runs slow:
# ntpdate x 19 Apr 13:03:05 ntpdate[28930]: adjust time server 137.138.16.69 offset 0.054395 sec # ntpdate x 19 Apr 13:03:06 ntpdate[28931]: adjust time server 137.138.16.69 offset 0.185675 sec # ntpdate x 19 Apr 13:03:07 ntpdate[28932]: adjust time server 137.138.16.69 offset 0.300528 sec # ntpdate x 19 Apr 13:03:08 ntpdate[28933]: adjust time server 137.138.16.69 offset 0.437668 sec # ntpdate x 19 Apr 13:03:10 ntpdate[28934]: step time server 137.138.16.69 offset 0.572383 sec
After the initial boot the time was almost ok for ~1 hour, then I compiled and started GA and then it started running very bad. But it continued to misbehave even after stopping GA. So it seems that something is activated by GA in the kernel that shouldn't ?
comment:91 by , 14 years ago
I've tried the settings above too (divider=10 noapic nolapic nolapic_timer clocksource=tsc) and the clock is a bit more stable (meaning "only" 5-12 seconds each minute, the actual interval seems to depend on the load, when idle it looses more), but ntp can't keep up so I run ntpdate from cron every minute. Is it a question of CPU type? 3.1.6 runs fine on a Core2 but all these problems show up on X5560.
comment:92 by , 14 years ago
Changing power states hasn't helped my VM timekeeping at all, but some people running Mac hosts have reported that disabling host power saving options helped.
Perhaps: https://bugzilla.redhat.com/show_bug.cgi?id=513138
When the tsc is not stable on the host, including cpufreq changes and deep C state, or even migration into faster tsc host. To stop deep C states in which the TSC can stop, add "processor.max_cstate=1" as a host kernel boot option. To disable cpufreq (only necessary on hosts which lack constant_tsc flag in "flags" field of /proc/cpuinfo), edit /etc/sysconfig/cpuspeed MIN_SPEED and MAX_SPEED variables to the highest frequency available.
... or disable Intel SpeedStep (IST/EIST/C1/C1E) in the host BIOS.
Is VBox.log/"TM: cTSCTicksPerSecond=" fairly constant? (Boot the VM while another VM is busy, compared to boot the VM with the host/VMs idle).
comment:93 by , 14 years ago
I've tried setting the cores' frequency to max, min, everything in between, but there is no change in the guest, it's still bad. I don't know about disabling freq scaling altogether, this is not something I can play freely with since the servers are in production but when I have the chance I'll report about the effect of this.
comment:94 by , 12 years ago
still problem with VirtualBox 4.2.10 host system is Windows 7 64bit on intel i5 (4 cores), guest system is also windows 7 64bit with 3 cores enabled
in guest system i have Microsoft SQL server 2008r2 and during debugging one problem i noticed that very often happens that for 2 inserted items in sequence 'newer one' has smaller timestamp then 'older one'
in test loop which did nothing just inserting items to temporary table i found that such thing happened 19-190 times for sample 1000000 rows. it seems to be depending on what is happening with computer, if i do something else in host system its worse, if i do nothing its better but for 1 million rows it never happened that there would be no such row at all
difference seems to be from 10ms to 110ms
comment:95 by , 12 years ago
Description: | modified (diff) |
---|
Sorry, could you also test VBox 4.2.12? Just this release contains a few fixes regarding timekeeping. It would be interesting to see if these fixes have any effect to your setup.
comment:96 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
No response, closing. Further fixes were made in VBox 4.3.0. Before you consider to reopen this ticket, please try VBox 4.3.2 first.
comment:97 by , 11 years ago
I'm still seeing this in Virtualbox 4.3.6. Windows 8 64-bit host and Ubuntu 12.04 guest.
Same Problem here Host: archlinux 32bit Guest: Endian Firewall (2.6.22.19-72)
Time gets wrong very fast, also ntpd is running