VirtualBox

Ticket #3135 (closed defect: fixed)

Opened 5 years ago

Last modified 3 months ago

time inside guest fluctuates

Reported by: costing Owned by:
Priority: major Component: other
Version: VirtualBox 3.0.6 Keywords: system time
Cc: Guest type: Linux
Host type: Linux

Description (last modified by frank) (diff)

Host:

OS: Ubuntu 8.10, 64bit, VirtualBox 2.1.2 (was also 2.1.0 and had the same problems) HW: HP DL360 G5, dual Xeon X5450, 16GB RAM

Guest: 64bit

OS1: Scientific Linux CERN 4.7 (Fedora Core 3), Kernel 2.6.9-78.0.8.EL.cern OS2: Scientific Linux CERN 5.2 (Fedora Core 6), Kernel 2.6.18-92.1.18.el5

Time is running too fast inside the guest, and vboxadd-timesync keeps bringing it back. Below is a sequence of commands ran at ~1sec interval:

# date
Fri Jan 23 23:55:33 CET 2009
# date
Fri Jan 23 23:55:35 CET 2009
# date
Fri Jan 23 23:55:37 CET 2009
# date
Fri Jan 23 23:55:34 CET 2009
# date
Fri Jan 23 23:55:36 CET 2009

Guest: 32bit

Same OSs, but 32bit installations

Here the time flows correctly only if the guest OS is busy. If the guest is idle, time flows very slowly and the time sync daemon brings it forward a lot.

Other remarks:

Guest additions are installed and have the same version as the host virtualbox installation.

CPU frequency scaling is disabled, all cores run at full speed.

Attachments

blender_dmesg_no_override.out Download (9.9 KB) - added by blender 5 years ago.
VBox.log Download (55.4 KB) - added by costing 4 years ago.
2.6.18 - time flowing very slow
VBox.2.log Download (56.5 KB) - added by costing 4 years ago.
2.6.32 - normal time flow
log.bz2 Download (1.1 KB) - added by costing 4 years ago.
VBoxService -f -vvv
log.2.bz2 Download (1.2 KB) - added by costing 4 years ago.
VBoxService -f -vvv (2.6.32)
dmesg.log.bz2 Download (5.4 KB) - added by costing 4 years ago.
dmesg for 2.6.18
VBox.3.log Download (118.3 KB) - added by litz 4 years ago.
litz timesync failure
VBox.4.log Download (38.4 KB) - added by litz 4 years ago.
New VBOX log file (4-12-10)

Change History

comment:1 Changed 5 years ago by bb84

Same Problem here Host: archlinux 32bit Guest: Endian Firewall (2.6.22.19-72)

Time gets wrong very fast, also ntpd is running

comment:2 Changed 5 years ago by bb84

Here you can see how heavy the problem is

[

Feb  1 22:04:00 endian fcron[4224]: Job /usr/bin/ntpfix started for user root (pid 4225)
Feb  1 22:04:00 endian ntpd[3272]: ntpd exiting on signal 15
 1 Feb 22:04:11 ntpdate[4233]: step time server 213.133.108.253 offset 10.731488 sec
Feb  1 22:04:12 endian fcron[4224]: Job /usr/bin/ntpfix completed
Feb  1 22:05:00 endian fcron[4237]: Job /usr/bin/ntpfix started for user root (pid 4238)
 1 Feb 22:05:26 ntpdate[4251]: step time server 213.133.108.253 offset 25.675820 sec
Feb  1 22:05:28 endian fcron[4237]: Job /usr/bin/ntpfix completed
Feb  1 22:06:00 endian fcron[4408]: Job /usr/bin/ntpfix started for user root (pid 4409)
 1 Feb 22:05:54 ntpdate[4417]: step time server 213.133.108.253 offset -6.396049 sec
Feb  1 22:05:56 endian fcron[4408]: Job /usr/bin/ntpfix completed
Feb  1 22:07:00 endian fcron[4419]: Job /usr/bin/ntpfix started for user root (pid 4420)
 1 Feb 22:07:23 ntpdate[4428]: step time server 213.133.108.253 offset 22.603067 sec
Feb  1 22:07:24 endian fcron[4419]: Job /usr/bin/ntpfix completed
Feb  1 22:08:00 endian fcron[4430]: Job /usr/bin/ntpfix started for user root (pid 4431)
 1 Feb 22:07:56 ntpdate[4439]: step time server 213.133.108.253 offset -4.023262 sec
Feb  1 22:07:58 endian fcron[4430]: Job /usr/bin/ntpfix completed
Feb  1 22:09:00 endian fcron[4444]: Job /usr/bin/ntpfix started for user root (pid 4445)
 1 Feb 22:09:13 ntpdate[4453]: step time server 213.133.108.253 offset 12.435577 sec
Feb  1 22:09:15 endian fcron[4444]: Job /usr/bin/ntpfix completed
Feb  1 22:10:00 endian fcron[4457]: Job /usr/bin/ntpfix started for user root (pid 4458)
 1 Feb 22:09:54 ntpdate[4475]: step time server 213.133.108.253 offset -6.266985 sec
Feb  1 22:09:55 endian fcron[4457]: Job /usr/bin/ntpfix completed

No Problem on host

comment:3 Changed 5 years ago by costing

This problem is not present on a laptop (Intel P8600), Ubuntu 8.04 64bit as host and 64bit OpenSuSE 11.1 as guest.

32bit SLC4 as guest is still slower and catches up from time to time, but the 64bit guest runs smoothly even if vboxadd-timesync is not running.

Looking in the logs, the big difference is that this message shows up many times in the logs of 32 and 64bit guests that have problems (the original bug report), but don't show up at all on the laptop, for neither of them:

32:28:28.230 TM: Giving up catch-up attempt at a 60005481924 ns lag; new total: 129068108517031 ns
32:29:47.698 TM: Giving up catch-up attempt at a 60000297065 ns lag; new total: 129128108814096 ns
32:31:03.507 TM: Giving up catch-up attempt at a 60003226290 ns lag; new total: 129188112040386 ns
32:32:58.975 TM: Giving up catch-up attempt at a 60003303860 ns lag; new total: 129248115344246 ns

comment:4 follow-up: ↓ 9 Changed 5 years ago by costing

The problem seems related guests running older kernel versions. 2.6.9 and 2.6.18 show this problem, 2.6.27 runs fine.

After adding "noapic nolapic nosmp" options to the old guest kernels, it stopped racing and the time flows nicely.

comment:5 Changed 5 years ago by sandervl73

These kernels probably use a different time source. We'll investigate.

comment:6 Changed 5 years ago by mooninite

This is still a bug in VBox 3.0.x. RHEL 5.4 x86_64 guest on a Fedora 10 x86_64 host. RHEL 5.x is stuck on a 2.6.18 kernel so I cannot simply upgrade the kernel for a fix. Has any progress been made? Are there any workarounds? I've tried "clock=pit" and no change. 1 minute goes by in real time while the guest has gone by 5 minutes or so.

comment:7 follow-up: ↓ 8 Changed 5 years ago by sandervl73

  • Version changed from VirtualBox 2.1.2 to VirtualBox 3.0.6

mooninite: you've tried 3.0.6? Not that I think it will fix your problem, but that version does contain a timing fix (rdtsc).

comment:8 in reply to: ↑ 7 Changed 5 years ago by mooninite

Replying to sandervl73:

mooninite: you've tried 3.0.6? Not that I think it will fix your problem, but that version does contain a timing fix (rdtsc).

I was using 3.0.2, but now I am using 3.0.6. The time is "doubled" in 3.0.6. The 3.0.2 version seemed to be "tripled" in time. It's closer to being fixed, but the timer is still very bad. Just over the course of writing this comment to you the time in the host has gone by 1 minute, and the guest has gone by 4 minutes. The 3.0.6 guest addons are installed.

host: 2.6.27.30-170.2.82.fc10.x86_64 guest: 2.6.18-164.el5

comment:9 in reply to: ↑ 4 Changed 5 years ago by bernhard

Replying to costing:

The problem seems related guests running older kernel versions. 2.6.9 and 2.6.18 show this problem, 2.6.27 runs fine.

After adding "noapic nolapic nosmp" options to the old guest kernels, it stopped racing and the time flows nicely.

I had a very similar effect on a linux host, not only inside the guest system. Problem was also a kind of alternating time. The host was a dual core system and after a couple of days running the system the time starts jumping fore and backwards. The background was that the interrupt for the RTC was switching the CPU from time to time. This highly depends on system load and other ressources (eth, scsi, usb). Time stopped jumping when assigning the RTC and timer intterrupt to one singe CPU core. You might do this by executing: "echo 02 >/proc/irq/0/smp_affinity" -> this binds IRQ0 to core #1 Verify if the irq is the correct one for you (--> cat /proc/interrupts) This will potentially fix the problem with the "jumping time" but not avoid it from "running twice faster".

comment:10 follow-up: ↓ 12 Changed 5 years ago by mooninite

I am not having issues with the host timer, but I did notice something very interesting. The guest timer is being hit abnormally high. This is the output of /proc/interrupts on the guest after just a few minutes of being up:

# cat /proc/interrupts 
           CPU0       
  0:     658872    IO-APIC-edge  timer
  1:         12    IO-APIC-edge  i8042
  6:          4    IO-APIC-edge  floppy
  8:          0    IO-APIC-edge  rtc
  9:          0   IO-APIC-level  acpi
 12:       3188    IO-APIC-edge  i8042
 15:       5649    IO-APIC-edge  ide1
177:      24647   IO-APIC-level  ahci
193:       2550   IO-APIC-level  eth0
201:       1211   IO-APIC-level  vboxadd
NMI:          0 
LOC:     329619 
ERR:          0
MIS:          0

comment:11 follow-up: ↓ 13 Changed 5 years ago by frank

Why do you think that the guest timer is being hit too often? If the guest timer has a frequency of 1000HZ then your VM is probably about 11 minutes up.

comment:12 in reply to: ↑ 10 Changed 5 years ago by mooninite

Replying to mooninite:

I am not having issues with the host timer, but I did notice something very interesting.

Excuse the spam, but I wanted to add that a SuSE 9 guest (on an ancient kernel 2.6.9) on the same host is not having any timer off-sync problems, but it, too, is hitting the timer hard inside that guest. I shut down the SuSE guest and started the RHEL guest, but that did not solve anything.

comment:13 in reply to: ↑ 11 Changed 5 years ago by mooninite

Replying to frank:

Why do you think that the guest timer is being hit too often? If the guest timer has a frequency of 1000HZ then your VM is probably about 11 minutes up.

Ah, silly me. I'm used to CONFIG_NO_HZ now. That would make sense then, but my SuSE example still stands. Should I provide a vbox.log of both RHEL and SuSE? Would that do any good?

comment:14 Changed 5 years ago by mooninite

I've tried the following kernel boot parameters:

notsc divider=10

divider=10 clocksource=acpi_pm

divider=10 clocksource=tsc

None of those made a difference. I can tell divider worked as its only hitting the timer interrupt at 100Hz (200-300Hz a real second though). According to VMWare, RHEL 5.4 is supposed to not need any boot options as the timer issues with 2.6.18 were "fixed." It seems this is a VBox timer exposing bug then.

RHEL 5.4 (notsc divider=10): time.c: Using 3.579545 MHz WALL PM GTOD PM timer.

RHEL 5.4 (divider=10 clocksource=acpi_pm): time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.

RHEL 5.4 (divider=10 clocksource=tsc): time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.

SuSE 9 (defaults): Using tsc for high-res timesource

I suppose there aren't that many RHEL customers using VBox otherwise I would imagine this issue would have been fixed or a known workaround would have been posted somewhere.

comment:15 Changed 5 years ago by mooninite

Success!

divider=10 nolapic_timer

These boot parameters allow accurate time on the RHEL guest. I hope this helps narrow down the cause of this issue. It seems to be ACPI/APIC related. No changes to the host were ever made.

comment:16 Changed 5 years ago by j2sentient

oh my! That worked for me! [on a RHEL4.4 guest running kernel 2.6.9-78.EL on an Ubuntu 9.04 host on a Core Duo 32-bit laptop] You are my saviour! Thank you!

comment:17 follow-up: ↓ 18 Changed 5 years ago by blender

divider=10 nolapic_timer did NOT work for me.

I am running openSUSE (x86) host with 2 CentOS 5.3 guests - both with GA installed.

The kernel in the guests is: 2.6.18-164.el5

One of the guests is in sync but the other is slow by over 20 minutes.

comment:18 in reply to: ↑ 17 ; follow-up: ↓ 19 Changed 5 years ago by mooninite

Replying to blender:

divider=10 nolapic_timer did NOT work for me.

Are you using SMP on the guests?

comment:19 in reply to: ↑ 18 ; follow-up: ↓ 21 Changed 5 years ago by blender

Replying to mooninite:

Replying to blender:

Are you using SMP on the guests?

In Virtualbox Management interface I am only provided 1 Processor for each VM.

If you are asking whether the guest kernel is SMP capable, it is:

[root@master ~]# uname -v #1 SMP Thu Sep 3 03:33:56 EDT 2009

Are you suggesting I pass nosmp kernel boot parameter?

Please advise.

Thanks!

Jason

comment:20 follow-up: ↓ 22 Changed 5 years ago by frank

Also, please post the output of

cat /sys/devices/system/clocksource/clocksource0/*

comment:21 in reply to: ↑ 19 Changed 5 years ago by mooninite

Replying to blender:

Are you suggesting I pass nosmp kernel boot parameter?

No. Adding that option would do nothing. In my findings, SMP (having VirtualBox show more than 1 CPU) forces the APIC timer and causes fast time even with the boot options. I have not experienced slower time, just faster time.

Frank, this is with the divider=10 nolapic_timer boot options set and one guest CPU.

# cat /sys/devices/system/clocksource/clocksource0/*
jiffies 
jiffies

comment:22 in reply to: ↑ 20 ; follow-up: ↓ 23 Changed 5 years ago by blender

Replying to frank:

Also, please post the output of

cat /sys/devices/system/clocksource/clocksource0/*
[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/*
acpi_pm jiffies tsc pit 
tsc 

Thanks!

comment:23 in reply to: ↑ 22 ; follow-up: ↓ 24 Changed 5 years ago by mooninite

Replying to blender:

[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/*
acpi_pm jiffies tsc pit 
tsc 

Thanks!

Is this when you used "nolapic_timer"? If so, then you must have typed something wrong on boot.

Try just this boot option:

clocksource=jiffies

comment:24 in reply to: ↑ 23 Changed 5 years ago by blender

Replying to mooninite:

Replying to blender:

[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/*
acpi_pm jiffies tsc pit 
tsc 

Thanks!

Is this when you used "nolapic_timer"? If so, then you must have typed something wrong on boot.

I edit menu.1st/grub.conf:

[root@master ~]# grep apic /boot/grub/menu.lst

kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/VolGroup00/LogVol00 divider=10 nolapic_timer

Hmm, I just had a meeting and the timesync is now only a few minutes slow... So maybe I spoke too soon. Perhaps given more time it will achieve the timesync.

Try just this boot option:

clocksource=jiffies

If this is the best practice I will change it - can you please confirm that it is indeed the preferred configuration?

Many thanks!

Jason

comment:25 follow-up: ↓ 27 Changed 5 years ago by frank

It seems that with acpi_pm the guest clock is very stable. Is this timer selected by default by your guest if you don't override the clock source? If not, the output of dmesg in the guest could be interesting (attach please, don't copy'n'paste).

comment:26 follow-up: ↓ 29 Changed 5 years ago by frank

blender, the output you posted, was this with or without overriding the clock source?

comment:27 in reply to: ↑ 25 Changed 5 years ago by blender

Replying to frank:

It seems that with acpi_pm the guest clock is very stable. Is this timer selected by default by your guest if you don't override the clock source? If not, the output of dmesg in the guest could be interesting (attach please, don't copy'n'paste).

This is without overriding:

[root@master ~]# grep kernel /boot/grub/menu.lst 
#          all kernel and initrd paths are relative to /boot/, eg.
#          kernel /vmlinuz-version ro root=/dev/VolGroup00/LogVol00
	kernel /vmlinuz-2.6.18-164.el5 ro root=/dev/VolGroup00/LogVol00
	kernel /vmlinuz-2.6.18-128.el5 ro root=/dev/VolGroup00/LogVol00
[root@master ~]# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

Cheers

comment:28 follow-up: ↓ 30 Changed 5 years ago by blender

I don't see an "Attach file" button anywhere to attach dmesg.out

Please advise.

Cheers

comment:29 in reply to: ↑ 26 Changed 5 years ago by blender

Replying to frank:

blender, the output you posted, was this with or without overriding the clock source?

With

comment:30 in reply to: ↑ 28 Changed 5 years ago by frank

Replying to blender:

I don't see an "Attach file" button anywhere to attach dmesg.out

The attach button is just below the ticket description.

Changed 5 years ago by blender

comment:31 follow-up: ↓ 32 Changed 5 years ago by blender

I'm running both VMs with clocksource=jiffies now. I basically see the same behavior. i.e. the one VM that has always been in sync is still in sync and the other VM that has always been slow is still slow, around 7 minutes currently.

How long should I give the VM to achieve the timesync?

The VMWare approach is detailed in this document:  http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1006427&sliceId=1&docTypeID=DT_KB_1_1&dialogID=2591349&stateId=0%200%202593559

On that platform the suggestion for my circumstances would be to disable VMWare Tools timesync (their notion of Guest Additions), set kernel parameters to divider=10 clocksource=acpi_pm and run NTP.

Let me know next steps.

Thanks!

comment:32 in reply to: ↑ 31 Changed 5 years ago by mooninite

Replying to blender:

I'm running both VMs with clocksource=jiffies now. I basically see the same behavior. i.e. the one VM that has always been in sync is still in sync and the other VM that has always been slow is still slow, around 7 minutes currently.

How long should I give the VM to achieve the timesync?

My suggestion was only because my VM's timer was too fast. Your timer are too slow so it may be a different issue. You may want to open a new bug.

comment:33 Changed 5 years ago by blender

I took the VMWare approach. i.e.:

I disabled Virtualbox GA:

chkconfig vboxadd off chkconfig vboxadd-service off

Set kernel parameters:

divider=10 clocksource=acpi_pm

Configured NTP per VMWare article.

The VMs are keeping good time now - within a few seconds anyway which is better than I could achieve with GA and suggestions to date.

Since my guests are headless anyway AFAICT the only interest I had in GA was timesync, the other features have little/no benefit.

Anyway, I am happy to continue to debug if asked but at least now I can proceed with testing.

Cheers

comment:34 Changed 5 years ago by michaln

blender and others: I saw the problem with too fast clock on a 64-bit SLES 10 VM. I tracked the problem down to doubled timer interrupt delivery. It's pretty easy to check if this is your problem: run 'cat /proc/interrupts'; use a clock or watch to measure a 5- or 10-second interval, high accuracy is not required; run 'cat /proc/interrupts' again. Calculate how many timer interrupts occurred in the time period. Check whether that corresponds with the kernel HZ setting or not. For a 250 Hz kernel, you should see ~250 timer interrupts per second. I'm seeing ~500, which is why the time in my guest runs way too fast.

This problem is specific to VMs which use the IOAPIC, which I think 64-bit Linux always(?) does. I don't know if the NOHZ option might have any effect because I need to use the stock SLES 10 kernel.

comment:35 Changed 4 years ago by kpreslan

Just my two cents:

I'm using Ubuntu Hardy for both host and a couple of guests on a Pentium 4 without VT-x.

I've had "clocksource=acpi_pm" set on my VMs since the VBox 2.0 or 2.1 days. Once I had that set, I didn't have any issues. Time is stable and stays in sync with the host.

Since switching to 3.0, I've been having problems where sometimes when the first of my VMs starts up, it will print the console message "PM-Timer running at invalid rate: 116% of normal - aborting.". There will be no acpi_pm timer and the kernel will pick one of the less reliable ones (tsc or pit). At that point, the time in my guests will drift off from the host despite the fact that I'm running the guest additions daemon. This happens 25% of the time in the first of the two VMs that I run, but I've never seen it in the second.

comment:36 follow-up: ↓ 37 Changed 4 years ago by michaln

The problem originally described in this ticket should be fixed in the next VirtualBox maintenance release. The symptom was time inside the guest running at 2x the correct speed. The issue was routing of 8259A interrupts with local APIC enabled. In practice that meant the problem was only likely to be observed on 64-bit Linux guests, even though guest bitness had nothing to do with it (because 64-bit OSes always use the I/O APIC and 32-bit usually don't).

comment:37 in reply to: ↑ 36 Changed 4 years ago by mooninite

Replying to michaln:

The problem originally described in this ticket should be fixed in the next VirtualBox maintenance release.

Sweet. I'll be sure to test & report when I get my hands on some binaries. Thanks!

comment:38 Changed 4 years ago by sandervl73

Unfortunately the fix has some unexpected side effects, so we will not include it in 3.0.10. A bit more patience is required.

comment:39 Changed 4 years ago by litz

I'd been able to stabilize a Centos VM (on a Centos host) with the following settings (off plain vanilla distributions) ... some are probably uneccesary, but it works, so I'm not willing to fiddle with it ...

Centos 5 host -- straight install, yum update to current, no extra repos. NTP enabled for clock sync to NIST.

Centos 4 client -- straight install, yum update to current on Base and rpmforge. NTP enabled for clock sync. config on client : Kernel smp, 2 cpu cores, all acceleration options selected, PAE not selected. kernel boot options on client : elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm

With those options, the client VM boots, stays in sync. In addition, it's cpu usage is 1/4th of what it used to be.

Host system is an IBM x3650 (dual quad core 3ghz xeon)

comment:40 Changed 4 years ago by litz

Oh, forgot to note -- host system is 64bit, client is 32bit.

comment:41 follow-up: ↓ 42 Changed 4 years ago by frank

That bug should be fixed in VBox 3.0.12. Please verify.

comment:42 in reply to: ↑ 41 Changed 4 years ago by mooninite

Replying to frank:

That bug should be fixed in VBox 3.0.12. Please verify.

With IO APIC checked on the VM settings, the VM kernel dumps immediately upon boot. I can't see the full dump message as its larger than the console.

With IO APIC unchecked, the kernel acts like it is booting, but gets hung somewhere and never boots (nash never starts). There is a message on the screen about the APIC timer not being accurate.

In order to have a working, booting VM I have to add back my "divider=10 nolapic_timer" kernel arguments. IO APIC is unchecked.

comment:43 Changed 4 years ago by pfee

Upgraded to virtualbox 3.0.12 (virtualbox-3.0_3.0.12-54655_Ubuntu_jaunty_amd64.deb)

Host OS: kubuntu 9.04

Guest OS: RHEL 5.3

Guest works with vbox 3.0.10 (although guest time is not accurate).

Guest (RHEL) fails to boot with 3.0.12. A variety of errors on the console with multiple boot attempts.

Here one example:

 Memory for crash kernel (0x0 to 0x0) notwithin permissible range
 NMI Watchdog detected LOCKUP on CPU 0
 CPU 0
 Modules linked in:
 Pid: 1, comm: swapper Not tainted 2.6.18-128.el5 #1
 RIP: 0010:[<ffffffff801a33e7>]  [<ffffffff801a33e7>] vt_console_print+0x19d/0x228

I suspect something related to interrupts is at fault.

3.0.12 is not usable for me, I have to revert to 3.0.10 (virtualbox-3.0_3.0.10-54097_Ubuntu_jaunty_amd64.deb) to get RHEL booting.

Hope there's enough data there to track down the bug.

Thanks.

comment:44 follow-ups: ↓ 45 ↓ 47 Changed 4 years ago by frank

Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.

comment:45 in reply to: ↑ 44 Changed 4 years ago by pfee

Replying to frank:

Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.

That worked, I can now use RHEL5.3 guest with vbox 3.0.12. Thanks for your help.

comment:46 Changed 4 years ago by mattkyn

Just upgraded to 3.0.12 and am still experiencing the problem.

Host: OSX 10.6.2
Guest: CentOS 4.7 (Final) Kernel 2.6.9-78.0.22.EL

dmesg is showing the "PM-Timer running at invalid rate: 120% of normal - aborting" message seen before.

Tried booting with nmi_watchdog=0 kernel parameter but clock still sped away. Am now using the 'elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm' kernel parameters suggested by litz and things seem fine.

comment:47 in reply to: ↑ 44 Changed 4 years ago by mooninite

Replying to frank:

Please use VBox 3.0.12. Specify the kernel parameter nmi_watchdog=0 for the guest kernel.

Using only that kernel parameter works. I had to have "IO APIC" checked as well. Time is being ticked normally. Thanks.

Is this kernel parameter going to be required from now on or is there an additional fix coming to not require it?

comment:48 Changed 4 years ago by michaln

There is an additional fix coming to not require the nmi_watchdog=0 kernel parameter. However, you may want to keep using the parameter because the NMI watchdog is pretty useless (which is why later Linux kernel versions don't turn it on by default).

comment:49 follow-up: ↓ 50 Changed 4 years ago by dcottingham

Seeing this problem in the form of the clock in the guest running slowly compared to the host clock.

Running Virtual Box 3.0.12 (r54655), on Ubuntu 9.04, with a Windows Vista guest. Guest additions for 3.0.12 installed.

Guest clock consistently drifts behind host clock: after manually setting the time, within a few minutes the clock is already 4 minutes slow.

Can't set nmi_watchdog=0 as the guest is not Linux.

Log file contains many instances of 00:04:06.193 TM: Giving up catch-up attempt at a 60 000 013 044 ns lag; new total: 60 000 013 044 ns

Followed by: 20:10:53.331 TM: Aborting catch-up attempt on reset with a 15 206 232 414 ns lag on reset; new total: 49 912 965 959 774 ns.

Any ideas?

Thanks.

comment:50 in reply to: ↑ 49 ; follow-up: ↓ 51 Changed 4 years ago by mooninite

Replying to dcottingham:

Any ideas?

Yes. Create a new bug. This bug is not the same as yours.

comment:51 in reply to: ↑ 50 Changed 4 years ago by dcottingham

Replying to mooninite:

Yes. Create a new bug. This bug is not the same as yours.

It's not entirely clear why you say this isn't the same bug: the time in the guest is incorrect, with the same log error messages. Admittedly it's not the same guest OS as the original bug report, but otherwise the symptoms are similar...

Obviously I'm happy to create another bug, but it would be good to understand why you believe that my issue is not the same one as covered by this ticket.

comment:52 Changed 4 years ago by costing

I've just tested VirtualBox 3.1.0 and 3.0.12, both still present the problem. The older kernel don't boot unless "nmi_watchdog=0", but this doesn't solve the time issue. But 3.0.10 was surprisingly running the machines without any time issues ... That I only realized _after_ upgrading, and now I regret doing it. So, could you please fix the fix ? :)

comment:53 Changed 4 years ago by costing

Until then this magic worked also for me, big thanks to the parents for suggesting it!

nmi_watchdog=0 elevator=deadline noapic nolapic divider=10 nolapic_timer clocksource=acpi_pm

comment:54 Changed 4 years ago by MarkCranness

On 2009/5/19 a change was made to TM.cpp that (AFAIK) broke timer catchup (now fixed internally I've been told:  http://vbox.innotek.de/pipermail/vbox-dev/2009-December/002124.html), which causes the 'TM: Giving up catch-up attempt' messages.

Tracing Guest Additions timer-sync showed the (VirtualBox adjusted-for-lag) host clock running @ 2/3 speed with a 60 second catchup jump everytime the 'TM: Giving up catch-up attempt' messages occurs, which causes GA timesync to surge faster then slower as it tried to match the jumping input clock.

Disabling GA timesync causes the guest time to run @ 2/3 speed and lose time.

I'm not 100% clear on how timers work in the VMM, but they all (many) seem to rely on the CPU TSC counter, and in my case the timers may be malfunctioning because my host CPU does not have the 'Invariant TSC' CPU feature (CPUID(80000007)[EDX]=0), and has EIST (Enhanced Intel SpeedStep Technology) which changes the CPU clock based on CPU load.

Perhaps people having this problem could disable EIST and/or C1 and/or C1E and/or other CPU clock power saving measures in their host's BIOS and see if that affects VM timekeeping at all?

(Note: Some googling says that all recent Intel CPUs have Invariant TSC, and blogs/posts suggest checking the CPU family and model to determine if the feature is enabled (as does VirtualBox in TM.cpp), BUT I'm not sure this suggested check/code is correct. The 'Intel® 64 and IA-32 Architectures Software Developer’s Manual' I have documents Invariant TSC in exactly the same was as for AMD, namely CPUID(80000007)[EDX][bit8] (Table 3-20, page 3-189, volume 2A), and my CPU reports CPUID(80000007)[EDX]=0 (in VBox.log) and acts as if the TSC varies with CPU clock speed.)

comment:55 Changed 4 years ago by MarkCranness

I disabled EIST and C1E on my host BIOS. CPU-Z confirmed that the CPU was now locked at the highest rate.

BUT the VM still ran @ 2/3 = 66.7% of normal time when the GA timesync is disabled, which would seem to rule out my Invariant TSC theory. I thought I saw the VM time slow down, then catchup then settle down to 2/3 speed, all with the VBoxService not running and therefore no time sync, which was puzzling.

comment:56 Changed 4 years ago by MarkCranness

I have the same problem as the 'Guest: 32bit' of the OP: guest time flows slowly (2/3 speed in my case) unless the guest is busy.

Guest: Window 2000 Server with uniprocessor HAL.

I've done some testing and code examination, which I hope may be useful:

For the testing I've done, I note the following (all with GA disabled in the VM - no timesync):
('catchup works' below means: No 'Catchup given up' messages in VBox.log, and clock tracks in line with host @~100%.)

  • If I have two VMs running at the same time (the original Windows 2000 and a clone of the same), then catchup works: The clock is accurate.
  • If the (single) VM is busy, catchup works (it will even double/triple time the timers and run the VM faster than normal to catchup the current 60 second lag, and correct the clock - except for the given-up lag of course).
  • If I set the Windows 2000 VM to use the ACPI Multiprocessor PC HAL, catchup works (but I get very high host CPU usage).

(The catchup only fails when using either of the two uniprocessor HALs.)

  • If I have a single (or multiple) Windows XP BartPE (UBCD4WIN) VM running, catchup works.

(IO APIC doesn't matter; I'm not sure what HAL this uses.)

  • If I have the Windows 2000 VM running and also a separate XP VM, catchup works in both VMs.
  • In summary, the VM only goes slow (2/3 speed) and has 'Catchup given up' messages when a single Windows 2000 VM is running with a uniprocessor HAL when it is idle.

For the code examination I've done, I note the following:
(Please excuse me if I misunderstand any of the code: that is certainly possible.)
(These comments based on the 3.0.10 stable codeline.)

  • There are only two functions that calculate and catchup amount, AND store that catchup back into the timers to actually fix the catchup.

(The calculation is done in a few more places, but only for internal use and is not actually "caught up" by updating a new value for tm.s.offVirtualSync.)

These are:
1) TM.cpp>tmR3TimerQueueRunVirtualSync (line 1908)
2) TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked (line 403)

Because of the fUpdateStuff bug, TM.cpp>tmR3TimerQueueRunVirtualSync did not actually fix and store the catchup.

Combining the tests and the code examination, seems to indicate this:

  • In many situations, TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked is periodically run, which performs timer catchup and clocks keep in sync.
  • When the host is running only a single VM that has only a single CPU and the guest is idle, TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked IS NOT periodically called, and no timer catchup occurs.

Is it intended that TMAllVirtual.cpp > tmVirtualSyncGetHandleCatchUpLocked should NOT run when there is only 1 CPU and that CPU is idle (halted)?
If it should be running, then that is the (a) problem.

If it not supposed to periodically run with a single idle (halted) CPU, and instead TM.cpp>tmR3TimerQueueRunVirtualSync runs instead (or TM.cpp>tmR3TimerQueueRunVirtualSync is supposed to run all the time anyway), then the fUpdateStuff bug is the (a) problem.

  • Questions about the 2/3 rate:

+ Is the design actually that the timer in a guest should run at 2/3 (66.7%) the speed of the host, and that the catchup speeds it back up to 100%?
+ I suppose that design would / might handle host CPUs that had a varying TSC rate?
+ What is the explanation for the 2/3 rate?

comment:57 Changed 4 years ago by costing

The problems seem to have gotten worse in 3.1.2. Here even a new kernel (2.6.32.2) doesn't keep the correct time, but with varying amounts: when idle a couple of seconds every minute, when under load it becomes 15-20 sec/minute.

3.0.12 running the same machine with the same configuration doesn't present this problem. Also SMP seems to really work in 3.0.12 while 3.1.2 uses a bit more than 1 core when two tasks run at full blast.

comment:58 Changed 4 years ago by sgrenier

Thanks for your help, I successfully fix my problem.

Here how I did it:

Host: Ubuntu Server 9.10 64bit 2.6.31-16-server Virtualbox: 3.1.2 Guest: Centos 5.4 64bit 2.6.18-164.10.1.el5.

Host: install and configure NTP.

Guest:

  1. sudo yum install binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel
  2. Reboot
  3. Install the GA
  4. Set kernel parameter: divider=10 nolapic_timer
  5. Verify that "cat /sys/devices/system/clocksource/clocksource0/*" return: jiffies jiffies
  6. Reboot

My time is in sync for more then 2 days now.

comment:59 Changed 4 years ago by MarkCranness

In 3.0.12 I was getting 'Giving up catch-up attempt' messages in VBox.log.

In 3.1.0 this was fixed (IIRC), and no such messages appeared.

In 3.1.2 the messages have come back again, and time sync is a bit wobbly (oscillates slow then fast).

comment:60 Changed 4 years ago by frank

Keep in mind that the 3.1.2 guest additions for non-Windows guests have a bug. If the guest clock is a bit ahead of the host clock, the guest time will be incorrectly adjusted and the time difference gets bigger and bigger (fixed in the next maintenance release). I'm not sure if this bug is related to your problem though.

comment:61 Changed 4 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

The 3.1.4 additions should nicely synchronize the guest with the host (3.1.2 bug fixed. Please reopen if still relevant).

comment:62 Changed 4 years ago by costing

  • Status changed from closed to reopened
  • Resolution fixed deleted

Sorry, it doesn't fix anything. I ran a simple test on both the guest and host:

while true; do date; done

In 20 seconds the guest was 15 seconds behind when running 2.6.18 and 3.1.4 GA (same without GA). However when the guest runs a custom compiled 2.6.32 and GA or no GA, time is running fine. GA doesn't change the situation at all, the broken one is still broken, the one that works still works fine.

I state my problem again, if you do many-seconds catch-ups every now and then, this brakes a compilation for example. Especially that when using the CPU intensively the time differences are widening. But this problem was fixed in 3.0.10, how come you have broken it again and can't find the magic combination again? Uf, this seems to go nowhere...

comment:63 Changed 4 years ago by frank

In that case please attach a VBox.log file of such a VM session.

Changed 4 years ago by costing

2.6.18 - time flowing very slow

Changed 4 years ago by costing

2.6.32 - normal time flow

comment:64 Changed 4 years ago by costing

Logs attached, please let me know if I can help with any other information.

comment:65 Changed 4 years ago by frank

I can't see it from the log but are you 100% sure that you installed the 3.1.4 guest additions in the VM with the 2.6.18 kernel? And is the timesync daemon running (check for VBoxService) and are you sure that there are no other timesync daemons running (neither from older VBox releases nor ntp or somethink similar)?

If so then you could start the timesync daemon manually like this:

su
killall VBoxService
/usr/sbin/VBoxService -f -vvv | tee ~/log

leave it running for a while, make your date experiments, then stop the service and attach the compressed log file to this ticket?

comment:66 Changed 4 years ago by costing

Yes to all questions:

# dmesg | grep -i virtualbox
...
vboxguest: Successfully loaded version 3.1.4 (interface 0x0010004)
vboxfs: Successfully loaded version 3.1.4 (interface 0x00010004)

# pgrep -f ntp
#

# killall VBoxService
#

I'm attaching the log now.

Changed 4 years ago by costing

VBoxService -f -vvv

Changed 4 years ago by costing

VBoxService -f -vvv (2.6.32)

comment:67 Changed 4 years ago by costing

After the test with the older kernel I've rebooted the guest and tried the same thing with the newer kernel. The time difference between the guest and the host didn't increase any more.

comment:68 follow-up: ↓ 70 Changed 4 years ago by frank

Thanks for the logs! It appears that the guest time walks away that quick that adjtime() is just to slow to adapt the time correctly. Could you also attach the output of dmesg from the guest?

Changed 4 years ago by costing

dmesg for 2.6.18

comment:69 Changed 4 years ago by costing

dmesg output attached. Just to be on the safe side I've repeated the tests with a single CPU available to the guest (all the above was with 2). There is no difference in behavior, same drift and good/bad situation function of kernel version.

comment:70 in reply to: ↑ 68 Changed 4 years ago by MarkCranness

Replying to frank:

Thanks for the logs! It appears that the guest time walks away that quick that adjtime() is just to slow to adapt the time correctly. Could you also attach the output of dmesg from the guest?

It appears more likely that adjtime() is failing to adjust the time at all.

comment:71 Changed 4 years ago by frank

No, adjtime adjusts the time in very small steps (< 5ms per step) to ensure that the applications are not confused.

comment:72 follow-up: ↓ 74 Changed 4 years ago by costing

I've upgraded to 3.1.6 and the time problem seems gone. 2.6.18 and 2.6.32, with 1 or 2 cores, idle or under load, no problems at all.

Guys, thanks a lot for fixing this!

comment:73 Changed 4 years ago by frank

  • Status changed from reopened to closed
  • Resolution set to fixed

Thanks for the feedback!

comment:74 in reply to: ↑ 72 Changed 4 years ago by mooninite

Replying to costing:

I've upgraded to 3.1.6 and the time problem seems gone. 2.6.18 and 2.6.32, with 1 or 2 cores, idle or under load, no problems at all.

Guys, thanks a lot for fixing this!

Are you still using kernel arguments? I still have to otherwise the kernel never finishes booting. With kernel arguments (divider=10 nolapic_timer) timing has always been fine with any 3.1.x VBox release.

comment:75 Changed 4 years ago by costing

At least in one case I can confirm that the extra parameters are no longer needed, it doesn't make any difference if they are passed or not with either of the guest kernel versions. But even before it didn't make any difference if I specified them or not, the time would anyway be wrong. For now I cannot put the new VB version in production where there is a different hardware and more cases, I'll get back to you with a report when I do that.

comment:76 follow-up: ↓ 77 Changed 4 years ago by litz

  • Status changed from closed to reopened
  • Resolution fixed deleted

I have to report a complete failure in this ...

3.1.6 has completely broken Centos 2.6.18 kernels (Centos 5).

I can't get any time sync to work at all - no errors are reported, it just slips out of sync, about 1 second delay per second of time. Once past 60 seconds off, it stops attempting and in a matter of 1-2 minutes, you're 15 minutes off in time.

e.g. : 05:59:02.531 TM: Giving up catch-up attempt at a 60 003 096 004 ns lag; new total: 6 139 186 536 679 ns

I've tried pretty much every kernel option on grub (noapic, nolapic, nolapictimer, and every clocksource= possiblity), and it didn't help.

This is a pretty huge issue for me - I test in VM a couple of applications that require sync'd time to operate, and this is causing them to fail (and therefore making the VM's useless for me).

Any ideas?

  • litz

comment:77 in reply to: ↑ 76 Changed 4 years ago by mooninite

Replying to litz:

I've tried pretty much every kernel option on grub (noapic, nolapic, nolapictimer, and every clocksource= possiblity), and it didn't help.

Instead of "nolapictimer", the correct argument is "nolapic_timer". You can add "divider=10" to that to reduce wakeups.

comment:78 Changed 4 years ago by litz

My mistake, I typo'd the comment above.

The last set of options I tried (which didn't work) are :

divider=10 nolapic_timer noapic nolapic

I've tried everything listed above, to no avail, too.

comment:79 Changed 4 years ago by litz

I just checked my other VM's.

They are ALL failing. Centos 4 and 5, all fail. All systems are up-to-date (latest kernels available).

All systems have divider=10 nolapic_timer.

All fail with log messages similar to :

TM: Giving up catch-up attempt at a 60 003 096 004 ns lag; new total: 6 139 186 536 679 ns

I can see clock skews off as much as 15 minutes within 2-3 minutes of startup.

Changed 4 years ago by litz

litz timesync failure

comment:80 Changed 4 years ago by litz

I've attached one of my VM's log files :

VBox.3.log (118.3 kB) - added by litz on 2010-04-09 23:17:39.

litz timesync failure

comment:81 Changed 4 years ago by sandervl73

litz: your log shows the guest is still using a 1000 hz timer. Are you sure this divider=10 statement is correct in your grub config? Did you install the guest additions?

comment:82 Changed 4 years ago by litz

divider=10 is on the grub config line (and shows up in dmesg at system start), and guest additions is installed.

from dmesg :

Kernel command line: ro root=LABEL=/ divider=10 nolapic_timer noapic nolapic

So ... I dunno what's going on.

Anyone got any ideas? (timer=1000 shouldn't cause the time sync problems anyways, should it?)

  • litz

comment:83 Changed 4 years ago by sandervl73

Actually I misread the log file. (it's not 1024 hz, but 100) Didn't you mention in the forum that you also added this option for your host? (you shouldn't)

comment:84 Changed 4 years ago by litz

I tried adding that for the host ... it didn't make a difference.

  • litz

comment:85 Changed 4 years ago by litz

BTW - Centos 5.4 is vastly more stable (it's still not acceptably stable, but it's MUCH better) than Centos 4.8 ...

5.5 is out real soon; let's see how that behaves. It may fix the issue.

  • litz

comment:86 Changed 4 years ago by litz

Ok ... I've been able to ... somewhat ... stabilize things on Centos 5.4

4.8 is, I think, a lost cause.

Have to use the following command line : divider=10 noapic nolapic nolapic_timer clocksource=tsc

and have ntpd enabled

If I do that, it seems to hold OK time ... exactly 7 seconds behind realtime.

Nothing I have tried has gotten the clock to hold realtime; it's always slow and lagging behind.

I can run two VM's and they stay perfectly in sync -- with each other ... exactly the same amount behind realtime, each. So whatever's happening, it's reproducable -- there is no actual sync between the two systems other than their timesync to the host (and ntp).

New log file attached.

Changed 4 years ago by litz

New VBOX log file (4-12-10)

comment:87 Changed 4 years ago by litz

I might also add ... that while the clock seems to be "keeping up with realtime" ... it is by NO MEANS steady.

It jumps back and forth, hangs for a 1/2 second, speeds up to catch up ...

Think of a car really in need of tuneup that spits, hesitates, then roars to life to catch up to traffic.

That's what the clock is running like.

The above solution is *workable*. It is NOT -- by any means -- an acceptable alternative to a properly stable clock.

Something's still broken. Any ideas what to test/try?

comment:88 Changed 4 years ago by klaus

litz: is there any reason why you have IOAPIC enabled in the VM config? The VirtualBox default is having IOAPIC off, as emulating it costs quite some CPU cycles, and you don't seem have selected a 64bit guest OS.

comment:89 Changed 4 years ago by litz

hmm ... I do not know why IOAPIC is enabled. I'll turn it off ... can't hurt to save some cycles.

I've tried it both ways, before, though, and not seen a difference in performance either way ...

  • litz

comment:90 Changed 4 years ago by costing

Something is funny with the machines, now one of them again runs slow:

# ntpdate x
19 Apr 13:03:05 ntpdate[28930]: adjust time server 137.138.16.69 offset 0.054395 sec
# ntpdate x
19 Apr 13:03:06 ntpdate[28931]: adjust time server 137.138.16.69 offset 0.185675 sec
# ntpdate x
19 Apr 13:03:07 ntpdate[28932]: adjust time server 137.138.16.69 offset 0.300528 sec
# ntpdate x
19 Apr 13:03:08 ntpdate[28933]: adjust time server 137.138.16.69 offset 0.437668 sec
# ntpdate x
19 Apr 13:03:10 ntpdate[28934]: step time server 137.138.16.69 offset 0.572383 sec

After the initial boot the time was almost ok for ~1 hour, then I compiled and started GA and then it started running very bad. But it continued to misbehave even after stopping GA. So it seems that something is activated by GA in the kernel that shouldn't ?

comment:91 Changed 4 years ago by costing

I've tried the settings above too (divider=10 noapic nolapic nolapic_timer clocksource=tsc) and the clock is a bit more stable (meaning "only" 5-12 seconds each minute, the actual interval seems to depend on the load, when idle it looses more), but ntp can't keep up so I run ntpdate from cron every minute. Is it a question of CPU type? 3.1.6 runs fine on a Core2 but all these problems show up on X5560.

comment:92 Changed 4 years ago by MarkCranness

Changing power states hasn't helped my VM timekeeping at all, but some people running Mac hosts have reported that disabling host power saving options helped.

Perhaps:  https://bugzilla.redhat.com/show_bug.cgi?id=513138

When the tsc is not stable on the host, including cpufreq changes and deep C state, or even migration into faster tsc host. To stop deep C states in which the TSC can stop, add "processor.max_cstate=1" as a host kernel boot option. To disable cpufreq (only necessary on hosts which lack constant_tsc flag in "flags" field of /proc/cpuinfo), edit /etc/sysconfig/cpuspeed MIN_SPEED and MAX_SPEED variables to the highest frequency available.

... or disable Intel SpeedStep (IST/EIST/C1/C1E) in the host BIOS.

Is VBox.log/"TM: cTSCTicksPerSecond=" fairly constant? (Boot the VM while another VM is busy, compared to boot the VM with the host/VMs idle).

comment:93 Changed 4 years ago by costing

I've tried setting the cores' frequency to max, min, everything in between, but there is no change in the guest, it's still bad. I don't know about disabling freq scaling altogether, this is not something I can play freely with since the servers are in production but when I have the chance I'll report about the effect of this.

comment:94 Changed 12 months ago by MS

still problem with VirtualBox 4.2.10 host system is Windows 7 64bit on intel i5 (4 cores), guest system is also windows 7 64bit with 3 cores enabled

in guest system i have Microsoft SQL server 2008r2 and during debugging one problem i noticed that very often happens that for 2 inserted items in sequence 'newer one' has smaller timestamp then 'older one'

in test loop which did nothing just inserting items to temporary table i found that such thing happened 19-190 times for sample 1000000 rows. it seems to be depending on what is happening with computer, if i do something else in host system its worse, if i do nothing its better but for 1 million rows it never happened that there would be no such row at all

difference seems to be from 10ms to 110ms

comment:95 Changed 12 months ago by frank

  • Description modified (diff)

Sorry, could you also test VBox 4.2.12? Just this release contains a few fixes regarding timekeeping. It would be interesting to see if these fixes have any effect to your setup.

comment:96 Changed 5 months ago by frank

  • Status changed from reopened to closed
  • Resolution set to fixed

No response, closing. Further fixes were made in VBox 4.3.0. Before you consider to reopen this ticket, please try VBox 4.3.2 first.

comment:97 Changed 3 months ago by mizzao

I'm still seeing this in Virtualbox 4.3.6. Windows 8 64-bit host and Ubuntu 12.04 guest.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use