VirtualBox

Ticket #12076 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

QueryPerformanceCounter() Anomalies => Fixed in SVN

Reported by: asdel Owned by:
Priority: critical Component: other
Version: VirtualBox 4.1.18 Keywords: QueryPerformanceCounter counter timer
Cc: Guest type: Windows
Host type: Linux

Description

Host: Ubuntu 10.10
Guest: Windows XP Pro, v5.1 SP3

Anomalies are observed with 64-bit counter value provided via QueryPerformanceCounter(). These anomalies being to appear after the VirtualBox is running continuously for ~4 days. The counter frequency is 3,579,545 Hz; host/guest time synchronization is disabled (GetHostTimeDisabled = '1')

The guest calls QueryPerformanceCounter() every 50mS (0x2BB21 counts), logging the previous and current counter values when:

(a) the previous count > current count, or
(b) the current count is much larger than previous count (increase of 1 second or more, i.e., > 0x370000 counts)

The following counter query results are logged:

 1. Last >  Cur Time: 0x0000011B:7FFFC526 >  0x0000011A:80001858
 2. Last >  Cur Time: 0x00000135:7FFFE61B >  0x00000134:80002C01
 3. Last >  Cur Time: 0x00000136:7FFFDB9F >  0x00000135:800021C0
 4. Last >  Cur Time: 0x0000013E:7FFFEBC2 >  0x0000013D:80003983
 5. Last >  Cur Time: 0x00000146:7FFFBCFE >  0x00000145:800003E8
 6. Last >  Cur Time: 0x00000147:7FFFF8EA >  0x00000146:80003EDA
 7. Last << Cur Time: 0x00000183:FFFFC9DC << 0x00000185:00005E92
 8. Last >  Cur Time: 0x00000185:7FFFD751 >  0x00000184:8000402C
 9. Last >  Cur Time: 0x00000189:7FFFD519 >  0x00000188:80006D8F
10. Last << Cur Time: 0x00000188:FFFFC076 << 0x0000018A:0000071D

At 3.579 MHz, the event of d31..d0 going from 0x7FFF:FFFF to 0x8000:0000 occurs every 1200 seconds (20 minutes).

Note the third and fifth cases, where the bits d35..d32 went from 0x6 (b0110) to 0x5 (b0101) -- this is a subtraction, not just a clearing of bit d32.

Log entries 7 and 10 appear to show bit d32 double-incrementing on the roll-over of d31..d0 from 0xFFFFFFFF to 0x1:00000000

The effect is, to the guest, time appears to jump forward or backward by 20 minutes.

-- end of description

Attachments

vbox.log Download (57.0 KB) - added by asdel 4 years ago.
MAC_VBox.log Download (121.1 KB) - added by asdel 4 years ago.
Log file under MAC OSX
QPC_Check.cpp Download (2.9 KB) - added by asdel 4 years ago.
Source for Windows app demonstrating QueryPerformanceCounter anomaly
VBoxBroken.log Download (90.8 KB) - added by Petr Vones 4 years ago.
Session with broken counter
virtualMachine1.log Download (64.5 KB) - added by florian 4 years ago.
virtualMachine2.log Download (63.6 KB) - added by florian 4 years ago.
checkTime.cpp Download (1.8 KB) - added by florian 4 years ago.

Change History

comment:1 follow-up: ↓ 2 Changed 4 years ago by ramshankar

As far as I can see, the latest TSC fixes in 4.2.x wasn't backported to 4.1.x release.

Could you, if possible, try upgrading to 4.2.18?

comment:2 in reply to: ↑ 1 ; follow-up: ↓ 3 Changed 4 years ago by asdel

Replying to ramshankar:

As far as I can see, the latest TSC fixes in 4.2.x wasn't backported to 4.1.x release.

Could you, if possible, try upgrading to 4.2.18?

Yes, we will upgrade and rerun the test. Since the anomaly required several days to manifest, we will need to let the upgraded version run for a while before reporting back.

comment:3 in reply to: ↑ 2 Changed 4 years ago by asdel

Now running v4.2.18. After approx 24 hrs, anomaly 'Type 1' has again appeared:

Last > Cur Time: 0x0000051:7FFFD7E2 > 0x00000050:80001F38

comment:4 Changed 4 years ago by ramshankar

Thanks. Could you please upload the VBox.log for the VM you are seeing this issue.

Changed 4 years ago by asdel

comment:5 Changed 4 years ago by frank

I think this problem was fixed in 4.2.x. Please update to 4.2.18.

comment:6 Changed 4 years ago by asdel

To reiterate: updated to 4.2.18 and the both forward and backward-jump anomalies are still present.
We are seeing the anomaly on some hardware platforms but not others (same VirtualBox host/guest arrangement on all platforms).

Note, the original problem description says QueryPerformanceCounter() is being called every 50mS, but it should say every 5mS (the count values reflect the 5mS period).

comment:7 Changed 4 years ago by ramshankar

I can't find any official documentation that guarantees that QueryPerformanceCounter() is monotonically increasing nor do I know which time source it is looking at in the VM. In fact, from my understanding it determines the time source(s) at runtime. It's unlikely that it's reading the TSC in this case though.

I'm not ruling out a bug in our device emulation/interrupt delivery but we'll need a better understanding of what happens here (wraparound/sign extension/interrupt latency issues etc.)

From the frequency, it's most likely using the ACPI timer in which case the 32-bit wraparound is expected but how correctly Microsoft deals with this is a bit fuzzy. See also https://www.virtualbox.org/ticket/11951#comment:2

Last edited 4 years ago by ramshankar (previous) (diff)

comment:8 Changed 4 years ago by asdel

Thank you for the feedback. To close one thought: since my last posing, the anomaly has been observed on both our hardware platforms, so a hardware-unique issue appears unlikely.

Changed 4 years ago by asdel

Log file under MAC OSX

Changed 4 years ago by asdel

Source for Windows app demonstrating QueryPerformanceCounter anomaly

comment:9 Changed 4 years ago by asdel

Update to report that the anomaly is also present with Mac OSX host (the companion .log file has been attached). I have also attached source code for a simple program that detects and reports any anomalies found.

An interesting consistency between the Linux and Mac platforms is that the anomaly only appears after 24+ hours of operation, and when it does appear, it is at first sporadic (i.e. occurring once every several hours). Eventually, it becomes more regular, and ultimately establishes a reliable 10-minute period.

Are there specific suggestions on how else we might assist in achieving a better understanding of the fundamental issue?

Last edited 4 years ago by asdel (previous) (diff)

comment:10 Changed 4 years ago by klaus

It would help to get the result from such a test program on real hardware which uses the same type of time source, i.e. the ACPI timer (sometimes called PMTIMER). The hardware implements only a 32 bit counter, and to me it looks like there is some quirk with extending it to 64 bits (honestly I have a hard time imagining how to do this safely at all with this brain-damaged hardware)...

comment:11 Changed 4 years ago by Petr Vones

I can confirm the issue with 4.3.6. The counter goes backward very often

Host: Windows 7 x64
Guest: Windows Server 2008 R2 x64
Core i5 2540M HT enabled, VT-x enabled, Nested paging enabled, 2 CPU

Here is the sample C# code:

using System;
using System.Diagnostics;
using System.Threading;

namespace PerformanceCounterTest
{
  class Program
  {
    static void Main(string[] args)
    {
      Stopwatch sw = new Stopwatch();
      sw.Start();
      long lastValue = sw.ElapsedTicks;
      for (int i = 0; i < 1000; i++)
      {
        Thread.Sleep(1);
        long elapsed = sw.ElapsedTicks;
        long diff = elapsed - lastValue;
        if (diff <= 0)
          Console.WriteLine("Loop: {0}, Error diff: {1}", i, diff);
        lastValue = elapsed;
      }
      sw.Stop();
      Console.WriteLine("Stop");
    }
  }
}

Sample output (Release configuration, no debugging):

Loop: 2, Error diff: -263820
Loop: 13, Error diff: -239370
Loop: 16, Error diff: -261904
Loop: 20, Error diff: -261749
Loop: 51, Error diff: -261078
Loop: 57, Error diff: -261983
Loop: 61, Error diff: -259806
Loop: 78, Error diff: -261652
Loop: 85, Error diff: -261453
Loop: 100, Error diff: -262647
Loop: 113, Error diff: -262865
Loop: 116, Error diff: -261389
Loop: 145, Error diff: -261829
Loop: 152, Error diff: -261208
Loop: 169, Error diff: -261680
Loop: 186, Error diff: -262212
Loop: 218, Error diff: -261561
Loop: 240, Error diff: -261688
Loop: 246, Error diff: -261788
Loop: 250, Error diff: -259956

The same sample code has zero error output on the same machine running in host system. It has also zero error output on Virtual PC or VMWare virtualized machines.

Last edited 4 years ago by Petr Vones (previous) (diff)

comment:12 Changed 4 years ago by Petr Vones

Additional information, after restart of the virtual machine the issue has disappeared. Log of the broken session added.

After saving the machine state and restoring the session the issue has appeared again with less occurence:

Loop: 1, Error diff: -17926
Loop: 205, Error diff: -18108
Loop: 208, Error diff: -17730
Loop: 528, Error diff: -12459
Loop: 848, Error diff: -17473
Loop: 935, Error diff: -18373
Stop
Last edited 4 years ago by Petr Vones (previous) (diff)

Changed 4 years ago by Petr Vones

Session with broken counter

comment:13 Changed 4 years ago by klaus

Too bad that no one provided the information I tried to get in my previous reply - are any of the systems (physical or virtual) where the first test program runs without glitches which use PMTIMER (i.e. show a base frequency of 3.579 MHz)? I'm asking because this timer is only 32 bit, and it could be an OS bug which is hidden by the fact that Windows usually doesn't pick this time source.

Also, the 2nd test program clearly shows a very different pattern and thus there might be a different issue behind it - but as there is no information what timer is used for the C# this is even harder to make any progress than with the original one. The only bit of useful information I can take from the log is that the TSC rate is very unusual... it is measured as 2.491681920 GHz, whereas the CPU declares itself to run at 2.60GHz. That's over 4% off. Also, there are suspicious RTC timer rate changes (first few very short, just ~15 msecs, but the 2nd last one 5 seconds and the last one over 83 seconds) - but they don't match your test program, which takes samples every millisecond, until it reaches 1000 loops, so overall it'll run for a bit more than a second.

comment:14 Changed 4 years ago by florian

I am not that familiar with all these timers so what I am writing might be completely wrong... If this is the case please correct me...

Wouldn't it be sufficient to specify /USEPMTIMER in windows xp boot.ini file. As far as I understood it this will force windows to use the PMTIMER. I have done this on one pysical test machine and the frequency returned from QueryPerformanceFrequency has changed from something close to cpu frequency which indicates that TSC is used to 0x369e99 which is 3.579 MHz which indicates that PMTIMER is used. I have started our test program yesterday evening. As we only see the time jumping after more than 24 hours uptime I will report about the results tomorrow or maybe even better the day after tomorrow. I will post the code of our test program together with the results but it is almost the same compared to the test program of asdel.

If I am wrong it would be nice if someone can post a comment so that I can stop this and spare some time...

comment:15 Changed 4 years ago by florian

Here are the logs both "Virtual Machine 1" and "Virtual Machine 2" are affected by the problem the "Physical Machine" works like a charm.

I will let the machines run so that we can see that the frequency of the time jump increases with increasing uptime of the virtual machines.

Virtual Machine 1

Frequency: 369e99
2014-02-17 14:16:20.978
Overflow from lower part are 1199 seconds
ActTime=1cf51e
ActPerfTime=195b7eab3
2014-02-20 00:14:15.868: PerfError  -1198.862613
ActPerfTime=ae8013a59c
2014-02-20 01:14:16.075: PerfError  -1198.863378
ActPerfTime=b08035319e

snapshot: 201402200739

Virtual Machine 2

Frequency: 369e99
2014-02-17 14:15:30.512
Overflow from lower part are 1199 seconds
ActTime=1c284b
ActPerfTime=18a892e82
2014-02-20 05:34:16.174: PerfError  -1198.862579
ActPerfTime=be802f9ccf
2014-02-20 07:14:14.829: PerfError  -1198.862586
ActPerfTime=c2800b4370

snapshot: 201402200739

Physical Machine

Frequency: 369e99
2014-02-17 16:08:45.828
Overflow from lower part are 1199 seconds
ActTime=4c2c
ActPerfTime=45f11bd

snapshot: 201402200740

Changed 4 years ago by florian

Changed 4 years ago by florian

Changed 4 years ago by florian

comment:16 Changed 3 years ago by frank

  • Summary changed from QueryPerformanceCounter() Anomalies to QueryPerformanceCounter() Anomalies => Fixed in SVN

We think we finally found and fixed the problem. The next 4.3.x maintenance release will contain the fix.

comment:17 Changed 3 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

Fix is part of VBox 4.3.18.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use