Context Navigation

← Previous Ticket
Next Ticket →

#13024 new enhancement

Support for physical multiCPU

Reported by:	Hrvoje	Owned by:
Component:	VMM/HWACCM	Version:	VirtualBox 4.3.10
Keywords:	cpu physical core sibling	Cc:
Guest type:	all	Host type:	all

Description

Hi.

The summary is somewhat missleading, sorry about that, but i was just not sure how to "commpress" it to something meaningfull.

As up to 4.3.10 version, VirtualBox engine only support multiCPU configurations, where guest sees multiple CPU's as one physical CPU, with more cores and siblings.

For majority of use-cases this is sufficiant to emulate multi-cpu environment. But, it seems that it have more than one performance penalty.

I have HP580 G5 machine, which was running ESXi package, and there was number of guest machines (mostly linuxes). All of them, when displaying cpuinfo, did show CPU's as separate physical packages. In example - this is from guest running on another similar host with VMWare:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 46
model name      : Intel(R) Xeon(R) CPU           X7542  @ 2.67GHz
stepping        : 6
cpu MHz         : 2663.778
cache size      : 18432 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm ida epb dts
bogomips        : 5327.55
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 46
model name      : Intel(R) Xeon(R) CPU           X7542  @ 2.67GHz
stepping        : 6
cpu MHz         : 2663.778
cache size      : 18432 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm ida epb dts
bogomips        : 5327.55
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

When i did migrate those to VirtualBox, now CPU's are shown as "cores":

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz
stepping	: 11
cpu MHz		: 2904.312
cache size	: 6144 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips	: 5783.55
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Xeon(R) CPU           X7350  @ 2.93GHz
stepping	: 11
cpu MHz		: 2904.312
cache size	: 6144 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 5
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips	: 5783.55
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

(note how physical id does not change).

It seems, that this is having some performance penalty, as guests seems to be running slower than before. Host is installed using CentOS 6.5, as do guests.

Also, HP580 G5 have 4 CPU's with 2x2 cores. Increasing vCPU (guest CPU count) to anything above 2 causes huge performance degradation, and instability (frequent oopses). On VMWare this was working flawlesly - you can go to VMWare knowelege base, and search for article 2030577.

I did not manage to find anywhere any reference on how to make VirtualBox present CPU's as physical ones, so i do assume that this is something that needs to be done.

I do also assume that this could be connected to other reports about suboptimal performance with more than one CPU. I say connected, because it seems that only affects older host CPU's - running VM's on HP380 G7 (i7), with lagrer number of CPU on guest does not influence performance in a major way.

If needed, i can provide logs from host, and guests.

Regards,

Change History (13)

follow-up: 2 comment:1 by Frank Mehnert, 10 years ago

I don't think that you found a new bug. The VirtualBox performance with multi-CPU guests needs improvement, this is a known problem. But there was no recent change how VirtualBox emulates multiple CPUs. A guest CPU is just a thread running on the host computer. The more virtual CPUs you configure, the more threads are started on the host. It's the job of the host operating system to distribute the execution threads of all currently running virtual machines over the physical hardware.

in reply to: 1 ; follow-up: 5 comment:2 by Hrvoje, 10 years ago

Replying to frank:

I don't think that you found a new bug. The VirtualBox performance with multi-CPU guests needs improvement, this is a known problem. But there was no recent change how VirtualBox emulates multiple CPUs. A guest CPU is just a thread running on the host computer. The more virtual CPUs you configure, the more threads are started on the host. It's the job of the host operating system to distribute the execution threads of all currently running virtual machines over the physical hardware.

Hi.

Thank you for the response. And, for most of the things, i do agree with you.

Also, problem is not with host - it just sees threads, and yes, host OS distributes those threads.

But, please note that how this distribution is done, depends on the state of NUMA structure. If host supports NUMA, host OS will group threads on one node, in attempt to avoid expensive copy over NUMA nodes. Effectively, this causes that CPU threads (of guest) behave as cores of single physical processor - because they are actually executed on the same NUMA node, and thus, on single physical CPU.

This changes drastically if host does _not_ support NUMA grupping. In this case, memory is equially shared between all threads on host, but you can assume that in more of 50% time, threads for one guest will run on different CPU. And this will be costly, because L1/L2/L3 cache copy and context switching. If you in this scenario show CPU as cores of single CPU, guest might use wrong optimization, and actually make things work slower than it should.

I would like to have a way to present vCPU's in guest as separate physical CPU's, instead of just cores of single physical CPU. I belive that this will then instruct guest to use different optimizations, and take in account that thread migration is expensive, and thus, increase preformance.

One typicall example is irqbalance, which does not work as it should on VBox, because it sees all vCPU as part of one physical CPU, and just quits. There is a lot of internal linux optimizations, but it would just take to much time to describe them all.

Regards,

follow-up: 7 comment:3 by VBoxHarry, 10 years ago

Hallo VBox-User, VBox-Team

May problem with VBox since 4.3.x incl. 4.3.12 is near the same. On XP SP2 64 bit !! NOT with XP 32 bit but too with Windows 7 64 bit. Only 1 Core is seen. My CPU's AMD "Phenom II X4 940" and "Phenom II X4 850". This Problem is not given with VBox 4.2.24 and only with OS 64 bit.

Regards, VBoxHarry

comment:4 by Frank Mehnert, 10 years ago

VBoxHarry, please could you attach a VBox.log file of a VM session? The file will contain information about the configuration of your system.

in reply to: 2 ; follow-up: 6 comment:5 by Ramshankar Venkataraman, 10 years ago

But, please note that how this distribution is done, depends on the state of NUMA structure. If host supports NUMA, host OS will group threads on one node, in attempt to avoid expensive copy over NUMA nodes. Effectively, this causes that CPU threads (of guest) behave as cores of single physical processor - because they are actually executed on the same NUMA node, and thus, on single physical CPU.

This changes drastically if host does _not_ support NUMA grupping. In this case, memory is equially shared between all threads on host, but you can assume that in more of 50% time, threads for one guest will run on different CPU. And this will be costly, because L1/L2/L3 cache copy and context switching. If you in this scenario show CPU as cores of single CPU, guest might use wrong optimization, and actually make things work slower than it should.

Yes you're right, and we are aware about NUMA systems and the overheads involved with suboptimal scheduling and memory allocation. However, VirtualBox SMP performance on NUMA systems, at the moment, is the least of our worries. We have a few other areas where we can improve our SMP performance before we get to NUMA.

in reply to: 5 comment:6 by Hrvoje, 10 years ago

Hi.

Replying to ramshankar:

But, please note that how this distribution is done, depends on the state of NUMA structure. If host supports NUMA, host OS will group threads on one node, in attempt to avoid expensive copy over NUMA nodes. Effectively, this causes that CPU threads (of guest) behave as cores of single physical processor - because they are actually executed on the same NUMA node, and thus, on single physical CPU.

This changes drastically if host does _not_ support NUMA grupping. In this case, memory is equially shared between all threads on host, but you can assume that in more of 50% time, threads for one guest will run on different CPU. And this will be costly, because L1/L2/L3 cache copy and context switching. If you in this scenario show CPU as cores of single CPU, guest might use wrong optimization, and actually make things work slower than it should.

Yes you're right, and we are aware about NUMA systems and the overheads involved with suboptimal scheduling and memory allocation. However, VirtualBox SMP performance on NUMA systems, at the moment, is the least of our worries. We have a few other areas where we can improve our SMP performance before we get to NUMA.

Actually, it's other way around - i have an excelent performance on NUMA. It works like a charm, expecially if CPU supports unrestricted guest execution.

Problem is on host which does _not_ support NUMA. There, it seems, presenting vCPU's as cores of single physical CPU is having perfomance cost.

For test, i did try to tie guest vCPU processes on host, to specific core of the same physical CPU. This is done by taskset unix command (my host is unix). Ocassionally, i can get up to 5x boost in speed. I'm still trying to determine why gain varies, but there is definitely someting strange there, and it is related to how host OS scheduler schedules guest vcpu processes around.

There is also another "anomally" (on host without NUMA). If guest is linux (64bit), adding "clearcpuid=0x1c" to boot command also somewhat improves performance. This specific command just masks hyperthread flag of CPU.

If i find more time, i'll try to do some more test, but it would be awesome, if there would be way to mask hyperthread flag for guest using VBoxManage command.

Even better would be to have any means to control how vCPU will be presented to guest os.

Regards,

Last edited 10 years ago by Hrvoje (previous) (diff)

in reply to: 3 comment:7 by Hrvoje, 10 years ago

Hi.

Replying to VBoxHarry:

Hallo VBox-User, VBox-Team

May problem with VBox since 4.3.x incl. 4.3.12 is near the same. On XP SP2 64 bit !! NOT with XP 32 bit but too with Windows 7 64 bit. Only 1 Core is seen. My CPU's AMD "Phenom II X4 940" and "Phenom II X4 850". This Problem is not given with VBox 4.2.24 and only with OS 64 bit.

Regards, VBoxHarry

It seems that your problem is related to something else, and i would politely ask you to open another thicket for this. Here i'm trying to get some way of controling how vCPU is presented to guest.

Regards,

Last edited 10 years ago by Hrvoje (previous) (diff)

comment:8 by Hrvoje, 10 years ago

Hi.

I still think that it would be usefull to have a posibility to control how vcpu is presented to guest.

Would it be possible to add such feature, and when do you think this could happen?

Regards,

comment:9 by Frank Mehnert, 10 years ago

To be honest it is very unlikely that we implement such a feature in the near future. There are much too many other things to do.

comment:10 by Hrvoje, 9 years ago

Hi.

So, i did finally find enough free time to tackle with this. :-)

Anyhow, i did look at src/VBox/VMM/VMMR3/CPUM.cpp. There i found this code:

1391     /*
1392      * Hide HTT, multicode, SMP, whatever.
1393      * (APIC-ID := 0 and #LogCpus := 0)
1394      */
1395     pStdFeatureLeaf->uEbx &= 0x0000ffff;
1396 #ifdef VBOX_WITH_MULTI_CORE
1397     if (pVM->cCpus > 1)
1398     {
1399         /* If CPUID Fn0000_0001_EDX[HTT] = 1 then LogicalProcessorCount is the number of threads per CPU core times the numbe     r of CPU cores per processor */
1400         pStdFeatureLeaf->uEbx |= (pVM->cCpus << 16);
1401         pStdFeatureLeaf->uEdx |= X86_CPUID_FEATURE_EDX_HTT;  /* necessary for hyper-threading *or* multi-core CPUs */
1402     }
1403 #endif

If i read code correctly, we are doing this on guest CPUID. And basically trashing some flags, and setting them as we like. Can somebody explain why is neccessary to have HTT?

I did comment out line 1401, and rebuild from source. VM's inside this new code work correctly, and have proper flags. Before:

[root@test ~]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3165.558
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips        : 6012.92
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3165.558
cache size      : 6144 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips        : 6012.92
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

and after:

[root@test ~]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3166.564
cache size      : 6144 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips        : 6225.92
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3166.564
cache size      : 6144 KB
physical id     : 1
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good pni ssse3 lahf_lm
bogomips        : 62128.12
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

NOTE that there is no more "ht" flag, and also now each cpu is separate physical socket.

From what i can tell, seems that VM's work more stable now. Please note that host DOES NOT have NUMA support, but it has 4 physical CPU's!

Despite the fact that this works more stable now, it still generates a lot of LOC interrupts on HOST. Maybe this is normal for this hardware (HP DL580 G5)?

And now question - will this change regarding line 1401 break something else? Do you know does this influences something else?

I'm asking this because i did found one odd thing. Comparing logs from before and after, this commes up:

@@ -679,8 +679,8 @@
      Function  eax      ebx      ecx      edx
 Gst: 00000000  00000005 756e6547 6c65746e 49656e69
 Hst:           0000000d 756e6547 6c65746e 49656e69
-Gst: 00000001  0001067a 00020800 00000201 178bfbff
-Hst:           0001067a 07040800 0c0ce3bd bfebfbff
+Gst: 00000001  0001067a 00020800 00000201 078bfbff
+Hst:           0001067a 02040800 0c0ce3bd bfebfbff
 Gst: 00000002  05b0b101 005657f0 00000000 2cb4304e
 Hst:           05b0b101 005657f0 00000000 2cb4304e
 Gst: 00000003  00000000 00000000 00000000 00000000
@@ -736,7 +736,7 @@
 SSE - SSE Support                      = 1 (1)
 SSE2 - SSE2 Support                    = 1 (1)
 SS - Self Snoop                        = 0 (1)
-HTT - Hyper-Threading Technology       = 1 (1)
+HTT - Hyper-Threading Technology       = 0 (1)
 TM - Thermal Monitor                   = 0 (1)
 30 - Reserved                          = 0 (0)
 PBE - Pending Break Enable             = 0 (1)

Why host CPUID is also changed?

Regards,

comment:11 by Frank Mehnert, 9 years ago

Regarding the change in the host ID: Bits 24...31 of EBX with CPUID leaf 1 contain the "Initial APIC ID" of the host CPU where the code is running. As the CPUID information of only one host CPU is displayed, that information changes if running on another CPU as a result of the regular host CPU scheduling. So nothing to worry about.

Regarding the LOC interrupts: LOC is the "Local APIC timer interrupt". Every time the Local APIC timer reaches zero, it generates an interrupt. This is usually a timer source or used for the watchdog so nothing to worry about.

Regarding your experiment: Could you describe in more detail what improved with your guest when running without the HTT bit set?

comment:12 by Hrvoje, 9 years ago

Hi.

Thank you for explanation, i did asumme something similar (cpuid).

Regarding LOC, i do understand what it is. Problem is that i can not understand why is so high on host OS. For example, booting CoentOS 6.6 with single vCPU, host LOC peaks 10k/s. Booting the same guest with 2 vCPU, LOC goes up to 50k/s. With 4 vCPU, it go to 200k/s!! Yes, 200k! This seems way too much. I'm assuming that vbox is maybe doing some passthrough, instead of optimising timers, but i'm not quite sure what is going on here. What is clear, that when LOC goes over ~50k, guest starts to slows down - it seems that a lot of cpu time is lost on some kind of synchronization. I'm still investigating this (prime suspect is stuff around tickless kernel implementation).

Now, HTT is quite interesting. My host have X5460 CPU, which accoring to intel, does _not_ support HTT. Yet, looking at /proc/cpuinfo, there is "ht" flag. I'm not quite sure what to make of this - in bios there is no options to enable/disable ht. Long story short, host does not support ht.

Testing VirtualBox 4.3.20 (vanilla), with 4 vCPU guest in most cases produced kernel panic or cpu hang messages. I would say 1 boot in 5 resulted with working OS. After disabling HT for guest, the result was oposite - in 5 boots, only in 1 case it does hang. Working with 2 vCPU gests produced even better results.

LOC numbers are almost the same, so is load on host, and i assume that it is something inside of linux kernel that watches out for ht flag and do something different.

But, i did not stop there. Digging through kernel code, i did found out that it does watch for "hypervisor" cpuid flag. I did enable it, by adding:

1392      * Hide HTT, multicode, SMP, whatever.
1393      * (APIC-ID := 0 and #LogCpus := 0)
1394      */
1395     pStdFeatureLeaf->uEbx &= 0x0000ffff;
1396 #ifdef VBOX_WITH_MULTI_CORE
1397     if (pVM->cCpus > 1)
1398     {
1399         /* If CPUID Fn0000_0001_EDX[HTT] = 1 then LogicalProcessorCount is the number of threads per CPU core times the numbe     r of CPU cores per processor */
1400 //        pStdFeatureLeaf->uEbx |= (pVM->cCpus << 16);
1401 //        pStdFeatureLeaf->uEdx |= X86_CPUID_FEATURE_EDX_HTT;  /* necessary for hyper-threading *or* multi-core CPUs */
1402         pStdFeatureLeaf->uEbx |= 0x10000;
1403         pStdFeatureLeaf->uEcx |= 0x80000000;    /* hypervisor bit */
1404     }
1405 #endif

This seems to futhurer stabilise kernel. This also added additional messages in dmesg:

...
alternatives: switching to unfair spinlock
...

and added additional flag(s) in /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3135.228
cache size      : 6144 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good unfair_spinlock pni ssse3 hypervisor lahf_lm
bogomips        : 6270.45
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           X5460  @ 3.16GHz
stepping        : 10
microcode       : 1547
cpu MHz         : 3135.228
cache size      : 6144 KB
physical id     : 1
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good unfair_spinlock pni ssse3 hypervisor lahf_lm
bogomips        : 6181.75
clflush size    : 64
cache_alignment : 64
address sizes   : 38 bits physical, 48 bits virtual
power management:

Note the new flags: unfair_spinlock hypervisor. Also note that now "bogomips" is correctly calculated!

With this, i'm almost unable to break guest with 2 vCPU's. With 4 vCPUs it still can "hang", but is quite rare - 1 in 10 boots?

All this, still, did not solve performance problem or LOC racing. Booting guest with 1 vCPU is done in 12 seconds. With 2 vCPU in 18 seconds. With 4 vCPU in 27 seconds.

I did test other hypervisors, and the difference should not be soo big.

Now, it seems that pinning vCPU to CPU does help _a lot_. General rule for pinning is that all vCPU must be pinned to the same physical package CPU. With 4 vCPU guest, boot time drops down to 15 secs! This explains why i saw before that boot time varies a lot. Still, it is quite impractical to do this everytime guest is started. Also, other hypervisors do not require such excercise, so i will assume that all this just helps to determine what is problem there.

I assume that also this explains why vbox performs so well on NUMA enabled host - OS take care that all processes in the same group end up on the same NUMA cell.

I guess that i will need to spend more time to find out what is a problem here. :-)

Regards,

comment:13 by Hrvoje, 9 years ago

Hi.

I did redo all tests, and it seems that vCPU topology change does influence how gest uses cpu and shares resources. Unfortunatelly, stability gain or performance gain from this is small. Still, it would be nice to have an option to control how vCPU is presented to guest.

Please note that all tests are done on non NUMA capable system, without guest isolation cpu support - using HP DL380 G5, with X5460 CPU's (x2).

What did significally improve stability and performance is vCPU process pinning - that is, locking down host process which represents vCPU in guest. Best results are achieved pinning single vCPU to single host CPU, and making sure that all vCPU processes end on the same physical CPU (that is, the same cpu package). For example, with 8 cpu host (2 physical cpu with 4 cores each, no HT), assume that 1st package are cpus 0,2,4,6 and second package are 1,3,5,7. Using guest with 2 vCPU, you should pin vcpu 0 to 1 and vcpu 1 to 3. Or you can use vcpu0->2, vcpu1->6.

It seems that this process pinning also somawhat reduces LOC interrupt rate.

There is definitely something wrong in VirtualBox regarding SMP, and i would put my bet on how TSC is handled - at least on this platform. I suspect that host is having issues with TSC, and this somehow influences guest, and causes a lot of problems. Main reason for this is that pinning helps, and when you pin process, you can actually guarantee that TSC will be consistent. Unfortunatelly, i did not manage to isolate what is acually wrong. I guess that this should be a separate ticket?

As a side note, using VirtualBox on NUMA enabled system, number of vCPU in guest does not influence performance or stability - for example i have same performance on 1vCPU gest as 4vCPU guest. Tested on HP DL380 G7, with X5670. Here NUMA takes care of pinning, so this could explain it.

I'll probably have less free time to play with this, but if somewone manages to fix the problem, i'm happy to test it.

Regards,

Note: See TracTickets for help on using tickets.

Download in other formats: