VirtualBox

Ticket #6013 (closed defect: fixed)

Opened 4 years ago

Last modified 3 years ago

SLES 10 Linux guest hangs -> retry with 3.1.4

Reported by: Pedja Owned by:
Priority: major Component: guest smp
Version: VirtualBox 3.1.4 Keywords: SLES guest hangs
Cc: Guest type: Linux
Host type: Linux

Description

I have SLES Linux 10 host and two SLES 10 Linux guests. Installed Oracle 11g DB (DB size=300GB+) and application on both (testing environment). During a high I/O (some batch jobs e.g.) one of the guests hangs becoming totally unresponsive. RAM size is 12 GB, using dynamic disks. Also tried to allocate all RAM (except 2 GB for host) to one guest. When I used physical machines all worked OK with 4GB RAM. I attached logs for both VM. Any solution?

Attachments

test5 logs.rar Download (37.0 KB) - added by Pedja 4 years ago.
VB logs.rar Download (65.3 KB) - added by Pedja 4 years ago.
VBox.log Download (40.6 KB) - added by Pedja 4 years ago.
VBox.log.1 Download (41.3 KB) - added by Pedja 4 years ago.
VBox.log.2 Download (41.3 KB) - added by Pedja 4 years ago.
VBox.log.3 Download (41.0 KB) - added by Pedja 4 years ago.

Change History

Changed 4 years ago by Pedja

Changed 4 years ago by Pedja

comment:1 follow-up: ↓ 2 Changed 4 years ago by Pedja

This is becoming very urgent...

comment:2 in reply to: ↑ 1 Changed 4 years ago by Pedja

Converted dynamic disks to fixed but problem still occurs

comment:3 follow-ups: ↓ 5 ↓ 7 Changed 4 years ago by frank

It would be helpful if you could tell us which of the two VMs hang (which log file)? Furthermore you could check if the same hang occurs if you decrease the number of guest CPUs to 1.

And what did you exactly in the guest to provoke the hang? I/O from/to the virtual disk, network I/O or/and I/O over shared folders?

comment:4 follow-up: ↓ 6 Changed 4 years ago by frank

And: Does the whole VM process hang? If so, forcing a core dump of that VM like described  here (Forcing VirtualBox to terminate with a core dump) and sending the core dump to us could help finding the problem. Give me a note if you have such a core dump and I can tell you a server for uploading the file.

Changed 4 years ago by Pedja

Changed 4 years ago by Pedja

Changed 4 years ago by Pedja

Changed 4 years ago by Pedja

comment:5 in reply to: ↑ 3 Changed 4 years ago by Pedja

Replying to frank:

It would be helpful if you could tell us which of the two VMs hang (which log file)? Furthermore you could check if the same hang occurs if you decrease the number of guest CPUs to 1.

And what did you exactly in the guest to provoke the hang? I/O from/to the virtual disk, network I/O or/and I/O over shared folders?

Both VM hanged, two kinds of high I/O provoke it.
I uploaded logs from VM hanged this morning, around 9:30. It happened during a backup operation (network I/O). Hang also happened during some batch jobs (/O from/to the virtual disk). There's no shared folders.
I decreased number of CPUs from 4 to 1 and I will inform you about result.

comment:6 in reply to: ↑ 4 Changed 4 years ago by Pedja

Replying to frank: > And: Does the whole VM process hang? If so, forcing a core dump of that VM like described  here (Forcing VirtualBox to terminate with a core dump) and sending the core dump to us could help finding the problem. Give me a note if you have such a core dump and I can tell you a server for uploading the file.

When hang occur it happens on one VM, second works OK.
I forced a core dump, it is a file of 2GB+, compressed around 450 MB. Should I upload it and where?

comment:7 in reply to: ↑ 3 Changed 4 years ago by Pedja

Replying to frank:

It would be helpful if you could tell us which of the two VMs hang (which log >file)? Furthermore you could check if the same hang occurs if you decrease the >number of guest CPUs to 1.

And what did you exactly in the guest to provoke the hang? I/O from/to the virtual disk, network I/O or/and I/O over shared folders?

It seems that decreasing the number of guest CPUs to 1 work very well (multiprocessing don't work). VT-X is still enabled. We are testing yet but hang didn't occured in situations where it happened with 4 processors. Processor is Intel Xeon CPU E5405 @ 2.00GHz. Am I wrong or it means that VM can use 25% of CPU?
Maybe I mad a mistake on one of the checkboxes for CPU settings?

comment:8 follow-up: ↓ 9 Changed 4 years ago by frank

The core dump you sent me is useless as you set one guest CPU for that VM. Reading your comments above I assume that the hang does only occur on high I/O with more than one guest CPU enabled.

Such a core dump makes only sense if you take it from a hanging VM session! So if you really want to help debugging this problem then set up 4 guest CPUs, make the VM hang with your I/O operations and then send me the core dump the same way as you already did.

And regarding your last question: On a 4 core host I would never activate 4 guest cores as the virtualization needs some overhead and there are other applications on the host requiring CPU time as well. A better choice would be 3 or 2 cores but nevertheless the guest VM shouldn't hang.

comment:9 in reply to: ↑ 8 ; follow-up: ↓ 10 Changed 4 years ago by Pedja

Replying to frank:

I hope I uploaded an useful core dump file this time for 4 CPU VM (still uploading at the time of writing this - 1.5 GB) Maybe I made one mistake with VM log files. After forcing dump of hanged VM I rebooted it and i couldn't find the right VM log file for hanging session so I uploaded all 3 logs. I sent file names by email.
It seems that another VM that is set to work with 2 CPUs doesn't work well but we are testing yet and will also try to provoke hang.

comment:10 in reply to: ↑ 9 Changed 4 years ago by Pedja

Tested and confirmed that VirtualBox hangs every time during a higher I/O load on VM with more than 1 CPU activated. Also noticed that system time is inaccurate. On VM with 1 CPU there's no such problem. Any solution from you?

comment:11 follow-up: ↓ 12 Changed 4 years ago by frank

Your last core dump was better but currently there is no solution. The wrong guest time will be most probably fixed in the next VBox maintenance release. So far I suggest you to use only one guest CPU for that VM. VirtualBox will still benefit from multiple host cores as the VMM itself and the virtual devices are multithreaded.

comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed 4 years ago by Pedja

Is the issue related to guest OS (SLES 10.3 64bit)? SM on guest VM is very important to us.

comment:13 in reply to: ↑ 12 Changed 4 years ago by Pedja

SM - I meant on SMP

comment:14 Changed 4 years ago by sandervl73

  • Component changed from other to guest smp
  • Summary changed from SLES 10 Linux guest hangs to SLES 10 Linux guest hangs -> retry with 3.1.4

Retry with 3.1.4. That version will include an important stability fix for SMP guests.

comment:15 follow-up: ↓ 16 Changed 4 years ago by sandervl73

Please check if 3.1.4 beta 1 solves the problem:  http://forums.virtualbox.org/viewtopic.php?f=15&t=27300

comment:16 in reply to: ↑ 15 Changed 4 years ago by Pedja

Replying to sandervl73:

Please check if 3.1.4 beta 1 solves the problem:  http://forums.virtualbox.org/viewtopic.php?f=15&t=27300

VirtualBox 3.1.4 beta doesn't solve the problem. VM hangs in the same way with SMP enabled (2 CPU) and the system time is more inaccurate then with version 3.1.3.

comment:17 follow-up: ↓ 18 Changed 4 years ago by frank

In that case, did you really test an unofficial 3.1.3 test build and if yes, which exact build was it?

comment:18 in reply to: ↑ 17 ; follow-up: ↓ 19 Changed 4 years ago by Pedja

Replying to frank:

In that case, did you really test an unofficial 3.1.3 test build and if yes, which exact build was it?

I tested VirtualBox-3.1-3.1.2_56127_sles10.1-1.x86_64. After post from sandervl73 on 2010-01-29 I downloaded and installed
VirtualBox 3.1-3.1.4_BETA1_57050_sles10.1-1.x86_64

comment:19 in reply to: ↑ 18 Changed 4 years ago by Pedja

SMP doesn't work with version 3.1.4 r57640 neither.
Also guest system time is not accurate (1 sec per minute forward before sync).
I will upload core dump file on  ftp://ftp.innotek.de/incoming in a few minutes. File name is core.1246.tar.gz.

comment:20 follow-up: ↓ 21 Changed 4 years ago by frank

  • Version changed from VirtualBox 3.1.2 to VirtualBox 3.1.4

Analyzing the core dump I saw that the E1000 ethernet card waits for the guest to free more network descriptors. One of the guest CPUs is currently executing code, the other is in halt state. This could be a problem with the E1000 network card emulation. Could you test if your guest works better if you change the network card to PCNet (VM network settings / advanced)?

comment:21 in reply to: ↑ 20 ; follow-up: ↓ 22 Changed 4 years ago by Pedja

Replying to frank:

It seems that setting NC to PCnet_Fast_III solves the problem with hanging. With 2 processors, machine worked under load for 2 days without problem with much better performance than with 1 CPU.
There's just one problem left - system time.

Mar 5 09:19:02 test6 ntpdate[31432]: step time server 10.0.0.x offset -0.615278 sec Mar 5 09:20:01 test6 ntpdate[31484]: step time server 10.0.0.x offset -0.856136 sec Mar 5 09:21:01 test6 ntpdate[31541]: step time server 10.0.0.x offset -1.297398 sec Mar 5 09:22:00 test6 ntpdate[31598]: step time server 10.0.0.x offset -2.188469 sec Mar 5 09:23:02 test6 ntpdate[31654]: step time server 10.0.0.x offset -1.373936 sec Mar 5 09:24:00 test6 ntpdate[31711]: step time server 10.0.0.x offset -1.646268 sec Mar 5 09:25:00 test6 ntpdate[31758]: step time server 10.0.0.x offset -1.749338 sec Mar 5 09:26:01 test6 ntpdate[31810]: step time server 10.0.0.x offset -0.745577 sec Mar 5 09:27:02 test6 ntpdate[31917]: step time server 10.0.0.x offset -0.839320 sec Mar 5 09:28:01 test6 ntpdate[31972]: step time server 10.0.0.x offset -0.545628 sec

Sync with time server is set on 1 minute.

comment:22 in reply to: ↑ 21 Changed 4 years ago by Pedja

Sorry for bad formatting, but it looks OK in Mozilla Firefox

Mar 5 09:19:02 test6 ntpdate[31432]: step time server 10.0.0.x offset -0.615278 sec

Mar 5 09:20:01 test6 ntpdate[31484]: step time server 10.0.0.x offset -0.856136 sec

Mar 5 09:21:01 test6 ntpdate[31541]: step time server 10.0.0.x offset -1.297398 sec

Mar 5 09:22:00 test6 ntpdate[31598]: step time server 10.0.0.x offset -2.188469 sec

Mar 5 09:23:02 test6 ntpdate[31654]: step time server 10.0.0.x offset -1.373936 sec

Mar 5 09:24:00 test6 ntpdate[31711]: step time server 10.0.0.x offset -1.646268 sec

Mar 5 09:25:00 test6 ntpdate[31758]: step time server 10.0.0.x offset -1.749338 sec

Mar 5 09:26:01 test6 ntpdate[31810]: step time server 10.0.0.x offset -0.745577 sec

Mar 5 09:27:02 test6 ntpdate[31917]: step time server 10.0.0.x offset -0.839320 sec

Mar 5 09:28:01 test6 ntpdate[31972]: step time server 10.0.0.x offset -0.545628 sec

comment:23 Changed 4 years ago by sandervl73

Retry with 3.2.10. It contains an SMP performance fix that might apply to your case as well.

comment:24 Changed 3 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

No response, closing.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use