VirtualBox

Ticket #3857 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

VBoxManage calls hang when VirtualBox VMs are running for longer time

Reported by: sengel Owned by:
Priority: major Component: VM control
Version: VirtualBox 2.1.4 Keywords: VBoxMange, cloning
Cc: Guest type: other
Host type: other

Description

Host-System:

  • Ubuntu 8.10 x64 (up to date with patches)
  • Dell PowerEdge 2950 III
  • VirtualBox 2.1.4

We encounter repeatedly the following behaviour with VirtualBox 2.1.4, if some virtual machines are running for a certain time.

  • we start several VMs on a machine and let them perform their duty
  • one VM (i.e. it's harddisk image) is a designated 'clone master', which only runs if we are doing maintenance on this template VM (e.g. updating the OS), so this VM is normally not running, especially not if we want to clone a new VM
  • all VMs are started using the command 'VBoxManage -nologo startvm <UUID_OF_VM> -type vrdp' (we are using a modified version of the vboxtool-script ( http://vboxtool.sourceforge.net/))
  • each VM is started with RDP enabled having its own RDP port

Once in a while a new clone is created from the 'clone master' using the following set of commands:

VM_MASTER_VDI=vm-master.vdi
VM_NAME=vm-srvXX
VM_MEM=1024MB
VM_VDI=vm-srvXX.vdi
VM_HOST_IF=eth0
VM_VRDPPORT=3170

VBoxManage -nologo createvm -name "${VM_NAME}" -register
VBoxManage -nologo modifyvm "${VM_NAME}" -ostype "ubuntu" -memory ${VM_MEM} -boot1 disk -acpi on -hwvirtex on
VBoxManage -nologo clonevdi "${VM_MASTER_VDI}" "${VM_VDI}"
VBoxManage -nologo registerimage disk "${VM_VDI}"
VBoxManage -nologo modifyvm "${VM_NAME}" -sata on -sataport1 ${VM_VDI}
VBoxManage -nologo modifyvm "${VM_NAME}" -floppy disabled -audio none -uart1 off -uart2 off -usb off
VBoxManage -nologo modifyvm "${VM_NAME}" -vrdp on
VBoxManage -nologo modifyvm "${VM_NAME}" -vrdpport "${VM_VRDPPORT}"
VBoxManage -nologo modifyvm "${VM_NAME}" -nic1 hostif -nictype1 82543GC -cableconnected1 on -hostifdev1 ${VM_HOST_IF}

Now the fun part: If the VMs are up and running for a longer time (no, I can't define 'longer', as we create new VMs as we see fit which may be after a few days or up to some weeks) and we try to clone a new VM, then cloning isn't possible any more. The first command of our cloning script (see above) is always getting executed, but after that one of the following commands fails to return, i.e. the call to VBoxManage never returns. This may happen on the second command or the fifth or whenever, it's not repeatable.

From that time on (e.g. the not-returing call to VBoxManage), the whole VirtualBox stack is in a kind of disorder.

  • it isn't possible to create a new VM as any new VBoxMange call is not returning
  • it is strange that some VMs are reported as 'powered-off' or 'aborted' after this incident although they a up and running and are reachable via network

The only way to get VirtualBox to work again, is to login into every VM on the machine where the cloning try was performed on and issue a shutdown of each of these VMs. After that each VM can be started again and clones can be created, even if the VMs are up and running. But cloning only works until someday after the start of the VMs a VBoxManage call doesn't return again. Then the whole process starts again: shutdown everything, restart, clone.

What I have seen (after a clone-incident) is, that the process VBoxXPCOMIPCD is running twice. Normally the process list is looking something like this

vbox@apollo:~/bin$ ps -ef |grep vbox
vbox     24262 23957  0 10:01 pts/1    00:00:00 -bash
vbox     26427     1  0 11:04 pts/1    00:00:00 /usr/lib/virtualbox/VBoxXPCOMIPCD
vbox     26434     1  0 11:04 ?        00:00:01 /usr/lib/virtualbox/VBoxSVC --automate
vbox     26594 26434 12 11:04 ?        00:00:28 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv1 -startvm 212c7bb0-8391-4363-9483-69fa0db809d4
vbox     26711 26434 12 11:04 ?        00:00:29 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv2 -startvm ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093

But after trying to create a clone, the process list looks like this

vbox@apollo:~/bin$ ps -ef | grep vbox
vbox      1042  9887  4 Mar31 ?        10:00:53 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv4 -startvm ee8ccb62-a82c-4431-a895-9a23f17be276
vbox      5728     1  0 Mar12 ?        00:01:34 /usr/lib/virtualbox/VBoxXPCOMIPCD
vbox      6130     1  2 Mar12 ?        15:54:50 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv5 -startvm 60beb39a-5a26-459e-904d-54e55ef921bb
vbox      6250     1  0 Mar12 ?        05:19:10 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv6 -startvm d404e738-e2a7-4e0e-a368-90e5a544f302
vbox      6490     1  1 Mar12 ?        12:40:58 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv8 -startvm d1364021-895b-4676-9a21-53bdd6f68e7c
vbox      6609     1  0 Mar12 ?        05:17:32 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv9 -startvm 30eebc3b-6010-4a92-a24c-c487ab70ff42
vbox      9880     1  0 Mar25 ?        00:00:05 /usr/lib/virtualbox/VBoxXPCOMIPCD
vbox      9887     1  0 Mar25 ?        00:01:38 /usr/lib/virtualbox/VBoxSVC --automate
vbox     10527  9887  4 Mar25 ?        15:57:57 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv3 -startvm 9ae6dd12-b880-4977-a4ef-08f4b23aef36
vbox     13986  9887  0 Mar26 ?        03:04:04 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv2 -startvm ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093

On 12th of March the VMs have beens started initially. On 25th, 26th and 31st of March some VMs were restarted. As you can see, the process VBoxXPCOMIPCD is running a second time after the restart on 25th of March. As far as I understand, there should only be one process of this kind. Another thing is that VBoxSVC is only running from 25th March on, but there should be only one process process of this kind which has been started on 12th of March. But why isn't such a process running?

And as VBoxSVC only knows the processes started after VBoxSVC hast been started, all VMs started prior to this date are now reported 'powered-off' and one VM is in state 'aborted'.

vbox@apollo:~/bin$ VBoxManage list vms | egrep "^UUID|Name|State"

Name:            vm-master
UUID:            6204cea0-9f8c-4b22-92ef-5e3fd427a3af
State:           powered off (since 2009-04-03T15:45:53.993000000)
Name:            vm-srv1
UUID:            212c7bb0-8391-4363-9483-69fa0db809d4
State:           powered off (since 2009-03-12T14:54:54.000000000)
Name:            vm-srv2
UUID:            ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093
State:           running (since 2009-03-26T00:14:43.442000000)
Name:            vm-srv5
UUID:            60beb39a-5a26-459e-904d-54e55ef921bb
State:           powered off (since 2009-03-12T14:54:58.000000000)
Name:            vm-srv6
UUID:            d404e738-e2a7-4e0e-a368-90e5a544f302
State:           powered off (since 2009-03-12T14:54:52.000000000)
Name:            vm-srv8
UUID:            d1364021-895b-4676-9a21-53bdd6f68e7c
State:           powered off (since 2009-03-12T14:54:52.000000000)
Name:            vm-srv9
UUID:            30eebc3b-6010-4a92-a24c-c487ab70ff42
State:           powered off (since 2009-03-12T14:54:52.000000000)
Name:            vm-srv4
UUID:            ee8ccb62-a82c-4431-a895-9a23f17be276
State:           running (since 2009-03-31T19:23:04.847000000)
Name:            vm-srv3
UUID:            9ae6dd12-b880-4977-a4ef-08f4b23aef36
State:           aborted (since 2009-04-09T07:23:18.036000000)

As you can see, all VMs are 'powered-off' (vm-master has been powered off before cloning, so the reported state is correct). vm-srv2 has been started after 26th of March, i.e. after the 'new' VBoxSVC process has been started. vm-srv3 is reported as 'aborted' but was still was up and running. 9th of April has been the date when we tried to clone a new machine.

Each time a cloning failed with a hanging VBoxMange, then always one VM is reported as 'aborted' although this VM is still working ok.

For me it seems, that there's a bug in the interprocess communication between VBoxSVC, VBoxManage and VBoxXPCOMIPCD, which is only triggered after the VMs are running for some time.

  • Why are there two VBoxXPCOMIPCD processes, if only one should exist?
  • Is VBoxSVC failing (for whatever reason) and is restarted when we try to clone a new VM (I haven't looked at this, when we tried to clone a VM the last time)?
  • If so VirtualBox is missing some kind of watchdog for keeping this process alive, because if VBoxSVC is restarted the new process doesn't know anything about the previously started (and running) VMs.

As you can understand, for us shutting down each VM just to perform cloning of a VM and to be safe that VirtualBox does the tasks we wanted it to do, is not an option.

If this is a bug, please investigate. If you need further informationen, please ask.

If this is not a bug, but a usage error, please advice in how this behaviour can be omitted.

Change History

comment:1 Changed 5 years ago by frank

You are right, there shouldn't be more than one VBoxSVC process and one VBoxXPCOMIPCD process. There were some synchronization fixes in VirtualBox 2.2.0, you might give VirtualBox 2.2.2 a try.

comment:2 Changed 5 years ago by sengel

Some furhter information, which support that there are some problems with synchronization:

  • I tried to clone a VM the other day on a host where serveral other VMs where running for about two weeks. VBoxManage never returned from
    VBoxManage -nologo modifyvm "${VM_NAME}" -ostype "ubuntu" -memory ${VM_MEM} -boot1 disk -acpi on -hwvirtex on
    
  • A quick 'strace' showed that VBoxManage was stuck in a syscall. As I was in a hurry, I didn't take notes, which syscall it was.
  • Because of the failed VBoxMange, one VM is now displayed in state 'aborted', one in 'powered off', the rest is in state 'running', but all are running fine.
  • This time there were no additional processes, just the processes of VBoxXPCOMIPCD and VBoxSVC which have been started, as the VMs where started initially.
  • Because I didn't want to take down all VMs I performed the cloning on another host, on which only one VM is currently running (for about a month or so). This time cloning went fine.

So as you wrote there seem to be some sync problems when using VBoxMange on a host where several VMs are running for a longer time. If there is only one VM running, no troubles so far.

I will give VirtualBox 2.2.2 a try on a separate host, and if everything is ok so far then try to upgrade to version 2.2.2 in our next maintenance window. Also I am going to change our clone script to start every call to VBoxManage with strace to gain further information.

comment:3 Changed 5 years ago by frank

Does this problem persist? You might want to try VBox 2.2.4.

comment:4 Changed 5 years ago by sengel

The problem still exists in 2.2.4. We are now considering to move to 3.0.2.

The problem seems only to affect host systems on which more than one VM is running when we try to clone a new VM. One of our systems has only one VM running and there we could clone a new system whenever we like without any problems.

comment:5 Changed 5 years ago by sengel

We haven't had any trouble so far since upgrading to VBox 3.0.4. So I think this issue can be closed.

comment:6 Changed 5 years ago by sandervl73

  • Status changed from new to closed
  • Resolution set to fixed

Thanks for the feedback.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use