Opened 16 years ago
Closed 15 years ago
#3857 closed defect (fixed)
VBoxManage calls hang when VirtualBox VMs are running for longer time
Reported by: | sengel | Owned by: | |
---|---|---|---|
Component: | VM control | Version: | VirtualBox 2.1.4 |
Keywords: | VBoxMange, cloning | Cc: | |
Guest type: | other | Host type: | other |
Description
Host-System:
- Ubuntu 8.10 x64 (up to date with patches)
- Dell PowerEdge 2950 III
- VirtualBox 2.1.4
We encounter repeatedly the following behaviour with VirtualBox 2.1.4, if some virtual machines are running for a certain time.
- we start several VMs on a machine and let them perform their duty
- one VM (i.e. it's harddisk image) is a designated 'clone master', which only runs if we are doing maintenance on this template VM (e.g. updating the OS), so this VM is normally not running, especially not if we want to clone a new VM
- all VMs are started using the command 'VBoxManage -nologo startvm <UUID_OF_VM> -type vrdp' (we are using a modified version of the vboxtool-script (http://vboxtool.sourceforge.net/))
- each VM is started with RDP enabled having its own RDP port
Once in a while a new clone is created from the 'clone master' using the following set of commands:
VM_MASTER_VDI=vm-master.vdi VM_NAME=vm-srvXX VM_MEM=1024MB VM_VDI=vm-srvXX.vdi VM_HOST_IF=eth0 VM_VRDPPORT=3170 VBoxManage -nologo createvm -name "${VM_NAME}" -register VBoxManage -nologo modifyvm "${VM_NAME}" -ostype "ubuntu" -memory ${VM_MEM} -boot1 disk -acpi on -hwvirtex on VBoxManage -nologo clonevdi "${VM_MASTER_VDI}" "${VM_VDI}" VBoxManage -nologo registerimage disk "${VM_VDI}" VBoxManage -nologo modifyvm "${VM_NAME}" -sata on -sataport1 ${VM_VDI} VBoxManage -nologo modifyvm "${VM_NAME}" -floppy disabled -audio none -uart1 off -uart2 off -usb off VBoxManage -nologo modifyvm "${VM_NAME}" -vrdp on VBoxManage -nologo modifyvm "${VM_NAME}" -vrdpport "${VM_VRDPPORT}" VBoxManage -nologo modifyvm "${VM_NAME}" -nic1 hostif -nictype1 82543GC -cableconnected1 on -hostifdev1 ${VM_HOST_IF}
Now the fun part: If the VMs are up and running for a longer time (no, I can't define 'longer', as we create new VMs as we see fit which may be after a few days or up to some weeks) and we try to clone a new VM, then cloning isn't possible any more. The first command of our cloning script (see above) is always getting executed, but after that one of the following commands fails to return, i.e. the call to VBoxManage never returns. This may happen on the second command or the fifth or whenever, it's not repeatable.
From that time on (e.g. the not-returing call to VBoxManage), the whole VirtualBox stack is in a kind of disorder.
- it isn't possible to create a new VM as any new VBoxMange call is not returning
- it is strange that some VMs are reported as 'powered-off' or 'aborted' after this incident although they a up and running and are reachable via network
The only way to get VirtualBox to work again, is to login into every VM on the machine where the cloning try was performed on and issue a shutdown of each of these VMs. After that each VM can be started again and clones can be created, even if the VMs are up and running. But cloning only works until someday after the start of the VMs a VBoxManage call doesn't return again. Then the whole process starts again: shutdown everything, restart, clone.
What I have seen (after a clone-incident) is, that the process VBoxXPCOMIPCD is running twice. Normally the process list is looking something like this
vbox@apollo:~/bin$ ps -ef |grep vbox vbox 24262 23957 0 10:01 pts/1 00:00:00 -bash vbox 26427 1 0 11:04 pts/1 00:00:00 /usr/lib/virtualbox/VBoxXPCOMIPCD vbox 26434 1 0 11:04 ? 00:00:01 /usr/lib/virtualbox/VBoxSVC --automate vbox 26594 26434 12 11:04 ? 00:00:28 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv1 -startvm 212c7bb0-8391-4363-9483-69fa0db809d4 vbox 26711 26434 12 11:04 ? 00:00:29 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv2 -startvm ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093
But after trying to create a clone, the process list looks like this
vbox@apollo:~/bin$ ps -ef | grep vbox vbox 1042 9887 4 Mar31 ? 10:00:53 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv4 -startvm ee8ccb62-a82c-4431-a895-9a23f17be276 vbox 5728 1 0 Mar12 ? 00:01:34 /usr/lib/virtualbox/VBoxXPCOMIPCD vbox 6130 1 2 Mar12 ? 15:54:50 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv5 -startvm 60beb39a-5a26-459e-904d-54e55ef921bb vbox 6250 1 0 Mar12 ? 05:19:10 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv6 -startvm d404e738-e2a7-4e0e-a368-90e5a544f302 vbox 6490 1 1 Mar12 ? 12:40:58 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv8 -startvm d1364021-895b-4676-9a21-53bdd6f68e7c vbox 6609 1 0 Mar12 ? 05:17:32 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv9 -startvm 30eebc3b-6010-4a92-a24c-c487ab70ff42 vbox 9880 1 0 Mar25 ? 00:00:05 /usr/lib/virtualbox/VBoxXPCOMIPCD vbox 9887 1 0 Mar25 ? 00:01:38 /usr/lib/virtualbox/VBoxSVC --automate vbox 10527 9887 4 Mar25 ? 15:57:57 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv3 -startvm 9ae6dd12-b880-4977-a4ef-08f4b23aef36 vbox 13986 9887 0 Mar26 ? 03:04:04 /usr/lib/virtualbox/VBoxHeadless -comment vm-srv2 -startvm ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093
On 12th of March the VMs have beens started initially. On 25th, 26th and 31st of March some VMs were restarted. As you can see, the process VBoxXPCOMIPCD is running a second time after the restart on 25th of March. As far as I understand, there should only be one process of this kind. Another thing is that VBoxSVC is only running from 25th March on, but there should be only one process process of this kind which has been started on 12th of March. But why isn't such a process running?
And as VBoxSVC only knows the processes started after VBoxSVC hast been started, all VMs started prior to this date are now reported 'powered-off' and one VM is in state 'aborted'.
vbox@apollo:~/bin$ VBoxManage list vms | egrep "^UUID|Name|State" Name: vm-master UUID: 6204cea0-9f8c-4b22-92ef-5e3fd427a3af State: powered off (since 2009-04-03T15:45:53.993000000) Name: vm-srv1 UUID: 212c7bb0-8391-4363-9483-69fa0db809d4 State: powered off (since 2009-03-12T14:54:54.000000000) Name: vm-srv2 UUID: ce19c7ad-2e02-44c4-b5e4-13b3ea8f0093 State: running (since 2009-03-26T00:14:43.442000000) Name: vm-srv5 UUID: 60beb39a-5a26-459e-904d-54e55ef921bb State: powered off (since 2009-03-12T14:54:58.000000000) Name: vm-srv6 UUID: d404e738-e2a7-4e0e-a368-90e5a544f302 State: powered off (since 2009-03-12T14:54:52.000000000) Name: vm-srv8 UUID: d1364021-895b-4676-9a21-53bdd6f68e7c State: powered off (since 2009-03-12T14:54:52.000000000) Name: vm-srv9 UUID: 30eebc3b-6010-4a92-a24c-c487ab70ff42 State: powered off (since 2009-03-12T14:54:52.000000000) Name: vm-srv4 UUID: ee8ccb62-a82c-4431-a895-9a23f17be276 State: running (since 2009-03-31T19:23:04.847000000) Name: vm-srv3 UUID: 9ae6dd12-b880-4977-a4ef-08f4b23aef36 State: aborted (since 2009-04-09T07:23:18.036000000)
As you can see, all VMs are 'powered-off' (vm-master has been powered off before cloning, so the reported state is correct). vm-srv2 has been started after 26th of March, i.e. after the 'new' VBoxSVC process has been started. vm-srv3 is reported as 'aborted' but was still was up and running. 9th of April has been the date when we tried to clone a new machine.
Each time a cloning failed with a hanging VBoxMange, then always one VM is reported as 'aborted' although this VM is still working ok.
For me it seems, that there's a bug in the interprocess communication between VBoxSVC, VBoxManage and VBoxXPCOMIPCD, which is only triggered after the VMs are running for some time.
- Why are there two VBoxXPCOMIPCD processes, if only one should exist?
- Is VBoxSVC failing (for whatever reason) and is restarted when we try to clone a new VM (I haven't looked at this, when we tried to clone a VM the last time)?
- If so VirtualBox is missing some kind of watchdog for keeping this process alive, because if VBoxSVC is restarted the new process doesn't know anything about the previously started (and running) VMs.
As you can understand, for us shutting down each VM just to perform cloning of a VM and to be safe that VirtualBox does the tasks we wanted it to do, is not an option.
If this is a bug, please investigate. If you need further informationen, please ask.
If this is not a bug, but a usage error, please advice in how this behaviour can be omitted.
Change History (6)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
Some furhter information, which support that there are some problems with synchronization:
- I tried to clone a VM the other day on a host where serveral other VMs where running for about two weeks. VBoxManage never returned from
VBoxManage -nologo modifyvm "${VM_NAME}" -ostype "ubuntu" -memory ${VM_MEM} -boot1 disk -acpi on -hwvirtex on
- A quick 'strace' showed that VBoxManage was stuck in a syscall. As I was in a hurry, I didn't take notes, which syscall it was.
- Because of the failed VBoxMange, one VM is now displayed in state 'aborted', one in 'powered off', the rest is in state 'running', but all are running fine.
- This time there were no additional processes, just the processes of VBoxXPCOMIPCD and VBoxSVC which have been started, as the VMs where started initially.
- Because I didn't want to take down all VMs I performed the cloning on another host, on which only one VM is currently running (for about a month or so). This time cloning went fine.
So as you wrote there seem to be some sync problems when using VBoxMange on a host where several VMs are running for a longer time. If there is only one VM running, no troubles so far.
I will give VirtualBox 2.2.2 a try on a separate host, and if everything is ok so far then try to upgrade to version 2.2.2 in our next maintenance window. Also I am going to change our clone script to start every call to VBoxManage with strace to gain further information.
comment:4 by , 15 years ago
The problem still exists in 2.2.4. We are now considering to move to 3.0.2.
The problem seems only to affect host systems on which more than one VM is running when we try to clone a new VM. One of our systems has only one VM running and there we could clone a new system whenever we like without any problems.
comment:5 by , 15 years ago
We haven't had any trouble so far since upgrading to VBox 3.0.4. So I think this issue can be closed.
You are right, there shouldn't be more than one VBoxSVC process and one VBoxXPCOMIPCD process. There were some synchronization fixes in VirtualBox 2.2.0, you might give VirtualBox 2.2.2 a try.