Opened 12 years ago
Closed 12 years ago
#10991 closed defect (fixed)
console.powerDown() hangs if console.powerButton() failed first
Reported by: | Erik Heeren | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 4.0.14 |
Keywords: | api | Cc: | |
Guest type: | all | Host type: | Linux |
Description
Assume a virtual machine which, for whatever reason, doesn't react well to ACPI shutdown requests (you can simulate this by changing /etc/acpi/powerbtn.sh to something like
#!/bin/sh echo "Ignoring shutdown request" | wall exit 0
On your host, execute the following code (replace UbuntuServer with your vm name).
vmname = 'UbuntuServer' from vboxapi import VirtualBoxManager vbox = VirtualBoxManager(None, None) virtualbox = vbox.getVirtualBox() machine = virtualbox.findMachine(vmname) mgr = vbox.mgr session = mgr.getSessionObject(vbox.getVirtualBox()) machine.lockMachine(session, 1) console = session.console console.powerButton() #this won't raise anything, but the VM won't shut down console.powerDown()
The last call will leave your VM hanging in a shutting down state - kill -9 is needed to get out of it. Expected behaviour: console.powerDown() should be able to turn off the virtual machine regardless of its state.
Attachments (7)
Change History (31)
comment:1 by , 12 years ago
by , 12 years ago
comment:2 by , 12 years ago
Further info:
- Apparently I was mistaken on the VBox version: we're still on 4.0.14
- Closing ipython released my virtual machine and cleaned it up nicely on my laptop
- The server (from where the attached logfile was taken), still has the problem. The code there calls waitForCompletion on the progress object returned by console.powerDown(). In case of problems (try/finally), it calls session.unlockMachine()
comment:3 by , 12 years ago
Version: | VirtualBox 4.2.0 → VirtualBox 4.0.14 |
---|
I cannot reproduce this problem either with VBox 4.0.14. I use vboxshell.py to first press the virtual power button and then to power-down the VM.
comment:4 by , 12 years ago
It doesn't seem to happen to all of our VMs, but we just had another one with this problem. Is there anything else I can provide that would help with pinpointing the problem? More logs / lshw output / lspci output / ... ?
comment:5 by , 12 years ago
We've started seeing the issue with only a hard stop as well now (I'll attach 38660.VBox.log from a machine which went into this state without a prior soft stop).
It looks like, when this happens, it always happens to a machine created from the same image, although it doesn't happen to all machines created from this image.
ps aux output shows that there are two VBoxXPCOMIPCD processes - could this be related?
# ps aux|grep -i vbox root 356 4.1 5.3 7240832 7069228 ? SNl Oct07 161:28 /opt/VirtualBox/VBoxHeadless --comment 38547 --startvm 11a78e00-b042-439e-8128-6ed10a553386 --vrde config root 1274 9.7 5.5 7563152 7317884 ? SNl Oct07 377:31 /opt/VirtualBox/VBoxHeadless --comment 38525 --startvm da17de49-60d4-4353-b3a0-e0b147178205 --vrde config root 5972 1.7 0.0 40476 3756 ? S Oct06 110:43 /opt/VirtualBox/VBoxXPCOMIPCD root 7715 3.6 2.7 3882752 3631600 ? SNl Oct08 130:07 /opt/VirtualBox/VBoxHeadless --comment 36569 --startvm 4ae09e91-20cc-4c0a-9bdb-732626dad908 --vrde config root 8447 3.6 2.7 3831880 3580608 ? SNl Oct08 127:30 /opt/VirtualBox/VBoxHeadless --comment 36498 --startvm daf4203a-f5f2-4edf-bfe6-2fa1b7d1816a --vrde config root 9182 3.7 2.7 3780932 3598616 ? SNl Oct08 133:16 /opt/VirtualBox/VBoxHeadless --comment 31622 --startvm 79cf32ff-02d8-44be-b07e-972f4f30de61 --vrde config root 9926 3.5 4.0 5575512 5330952 ? SNl Oct08 123:37 /opt/VirtualBox/VBoxHeadless --comment 31159 --startvm 492e62d7-52b1-4874-b120-0545ee3030d3 --vrde config root 10296 3.0 5.3 7320764 7072764 ? SNl Oct08 94:14 /opt/VirtualBox/VBoxHeadless --comment 31839 --startvm da715602-eae6-4eef-a041-499fca13e11f --vrde config root 10662 3.6 1.3 2015496 1833020 ? SNl Oct08 128:40 /opt/VirtualBox/VBoxHeadless --comment 36539 --startvm 60b2530a-1f07-47ed-b37b-df954cda2a3b --vrde config root 13516 1.4 5.3 7305776 7126332 ? Sl Oct06 92:33 /opt/VirtualBox/VBoxHeadless --comment 36244 --startvm 27f40d12-bda3-4f86-b2b7-c806b9b40f5f --vrde config root 14374 1.3 2.7 3806056 3619132 ? Sl Oct06 85:51 /opt/VirtualBox/VBoxHeadless --comment 36468 --startvm 4f883c43-7d50-40ad-8ad9-b09a4e716549 --vrde config root 19351 7.0 0.0 187332 11164 ? SNl Oct07 275:41 /opt/VirtualBox/VBoxSVC --auto-shutdown root 19355 4.4 0.0 40476 3756 ? SN Oct07 174:08 /opt/VirtualBox/VBoxXPCOMIPCD root 20298 5.6 1.4 2067064 1878416 ? SNl Oct07 221:51 /opt/VirtualBox/VBoxHeadless --comment 37931 --startvm 855bf7a0-9714-48c0-aa05-cbd1da7ca8d8 --vrde config <snip>
I'll keep looking for a surefire way to reproduce this, but if you have any suggestions for things I might try, or need any additional information that would help you to investigate this, please let me know.
What's more: we don't dare to use kill -9 on the VBox machine process because we've had cases where all other VBox machines on the node would die together with the one being killed - this leaves us without a workaround, so any suggestions on that front would be appreciated as well.
by , 12 years ago
Attachment: | 38660.VBox.log added |
---|
comment:6 by , 12 years ago
In trying to reproduce, I managed to kill all virtual machines by using kill -9 on one of them.
The situation:
One physical machine with 10 virtual machines, all of which are running. kill -9 on one machine all vmachines go down
I have a script which selects a random VMachine on every run and tries:
- soft stop (powerbutton)
- If this fails: hard stop
- Start the machine again
- Kill -9 the process
- Check whether all machines except the killed one are still running (this was not the case)
- Start the machine again
Attached you'll find a tarball with:
- Logs of all the vmachines at this point (some are aborted, some powered off)
- VboxManage showvminfo for all vmachines
Running vbox-related processes (full output - nothing cut):
ps aux |grep -i box root 8371 0.0 0.0 40476 3704 ? S Oct11 0:23 /opt/VirtualBox/VBoxXPCOMIPCD
Additionally: should I open a new ticket for this? Would it help if I somehow got the machine's disk to you?
by , 12 years ago
Attachment: | all_machines_dead.tgz added |
---|
comment:7 by , 12 years ago
Erik,
I can see two issues here:
1) a VM does not poweroff after calling the API
2) when killing a VM process with kill -9, VBoxSVC (the communication daemon) and other VMs crash
Both should be looked at. However, in your scenario, you should never have to use kill -9, the poweroff command is designed to always work. So this should be looked into.
I think you will have to provide your complete code that deals with the VM power off so it can be inspected for correct use of the APIs. Then, you should also take process dumps of the VM process refusing the terminate as well as VBoxSVC. This will tell what is being executed and what the internal state is. Note that in case you are not on a standard Linux distribution, you will have to provide your whole Linux environment in order to analyze dumps.
comment:8 by , 12 years ago
ps aux output shows that there are two VBoxXPCOMIPCD processes - could this be related?
That is not right, there should be only one instance per user (so total one). This process is started automatically so I wonder how you can end up with two of them.
I notice you run everything as root. While that is possible, it is not particularly nice.
follow-up: 11 comment:9 by , 12 years ago
To me it seems that the VBoxSVC daemon crashes and that will make all other VMs terminate (otherwise they would not be controllable anymore). So we should try to find out why VBoxSVC crashes before investigating why the VM does not poweroff (or this could even be the same problem).
So please activate coredumps, see here and reproduce this problem. Then I can either provide you debugging files for generating a better backtrace or I need access to the box where VirtualBox is installed.
comment:10 by , 12 years ago
And the hang can be debugged in a similar way by triggering a core dump (see my previous comment) of the hanging process.
comment:11 by , 12 years ago
Replying to frank:
To me it seems that the VBoxSVC daemon crashes and that will make all other VMs terminate (otherwise they would not be controllable anymore).
This can be seen in the VBox.log file of all your VMs: detected unresponsive VBoxSVC (rc=NS_ERROR_CALL_FAILED) -- this caused a clean shutdown of these VMs.
comment:12 by , 12 years ago
I have, so far, only managed to reproduce the "all machines die after kill -9" issue. I'm attaching a core dump of the VBoxXPCOMIPCD process after this happens (the only remaining virtualbox-related process).
I'm also attaching our code which calls the vbox apis.
You can find the Ubuntu we're using (with some kernel modifications) here: http://files.incubaid.com/iaas/qpackages/71/cd/71cd27ef769f4ca68f7c03c309a445cc.gz . It's a tgz archive within a gzip archive.
Extra debs that we install can be found here: http://files.incubaid.com/iaas/qpackages/54/2b/542b59709bc726375bd06964ba2324d8.tgz
I'm still trying to reproduce the hanging VM issue - this time without kill -9 cycles in between.
by , 12 years ago
Attachment: | VBoxXPCOMIPCD.core.tgz added |
---|
Coredump of VBoxXPCOMIPCD after all machines died
comment:13 by , 12 years ago
Erik,
I think it would be better to debug the power off hang as that is the first problem you run into. However, if it's easier let's focus on the crash on kill issue -- if that is one is solved, then you have a workable solution.
You need to setup your system to automatically generate core dumps when a process crashes. Depending on how you start VirtualBox, you have to put the right environment settings (ulimit etc., see the docs as pointed out by Frank) at the right place. Once you manage to reproduce the issue, you need to provide core dumps of VBoxSVC and a couple of VBoxHeadless processes.
You can test your core dump setup by doing
kill -SIGSEGV <pid>
From the command line. This simulates a segmentation fault crash and will give you a core dump if your system is configured correctly.
comment:14 by , 12 years ago
Achim,
Agreed - it would be best to solve the poweroff hang, although the kill issue would be a good workaround.
I've added
ulimit -c unlimited echo -n 1 > /proc/sys/kernel/core_uses_pid echo -n 1 > /proc/sys/fs/suid_dumpable
to /etc/profile and the vboxdrv startup script and have a clean stop / hard stop script running. Hoping to see something show up before Monday.
comment:15 by , 12 years ago
Had a quick look at your Python code. A few remarks:
- between calling IConsole::PowerButton() and checking whether the guest has reacted to it using IConsole::PowerButtonHandled(), there is no delay. I think you should give the guest some time to react to it, maybe poll for it for 3 seconds?
- you use IConsole::PowerDown() to terminate the VM. This API is deprecated (and removed in VirtualBox 4.2). You should not use it but go for IConsole::PowerDownAsync() (which is renamed to PowerDown() in 4.2). This API will let you track the progress.
comment:16 by , 12 years ago
I'll make sure to incorporate your remarks into the code, but I'll hold off on doing so until I've reliably reproduced the issue.
I've had a script run over the weekend to try to reproduce the issue, but it's still happily running. A colleague managed to reproduce it by accident though, and we've coredumped the vbox process at that time. You can find the dump here: http://files.incubaid.com/iaas/virtualbox/core.hung.bz2
comment:17 by , 12 years ago
Thanks for the core dump! First analysis shows that the power down thread is waiting for the EM thread. The EM thread is executing PDMR3PowerOff() but is waiting for some device to get powered off. This smells like a bug we fixed in a later release, let me check.
comment:18 by , 12 years ago
Actually, do you still have the corresponding VBox.log file to the core dump you submitted available? I just saw that this case must be slightly different than the original VBox.log file attached to this ticket.
The other problem (VBoxSVC crashes if client is killed with -9) was fixed in VirtualBox 4.0.16.
comment:19 by , 12 years ago
Frank,
I'm not quite sure - the vmachine was destroyed, but I did find a VBox.log file that might have belonged to it. I don't know if you can get the VM name from the core dump, but if it was 21470, then this is the log file you need.
by , 12 years ago
Attachment: | 21470.VBox.log added |
---|
comment:20 by , 12 years ago
Thanks for the log file. I think this one does indeed match the core dump. So far it looks like some AHCI thread does not properly terminate. We are currently investigating the core dump.
comment:21 by , 12 years ago
Is it possible to get the raw disk vmdk descriptor files of a hung VM? I'm currently investigating the core dump but it would make things a bit easier if I could have a look at the descriptor files.
comment:22 by , 12 years ago
The machine had three disks (boot, data, swap). I'm afraid the actual disks are lost, but the vmdk files were still around on the system.
21470_0.vmdk should be the boot disk.
by , 12 years ago
Attachment: | vmdk_files.tgz added |
---|
comment:23 by , 12 years ago
For the sake of completeness:
The poweroff hang was due to a misbehavior of one of your storage volumes (iSCSI?). It reports that it does not support asynchronous flush operations so VirtualBox performs synchronous flushes. Such a flush was issued as a response to the shutdown but it never completes (it is the first and only flush command for that volume). The volume has seen little access during VM runtime so it could be the swap partition. When trying to do the hard power off, VirtualBox waits for the flush to complete but due to some problem with the underlying storage system, it never does. In the synchronous flush case, VirtualBox did not handle a hanging flush command. This is now implemented.
The following problem when doing a kill -9 on the VM process was already fixed in the current version 4.0.16.
comment:24 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I cannot reproduce your problem. Tried this with VBox 4.2.0, a Linux host and a Linux guest. Also: If the guest is ignoring the power button event then this should not influence the VM process in any way.
Please attach a VBox.log file from such a VM session where you had to kill the VM process with 'kill -9'.