VirtualBox

Ticket #10991 (closed defect: fixed)

Opened 19 months ago

Last modified 13 months ago

console.powerDown() hangs if console.powerButton() failed first

Reported by: ErikH Owned by:
Priority: major Component: other
Version: VirtualBox 4.0.14 Keywords: api
Cc: Guest type: all
Host type: Linux

Description

Assume a virtual machine which, for whatever reason, doesn't react well to ACPI shutdown requests (you can simulate this by changing /etc/acpi/powerbtn.sh to something like

#!/bin/sh

echo "Ignoring shutdown request" | wall
exit 0

On your host, execute the following code (replace UbuntuServer with your vm name).

vmname = 'UbuntuServer'

from vboxapi import VirtualBoxManager
vbox = VirtualBoxManager(None, None)
virtualbox = vbox.getVirtualBox()
machine = virtualbox.findMachine(vmname)
mgr = vbox.mgr
session = mgr.getSessionObject(vbox.getVirtualBox())
machine.lockMachine(session, 1)
console = session.console
console.powerButton()  #this won't raise anything, but the VM won't shut down
console.powerDown() 

The last call will leave your VM hanging in a shutting down state - kill -9 is needed to get out of it. Expected behaviour: console.powerDown() should be able to turn off the virtual machine regardless of its state.

Attachments

VBox.log Download (73.0 KB) - added by ErikH 19 months ago.
38660.VBox.log Download (52.0 KB) - added by ErikH 19 months ago.
all_machines_dead.tgz Download (314.5 KB) - added by ErikH 18 months ago.
VBoxXPCOMIPCD.core.tgz Download (486.7 KB) - added by ErikH 18 months ago.
Coredump of VBoxXPCOMIPCD after all machines died
vbox_calls.tgz Download (15.8 KB) - added by ErikH 18 months ago.
VBox API calls
21470.VBox.log Download (47.2 KB) - added by ErikH 18 months ago.
vmdk_files.tgz Download (606 bytes) - added by ErikH 18 months ago.

Change History

comment:1 Changed 19 months ago by frank

I cannot reproduce your problem. Tried this with VBox 4.2.0, a Linux host and a Linux guest. Also: If the guest is ignoring the power button event then this should not influence the VM process in any way.

Please attach a VBox.log file from such a VM session where you had to kill the VM process with 'kill -9'.

Changed 19 months ago by ErikH

comment:2 Changed 19 months ago by ErikH

Further info:

  • Apparently I was mistaken on the VBox version: we're still on 4.0.14
  • Closing ipython released my virtual machine and cleaned it up nicely on my laptop
  • The server (from where the attached logfile was taken), still has the problem. The code there calls waitForCompletion on the progress object returned by console.powerDown(). In case of problems (try/finally), it calls session.unlockMachine()

comment:3 Changed 19 months ago by frank

  • Version changed from VirtualBox 4.2.0 to VirtualBox 4.0.14

I cannot reproduce this problem either with VBox 4.0.14. I use vboxshell.py to first press the virtual power button and then to power-down the VM.

comment:4 Changed 19 months ago by ErikH

It doesn't seem to happen to all of our VMs, but we just had another one with this problem. Is there anything else I can provide that would help with pinpointing the problem? More logs / lshw output / lspci output / ... ?

comment:5 Changed 19 months ago by ErikH

We've started seeing the issue with only a hard stop as well now (I'll attach 38660.VBox.log from a machine which went into this state without a prior soft stop).

It looks like, when this happens, it always happens to a machine created from the same image, although it doesn't happen to all machines created from this image.

ps aux output shows that there are two VBoxXPCOMIPCD processes - could this be related?

# ps aux|grep -i vbox
root       356  4.1  5.3 7240832 7069228 ?     SNl  Oct07 161:28 /opt/VirtualBox/VBoxHeadless --comment 38547 --startvm 11a78e00-b042-439e-8128-6ed10a553386 --vrde config
root      1274  9.7  5.5 7563152 7317884 ?     SNl  Oct07 377:31 /opt/VirtualBox/VBoxHeadless --comment 38525 --startvm da17de49-60d4-4353-b3a0-e0b147178205 --vrde config
root      5972  1.7  0.0  40476  3756 ?        S    Oct06 110:43 /opt/VirtualBox/VBoxXPCOMIPCD
root      7715  3.6  2.7 3882752 3631600 ?     SNl  Oct08 130:07 /opt/VirtualBox/VBoxHeadless --comment 36569 --startvm 4ae09e91-20cc-4c0a-9bdb-732626dad908 --vrde config
root      8447  3.6  2.7 3831880 3580608 ?     SNl  Oct08 127:30 /opt/VirtualBox/VBoxHeadless --comment 36498 --startvm daf4203a-f5f2-4edf-bfe6-2fa1b7d1816a --vrde config
root      9182  3.7  2.7 3780932 3598616 ?     SNl  Oct08 133:16 /opt/VirtualBox/VBoxHeadless --comment 31622 --startvm 79cf32ff-02d8-44be-b07e-972f4f30de61 --vrde config
root      9926  3.5  4.0 5575512 5330952 ?     SNl  Oct08 123:37 /opt/VirtualBox/VBoxHeadless --comment 31159 --startvm 492e62d7-52b1-4874-b120-0545ee3030d3 --vrde config
root     10296  3.0  5.3 7320764 7072764 ?     SNl  Oct08  94:14 /opt/VirtualBox/VBoxHeadless --comment 31839 --startvm da715602-eae6-4eef-a041-499fca13e11f --vrde config
root     10662  3.6  1.3 2015496 1833020 ?     SNl  Oct08 128:40 /opt/VirtualBox/VBoxHeadless --comment 36539 --startvm 60b2530a-1f07-47ed-b37b-df954cda2a3b --vrde config
root     13516  1.4  5.3 7305776 7126332 ?     Sl   Oct06  92:33 /opt/VirtualBox/VBoxHeadless --comment 36244 --startvm 27f40d12-bda3-4f86-b2b7-c806b9b40f5f --vrde config
root     14374  1.3  2.7 3806056 3619132 ?     Sl   Oct06  85:51 /opt/VirtualBox/VBoxHeadless --comment 36468 --startvm 4f883c43-7d50-40ad-8ad9-b09a4e716549 --vrde config
root     19351  7.0  0.0 187332 11164 ?        SNl  Oct07 275:41 /opt/VirtualBox/VBoxSVC --auto-shutdown
root     19355  4.4  0.0  40476  3756 ?        SN   Oct07 174:08 /opt/VirtualBox/VBoxXPCOMIPCD
root     20298  5.6  1.4 2067064 1878416 ?     SNl  Oct07 221:51 /opt/VirtualBox/VBoxHeadless --comment 37931 --startvm 855bf7a0-9714-48c0-aa05-cbd1da7ca8d8 --vrde config
<snip>

I'll keep looking for a surefire way to reproduce this, but if you have any suggestions for things I might try, or need any additional information that would help you to investigate this, please let me know.

What's more: we don't dare to use kill -9 on the VBox machine process because we've had cases where all other VBox machines on the node would die together with the one being killed - this leaves us without a workaround, so any suggestions on that front would be appreciated as well.

Changed 19 months ago by ErikH

comment:6 Changed 18 months ago by ErikH

In trying to reproduce, I managed to kill all virtual machines by using kill -9 on one of them.

The situation:

One physical machine with 10 virtual machines, all of which are running. kill -9 on one machine all vmachines go down

I have a script which selects a random VMachine on every run and tries:

  • soft stop (powerbutton)
  • If this fails: hard stop
  • Start the machine again
  • Kill -9 the process
  • Check whether all machines except the killed one are still running (this was not the case)
  • Start the machine again

Attached you'll find a tarball with:

  • Logs of all the vmachines at this point (some are aborted, some powered off)
  • VboxManage showvminfo for all vmachines

Running vbox-related processes (full output - nothing cut):

ps aux |grep -i box
root      8371  0.0  0.0  40476  3704 ?        S    Oct11   0:23 /opt/VirtualBox/VBoxXPCOMIPCD

Additionally: should I open a new ticket for this? Would it help if I somehow got the machine's disk to you?

Changed 18 months ago by ErikH

comment:7 Changed 18 months ago by achimha

Erik,

I can see two issues here:

1) a VM does not poweroff after calling the API

2) when killing a VM process with kill -9, VBoxSVC (the communication daemon) and other VMs crash

Both should be looked at. However, in your scenario, you should never have to use kill -9, the poweroff command is designed to always work. So this should be looked into.

I think you will have to provide your complete code that deals with the VM power off so it can be inspected for correct use of the APIs. Then, you should also take process dumps of the VM process refusing the terminate as well as VBoxSVC. This will tell what is being executed and what the internal state is. Note that in case you are not on a standard Linux distribution, you will have to provide your whole Linux environment in order to analyze dumps.

comment:8 Changed 18 months ago by achimha

ps aux output shows that there are two VBoxXPCOMIPCD processes - could this be related?

That is not right, there should be only one instance per user (so total one). This process is started automatically so I wonder how you can end up with two of them.

I notice you run everything as root. While that is possible, it is not particularly nice.

comment:9 follow-up: ↓ 11 Changed 18 months ago by frank

To me it seems that the VBoxSVC daemon crashes and that will make all other VMs terminate (otherwise they would not be controllable anymore). So we should try to find out why VBoxSVC crashes before investigating why the VM does not poweroff (or this could even be the same problem).

So please activate coredumps, see here and reproduce this problem. Then I can either provide you debugging files for generating a better backtrace or I need access to the box where VirtualBox is installed.

comment:10 Changed 18 months ago by frank

And the hang can be debugged in a similar way by triggering a core dump (see my previous comment) of the hanging process.

comment:11 in reply to: ↑ 9 Changed 18 months ago by frank

Replying to frank:

To me it seems that the VBoxSVC daemon crashes and that will make all other VMs terminate (otherwise they would not be controllable anymore).

This can be seen in the VBox.log file of all your VMs: detected unresponsive VBoxSVC (rc=NS_ERROR_CALL_FAILED) -- this caused a clean shutdown of these VMs.

comment:12 Changed 18 months ago by ErikH

I have, so far, only managed to reproduce the "all machines die after kill -9" issue. I'm attaching a core dump of the VBoxXPCOMIPCD process after this happens (the only remaining virtualbox-related process).

I'm also attaching our code which calls the vbox apis.

You can find the Ubuntu we're using (with some kernel modifications) here:  http://files.incubaid.com/iaas/qpackages/71/cd/71cd27ef769f4ca68f7c03c309a445cc.gz . It's a tgz archive within a gzip archive.

Extra debs that we install can be found here:  http://files.incubaid.com/iaas/qpackages/54/2b/542b59709bc726375bd06964ba2324d8.tgz

I'm still trying to reproduce the hanging VM issue - this time without kill -9 cycles in between.

Changed 18 months ago by ErikH

Coredump of VBoxXPCOMIPCD after all machines died

Changed 18 months ago by ErikH

VBox API calls

comment:13 Changed 18 months ago by achimha

Erik,

I think it would be better to debug the power off hang as that is the first problem you run into. However, if it's easier let's focus on the crash on kill issue -- if that is one, then you have a workable solution.

You need to setup your system to automatically generate core dumps when a process crashes. Depending on how you start VirtualBox, you have to put the right environment settings (ulimit etc., see the docs as pointed out by Frank) at the right place. Once you manage to reproduce the issue, you need to provide core dumps of VBoxSVC and a couple of VBoxHeadless processes.

You can test your core dump setup by doing

kill -SIGSEGV <pid>

From the command line. This simulates a segmentation fault crash and will give you a core dump if your system is configured correctly.

Version 0, edited 18 months ago by achimha (next)

comment:14 Changed 18 months ago by ErikH

Achim,

Agreed - it would be best to solve the poweroff hang, although the kill issue would be a good workaround.

I've added

ulimit -c unlimited
echo -n 1 > /proc/sys/kernel/core_uses_pid
echo -n 1 > /proc/sys/fs/suid_dumpable

to /etc/profile and the vboxdrv startup script and have a clean stop / hard stop script running. Hoping to see something show up before Monday.

comment:15 Changed 18 months ago by achimha

Had a quick look at your Python code. A few remarks:

  • between calling IConsole::PowerButton() and checking whether the guest has reacted to it using IConsole::PowerButtonHandled(), there is no delay. I think you should give the guest some time to react to it, maybe poll for it for 3 seconds?
  • you use IConsole::PowerDown() to terminate the VM. This API is deprecated (and removed in VirtualBox 4.2). You should not use it but go for IConsole::PowerDownAsync() (which is renamed to PowerDown() in 4.2). This API will let you track the progress.

comment:16 Changed 18 months ago by ErikH

I'll make sure to incorporate your remarks into the code, but I'll hold off on doing so until I've reliably reproduced the issue.

I've had a script run over the weekend to try to reproduce the issue, but it's still happily running. A colleague managed to reproduce it by accident though, and we've coredumped the vbox process at that time. You can find the dump here:  http://files.incubaid.com/iaas/virtualbox/core.hung.bz2

comment:17 Changed 18 months ago by frank

Thanks for the core dump! First analysis shows that the power down thread is waiting for the EM thread. The EM thread is executing PDMR3PowerOff() but is waiting for some device to get powered off. This smells like a bug we fixed in a later release, let me check.

comment:18 Changed 18 months ago by frank

Actually, do you still have the corresponding VBox.log file to the core dump you submitted available? I just saw that this case must be slightly different than the original VBox.log file attached to this ticket.

The other problem (VBoxSVC crashes if client is killed with -9) was fixed in VirtualBox 4.0.16.

comment:19 Changed 18 months ago by ErikH

Frank,

I'm not quite sure - the vmachine was destroyed, but I did find a VBox.log file that might have belonged to it. I don't know if you can get the VM name from the core dump, but if it was 21470, then this is the log file you need.

Changed 18 months ago by ErikH

comment:20 Changed 18 months ago by frank

Thanks for the log file. I think this one does indeed match the core dump. So far it looks like some AHCI thread does not properly terminate. We are currently investigating the core dump.

comment:21 Changed 18 months ago by aeichner

Is it possible to get the raw disk vmdk descriptor files of a hung VM? I'm currently investigating the core dump but it would make things a bit easier if I could have a look at the descriptor files.

comment:22 Changed 18 months ago by ErikH

The machine had three disks (boot, data, swap). I'm afraid the actual disks are lost, but the vmdk files were still around on the system.

21470_0.vmdk should be the boot disk.

Changed 18 months ago by ErikH

comment:23 Changed 18 months ago by achimha

For the sake of completeness:

The poweroff hang was due to a misbehavior of one of your storage volumes (iSCSI?). It reports that it does not support asynchronous flush operations so VirtualBox performs synchronous flushes. Such a flush was issued as a response to the shutdown but it never completes (it is the first and only flush command for that volume). The volume has seen little access during VM runtime so it could be the swap partition. When trying to do the hard power off, VirtualBox waits for the flush to complete but due to some problem with the underlying storage system, it never does. In the synchronous flush case, VirtualBox did not handle a hanging flush command. This is now implemented.

The following problem when doing a kill -9 on the VM process was already fixed in the current version 4.0.16.

comment:24 Changed 13 months ago by frank

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use