VirtualBox

Ticket #3156 (closed defect: fixed)

Opened 5 years ago

Last modified 5 years ago

Linux (Debian) host and conflicting UUIDs (VBoxSVC sync issue)

Reported by: corvus Owned by:
Priority: critical Component: other
Version: VirtualBox 2.1.2 Keywords: semaphore Linux VBoxSVC sync syncronization poweroff
Cc: Guest type: other
Host type: other

Description (last modified by frank) (diff)

There seems to be a bug in syncronization objects on linux hosts. We have been seeing a bug in VBox 1.6, 2.0, 2.1.0, 2.1.2. Reproduced on Debian Lenny 2.6.25-amd64, 2.6.26-amd64, 2.6.25-i686.

I haven't reported the bug for few months just because I found a workaround with UUID patching (see below) and it was hard to describe where the bug was. Now I have more info and that's why posting full details.

To reproduce:

  1. create at least 10 machines by cloning the same VDI file and creating the same settings.
  2. start machines in ascending order (by order of creation)
  3. shutdown machine by machine in descending order. Check if shutting down one machine may cause changing state of other (running) machines.

For instance, powering off machine number 8 may cause changing state of machine number 5 to "Aborted". The pairs of conflicting machines remain the same each start you start/stop machines. If you discovered the conflicting pair you may reproduce the bug by starting and shutting down just these 2 machines (but keep the order).

Playing around with the bug showed that the problem is connected to machine UUIDs and semaphores used for synchronization of the VirtualBox process and VBoxSVC (see details below). Therefore, just by changing UUID of one of conflicting machines the problem seems to disappear. But at the same time when UUID is changed there might appear another conflict with other machine in the set.

Looks like there is some semaphore which id is generated basing on machines UUID. The hashing function for creating semaphore id seems to be the key problem. I believe it is inside VBoxSVC module but haven't found yet.

Example:

I started machine N5, then started machine N8. Powered off machine N8 and machine N5 got into 'Aborted' state same moment. The VirtualBox window for machine N8 disappeared but the process was still running in the background.

I have attached to the VirtualBox process for machine N8 with gdb and checked the stack backtrace. You may see it in the attachment. There is a reference to a sourcecode: src/VBox/Main/SessionImpl.cpp (line 860). Seems like machine N8 got stuck at this point:

progress->WaitForCompletion (-1);

I hope this helps!

Attachments

Screenshot.png Download (48.2 KB) - added by corvus 5 years ago.
GDB stack backtrace of the hanging VirtualBox process

Change History

Changed 5 years ago by corvus

GDB stack backtrace of the hanging VirtualBox process

comment:1 Changed 5 years ago by frank

  • Description modified (diff)

comment:2 follow-up: ↓ 3 Changed 5 years ago by dmik

I tried what you suggest on an 2.6.27-11-amd64 system (Ubuntu): started VMs with

bash -c "for ((a=1;a<=10;++a)) do ./VBoxManage startvm test\$a; done"

and then stopped them with

bash -c "for ((a=10;a>=1;--a)) do ./VBoxManage controlvm test\$a poweroff; done"

and didn't observe the behavior you describe.

A clash of the semaphore names at the VirtualBox side is impossible because the VM's full XML file path is used as a SYSV IPC semaphore name to guarantee its unicity. My guess is the problem is specific to your installation.

I can recommend to build a debug version of VirtualBox and collect the relevant logs to better understand what's going on. This can be done by running the clients from the following environment:

export VBOX_LOG=main.e.l.f+gui.e.l.f
export VBOX_LOG_FLAGS="time tid thread"
export VBOX_LOG_DEST=dir=/path/to/all/logs

Once you've got the logs of your crash, you may zip all of them and attach here.

comment:3 in reply to: ↑ 2 Changed 5 years ago by corvus

Replying to dmik:

I tried what you suggest on an 2.6.27-11-amd64 system (Ubuntu):

I have made additional tests on Ubuntu Gutsy 2.6.22-14-amd64 and Gentoo 2.6.27-8-i686. The bug doesn't show up. Also, we have updated kernels (to 2.6.26-1-amd64) on the Debian machine that can reproduce this bug and it is still there.

A clash of the semaphore names at the VirtualBox side is impossible because the VM's full XML file path is used as a SYSV IPC semaphore name to guarantee its unicity. My guess is the problem is specific to your installation.

I know, I learnt it from sourcecode. :) But it somehow happens on Debian. I found a little misuse of ftok libc function during my analysis. The second parameter proj_id should not be null (according to the man), but it is 0 according to the sources: check src/VBox/Main/SessionImpl.cpp: line 961

This misuse isn't critical (I checked the libc implementation of ftok) and should not alter this bug. Setting non-null value didn't help, but I would recommend you changing the code for future compatibility with libc.

I can recommend to build a debug version of VirtualBox and collect the relevant logs to better understand what's going on.

I have tried already, but debug build of 2.1.2 doesn't simply restore the machine from saved state. It (VirtualBox) crashes right after restoring machine state.

This can be done by running the clients from the following environment:

export VBOX_LOG=main.e.l.f+gui.e.l.f
export VBOX_LOG_FLAGS="time tid thread"
export VBOX_LOG_DEST=dir=/path/to/all/logs

Thank you for the hints! I was wondering where to see the full list of components/groups and available flags for debug logging. Is there a common list for that?

Once you've got the logs of your crash, you may zip all of them and attach here.

I will do as soon as I resolve the crashing of debug version when restoring machine from saved state.

comment:4 Changed 5 years ago by dmik

Thank you for noticing the ftok() misuse.

The full list of all predefined logging groups is defined in include/iprt/log.h and in include/VBox/log.h.

comment:5 Changed 5 years ago by corvus

FYI. We have finally moved to new hardware and the bug doesn't replicate anymore. I assume that we had problems with filesystem which affected VBox stability somehow.

I guess you may close the bug. Sorry for disturbing you.

comment:6 Changed 5 years ago by sandervl73

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use