VirtualBox

Opened 4 years ago

Last modified 4 years ago

#19074 new defect

VM Stuck in "offline snapshotting" state

Reported by: Brian of the Pines Owned by:
Component: webservices Version: VirtualBox 6.0.10
Keywords: Cc:
Guest type: Linux Host type: Linux

Description

The active problem is that a VM is stuck in "offline snapshotting" state and I can't do anything with it. I need a way to get it out of that state, hopefully without corrupting the VM along the way.

There is no relevant VM-specific log information. Since it's an offline snapshot, the VM wasn't running at the time, so the most recent log ends with "Machine state changed to 'PoweredOff'".

There appears to be no relevant info in the ~/.config/VirtualBox log files either. Just a bunch of messages about saving config files and the reaper running.

Here's how I got to where I am: I'm working on a Perl+Curses TUI front-end for a headless VirtualBox server. It uses the vboxwebsrv webservice via the SOAP::Lite package to interact with VirtualBox. I am following the API documented in the latest SDK from Oracle (6.0.10).

When my scripts initiate a snapshot of a powered-off machine via IMachine::takeSnapshot, the VM then becomes stuck in "offline snapshotting" state ("stuck" as in it's been in that state for over 2 months as of the time of this writing). This is what the script does:

  1. Get an ISession object via IWebSessionManager::getSessionObject
  2. Lock the session in Shared mode via IMachine::lockMachine
  3. Get a mutable IMachine object with ISession::getMachine
  4. Initiate the snapshot from the mutable IMachine object, getting an IProgress object, with IMachine::takeSnapshot
  5. Start a modal window which periodically queries the IProgress interface for its status until IProgress::complete returns true

As I'm reviewing my code to write this, I note that it's possible that ISession::unlockMachine is being called before the IProgress object reports success. Although the documentation implies that a session remains locked as long as it is busy even after a call to ISession::unlockMachine, I wonder if this may be causing VirtualBox to do something bad resulting in the stuck "offline snapshotting" state.

Environment info:

  • Host OS is Debian Buster 64-bit.
  • Guest OS is Windows 7 Ultimate 32-bit.
  • VBoxHeadless --version gives 6.0.10r132072.

VBox.log does not get anything written to it. I'm happy to do whatever data collection is necessary to troubleshoot this; just let me know what I need to do.

Attachments (2)

19074mcre.tgz (19.1 KB ) - added by Brian of the Pines 4 years ago.
Scripts to test getting a VM stuck in offline snapshotting state
get_stuck.pl (3.9 KB ) - added by Brian of the Pines 4 years ago.
Updated get_stuck.pl with --ignore_completed option

Download all attachments as: .zip

Change History (10)

in reply to:  description comment:1 by Socratis, 4 years ago

Replying to Brian of the Pines:

The active problem is that a VM is stuck in "offline snapshotting" state and I can't do anything with it.

I asked you in the related discussion in the forums: https://forums.virtualbox.org/viewtopic.php?f=34&t=94594, did you try killing the "VBoxSVC" service? And pretty much any "vbox*" and any "virtualb*" process?

And if you have a code snippet (even pseudo-code) that would go a long way for the developers to replicate your situation. The bulleted list in your original post are a really good starting point methinks (who BTW is not a developer/programmer ;) ).

comment:2 by Brian of the Pines, 4 years ago

I've tried killing those processes, but it was a while ago that I was working on any of this (it languished for two months as I had no time to devote to it). It didn't fix the problem, presumably, but I don't remember if anything additional bad happened.

I'm working on the reproducible today.

Also: Guest OS doesn't seem to matter (at least not across Windows versions); I've got another one stuck in the same state by the same process, where the guest OS is Windows 10.

Last edited 4 years ago by Brian of the Pines (previous) (diff)

comment:3 by Klaus Espenlaub, 4 years ago

The cause must certainly be in VBoxSVC, and of course it could be caused by premature session unlocking - because it's something we don't test (our API clients do it after the progress objects signals completion). It would be interesting to get a stack trace of everything in VBoxSVC when things are stuck. Do you use the package from Debian or the one from virtualbox.org? The former one should come with matching debug info (the latter doesn't, but I could provide it to you)...

I would really hope that the guest OS doesn't matter - because in the end creating an offline snapshot is mainly creating one or more empty diff images, and juggling a bit around with the settings.

Also (but this isn't related): in super paranoia mode I'd recommend getting an exclusive lock (not a shared one), because that's what's logically required. However, since I assume that nothing is done through the API with this VM concurrently (especially not before and overlapping with the start of the snapshot) this doesn't make any difference, since the request for a shared lock does pretty much the same thing as an exclusive lock if there is no one holding any session locks for the VM already.

by Brian of the Pines, 4 years ago

Attachment: 19074mcre.tgz added

Scripts to test getting a VM stuck in offline snapshotting state

comment:4 by Brian of the Pines, 4 years ago

Hi Klaus, thank you for engaging so rapidly. I have some news. I've been able to spend most of the day characterizing this issue more carefully. Here's what I've discovered so far. I'll also upload a tarball of scripts which form a MCRE. I've run two tests:

Test #1:
Setup: Freshly-exported base Win10 installation. Powered off state.
What happened:

  1. I ran my test script first without the --early_unlock option (which waits until IProgress::Completed is true before unlocking the session).
  2. IProgress::Completed indicated true immediately, so my script unlocked the session and exited. Machine state reported by vboxmanage is still "powered off".
  3. A snapshot appeared in the snapshots directory.
  4. I immediately tried my script again, this time specifying --early_unlock, causing my script to unlock the session BEFORE checking on the IProgress object.
  5. IProgress::Completed again indicated true immediately, so my script exited.
  6. vboxmanage indicated machine state as "powered off" HOWEVER...
  7. Any attempt to do anything with the VM after that point resulted in an error indicating that it was locked. The VM could not be started or unregistered.
  8. Using systemctl, I stopped both the vboxdrv and vboxweb-service services and waited for VBoxSVC and vboxwebsrv processes to exit (which they did; I did not have to kill them).
  9. While those processes were stopped, vboxmanage unregistervm <uuid> --delete succeeded.

Test #2:
Setup: Restarted VBoxSVC and vboxwebsrv. Imported VM created from export in Test #1.
What happened:

  1. Inside the VM, I did some bad science; I installed guest additions, joined the VM to our domain, and copied a large volume of data onto it (so that a snapshot would have decent size)
  2. Shut down the VM normally within the guest.
  3. Ran my test without --early_unlock.
  4. My test script unlocked and exited immediately because IProgress::Completed returned true immediately.
  5. vboxmanage now reports machine state as "offline snapshotting"
  6. The snapshot path doesn't actually exist
  7. Now the machine appears to be stuck in "offline snapshotting," although it hasn't been very long as I'm writing this.

I have to leave the office now for an appointment, but when I get in tomorrow I'll see if it's still stuck, then use the same approach from Test 1 to see if I can at least unregister the vm.

comment:5 by Brian of the Pines, 4 years ago

Update on Test #2: (aside: I realize I was thinking incorrectly about snapshots when I said "so that a snapshot would have a decent size")

  1. I used systemctl stop vboxdrv and systemctl stop vboxweb-service.
  2. I observed that once VBoxSVC had shut down (which took a minute or two), the stuck VM moved into "powered off" state.
  3. It took a few tries[1], but eventually I got vboxdrv and vboxweb-service restarted.
  4. The VM was then able to be started normally.
  5. Still, no snapshot file was ever created.

So it looks like maybe cleanly stopping and restarting VBoxSVC can un-stick the VM from that state.

I carried on:

  1. I shut down the VM from within the guest
  2. I created the missing snapshots directory
  3. I ran my script again, without --early_unlock. Again, IProgress::Completed returned true immediately.
  4. A snapshot file was created, but the VM got stuck in "offline snapshotting" state again.
  5. I stopped VBoxSVC using systemctl as before
  6. Now vboxmanage showvminfo gives this error:
    $ vboxmanage showvminfo $WIN10
    Name:            <inaccessible!>
    UUID:            beea55d0-6ffb-4fe8-beef-f53293108408
    Config file:     /var/VirtualBox/VirtualBox VMs/Windows 10/Windows 10.vbox
    Access error details:
    VBoxManage: error: Hard disk '/var/VirtualBox/VirtualBox VMs/Windows 10/Windows_10-disk001.vmdk' with UUID {4687eab9-1a50-4612-afe1-bcfa39e35fc8} cannot be directly attached to the virtual machine 'Windows 10' ('/var/VirtualBox/VirtualBox VMs/Windows 10/Windows 10.vbox') because it has 1 differencing child hard disks
    VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MachineWrap, interface IMachine
    
  7. The error persists after VBoxSVC is restarted
  8. The VM is now completely inaccessible

I am continuing to run tests to characterize the behavior further. It seems odd to me that IProgress::Completed always returns true immediately; I am modifying my script to optionally check the Percent field instead to determine whether it's finished. I'll report back with my findings.

[1] - It seems that some aspects of getting the VirtualBox services up and running take some time, and the vboxweb-service.sh script just bails instead of waiting for whatever it is to take place. If I wait long enough after starting vboxdrv, vboxweb-service will start properly (vboxwebsrv appears in the process list); however, if I don't wait long enough, systemd thinks it started, but vboxwebsrv doesn't appear in the process list.

comment:6 by Brian of the Pines, 4 years ago

After further testing, I have zeroed in on the circumstances that cause the VM to get stuck in snapshotting state, and it's definitely due to the session being unlocked before the process is completed.

The cause, as far as it affects my code, appears to be that IProgress::Completed always returns true immediately, regardless of whether the process is actually done or not. Using my modified MCRE script which monitors IProgress::Percent instead of IProgress::Completed resulted in a successful snapshot. Using both that modification plus the --early_unlock option (which releases the session before starting to monitor progress) showed the "Creating differencing hard disks for..." process getting stuck at 33%. I suppose that particular percentage may be random, depending on exactly when the session was unlocked.

So, I believe there are two things at play here; two bad behaviors of the webservice:

Bad Behavior 1: IProgress::Completed returns true before the process is complete
Bad Behavior 2: ISession::unlockMachine has immediate, potentially disastrous effects, if a process using that session is in progress. API documentation[1] suggests that the session should in fact remain locked until the process completes, regardless of the call to Unlock (that is, the Unlock call should take effect only once any pending operations are completed).

Clearly I have a workaround for my purposes (monitoring IProgress::Percent instead of IProgress::Completed). I would still be interested in gathering whatever useful information I can in order to troubleshoot the underlying causes. Just let me know what I can to be helpful.

[1] see ISession::unlockMachine...note that the interpretation of the expected behavior might be different depending on whether the VM is in the running state.

Last edited 4 years ago by Brian of the Pines (previous) (diff)

by Brian of the Pines, 4 years ago

Attachment: get_stuck.pl added

Updated get_stuck.pl with --ignore_completed option

comment:7 by Klaus Espenlaub, 4 years ago

Just had a little time to dig further... but effectively you already confirmed the high-level symptoms of a deadlock. Which I could reproduce very easily - it's a self-deadlock during destruction of some objects related to the SessionMachine object which is invalidated by your early unlocking. A self-deadlock is a lock issue within a single thread, so it's guaranteed to happen.

So for now this early unlocking is a "don't do this" situation (and I'm not sure if any API code is really meant to handle such early unlocking, because it invalidates the "SessionMachine" object, which is still in use). To me it sounds like a misinterpretation of the API docs which states that unlocking may not happen instantly (which is true, this covers both the case when someone else also has a lock and the little asynchronous clean-up work which needs to complete before the session is truly gone). This doesn't mean that the API operations keep their own lock during some operation. I suspect the object invalidation is the cause of another assertion, which I need to investigate.

Currently the most likely result of this bug report will be that the early unlock will be detected, and the TakeSnapshot operation (the async part, i.e. you'll need to check the Progress object) will end with an error, not taking the snapshot. So you better keep your locks long enough...

I don't really follow why you're surprised that taking an offline snapshot is really fast. It's not doing much: creates an empty diff image (usually a few MBytes, something which will instantly go to the filesystem cache), and doing some juggling around with settings which ultimately results in writing out a new config file, which usually is smaller than 100KBytes. Can easily finish in the asynchronous worker thread by the time the client received the progress object reference and got around to checking for completion.

During my testing with a C++ API client I saw that the completion was signaled when the percentage was at about 30% - but the operation actually failed, because as indicated by the mentioned assertion, a method of the invalidated object would return the wrong path, making the "take snapshot" operation attempt to create a diff image in the root directory. Not going to happen on any sensible OS.

Online snapshots are significantly more work, because they involve writing out the VM state which is today usually some GBytes.

P.S.: I'm surprised that you're using Perl for webservice programming. It's pretty tedious and I've removed the perl bindings from the VirtualBox SDK ages ago because it's such a waste of CPU cycles to even get SOAP::Lite to process the .wsdl files. For Python2 we have very comfortable bindings (work transparently with webservice and local API). Unfortunately we didn't find time yet to work on webservice support for Python3, because the ecosystem changed completely and for a long time there was no sensible SOAP support at all.

comment:8 by Brian of the Pines, 4 years ago

Just thought to come back and check on this. Thank you for your detailed response!

I can live with the "don't do that" solution; I've already modified my code as necessary to avoid creating the situation.

As far as how the bug report is resolved, as long as the VM doesn't get into an unrecoverable deadlock state, it seems reasonable to me to return some sort of error and require the client to hold the lock long enough, rather than developing some extra layer of locking within the service.

I had a poor understanding of the snapshot process earlier; I think you've helped me get a better grasp of what's involved and what to expect regarding the time it takes.

I'm using Perl because I'm comfortable with it, didn't need the runtime efficiency of a C++ project, and haven't learned Python yet.

Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use