VirtualBox

Opened 8 years ago

Last modified 8 years ago

#16540 new defect

Deleting older snapshot reproducibly, deterministically corrupts newer snapshot

Reported by: Barry K. Nathan Owned by:
Component: virtual disk Version: VirtualBox 5.1.14
Keywords: Cc:
Guest type: all Host type: Mac OS X

Description

On a virtual machine with a series of snapshots, deleting one of the older snapshots will, under some circumstances, cause changes to the contents of the virtual disk on newer snapshots. (That may be an overly simplified description, but I believe the reduced test case will clear up the details.)

The real-world manifestation of this bug is that, on a Win10 1607 VM of mine, deleting one of the older snapshots causes the Start menu to malfunction (lose the Settings button and all tiles) on newer snapshots.

I created a reduced test case by deleting as many snapshots as I could without the corruption happening, then creating and using a quick-and-dirty program to replace each non-empty (i.e. not all zeroes) 1MB block in each VDI file with a repeated byte value. (By the way, if I do this on all blocks rather than all non-zero blocks, the bug is no longer reproducible.)

Instructions to reproduce:

  1. Extract the attached xz-compressed tar archive (it will expand to something like 33GB) and add the resulting VM to VirtualBox.
  2. There are 3 snapshots: Snapshot 1, Snapshot 5, and Snapshot 11. Get the md5sum of Snapshot 11 by doing one of the following:
    1. VBoxManage clonemedium 7e718487-9538-4552-b223-69ce21b421d0 good.raw --format RAW then something like openssl md5 good.raw
    2. Boot into a simple Linux CD/DVD (I used the Slackware 14.2 installer DVD), run md5sum /dev/sda, then power off the VM.
  3. Delete Snapshot 5. (By the way, if you delete Snapshot 1 before deleting Snapshot 5, the corruption does not occur.)
  4. Check the md5sum of Snapshot 11 again by doing one of the following:
    1. VBoxManage clonemedium 750ff504-0292-427f-b108-f4c9b78f88df bad.raw --format RAW then something like openssl md5 bad.raw
    2. Repeat step 2(b).

Expected results: md5sum should be 94157a72c97d2d61e288748ec4029b6a both times.

Actual results: md5sum is originally 94157a72c97d2d61e288748ec4029b6a but becomes 75dc8ef4596478dee7abf47780471c98. An examination of the difference between good.raw and bad.raw shows that some sectors that are filled with zeroes in good.raw are filled with 0x2A ('*') in bad.raw (the pattern that I used to fill the non-zero blocks in the base VDI).

I think what's happening is, data that was originally non-zero in the base VDI and that was overwritten with zeroes later on (probably in Snapshot 5) is being incorrectly reverted back to the older non-zero data in Snapshot 11 after Snapshot 5 is deleted. (This is the same pattern of corruption that I was seeing with the original Windows 10 VM. I did not dig into why this particular pattern of corruption causes the Start menu to malfunction.)

Emotionally, I'm tempted to set the priority of this bug to "blocker" because I'm not happy about data corruption. However, I'm able to reproduce this bug going back at least to VirtualBox 4.3.4 (the oldest that I've tested so far), so this bug has been around for a while and I'm guessing that either it doesn't happen often in general use or it usually has insignificant consequences.

So far I have only tested this on a Mac OS X host -- I have not yet tested a Windows or Linux host.

Attachments (4)

Win10-Pro-x64-VM1.tar.xz.xz (72.9 KB ) - added by Barry K. Nathan 8 years ago.
Tar archive of reduced test case VM, compressed *twice* with xz
test2.tar.xz.xz (198.7 KB ) - added by Barry K. Nathan 8 years ago.
Modified test case (still 30+GB but easier/quicker to see problem)
test3.tar.xz (223.0 KB ) - added by Barry K. Nathan 8 years ago.
like test2, but now only needs 400MB of disk space
test4.flp.xz (266.3 KB ) - added by Barry K. Nathan 8 years ago.
floppy image with FreeDOS, shell scripts, etc. for new test case

Download all attachments as: .zip

Change History (20)

comment:1 by Barry K. Nathan, 8 years ago

Hmm, the attachment size limit is 512KB but my attachment was going to be closer to 5 or 6 MB. I'll either post it up somewhere or do something to see if I can reduce the size further.

Version 0, edited 8 years ago by Barry K. Nathan (next)

by Barry K. Nathan, 8 years ago

Attachment: Win10-Pro-x64-VM1.tar.xz.xz added

Tar archive of reduced test case VM, compressed *twice* with xz

comment:2 by Barry K. Nathan, 8 years ago

A few miscellaneous notes:

In the past I've needed to use a second round of gzip compression on highly repetitive data. That's how I got the idea to try a second round of xz.

I know that MD5 is broken, but the Slackware installer DVD's Busybox only has md5sum, not sha1sum or sha256sum, and MD5 is sufficient in this case.

By the way, if you need log files, I think those are also present in the tar archive. If there's any problem with those log files and you need me to provide more, just let me know.

comment:3 by Socratis, 8 years ago

I'm not sure that you're following the correct "mental" process for the snapshots. Snapshots are NOT full HD images, they contain the modified sectors of the virtual HD in between the snapshots. Here's what you got so far:

Snapshot 1
  |
  +- Snapshot 5
       |
       +- Snapshot 11
            |
            +- Current state

This is a simplified version of how snapshots work in your case:

      HD state 1                   Snapshot 1
A---AAA--A-A-AAAA-------    A---AAA--A-A-AAAA-------
AAA----AAA--------------    AAA----AAA--------------
------------------------    ------------------------
------------------------    ------------------------
------------------------    ------------------------

      HD state 5                   Snapshot 5
A---AAA--A-A-AAAA-------    ------------------------
ABA-B--AABBBBB----------    -B--B----BBBBB----------
--------------B----B----    --------------B----B----
------------------------    ------------------------
------------------------    ------------------------

      HD state 11                   Snapshot 11
A-C-AAA--A-A-AAAA--CCC--    --C----------------CCC--
ABA-B--AABCCBB----------    -----------CC-----------
--CCC---------B----B----    --CCC-------------------
------------------------    ------------------------
------------------------    ------------------------

   HD state *NOW*                  Current state
A-C-AAA--A-A-AAAA--CCC--    ------------------------
ABA-B--AABCCBB----------    ------------------------
--CCC---------B----B----    ------------------------
------------------------    ------------------------
------------------------    ------------------------
  • "Current state" incorporates the differences in the HD sectors between 11 and now. There are none, that's why the "Current state" does not end with a "(changed)".
  • "Snapshot 11" incorporates the differences in the HD sectors made after Snapshot 5.
  • "Snapshot 5" incorporates the differences in the HD sectors after Snapshot 1, and finally
  • "Snapshot 1" incorporates the differences between an empty drive and when you took Snapshot 1.

So, if you're at current state, you need all of 3 snapshots to correctly depict the state of the HD. If you revert to state 5 you need only the first two ones, etc.

If you delete 5, 11 is definitely going to change, because now 11 will hold the differences between 1 and 11, not between 1 and 5. Even the simple action of booting and shutting down the VM will definitely alter the structure of the virtual drive, even if a single temp file is created/altered.

If on the other hand you delete 1, the changes are merged into 5. Now 5 is going to be different. But, 11 is NOT going to change, since it contains the differences between 5 and 11. Since the state of 5 did NOT change, then the differences between 5 and 11 are identical and simply deleting snapshot 1 is not going to "trigger the bug".

Here's an ASCII art depiction of what happens in your case when you delete Snapshot 5. Your Snapshot 11 would become:

      HD state 11                   Snapshot 11
A-C-AAA--A-A-AAAA--CCC--    --C----------------CCC--
ABA-B--AABCCBB----------    -B--B----BCCBB----------
--CCC---------B----B----    --CCC---------B----B----
------------------------    ------------------------
------------------------    ------------------------

which is the difference between Snapshot 1 and Snapshot 11, and yes, of course the checksum would be different.


Bottom line, I don't believe this is a bug, but expected behavior and possibly a misunderstanding of how the snapshot feature works.

I hope I understood correctly your description and I'm not completely off the target.

PS. Did you really have to have all 33 GB? Couldn't you do a FreeDOS client? Much smaller ;) Kudos on the other hand for the double-compression. It's a wonder how a 75 KB file can expand to 33 GB! Thankfully I had that much available space on another HD. Usually I don't... ;)

comment:4 by Barry K. Nathan, 8 years ago

Maybe I did a bad job of explaining.

Deleting Snapshot 5 is going to merge it into Snapshot 11, of course. The bug here is that some of the changes contained in Snapshot 5 are getting lost in the process, causing HD state 11 to become corrupted.

I'm not talking about the checksum of Snapshot 11 on its own -- I'm talking about the checksum of HD state 11 as a whole.

I'll try to modify your ASCII diagrams to show what's going on:

HD state 11 before deleting Snapshot 5:
A-C-AAA--A-A-AAAA--CCC--    
ABA-B--AABCCBB----------    
--CCC---------B----B----    
------------------------    
------------------------    

HD state 11 after deleting Snapshot 5
(if the 1st and 3rd 'B' blocks on line 2 were all zeroes):
A-C-AAA--A-A-AAAA--CCC--    
AAA-B--AAACCBB----------    
--CCC---------B----B----    
------------------------    
------------------------

B blocks from Snapshot 5 which are not all zero are coming through OK. B blocks from Snapshot 5 which are all zero are being mistakenly dropped (so the non-zero A blocks come through to HD state 11) when Snapshot 5 is deleted. (I don't know if this is happening to all of the all-zero B blocks or only some of them. I guess I should look into that.)

The idea of booting into the Slackware installer DVD is to make sure that no changes whatsoever are made to the drive's contents for testing purposes (it starts off really simple and doesn't try to create a swap file or mount the hard drive's filesystem or anything as long as you don't initiate the installer itself).

I'll see if I can come up with a smaller, clearer test case.

by Barry K. Nathan, 8 years ago

Attachment: test2.tar.xz.xz added

Modified test case (still 30+GB but easier/quicker to see problem)

comment:5 by Barry K. Nathan, 8 years ago

My initial attempts to create a smaller test case have failed so far, but I did create a test case that shows the problem more clearly. (See test2.tar.xz.xz attachment.) This one has a new snapshot, Snapshot 12. Compared to Snapshot 11, it partitions the disk (in a manner that makes sure the bug hits) and adds FreeDOS, md5sum.exe, a file filled with zeroes, and an autoexec.bat that helps show what's going on. (I know this isn't what you had in mind when you mentioned FreeDOS, but it's still an improvement over my last test case.)

Steps to reproduce with this one: (EDIT: I've now noticed that the corruption happens whether the VM is powered on or off when the snapshot is being deleted, so I've updated the instructions accordingly):

  1. Extract the archive and add the VM to VirtualBox.
  2. Boot it up and look at the output on the screen.
  3. (Optionally shut the machine down, then) Delete Snapshot 5.
  4. Boot it again and look at the output again, or just run "autoexec" if you didn't shut it down in step 3.

Deleting Snapshot 5 changes the contents of a file that was added after I took Snapshot 11 but before I took Snapshot 12. (Specifically, a 1MB chunk of zeroes gets replaced by '*'.) That behavior seems incorrect to me.

I'll continue trying to make a smaller test case.

Last edited 8 years ago by Barry K. Nathan (previous) (diff)

by Barry K. Nathan, 8 years ago

Attachment: test3.tar.xz added

like test2, but now only needs 400MB of disk space

comment:6 by Barry K. Nathan, 8 years ago

Later on I'll see if I can shrink it further, but 400MB is already a huge improvement over 30GB :)

comment:7 by Socratis, 8 years ago

OK, I think I know what's going on. The smaller size example definitely helped.

What you're looking at is not corrupt/different contents of the snapshot, it's the different headers of the snapshot. Let me elaborate with an example of the VM that you sent me, the "test3.tar.xz" one.

<HardDisk uuid="{aae8f202...           <- Snapshot 1
  <HardDisk uuid="{750ff504...         <- Snapshot 5
    <HardDisk uuid="{7e718487...       <- Snapshot 11
      <HardDisk uuid="{9b2c92b8...     <- Snapshot 12
        <HardDisk uuid="{b3507460...   <- Current state

Now, assume you delete snap5, per the instructions. That should merge the changes to snap11, leaving snap12 untouched (the one you claim gets corrupted). So, now we have:

<HardDisk uuid="{aae8f202...           <- Snapshot 1, identical via "cmp".
  <HardDisk uuid="{750ff504...         <- Snapshot 11, different, of course.
    <HardDisk uuid="{9b2c92b8...       <- Snapshot 12, more on this later.
      <HardDisk uuid="{18a4869b...     <- Current state

Now, here's the deal. Notice that the snap11 has a different UUID, the one that used to be the one for snap5. Reason being (I imagine) that the child UUID of snap1 has to remain the same. Therefore the only viable option is to change the UUID of snap11 to that of the former snap5. And to of course propagate the changes down the chain.

Logically, the parent UUID of snap12 has to change to reflect the changed state of affairs. Which can be also verified, by trying the "VBoxManage internalcommands dumphdinfo \{9b2c92b8-f174-474e-b85f-cb2445f06fc0\}.vdi":

Dumping VD image "old/{9b2c92b8-f174-474e-b85f-cb2445f06fc0}.vdi" (Backend=VDI)
Dumping VDI image "old/{9b2c92b8-f174-474e-b85f-cb2445f06fc0}.vdi" mode=r/o uOpenFlags=9 File=0x007fe1b2602f80
Header: Version=00010001 Type=4 Flags=0 Size=42949672960
Header: cbBlock=1048576 cbBlockExtra=0 cBlocks=40960 cBlocksAllocated=3
Header: offBlocks=1048576 offData=2097152
Header: Geometry: C/H/S=1024/255/63 cbSector=512
Header: uuidCreation={9b2c92b8-f174-474e-b85f-cb2445f06fc0}
Header: uuidModification={7dc217a9-f56b-4015-b8c6-4a569f2968e8}
Header: uuidParent={7e718487-9538-4552-b223-69ce21b421d0}
Header: uuidParentModification={355356a2-a30d-4587-b1ba-616d67c13cbb}
Image:  fFlags=00020000 offStartBlocks=1048576 offStartData=2097152
Image:  uBlockMask=000FFFFF cbTotalBlockData=1048576 uShiftOffset2Index=20 offStartBlockData=0

I'll spare you the details and post the single different line from the same command on the new snap12:

Header: uuidParent={750ff504-0292-427f-b108-f4c9b78f88df}

So, where's that information written in the {9b2c92b8..}.vdi? In its header of course. Taking a look at the article in the forums All about VDIs you'll see that the header is 512 bytes of data. The differences are not even that "widespread". Trying a "cmp -i 440 old.vdi new.vdi" gives no difference whatsoever in the snapshot contents. The "-i 440" basically tells the compare to skip the first 440 bytes (yes, that's where the differences end).

So, if you do a checksum of the two, they will be different. But if you do a comparison, skipping the header they seem to be identical. As they should.

comment:8 by Barry K. Nathan, 8 years ago

[EDIT: Perhaps the next paragraph could be phrased better, but my point is that, since my later test cases are actually bootable and show the problem happening within the VM, which is my real problem, I'm more concerned about getting to the bottom of that.]

Please don't bother with my original steps to reproduce (which were too confusing, I admit) and please try my new steps-to-reproduce from comment 5 (which involve actually booting the VM and letting autoexec.bat run an md5sum of a file in the VM).

I installed FreeDOS, set up an autoexec.bat, and created a file "test1" inside the VM with all-zeroes, after taking Snapshot 11 and before taking Snapshot 12. The md5sum of that file is b2d1236c286a3c0704224fe4105eca49 before deleting Snapshot 5, but after deleting that snapshot, a 1MB chunk of that 2MB test file inside the VM gets wrecked and the md5sum changes to 300c72b24aa4f33f71673b066c3acf7c. (I typed those MD5s manually into this comment, so it's possible I made typos, but hopefully my point gets across.)

Are you seeing this behavior from my test case?

Last edited 8 years ago by Barry K. Nathan (previous) (diff)

comment:9 by Frank Mehnert, 8 years ago

From the last comment this looks indeed like some guest file corruption. Is this reproducible?

comment:10 by Socratis, 8 years ago

Didn't have the time to try, will do shortly.

comment:11 by Socratis, 8 years ago

I've been trying to find a way to independently verify your findings. It has to be independent, just to make sure that your logic/tests is not flawed. And I've been trying hard to think with what's wrong with the test that I did in comment #7. And I can't find anything wrong with it.

Not that I can find anything wrong with the checksum in your approach, I simply want 1) an independent verification, not simply to repeat exactly what you're doing and 2) to find the source of the discrepancy, be it the header or something else.

Anyway, I tried a different approach. I tried to use CloneVDI to make a clone of the current state and compare it to a clone that I would create after I've deleted the snapshot. But if fails. On several fronts. I put all the snapshots and the base VDI in the same folder. The snapshots again and the messages from the main CloneVDI interface (version 2.5.1):

<HardDisk uuid="{aae8f202...           <- Source file corrupt, -   , -
  <HardDisk uuid="{750ff504...         <- Source file corrupt, -   , -
    <HardDisk uuid="{7e718487...       <- Source file corrupt, -   , -
      <HardDisk uuid="{9b2c92b8...     <- OK,                  40GB, FAT12
        <HardDisk uuid="{b3507460...   <- OK,                  40GB, -

Trying to clone either of the "OK" snapshots results in a

Source is a difference image. Sorry, this tool cannot clone these (for now)

error message, which is not very promising. I'm going to try and see what else I can try, including seeking help...

comment:12 by mpack, 8 years ago

Sorry, the CloneVDI message is both accurate and misleading - the wording is a hangover from before it supported snapshots. A better wording would be "snapshot chain is incomplete". You can verify this by examining the parent UUID links in the VDI headers and making sure you can complete a chain all the way back to the base VDI.

I don't see how this can be related to the original report of corruption.

comment:13 by Barry K. Nathan, 8 years ago

OK, I've re-read comment 7, and I see the misunderstanding there. I'll try my best to phrase it with the proper terminology.

When deleting Snapshot 5, the differencing VDI for Snapshot 11 (which now is supposed to have Snapshot 5's contents merged into it) is left corrupted. This corruption carries through into both HD state 11 and HD state 12. Your diagram, after deleting of Snapshot 5, says "Snapshot 11, different, of course." The bug I'm seeing is that Snapshot 11's VDI is different in the wrong way.

(Specifically, some of the blocks that were overwritten with zeroes in Snapshot 5 are getting dropped when the snapshot is being merged with Snapshot 11, with the result that some of the underlying blocks from the base VDI are coming through to HD state 11 and 12 when those blocks should actually remain zero.)

comment:14 by Barry K. Nathan, 8 years ago

I think I should be able to come up with a better test case for this bug, but I might not get around to it until the weekend or early next week.

comment:15 by Socratis, 8 years ago

If you do, please share the recipe with full details. I was/am in the middle of the analysis, but there are a lot of distractions. I'm trying to see what's going on with the

VBoxManage internalcommands dumphdinfo ...

command for every snapshot and the whole chain. I really didn't like the fact that CloneVDI choked on it. And then, based on Don's comment that the "snapshot chain is incomplete", I'm starting to think that there's something rotten in the state of Denmark, I simply have to follow the "rotting smell"...

comment:16 by Barry K. Nathan, 8 years ago

Hopefully these instructions are clear enough. If anything is not clear, please let me know. I've also made a video demonstration, in case that might help, at: https://youtu.be/RRdSaa9GL-0

Download and decompress the attached floppy disk image (test4.flp, which I will attach in a few minutes). Also download Slackware 14.2 install CD 1 if you do not already have it. For example: http://mirrors.kernel.org/slackware/slackware-iso/slackware-14.2-iso/slackware-14.2-install-d1.iso Non-Slackware Linux ISOs that I tried tended to boot much more slowly or have other disadvantages for this test scenario.

Create a new VM. Choose Ubuntu (32-bit). Optionally reduce the RAM; 256MB was sufficient for me.

Create a virtual hard disk now. VDI, dynamically allocated. Anything 6MB (that’s not a typo) or larger will work. FreeDOS will give some partition-related warnings if the drive is smaller than about 4GB, but as far as I can tell, those warnings are cosmetic.

Go to Settings

Go to System tab -> Motherboard

Change boot order so Floppy is below Hard Disk (so boot order becomes Optical, Hard Disk, Floppy)

Go to Storage tab

Add Floppy controller & drive

Choose test4.flp for the floppy disk image

Choose Slackware 14.2 install DVD or CD for optical drive

Start the VM. At the bootloader prompt, hit Enter once to initiate boot, then once it finishes booting, hit Enter a couple more times until you get a command prompt. (If you get a kernel panic, you probably reduced the RAM too much.)

Enter the following commands:

mount -o ro /dev/fd0 /mnt
/mnt/s1part.sh
/mnt/s1fill.sh

Power off the VM, then take Snapshot 1.

Boot the VM and get to the command prompt again, as before. Then enter the following commands:

mount -o ro /dev/fd0 /mnt
/mnt/s2fill.sh

Power off the VM, then take Snapshot 2.

Boot the VM and get to the command prompt again. Then enter the following commands:

mount -o ro /dev/fd0 /mnt
/mnt/s3fill.sh

Power off the VM, then take Snapshot 3.

Go to Settings and remove the Slackware CD from the virtual optical drive.

Start the VM again, press F12, then press F to boot from the floppy disk. Then run s4setup which will format the partition that was created by s1part.sh, install FreeDOS on it, extract the testing files, and restart the VM.

Observe how autoexec.bat runs, checking the md5sum of the test file (which is 2MB of null bytes). Optionally, run type z and observe that no output appears on the screen.

Optionally power off the VM, and optionally back up the VDI files or the entire VM for your convenience.

Now delete Snapshot 2. Then, start the VM again if you powered it off, or just run autoexec if you did not power it off.

Expected results: The test file’s contents should not change.

Actual results: Now the md5sum check fails, and type z fills the screen with 1 (obviously not null bytes, so the file got corrupted).

by Barry K. Nathan, 8 years ago

Attachment: test4.flp.xz added

floppy image with FreeDOS, shell scripts, etc. for new test case

Note: See TracTickets for help on using tickets.

© 2024 Oracle Support Privacy / Do Not Sell My Info Terms of Use Trademark Policy Automated Access Etiquette