VirtualBox

Ticket #2247 (closed defect: fixed)

Opened 6 years ago

Last modified 4 years ago

VM Lock-Up after 300 seconds host uptime => Fixed in 2.0.4

Reported by: kpreslan Owned by:
Priority: major Component: other
Version: VirtualBox 2.0.2 Keywords:
Cc: Guest type: Linux
Host type: Linux

Description (last modified by frank) (diff)

I'm having a weird problem with VirtualBox 2. I'm trying to use it to break my one server into two parts (the native server and a virtual machine) and split services across the two. I'm using bridged HIF networking, so the setup should look like two different machines.

The problem I'm seeing is that the virtual machine works fine for a while (2-3 minutes), and then it becomes unresponsive. The VM's CPU usage goes way down and the VM doesn't reliably talk to the network or through VRDP. In the end I have to do a "VBoxManage controlvm testvbox1 poweroff" to get rid of the VM.

The weird thing is the problem only occurs the first time I run the VM after I reboot the physical machine. After I see the problem and stop the VM, I can restart it and it works fine from then on -- until I reboot the host again.

Any suggestions?

My setup:

  • VirtualBox 2.0
  • A generic Pentium 4 box
  • Ubuntu Hardy Server as both the host and guest
  • No X servers on either the host or guest
  • Bridged HIF networking (both the host and guest are using static IPs)
  • I'm launching the virtual machine with VBoxManage startvm testvbox1 -type vrdp.

More details:

  • I've tried "nohz=off" on both the host and guest and it doesn't help.
  • When I see the problem, the CPU usage drops from 0.3% to 2% cpu load for the idle VM to not showing up in "top" most refreshes. So, the VM is definitely doing less.
  • The problem happens whenever I run the VM the first time after boot. Whether I run it as part of the boot scripts or wait a few minutes and run it by hand makes no difference. The second run always works right.
  • The problem happens whether I run the headless frontend or the standard GUI frontend.
  • This is weird: During the 2-3 minutes of "good" time before the VM goes all wonky, I ssh (with X forwarding) from my laptop into the VM and run a xterm so I can type commands to the VM and see what's going on. When the VM start misbehaving, I can still type commands to the VM, but the output stalls. For example, I type "ps ax" and hit enter it will print a few lines of result and then stop. If I then move the mouse into and out of the xterm, it will print a few more lines. It seems like the focus events that the laptop's X server sends to the VM cause it to wake up and do work. It's like the VM is dropping interrupts or something.
  • A VRDP connection works fine for the "good" 2-3 minutes, but completely locks up when the VM goes wonky.
  • Network connections to the host work fine through all of this.
  • This happens with VirtualBox 2.0.0 and 2.0.2
  • Thanks for reading all this. I'd be grateful for any help you guys could offer.

Forum Thread:  http://forums.virtualbox.org/viewtopic.php?t=9590

Attachments

VBox.log-broken Download (29.1 KB) - added by kpreslan 6 years ago.
VBox.log of a unresponsive VM
VBox.log-broken-after-poweroff Download (38.7 KB) - added by kpreslan 6 years ago.
VBox.log of the same unresponsive VM after I issued a "VBoxManage controlvm testvbox1 poweroff"
VBox.log-working Download (29.1 KB) - added by kpreslan 6 years ago.
VBox.log of a happily running VM
VBox.log-working-after-halt Download (38.5 KB) - added by kpreslan 6 years ago.
VBox.log of of the working VM after I shut it down after a few minutes by running "halt" in the guest OS
VBox.log-1.6.6-working-after-halt Download (37.5 KB) - added by kpreslan 6 years ago.
VBox.log from a working VM run with VB 1.6.6.

Change History

comment:1 Changed 6 years ago by kpreslan

More things I've tried:

  • I've never enabled USB on the guest, so maybe this is irrelevant, but the bug appears whether or not I do the mountdevsubfs.sh thing mentioned in the FAQ.
  • The problem doesn't seem to be related to HIF networking. I see the same problem if I temporarily switch the guest to use NAT. I didn't change any of the networking stuff on the host, though. So, I haven't ruled out the process of bringing up the vbox0 interface as being a factor yet.
  • Recently, I've been playing with two VMs. I start both VMs at the same time from my boot scripts, they work fine for a few minutes, and then they both stop working at the same time. But, if I start one VM by itself and it screws up, I can start up the other and it will work fine. In other words, it's the first VM of any kind that has the problem, not the first run of each individual VM.
  • I've been playing with a near-identical guest image on VB2 running on a Mac client. I have not seen any problems there at all.

comment:2 Changed 6 years ago by plenque

I think I'm facing the same problem, but I don't have an exact way to reproduce it. Also, there's no related information in any log (guest or host).

I'm using a Hardy desktop host, with a Windows XP guest. Not using VRDP.

Regards, Mauro.

Changed 6 years ago by kpreslan

VBox.log of a unresponsive VM

Changed 6 years ago by kpreslan

VBox.log of the same unresponsive VM after I issued a "VBoxManage controlvm testvbox1 poweroff"

Changed 6 years ago by kpreslan

VBox.log of a happily running VM

Changed 6 years ago by kpreslan

VBox.log of of the working VM after I shut it down after a few minutes by running "halt" in the guest OS

comment:3 Changed 6 years ago by kpreslan

Just a description of the first four attachments:

I rebooted the host box. After the host box was up, I ran "VBoxManage startvm testvbox1 -type vrdp" and waited for the VM to lock up. I then copied out the VBox.log to get "VBox.log-broken". I then did a "VBoxManage controlvm testvbox1 poweroff" and copied the log to get "VBox.log-broken-after-poweroff".

Then, I restarted the same VM (with the same command as above) and copied the log to get "VBox.log-working". I waited a few minutes to prove the VM was working right, and then issued a "halt" in the guest OS and copied the log to get "VBox.log-working-after-halt".

comment:4 Changed 6 years ago by kpreslan

For whatever it's worth:

I turned off the host's bridge, got rid of /etc/vbox/interfaces, and turned my VMs back to the NAT setting. I then rebooted. I still see the lock up. So, it definitely has nothing to do with HIF networking at all.

comment:5 Changed 6 years ago by kpreslan

So, this appears to be a 2.X-only bug. I downgraded the my VB .deb and guest additions to 1.6.6 and I can't reproduce the problem.

I hope that narrows the search for the bug some. :-)

comment:6 Changed 6 years ago by frank

  • Description modified (diff)

Thanks for your findings. Are you 100% sure that this is a 2.0.0 regression?

To help debugging this problem you could do the following: Start the VM with

gdb -args /usr/lib/virtualbox/VBoxHeadless -startvm testvbox1

When the guest does not respond anymore, force the process to terminate with a core dump. I've updated the instructions at  http://www.virtualbox.org/wiki/Core_dump. Keep in mind to allow SUID root processes to dump core dumps and kill the process with -4 (as described there).

Send the core dump to frank _dot_ mehnert _at_ sun _dot_ com. If the compressed file is bigger than 4MB (very likely), try to make it available somehow for me for download (preferred) or use some file sharing service (megaupload.com, yousendit.com or similar).

Changed 6 years ago by kpreslan

VBox.log from a working VM run with VB 1.6.6.

comment:7 Changed 6 years ago by kpreslan

All I know is that I saw the problem when I was running 2.0.0 and 2.0.2. I changed everything to version 1.6.6 and I wasn't able to reproduce the problem. (I did try for a while, too.) I just got through changing back to 2.0.2 and I see the problem again. I don't think I changed anything else of substance.

So, am I 100% sure it's a regression? No. It could be a bug in both versions that is masked by something else in 1.6.6 (a performance issue, perhaps). Or maybe I've made a mistake somewhere, but I don't think so.

I'll get your GDB core dumps today.

Thanks for your help.

comment:8 Changed 6 years ago by kpreslan

I was trying to better quantify how long a VM had to be running before it hung and I made an interesting (and really weird) discovery. It seems that any VMs running on the host lock up exactly as the host's uptime (as shown by /proc/uptime) crosses the 300 second mark. It doesn't matter when in those first 300 seconds the VM starts up.

Test setup:

I have my boot scripts set up to sleep a bit and then start the VM. I wrote a script on the host with a loop that cats /proc/uptime and sleeps a second. I wrote a script in the guest that prints stuff to the screen every couple of seconds.

So, I reboot the host, log in, and start the script that watches /proc/uptime. I wait for the VM to boot. I then log in to that and start the other script. So, I can tell exactly when the VM locks up by watching the output of the script on the VM. It always stops when the host's uptime reaches 300. I can vary the amount of sleep in the boot script to show that it's not how long the guest has been running that causes the lock up -- It's how long the host has been up.

comment:9 Changed 6 years ago by frank

Investigating your core dump. Please could you try if your VM works correctly if you don't start the guest additions? That is, please make sure that /etc/init.d/vboxadd is not executed within the guest during boot (edit the script and prevent it from executing). If you have X running, you will loose mouse pointer integration but X should start anyway. Please check this with version 2.0.2.

comment:10 Changed 6 years ago by kpreslan

I made sure no VB processes or modules were ever loaded in the guest. I still get the lock up. Still at exactly 300 seconds after the host boots.

Back in my kernel hacking days, I had problems with doing arithmetic on the "jiffies" variable (and the like) that were counters since the system booted. Things that would work right when the system was up for a while would screw up on a newly booted machine because my arithmetic would cause a underflow and my code wasn't expecting negative numbers. Could this be something similar?

comment:11 Changed 6 years ago by frank

A very interesting finding. Of course we cannot rule out such a bug. Are there any messages in the kernel log when this hang appears (dmesg)? Are there any services at your host which are executed after 300 seconds? Some cron jobs? Could you have a look at /var/log/daemon.log?

comment:12 Changed 6 years ago by sunlover

Does this Lock-Up happen with other guests as well? Could you please try some LiveCD ISO in VBox 2.0.2? Thanks.

comment:13 Changed 6 years ago by kpreslan

No messages in dmesg or any file in /var/log/. I turned off all the services I could in the host. It's only running udevd, syslogd, klogd, sshd, and a couple of gettys. So, no cron or any other possibility of services starting.

I haven't tried any guest other than Ubuntu yet, I guess I can try one. But on the other hand, the hang happens even if I stop the boot process and the computer sits in the Grub screen waiting for me to select which kernel I want to boot. I've tried with both the standard GUI front-end and the Headless front-end (w/ VRDP). At 300 seconds host uptime, the cursor keys stop moving the selection between the different kernels.

Even further, I created a dummy VM that doesn't have an OS installed on it. I "boot" it and connect with rdesktop-vrdp and I see a "FATAL: No bootable medium found! System Halted." message. For the first 300 seconds, I can connect and disconnect repeatedly and I always see the message. After 300 seconds, a new VRDP connection will just have a black screen with no message.

comment:14 Changed 6 years ago by kpreslan

I just tried the KNOPPIX_V5.1.1CD-2007-01-04-EN.iso live CD and it still hangs.

comment:15 Changed 6 years ago by kpreslan

I was poking at all the places jiffies were referenced in the OSE code. I found usage of a macro I'd never seen before called INITIAL_JIFFIES. I looked up INITIAL_JIFFIES in the kernel source and found this:

From include/linux/jiffies.h:

/*
 * Have the 32 bit jiffies value wrap 5 minutes after boot
 * so jiffies wrap bugs show up earlier.
 */
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

comment:16 Changed 6 years ago by frank

  • Summary changed from VM Lock-Up with VB 2.0.2 and Ubuntu Hardy to VM Lock-Up after 300 seconds host uptime => Fixed in 2.1

Ken, many thanks to you for these findings. Indeed it was a jiffies overflow although not related to INITIAL_JIFFIES. The fix is in r12591.

comment:17 Changed 6 years ago by alex2000

Seeing how this is fixed in the new version (2.1), when is this coming out? I mean this is a huge bug (I'm constantly annoyed by the fact that my VMs halt randomly when first started).

comment:18 Changed 5 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed
  • Summary changed from VM Lock-Up after 300 seconds host uptime => Fixed in 2.1 to VM Lock-Up after 300 seconds host uptime => Fixed in 2.0.4
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use