VirtualBox

Ticket #5864 (new defect)

Opened 4 years ago

Last modified 3 years ago

Again: hard hang of physical server

Reported by: bauer40 Owned by:
Priority: major Component: other
Version: VirtualBox 3.1.2 Keywords: hard hang
Cc: Guest type: other
Host type: Solaris

Description

Ooops, it did it again....

In five days of using VBox 3.1.2, I had two hard hangs of the physical server, effectively needing it to hardly power it off and on again using the power switch.

I run Debain, Solaris 10 and Windows XP guests on the physical server, which uses Solaris 10 u7 with KJP 141415-04.

It happend with and without Intel VT enabled.

Specialty: I make use of tagged vlan interfaces (e1000g4000, ...)

First, I discovered that the VM running my web proxy was no longer able to ping one (and only one) physical workstation in my network. Only the connection between the VM and the workstation was affected. The workstation could be pinged by other hosts on the net, and the proxy were able to ping others.

During the diagnosis of that behaviour, the physical server running VBox froze. I was not able to fall into the kernel debugger, and even the NumLock-LED did not change when pressing the appropriate key on the keyboard. It was a complete lock up.

It is acceptable for me to live with a few more hangs to diagnose the problem, if somebody is able to tell me how I can collect information which helps Sun to track down that problem.

Attachments

logs.tar.gz Download (298.7 KB) - added by bauer40 4 years ago.
Logs of all VMs on the host
VBox.log Download (34.6 KB) - added by bauer40 4 years ago.
VBox.log where BindIP was used to bind NAT to a hosts IP address

Change History

comment:1 Changed 4 years ago by Hachiman

Could you please attach the logs?

comment:2 follow-up: ↓ 3 Changed 4 years ago by bauer40

Which logs? Of every VBox that was running when the hang occured?

Changed 4 years ago by bauer40

Logs of all VMs on the host

comment:3 in reply to: ↑ 2 Changed 4 years ago by Hachiman

Replying to bauer40:

Which logs? Of every VBox that was running when the hang occured?

Right. Thanks.

comment:4 Changed 4 years ago by bauer40

I uploaded all the logs of the vms which were running on that host. However, I'm not sure that at the time of the hang really EVERY VM was up...

I first discoverd the the VM named "proxy" could not contact the workstation 192.168.94.4 at Dec 28, 2009, at 05:05 when the workstation tried to download files from the internet.

The first hard hang occured on Dec 28, 2009, 08:32 o'clock. After that hang I enabled Hardware accelaration (Intel VT) on all VMs but the Solaris 10 "jumpstart" VM (which in that case did no longer boot).

The second hang occurd on Dec 29, 2009, 02:53.

Can I activate any debugging or activity logs for the VBox kernel modules?

comment:5 follow-up: ↓ 6 Changed 4 years ago by ramshankar

How long does it take to reproduce the hang? Any vbox messages on /var/adm/messages? If possible does using NAT for all your VMs prevent the hang? What kind of load were the guests doing, what kind of load was the host doing? Anything specific like high network load on the host?

comment:6 in reply to: ↑ 5 Changed 4 years ago by bauer40

Replying to ramshankar:

How long does it take to reproduce the hang?

I have no idea, but I assume it will take no longer than three days until the problem arises again.

Any vbox messages on /var/adm/messages?

Unfortunatly not, no.

If possible does using NAT for all your VMs prevent the hang?

There is absolutely no way to test this, because all VMs are servers which need incoming TCP connections. This is very hard to implement with NAT.

What kind of load were the guests doing, what kind of load was the host doing?

Well, the last hang occured in the middle of the night. At this time it is expected that only two systems are "busy": the system named exchange (running a mail server software for 5 users on Debian), and proxy being the Web Proxy and the router from the internet.

However, the entire load can be assumed as "very low".

Anything specific like high network load on the host?

Sorry, no.

I moved the VM having the most network traffic to another server running VBox 3.1.2 on the same host OS version, just to see if this (reliably) moves the problem from the main physical server to that one. This move involved a reconfiguration of the VM to use ISCSI as the disk interface.

comment:7 follow-up: ↓ 8 Changed 4 years ago by ramshankar

We need to narrow down the problem as it's currently too wide for me to reproduce without more information.

The following information would be useful to help me in replicating this problem:

  1. If the problem goes away with NAT. You can setup port forwarding for the VM with NAT for accepting incoming connections. I suggest you can setup 3 VMs (NAT+port forwarding) and run them in parallel till the hang occurs on one host, while run the same VMs with bridged networking on another host till the hang occurs. (Preferrably "exchange" and "proxy")
  1. How many VMs in parallel were running when you hit this problem.

Thanks for the report.

comment:8 in reply to: ↑ 7 Changed 4 years ago by bauer40

Replying to ramshankar:

We need to narrow down the problem as it's currently too wide for me to reproduce without more information.

The following information would be useful to help me in replicating this problem:

  1. If the problem goes away with NAT. You can setup port forwarding for the VM with NAT for accepting incoming connections. I suggest you can setup 3 VMs (NAT+port forwarding) and run them in parallel till the hang occurs on one host, while run the same VMs with bridged networking on another host till the hang occurs. (Preferrably "exchange" and "proxy")

OK, I think I can setup NAT for most of these VMs, but probably not for proxy, since this is also an IP router. I take a closer look at it, maybe I can track down which ports are involved.

However, making that change will take a few days. I will post a message when I'm done and explain the current setup. I assume I will come out with two hosts: one running NATed VMs only, and another one running those VMs which are not NATable.

  1. How many VMs in parallel were running when you hit this problem.

At least six, maybe seven. Unprobably more than eight.

Thanks for the report.

For the report? Thanks for working on that issue!

comment:9 follow-up: ↓ 10 Changed 4 years ago by bauer40

There is a problem when I try to set up NAT:

Since those VMs are running on four different networks, and NAT cant be bound to a physical NIC of the server, I can't define which VM-NIC is connected to which physical network. Furthermore, some VMs use multiple virtual NICs bridged in multiple physical nets (e.g: proxy uses four bridged networks to the subnets 192.168.0.0, 192.168.94.1, 85.216.217.104 and one for PPPoE).

Or is there a way that I currently dont know off? Can I create a NAT which uses e1000g0 on the physical server, define another one on e1000g3000, and choose which NAT-net :-) the virtual NIC of the VM uses?

comment:10 in reply to: ↑ 9 Changed 4 years ago by Hachiman

Replying to bauer40:

There is a problem when I try to set up NAT:

Since those VMs are running on four different networks, and NAT cant be bound to a physical NIC of the server,

You can find details  about binding NAT here.

comment:11 Changed 4 years ago by bauer40

Thanks for advising how to set up NAT in that case. Unfortunatly, this documentation does not explain the setextradata "path" very well.

Supposing I have to do the following mapping for a VM with two virtual NICs:

vNIC1 bound to physical interface with the address 1.1.1.1
vNIC2 bound to physical interface with the address 2.1.1.1

Do I get it right that the correct commands to do that mapping is

VBoxManage setextradata "Linux Guest"   "VBoxInternal/Devices/pcnet/0/LUN#0/Config/BindIP" "1.1.1.1"
VBoxManage setextradata "Linux Guest"   "VBoxInternal/Devices/pcnet/1/LUN#0/Config/BindIP" "2.1.1.1"

(the instance number of .../pcnet/x corrosponds to the vNIC number off by one)

Is that right?

I will try to configure this with a test system, as well as the port forwarding, and see if I can make that work.

comment:12 Changed 4 years ago by bauer40

OK, the documentation seens to be invalid, or I did'nt get it right.

That's what I did:

VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/0/LUN#0/Config/BindIP" "192.168.4.3"
VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/1/LUN#0/Config/BindIP" "192.168.94.66"

That's what I have now:

infra1# VBoxManage getextradata testdebian enumerate

VirtualBox Command Line Management Interface Version 3.1.2
(C) 2005-2009 Sun Microsystems, Inc.
All rights reserved.

Key: GUI/AutoresizeGuest, Value: on
Key: GUI/Fullscreen, Value: off
Key: GUI/LastCloseAction, Value: powerOff
Key: GUI/LastWindowPostion, Value: 6,24,640,531
Key: GUI/MiniToolBarAlignment, Value: bottom
Key: GUI/MiniToolBarAutoHide, Value: on
Key: GUI/SaveMountedAtRuntime, Value: yes
Key: GUI/Seamless, Value: off
Key: GUI/ShowMiniToolBar, Value: yes
Key: VBoxInternal/Devices/pcnet/0/LUN#0/Config/BindIP, Value: 192.168.4.3
Key: VBoxInternal/Devices/pcnet/1/LUN#0/Config/BindIP, Value: 192.168.94.66

and that's what happens if I try to start up the VM:

infra1# tail -6 VBox.log
00:00:00.533 PDM: Failed to construct 'pcnet'/0! VERR_PDM_DRVINS_UNKNOWN_CFG_VALUES (-2805) - A driver encountered an unknown configuration value. This means that the driver is potentially misconfigured and the driver construction failed because of this.
00:00:00.539 iSCSI: logout to target iqn.1986-03.com.sun:02:acc41bc7-0d77-66d2-863a-f0aba89dc2d2.testdebian
00:00:00.540 VMSetError: /export/home/vbox/tinderbox/3.1-sol-rel/src/VBox/VMM/VM.cpp(323) int VMR3Create(uint32_t, void (*)(VM*, void*, int, const char*, unsigned int, const char*, const char*, __va_list_tag*), void*, int (*)(VM*, void*), void*, VM**)
00:00:00.540 VMSetError: Unknown error creating VM
00:00:00.540 ERROR [COM]: aRC=NS_ERROR_FAILURE (0x80004005) aIID={6375231a-c17c-464b-92cb-ae9e128d71c3} aComponent={Console} aText={Unknown error creating VM (VERR_PDM_DRVINS_UNKNOWN_CFG_VALUES)} aWarning=false, preserve=false
00:00:00.552 Power up failed (vrc=VERR_PDM_DRVINS_UNKNOWN_CFG_VALUES, rc=NS_ERROR_FAILURE (0X80004005))

So, pcnet/0 seems to be invalid...

Changed 4 years ago by bauer40

VBox.log where BindIP was used to bind NAT to a hosts IP address

comment:13 Changed 4 years ago by ramshankar

Remove the old config data with:

VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/0/LUN#0/Config/BindIP"
VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/1/LUN#0/Config/BindIP"

And try this:

VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/0/LUN#0/AttachedDriver/Config/BindIP" "192.168.4.3"
VBoxManage setextradata testdebian "VBoxInternal/Devices/pcnet/1/LUN#0/AttachedDriver/Config/BindIP" "192.168.94.66"

comment:14 Changed 4 years ago by bauer40

Yes, that worked. I will now go ahead and try to configure everything.

Again: this will take some time, probably some days. I will come back to this tread as I have news.

comment:15 follow-up: ↓ 16 Changed 4 years ago by bauer40

Update: The migration from Bridged Network to NAT with port forwarding takes steps forward, but it will take the entire coming week to make this happen. The point is that those servers are productive for 15 Users...

Currently everything is stable. My Setup: everything except proxy (4 NICS, currently bridged network to e1000-Chip) is running on one physical system, and proxy has it's own physical host. Most communication is between VBoxes "proxy" and "www-itserv-de", and second most between "proxy" and "exchange".

By the way two questions:

Q: Does bridged network care about the interface type? Does VBox distinguish between physical e1000 and bge chipsets, or do you rely on Solaris on that? The point is that I currently run PPPoE no longer over my bge interface (what I did when the host hangs occured. I did the same at the time when 3.0.0 to 3.0.4 were current, and since I had host hangs then I fell back to 2.2.4), and currently everything is stable. So I du the wild guess if there is a problem when running PPPoE over bridged-to-bge interfaces.

Q: Is there a difference in the code executed when two VBoxes with bridged networking communicate with each other on the SAME host or on DIFFERENT hosts? Say, could it be a VBox-on-the-same-host-only problem?

Thanks again, and: happy new year!!!

comment:16 in reply to: ↑ 15 Changed 4 years ago by ramshankar

By the way two questions:

Q: Does bridged network care about the interface type? Does VBox distinguish between physical e1000 and bge chipsets, or do you rely on Solaris on that? The point is that I currently run PPPoE no longer over my bge interface (what I did when the host hangs occured. I did the same at the time when 3.0.0 to 3.0.4 were current, and since I had host hangs then I fell back to 2.2.4), and currently everything is stable. So I du the wild guess if there is a problem when running PPPoE over bridged-to-bge interfaces.

No, infact bridged doesn't even care if it's a VNIC or a physical interface that you pass to it. As far as the bridged driver goes, it's a network interface.

Q: Is there a difference in the code executed when two VBoxes with bridged networking communicate with each other on the SAME host or on DIFFERENT hosts? Say, could it be a VBox-on-the-same-host-only problem?

Well guest->host packets in the current bridged network driver need to be handled in a special way. So possibly, but this seems a little sketchy to speculate more at this point. Thanks for testing.

Thanks again, and: happy new year!!!

comment:17 follow-up: ↓ 18 Changed 4 years ago by bauer40

Update:

I have not been able to move all NICs to NAT. The VM www-itserv-de and proxy still use bridged adapters. However, I did not have any further hard hangs.

I will move those two VMs to a second server, so I have one machine with two VMs and bridged networking, and another one with NATted VBoxes only.

It will look like this:
infra1:

  • proxy (debian, bridged)
  • www-itserv-de (debian, bridged)

infra2:

  • kerstin-wilfer.de (Debian, NAT)
  • service (WinXP, NAT)
  • training (Debian, NAT)
  • exchange (Debian, NAT)

In addition to that, I will start up a VM which is down since the beginning of the tests. It's named jumpstart and is a Solaris-10 VBox with bridged networking. I will see if this negatively influences the stability...

comment:18 in reply to: ↑ 17 Changed 4 years ago by bauer40

Replying to bauer40:

In addition to that, I will start up a VM which is down since the beginning of the tests. It's named jumpstart and is a Solaris-10 VBox with bridged networking. I will see if this negatively influences the stability...

It did. This morning, the server hung again. To quickly summarize: when the hang occured, the server did only run NATed VBoxes, except for two: www-itserv-de (Debian) and jumpstart.

I will now migrate jumpstart (which is, luckyly, unimportant) to another physical server, simply to isolate it and see if the problem moves with the VM.

I start to think that this single VBox is guilty.

Can you please lose a few words on that theory: jumpstart is the only VBox which uses the "Intel PRO/1000MT Desktop" NIC in bridged mode. All others use "Intel PRO/1000 T Server" or "PCnet-FAST III" NICs. Is there a chance that the hang relates to that virtual adapter type?

comment:19 Changed 4 years ago by ramshankar

The idea here was to not mix NAT and bridged VMs on the same box to isolate the problem between the two. Which adapter you choose for either is not responsible for cause of hangs. As long as you run even one bridged VM it's not possible to isolate the problem as we first tried it.

comment:20 Changed 4 years ago by bauer40

Ok, last night at 3:00 it happened again. Hard hang, I had to power off the host. That clearly states that my guess is wrong that the problem was triggered by the Solaris Guest.

I will make the efford to entirely split VMs running NAT from those using bridged networking. Unfortunatly, I did not understand that you wanted me to do that prior to your message from 2010-01-18. However, all we lost is time.

comment:21 Changed 4 years ago by bauer40

OK, it's done. I have one system running bridged-networked-VMs only, and the other hosts NATted VBoxes only.

Time will tell what will happen....

comment:22 Changed 4 years ago by bauer40

OK, I have an uptime >33 days now. I think we can assume that mixing bridged and NATed interfaces can be assumed to be one reason for the freezes.

What next? Should I update to the 3.1.4 and see if the problem is gone?

comment:23 follow-up: ↓ 24 Changed 4 years ago by ramshankar

Yes please upgrade to 3.1.4 and we'll see if the problem occurs more rapidly/predictably.

comment:24 in reply to: ↑ 23 Changed 4 years ago by bauer40

Replying to ramshankar:

Yes please upgrade to 3.1.4 and we'll see if the problem occurs more rapidly/predictably.

Done. After one week without problems, I start to migrate VMs and replace NATted interfaces with bridged.

comment:25 Changed 4 years ago by bauer40

OK, I upgraded the VM to 3.1.4 and everything is stable. I also changed back from NAT to Bridged Networking for every VM (no mixing of NAT and Bridged), and it's still stable.

But now it becomes interesting: after teleporting three named VBoxes from infra1 to infra2, infra2 got it's hard hang again.

Today, I'll upgrade to 3.1.6, and after some time I will start moving the same combination of VMs that hung my server to that machine. So this ticket will stay on some more time. I'm quite sure that I can break down the hard hang issue to some kind of a combanation of configurations.

ramshankar, are you still with this issue, or did it get a timeout :-)

comment:26 Changed 4 years ago by ramshankar

Sure, we don't have much info. to reproduce a hang and from the looks of it seems a rather difficult corner case. So yes let the ticket be open.

comment:27 Changed 4 years ago by ramshankar

  • Priority changed from blocker to major

Any news on this? I'll bump this down from "Blocker" for the time being.

comment:28 Changed 4 years ago by bauer40

Yesteday, I wanted to answer: no. But last night, 3:15, the server hung again. Currently, I have VBox 3.2.6 running, Kernel 142901-13. The server was stable from July 4th to August 9th.

I was thinking that the combination of three VMs (proxy, www-itserv-de, jumpstart) triggers the problem, but now only proxy and www-itserv-de were on that physical host. Both are Debian Linuxes.

I remember that (10 years ago) Sun released an Infodoc how to break into OBP on a hung server ("sleeping dragon"???), but I cant find that document. It was something to add to /etc/system. I'll scan through some Technical Instructions on SunSolve (1012991.1, 1012913.1, maybe more) to see if I find a way to fall into kadb instead of a hang.

comment:29 Changed 4 years ago by ramshankar

I think you mean the deadman timer. Add

set snooping=1
set snoop_interval=50000000

This will cause the system to produce a panic if the clock hasn't been updated for 50 seconds. Of course this is not guaranteed to produce the core 100% on all system hangs, simply because if the system is wedged way beyond recorvery, a deadman timer will be of no use. You can additionally add the following to /etc/system as well

set pcplusmp:apic_panic_on_nmi=1

Then once the system is hung you can send the system an NMI from another box (using ipmitool -I lanplus -H yourhunghost -U root chassis power diag) or via the iLOM (if your host has one).

After making the required changes to /etc/system, reboot the host for the changes to take effect. None of these methods I described above are guaranteed solutions to produce a system dump, but it's definitely something that should be tried in the process of analyzing an unresponsive Solaris system.

comment:30 Changed 4 years ago by bauer40

It happened again. The server hung.

Even snooping was on, and the kernel debugger was running, it did not fell into it. The machine simply hung and had to be power cylced. I could not go to kmdb with F1-A.

I start to believe that there might be a hardware problem with the machine. Is there anybody else than me who has the same problem?

I will next switch all VBoxes from server infra2 to infra1 and vice-versa. However, there is one VBox that cant be moved, because it uses a serial interface on the physical server.

Furthermore, I'll upgrade to 3.2.8.

comment:31 Changed 4 years ago by bauer40

I start to ask me if it could be a hardware error. I use HP ProLiant ML110 G5 servers with ECC-Memory, but I don't know if an memory error would be reported to the OS.

comment:32 Changed 4 years ago by ramshankar

I don't suppose you have an iLOM console to the machine or any other out-of-band management console for the server? Usually you could check those logs for hardware errors.

comment:33 Changed 4 years ago by bauer40

OK, here is the news: after switching the VBoxes from Infra1 to Infra2 and vice-versa (except training, which is hardware-bound), and upgraded to 3.2.8, the problem SWITCHED the host. Now, Infra1 did freeze.

Again: no output on the console, no F1-A to reach the kmdb, neither did the activated snooping throw the machine to mdb.

I think we can now clearly state that this is NOT a hardware issue.

Any idea? If not, I will next delete all guests and start from scratch (except for the disk images, of course). This way, I can assure that the settings for all guests are the same and not that some use VT, some not, some use PXE, some not...

comment:34 Changed 4 years ago by ramshankar

Yes having a consistent config that causes this issue would be good. I didn't know each guest was configured differently.

I think we can assume that mixing bridged and NATed interfaces can be assumed to be one reason for the freezes.

Does this still hold true? You need a mix of both BAT+bridged VMs for the problem to occur? I hope we don't forget the configurations while we test various different ones.

comment:35 Changed 4 years ago by bauer40

I had another three freezes: one on Sept 5, one on Sept 6th, one on Sept 9th.

Things escalate - I will adjourn my vacation and change the configs today. I'll update this record after I've done it.

I can clearly state that every VM uses bridged networking right now. I don't use NAT anymore.

comment:36 Changed 4 years ago by bauer40

OK, I had another idea which might drill down to my problem. The point is that the entire server freezes, which means it must be something _below_ the runnung Solaris kernel. And what is there? The CPU and the BIOS.

So I switched off hardware virtualisation support for all the VMs (they were all using Intel VT, but only two of them supported Nested Paging).

With this step I bypass the hardware support and keep running the "software only virtualisation". I expect to be kept "above the kernel" at any time.

Let's look what happens, and what not....

comment:37 Changed 4 years ago by ramshankar

No, if the kernel is wedged or frozen, the system/server is screwed up anyway. It's extremely unlikely that this is a BIOS bug of any kind.

What is interesting is that it occurs with bridged networking.

comment:38 Changed 4 years ago by AZweimiller

I am having a problem very similar to this on OpenSolaris b142. I use VirtualBox 3.2.8 to run pfSense (FreeBSD distro) in a VM as my router. I use bridged networking exclusively. Randomly the server completely hardlocks. It does not respond to ping or at the physical console, and the VM (pfSense) is also unresponsive. I have to power it off and back on. Sometimes it is as infrequent as 8 hours and I have had it happen within 3 times in a row right after rebooting it. I finally just removed VBox and have had not one freeze or stability issues with OpenSolaris since then. I looked in the logs but could find no information that would indicate errors or a reason for the hard lock. I will reinstall and attempt to duplicate if you think my problems are related to this bug.

comment:39 follow-up: ↓ 40 Changed 4 years ago by bauer40

@ramshankar: I was thinking that kernel snooping would exactly "escape" from wedged kernels. Is that wrong? Is it possible for the kernel to get wedged in a way that even snooping won't help?

@AZweimiller: what kind of hardware do you use? System, CPU, and does your VM use Intel VT and/or nested paging?

comment:40 in reply to: ↑ 39 Changed 4 years ago by ramshankar

Replying to bauer40:

@ramshankar: I was thinking that kernel snooping would exactly "escape" from wedged kernels. Is that wrong? Is it possible for the kernel to get wedged in a way that even snooping won't help?

Of course. The deadman timer (snooping) fires a level 15 interrupt every second on the CPU and checks if the system lbolt variable has not been updated for the given period and then triggers kmdb/panic. If the kernel is so wedged, this too can fail. In fact, I've seen the deadman timer kick in only once in the numerous system hangs I've come across, and VirtualBox is known for doing "special" things even from a kernel point of view.

comment:41 Changed 4 years ago by bauer40

Good news on the bad news: the hang occured once again, with Intel VT-support disabled. So we can clearly state that THIS is not the reason.

I will re-create my VMs next weekend or maybe sooner. I'll give an update.

comment:42 Changed 4 years ago by frank

Could this problem be related to #7342? The reporter of that ticket is currently trying a new test build with a potential fix for his problem. Maybe on Solaris you just see a host crash as well (the other ticket is for Linux hosts)?

comment:43 Changed 4 years ago by ramshankar

@bauer40: VT-x disabled for all VMs or disabled in the BIOS?

comment:44 Changed 4 years ago by bauer40

@ramshankar: VT-x had been disabled for the running (not all) VMs, but not in the BIOS.

After another hang with disabled VT-x (by the VMs), I re-enabled VT-x for all running VMs again. Furthermore, I made sure they all are configured the same way like a new VM would be (no PAE, but with VT-x and nested paging; Network adapter is now Intel Pro 1000 MT Desktop for all running VMs; Enabled USB and EHCI 2.0 even if it is not supported by Solaris, but I wanted it to be as close as possible to the default).

If I have another hang, I will remove all VMs and start "from scratch", re-using the disk images.

comment:45 Changed 4 years ago by bauer40

I dont think my case is similar to case #7342. This reporter has problems keeping a guest up-and-running, in my case the physical host freezes.

comment:46 Changed 4 years ago by bauer40

two more hangs.

I will discard all my VMs and start from scratch this weekend (recycling the disk images, of course). This will create new XML files. But I don't think this will help... why should it?

comment:47 Changed 4 years ago by bauer40

OK, I have recreated all my VM XML files from scratch, keeping only the disk images. But I found something interesting:

I had one VM (and only one) using an virtual LSILogic SCSI controller. As far as I can say, in EVERY case my server hang, this VM was affected, so it MIGHT be the source of the problem, even if there is no indicator that it is.

I hope my system is now stable. Time will tell...

comment:48 Changed 4 years ago by frank

If the LSILogic controller is the source of the problem then this would be only a coincidence.

comment:49 Changed 3 years ago by AZweimiller

@ramshankar: I have an Intel E8400 processor on an Abit AB9 Pro motherboard with OpenSolaris b142. I recently reinstalled Virtualbox now that a new version is out (3.2.10) and created a brand new VM and performed a fresh install of pfSense. It took less than 24 hours for the first hard lock, a trend that has continued since. I have looked everywhere I know and cannot find anything that looks relevant in any system or VirtualBox log files. I have also tried reconfiguring the VM to try and nail down the issue. So far, I disabled Intel VT/Nested paging and still had it freeze. I then changed the type of networking controller from Intel to PCNet with no success. I have 3 Realtek network cards in this server, all 3 of which are bridged to the pfSense VM. I would be happy to attach any files you request.

comment:50 Changed 3 years ago by bauer40

@AZweimiller: please confirm that you mean the same as me when you refer to "hard lock":

your physical server does not respond to anything, including typing on a console. Even pressing the NumLock-key on the keyboard does not switch on/off the NumLock indicator LED.

Is it the same at yours?

comment:51 Changed 3 years ago by AZweimiller

Exactly. I cannot ping the pfSense VM or the openSolaris host. The physical console is unresponsive and "hard locked". The only remedy is to power cycle the server.

comment:52 Changed 3 years ago by bauer40

@AZweimiller: Thank you for your confirmation.

@ramshankar: I think we can clearly state that this is not an

  • Hardware error
  • Intel virtual NIC
  • VDI Disk Image Error (file on disk/OS instance)
  • XML-File or configuration error.

Thus, all that's left for that issue is Oracle VirtualBox with all it's kernel modules, and the Solaris kernel itself.

I'm still willing to track down this ugly bug, but I need more input and precise advises from you. Is there any possibility to enable some debugging on the VBOX kernel modules? Are there possibilities left to pinpoint the problem, or do we have to live with unregular hard hangs?

comment:53 Changed 3 years ago by bauer40

I recieved more information about how to get more useful information on this issue. I will paste the Information I received here, for further reference.

@AZweimiller: by your name, I assume that you are able to read german. So you might also be interested in this:


Ein Host-Hänger debuggt sich immer sehr schlecht. Mein Vorschlag wäre, dass Sie sich einen Debug-Build installieren:

 http://www.virtualbox.org/download/testcase/VirtualBox-3.2.11-67125-SunOS-g.tar.gz

In diesem Debug-Build sind die Assertions angeschaltet und die Optimierung aus. Daher wird dieser auf Ihrem System sehr viel langsamer laufen. Die Hoffnung ist aber, dass eine Assertion triggert, die zu einem Host-Reboot führen könnte. Man könnte auch noch diverses Logging einschalten, aber das ist im Moment wahrscheinlich nicht ratsam.

Sie müssten nach der Installation die VM mit dem gdb starten, also z.B.

gdb VBoxHeadless --startvm VM_NAME

Es kann sein, dass Assertions auftreten, die unwichtig sind (oder false positives). Bei jeder Assertion wird der gdb auf den Prompt fallen, mit 'c' könnten sie den Prozess weiterlaufen lassen. Wenn Sie mir das Triggern von Assertions mitteilen kann ich Ihnen sagen, ob diese Assertion wichtig ist oder nicht.


comment:54 Changed 3 years ago by AZweimiller

I don't think German has been my family's native language for more than 4 generations :)

I added a second VM last night. PBX-in-a-Flash customized CentOS 5.5. This drastically increased the frequency of system freezing. Here are the reboot times from just last night alone:

[ Nov 1 19:03:42 ] [ Nov 1 20:38:48 ] [ Nov 1 22:42:44 ] [ Nov 1 23:06:15 ]

Each time, the server was unresponsive to ping and Gnome was completely unresponsive when I went to the physical console. I have no additional software or customizations of this Open Solaris installation. I installed OSOL b142 and created a zfs partition as a NAS. All settings are "out of the box". I have flawless stability from OSOL until I try to use VBox. Please advise on debugging steps and I will submit any required information.

comment:55 Changed 3 years ago by bauer40

Two more hangs - I start evaluating alternatives, because I stop hoping think VirtualBox will _ever_ get stable enough.

I simply don't see any helpful hints from the developers how to pinpoint the problem. So I have to leave.

comment:56 Changed 3 years ago by frank

Interesting. I provided you a debug build and a possible strategy to start but except complaints I didn't saw any further feedback.

comment:57 Changed 3 years ago by junkfer

Hi all....

Maybe same error at me, could we thinking together on that?? HW: Sun x2270 (2pcs 2Ghz Xeon/4 core CPU; 6GB RAM) Running host Solaris 10 u9; EIS patched: version 2010.12; Virtualbox 3.2.12. Symtoms:

  • after a few days (cca 3) the Host system freezes, no log in anywhere, just hang. Restart/Reset from SystemController could solve the hang.
  • Guest system: Windows 2003 Server (2 processors, 2048MB RAM, VT-x and Nested Paging ON; Bridged network Intel PRO/1000 T Server (guest) - igb1 (host)
  • From HOST-side i could diagnostics massive MEMORY decrasing. (Booted today 1:00PM vmstat free memory 2,5GB after 10 hour uptime i have only 500MB...)

comment:58 follow-up: ↓ 59 Changed 3 years ago by jwythe@…

Hi All

I also have similar problem. I have CentOS 5.7 running on four systems, 3 desktops, and 1 laptop. All three desktops are single core cpu's. Laptop is dual core intel. Just upgraded to version 4.1.2 of Virtual box, still same problem. 2 of the desktops are working fine. One is running RHEL3 as guest OS, the other is running W2K (started at bootup via command line Headless). The other desktop and the laptop are running XP as the guest. All OS's have the latest updates. The other desktop and the laptop randomly freeze, after a few seconds the numlock light goes out, and the capslock nad scroll lock lights are flashing (identically). Only solution is to power off and back on. Sometimes my laptop will go into the same state when I shutdown the Windows XP guest. The laptop seems to run very well

very seldom freezes while using it. Occasionly kernel will panic if I close the lid on the laptop, and re-open later. Basically in and out of suspend mode. Seems

okay if I hibernate, but haven't tested that recently, as recent update to OS removed hibernate as a shutdown option, Haven't figured out how to get it back. Usually if I shutdown the XP guest on the laptop, I can hibernate, but not suspend, with out a kernel panic when coming out of suspend mode. The kernel panic message shows VirtualBox as being one of the modules on the stack. The desktop freezes quite often when I am working on it usually in the guest OS, but also freezes at no specific interval after it as been idle. There have been times when I shutdown the guest OS, just Linux running. The system would still freeze. Only if I totally shutdown VirtualBox will the system stay running. All systems are running the network in Bridged mode. All systems have the extpack loaded. Both XP systems have the Guest Additions loaded, not sure about W2K, think it does, but I can't get it started since I upgraded to 4.1.2.

comment:59 in reply to: ↑ 58 Changed 3 years ago by ramshankar

Replying to jwythe@epicor.com:

The laptop seems to run very well

very seldom freezes while using it. Occasionly kernel will panic if I close the lid on the laptop, and re-open later. Basically in and out of suspend mode. Seems

okay if I hibernate, but haven't tested that recently, as recent update to OS removed hibernate as a shutdown option, Haven't figured out how to get it back. Usually if I shutdown the XP guest on the laptop, I can hibernate, but not suspend, with out a kernel panic when coming out of suspend mode. The kernel panic message shows VirtualBox as being one of the modules on the stack. The desktop freezes quite often when I am working on it usually in the guest OS, but also freezes at no specific interval after it as been idle. There have been times when I shutdown the guest OS, just Linux running. The system would still freeze. Only if I totally shutdown VirtualBox will the system stay running. All systems are running the network in Bridged mode. All systems have the extpack loaded. Both XP systems have the Guest Additions loaded, not sure about W2K, think it does, but I can't get it started since I upgraded to 4.1.2.

We fixed a bug in this specific area in 4.1.2, mentioned in the changelog as well:

# Linux hosts: fixed random kernel panics on host suspend / shutdown (4.1.0 regression; bug #9305)

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use