Ticket #8294 (closed defect: fixed)
VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host => Fixed in SVN
Reported by: | tomwaters | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 4.0.2 |
Keywords: | solaris 11 express freeze | Cc: | |
Guest type: | other | Host type: | Solaris |
Description
Hi, I have recently installed Solaris 11 Express from Opensolaris 2009.06. I have 3 VM's CentOS 5.5 and two Server2008R2 VM's.
The VM's are running in headless mode without vrde (ie..VBoxHeadless -startvm Server6 --vrde off) as I connect to the VM's using vnc.
Under opensolaris 2009.06 (111b) the VM's were stable. Under Solaris 11 express, the VM's freeze/become unresponsive. Not all will become unresponsive all the time - the most recent one was a Windows server 2008R2 machine, however the CentOS 5.5 machine has also frozen and another time all three machines froze at the same time.
I can not connect to the VM or ping it (or ssh etc)....it's frozen, however it still shows as being in the running state - both in the GUI and from the command line.
I had the preview window enabled (now disabled after reading a post on here - I'll see if that makes any difference).
Nested Paging is enabled.
Host system is Xeon 3370 on an Intel S3210SHLC mobo. with 8G ECC ram
Let me know if you need additional information or need me to run some tests. I am keen to get this fixed as I'd like to stay on Solaris 11 Express (151), but may need to retur to opensolaris (111b) to have a stable VBox.
Thanks.
Attachments
Change History
comment:1 Changed 12 years ago by tomwaters
Anyone home at Oracle?
Meanwhile, another VM has crashed. I checked top, and note that it goes to 100% utilisation of the CPU (one cpu is allocated to this VM) when it fails...see process "1418".
load averages: 1.74, 1.73, 1.73; up 1+20:08:40 123 processes: 118 sleeping, 2 zombie, 3 on cpu CPU states: 57.5% idle, 6.0% user, 36.6% kernel, 0.0% iowait, 0.0% swap Kernel: 15866 ctxsw, 4649 trap, 8983 intr, 205571 syscall, 14 fork, 3355 flt Memory: 8188M phys mem, 313M free mem, 4094M total swap, 4094M free swap PID USERNAME NLWP PRI NICE SIZE RES STATE TIME CPU COMMAND 1418 nas 155 50 0 1158M 1127M cpu/3 502:30 24.68% VBoxHeadless 1392 nas 131 53 0 1351M 1320M cpu/2 21.6H 12.49% VBoxHeadless 1430 nas 3 59 0 266M 208M sleep 19:54 0.50% Xorg 1390 nas 152 59 0 775M 744M sleep 92:22 0.41% VBoxHeadless 1329 nas 13 59 0 26M 14M sleep 5:32 0.25% VBoxSVC
Anyone want to suggest something, anything?
Changed 12 years ago by tomwaters
-
attachment
vbox_poweroff_error.txt
added
VMSetError: /export/home/vbox/tinderbox/sol-rel/src/VBox/VMM/VMMR3/VM.cpp(3268) int vmR3TrySetState(VM*, const char*, unsigned int, ...); rc=VERR_VM_INVALID_VM_STATE
comment:3 Changed 12 years ago by tomwaters
Thanks for that frank. I left it as "Other" as the instructions said to unless I know it applied to just a specific host. Will post future issues against Solaris host.
fyi. The VM hung again after just a few hours this time...and when I tried to power it off it has "hung" the session on my Solaris host...
nas@nas:~$ VBoxManage controlvm Server5 poweroff 0%...10%...
The VBox gui says "Stopping"....and I see some odd errors in the log (attached as "vbox_poweroff_error.txt") saying "VMR3Suspend failed because the current VM state, SUSPENDING, was not found in the state transition table"...
The process is still running but not using any cpu (.01%). I need to kill the process to stop the machine.
fyi - I am running the latest guest additions and preview and Nested paging are disabled...not that it appears to make any difference.
comment:4 Changed 12 years ago by tomwaters
And it's frozen again...
Guys, what do you need from me to help debug this? A core dump? if so tell me how and I'll provide it.
I really need to resolve this - please?
comment:5 Changed 12 years ago by frank
Please have a look at the user manual section 9.13. A core dump taken when the guest is frozen would indeed help. If that method does not work (perhaps because the guest is frozen), try the other method, see here.
comment:6 Changed 12 years ago by tomwaters
Excellent...thanks for that frank...I just uninstalled 4.0.2 and was trying 3.12...but will reinstall 4.0.2 and with for it to hang and do a core dump a per the manual. Thanks again for helping me out with this.
I'll email it to you as soon as I get a dump.
comment:7 Changed 12 years ago by tomwaters
Hmmm...may be related to power management... " Since 4.0, there are some more ACPI options available and when you have the GA installed, they become enabled in the VM OS. Open the Guest OS power management and disable the actions for hibernate and suspend."
I disabled all power management features and screensaver and have not had a crash in the last 2 days...fingers crossed.
comment:8 Changed 12 years ago by frank
- Summary changed from VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host to VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host => Fixed in SVN
Thanks for the feedback! In that case, your bug should be fixed in the upcoming maintenance release.
comment:9 Changed 12 years ago by tomwaters
Frank, Great to hear...still running solid...so yep, happy to attribute this as root cause.
Look forward to the next point release.
Thankyou to you and the team for the support.
comment:10 Changed 12 years ago by tomwaters
Frank, can you pls. re-open this ticket?
I spoke too soon... it crashed overnight.
VBoxHeadless: error: Code NS_ERROR_CALL_FAILED (0x800706BE) - Call to remote object failed (extended info not available) Context: "COMGETTER(EventSource)(es.asOutParam())" at line 1244 of file VBoxHeadless.cpp VBoxSVC became unavailable, exiting. VBoxHeadless: error: Code NS_ERROR_CALL_FAILED (0x800706BE) - Call to remote object failed (extended info not available) Context: "COMGETTER(EventSource)(es.asOutParam())" at line 1244 of file VBoxHeadless.cpp [1] Illegal Instruction (core dumped) VBoxHeadless -startvm Server5 --vrde off [2]- Done VBoxHeadless -startvm Server6 --vrde off
I have the core dump and will send it through to you frank.
as@nas:/cloud/coredump# ls -l total 282650 -rw------- 1 root media 256401063 2011-02-16 07:49 core.VBoxHeadless.9776 -rw------- 1 root media 32757671 2011-02-16 07:49 core.VBoxSVC.9539
comment:11 Changed 12 years ago by ramshankar
Could you upload the VBox.log from this session you took the core file in if you still have it? I know you attached one at the beginning of this ticket, but it'd be better if we had the appropriate log.
comment:12 Changed 12 years ago by ramshankar
I suspect this might be an issue with asynchronous IO. Could you enable "Host IO Cache" in your VM storage settings for the controller and re-try?
comment:13 Changed 12 years ago by tomwaters
Thanks for getting back to met. I do not have the log as it seems to only keep 3 versions, and I recently deleted and recreated the zpool, somcan not get the old snapshots back...sorry.
I have ticked Host IO Cache for all the guests and restarted them. Will let you know how they go.
Note: I have updated to the latest box release, 4.0.4 Note I have also updated all the guests with the latest guest additions.
Can you outline the likely impact from having this enabled? I read chapter 5 host io caching, and see comments like this may slowdown the host immensely, wasted mem utilization etc.
Ie. Is this a temporary test setting for me to identify the issue or a long term setting?
comment:14 follow-up: ↓ 15 Changed 12 years ago by ramshankar
The impact will not be terrible on ZFS. The setting is for narrowing down the issue to find the root cause.
comment:15 in reply to: ↑ 14 Changed 12 years ago by tomwaters
Sorry about the delays in updating - I needed to powercycle the server a few time as I was moving HBA cards etc...
Have now had it up for 7 days and all the vm's are running perfectly. Previously would have seen the issue within 1-3 days...so looking good.
Also, can not really see and performance issues with running with Host IO cache ticked.
So, looks like you nailed it.
Is there anything else you need from me to debug this?
comment:16 Changed 12 years ago by ramshankar
This has been fixed internally and should be available in the next release. Thank you for the report.
comment:17 in reply to: ↑ description Changed 12 years ago by tomwaters
Brilliant. Thankyou to you and the team for resolving this so promptly. Outstanding work.
Looking forward to the next release.