#8294 closed defect (fixed)

VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host => Fixed in SVN

Reported by:	tomwaters	Owned by:
Component:	other	Version:	VirtualBox 4.0.2
Keywords:	solaris 11 express freeze	Cc:
Guest type:	other	Host type:	Solaris

Description

Hi, I have recently installed Solaris 11 Express from Opensolaris 2009.06. I have 3 VM's CentOS 5.5 and two Server2008R2 VM's.

The VM's are running in headless mode without vrde (ie..VBoxHeadless -startvm Server6 --vrde off) as I connect to the VM's using vnc.

Under opensolaris 2009.06 (111b) the VM's were stable. Under Solaris 11 express, the VM's freeze/become unresponsive. Not all will become unresponsive all the time - the most recent one was a Windows server 2008R2 machine, however the CentOS 5.5 machine has also frozen and another time all three machines froze at the same time.

I can not connect to the VM or ping it (or ssh etc)....it's frozen, however it still shows as being in the running state - both in the GUI and from the command line.

I had the preview window enabled (now disabled after reading a post on here - I'll see if that makes any difference).

Nested Paging is enabled.

Host system is Xeon 3370 on an Intel S3210SHLC mobo. with 8G ECC ram

Let me know if you need additional information or need me to run some tests. I am keen to get this fixed as I'd like to stay on Solaris 11 Express (151), but may need to retur to opensolaris (111b) to have a stable VBox.

Thanks.

Attachments (2)

vbox_ticket.txt (61.8 KB ) - added by tomwaters 14 years ago.
vbox_poweroff_error.txt (43.9 KB ) - added by tomwaters 14 years ago.: VMSetError: /export/home/vbox/tinderbox/sol-rel/src/VBox/VMM/VMMR3/VM.cpp(3268) int vmR3TrySetState(VM*, const char*, unsigned int, ...); rc=VERR_VM_INVALID_VM_STATE

Download all attachments as: .zip

Change History (20)

by tomwaters, 14 years ago

Attachment:	vbox_ticket.txt added

comment:1 by tomwaters, 14 years ago

Anyone home at Oracle?

Meanwhile, another VM has crashed. I checked top, and note that it goes to 100% utilisation of the CPU (one cpu is allocated to this VM) when it fails...see process "1418".

load averages:  1.74,  1.73,  1.73;               up 1+20:08:40                      
123 processes: 118 sleeping, 2 zombie, 3 on cpu
CPU states: 57.5% idle,  6.0% user, 36.6% kernel,  0.0% iowait,  0.0% swap
Kernel: 15866 ctxsw, 4649 trap, 8983 intr, 205571 syscall, 14 fork, 3355 flt
Memory: 8188M phys mem, 313M free mem, 4094M total swap, 4094M free swap

   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
  1418 nas       155  50    0 1158M 1127M cpu/3  502:30 24.68% VBoxHeadless
  1392 nas       131  53    0 1351M 1320M cpu/2   21.6H 12.49% VBoxHeadless
  1430 nas         3  59    0  266M  208M sleep   19:54  0.50% Xorg
  1390 nas       152  59    0  775M  744M sleep   92:22  0.41% VBoxHeadless
  1329 nas        13  59    0   26M   14M sleep    5:32  0.25% VBoxSVC

Anyone want to suggest something, anything?

comment:2 by Frank Mehnert, 14 years ago

Host type:	other → Solaris

by tomwaters, 14 years ago

Attachment:	vbox_poweroff_error.txt added

VMSetError: /export/home/vbox/tinderbox/sol-rel/src/VBox/VMM/VMMR3/VM.cpp(3268) int vmR3TrySetState(VM*, const char*, unsigned int, ...); rc=VERR_VM_INVALID_VM_STATE

comment:3 by tomwaters, 14 years ago

Thanks for that frank. I left it as "Other" as the instructions said to unless I know it applied to just a specific host. Will post future issues against Solaris host.

fyi. The VM hung again after just a few hours this time...and when I tried to power it off it has "hung" the session on my Solaris host...

nas@nas:~$ VBoxManage controlvm Server5 poweroff
0%...10%...

The VBox gui says "Stopping"....and I see some odd errors in the log (attached as "vbox_poweroff_error.txt") saying "VMR3Suspend failed because the current VM state, SUSPENDING, was not found in the state transition table"...

The process is still running but not using any cpu (.01%). I need to kill the process to stop the machine.

fyi - I am running the latest guest additions and preview and Nested paging are disabled...not that it appears to make any difference.

comment:4 by tomwaters, 14 years ago

And it's frozen again...

Guys, what do you need from me to help debug this? A core dump? if so tell me how and I'll provide it.

I really need to resolve this - please?

comment:5 by Frank Mehnert, 14 years ago

Please have a look at the user manual section 9.13. A core dump taken when the guest is frozen would indeed help. If that method does not work (perhaps because the guest is frozen), try the other method, see here.

comment:6 by tomwaters, 14 years ago

Excellent...thanks for that frank...I just uninstalled 4.0.2 and was trying 3.12...but will reinstall 4.0.2 and with for it to hang and do a core dump a per the manual. Thanks again for helping me out with this.

I'll email it to you as soon as I get a dump.

comment:7 by tomwaters, 14 years ago

Hmmm...may be related to power management... " Since 4.0, there are some more ACPI options available and when you have the GA installed, they become enabled in the VM OS. Open the Guest OS power management and disable the actions for hibernate and suspend."

I disabled all power management features and screensaver and have not had a crash in the last 2 days...fingers crossed.

comment:8 by Frank Mehnert, 14 years ago

Summary:	VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host → VBox 4.0.2 - VM's unresponsive (freeze) after 1-3 days on Solaris 11 Express host => Fixed in SVN

Thanks for the feedback! In that case, your bug should be fixed in the upcoming maintenance release.

comment:9 by tomwaters, 14 years ago

Frank, Great to hear...still running solid...so yep, happy to attribute this as root cause.

Look forward to the next point release.

Thankyou to you and the team for the support.

comment:10 by tomwaters, 14 years ago

Frank, can you pls. re-open this ticket?

I spoke too soon... it crashed overnight.

VBoxHeadless: error: Code NS_ERROR_CALL_FAILED (0x800706BE) - Call to remote object failed (extended info not available)
Context: "COMGETTER(EventSource)(es.asOutParam())" at line 1244 of file VBoxHeadless.cpp
VBoxSVC became unavailable, exiting.
VBoxHeadless: error: Code NS_ERROR_CALL_FAILED (0x800706BE) - Call to remote object failed (extended info not available)
Context: "COMGETTER(EventSource)(es.asOutParam())" at line 1244 of file VBoxHeadless.cpp

[1]   Illegal Instruction     (core dumped) VBoxHeadless -startvm Server5 --vrde off
[2]-  Done                    VBoxHeadless -startvm Server6 --vrde off

I have the core dump and will send it through to you frank.

as@nas:/cloud/coredump# ls -l
total 282650
-rw------- 1 root media 256401063 2011-02-16 07:49 core.VBoxHeadless.9776
-rw------- 1 root media  32757671 2011-02-16 07:49 core.VBoxSVC.9539

comment:11 by Ramshankar Venkataraman, 14 years ago

Could you upload the VBox.log from this session you took the core file in if you still have it? I know you attached one at the beginning of this ticket, but it'd be better if we had the appropriate log.

comment:12 by Ramshankar Venkataraman, 14 years ago

I suspect this might be an issue with asynchronous IO. Could you enable "Host IO Cache" in your VM storage settings for the controller and re-try?

comment:13 by tomwaters, 14 years ago

Thanks for getting back to met. I do not have the log as it seems to only keep 3 versions, and I recently deleted and recreated the zpool, somcan not get the old snapshots back...sorry.

I have ticked Host IO Cache for all the guests and restarted them. Will let you know how they go.

Note: I have updated to the latest box release, 4.0.4 Note I have also updated all the guests with the latest guest additions.

Can you outline the likely impact from having this enabled? I read chapter 5 host io caching, and see comments like this may slowdown the host immensely, wasted mem utilization etc.

Ie. Is this a temporary test setting for me to identify the issue or a long term setting?

follow-up: 15 comment:14 by Ramshankar Venkataraman, 14 years ago

The impact will not be terrible on ZFS. The setting is for narrowing down the issue to find the root cause.

in reply to: 14 comment:15 by tomwaters, 14 years ago

Sorry about the delays in updating - I needed to powercycle the server a few time as I was moving HBA cards etc...

Have now had it up for 7 days and all the vm's are running perfectly. Previously would have seen the issue within 1-3 days...so looking good.

Also, can not really see and performance issues with running with Host IO cache ticked.

So, looks like you nailed it.

Is there anything else you need from me to debug this?

comment:16 by Ramshankar Venkataraman, 14 years ago

This has been fixed internally and should be available in the next release. Thank you for the report.

in reply to: description comment:17 by tomwaters, 14 years ago

Brilliant. Thankyou to you and the team for resolving this so promptly. Outstanding work.

Looking forward to the next release.

comment:18 by Frank Mehnert, 14 years ago

Resolution:	→ fixed
Status:	new → closed

Note: See TracTickets for help on using tickets.

Download in other formats: