Ticket #5000 (closed defect: fixed)

Opened 9 years ago

Last modified 9 years ago

Host crashes with "BAD TRAP" when a guest runs with more than one CPU

Reported by: joe42 Owned by:
Priority: major Component: guest smp
Version: VirtualBox 3.0.6 Keywords:
Cc: Guest type: other
Host type: Solaris


Host: Solaris 10 (kernel Generic_141415-07), two AMD 2210 dual core processors, 4GiB memory (Sun Fire x2100 box)

After upgrading from 3.0.4 to 3.0.6, the host has started to crash when I use a multi-cpu guest system. This is reproducible with both a 32-bit Windows 2000 guest with the guest utilities installed and a 32-bit Debian 5 without the guest utilities installed.

It will "reliably" crash within a few hours if a guest system runs with more than 1 CPU. Previously, with VirtualBox 3.0.4 the same guests ran reliably (but very slowly) in this mode.

If the number of CPUs is set to 1 in the VM settings, the host system runs stable.

The error in the dmesg on the host system is:

savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=fffffe80007af670 addr=a90 occurred in module "<unknown>" due to a NULL pointer dereference
savecore: [ID 748169 auth.error] saving system crash dump in /var/crash//*.3

Is there anything I can do to help debug this problem?


VBox.log.3 Download (36.0 KB) - added by joe42 9 years ago.
Guest log from NumCPUs=2 32-bit Windows 2000 guest that crashed the host

Change History

comment:1 Changed 9 years ago by sandervl73

Could you attach the VBox.log of such a session? (no need to wait for the host crash)

Changed 9 years ago by joe42

Guest log from NumCPUs=2 32-bit Windows 2000 guest that crashed the host

comment:2 Changed 9 years ago by joe42

After having installed the most recent recommended patch cluster, I now have kernel: Generic_141415-10

The situation has changed slightly

First of all: I no longer get a BAD TRAP. Instead, I get a SUNOS-8000-FU event in the fault manager, and after that the host is almost frozen. "Almost", as in, the console still accepts keyboard input, but I cannot log in (after typing 'root' and hitting enter, nothing happens) and the machine is off the network too.

Second: The above happens even when I run only single-CPU guests.

As before, there is nothing unusual in the guest log leading up to the host crash. The past two hours up until the crash, the only message in the guest log is "NAT: ARP request sent".

comment:3 follow-up: ↓ 4 Changed 9 years ago by ramshankar

Anything under /var/adm/messages? Any core files in /var/crash/<host>/ ?

comment:4 in reply to: ↑ 3 Changed 9 years ago by joe42

Replying to ramshankar:

Anything under /var/adm/messages? Any core files in /var/crash/<host>/ ?

I'm not sure how to answer this more specifically. The original report clearly states the kernel log message which is quite explicit about the core files.

comment:5 Changed 9 years ago by joe42

A status update on the current state of affairs:

I downgraded to 3.0.4 only to find that the problem persists. So I guess the good news is that it is not a new bug in 3.0.6 :)

Running only two VMs (one SLES 10 SP2 64-bit single processor, one XP 32-bit single processor), the host still crashes. The crashes are now frequently silent hangs though; the host system simply freezes up without any errors on the console or in the logs.

The server is a Sun Fire X2200M2 (I mis-stated this in the original report I'm afraid, but the boxes are somewhat similar). The event log on the BMC is quiet; so there are no obvious power/fan/temperature/... failures on the system. If no-one has any better ideas, I am going to start the system up without the VB modules and run some big compile jobs or whatever - anything to exercise the CPUs and memory - then see if the box can stay up for a day under load. At least this will let me determine if the problem is actually VB related or if I am chasing ghosts.

comment:6 Changed 9 years ago by ramshankar

Could you please try this test build:

(Note: this build will no longer be available after 14 days from now).

comment:7 Changed 9 years ago by joe42

I will.

The system has run for a little more than three hours with load 10+ doing parallel compliation. This load is a lot heavier than what it had with the two single processor VirtualBox guests. So far everything is running perfectly fine. I'm concluding, for now, that the hardware + Solaris combination alone is not completely broken. The system is stable under load.

I'm installing the 3.0.7 test build now and will report back when I know more.

comment:8 Changed 9 years ago by frank

  • Component changed from other to guest smp

comment:9 Changed 9 years ago by joe42

The host now has an uptime record; 6 days and counting :)

I have run it with the two single CPU Linux guests without interruption for the past four days and I have even run an extra 32-bit Windows guest the past three days. The guests and the host all stay up and running.

So, while this of course doesn't prove anything for certain, I would say it is a pretty good indication that r52830 fixed something. The host would crash after at most two days of uptime before.

I will continue running this build and report anything else of interest.

comment:10 Changed 9 years ago by joe42

After a record 12 days of uptime, I am now upgrading to the 3.0.8 release. I see in the changelog that several solaris host crashes have been resolved; I assume all the fixes from the test build are in the 3.0.8 release.

comment:11 Changed 9 years ago by sandervl73

  • Status changed from new to closed
  • Resolution set to fixed

3.0.8 is basically your 3.0.7 plus some extra fixes, so yes. I'll close this one assuming the problem is fixed. If not, please reopen.

Note: See TracTickets for help on using tickets.
ContactPrivacy policyTerms of Use