VirtualBox

Ticket #9718 (reopened defect)

Opened 3 years ago

Last modified 4 months ago

Random app crashes in the guest

Reported by: rojer Owned by:
Priority: major Component: other
Version: VirtualBox 4.1.4 Keywords: random crashes, memory corruption
Cc: Guest type: other
Host type: other

Description (last modified by frank) (diff)

Host: Ubuntu 10.04 (Lucid) x86_64, 4.1.4 Guest: Ubuntu 11.04 (Natty) x86

Apps in guest crash randomly. The most susceptible seems to be Adobe Flash plugin, within Chrome or Firefox. Chrome itself crashes sometimes too, but Flash is really unstable: it rarely survives 1 minute of playing a YouTube video.

What I've tried and eliminated so far:

1) Host is rock solid, no random crashes, plays YouTube video fine.
2) Memory is fine - I've ran memtest86 on both host and guest.
3) Tried downgrading to 4.0.12 - same thing.
4) Presence of guest utils make no difference - crashes happen with and without them.
5) Tried 1 and 2 CPUs for the guest, different emulated hardware (PIIX/ICH), different amounts of memory (4, 3, 2G or RAM for guest, host has 8G).
6) nothing else is happening on the host. Log in, start chrome, open a youtube video -> boom, flash plugin crashes.
7) Guest kernel seems stable, no panics.
8) On two occasions Thunderbird reported strange errors: incorrect MAC on SSL packets. This suggests that memory corruption is happening somewhere.

Attachments

nbx-2011-10-07-18-07-42.log Download (78.2 KB) - added by rojer 3 years ago.
nbx-2011-10-07-17-59-03.log Download (108.9 KB) - added by rojer 3 years ago.
ssh_corruption.log Download (1.1 KB) - added by rojer 3 years ago.
dmesg.log Download (31.2 KB) - added by rojer 3 years ago.
sysctl_route.log Download (45.1 KB) - added by rojer 2 years ago.
output of "sysctl -a" and route
lspci.txt Download (6.7 KB) - added by rojer 2 years ago.
output of lspci and lspci -v

Change History

Changed 3 years ago by rojer

Changed 3 years ago by rojer

comment:1 Changed 3 years ago by Hachiman

  • Description modified (diff)

comment:2 Changed 3 years ago by Hachiman

Looks like it's networking issue. Does it happens for you with bridged networking or with NAT too?

comment:3 follow-up: ↓ 4 Changed 3 years ago by rojer

It was set to bridging all this time and I couldn't even watch a single video, now having set it to NAT it's a bit more palatable but still, having left the guest running for the night I came back to find that Chrome had crashed and Thunderbird reported a bad SSL MAC. In other words, setting networking to NAT does seem to help a bit but corruption is still happening.

comment:4 in reply to: ↑ 3 Changed 3 years ago by Hachiman

Replying to rojer:

It was set to bridging all this time and I couldn't even watch a single video, now having set it to NAT it's a bit more palatable but still, having left the guest running for the night I came back to find that Chrome had crashed and Thunderbird reported a bad SSL MAC. In other words, setting networking to NAT does seem to help a bit but corruption is still happening.

Could you please give a hint which SSL resources (URLs) could provoke that guest's browser crushes?

comment:5 follow-up: ↓ 6 Changed 3 years ago by rojer

I don't think it's specific to SSL. I think crashes are caused by memory being stomped somehow, but SSL is one of the few things that can actually detect this by way of checksumming (but if you really need to know, errors were reported on an IMAPS connection to Gmail - i have a Gmail IMAP account configured in Thunderbird and Gmail requires SSL). In other cases it just causes crashes and (quite possibly) silent corruption elsewhere. In fact, I had suspicious corruption of dpkg database in this VM that may be explained by this. For whatever reason, NAT seems to cause less memory corruption than bridging though it doesn't eliminate it completely. I also occasionally see visual artifacts on the guest's screen.

comment:6 in reply to: ↑ 5 Changed 3 years ago by Hachiman

Replying to rojer: Does it depend which network device emulation are you using pcnet or e1000?

Changed 3 years ago by rojer

comment:7 follow-up: ↓ 8 Changed 3 years ago by rojer

Haven't tried it, and it probably doesn't matter, because the corruption doesn't seem to be related to network adapter, after all. I found an easy way to reproduce it: ssh to localhost, transfer some data, get data corruption within seconds (ok, sometimes a minute or two -- see attached ssh_corruption.log).

This same command pipeline runs on the host without any problems. since it's localhost, it shouldn't involve networking drivers. but something somewhere corrupts the data, and it only happens in the guest. i have to add that guest was running fine while it was still on the physical machine, so it must be somehow related to virtualbox. but likely not to network drivers.

comment:8 in reply to: ↑ 7 Changed 3 years ago by Hachiman

Replying to rojer:

Haven't tried it, and it probably doesn't matter, because the corruption doesn't seem to be related to network adapter, after all.

VirtualBox has two main blocks in case of networking they are: network attachment (NAT, bridged and so on) and network device emulation (AMD pcnet and Intel e1000). We've checked that with both attachment you're seeing the same issue (it's rather small chance that both code paths has exactly the same bugs) detected by guest TCP/IP stack, so we need check does selection of device emulation change the behavior.

comment:9 follow-up: ↓ 10 Changed 3 years ago by rojer

i understand, but look at the attached log: i'm using 127.0.0.1 and still getting corrupt packets, which means that it's not related to virtualbox network drivers.

comment:10 in reply to: ↑ 9 Changed 3 years ago by Hachiman

Replying to rojer:

i understand, but look at the attached log: i'm using 127.0.0.1 and still getting corrupt packets, which means that it's not related to virtualbox network drivers.

Do you mean that this test was done on guest side, right?

comment:11 Changed 3 years ago by rojer

yes, that's done on guest.

and i have something else to add. corruption does seem to correlate with real network activity. when doing previous test i had other apps running in the background - chrome, thunderbird, pidgin. now i've done tests without background network traffic and if there's nothing else running, corruption on ssh to localhost doesn't happen. however, if i induce some real network traffic, corruption does happen. here's what i did: i reboot the virtual machine and start two terminal windows. in the first, i start the localhost ssh loop (time ssh 127.0.0.1 cat /dev/zero > /dev/null). if i just leave it alone, it will run just fine, consuming cpu as expected. however, if i now start the same loop but talking to a remote server instead (time ssh my_server.example.org cat /dev/zero > /dev/null), the local loop will crash very soon. the second will eventually too, but it's the induced crash of the first is that proves that there's memory stomping going on somewhere in the virtualbox network plumbing.

i have now tested AMD adapter type too (previously i was using intel pro 1000) and the same ting is happening. so it must be something else.

comment:12 Changed 3 years ago by rojer

one more datapoint: induced network traffic will crash glxgears (by disrupting X connection, perhaps?) run glxgears in terminal 1:

rojer@nbx:~$ glxgears OpenGL Warning: Failed to connect to host. Make sure 3D acceleration is enabled for this VM. 2005 frames in 5.0 seconds = 400.926 FPS 1798 frames in 5.0 seconds = 359.461 FPS 1900 frames in 5.0 seconds = 379.984 FPS ... (runs fine on its own)

now start inducing network traffic in terminal 2 (i've changed from using SSH to flood pinging my DSL router):

rojer@nbx:~$ sudo ping -f 192.168.1.1

and you start seeing visual artifacts in the glxgears window and pretty soon you get this in terminal 1:

1676 frames in 5.0 seconds = 335.083 FPS Mesa 7.10.2 implementation error: bad datatype in interpolate_int_colors Please report at bugs.freedesktop.org Mesa 7.10.2 implementation error: Invalid datatype in _mesa_convert_colors Please report at bugs.freedesktop.org

comment:13 Changed 3 years ago by Hachiman

Could you please attach dmesg from your guest after doing your "ssh based" test?

comment:14 Changed 3 years ago by frank

Wild guess: Could you also check what happens if you downgrade the guest memory to 1GB or 512MB?

comment:15 Changed 3 years ago by frank

Oh, and an important test would be to disable nested paging. I would appreciate if you could test this also as we are currently try to reproduce your problem but were not successful so far.

comment:16 Changed 3 years ago by rojer

  1. after several ssh corruptions, nothing notable appeared in dmesg. i'm attaching it anyway, just fyi.
  2. even with memory down to 512M, vt extensions and nested paging turned off network-induced corruption is still happening.

Changed 3 years ago by rojer

comment:17 Changed 3 years ago by rojer

after some more testing i have some additional info.

  1. this is related to usage of wifi by the host: corruption only happens when host is using wifi, not wired connection.
  2. this does not affect 64-bit guests, only 32-bit (i had two ubuntu guests running at the same time, 64-bit and 32-bit natty live cds, and corruption would only ever happen in the 32-bit guest).
  3. real network traffic does not have to originate in the vm. corruption happens even if ping is running on the host.
  4. as reported before and despite the fact that it depends on host's network connection type, corruption does not affect the host itself in any way.

based on this, it seems that this is related to 64<->32 bit host<->guest compatibility and is somehow triggered by host's usage of wifi (iwlagn driver).

comment:18 Changed 2 years ago by frank

To be 100% sure, could you also try if the guest memory corruption happens if you start the guest without networking at all?

comment:19 Changed 2 years ago by frank

That is, please disable all network adapters for the test VM and try to reproduce the guest memory corruptions by doing network traffic on your host wireless network interface.

comment:20 Changed 2 years ago by Hachiman

I've tried to reproduce the issue using ssh localhost test on the guest and pinging -I (wireless adapter) on the host locally, with attaching to wireless networking, but no luck yet. Could you please attach sysctl -a output to the defect, might be some host settings will give hints how to reproduce the issue?

comment:21 follow-up: ↓ 22 Changed 2 years ago by rojer

frank, yes, it's reproducible with guest networking completely disabled. i've recorded a video to demonstrate:  http://www.youtube.com/watch?v=whmSe7EcgtU in it i'm showing ssh running with ping -f running on the host first over ethernet connection (i let it run for about a minute, but i had it running for as long as 10 minutes) and then switch to wireless and have guest ssh bomb after ~15 seconds.

comment:22 in reply to: ↑ 21 Changed 2 years ago by Hachiman

Replying to rojer: Could you please, attach to the defect output of sysctl -q and route commands output from your host?

Changed 2 years ago by rojer

output of "sysctl -a" and route

comment:23 Changed 2 years ago by rojer

Hachiman, netstat -nr is ~= route but here you go, and sysctl (sorry, forgot to attach it earlier). vmnet interfaces belong to vmware workstation that i installed today (they weren't there yesterday). this is a temporary workaround, hopefully we can get virtualbox going, but for now i've converted my main guest and am using the free vmware player. i don't get snapshots but at least it's not crashing and i can get some work done. i still have test vbox guests.

comment:24 Changed 2 years ago by Hachiman

Hm, still can't reproduce the issue even with rather close configuration and installation, could you please collect pcap file from your host interface (e.g. with wireshark) and guest nic (for more details please look at Network tips)? you can send them directly to me (vasily _dot_ levchenko _at_ oracle _dot_ com)

comment:25 Changed 2 years ago by Hachiman

Rojer, have you been able to collect pcap files from host's interface and guest's one?

comment:26 follow-up: ↓ 27 Changed 2 years ago by rojer

Hachiman, sorry, been busy lately. i can do that, but note that this is reproducible with guest networking completely disabled, as i've shown in the video.

i also have to mention that even though host's base system is 10.04/lucid, the kernel is actually from 11.04/natty since 10.04 doesn't support my hardware. it's an offical kernel build but from natty:

[ 0.000000] Linux version 2.6.38-11-generic (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #50~lucid1-Ubuntu SMP Tue Sep 13 21:53:24 UTC 2011 (Ubuntu 2.6.38-11.50~lucid1-generic 2.6.38.8)

i have the right headers installed too. so may want to try to reproduce this on natty as host as opposed to lucid.

comment:27 in reply to: ↑ 26 Changed 2 years ago by Hachiman

Replying to rojer:

Hachiman, sorry, been busy lately. i can do that, but note that this is reproducible with guest networking completely disabled, as i've shown in the video.

Yes, i've seen that guest's networking is disabled, but the fact that guest ssh client is affected might mean that some packets are appeared on guest stack from host one, and I'd like investigate how this packet looks like and can we emulate this situation.

Here I've tried several Ubuntu's releases without luck to reproduce the issue, but probably it depends on your hardware or your network environment.

comment:28 follow-up: ↓ 29 Changed 2 years ago by rojer

fwiw, my machine is Lenovo X220 laptop. it's somewhat less urgent for me now that i found a workaround using vmware player, but i promise to do more tests when i have some time. thanks for your patience!

comment:29 in reply to: ↑ 28 Changed 2 years ago by Hachiman

Replying to rojer:

fwiw, my machine is Lenovo X220 laptop. it's somewhat less urgent for me now that i found a workaround using vmware player, but i promise to do more tests when i have some time.

Could you please paste output of lspci to defect to verify that hardware we're testing is close to your configuration?

Changed 2 years ago by rojer

output of lspci and lspci -v

comment:30 Changed 2 years ago by frank

rojer, we assume that this problem is somehow related to your hardware. Could you check if there is a BIOS update available and if so, if an update would solve your problem?

comment:31 Changed 2 years ago by frank

Does anything change if you use the PIIX3 chipset (not the ICH9 one)?

comment:32 Changed 2 years ago by rojer

tried that - no, it doesn't matter.

comment:33 Changed 2 years ago by frank

  • Description modified (diff)

Still relevant with VBox 4.1.10?

comment:34 Changed 2 years ago by rojer

yep, corruption is still happening on 4.1.10. and having done some more testing today, i have an important bit of information for you: wireless. it's all about wireless networking on the host. this is something i didn't try before: i had tried using wired network but i didn't disable wireless completely and was still getting corruption - caused, i assume, by background traffic. today i tried disabling all networking and then selectively enabling wired *or* wireless. with networking disabled or wired only networking + traffic, ssh in the guest can run for a long time - i've tested up to 15 minutes, i assume it can run for longer. if, however, you have wireless enabled, let alone send traffic over if (ping -s 1000 -f), all hell breaks loose - corruption will occur in less than a minute, sometimes just seconds after starting.

again: no interaction with guest is necessary, just having wireless enabled and connected on the host and some traffic will cause memory corruption in the guest, which is most evident in ssh failures but will sometimes manifest as graphics artifacts (presumably when it hits X buffers).

to give you an idea of what kind of device and driver this is, here are the relevant parts of dmesg:

[   10.252486] iwlagn: Intel(R) Wireless WiFi Link AGN driver for Linux, in-tree:
[   10.252490] iwlagn: Copyright(c) 2003-2010 Intel Corporation
[   10.254190] iwlagn 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
[   10.254202] iwlagn 0000:03:00.0: setting latency timer to 64
[   10.254251] iwlagn 0000:03:00.0: Detected Intel(R) Centrino(R) Advanced-N 6205 AGN, REV=0xB0
[   10.265856] iwlagn 0000:03:00.0: device EEPROM VER=0x715, CALIB=0x6
[   10.265860] iwlagn 0000:03:00.0: Device SKU: 0Xb
[   10.265862] iwlagn 0000:03:00.0: Valid Tx ant: 0X3, Valid Rx ant: 0X3
[   10.267193] iwlagn 0000:03:00.0: Tunable channels: 13 802.11bg, 24 802.11a channels
[   10.267290] iwlagn 0000:03:00.0: irq 45 for MSI/MSI-X
[   10.279547] iwlagn 0000:03:00.0: loaded firmware version 17.168.5.2 build 35905

as to updating the BIOS, there is a newer version, but Lenovo provides no good way of updating it from Linux.

also, it's worth mentioning that neither host itself, nor VMWare Workstation that i've been running for the past several months experiences no such problems. so i doubt this is just a memory stomping bug in hardware/driver. it really does affect just VirtualBox.

Last edited 2 years ago by frank (previous) (diff)

comment:35 Changed 2 years ago by Hachiman

Rojer, thanks for testing and feedback, we still not sure what is the reason of such behaviour.

comment:36 Changed 22 months ago by Hachiman

Does it still reproduciible with 4.1.18?

comment:37 Changed 16 months ago by rojer

so, i've just upgraded my host to Precise (12.04, kernel 3.2.0). whatever the cause was, it's no linger reproducible, even with the stock virtualbox-ose package (4.1.12). so i guess the root cause was in the kernel and this can be closed now.

comment:38 Changed 16 months ago by Hachiman

  • Status changed from new to closed
  • Resolution set to invalid

Thanks for update, Rojer. Will close it.

comment:39 Changed 5 months ago by insecure

  • Status changed from closed to reopened
  • Resolution invalid deleted

Host: Fedora 19

I'm still having this problem with VirtualBox 4.3.0. Actually I observed this behaviour since at least VirtualBox 4.0.10. It happens on various guest platforms, e.g. Gentoo 32 Bit, Windows 7 32 Bit and sometimes, but only rarely on Windows XP 32 Bit.

Windows guests crash with blue screens. Linux guests show strange random problems like segfaults, glibc: memory corruption, "random crashes in gentoo ebuild scripts", corrupt packets on ssh connections. As already mentioned in previous posts, these problems only occur if there is traffic on a WIFI device, it never happens if there is only a busy ethernet device. The more traffic, the more often the crashes occur.

I tried running VirtualBox with taskset "pinning" it to certain CPU sets, crashes still occurred.

I also tried unloading all but the "vboxdrv" kernel module and disabled all network adapters for the VM, crashes still occur.

If more information is needed, e.g. hardware information, logfiles (although the VBox.log files do not show anything after such an "app crash" happened), I'll be happy to help.

comment:40 Changed 5 months ago by frank

insecure, just to be sure: You don't observe any instability on the host if VirtualBox is not running, is that correct? Also, can you test  this build? We fixed a potential problem with fxrstor/fxsave on 64-bit hosts but this is only a shot into the dark. Any feedback from testing this package is welcome. Thank you!

comment:41 Changed 5 months ago by insecure

Hey Frank, system's rock solid! No other problems, compiling or other CPU intensive stuff works fine, harddisk is going nicely, network runs like a charm...

Still no luck with 4.3.3, here some examples of the symptoms, 3 successive runs of the same command, all failed.

...
>>> Unpacking Python-2.7.5.tar.xz to /var/tmp/portage/dev-lang/python-2.7.5-r4/work
xz: /var/tmp/portage/dev-lang/python-2.7.5-r4/distdir/Python-2.7.5.tar.xz: Compressed data is corrupt
tar: Skipping to next header
...

The harddisk is OK, also the files on it.

localhost ~ # emerge -av1 python
Segmentation fault
...
checking for uintptr_t... yes
checking size of uintptr_t... configure: error: in `/var/tmp/portage/dev-lang/python-2.7.5-r4/work/i686-pc-linux-gnu':
configure: error: cannot compute sizeof (uintptr_t)
See `config.log' for more details
...

After stopping network traffic on the host compiling works fine again.

Besides, I also tried to disabling "Nested Paging", no luck. I configured only 1 instead of 3 CPUs, no luck. Configuring 1 CPU, turning of VT extensions and nested paging seems to work but is, of course, incredibly slow.

comment:42 Changed 5 months ago by frank

Thanks for testing! So it must be indeed the VT-x code which is somehow interfering with your WLAN firmware. Did you also check if there is a new BIOS update available? This could be also a BIOS bug.

comment:43 follow-up: ↓ 44 Changed 5 months ago by insecure

Yes, there is indeed a new BIOS update, until now I never managed to update the BIOS due to bugs in those strange HP USB update tool.

I just installed the update successfully but have no suitable WIFI available here at work, I will check at home this afternoon and report back.

comment:44 in reply to: ↑ 43 Changed 5 months ago by insecure

OK, the BIOS update does not bring any improvements. I still get the errors.

Some ideas to further narrowing down the bug? I would be glad to try them out.

comment:45 Changed 4 months ago by frank

Still no ideas but as almost every new release has fixes for VT-x/AMD-V which could be relevant I would appreciate if you could test the latest VBox release (right now 4.3.6). Any success?

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use