VirtualBox

Opened 13 years ago

Last modified 4 years ago

#9718 closed defect

Random app crashes in the guest — at Version 33

Reported by: rojer Owned by:
Component: other Version: VirtualBox 4.1.4
Keywords: random crashes, memory corruption Cc:
Guest type: other Host type: other

Description (last modified by Frank Mehnert)

Host: Ubuntu 10.04 (Lucid) x86_64, 4.1.4 Guest: Ubuntu 11.04 (Natty) x86

Apps in guest crash randomly. The most susceptible seems to be Adobe Flash plugin, within Chrome or Firefox. Chrome itself crashes sometimes too, but Flash is really unstable: it rarely survives 1 minute of playing a YouTube video.

What I've tried and eliminated so far:

1) Host is rock solid, no random crashes, plays YouTube video fine.
2) Memory is fine - I've ran memtest86 on both host and guest.
3) Tried downgrading to 4.0.12 - same thing.
4) Presence of guest utils make no difference - crashes happen with and without them.
5) Tried 1 and 2 CPUs for the guest, different emulated hardware (PIIX/ICH), different amounts of memory (4, 3, 2G or RAM for guest, host has 8G).
6) nothing else is happening on the host. Log in, start chrome, open a youtube video -> boom, flash plugin crashes.
7) Guest kernel seems stable, no panics.
8) On two occasions Thunderbird reported strange errors: incorrect MAC on SSL packets. This suggests that memory corruption is happening somewhere.

Change History (39)

by rojer, 13 years ago

Attachment: nbx-2011-10-07-18-07-42.log added

by rojer, 13 years ago

Attachment: nbx-2011-10-07-17-59-03.log added

comment:1 by vasily Levchenko, 13 years ago

Description: modified (diff)

comment:2 by vasily Levchenko, 13 years ago

Looks like it's networking issue. Does it happens for you with bridged networking or with NAT too?

comment:3 by rojer, 13 years ago

It was set to bridging all this time and I couldn't even watch a single video, now having set it to NAT it's a bit more palatable but still, having left the guest running for the night I came back to find that Chrome had crashed and Thunderbird reported a bad SSL MAC. In other words, setting networking to NAT does seem to help a bit but corruption is still happening.

in reply to:  3 comment:4 by vasily Levchenko, 13 years ago

Replying to rojer:

It was set to bridging all this time and I couldn't even watch a single video, now having set it to NAT it's a bit more palatable but still, having left the guest running for the night I came back to find that Chrome had crashed and Thunderbird reported a bad SSL MAC. In other words, setting networking to NAT does seem to help a bit but corruption is still happening.

Could you please give a hint which SSL resources (URLs) could provoke that guest's browser crushes?

comment:5 by rojer, 13 years ago

I don't think it's specific to SSL. I think crashes are caused by memory being stomped somehow, but SSL is one of the few things that can actually detect this by way of checksumming (but if you really need to know, errors were reported on an IMAPS connection to Gmail - i have a Gmail IMAP account configured in Thunderbird and Gmail requires SSL). In other cases it just causes crashes and (quite possibly) silent corruption elsewhere. In fact, I had suspicious corruption of dpkg database in this VM that may be explained by this. For whatever reason, NAT seems to cause less memory corruption than bridging though it doesn't eliminate it completely. I also occasionally see visual artifacts on the guest's screen.

in reply to:  5 comment:6 by vasily Levchenko, 13 years ago

Replying to rojer: Does it depend which network device emulation are you using pcnet or e1000?

by rojer, 13 years ago

Attachment: ssh_corruption.log added

comment:7 by rojer, 13 years ago

Haven't tried it, and it probably doesn't matter, because the corruption doesn't seem to be related to network adapter, after all. I found an easy way to reproduce it: ssh to localhost, transfer some data, get data corruption within seconds (ok, sometimes a minute or two -- see attached ssh_corruption.log).

This same command pipeline runs on the host without any problems. since it's localhost, it shouldn't involve networking drivers. but something somewhere corrupts the data, and it only happens in the guest. i have to add that guest was running fine while it was still on the physical machine, so it must be somehow related to virtualbox. but likely not to network drivers.

in reply to:  7 comment:8 by vasily Levchenko, 13 years ago

Replying to rojer:

Haven't tried it, and it probably doesn't matter, because the corruption doesn't seem to be related to network adapter, after all.

VirtualBox has two main blocks in case of networking they are: network attachment (NAT, bridged and so on) and network device emulation (AMD pcnet and Intel e1000). We've checked that with both attachment you're seeing the same issue (it's rather small chance that both code paths has exactly the same bugs) detected by guest TCP/IP stack, so we need check does selection of device emulation change the behavior.

comment:9 by rojer, 13 years ago

i understand, but look at the attached log: i'm using 127.0.0.1 and still getting corrupt packets, which means that it's not related to virtualbox network drivers.

in reply to:  9 comment:10 by vasily Levchenko, 13 years ago

Replying to rojer:

i understand, but look at the attached log: i'm using 127.0.0.1 and still getting corrupt packets, which means that it's not related to virtualbox network drivers.

Do you mean that this test was done on guest side, right?

comment:11 by rojer, 13 years ago

yes, that's done on guest.

and i have something else to add. corruption does seem to correlate with real network activity. when doing previous test i had other apps running in the background - chrome, thunderbird, pidgin. now i've done tests without background network traffic and if there's nothing else running, corruption on ssh to localhost doesn't happen. however, if i induce some real network traffic, corruption does happen. here's what i did: i reboot the virtual machine and start two terminal windows. in the first, i start the localhost ssh loop (time ssh 127.0.0.1 cat /dev/zero > /dev/null). if i just leave it alone, it will run just fine, consuming cpu as expected. however, if i now start the same loop but talking to a remote server instead (time ssh my_server.example.org cat /dev/zero > /dev/null), the local loop will crash very soon. the second will eventually too, but it's the induced crash of the first is that proves that there's memory stomping going on somewhere in the virtualbox network plumbing.

i have now tested AMD adapter type too (previously i was using intel pro 1000) and the same ting is happening. so it must be something else.

comment:12 by rojer, 13 years ago

one more datapoint: induced network traffic will crash glxgears (by disrupting X connection, perhaps?) run glxgears in terminal 1:

rojer@nbx:~$ glxgears OpenGL Warning: Failed to connect to host. Make sure 3D acceleration is enabled for this VM. 2005 frames in 5.0 seconds = 400.926 FPS 1798 frames in 5.0 seconds = 359.461 FPS 1900 frames in 5.0 seconds = 379.984 FPS ... (runs fine on its own)

now start inducing network traffic in terminal 2 (i've changed from using SSH to flood pinging my DSL router):

rojer@nbx:~$ sudo ping -f 192.168.1.1

and you start seeing visual artifacts in the glxgears window and pretty soon you get this in terminal 1:

1676 frames in 5.0 seconds = 335.083 FPS Mesa 7.10.2 implementation error: bad datatype in interpolate_int_colors Please report at bugs.freedesktop.org Mesa 7.10.2 implementation error: Invalid datatype in _mesa_convert_colors Please report at bugs.freedesktop.org

comment:13 by vasily Levchenko, 13 years ago

Could you please attach dmesg from your guest after doing your "ssh based" test?

comment:14 by Frank Mehnert, 13 years ago

Wild guess: Could you also check what happens if you downgrade the guest memory to 1GB or 512MB?

comment:15 by Frank Mehnert, 13 years ago

Oh, and an important test would be to disable nested paging. I would appreciate if you could test this also as we are currently try to reproduce your problem but were not successful so far.

comment:16 by rojer, 13 years ago

  1. after several ssh corruptions, nothing notable appeared in dmesg. i'm attaching it anyway, just fyi.
  2. even with memory down to 512M, vt extensions and nested paging turned off network-induced corruption is still happening.

by rojer, 13 years ago

Attachment: dmesg.log added

comment:17 by rojer, 13 years ago

after some more testing i have some additional info.

  1. this is related to usage of wifi by the host: corruption only happens when host is using wifi, not wired connection.
  2. this does not affect 64-bit guests, only 32-bit (i had two ubuntu guests running at the same time, 64-bit and 32-bit natty live cds, and corruption would only ever happen in the 32-bit guest).
  3. real network traffic does not have to originate in the vm. corruption happens even if ping is running on the host.
  4. as reported before and despite the fact that it depends on host's network connection type, corruption does not affect the host itself in any way.

based on this, it seems that this is related to 64<->32 bit host<->guest compatibility and is somehow triggered by host's usage of wifi (iwlagn driver).

comment:18 by Frank Mehnert, 13 years ago

To be 100% sure, could you also try if the guest memory corruption happens if you start the guest without networking at all?

comment:19 by Frank Mehnert, 13 years ago

That is, please disable all network adapters for the test VM and try to reproduce the guest memory corruptions by doing network traffic on your host wireless network interface.

comment:20 by vasily Levchenko, 13 years ago

I've tried to reproduce the issue using ssh localhost test on the guest and pinging -I (wireless adapter) on the host locally, with attaching to wireless networking, but no luck yet. Could you please attach sysctl -a output to the defect, might be some host settings will give hints how to reproduce the issue?

comment:21 by rojer, 13 years ago

frank, yes, it's reproducible with guest networking completely disabled. i've recorded a video to demonstrate: http://www.youtube.com/watch?v=whmSe7EcgtU in it i'm showing ssh running with ping -f running on the host first over ethernet connection (i let it run for about a minute, but i had it running for as long as 10 minutes) and then switch to wireless and have guest ssh bomb after ~15 seconds.

in reply to:  21 comment:22 by vasily Levchenko, 13 years ago

Replying to rojer: Could you please, attach to the defect output of sysctl -q and route commands output from your host?

by rojer, 13 years ago

Attachment: sysctl_route.log added

output of "sysctl -a" and route

comment:23 by rojer, 13 years ago

Hachiman, netstat -nr is ~= route but here you go, and sysctl (sorry, forgot to attach it earlier). vmnet interfaces belong to vmware workstation that i installed today (they weren't there yesterday). this is a temporary workaround, hopefully we can get virtualbox going, but for now i've converted my main guest and am using the free vmware player. i don't get snapshots but at least it's not crashing and i can get some work done. i still have test vbox guests.

comment:24 by vasily Levchenko, 13 years ago

Hm, still can't reproduce the issue even with rather close configuration and installation, could you please collect pcap file from your host interface (e.g. with wireshark) and guest nic (for more details please look at Network tips)? you can send them directly to me (vasily _dot_ levchenko _at_ oracle _dot_ com)

comment:25 by vasily Levchenko, 13 years ago

Rojer, have you been able to collect pcap files from host's interface and guest's one?

comment:26 by rojer, 13 years ago

Hachiman, sorry, been busy lately. i can do that, but note that this is reproducible with guest networking completely disabled, as i've shown in the video.

i also have to mention that even though host's base system is 10.04/lucid, the kernel is actually from 11.04/natty since 10.04 doesn't support my hardware. it's an offical kernel build but from natty:

[ 0.000000] Linux version 2.6.38-11-generic (buildd@allspice) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) #50~lucid1-Ubuntu SMP Tue Sep 13 21:53:24 UTC 2011 (Ubuntu 2.6.38-11.50~lucid1-generic 2.6.38.8)

i have the right headers installed too. so may want to try to reproduce this on natty as host as opposed to lucid.

in reply to:  26 comment:27 by vasily Levchenko, 13 years ago

Replying to rojer:

Hachiman, sorry, been busy lately. i can do that, but note that this is reproducible with guest networking completely disabled, as i've shown in the video.

Yes, i've seen that guest's networking is disabled, but the fact that guest ssh client is affected might mean that some packets are appeared on guest stack from host one, and I'd like investigate how this packet looks like and can we emulate this situation.

Here I've tried several Ubuntu's releases without luck to reproduce the issue, but probably it depends on your hardware or your network environment.

comment:28 by rojer, 13 years ago

fwiw, my machine is Lenovo X220 laptop. it's somewhat less urgent for me now that i found a workaround using vmware player, but i promise to do more tests when i have some time. thanks for your patience!

in reply to:  28 comment:29 by vasily Levchenko, 13 years ago

Replying to rojer:

fwiw, my machine is Lenovo X220 laptop. it's somewhat less urgent for me now that i found a workaround using vmware player, but i promise to do more tests when i have some time.

Could you please paste output of lspci to defect to verify that hardware we're testing is close to your configuration?

by rojer, 13 years ago

Attachment: lspci.txt added

output of lspci and lspci -v

comment:30 by Frank Mehnert, 12 years ago

rojer, we assume that this problem is somehow related to your hardware. Could you check if there is a BIOS update available and if so, if an update would solve your problem?

comment:31 by Frank Mehnert, 12 years ago

Does anything change if you use the PIIX3 chipset (not the ICH9 one)?

comment:32 by rojer, 12 years ago

tried that - no, it doesn't matter.

comment:33 by Frank Mehnert, 12 years ago

Description: modified (diff)

Still relevant with VBox 4.1.10?

Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use