Opened 12 years ago
Closed 9 years ago
#11649 closed defect (fixed)
NAT-related crash of ubuntu guest on OSX host.
Reported by: | c_t | Owned by: | |
---|---|---|---|
Component: | network/NAT | Version: | VirtualBox 4.2.10 |
Keywords: | NAT, crash | Cc: | |
Guest type: | Linux | Host type: | Mac OS X |
Description
When trying to install node.js npm dependencies inside an ubuntu 12.04 guest system, which causes a lot of simultaneous HTTP connections, virtualbox crashes reproducably with the following crash report:
Process: VirtualBoxVM [52317] Path: /Applications/VirtualBox.app/Contents/MacOS/VirtualBoxVM Identifier: VirtualBoxVM Version: ??? (???) Code Type: X86-64 (Native) Parent Process: VBoxSVC [29251] Date/Time: 2013-03-27 11:32:37.325 +0100 OS Version: Mac OS X 10.7.5 (11G63) Report Version: 9 Sleep/Wake UUID: 6554FC2E-A188-4914-A585-61EF6051F79C Crashed Thread: 20 NAT Exception Type: EXC_BAD_ACCESS (SIGSEGV) Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000010 Thread 20 Crashed:: NAT 0 VBoxDD.dylib 0x000000010db6ed59 VBoxDevicesRegister + 838601 1 VBoxDD.dylib 0x000000010db65b85 VBoxDevicesRegister + 801269 2 VBoxDD.dylib 0x000000010db5d5bb VBoxDevicesRegister + 767019 3 VBoxVMM.dylib 0x000000010280f96a PDMR3ThreadCreate + 970 4 VBoxRT.dylib 0x00000001002409df RTThreadCreateF + 271 5 VBoxRT.dylib 0x00000001002930fc RTThreadPoke + 540 6 libsystem_c.dylib 0x00007fff8e6cb8bf _pthread_start + 335 7 libsystem_c.dylib 0x00007fff8e6ceb75 thread_start + 13
Full crash log attached.
I experience this on Mac OS X 10.7.5 and Mac OS X 10.8.3 with VirtualBox 4.1.23 and 4.2.10.
We are currently seeing this issue on 3 different host machines and it "just appeared" without any updates to virtual box or the guest OS.
Maybe it was introduced with a Mac OS X update?
Attachments (7)
Change History (55)
by , 12 years ago
Attachment: | VirtualBoxVM_2013-03-27-113240_localhost.crash added |
---|
comment:1 by , 12 years ago
comment:2 by , 12 years ago
I've uploaded a tarball called vbox-crash-ticket-11649.tar.gz into ftp://ftp.oracle.com/appsdev/incoming including the core dump, the crash report and some corresponding lines from /var/log/system.log
Thank you for your help.
comment:3 by , 12 years ago
Thanks for uploading the core file, but could you please attache log file from your VBox session (please look at http://www.virtualbox.org/manual/ch12.html#idp16864224 for more details)
comment:4 by , 12 years ago
Oh, sorry for the misunderstanding. I've uploaded another tarball which now includes the corresponding VBox.log to make sure all files are consistent.
Thanks again!
comment:5 by , 12 years ago
Could you please describe what your guest is doing? I've been able to get stack trace from your core file, but It doesn't point to obvious problematic place. It looks like crush related to dnsproxy code which is enabled in your scenario, and I've tried several DNS related benchmarks to reproduce the issue you're experiencing, but wasn't succeed.
follow-up: 7 comment:6 by , 12 years ago
On the guest system (Ubuntu 12.04) im performing an
npm install
which installs about 132 npm packages for node.js (http://nodejs.org/). These packages are hosted on HTTP locations, so the process is resolving a lot of DNS names and downloading many files in parallel. One work-around we have found, is manually installing the 132 packages in series and not in parallel, which does not cause crashes (so only few concurrent DNS lookups and HTTP requests). However doing so in parallel, results reproduceably in the described crash.
Interestingly, I now did the same at home (so logged into a different network with the host machine) and the problem does not occur! Nothing else has changed with my machine except for the network connection of the host machine!
Maybe that's helping? I'll be asking my colleagues, who are experiencing the same issues at the office, whether they can try to reproduce the problems at home.
Do you have any suspicion on what in the office network could cause these weird crashes? Could it be something IPV6 related? I see the message
00:00:16.757430 NAT: IPv6 not supported
is one of the last messages in the log... I will try to check with our ISP at the office whether they changed anything concerning IPv6 or DNS lately.
At least something like this would explain why the crashes suddenly started to occur on many different computers at the same time without any changes in the software configuration of these machines.
comment:7 by , 12 years ago
Replying to c_t:
Could you please try the build? I've changed clean up on processing timeouts in dnsproxy code, which could potentially cause the issue.
On the guest system (Ubuntu 12.04) im performing an
npm install
Interestingly, I now did the same at home (so logged into a different network with the host machine) and the problem does not occur! Nothing else has changed with my machine except for the network connection of the host machine!
Hmm, interesting observation (what is in delta of your environments? (number of DNS servers?))
Maybe that's helping? I'll be asking my colleagues, who are experiencing the same issues at the office, whether they can try to reproduce the problems at home.
Do you have any suspicion on what in the office network could cause these weird crashes?
there should be some difference in your DNS settings or DNS servers behaviour that provoke the bug in NAT/dnsproxy.
Could it be something IPV6 related? I see the message
No it isn't relates to IPv6.
00:00:16.757430 NAT: IPv6 not supported
is one of the last messages in the log... I will try to check with our ISP at the office whether they changed anything concerning IPv6 or DNS lately.
At least something like this would explain why the crashes suddenly started to occur on many different computers at the same time without any changes in the software configuration of these machines.
comment:8 by , 12 years ago
Thanks for your investigations. I'll try the build out on tuesday when I'll be back in the office and I'll also have a deeper look at the differences in DNS config.
follow-up: 10 comment:9 by , 12 years ago
Just tried out the new build, but the problem persists.
I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz
I will check whether I can tune DNS settings or whether configuring an external DNS server on my computer will help (thus circumventing the router's DNS cache). Is there anything else I can do?
Thanks again!
follow-up: 11 comment:10 by , 12 years ago
Replying to c_t:
Just tried out the new build, but the problem persists.
I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz
thank you for feed back I will look at the core file and will back to you after analisys.
comment:11 by , 12 years ago
Replying to Hachiman:
Replying to c_t:
Just tried out the new build, but the problem persists.
I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz
thank you for feed back I will look at the core file and will back to you after analisys.
Did you already manage to have a look at the core dump? Is there any other information I could provide or do you have any other idea, what I could check?
comment:12 by , 12 years ago
Yes I've looked at the coredump, I've done mass changes but still not sure whether I've missed anything, and still I will appreciate to see the some kind of diff your home and office resolv.conf settings, are there some dead in the list, difference in timeout settings?
comment:13 by , 12 years ago
OK, we now observed the following:
- In the case of no special network configuration on the host machine (Mac OS X) the DNS-Server is set via DHCP and the /etc/resolv.conf looks like this:
# # Mac OS X Notice # # This file is not used by the host name and address resolution # or the DNS query routing mechanisms used by most processes on # this Mac OS X system. # # This file is automatically generated. # domain fritz.box nameserver 192.168.20.1
The nameserver IP is pointing to the local router which is also the default gateway (192.168.20.1). It is this device: https://www.avm.de/en/Produkte/FRITZBox/FRITZ_6360_Cable/index.php
- We now went to the Mac OS X network configuration on the host system and changed the DNS server's IP to 8.8.8.8 (i.e. a google nameserver). After a reboot of the host machine and a restart of the virtual box the problem is gone. /etc/resolv.conf now looks like this:
# # Mac OS X Notice # # This file is not used by the host name and address resolution # or the DNS query routing mechanisms used by most processes on # this Mac OS X system. # # This file is automatically generated. # domain fritz.box nameserver 8.8.8.8
Setting the nameserver IP to the google nameserver solved the problem on both, you 4.2.11 build as well as on virtual box 4.2.6.
So as you already pointed out it's definitely the nameserver! Unfortunately I have no idea how to find out how the nameserver on the cable-router is configured since you can only access it through a web frontend that doesn't really reveal any nameserver settings...
comment:14 by , 12 years ago
Can you check how differs query time (e.g. ;; Query time: 66 msec) for google dns server and in your router settings e.g. with dig
# dig www.de @8.8.8.8
btw: I believe that your router has got just own DNSPROXY forwarding requests to your internet provider's nameservers configured while setup. So perhaps some statistics of resolving times against them will give me a hint for emulation environment closer to your settings.
comment:15 by , 12 years ago
Here are some results (I stripped all the irrelevant stuff from the dig output):
$ dig www.de @8.8.8.8 ;; Query time: 27 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) $ dig www.de @192.168.20.1 ;; Query time: 148 msec ;; SERVER: 192.168.20.1#53(192.168.20.1) $ dig www.de @192.168.20.1 ;; Query time: 9 msec ;; SERVER: 192.168.20.1#53(192.168.20.1) $ dig @83.169.184.33 www.de ;; Query time: 152 msec ;; SERVER: 83.169.184.33#53(83.169.184.33) $ dig @8.8.8.8 www.de ;; Query time: 25 msec ;; SERVER: 8.8.8.8#53(8.8.8.8)
So all in all the google nameserver is very fast ~25ms, the ISP's nameserver, which is used by the cable modem/router (83.169.184.33), seems to be a lot slower, ~152 ms as you can see. From the router's dns proxy (192.168.20.1) we see the expected: first request is roughly the same response time as the ISP's nameserver, once the result is cached responses are very fast (9ms).
Does that information help?
comment:17 by , 12 years ago
I've checked against your timeouts here and it works fine, let's check in your environment.
Could you please verify the build build? Note: It's a trunk build, so it'd better backup yours vm settings, before try.
follow-up: 19 comment:18 by , 12 years ago
Unfortunately no luck. With the router's NS it still crashes, with the google NS it works...
comment:19 by , 12 years ago
Replying to c_t:
Unfortunately no luck. With the router's NS it still crashes, with the google NS it works...
Could you please upload the core file for this session?
comment:20 by , 11 years ago
Even though this seems to be dormant: I seem to have the same problem: NAT thread crashes on MacOS, FritzBox as router (different model, though), after I read this I put the Google DNS which seems to work. Interestingly enough, it works with a different Mac and FritzBox at home (different provider, though).
VirtualBox version is 4.2.16, Host OS is Mac OS 10.8, Guest OS is CentOS 6.4. The problem happens when trying to install the first package from the repositories (which ususally tries to contact all mirrors to find the fastest one).
I can try to post crash logs if that is still interesting.
comment:21 by , 11 years ago
I've got the same issue on a Windows 7 host with the most recent Virtual Box (4.3.6) and vagrant!
When the networking type is 'NAT', 'npm install bower' crashes the guest vm. When I switch to 'Network Bridge', the crash doesn't happen.
Please also see: https://www.virtualbox.org/ticket/12588
follow-up: 23 comment:22 by , 11 years ago
Interesting though, because the same VM works fine on Mac OS...
comment:23 by , 11 years ago
Replying to machete143:
Interesting though, because the same VM works fine on Mac OS...
Could you please collect pcap file from your guest?
follow-up: 25 comment:24 by , 11 years ago
I've attached the pcap file you requested. The commands I used (after login) were:
sudo -s npm install bower grunt-cli
The VM crashed very soon after (like .5 - 1 seconds later)
comment:25 by , 11 years ago
Replying to machete143:
I've attached the pcap file you requested. The commands I used (after login) were:
thank you.
comment:26 by , 11 years ago
Could you please also share your VBox.log file? Rather interesting to look at your NAT configuration.
by , 11 years ago
Attachment: | v11649-test.sh added |
---|
comment:27 by , 11 years ago
Could you please check, whether v11649-test.sh crashes your VM? It's sends DNS requests similar, which are the last at your file. Note, it require xxd (usually exists with vim) and socat are installed in the guest.
by , 11 years ago
Attachment: | vagrant_default_1389228581153_51611-2014-01-09-01-55-26.log added |
---|
Windows 7 Host / Precise 32 Guest Vbox Log
comment:28 by , 11 years ago
I've attached the VBox log. Unfortunately I can not run the sh file, because it tells me:
line 23: Syntax error: Bad for loop variable.
However, that was solvable by replacing #!/bin/sh
with #!/bin/bash
But unfortunately the script didn't crash the vm.
Please see my updated posts below, the vm crashes when executing the script via nodejs
I think this issue could be related to nodejs' non blocking I/O architecture. What happens (at least I think this is what happens) is that npm asks very fast for multiple DNS entries and queueing the answer in a callback. So we get multiple simoultaneously DNS requests although the answer hasn't arrived yet. What your script does is (if I read that correctly) opening sequently 100 dns queries and *waiting* for the reply before opening another. Could this be right?
I'm from another field of development, but we have there something called "Race condition" where one thread is faster than the other and causes a fatal error. This could be related to this issue (again, I'm wild guessing here) where a second DNS query arrives *before* the first DNS query and maybe that causes the VM to crash? That could also be the reason why this error happens randomly (sometimes I can install 200 packages, sometimes the VM crashes right at the 2nd package).
I have no background with nodejs whatsoever, but I'll ask a friend of mine to write a nodejs script that opens many simoultaneously connections to one server and maybe that script will crash the VM as well.
Please also note that I have a very fast internet link (100mbit) with a very low ping (usually < 10ms) and a direct connection (lan) to the router, so that might favor this issue. My MacBook is connected through WLAN, maybe that's one of the reasons why this doesn't happen there.
comment:29 by , 11 years ago
I've now also added a logfile with a VM I've set up manually (not through vagrant). This VM *does not crash* when executing 'npm install'.
However, when creating a vagrant box out of it and running 'vagrant up', the crash occurs again! So my wild guess is that vagrant is setting some weird option, which causes the system to crash!
Maybe a diff on those two can help you identifying the issue.
I've added two logfiles:
- VM not crashing (above this post, Athene2-vm)
- Vagrant VM based on the non crashing VM (below this post, vagrant_default)
The first logfile is the one, I've setup manually - this one doesn't crash with 'npm install'
The second logfile is a VM setup by vagrant. It uses the image of the 'VM not crashing' VM. This one crashes when executing 'npm install'
by , 11 years ago
Attachment: | vagrant_default_1389266271691_35940-2014-01-09-12-18-20.log added |
---|
Vagrant VM based on the non crashing VM
comment:30 by , 11 years ago
I've now attached two nodejs scripts. One ('request') executes multiple requests. The other, executes your script. BOTH crash the VM!
However, they only crash 'vagrant_default', not 'Athene2-vm'!
So this is just as I expected, the non blocking I/O architecture of nodejs is causing the trouble
comment:31 by , 11 years ago
I've made multiple edits to my previous posts so if you're only monitoring the mailing list, please check those!
tracing bugs really is fun :D
comment:32 by , 11 years ago
Thank you for investigation, right DNS calls are done extremely intensive in short period. Could you please attach pcap files for Athene2-vm and vagran_default? there is other thing that I've noticed in file.pcap, the order of responses, that could potentially be a reason. And one thing actually Athene2-vm isn't exactly clone of vagrant_default : the first one uses list of DNS servers exported by VirtualBox to guest, while valgrant_default uses dnsproxy. I believe that once you change it Athene2-vm will also crash.
follow-up: 37 comment:33 by , 11 years ago
I will provide you with a pcap once I'm back home. Can you provide me with a command to switch the dns behaviour (for both VMs)? If that's related to the dnsproxy setting, is there a possibility to fix this? Or should I contact the vagrant developers and tell them that the other setting should be used instead?
follow-up: 36 comment:34 by , 11 years ago
Using
config.vm.provider :virtualbox do |vb| vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on'] end
the VM doesn't crash any more! Do you still need the file.pcap of Athene2-vm?
follow-up: 39 comment:35 by , 11 years ago
'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.
comment:36 by , 11 years ago
Replying to machete143:
the VM doesn't crash any more! Do you still need the file.pcap of Athene2-vm?
yes please, but perhaps you just can switch dnsproxy on and off (but please dont forget to switch hostresolver off first) because if all requests are identical, perhaps because of crash real problematic request/response is missed in serialized pcap.
comment:37 by , 11 years ago
Replying to machete143:
I will provide you with a pcap once I'm back home. Can you provide me with a command to switch the dns behaviour (for both VMs)? If that's related to the dnsproxy setting, is there a possibility to fix this? Or should I contact the vagrant developers and tell them that the other setting should be used instead?
--natdnsproxy1 is more preferable, it was in shadow of hostresolver too long, so it better to find root cause and fix it correctly :).
comment:38 by , 11 years ago
Unfortunately the pcap file was too big for me to upload, so I put it on dropbox: https://www.dropbox.com/s/qozdlzfu6lzvvnz/athene.pcap
Hope this helps :)
follow-up: 40 comment:39 by , 11 years ago
Replying to Hachiman:
'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.
So if I understood correctly, I should do:
vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on'] vb.customize ['modifyvm', :id, '--natdnsproxy1', 'off']
?
comment:40 by , 11 years ago
Replying to machete143:
Replying to Hachiman:
'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.
So if I understood correctly, I should do:
vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on'] vb.customize ['modifyvm', :id, '--natdnsproxy1', 'off']?
No both should be off.
comment:41 by , 11 years ago
With those settings, vagrant doesn't work :) If any one else has this problem, this setting worked for me:
vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']
comment:42 by , 10 years ago
Any progress on this? I'm most likely having the same issue: Using bower crashes my CentOS 6.5 virtual machine (setup using Vagrant).
comment:43 by , 10 years ago
I can confirm that the workaround provided above works for me:
config.vm.provider :virtualbox do |vb| vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on'] end
comment:44 by , 10 years ago
follow-up: 47 comment:46 by , 9 years ago
forgive me if this seems like a silly question, but where do we put this workaround?
comment:47 by , 9 years ago
Replying to fotoflo:
forgive me if this seems like a silly question, but where do we put this workaround?
Why do you need the workaround? Do you still experience this problem with 4.3.28 or later?
comment:48 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Please reopen if necessary. The above comments clearly say that the fix is part of VBox 4.3.28 or later (VBox 5.0.0 includes the fix as well).
If it easy to reproduce in you environment could you please collect coredump and upload it on ftp://ftp.oracle.com/appsdev/incoming, together with attaching log file from crushing session.