VirtualBox

Ticket #11649 (closed defect: fixed)

Opened 7 years ago

Last modified 5 years ago

NAT-related crash of ubuntu guest on OSX host.

Reported by: c_t Owned by:
Component: network/NAT Version: VirtualBox 4.2.10
Keywords: NAT, crash Cc:
Guest type: Linux Host type: Mac OS X

Description

When trying to install node.js npm dependencies inside an ubuntu 12.04 guest system, which causes a lot of simultaneous HTTP connections, virtualbox crashes reproducably with the following crash report:

Process:         VirtualBoxVM [52317]
Path:            /Applications/VirtualBox.app/Contents/MacOS/VirtualBoxVM
Identifier:      VirtualBoxVM
Version:         ??? (???)
Code Type:       X86-64 (Native)
Parent Process:  VBoxSVC [29251]

Date/Time:       2013-03-27 11:32:37.325 +0100
OS Version:      Mac OS X 10.7.5 (11G63)
Report Version:  9
Sleep/Wake UUID: 6554FC2E-A188-4914-A585-61EF6051F79C

Crashed Thread:  20  NAT

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000010


Thread 20 Crashed:: NAT
0   VBoxDD.dylib                  	0x000000010db6ed59 VBoxDevicesRegister + 838601
1   VBoxDD.dylib                  	0x000000010db65b85 VBoxDevicesRegister + 801269
2   VBoxDD.dylib                  	0x000000010db5d5bb VBoxDevicesRegister + 767019
3   VBoxVMM.dylib                 	0x000000010280f96a PDMR3ThreadCreate + 970
4   VBoxRT.dylib                  	0x00000001002409df RTThreadCreateF + 271
5   VBoxRT.dylib                  	0x00000001002930fc RTThreadPoke + 540
6   libsystem_c.dylib             	0x00007fff8e6cb8bf _pthread_start + 335
7   libsystem_c.dylib             	0x00007fff8e6ceb75 thread_start + 13

Full crash log attached.

I experience this on Mac OS X 10.7.5 and Mac OS X 10.8.3 with VirtualBox 4.1.23 and 4.2.10.

We are currently seeing this issue on 3 different host machines and it "just appeared" without any updates to virtual box or the guest OS.

Maybe it was introduced with a Mac OS X update?

Attachments

VirtualBoxVM_2013-03-27-113240_localhost.crash Download (57.8 KB) - added by c_t 7 years ago.
file.pcap Download (26.0 KB) - added by machete143 6 years ago.
Windows 7 Host PCAP File
v11649-test.sh Download (1.2 KB) - added by Hachiman 6 years ago.
vagrant_default_1389228581153_51611-2014-01-09-01-55-26.log Download (55.5 KB) - added by machete143 6 years ago.
Windows 7 Host / Precise 32 Guest Vbox Log
Athene2-vm-2014-01-09-00-53-21.log Download (88.9 KB) - added by machete143 6 years ago.
VM not crashing
vagrant_default_1389266271691_35940-2014-01-09-12-18-20.log Download (60.4 KB) - added by machete143 6 years ago.
Vagrant VM based on the non crashing VM
nodejs.zip Download (2.0 KB) - added by machete143 6 years ago.
nodejs scripts

Change History

Changed 7 years ago by c_t

comment:1 Changed 7 years ago by Hachiman

If it easy to reproduce in you environment could you please collect coredump and upload it on  ftp://ftp.oracle.com/appsdev/incoming, together with attaching log file from crushing session.

comment:2 Changed 7 years ago by c_t

I've uploaded a tarball called vbox-crash-ticket-11649.tar.gz into  ftp://ftp.oracle.com/appsdev/incoming including the core dump, the crash report and some corresponding lines from /var/log/system.log

Thank you for your help.

comment:3 Changed 7 years ago by Hachiman

Thanks for uploading the core file, but could you please attache log file from your VBox session (please look at  http://www.virtualbox.org/manual/ch12.html#idp16864224 for more details)

comment:4 Changed 7 years ago by c_t

Oh, sorry for the misunderstanding. I've uploaded another tarball which now includes the corresponding VBox.log to make sure all files are consistent.

Thanks again!

comment:5 Changed 7 years ago by Hachiman

Could you please describe what your guest is doing? I've been able to get stack trace from your core file, but It doesn't point to obvious problematic place. It looks like crush related to dnsproxy code which is enabled in your scenario, and I've tried several DNS related benchmarks to reproduce the issue you're experiencing, but wasn't succeed.

comment:6 follow-up: ↓ 7 Changed 7 years ago by c_t

On the guest system (Ubuntu 12.04) im performing an

npm install

which installs about 132 npm packages for node.js ( http://nodejs.org/). These packages are hosted on HTTP locations, so the process is resolving a lot of DNS names and downloading many files in parallel. One work-around we have found, is manually installing the 132 packages in series and not in parallel, which does not cause crashes (so only few concurrent DNS lookups and HTTP requests). However doing so in parallel, results reproduceably in the described crash.

Interestingly, I now did the same at home (so logged into a different network with the host machine) and the problem does not occur! Nothing else has changed with my machine except for the network connection of the host machine!

Maybe that's helping? I'll be asking my colleagues, who are experiencing the same issues at the office, whether they can try to reproduce the problems at home.

Do you have any suspicion on what in the office network could cause these weird crashes? Could it be something IPV6 related? I see the message

00:00:16.757430 NAT: IPv6 not supported

is one of the last messages in the log... I will try to check with our ISP at the office whether they changed anything concerning IPv6 or DNS lately.

At least something like this would explain why the crashes suddenly started to occur on many different computers at the same time without any changes in the software configuration of these machines.

comment:7 in reply to: ↑ 6 Changed 7 years ago by Hachiman

Replying to c_t:

Could you please try the  build? I've changed clean up on processing timeouts in dnsproxy code, which could potentially cause the issue.

On the guest system (Ubuntu 12.04) im performing an

npm install

Interestingly, I now did the same at home (so logged into a different network with the host machine) and the problem does not occur! Nothing else has changed with my machine except for the network connection of the host machine!

Hmm, interesting observation (what is in delta of your environments? (number of DNS servers?))

Maybe that's helping? I'll be asking my colleagues, who are experiencing the same issues at the office, whether they can try to reproduce the problems at home.

Do you have any suspicion on what in the office network could cause these weird crashes?

there should be some difference in your DNS settings or DNS servers behaviour that provoke the bug in NAT/dnsproxy.

Could it be something IPV6 related? I see the message

No it isn't relates to IPv6.

00:00:16.757430 NAT: IPv6 not supported

is one of the last messages in the log... I will try to check with our ISP at the office whether they changed anything concerning IPv6 or DNS lately.

At least something like this would explain why the crashes suddenly started to occur on many different computers at the same time without any changes in the software configuration of these machines.

comment:8 Changed 7 years ago by c_t

Thanks for your investigations. I'll try the build out on tuesday when I'll be back in the office and I'll also have a deeper look at the differences in DNS config.

comment:9 follow-up: ↓ 10 Changed 7 years ago by c_t

Just tried out the new build, but the problem persists.

I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz

I will check whether I can tune DNS settings or whether configuring an external DNS server on my computer will help (thus circumventing the router's DNS cache). Is there anything else I can do?

Thanks again!

comment:10 in reply to: ↑ 9 ; follow-up: ↓ 11 Changed 7 years ago by Hachiman

Replying to c_t:

Just tried out the new build, but the problem persists.

I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz

thank you for feed back I will look at the core file and will back to you after analisys.

comment:11 in reply to: ↑ 10 Changed 7 years ago by c_t

Replying to Hachiman:

Replying to c_t:

Just tried out the new build, but the problem persists.

I've uploaded a new core dump as vbox-crash-ticket-11649-3.tar.gz

thank you for feed back I will look at the core file and will back to you after analisys.

Did you already manage to have a look at the core dump? Is there any other information I could provide or do you have any other idea, what I could check?

comment:12 Changed 7 years ago by Hachiman

Yes I've looked at the coredump, I've done mass changes but still not sure whether I've missed anything, and still I will appreciate to see the some kind of diff your home and office resolv.conf settings, are there some dead in the list, difference in timeout settings?

comment:13 Changed 7 years ago by c_t

OK, we now observed the following:

  1. In the case of no special network configuration on the host machine (Mac OS X) the DNS-Server is set via DHCP and the /etc/resolv.conf looks like this:
#
# Mac OS X Notice
#
# This file is not used by the host name and address resolution
# or the DNS query routing mechanisms used by most processes on
# this Mac OS X system.
#
# This file is automatically generated.
#
domain fritz.box
nameserver 192.168.20.1

The nameserver IP is pointing to the local router which is also the default gateway (192.168.20.1). It is this device:  https://www.avm.de/en/Produkte/FRITZBox/FRITZ_6360_Cable/index.php

  1. We now went to the Mac OS X network configuration on the host system and changed the DNS server's IP to 8.8.8.8 (i.e. a google nameserver). After a reboot of the host machine and a restart of the virtual box the problem is gone. /etc/resolv.conf now looks like this:
#
# Mac OS X Notice
#
# This file is not used by the host name and address resolution
# or the DNS query routing mechanisms used by most processes on
# this Mac OS X system.
#
# This file is automatically generated.
#
domain fritz.box
nameserver 8.8.8.8

Setting the nameserver IP to the google nameserver solved the problem on both, you 4.2.11 build as well as on virtual box 4.2.6.

So as you already pointed out it's definitely the nameserver! Unfortunately I have no idea how to find out how the nameserver on the cable-router is configured since you can only access it through a web frontend that doesn't really reveal any nameserver settings...

comment:14 Changed 7 years ago by Hachiman

Can you check how differs query time (e.g. ;; Query time: 66 msec in dig report) for google dns server and in your router settings e.g. with dig

# dig www.de @8.8.8.8

btw: I believe that your router has got just own DNSPROXY forwarding requests to your internet provider's nameservers configured while setup. So perhaps some statistics of resolving times against them will give me a hint for emulation environment closer to your settings.

Last edited 7 years ago by Hachiman (previous) (diff)

comment:15 Changed 7 years ago by c_t

Here are some results (I stripped all the irrelevant stuff from the dig output):

$ dig www.de @8.8.8.8

;; Query time: 27 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)

$ dig www.de @192.168.20.1

;; Query time: 148 msec
;; SERVER: 192.168.20.1#53(192.168.20.1)

$ dig www.de @192.168.20.1

;; Query time: 9 msec
;; SERVER: 192.168.20.1#53(192.168.20.1)

$ dig @83.169.184.33 www.de

;; Query time: 152 msec
;; SERVER: 83.169.184.33#53(83.169.184.33)

$ dig @8.8.8.8 www.de

;; Query time: 25 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)

So all in all the google nameserver is very fast ~25ms, the ISP's nameserver, which is used by the cable modem/router (83.169.184.33), seems to be a lot slower, ~152 ms as you can see. From the router's dns proxy (192.168.20.1) we see the expected: first request is roughly the same response time as the ISP's nameserver, once the result is cached responses are very fast (9ms).

Does that information help?

comment:16 Changed 7 years ago by Hachiman

Thank you for update I will emulate corresponding timeouts here.

comment:17 Changed 7 years ago by Hachiman

I've checked against your timeouts here and it works fine, let's check in your environment.

Could you please verify the build  build? Note: It's a trunk build, so it'd better backup yours vm settings, before try.

Last edited 7 years ago by Hachiman (previous) (diff)

comment:18 follow-up: ↓ 19 Changed 7 years ago by c_t

Unfortunately no luck. With the router's NS it still crashes, with the google NS it works...

comment:19 in reply to: ↑ 18 Changed 7 years ago by Hachiman

Replying to c_t:

Unfortunately no luck. With the router's NS it still crashes, with the google NS it works...

Could you please upload the core file for this session?

comment:20 Changed 7 years ago by betterdevs

Even though this seems to be dormant: I seem to have the same problem: NAT thread crashes on MacOS, FritzBox as router (different model, though), after I read this I put the Google DNS which seems to work. Interestingly enough, it works with a different Mac and FritzBox at home (different provider, though).

VirtualBox version is 4.2.16, Host OS is Mac OS 10.8, Guest OS is CentOS 6.4. The problem happens when trying to install the first package from the repositories (which ususally tries to contact all mirrors to find the fastest one).

I can try to post crash logs if that is still interesting.

Last edited 7 years ago by betterdevs (previous) (diff)

comment:21 Changed 6 years ago by machete143

I've got the same issue on a Windows 7 host with the most recent Virtual Box (4.3.6) and vagrant!

When the networking type is 'NAT', 'npm install bower' crashes the guest vm. When I switch to 'Network Bridge', the crash doesn't happen.

Please also see: https://www.virtualbox.org/ticket/12588

comment:22 follow-up: ↓ 23 Changed 6 years ago by machete143

Interesting though, because the same VM works fine on Mac OS...

comment:23 in reply to: ↑ 22 Changed 6 years ago by Hachiman

Replying to machete143:

Interesting though, because the same VM works fine on Mac OS...

Could you please collect pcap file from your guest?

Changed 6 years ago by machete143

Windows 7 Host PCAP File

comment:24 follow-up: ↓ 25 Changed 6 years ago by machete143

I've attached the pcap file you requested. The commands I used (after login) were:

sudo -s
npm install bower grunt-cli

The VM crashed very soon after (like .5 - 1 seconds later)

comment:25 in reply to: ↑ 24 Changed 6 years ago by Hachiman

Replying to machete143:

I've attached the pcap file you requested. The commands I used (after login) were:

thank you.

comment:26 Changed 6 years ago by Hachiman

Could you please also share your VBox.log file? Rather interesting to look at your NAT configuration.

Changed 6 years ago by Hachiman

comment:27 Changed 6 years ago by Hachiman

Could you please check, whether v11649-test.sh crashes your VM? It's sends DNS requests similar, which are the last at your file. Note, it require xxd (usually exists with vim) and socat are installed in the guest.

Changed 6 years ago by machete143

Windows 7 Host / Precise 32 Guest Vbox Log

comment:28 Changed 6 years ago by machete143

I've attached the VBox log. Unfortunately I can not run the sh file, because it tells me:

line 23: Syntax error: Bad for loop variable.

However, that was solvable by replacing #!/bin/sh with #!/bin/bash

But unfortunately the script didn't crash the vm.

Please see my updated posts below, the vm crashes when executing the script via nodejs

I think this issue could be related to nodejs' non blocking I/O architecture. What happens (at least I think this is what happens) is that npm asks very fast for multiple DNS entries and queueing the answer in a callback. So we get multiple simoultaneously DNS requests although the answer hasn't arrived yet. What your script does is (if I read that correctly) opening sequently 100 dns queries and *waiting* for the reply before opening another. Could this be right?

I'm from another field of development, but we have there something called "Race condition" where one thread is faster than the other and causes a fatal error. This could be related to this issue (again, I'm wild guessing here) where a second DNS query arrives *before* the first DNS query and maybe that causes the VM to crash? That could also be the reason why this error happens randomly (sometimes I can install 200 packages, sometimes the VM crashes right at the 2nd package).

I have no background with nodejs whatsoever, but I'll ask a friend of mine to write a nodejs script that opens many simoultaneously connections to one server and maybe that script will crash the VM as well.

Please also note that I have a very fast internet link (100mbit) with a very low ping (usually < 10ms) and a direct connection (lan) to the router, so that might favor this issue. My MacBook is connected through WLAN, maybe that's one of the reasons why this doesn't happen there.

Last edited 6 years ago by machete143 (previous) (diff)

Changed 6 years ago by machete143

VM not crashing

comment:29 Changed 6 years ago by machete143

I've now also added a logfile with a VM I've set up manually (not through vagrant). This VM *does not crash* when executing 'npm install'.

However, when creating a vagrant box out of it and running 'vagrant up', the crash occurs again! So my wild guess is that vagrant is setting some weird option, which causes the system to crash!

Maybe a diff on those two can help you identifying the issue.

I've added two logfiles:

  • VM not crashing (above this post, Athene2-vm)
  • Vagrant VM based on the non crashing VM (below this post, vagrant_default)

The first logfile is the one, I've setup manually - this one doesn't crash with 'npm install'

The second logfile is a VM setup by vagrant. It uses the image of the 'VM not crashing' VM. This one crashes when executing 'npm install'

Last edited 6 years ago by machete143 (previous) (diff)

Changed 6 years ago by machete143

Vagrant VM based on the non crashing VM

comment:30 Changed 6 years ago by machete143

I've now attached two nodejs scripts. One ('request') executes multiple requests. The other, executes your script. BOTH crash the VM!

However, they only crash 'vagrant_default', not 'Athene2-vm'!

So this is just as I expected, the non blocking I/O architecture of nodejs is causing the trouble

Last edited 6 years ago by machete143 (previous) (diff)

comment:31 Changed 6 years ago by machete143

I've made multiple edits to my previous posts so if you're only monitoring the mailing list, please check those!

tracing bugs really is fun :D

Last edited 6 years ago by machete143 (previous) (diff)

comment:32 Changed 6 years ago by Hachiman

Thank you for investigation, right DNS calls are done extremely intensive in short period. Could you please attach pcap files for Athene2-vm and vagran_default? there is other thing that I've noticed in file.pcap, the order of responses, that could potentially be a reason. And one thing actually Athene2-vm isn't exactly clone of vagrant_default : the first one uses list of DNS servers exported by VirtualBox to guest, while valgrant_default uses dnsproxy. I believe that once you change it Athene2-vm will also crash.

Changed 6 years ago by machete143

nodejs scripts

comment:33 follow-up: ↓ 37 Changed 6 years ago by machete143

I will provide you with a pcap once I'm back home. Can you provide me with a command to switch the dns behaviour (for both VMs)? If that's related to the dnsproxy setting, is there a possibility to fix this? Or should I contact the vagrant developers and tell them that the other setting should be used instead?

comment:34 follow-up: ↓ 36 Changed 6 years ago by machete143

Using

config.vm.provider :virtualbox do |vb|
  vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']
end

the VM doesn't crash any more! Do you still need the file.pcap of Athene2-vm?

comment:35 follow-up: ↓ 39 Changed 6 years ago by Hachiman

'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.

comment:36 in reply to: ↑ 34 Changed 6 years ago by Hachiman

Replying to machete143:

the VM doesn't crash any more! Do you still need the file.pcap of Athene2-vm?

yes please, but perhaps you just can switch dnsproxy on and off (but please dont forget to switch hostresolver off first) because if all requests are identical, perhaps because of crash real problematic request/response is missed in serialized pcap.

comment:37 in reply to: ↑ 33 Changed 6 years ago by Hachiman

Replying to machete143:

I will provide you with a pcap once I'm back home. Can you provide me with a command to switch the dns behaviour (for both VMs)? If that's related to the dnsproxy setting, is there a possibility to fix this? Or should I contact the vagrant developers and tell them that the other setting should be used instead?

--natdnsproxy1 is more preferable, it was in shadow of hostresolver too long, so it better to find root cause and fix it correctly :).

comment:38 Changed 6 years ago by machete143

Unfortunately the pcap file was too big for me to upload, so I put it on dropbox:  https://www.dropbox.com/s/qozdlzfu6lzvvnz/athene.pcap

Hope this helps :)

Last edited 6 years ago by machete143 (previous) (diff)

comment:39 in reply to: ↑ 35 ; follow-up: ↓ 40 Changed 6 years ago by machete143

Replying to Hachiman:

'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.

So if I understood correctly, I should do:

vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']
vb.customize ['modifyvm', :id, '--natdnsproxy1', 'off']

?

comment:40 in reply to: ↑ 39 Changed 6 years ago by Hachiman

Replying to machete143:

Replying to Hachiman:

'--natdnshostresolver1' is alternative '--natdnsproxy1', difference is that NAT ask host to resolve name via C API, rather forward UDP packet to registered servers, to switch them off possible via VBoxManage modifyvm --natdns{hostresolver,proxy}1 off, actually if they're both on --dnshostresolver1 will win.

So if I understood correctly, I should do:

vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']
vb.customize ['modifyvm', :id, '--natdnsproxy1', 'off']

?

No both should be off.

comment:41 Changed 6 years ago by machete143

With those settings, vagrant doesn't work :) If any one else has this problem, this setting worked for me:

vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']

comment:42 Changed 5 years ago by Takis

Any progress on this? I'm most likely having the same issue: Using bower crashes my CentOS 6.5 virtual machine (setup using Vagrant).

comment:43 Changed 5 years ago by Takis

I can confirm that the workaround provided above works for me:

config.vm.provider :virtualbox do |vb|
  vb.customize ['modifyvm', :id, '--natdnshostresolver1', 'on']
end

comment:44 Changed 5 years ago by vushakov

This might be the same underlying use-after-free problem as the one described in #13994. It was fixed after 4.3.26 and will be available in the next 4.3 release. Meanwhile you may give a recent testbuild a try.

comment:45 Changed 5 years ago by vushakov

The fix is part of 4.3.28. Please, give it a try.

comment:46 follow-up: ↓ 47 Changed 5 years ago by fotoflo

forgive me if this seems like a silly question, but where do we put this workaround?

comment:47 in reply to: ↑ 46 Changed 5 years ago by vushakov

Replying to fotoflo:

forgive me if this seems like a silly question, but where do we put this workaround?

Why do you need the workaround? Do you still experience this problem with 4.3.28 or later?

comment:48 Changed 5 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

Please reopen if necessary. The above comments clearly say that the fix is part of VBox 4.3.28 or later (VBox 5.0.0 includes the fix as well).

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use