#14748 new defect

nat network hangs intermittantly

Reported by:	dtgeorge	Owned by:
Component:	network/NAT	Version:	VirtualBox 5.0.8
Keywords:	nat	Cc:
Guest type:	Windows	Host type:	Linux

Description

Host: Ubuntu 14.04 64Bit

Virtualbox 5.0.8

Guest: Windows Server 2012, Windows XP - defect present for both guests.

Both guests are headless, used only by RDP.

Sometimes (1-2 times per day) guest network adapter stops to recieve any packets from "media", can't ping default gw for this network interface.

The only things that fix issue is nat network restart:

VBoxManage natnetwork stop --netname nat-int-network-2
VBoxManage natnetwork start --netname nat-int-network-2

Of course, may be there are another ways... reboot particular guest or perform "sudo /etc/init.d/vboxdrv restart" (requiring all VMs stop)... But the only way I found to fix issue (while not rebooting/stopping guest is restart nat network... I have two nat networks (192.168.1.0/24 and 192.168.2.0/24). They hangs indpendently:

First one, nat-int-network (192.168.1.0/24), connected to WinXP guest (single network adapter within guest) hangs rarely - 1-2 times per month.

Second one, nat-int-network-2 (192.168.2.0/24), connected to Windows Server 2012 Guest (one of two network adapters within guest, second one is openvpn tap0 interface attached to guest as Bridged Adapter) and hangs more often - 1-2 times per day... sometime 1 time per 2-3 days...

Windows Server guest is a loaded by 5-10 simultaneous RDP-sessions, but average load is quite moderate...

I understand that info provied is not enough. Please, let me know - what kind of details should I add to?

Attachments (4)

output2.txt (41.5 KB ) - added by Chris Law 9 years ago.: Actual stack trace
StoreNat.log.gz (53.0 KB ) - added by Chris Law 9 years ago.
14748.diag.tar.gz (218.8 KB ) - added by Chris Law 9 years ago.
111633_StoreNat.log.gz (28.3 KB ) - added by Chris Law 9 years ago.

Download all attachments as: .zip

Change History (86)

comment:1 by Valery Ushakov, 10 years ago

Please, can you give a recent (103571 or later) 5.0 test build a try?

follow-up: 3 comment:2 by dtgeorge, 10 years ago

That is production server. Is it safe to use test builds?

in reply to: 2 comment:3 by Valery Ushakov, 10 years ago

Replying to dtgeorge:

That is production server. Is it safe to use test builds?

Depends on values of "safe". It's a build from the release branch of what will eventually become the next dot-dot version. If you don't feel comfortable using a test build, you'll have to wait until the next release.

comment:4 by dtgeorge, 10 years ago

What about your adivce to install test build. Is it standart procedure or you know some specific nat network patches/fixes, that present on test builds and could help to solve my issue?

comment:5 by dtgeorge, 10 years ago

UPD: I spent some time trying to get - where can I find a list what exaclty patches and improovements test build has, but no succes for me (

comment:6 by Frank Mehnert, 10 years ago

Test builds are builds of respective branches which are used for building release packages. We don't publish changes for test builds. Test builds can be considered being stable but there is no guarantee. If you are asked to install a test build there is a good reason for doing so. If you install the test build and can confirm that a problem is fixed then this is your contribution to make the product better. Otherwise you have to wait until the next maintenance release with the risk that the fix which is included in the test build is not relevant for your problem.

comment:7 by dtgeorge, 10 years ago

Does phpvirtualbox works fine with test builds?

If I already have installation of virtualbox 5.0.8 and phpvirtualbox, what steps should I do to become test builds user? Uninstall virtualbox and then execute .run script?

comment:8 by dtgeorge, 10 years ago

Test build 103669 installed. Just started to contribute ))))

So far so good... Watching...

Last edited 10 years ago by dtgeorge (previous) (diff)

comment:9 by dtgeorge, 10 years ago

Got it! Test build 103669 installed. 7 days of stable work and bump - no packets recieved on interface!

I restarted network and only after 10-15 seconds (not immedeately after restart!) interface became to normal opertaion (ping to 192.168.2.1 (within VM) started to responce).

VBoxManage natnetwork stop --netname nat-int-network-2
VBoxManage natnetwork start --netname nat-int-network-2

Last edited 10 years ago by dtgeorge (previous) (diff)

comment:10 by Valery Ushakov, 10 years ago

So, the crashes (#13899) are gone, but now you occasionally see it become wedged (#13987), right?

comment:11 by dtgeorge, 10 years ago

I am not shure if it was #13899 or #13987 or else, because nobody told me how to go deeper into my issue diagnostic... I was told to install test build and that's it! I asked for instructions on first post... So for now all those issues are "black box" for me...

On #13987 Windows host described, but I have Ubuntu 64. Also on #13987 issue VMs can "see" each other. On my case all VMs are in different nets, so I cannot confirm #13987-like behavior. But, as I mentioned, today I couldn't ping even default gw (192.168.2.1) from VM...

comment:12 by Valery Ushakov, 10 years ago

Sorry, I guess I was confusing this with another similar bug for a moment...

When your network becomes unresponsive, can you attach to the VBoxNetNAT process with gdb and issue

(gdb) set height 0
(gdb) thread apply all bt

comment:13 by dtgeorge, 10 years ago

Never do this before, but i'll try... if I'll have enough time that moment - I mean server should serve clients, not time for tests..

May be first I should update to newer test build?

Last edited 10 years ago by dtgeorge (previous) (diff)

comment:14 by C0RD, 9 years ago

Hi, I've encountered this on 5.1.4. I've also seen this in 5.0.x.

Host: Windows 8.1 Guest: Debian sid (Kernel 4.7.0-1-amd64, current Guest Additions)

Switching back to "normal" NAT solves the problem, and it is stable. But running NAT Network leads to a broken network after some time. (usually hours.)

Cord

PS: Will try gdb if it happens again.

follow-up: 16 comment:15 by Chris Law, 9 years ago

I believe I'm seeing the same problem with an OSX host, natnetwork with 2 RHEL 7.1 guests (no guest additions). I've seen it on 5.0.18, 5.0.26, 5.1.6 versions of VirtualBox. I have yet to see anything in any logs to suggest what's going on, but can recreate in about 15 minutes or so. Is there anything I can try to grab more diagnostics? Ticket #13987 looks to be basically the same but closed as no-one replied to it, is that right?

in reply to: 15 ; follow-up: 17 comment:16 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

Is there anything I can try to grab more diagnostics?

A stack trace from the "stuck" VBoxNetNAT process will be a good start. Thanks!

in reply to: 16 comment:17 by Chris Law, 9 years ago

Replying to vushakov:

Replying to OSX_claw:

Is there anything I can try to grab more diagnostics?

A stack trace from the "stuck" VBoxNetNAT process will be a good start. Thanks!

Hi vushakov. Apologies, I'm not much of a developer, and even less of an OSX expert (but I am more than comfortable in a terminal/ linux environment). Do you have any hints on what I need to do to get a stack trace on OSX? I'll be googling frantically for a few minutes, but if you know a command to get exactly what you want that would help.

Cheers, Chris

comment:18 by Valery Ushakov, 9 years ago

From a terminal run:

$ sample -file output.txt VBoxNetNAT

and attach output.txt.

comment:19 by Chris Law, 9 years ago

I ran it against the PID rather than the process name, but attached is the output. I really hope you can make some sense of it, I know I can't...

comment:20 by Valery Ushakov, 9 years ago

This is for VBoxNetDHCP, not VBoxNetNAT :)

Also, we need a stack trace from VBoxNetNAT when the network is "stuck", otherwise you will be just reporting business-as-usual state, which is not very instructive.

Just to make sure, when you say "I believe I'm seeing the same problem", what are the actual symptoms that you observe?

by Chris Law, 9 years ago

Attachment:	output2.txt added

Actual stack trace

follow-up: 25 comment:21 by Chris Law, 9 years ago

Noooooooo :-D

The symptoms that I see are best explained with a little detail of my setup.

I have 2 RHEL clients.

I portforward (arbitrary) ports from the host to allow ssh access, so both end up on port 22

I ssh from the host to the guests. After 15-20 minutes or so, the natnetwork gets into an odd state. I am unable to initiate new outbound connections from the clients. I am unable to initiate new ssh connections - I get

write - broke pipe

from the command line. But I can't telnet to any open ports. What does work is to run VBoxManage natnetwork stop --netname StoreNat && VBoxManage natnetwork start --netname StoreNat

What is very odd is that another natnetwork that I have set up (DefaultNat) which has one RHEL client and one ubunti client is fine. I tried putting these 2 clients into another natnetwork and it failed too. My colleague sat next to me has created a virtually identical setup (we're testing an environment of 2 machines) and does not have this problem. I did not have it before around the 17th of August, but I have had it since then.

And with take 2 - hopefully the right process this time! Thank you for your patience.

(Edited to add formatting)

Last edited 9 years ago by Chris Law (previous) (diff)

comment:22 by Chris Law, 9 years ago

I meant to add - this is stuck now. That is "business as usual" for me - it spends more time stuck than not, and I have the stop and start command in it's own window to be re-run on demand :-(

comment:23 by Valery Ushakov, 9 years ago

Forgot to ask - which version of VirtualBox is that?

comment:24 by Chris Law, 9 years ago

Currently,

VBoxManage --version 5.1.6r110634

I've seen it on 5.1.4, 5.0.26, 5.0.18 too I believe though.

I'm also going to leave work for the day soon, so if I don't reply immediately I've not given up but I'll be back tomorrow (UK time)!

Thanks again, Chris

in reply to: 21 comment:25 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

I ssh from the host to the guests. After 15-20 minutes or so, the natnetwork gets into an odd state. I am unable to initiate new outbound connections from the clients. I am unable to initiate new ssh connections - I get write - broke pipe from the command line.

In this description you refer to "new" connections. Do established connections continue to work?

But I can't telnet to any open ports.

You mean, the ports on the host that are forwarded to the guests on the natnet, right?

comment:26 by Chris Law, 9 years ago

Yes - established connections (ie an existing ssh) continues to work.

I think my second comment made it sound like I have lots of forwarded ports. Now I check, I only have ssh connections to each client. But I have seen telnetting to those ports not giving the expected result.

In fact, it's this -

$ telnet localhost 2251
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
SetSockOpt: Invalid argument
Connection closed by foreign host.

I'm running a test build today at the suggestion of a mod on the VirtualBox forums. It seemed to take a lot longer to fail, and it seemed to block outbound connections before inbound this time, but that could well be perception. Build is now - 5.1.7r111360

This won't match with the traces I took yesterday - is there benefit to a new one?

comment:27 by Valery Ushakov, 9 years ago

There were no relevant changes after the release, so there's no sense in running a test build in this case.

The original report was that there are no packets from the natnet at all, so your problem is something different.

I wonder if you are hitting a limit of the number of simultaneous connections. How many external connections do you have (on both of those two RHEL guests)

$ netstat -n -A inet -A inet6

You may want to obfuscate/whiteout the remote IPs for privacy reasons.

comment:28 by Chris Law, 9 years ago

Ip addresses are only in the natnetwork, so obfuscation not necessary.

On client machine 1 -

# netstat -n -A inet -A inet6
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0     36 10.0.2.10:22            10.0.2.2:54743          ESTABLISHED
tcp        0      0 10.0.2.10:22            10.0.2.2:52760          ESTABLISHED

On the client machine 2 (I had to ssh into this from the first as I didn't have an existing session)

#  netstat -n -A inet -A inet6
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 10.0.2.51:22            10.0.2.10:45584         ESTABLISHED

Not exactly a lot of connections...

This gets a little complicated, but some more background might help. The natnetwork that is getting "stuck" is called StoreNat.

I'm developing/testing puppet code, so the bulk of the connections would actually be outbound from these machines to the external ip address of my host laptop to a port that is forwarded to the puppetmaster that runs in a VM in different natnetwork (DefaultNat). This other natnetwork has been solid as a rock - as had this one been until mid-August.

I see what you're saying about this not being identical to the initial problem. I was looking at 3 (to my eyes) very similar tickets yesterday, and perhaps 13987 is closer, but it was closed so I picked this one. The other similar looking one was 14774.

Last edited 9 years ago by Chris Law (previous) (diff)

comment:29 by Valery Ushakov, 9 years ago

A crash on Windows is a separate problem (#13899).

Ok, I need a bit of time to think of what might be going wrong here and which instrumentation to add to try to ddx that (that would be the time to run a test build :).

comment:30 by Chris Law, 9 years ago

Many thanks for taking the time to investigate. More than happy to run/capture whatever I can, and I'll add anything else I think might be relevant. One thing you might know more about is getting more verbose logging on the vboxnatnet . At one time I followed instructions I found on another old ticket #https://www.virtualbox.org/ticket/15223. In fact, now I look, it was you giving the instructions :-)

(export VBOXNET_StoreNat_RELEASE_LOG="+nat_service.e.l.l2.l3.f -nat_service.restrict"; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog";  VirtualBox &)

That should still work shouldn't it? It did once, but I can't seem to get it right since.

comment:31 by Valery Ushakov, 9 years ago

Right. The instrumentation that the above turned on was only put it temporarily for the test builds. As I said, I need to formulate at least some hypothesis of what I'm looking for before adding instrumentation for your case; I don't want full scale logging equivalent of carpet bombing if I can avoid it :)

comment:32 by Valery Ushakov, 9 years ago

When the network is stuck, can you ping 10.0.2.1 (default gateway)? 10.0.2.2 (mapped host loopback)?

Can you use 10.0.2.1 (DNS proxy) as resolver? E.g.

$ nslookup www.virtualbox.org 10.0.2.1

comment:33 by Chris Law, 9 years ago

Long reply where I have obfuscated some pieces.

[root@client1-1 ~]# ping 10.0.2.1
PING 10.0.2.1 (10.0.2.1) 56(84) bytes of data.
64 bytes from 10.0.2.1: icmp_seq=1 ttl=255 time=0.209 ms
64 bytes from 10.0.2.1: icmp_seq=2 ttl=255 time=0.177 ms
^C
--- 10.0.2.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.177/0.193/0.209/0.016 ms
You have new mail in /var/spool/mail/root
[root@client1-1 ~]# ping 10.0.2.2
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.
64 bytes from 10.0.2.2: icmp_seq=1 ttl=64 time=2.35 ms
64 bytes from 10.0.2.2: icmp_seq=2 ttl=64 time=0.341 ms
^C
--- 10.0.2.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.341/1.348/2.355/1.007 ms
[root@client1-1 ~]# logout
Connection to 10.0.2.10 closed.
[root@client-2 ~]# nslookup www.virtualbox.org 10.0.2.1
Server:         10.0.2.1
Address:        10.0.2.1#53

Non-authoritative answer:
Name:   www.virtualbox.org
Address: 137.254.60.32

[root@client-2 ~]# telnet v.w.x.y 8140
Trying v.w.x.y...
telnet: connect to address v.w.x.y: Connection timed out
[root@client-2 ~]# Connection to localhost closed by remote host.
Connection to localhost closed.

I only moved between the 2 boxes for ping/nslookup as the nslookup command wasn't installed on the client I was on intially.

The telnet to port 8140 on v.w.x.y (the address of my mac host) should work - but doesn't, and times out. At this point I stopped and started the natnetwork in another terminal which caused the connection to be closed

# osx_claww@mac-host:~ [10:14:53]
$ ssh virtualclient2
Warning: Permanently added '[localhost]:2251' (ECDSA) to the list of known hosts.
root@localhost's password:
Last login: Wed Oct 19 16:24:15 2016 from 10.0.2.2
[root@client-2 ~]# telnet v.w.x.y 8140
Trying v.w.x.y...
Connected to v.w.x.y.
Escape character is '^]'.

virtualclient2 is defined in my ssh cofig file as localhost on the correct port to be forwarded to the client. I can now connect into the clients, and I am also able to telnet outside of the natnetwork now.l

comment:34 by Valery Ushakov, 9 years ago

One more test I forgot, to exclude the UDP problems for sure - does

$ nslookup www.virtualbox.org 8.8.8.8

work?

I expect it does, and the problem seems to be confined to TCP connections.

Can you run wireshark in the guest to capture the traffic on the intnet interface and then do e.g.

$ telnet www.virtualbox.org http

follow-up: 36 comment:35 by Chris Law, 9 years ago

I can't access 8.8.8.8 behind my corporate firewall - but assuming you want to see if I can get to a nameserver that isn't in VirtualBox or even my host -

nslookup www.virtualbox.org <corporate nameserver>
Server:		<corporate nameserver>
Address:	<corporate nameserver>#53

Non-authoritative answer:
Name:	www.virtualbox.org
Address: 137.254.60.32

The second part I have to be even more careful with. I've had to remove wireshark from my mac as it was noticed by my IT department and I was explicitly told to take it off. Even if it was still on, telnetting to www.virtualbox.org on port 80 would fail as I'm stuck behind a firewall anyway.

I should be able to do something along those lines at home though, so can I be clear on what you want me to do? Do you want me to capture traffic on the client or the host? And which interface?

in reply to: 35 comment:36 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

I should be able to do something along those lines at home though, so can I be clear on what you want me to do? Do you want me to capture traffic on the client or the host? And which interface?

What I'm asking is an outgoing TCP connection from the guest. It can even be to the host mapped loopback at 10.0.2.2, that doesn't matter. I used http at virtualbox.org as an example just because it's there and it will not expose your IP addresses.

I want traffic captured on the guest on its natnet interface.

comment:37 by Chris Law, 9 years ago

Not sure if this is exactly what you want/expect, but I ran a tcpdump on the client once it was stuck. I've got a raw file I can make available I guess, but there isn't that much traffic.

I've also cropped out nfs traffic that flows between the 2 clients (I forgot that was happening, hopefully it's not relevant)

00:16:56.267344 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28226548 ecr 0,nop,wscale 7], length 0
00:16:57.286518 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28227567 ecr 0,nop,wscale 7], length 0
00:16:59.289952 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28229570 ecr 0,nop,wscale 7], length 0
00:17:01.276768 ARP, Request who-has 10.0.2.2 tell 10.0.2.51, length 28
00:17:01.276988 ARP, Reply 10.0.2.2 is-at 52:54:00:12:35:00 (oui Unknown), length 46
00:17:03.305182 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28233586 ecr 0,nop,wscale 7], length 0
00:17:11.323646 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28241604 ecr 0,nop,wscale 7], length 0
00:17:27.401935 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28257682 ecr 0,nop,wscale 7], length 0

where puppet is hardcoded in the hosts file to point to my external mac ip address.

I imagine you don't need this piece, but for completess, this is the comparison with a working natnetwork.

00:23:55.584295 IP 10.0.2.51.44237 > puppet.8140: Flags [S], seq 3015116107, win 14600, options [mss 1460,sackOK,TS val 28645864 ecr 0,nop,wscale 7], length 0
00:23:55.584679 IP puppet.8140 > 10.0.2.51.44237: Flags [S.], seq 24054, ack 3015116108, win 32768, options [mss 1460], length 0
00:23:55.584733 IP 10.0.2.51.44237 > puppet.8140: Flags [.], ack 1, win 14600, length 0
00:23:55.584819 IP 10.0.2.51.44237 > puppet.8140: Flags [P.], seq 1:79, ack 1, win 14600, length 78
00:23:55.585735 IP puppet.8140 > 10.0.2.51.44237: Flags [P.], seq 1:589, ack 79, win 32690, length 588
00:23:55.585776 IP 10.0.2.51.44237 > puppet.8140: Flags [.], ack 589, win 15288, length 0
00:23:55.585792 IP puppet.8140 > 10.0.2.51.44237: Flags [F.], seq 589, ack 79, win 32690, length 0
00:23:55.589673 IP 10.0.2.51.44237 > puppet.8140: Flags [F.], seq 79, ack 590, win 15288, length 0
00:23:55.589916 IP puppet.8140 > 10.0.2.51.44237: Flags [.], ack 80, win 32689, length 0

I think I captured the packet contents, but didn't show them on the replay which is all I've pasted here - do you want more?

comment:38 by Valery Ushakov, 9 years ago

Thanks, I wanted to verify you don't get any Syn/Ack responses for the connection attempt, and this confirms it.

comment:39 by Valery Ushakov, 9 years ago

Sorry for the delay. Can you please try 5.1 r111535 OS X test build with extra logging enabled as in comment:30 (I assume you used the right natnet name in that example).

The log file is in ~/Library/VirtualBox/StoreNat.log (again, assuming that's your natnet's name).

You should see unable to allocate TCP PCB... and then the dump of all currently used pcbs. IP addresses are logged, so if that's a privacy concern feel free to mask them.

comment:40 by Chris Law, 9 years ago

Not great results so far. It seemed to take take a little longer to get "stuck", but it has got there in the end. I had invoked VirtualBox with the command -

$ (export VBOXNET_StoreNat_RELEASE_LOG="+nat_service.e.l.l2.l3.f -nat_service.restrict"; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog";  VirtualBox &)

but I didn't get any more logging in the StoreNat.log.

As a check, I listed the network

VBoxManage list natnets
NetworkName:    StoreNat
IP:             10.0.2.1
Network:        10.0.2.0/24
IPv6 Enabled:   No
IPv6 Prefix:    fd17:625c:f037:2::/64
DHCP Enabled:   Yes
Enabled:        Yes
Port-forwarding (ipv4)
        ClientSSH:tcp:[]:2251:[10.0.2.51]:22
        StoreSSH:tcp:[]:2210:[10.0.2.10]:22
        clientsshagain:tcp:[]:2215:[10.0.2.15]:22
        dhcpssh:tcp:[]:2207:[10.0.2.7]:22
        homerssh:tcp:[]:22922:[10.0.2.9]:22
        server2ssh:tcp:[]:2204:[10.0.2.4]:22
loopback mappings (ipv4)
        127.0.0.1=2

so it's the right network name. Am I missing something?

edit: removed erroneous hash character

Last edited 9 years ago by Chris Law (previous) (diff)

comment:41 by Chris Law, 9 years ago

Just to confirm that I wasn't getting something really basic wrong, I've re-installed an old build I had lying about -

Version 5.0.18 r106667

This gives a much more verbose log eg

00:02:52.439357 lwIP2    ---> TCP ext.mac.addr:48002 socket 56
00:02:52.439515 lwIP3    pxtcp_add: new pxtcp 00007fd77ac16320; pcb 0000000100f3acb0; sock 56
00:02:52.439559 lwIP3    pxtcp_pmgr_connect: sock 56: connect: Connection refused
00:02:52.439579 lwIP2    pxtcp_pcb_accept_refuse: pxtcp 00007fd77ac16320, pcb 0000000100f3acb0, sock -1: Connection refused
00:02:52.439584 lwIP2    pxtcp_pcb_dissociate: pxtcp 00007fd77ac16320 <-> pcb 0000000100f3acb0
00:02:53.441169 lwIP2    ---> TCP ext.mac.addr:48002 socket 56
00:02:53.441299 lwIP3    pxtcp_add: new pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; sock 56
00:02:53.441326 lwIP3    pxtcp_pmgr_connect: sock 56: connect: Connection refused
00:02:53.441348 lwIP2    pxtcp_pcb_accept_refuse: pxtcp 00007fd77ae0b650, pcb 0000000100f3acb0, sock -1: Connection refused
00:02:53.441376 lwIP2    pxtcp_pcb_dissociate: pxtcp 00007fd77ae0b650 <-> pcb 0000000100f3acb0
00:02:54.019762 lwIP2    ---> TCP proxy.server:8080 socket 56
00:02:54.019890 lwIP3    pxtcp_add: new pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; sock 56
00:02:54.040747 lwIP3    sock 56: HUP
00:02:54.040804 lwIP2    pxtcp_pcb_pull_inbound: pxtcp 00007fd77ae0b650: pcb 0000000100f3acb0
00:02:54.040819 lwIP2    inbound_close: pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0: ESTABLISHED
00:02:54.041075 lwIP2    pxtcp_pcb_sent: pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; all data ACKed

while it's working. Is there any value to this log - I'm expecting not as this instrumentation was for a different problem?

comment:42 by Valery Ushakov, 9 years ago

When you run that command to start VirtualBox with extra environment variables you must make sure you don't have VirtualBox, in particular VBoxSVC process already running. The VBoxNetNAT process is spawned from VBoxSVC, which will be spawned from VirtualBox if necessary. If you already had it running, the old copy will not pick up the env vars, obviously. Sorry I didn't make it clear.

BTW, is that log above in the state where it's "stuck"?

comment:43 by Chris Law, 9 years ago

No, that log is from a working state. I'll keep it running until it breaks. I had been trying to kill all VirtualBox processes each time I was trying to get VirtualBox to fire up, but I was unable to get any more logging with the latest test build as directed. It's possible I did get that wrong, so I'll try again once I've seen it get stuck with this old build so that I can get you something from that log.

Thanks, Chris

comment:44 by Valery Ushakov, 9 years ago

Ah, my bad. I didn't put that extra logging in the right group. Make that VBOXNET_StoreNat_RELEASE_LOG="+all.l2 -all.restrict"

comment:45 by Chris Law, 9 years ago

I'm going to try that new build again (r1111535) now. The earlier build I was using with that logging on failed to get into the bad state at all - longest it's run cleanly since August, typical...

comment:46 by Chris Law, 9 years ago

Trying to think of things that I did differently. I had run VirtualBox without backgrounding it on the times when I didn't get stuck today. Something for me to try again later.

comment:47 by Chris Law, 9 years ago

I've gone back to Version 5.1.9 r111535 (Qt5.5.1)

The command I used to start it was

export VBOXNET_StoreNat_RELEASE_LOG="+all.l2 -all.restrict" ; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog";  VirtualBox &

I'm not seeing anything immediately obvious in the ~/Library/VirtualBox/StoreNat.log file, should I?

follow-up: 52 comment:48 by Chris Law, 9 years ago

Aha...

Yep, exactly what you suggested - here's a little snapshot

00:51:57.881822 lwIP2    == unable to allocate TCP PCB corp.proxy.server:8080 <-> 10.0.2.10:60352
00:51:57.881864 lwIP2    bound pcbs: none
00:51:57.881869 lwIP2    listen pcbs {
00:51:57.881892 lwIP2      0.0.0.0:53
00:51:57.881897 lwIP2    } // 1 listen pcbs
00:51:57.881901 lwIP2    active pcbs {
00:51:57.881904 lwIP2      corp.proxy.server:8080 <-> 10.0.2.10:60351
00:51:57.881909 lwIP2      corp.proxy.server:8080 <-> 10.0.2.10:60350
00:51:57.881913 lwIP2      corp.proxy.server:8080 <-> 10.0.2.10:60349

I can attach the log, but it's that message showing basically the full 128 connections to the proxy on port 8080, repeated about 70 times... Do you want it? Perhaps unsurprisingly, there are also 127 connections to the proxy server on port 8080 in CLOSE_WAIT if I do a netstat on my Mac host.

Also, is that saying I'm listening on port 53? Or is that for the DHCP server for that natnetwork?

Cheers, Chris

by Chris Law, 9 years ago

Attachment:	StoreNat.log.gz added

follow-up: 53 comment:49 by Chris Law, 9 years ago

I think I've left out something important (that I've only just realised, apologies) :-(

I run docker on the VM's. The containers have their own network stacks... Any netstat commands I ran on the clients will have been incomplete I think. sorry.

So, I just went into the container I most suspect is at fault, and there weren't many bad connections at that point

netstat -atnp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:6000            0.0.0.0:*               LISTEN      6/Xvfb
tcp6       0      0 :::6000                 :::*                    LISTEN      6/Xvfb
tcp6       0      1 172.17.0.2:60416        corp.proxy.server:8080     SYN_SENT    16/java

However - that was with things in a broken state. I bounced the natnetwork again, and now I see this

user@b20fa7259901:/opt/build $ netstat -n -A inet -A inet6
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp6       0      1 172.17.0.2:60442        172.31.200.210:8080     LAST_ACK
tcp6       0      0 172.17.0.2:60523        172.31.200.210:8080     CLOSE_WAIT
tcp6       0      0 172.17.0.2:60522        172.31.200.210:8080     CLOSE_WAIT
tcp6       0      1 172.17.0.2:60454        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60445        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60452        172.31.200.210:8080     LAST_ACK
tcp6       0      0 172.17.0.2:60521        172.31.200.210:8080     CLOSE_WAIT
tcp6       0      1 172.17.0.2:60443        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60439        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60438        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60441        172.31.200.210:8080     LAST_ACK
tcp6       0      0 172.17.0.2:60524        172.31.200.210:8080     CLOSE_WAIT
tcp6       0      1 172.17.0.2:60455        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60436        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60437        172.31.200.210:8080     LAST_ACK
tcp6       0      1 172.17.0.2:60435        172.31.200.210:8080     LAST_ACK
tcp6       0      0 172.17.0.2:60525        172.31.200.210:8080     CLOSE_WAIT
tcp6       0      1 172.17.0.2:60440        172.31.200.210:8080     LAST_ACK
tcp6       1      1 172.17.0.2:60434        172.31.200.210:8080     LAST_ACK

and it looks like that list is growing.

So - is this pure user error, or should the natnetwork deal with things better? I believe that those connections are all down to a java process attempting to connect through our proxy to a deliberately non-existant web server. Whether that means the proxy isn't closing things, or the java process isn't closing things, I'm not sure though.

comment:50 by Chris Law, 9 years ago

So it might not be that simple, though it's likely related. It has just got stuck, and there are 31 reported connections inside that container, all moving from CLOSE_WAIT to LAST_ACK, not seemingly completely stuck.

comment:51 by Chris Law, 9 years ago

It appears to be in a state whereby I can still ssh in to the clients (for now) but cannot initiate external connections, and I still get messages about max tcbs in the log. That was after bouncing what I thought was the problem container.

in reply to: 48 comment:52 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

Also, is that saying I'm listening on port 53? Or is that for the DHCP server for that natnetwork?

It's the TCP part of the DNS proxy. It listens on the "inside" of the natnet, not on the host; i.e. it's what guests can reach as 10.0.2.1:53.

in reply to: 49 comment:53 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

However - that was with things in a broken state. I bounced the natnetwork again, and now I see this
user@b20fa7259901:/opt/build $ netstat -n -A inet -A inet6
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp6       0      1 172.17.0.2:60442        172.31.200.210:8080     LAST_ACK
tcp6       0      0 172.17.0.2:60523        172.31.200.210:8080     CLOSE_WAIT
...
and it looks like that list is growing.

So - is this pure user error, or should the natnetwork deal with things better? I believe that those connections are all down to a java process attempting to connect through our proxy to a deliberately non-existant web server. Whether that means the proxy isn't closing things, or the java process isn't closing things, I'm not sure though.

CLOSE_WAIT is when the remote side has (half-)closed. A classic example of this is rsh(1) that sends local stdin to the server and then half-closes on EOF. It's quite normal for a connection to stay in that state, though usually its on the "server" side, after client is done sending its request/input/... So it's a bit suspect that you have these CLOSE_WAIT connections on the client. When the proxy closes and stops sending data to you it doesn't make much sense to keep the outbound connection still open - whatever you gonna ask the proxy, it can't any longer reply to you anyway.

OTOH, it can be some http proxy tunnel. Anyway, transitioning out of that state requires an action of the local app.

LAST_ACK should be transient - a connection is in that state when the app closes a connection in CLOSE_WAIT state (i.e. the remote has already closed) and waits for the last ack that will terminate the connection. You shouldn't normally see a lot of connections in that state since that last ack usually comes pretty fast.

Here guest waits for a packet form natnet and I can actually think of some pathological failure scenarios where natnet is starved out of resources and can't send that last ack...

Let me think a bit more on what I want to probe next, so that I don't bother you with a back and forth for each additional debug log statement.

comment:54 by Chris Law, 9 years ago

Thanks again - I'll keep checking back. It's got to the broken state now, but to confirm, inside what I suspect to be the problem container I don't see many connections by this point - they do seem to follow the usual flow to get closed off, even if they were showing as CLOSE_WAIT for a while previously. On the host, I do still see the VBoxNetNat process with 126 CLOSE_WAIT connections though.

comment:55 by Valery Ushakov, 9 years ago

Test build r111579 should print a bit more info. When it's stuck, please, provide the log, and the netstat output from the host (VBoxNetNAT process), and the guest and containers. TIA.

comment:56 by Chris Law, 9 years ago

Incoming... I've obfuscated and zipped the files up you've asked for. The problem is seeing exactly when it breaks - I think that the netstat output inside the containers changes even once it is stuck - so by the time I took the netstat, any sockets had been tidied up as far as that network stack was concerned. One thing that isn't as obvious as I'd meant it to - I've changed ipaddress to be things like corp.proxy.server. There are 2 proxies (though I did at least remember to differentiate them) and a corp.docker.server. At a quick glance those strings look very similar in the logs is all.

by Chris Law, 9 years ago

Attachment:	14748.diag.tar.gz added

comment:57 by Valery Ushakov, 9 years ago

Strange. I would have expected a bunch of connections in the guest/containers in CLOSE_WAIT/LAST_ACK states, like in comment:49. On the host I can see a bunch of CLOSE_WAIT sockets and there are corresponding pcbs in the natnet log in FIN_WAIT_2, i.e. the connections to corp.proxy.server:8080 that had inbound FIN from remote, but didn't yet get outbound FIN from the guest.

Can you please try r111619 (or later) that cherry picks a TCP fix from upstream that might be relevant.

Please, can you also capture some of the offending connections to the corp.proxy.server:8080 on the guest side (please, provide pcap file(s), not text output from tcpdump). I don't know if every connection to the proxy causes a stuck connection in natnet, or only some of them, so this might be a bit tricky, b/c by the time you have your whole natnet stuck, you can't make any new connections...

Obfuscating the addresses will be harder in this case though :). You can limit the capture size to just the tcp header. If you don't want to attach the captures here, you can send them to me privately to valery dot ushakov at oracle dot com.

follow-up: 59 comment:58 by Chris Law, 9 years ago

I'll try the build, and see what I can do for the pcap files. It's made harder by needing to do the captures from inside the container on the client, but should still be possible. The images don't have tcpdump in them for example, but I think I can install it.

in reply to: 58 comment:59 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

It's made harder by needing to do the captures from inside the container on the client.

Can't you capture on the "main" guest?

Worst case, you can use VBox's built-in nic trace feature, but it's rather rudimentary e.g. you can't specify the capture filter (the intention is that you'd normally use guest's capture facilities and only resort to nictrace for early boot and the like, so it's kept pretty bare bones). So you can use nictrace to capture everything and post-process, though that's obviously a chore.

comment:60 by Chris Law, 9 years ago

You're probably right - presumably the docker0 interface or one of the veth devices should be connected to the containers so tcpdump should catch it. I was getting hung up on netstat not showing the different network namespaces.

follow-up: 62 comment:61 by Chris Law, 9 years ago

Seems to be capturing. I'm going for

tcpdump -S -n -i <interface> -w <outfille> host corp.proxy.server

do you think that will get what you're after?

Thanks, Chris

in reply to: 61 comment:62 by Valery Ushakov, 9 years ago

Replying to OSX_claw:

Seems to be capturing. I'm going for
tcpdump -S -n -i <interface> -w <outfille> host corp.proxy.server
do you think that will get what you're after?

Yes, I'm interested only in those connections. A tiny nit-pick - -S and -n are redundant as they only affect text output, not that it matters.

Thanks!

comment:63 by Chris Law, 9 years ago

I've sent the tcpdump output through, hopefully there is something useful in that.

Thanks, Chris

comment:64 by Valery Ushakov, 9 years ago

Thanks! That capture confirms that for some connections the guest sends (and resends) the FINs but never gets that final ACK from the natnet and gives up eventually.

comment:65 by Valery Ushakov, 9 years ago

Please, can you try r111633 with extra logging enabled like before. I'm shooting in the dark here as I'm not quite sure from RTFS how we can even end up in that state, so you patience is appreciated. Thanks in advance.

by Chris Law, 9 years ago

Attachment:	111633_StoreNat.log.gz added

comment:66 by Valery Ushakov, 9 years ago

I assume that that log ends when the natnet becomes stuck again?

comment:67 by Chris Law, 9 years ago

Apologies, I didn't post the comment I thought I had. Yes, that should be stopped when the natnet was stuck. I should have an accompanying tcpdump if it's any use to you?

comment:68 by Valery Ushakov, 9 years ago

Since you've captured it anyway, it would be nice to have. Though I don't see in the log anything that would hint at the potential problem.

comment:69 by Valery Ushakov, 9 years ago

The capture actually helped to provide some clues, hopefully. Please, can you collect logs and packet capture for r111660. Thanks in advance.

comment:70 by Chris Law, 9 years ago

Slightly later than intended, I'm running with the new build - will let you know about diagnostics.

comment:71 by Chris Law, 9 years ago

I didn't see the usual problem in about 3 hours of running yesterday, which is kinda nice but a but frustrating from your point of view I suspect :-) Running again today.

follow-up: 76 comment:72 by giner, 7 years ago

I have exactly the same problem while running one of the most recent versions of VirtualBox in Linux. I have to restart natnetwork all the time to make it work.

$ dpkg -l | grep virtualbox-5.1
ii  virtualbox-5.1                                        5.1.30-118389~Ubuntu~trusty                         amd64        Oracle VM VirtualBox

Here is how I restart natnetwork:

VBoxManage natnetwork stop --netname NatNetwork && VBoxManage natnetwork start --netname NatNetwork

Last edited 7 years ago by giner (previous) (diff)

comment:73 by Valery Ushakov, 7 years ago

Xref duplicates: #17352, #17887.

comment:74 by box-tester, 7 years ago

i have same problem. restart service virtualbox without restarting VMs resolve this problem for some days. Then internet connection is lost again. Virtualbox 5.1.30-38 and Virtualbox 5.2.4-5.2.16 tested for this bug.

Last edited 7 years ago by box-tester (previous) (diff)

comment:75 by box-tester, 7 years ago

can you fix this?

in reply to: 72 comment:76 by souvir, 6 years ago

Replying to giner:

I have exactly the same problem while running one of the most recent versions of VirtualBox in Linux. I have to restart natnetwork all the time to make it work.
$ dpkg -l | grep virtualbox-5.1
ii  virtualbox-5.1                                        5.1.30-118389~Ubuntu~trusty                         amd64        Oracle VM VirtualBox
Here is how I restart natnetwork:
VBoxManage natnetwork stop --netname NatNetwork && VBoxManage natnetwork start --netname NatNetwork

Restart natnetwork is useful. My NAT network is stucked 1 time per 10 days. I have two nat networks, but only one have such problem. Host OS: Win Server 2012 R2, Guest OS: Winxp sp3, VirtualBox 5.2.22.

comment:77 by Chewi, 6 years ago

Still a problem in 6.0.10. :( I have a Win 10 host and a CentOS 7 guest. I don't think ARP even works when it's in this state. I believe this forum thread covers the same issue.

comment:78 by MacWorld, 6 years ago

Also facing this problem on a Debian 9.11 host with Virtual Box 6.0.12. Affects at least WinXP and Debian guests.

Currently as a "bypass" I automated a stop/start of the NatNetwork every hour, far from ideal.

comment:79 by UlfRenman, 5 years ago

I have tha same issue on VirtualBox Version 6.0.14. I run Ubuntu 18.04.3 on both Host and Guest.

I have been running VirtualBox for ~3 months on this new computer with the default version 5.2.34 without any issues until yesterday when I started having issues with the network. After upgrading to 6.0.14 there is still the same issue.

To my knowledge I have made no changes or upgrades to any part of my system the last couple of days.

On the ~10 tries I've made, the network has been alive for ~1 to a maximum of 10 minutes.

comment:80 by dan9999, 5 years ago

I have the same problem, nat network dies after approx 15 minutes Is this bug being worked on?

comment:81 by salkowitch, 4 years ago

Exactly the same problem still here, noticed in several versions. The same with the latest update. Host Windows Server 2012 with RHEL V7. Not able to find combination of factors making NatNetwork dying. Disappointed, seruois problem. Anu update? Thanks.

comment:82 by Francois94130, 4 years ago

This issue exists for years, is there someone still working on Vbox at Oracle ???

Note: See TracTickets for help on using tickets.

Download in other formats: