Opened 9 years ago
Last modified 3 years ago
#14748 new defect
nat network hangs intermittantly
Reported by: | dtgeorge | Owned by: | |
---|---|---|---|
Component: | network/NAT | Version: | VirtualBox 5.0.8 |
Keywords: | nat | Cc: | |
Guest type: | Windows | Host type: | Linux |
Description
Host: Ubuntu 14.04 64Bit
Virtualbox 5.0.8
Guest: Windows Server 2012, Windows XP - defect present for both guests.
Both guests are headless, used only by RDP.
Sometimes (1-2 times per day) guest network adapter stops to recieve any packets from "media", can't ping default gw for this network interface.
The only things that fix issue is nat network restart:
VBoxManage natnetwork stop --netname nat-int-network-2 VBoxManage natnetwork start --netname nat-int-network-2
Of course, may be there are another ways... reboot particular guest or perform "sudo /etc/init.d/vboxdrv restart" (requiring all VMs stop)... But the only way I found to fix issue (while not rebooting/stopping guest is restart nat network... I have two nat networks (192.168.1.0/24 and 192.168.2.0/24). They hangs indpendently:
- First one, nat-int-network (192.168.1.0/24), connected to WinXP guest (single network adapter within guest) hangs rarely - 1-2 times per month.
- Second one, nat-int-network-2 (192.168.2.0/24), connected to Windows Server 2012 Guest (one of two network adapters within guest, second one is openvpn tap0 interface attached to guest as Bridged Adapter) and hangs more often - 1-2 times per day... sometime 1 time per 2-3 days...
Windows Server guest is a loaded by 5-10 simultaneous RDP-sessions, but average load is quite moderate...
I understand that info provied is not enough. Please, let me know - what kind of details should I add to?
Attachments (4)
Change History (86)
comment:1 by , 9 years ago
comment:3 by , 9 years ago
Replying to dtgeorge:
That is production server. Is it safe to use test builds?
Depends on values of "safe". It's a build from the release branch of what will eventually become the next dot-dot version. If you don't feel comfortable using a test build, you'll have to wait until the next release.
comment:4 by , 9 years ago
What about your adivce to install test build. Is it standart procedure or you know some specific nat network patches/fixes, that present on test builds and could help to solve my issue?
comment:5 by , 9 years ago
UPD: I spent some time trying to get - where can I find a list what exaclty patches and improovements test build has, but no succes for me (
comment:6 by , 9 years ago
Test builds are builds of respective branches which are used for building release packages. We don't publish changes for test builds. Test builds can be considered being stable but there is no guarantee. If you are asked to install a test build there is a good reason for doing so. If you install the test build and can confirm that a problem is fixed then this is your contribution to make the product better. Otherwise you have to wait until the next maintenance release with the risk that the fix which is included in the test build is not relevant for your problem.
comment:7 by , 9 years ago
Does phpvirtualbox works fine with test builds?
If I already have installation of virtualbox 5.0.8 and phpvirtualbox, what steps should I do to become test builds user? Uninstall virtualbox and then execute .run script?
comment:8 by , 9 years ago
Test build 103669 installed. Just started to contribute ))))
So far so good... Watching...
comment:9 by , 9 years ago
Got it! Test build 103669 installed. 7 days of stable work and bump - no packets recieved on interface!
I restarted network and only after 10-15 seconds (not immedeately after restart!) interface became to normal opertaion (ping to 192.168.2.1 (within VM) started to responce).
VBoxManage natnetwork stop --netname nat-int-network-2 VBoxManage natnetwork start --netname nat-int-network-2
comment:10 by , 9 years ago
comment:11 by , 9 years ago
I am not shure if it was #13899 or #13987 or else, because nobody told me how to go deeper into my issue diagnostic... I was told to install test build and that's it! I asked for instructions on first post... So for now all those issues are "black box" for me...
On #13987 Windows host described, but I have Ubuntu 64. Also on #13987 issue VMs can "see" each other. On my case all VMs are in different nets, so I cannot confirm #13987-like behavior. But, as I mentioned, today I couldn't ping even default gw (192.168.2.1) from VM...
comment:12 by , 9 years ago
Sorry, I guess I was confusing this with another similar bug for a moment...
When your network becomes unresponsive, can you attach to the VBoxNetNAT
process with gdb and issue
(gdb) set height 0 (gdb) thread apply all bt
comment:13 by , 9 years ago
Never do this before, but i'll try... if I'll have enough time that moment - I mean server should serve clients, not time for tests..
May be first I should update to newer test build?
comment:14 by , 8 years ago
Hi, I've encountered this on 5.1.4. I've also seen this in 5.0.x.
Host: Windows 8.1 Guest: Debian sid (Kernel 4.7.0-1-amd64, current Guest Additions)
Switching back to "normal" NAT solves the problem, and it is stable. But running NAT Network leads to a broken network after some time. (usually hours.)
Cord
PS: Will try gdb if it happens again.
follow-up: 16 comment:15 by , 8 years ago
I believe I'm seeing the same problem with an OSX host, natnetwork with 2 RHEL 7.1 guests (no guest additions). I've seen it on 5.0.18, 5.0.26, 5.1.6 versions of VirtualBox. I have yet to see anything in any logs to suggest what's going on, but can recreate in about 15 minutes or so. Is there anything I can try to grab more diagnostics? Ticket #13987 looks to be basically the same but closed as no-one replied to it, is that right?
follow-up: 17 comment:16 by , 8 years ago
Replying to OSX_claw:
Is there anything I can try to grab more diagnostics?
A stack trace from the "stuck" VBoxNetNAT
process will be a good start. Thanks!
comment:17 by , 8 years ago
Replying to vushakov:
Replying to OSX_claw:
Is there anything I can try to grab more diagnostics?
A stack trace from the "stuck"
VBoxNetNAT
process will be a good start. Thanks!
Hi vushakov. Apologies, I'm not much of a developer, and even less of an OSX expert (but I am more than comfortable in a terminal/ linux environment). Do you have any hints on what I need to do to get a stack trace on OSX? I'll be googling frantically for a few minutes, but if you know a command to get exactly what you want that would help.
Cheers, Chris
comment:18 by , 8 years ago
From a terminal run:
$ sample -file output.txt VBoxNetNAT
and attach output.txt
.
comment:19 by , 8 years ago
I ran it against the PID rather than the process name, but attached is the output. I really hope you can make some sense of it, I know I can't...
comment:20 by , 8 years ago
This is for VBoxNetDHCP
, not VBoxNetNAT
:)
Also, we need a stack trace from VBoxNetNAT
when the network is "stuck", otherwise you will be just reporting business-as-usual state, which is not very instructive.
Just to make sure, when you say "I believe I'm seeing the same problem", what are the actual symptoms that you observe?
follow-up: 25 comment:21 by , 8 years ago
Noooooooo :-D
The symptoms that I see are best explained with a little detail of my setup.
I have 2 RHEL clients.
I portforward (arbitrary) ports from the host to allow ssh access, so both end up on port 22
I ssh from the host to the guests. After 15-20 minutes or so, the natnetwork gets into an odd state. I am unable to initiate new outbound connections from the clients. I am unable to initiate new ssh connections - I get
write - broke pipe
from the command line. But I can't telnet to any open ports. What does work is to run VBoxManage natnetwork stop --netname StoreNat && VBoxManage natnetwork start --netname StoreNat
What is very odd is that another natnetwork that I have set up (DefaultNat) which has one RHEL client and one ubunti client is fine. I tried putting these 2 clients into another natnetwork and it failed too. My colleague sat next to me has created a virtually identical setup (we're testing an environment of 2 machines) and does not have this problem. I did not have it before around the 17th of August, but I have had it since then.
And with take 2 - hopefully the right process this time! Thank you for your patience.
(Edited to add formatting)
comment:22 by , 8 years ago
I meant to add - this is stuck now. That is "business as usual" for me - it spends more time stuck than not, and I have the stop and start command in it's own window to be re-run on demand :-(
comment:24 by , 8 years ago
Currently,
VBoxManage --version 5.1.6r110634
I've seen it on 5.1.4, 5.0.26, 5.0.18 too I believe though.
I'm also going to leave work for the day soon, so if I don't reply immediately I've not given up but I'll be back tomorrow (UK time)!
Thanks again, Chris
comment:25 by , 8 years ago
Replying to OSX_claw:
I ssh from the host to the guests. After 15-20 minutes or so, the natnetwork gets into an odd state. I am unable to initiate new outbound connections from the clients. I am unable to initiate new ssh connections - I get write - broke pipe from the command line.
In this description you refer to "new" connections. Do established connections continue to work?
But I can't telnet to any open ports.
You mean, the ports on the host that are forwarded to the guests on the natnet, right?
comment:26 by , 8 years ago
Yes - established connections (ie an existing ssh) continues to work.
I think my second comment made it sound like I have lots of forwarded ports. Now I check, I only have ssh connections to each client. But I have seen telnetting to those ports not giving the expected result.
In fact, it's this -
$ telnet localhost 2251 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. SetSockOpt: Invalid argument Connection closed by foreign host.
I'm running a test build today at the suggestion of a mod on the VirtualBox forums. It seemed to take a lot longer to fail, and it seemed to block outbound connections before inbound this time, but that could well be perception. Build is now - 5.1.7r111360
This won't match with the traces I took yesterday - is there benefit to a new one?
comment:27 by , 8 years ago
There were no relevant changes after the release, so there's no sense in running a test build in this case.
The original report was that there are no packets from the natnet at all, so your problem is something different.
I wonder if you are hitting a limit of the number of simultaneous connections. How many external connections do you have (on both of those two RHEL guests)
$ netstat -n -A inet -A inet6
You may want to obfuscate/whiteout the remote IPs for privacy reasons.
comment:28 by , 8 years ago
Ip addresses are only in the natnetwork, so obfuscation not necessary.
On client machine 1 -
# netstat -n -A inet -A inet6 Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 36 10.0.2.10:22 10.0.2.2:54743 ESTABLISHED tcp 0 0 10.0.2.10:22 10.0.2.2:52760 ESTABLISHED
On the client machine 2 (I had to ssh into this from the first as I didn't have an existing session)
# netstat -n -A inet -A inet6 Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 10.0.2.51:22 10.0.2.10:45584 ESTABLISHED
Not exactly a lot of connections...
This gets a little complicated, but some more background might help. The natnetwork that is getting "stuck" is called StoreNat.
I'm developing/testing puppet code, so the bulk of the connections would actually be outbound from these machines to the external ip address of my host laptop to a port that is forwarded to the puppetmaster that runs in a VM in different natnetwork (DefaultNat). This other natnetwork has been solid as a rock - as had this one been until mid-August.
I see what you're saying about this not being identical to the initial problem. I was looking at 3 (to my eyes) very similar tickets yesterday, and perhaps 13987 is closer, but it was closed so I picked this one. The other similar looking one was 14774.
comment:29 by , 8 years ago
A crash on Windows is a separate problem (#13899).
Ok, I need a bit of time to think of what might be going wrong here and which instrumentation to add to try to ddx that (that would be the time to run a test build :).
comment:30 by , 8 years ago
Many thanks for taking the time to investigate. More than happy to run/capture whatever I can, and I'll add anything else I think might be relevant. One thing you might know more about is getting more verbose logging on the vboxnatnet . At one time I followed instructions I found on another old ticket #https://www.virtualbox.org/ticket/15223. In fact, now I look, it was you giving the instructions :-)
(export VBOXNET_StoreNat_RELEASE_LOG="+nat_service.e.l.l2.l3.f -nat_service.restrict"; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog"; VirtualBox &)
That should still work shouldn't it? It did once, but I can't seem to get it right since.
comment:31 by , 8 years ago
Right. The instrumentation that the above turned on was only put it temporarily for the test builds. As I said, I need to formulate at least some hypothesis of what I'm looking for before adding instrumentation for your case; I don't want full scale logging equivalent of carpet bombing if I can avoid it :)
comment:32 by , 8 years ago
When the network is stuck, can you ping 10.0.2.1
(default gateway)? 10.0.2.2
(mapped host loopback)?
Can you use 10.0.2.1
(DNS proxy) as resolver? E.g.
$ nslookup www.virtualbox.org 10.0.2.1
comment:33 by , 8 years ago
Long reply where I have obfuscated some pieces.
[root@client1-1 ~]# ping 10.0.2.1 PING 10.0.2.1 (10.0.2.1) 56(84) bytes of data. 64 bytes from 10.0.2.1: icmp_seq=1 ttl=255 time=0.209 ms 64 bytes from 10.0.2.1: icmp_seq=2 ttl=255 time=0.177 ms ^C --- 10.0.2.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 0.177/0.193/0.209/0.016 ms You have new mail in /var/spool/mail/root [root@client1-1 ~]# ping 10.0.2.2 PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data. 64 bytes from 10.0.2.2: icmp_seq=1 ttl=64 time=2.35 ms 64 bytes from 10.0.2.2: icmp_seq=2 ttl=64 time=0.341 ms ^C --- 10.0.2.2 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 0.341/1.348/2.355/1.007 ms [root@client1-1 ~]# logout Connection to 10.0.2.10 closed. [root@client-2 ~]# nslookup www.virtualbox.org 10.0.2.1 Server: 10.0.2.1 Address: 10.0.2.1#53 Non-authoritative answer: Name: www.virtualbox.org Address: 137.254.60.32 [root@client-2 ~]# telnet v.w.x.y 8140 Trying v.w.x.y... telnet: connect to address v.w.x.y: Connection timed out [root@client-2 ~]# Connection to localhost closed by remote host. Connection to localhost closed.
I only moved between the 2 boxes for ping/nslookup as the nslookup command wasn't installed on the client I was on intially.
The telnet to port 8140 on v.w.x.y (the address of my mac host) should work - but doesn't, and times out. At this point I stopped and started the natnetwork in another terminal which caused the connection to be closed
# osx_claww@mac-host:~ [10:14:53] $ ssh virtualclient2 Warning: Permanently added '[localhost]:2251' (ECDSA) to the list of known hosts. root@localhost's password: Last login: Wed Oct 19 16:24:15 2016 from 10.0.2.2 [root@client-2 ~]# telnet v.w.x.y 8140 Trying v.w.x.y... Connected to v.w.x.y. Escape character is '^]'.
virtualclient2 is defined in my ssh cofig file as localhost on the correct port to be forwarded to the client. I can now connect into the clients, and I am also able to telnet outside of the natnetwork now.l
comment:34 by , 8 years ago
One more test I forgot, to exclude the UDP problems for sure - does
$ nslookup www.virtualbox.org 8.8.8.8
work?
I expect it does, and the problem seems to be confined to TCP connections.
Can you run wireshark in the guest to capture the traffic on the intnet interface and then do e.g.
$ telnet www.virtualbox.org http
follow-up: 36 comment:35 by , 8 years ago
I can't access 8.8.8.8 behind my corporate firewall - but assuming you want to see if I can get to a nameserver that isn't in VirtualBox or even my host -
nslookup www.virtualbox.org <corporate nameserver> Server: <corporate nameserver> Address: <corporate nameserver>#53 Non-authoritative answer: Name: www.virtualbox.org Address: 137.254.60.32
The second part I have to be even more careful with. I've had to remove wireshark from my mac as it was noticed by my IT department and I was explicitly told to take it off. Even if it was still on, telnetting to www.virtualbox.org on port 80 would fail as I'm stuck behind a firewall anyway.
I should be able to do something along those lines at home though, so can I be clear on what you want me to do? Do you want me to capture traffic on the client or the host? And which interface?
comment:36 by , 8 years ago
Replying to OSX_claw:
I should be able to do something along those lines at home though, so can I be clear on what you want me to do? Do you want me to capture traffic on the client or the host? And which interface?
What I'm asking is an outgoing TCP connection from the guest. It can even be to the host mapped loopback at 10.0.2.2
, that doesn't matter. I used http at virtualbox.org
as an example just because it's there and it will not expose your IP addresses.
I want traffic captured on the guest on its natnet interface.
comment:37 by , 8 years ago
Not sure if this is exactly what you want/expect, but I ran a tcpdump on the client once it was stuck. I've got a raw file I can make available I guess, but there isn't that much traffic.
I've also cropped out nfs traffic that flows between the 2 clients (I forgot that was happening, hopefully it's not relevant)
00:16:56.267344 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28226548 ecr 0,nop,wscale 7], length 0 00:16:57.286518 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28227567 ecr 0,nop,wscale 7], length 0 00:16:59.289952 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28229570 ecr 0,nop,wscale 7], length 0 00:17:01.276768 ARP, Request who-has 10.0.2.2 tell 10.0.2.51, length 28 00:17:01.276988 ARP, Reply 10.0.2.2 is-at 52:54:00:12:35:00 (oui Unknown), length 46 00:17:03.305182 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28233586 ecr 0,nop,wscale 7], length 0 00:17:11.323646 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28241604 ecr 0,nop,wscale 7], length 0 00:17:27.401935 IP 10.0.2.51.44200 > puppet.8140: Flags [S], seq 3965336833, win 14600, options [mss 1460,sackOK,TS val 28257682 ecr 0,nop,wscale 7], length 0
where puppet is hardcoded in the hosts file to point to my external mac ip address.
I imagine you don't need this piece, but for completess, this is the comparison with a working natnetwork.
00:23:55.584295 IP 10.0.2.51.44237 > puppet.8140: Flags [S], seq 3015116107, win 14600, options [mss 1460,sackOK,TS val 28645864 ecr 0,nop,wscale 7], length 0 00:23:55.584679 IP puppet.8140 > 10.0.2.51.44237: Flags [S.], seq 24054, ack 3015116108, win 32768, options [mss 1460], length 0 00:23:55.584733 IP 10.0.2.51.44237 > puppet.8140: Flags [.], ack 1, win 14600, length 0 00:23:55.584819 IP 10.0.2.51.44237 > puppet.8140: Flags [P.], seq 1:79, ack 1, win 14600, length 78 00:23:55.585735 IP puppet.8140 > 10.0.2.51.44237: Flags [P.], seq 1:589, ack 79, win 32690, length 588 00:23:55.585776 IP 10.0.2.51.44237 > puppet.8140: Flags [.], ack 589, win 15288, length 0 00:23:55.585792 IP puppet.8140 > 10.0.2.51.44237: Flags [F.], seq 589, ack 79, win 32690, length 0 00:23:55.589673 IP 10.0.2.51.44237 > puppet.8140: Flags [F.], seq 79, ack 590, win 15288, length 0 00:23:55.589916 IP puppet.8140 > 10.0.2.51.44237: Flags [.], ack 80, win 32689, length 0
I think I captured the packet contents, but didn't show them on the replay which is all I've pasted here - do you want more?
comment:38 by , 8 years ago
Thanks, I wanted to verify you don't get any Syn/Ack responses for the connection attempt, and this confirms it.
comment:39 by , 8 years ago
Sorry for the delay. Can you please try 5.1 r111535
OS X test build with extra logging enabled as in comment:30 (I assume you used the right natnet name in that example).
The log file is in ~/Library/VirtualBox/StoreNat.log
(again, assuming that's your natnet's name).
You should see unable to allocate TCP PCB...
and then the dump of all currently used pcbs. IP addresses are logged, so if that's a privacy concern feel free to mask them.
comment:40 by , 8 years ago
Not great results so far. It seemed to take take a little longer to get "stuck", but it has got there in the end. I had invoked VirtualBox with the command -
$ (export VBOXNET_StoreNat_RELEASE_LOG="+nat_service.e.l.l2.l3.f -nat_service.restrict"; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog"; VirtualBox &)
but I didn't get any more logging in the StoreNat.log.
As a check, I listed the network
VBoxManage list natnets NetworkName: StoreNat IP: 10.0.2.1 Network: 10.0.2.0/24 IPv6 Enabled: No IPv6 Prefix: fd17:625c:f037:2::/64 DHCP Enabled: Yes Enabled: Yes Port-forwarding (ipv4) ClientSSH:tcp:[]:2251:[10.0.2.51]:22 StoreSSH:tcp:[]:2210:[10.0.2.10]:22 clientsshagain:tcp:[]:2215:[10.0.2.15]:22 dhcpssh:tcp:[]:2207:[10.0.2.7]:22 homerssh:tcp:[]:22922:[10.0.2.9]:22 server2ssh:tcp:[]:2204:[10.0.2.4]:22 loopback mappings (ipv4) 127.0.0.1=2
so it's the right network name. Am I missing something?
edit: removed erroneous hash character
comment:41 by , 8 years ago
Just to confirm that I wasn't getting something really basic wrong, I've re-installed an old build I had lying about -
Version 5.0.18 r106667
This gives a much more verbose log eg
eg
00:02:52.439357 lwIP2 ---> TCP ext.mac.addr:48002 socket 56 00:02:52.439515 lwIP3 pxtcp_add: new pxtcp 00007fd77ac16320; pcb 0000000100f3acb0; sock 56 00:02:52.439559 lwIP3 pxtcp_pmgr_connect: sock 56: connect: Connection refused 00:02:52.439579 lwIP2 pxtcp_pcb_accept_refuse: pxtcp 00007fd77ac16320, pcb 0000000100f3acb0, sock -1: Connection refused 00:02:52.439584 lwIP2 pxtcp_pcb_dissociate: pxtcp 00007fd77ac16320 <-> pcb 0000000100f3acb0 00:02:53.441169 lwIP2 ---> TCP ext.mac.addr:48002 socket 56 00:02:53.441299 lwIP3 pxtcp_add: new pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; sock 56 00:02:53.441326 lwIP3 pxtcp_pmgr_connect: sock 56: connect: Connection refused 00:02:53.441348 lwIP2 pxtcp_pcb_accept_refuse: pxtcp 00007fd77ae0b650, pcb 0000000100f3acb0, sock -1: Connection refused 00:02:53.441376 lwIP2 pxtcp_pcb_dissociate: pxtcp 00007fd77ae0b650 <-> pcb 0000000100f3acb0 00:02:54.019762 lwIP2 ---> TCP proxy.server:8080 socket 56 00:02:54.019890 lwIP3 pxtcp_add: new pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; sock 56 00:02:54.040747 lwIP3 sock 56: HUP 00:02:54.040804 lwIP2 pxtcp_pcb_pull_inbound: pxtcp 00007fd77ae0b650: pcb 0000000100f3acb0 00:02:54.040819 lwIP2 inbound_close: pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0: ESTABLISHED 00:02:54.041075 lwIP2 pxtcp_pcb_sent: pxtcp 00007fd77ae0b650; pcb 0000000100f3acb0; all data ACKed
while it's working. Is there any value to this log - I'm expecting not as this instrumentation was for a different problem?
comment:42 by , 8 years ago
When you run that command to start VirtualBox with extra environment variables you must make sure you don't have VirtualBox, in particular VBoxSVC
process already running. The VBoxNetNAT
process is spawned from VBoxSVC
, which will be spawned from VirtualBox
if necessary. If you already had it running, the old copy will not pick up the env vars, obviously. Sorry I didn't make it clear.
BTW, is that log above in the state where it's "stuck"?
comment:43 by , 8 years ago
No, that log is from a working state. I'll keep it running until it breaks. I had been trying to kill all VirtualBox processes each time I was trying to get VirtualBox to fire up, but I was unable to get any more logging with the latest test build as directed. It's possible I did get that wrong, so I'll try again once I've seen it get stuck with this old build so that I can get you something from that log.
Thanks, Chris
comment:44 by , 8 years ago
Ah, my bad. I didn't put that extra logging in the right group. Make that VBOXNET_StoreNat_RELEASE_LOG="+all.l2 -all.restrict"
comment:45 by , 8 years ago
I'm going to try that new build again (r1111535) now. The earlier build I was using with that logging on failed to get into the bad state at all - longest it's run cleanly since August, typical...
comment:46 by , 8 years ago
Trying to think of things that I did differently. I had run VirtualBox without backgrounding it on the times when I didn't get stuck today. Something for me to try again later.
comment:47 by , 8 years ago
I've gone back to Version 5.1.9 r111535 (Qt5.5.1)
The command I used to start it was
export VBOXNET_StoreNat_RELEASE_LOG="+all.l2 -all.restrict" ; export VBOXNET_StoreNat_RELEASE_LOG_FLAGS="thread timeprog"; VirtualBox &
I'm not seeing anything immediately obvious in the ~/Library/VirtualBox/StoreNat.log file, should I?
follow-up: 52 comment:48 by , 8 years ago
Aha...
Yep, exactly what you suggested - here's a little snapshot
00:51:57.881822 lwIP2 == unable to allocate TCP PCB corp.proxy.server:8080 <-> 10.0.2.10:60352 00:51:57.881864 lwIP2 bound pcbs: none 00:51:57.881869 lwIP2 listen pcbs { 00:51:57.881892 lwIP2 0.0.0.0:53 00:51:57.881897 lwIP2 } // 1 listen pcbs 00:51:57.881901 lwIP2 active pcbs { 00:51:57.881904 lwIP2 corp.proxy.server:8080 <-> 10.0.2.10:60351 00:51:57.881909 lwIP2 corp.proxy.server:8080 <-> 10.0.2.10:60350 00:51:57.881913 lwIP2 corp.proxy.server:8080 <-> 10.0.2.10:60349
I can attach the log, but it's that message showing basically the full 128 connections to the proxy on port 8080, repeated about 70 times... Do you want it? Perhaps unsurprisingly, there are also 127 connections to the proxy server on port 8080 in CLOSE_WAIT if I do a netstat on my Mac host.
Also, is that saying I'm listening on port 53? Or is that for the DHCP server for that natnetwork?
Cheers, Chris
by , 8 years ago
Attachment: | StoreNat.log.gz added |
---|
follow-up: 53 comment:49 by , 8 years ago
I think I've left out something important (that I've only just realised, apologies) :-(
I run docker on the VM's. The containers have their own network stacks... Any netstat commands I ran on the clients will have been incomplete I think. sorry.
So, I just went into the container I most suspect is at fault, and there weren't many bad connections at that point
netstat -atnp Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:6000 0.0.0.0:* LISTEN 6/Xvfb tcp6 0 0 :::6000 :::* LISTEN 6/Xvfb tcp6 0 1 172.17.0.2:60416 corp.proxy.server:8080 SYN_SENT 16/java
However - that was with things in a broken state. I bounced the natnetwork again, and now I see this
user@b20fa7259901:/opt/build $ netstat -n -A inet -A inet6 Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp6 0 1 172.17.0.2:60442 172.31.200.210:8080 LAST_ACK tcp6 0 0 172.17.0.2:60523 172.31.200.210:8080 CLOSE_WAIT tcp6 0 0 172.17.0.2:60522 172.31.200.210:8080 CLOSE_WAIT tcp6 0 1 172.17.0.2:60454 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60445 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60452 172.31.200.210:8080 LAST_ACK tcp6 0 0 172.17.0.2:60521 172.31.200.210:8080 CLOSE_WAIT tcp6 0 1 172.17.0.2:60443 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60439 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60438 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60441 172.31.200.210:8080 LAST_ACK tcp6 0 0 172.17.0.2:60524 172.31.200.210:8080 CLOSE_WAIT tcp6 0 1 172.17.0.2:60455 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60436 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60437 172.31.200.210:8080 LAST_ACK tcp6 0 1 172.17.0.2:60435 172.31.200.210:8080 LAST_ACK tcp6 0 0 172.17.0.2:60525 172.31.200.210:8080 CLOSE_WAIT tcp6 0 1 172.17.0.2:60440 172.31.200.210:8080 LAST_ACK tcp6 1 1 172.17.0.2:60434 172.31.200.210:8080 LAST_ACK
and it looks like that list is growing.
So - is this pure user error, or should the natnetwork deal with things better? I believe that those connections are all down to a java process attempting to connect through our proxy to a deliberately non-existant web server. Whether that means the proxy isn't closing things, or the java process isn't closing things, I'm not sure though.
comment:50 by , 8 years ago
So it might not be that simple, though it's likely related. It has just got stuck, and there are 31 reported connections inside that container, all moving from CLOSE_WAIT to LAST_ACK, not seemingly completely stuck.
comment:51 by , 8 years ago
It appears to be in a state whereby I can still ssh in to the clients (for now) but cannot initiate external connections, and I still get messages about max tcbs in the log. That was after bouncing what I thought was the problem container.
comment:52 by , 8 years ago
Replying to OSX_claw:
Also, is that saying I'm listening on port 53? Or is that for the DHCP server for that natnetwork?
It's the TCP part of the DNS proxy. It listens on the "inside" of the natnet, not on the host; i.e. it's what guests can reach as 10.0.2.1:53
.
comment:53 by , 8 years ago
Replying to OSX_claw:
However - that was with things in a broken state. I bounced the natnetwork again, and now I see this
user@b20fa7259901:/opt/build $ netstat -n -A inet -A inet6 Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp6 0 1 172.17.0.2:60442 172.31.200.210:8080 LAST_ACK tcp6 0 0 172.17.0.2:60523 172.31.200.210:8080 CLOSE_WAIT ...and it looks like that list is growing.
So - is this pure user error, or should the natnetwork deal with things better? I believe that those connections are all down to a java process attempting to connect through our proxy to a deliberately non-existant web server. Whether that means the proxy isn't closing things, or the java process isn't closing things, I'm not sure though.
CLOSE_WAIT
is when the remote side has (half-)closed. A classic example of this is rsh
(1) that sends local stdin to the server and then half-closes on EOF. It's quite normal for a connection to stay in that state, though usually its on the "server" side, after client is done sending its request/input/... So it's a bit suspect that you have these CLOSE_WAIT
connections on the client. When the proxy closes and stops sending data to you it doesn't make much sense to keep the outbound connection still open - whatever you gonna ask the proxy, it can't any longer reply to you anyway.
OTOH, it can be some http proxy tunnel. Anyway, transitioning out of that state requires an action of the local app.
LAST_ACK
should be transient - a connection is in that state when the app closes a connection in CLOSE_WAIT
state (i.e. the remote has already closed) and waits for the last ack that will terminate the connection. You shouldn't normally see a lot of connections in that state since that last ack usually comes pretty fast.
Here guest waits for a packet form natnet and I can actually think of some pathological failure scenarios where natnet is starved out of resources and can't send that last ack...
Let me think a bit more on what I want to probe next, so that I don't bother you with a back and forth for each additional debug log statement.
comment:54 by , 8 years ago
Thanks again - I'll keep checking back. It's got to the broken state now, but to confirm, inside what I suspect to be the problem container I don't see many connections by this point - they do seem to follow the usual flow to get closed off, even if they were showing as CLOSE_WAIT for a while previously. On the host, I do still see the VBoxNetNat process with 126 CLOSE_WAIT connections though.
comment:55 by , 8 years ago
Test build r111579
should print a bit more info. When it's stuck, please, provide the log, and the netstat
output from the host (VBoxNetNAT process), and the guest and containers. TIA.
comment:56 by , 8 years ago
Incoming... I've obfuscated and zipped the files up you've asked for. The problem is seeing exactly when it breaks - I think that the netstat output inside the containers changes even once it is stuck - so by the time I took the netstat, any sockets had been tidied up as far as that network stack was concerned. One thing that isn't as obvious as I'd meant it to - I've changed ipaddress to be things like corp.proxy.server. There are 2 proxies (though I did at least remember to differentiate them) and a corp.docker.server. At a quick glance those strings look very similar in the logs is all.
by , 8 years ago
Attachment: | 14748.diag.tar.gz added |
---|
comment:57 by , 8 years ago
Strange. I would have expected a bunch of connections in the guest/containers in CLOSE_WAIT
/LAST_ACK
states, like in comment:49. On the host I can see a bunch of CLOSE_WAIT
sockets and there are corresponding pcbs in the natnet log in FIN_WAIT_2
, i.e. the connections to corp.proxy.server:8080
that had inbound FIN
from remote, but didn't yet get outbound FIN
from the guest.
Can you please try r111619 (or later) that cherry picks a TCP fix from upstream that might be relevant.
Please, can you also capture some of the offending connections to the corp.proxy.server:8080
on the guest side (please, provide pcap file(s), not text output from tcpdump). I don't know if every connection to the proxy causes a stuck connection in natnet, or only some of them, so this might be a bit tricky, b/c by the time you have your whole natnet stuck, you can't make any new connections...
Obfuscating the addresses will be harder in this case though :). You can limit the capture size to just the tcp header. If you don't want to attach the captures here, you can send them to me privately to valery
dot ushakov
at oracle
dot com
.
follow-up: 59 comment:58 by , 8 years ago
I'll try the build, and see what I can do for the pcap files. It's made harder by needing to do the captures from inside the container on the client, but should still be possible. The images don't have tcpdump in them for example, but I think I can install it.
comment:59 by , 8 years ago
Replying to OSX_claw:
It's made harder by needing to do the captures from inside the container on the client.
Can't you capture on the "main" guest?
Worst case, you can use VBox's built-in nic trace feature, but it's rather rudimentary e.g. you can't specify the capture filter (the intention is that you'd normally use guest's capture facilities and only resort to nictrace for early boot and the like, so it's kept pretty bare bones). So you can use nictrace to capture everything and post-process, though that's obviously a chore.
comment:60 by , 8 years ago
You're probably right - presumably the docker0 interface or one of the veth devices should be connected to the containers so tcpdump should catch it. I was getting hung up on netstat not showing the different network namespaces.
follow-up: 62 comment:61 by , 8 years ago
Seems to be capturing. I'm going for
tcpdump -S -n -i <interface> -w <outfille> host corp.proxy.server
do you think that will get what you're after?
Thanks, Chris
comment:62 by , 8 years ago
Replying to OSX_claw:
Seems to be capturing. I'm going for
tcpdump -S -n -i <interface> -w <outfille> host corp.proxy.serverdo you think that will get what you're after?
Yes, I'm interested only in those connections. A tiny nit-pick - -S
and -n
are redundant as they only affect text output, not that it matters.
Thanks!
comment:63 by , 8 years ago
I've sent the tcpdump output through, hopefully there is something useful in that.
Thanks, Chris
comment:64 by , 8 years ago
Thanks! That capture confirms that for some connections the guest sends (and resends) the FIN
s but never gets that final ACK
from the natnet and gives up eventually.
comment:65 by , 8 years ago
Please, can you try r111633 with extra logging enabled like before. I'm shooting in the dark here as I'm not quite sure from RTFS how we can even end up in that state, so you patience is appreciated. Thanks in advance.
by , 8 years ago
Attachment: | 111633_StoreNat.log.gz added |
---|
comment:67 by , 8 years ago
Apologies, I didn't post the comment I thought I had. Yes, that should be stopped when the natnet was stuck. I should have an accompanying tcpdump if it's any use to you?
comment:68 by , 8 years ago
Since you've captured it anyway, it would be nice to have. Though I don't see in the log anything that would hint at the potential problem.
comment:69 by , 8 years ago
The capture actually helped to provide some clues, hopefully. Please, can you collect logs and packet capture for r111660. Thanks in advance.
comment:70 by , 8 years ago
Slightly later than intended, I'm running with the new build - will let you know about diagnostics.
comment:71 by , 8 years ago
I didn't see the usual problem in about 3 hours of running yesterday, which is kinda nice but a but frustrating from your point of view I suspect :-) Running again today.
follow-up: 76 comment:72 by , 7 years ago
I have exactly the same problem while running one of the most recent versions of VirtualBox in Linux. I have to restart natnetwork all the time to make it work.
$ dpkg -l | grep virtualbox-5.1 ii virtualbox-5.1 5.1.30-118389~Ubuntu~trusty amd64 Oracle VM VirtualBox
Here is how I restart natnetwork:
VBoxManage natnetwork stop --netname NatNetwork && VBoxManage natnetwork start --netname NatNetwork
comment:74 by , 6 years ago
i have same problem. restart service virtualbox without restarting VMs resolve this problem for some days. Then internet connection is lost again. Virtualbox 5.1.30-38 and Virtualbox 5.2.4-5.2.16 tested for this bug.
comment:76 by , 6 years ago
Replying to giner:
I have exactly the same problem while running one of the most recent versions of VirtualBox in Linux. I have to restart natnetwork all the time to make it work.
$ dpkg -l | grep virtualbox-5.1 ii virtualbox-5.1 5.1.30-118389~Ubuntu~trusty amd64 Oracle VM VirtualBoxHere is how I restart natnetwork:
VBoxManage natnetwork stop --netname NatNetwork && VBoxManage natnetwork start --netname NatNetwork
Restart natnetwork is useful. My NAT network is stucked 1 time per 10 days. I have two nat networks, but only one have such problem. Host OS: Win Server 2012 R2, Guest OS: Winxp sp3, VirtualBox 5.2.22.
comment:77 by , 5 years ago
Still a problem in 6.0.10. :( I have a Win 10 host and a CentOS 7 guest. I don't think ARP even works when it's in this state. I believe this forum thread covers the same issue.
comment:78 by , 5 years ago
Also facing this problem on a Debian 9.11 host with Virtual Box 6.0.12. Affects at least WinXP and Debian guests.
Currently as a "bypass" I automated a stop/start of the NatNetwork every hour, far from ideal.
comment:79 by , 5 years ago
I have tha same issue on VirtualBox Version 6.0.14. I run Ubuntu 18.04.3 on both Host and Guest.
I have been running VirtualBox for ~3 months on this new computer with the default version 5.2.34 without any issues until yesterday when I started having issues with the network. After upgrading to 6.0.14 there is still the same issue.
To my knowledge I have made no changes or upgrades to any part of my system the last couple of days.
On the ~10 tries I've made, the network has been alive for ~1 to a maximum of 10 minutes.
comment:80 by , 4 years ago
I have the same problem, nat network dies after approx 15 minutes Is this bug being worked on?
comment:81 by , 4 years ago
Exactly the same problem still here, noticed in several versions. The same with the latest update. Host Windows Server 2012 with RHEL V7. Not able to find combination of factors making NatNetwork dying. Disappointed, seruois problem. Anu update? Thanks.
comment:82 by , 3 years ago
This issue exists for years, is there someone still working on Vbox at Oracle ???
Please, can you give a recent (
103571
or later) 5.0 test build a try?