Opened 14 years ago
Closed 10 years ago
#7459 closed defect (fixed)
NAT networking drops out after a few days' uptime -> fixed in svn.
Reported by: | dimitris | Owned by: | |
---|---|---|---|
Component: | network/NAT | Version: | VirtualBox 4.0.2 |
Keywords: | NAT DHCP | Cc: | |
Guest type: | Linux | Host type: | Linux |
Description (last modified by )
Both host and guest are Ubuntu Lucid servers, guest additions are up-to-date, guest uses a virtio adapter with NAT networking.
2-3 days after starting the guest, it loses connectivity to the outside world. These are the last few entries in its log:
31:39:39.715 NAT: adding domain name lan to search list 31:39:39.715 NAT: DHCP offered IP address 10.0.2.15 42:40:55.746 NAT: adding domain name lan to search list 42:40:55.746 NAT: DHCP offered IP address 10.0.2.15 51:08:57.554 NAT: error occurred while sending ICMP error message
The guest is running Logitech's Squeezebox Server, a perl-based music server. Ports for that have been forwarded (see machine XML) and traffic flows normally before the machine drops out.
log and machine XML to be attached.
Attachments (13)
Change History (70)
by , 14 years ago
Attachment: | squeezemachine.xml added |
---|
by , 14 years ago
by , 14 years ago
Attachment: | nodhcp.png added |
---|
comment:1 by , 14 years ago
VBoxManage controlvm <machine> reset
doesn't bring DHCP back. I have to power the machine off then back on to get DHCP working.
by , 14 years ago
Attachment: | VBox.2.log added |
---|
log including the loss of connectivity and a guest reboot
comment:2 by , 14 years ago
Just happened again. Attached guest screenshots of its ifconfig and syslog after the guest lost connectivity but before I rebooted it (from within the guest itself through the terminal server).
After the guest is rebooted, the DHCP is still not functioning, same as shown in the previously attached screenshot (nodhcp.png).
Also attaching a log that covers the guest's reboot.
comment:4 by , 14 years ago
Switched to
00:11.0 Ethernet controller: Intel Corporation 82545EM Gigabit Ethernet Controller (Copper) (rev 02)
Let's see how it does over the next few days.
comment:5 by , 14 years ago
Same behavior with Intel virtual network adapter, see latest attachments:
51:09:51.358 NAT: error occurred while sending ICMP error message
comment:8 by , 14 years ago
Version: | VirtualBox 3.2.8 → VirtualBox 3.2.10 |
---|
comment:9 by , 14 years ago
DHCP client, SSH server, the perl-based Squeezebox Server and its associated MySql server:
Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 127.0.0.1:9092 0.0.0.0:* LISTEN 744/mysqld tcp 0 0 0.0.0.0:9000 0.0.0.0:* LISTEN 565/perl tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN 551/mysqld tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 648/sshd tcp 0 0 0.0.0.0:3483 0.0.0.0:* LISTEN 565/perl tcp 0 0 0.0.0.0:9090 0.0.0.0:* LISTEN 565/perl udp 0 0 0.0.0.0:3483 0.0.0.0:* 565/perl udp 0 0 0.0.0.0:50609 0.0.0.0:* 565/perl udp 0 0 0.0.0.0:68 0.0.0.0:* 586/dhclient3
comment:10 by , 14 years ago
Here are the active connections. The one on local:3483 is the hardware player, and normally that's the only active TCP connection on this guest. Traffic is minimal when there's no music streaming (clock update on the player's idle screen is driven by the server). Streaming is a typical MP3 or Ogg stream, should be less than 200kbps.
Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 10.0.2.15:22 10.0.2.2:47061 ESTABLISHED tcp 0 0 10.0.2.15:3483 10.0.2.2:4718 ESTABLISHED
comment:11 by , 14 years ago
Here I've installed the similar vm, let see if I'll able to reproduce the issue.
comment:12 by , 14 years ago
I have the same problem but on a OS running Windows 7 and the guest is XP.
Takes about 4 hours , log says the same in the end "NAT: error occurred while sending ICMP error message".
Have to shutdown the Guest , reset wont bring the network back.
comment:15 by , 14 years ago
Version: | VirtualBox 3.2.10 → VirtualBox 3.2.12 |
---|
follow-up: 17 comment:16 by , 14 years ago
Any luck reproducing this? If not I could probably FTP the machine (XML and disk image) if that would help.
comment:17 by , 14 years ago
Replying to dimitris:
Any luck reproducing this? If not I could probably FTP the machine (XML and disk image) if that would help.
Instructions are send via mail.
comment:18 by , 14 years ago
Almost on the third day I've got effect you're describing. I'll restart the test with more information collecting to get more information what, could lead to this effect. Also I'll try to minimize tests to provoke the effect in more short period.
comment:19 by , 14 years ago
Debug build, which I've lunched about week ago, has met other unrelated issue, causing me recompile/restart build.
comment:20 by , 14 years ago
yesterday debug build enters the state you'd described, finally. Looks like drops are caused with huge amount requests to UDP.IP.dst:63.17.46.179 and ARP.TIP:63.17.46.179 from NAT service side to the guest. in order to debug the issue I'm trying to create testcase which will provoke similar issue in smaller timeframe.
comment:21 by , 14 years ago
Version: | VirtualBox 3.2.12 → VirtualBox 4.0.2 |
---|
follow-up: 23 comment:22 by , 14 years ago
Also upgraded to 4.0.2 here, problem persists, with this additional info now on the logs:
... 09:03:50.329 NAT: DHCP offered IP address 10.0.2.15 19:34:59.297 NAT: DHCP offered IP address 10.0.2.15 30:17:58.870 NAT: DHCP offered IP address 10.0.2.15 41:06:21.336 NAT: DHCP offered IP address 10.0.2.15 51:06:45.355 NAT: error occurred while sending ICMP error message 51:06:45.416 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:45.618 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:46.028 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:46.842 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:48.475 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:51.738 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:06:58.266 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:07:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:07:11.323 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY 51:07:18.234 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:07:19.235 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:07:20.234 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:07:37.438 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:07:38.439 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:07:39.440 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:08:03.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:08:04.355 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:08:04.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:08:05.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY 51:09:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:10:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:11:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:12:04.355 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:13:04.356 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:14:04.356 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY 51:14:26.453 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY ...
Over here though I don't see any UDP traffic, albeit on a relatively short packet capture. What I do see is the TCP traffic between the Squeezebox Server and the hardware player, e.g.:
11:45:35.855991 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 167190:168480, ack 3173, win 5840, length 1290 11:45:35.857154 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 168480, win 65535, length 0 11:45:36.854949 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 168480:169770, ack 3173, win 5840, length 1290 11:45:36.856447 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 169770, win 65535, length 0 11:45:37.855543 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 169770:171060, ack 3173, win 5840, length 1290 11:45:37.856682 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 171060, win 65535, length 0 11:45:38.001067 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 171060:171090, ack 3173, win 5840, length 30 11:45:38.002478 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 171090, win 65535, length 0 11:45:38.004206 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [P.], seq 3173:3234, ack 171090, win 65535, length 61 11:45:38.004269 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [P.], seq 3234:3295, ack 171090, win 65535, length 61 11:45:38.006554 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [.], ack 3295, win 5840, length 0 11:45:38.854897 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 171090:172380, ack 3295, win 5840, length 1290 11:45:38.855901 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 172380, win 65535, length 0 11:45:39.854914 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 172380:173670, ack 3295, win 5840, length 1290 11:45:39.855924 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 173670, win 65535, length 0
The above is guest-side: The guest's eth0 is at 10.0.2.15, and the hardware player, on the other side of the VBox NAT, is mapped here as 10.0.2.2.
The traffic looks the same on the host side, just with different addresses. Which UDP port(s) do you see traffic on?
comment:23 by , 14 years ago
Replying to dimitris:
Also upgraded to 4.0.2 here, problem persists, with this additional info now on the logs:
The traffic looks the same on the host side, just with different addresses. Which UDP port(s) do you see traffic on?
it's a lot of udp broadcasts to udp/3483. I don't touch web interface a lot so i have very rare tcp entries in dump. the major part is UDP broadcasts.
follow-up: 25 comment:24 by , 14 years ago
OK, on a longer capture I see the UDP traffic too:
11:11:39.001496 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 44) 10.0.2.15.3483 > 255.255.255.255.3483: UDP, length 16
It's once a minute, and IIRC it's the server process advertising itself on the network so that players can discover it.
comment:25 by , 14 years ago
Replying to dimitris:
It's once a minute, and IIRC it's the server process advertising itself on the network so that players can discover it.
Here, I've made tool broadcasting udp, that cause similar effect. But I don't sure that it's exactly or single the problem. I've fixed the issue and now I'm running the build to verify.
comment:26 by , 14 years ago
Summary: | NAT networking drops out after a few days' uptime → NAT networking drops out after a few days' uptime -> fixed in svn. |
---|
follow-up: 29 comment:28 by , 12 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
I've been seeing this issue on and off for a while now and it is still in the 4.1.14.
With the help of a UDP packet blaster and the PCNetIII adapter I was able to reproduce this problem every single time.
The steps:
- Create a VM with a Natted interface, use PCNetIII
- Start a UDP packet blast with size 1400, the packets must go from the VM and get routed/natted out to the physical NIC.
- After ~5 minutes, the PCNet adapter will bounce the link
- Shortly after that, the NAT interface seems to be trying to send a DHCP ACK back to the PCNetIII adapter
- Finally, the error "error occurred while sending ICMP error message" shows up and no packets can be passed after that. You wont even be able to ping internal ip of the NAT interface (10.0.2.2)
Notes:
- I also tried the same steps with the latest SVN and the problem is still there.
- The intel interfaces do not have the same issue because they do not bounce the link of the NIC card.
- See attached log
- The same bug will happen in other cases but this is the most reliable way so far.
Please let me know if you'd like to test a patch, I can build my own vbox and test it.
by , 12 years ago
Attachment: | ubuntu-2012-05-21-14-37-24.log added |
---|
vboxlog using vbox 4.1.14 on windows 7 32bit with ubuntu 32bit guest
comment:29 by , 12 years ago
Description: | modified (diff) |
---|
Replying to rbhkamal:
I've been seeing this issue on and off for a while now and it is still in the 4.1.14.
With the help of a UDP packet blaster and the PCNetIII adapter I was able to reproduce this problem every single time.
Could you please clarify what are you doing/or what are you meaning by "UDP packet blast"?
Please let me know if you'd like to test a patch, I can build my own vbox and test it.
Thanks, for cooperation, I'll check what disabled experimental code, could possibly solve your issue, but first I'd like to understand what issue have you met.
follow-up: 31 comment:30 by , 12 years ago
From within the guest OS, I ran a utility which generates UDP traffic at 70-85Mbps (udb traffic blast). UDP here is not significant, the traffic can be TCP or even ICMP. I simply chose UDP because you wont have to setup a server to receive the packets.
See attached diagram.
I got a little more info since I reopened the bug. I was suspecting that a DHCP renewal from the guest, while under load, was causing the issue to happen. However, the bug happens with/without DHCP enabled on the guest.
comment:31 by , 12 years ago
Replying to rbhkamal:
From within the guest OS, I ran a utility which generates UDP traffic at 700-850Mbps (udb traffic blast). UDP here is not significant, the traffic can be TCP or even ICMP. I simply chose UDP because you wont have to setup a server to receive the packets.
ok, what this utility is? And are you able to reproduce this issue let say with ipperf?
comment:32 by , 12 years ago
It is an in-house propitiatory utility, however, I'm confident that ipperf will have the same effect. I'll try it now.
The bug was initially discovered while FTPing large files over a 1Gbps line. After about 5 minutes, FTP would fail. FTPing on a 100Mbps line didn't have any problem at all. Sorry I didn't mention this in the beginning but I was worried that you might think that I was running into Ticket #10527
So FTP by it self was causing the problem to happen (as long as it is 700MB+ file and a 1Gbps line)
I'll update you soon with the ipperf results.
follow-up: 34 comment:33 by , 12 years ago
Alright, I reproduced with iperf. Same result, after about 80 seconds the NIC stops passing traffic. Please remember that this bug is reproducable ONLY when using PCNetIII adapter (vs. Intel)
iperf client log:
------------------------------------------------------------ root@testing-VirtualBox:~# cat run iperf -u -c 10.100.1.97 -i 5 -l 1400 -b 500M -t 300 root@testing-VirtualBox:~# root@testing-VirtualBox:~# ./run Client connecting to 10.100.1.97, UDP port 5001 Sending 1400 byte datagrams UDP buffer size: 160 KByte (default) ------------------------------------------------------------ [ 3] local 10.0.2.15 port 56444 connected with 10.100.1.97 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0- 5.0 sec 41.4 MBytes 69.5 Mbits/sec [ 3] 5.0-10.0 sec 41.4 MBytes 69.5 Mbits/sec [ 3] 10.0-15.0 sec 41.1 MBytes 69.0 Mbits/sec [ 3] 15.0-20.0 sec 40.9 MBytes 68.7 Mbits/sec [ 3] 20.0-25.0 sec 40.7 MBytes 68.3 Mbits/sec [ 3] 25.0-30.0 sec 40.4 MBytes 67.7 Mbits/sec [ 3] 30.0-35.0 sec 40.4 MBytes 67.7 Mbits/sec [ 3] 35.0-40.0 sec 40.2 MBytes 67.4 Mbits/sec [ 3] 40.0-45.0 sec 40.4 MBytes 67.8 Mbits/sec [ 3] 45.0-50.0 sec 40.3 MBytes 67.7 Mbits/sec [ 3] 50.0-55.0 sec 40.3 MBytes 67.6 Mbits/sec [ 3] 55.0-60.0 sec 39.6 MBytes 66.5 Mbits/sec [ 3] 60.0-65.0 sec 52.8 MBytes 88.6 Mbits/sec [ 3] 65.0-70.0 sec 476 KBytes 780 Kbits/sec [ 3] 70.0-75.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 75.0-80.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 80.0-85.0 sec 0.00 Bytes 0.00 bits/sec
iperf server log:
------------------------------------------------------------ Server listening on UDP port 5001 Receiving 1470 byte datagrams UDP buffer size: 112 KByte (default) ------------------------------------------------------------ [ 3] local 10.100.1.97 port 5001 connected with 10.100.1.77 port 59842 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0- 0.5 sec 3.94 MBytes 66.1 Mbits/sec 0.122 ms 17/ 2969 (0.57%) [ 3] 0.5- 1.0 sec 4.10 MBytes 68.8 Mbits/sec 0.015 ms 78/ 3150 (2.5%) [ 3] 1.0- 1.5 sec 3.96 MBytes 66.5 Mbits/sec 0.118 ms 74/ 3043 (2.4%) [ 3] 1.5- 2.0 sec 4.06 MBytes 68.2 Mbits/sec 0.010 ms 80/ 3123 (2.6%) [ 3] 2.0- 2.5 sec 4.06 MBytes 68.1 Mbits/sec 0.009 ms 66/ 3104 (2.1%) [ 3] 2.5- 3.0 sec 4.09 MBytes 68.7 Mbits/sec 0.094 ms 77/ 3144 (2.4%) [ 3] 3.0- 3.5 sec 4.05 MBytes 68.0 Mbits/sec 0.119 ms 82/ 3117 (2.6%) [ 3] 3.5- 4.0 sec 4.11 MBytes 69.0 Mbits/sec 0.026 ms 63/ 3142 (2%) . . Skipping . [ 3] 59.0-59.5 sec 4.02 MBytes 67.4 Mbits/sec 0.010 ms 63/ 3074 (2%) [ 3] 59.5-60.0 sec 3.93 MBytes 65.9 Mbits/sec 0.010 ms 53/ 2993 (1.8%) [ 3] 60.0-60.5 sec 3.45 MBytes 57.9 Mbits/sec 0.009 ms 57/ 2642 (2.2%) [ 3] 60.5-61.0 sec 5.29 MBytes 88.8 Mbits/sec 0.035 ms 14/ 3978 (0.35%) [ 3] 61.0-61.5 sec 5.40 MBytes 90.5 Mbits/sec 0.026 ms 16/ 4058 (0.39%) [ 3] 61.5-62.0 sec 5.41 MBytes 90.7 Mbits/sec 0.023 ms 0/ 4051 (0%) [ 3] 62.0-62.5 sec 5.51 MBytes 92.4 Mbits/sec 0.183 ms 0/ 4125 (0%) [ 3] 62.5-63.0 sec 5.56 MBytes 93.4 Mbits/sec 0.100 ms 0/ 4168 (0%) [ 3] 63.0-63.5 sec 5.58 MBytes 93.7 Mbits/sec 0.060 ms 0/ 4183 (0%) [ 3] 63.5-64.0 sec 5.65 MBytes 94.9 Mbits/sec 0.008 ms 0/ 4235 (0%) [ 3] 64.0-64.5 sec 5.64 MBytes 94.7 Mbits/sec 0.191 ms 0/ 4226 (0%) [ 3] 64.5-65.0 sec 5.27 MBytes 88.4 Mbits/sec 0.052 ms 0/ 3946 (0%)
Also I attached the vbox.log from that test run.
comment:34 by , 12 years ago
Replying to rbhkamal: Thank you for investigation. I will try to reproduce the issue locally.
comment:35 by , 12 years ago
In mean while I've tried to reproduce the issue, but I've used Mac host (Windows was used for other tests) and wasn't able to reproduce the issue. Could you please repeat the same test with bridged network attachment?
follow-up: 37 comment:36 by , 12 years ago
response from rbhkamal I've received via mail notification
My account (rbhkamal) got locked so I hope you get my response here: To reproduce the issue you *must* use the PCNetIII network with NAT, I was able to reproduce it every single time with vbox 3.1.16. If I switch it to E1000, the problem disappears. I suspect bridge networking will not have this problem as well.
comment:37 by , 12 years ago
Replying to rbhkamal:
response from rbhkamal I've received via mail notification
My account (rbhkamal) got locked so I hope you get my response here: To reproduce the issue you *must* use the PCNetIII network with NAT, I was able to reproduce it every single time with vbox 3.1.16.
I've tried this test on trunk with pcnet and e1000. Could you please try with 4.0.16 or 4.1.16?
If I switch it to E1000, the problem disappears. I suspect bridge networking will not have this problem as well.
comment:38 by , 12 years ago
Sorry for the late response, I had to redo the setup. I was able to reproduce the problem with 4.1.16 on a Windows 32bit host and a linux 32bit guest (Ubuntu 12.04).
I'm able to reproduce the problem 100% of the time on windows. If you like, I could also try it on my mbp laptop.
Previously, when I saw this issue, it would show up randomly like the original steps of this bug. But if you follow my steps exactly, you should be able to reproduce consistently.
For how long did you run the blast and how much bandwidth where you able get?
I've done my testing using a Dell latitude D830 (Intel Core 2 duo). On that machine, I ran the blast for 60-120 seconds before the problem occurred and I was able to get around 70Mbps with 2% packet loss.
follow-up: 40 comment:39 by , 12 years ago
Also make sure that you use UDP with iperf because TCP will throttle back once you start loosing packets.
I *may* be able to give you full access to my test environment, but it will take me a while to isolate it from the rest of the network. Let me know if you want it.
comment:40 by , 12 years ago
Replying to rbhkamal:
Also make sure that you use UDP with iperf because TCP will throttle back once you start loosing packets.
I'll try this on Windows machine tomorrow, looks like it Windows specific issue.
I *may* be able to give you full access to my test environment, but it will take me a while to isolate it from the rest of the network. Let me know if you want it.
I'd hoped, that it not necessary, I think next step would be following, I'll send you build of VBoxDD.dll with enabled logging, in case I won't be able to reproduce the issue.
comment:41 by , 12 years ago
rbhkamal replied offline that he can only reproduce this problem using the PCnet adapter with NAT.
comment:42 by , 12 years ago
Actually two days ago a coworker of mine was able to consistently reproduce the same exact issue with E1000 (Server T OEM) using VBox 4.1.16. So he tried to run the same exact test but using Linux instead, and he wasn't able to get the issue to happen.
This issue is fairly easy to reproduce on a windows machine using ~20Mbps packet stream (outbound from the VM). But I believe that the bug is generic to all platforms and that Windows happened to aggravate the problem with its "standard" behavior as always.
I'm going to reproduce the problem again on Windows and switch the same machine to Linux and see if I can reproduce the issue.
Thanks,
Ribhi
comment:43 by , 12 years ago
Sorry that I haven't updated the ticket before leaving to vacation, IIRC I wasn't able reproduce the issue on my Windows machine here, I will retry this with fresh forces.
follow-up: 45 comment:44 by , 12 years ago
Hope you had a great time :)
Back to work, are you sniffing the traffic while trying to reproduce this issue? So far I haven't been able to reproduce the bug if I was sniffing at the same time. I tried sniffing from the host and inside the guest and in both cases the issue stopped happening. However, I was able to sniff on the server side of iperf, which is running on a separate machine. But there was nothing interesting, some dropped packets but that is all.
I noticed yesterday something very strange that only happens while using NAT (with any driver.) Instead of using iperf to blast packets, I just used ping like this:
ping -i 0 -s 1024 10.100.1.123
And I ended up getting a lot of duplicate ICMP responses, so I stopped the ping and starting a normal ping:
ping 10.100.1.123
After that I got the same duplicate ICMP replies at the same rate as if I was doing "-i 0". So I stopped the ping and started it again and only then I started getting the correct pings. It was as if a queue wasn't getting flushed correctly.
One last thing to add to the bug, when reproducing the bug with PCNetIII adapter I get the following stack trace from the guest:
Jun 14 20:16:02 TESTING kernel: [ 8.680836] Console: switching to colour frame buffer device 80x30 Jun 14 20:16:02 TESTING kernel: [ 8.686448] fb0: VESA VGA frame buffer device Jun 14 20:16:02 TESTING kernel: [ 17.568649] eth4: no IPv6 routers present Jun 14 20:23:41 TESTING kernel: [ 524.016173] ------------[ cut here ]------------ Jun 14 20:23:41 TESTING kernel: [ 524.016221] WARNING: at /build/buildd/linux-3.0.0/net/sched/sch_generic.c:255 dev_watchdog+0x1e6/0x1f0() Jun 14 20:23:41 TESTING kernel: [ 524.016231] Hardware name: VirtualBox Jun 14 20:23:41 TESTING kernel: [ 524.016234] NETDEV WATCHDOG: eth4 (pcnet32): transmit queue 0 timed out Jun 14 20:23:41 TESTING kernel: [ 524.016237] Modules linked in: vesafb ppdev joydev psmouse serio_raw parport_pc parport i2c_piix4 usbhid hid ahci libahci pcnet32 Jun 14 20:23:41 TESTING kernel: [ 524.016279] Pid: 0, comm: swapper Not tainted 3.0.0-20-generic #34-Ubuntu Jun 14 20:23:41 TESTING kernel: [ 524.016282] Call Trace: Jun 14 20:23:41 TESTING kernel: [ 524.016316] [<c1047e32>] warn_slowpath_common+0x72/0xa0 Jun 14 20:23:41 TESTING kernel: [ 524.016316] [<c1455fe6>] ? dev_watchdog+0x1e6/0x1f0 Jun 14 20:23:41 TESTING kernel: [ 524.016539] [<c1455fe6>] ? dev_watchdog+0x1e6/0x1f0 Jun 14 20:23:41 TESTING kernel: [ 524.016543] [<c1047f03>] warn_slowpath_fmt+0x33/0x40 Jun 14 20:23:41 TESTING kernel: [ 524.016547] [<c1455fe6>] dev_watchdog+0x1e6/0x1f0 Jun 14 20:23:41 TESTING kernel: [ 524.016552] [<c1054d8f>] call_timer_fn+0x2f/0x110 Jun 14 20:23:41 TESTING kernel: [ 524.016556] [<c1455e00>] ? qdisc_reset+0x40/0x40 Jun 14 20:23:41 TESTING kernel: [ 524.016560] [<c105648b>] run_timer_softirq+0xeb/0x210 Jun 14 20:23:41 TESTING kernel: [ 524.016564] [<c1455e00>] ? qdisc_reset+0x40/0x40 Jun 14 20:23:41 TESTING kernel: [ 524.016568] [<c104ea60>] ? local_bh_enable_ip+0x90/0x90 Jun 14 20:23:41 TESTING kernel: [ 524.016572] [<c104eae1>] __do_softirq+0x81/0x1a0 Jun 14 20:23:41 TESTING kernel: [ 524.016576] [<c104ea60>] ? local_bh_enable_ip+0x90/0x90 Jun 14 20:23:41 TESTING kernel: [ 524.016578] <IRQ> [<c104ee26>] ? irq_exit+0x76/0xa0 Jun 14 20:23:41 TESTING kernel: [ 524.016586] [<c1537d89>] ? smp_apic_timer_interrupt+0x59/0x88 Jun 14 20:23:41 TESTING kernel: [ 524.016591] [<c106d07f>] ? sched_clock_cpu+0xcf/0x150 Jun 14 20:23:41 TESTING kernel: [ 524.016595] [<c1530d89>] ? apic_timer_interrupt+0x31/0x38 Jun 14 20:23:41 TESTING kernel: [ 524.016601] [<c10259da>] ? native_safe_halt+0xa/0x10 Jun 14 20:23:41 TESTING kernel: [ 524.016606] [<c100971a>] ? default_idle.part.5+0x4a/0x190 Jun 14 20:23:41 TESTING kernel: [ 524.016610] [<c100987f>] ? default_idle+0x1f/0x40 Jun 14 20:23:41 TESTING kernel: [ 524.016613] [<c100175a>] ? cpu_idle+0x8a/0xc0 Jun 14 20:23:41 TESTING kernel: [ 524.016618] [<c15036d5>] ? rest_init+0x5d/0x68 Jun 14 20:23:41 TESTING kernel: [ 524.016633] [<c17c3763>] ? start_kernel+0x352/0x358 Jun 14 20:23:41 TESTING kernel: [ 524.016638] [<c17c33a2>] ? pass_bootoption.constprop.2+0xe2/0xe2 Jun 14 20:23:41 TESTING kernel: [ 524.016642] [<c17c30a9>] ? i386_start_kernel+0xa9/0xaf Jun 14 20:23:41 TESTING kernel: [ 524.016645] ---[ end trace 74ccf6cac077ffb8 ]--- Jun 14 20:23:41 TESTING kernel: [ 524.016813] pcnet32: eth4: transmit timed out, status 0073, resetting Jun 14 20:23:51 TESTING kernel: [ 534.016669] pcnet32: eth4: transmit timed out, status 0073, resetting Jun 14 20:24:01 TESTING kernel: [ 544.016609] pcnet32: eth4: transmit timed out, status 0073, resetting Jun 14 20:24:11 TESTING kernel: [ 554.016295] pcnet32: eth4: transmit timed out, status 0073, resetting Jun 14 20:24:21 TESTING kernel: [ 564.016630] pcnet32: eth4: transmit timed out, status 0073, resetting
I'll check my coworker's machine to see if he has a similar stack trace with the E1000 driver.
by , 12 years ago
Attachment: | detailed-network.jpeg added |
---|
A more detailed diagram of my simple test environment.
follow-up: 46 comment:45 by , 12 years ago
Replying to rbhkamal:
Hope you had a great time :)
Yes, that was great. thank you.
Back to work, are you sniffing the traffic while trying to reproduce this issue? So far I haven't been able to reproduce the bug if I was sniffing at the same time. I tried sniffing from the host and inside the guest and in both cases the issue stopped happening. However, I was able to sniff on the server side of iperf, which is running on a separate machine. But there was nothing interesting, some dropped packets but that is all.
That smell like time sensitive issue. Grrr.
I noticed yesterday something very strange that only happens while using NAT (with any driver.) Instead of using iperf to blast packets, I just used ping like this
Unfortunately ICMP not very good friend for detecting network issues on Windows host, we're using ICMP API and not dealing with socket there. So it'd be better to return to the iperf scenario.
Regarding your test diagram is it intentional to run iperf in server mode on different machine, or the same result achievable on if iperf runs on the same machine. and if so does it make any difference for you whether you point -c 10.0.2.2 or -c <ip of your host>?
comment:46 by , 12 years ago
Replying to Hachiman:
Regarding your test diagram is it intentional to run iperf in server mode on different machine, or the same result achievable on if iperf runs on the same machine.
Because my host is a Windows machine and iperf was a Linux only app, I didn't want to bother with iperf windows ports. I'll try that now.
and if so does it make any difference for you whether you point -c 10.0.2.2 or -c <ip of your host>?
I haven't tried that either. Will do it now too
comment:47 by , 12 years ago
This is taking longer than expected but here is what I have so far.
1- When putting iperf in server mode on the Host OS; the problem does NOT happen
2- When blasting the natted port from the guest (10.0.2.15 to 10.0.2.2); the problem does NOT happen.
3- I removed the switch and used a straight through cable; the problem *STILL* happens.
However, I found something interesting, the bug will only happen if I can make the packets get to ~64Mbps. In the case of # 1 and # 2, the speed would never exceed ~50Mbps. So I'm worried that the bug may still happen. I will let the test run overnight and I'll update the bug if it happens
Also while sniffing, the speed gets limited to ~49Mbps. Below is the result from the server, from second 0 to 70, the sniffer was running. At the 75 seconds mark, I stopped the sniffer and the speed increased by 20Mbps then the bug happened. The CPU was never near %90 during my all of my testing.
Here is the log from the server (without the switch in the middle), notice how the speed goes up then the bug happens.
------------------------------------------------------------ Server listening on UDP port 5001 Receiving 1470 byte datagrams UDP buffer size: 112 KByte (default) ------------------------------------------------------------ [ 3] local 10.100.1.97 port 5001 connected with 10.100.1.222 port 53482 [ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams [ 3] 0.0- 5.0 sec 29.5 MBytes 49.4 Mbits/sec 0.008 ms 600/30777 (1.9%) [ 3] 5.0-10.0 sec 29.5 MBytes 49.5 Mbits/sec 0.129 ms 573/30779 (1.9%) [ 3] 10.0-15.0 sec 29.5 MBytes 49.4 Mbits/sec 0.019 ms 595/30774 (1.9%) [ 3] 15.0-20.0 sec 29.3 MBytes 49.2 Mbits/sec 0.068 ms 544/30582 (1.8%) [ 3] 20.0-25.0 sec 29.3 MBytes 49.2 Mbits/sec 0.018 ms 557/30591 (1.8%) [ 3] 25.0-30.0 sec 29.5 MBytes 49.6 Mbits/sec 0.016 ms 590/30833 (1.9%) [ 3] 30.0-35.0 sec 29.4 MBytes 49.3 Mbits/sec 0.056 ms 604/30692 (2%) [ 3] 35.0-40.0 sec 29.6 MBytes 49.7 Mbits/sec 0.009 ms 619/30927 (2%) [ 3] 40.0-45.0 sec 29.5 MBytes 49.5 Mbits/sec 0.127 ms 570/30797 (1.9%) [ 3] 45.0-50.0 sec 29.5 MBytes 49.5 Mbits/sec 0.029 ms 584/30792 (1.9%) [ 3] 50.0-55.0 sec 29.5 MBytes 49.4 Mbits/sec 0.106 ms 573/30740 (1.9%) [ 3] 55.0-60.0 sec 29.6 MBytes 49.7 Mbits/sec 0.017 ms 652/30999 (2.1%) [ 3] 60.0-65.0 sec 29.6 MBytes 49.6 Mbits/sec 0.023 ms 559/30846 (1.8%) [ 3] 65.0-70.0 sec 29.5 MBytes 49.6 Mbits/sec 0.009 ms 539/30787 (1.8%) [ 3] 70.0-75.0 sec 37.9 MBytes 63.6 Mbits/sec 0.062 ms 159/38960 (0.41%) [ 3] 75.0-80.0 sec 40.7 MBytes 68.2 Mbits/sec 0.018 ms 0/41655 (0%) <-- Sniffer stopped [ 3] 80.0-85.0 sec 40.6 MBytes 68.2 Mbits/sec 0.090 ms 0/41624 (0%) [ 3] 85.0-90.0 sec 40.6 MBytes 68.1 Mbits/sec 0.201 ms 7/41564 (0.017%) [ 3] 90.0-95.0 sec 40.7 MBytes 68.2 Mbits/sec 0.011 ms 0/41632 (0%) [ 3] 95.0-100.0 sec 40.4 MBytes 67.8 Mbits/sec 0.055 ms 0/41370 (0%) [ 3] 100.0-105.0 sec 40.6 MBytes 68.1 Mbits/sec 0.212 ms 0/41543 (0%) <-- No more packets
If needed, I saved the logs from the client-side iperf and the VBox.log.
follow-up: 49 comment:48 by , 12 years ago
Thank you for update. Looks like the reason why I couldn't reproduce the issue because of using iperf on the same host, so will modify my test environment to using remote iperf.
comment:49 by , 12 years ago
I've been able to reproduce the issue with UDP and have acceptable for UDP terms fix for it. now I need to reproduce this issue with TCP and find solution for it.
follow-up: 52 comment:51 by , 12 years ago
Could you please upgrade your VBox installation up to 4.1.18 and replace VBoxDD.dll under link. Please safe original VBoxDD.dll in the safe place. Note: this binary is expected to change behavior of UDP tests only.
comment:52 by , 12 years ago
Replying to Hachiman:
Could you please upgrade your VBox installation up to 4.1.18 and replace VBoxDD.dll under link. Please safe original VBoxDD.dll in the safe place. Note: this binary is expected to change behavior of UDP tests only.
I can't download the file I get error 404.
follow-up: 55 comment:54 by , 12 years ago
Alright so far the problem with UDP seems to be fixed. Would you please explain a little what was causing the problem to happen? Why didn't happen on Linux hosts? I need to come up better test cases for this issue. Also is the fix in the public svn yet? (r41812?)
Thanks
comment:55 by , 12 years ago
Replying to rbhkamal:
Alright so far the problem with UDP seems to be fixed. Would you please explain a little what was causing the problem to happen?
The problem with UDP was wrong handling of errors like EAGAIN, EWOULDBLOCK and so on, that leaded to sending ICMP_UNREACH/ICMP_UNREACH_NET error code to guest and closing the socket, while we should drop packet (at least for now) in such case , assuming for UDP case application do packet accounting in right way.
Why didn't happen on Linux hosts?
I suppose the difference is in multiplexing implementation, on Linux (and other Unixes) we're using poll and on Windows it's WSAWaitForMultipleEvents.
I need to come up better test cases for this issue.
I've looked through NAT's TCP stack implementation, and found that similar code presents at code, but perhaps it isn't correct. And I need good test case for such fix.
Also is the fix in the public svn yet? (r41812?)
not yet.
Thanks
comment:56 by , 12 years ago
Thanks Hachiman, I tried finding other ways to make the problem happen but so far the previous steps seem like the only way.
Here is what I tried so far:
1- Used a firewall on the host to block packets from the VM. I tried DROP,REJECT with tcp-reset and ICMP
2- Sent packets to a host that is not discoverable by ARP (causing windows to report an err with an ICMP-UNREACHABLE)
3- Sent packets to a host on the Internet while the GW is not discoerable by ARP.
4- I tired disabling all the interfaces, unplugging the Ethernet adapter and similar
I'm going to move on to different anomalies with NAT and try to find ways to reproduce them.
Thanks for fixing this annoying bug.
comment:57 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Confirmed as fixed, closing.
terminal server view after controlvm reset