VirtualBox

Ticket #7459 (reopened defect)

Opened 4 years ago

Last modified 22 months ago

NAT networking drops out after a few days' uptime -> fixed in svn.

Reported by: dimitris Owned by:
Priority: major Component: network/NAT
Version: VirtualBox 4.0.2 Keywords: NAT DHCP
Cc: Guest type: Linux
Host type: Linux

Description (last modified by Hachiman) (diff)

Both host and guest are Ubuntu Lucid servers, guest additions are up-to-date, guest uses a virtio adapter with NAT networking.

2-3 days after starting the guest, it loses connectivity to the outside world. These are the last few entries in its log:

31:39:39.715 NAT: adding domain name lan to search list
31:39:39.715 NAT: DHCP offered IP address 10.0.2.15
42:40:55.746 NAT: adding domain name lan to search list
42:40:55.746 NAT: DHCP offered IP address 10.0.2.15
51:08:57.554 NAT: error occurred while sending ICMP error message

The guest is running Logitech's Squeezebox Server, a perl-based music server. Ports for that have been forwarded (see machine XML) and traffic flows normally before the machine drops out.

log and machine XML to be attached.

Attachments

squeezemachine.xml Download (7.1 KB) - added by dimitris 4 years ago.
VBox.log Download (45.3 KB) - added by dimitris 4 years ago.
nodhcp.png Download (18.4 KB) - added by dimitris 4 years ago.
terminal server view after controlvm reset
ifconfig.png Download (16.5 KB) - added by dimitris 4 years ago.
ifconfig after machine loses connectivity
syslog.png Download (19.1 KB) - added by dimitris 4 years ago.
guest syslog after losing connectivity
VBox.2.log Download (58.0 KB) - added by dimitris 4 years ago.
log including the loss of connectivity and a guest reboot
dhcp.png Download (25.1 KB) - added by dimitris 4 years ago.
DHCP log messages using Intel virtual network adapter
ifconfig.2.png Download (16.5 KB) - added by dimitris 4 years ago.
ifconfig using Intel virtual network adapter
VBox.3.log Download (46.4 KB) - added by dimitris 4 years ago.
log using Intel virtual network adapter.
ubuntu-2012-05-21-14-37-24.log Download (92.5 KB) - added by rbhkamal 2 years ago.
vboxlog using vbox 4.1.14 on windows 7 32bit with ubuntu 32bit guest
udp.jpeg Download (14.5 KB) - added by rbhkamal 2 years ago.
Diagram of UDP packet "blast"
iperf-test-run-001.log Download (72.4 KB) - added by rbhkamal 2 years ago.
VBox.log while running iperf test
detailed-network.jpeg Download (16.9 KB) - added by rbhkamal 22 months ago.
A more detailed diagram of my simple test environment.

Change History

Changed 4 years ago by dimitris

Changed 4 years ago by dimitris

Changed 4 years ago by dimitris

terminal server view after controlvm reset

comment:1 Changed 4 years ago by dimitris

VBoxManage controlvm <machine> reset doesn't bring DHCP back. I have to power the machine off then back on to get DHCP working.

Changed 4 years ago by dimitris

ifconfig after machine loses connectivity

Changed 4 years ago by dimitris

guest syslog after losing connectivity

Changed 4 years ago by dimitris

log including the loss of connectivity and a guest reboot

comment:2 Changed 4 years ago by dimitris

Just happened again. Attached guest screenshots of its ifconfig and syslog after the guest lost connectivity but before I rebooted it (from within the guest itself through the terminal server).

After the guest is rebooted, the DHCP is still not functioning, same as shown in the previously attached screenshot (nodhcp.png).

Also attaching a log that covers the guest's reboot.

comment:3 Changed 4 years ago by Hachiman

Does it happen with other adapters or with virt-io only?

comment:4 Changed 4 years ago by dimitris

Switched to

00:11.0 Ethernet controller: Intel Corporation 82545EM Gigabit Ethernet Controller (Copper) (rev 02)

Let's see how it does over the next few days.

Changed 4 years ago by dimitris

DHCP log messages using Intel virtual network adapter

Changed 4 years ago by dimitris

ifconfig using Intel virtual network adapter

Changed 4 years ago by dimitris

log using Intel virtual network adapter.

comment:5 Changed 4 years ago by dimitris

Same behavior with Intel virtual network adapter, see latest attachments:

51:09:51.358 NAT: error occurred while sending ICMP error message

comment:6 Changed 3 years ago by dimitris

No improvement or behavior change with 3.2.10.

comment:7 Changed 3 years ago by Hachiman

What kind of network applications are working on guest?

comment:8 Changed 3 years ago by Hachiman

  • Version changed from VirtualBox 3.2.8 to VirtualBox 3.2.10

comment:9 Changed 3 years ago by dimitris

DHCP client, SSH server, the perl-based  Squeezebox Server and its associated MySql server:

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:9092          0.0.0.0:*               LISTEN      744/mysqld      
tcp        0      0 0.0.0.0:9000            0.0.0.0:*               LISTEN      565/perl        
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      551/mysqld      
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      648/sshd        
tcp        0      0 0.0.0.0:3483            0.0.0.0:*               LISTEN      565/perl        
tcp        0      0 0.0.0.0:9090            0.0.0.0:*               LISTEN      565/perl        
udp        0      0 0.0.0.0:3483            0.0.0.0:*                           565/perl        
udp        0      0 0.0.0.0:50609           0.0.0.0:*                           565/perl        
udp        0      0 0.0.0.0:68              0.0.0.0:*                           586/dhclient3   

comment:10 Changed 3 years ago by dimitris

Here are the active connections. The one on local:3483 is the hardware player, and normally that's the only active TCP connection on this guest. Traffic is minimal when there's no music streaming (clock update on the player's idle screen is driven by the server). Streaming is a typical MP3 or Ogg stream, should be less than 200kbps.

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 10.0.2.15:22            10.0.2.2:47061          ESTABLISHED
tcp        0      0 10.0.2.15:3483          10.0.2.2:4718           ESTABLISHED

comment:11 Changed 3 years ago by Hachiman

Here I've installed the similar vm, let see if I'll able to reproduce the issue.

comment:12 Changed 3 years ago by Appiah

I have the same problem but on a OS running Windows 7 and the guest is XP.

Takes about 4 hours , log says the same in the end "NAT: error occurred while sending ICMP error message".

Have to shutdown the Guest , reset wont bring the network back.

comment:13 follow-up: ↓ 14 Changed 3 years ago by dimitris

Still here with 3.2.12.

comment:14 in reply to: ↑ 13 Changed 3 years ago by Hachiman

Replying to dimitris:

Still here with 3.2.12.

Thank you for update.

comment:15 Changed 3 years ago by Hachiman

  • Version changed from VirtualBox 3.2.10 to VirtualBox 3.2.12

comment:16 follow-up: ↓ 17 Changed 3 years ago by dimitris

Any luck reproducing this? If not I could probably FTP the machine (XML and disk image) if that would help.

comment:17 in reply to: ↑ 16 Changed 3 years ago by Hachiman

Replying to dimitris:

Any luck reproducing this? If not I could probably FTP the machine (XML and disk image) if that would help.

Instructions are send via mail.

comment:18 Changed 3 years ago by Hachiman

Almost on the third day I've got effect you're describing. I'll restart the test with more information collecting to get more information what, could lead to this effect. Also I'll try to minimize tests to provoke the effect in more short period.

comment:19 Changed 3 years ago by Hachiman

Debug build, which I've lunched about week ago, has met other unrelated issue, causing me recompile/restart build.

comment:20 Changed 3 years ago by Hachiman

yesterday debug build enters the state you'd described, finally. Looks like drops are caused with huge amount requests to UDP.IP.dst:63.17.46.179 and ARP.TIP:63.17.46.179 from NAT service side to the guest. in order to debug the issue I'm trying to create testcase which will provoke similar issue in smaller timeframe.

comment:21 Changed 3 years ago by Hachiman

  • Version changed from VirtualBox 3.2.12 to VirtualBox 4.0.2

comment:22 follow-up: ↓ 23 Changed 3 years ago by dimitris

Also upgraded to 4.0.2 here, problem persists, with this additional info now on the logs:

...
09:03:50.329 NAT: DHCP offered IP address 10.0.2.15
19:34:59.297 NAT: DHCP offered IP address 10.0.2.15
30:17:58.870 NAT: DHCP offered IP address 10.0.2.15
41:06:21.336 NAT: DHCP offered IP address 10.0.2.15
51:06:45.355 NAT: error occurred while sending ICMP error message
51:06:45.416 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:45.618 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:46.028 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:46.842 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:48.475 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:51.738 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:06:58.266 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:07:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:07:11.323 virtio-net: failed to allocate SG buffer: size=1344 rc=VERR_NO_MEMORY
51:07:18.234 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:07:19.235 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:07:20.234 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:07:37.438 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:07:38.439 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:07:39.440 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:08:03.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:08:04.355 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:08:04.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:08:05.550 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
51:09:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:10:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:11:04.354 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:12:04.355 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:13:04.356 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:14:04.356 virtio-net: failed to allocate SG buffer: size=58 rc=VERR_NO_MEMORY
51:14:26.453 virtio-net: failed to allocate SG buffer: size=42 rc=VERR_NO_MEMORY
...

Over here though I don't see any UDP traffic, albeit on a relatively short packet capture. What I do see is the TCP traffic between the Squeezebox Server and the hardware player, e.g.:

11:45:35.855991 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 167190:168480, ack 3173, win 5840, length 1290
11:45:35.857154 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 168480, win 65535, length 0
11:45:36.854949 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 168480:169770, ack 3173, win 5840, length 1290
11:45:36.856447 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 169770, win 65535, length 0
11:45:37.855543 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 169770:171060, ack 3173, win 5840, length 1290
11:45:37.856682 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 171060, win 65535, length 0
11:45:38.001067 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 171060:171090, ack 3173, win 5840, length 30
11:45:38.002478 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 171090, win 65535, length 0
11:45:38.004206 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [P.], seq 3173:3234, ack 171090, win 65535, length 61
11:45:38.004269 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [P.], seq 3234:3295, ack 171090, win 65535, length 61
11:45:38.006554 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [.], ack 3295, win 5840, length 0
11:45:38.854897 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 171090:172380, ack 3295, win 5840, length 1290
11:45:38.855901 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 172380, win 65535, length 0
11:45:39.854914 IP 10.0.2.15.3483 > 10.0.2.2.38399: Flags [P.], seq 172380:173670, ack 3295, win 5840, length 1290
11:45:39.855924 IP 10.0.2.2.38399 > 10.0.2.15.3483: Flags [.], ack 173670, win 65535, length 0

The above is guest-side: The guest's eth0 is at 10.0.2.15, and the hardware player, on the other side of the VBox NAT, is mapped here as 10.0.2.2.

The traffic looks the same on the host side, just with different addresses. Which UDP port(s) do you see traffic on?

comment:23 in reply to: ↑ 22 Changed 3 years ago by Hachiman

Replying to dimitris:

Also upgraded to 4.0.2 here, problem persists, with this additional info now on the logs:

The traffic looks the same on the host side, just with different addresses. Which UDP port(s) do you see traffic on?

it's a lot of udp broadcasts to udp/3483. I don't touch web interface a lot so i have very rare tcp entries in dump. the major part is UDP broadcasts.

comment:24 follow-up: ↓ 25 Changed 3 years ago by dimitris

OK, on a longer capture I see the UDP traffic too:

11:11:39.001496 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 44)
    10.0.2.15.3483 > 255.255.255.255.3483: UDP, length 16

It's once a minute, and IIRC it's the server process advertising itself on the network so that players can discover it.

comment:25 in reply to: ↑ 24 Changed 3 years ago by Hachiman

Replying to dimitris:

It's once a minute, and IIRC it's the server process advertising itself on the network so that players can discover it.

Here, I've made tool broadcasting udp, that cause similar effect. But I don't sure that it's exactly or single the problem. I've fixed the issue and now I'm running the build to verify.

comment:26 Changed 3 years ago by Hachiman

  • Summary changed from NAT networking drops out after a few days' uptime to NAT networking drops out after a few days' uptime -> fixed in svn.

comment:27 Changed 3 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

Should be fixed in 4.0.4.

comment:28 follow-up: ↓ 29 Changed 2 years ago by rbhkamal

  • Status changed from closed to reopened
  • Resolution fixed deleted

I've been seeing this issue on and off for a while now and it is still in the 4.1.14.
With the help of a UDP packet blaster and the PCNetIII adapter I was able to reproduce this problem every single time.

The steps:

  1. Create a VM with a Natted interface, use PCNetIII
  2. Start a UDP packet blast with size 1400, the packets must go from the VM and get routed/natted out to the physical NIC.
  3. After ~5 minutes, the PCNet adapter will bounce the link
  4. Shortly after that, the NAT interface seems to be trying to send a DHCP ACK back to the PCNetIII adapter
  5. Finally, the error "error occurred while sending ICMP error message" shows up and no packets can be passed after that. You wont even be able to ping internal ip of the NAT interface (10.0.2.2)


Notes:

  • I also tried the same steps with the latest SVN and the problem is still there.
  • The intel interfaces do not have the same issue because they do not bounce the link of the NIC card.
  • See attached log
  • The same bug will happen in other cases but this is the most reliable way so far.

Please let me know if you'd like to test a patch, I can build my own vbox and test it.

Changed 2 years ago by rbhkamal

vboxlog using vbox 4.1.14 on windows 7 32bit with ubuntu 32bit guest

comment:29 in reply to: ↑ 28 Changed 2 years ago by Hachiman

  • Description modified (diff)

Replying to rbhkamal:

I've been seeing this issue on and off for a while now and it is still in the 4.1.14.
With the help of a UDP packet blaster and the PCNetIII adapter I was able to reproduce this problem every single time.

Could you please clarify what are you doing/or what are you meaning by "UDP packet blast"?

Please let me know if you'd like to test a patch, I can build my own vbox and test it.

Thanks, for cooperation, I'll check what disabled experimental code, could possibly solve your issue, but first I'd like to understand what issue have you met.

comment:30 follow-up: ↓ 31 Changed 2 years ago by rbhkamal

From within the guest OS, I ran a utility which generates UDP traffic at 700-850Mbps (udb traffic blast). UDP here is not significant, the traffic can be TCP or even ICMP. I simply chose UDP because you wont have to setup a server to receive the packets.

See attached diagram.

I got a little more info since I reopened the bug. I was suspecting that a DHCP renewal from the guest, while under load, was causing the issue to happen. However, the bug happens with/without DHCP enabled on the guest.

[edit] corrected speeds

Last edited 2 years ago by rbhkamal (previous) (diff)

Changed 2 years ago by rbhkamal

Diagram of UDP packet "blast"

comment:31 in reply to: ↑ 30 Changed 2 years ago by Hachiman

Replying to rbhkamal:

From within the guest OS, I ran a utility which generates UDP traffic at 700-850Mbps (udb traffic blast). UDP here is not significant, the traffic can be TCP or even ICMP. I simply chose UDP because you wont have to setup a server to receive the packets.

ok, what this utility is? And are you able to reproduce this issue let say with ipperf?

comment:32 Changed 2 years ago by rbhkamal

It is an in-house propitiatory utility, however, I'm confident that ipperf will have the same effect. I'll try it now.

The bug was initially discovered while FTPing large files over a 1Gbps line. After about 5 minutes, FTP would fail. FTPing on a 100Mbps line didn't have any problem at all. Sorry I didn't mention this in the beginning but I was worried that you might think that I was running into Ticket #10527
So FTP by it self was causing the problem to happen (as long as it is 700MB+ file and a 1Gbps line)

I'll update you soon with the ipperf results.

comment:33 follow-up: ↓ 34 Changed 2 years ago by rbhkamal

Alright, I reproduced with iperf. Same result, after about 80 seconds the NIC stops passing traffic. Please remember that this bug is reproducable ONLY when using PCNetIII adapter (vs. Intel)

iperf client log:

------------------------------------------------------------
root@testing-VirtualBox:~# cat run
iperf -u -c 10.100.1.97 -i 5 -l 1400 -b 500M -t 300
root@testing-VirtualBox:~#
root@testing-VirtualBox:~# ./run 
Client connecting to 10.100.1.97, UDP port 5001
Sending 1400 byte datagrams
UDP buffer size:  160 KByte (default)
------------------------------------------------------------
[  3] local 10.0.2.15 port 56444 connected with 10.100.1.97 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 5.0 sec  41.4 MBytes  69.5 Mbits/sec
[  3]  5.0-10.0 sec  41.4 MBytes  69.5 Mbits/sec
[  3] 10.0-15.0 sec  41.1 MBytes  69.0 Mbits/sec
[  3] 15.0-20.0 sec  40.9 MBytes  68.7 Mbits/sec
[  3] 20.0-25.0 sec  40.7 MBytes  68.3 Mbits/sec
[  3] 25.0-30.0 sec  40.4 MBytes  67.7 Mbits/sec
[  3] 30.0-35.0 sec  40.4 MBytes  67.7 Mbits/sec
[  3] 35.0-40.0 sec  40.2 MBytes  67.4 Mbits/sec
[  3] 40.0-45.0 sec  40.4 MBytes  67.8 Mbits/sec
[  3] 45.0-50.0 sec  40.3 MBytes  67.7 Mbits/sec
[  3] 50.0-55.0 sec  40.3 MBytes  67.6 Mbits/sec
[  3] 55.0-60.0 sec  39.6 MBytes  66.5 Mbits/sec
[  3] 60.0-65.0 sec  52.8 MBytes  88.6 Mbits/sec
[  3] 65.0-70.0 sec   476 KBytes   780 Kbits/sec
[  3] 70.0-75.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 75.0-80.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 80.0-85.0 sec  0.00 Bytes  0.00 bits/sec



iperf server log:

------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  112 KByte (default)
------------------------------------------------------------
[  3] local 10.100.1.97 port 5001 connected with 10.100.1.77 port 59842
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0- 0.5 sec  3.94 MBytes  66.1 Mbits/sec   0.122 ms   17/ 2969 (0.57%)
[  3]  0.5- 1.0 sec  4.10 MBytes  68.8 Mbits/sec   0.015 ms   78/ 3150 (2.5%)
[  3]  1.0- 1.5 sec  3.96 MBytes  66.5 Mbits/sec   0.118 ms   74/ 3043 (2.4%)
[  3]  1.5- 2.0 sec  4.06 MBytes  68.2 Mbits/sec   0.010 ms   80/ 3123 (2.6%)
[  3]  2.0- 2.5 sec  4.06 MBytes  68.1 Mbits/sec   0.009 ms   66/ 3104 (2.1%)
[  3]  2.5- 3.0 sec  4.09 MBytes  68.7 Mbits/sec   0.094 ms   77/ 3144 (2.4%)
[  3]  3.0- 3.5 sec  4.05 MBytes  68.0 Mbits/sec   0.119 ms   82/ 3117 (2.6%)
[  3]  3.5- 4.0 sec  4.11 MBytes  69.0 Mbits/sec   0.026 ms   63/ 3142 (2%)
.
. Skipping
.
[  3] 59.0-59.5 sec  4.02 MBytes  67.4 Mbits/sec   0.010 ms   63/ 3074 (2%)
[  3] 59.5-60.0 sec  3.93 MBytes  65.9 Mbits/sec   0.010 ms   53/ 2993 (1.8%)
[  3] 60.0-60.5 sec  3.45 MBytes  57.9 Mbits/sec   0.009 ms   57/ 2642 (2.2%)
[  3] 60.5-61.0 sec  5.29 MBytes  88.8 Mbits/sec   0.035 ms   14/ 3978 (0.35%)
[  3] 61.0-61.5 sec  5.40 MBytes  90.5 Mbits/sec   0.026 ms   16/ 4058 (0.39%)
[  3] 61.5-62.0 sec  5.41 MBytes  90.7 Mbits/sec   0.023 ms    0/ 4051 (0%)
[  3] 62.0-62.5 sec  5.51 MBytes  92.4 Mbits/sec   0.183 ms    0/ 4125 (0%)
[  3] 62.5-63.0 sec  5.56 MBytes  93.4 Mbits/sec   0.100 ms    0/ 4168 (0%)
[  3] 63.0-63.5 sec  5.58 MBytes  93.7 Mbits/sec   0.060 ms    0/ 4183 (0%)
[  3] 63.5-64.0 sec  5.65 MBytes  94.9 Mbits/sec   0.008 ms    0/ 4235 (0%)
[  3] 64.0-64.5 sec  5.64 MBytes  94.7 Mbits/sec   0.191 ms    0/ 4226 (0%)
[  3] 64.5-65.0 sec  5.27 MBytes  88.4 Mbits/sec   0.052 ms    0/ 3946 (0%)


Also I attached the vbox.log from that test run.

Changed 2 years ago by rbhkamal

VBox.log while running iperf test

comment:34 in reply to: ↑ 33 Changed 2 years ago by Hachiman

Replying to rbhkamal: Thank you for investigation. I will try to reproduce the issue locally.

comment:35 Changed 23 months ago by Hachiman

In mean while I've tried to reproduce the issue, but I've used Mac host (Windows was used for other tests) and wasn't able to reproduce the issue. Could you please repeat the same test with bridged network attachment?

comment:36 follow-up: ↓ 37 Changed 23 months ago by Hachiman

response from rbhkamal I've received via mail notification

My account (rbhkamal) got locked so I hope you get my response here:
To reproduce the issue you *must* use the PCNetIII network with NAT, I was able to reproduce it every single time with vbox 3.1.16.

If I switch it to E1000, the problem disappears. I suspect bridge networking will not have this problem as well.

comment:37 in reply to: ↑ 36 Changed 23 months ago by Hachiman

Replying to rbhkamal:

response from rbhkamal I've received via mail notification

My account (rbhkamal) got locked so I hope you get my response here: To reproduce the issue you *must* use the PCNetIII network with NAT, I was able to reproduce it every single time with vbox 3.1.16.

I've tried this test on trunk with pcnet and e1000. Could you please try with 4.0.16 or 4.1.16?

If I switch it to E1000, the problem disappears. I suspect bridge networking will not have this problem as well.

comment:38 Changed 23 months ago by rbhkamal

Sorry for the late response, I had to redo the setup. I was able to reproduce the problem with 4.1.16 on a Windows 32bit host and a linux 32bit guest (Ubuntu 12.04).

I you like I could also try it on my mbp laptop. But overall, I'm able to reproduce the problem 100% of the time.
Previously, when I see this issue, it would show up randomly like the original steps of this bug. But if you follow my steps exactly, you should be able to reproduce consistently.

For how long did you run the blast and how much bandwidth where you able get?
I've done my testing using a Dell latitude D830 (Intel Core 2 duo). On that machine, I ran the blast for 60-120 seconds before the problem occurred and I was able to get around 70Mbps with 2% packet loss.

Version 0, edited 23 months ago by rbhkamal (next)

comment:39 follow-up: ↓ 40 Changed 23 months ago by rbhkamal

Also make sure that you use UDP with iperf because TCP will throttle back once you start loosing packets.
I *may* be able to give you full access to my test environment, but it will take me a while to isolate it from the rest of the network. Let me know if you want it.

comment:40 in reply to: ↑ 39 Changed 23 months ago by Hachiman

Replying to rbhkamal:

Also make sure that you use UDP with iperf because TCP will throttle back once you start loosing packets.

I'll try this on Windows machine tomorrow, looks like it Windows specific issue.

I *may* be able to give you full access to my test environment, but it will take me a while to isolate it from the rest of the network. Let me know if you want it.

I'd hoped, that it not necessary, I think next step would be following, I'll send you build of VBoxDD.dll with enabled logging, in case I won't be able to reproduce the issue.

comment:41 Changed 22 months ago by frank

rbhkamal replied offline that he can only reproduce this problem using the PCnet adapter with NAT.

comment:42 Changed 22 months ago by rbhkamal

Actually two days ago a coworker of mine was able to consistently reproduce the same exact issue with E1000 (Server T OEM) using VBox 4.1.16. So he tried to run the same exact test but using Linux instead, and he wasn't able to get the issue to happen.

This issue is fairly easy to reproduce on a windows machine using ~20Mbps packet stream (outbound from the VM). But I believe that the bug is generic to all platforms and that Windows happened to aggravate the problem with its "standard" behavior as always.

I'm going to reproduce the problem again on Windows and switch the same machine to Linux and see if I can reproduce the issue.

Thanks,
Ribhi

comment:43 Changed 22 months ago by Hachiman

Sorry that I haven't updated the ticket before leaving to vacation, IIRC I wasn't able reproduce the issue on my Windows machine here, I will retry this with fresh forces.

comment:44 follow-up: ↓ 45 Changed 22 months ago by rbhkamal

Hope you had a great time :)

Back to work, are you sniffing the traffic while trying to reproduce this issue? So far I haven't been able to reproduce the bug if I was sniffing at the same time. I tried sniffing from the host and inside the guest and in both cases the issue stopped happening. However, I was able to sniff on the server side of iperf, which is running on a separate machine. But there was nothing interesting, some dropped packets but that is all.

I noticed yesterday something very strange that only happens while using NAT (with any driver.) Instead of using iperf to blast packets, I just used ping like this:

ping -i 0 -s 1024 10.100.1.123

And I ended up getting a lot of duplicate ICMP responses, so I stopped the ping and starting a normal ping:

ping 10.100.1.123

After that I got the same duplicate ICMP replies at the same rate as if I was doing "-i 0". So I stopped the ping and started it again and only then I started getting the correct pings. It was as if a queue wasn't getting flushed correctly.

One last thing to add to the bug, when reproducing the bug with PCNetIII adapter I get the following stack trace from the guest:

Jun 14 20:16:02 TESTING kernel: [    8.680836] Console: switching to colour frame buffer device 80x30
Jun 14 20:16:02 TESTING kernel: [    8.686448] fb0: VESA VGA frame buffer device
Jun 14 20:16:02 TESTING kernel: [   17.568649] eth4: no IPv6 routers present
Jun 14 20:23:41 TESTING kernel: [  524.016173] ------------[ cut here ]------------
Jun 14 20:23:41 TESTING kernel: [  524.016221] WARNING: at /build/buildd/linux-3.0.0/net/sched/sch_generic.c:255 dev_watchdog+0x1e6/0x1f0()
Jun 14 20:23:41 TESTING kernel: [  524.016231] Hardware name: VirtualBox
Jun 14 20:23:41 TESTING kernel: [  524.016234] NETDEV WATCHDOG: eth4 (pcnet32): transmit queue 0 timed out
Jun 14 20:23:41 TESTING kernel: [  524.016237] Modules linked in: vesafb ppdev joydev psmouse serio_raw parport_pc parport i2c_piix4 usbhid hid ahci libahci pcnet32
Jun 14 20:23:41 TESTING kernel: [  524.016279] Pid: 0, comm: swapper Not tainted 3.0.0-20-generic #34-Ubuntu
Jun 14 20:23:41 TESTING kernel: [  524.016282] Call Trace:
Jun 14 20:23:41 TESTING kernel: [  524.016316]  [<c1047e32>] warn_slowpath_common+0x72/0xa0
Jun 14 20:23:41 TESTING kernel: [  524.016316]  [<c1455fe6>] ? dev_watchdog+0x1e6/0x1f0
Jun 14 20:23:41 TESTING kernel: [  524.016539]  [<c1455fe6>] ? dev_watchdog+0x1e6/0x1f0
Jun 14 20:23:41 TESTING kernel: [  524.016543]  [<c1047f03>] warn_slowpath_fmt+0x33/0x40
Jun 14 20:23:41 TESTING kernel: [  524.016547]  [<c1455fe6>] dev_watchdog+0x1e6/0x1f0
Jun 14 20:23:41 TESTING kernel: [  524.016552]  [<c1054d8f>] call_timer_fn+0x2f/0x110
Jun 14 20:23:41 TESTING kernel: [  524.016556]  [<c1455e00>] ? qdisc_reset+0x40/0x40
Jun 14 20:23:41 TESTING kernel: [  524.016560]  [<c105648b>] run_timer_softirq+0xeb/0x210
Jun 14 20:23:41 TESTING kernel: [  524.016564]  [<c1455e00>] ? qdisc_reset+0x40/0x40
Jun 14 20:23:41 TESTING kernel: [  524.016568]  [<c104ea60>] ? local_bh_enable_ip+0x90/0x90
Jun 14 20:23:41 TESTING kernel: [  524.016572]  [<c104eae1>] __do_softirq+0x81/0x1a0
Jun 14 20:23:41 TESTING kernel: [  524.016576]  [<c104ea60>] ? local_bh_enable_ip+0x90/0x90
Jun 14 20:23:41 TESTING kernel: [  524.016578]  <IRQ>  [<c104ee26>] ? irq_exit+0x76/0xa0
Jun 14 20:23:41 TESTING kernel: [  524.016586]  [<c1537d89>] ? smp_apic_timer_interrupt+0x59/0x88
Jun 14 20:23:41 TESTING kernel: [  524.016591]  [<c106d07f>] ? sched_clock_cpu+0xcf/0x150
Jun 14 20:23:41 TESTING kernel: [  524.016595]  [<c1530d89>] ? apic_timer_interrupt+0x31/0x38
Jun 14 20:23:41 TESTING kernel: [  524.016601]  [<c10259da>] ? native_safe_halt+0xa/0x10
Jun 14 20:23:41 TESTING kernel: [  524.016606]  [<c100971a>] ? default_idle.part.5+0x4a/0x190
Jun 14 20:23:41 TESTING kernel: [  524.016610]  [<c100987f>] ? default_idle+0x1f/0x40
Jun 14 20:23:41 TESTING kernel: [  524.016613]  [<c100175a>] ? cpu_idle+0x8a/0xc0
Jun 14 20:23:41 TESTING kernel: [  524.016618]  [<c15036d5>] ? rest_init+0x5d/0x68
Jun 14 20:23:41 TESTING kernel: [  524.016633]  [<c17c3763>] ? start_kernel+0x352/0x358
Jun 14 20:23:41 TESTING kernel: [  524.016638]  [<c17c33a2>] ? pass_bootoption.constprop.2+0xe2/0xe2
Jun 14 20:23:41 TESTING kernel: [  524.016642]  [<c17c30a9>] ? i386_start_kernel+0xa9/0xaf
Jun 14 20:23:41 TESTING kernel: [  524.016645] ---[ end trace 74ccf6cac077ffb8 ]---
Jun 14 20:23:41 TESTING kernel: [  524.016813] pcnet32: eth4: transmit timed out, status 0073, resetting
Jun 14 20:23:51 TESTING kernel: [  534.016669] pcnet32: eth4: transmit timed out, status 0073, resetting
Jun 14 20:24:01 TESTING kernel: [  544.016609] pcnet32: eth4: transmit timed out, status 0073, resetting
Jun 14 20:24:11 TESTING kernel: [  554.016295] pcnet32: eth4: transmit timed out, status 0073, resetting
Jun 14 20:24:21 TESTING kernel: [  564.016630] pcnet32: eth4: transmit timed out, status 0073, resetting


I'll check my coworker's machine to see if he has a similar stack trace with the E1000 driver.

Changed 22 months ago by rbhkamal

A more detailed diagram of my simple test environment.

comment:45 in reply to: ↑ 44 ; follow-up: ↓ 46 Changed 22 months ago by Hachiman

Replying to rbhkamal:

Hope you had a great time :)

Yes, that was great. thank you.

Back to work, are you sniffing the traffic while trying to reproduce this issue? So far I haven't been able to reproduce the bug if I was sniffing at the same time. I tried sniffing from the host and inside the guest and in both cases the issue stopped happening. However, I was able to sniff on the server side of iperf, which is running on a separate machine. But there was nothing interesting, some dropped packets but that is all.

That smell like time sensitive issue. Grrr.


I noticed yesterday something very strange that only happens while using NAT (with any driver.) Instead of using iperf to blast packets, I just used ping like this

Unfortunately ICMP not very good friend for detecting network issues on Windows host, we're using ICMP API and not dealing with socket there. So it'd be better to return to the iperf scenario.

Regarding your test diagram is it intentional to run iperf in server mode on different machine, or the same result achievable on if iperf runs on the same machine. and if so does it make any difference for you whether you point -c 10.0.2.2 or -c <ip of your host>?

comment:46 in reply to: ↑ 45 Changed 22 months ago by rbhkamal

Replying to Hachiman:

Regarding your test diagram is it intentional to run iperf in server mode on different machine, or the same result achievable on if iperf runs on the same machine.

Because my host is a Windows machine and iperf was a Linux only app, I didn't want to bother with iperf windows ports. I'll try that now.

and if so does it make any difference for you whether you point -c 10.0.2.2 or -c <ip of your host>?

I haven't tried that either. Will do it now too

comment:47 Changed 22 months ago by rbhkamal

This is taking longer than expected but here is what I have so far.
1- When putting iperf in server mode on the Host OS; the problem does NOT happen
2- When blasting the natted port from the guest (10.0.2.15 to 10.0.2.2); the problem does NOT happen.
3- I removed the switch and used a straight through cable; the problem *STILL* happens.

However, I found something interesting, the bug will only happen if I can make the packets get to ~64Mbps. In the case of # 1 and # 2, the speed would never exceed ~50Mbps. So I'm worried that the bug may still happen. I will let the test run overnight and I'll update the bug if it happens

Also while sniffing, the speed gets limited to ~49Mbps. Below is the result from the server, from second 0 to 70, the sniffer was running. At the 75 seconds mark, I stopped the sniffer and the speed increased by 20Mbps then the bug happened. The CPU was never near %90 during my all of my testing.

Here is the log from the server (without the switch in the middle), notice how the speed goes up then the bug happens.

------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  112 KByte (default)
------------------------------------------------------------
[  3] local 10.100.1.97 port 5001 connected with 10.100.1.222 port 53482
[ ID] Interval       Transfer     Bandwidth        Jitter   Lost/Total Datagrams
[  3]  0.0- 5.0 sec  29.5 MBytes  49.4 Mbits/sec   0.008 ms  600/30777 (1.9%)
[  3]  5.0-10.0 sec  29.5 MBytes  49.5 Mbits/sec   0.129 ms  573/30779 (1.9%)
[  3] 10.0-15.0 sec  29.5 MBytes  49.4 Mbits/sec   0.019 ms  595/30774 (1.9%)
[  3] 15.0-20.0 sec  29.3 MBytes  49.2 Mbits/sec   0.068 ms  544/30582 (1.8%)
[  3] 20.0-25.0 sec  29.3 MBytes  49.2 Mbits/sec   0.018 ms  557/30591 (1.8%)
[  3] 25.0-30.0 sec  29.5 MBytes  49.6 Mbits/sec   0.016 ms  590/30833 (1.9%)
[  3] 30.0-35.0 sec  29.4 MBytes  49.3 Mbits/sec   0.056 ms  604/30692 (2%)
[  3] 35.0-40.0 sec  29.6 MBytes  49.7 Mbits/sec   0.009 ms  619/30927 (2%)
[  3] 40.0-45.0 sec  29.5 MBytes  49.5 Mbits/sec   0.127 ms  570/30797 (1.9%)
[  3] 45.0-50.0 sec  29.5 MBytes  49.5 Mbits/sec   0.029 ms  584/30792 (1.9%)
[  3] 50.0-55.0 sec  29.5 MBytes  49.4 Mbits/sec   0.106 ms  573/30740 (1.9%)
[  3] 55.0-60.0 sec  29.6 MBytes  49.7 Mbits/sec   0.017 ms  652/30999 (2.1%)
[  3] 60.0-65.0 sec  29.6 MBytes  49.6 Mbits/sec   0.023 ms  559/30846 (1.8%)
[  3] 65.0-70.0 sec  29.5 MBytes  49.6 Mbits/sec   0.009 ms  539/30787 (1.8%)
[  3] 70.0-75.0 sec  37.9 MBytes  63.6 Mbits/sec   0.062 ms  159/38960 (0.41%)
[  3] 75.0-80.0 sec  40.7 MBytes  68.2 Mbits/sec   0.018 ms    0/41655 (0%) <-- Sniffer stopped
[  3] 80.0-85.0 sec  40.6 MBytes  68.2 Mbits/sec   0.090 ms    0/41624 (0%)
[  3] 85.0-90.0 sec  40.6 MBytes  68.1 Mbits/sec   0.201 ms    7/41564 (0.017%)
[  3] 90.0-95.0 sec  40.7 MBytes  68.2 Mbits/sec   0.011 ms    0/41632 (0%)
[  3] 95.0-100.0 sec  40.4 MBytes  67.8 Mbits/sec   0.055 ms    0/41370 (0%)
[  3] 100.0-105.0 sec  40.6 MBytes  68.1 Mbits/sec   0.212 ms    0/41543 (0%) <-- No more packets

If needed, I saved the logs from the client-side iperf and the VBox.log.

comment:48 follow-up: ↓ 49 Changed 22 months ago by Hachiman

Thank you for update. Looks like the reason why I couldn't reproduce the issue because of using iperf on the same host, so will modify my test environment to using remote iperf.

comment:49 in reply to: ↑ 48 Changed 22 months ago by Hachiman

I've been able to reproduce the issue with UDP and have acceptable for UDP terms fix for it. now I need to reproduce this issue with TCP and find solution for it.

comment:50 Changed 22 months ago by rbhkamal

Great news! Let me know if you need any testing done.

comment:51 follow-up: ↓ 52 Changed 22 months ago by Hachiman

Could you please upgrade your VBox installation up to 4.1.18 and replace  VBoxDD.dll under link. Please safe original VBoxDD.dll in the safe place. Note: this binary is expected to change behavior of UDP tests only.

comment:52 in reply to: ↑ 51 Changed 22 months ago by rbhkamal

Replying to Hachiman:

Could you please upgrade your VBox installation up to 4.1.18 and replace  VBoxDD.dll under link. Please safe original VBoxDD.dll in the safe place. Note: this binary is expected to change behavior of UDP tests only.

I can't download the file I get error 404.

comment:53 Changed 22 months ago by frank

rbhkamal, fixed, please try again.

comment:54 follow-up: ↓ 55 Changed 22 months ago by rbhkamal

Alright so far the problem with UDP seems to be fixed. Would you please explain a little what was causing the problem to happen? Why didn't happen on Linux hosts? I need to come up better test cases for this issue. Also is the fix in the public svn yet? (r41812?)

Thanks

comment:55 in reply to: ↑ 54 Changed 22 months ago by Hachiman

Replying to rbhkamal:

Alright so far the problem with UDP seems to be fixed. Would you please explain a little what was causing the problem to happen?

The problem with UDP was wrong handling of errors like EAGAIN, EWOULDBLOCK and so on, that leaded to sending ICMP_UNREACH/ICMP_UNREACH_NET error code to guest and closing the socket, while we should drop packet (at least for now) in such case , assuming for UDP case application do packet accounting in right way.

Why didn't happen on Linux hosts?

I suppose the difference is in multiplexing implementation, on Linux (and other Unixes) we're using poll and on Windows it's WSAWaitForMultipleEvents.

I need to come up better test cases for this issue.

I've looked through NAT's TCP stack implementation, and found that similar code presents at code, but perhaps it isn't correct. And I need good test case for such fix.

Also is the fix in the public svn yet? (r41812?)

not yet.

Thanks

comment:56 Changed 22 months ago by rbhkamal

Thanks Hachiman, I tried finding other ways to make the problem happen but so far the previous steps seem like the only way.

Here is what I tried so far:
1- Used a firewall on the host to block packets from the VM. I tried DROP,REJECT with tcp-reset and ICMP
2- Sent packets to a host that is not discoverable by ARP (causing windows to report an err with an ICMP-UNREACHABLE)
3- Sent packets to a host on the Internet while the GW is not discoerable by ARP.
4- I tired disabling all the interfaces, unplugging the Ethernet adapter and similar

I'm going to move on to different anomalies with NAT and try to find ways to reproduce them.

Thanks for fixing this annoying bug.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use