Context Navigation

← Previous Ticket
Next Ticket →

#16128 new defect

Client loses TCP/UDP connectivity, VirtualBox on host uses 100% CPU

Reported by:	Roman Trunov	Owned by:
Component:	network/NAT	Version:	VirtualBox 5.1.8
Keywords:		Cc:
Guest type:	Linux	Host type:	Linux

Description

I have a guest debian server running on ubuntu host (both 64 bits). Networking mode was set to NAT and port forwarding was set for guest ports 22 and 80 to control and use this server.

It worked fine during low-load testing, but when it come to production with some load (1-2 requests per seconds from around the world), following issue was discovered:

It's happening after some time, usually after 2-3 days but last time it happens after 7 hours only.
CPU usage of VBoxHeadless process on host jumps and stay at 100%, although there is no any load on the guest.
Guest system continues to work but it cannot nor accept nor create new TCP connections (existing TCP connections, e.g. early opened ssh session, continues to work).
Outgoing TCP connection from the guest (e.g. wget) fails with error code "Network is unreachable". This error code is a clear sign that "bad thing happened".
DNS resolving on guest fails so it seems that UDP is affected too.
PINGs from guest are working! So I consider that network, routing, etc is OK, and the bug is in NAT code.

Of course there are no any messages in any logs when it happens.

I think that symptoms are quite similar to Bug #15223, but don't know could they have common reason or not. First discovering this issue in 4.3.old, I've upgraded to 4.3.40 and then to 5.1.8, but the issue is still here.

Attachments (3)

VBox.log.1 (99.1 KB ) - added by Roman Trunov 7 years ago.: Log of one of failed sessions
debian-8-boinc-server.vbox (4.9 KB ) - added by Roman Trunov 7 years ago.: VM config
20200620_095436.jpg (129.3 KB ) - added by Kral kaplan 4 years ago.

Download all attachments as: .zip

Change History (9)

by Roman Trunov, 7 years ago

Attachment:	VBox.log.1 added

Log of one of failed sessions

by Roman Trunov, 7 years ago

Attachment:	debian-8-boinc-server.vbox added

VM config

follow-up: 2 comment:1 by Socratis, 7 years ago

Please take a look at #16084, #16095, #16103, #16113 and #16126 (the last 4 ones being duplicates of the 1st one). Also in https://forums.virtualbox.org/viewtopic.php?f=6&t=80310

One easy way to see if it is a duplicate of #16084 is to downgrade to 5.1.6. You shouldn't have the problem with that build.

in reply to: 1 comment:2 by Roman Trunov, 7 years ago

Replying to socratis:

Please take a look at #16084, #16095, #16103, #16113 and #16126 (the last 4 ones being duplicates of the 1st one). Also in https://forums.virtualbox.org/viewtopic.php?f=6&t=80310

Highly unlikely, there are completely different symptoms. In my case, existing connections are working but no new connection could be initiated at all. Also 100% CPU usage by VirtualBox on the host. Exactly like in bug #15223:

vushakov wrote in bug #15223 8 months ago:

Thanks! I have reproduced the problem now. My guess is that some external client connects to VM's http server via port-forwarding, requests a lot of data, half-closes its tx side and/then aborts the transfer. Or something along this lines.

Same usage scenario (external client connects to VM's http server via port forwarding, it could be connection aborts/errors because requests are coming from worldwide), same symptoms. First suspect for me.

comment:3 by Roman Trunov, 7 years ago

Reading latest comments to #14748 I decided to run netstat on host - and yes, here they are. HUNDREDS of sockets in FIN_WAIT2 state, belonging to vbox process, and their number is growing:

tcp  0  0  192.168.0.xx:30080  xx.xxx.xxx.xxx:xxxxx  FIN_WAIT2  6937/VBoxHeadless

$ sudo netstat -tulpan | grep VBox | grep WAIT2 | wc -l
564

After few hours:

$ sudo netstat -tulpan | grep VBox | grep WAIT2 | wc -l
902

An I suppose as soon as it hit 1024, I'll have a problem.

There are no extra open sockets in the guest.

Is close() missing on some path in the vbox port forwarder code?

comment:4 by Roman Trunov, 7 years ago

I think understand this bug now. Here is a sequence of events.

At some point, port forwarder decides to call shutdown() on outer socket. I don't know why it's doing this, so let's assume it has some reasons to use shutdown() instead of complete close of connection.
After shutdown(), outer socket becomes half-closed and quickly advances to FIN_WAIT2 state. In this state, it could either still receive some data, either get a FIN and become completely closed.
Meantime, an inner socket (from VM) is completely closed (it was web server who sent a page and closed socket completely), but port forwarders seems to failed to notice this fact.
Although there is nobody on the inner end anymore and port forwarder must shutdown all connections, VirtualBox is keeping outer socket open due to reason unknown to me.

In simplest case, the reason why socket is kept open may be a trivial bug - a disconnect of inner VM socket has been noticed, but port forwarder code forgot to do close() of outer socket, causing leak of file descriptors.

I don't know how exactly inner part of port forwarder work, so let's assume more complex scenario when it's not possible to reliable detect complete close of inner/guest socket in some cases.

Remember that half-closed socket is waiting for FIN? The FIN never arrives. This is a key thing which triggers a bug. It happens often enough in my setup to leak 1024 handles in a few hours - there is a packet loss or buggy NAT somewhere on the path. Alas, this is a real world.
VirtualBox will keep socket open forever. I don't know is it simply "forgotten" about the socket, or is blocked on a read(), but such a read() will never be completed - remote client is gone long time ago, FIN is lost, so read() will never return.
Since socket is kept open, Linux kernel cannot cleanup stuck FIN_WAIT2 state like it does for closed ("orphaned") sockets.

In this scenario (full close of inner socket cannot be detected), port forwarder must implement same logic as in Linux kernel when it deals with sockets stuck in FIN_WAIT2 state. When port forwarder calls shutdown() and outer socket becomes half-closed, port forwarder must start 60 seconds inactivity timeout (default value of Linux tcp_fin_timeout). If no data received within this period and socket is still open, in most cases it means that FIN was lost, and connection must forced closed. It's better then "DDOS" the system with leaked sockets and file handles.

comment:5 by Valery Ushakov, 7 years ago

Do you still see the 100% CPU usage with a recent test build?

Re stuck connections. Half-open connections are pretty normal (the classic example is rsh half-closing its tx side on reading local EOF), so shutdown() is correct.

Linux tcp_fin_timeout and TCP_LINGER2 only affect orphaned sockets, i.e. those that the application has closed, that is dissociated from. The TCP connection is still technically half-open, but there's no-one to read any inbound data that might still come. All those sockets on the host are not killed by that timeout since they are not orphaned. Unfortunately, NAT cannot immediately tell if the guest connection is still alive. May be we should use keepalive between the NAT and the guest to verify the guest hasn't yet killed the connection.

Would it be possible to look at packet traces, both host- and guest- side that demonstrate the problem?

comment:6 by Frank Mehnert, 7 years ago

... or better test with VirtualBox 5.1.10 which was released 2 days ago.

by Kral kaplan, 4 years ago

Attachment:	20200620_095436.jpg added

Note: See TracTickets for help on using tickets.

Download in other formats: