[vbox-dev] Race condition in DrvNAT.cpp (was: svn: NAT dropping all external packets)

Vasily Levchenko Vasily.Levchenko at Sun.COM
Mon Nov 30 17:29:46 GMT 2009


On Nov 30, 2009, at 7:23 PM, Robin Green wrote:

> Hi Vasily,
> I am sure that the patched code is still not correct. A thread switch could occur between the unlocking of one lock and the locking of another - right?

Need investigate that. 

But I think root cause of the problem isn't here (race), because earlier or later urgent channel will be waken up, by ICMP message from host and will wakeup Recv thread. But for me problem is located at mbuf zone water line mechanism some basic implementation of rfc 896. Have you tried first today's suggestion?  

> So, you would need to lock the *same* lock that is currently locked by RTSemEventWait (on systems that use semevent-posix.cpp, anyway), so that that lock will be temporarily released while it waits.
> 
> However, there is a problem here. With systems like Mac OS X that use semevent-posix.cpp, the implementation of RTSemEventWait in there uses a lock. So, OK, it should be possible to lock that lock (but you might have to prevent locking the same lock twice). But on Linux, semevent-linux.cpp is presumably used, and the implementation of RTSemEventWait in there does *not* use a lock! I think this is wrong - you can't use the code in RTSemEventWait in semevent-linux.cpp to implement a correctly-working condition variable or semaphore for what you're trying to do, because to implement it correctly you need to hold a lock to avoid race conditions, and atomically release the lock while you wait.
> 
> I know this stuff can be confusing - I find it confusing myself, and I might have misunderstood it - so please, do not hesitate to ask for further explanation.
> 
> -- 
> Robin
> 
> 2009/11/30 Vasily Levchenko <Vasily.Levchenko at sun.com>
> Hi Robin,
> Could you please check, that patch solves problem for you?
> 
> 
> On Nov 28, 2009, at 7:30 PM, Robin Green wrote:
> 
>> Further to my earlier report on this list of NAT adapters stopping working with svn and 3.1.0 beta versions, I believe I have identified the problem.
>> 
>> r23462 introduced the bug, by introducing a new thread to process "urgent" packets, the communication of which with the existing thread isn't properly synchronised. Basically, it's the classic bug of sending a signal to a thread (in this case, the existing, non-urgent thread) without using a critical section to ensure that the target thread is waiting for the signal when it's sent. So occasionally the non-urgent thread misses the signal, and thus (for all practical purposes) never wakes up, because it's waiting indefinitely.
>> 
>> There *is* a lock taken inside RTSemEventSignal and RTSemEventWait, but that doesn't prevent this because it only covers the waiting and signalling itself, and not the real work. This (locking but making the critical section too small) is also a classic bug.
>> 
>> I haven't tried to fix this because I'm not familiar with the code, but this is my best guess at what's going wrong.
>> -- 
>> Robin
>> 
>> 2009/11/26 Robin Green <greenrd at gmail.com>
>> I think I've found something important. The responses stop appearing in the pcap log file immediately after VirtualBox sends the first ICMP type 4 packet (wireshark says this is "Source quench (flow control)"). The packet is also malformed in the .pcap file, according to wireshark, which says it has a wrong header checksum - not for the overall packet, but in a nested IP header inside the ICMP packet.
>> 
>> -- 
>> Robin
>> 
>> 2009/11/26 Robin Green <greenrd at gmail.com>
>> 
>> OK, this time I fortunately managed to reproduce the bug very quickly, *and* I got a host log file as well using wireshark. This time I used r24972.
>> 
>> I had a quick look at both .pcap files in wireshark. If you look for example at my attempt to ping 137.44.1.27, you'll see that the ping is sent and the reply is received on the host, but the reply does *not* appear in VirtualBox's pcap log file. I was trying to ping this IP address (one of our DNS servers) because I could not access twitter.com in firefox in the guest.
>> 
>> Thanks,
>> -- 
>> Robin
>> 
>> 2009/11/21 Vasily Levchenko <Vasily.Levchenko at sun.com>
>> 
>> On Nov 20, 2009, at 7:43 PM, Robin Green wrote:
>> 
>>> I've attached the log file. What do you mean by host network file?
>>> 
>> 
>> It's actually output of wireshark or any other traffic dumping tool working on host. Rather useful to check how nat send/modify    received from the guest packets to remote host.  
>> 
>>> 2009/11/19 Vasily Levchenko <Vasily.Levchenko at sun.com>
>>> Hello Robin,
>>> Thanks for feedback.
>>> Could you please attach pcap file NAT<->guest (please see http://www.virtualbox.org/wiki/Network_tips) and host network and log file?
>>> On Nov 19, 2009, at 7:23 PM, Robin Green wrote:
>>> 
>>>> Bisection reveals that this bug was introduced at some point between r23223 and r23530 inclusive (I can't be more specific because all of those revisions except r23530 either failed to build or failed to boot).
>>>> -- 
>>>> Robin
>>>> 
>>>> 2009/11/18 Robin Green <greenrd at gmail.com>
>>>> I tried 3.1 beta 1 and the NAT interfaces on both my VMs very quickly stopped working. So I tried building from SVN, and with this (r24735) I can't connect to, or even ping, anything other than the host. Guest is Fedora rawhide (F13), but with Fedora 12 kernel (due to an uninteresting bug in the rawhide kernel).
>>>> 
>>>> -- 
>>>> Robin
>>>> 
>>>> _______________________________________________
>>>> vbox-dev mailing list
>>>> vbox-dev at virtualbox.org
>>>> http://vbox.innotek.de/mailman/listinfo/vbox-dev
>>> 
>>> 
>>> <VBox.log>
>> 
>> 
>> 
>> 
>> _______________________________________________
>> vbox-dev mailing list
>> vbox-dev at virtualbox.org
>> http://vbox.innotek.de/mailman/listinfo/vbox-dev
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.virtualbox.org/pipermail/vbox-dev/attachments/20091130/df36051b/attachment.html>


More information about the vbox-dev mailing list