VirtualBox

Ticket #11207 (closed defect: fixed)

Opened 17 months ago

Last modified 10 months ago

bluetooth (via USB) not working reliably

Reported by: mbatle Owned by:
Priority: major Component: USB
Version: VirtualBox 4.2.4 Keywords: usb bluetooth hciconfig reset timeout hcitool
Cc: Guest type: Linux
Host type: all

Description

Bluetooth in VirtualBox (which uses USB pass-through) does not work reliably. A way to reproduce the problem using bluez:

# hciconfig hci0 reset Can't init device hci0: Connection timed out (110)

Several tests have been done reproducing the same problem:

  • Using Host OS Fedora 17 x64 (kernel 3.6.6)
  • Using Host OS Windows 7
  • Using Guest OS Ubuntu Quantal 12.10 32bit (kernel 3.5.0)
  • Using Guest OS Fedora 17 x64 (kernel 3.6.6)
  • Using different BT USB dongles from different brands

Same test natively on the host OS works ok.

Several files will be attached to help to debug this issue:

  • log-virtualbox.txt.bz2, the BT USB dongle is 0b05:179c ASUSTek Computer, Inc.
  • log-usbmon-guest.txt, log done with USB monitor in the guest OS
  • log-usbmon-host.txt, log done with USB monitor in the host OS, slightly different than previous one
  • log-strace.txt, log done with strace for hciconfig command

Attachments

log-usbmon-guest.txt Download (539 bytes) - added by mbatle 17 months ago.
usbmon log in guest OS
log-usbmon-host.txt Download (369 bytes) - added by mbatle 17 months ago.
usbmon log in host OS
log-strace.txt Download (1.9 KB) - added by mbatle 17 months ago.
strace log
log-virtualbox.txt.bz2 Download (66.3 KB) - added by mbatle 17 months ago.
virtualbox log filtered to usb stuff due to size contraints
vbox-ohci_cancellations.diff Download (2.0 KB) - added by Derek 16 months ago.
Attempt to provide a way for virtual box's ohci driver to cancel host URBs in response to guest cancellations
vbox_passthrough.diff Download (846 bytes) - added by Derek 14 months ago.
Simpler patch that only corrects the single connected passthrough device case
vbox-ohci_cancellations_v2.diff Download (3.7 KB) - added by Derek 13 months ago.
Cancellations by scanning for removed transfer descriptors

Change History

Changed 17 months ago by mbatle

usbmon log in guest OS

Changed 17 months ago by mbatle

usbmon log in host OS

Changed 17 months ago by mbatle

strace log

Changed 17 months ago by mbatle

virtualbox log filtered to usb stuff due to size contraints

comment:1 Changed 17 months ago by mbatle

I had to attach a filtered (smaller) file for log-virtualbox due to 512 kb size constraints. I still have the long one and can hang it somewhere if anybody finds it might be useful.

comment:2 Changed 16 months ago by Derek

I'd like to add some additional findings to this bug report (the host OS is linux for my tests).

hciconfig hci0 reset in the version of bluez we're using is the same as performing "hciconfig hci0 down ; hciconfig hci0 up".

the linux guest OS's btusb driver submits 3 URBs as part of its open() method. One interrupt URB and two bulk URBs. This appears to be translated into 2 URBs by the virtual box usb proxy layer - one interrupt URB and one bulk URB.

hciconfig hci0 down invokes btusb's close() method which retracts the 3 URBs via usb_kill_anchored_urbs(). The guest OS performs the appropriate OHCI register fiddling to perform these cancellations, and the btusb driver receives callbacks indicating the URBs are cancelled.

However, virtual box's virtual OHCI driver appears to lack any method for intentionally cancelling an interrupt URB, so the interrupt URB is not discarded, it remains attached to the host's usbdevfs driver indefinitely.

The bulk URB is discarded because the closure of the bluetooth USB device results in disabling of bulk list processing - however, if a second USB device has URBs in flight, the bulk URB will also be incorrectly left attached to the host driver.

when hciconfig hci0 up is called, the guest OS submits new bulk and interrupt URBs and sends a control URB that should result in the interrupt URB completing and receiving a call back.

Since the old interrupt URB is still attached on the host, it receives the data intended for the new URB. At this point virtual box's OHCI driver realizes that the URB had been cancelled and discards it - along with the data the btusb driver is expecting.

btusb waits several seconds for the device to reply, but since the reply has been incorrectly discarded, it times out.

My attempt at fixing this has been to add code in ohciStartOfFrame that tests the entire in flight list with ohciHasUrbBeenCanceled() and calls the cancellation function for any that have. The results have been moderately successful, but under heavy load (ie: resetting 2-3 passed through bt usb adapters concurrently with shell loops) it will fail badly. I'm not certain my approach is correct, but I will attach the patch anyway.

I've also tried (while running an unpatched virtual box) changing the btusb driver to not cancel URBs on close, and to re-use the existing ones on open (an unacceptable long term solution as module unloading will still result in an attempt to withdraw URBs that will fail) - when resetting multiple devices concurrently there appear to be spurious URB discards on the host as monitored via usb devfs snooping. I haven't had time to pursue this as much as I'd like, but it may be a different but related problem.

Changed 16 months ago by Derek

Attempt to provide a way for virtual box's ohci driver to cancel host URBs in response to guest cancellations

comment:3 Changed 15 months ago by Derek

In case the information is helpful, the bluetooth dongles I've tried are: (results from lsusb on a linux machine)

ID 0a5c:217f Broadcom Corp. Bluetooth Controller

ID 0a12:0001 Cambridge Silicon Radio, Ltd Bluetooth Dongle (HCI mode)

The broadcom device is built into my laptop (lenovo x201) and is on an internal usb connection.

These are both USB 2.0 full-speed devices plugged into EHCI busses on the host, and they end up using the OHCI driver in the guest.

As far as I can tell, on linux the same driver is used for all USB bluetooth devices, so the problem should be device invariant.

comment:4 Changed 14 months ago by mbatle

Thanks a lot Derek for your patch. It seems to work pretty well for me too. It is pity after the research you did and even providing a patch 2 months ago, virtualbox developers didn't have time yet to review it and provide some feedback or include it into the next version. It would be great to have some news soon on this.

Also, the problem is not bluetooth specific, it is generic due to how URB cancelling is not properly handled in virtualbox, and it is affecting all platforms.

comment:5 Changed 14 months ago by michaln

Couple of questions/comments:

  • Source comments like "I don't know why this happens, but it does!" spell trouble ;) When does it happen? Normal operation? Removing a USB device? Shutting down a VM?
  • There is already code to cancel URBs in the OHCI emulation; see existing calls to pfnCancelUrbsEp(). How is the newly added code expected to interact with that?
  • Wouldn't it be more logical to check for canceled URBs in ohciServicePeriodicList(), analogous to ohciServiceBulkList()? If I understand the problem description correctly, the issue is the lack of support for canceling interrupt URBs rather than canceling URBs in general?
  • Is the ED_HWINFO_SKIP bit set on the URBs the guest wants to cancel? If so, why not use that when processing the endpoint(s) instead of going over the entire URB list all the time, checking for something that will almost never happen?

By the way, we do appreciate the contribution!

comment:6 Changed 14 months ago by Derek

Thanks so much for your reply!

Please don't hit me too hard if I get any of the following wrong. ;)

  1. that (the pUrbLnx pointer being NULL) happens when resetting multiple bluetooth devices concurrently via hciconfig hci# reset. I've never seen it occur when not deliberately trying to break it. As I can't do this test without the patch, it's hard to determine if I've introduced a new problem or exposed a previously hidden one. :/
  1. As far as I can tell, the existing cancellation code is called:
    • when the controller is disabled but there are outstanding requests
    • when a BULK endpoint is set ED_HWINFO_SKIP

The first condition won't occur very often at all if there are multiple devices plugged into the virtual controller, since if one device is up and running the controller won't be disabled in response to bringing another device down.

The second condition is extremely timing dependent as explained in 4 below.

I had thought by performing my cancellations before either of these conditions are tested that ohciTdInFlightUrb(pOhci, TdAddr) would return NULL, because the cancelled URBs are no longer in flight... I could be wrong.

  1. It's not just interrupt URBs that aren't properly cancelled. Bulk ones aren't always cancelled correctly as well. If two devices are connected and one is brought down, the bulk URB cancellation code doesn't trigger.
  1. If I understand correctly how this stuff works (in a linux guest OS), ED_HWINFO_SKIP bit is set for the endpoint, but the OHCI emulator doesn't usually see it. It's set on an endpoint just prior to removing it from shared memory with the device. Once it's actually removed, it's simply not there anymore, and the emulator doesn't notice that it's missing.

So I think the root problem is that there are still orphaned URBs that have been sent by vbox to the host usbdevfs driver, but they don't have associated endpoint descriptors attached to an OHCI device on the guest OS anymore.

An earlier attempt I made at solving this just cancelled all in flight URBs if the endpoint list was empty in serviceperiodiclist. this works great if there's only a single usb passthrough device plugged in, but things fall apart when multiple devices are connected.

I'll attach that patch too - it's quite nice in the single device case, which is the only case where the current bulk urb cancellation code works.

Last edited 14 months ago by frank (previous) (diff)

Changed 14 months ago by Derek

Simpler patch that only corrects the single connected passthrough device case

comment:7 Changed 14 months ago by michaln

You probably didn't introduce a new problem in 1., more likely just got into some situation that never happened before.

It'll unfortunately take me a little bit to get back to you on the rest as I have to re-read the OHCI spec again in detail before I can say anything intelligent.

comment:8 Changed 14 months ago by michaln

Okay, memory refreshed... more questions.

  • Why "for (i = 0; cLeft && i < RT_ELEMENTS(pOhci->aInFlight); i++)" and not just "for (i = 0; i < cLeft; ++i)"?
  • Shouldn't the interrupt (and perhaps control/isoc?) endpoint processing cancel a URB immediately if it sees the sKip bit set? That might reduce the possibility that the wrong URB will receive data.

I reviewed the OHCI spec and saw that checking the sKip bit alone won't do. It seems to me that there's no guarantee the HCD will see the ED being skipped before it's removed from processing entirely. So yes, we have to check the in-flight URBs all the time and make sure they're still in-flight. Silly but unavoidable.

It may be possible to do this more efficiently - before each run, mark all URBs as inactive; go over all queues, mark all still-active URBs as such; when done, cancel all URBs not marked as active. It's not entirely clear to me how "paused" endpoints should be handled - cancel all in-flight URBs, resubmit them later?

If we knew that the guest will trigger a SOF interrupt when canceling URBs etc., we could watch out for that and only look for funny things happening in that case, but the guest could poll so that won't work.

comment:9 Changed 14 months ago by Derek

regarding RT_ELEMENTS(pOhci->aInFlight), I'm unfamiliar with the in flight list's format, so I was just being paranoid about running off the end. :) I think I may have been confused by the URB splitting stuff and thought it was possible for the number of URBs in flight to be greater than the number of elements in the list, but I can't really recall anymore...

I tried putting a cancel on skip in the isoc processing with no result. Then I instrumented it with LogRel(()) and it looks like it actually triggers once in a while, but with an incredibly low probability. In retrospect, maybe a complete fix does require that check as well, since ohciHasUrbBeenCancelled won't notice ED_HWINFO_SKIP, so just doing what I did can still result in a skipped interrupt URB being left around for an extra SOF.

More efficiency would be really nice - my patch really does feel heavy, and I for one don't have a good grasp on how frequently SOF occurs or how many entries are usually in that list... Your proposed solution sounds more robust than my patch.

I guess if the URBs are left around, the host will leave the real USB hardware running. It would seem to me that cancelling them as you suggest is the only way to keep unexpected things from happening between the host usb controller and the usb device.

comment:10 Changed 14 months ago by michaln

Actually, aInFlight is a simple array. The size check isn't wrong per se, but I'm pretty sure if we get into that situation, the in-flight list is garbage. In other words, if the number of in-flight URBs in the list does not equal cInFlight, we're in trouble. In these situations, we usually put an assertion in the code. So I guess leave the check as it is but add an assertion after the loop; probably check that cLeft is zero? I think that should always hold.

Yes, checking for the sKip bit is unfortunately not sufficient :/ As I mentioned earlier, the HC is not guaranteed to see the bit. OTOH if it does see it, it probably should act on it right away.

SOF (Start Of Frame) occurs at the beginning of each frame... so 1,000 times a second. We try to ratchet the rate down if USB is idle, but if it's not, all that stuff does get run 1,000 times per second.

How many entries are in the in-flight list depends entirely on the number of devices attached and their types. I'm guessing there are typically under 10 in-flight URBs, but I'm also sure the number can get much higher. Our current limit is 257.

Yes, if a URB isn't canceled, the host's USB HC will keep polling the device. And if the device does respond, the orphaned URB will be processed, but when it's completed in the OHCI emulation, the code will realize that the URB has been canceled and will effectively throw away the result. In the worst case there might be a different URB in the same location in guest's memory and things will go even more badly wrong.

Changed 13 months ago by Derek

Cancellations by scanning for removed transfer descriptors

comment:11 Changed 13 months ago by Derek

I've attached a new patch that a) performs cancellations on ED_SKIP in the periodiclist handler as is done in the bulk list handler b) once per frame removes any URBs that no longer have any associated transfer descriptors in memory

currently control transfers are exempt from this removal as I'm not sure it really makes sense to try to cancel them once they're already in flight.

comment:12 Changed 12 months ago by Derek

To fill in the missing patch submission requirements:

This patch is released under the MIT license.

My email address is derek.foreman@…

comment:13 Changed 10 months ago by frank

Finally we've integrated an improved version of this patch. If anyone is interested in testing, here are some 4.2.13 test builds:

comment:14 Changed 10 months ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

Fix is part of VBox 4.2.14.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use