Ticket #10535 (reopened defect)

Opened 2 years ago

Last modified 2 years ago

ehci_handle_endpoint_reclaimation panic in Solaris guest VM on VirtualBox 4.1.15

Reported by: jayce Owned by:
Priority: major Component: other
Version: VirtualBox 4.1.14 Keywords: solaris usb ehci panic
Cc: jason.banham@… Guest type: Solaris
Host type: Mac OS X


Problem Description

I was trying to attach a Kingston USB memory stick into the guest VM (Solaris 11) when the guest VM panic'd.

Host O/S is MacOS 10.6.8 and running VirtualBox 4.1.15 after recent host O/S panic's upon guest O/S (Linux) reboots. VirtualBox was upgraded to 4.1.15 as per:

Ticket #9897 Synopsis: Frequent panics on Mac OS X when shutting down/starting VMs.

Extension pack is: Oracle VM VirtualBox Extension Pack 4.1.4r77440

... I didn't see a 4.1.15 extension pack.

Core Dump Analysis

The panic string is this:

> $C
ffffff0003ea7af0 ddi_get32+0x12()
ffffff0003ea7b30 ehci_handle_endpoint_reclaimation+0x91(ffffff0149379000)
ffffff0003ea7b70 ehci_intr+0x184(ffffff0149379000, 0)
ffffff0003ea7bc0 av_dispatch_autovect+0x74(13)
ffffff0003ea7c00 dispatch_hardint+0x33(13, 0)
ffffff0003e05a50 switch_sp_and_call+0x13()
ffffff0003e05aa0 do_interrupt+0xb6(ffffff0003e05ab0, 1)
ffffff0003e05ab0 _interrupt+0xba()
ffffff0003e05ba0 mach_cpu_idle+6()
ffffff0003e05bd0 cpu_idle+0xb2()
ffffff0003e05c00 idle+0x116()
ffffff0003e05c10 thread_start+8()

From msgbuf we see:

2012 May  9 09:07:13 ffffff0161aa3ac0 
WARNING: /pci@0,0/pci8086,265c@b (ehci0): Connecting device on port 1 failed
2012 May  9 09:08:07 ffffff0161ab3e00 
2012 May  9 09:08:07 ffffff0161aa3640 
BAD TRAP: type=e (#pf Page fault) rp=ffffff0003ea79c0 addr=ffffff023772af00
2012 May  9 09:08:07 ffffff015b93fa00 

2012 May  9 09:08:07 ffffff0167d940c0 sched: 
2012 May  9 09:08:07 ffffff0168024a00 #pf Page fault
2012 May  9 09:08:07 ffffff0161aa3100 Bad kernel fault at addr=0xffffff023772af00
2012 May  9 09:08:08 ffffff0168080200 
pid=0, pc=0xfffffffffb85ff32, sp=0xffffff0003ea7ab8, eflags=0x10246
2012 May  9 09:08:08 ffffff0161aa3e80 
cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6b8<xmme,fxsr,pge,pae,pse,de>
2012 May  9 09:08:08 ffffff0161aa3b80 cr2: ffffff023772af00
2012 May  9 09:08:08 ffffff0168080d40 cr3: 38a6000
2012 May  9 09:08:08 ffffff0168080ec0 cr8: c
2012 May  9 09:08:08 ffffff016807e840 
2012 May  9 09:08:08 ffffff0168024340 
        rdi: ffffff014bbe5bc0 rsi: ffffff023772af00 rdx:                c
2012 May  9 09:08:08 ffffff016807e300 
        rcx: ffffff014acfe300  r8:            2ceaf  r9:            2ceaf
2012 May  9 09:08:08 ffffff0161aa3400 
        rax: ffffff023772af00 rbx: ffffff014bb98800 rbp: ffffff0003ea7af0
2012 May  9 09:08:08 ffffff015c7959c0 
        r10: fffffffffbcb9ab0 r11: fffffffffb82d0bc r12: ffffff0149379000
2012 May  9 09:08:08 ffffff0161aa3d00 
        r13: ffffff023772af00 r14: ffffff014bb98854 r15: ffffff01466b4c30
2012 May  9 09:08:08 ffffff0161aa31c0 
        fsb:                0 gsb: fffffffffbc3ebc0  ds:               4b
2012 May  9 09:08:08 ffffff0168080800 
         es:               4b  fs:                0  gs:              1c3
2012 May  9 09:08:09 ffffff0161aa3340 
        trp:                e err:                0 rip: fffffffffb85ff32
2012 May  9 09:08:09 ffffff0168024580 
         cs:               30 rfl:            10246 rsp: ffffff0003ea7ab8
2012 May  9 09:08:09 ffffff0161ab3d40    ss:               38
2012 May  9 09:08:09 ffffff0161ab3c80 
2012 May  9 09:08:09 ffffff0161ab3bc0 ffffff0003ea78e0 unix:die+131 ()
2012 May  9 09:08:09 ffffff0161ab3b00 ffffff0003ea79b0 unix:trap+152b ()
2012 May  9 09:08:09 ffffff0161ab3a40 ffffff0003ea79c0 unix:cmntrap+e6 ()
2012 May  9 09:08:09 ffffff0161ab3980 ffffff0003ea7af0 unix:ddi_getl+12 ()
2012 May  9 09:08:09 ffffff0161ab38c0 
ffffff0003ea7b30 ehci:ehci_handle_endpoint_reclaimation+91 ()
2012 May  9 09:08:09 ffffff0161ab3800 ffffff0003ea7b70 ehci:ehci_intr+184 ()
2012 May  9 09:08:09 ffffff0161ab3740 ffffff0003ea7bc0 unix:av_dispatch_autovect+74 ()
2012 May  9 09:08:09 ffffff0161ab3680 ffffff0003ea7c00 unix:dispatch_hardint+33 ()
2012 May  9 09:08:09 ffffff0161ab35c0 ffffff0003e05a50 unix:switch_sp_and_call+13 ()
2012 May  9 09:08:09 ffffff0161ab3500 ffffff0003e05aa0 unix:do_interrupt+b6 ()
2012 May  9 09:08:10 ffffff0161ab3440 ffffff0003e05ab0 unix:cmnint+ba ()
2012 May  9 09:08:10 ffffff0161ab3380 ffffff0003e05ba0 unix:mach_cpu_idle+6 ()
2012 May  9 09:08:10 ffffff0161ab32c0 ffffff0003e05bd0 unix:cpu_idle+b2 ()
2012 May  9 09:08:10 ffffff0161ab3200 ffffff0003e05c00 unix:idle+116 ()
2012 May  9 09:08:10 ffffff0161ab3140 ffffff0003e05c10 unix:thread_start+8 ()
2012 May  9 09:08:10 ffffff0161ab3080 
2012 May  9 09:09:03 ffffff0161e85f00 
2012 May  9 09:09:03 ffffff0161e85e40 
BAD TRAP: type=e (#pf Page fault) rp=fffffffffbc879d0 addr=0 occurred in module "<unknown>" due to 
a NULL pointer dereference
2012 May  9 09:09:03 ffffff0161e85d80 
2012 May  9 09:09:03 ffffff0161e85cc0 syncing file systems...
2012 May  9 09:09:03 ffffff0161e85c00  done
2012 May  9 09:09:04 ffffff0161e85b40 
dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
2012 May  9 09:09:05 ffffff0161e85a80 NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port

=> Interesting to see we couldn't attach the device on port 1

> *panic_thread::findstack -v
stack pointer for thread ffffff0003ea7c20: ffffff0003ea7730
  ffffff0003ea7870 0xffffff0003ea79c0()
  ffffff0003ea79c0 0x10000000e()
  ffffff0003ea7af0 ddi_get32+0x12()
  ffffff0003ea7b30 ehci_handle_endpoint_reclaimation+0x91(ffffff0149379000)
  ffffff0003ea7b70 ehci_intr+0x184(ffffff0149379000, 0)
  ffffff0003ea7bc0 av_dispatch_autovect+0x74(13)
  ffffff0003ea7c00 dispatch_hardint+0x33(13, 0)
  ffffff0003e05a50 switch_sp_and_call+0x13()
  ffffff0003e05aa0 do_interrupt+0xb6(ffffff0003e05ab0, 1)
  ffffff0003e05ab0 _interrupt+0xba()
  ffffff0003e05ba0 mach_cpu_idle+6()
  ffffff0003e05bd0 cpu_idle+0xb2()
  ffffff0003e05c00 idle+0x116()
  ffffff0003e05c10 thread_start+8()

=> Some confusion over whether we're running ddi_getl() or ddi_get32()

> ddi_get32::dis
ddi_get32:                      movl   0x68(%rdi),%edx
ddi_get32+3:                    cmpl   $0xa,%edx
ddi_get32+6:                    jne    +0x5     <ddi_get32+0xd>
ddi_get32+8:                    movq   %rsi,%rdx
ddi_get32+0xb:                  inl    (%dx)
ddi_get32+0xc:                  ret    
ddi_get32+0xd:                  cmpl   $0xc,%edx
ddi_get32+0x10:                 jne    +0x3     <ddi_get32+0x15>
ddi_get32+0x12:                 movl   (%rsi),%eax
ddi_get32+0x14:                 ret    
ddi_get32+0x15:                 jmp    *0x88(%rdi)

=> So we gell over at instruction offset 0x12 which was:

  movl   (%rsi),%eax

> ::regs
%rax = 0xffffff023772af00                 %r9  = 0x000000000002ceaf 
%rbx = 0xffffff014bb98800                 %r10 = 0xfffffffffbcb9ab0 lwpsleepq+0x3570
%rcx = 0xffffff014acfe300                 %r11 = 0xfffffffffb82d0bc dispatch_hardint
%rdx = 0x000000000000000c                 %r12 = 0xffffff0149379000 
%rsi = 0xffffff023772af00                 %r13 = 0xffffff023772af00 
%rdi = 0xffffff014bbe5bc0                 %r14 = 0xffffff014bb98854 
%r8  = 0x000000000002ceaf                 %r15 = 0xffffff01466b4c30 

%rip = 0xfffffffffb85ff32 ddi_get32+0x12
%rbp = 0xffffff0003ea7af0
%rsp = 0xffffff0003ea7ab8
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0

                        %cs = 0x0030    %ds = 0x004b    %es = 0x004b
%trapno = 0xe           %fs = 0x0000    %gs = 0x01c3
   %err = 0x0

The value of %rsi is 0xffffff023772af00, we're trying to dereference that and store
it into %eax

> 0xffffff023772af00/J
mdb: failed to read data from target: no mapping for address

And it sure looks bogus to me, plus it's the bogus address seen in the panic stack.
So this would have been passed upto us via ehci_handle_endpoint_reclaimation()
Looking at the illumos source code (which may be slightly different to the Solaris 11
source, which isn't publicly available):

    217 /*
    218  * ehci_handle_endpoint_reclamation:
    219  *
    220  * Reclamation of Host Controller (HC) Endpoint Descriptors (QH).
    221  */
    222 void
    223 ehci_handle_endpoint_reclaimation(ehci_state_t	*ehcip)
    224 {
    225 	usb_frame_number_t	current_frame_number;
    226 	usb_frame_number_t	endpoint_frame_number;
    227 	ehci_qh_t		*reclaim_qh;

> ffffff0149379000::print ehci_state_t
    ehci_dip = 0xffffff01466b4c30
    ehci_instance = 0
    ehci_hcdi_ops = 0xffffff0148486388
    ehci_cb_hdl = 0xffffff01466b4e90
    ehci_flags = 0x1e
    ehci_vendor_id = 0x8086
    ehci_device_id = 0x265c
    ehci_rev_id = 0
    ehci_caps_handle = 0xffffff014bbe5d00
    ehci_capsp = 0xffffff010272a000
    ehci_regsp = 0xffffff010272a020
    ehci_config_handle = 0xffffff014bbe5e40
    ehci_frame_interval = 0
    ehci_dma_attr = {


It's quite a long structure and I'm not 100% familiar with it, so not sure it's worth
reviewing it all.
Can I look at the device ID?

> 0xffffff01466b4c30::print struct dev_info  
    devi_parent = 0xffffff01466b8688
    devi_child = 0
    devi_sibling = 0xffffff01466b4960
    devi_binding_name = 0xffffff0146690b1c "pciclass,0c0320"
    devi_addr = 0xffffff014bbf3c00 "b"
    devi_hw_prop_ptr = 0xffffff014b6b6418
    devi_node_name = 0xffffff0148832e48 "pci8086,265c"
    devi_compat_names = 0xffffff0146690b00 "pci8086,265c.0"

Hmmm, seems so.
How about the vendor id and device id?

0x8086	Intel Corporation

Device Id	Chip Description	Vendor Id	Vendor Name
0x265C		USB 2.0 EHCI Controller	0x8086		Intel Corporation

Well the structure itself looks reasonably sound from an initial review.
So where did we hand over to ddi_get32() ??

ehci_handle_endpoint_reclaimation+0x89: movq   %r12,%rsi
ehci_handle_endpoint_reclaimation+0x8c: call   -0x296d  <ehci_deallocate_qh>
ehci_handle_endpoint_reclaimation+0x91: movq   0x650(%r13),%r12

Interesting, the dis-assembly shows us calling ehci_deallocate_qh() and not ddi_get32()
Taking apart the stack:

> ffffff0003ea7730-0x60,100/nap
0xffffff0003ea76d0:             0xd1            
0xffffff0003ea76d8:             0xd1            
0xffffff0003ea76e0:             0xe             
0xffffff0003ea76e8:             0xffffff0003ea77b0
0xffffff0003ea76f0:             kvseg           
0xffffff0003ea76f8:             0xffffff0003ea7778
0xffffff0003ea7700:             0xffffff0003ea7760
0xffffff0003ea7708:             avl_find+0x56   
0xffffff0003ea7710:             vpanic+0x22     
0xffffff0003ea7718:             0xfffffffffb957418
0xffffff0003ea7720:             0xffffff0003ea7830
0xffffff0003ea7728:             0xfffffffffb957698
0xffffff0003ea7730:             0xffffff0003ea7870
0xffffff0003ea7738:             0xffffff0003ea79c0
0xffffff0003ea7740:             0xffffff023772af00
0xffffff0003ea7748:             0xffffff0003ea7780


0xffffff0003ea7a48:             0xffffff014bbe5d00
0xffffff0003ea7a50:             0xffffff0003ea7ac0
0xffffff0003ea7a58:             0x4b            
0xffffff0003ea7a60:             0x4b            
0xffffff0003ea7a68:             0               
0xffffff0003ea7a70:             0x1c3           
0xffffff0003ea7a78:             0xe             
0xffffff0003ea7a80:             0               
0xffffff0003ea7a88:             ddi_get32+0x12  
0xffffff0003ea7a90:             0x30            
0xffffff0003ea7a98:             0x10246         
0xffffff0003ea7aa0:             0xffffff0003ea7ab8
0xffffff0003ea7aa8:             0x38            
0xffffff0003ea7ab0:             0xffffff0003ea7af0
0xffffff0003ea7ab8:             ehci_deallocate_qh+0x57
0xffffff0003ea7ac0:             0xffffffffc002cea0
0xffffff0003ea7ac8:             0xffffff0149379000
0xffffff0003ea7ad0:             0xffffff014bb98800
0xffffff0003ea7ad8:             0x23a007        
0xffffff0003ea7ae0:             0xffffff014bb98800
0xffffff0003ea7ae8:             0xffffff0149379000
0xffffff0003ea7af0:             0xffffff0003ea7b30
0xffffff0003ea7af8:             ehci_handle_endpoint_reclaimation+0x91
0xffffff0003ea7b00:             0xffffff01466b4c30
0xffffff0003ea7b08:             0xffffff014b6d3298

... seems we do call ehci_deallocate_qh() before getting to ddi_get32()

> ehci_deallocate_qh::dis             
ehci_deallocate_qh:             pushq  %rbp
ehci_deallocate_qh+1:           movq   %rsp,%rbp
ehci_deallocate_qh+4:           subq   $0x10,%rsp
ehci_deallocate_qh+8:           movq   %rdi,-0x8(%rbp)
ehci_deallocate_qh+0xc:         movq   %rsi,-0x10(%rbp)
ehci_deallocate_qh+0x10:        pushq  %rbx
ehci_deallocate_qh+0x11:        pushq  %r12
ehci_deallocate_qh+0x13:        pushq  %r13
ehci_deallocate_qh+0x15:        subq   $0x8,%rsp
ehci_deallocate_qh+0x19:        movq   %rdi,%r12
ehci_deallocate_qh+0x1c:        movq   %rsi,%rbx
ehci_deallocate_qh+0x1f:        movq   0x128(%r12),%rdi
ehci_deallocate_qh+0x27:        leaq   0x10(%rbx),%rsi
ehci_deallocate_qh+0x2b:        call   +0x3ae00dc       <ddi_get32>
ehci_deallocate_qh+0x30:        andq   $0xffffffffffffffe0,%rax
ehci_deallocate_qh+0x34:        movq   %r12,%rdi
ehci_deallocate_qh+0x37:        movq   %rax,%rsi
ehci_deallocate_qh+0x3a:        call   +0x1671  <ehci_qtd_iommu_to_cpu>
ehci_deallocate_qh+0x3f:        movq   %rax,%r13
ehci_deallocate_qh+0x42:        testq  %r13,%r13
ehci_deallocate_qh+0x45:        je     +0x35    <ehci_deallocate_qh+0x7c>
ehci_deallocate_qh+0x47:        movq   0x160(%r12),%rdi
ehci_deallocate_qh+0x4f:        movq   %r13,%rsi
ehci_deallocate_qh+0x52:        call   +0x3ae00b5       <ddi_get32>
ehci_deallocate_qh+0x57:        movl   %eax,%esi
ehci_deallocate_qh+0x59:        movq   %r12,%rdi

... and the stack shows the call to ddi_get32() at ehci_deallocate_qh+0x57
It seems we're looking for an offset from %rbx to get our %rsi for the first
call to ddi_get32() then it's the value of %r13 being moved into %esi where
we then fall over.

In the source I see we're doing this:

   1330 	first_dummy_qtd = ehci_qtd_iommu_to_cpu(ehcip,
   1331 	    (Get_QH(old_qh->qh_next_qtd) & EHCI_QH_NEXT_QTD_PTR));

Where Get_QH() decodes as:

    870 #define	Get_QH(addr)		ddi_get32(ehcip->ehci_qh_pool_mem_handle, \
    871 					(uint32_t *)&addr)

... so this is where the ddi_get32() comes into play.
Alas there are a number of calls to Get_QH() in this ehci_deallocate_qh() function.


I suspect the failure to connect to the device port is causing some form of bogus/NULL address to appear in a structure here and we're tripping over it.

Change History

comment:1 Changed 2 years ago by michaln

  • Status changed from new to closed
  • Resolution set to invalid

Sorry, we do not accept bug reports against mismatched VirtualBox <-> Extension Pack versions. Unless you're running one of the Mountain Lion DPs, there is no point in running a 4.1.15 test build. In fact you should not use a test build unless specifically directed to do so; if you do, expect trouble (and no help).

Please go back to the official release (4.1.14) and the accompanying extension pack. If the error persists, then open a ticket against that combination.

comment:2 Changed 2 years ago by jayce

Hello there,

As I pointed out at the beginning of the bug report, I'm running 4.1.15 *because* I was hitting panic's on my Mac when rebooting my guest virtual machine. Since I put that on I haven't had a panic under MacOS 10.6.8.

I would have installed guest extensions 4.1.15 but they do not appear to be available on your download web page. If this is available I'll happily try installing that and seeing if the panic's in the guest OS (Solaris 11) are reproducible.

I know 4.1.15 vbox binaries and the 4.1.14 extension pack are not a good/supported option and I clearly mentioned this at the beginning being open and honest. You can see I performed a fair amount of work looking at the Solaris 11 dump, so I have to say I'm somewhat disappointed this has been closed without anyone actually discussing the matter with me.

If there is a problem in this code area come the next version then potentially you'll have a bunch of Solaris 11 boxes panicing, which I'd like to help prevent. Of course it's highly likely that the version number mismatch is the problem, so if I could install that and the problem goes away then hey we know that was the root cause. It's a pretty quick test and has very little impact on you guys if this is the fix.

On the other hand if I demonstrate the problem again then maybe there is something that needs investigating?

Just trying to work with you guys, see if someone can meet me half way here.



comment:3 Changed 2 years ago by frank

  • Status changed from closed to reopened
  • Resolution invalid deleted

Could you try the 4.1.15 Mac OS X  test build together with  this Extension Pack?

comment:4 Changed 2 years ago by jayce

Hi Frank,

I tried the 4.1.15 extension pack and had mixed results. The virtual machine did appear to hang with a USB stick however X wouldn't start. Technically the gdm service was running and Xorg was running but I had no login box I could see, the display had an odd red/yellow flashing output in the miniature display in the main Virtualbox window.

I had to revert back to 4.1.14 to see if that problem went away and I could get X back - I could. So I'll try upgrading to 4.1.15 (binaries and extension pack) again to see if the problem is reproducible.



Note: See TracTickets for help on using tickets.
ContactPrivacy policyTerms of Use