VirtualBox

Opened 8 years ago

Closed 6 years ago

#14779 closed defect (obsolete)

Kernel panics with VirtualBox 5.0.8, possible network problem

Reported by: Thomas Dreibholz Owned by:
Component: network Version: VirtualBox 5.0.8
Keywords: Cc:
Guest type: Linux Host type: Linux

Description

I get regular kernel panics with VirtualBox 5.0.8 under Ubuntu Server 14.04 LTS (64-bit), when running a Ubuntu Server 12.04 LTS VM (64-bit). The issue seems to be a problem with IPv6 TCP offloading in vboxnetflt. See the attached picture of a stack trace. Both, the host and the VM use IPv6 on 3 network interfaces. All 3 interfaces are bridged to Ethernet ports.

The problem is reproducible on my system, i.e. it occurs when the system runs for some minutes. I did not observe the issue when running VirtualBox 4.3 on a Ubuntu Server 12.04 LTS; the issue appeared after upgrading the system to Ubuntu Server 14.04 LTS and VirtualBox 5.0.8.

Attachments (9)

VBox-Bug1.png (81.5 KB ) - added by Thomas Dreibholz 8 years ago.
Screenshot of the stack trace
dmesg.txt (75.6 KB ) - added by Thomas Dreibholz 8 years ago.
dmesg output
VBox-Bug2.png (84.0 KB ) - added by Thomas Dreibholz 8 years ago.
Another stack trace
dmesg.201511040959 (108.0 KB ) - added by Thomas Dreibholz 8 years ago.
dmesg output obtained via kexec
ip4.txt (24.5 KB ) - added by Thomas Dreibholz 8 years ago.
IPv4 configuration of the VM on the crashed system
ip6.txt (26.8 KB ) - added by Thomas Dreibholz 8 years ago.
IPv6 configuration of the VM on the crashed system
tunnel4.txt (9.3 KB ) - added by Thomas Dreibholz 8 years ago.
GRE tunnel configuration of the VM on the crashed system
vbox.patch (177 bytes ) - added by Thomas Dreibholz 8 years ago.
A patch that seems to prevent the issue
VBoxNetFlt-linux.diff (569 bytes ) - added by Valery Ushakov 8 years ago.

Download all attachments as: .zip

Change History (43)

by Thomas Dreibholz, 8 years ago

Attachment: VBox-Bug1.png added

Screenshot of the stack trace

comment:1 by Valery Ushakov, 8 years ago

Please, can you provide real dmesg as text.

by Thomas Dreibholz, 8 years ago

Attachment: dmesg.txt added

dmesg output

by Thomas Dreibholz, 8 years ago

Attachment: VBox-Bug2.png added

Another stack trace

in reply to:  1 ; comment:2 by Thomas Dreibholz, 8 years ago

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

in reply to:  2 ; comment:3 by Valery Ushakov, 8 years ago

Replying to Thomas Dreibholz:

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

I mean can you provide the dmesg from the crash as text, not as a screenshot.

in reply to:  3 comment:4 by Thomas Dreibholz, 8 years ago

Replying to vushakov:

Replying to Thomas Dreibholz:

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

I mean can you provide the dmesg from the crash as text, not as a screenshot.

No, unfortunately, the machine is remote. All I can get is a screenshot of the HP iLO Java applet showing the console screen. It is graphics output.

comment:5 by Valery Ushakov, 8 years ago

I see gre_gso_segment in both stack traces. Can you provide more details about your GRE setup? Do you have a lot of GRE traffic, or does the crash happen on the first GRE packet? On the first large GRE packet? A packet capture might be handy.

comment:6 by Thomas Dreibholz, 8 years ago

I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see https://www.nntb.no/nornet-core/ and https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.

in reply to:  6 comment:7 by Thomas Dreibholz, 8 years ago

Replying to Thomas Dreibholz:

I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see https://www.nntb.no/nornet-core/ and https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.

Note, that the GRE tunnels are only inside the VM. The host machine has no GRE tunnels configured.

comment:8 by Valery Ushakov, 8 years ago

So it looks like you are hitting BUG_ON(len); in skb checksum code. I wonder if this happens on the first packet for which GRO kicks in.

comment:9 by Thomas Dreibholz, 8 years ago

I now also observe the kernel panics on 4 other systems with the same Ubuntu Server 14.04 LTS/VirtualBox 5.0.8 combination. All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4. I have 8 more systems of the same Ubuntu/VirtualBox versions, but these systems do not use IPv6 on their primary GRE tunnel. These systems seem to be stable. So, I assume there is some issue with IPv6 over GRE.

I will perform some further tests. At least, a problem when sending the first IPv6 packet over a tunnel seems to be highly unlikely. The machines at least run some minutes to hours before the kernel panic happens.

by Thomas Dreibholz, 8 years ago

Attachment: dmesg.201511040959 added

dmesg output obtained via kexec

comment:10 by Thomas Dreibholz, 8 years ago

I now installed kdump and I can now generate kernel dumps of the crashes. I already attached one of the resulting dmesg files (dmesg.201511040959). This may be interesting:

... (many more of the following messages) ...
[  419.959822] VBoxNetFlt: Failed to segment a packet (-93).
[  420.265421] VBoxNetFlt: Failed to segment a packet (-93).
[  420.875506] VBoxNetFlt: Failed to segment a packet (-93).
[  421.478203] VBoxNetFlt: Failed to segment a packet (-93).
[  422.029902] VBoxNetFlt: Failed to segment a packet (-93).
[  473.713682] VBoxNetFlt: Failed to segment a packet (-93).
[  473.959018] VBoxNetFlt: Failed to segment a packet (-93).
[  474.096466] VBoxNetFlt: Failed to segment a packet (-93).
[  474.309785] VBoxNetFlt: Failed to segment a packet (-93).
[  474.334235] VBoxNetFlt: Failed to segment a packet (-93).
[  474.414036] ------------[ cut here ]------------
[  474.414118] kernel BUG at /build/linux-lts-vivid-Nr0FoT/linux-lts-vivid-3.19.0/net/core/skbuff.c:2135!

comment:11 by Thomas Dreibholz, 8 years ago

I have now also stored the full output of some kernel dumps under https://www.nntb.no/~nornetpp/temp/crash.tar.gz . The file size is about 750 MiB.

in reply to:  9 comment:12 by Valery Ushakov, 8 years ago

Replying to Thomas Dreibholz:

All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4.

Please, can you provide example GRE setup instructions for this?

by Thomas Dreibholz, 8 years ago

Attachment: ip4.txt added

IPv4 configuration of the VM on the crashed system

by Thomas Dreibholz, 8 years ago

Attachment: ip6.txt added

IPv6 configuration of the VM on the crashed system

by Thomas Dreibholz, 8 years ago

Attachment: tunnel4.txt added

GRE tunnel configuration of the VM on the crashed system

comment:13 by Thomas Dreibholz, 8 years ago

I attached IPv4, IPv6 and GRE tunnel configurations.

Probably most interesting is the main tunnel gre1-1-1:

nornetpp@tromsoe:~$ ip tunnel show gre1-1-1
gre1-1-1: gre/ip  remote 158.39.4.2  local 129.242.157.228  ttl 255  key 16843777

nornetpp@tromsoe:~$ ip -4 addr show dev gre1-1-1
47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 qdisc noqueue state UNKNOWN group default 
    inet 192.168.43.150 peer 192.168.43.151/32 scope global gre1-1-1
       valid_lft forever preferred_lft forever

nornetpp@tromsoe:~$ ip -6 addr show dev gre1-1-1
47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 
    inet6 2001:700:4100:ff:ffff:401:101:1/112 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::4444:1:4:1:1/64 scope link 
       valid_lft forever preferred_lft forever

by Thomas Dreibholz, 8 years ago

Attachment: vbox.patch added

A patch that seems to prevent the issue

comment:15 by Thomas Dreibholz, 8 years ago

I added a patch to VirtualBox that seems to prevent the issue by turning the offloading features in VBoxNetFlt-linux.c off. So far, I have not observed further crashes after installing a patched VirtualBox on my machines.

comment:16 by Thomas Dreibholz, 8 years ago

No far, I have not observed any more kernel panics after applying my patch. The patch simply comments out these settings:

# define VBOXNETFLT_WITH_GSO                1
# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1
# define VBOXNETFLT_WITH_GRO                1

That is, one of these settings causes the crashes.

Last edited 8 years ago by Valery Ushakov (previous) (diff)

comment:17 by Thomas Dreibholz, 8 years ago

If necessary, I could test some specific combinations of the options and/or debug code on one of my machines.

comment:18 by Valery Ushakov, 8 years ago

Thanks for the update. Yes, the crash is triggered by us calling skb_gso_segment (comment:5) and we probably do some wrong modifications to the skb before that. I haven't yet had a chance to look closer into this. I'll try to get to it this week. Sorry, other stuff needs attention too...

by Valery Ushakov, 8 years ago

Attachment: VBoxNetFlt-linux.diff added

comment:19 by Valery Ushakov, 8 years ago

Actually, on a hunch... please, can you try the above patch (after re-enabling the GSO code your patch disables)? Not tested, but it looks like a cut-n-paste typo which might result in BUG_ON(len) later because of wrong header size.

comment:20 by Thomas Dreibholz, 8 years ago

I just installed a version with your patch on one of my machines. I will report what happens ...

comment:21 by Thomas Dreibholz, 8 years ago

Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...

Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.

comment:22 by Thomas Dreibholz, 8 years ago

in reply to:  21 comment:23 by Valery Ushakov, 8 years ago

Replying to Thomas Dreibholz:

Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...

Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.

Thanks for trying it, it's not necessary to try 5.0.10. I also don't think I need any more crash dumps for now.

comment:24 by Thomas Dreibholz, 8 years ago

I already tried 5.0.10 -> no change, i.e. the kernel panics happen as before.

comment:25 by Thomas Dreibholz, 8 years ago

I varied my work-around patch by keeping

# define VBOXNETFLT_WITH_GSO                1
# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1

and just commenting out one setting:

# define VBOXNETFLT_WITH_GRO                1

So far, the kernel worked during the whole week-end without kernel panic. VBOXNETFLT_WITH_GRO seems to cause the problem.

comment:26 by Valery Ushakov, 8 years ago

Yes, I'm also looking at that code right now. The skjennungen2 crash you posted in comment:22 crashes on an skb with

  len = 4294904698, 
  mac_len = 65430, 
  network_header = 0, 
  mac_header = 106, 

where mac_len, which is -106, must be coming from VBoxNetFlt-linux.c:1109 inside VBOXNETFLT_WITH_GRO ifdef.

comment:27 by Thomas Dreibholz, 8 years ago

Unfortunately, # define VBOXNETFLT_WITH_GRO is not the problem. It crashed again.

I am currently trying to comment out just:

# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1
Version 1, edited 8 years ago by Thomas Dreibholz (previous) (next) (diff)

in reply to:  27 comment:28 by Thomas Dreibholz, 8 years ago

This also results in the crashes.

I am currently trying to comment out all VBOXNETFLT_WITH_GSO* settings, just leaving VBOXNETFLT_WITH_GRO.

comment:29 by Thomas Dreibholz, 8 years ago

Is there any news on locating the bug? If a possible fix is available, I could test it.

Last edited 8 years ago by Thomas Dreibholz (previous) (diff)

in reply to:  29 ; comment:30 by Valery Ushakov, 8 years ago

Replying to Thomas Dreibholz:

Is there any news on locating the bug? If a possible fix is available, I could test it.

Unfortunately, I have been unable to reproduce the problem locally so far.

comment:31 by Thomas Dreibholz, 8 years ago

I tried to replace VirtualBox by KVM, to see whether this should solve the problem. However, when using KVM directly, the same problem still appears. I therefore filed a kernel bug report as well: https://bugzilla.kernel.org/show_bug.cgi?id=109071 .

in reply to:  30 comment:32 by Thomas Dreibholz, 8 years ago

Replying to vushakov:

Replying to Thomas Dreibholz:

Is there any news on locating the bug? If a possible fix is available, I could test it.

Unfortunately, I have been unable to reproduce the problem locally so far.

If it may help, I could build and install a custom kernel e.g. with some kprintf() calls. It may also be possible to provide you access to a test setup machine.

comment:33 by Valery Ushakov, 8 years ago

Thank you for the update.

If you see the problem with KVM as well, then it's most likely a kernel bug with GRO/GSO of GRE. The VBox code basically does nothing much beyond skb_copy() and skb_gso_segment() on the passed skb, so I was starting to suspect as much. I'd rather wait for kernel folks to do their investigation. Unfortunately, we don't have enough resources to duplicate their effort, so we appreciate your offer, but won't take you up on it just yet.

comment:34 by Valery Ushakov, 6 years ago

Resolution: obsolete
Status: newclosed
Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use