VirtualBox

Ticket #14779 (closed defect: obsolete)

Opened 6 years ago

Last modified 4 years ago

Kernel panics with VirtualBox 5.0.8, possible network problem

Reported by: Thomas Dreibholz Owned by:
Component: network Version: VirtualBox 5.0.8
Keywords: Cc:
Guest type: Linux Host type: Linux

Description

I get regular kernel panics with VirtualBox 5.0.8 under Ubuntu Server 14.04 LTS (64-bit), when running a Ubuntu Server 12.04 LTS VM (64-bit). The issue seems to be a problem with IPv6 TCP offloading in vboxnetflt. See the attached picture of a stack trace. Both, the host and the VM use IPv6 on 3 network interfaces. All 3 interfaces are bridged to Ethernet ports.

The problem is reproducible on my system, i.e. it occurs when the system runs for some minutes. I did not observe the issue when running VirtualBox 4.3 on a Ubuntu Server 12.04 LTS; the issue appeared after upgrading the system to Ubuntu Server 14.04 LTS and VirtualBox 5.0.8.

Attachments

VBox-Bug1.png Download (81.5 KB) - added by Thomas Dreibholz 6 years ago.
Screenshot of the stack trace
dmesg.txt Download (75.6 KB) - added by Thomas Dreibholz 6 years ago.
dmesg output
VBox-Bug2.png Download (84.0 KB) - added by Thomas Dreibholz 6 years ago.
Another stack trace
dmesg.201511040959 Download (108.0 KB) - added by Thomas Dreibholz 6 years ago.
dmesg output obtained via kexec
ip4.txt Download (24.5 KB) - added by Thomas Dreibholz 6 years ago.
IPv4 configuration of the VM on the crashed system
ip6.txt Download (26.8 KB) - added by Thomas Dreibholz 6 years ago.
IPv6 configuration of the VM on the crashed system
tunnel4.txt Download (9.3 KB) - added by Thomas Dreibholz 6 years ago.
GRE tunnel configuration of the VM on the crashed system
vbox.patch Download (177 bytes) - added by Thomas Dreibholz 6 years ago.
A patch that seems to prevent the issue
VBoxNetFlt-linux.diff Download (569 bytes) - added by vushakov 6 years ago.

Change History

Changed 6 years ago by Thomas Dreibholz

Screenshot of the stack trace

comment:1 follow-up: ↓ 2 Changed 6 years ago by vushakov

Please, can you provide real dmesg as text.

Changed 6 years ago by Thomas Dreibholz

dmesg output

Changed 6 years ago by Thomas Dreibholz

Another stack trace

comment:2 in reply to: ↑ 1 ; follow-up: ↓ 3 Changed 6 years ago by Thomas Dreibholz

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

comment:3 in reply to: ↑ 2 ; follow-up: ↓ 4 Changed 6 years ago by vushakov

Replying to Thomas Dreibholz:

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

I mean can you provide the dmesg from the crash as text, not as a screenshot.

comment:4 in reply to: ↑ 3 Changed 6 years ago by Thomas Dreibholz

Replying to vushakov:

Replying to Thomas Dreibholz:

Replying to vushakov:

Please, can you provide real dmesg as text.

I attached a dmesg output.

I mean can you provide the dmesg from the crash as text, not as a screenshot.

No, unfortunately, the machine is remote. All I can get is a screenshot of the HP iLO Java applet showing the console screen. It is graphics output.

comment:5 Changed 6 years ago by vushakov

I see gre_gso_segment in both stack traces. Can you provide more details about your GRE setup? Do you have a lot of GRE traffic, or does the crash happen on the first GRE packet? On the first large GRE packet? A packet capture might be handy.

comment:6 follow-up: ↓ 7 Changed 6 years ago by Thomas Dreibholz

I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see  https://www.nntb.no/nornet-core/ and  https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.

comment:7 in reply to: ↑ 6 Changed 6 years ago by Thomas Dreibholz

Replying to Thomas Dreibholz:

I have indeed a lot of GRE tunnels inside the VM. The VM is part of the NorNet Core setup (see  https://www.nntb.no/nornet-core/ and  https://www.nntb.no/pub/nornet-configuration/NorNetCore-Sites.html). The VM has 36 IPv4 GRE tunnels, as well as 26 IPv6-over-IPv6 tunnels configured. It definitely does not crash on the first packet via GRE, since there is an almost steady flow of packets due to RTT measurements. I am not sure about fragmentation. Unfortunately, I cannot easily generate a packet trace since the machine is remote. However, I could e.g. set up some test GRE tunnels and try to investigate behaviour with fragmentation if this could help debugging.

Note, that the GRE tunnels are only inside the VM. The host machine has no GRE tunnels configured.

comment:8 Changed 6 years ago by vushakov

So it looks like you are hitting BUG_ON(len); in skb checksum code. I wonder if this happens on the first packet for which GRO kicks in.

comment:9 follow-up: ↓ 12 Changed 6 years ago by Thomas Dreibholz

I now also observe the kernel panics on 4 other systems with the same Ubuntu Server 14.04 LTS/VirtualBox 5.0.8 combination. All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4. I have 8 more systems of the same Ubuntu/VirtualBox versions, but these systems do not use IPv6 on their primary GRE tunnel. These systems seem to be stable. So, I assume there is some issue with IPv6 over GRE.

I will perform some further tests. At least, a problem when sending the first IPv6 packet over a tunnel seems to be highly unlikely. The machines at least run some minutes to hours before the kernel panic happens.

Changed 6 years ago by Thomas Dreibholz

dmesg output obtained via kexec

comment:10 Changed 6 years ago by Thomas Dreibholz

I now installed kdump and I can now generate kernel dumps of the crashes. I already attached one of the resulting dmesg files (dmesg.201511040959). This may be interesting:

... (many more of the following messages) ...
[  419.959822] VBoxNetFlt: Failed to segment a packet (-93).
[  420.265421] VBoxNetFlt: Failed to segment a packet (-93).
[  420.875506] VBoxNetFlt: Failed to segment a packet (-93).
[  421.478203] VBoxNetFlt: Failed to segment a packet (-93).
[  422.029902] VBoxNetFlt: Failed to segment a packet (-93).
[  473.713682] VBoxNetFlt: Failed to segment a packet (-93).
[  473.959018] VBoxNetFlt: Failed to segment a packet (-93).
[  474.096466] VBoxNetFlt: Failed to segment a packet (-93).
[  474.309785] VBoxNetFlt: Failed to segment a packet (-93).
[  474.334235] VBoxNetFlt: Failed to segment a packet (-93).
[  474.414036] ------------[ cut here ]------------
[  474.414118] kernel BUG at /build/linux-lts-vivid-Nr0FoT/linux-lts-vivid-3.19.0/net/core/skbuff.c:2135!

comment:11 Changed 6 years ago by Thomas Dreibholz

I have now also stored the full output of some kernel dumps under  https://www.nntb.no/~nornetpp/temp/crash.tar.gz . The file size is about 750 MiB.

comment:12 in reply to: ↑ 9 Changed 6 years ago by vushakov

Replying to Thomas Dreibholz:

All 5 affected systems have in common that the primary GRE tunnel (transporting a lot of management traffic) also transports IPv6 traffic, together with IPv4 traffic, over IPv4.

Please, can you provide example GRE setup instructions for this?

Changed 6 years ago by Thomas Dreibholz

IPv4 configuration of the VM on the crashed system

Changed 6 years ago by Thomas Dreibholz

IPv6 configuration of the VM on the crashed system

Changed 6 years ago by Thomas Dreibholz

GRE tunnel configuration of the VM on the crashed system

comment:13 Changed 6 years ago by Thomas Dreibholz

I attached IPv4, IPv6 and GRE tunnel configurations.

Probably most interesting is the main tunnel gre1-1-1:

nornetpp@tromsoe:~$ ip tunnel show gre1-1-1
gre1-1-1: gre/ip  remote 158.39.4.2  local 129.242.157.228  ttl 255  key 16843777

nornetpp@tromsoe:~$ ip -4 addr show dev gre1-1-1
47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 qdisc noqueue state UNKNOWN group default 
    inet 192.168.43.150 peer 192.168.43.151/32 scope global gre1-1-1
       valid_lft forever preferred_lft forever

nornetpp@tromsoe:~$ ip -6 addr show dev gre1-1-1
47: gre1-1-1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1472 
    inet6 2001:700:4100:ff:ffff:401:101:1/112 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::4444:1:4:1:1/64 scope link 
       valid_lft forever preferred_lft forever

Changed 6 years ago by Thomas Dreibholz

A patch that seems to prevent the issue

comment:15 Changed 6 years ago by Thomas Dreibholz

I added a patch to VirtualBox that seems to prevent the issue by turning the offloading features in VBoxNetFlt-linux.c off. So far, I have not observed further crashes after installing a patched VirtualBox on my machines.

comment:16 Changed 6 years ago by Thomas Dreibholz

No far, I have not observed any more kernel panics after applying my patch. The patch simply comments out these settings:

# define VBOXNETFLT_WITH_GSO                1
# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1
# define VBOXNETFLT_WITH_GRO                1

That is, one of these settings causes the crashes.

Last edited 6 years ago by vushakov (previous) (diff)

comment:17 Changed 6 years ago by Thomas Dreibholz

If necessary, I could test some specific combinations of the options and/or debug code on one of my machines.

comment:18 Changed 6 years ago by vushakov

Thanks for the update. Yes, the crash is triggered by us calling skb_gso_segment (comment:5) and we probably do some wrong modifications to the skb before that. I haven't yet had a chance to look closer into this. I'll try to get to it this week. Sorry, other stuff needs attention too...

Changed 6 years ago by vushakov

comment:19 Changed 6 years ago by vushakov

Actually, on a hunch... please, can you try the above patch (after re-enabling the GSO code your patch disables)? Not tested, but it looks like a cut-n-paste typo which might result in BUG_ON(len) later because of wrong header size.

comment:20 Changed 6 years ago by Thomas Dreibholz

I just installed a version with your patch on one of my machines. I will report what happens ...

comment:21 follow-up: ↓ 23 Changed 6 years ago by Thomas Dreibholz

Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...

Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.

comment:22 Changed 6 years ago by Thomas Dreibholz

comment:23 in reply to: ↑ 21 Changed 6 years ago by vushakov

Replying to Thomas Dreibholz:

Unfortunately, the patch does not solve the problem. The kernel panics still happen. I will again provide some kernel dumps ...

Note, I tried with VirtualBox-5.0.8. I could try VirtualBox-5.0.10 with your patch now.

Thanks for trying it, it's not necessary to try 5.0.10. I also don't think I need any more crash dumps for now.

comment:24 Changed 6 years ago by Thomas Dreibholz

I already tried 5.0.10 -> no change, i.e. the kernel panics happen as before.

comment:25 Changed 6 years ago by Thomas Dreibholz

I varied my work-around patch by keeping

# define VBOXNETFLT_WITH_GSO                1
# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1

and just commenting out one setting:

# define VBOXNETFLT_WITH_GRO                1

So far, the kernel worked during the whole week-end without kernel panic. VBOXNETFLT_WITH_GRO seems to cause the problem.

comment:26 Changed 6 years ago by vushakov

Yes, I'm also looking at that code right now. The skjennungen2 crash you posted in comment:22 crashes on an skb with

  len = 4294904698, 
  mac_len = 65430, 
  network_header = 0, 
  mac_header = 106, 

where mac_len, which is -106, must be coming from VBoxNetFlt-linux.c:1109 inside VBOXNETFLT_WITH_GRO ifdef.

comment:27 follow-up: ↓ 28 Changed 6 years ago by Thomas Dreibholz

Unfortunately, # define VBOXNETFLT_WITH_GRO is not the problem. It crashed again.

I am currently trying to comment out just:

# define VBOXNETFLT_WITH_GSO_XMIT_HOST      1
# define VBOXNETFLT_WITH_GSO_XMIT_WIRE      1
# define VBOXNETFLT_WITH_GSO_RECV           1

That is:

# define VBOXNETFLT_WITH_GSO                1
# define VBOXNETFLT_WITH_GRO                1

remain active.

Last edited 6 years ago by Thomas Dreibholz (previous) (diff)

comment:28 in reply to: ↑ 27 Changed 6 years ago by Thomas Dreibholz

This also results in the crashes.

I am currently trying to comment out all VBOXNETFLT_WITH_GSO* settings, just leaving VBOXNETFLT_WITH_GRO.

comment:29 follow-up: ↓ 30 Changed 6 years ago by Thomas Dreibholz

Is there any news on locating the bug? If a possible fix is available, I could test it.

Last edited 6 years ago by Thomas Dreibholz (previous) (diff)

comment:30 in reply to: ↑ 29 ; follow-up: ↓ 32 Changed 6 years ago by vushakov

Replying to Thomas Dreibholz:

Is there any news on locating the bug? If a possible fix is available, I could test it.

Unfortunately, I have been unable to reproduce the problem locally so far.

comment:31 Changed 6 years ago by Thomas Dreibholz

I tried to replace VirtualBox by KVM, to see whether this should solve the problem. However, when using KVM directly, the same problem still appears. I therefore filed a kernel bug report as well:  https://bugzilla.kernel.org/show_bug.cgi?id=109071 .

comment:32 in reply to: ↑ 30 Changed 6 years ago by Thomas Dreibholz

Replying to vushakov:

Replying to Thomas Dreibholz:

Is there any news on locating the bug? If a possible fix is available, I could test it.

Unfortunately, I have been unable to reproduce the problem locally so far.

If it may help, I could build and install a custom kernel e.g. with some kprintf() calls. It may also be possible to provide you access to a test setup machine.

comment:33 Changed 6 years ago by vushakov

Thank you for the update.

If you see the problem with KVM as well, then it's most likely a kernel bug with GRO/GSO of GRE. The VBox code basically does nothing much beyond skb_copy() and skb_gso_segment() on the passed skb, so I was starting to suspect as much. I'd rather wait for kernel folks to do their investigation. Unfortunately, we don't have enough resources to duplicate their effort, so we appreciate your offer, but won't take you up on it just yet.

comment:34 Changed 4 years ago by vushakov

  • Status changed from new to closed
  • Resolution set to obsolete
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use