Opened 13 years ago
Closed 13 years ago
#8755 closed defect (fixed)
Ubuntu 10.04 kernel panic on high network load => Fixed in SVN
Reported by: | Frederick Ryckbosch | Owned by: | |
---|---|---|---|
Component: | network | Version: | VirtualBox 4.0.6 |
Keywords: | e1000 | Cc: | |
Guest type: | Linux | Host type: | Linux |
Description (last modified by )
Setup:
- VirtualBox 4.0.4
- Ubuntu 10.04 host (64 bit)
- 1 Ubuntu 10.04 guest (64 bit)
- Running on a quad core AMD Opteron
- Bridged network, using e1000.
Workload:
- 4 concurrent http wgets to a local server (at gigabit)
Problem:
- No problems when using 1 core
- Kernel panic in guest when using 4 cores
Kernel panic is always in e1000_clean.
Attachments (10)
Change History (42)
by , 13 years ago
by , 13 years ago
Attachment: | kernel_panic added |
---|
comment:1 by , 13 years ago
comment:3 by , 13 years ago
Same problem here. Also under "heavy" load. The virtual machine is running a web server with PHP.
FYI the virtual machine got two network adapter attached (one public, one private subnet, the latest turns to be our private VLAN on the host).
I'm attaching the kernel_panic dump, and the VBox.log.
comment:4 by , 13 years ago
Like fryckbos said earlier, latest VirtualBox 4.0.6 does not solve the problem.
comment:5 by , 13 years ago
Description: | modified (diff) |
---|---|
Version: | VirtualBox 4.0.4 → VirtualBox 4.0.6 |
comment:6 by , 13 years ago
Just to be sure: Does the guest kernel panics always within the E1000 driver? At least it looks like because we have two similar kernel logs from two different reporters...
comment:7 by , 13 years ago
An interesting question would be if similar guest kernel crashes also happen with the PCNet device and with the virtio device (Ubuntu should support both devices).
comment:8 by , 13 years ago
The kernel always panics in e1000_clean. We've encountered this at least 10 times.
I will repeat the test with the other 2 network devices.
comment:9 by , 13 years ago
The same test with Virtio is even worse. After 2 minutes the networks becomes flaky
- I can open new ssh sessions, type in the password, but I can't issue any commands.
- The VM doesn't hang fully: I can still login on the console and run commands.
- If I do a ping from within the guest, the first ping returns but the others fail.
The VBox.log file does not show any output, and the guest does not show a kernel panic.
The PCNet test is now running for one hour and is doing fine at the moment.
comment:10 by , 13 years ago
Oh, one more issue: Could both of you attach the binary e1000 module to this ticket? Thanks!
by , 13 years ago
Attachment: | e1000.2.ko added |
---|
module for Ubuntu 10.04 x86_64 : uname -a : Linux demo 2.6.32-24-server #39-Ubuntu SMP Wed Jul 28 06:21:40 UTC 2010 x86_64 GNU/Linux
comment:11 by , 13 years ago
Hi Franck,
As of now, there seems to be 2 workarounds :
- use a 1 CPU only virtual machine (as fryckbos said),
- use Am79C973 as NIC type.
Currently our test platform is running with the later option without crashing (let's wait and see, but it should have already crashed according our previous tests).
comment:13 by , 13 years ago
There is no speed limitation. The e1000 device is usually a bit faster than pcnet but even with pcnet you should be easily able to exceed 100MBit.
comment:14 by , 13 years ago
fryckbos, the kernel module you attached does not fit the kernel log. Your guest kernel is apparently 2.6.35-24, please attach the correct e1000.ko module from this kernel.
comment:16 by , 13 years ago
The PCNet device also gave up, but it took about 5 hours. VirtualBox is still running, the guest is still running but network connections fail. I can still ssh, give my password, get a console but I can't type commands.
It seems to be a problem with the clocks: if I do a sleep of 10 seconds at the console, it never returns. The following message is in dmesg:
[ 936.534361] hrtimer: interrupt took 59930437 ns [18529.704149] Clocksource tsc unstable (delta = 4686851418 ns)
comment:17 by , 13 years ago
That might be a different problem. Are there any relevant entries in the log to the pcnet card? Could you attach the complete output of 'dmesg' from the guest and the VBox.log file?
comment:18 by , 13 years ago
I attached the VBox and console log, but as far as I know they don't show any errors. I don't know why the hrtime message does not show up in the console log, it was on the dmesg output, but I did not save the output of dmesg.
comment:19 by , 13 years ago
During test it is always advisable to add the kernel parameters console=ttyS0 ignore_loglevel to the kernel command line and to enable a serial port in the VM settings and attach this port to a file. This will log any guest kernel messages to the external file.
comment:20 by , 13 years ago
Thanks, I was doing the logging to file with console=ttyS0 but did not have the ignore_loglevel.
comment:22 by , 13 years ago
Not sure if that will help. We are currently working on finding the e1000 issue.
comment:23 by , 13 years ago
We found a potential problem in the E1000 implementation. We are currently fixing this issue and will provide you a test build when the fix is done.
comment:26 by , 13 years ago
I've attached a tiny patch that should fix the issue. In short words it prevents context descriptors from being written back with DD bit set which caused trouble in TX cleanup function of e1000 driver because context descriptors do not have skb associated with them. This happened in SMP guests since they can add TX descriptors and do cleanup simultaneously. Could you check if the patch indeed solves the problem for you?
comment:27 by , 13 years ago
I will add a "mee too" because I see what seems to be the same problem on Scientific Linux 6 which has the same kernel - 2.6.32 64-bit, host is Win XP 32 bit.
comment:29 by , 13 years ago
rgcosma, could you check if this build fixes the problem for you as well?
comment:31 by , 13 years ago
Summary: | Ubuntu 10.04 kernel panic on high network load → Ubuntu 10.04 kernel panic on high network load => Fixed in SVN |
---|
Great, thanks for the feedback! The fix will be also available in the next maintenance release.
Forgot to mention that the kernel panic does not occur instantaneously but after a while, it can take up to half an hour.
We are having the same problem when running a web server at high load.