VirtualBox

Ticket #4433 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

SMP bugs in guest

Reported by: costing Owned by:
Priority: major Component: guest smp
Version: VirtualBox 3.0.0 Keywords:
Cc: Guest type: Linux
Host type: Linux

Description

Guests run 2.6.29.3, one is 32 bit the other 64 bit. No message is logged in VBox.log. Here are some samples, I'll put more info in separate files.

Eeek! page_mapcount(page) went negative! (-1)

page pfn = 66f89 page->flags = 8000087c page->count = 2 page->mapping = f219f50c vma->vm_ops = 0x0


kernel BUG at mm/rmap.c:725! invalid opcode: 0000 #1 SMP

kernel BUG at arch/x86/mm/highmem_32.c:87! invalid opcode: 0000 #12 SMP last sysfs file: /sys/class/vc/vcs5/dev

BUG: unable to handle kernel paging request at ffffffde IP: [<c015710a>] buffered_rmqueue+0x125/0x211 *pde = 00008067 *pte = 00000000 Oops: 0000 #1 SMP

Attachments

messages.32bit.log Download (241.0 KB) - added by costing 5 years ago.
kernel log for 32bit kernel
messages.64bit.log Download (163.0 KB) - added by costing 5 years ago.
kernel log for 64bit kernel
VBox.32bit.log Download (38.1 KB) - added by costing 5 years ago.
VBox.64bit.log Download (41.3 KB) - added by costing 5 years ago.
VBox.log Download (69.1 KB) - added by costing 5 years ago.
log of hanged guest with 4 cores under 3.0.6

Change History

Changed 5 years ago by costing

kernel log for 32bit kernel

Changed 5 years ago by costing

kernel log for 64bit kernel

Changed 5 years ago by costing

Changed 5 years ago by costing

comment:1 Changed 5 years ago by costing

Another one just happened while doing a "du" on the guest on a large structure. du segfaulted while the kernel logged:

general protection fault: 0000 [#1] SMP
last sysfs file: /sys/class/vc/vcs6/dev
CPU 3
Modules linked in: ipv6 parport_pc lp parport vboxvfs autofs4 sunrpc dm_mirror dm_region_hash dm_log dm_mod button battery ac i2c_piix4 i2c_core vboxadd e1000 floppy ext3 jbd ata_piix libata sd_mod scsi_mod [last unloaded: x_tables]
Pid: 16486, comm: du Tainted: G        W  2.6.28.3 #1
RIP: 0010:[<ffffffff80296b49>]  [<ffffffff80296b49>] do_lookup+0xcd/0x1c6
RSP: 0018:ffff880068bf5ca8  EFLAGS: 00010282
RAX: ffffffffa009d620 RBX: fffffffffffffff4 RCX: ffff88010833a3f0
RDX: ffff880068bf5e08 RSI: ffff88007f948380 RDI: ffff880097cf3288
RBP: ffff88007f948380 R08: 0000000000000007 R09: 0000000000000007
R10: 6f6f620000000000 R11: ffffffff80327ed9 R12: ffff880097cf3288
R13: ffff880097cf3358 R14: ffff880068bf5e08 R15: ffff880068bf5d38
FS:  00007f30a896b6e0(0000) GS:ffff88010ec05d00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000084c0c8 CR3: 00000000db878000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process du (pid: 16486, threadinfo ffff880068bf4000, task ffff880104cfc0a0)
Stack:
 ffffffffa009c61f ffff88010dc5ff00 ffff880068bf5d48 0000000000000000
 ffff880068bf5e08 ffff880097cf3288 ffff880068bf5d98 00000000ffffff9c
 ffff880068bf5d48 ffffffff802974e1 0000004468bf5da8 0000000000000003
Call Trace:
 [<ffffffffa009c61f>] ? ext3_check_acl+0x0/0x53 [ext3]
 [<ffffffff802974e1>] ? __link_path_walk+0x89f/0xc93
 [<ffffffff8029791e>] ? path_walk+0x49/0x8e
 [<ffffffff80297a73>] ? do_path_lookup+0x110/0x164
 [<ffffffff80297e9f>] ? user_path_at+0x48/0x79
 [<ffffffff8032b417>] ? _raw_spin_lock+0x61/0xfa
 [<ffffffff8032b565>] ? _raw_spin_unlock+0x86/0x89
 [<ffffffff80292adc>] ? vfs_lstat_fd+0x15/0x3f
 [<ffffffff80292e48>] ? sys_newlstat+0x19/0x31
 [<ffffffff8048da37>] ? _spin_lock_irqsave+0x22/0x2b
 [<ffffffff8032b565>] ? _raw_spin_unlock+0x86/0x89
 [<ffffffff8048dd1a>] ? error_exit+0x0/0x70
 [<ffffffff8020b31a>] ? system_call_fastpath+0x16/0x1b
Code: ff ff ff 75 3e 48 89 ef 4c 89 fe b3 f4 e8 b3 73 00 00 48 85 c0 48 89 c5 74 29 49 8b 84 24 30 01 00 00 4c 89 f2 48 89 ee 4c 89 e7 <ff> 50 08 48 85 c0 48 89 c3 74 0a 48 89 ef e8 66 67 00 00 eb 03
RIP  [<ffffffff80296b49>] do_lookup+0xcd/0x1c6
 RSP <ffff880068bf5ca8>
---[ end trace 4eaa2a86a8e2da22 ]---

comment:2 Changed 5 years ago by costing

Changing the kernel doesn't help, here is another one with 2.6.31-rc2. I don't know if it's related or not, but I've seen another kernel bug logged at about the same time by another guest on the same host. First message is from a 32bit 2.6.31-rc2, second is a 64bit 2.6.28.3, 4 CPUs each. Both machines were busy compiling for a long time.

I've also left other 2 guests with a single CPU each and since then I've seen no complains from them any more, even under load. Other 2 guests with 4 CPUs each but very low load didn't show these problems.

BUG: unable to handle kernel paging request at c102f631
IP: [<c102f631>] sys_exit_group+0x0/0x10
*pde = 369ed063 *pte = 0102f161
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/irq
Modules linked in: ipv6 vboxvfs vboxadd autofs4 hidp rfcomm l2cap bluetooth rfkill sunrpc dm_mirror dm_multipath video output battery lp nvram ac button parport_pc parport pcspkr i2c_piix4 e1000 floppy dm_region_hash dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]

Pid: 20687, comm: true Not tainted (2.6.31-rc2 #1) VirtualBox
EIP: 0060:[<c102f631>] EFLAGS: 00010283 CPU: 0
EIP is at sys_exit_group+0x0/0x10
EAX: 000000fc EBX: 00000000 ECX: 00c34186 EDX: 3280cc36
ESI: 45453280 EDI: 45453280 EBP: f2d1f000 ESP: f2d1ffb0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process true (pid: 20687, ti=f2d1f000 task=f6301270 task.ti=f2d1f000)
Stack:
 c1002ed4 00000000 00000004 00000000 45453280 45453280 bf9b5018 000000fc
<0> 0000007b 0000007b 00000000 00000000 000000fc ffffe424 00000073 00000246
<0> bf9b4fec 0000007b 00000000 00000000
Call Trace:
 [<c1002ed4>] ? sysenter_do_call+0x12/0x28
Code: ff ff 89 c2 8d 80 58 01 00 00 39 82 58 01 00 00 74 e9 eb be 89 73 34 c7 43 44 08 00 00 00 64 a1 00 30 41 c1 e8 41 7e 00 00 eb c9 <0f> b6 44 24 04 c1 e0 08 e8 6e ff ff ff 31 c0 c3 0f b6 44 24 04
EIP: [<c102f631>] sys_exit_group+0x0/0x10 SS:ESP 0068:f2d1ffb0
CR2: 00000000c102f631
---[ end trace 0a336e01786aae2f ]---
------------[ cut here ]------------
kernel BUG at mm/mmap.c:2125!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/irq
CPU 0
Modules linked in: nfs lockd nfs_acl ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath rfkill input_polldev battery lp floppy ac button parport_pc parport e1000 i2c_piix4 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 28333, comm: config.guess Tainted: G        W  2.6.28.3 #1
RIP: 0010:[<ffffffff80278de8>]  [<ffffffff80278de8>] exit_mmap+0x112/0x11d
RSP: 0018:ffff8800df97de68  EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880028034460 RCX: ffffffff80278ccf
RDX: ffff88010303bbe0 RSI: ffff88003d9f5200 RDI: 0000000000000246
RBP: ffff88010303bb80 R08: 0000000000000000 R09: ffff8800281015c0
R10: ffff880008485738 R11: ffffffff802f559a R12: 0000000000000000
R13: ffff88010303bbe0 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff80686000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003661c03080 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process config.guess (pid: 28333, threadinfo ffff8800df97c000, task ffff88000756cc60)
Stack:
 00000000000000a5 ffff880028034460 ffff88010303bb80 ffff88010303bc28
 ffff88000756cc60 ffffffff8023219a 0000000000000296 ffff88000756d110
 ffff88010303bb80 ffffffff80235a20 0000000000000000 ffffffff802606a0
Call Trace:
 [<ffffffff8023219a>] ? mmput+0x40/0xbd
 [<ffffffff80235a20>] ? exit_mm+0xfa/0x105
 [<ffffffff802606a0>] ? audit_free+0x183/0x1bb
 [<ffffffff80236e14>] ? do_exit+0x1e7/0x769
 [<ffffffff8048641f>] ? _spin_lock_irqsave+0x25/0x2d
 [<ffffffff802373fd>] ? do_group_exit+0x67/0x96
 [<ffffffff8023743e>] ? sys_exit_group+0x12/0x17
 [<ffffffff8020b20a>] ? system_call_fastpath+0x16/0x1b
Code: 8d 7b 18 e8 b7 70 00 00 c7 43 08 00 00 00 00 eb 0b 4c 89 e7 e8 93 fe ff ff 49 89 c4 4d 85 e4 75 f0 48 83 bd 10 01 00 00 00 74 04 <0f> 0b eb fe 59 5e 5b 5d 41 5c c3 41 57 41 56 41 55 41 54 49 89
RIP  [<ffffffff80278de8>] exit_mmap+0x112/0x11d
 RSP <ffff8800df97de68>
---[ end trace 4eaa2a86a8e2da22 ]---

comment:3 Changed 5 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

There were several SMP-related fixes in 3.0.2, please reopen if this problem still persists.

comment:4 Changed 5 years ago by costing

  • Status changed from closed to reopened
  • Resolution fixed deleted

The problem is still here even in 3.0.6. One of the virtual machines that worked fine with only one core for ~40 days was upgraded to 4 cores and in ~6h it hanged. On the only console that was still opened a few date commands ran at ~1sec interval showed the following sequence:

> date
Tue Sep 15 01:59:40 CEST 2009
> date
Tue Sep 15 01:59:41 CEST 2009
> date
Tue Sep 15 01:59:42 CEST 2009
> date
Tue Sep 15 01:59:37 CEST 2009
> date
Tue Sep 15 01:59:38 CEST 2009
> date
Tue Sep 15 01:59:38 CEST 2009
> date
Tue Sep 15 01:59:39 CEST 2009

At the next command it hanged for good so ... no other details are available.

Restarting it again with only one allocated core solved the problem.

This machine runs quite frequently two or even three parallel processes that are CPU intensive. It would've been the perfect use case for more cores...

Host: 64bit Ubuntu 9.10 on Intel X5450

Guest: 64bit RHEL4 with custom 2.6.28.3 (to solve the timing issues)

comment:5 Changed 5 years ago by frank

costing, please attach the VBox.log file of this VM session.

comment:6 Changed 5 years ago by frank

  • Component changed from other to guest smp

Changed 5 years ago by costing

log of hanged guest with 4 cores under 3.0.6

comment:7 Changed 4 years ago by sandervl73

  • Status changed from reopened to closed
  • Resolution set to fixed

3.0.10 contains guest SMP fixes to address such issues. Reopen if necessary please.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use