#14034 new defect

NMI kernel panics on HP Proliants running RHEL causing machines to reboot at VBoxHost_RTSemEventMultiWaitEx

Reported by:	nj	Owned by:
Component:	guest control	Version:	VirtualBox 4.3.26
Keywords:		Cc:
Guest type:	Windows	Host type:	Linux

Description (last modified by Frank Mehnert)

Over the past few months we have been experiencing seemingly random reboots of our DL360 G7 and DL360 G6 machines every few days when using VirtualBox 4.3. We don't think we have been suffering from this problem when using VirtualBox 4.2. The vmcore-dmesg.txt file that is written to /var/crash contains stack traces that might seem to implicate VirtualBox as a cause of the crashes

<4>Pid: 5567, comm: EMT-0 Not tainted 2.6.32-504.12.2.el6.x86_64 #1
<4>Call Trace:
<4> <NMI>  [<ffffffff8152933c>] ? panic+0xa7/0x16f
<4> [<ffffffffa002e4df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]
<4> [<ffffffff815300f5>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff8153015a>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152de17>] ? do_nmi+0x217/0x340
<4> [<ffffffff8152d680>] ? nmi+0x20/0x30
<4> <<EOE>>  [<ffffffffa04bb210>] ? VBoxHost_RTSemEventMultiWaitEx+0x10/0x20 [vboxdrv]
<4> [<ffffffffa04b84da>] ? rtR0MemAllocEx+0x8a/0x250 [vboxdrv]
<4> [<ffffffffa04a98ca>] ? supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
<4> [<ffffffffa04a93a4>] ? VBoxDrvLinuxIOCtl_4_3_26+0x54/0x210 [vboxdrv]
<4> [<ffffffff811a3782>] ? vfs_ioctl+0x22/0xa0
<4> [<ffffffff811a3924>] ? do_vfs_ioctl+0x84/0x580
<4> [<ffffffff810e5c7b>] ? audit_syscall_entry+0x1cb/0x200
<4> [<ffffffff811a3ea1>] ? sys_ioctl+0x81/0xa0
<4> [<ffffffff810e5a7e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

These machines are running recently patched versions of Red Hat Enterprise Linux Server release 6.6 (Santiago). The problem has been seen on about 6 different machines. This is occuring with VirtualBox 4.3.26 but has been happening with many other recent releases in the 4.3 series. We only have 1 guest VM running on each of these RHEL machines and it is running Windows 2008 R2 64 Bit as a guest.

We have not as far as we know configured the HP Watchdog Timer in any special way. Ticket 13762 seems to bear some similarities to this ticket. I have attached one example of vmcore-dmesg.txt

Attachments (1)

vmcore-dmesg.txt (60.9 KB ) - added by nj 10 years ago.: Crash Dump Summary 1

Download all attachments as: .zip

Change History (8)

by nj, 10 years ago

Attachment:	vmcore-dmesg.txt added

Crash Dump Summary 1

comment:1 by Frank Mehnert, 10 years ago

Description:	modified (diff)

comment:2 by Frank Mehnert, 10 years ago

Actually I'm not sure if this is a VBox bug at all. See this Ubuntu ticket. Could you try to blacklist the hpwdt module like suggested there and check if this resolves your problem as well?

comment:3 by nj, 10 years ago

After reading the indicated link I could not see that this definitely isn't a VBox problem. I presume in the case of this ticket the hardware watchdog is working as expected and is correctly causing a NMI to occur. There have been two more cases of servers rebooting since I submitted the ticket. One occurred with 4.3.26 and one occurred with 4.3.28 of VirtualBox. The stack traces were
<4> <NMI> [<ffffffff8152933c>] ? panic+0xa7/0x16f
<4> [<ffffffff8152fac6>] ? kprobe_exceptions_notify+0x16/0x430
<4> [<ffffffffa00364df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]
<4> [<ffffffff8152e609>] ? perf_event_nmi_handler+0x9/0xb0
<4> [<ffffffff815300f5>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff8153015a>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152de17>] ? do_nmi+0x217/0x340
<4> [<ffffffff8152d680>] ? nmi+0x20/0x30
<4> <<EOE>> [<ffffffffa078b210>] ? VBoxHost_RTSemEventMultiWaitEx+0x10/0x20 [vboxdrv]
<4> [<ffffffffa07884da>] ? rtR0MemAllocEx+0x8a/0x250 [vboxdrv]
<4> [<ffffffffa07798ca>] ? supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
<4> [<ffffffffa07793a4>] ? VBoxDrvLinuxIOCtl_4_3_28+0x54/0x210 [vboxdrv]
<4> [<ffffffff811a3782>] ? vfs_ioctl+0x22/0xa0
<4> [<ffffffff811a3924>] ? do_vfs_ioctl+0x84/0x580
<4> [<ffffffff81529a3e>] ? thread_return+0x4e/0x7d0
<4> [<ffffffff811a3ea1>] ? sys_ioctl+0x81/0xa0
<4> [<ffffffff810e5a7e>] ? audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

and

<4> <NMI> [<ffffffff8152933c>] ? panic+0xa7/0x16f
<4> [<ffffffff8152fac6>] ? kprobe_exceptions_notify+0x16/0x430
<4> [<ffffffffa00364df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]
<4> [<ffffffff8152e609>] ? perf_event_nmi_handler+0x9/0xb0
<4> [<ffffffff815300f5>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff8153015a>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152de17>] ? do_nmi+0x217/0x340
<4> [<ffffffff8152d680>] ? nmi+0x20/0x30
<4> <<EOE>> [<ffffffff81064ba2>] ? default_wake_function+0x12/0x20
<4> [<ffffffffa04ac4da>] ? rtR0MemAllocEx+0x8a/0x250 [vboxdrv]
<4> [<ffffffffa049d8ca>] ? supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
<4> [<ffffffffa049d3a4>] ? VBoxDrvLinuxIOCtl_4_3_26+0x54/0x210 [vboxdrv]
<4> [<ffffffff811a3782>] ? vfs_ioctl+0x22/0xa0
<4> [<ffffffff811a3924>] ? do_vfs_ioctl+0x84/0x580
<4> [<ffffffff81175a28>] ? kfree+0x28/0x320
<4> [<ffffffff811a3ea1>] ? sys_ioctl+0x81/0xa0
<4> [<ffffffff810e5a7e>] ? audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

So If I read the stack traces correctly the NMIs occur when vboxdrv is trying to allocate memory. In one of the stack traces it looks like the thread is waiting for a semaphore to become signalled before the thread can make progress.

Last edited 10 years ago by nj (previous) (diff)

comment:4 by nj, 10 years ago

Had another crash
<4>Call Trace:
<4> <NMI> [<ffffffff8152933c>] ? panic+0xa7/0x16f
<4> [<ffffffff8152fac6>] ? kprobe_exceptions_notify+0x16/0x430
<4> [<ffffffffa00364df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]
<4> [<ffffffff8152e609>] ? perf_event_nmi_handler+0x9/0xb0
<4> [<ffffffff815300f5>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff8153015a>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810a4eae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8152de17>] ? do_nmi+0x217/0x340
<4> [<ffffffff8152d680>] ? nmi+0x20/0x30
<4> <<EOE>> [<ffffffffa04af210>] ? VBoxHost_RTSemEventMultiWaitEx+0x10/0x20 [vboxdrv]
<4> [<ffffffffa04ac4da>] ? rtR0MemAllocEx+0x8a/0x250 [vboxdrv]
<4> [<ffffffffa049d8ca>] ? supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
<4> [<ffffffffa049d3a4>] ? VBoxDrvLinuxIOCtl_4_3_28+0x54/0x210 [vboxdrv]
<4> [<ffffffff811a3782>] ? vfs_ioctl+0x22/0xa0
<4> [<ffffffff811a3924>] ? do_vfs_ioctl+0x84/0x580
<4> [<ffffffff81529a3e>] ? thread_return+0x4e/0x7d0
<4> [<ffffffff810a9cbc>] ? current_kernel_time+0x2c/0x40
<4> [<ffffffff811a3ea1>] ? sys_ioctl+0x81/0xa0
<4> [<ffffffff810e5a7e>] ? audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

These crashes do not occur with VirtualBox 4.2. They only occur with VirtualBox 4.3

comment:5 by nj, 9 years ago

Still getting NMIs and machine reboots with VirtualBox 5.0.22 and OracleLinux 6r8

<4>vboxdrv: ffffffffa0c57020 VBoxDDR0.r0
<4>vboxdrv: ffffffffa0c77020 VBoxDD2R0.r0
<0>Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
<0>1. Integrated Management Log (IML)
<0>2. OA Syslog
<0>3. OA Forward Progress Log
<0>4. iLO Event Log
<4>Pid: 11483, comm: EMT-0 Not tainted 2.6.32-642.1.1.el6.x86_64 #1
<4>Call Trace:
<4> <NMI>  [<ffffffff81546ea1>] ? panic+0xa7/0x179
<4> [<ffffffff8154d716>] ? kprobe_exceptions_notify+0x16/0x430
<4> [<ffffffffa07204df>] ? hpwdt_pretimeout+0x9f/0xcc [hpwdt]
<4> [<ffffffff8154c219>] ? perf_event_nmi_handler+0x9/0xb0
<4> [<ffffffff8154dd45>] ? notifier_call_chain+0x55/0x80
<4> [<ffffffff8154ddaa>] ? atomic_notifier_call_chain+0x1a/0x20
<4> [<ffffffff810aceae>] ? notify_die+0x2e/0x30
<4> [<ffffffff8154ba1f>] ? do_nmi+0x21f/0x350
<4> [<ffffffff8154b283>] ? nmi+0x83/0x90
<4> <<EOE>>  [<ffffffff810a6ac0>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffffa09a1b20>] ? VBoxHost_RTSemEventMultiWaitExDebug+0x10/0x40 [vboxdrv]
<4> [<ffffffffa099ea4a>] ? rtR0MemAllocEx+0x8a/0x250 [vboxdrv]
<4> [<ffffffffa098ba1a>] ? supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
<4> [<ffffffffa098b3c4>] ? VBoxDrvLinuxIOCtl_5_0_24+0x54/0x210 [vboxdrv]
<4> [<ffffffff810097cc>] ? __switch_to+0x1ac/0x340
<4> [<ffffffff811af742>] ? vfs_ioctl+0x22/0xa0
<4> [<ffffffff811af8e4>] ? do_vfs_ioctl+0x84/0x580
<4> [<ffffffff815475be>] ? schedule+0x3ee/0xb70
<4> [<ffffffff811afe61>] ? sys_ioctl+0x81/0xa0
<4> [<ffffffff810ee47e>] ? __audit_syscall_exit+0x25e/0x290
<4> [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b

Last edited 9 years ago by Frank Mehnert (previous) (diff)

comment:6 by nj, 9 years ago

This ticket seems similar to https://www.virtualbox.org/ticket/14075

We have disabled the hpwdt by placing a file in /etc/modprobe/blacklist-hp.conf containing the contents install hpwdt /bin/true

Since then we have had two occasions where we see the following messages Message from syslogd@host1 at Jul 15 07:28:09 ...

kernel:Uhhuh. NMI received for unknown reason 00 on CPU 7.

Message from syslogd@host1 at Jul 15 07:28:09 ...

kernel:Do you have a strange power saving mode enabled?

Message from syslogd@host1 at Jul 15 07:28:09 ...

kernel:Dazed and confused, but trying to continue

We were using VirtualBox 5.0.24 at this point.

Does this mean that the underlying problem still exists, that NMI are still occuring and that the only thing different now is a thread dump for the affected CPU is not obtained and the server is not rebooted by HPWDT?

comment:7 by nj, 8 years ago

In Jan 2017 we upgraded to Oracle Linux 7. At this time we were using VirtualBox 5.0.30r112061.

After upgrading we did not blacklist the hpwdt module and the crash reboots due to NMI kernel panics returned

[13994.968013] Call Trace:
[13994.968048]  <NMI>  [<ffffffff816860cc>] dump_stack+0x19/0x1b
[13994.968141]  [<ffffffff8167f4d3>] panic+0xe3/0x1f2
[13994.968213]  [<ffffffff8108574f>] nmi_panic+0x3f/0x40
[13994.968284]  [<ffffffffa02b9946>] hpwdt_pretimeout+0x86/0xfa [hpwdt]
[13994.968373]  [<ffffffff8168f119>] nmi_handle.isra.0+0x69/0xb0
[13994.968453]  [<ffffffff8168f36b>] do_nmi+0x20b/0x410
[13994.968522]  [<ffffffff8168e553>] end_repeat_nmi+0x1e/0x2e
[13994.968605]  <<EOE>>  [<ffffffff810c4f20>] ? try_to_wake_up+0x2d0/0x330
[13994.968719]  [<ffffffffa051f847>] ? supdrvIOCtlFast+0x77/0xa0 [vboxdrv]
[13994.968833]  [<ffffffffa051c4ab>] ? VBoxDrvLinuxIOCtl_5_0_30+0x5b/0x230 [vbox
drv]
[13994.968958]  [<ffffffff81212035>] ? do_vfs_ioctl+0x2d5/0x4b0
[13994.969037]  [<ffffffff812122b1>] ? SyS_ioctl+0xa1/0xc0
[13994.969113]  [<ffffffff816966c9>] ? system_call_fastpath+0x16/0x1b

At this point we will return to blacklisting the hpwdt. When we did this previously we would occasionally get logs like

Jan 25 00:34:40 XXX kernel: Uhhuh. NMI received for unknown reason 00 on CPU 4.
Jan 25 00:34:40 XXX kernel: Do you have a strange power saving mode enabled?
Jan 25 00:34:40 XXX kernel: Dazed and confused, but trying to continue

Thus far we have not noticed any negative repercussions from these events after blacklisting the hpwdt.

Note: See TracTickets for help on using tickets.

Download in other formats: