VirtualBox

Opened 11 months ago

Last modified 11 months ago

#21706 new defect

Ubuntu 20.04 Stuck at reboot in Virtualbox kernel version 5.4.0-149-generic #166 SMP

Reported by: svorobev Owned by:
Component: other Version: VirtualBox-7.0.6
Keywords: Cc:
Guest type: Linux Host type: Linux

Description

The problem occurs whilst sequentially rebooting the Ubuntu20.04 freshly installed inside the VirtualBox. Host os: ` No LSB modules are available. Description: Ubuntu 23.04 Release: 23.04 ` Guest os: ` Description: Ubuntu 20.04.6 LTS Release: 20.04 ` VirtualBox version: ` VirtualBox Graphical User Interface Version 7.0.6_Ubuntu r155176 7.0.6_Ubuntur155176 ` Steps to reproduce: reboot the virtual machine several times via the reboot command.

Actual behavior: At a certain reboot, the Ubuntu gets stuck with the last message being [ 0.000000] Linux agpgart interface v0.103. After that the machine gets stuck. Before that there are significant discrepancies in timestamp.

Expected behavior: Normal boot.

The issue consistently reproduces on different VirtualBox versions (from 6.1.28r147628 to 7.0) and on different host kernel versions, also tried 5.16 and 5.19 kernels.

The issue persists after changing various VirtualBox VM options. Disabling sound control, disabling graphics, disabling PAE/NX, VT/X nested virtualization and KVM nested paging does not make the issue go away. Enabling/disabling serial port does not influence the issue. Changing the VirtualBox System/Enable Hardware Clock in UTC time to unchecked does not help too. Whilst being stuck the kernel inside the VM ignores sysrq sent via: ` VM="Ubuntu" PRESS="26" RELEASE=$(printf "%X\n" $((0x${PRESS} + 0x80))) VBoxManage controlvm "$VM" keyboardputscancode 1d 38 54 "${PRESS}" "${RELEASE}" d4 b8 9d ` (All the other sysrq interrupts were tried, in order to make sure I have double checked that the sysrq was enabled, and sent all sysrq's to the normally booted VM, and it did work).

Kernel rebuilt with debug symbols does not affect the problem. Changing the verbosity level did not yield any additional results. Enabling/disabling KASLR and ASLR did not yield any results. Changing the clocksource to kvm-clock , tsc and acpi_pm did not fix the problem. SIGNIFICANT NOTE: The problem goes away in case of > 2 CPU and 1GB ram. Attaching kdb/kgdb does not seem possible since the sysrq is being ignored.

The issue does reproduce for other users: https://www.reddit.com/r/linuxquestions/comments/ols6f1/ubuntu_server_2004_boot_suddenly_it_freezes_at/

After a bit of printk I have managed to find the source of the problem and take a stack trace with the VirtualBox debugger. The problem is the blk_mq_freeze_queue_wait being stuck at the loop driver initialization. ` # RBP Ret SS:RBP Ret RIP CS:RIP / Symbol [line] 00 ffffc90000013c80 0000:ffffc90000013c90 ffffffff81aaf4ba kallsyms!blk_mq_freeze_queue_wait

retn/64

01 ffffc90000013c90 0000:ffffc90000013cc0 ffffffff81ab0d47 kallsyms!blk_mq_freeze_queue+e

retn/64

02 ffffc90000013cc0 0000:ffffc90000013ce0 ffffffff81ab0f29 kallsyms!wbt_init+1af

retn/64

03 ffffc90000013ce0 0000:ffffc90000013d28 ffffffff81aaef13 kallsyms!wbt_enable_default+b6

retn/64

04 ffffc90000013d28 0000:ffffc90000013d90 ffffffff814f2760 kallsyms!blk_register_queue+358

retn/64

05 ffffc90000013d90 0000:ffffc90000013da0 ffffffff814f27a3 kallsyms__device_add_disk+450

retn/64

06 ffffc90000013da0 0000:ffffc90000013dd8 ffffffff81aca41c kallsyms!device_add_disk+13

retn/64

07 ffffc90000013dd8 0000:ffffc90000013e10 ffffffff82d0f072 kallsyms!loop_add+327

retn/64

08 ffffc90000013e10 0000:ffffc90000013e88 ffffffff810037da kallsyms!loop_init+134

retn/64

09 ffffc90000013e88 0000:ffffc90000013f38 ffffffff82ca240c kallsyms!do_one_initcall+4a

retn/64

0a ffffc90000013f38 0000:ffffc90000013f48 ffffffff81aedd6e kallsyms!kernel_init_freeable+1e6

retn/64

0b ffffc90000013f48 0000:0000000000000000 ffffffff81c00255 kallsyms!kernel_init+e

retn/64

0c 0000000000000000 0000:0000000000000000 0000000000000000 kallsyms!ret_from_fork+35 ` blk_mq_freeze_queue_wait just calls the wait_event for the percpu_ref_is_zero(&q->q_usage_counter), which is not being zero before the wait. Inside the wait_event macro, prepare_to_wait_event function is reached, finish_wait is not reached. (Checked with the debuggers breakpoints).

Attachments (1)

ubuntu_dmesg.log (19.9 KB ) - added by svorobev 11 months ago.

Download all attachments as: .zip

Change History (8)

by svorobev, 11 months ago

Attachment: ubuntu_dmesg.log added

comment:1 by galitsyn, 11 months ago

Hi svorobev,

You are referring to the package released by Ubuntu, not by us. Also VBox.log is missing. I would suggest to give it a try to official VirtualBox package from Downloads page.

in reply to:  1 comment:2 by svorobev, 11 months ago

Replying to galitsyn:

Hi svorobev,

You are referring to the package released by Ubuntu, not by us. Also VBox.log is missing. I would suggest to give it a try to official VirtualBox package from Downloads page.

Hello, galitsyn. Thank you for your answer. I will definitely try reproducing this bug on the official Virtualbox build. I think that the issue would reproduce since I have mentioned that it did reproduce on a variety of Virtualbox versions. Regarding the Virtualbox.log file, please fix the bug tracker since it does not allow to upload the file with the size of 170KB, and also does not reset the file attached on the upload error. I can send you the link. https://launchpadlibrarian.net/669621391/VBox.log

comment:3 by galitsyn, 11 months ago

Hi svorobev,

I see that host is running on 11th Gen Intel Core CPU. People are starting to complain about IBT. Do you see the difference if you boot the host with "ibt=off" kernel command line parameter?

in reply to:  3 comment:4 by svorobev, 11 months ago

Replying to galitsyn:

Hi svorobev,

I see that host is running on 11th Gen Intel Core CPU. People are starting to complain about IBT. Do you see the difference if you boot the host with "ibt=off" kernel command line parameter?

Hi galitsyn, Tried this on the VirtualBox Version 7.0.8 r156879 (Qt5.15.8). Still reproduces. Tried "ibt=off" as the kernel parameter, did not help.Also, switching to the 7.0.8 version have led for this bug to occur more often. Maybe I am doing sth wrong, but I see the "ibt=off" in the "/proc/cmdline". Any other debugging ideas would be appreciated.

comment:5 by galitsyn, 11 months ago

Hi svorobev,

Cannot to say at the moment what might be a reason. I see that Extension Pack version does not match and you are using NVMe as a storage attachment type (NVMe is a part of Extension Pack, so might be sufficient to install proper version).

If Extension Pack upgrade does not help, could you please attach guest dmesg (https://www.virtualbox.org/wiki/Serial_redirect). I see you might already have it in /tmp/serial.log. Also, host dmesg might tell some more info.

comment:6 by galitsyn, 11 months ago

Btw, have you tried to change storage controller type, say, to AHCI or PIIX4?

in reply to:  5 comment:7 by svorobev, 11 months ago

Replying to galitsyn:

Hi svorobev,

Cannot to say at the moment what might be a reason. I see that Extension Pack version does not match and you are using NVMe as a storage attachment type (NVMe is a part of Extension Pack, so might be sufficient to install proper version).

If Extension Pack upgrade does not help, could you please attach guest dmesg (https://www.virtualbox.org/wiki/Serial_redirect). I see you might already have it in /tmp/serial.log. Also, host dmesg might tell some more info.

Hi galitsyn, I have attached the ubuntu_dmesg.log, it is the dmesg log for the guest without the additional printk statements I added later (I can send that too). I have attributed the problem to a given kernel call blk_mq_freeze_queue_wait(). (At the disk addition in the loop device driver init, at the disk queue registration and writeback throttling initialization). I have tried different guest additions versions and different VBox versions, including Ubuntu packaged and official VirtualBox builds. It is either VirtualBox or the kernel problem (I think it is both).

Using the PIIX4 controller also did not yield any result.

Please, take a look at the timing in the dmesg file, changing the clocksource in the kernel parameters does not help.

Here is my host dmesg near the time of the bug reproduced (removed all audit and evdi events, also removed wifi driver events, the issue reproduced on the server with no evdi enabled).

[69148.695284] ACPI: EC: interrupt unblocked
[69149.506585] OOM killer enabled.
[69149.506588] Restarting tasks ... done.
[69149.516465] random: crng reseeded on system resumption
[69149.551620] thermal thermal_zone8: failed to read out thermal zone (-61)
[69149.601652] PM: suspend exit
[69149.916220] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_component_ops [i915])
[69149.916785] mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
[69150.255090] ucsi_acpi USBC000:00: failed to re-enable notifications (-110)
[69893.782157] vboxdrv: 0000000000000000 VMMR0.r0
[69893.877074] vboxdrv: 0000000000000000 VBoxDDR0.r0

Thank you for your answer.

Last edited 11 months ago by galitsyn (previous) (diff)
Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use