VirtualBox

Opened 4 years ago

Closed 4 years ago

#19311 closed defect (invalid)

Changes on Archlinux 1/27/20 to support Linux 5.5 cause 10 min. boot of guest after downgrade to Linux 5.4

Reported by: drankinatty Owned by:
Component: other Version: VirtualBox 5.2.36
Keywords: Cc:
Guest type: Linux Host type: Linux

Description

This issue was first discussed on the virtualbox-users list on 2/4/20 under the thread "[VBox-users] 5.5 kernel and 5.2.36 - Does this not work with the new 5.2.36?"

Following direction to https://www.virtualbox.org/ticket/19145 I was directed to file a separate bug and submit the strace and dmesg output for the guest here.

I build the 5.2.x branch for the AUR repository. https://aur.archlinux.org/packages/virtualbox-bin-5/

During update on 2/4/20, Archlinux moved to Linux 5.5 kernel. The virtualbox drivers built and on reboot starting my Archlinux guest headless, virtualbox reports that the guest started. 5 minutes later the boot had yet to complete and a poweroff was issued. So I built with 5.2.36 - same result, so I built with testcase 5.2.37 (135942), same result.

After posting to the vbox-users list and being informed of https://www.virtualbox.org/ticket/19145 and 5.2 not yet supporting Linux 5.5 kernel, I downgraded my host. The same problem remained, but this time I just left the boot of a simple Archlinux guest that boot to a text console running. After 10 minutes -- the boot succeeded, but the guest was unusably slow.

For example, users cannot login because the 90 second login timeout expires between the time the user enters their user name and presses [Enter] and the display of the 'password:' prompt. root can login as there is no 90 second timeout (takes about 120 seconds for 'password:' to display after entering user name.

All was fine on 1/27/20 when I last updated the guest to the Linux 5.4.14 kernel. I have downgraded both the host and guest multiple times to the package/kernel state they were in on 1/14/20, 1/21/20, 1/24/20, and 1/27/20. The guests are still unusable. It is like a config or something similar was modified to support the 5.5 kernel that isn't reverted on downgrade (or something similar) that is causing the problem. That's the only reason I can think that downgrading the entire system before the point the problem appeared does not resole the issue. I have a number of guests I use, and all was well before the 5.5 update.

I have attached a strace of the communication between the Archlinux host and Archlinux guest. I am no expert interpreting the strace communications, but when each bit of I/O is taking more than 10 seconds -- that looks like an issue to me. I have also attached the dmesg output from the Archlinux guest after waiting though the 10 minute boot (it normally boots fully in less than 20 seconds).

I'm happy to provide anything else you can think of that may identify the issue.

Attachments (5)

dmesg+strace_vbox-5.2.tar.xz (21.5 KB ) - added by drankinatty 4 years ago.
strace of I/O between Host/Guest, dmesg from Guest
dmesg+strace_vbox-headless-5.2.tar.xz (17.1 KB ) - added by drankinatty 4 years ago.
Additional dmesg + strace from vbox-headless session
20200222-025921_vbox_err.log.xz (4.5 KB ) - added by drankinatty 4 years ago.
strace between Linux Host/Guest during upgrade of guest to Linux 5.5
vbox_5.2.38.png (62.3 KB ) - added by drankinatty 4 years ago.
Screenshot of headless guest attempting to boot (about 6 minutes)
virtualbox-ext-oracle-5-5.2.38-1.src.tar.gz (1.2 KB ) - added by drankinatty 4 years ago.
Archlinux Source Build for virtualbox-ext-oracle_5.2.38 to go along with AUR build

Download all attachments as: .zip

Change History (14)

by drankinatty, 4 years ago

strace of I/O between Host/Guest, dmesg from Guest

by drankinatty, 4 years ago

Additional dmesg + strace from vbox-headless session

comment:1 by drankinatty, 4 years ago

In the new attachment from the headless session, the strace shows what looks like some type of race condition. There are a few hundred lines at the beginning, ending with write(1, "Oracle VM VirtualBox Headless In"..., 1471) = 1471 and exit_group(1) and +++ exited with 1 +++

Then 40,000 lines of read(9, "\372", 1) followed by select(10, [9], NULL, [9], NULL) with an occasional futex(0x23e2f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

Then at the end, roughly 100 lines of FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1581693996, tv_nsec=146944000}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)

Something is spinning in circles and it is bringing the system to it knees. This last session I did an update to the Guest that normally takes less than 2 minutes to complete -- 4 hours later, the update finally completed.

by drankinatty, 4 years ago

strace between Linux Host/Guest during upgrade of guest to Linux 5.5

comment:2 by drankinatty, 4 years ago

Troubleshooting this issue, I managed to get the Archlinux guest updated to Linux 5.5 while running testcase 5.2.37 (135942), but the guest additions failed to build. I have opened a separate bug attaching the vboxadd-setup.log to https://www.virtualbox.org/ticket/19337.

I captured the strace of the host process which I included as an additional attachment here. Guest under 5.5 continue to be unusably slow. Several hours were required for this last update of the guest that would normally have taken 5 minutes. Let me know if I can provide anything further.

comment:3 by Frank Batschulat (Oracle), 4 years ago

if someone wants to play with this, here's the Archlinux installation guide:

https://wiki.archlinux.org/index.php/Installation_guide

comment:4 by drankinatty, 4 years ago

Here is my package that I build for AUR - https://aur.archlinux.org/packages/virtualbox-bin-5/

I have updated it to 5.2.38 and the same issue remains. I am currently on minute 10 of trying to boot the Archlinux guest. I'll add a screenshot of the guest failures. Thank you for any help you can give.

by drankinatty, 4 years ago

Attachment: vbox_5.2.38.png added

Screenshot of headless guest attempting to boot (about 6 minutes)

by drankinatty, 4 years ago

Archlinux Source Build for virtualbox-ext-oracle_5.2.38 to go along with AUR build

comment:5 by drankinatty, 4 years ago

When I look at top on both host and guest, the cpu utilization generally less that 4% on both host and guest with occasional cpu utilization up to 30% or so when data is processed. That generally lasts for no more than a second or two, then utilization drops back to idle. It is like there is an I/O deadlock between the host and guest, that will only let a few bytes through at a time.

iostat -c on the guest shows:

Linux 5.5.5-arch1-1 (vl1)       02/24/2020      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.14    0.00    0.18   54.55    0.00   45.14

iostat -d on the guest shows

Linux 5.5.5-arch1-1 (vl1)       02/24/2020      _x86_64_        (2 CPU)

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda               3.82       102.85        61.57         0.00     323790     193848          0
scd0              0.01         0.02         0.00         0.00         71          0          0

I'm not sure what that adds but the %iowait looks high. I don't know what it normally is for the guest, but I'm trying to run just about any tool I can look at to rule-in/out different issues as being related to this problem.

If you can think of anything else that I can post that would help -- just let me know.

comment:6 by drankinatty, 4 years ago

Here is a new piece of information. I have another server, a backup, that has not been updated since 1/25/20 that has the same Archlinux host/guest setup. (along with openSUSE guests). It is running the 5.4.14 kernel. I booted today to check and it runs PERFECTLY. No slowness, Archlinux guest boots in 18 seconds, iostat waits are minimal and it is a joy to use.

I will not update that box until we have resolved this bug as it can provide whatever information may be requested on the pre-update state when things were working. I haven't been able to make any further determination of what the problem is. I know the host/guest I/O is broken for 5.2 with whatever Archlinux changes were made to support the 5.5 kernel. (and since Arch is bleeding edge, others will likely hit this problem when their distro moves forward) I just do not know enough about the internal host/guest communications to know where/what to look at.

Any help would be appreciated and I'm happy to pull whatever information from the working or non-working systems you may care to see.

comment:7 by drankinatty, 4 years ago

Hold off testing, this may to be an issue with the Linux 5.5 kernel and the mdadm Raid1 array the guest is on where the Raid1 is comprised of 2-disks instead of 2-partitions. I have posted to the kernel-raid mailing list (under title "Linux 5.5 Breaks Raid1 on Device instead of Partition, Unusable I/O") and will provide the feedback I receive. (gotta love kernel version updates...) A Raid1 I/O issue would explain all the slowdown symptoms seen:

For what it is worth the Raid1 details are:

[14:17 valkyrie:/home/david/tmp] # mdadm -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875
           Name : valkyrie:4  (local to host valkyrie)
  Creation Time : Mon Mar 21 02:27:21 2016
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 2930135488 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 5860270976 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=48 sectors
          State : clean
    Device UUID : e15f0ea7:7e973d0c:f7ae51a1:9ee4b3a4

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Mar  1 14:18:07 2020
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 62472be - correct
         Events : 8193


   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)


[14:18 valkyrie:/home/david/tmp] # mdadm -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875
           Name : valkyrie:4  (local to host valkyrie)
  Creation Time : Mon Mar 21 02:27:21 2016
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB)
     Array Size : 2930135488 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 5860270976 (2794.39 GiB 3000.46 GB)
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
   Unused Space : before=262056 sectors, after=48 sectors
          State : clean
    Device UUID : f745d11a:c323f477:71f8a0d9:27d8c717

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Mar  1 14:18:15 2020
  Bad Block Log : 512 entries available at offset 72 sectors
       Checksum : 9101220e - correct
         Events : 8194


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing, 'R' == replacing)


[14:18 valkyrie:/home/david/tmp] # mdadm -D /dev/md4
/dev/md4:
           Version : 1.2
     Creation Time : Mon Mar 21 02:27:21 2016
        Raid Level : raid1
        Array Size : 2930135488 (2794.39 GiB 3000.46 GB)
     Used Dev Size : 2930135488 (2794.39 GiB 3000.46 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Mar  1 14:18:32 2020
             State : clean, checking
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

      Check Status : 1% complete

              Name : valkyrie:4  (local to host valkyrie)
              UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875
            Events : 8194

    Number   Major   Minor   RaidDevice State
       0       8       32        0      active sync   /dev/sdc
       2       8       48        1      active sync   /dev/sdd

comment:8 by drankinatty, 4 years ago

This bug can be closed. The problem was the underlying disk /dev/sdc. As you can see above, there are no errors regarding the array and there were no errors in the journal, but when drilling down with iostat, the /dev/sdc disk was suffering high io_wait which caused the guest that was stored on the array fits essentially bringing it to it's knees. The disk was replaced and array rebuilt and all guests are now happy on 5.2.38 and the Linux 5.5.7 kernel. The fact that no errors were generated regarding the disk failure and the coincidence of the problem occurring at the time of the Linux 5.4 -> 5.5 update, made diagnosing this issue a bit of a rabbit trail. Thank you for your help.

Please close this ticket as (invalid/not-a-bug).

comment:9 by Frank Batschulat (Oracle), 4 years ago

Resolution: invalid
Status: newclosed
Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use