VirtualBox

Ticket #8631 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Large page allocation times out - causes guest hang

Reported by: joe42 Owned by:
Priority: major Component: VMM
Version: VirtualBox 4.0.4 Keywords: large pages
Cc: Guest type: Linux
Host type: Solaris

Description

On a host currently running 11 guests (mixed linux and windows), one of the Debian 5 64-bit Linux guests quickly freeze up as soon as it starts running its batch jobs (CPU intensive and NFS I/O intensive).

The last message in the log file before the freeze is:

 00:00:51.020 PGMR3PhysAllocateLargePage: allocating large pages takes too long (last attempt 3960 ms; nr of timeouts 2); DISABLE

The host system processors are only utilized some 11-20% while all this is happening. Also, vmstat reports a 7.5GB free list.

Attaching to the VM console never results in an updated console window - I get the RDP window, but the contents of it never update.

Worse yet... When I attempt to power off the VM using VBoxMange controlvm poweroff, The poweroff hangs before 30%:

 $ VBoxManage controlvm sparrow.rd.evalesco.com poweroff
0%...10%...20%...

After several minutes, I try to kill the VBoxHeadless process that is hanging. This causes VBoxManage showvminfo to hang for the VM, but the process does not go away.

A kill -9 is not effective either. The process lives until I reboot the system.

Attachments

VBox.log Download (46.5 KB) - added by joe42 3 years ago.

Change History

Changed 3 years ago by joe42

comment:1 Changed 3 years ago by ramshankar

These are two separate issues. First is the large page allocation failure, and the next is the poweroff hang leading to unkillable VM processes.

I'm 99% sure the VM poweroff hang issue is the one we just internally fixed and will be part of the next VirtualBox release. FTR, it's a bug in the kernel-side event-semaphore code that was fixed.

As for the large-page issue, are you limiting your ZFS arc-cache? Could you post the output of "kstat zfs"?

comment:2 Changed 3 years ago by joe42

Sure:

module: zfs                             instance: 0     
name:   arcstats                        class:    misc
        c                               1073109888
        c_max                           1073741824
        c_min                           1073109888
        crtime                          170.747016437
        data_size                       499514368
        deleted                         3778
        demand_data_hits                232539
        demand_data_misses              3417
        demand_metadata_hits            550561
        demand_metadata_misses          2721
        evict_l2_cached                 0
        evict_l2_eligible               3102720
        evict_l2_ineligible             313006080
        evict_skip                      0
        hash_chain_max                  3
        hash_chains                     1371
        hash_collisions                 37633
        hash_elements                   38570
        hash_elements_max               38571
        hdr_size                        7510848
        hits                            791813
        l2_abort_lowmem                 0
        l2_cksum_bad                    0
        l2_evict_lock_retry             0
        l2_evict_reading                0
        l2_feeds                        0
        l2_free_on_write                0
        l2_hdr_size                     0
        l2_hits                         0
        l2_io_error                     0
        l2_misses                       0
        l2_read_bytes                   0
        l2_rw_clash                     0
        l2_size                         0
        l2_write_bytes                  0
        l2_writes_done                  0
        l2_writes_error                 0
        l2_writes_hdr_miss              0
        l2_writes_sent                  0
        memory_throttle_count           0
        mfu_ghost_hits                  2136
        mfu_hits                        591228
        misses                          40713
        mru_ghost_hits                  59
        mru_hits                        192945
        mutex_miss                      0
        other_size                      7910240
        p                               520093696
        prefetch_data_hits              1101
        prefetch_data_misses            734
        prefetch_metadata_hits          7612
        prefetch_metadata_misses        33841
        recycle_miss                    554
        size                            514935456
        snaptime                        538993.564543555

module: zfs                             instance: 0     
name:   vdev_cache_stats                class:    misc
        crtime                          170.747033753
        delegations                     79240
        hits                            243639
        misses                          67903
        snaptime                        538993.565089241

module: zfs                             instance: 0     
name:   zfetchstats                     class:    misc
        bogus_streams                   0
        colinear_hits                   24
        colinear_misses                 259227
        crtime                          170.745427598
        hits                            1307794
        misses                          259251
        reclaim_failures                241396
        reclaim_successes               17831
        snaptime                        538993.565153127
        streams_noresets                5876
        streams_resets                  2
        stride_hits                     1301918
        stride_misses                   84

I meant to limit the ZFS ARC - I have this in /etc/system:

 * Limit ZFS ARC to 1GiB because we really just want to use our
 * memory for VMs and do not use local disk...
 set zfs:zfs_arc_max = 1073741824

comment:3 Changed 3 years ago by ramshankar

The arc-cache looks good. How much memory are the 11 VMs using in total? RAM+vRAM of the VMs. The log you posted indicates 2 GB. Are all VMs identical?

Is there anything interesting in the syslog (/var/adm/messages)?

comment:4 Changed 3 years ago by joe42

The dmesg reported this some days ago, but that is a long time before the last failure (I had another VM fail the same way yesterday):

 Mar 25 08:57:13 turkey vboxdrv: [ID 456520 kern.notice] NOTICE: vbi_internal_alloc() failure for 2097152 bytes
 Mar 25 10:31:59 turkey vboxdrv: [ID 456520 kern.notice] NOTICE: vbi_internal_alloc() failure for 2097152 bytes

/var/adm/messages is empty.

No the VMs are not identical. Some are Linux, some are Windows, most are dual processor, most have around 2-4G memory.

If I grep out RAM and VRAM for the VMs on the system, I get:

 512 12
3072 18
2048 18
3072 18
3072 18
4096 18
1024 12
1024 12
4096 12  (Large page alloc crash yesterday)
2048 12  (Original large page crash)
2048 18
 512 18

The total is some 26G if I am not much mistaken. Is that simply too much?

comment:5 Changed 3 years ago by joe42

I decreased the memory allocations for many of the VMs (and added another VM). Now I have a total memory reserved to VMs of just below 21GB.

The system seem to be perfectly stable this way. So far so good.

But I feel that I am flying blind. It seems that with 32GB of host memory, utilizing 26GB will get me in trouble but 21GB will not. There seems to be no "early warning"; when I use too much, a VM will crash. This is not very reassuring.

Is there any way that I can get an indication of beginning memory shortage, or an idea of the memory remaining? It seems that just using "vmstat" is not the answer as 26GB utilization got me in trouble (with the ZFS ARC limited to 1GB). (If only VM memory could be paged to disk, beginning page traffic would be an indicator, but I assume that there are significant problems with this..).

As for the bug report, let us close it for now. If the hang is fixed, this is more of a host utilization issue than an actual bug (unless you consider it a bug that 7GB of free memory is not enough to keep the system afloat).

comment:6 Changed 3 years ago by ramshankar

I did some allocation fixes now but they are relevant for Solaris 11 not Solaris 10 (which your log indicates is what you're using). So I'm not really sure what might be failing here. For now you could try disabling large pages for all your VMs and seeing if that makes a difference (VBox-4.0.x turns on large pages by default), you can disable them using:

VBoxManage modifyvm <vmname> --largepages off

comment:7 Changed 3 years ago by ramshankar

  • Status changed from new to closed
  • Resolution set to fixed

This should be fixed in 4.1.x but 4.0.x will probably still need the above mentioned workaround.

Please reopen if bug still persists in 4.1.2.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use