VirtualBox

Ticket #10031 (new defect)

Opened 2 years ago

Last modified 19 months ago

High I/O causing filesystem corruption

Reported by: Indigo42 Owned by:
Priority: major Component: virtual disk
Version: VirtualBox 4.1.4 Keywords:
Cc: Guest type: Linux
Host type: Linux

Description (last modified by klaus) (diff)

Hello, We are running an Ubuntu 10.04 32bit guest on a RHEL 64 bit host. The guest storage control is SATA AHCI with host I/O caching enabled. Host info: OS Type: Linux (2.6.18-194.el5) VirtualBox: 4.1.4 (74291) Processors: Intel(R) Xeon(R) CPU 3060 @ 2.40GHz (2) HWVirtEx, PAE, Long Mode (64-bit)

It seems during times of high I/O, like backups and etc, the guest filesystem resets to read-only. We have this issue with both ext3 and ext4 filesystems. Switching to ext3 seemed to lessen the occurrences though.

Here is a snip of syslog around the time of the corruption..

Dec 15 08:01:48 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:01:48 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:03 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:02:03 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:18 vault sm-mta[1437]: rejecting connections on daemon MTA-v4: load average: 12
Dec 15 08:02:18 vault sm-mta[1437]: rejecting connections on daemon MSP-v4: load average: 12
Dec 15 08:02:29 vault kernel: [ 2338.011698] ata4.00: exception Emask 0x0 SAct 0xffff SErr 0x0 action 0x6 frozen
Dec 15 08:02:29 vault kernel: [ 2338.011714] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011721] ata4.00: cmd 61/00:00:b7:5c:9d/04:00:09:00:00/40 tag 0 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011723]          res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Dec 15 08:02:29 vault kernel: [ 2338.011726] ata4.00: status: { DRDY }
Dec 15 08:02:29 vault kernel: [ 2338.011729] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011736] ata4.00: cmd 61/00:08:77:31:9d/04:00:09:00:00/40 tag 1 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011737]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 15 08:02:29 vault kernel: [ 2338.011740] ata4.00: status: { DRDY }
Dec 15 08:02:29 vault kernel: [ 2338.011743] ata4.00: failed command: WRITE FPDMA QUEUED
Dec 15 08:02:29 vault kernel: [ 2338.011749] ata4.00: cmd 61/00:10:77:35:9d/04:00:09:00:00/40 tag 2 ncq 524288 out
Dec 15 08:02:29 vault kernel: [ 2338.011751]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

And then later on in the file....

Dec 15 08:02:29 vault kernel: [ 2338.011969] ata4: hard resetting link
Dec 15 08:02:29 vault kernel: [ 2338.332268] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Dec 15 08:02:33 vault kernel: [ 2342.108804] ata4.00: configured for UDMA/133
Dec 15 08:02:33 vault kernel: [ 2342.108811] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108815] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108818] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108820] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108823] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108826] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108829] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108831] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108834] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108837] ata4.00: device reported invalid CHS sector 0
Dec 15 08:02:33 vault kernel: [ 2342.108840] ata4.00: device reported invalid CHS sector 0

At this point the guest needs to be shut down and the drive mounted and fsck, which complains about allot of IO handle stuff.

Attachments

VBox.log Download (54.0 KB) - added by Indigo42 2 years ago.
Logfile for the guest
VBox.log.1 Download (93.5 KB) - added by Indigo42 2 years ago.
Second Logfile for the Guest.

Change History

Changed 2 years ago by Indigo42

Logfile for the guest

Changed 2 years ago by Indigo42

Second Logfile for the Guest.

comment:1 Changed 2 years ago by localhostargentina

This bug happens long time ago, since 3.x versions, i tried to 4.1.6 and kernel 3.1.5 and it still happens. All the guest linux when have a lot of disk usage, they filesystem crash and go to read only. We have to restart, repair filesystem and reboot again to make it work. We tried AMD and Intel (with VT extension) too and all the flags like ioapic, pae, nestedpaging, vtxvpid, different chipsets, hwvirtex, hwvirtexexcl, hpet, largepages, synthcpu and pagefusion on and off, and with/without virtualbox-guest-additions and nothing changes.

Please to reproduce, use various virtualbox virtual machines on the same server and one vm use for a while too much disk resources and the filesystem will crash.

In the host, i have a lot of "page allocation failure" like this:

VBoxHeadless:

page allocation failure: order:9, mode:0x344d2
Pid: 3776, comm: VBoxHeadless Not tainted 3.1.5-gentoo #1
Call Trace:
 [<ffffffff8109e887>] ? warn_alloc_failed+0x106/0x130
 [<ffffffff810a18be>] ? __alloc_pages_nodemask+0x6bc/0x6f9
 [<ffffffffa0020a48>] ? rtR0MemAllocEx+0xbe/0x116 [vboxdrv]
 [<ffffffffa0020dde>] ? rtR0TermNative+0x264/0x54c [vboxdrv]
 [<ffffffffa0020f0f>] ? rtR0TermNative+0x395/0x54c [vboxdrv]
 [<ffffffffa0033edf>] ? g_abExecMemory+0x3e3f/0x180000 [vboxdrv]
 [<ffffffffa003874d>] ? g_abExecMemory+0x86ad/0x180000 [vboxdrv]
 [<ffffffffa0044367>] ? g_abExecMemory+0x142c7/0x180000 [vboxdrv]
 [<ffffffffa0046179>] ? g_abExecMemory+0x160d9/0x180000 [vboxdrv]
 [<ffffffffa0046c30>] ? g_abExecMemory+0x16b90/0x180000 [vboxdrv]
 [<ffffffffa003aac7>] ? g_abExecMemory+0xaa27/0x180000 [vboxdrv]
 [<ffffffffa001cc75>] ? supdrvIOCtl+0x128b/0x23d2 [vboxdrv]
 [<ffffffffa0020a48>] ? rtR0MemAllocEx+0xbe/0x116 [vboxdrv]
 [<ffffffffa001834a>] ? SUPR0Printf+0x1bb/0x235 [vboxdrv]
 [<ffffffff810e3692>] ? do_vfs_ioctl+0x3fd/0x43e
 [<ffffffff8107a99e>] ? sys_futex+0x132/0x144
 [<ffffffff810e370f>] ? sys_ioctl+0x3c/0x61
 [<ffffffff810e36d5>] ? sys_ioctl+0x2/0x61
 [<ffffffff813f293b>] ? system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
CPU    2: hi:    0, btch:   1 usd:   0
CPU    3: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:  88
CPU    1: hi:  186, btch:  31 usd:   0
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:  35
CPU    1: hi:  186, btch:  31 usd:   0
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
active_anon:55213 inactive_anon:13423 isolated_anon:0
 active_file:591922 inactive_file:592035 isolated_file:0
 unevictable:0 dirty:38310 writeback:280 unstable:0
 free:227453 slab_reclaimable:39250 slab_unreclaimable:26638
 mapped:83791 shmem:0 pagetables:6131 bounce:0
Node 0 DMA free:15924kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15700kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3254 16132 16132
Node 0 DMA32 free:219600kB min:13620kB low:17024kB high:20428kB active_anon:57196kB inactive_anon:35036kB active_file:755084kB inactive_file:755128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3332896kB mlocked:0kB dirty:55196kB writeback:220kB mapped:8040kB shmem:0kB slab_reclaimable:43016kB slab_unreclaimable:9960kB kernel_stack:8kB pagetables:2396kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:208 all_unreclaimable? no
lowmem_reserve[]: 0 0 12877 12877
Node 0 Normal free:674288kB min:53896kB low:67368kB high:80844kB active_anon:163656kB inactive_anon:18656kB active_file:1612428kB inactive_file:1613012kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:13186560kB mlocked:0kB dirty:98044kB writeback:900kB mapped:327124kB shmem:0kB slab_reclaimable:113984kB slab_unreclaimable:96592kB kernel_stack:2096kB pagetables:22128kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB 0*8kB 1*16kB 1*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15924kB
Node 0 DMA32: 3024*4kB 2090*8kB 1462*16kB 875*32kB 528*64kB 249*128kB 96*256kB 48*512kB 22*1024kB 1*2048kB 0*4096kB = 219600kB
Node 0 Normal: 4490*4kB 5913*8kB 4506*16kB 3159*32kB 1876*64kB 965*128kB 313*256kB 107*512kB 54*1024kB 1*2048kB 0*4096kB = 674288kB
1184083 total pagecache pages
130 pages in swap cache
Swap cache stats: add 9947, delete 9817, find 594/648
Free swap  = 13349756kB
Total swap = 13386492kB
4194288 pages RAM
2629507 pages reserved
893107 pages shared
433355 pages non-shared

and in the guest, they log things like this:

Buffer I/O error on device sda2, logical block 4099963
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda]  Result: hostbyte=0x00 driverbyte=0x06
sd 0:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 00 69 65 3e 00 00 08 00
end_request: I/O error, dev sda, sector 6907198
Buffer I/O error on device sda2, logical block 835286
EXT4-fs warning (device sda2): ext4_end_bio:242: I/O error writing to inode 2341288 (offset 0 size 4096 starting block 863400)
JBD2: Detected IO errors while flushing file data on sda2-8

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE DMA
ata1.00: cmd ca/00:08:d6:a0:1b/00:00:00:00:00/e0 tag 0 dma 4096 out
        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: configured for UDMA/133
ata1.00: device reported invalid CHS sector 0
ata1: EH complete

ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: hard resetting link

ata1.00: status: { DRDY }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:30:a6:a1:44/00:00:04:00:00/40 tag 6 ncq 4096 out
        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

ata1.00: device reported invalid CHS sector 0

Please fix this problem, we are still waiting for a long time for this fix.

Regards;

k

comment:2 Changed 2 years ago by msurkova

We also have similar messages at guest logs sometimes during high I/O load (sometimes caused by guests, sometimes caused by host applications):

Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771243] ata4.00: status: { DRDY } Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771247] ata4.00: failed command: WRITE FPDMA QUEUED Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771253] ata4.00: cmd 61/08:08:a7:43:48/00:00:07:00:00/40 tag 1 ncq 4096 out Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771257] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771261] ata4.00: status: { DRDY } Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771265] ata4.00: failed command: WRITE FPDMA QUEUED Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771270] ata4.00: cmd 61/10:10:2f:1b:87/00:00:07:00:00/40 tag 2 ncq 8192 out Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771273] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771277] ata4.00: status: { DRDY } Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771280] ata4.00: failed command: WRITE FPDMA QUEUED Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771285] ata4.00: cmd 61/b0:18:87:a4:87/00:00:07:00:00/40 tag 3 ncq 90112 out Dec 28 17:09:07 xxxxxxxxx kernel: [182026.771288] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

However we do not have drastic consequences like filesystem corruption maybe because problem goes away before guest OS retries operation. We are running 64-bit Ubuntu 10.04 server on 24-CPU server with 64Gb RAM as host and mix of OpenSuse 11.3 and Ubuntu 10.10 as guests. host cache is enabled since VM disk images are on ext4 filesystem. After digging around it seems that it may be not a bug in VirtualBox itself, but rather I/O timeout issue when host OS flushes dirty disk buffers to disk storage. This may result in VirtualBox disk I/O subsystem to stop on write() call for some time and in turn cause timeouts in guest SATA code. So this may be problem for any type 2 virtualizer running on top of host file system cache.

If this is sensible, I would like to get advice how to prevent such undesired behavior. Following approaches are possible: 1). Upgrade host OS kernel to 2.6.36+ and disable "host I/O cache" for virtual SATA adapters. This will allow VirtualBox to bypass host file system buffers, not pollute them with dirty pages and not be stopped during buffer flush. In fact this is even recommended in VirtualBox manual (except for mentioning issue with ext4 host file system) 2). Try to tune host os page flusher behavior to avoid accumulating of significant amount of dirty pages in host file system buffers. For example, set vm.dirty_background_ratio to 5 or below (this is still large size - about 1Gb) or set vm.dirty_background_bytes to something smaller like 20971520 to make page flusher more active. Note that this may make host file system cache less effective, but it prevents write operation stuck during buffer flush. 3). Tune VirtualBox to flush VM image files more often to prevent making lot of pages in host buffer cache dirty as described here:  http://www.virtualbox.org/manual/ch12.html#configPeriodicFlush

Could VirtualBox developers tell if this makes sense and if so, which approach to take ?

Thanks a lot !

comment:3 follow-up: ↓ 4 Changed 2 years ago by g.andrew.stone@…

I am seeing the exact same thing. Host Ubuntu 11.10/64 guest ubuntu 10.04/64 virtual box 4.1.2

comment:4 in reply to: ↑ 3 Changed 2 years ago by frank

Replying to g.andrew.stone@gmail.com:

I am seeing the exact same thing. Host Ubuntu 11.10/64 guest ubuntu 10.04/64 virtual box 4.1.2

Please try VirtualBox 4.1.8.

comment:5 Changed 2 years ago by Olegch

I am experiencing exactly same problem with VirtualBox 4.1.8

Host: Ubuntu Linux 3.0.0-15-generic #26-Ubuntu SMP Fri Jan 20 17:23:00 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Guest: CentOS 5 Linux 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 17:25:58 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

comment:6 Changed 2 years ago by FrozenCow

I seem to have exactly the same problem. The host machine runs Ubuntu 10.04 with VirtualBox 4.1.8r75467. It runs 5 virtual machines, both Arch Linux and Ubuntu 10.04. I've been running VirtualBox for a year now in this setup and it has giving me this error a number of times over the year. When this happens most guests output lines with "failed command: WRITE FPDMA QUEUED". The host machine outputs following to dmesg:

[2583901.668803] VBoxHeadless: page allocation failure. order:4, mode:0x4020
[2583901.668808] Pid: 1482, comm: VBoxHeadless Not tainted 2.6.32-21-generic #32-Ubuntu
[2583901.668811] Call Trace:
[2583901.668822]  [<ffffffff810f9d7e>] __alloc_pages_slowpath+0x56e/0x580
[2583901.668827]  [<ffffffff810f9eee>] __alloc_pages_nodemask+0x15e/0x1a0
[2583901.668832]  [<ffffffff81131c12>] kmalloc_large_node+0x62/0xb0
[2583901.668837]  [<ffffffff81136265>] __kmalloc_node_track_caller+0x125/0x180
[2583901.668842]  [<ffffffff81452e7d>] ? dev_alloc_skb+0x1d/0x40
[2583901.668845]  [<ffffffff81452b90>] __alloc_skb+0x80/0x190
[2583901.668849]  [<ffffffff81452e7d>] dev_alloc_skb+0x1d/0x40
[2583901.668856]  [<ffffffffa02ccb0a>] vboxNetFltLinuxSkBufFromSG+0x3a/0x410 [vboxnetflt]
[2583901.668861]  [<ffffffffa02ccf85>] vboxNetFltPortOsXmit+0xa5/0xd0 [vboxnetflt]
[2583901.668866]  [<ffffffffa02ce109>] vboxNetFltPortXmit+0x69/0x1d0 [vboxnetflt]
[2583901.668871]  [<ffffffffa02ce0a1>] ? vboxNetFltPortXmit+0x1/0x1d0 [vboxnetflt]
[2583901.668886]  [<ffffffffa0155029>] g_abExecMemory+0x4ec89/0x180000 [vboxdrv]
[2583901.668891]  [<ffffffff81052a60>] ? update_cpu_power+0x80/0xf0
[2583901.668896]  [<ffffffff81540bbf>] ? _spin_lock_irqsave+0x2f/0x40
[2583901.668910]  [<ffffffffa00f5985>] ? VBoxHost_RTSpinlockReleaseNoInts+0x15/0x20 [vboxdrv]
[2583901.668922]  [<ffffffffa0153b77>] ? g_abExecMemory+0x4d7d7/0x180000 [vboxdrv]
[2583901.668934]  [<ffffffffa0159b45>] g_abExecMemory+0x537a5/0x180000 [vboxdrv]
[2583901.668938]  [<ffffffff81540c9e>] ? _spin_lock+0xe/0x20
[2583901.668951]  [<ffffffffa00f5942>] ? VBoxHost_RTSpinlockRelease+0x12/0x20 [vboxdrv]
[2583901.668961]  [<ffffffffa00e9da0>] ? SUPR0ObjAddRefEx+0x130/0x240 [vboxdrv]
[2583901.668976]  [<ffffffffa00fd4ac>] ? VBoxHost_RTHandleTableLookupWithCtx+0x9c/0x130 [vboxdrv]
[2583901.668988]  [<ffffffffa015aa49>] g_abExecMemory+0x546a9/0x180000 [vboxdrv]
[2583901.668999]  [<ffffffffa01876ec>] g_abExecMemory+0x8134c/0x180000 [vboxdrv]
[2583901.669011]  [<ffffffffa01886d0>] g_abExecMemory+0x82330/0x180000 [vboxdrv]
[2583901.669022]  [<ffffffffa0189985>] g_abExecMemory+0x835e5/0x180000 [vboxdrv]
[2583901.669033]  [<ffffffffa0125a5b>] g_abExecMemory+0x1f6bb/0x180000 [vboxdrv]
[2583901.669044]  [<ffffffffa0126770>] ? g_abExecMemory+0x203d0/0x180000 [vboxdrv]
[2583901.669055]  [<ffffffffa0117e81>] g_abExecMemory+0x11ae1/0x180000 [vboxdrv]
[2583901.669067]  [<ffffffffa011511e>] ? g_abExecMemory+0xed7e/0x180000 [vboxdrv]
[2583901.669078]  [<ffffffffa0152907>] g_abExecMemory+0x4c567/0x180000 [vboxdrv]
[2583901.669088]  [<ffffffffa011c2bc>] g_abExecMemory+0x15f1c/0x180000 [vboxdrv]
[2583901.669098]  [<ffffffffa00e899a>] supdrvIOCtlFast+0x8a/0xa0 [vboxdrv]
[2583901.669110]  [<ffffffffa00e82ec>] VBoxDrvLinuxIOCtl_4_1_8+0x4c/0x1d0 [vboxdrv]
[2583901.669114]  [<ffffffff81053980>] ? __dequeue_entity+0x30/0x50
[2583901.669119]  [<ffffffff8101179c>] ? __switch_to+0x1ac/0x320
[2583901.669126]  [<ffffffff81152a92>] vfs_ioctl+0x22/0xa0
[2583901.669129]  [<ffffffff8153e738>] ? thread_return+0x48/0x420
[2583901.669133]  [<ffffffff81152d41>] do_vfs_ioctl+0x81/0x380
[2583901.669137]  [<ffffffff811530c1>] sys_ioctl+0x81/0xa0
[2583901.669141]  [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[2583901.669144] Mem-Info:
[2583901.669146] Node 0 DMA per-cpu:
[2583901.669149] CPU    0: hi:    0, btch:   1 usd:   0
[2583901.669151] CPU    1: hi:    0, btch:   1 usd:   0
[2583901.669154] CPU    2: hi:    0, btch:   1 usd:   0
[2583901.669157] CPU    3: hi:    0, btch:   1 usd:   0
[2583901.669159] CPU    4: hi:    0, btch:   1 usd:   0
[2583901.669161] CPU    5: hi:    0, btch:   1 usd:   0
[2583901.669164] CPU    6: hi:    0, btch:   1 usd:   0
[2583901.669166] CPU    7: hi:    0, btch:   1 usd:   0
[2583901.669168] Node 0 DMA32 per-cpu:
[2583901.669171] CPU    0: hi:  186, btch:  31 usd: 123
[2583901.669173] CPU    1: hi:  186, btch:  31 usd: 172
[2583901.669175] CPU    2: hi:  186, btch:  31 usd:  29
[2583901.669178] CPU    3: hi:  186, btch:  31 usd:  41
[2583901.669180] CPU    4: hi:  186, btch:  31 usd: 138
[2583901.669182] CPU    5: hi:  186, btch:  31 usd:  11
[2583901.669185] CPU    6: hi:  186, btch:  31 usd: 153
[2583901.669187] CPU    7: hi:  186, btch:  31 usd: 147
[2583901.669189] Node 0 Normal per-cpu:
[2583901.669191] CPU    0: hi:  186, btch:  31 usd:  44
[2583901.669194] CPU    1: hi:  186, btch:  31 usd: 166
[2583901.669196] CPU    2: hi:  186, btch:  31 usd:  64
[2583901.669198] CPU    3: hi:  186, btch:  31 usd:  58
[2583901.669200] CPU    4: hi:  186, btch:  31 usd:  71
[2583901.669203] CPU    5: hi:  186, btch:  31 usd:  14
[2583901.669205] CPU    6: hi:  186, btch:  31 usd:  31
[2583901.669207] CPU    7: hi:  186, btch:  31 usd: 168
[2583901.669213] active_anon:115139 inactive_anon:15245 isolated_anon:8
[2583901.669214]  active_file:2290459 inactive_file:2298973 isolated_file:48
[2583901.669215]  unevictable:0 dirty:474987 writeback:30553 unstable:0
[2583901.669217]  free:28813 slab_reclaimable:136963 slab_unreclaimable:20844
[2583901.669218]  mapped:888150 shmem:86 pagetables:3706 bounce:0
[2583901.669221] Node 0 DMA free:15800kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15216kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[2583901.669232] lowmem_reserve[]: 0 2991 24201 24201
[2583901.669237] Node 0 DMA32 free:85700kB min:2460kB low:3072kB high:3688kB active_anon:2152kB inactive_anon:4308kB active_file:1202816kB inactive_file:1203848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3063520kB mlocked:0kB dirty:224284kB writeback:13476kB mapped:50360kB shmem:0kB slab_reclaimable:137644kB slab_unreclaimable:8784kB kernel_stack:24kB pagetables:140kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[2583901.669250] lowmem_reserve[]: 0 0 21210 21210
[2583901.669254] Node 0 Normal free:13752kB min:17440kB low:21800kB high:26160kB active_anon:458404kB inactive_anon:56672kB active_file:7959020kB inactive_file:7992044kB unevictable:0kB isolated(anon):32kB isolated(file):192kB present:21719040kB mlocked:0kB dirty:1675664kB writeback:108736kB mapped:3502240kB shmem:344kB slab_reclaimable:410208kB slab_unreclaimable:74592kB kernel_stack:2672kB pagetables:14684kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[2583901.669267] lowmem_reserve[]: 0 0 0 0
[2583901.669270] Node 0 DMA: 2*4kB 2*8kB 2*16kB 2*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15800kB
[2583901.669282] Node 0 DMA32: 65*4kB 10*8kB 3*16kB 4*32kB 3*64kB 202*128kB 155*256kB 18*512kB 6*1024kB 0*2048kB 1*4096kB = 85700kB
[2583901.669292] Node 0 Normal: 2672*4kB 116*8kB 19*16kB 1*32kB 0*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 13744kB
[2583901.669302] 4595822 total pagecache pages
[2583901.669305] 6272 pages in swap cache
[2583901.669307] Swap cache stats: add 12918, delete 6646, find 494548/494821
[2583901.669309] Free swap  = 48799272kB
[2583901.669311] Total swap = 48829432kB
[2583901.749131] 6291440 pages RAM
[2583901.749133] 992618 pages reserved
[2583901.749135] 3897567 pages shared
[2583901.749136] 1382636 pages non-shared

Each time this happened all virtual machines needed a reboot, since they remounted their root-partition as read-only. 3 of those times the disk of one particular (Ubuntu) virtual machine became corrupted: Ubuntu was not able to find some of its files when it rebooted afterwards. Using fsck I was able to get the important files back and needed to reinstall all packages to be able to boot again.

I don't know whether it is a bug in my hardware, Linux/Ubuntu or VirtualBox, but it is quite a severe problem for me.

What msurkova said (disk caching) is what I have been been thinking too after some more digging. I'd also -really- like some advice to try, so that I'm doing what is most sensible and not screw with settings I don't fully understand myself.

Thank you in advance.

comment:7 Changed 2 years ago by V-Alexeev

We're having exactly the same problem here. VirtualBox 4.1.8, Ubuntu 10.10 host, Ubuntu 11.10 guest.

Although it is much more severe for us: our server has a lot of i/o intensive tasks, and now the problem started to happen to us every 30-40 minutes, rendering guests completely unusable.

comment:8 follow-up: ↓ 9 Changed 2 years ago by V-Alexeev

Configuring periodic flush seems to fix the problem for us (https://www.virtualbox.org/manual/ch12.html#configPeriodicFlush)

comment:9 in reply to: ↑ 8 Changed 20 months ago by fade2gray

Replying to V-Alexeev:

Configuring periodic flush seems to fix the problem for us (https://www.virtualbox.org/manual/ch12.html#configPeriodicFlush)

Thanks for posting this - also worked for me running Ubuntu sever guest on low spec Windows host.

comment:10 Changed 19 months ago by klaus

  • Description modified (diff)

Forcing periodic flushes can only be a workaround for some bug in the Linux kernel (losing control over I/O scheduling when lots of unwritten data is in the cache). So I'm not convinced at all that this is a good idea or the final word. VirtualBox does normal user land reads/writes from/to image files. No kernel level trickery involved at all. There have been other complaints about Linux becoming unresponsive when lots of data is written, e.g.  https://bugzilla.kernel.org/show_bug.cgi?id=43094 - that one has been closed on formal grounds, but it's highly likely that the nvidia kernel module makes any difference. We'll see if we can provide a very very simple testcase for the Linux folks to reproduce.

comment:11 Changed 19 months ago by m.s.s.

Yes, VirtualBox does normal user land read/writes. But most of normal user land applications are not too sensitive to read/write operation timing while guest kernel running in VirtualBox is. Therefore if write operation in the host takes too long (what can happen if host flushes too much dirty pages to disk) then guest kernel may consider it as SATA timeout and trigger error handling procedure. There are several ways to overcome this issue: 1). Configure virtualbox to bypass host cache (use direct I/O). This may cause issues with ext4 in some kernel versions (like in Ubuntu 10.04) and also may not be enough if host itself performs heavy write operations. 2). Use "raw" VMDK on top of LVM partitions to store VM disks rather than files on file system (this also avoids double file system overhead). 3). Configure host kernel to flush buffer cache more frequently to not create heavy write batches. 4). Use periodic flush in VirtualBox to limit amount of data in host buffer cache.

In general I think that may be only type 1 hypervisors with para-virtual disk drivers are completely free from this issue. Emulating of real hardware on top of user-land processes may always cause such issues.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use