VirtualBox

Ticket #2524 (closed defect: fixed)

Opened 5 years ago

Last modified 4 years ago

data corruption under heavy I/O load on host

Reported by: shmem Owned by:
Priority: critical Component: other
Version: VirtualBox 2.0.2 Keywords:
Cc: Guest type: other
Host type: other

Description

Heavy I/O load on the host leads to ide dma timeouts, device resets, incorrect read/write operations and ultimately to data corruption in the VBox.

It looks like iowait conditions on the host don't lead to a blocking of the virtual

vb10:~# tail -n 0 -f /var/log/messages
Oct 27 18:28:35 vb10 kernel: [ 1559.569320] hdb: dma_timer_expiry: dma status == 0x61
Oct 27 18:28:45 vb10 kernel: [ 1569.569594] hdb: DMA timeout error
Oct 27 18:28:45 vb10 kernel: [ 1569.570105] hdb: dma timeout error: status=0x48 { DriveReady DataRequest }
Oct 27 18:28:45 vb10 kernel: [ 1569.570127] ide: failed opcode was: unknown
Oct 27 18:28:45 vb10 kernel: [ 1569.570766] hda: DMA disabled
Oct 27 18:28:45 vb10 kernel: [ 1569.572039] hdb: DMA disabled
Oct 27 18:29:20 vb10 kernel: [ 1599.568186] ide0: reset timed-out, status=0x90
Oct 27 18:29:20 vb10 kernel: [ 1604.572374] hda: status timeout: status=0x90 { Busy }
Oct 27 18:29:20 vb10 kernel: [ 1604.572396] ide: failed opcode was: unknown
Oct 27 18:29:20 vb10 kernel: [ 1604.580286] Clocksource tsc unstable (delta = 4687228551 ns)
Oct 27 18:30:00 vb10 kernel: quest: I/O error, dev hdb, sector 506799
Oct 27 18:30:01 vb10 kernel: [ 1635.421648] __ratelimit: 5022 messages suppressed
Oct 27 18:30:01 vb10 kernel: [ 1635.421648] lost page write due to I/O error on hdb1
Oct 27 18:31:02 vb10 kernel: nd_request: I/O error, dev hdb, sector 551111
Oct 27 18:31:09 vb10 kernel: [ 1714.280038] __ratelimit: 5014 messages suppressed
Oct 27 18:31:09 vb10 kernel: [ 1714.280038] lost page write due to I/O error on hdb1
Oct 27 18:31:20 vb10 kernel: [ 1725.024393] lost page write due to I/O error on hdb1
Oct 27 18:31:26 vb10 kernel: [ 1730.728966] lost page write due to I/O error on hdb1
Oct 27 18:31:29 vb10 kernel: [ 1733.614585] lost page write due to I/O error on hdb1
Oct 27 18:31:29 vb10 kernel: [ 1733.905566] lost page write due to I/O error on hdb1
Oct 27 18:31:31 vb10 kernel: [ 1736.305170] __ratelimit: 10 messages suppressed
Oct 27 18:31:31 vb10 kernel: [ 1736.305403] lost page write due to I/O error on hdb1
Oct 27 18:31:36 vb10 kernel: [ 1741.416768] __ratelimit: 23 messages suppressed
Oct 27 18:31:36 vb10 kernel: [ 1741.417144] lost page write due to I/O error on hdb1
Oct 27 18:31:41 vb10 kernel: [ 1746.255712] __ratelimit: 21 messages suppressed
Oct 27 18:31:41 vb10 kernel: [ 1746.255917] lost page write due to I/O error on hdb1
Oct 27 18:31:46 vb10 kernel: [ 1751.279075] __ratelimit: 22 messages suppressed
Oct 27 18:31:46 vb10 kernel: [ 1751.279288] lost page write due to I/O error on hdb1
Oct 27 18:31:51 vb10 kernel: [ 1756.235377] __ratelimit: 20 messages suppressed
Oct 27 18:31:51 vb10 kernel: [ 1756.235377] lost page write due to I/O error on hdb1
Oct 27 18:31:56 vb10 kernel: [ 1761.304274] __ratelimit: 20 messages suppressed
Oct 27 18:31:56 vb10 kernel: [ 1761.304455] lost page write due to I/O error on hdb1
Oct 27 18:32:01 vb10 kernel: 009488] end_request: I/O error, dev hdb, sector 580247
Oct 27 18:32:03 vb10 kernel: r, dev hdb, sector 601439
Oct 27 18:32:03 vb10 kernel: [ 1767.301521] __journal_remove_journal_head: freeing b_frozen_data
^C
vb10:~# ls -l /data
ls: cannot access /data/lost+found: Input/output error
total 260968
-rw-r----- 1 root root 266964992 2008-10-27 18:30 foo.img
d????????? ? ?    ?            ?                ? lost+found
vb10:~#

It looks like iowait conditions on the host don't lead to the blocking of the virtual pci bus, which results in the disk driver to run into timeouts as per the pci spec.

Setup:

Host: 8 CPUs, 32 GB RAM, 2.8 TB RAID 5 divided into 32 logical volumes of 83 GB each, running debian lenny

VBoxes: 768 MB RAM, 1 GB ext3 root fs (hda), 11 GB ext3 data fs (hdb), running debian lenny

I have been testing throughput and stability running iozone simultaneously in different numbers of VBoxes. Corruption rate was 0 out of 4 (0/4), 5/8 and 16/16.

Change History

comment:1 Changed 5 years ago by martind

I can confirm this. I'm seeing the same "dma_timer_expiry" and "DMA disabled" sequence, until eventually the VM becomes unresponsive and has to be rebooted.

(I started seeing this during a batch load of a large PostgreSQL database, i.e. under high I/O load -- which I wouldn't consider an uncommon situation. I can't successfully load the data in one go, will try to split it up. So the bug seems critical to me indeed :)

comment:2 Changed 5 years ago by aeichner

We are aware of this issue and are working on it but this will take some time as we have to change how I/O works at the moment. At the moment we use the kernel cache of the host to speed up reading/wrinting the data from/to the image. If the kernel cache is full the data is written to the disk which can block any other I/O operation for a long time especially if the cache is quite big. We could just disable the cache but this will cause bad I/O performance so we have to implement our own caching. Can you try if the workaround mentioned in chapter 11.1.2 in the manual fixes the issue please? You may need to try different values before the timeouts disappear while still getting decent I/O performance.

comment:3 Changed 5 years ago by martind

Thanks for the feedback and the pointer to your docs.

I did experiment with lower FlushInterval settings, and at the values I tried it did slow down the frequency of error messages. But ultimately what I'm trying to do takes way too long with lower values, so I ended up doing it on another system. I did not confirm whether there was a setting that stopped all errors (I would have needed to wait for days for the process to finish.)

comment:4 Changed 5 years ago by frank

Is this problem still current with VBox 3.0.8?

comment:5 follow-up: ↓ 6 Changed 4 years ago by cokegen

Having this error but slightly different, maybe the same ? I'm using 3.0.10 on Windows (XP SP2) with a Debian guest. Lost a database on this possible bug / HD failing :-(

Could be my HD ? I don't see any events in the Windows Event Viewer

Dec 11 06:41:35 vbox kernel: [108262.205567] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.205567] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.205567] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: status=0x41 { DriveReady Error } Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: dma_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:35 vbox kernel: [108262.237569] ide: failed opcode was: unknown Dec 11 06:41:35 vbox kernel: [108262.237569] hda: DMA disabled Dec 11 06:41:35 vbox kernel: [108262.237569] hdb: DMA disabled Dec 11 06:41:35 vbox kernel: [108262.300669] ide0: reset: success Dec 11 06:41:35 vbox kernel: [108262.348672] hda: task_out_intr: status=0x41 { DriveReady Error } Dec 11 06:41:36 vbox kernel: [108262.348672] hda: task_out_intr: error=0x10 { SectorIdNotFound }, LBAsect=17011429, sector=17011429 Dec 11 06:41:36 vbox kernel: [108262.348672] ide: failed opcode was: unknown Dec 11 06:41:36 vbox kernel: [108262.348672] hdb: task_out_intr: status=0x41 { DriveReady Error } Dec 11 06:41:36 vbox kernel: [108262.348672] hdb: task_out_intr: error=0x10 { SectorIdNotFound }, LBAsect=18749623, sector=18749623 Dec 11 06:41:36 vbox kernel: [108262.348672] ide: failed opcode was: unknown

comment:6 in reply to: ↑ 5 Changed 4 years ago by klaus

Replying to cokegen:

Having this error but slightly different, maybe the same ? I'm using 3.0.10 on Windows (XP SP2) with a Debian guest. Lost a database on this possible bug / HD failing :-(

Could be my HD ? I don't see any events in the Windows Event Viewer

From the symptoms it looks like an unrelated issue (corrupted hard disk image or something), so please open a new ticket for your problem, and attach VBox.log from a VirtualBox run where those errors showed up.

Ah, just saw you already did. Thanks.

comment:7 Changed 4 years ago by nmelnick

Still seeing this in VirtualBox 3.1, trying to sign on to this ticket for updates. The workaround noted in the manual is subject to tweaking, and I'm still working at that. Is there a recommended filesystem for hosting VirtualBox VMs? ZFS on OpenSolaris? ext3/4 on Linux?

comment:8 Changed 4 years ago by frank

Yes, these file systems should be fine, although I'm not sure about ext4 because this file system isn't that stable as ext3 yet.

comment:9 Changed 4 years ago by achr

Having same problem with 3.1.4 on Ubuntu 9.10 host with Debian4 guest. I already use ext3 with my debian guest.

Is there already a workaround available?!

comment:10 Changed 4 years ago by achr

*aaah* I now read the ticket description more attentively... It says: "if the host disk is on heavy load, the vbox disk is having the described DMA etc. problems." Okay, in that case, I have to be more precise:

Actually, I have two machines:

  • VBox machine: powerful CPU, enough RAM
  • Storage-Server with lot of disks and RAID etc...

Both machines are directly connected (x-link) with 1Gbit network. The virtualized harddisks are located on the storage server. The VBox server mounts a special folder (that contains the virtualized harddisks) via NFS via this 1GBit x-link. This is actually very fast: I can transfer >80MB/sec without any problem.

But VBox has the above described problem if the guest system issues heavy load on the virtualized harddisk. I think the problem is the same, but with a different derivation...

I would love to see some progress on this bug-ticket ... ;-)

comment:11 Changed 4 years ago by nmelnick

Can we put forth donations for someone to work on this? I really don't want to uproot from vbox, but this is occasionally causing some pretty huge issues, and I really don't have the skills necessary to help much. :(

comment:12 Changed 4 years ago by aeichner

This was actually fixed with 3.2. Make sure to disable the host I/O cache in the controller settings and don't store the image on ext4 because there seems to be some bug in ext4 causing data corruption.

comment:13 Changed 4 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use