VirtualBox

Opened 2 years ago

Closed 2 years ago

#20875 closed defect (fixed)

pdmBlkCacheEvictPagesFrom leaking locks and causing VM deadlocks [FIXED IN SVN]

Reported by: aaronk Owned by:
Component: VMM Version: VirtualBox 6.1.32
Keywords: deadlock cache Cc:
Guest type: Linux Host type: Linux

Description

I was experiencing VM I/O deadlocks. I analyzed one instance and I could see there were three threads that looked to have deadlocked:

Thread 20 (Thread 0x7f23cd9b1700 (LWP 13085)):
==============================================
#0  0x00007f240286ba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f2401f79458 in RTSemEventWait () from /usr/lib/virtualbox/VBoxRT.so
#2  0x00007f2401f1ddfa in RTCritSectEnter () from /usr/lib/virtualbox/VBoxRT.so
#3  0x00007f23f4acdac6 in PDMR3BlkCacheWrite () from /usr/lib/virtualbox/components/VBoxVMM.so
Thread 19 (Thread 0x7f23cd430700 (LWP 13086)):
==============================================
#0  0x00007f240286b39e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f2401f739c1 in RTSemRWRequestWrite () from /usr/lib/virtualbox/VBoxRT.so
#2  0x00007f23f4acbd9b in pdmBlkCacheEvictPagesFrom(PDMBLKCACHEGLOBAL*, unsigned long, PDMBLKLRULIST*, PDMBLKLRULIST*, bool, unsigned char**) [clone .isra.5] () from /usr/lib/virtualbox/components/VBoxVMM.so
#3  0x00007f23f4acbfe5 in pdmBlkCacheReclaim(PDMBLKCACHEGLOBAL*, unsigned long, bool, unsigned char**) [clone .part.6] [clone .constprop.9] () from /usr/lib/virtualbox/components/VBoxVMM.so
#4  0x00007f23f4acdbca in PDMR3BlkCacheWrite () from /usr/lib/virtualbox/components/VBoxVMM.so
Thread 32 (Thread 0x7f23cda32700 (LWP 30856)):
==============================================
#0  0x00007f240286b39e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00007f2401f739c1 in RTSemRWRequestWrite () from /usr/lib/virtualbox/VBoxRT.so
#2  0x00007f23f4ace04e in PDMR3BlkCacheIoXferComplete () from /usr/lib/virtualbox/components/VBoxVMM.so

The first thread was waiting on a lock on a pCache object and the 2nd and 3rd threads were waiting on pBlkCache->SemRWEntries.

It turns out that thread 19 was holding the lock on pCache which I think is why Thread 20 was blocked. The only answer I could find as to why the pBlkCache->SemRWEntries lock was being held was due to an imbalance in the pdmBlkCacheEvictPagesFrom function. This the first time I’ve really looked at the virtualbox source, so I’m not overly sure but it seems as though the imbalance in pdmBlkCacheEvictPagesFrom would only be experienced under situations of high contention over VirtualBox’s write cache.

Here’s the fix I’ve tried and it appears to avoid the deadlocks:

Index: src/VBox/VMM/VMMR3/PDMBlkCache.cpp
===================================================================
--- src/VBox/VMM/VMMR3/PDMBlkCache.cpp  (revision 93537)
+++ src/VBox/VMM/VMMR3/PDMBlkCache.cpp  (working copy)
@@ -460,7 +460,10 @@
                     RTSemRWReleaseWrite(pBlkCache->SemRWEntries);
                     RTMemFree(pCurr);
                 }
-            }
+            } else {
+                   LogRel(("Would have left without releasing pBlkCache->SemRWEntries"));
+                    RTSemRWReleaseWrite(pBlkCache->SemRWEntries);
+           }

         }
         else

Change History (2)

comment:1 by bird, 2 years ago

Summary: pdmBlkCacheEvictPagesFrom leaking locks and causing VM deadlockspdmBlkCacheEvictPagesFrom leaking locks and causing VM deadlocks [FIXED IN SVN]

Thanks for debugging and tracking this down. The semaphore handling is certainly out of balance there in one code path. I've committed a similar fix (logging + style differs). May take a little while before this becomes externally visible, though. It will be included in 6.1.36, but it might just have missed the boat for 6.1.34, we'll see...

comment:2 by galitsyn, 2 years ago

Resolution: fixed
Status: newclosed

The issue should be fixed in VirtualBox 6.1.34. Closing.

Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use