VirtualBox

Ticket #616 (closed defect: fixed)

Opened 16 years ago

Last modified 13 years ago

Assertion failed in sems-linux.cpp(219) => Fixed in SVN/3.0.6

Reported by: freggy Owned by:
Component: VMM Version: VirtualBox 3.0.4
Keywords: Cc:
Guest type: other Host type: Linux

Description (last modified by frank) (diff)

Virtualbox OSE 1.5.0 as included in Mandriva 2008.0 Cooker crashes while installing Mandriva 2008.0 i585 edition via network on Mandriva 2008.0 Cooker x86_64. This can be found in the logs:

00:16:13.398 !!Assertion Failed!!
00:16:13.398 Expression: i < 4096
00:16:13.411 Location  : /home/mandrake/rpm/BUILD/VirtualBox-1.5.0_OSE/src/VBox/Runtime/r3/linux/sems-linux.cpp(219) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
00:16:13.475 iCur=0x1 pIntEventSem=0000000000a5ccf0

Attachments

VBox.log Download (27.3 KB) - added by freggy 16 years ago.
Vbox.log
2.6.30-r5config.rtf Download (53.4 KB) - added by michael55123 14 years ago.
2.6.30-R5 Kernel Config- Gentoo
2.6.28.7.config Download (80.0 KB) - added by tg2861 14 years ago.
Kernel config on a machine with this issue
2.6.30-r4config Download (50.8 KB) - added by malte 14 years ago.
Kernel configuration 2.6.30-gentoo-r4
config-2.6.30-gentoo-r6 Download (64.1 KB) - added by SDNick484 14 years ago.
Another Gentoo .config

Change History

Changed 16 years ago by freggy

Vbox.log

comment:1 Changed 16 years ago by freggy

Actually this seems to happen when I minimise the guest VMs window in GNOME - Mandriva Cooker 2008.0, x86_64.

comment:2 Changed 15 years ago by freggy

It seems like this problem still exists in 1.5.4. I just had the same crash on Mandriva Cooker x86_64 (Linux 2.6.24-rc6) with Virtualbox 1.5.4:

1193:59:47.780 !!Assertion Failed!! 1193:59:47.780 Expression: i < 4096 1193:59:47.780 Location : /home/mandrake/rpm/BUILD/VirtualBox-1.5.4_OSE/src/VBox/Runtime/r3/linux/sems-linux.cpp(219) int RTSemEventSignal(RTSEMEVENTINTERNAL*) 1193:59:47.810 iCur=0x1 pIntEventSem=00000000009cb000

The crash mentioned in  http://vbox.innotek.de/pipermail/vbox-dev/2007-November/000394.html seems to be the same problem too.

comment:3 Changed 15 years ago by freggy

crash in a code block to improve readability:

1193:59:47.780 
1193:59:47.780 !!Assertion Failed!!
1193:59:47.780 Expression: i < 4096
1193:59:47.780 Location  : /home/mandrake/rpm/BUILD/VirtualBox-1.5.4_OSE/src/VBox/Runtime/r3/linux/sems-linux.cpp(219) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
1193:59:47.810 iCur=0x1 pIntEventSem=00000000009cb000

comment:4 Changed 15 years ago by benjamin9999

i had this same assert. 1.5.4-binary on linux 2.6.24-rc8 running win2k3 guest. this same box happens to run vmware-server 1.0.3.

comment:5 Changed 15 years ago by freggy

This still happens very often in Mandriva Cooker 2008.1 (Linux 2.6.24 - Glibc 2.7 - x86_64) and it makes Virtualbox unusable for production use. Can anybody finally take a look at this please?

comment:6 Changed 15 years ago by blueyed

The bug has been reported for VirtualBox 1.5.6 on Ubuntu at  https://launchpad.net/bugs/206615. The host is Ubuntu 8.04 AMD64 (beta), the host Windows XP and it seems to happen after leaving the machine running/idle for a while.

The bug in Launchpad ( https://launchpad.net/bugs/206615), provides additional debugging information, like a stacktrace.

comment:7 Changed 15 years ago by frank

Just to keep you up-to-date: This is a known issue. Still no fix available.

comment:8 follow-up: ↓ 29 Changed 15 years ago by pmatthew

I think this bug is related with preemptivity enabled in kernel... I compiled a kernel without preemptivity and it disappeared. Just a day of machine uptime, I'll send another report later ;)

comment:9 Changed 15 years ago by pmatthew

No, it's not preemptivity, it crashes less, but still aborting...

comment:10 Changed 15 years ago by frank

  • Description modified (diff)

comment:11 Changed 15 years ago by sandervl73

  • priority changed from major to critical

comment:12 Changed 15 years ago by sandervl73

  • Version changed from VirtualBox 1.5.0 to VirtualBox 1.6.2

comment:15 Changed 15 years ago by sandervl73

Similar reports in tickets 1733 and 1746.

comment:16 Changed 15 years ago by frank

  • Host type changed from other to Linux

comment:17 Changed 15 years ago by frank

  • Component changed from other to VMM

comment:18 Changed 15 years ago by vmorgo

Happens under Ubuntu Hardy Heron 64-bit edition. Core 2 Duo Penryn at 2.5 GHz, VirtualBox 1.5.6OSE as supplied as a package with Ubuntu Hardy Heron 8.04 repos.

Any other information required, please just ask! I'd sure like to know if/when this gets fixed.

comment:19 Changed 15 years ago by tomcrummey

I seem to be having a similar issue as described here.

Host OS is CentOS 5.2 Kernel 2.6.18-92.1.10.el5 Guest is Windows Vista SP1 32bit

VM is aborted. Log message: 04:06:26.848 04:06:26.848 !!Assertion Failed!! 04:06:26.848 Expression: i < 4096 04:06:26.848 Location : /home/vbox/vbox-1.6/src/VBox/Runtime/r3/linux/semevent-linux.cpp(186) int RTSemEventSignal(RTSEMEVENTINTERNAL*) 04:06:26.848 iCur=0x1 pThis=000000000815f390

The abort seems to happen when the screensaver on the host kicks in.

Full log available if required.

comment:20 Changed 15 years ago by tomcrummey

I forgot to put the VirtualBox version number in. It's 1.6.4.

comment:21 Changed 15 years ago by tbroberg

Just saw this under 1.6.6.

!!Assertion Failed!! Expression: i < 4096 Location : /home/vbox/vbox-1.6.6/src/VBox/Runtime/r3/linux/semevent-linux.cpp(186) int RTSemEventSignal(RTSEMEVENTINTERNAL*) iCur=0x1 pThis=00000000016e6e50

Running from VBoxHeadless on a Dell Precision 370 under Fedora 8 amd_64 host os and a Fedora 8 x86 guest (actually 3 Fedora guests, an XP guest, and a Win 2008 server guest in a test network). I left it pinging overnight with the failed VM acting as a gateway using host interface networking so the VM host could see all the network connections. One connection was bridged internally, the other went out a physical ethernet device.

comment:22 Changed 15 years ago by frank

  • Version changed from VirtualBox 1.6.2 to VirtualBox 2.0.0

comment:23 Changed 14 years ago by raxyx

Just to remind you: the problem still exists in 2.0.2.
Host: Debian Lenny 64bit
Guests: Debian Lenny 32bit and WinXP 32bit
Hardware: amd64 x2 4800+
Using the official Debian virtualbox-2.0 package

My Debian VMs are meant to be servers, manually started via the standard GUI, and then basically running idle in background somewhere
some with X installed, some without, all with bridged networking
They keep crashing with logs like this:

Executable: /usr/lib/virtualbox/VirtualBox
Arg[0]: /usr/lib/virtualbox/VirtualBox
Arg[1]: -comment
Arg[2]: Debian Lenny Postgresql
Arg[3]: -startvm
Arg[4]: 47f95c56-9d6b-419c-d7b3-f4cda9a2b8a4

!!Assertion Failed!!
Expression: i < 4096
Location  : /home/vbox/vbox-2.0.2/src/VBox/Runtime/r3/linux/semevent-linux.cpp(188) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
iCur=0x1 pThis=00000000010416b0

comment:24 follow-ups: ↓ 27 ↓ 28 Changed 14 years ago by frank

We are aware of that problem, and yes, it is annoying. Unfortunately, even the next release expected soon will not have a fix for this problem. We will completely overhaul the NAT network stack and this will fix that problem as well. We hope that the new stack will be available this year.

comment:25 follow-up: ↓ 26 Changed 14 years ago by Skinkie

Is this a nat only problem? In that case I'll just disable my NAT network card and use bridging only.

comment:26 in reply to: ↑ 25 Changed 14 years ago by schinkelm

Replying to Skinkie:

Is this a nat only problem? In that case I'll just disable my NAT network card and use bridging only.

I have the same problem here with Virtualbox 2.0.2 Binary on x86-64 linux but I don't use NAT.

!!Assertion Failed!!
Expression: i < 4096
Location  : /home2/vbox/vbox/lnx64-rel/src/VBox/Runtime/r3/linux/semevent-linux.cpp(188) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
iCur=0x1 pThis=00007f568002cfe0
Trace/breakpoint trap

comment:27 in reply to: ↑ 24 Changed 14 years ago by schinkelm

Replying to frank:

We are aware of that problem, and yes, it is annoying. Unfortunately, even the next release expected soon will not have a fix for this problem. We will completely overhaul the NAT network stack and this will fix that problem as well. We hope that the new stack will be available this year.

Unfortunately this bug renders VirtualBox useless because we cannot rely on the VMs without watching them all the time (or doing some kind of automatic restart). This bug also applies to usage of host only network adapters which are added to a bridge. Is there a chance that this usage case gets fixed even before the new nat stack is merged?

comment:28 in reply to: ↑ 24 Changed 14 years ago by schinkelm

Replying to frank:

We are aware of that problem, and yes, it is annoying. Unfortunately, even the next release expected soon will not have a fix for this problem. We will completely overhaul the NAT network stack and this will fix that problem as well. We hope that the new stack will be available this year.

There is also a forum thread here:  http://forums.virtualbox.org/viewtopic.php?t=2794

comment:29 in reply to: ↑ 8 Changed 14 years ago by schinkelm

Replying to pmatthew:

I think this bug is related with preemptivity enabled in kernel... I compiled a kernel without preemptivity and it disappeared. Just a day of machine uptime, I'll send another report later ;)

The crash occurs regardless of preemption type (on/voluntarily/off). I verified this with kernel 2.6.27.3.

comment:30 Changed 14 years ago by bilbo

I encountered same problem ...

Is there any chance at least for some quick temporary workaround before the complicated permanent fix? VM crashing every 4 hours or so isn't exactly the best thing ...

comment:31 Changed 14 years ago by Alecfyz

I have same trouble. Host machine: Fedora9 (x64) Guest: Windows XP SP3 Last strings in the VBox.log:

00:12:05.821 NAT: DHCP offered IP address 10.0.2.15
00:12:05.823 NAT: DHCP offered IP address 10.0.2.15
00:12:05.834 PCNet#0: Init: ss32=1 GCRDRA=0x021f9420[64] GCTDRA=0x021f9020[64]
00:14:52.879 
00:14:52.879 !!Assertion Failed!!
00:14:52.879 Expression: i < 4096
00:14:52.879 Location  : /home/vbox/vbox-2.0.4/src/VBox/Runtime/r3/linux/semevent-linux.cpp(188) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
00:14:52.899 iCur=0x1 pThis=00007f432c02c8e0

comment:32 Changed 14 years ago by Alecfyz

Added to prev. post: I forget to say about my version. I using 2.0.4 (linux64)

comment:33 Changed 14 years ago by sandervl73

  • Version changed from VirtualBox 2.0.0 to VirtualBox 2.0.4

comment:34 Changed 14 years ago by frank

This annoying bug is not fixed as we are still not able to reproduce it. This happens only on Linux/64 hosts. We would appreciate any hint how to reproduce this assertion. And no, this bug has nothing (at least not directly) to do with NAT. If some of the reporter could generate a  core dump this could help as well.

comment:35 Changed 14 years ago by ebini

Hi,

i have the same problem here. FYI: I'm not using NAT. I'm using hostinterfaces.

Host ist 64 bit Linux (ubuntu 8.10). client is also linux (ubuntu, centos).

and i have a coredump. (zipped about 70MB)

comment:36 Changed 14 years ago by frank

Could you make it somehow available to me (frank _dot_ mehnert _at_ sun _dot_ com)? Please don't forget to tell which package you are using.

comment:37 Changed 14 years ago by cyruspy

Hi, i'm using VB 2.0.4 on OpenSUSE 11.0@x86_64, the same machine is crashing from time to time. The machine uses host interfase (bridging)

00:57:56.326 
00:57:56.326 !!Assertion Failed!!
00:57:56.326 Expression: i < 4096
00:57:56.326 Location  : /home/vbox/vbox-2.0.4/src/VBox/Runtime/r3/linux/semevent-linux.cpp(188) int RTSemEventSignal(RTSEMEVENTINTERNAL*)
00:57:56.327 iCur=0x1 pThis=00007f48a004ccc0

comment:38 follow-up: ↓ 39 Changed 14 years ago by frank

  • Status changed from new to closed
  • Resolution set to fixed

2.0.6 should fix that problem. Note that the fix currently only works for .deb/.rpm packages for distributions with glibc >= 2.6 (e.g. Ubuntu 7.10 / Hardy or later, Fedora 7 or later, ...). The .run packages are compiled for rhel4 and do not contain the fix. I will close that bug anyway.

comment:39 in reply to: ↑ 38 ; follow-ups: ↓ 40 ↓ 41 Changed 14 years ago by schinkelm

  • Status changed from closed to reopened
  • Resolution fixed deleted

Replying to frank:

2.0.6 should fix that problem. Note that the fix currently only works for .deb/.rpm packages for distributions with glibc >= 2.6 (e.g. Ubuntu 7.10 / Hardy or later, Fedora 7 or later, ...). The .run packages are compiled for rhel4 and do not contain the fix. I will close that bug anyway

Could you please compile the .run package on a newer system? On Gentoo (which seems to use the .run package in the app-emulation/virtualbox-bin ebuild) the bug still exists.

comment:40 in reply to: ↑ 39 Changed 14 years ago by schinkelm

Replying to schinkelm:

Replying to frank:

2.0.6 should fix that problem. Note that the fix currently only works for .deb/.rpm packages for distributions with glibc >= 2.6 (e.g. Ubuntu 7.10 / Hardy or later, Fedora 7 or later, ...). The .run packages are compiled for rhel4 and do not contain the fix. I will close that bug anyway

Could you please compile the .run package on a newer system? On Gentoo (which seems to use the .run package in the app-emulation/virtualbox-bin ebuild) the bug still exists.

I commented on the new ebuild here:  http://bugs.gentoo.org/show_bug.cgi?id=248776#c11

comment:41 in reply to: ↑ 39 ; follow-up: ↓ 42 Changed 14 years ago by amdg

Replying to schinkelm:

Replying to frank:

2.0.6 should fix that problem. Note that the fix currently only works for .deb/.rpm packages for distributions with glibc >= 2.6 (e.g. Ubuntu 7.10 / Hardy or later, Fedora 7 or later, ...). The .run packages are compiled for rhel4 and do not contain the fix. I will close that bug anyway

Could you please compile the .run package on a newer system? On Gentoo (which seems to use the .run package in the app-emulation/virtualbox-bin ebuild) the bug still exists.

Seconding this. I'm running Gentoo on amd64 and I still see the bug (but so far, it has only happened when more than one VM is running).

comment:42 in reply to: ↑ 41 Changed 14 years ago by schinkelm

Replying to amdg:

Replying to schinkelm:

Replying to frank:

2.0.6 should fix that problem. Note that the fix currently only works for .deb/.rpm packages for distributions with glibc >= 2.6 (e.g. Ubuntu 7.10 / Hardy or later, Fedora 7 or later, ...). The .run packages are compiled for rhel4 and do not contain the fix. I will close that bug anyway

Could you please compile the .run package on a newer system? On Gentoo (which seems to use the .run package in the app-emulation/virtualbox-bin ebuild) the bug still exists.

Seconding this. I'm running Gentoo on amd64 and I still see the bug (but so far, it has only happened when more than one VM is running).

I currently run only one VM and have seen the problem on high network loads.

comment:43 Changed 14 years ago by pkerwien

I'm also seeing this with virtualbox-bin-2.1.4 on Gentoo amd64:

00:10:13.615 PCNet#0: Init: ss32=1 GCRDRA=0x0f9c7000[32] GCTDRA=0x0f934000[16] 01:38:05.274 01:38:05.274 !!Assertion Failed!! 01:38:05.274 Expression: i < 4096 01:38:05.274 Location : /home/vbox/tinderbox/2.1-lnx64-rel/src/VBox/Runtime/r3/linux/semevent-linux.cpp(203) int RTSemEventSignal(RTSEMEVENTINTERNAL*) 01:38:05.310 iCur=0x1 pThis=00000000023b2090

The guest is running Debian 5.0 i386 with a host network interface. This happens just after a few minutes when I access the webserver running on the virtual machine.

comment:44 follow-up: ↓ 45 Changed 14 years ago by frank

The problem with Gentoo is still that our .run installer is built against a libc < 2.6.

comment:45 in reply to: ↑ 44 Changed 14 years ago by malte

Replying to frank:

The problem with Gentoo is still that our .run installer is built against a libc < 2.6.

So if that has been known for several months now, what exactly is the reason for linking the package with such an ancient libc? And if it's about compatibility, why not provide an alternative package that fixes this very annoying bug? I simply can't run more than one VM at a time which kind of undermines my attempts to test some different network setups :-( Thanks in advance for fixing!

comment:46 Changed 14 years ago by bramd

Same bug with Vbox 2.2.2 (closed source edition) on ArchLinux.

comment:47 Changed 14 years ago by paranoid

I have same bug with Vbox 2.2.2 on Debian 4.0 :( has stable crash every two days...

core 2.6.26-bpo.1-amd64

628:36:14.255 !!Assertion Failed!! 628:36:14.255 Expression: i < 4096 628:36:14.255 Location : /home/vbox/vbox-2.2.2/src/VBox/Runtime/r3/linux/semevent-linux.cpp(203) int RTSemEventSignal(RTSEMEVENTINTERNAL*) 628:36:14.264 iCur=0x1 pThis=0000000001040330

Have you a solution of problem??

comment:48 Changed 14 years ago by frank

Debian/Etch uses a libc < 2.6, therefore we have to use our re-implementation of this event semaphore which is obviously buggy. No idea why, contributions are welcome. If you would upgrade to Debian/Lenny the problem would went away ...

comment:49 Changed 14 years ago by frank

An easy scenario how to trigger this bug as quick as possible would be helpful.

comment:50 Changed 14 years ago by Zer0COOL

Same bug with VirtualBox 2.2.4 binary on Gentoo x86_64.

comment:51 Changed 14 years ago by frank

I want to repeat: An easy scenario how to trigger this bug as quick as possible would be helpful.

comment:52 Changed 14 years ago by malte

Well...all I have to do is start two or more VMs and all but one of them will sooner or later die with the assertion failure, usually within five minutes. I've seen it happen with 32-bit WinXP and Win2k3 guests, I can try with other combinations if you want. Basically it's impossible to run more than one guest at a time. Host info: Gentoo Linux (x86_64 on a Core 2 Duo E6750 2.66GHz with 6GB RAM), kernel 2.6.28, glibc 2.8_p20080602-r1, VirtualBox 2.2.2 (haven't tested with 2.2.4 yet, but as Zer0COOL suggest it's still there).

comment:53 Changed 14 years ago by SDNick484

I'm still seeing this with VirtualBox 3.0.2 on a Gentoo Linux x86_64 host running kernel 2.6.30 (with Gentoo patches) & glibc 2.10.1. The problem has occurred on OpenSolaris (x86_64), Fedora 11 (x86), & Ubuntu 9.04 (x86) guests. I'm on a C2D P8400 with VT-x, PAE, 3D accel, & Nested Paging enabled. It's occurred both with and withou IO-APIC enabled, and generally happens to me during the OS install phase. I do have one Windows 7 x86_64 guest which didn't run into that issue, but it did require IO-APIC enabled to install.

comment:54 Changed 14 years ago by tg2861

Confirmed the same issue with 3.0.4.

4GB of memory, dual Opteron CPUs. Software based RAID 1 mirrored SATA drives shared by 3 Windows server guests (Win 2k3 and Win 2k8).

Basically, if all 3 start sometime between a few hours and a few days this will occur. More than 1 guest appears to be the trigger. Each of the guests are legitimate servers (Exchange 2007, AD domain controller, and Symantec SEP console) -- each of these can have substantial I/O bursts (sometimes concurrently).

I have another identical system that only runs 1 VM at a time and it has gone 6+ mos without an issue.

comment:55 Changed 14 years ago by frank

  • Version changed from VirtualBox 2.0.4 to VirtualBox 3.0.4

comment:56 follow-up: ↓ 57 Changed 14 years ago by michael55123

I can also comfirm this issue on 3.04.

share VirtualBox # !!Assertion Failed!! Expression: i < 4096 Location : /home/vbox/tinderbox/3.0-lnx64-rel/src/VBox/Runtime/r3/linux/semevent-linux.cpp(203) int RTSemEventSignal(RTSEMEVENTINTERNAL*) iCur=0x1 pThis=000000000099bfe0

[5]- Trace/breakpoint trap ./VBoxHeadless --startvm "Windows2008-Server2" --vrdpport 3387

share VirtualBox # uname -a Linux share 2.6.30-gentoo-r6 #1 SMP Tue Sep 1 03:42:32 CDT 2009 x86_64 AMD Processor model unknown AuthenticAMD GNU/Linux

Happens when running more than 1 VM. Athlon X2 3.0. 8 GB ram.

comment:57 in reply to: ↑ 56 Changed 14 years ago by michael55123

Very very easy to repeat this bug when installing Windows 2008 concurrently, 2 installs did it for me 3-4 times.

comment:58 Changed 14 years ago by renanbirck

I can reproduce this systematically here by doing anything like copying a file to the VM. I'm struggling to get SP3 in.stalled on my Windows XP VM, because with every heavy I/O it aborts!

This is the only VM I have here. VirtualBox 3.0.4. Core 2 Duo T5550, 2GB of RAM, Arch Linux, kernel 2.6.30

comment:59 Changed 14 years ago by frank

renanbirck, copying a file over the NAT network? Could you attach a VBox.log file of such a crashed session? I have done wget in three concurrent running VMs but was still not able to reproduce this problem.

comment:60 follow-up: ↓ 61 Changed 14 years ago by frank

I have more and more the feeling that some special Linux kernel configuration is required to trigger this bug. Since I don't have neither ArchLinux nor Gentoo installed here, could someone of you who is experiencing this bug attach the configuration of his host Linux kernel here?

Changed 14 years ago by michael55123

2.6.30-R5 Kernel Config- Gentoo

comment:61 in reply to: ↑ 60 Changed 14 years ago by michael55123

Replying to frank:

I have more and more the feeling that some special Linux kernel configuration is required to trigger this bug. Since I don't have neither ArchLinux nor Gentoo installed here, could someone of you who is experiencing this bug attach the configuration of his host Linux kernel here?

Attached. hopefully we can have some other attach theres to compare.

Changed 14 years ago by tg2861

Kernel config on a machine with this issue

Changed 14 years ago by malte

Kernel configuration 2.6.30-gentoo-r4

comment:62 Changed 14 years ago by tg2861

Another config uploaded

comment:63 Changed 14 years ago by malte

Me Too (TM)

Changed 14 years ago by SDNick484

Another Gentoo .config

comment:64 Changed 14 years ago by SDNick484

I was going through all the .configs posted thus far, and one thing standing out is they're all SMP machines (so perhaps some threading issues are present). I ran a several older releases of VirtualBox on my previous laptop with a Pentium-M also running Gentoo (several kernels all the way up to 2.6.28), but I never saw the issue.

comment:65 Changed 14 years ago by frank

Yes, I'm using an SMP box as well (T9550 @ 2.66GHz). Yesterday I used a 2.6.30.5 kernel with a the adapted config file by tg2861 -- the build run rock solid for hours doing wget guest=>host and wget host=>guest in parallel. Did similar experiments with a Pentium-D @ 3GHz. Are you guys using a CPU with hyperthreading?

comment:66 Changed 14 years ago by tg2861

I'm running dual Opterons.

CPUInfo says:

processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 HE stepping : 10 cpu MHz : 1992.244 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3984.48 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp

processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 HE stepping : 10 cpu MHz : 1992.244 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow rep_good bogomips : 3984.72 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp

comment:67 Changed 14 years ago by malte

I tried again now. Created four fresh VMs, everything default. Started a simultaneous installation of XP Pro SP3 (32 bit) in all of them from a CD image. Two machines died in the text setup phase while "Setup is copying files...". The other two finished installing. After the first login I copied to contents of the installation CD to My Documents, which is when the third machine died. On a second run with the same setup, the first VM went away right after setup started to load drivers, the second and third one followed suit in the same phase. Again, the fourth one went on running. I re-ran the test with different Windows variants and Linux live CDs and every time all but one VMs sooner or later hit the assertion - usually sooner.

processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping : 11 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority bogomips : 5320.64 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6750 @ 2.66GHz stepping : 11 cpu MHz : 1998.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow vnmi flexpriority bogomips : 5319.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

comment:68 Changed 14 years ago by frank

Finally that was a good test case, was able to reproduce the assertion now. However, that does not necessarily mean that the bug can easily be fixed ...

comment:69 Changed 14 years ago by frank

Well, we think we finally fixed this problem. For your convenience I've uploaded a new  3.0.6 .run package for Linux/AMD64. This new package is still not linked from the download page (note the different build number 52130 versus 52128). This package only differs in this semaphore fix. Any feedback is welcome. If it works for you then we will probably update the other affected packages as well (Debian 4.0, RHEL5, sles10.1) and change the links.

comment:70 Changed 14 years ago by frank

  • Summary changed from Assertion failed in sems-linux.cpp(219) to Assertion failed in sems-linux.cpp(219) => Fixed in SVN

comment:71 Changed 14 years ago by malte

Looks very promising indeed, this build survived all my torturing so far :-) Thanks alot for looking into this!

comment:72 Changed 14 years ago by tg2861

Fantastic! I'll get it installed this evening. I can't recall ever getting more than a week with all 3 of my VMs running; I'll post updates.

Thanks

comment:73 Changed 14 years ago by frank

  • Summary changed from Assertion failed in sems-linux.cpp(219) => Fixed in SVN to Assertion failed in sems-linux.cpp(219) => Fixed in SVN/3.0.6

Marked as fixed in 3.0.6 because I've replaced the packages on the download server and on the webppage. Replaced all affected packages (rhel5-amd64, Debian/Etch-amd64, SLES12-amd64, Linux/.run-amd64).

comment:74 Changed 14 years ago by SDNick484

Frank, any chance we can get a little more details on the fix? I submitted a Gentoo bug ( 285228) to get Portage updated, but downstream would appreciate a little more info (& notification).

comment:75 Changed 14 years ago by frank

Sure (and thanks btw for notifying the Gentoo people). The fixes are contained in the changesets r22950, r22952, r22953, r22954, r22955, r22956, r22957, r22958, r22959. As written above the reason for this problem was our own implementation of a event semaphore. Older LibCs (version < 2.6) contain a bug of the 64-bit futex code. So for newer Linux distributions we used the generic implementation (Runtime/r3/posix/semevent-posix.cpp). But as we are building our generic Linux package on RHEL4 (to be compatible with a lot of older Linux distributions), the generic package contained out own implementation and therefore this bug.

The problem was that the signalling thread was responsible for adjusting the numbers of waiting threads. This number was used to determine if a thread which executes RTSemEventSignal() has actually to wakeup another thread or if there no threads sleeping. If this thread was preempted just after he woke up a waiting thread it could take some time until the waking thread was running again (especially if the system load is very high). The following happened: One thread A was leaving a critical section with RTSemEventSignal(). Another thread B was waiting in RTSemEventWait() and was woken up by A. A was preempted before it could adjust the number of waiting threads nWaiters. B continued to run and eventually left the critical section with RTSemEventSignal(). Because nWaiters was still not adjusted, B tried to wake up a waiting thread -- but there was no thread waiting, A just had no chance to adjust nWaiters. B was now looping and waiting for some time but as the system load is very high, it took a long time until A was scheduled again. So the general problem was that A had to adjust nWaiters. You can browse the fixed code to see how the problem is solved.

comment:76 Changed 14 years ago by SDNick484

Thanks for the details response; portage has been updated to include the new build.

comment:77 Changed 14 years ago by tg2861

I can confirm that this patch corrected my problems. I've been running for almost 2 weeks and all 3 VMs running on my dual Opteron system are running -- far longer than I'd ever been able to keep all 3 up.

Thanks for taking care of this!

comment:78 Changed 13 years ago by frank

  • Status changed from reopened to closed
  • Resolution set to fixed
Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use