VirtualBox

Ticket #9042 (reopened defect)

Opened 3 years ago

Last modified 5 months ago

OS/2 guest crashes on floating point exception => fixed in svn

Reported by: rudiIhle Owned by:
Priority: major Component: other
Version: VirtualBox 4.0.8 Keywords:
Cc: Guest type: other
Host type: Windows

Description (last modified by michaln) (diff)

The following code snippet (which is supposed to die by SIGFPE) causes an OS/2 guest to trap. Kernel is 14.104_W4.

#include <stdio.h>
#include <float.h>

int main(void)
{
  double d1 = 1.0;
  double d2 = 0.0;

  _control87(0, 0x1f);

  printf("%lf\n", d1 / d2);

  return 0;
}

Attachments

eCS 2.0-2011-06-08-09-58-29.log Download (85.2 KB) - added by rudiIhle 3 years ago.
Log file
trapscreen.png Download (36.4 KB) - added by rudiIhle 2 years ago.
Trap screen after executing the program snippet.
VBox-4.1.6-r74713-eCS2-Mac.png Download (65.6 KB) - added by dbsoft 2 years ago.
Trap screen with eCS 2.0 on MacBook Pro w/ Intel Core 2 Duo
eComStationV2-2011-11-27-17-13-25.log Download (72.7 KB) - added by Erdmann 2 years ago.
Yet another log: Windows 7 host, Intel Core2 Duo CPU, 2 GB RAM
erdmann.png Download (26.0 KB) - added by erdmann 2 years ago.
Sudden trap on using Seamonkey 2.5
boottrap.PNG Download (21.0 KB) - added by lerdmann 2 years ago.
Trap on boot with SMP kernel (one CPU, no PSD loaded)
POPUPLOG.OS2 Download (1.8 KB) - added by lerdmann 2 years ago.
Traps in C2L.EXE (part of 16-bit Microsoft C Compiler)
shutdownW4trap.PNG Download (21.2 KB) - added by lerdmann 2 years ago.
Trap on shutdown with W4 kernel and OS2APIC.PSD

Change History

comment:1 Changed 3 years ago by frank

A VBox.log file is missing. It will show us which configuration your VM has and which processor features of your host are used.

Changed 3 years ago by rudiIhle

Log file

comment:2 Changed 3 years ago by dbsoft

This appears to be the bug I have been experiencing on my Macs. They have Intel CPUs, my AMD based Windows 7 64bit systems don't experience this problem and it results in SIGFPE.

comment:3 Changed 2 years ago by bird

I have not been able to reproduce this with the current trunk version of VirtualBox, testing on both intel i7 core and amd phenom 2.

comment:4 Changed 2 years ago by dbsoft

It happens on both my Macs... MacBook Pro with an Intel Core 2 Duo and MacPro with an Intel Quad Core Xeon.

It does not happen on my PCs with AMD Phenom 2 and X2.

comment:5 Changed 2 years ago by rudiIhle

With 4.1.4r74291 it happens here. See attached screenshot...

Changed 2 years ago by rudiIhle

Trap screen after executing the program snippet.

Changed 2 years ago by dbsoft

Trap screen with eCS 2.0 on MacBook Pro w/ Intel Core 2 Duo

comment:6 follow-up: ↓ 7 Changed 2 years ago by Erdmann

I can add that this very same trap also happens with more complex OS/2 applications, namely Firefox 8.x and Seamonkey 2.3. But I guess that can be expected as the error is so fundamental.

comment:7 in reply to: ↑ 6 Changed 2 years ago by Erdmann

Replying to Erdmann:

I can add that this very same trap also happens with more complex OS/2 applications, namely Firefox 8.x and Seamonkey 2.3. But I guess that can be expected as the error is so fundamental.

Forgot to add: I am using VirtualBox Version 4.1.6.

Changed 2 years ago by Erdmann

Yet another log: Windows 7 host, Intel Core2 Duo CPU, 2 GB RAM

comment:8 Changed 2 years ago by erdmann

The error still exists in VBox 4.1.8

comment:9 Changed 2 years ago by michaln

Reproducible here on an Intel Core 2 Quad host. No problem on an AMD system. I wonder if this is specific to the crummy old VT-x implementation.

comment:10 Changed 2 years ago by erdmann

I still experience this crash, funny enough it now occurs less often since I upgraded to Seamonkey 2.5 from Seamonkey 2.3.3. Maybe Seamonkey 2.5 uses less floating point. Unfortunately I don't know the technical details behind VT-x (I could have a look into the Intel manual but I am sure I am lacking years behind ...). I am using Intel Core2 Duo with Windows 7 as host. I had another trap on bootup just now, unfortunately I forgot to take a photo. The only thing I can say is that it happens pretty randomly. If you want me to test anything ...

Changed 2 years ago by erdmann

Sudden trap on using Seamonkey 2.5

comment:11 Changed 2 years ago by erdmann

I got a trap using Seamonkey. I attached trap screen. With the very same kernel the trap address has now changed.

comment:12 Changed 2 years ago by michaln

  • Description modified (diff)

There's one crucial piece of information missing here. This problem does not occur with the SMP OS/2 kernel. The reason being that the SMP kernel runs with the CR0.NE bit set. The actual number of processors in the guest does not matter.

comment:13 Changed 2 years ago by lerdmann

For information: the traps still occur with VirtualBox 4.1.10.

CR0.NE bit set implies that there is some old interrupt controller around that generates IRQ 13 on a floating point exception, correct ?
Ok, I will now install the SMP kernel in virtual box and see if the problem goes away. Maybe that's also the reason why traps occur so frequently when Seamonkey is in use. I have the impression Seamonkey creates a lot of floating point exceptions that are then handled internally by the application. But the underlying mechanisms have to work ...

Last edited 2 years ago by lerdmann (previous) (diff)

comment:14 Changed 2 years ago by michaln

Sure, it's the same with 4.1.10. No one said anything changed.

It's the other way around with CR0.NE. When it's set, it means the "new style" (implemented since the 286) math error handling should be used, i.e. #MF exception. That's also the only way a SMP system can work. The OS/2 UNI kernels use the ancient FERR/IRQ13/IGNNE math error handling (CR0.NE clear) which clearly doesn't work right in VirtualBox. No modern OS uses that, Windows 9x was the only other important OS which used the old style math error handling. Besides DOS, of course.

comment:15 Changed 2 years ago by lerdmann

Sorry, yes, I meant it the other way around for CR0.NE.
In any case, I have just upgraded to the SMP kernel along with enabled I/O APIC in VirtualBox configuration and OS2APIC.PSD (with no parameters) and I am using Seamonkey 2.5 under an OS/2 guest in VirtualBox. Should traps occur again, I will post them here.

By the way: when you say "OS/2 UNI kernels" do you mean only the "W4" kernel or also the "UNI" kernel ? I never really understood why there is yet a third kernel variant (UNI) besides the other 2 (W4, SMP).

Last edited 2 years ago by lerdmann (previous) (diff)

Changed 2 years ago by lerdmann

Trap on boot with SMP kernel (one CPU, no PSD loaded)

comment:16 Changed 2 years ago by lerdmann

I had a trap on bootup: SMP kernel, one CPU only, no I/O APIC VM emulation, no PSD loaded. As I can tell from the trap screen, the CR0.NE bit is NOT set even though it's an SMP kernel. Does that mean that I either need a PSD to operate or that the OS needs its time to change from CR0.NE = 0 to CR0.NE = 1 ? How do you handle Win 9x and DOS guests ? As far as I understand they also set CR0.NE = 0.

comment:17 Changed 2 years ago by michaln

I believe a PSD is required. It's also true that early in the boot, CR0.NE is not set. I can't say exactly when it does get set. See INIT_USE_FPERR_TRAP in SMP.INF.

Win9x and DOS guests are handled the same as OS/2. It would appear that running with FP exceptions unmasked is extremely rare on those guests.

comment:18 Changed 2 years ago by lerdmann

No, you can run an SMP kernel without a PSD. But of course, you will only get to use the BSP and none of the ASPs. But it surely looks that running the SMP kernel with a PSD (OS2APIC.PSDD or ACPI.PSD) is preventing the traps. I guess that OS2APIC.PSD will set CR0.NE very early in the boot process. For ACPI.PSD I could ask the developer and find out if it also explicitely sets CR0.NE (sets flag INIT_USE_FPERR_TRAP).
I have now readded OS2APIC.PSD to config.sys and enabled "I/O APIC" in the VM configuration. At the same time I have only enabled one CPU core (of two CPU cores available) in the VM configuration because mouse tends to get jerky with > 1 CPU core. I will see if this eliminates the traps on the long term.

Last edited 2 years ago by lerdmann (previous) (diff)

comment:19 Changed 2 years ago by michaln

There's a Windows test build at  http://www.virtualbox.org/download/testcase/VirtualBox-4.1.51-76951-Win.exe

It would be nice if someone could try it and check if the guest OS traps are gone. Please note that this is a development build and I'm not interested in anything other than whether the traps are gone or not.

I should also note that the issue does NOT affect AMD CPUs.

comment:20 Changed 2 years ago by rudiIhle

Hmm, installed it over a 4.1.8 and it would not start due to "driver structure changed" or so. Had some trouble getting back to a working setup (now at 4.1.10), so I'm not too enthusiastic to try again.

comment:21 Changed 2 years ago by frank

The problem is simply that you still have the 4.1.10 Extension Pack installed. If you don't need USB2 for that VM, just disable USB2 in the VM settings, otherwise we could provide you a test build of the 4.1.51 Extension Pack. Do you need one?

comment:22 Changed 2 years ago by dbsoft

All of my Windows machines have AMD, is there a Mac testbuild?

comment:23 Changed 2 years ago by rudiIhle

I think it would be good to have the 4.1.51 extension pack as well.

comment:24 Changed 2 years ago by frank

 Here is the corresponding 4.1.51 ExtPack and  here is a Mac OS X testbuild.

comment:25 Changed 2 years ago by rudiIhle

O.K., first of all, I don't get the trap in the guest anymore. However, there still seems to be something not quite right. When running the test program above on a freshly booted up guest I get a SIGFPE (as expected). But the location appears to be somewhere in the runtime lib instead of in the program code itself. Also, when running the program three or more times in a row, no SIGFPE is thrown anymore. Instead it simply continues printing out the unmodified value of "d1" (i.e. 1.00000).

comment:26 Changed 2 years ago by dbsoft

Similar here on both Macs...

10:56:00a nuke@ECS-[C:\HOME\DEFAULT]test

Killed by SIGFPE pid=0x0041 ppid=0x0040 tid=0x0001 slot=0x007f pri=0x0200 mc=0x0001 C:\HOME\DEFAULT\TEST.EXE LIBC063 0:0009a244 cs:eip=005b:1f39a244 ss:esp=0053:0212dddc ebp=0212de48

ds=0053 es=0053 fs=150b gs=0000 efl=00012202

eax=00000066 ebx=0212ff7c ecx=0212ff74 edx=0212df10 edi=00010032 esi=00000066 Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

10:56:01a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

10:57:14a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

comment:27 Changed 2 years ago by michaln

Yes, the exception may not be reported in the correct place. Is the behavior on AMDs any different?

comment:28 Changed 2 years ago by dbsoft

Michal you are correct the behavior does seem to be the same on AMD... although it seems unexpected to me on both.

11:27:01a nuke@ECS-[C:\HOME\DEFAULT]test

Killed by SIGFPE pid=0x0046 ppid=0x0040 tid=0x0001 slot=0x007f pri=0x0200 mc=0x0001 C:\HOME\DEFAULT\TEST.EXE LIBC064 0:00083123 cs:eip=005b:1f373123 ss:esp=0053:0212ddf0 ebp=0212de48

ds=0053 es=0053 fs=150b gs=0000 efl=00012286

eax=0212df10 ebx=00000004 ecx=ffffffff edx=80000000 edi=00000000 esi=00000066 Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

11:27:02a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

comment:29 Changed 2 years ago by lerdmann

I am not sure if this is related, if not, just ignore:
I am running the rusty old 16-bit Microsoft C compiler for OS/2. It's CL.exe with its subcomponents C1.exe (preprocessor(?)), C2.exe (tokenizer(?), optimizer(!)), C3.exe (output generator(?)). There also exists large memory model variants C1L.exe,C2L.exe,C3L.exe that can deal with large source files. I would have to use C2L.exe because I am using /Oe /Og (global optimizations) with rather large source files which require it (otherwise I get a warning that global optimizations cannot be performed for this and that routine).
I therefore specify /B2c2l.exe either on commandline or via CL env. var.

When I run the compiler with /B2... on a W4 kernel within VirtualBox, it just works but then I occasionally have these general trap problems.
When I run the compiler with /B2... on an SMP kernel within VirtualBox, I get "varying" results. I never get a trap but on the first run I might get a "C1001" compiler error, whereas on subsequent runs I will get a "Command line error D2030: INTERNAL COMPILER ERROR in 'P2'". But the internal compiler error might also occur on the very first run.
This is true for a source file of any size, small or big.

Unfortunately I don't have a native OS/2 on a multi-core system to test the SMP kernel on.
I would be grateful if anybody could test this behaviour on a multi-core system with SMP kernel on a native OS/2 installation and compare with behaviour in VirtualBox.

comment:30 Changed 2 years ago by lerdmann

As to probs with C2L.EXE: I have to correct my statement. It keeps trapping but the place whee it traps is pretty much random. Even though I compile the very same file with the very same command line switches. See attached POPUPLOG.OS2.

Version 0, edited 2 years ago by lerdmann (next)

Changed 2 years ago by lerdmann

Traps in C2L.EXE (part of 16-bit Microsoft C Compiler)

comment:31 Changed 2 years ago by lerdmann

Some news: at some point in time Scott Garfinkle from IBM modified the W4 kernel to also support using a PSD along with it so I took the chance:
1) if I use a W4 kernel with OS2APIC.PSD (and only 1 CPU of course), it looks like it gets rid of the traps and C2L.exe starts working again. I will need more observation time and report back
2) using a W4 kernel without any PSD leads to the random traps

Here is what I found in the eComStation bug tracker about probs running Firefox with an OS/2 guest in VirtualBox. It explains why the W4 kernel is kind of "flaky":  http://bugs.ecomstation.nl/view.php?id=2974[[BR]]


[The kernel trap is caused by a defect in the Warp4 kernels. The firefox code issues a fldcw which generates a math fault (#MF) exception which does not push an exception specific error code on to the stack. The kernel code should push a dummy error code on to the stack before entering the common exception handler code, but it does not. The common codes assumes that the EFLAGS are at a specific stack offset and checks the EFLAGS VM bit to determine if the trap occurred in V86 mode. If the bit happens to be set, the result is a trap in V86FaultEntry + 17. If the bit is not set, the kernel will trap or hang somewhat later because the stack contains on less dword than the code expects.

The defect has been fixed in the SMP kernel, so running the SMP kernel in VirtualBox is a possible workaround.

It is not known why the fldcw generates a #MF exception. This might be a VirtualBox defect. ]

Changed 2 years ago by lerdmann

Trap on shutdown with W4 kernel and OS2APIC.PSD

comment:32 Changed 2 years ago by lerdmann

I had a trap on shutdown. W4 kernel with OS2APIC.PSD. The trap screen says that CR0.NE bit was set. So this cannot be the only reason for trapping.

comment:33 Changed 2 years ago by michaln

Well, duh. For example, the reason could be that you're running the W4 kernel with a PSD, which I'm sure is an almost completely untested combination.

comment:34 Changed 2 years ago by lerdmann

ok,

1) POPUPLOG.OS2 was taken with the SMP kernel 10.104a and OS2APIC.PSD in place. I was using only one CPU.
2) I was invoking cl.exe 3 times with the very same parameters and the very same source file
3) the traps however occured at 3 different places in C2L.exe.

The only reason I was mentioning the W4 kernel is to state that C2L.exe does not trap when I use the W4 kernel in conjunction with OS2APIC.PSD.

Last edited 2 years ago by lerdmann (previous) (diff)

comment:35 Changed 2 years ago by michaln

  • Status changed from new to closed
  • Resolution set to fixed
  • Summary changed from OS/2 guest crashes on floating point exception to OS/2 guest crashes on floating point exception => fixed in svn

This ticket has clearly outlived its usefulness. We really don't care about 20+ year old Microsoft compilers which have known problems running on modern systems.

The reported problem is now fixed and the fix should be included in the next VirtualBox release. The OS/2 kernel should no longer crash on Intel CPUs because it should never get a #MF exception anymore (unless it asked for it).

comment:36 Changed 2 years ago by rudiIhle

Michal,

does the fix also address the inconsistent behavior when running the test case program multiple times and the reporting of the exception in the correct place ?

comment:37 Changed 2 years ago by michaln

No, it doesn't. That's a completely different problem, which was visible on AMD CPUs since day one. Feel free to open a separate ticket, just don't expect it to be fixed anytime soon without giving some really good reason why we should spend time on that (it actually needs quite a bit of work).

comment:38 Changed 2 years ago by lerdmann

Rudi,

would you create a new bug ? Unfortunately, Michal does not consider the OS/2 Microsoft C-Compiler 6.0 a valid test case. I am sure that once the "inconsistent behaviour" is fixed that then the OS/2 Microsoft C-Compiler 6.0 will happily exhibit consistent behaviour (trapping or not) provided the same command line switches and the same input source file is used.

In any case, thanks for fixing this bug's problem.

Last edited 2 years ago by lerdmann (previous) (diff)

comment:39 Changed 2 years ago by rudiIhle

Lars,

I'm not convinced that the problems you are describing are really related to this issue. To summarize: We had a trap in the kernel due to VirtualBox was delivering #MF which is neither expected nor properly handled by the W4-Kernel. To my understanding this has been fixed. Now we see two different problems:

1.) the SIGFPE fires only once or twice per guest session
2.) the reported exception location is wrong

I cannot tell if these two issues have a common cause (maybe a bug in the DOS-like FPE emulation) or if these are two separate things. I also don't know, if the location reporting is broken in general (i.e. not only for SIGFPE). Maybe Michal can tell and depending on this I might open one or two new ticket(s). Given the time it took until this one was addressed, expecting it to be fixed "anytime soon" is probably not realistic anyway...

comment:40 Changed 2 years ago by michaln

The two issues probably have a common cause. They are also specific to floating-point exceptions because a) the delivery is very different, and b) the FPU has a whole own internal state that's different from the CPU.

If you have some paying customer who depends on accurate FPU exception reporting in OS/2 guests, that would greatly accelerate the process. But I suspect there's no such customer because very few applications even run with FP exceptions unmasked. I'm sure you understand that we have better things to do. Of course if someone wants to spent a fun few weeks with VirtualBox and submit a patch, we won't object :)

I highly doubt the problems with MS C 6.0 are related at all. MS C 6.0 is well known to have all sorts of problems running on modern systems, both Windows and OS/2. If you still depend on MS C 6.0 in 2012, you have no one but yourself to blame. So far I've seen no evidence that the MS C 6.0 compiler even uses the FPU at all (it might, but I wouldn't count on that).

comment:41 Changed 6 months ago by dbsoft

  • Status changed from closed to reopened
  • Resolution fixed deleted

In version 4.3.2 this issue has resurfaced and now affects both AMD and Intel processors.

comment:42 Changed 5 months ago by frank

Thanks for the report. We reproduced the bug and working an a fix.

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use