VirtualBox

Opened 13 years ago

Closed 9 years ago

#9042 closed defect (fixed)

OS/2 guest crashes on floating point exception => fixed in svn

Reported by: rudi Owned by:
Component: other Version: VirtualBox 4.0.8
Keywords: Cc:
Guest type: other Host type: Windows

Description (last modified by michaln)

The following code snippet (which is supposed to die by SIGFPE) causes an OS/2 guest to trap. Kernel is 14.104_W4.

#include <stdio.h>
#include <float.h>

int main(void)
{
  double d1 = 1.0;
  double d2 = 0.0;

  _control87(0, 0x1f);

  printf("%lf\n", d1 / d2);

  return 0;
}

Attachments (12)

eCS 2.0-2011-06-08-09-58-29.log (85.2 KB ) - added by rudi 13 years ago.
Log file
trapscreen.png (36.4 KB ) - added by rudi 12 years ago.
Trap screen after executing the program snippet.
VBox-4.1.6-r74713-eCS2-Mac.png (65.6 KB ) - added by Brian Smith 12 years ago.
Trap screen with eCS 2.0 on MacBook Pro w/ Intel Core 2 Duo
eComStationV2-2011-11-27-17-13-25.log (72.7 KB ) - added by Lars Erdmann 12 years ago.
Yet another log: Windows 7 host, Intel Core2 Duo CPU, 2 GB RAM
erdmann.png (26.0 KB ) - added by Lars Erdmann 12 years ago.
Sudden trap on using Seamonkey 2.5
boottrap.PNG (21.0 KB ) - added by lerdmann 12 years ago.
Trap on boot with SMP kernel (one CPU, no PSD loaded)
POPUPLOG.OS2 (1.8 KB ) - added by lerdmann 12 years ago.
Traps in C2L.EXE (part of 16-bit Microsoft C Compiler)
shutdownW4trap.PNG (21.2 KB ) - added by lerdmann 12 years ago.
Trap on shutdown with W4 kernel and OS2APIC.PSD
newTrapScreen.PNG (74.5 KB ) - added by lerdmann 10 years ago.
eComStation 2.png (9.5 KB ) - added by Brian Smith 10 years ago.
Trap with 4.3.17 test version
os2pcat.zip (10.4 KB ) - added by lerdmann 9 years ago.
PSD for Warp4 kernel that works around bugs in the #MF handler of Warp4 kernel
amdfx6300.png (51.4 KB ) - added by Brian Smith 9 years ago.

Download all attachments as: .zip

Change History (83)

comment:1 by Frank Mehnert, 13 years ago

A VBox.log file is missing. It will show us which configuration your VM has and which processor features of your host are used.

by rudi, 13 years ago

Log file

comment:2 by Brian Smith, 13 years ago

This appears to be the bug I have been experiencing on my Macs. They have Intel CPUs, my AMD based Windows 7 64bit systems don't experience this problem and it results in SIGFPE.

comment:3 by bird, 12 years ago

I have not been able to reproduce this with the current trunk version of VirtualBox, testing on both intel i7 core and amd phenom 2.

comment:4 by Brian Smith, 12 years ago

It happens on both my Macs... MacBook Pro with an Intel Core 2 Duo and MacPro with an Intel Quad Core Xeon.

It does not happen on my PCs with AMD Phenom 2 and X2.

comment:5 by rudi, 12 years ago

With 4.1.4r74291 it happens here. See attached screenshot...

by rudi, 12 years ago

Attachment: trapscreen.png added

Trap screen after executing the program snippet.

by Brian Smith, 12 years ago

Trap screen with eCS 2.0 on MacBook Pro w/ Intel Core 2 Duo

comment:6 by Lars Erdmann, 12 years ago

I can add that this very same trap also happens with more complex OS/2 applications, namely Firefox 8.x and Seamonkey 2.3. But I guess that can be expected as the error is so fundamental.

in reply to:  6 comment:7 by Lars Erdmann, 12 years ago

Replying to Erdmann:

I can add that this very same trap also happens with more complex OS/2 applications, namely Firefox 8.x and Seamonkey 2.3. But I guess that can be expected as the error is so fundamental.

Forgot to add: I am using VirtualBox Version 4.1.6.

by Lars Erdmann, 12 years ago

Yet another log: Windows 7 host, Intel Core2 Duo CPU, 2 GB RAM

comment:8 by Lars Erdmann, 12 years ago

The error still exists in VBox 4.1.8

comment:9 by michaln, 12 years ago

Reproducible here on an Intel Core 2 Quad host. No problem on an AMD system. I wonder if this is specific to the crummy old VT-x implementation.

comment:10 by Lars Erdmann, 12 years ago

I still experience this crash, funny enough it now occurs less often since I upgraded to Seamonkey 2.5 from Seamonkey 2.3.3. Maybe Seamonkey 2.5 uses less floating point. Unfortunately I don't know the technical details behind VT-x (I could have a look into the Intel manual but I am sure I am lacking years behind ...). I am using Intel Core2 Duo with Windows 7 as host. I had another trap on bootup just now, unfortunately I forgot to take a photo. The only thing I can say is that it happens pretty randomly. If you want me to test anything ...

by Lars Erdmann, 12 years ago

Attachment: erdmann.png added

Sudden trap on using Seamonkey 2.5

comment:11 by Lars Erdmann, 12 years ago

I got a trap using Seamonkey. I attached trap screen. With the very same kernel the trap address has now changed.

comment:12 by michaln, 12 years ago

Description: modified (diff)

There's one crucial piece of information missing here. This problem does not occur with the SMP OS/2 kernel. The reason being that the SMP kernel runs with the CR0.NE bit set. The actual number of processors in the guest does not matter.

comment:13 by lerdmann, 12 years ago

For information: the traps still occur with VirtualBox 4.1.10.

CR0.NE bit set implies that there is some old interrupt controller around that generates IRQ 13 on a floating point exception, correct ?
Ok, I will now install the SMP kernel in virtual box and see if the problem goes away. Maybe that's also the reason why traps occur so frequently when Seamonkey is in use. I have the impression Seamonkey creates a lot of floating point exceptions that are then handled internally by the application. But the underlying mechanisms have to work ...

Last edited 12 years ago by lerdmann (previous) (diff)

comment:14 by michaln, 12 years ago

Sure, it's the same with 4.1.10. No one said anything changed.

It's the other way around with CR0.NE. When it's set, it means the "new style" (implemented since the 286) math error handling should be used, i.e. #MF exception. That's also the only way a SMP system can work. The OS/2 UNI kernels use the ancient FERR/IRQ13/IGNNE math error handling (CR0.NE clear) which clearly doesn't work right in VirtualBox. No modern OS uses that, Windows 9x was the only other important OS which used the old style math error handling. Besides DOS, of course.

comment:15 by lerdmann, 12 years ago

Sorry, yes, I meant it the other way around for CR0.NE.
In any case, I have just upgraded to the SMP kernel and I am using Seamonkey 2.5 under an OS/2 guest in VirtualBox. Should traps occur again, I will post them here.

By the way: when you say "OS/2 UNI kernels" do you mean only the "W4" kernel or also the "UNI" kernel ? I never really understood why there is yet a third kernel variant (UNI) besides the other 2 (W4, SMP).

Version 1, edited 12 years ago by lerdmann (previous) (next) (diff)

by lerdmann, 12 years ago

Attachment: boottrap.PNG added

Trap on boot with SMP kernel (one CPU, no PSD loaded)

comment:16 by lerdmann, 12 years ago

I had a trap on bootup: SMP kernel, one CPU only, no I/O APIC VM emulation, no PSD loaded. As I can tell from the trap screen, the CR0.NE bit is NOT set even though it's an SMP kernel. Does that mean that I either need a PSD to operate or that the OS needs its time to change from CR0.NE = 0 to CR0.NE = 1 ? How do you handle Win 9x and DOS guests ? As far as I understand they also set CR0.NE = 0.

comment:17 by michaln, 12 years ago

I believe a PSD is required. It's also true that early in the boot, CR0.NE is not set. I can't say exactly when it does get set. See INIT_USE_FPERR_TRAP in SMP.INF.

Win9x and DOS guests are handled the same as OS/2. It would appear that running with FP exceptions unmasked is extremely rare on those guests.

comment:18 by lerdmann, 12 years ago

No, you can run an SMP kernel without a PSD. But of course, you will only get to use the BSP and none of the ASPs. But it surely looks that running the SMP kernel with a PSD (OS2APIC.PSDD or ACPI.PSD) is preventing the traps. I guess that OS2APIC.PSD will set CR0.NE very early in the boot process. For ACPI.PSD I could ask the developer and find out if it also explicitely sets CR0.NE (sets flag INIT_USE_FPERR_TRAP).
I have now readded OS2APIC.PSD to config.sys and enabled "I/O APIC" in the VM configuration. At the same time I have only enabled one CPU core (of two CPU cores available) in the VM configuration because mouse tends to get jerky with > 1 CPU core. I will see if this eliminates the traps on the long term.

Last edited 12 years ago by lerdmann (previous) (diff)

comment:19 by michaln, 12 years ago

There's a Windows test build at http://www.virtualbox.org/download/testcase/VirtualBox-4.1.51-76951-Win.exe

It would be nice if someone could try it and check if the guest OS traps are gone. Please note that this is a development build and I'm not interested in anything other than whether the traps are gone or not.

I should also note that the issue does NOT affect AMD CPUs.

comment:20 by rudi, 12 years ago

Hmm, installed it over a 4.1.8 and it would not start due to "driver structure changed" or so. Had some trouble getting back to a working setup (now at 4.1.10), so I'm not too enthusiastic to try again.

comment:21 by Frank Mehnert, 12 years ago

The problem is simply that you still have the 4.1.10 Extension Pack installed. If you don't need USB2 for that VM, just disable USB2 in the VM settings, otherwise we could provide you a test build of the 4.1.51 Extension Pack. Do you need one?

comment:22 by Brian Smith, 12 years ago

All of my Windows machines have AMD, is there a Mac testbuild?

comment:23 by rudi, 12 years ago

I think it would be good to have the 4.1.51 extension pack as well.

comment:24 by Frank Mehnert, 12 years ago

Here is the corresponding 4.1.51 ExtPack and here is a Mac OS X testbuild.

comment:25 by rudi, 12 years ago

O.K., first of all, I don't get the trap in the guest anymore. However, there still seems to be something not quite right. When running the test program above on a freshly booted up guest I get a SIGFPE (as expected). But the location appears to be somewhere in the runtime lib instead of in the program code itself. Also, when running the program three or more times in a row, no SIGFPE is thrown anymore. Instead it simply continues printing out the unmodified value of "d1" (i.e. 1.00000).

comment:26 by Brian Smith, 12 years ago

Similar here on both Macs...

10:56:00a nuke@ECS-[C:\HOME\DEFAULT]test

Killed by SIGFPE pid=0x0041 ppid=0x0040 tid=0x0001 slot=0x007f pri=0x0200 mc=0x0001 C:\HOME\DEFAULT\TEST.EXE LIBC063 0:0009a244 cs:eip=005b:1f39a244 ss:esp=0053:0212dddc ebp=0212de48

ds=0053 es=0053 fs=150b gs=0000 efl=00012202

eax=00000066 ebx=0212ff7c ecx=0212ff74 edx=0212df10 edi=00010032 esi=00000066 Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

10:56:01a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

10:57:14a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

comment:27 by michaln, 12 years ago

Yes, the exception may not be reported in the correct place. Is the behavior on AMDs any different?

comment:28 by Brian Smith, 12 years ago

Michal you are correct the behavior does seem to be the same on AMD... although it seems unexpected to me on both.

11:27:01a nuke@ECS-[C:\HOME\DEFAULT]test

Killed by SIGFPE pid=0x0046 ppid=0x0040 tid=0x0001 slot=0x007f pri=0x0200 mc=0x0001 C:\HOME\DEFAULT\TEST.EXE LIBC064 0:00083123 cs:eip=005b:1f373123 ss:esp=0053:0212ddf0 ebp=0212de48

ds=0053 es=0053 fs=150b gs=0000 efl=00012286

eax=0212df10 ebx=00000004 ecx=ffffffff edx=80000000 edi=00000000 esi=00000066 Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

11:27:02a nuke@ECS-[C:\HOME\DEFAULT]test 1.000000

comment:29 by lerdmann, 12 years ago

I am not sure if this is related, if not, just ignore:
I am running the rusty old 16-bit Microsoft C compiler for OS/2. It's CL.exe with its subcomponents C1.exe (preprocessor(?)), C2.exe (tokenizer(?), optimizer(!)), C3.exe (output generator(?)). There also exists large memory model variants C1L.exe,C2L.exe,C3L.exe that can deal with large source files. I would have to use C2L.exe because I am using /Oe /Og (global optimizations) with rather large source files which require it (otherwise I get a warning that global optimizations cannot be performed for this and that routine).
I therefore specify /B2c2l.exe either on commandline or via CL env. var.

When I run the compiler with /B2... on a W4 kernel within VirtualBox, it just works but then I occasionally have these general trap problems.
When I run the compiler with /B2... on an SMP kernel within VirtualBox, I get "varying" results. I never get a trap but on the first run I might get a "C1001" compiler error, whereas on subsequent runs I will get a "Command line error D2030: INTERNAL COMPILER ERROR in 'P2'". But the internal compiler error might also occur on the very first run.
This is true for a source file of any size, small or big.

Unfortunately I don't have a native OS/2 on a multi-core system to test the SMP kernel on.
I would be grateful if anybody could test this behaviour on a multi-core system with SMP kernel on a native OS/2 installation and compare with behaviour in VirtualBox.

comment:30 by lerdmann, 12 years ago

As to probs with C2L.EXE: I have to correct my statement. It keeps trapping but the trap address is pretty much random. Even though I compile the very same file with the very same command line switches. See attached POPUPLOG.OS2. My gut feeling for this error is that it depends on how many segments the (segmented) executable contains. The more, the worse.

Last edited 12 years ago by lerdmann (previous) (diff)

by lerdmann, 12 years ago

Attachment: POPUPLOG.OS2 added

Traps in C2L.EXE (part of 16-bit Microsoft C Compiler)

comment:31 by lerdmann, 12 years ago

Some news: at some point in time Scott Garfinkle from IBM modified the W4 kernel to also support using a PSD along with it so I took the chance:
1) if I use a W4 kernel with OS2APIC.PSD (and only 1 CPU of course), it looks like it gets rid of the traps and C2L.exe starts working again. I will need more observation time and report back
2) using a W4 kernel without any PSD leads to the random traps

Here is what I found in the eComStation bug tracker about probs running Firefox with an OS/2 guest in VirtualBox. It explains why the W4 kernel is kind of "flaky": http://bugs.ecomstation.nl/view.php?id=2974[[BR]]


[The kernel trap is caused by a defect in the Warp4 kernels. The firefox code issues a fldcw which generates a math fault (#MF) exception which does not push an exception specific error code on to the stack. The kernel code should push a dummy error code on to the stack before entering the common exception handler code, but it does not. The common codes assumes that the EFLAGS are at a specific stack offset and checks the EFLAGS VM bit to determine if the trap occurred in V86 mode. If the bit happens to be set, the result is a trap in V86FaultEntry + 17. If the bit is not set, the kernel will trap or hang somewhat later because the stack contains on less dword than the code expects.

The defect has been fixed in the SMP kernel, so running the SMP kernel in VirtualBox is a possible workaround.

It is not known why the fldcw generates a #MF exception. This might be a VirtualBox defect. ]

by lerdmann, 12 years ago

Attachment: shutdownW4trap.PNG added

Trap on shutdown with W4 kernel and OS2APIC.PSD

comment:32 by lerdmann, 12 years ago

I had a trap on shutdown. W4 kernel with OS2APIC.PSD. The trap screen says that CR0.NE bit was set. So this cannot be the only reason for trapping.

comment:33 by michaln, 12 years ago

Well, duh. For example, the reason could be that you're running the W4 kernel with a PSD, which I'm sure is an almost completely untested combination.

comment:34 by lerdmann, 12 years ago

ok,

1) POPUPLOG.OS2 was taken with the SMP kernel 10.104a and OS2APIC.PSD in place. I was using only one CPU.
2) I was invoking cl.exe 3 times with the very same parameters and the very same source file
3) the traps however occured at 3 different places in C2L.exe.

The only reason I was mentioning the W4 kernel is to state that C2L.exe does not trap when I use the W4 kernel in conjunction with OS2APIC.PSD.

Last edited 12 years ago by lerdmann (previous) (diff)

comment:35 by michaln, 12 years ago

Resolution: fixed
Status: newclosed
Summary: OS/2 guest crashes on floating point exceptionOS/2 guest crashes on floating point exception => fixed in svn

This ticket has clearly outlived its usefulness. We really don't care about 20+ year old Microsoft compilers which have known problems running on modern systems.

The reported problem is now fixed and the fix should be included in the next VirtualBox release. The OS/2 kernel should no longer crash on Intel CPUs because it should never get a #MF exception anymore (unless it asked for it).

comment:36 by rudi, 12 years ago

Michal,

does the fix also address the inconsistent behavior when running the test case program multiple times and the reporting of the exception in the correct place ?

comment:37 by michaln, 12 years ago

No, it doesn't. That's a completely different problem, which was visible on AMD CPUs since day one. Feel free to open a separate ticket, just don't expect it to be fixed anytime soon without giving some really good reason why we should spend time on that (it actually needs quite a bit of work).

comment:38 by lerdmann, 12 years ago

Rudi,

would you create a new bug ? Unfortunately, Michal does not consider the OS/2 Microsoft C-Compiler 6.0 a valid test case. I am sure that once the "inconsistent behaviour" is fixed that then the OS/2 Microsoft C-Compiler 6.0 will happily exhibit consistent behaviour (trapping or not) provided the same command line switches and the same input source file is used.

In any case, thanks for fixing this bug's problem.

Last edited 12 years ago by lerdmann (previous) (diff)

comment:39 by rudi, 12 years ago

Lars,

I'm not convinced that the problems you are describing are really related to this issue. To summarize: We had a trap in the kernel due to VirtualBox was delivering #MF which is neither expected nor properly handled by the W4-Kernel. To my understanding this has been fixed. Now we see two different problems:

1.) the SIGFPE fires only once or twice per guest session
2.) the reported exception location is wrong

I cannot tell if these two issues have a common cause (maybe a bug in the DOS-like FPE emulation) or if these are two separate things. I also don't know, if the location reporting is broken in general (i.e. not only for SIGFPE). Maybe Michal can tell and depending on this I might open one or two new ticket(s). Given the time it took until this one was addressed, expecting it to be fixed "anytime soon" is probably not realistic anyway...

comment:40 by michaln, 12 years ago

The two issues probably have a common cause. They are also specific to floating-point exceptions because a) the delivery is very different, and b) the FPU has a whole own internal state that's different from the CPU.

If you have some paying customer who depends on accurate FPU exception reporting in OS/2 guests, that would greatly accelerate the process. But I suspect there's no such customer because very few applications even run with FP exceptions unmasked. I'm sure you understand that we have better things to do. Of course if someone wants to spent a fun few weeks with VirtualBox and submit a patch, we won't object :)

I highly doubt the problems with MS C 6.0 are related at all. MS C 6.0 is well known to have all sorts of problems running on modern systems, both Windows and OS/2. If you still depend on MS C 6.0 in 2012, you have no one but yourself to blame. So far I've seen no evidence that the MS C 6.0 compiler even uses the FPU at all (it might, but I wouldn't count on that).

comment:41 by Brian Smith, 10 years ago

Resolution: fixed
Status: closedreopened

In version 4.3.2 this issue has resurfaced and now affects both AMD and Intel processors.

comment:42 by Frank Mehnert, 10 years ago

Thanks for the report. We reproduced the bug and working an a fix.

comment:43 by Brian Smith, 10 years ago

Has any progress been made in the last 6 months?

comment:44 by Frank Mehnert, 10 years ago

There is a chance that this problem was fixed in VBox 4.3.16. Could you test?

in reply to:  44 comment:45 by Brian Smith, 10 years ago

Just tested on my Mac with 4.3.16... still traps on the floating point exception.

comment:46 by Klaus Espenlaub, 10 years ago

dbsoft, you never said what causes problems for you. A screenshot is not enough to debug whatever problem you might have.

comment:47 by lerdmann, 10 years ago

I still get a trap with the program snippet Rüdiger provided. See attached screenshot.

Last edited 10 years ago by lerdmann (previous) (diff)

by lerdmann, 10 years ago

Attachment: newTrapScreen.PNG added

comment:48 by Frank Mehnert, 10 years ago

lerdmann, would you be willing to test a new fix? Which VirtualBox package do you need, Windows host?

comment:49 by lerdmann, 10 years ago

Sure, I'd like to test. I am using Windows 7 Professional as host. If it is not too much hassle I'd also like to have a matching extension pack.

comment:50 by lerdmann, 10 years ago

1) I forgot to mention: I am using an 8-core AMD CPU

2) Don't know if this has a bearing on the problem: see the 2. half of comment 31.
On the other hand I understood from Michals comments that with the existing fix the CPU should no longer get a #MF exception at all any more.
But 2. half of comment 31 would explain why the W4 kernel traps on a #MF exception while the SMP kernel does not and it would turn out to be a W4 kernel bug that cannot be fixed in VirtualBox.

comment:51 by Frank Mehnert, 10 years ago

Here is a 4.3.17 test build for Windows and here is the corresponding extension pack. Thank you for testing!

by Brian Smith, 10 years ago

Attachment: eComStation 2.png added

Trap with 4.3.17 test version

in reply to:  51 comment:52 by Brian Smith, 10 years ago

Replying to frank:

I just tested on Windows 8.1 x64 on an AMD FX-6300 with the 4.3.17 build and it still traps on that code snippet.

Attached my trap screen... it is a TRAP 000e instead of 0008 that lerdmann got.

(And I just doubled checked... using 4.3.16 on my Mac I also get TRAP 0008 like lerdmann)

Last edited 10 years ago by Brian Smith (previous) (diff)

comment:53 by lerdmann, 10 years ago

Yes, that fixes it for me.
The funny thing is, for the W4 kernel the program snippet reports 0x2003e to be the program trap address in the program whereas with the very same program the SMP kernel reports 0x2003b to be the program trap address. For both kernels, the program trap address is consistent across multiple invocations of the program.

Whatever, I consider this problem fixed at least for my AMD CPU.

Just for fun, I loaded a self written PSD in conjunction with the W4 kernel that enables the new way of floating point exception reporting (#MF exception) and disables IRQ13. Under this scenario I get the kernel trap as already shown in "newTrapScreen.PNG".

@dbsoft: you should make sure that you run the W4 kernel WITHOUT ANY PSD as it is supposed to be for the W4 kernel.

Thanks a lot !

Last edited 10 years ago by lerdmann (previous) (diff)

in reply to:  53 comment:54 by Brian Smith, 10 years ago

Replying to lerdmann:

@dbsoft: you should make sure that you run the W4 kernel WITHOUT ANY PSD as it is supposed to be for the W4 kernel.

As far as I know I am not using a PSD... there is no PSD in the CONFIG.SYS file.

comment:55 by michaln, 10 years ago

@lerdmann: The program trap address is probably the address of the instruction where the FP exception was detected, but not the address of the actual FP instruction which triggered the exception. I don't know why the reported address is different, but it should not cause problems.

The thing with the PSD is very interesting and yes, it basically exactly simulates the VirtualBox bug (FP exceptions are delivered as #MF and not IRQ13) which then triggers a bug/unexpected code path in the W4 kernel.

It would be nice if someone could test on an Intel machine, too.

in reply to:  55 comment:56 by Brian Smith, 10 years ago

Replying to michaln:

It would be nice if someone could test on an Intel machine, too.

I can test on Intel... can boot my Mac in Windows or test a Mac build if one is available... but I am experiencing a trap still with the test program... a different one though. So not sure how valuable my test will be.

So I booted Windows, installed the same VirtualBox 4.3.17 and using the same image I no longer get the trap. Seems to be fixed on my Intel Mac in Windows 7... not sure if my AMD PC has something configured differently but I am still getting the trap there with the same software.

I also now tested on an older AMD Athlon 64 X2 running Windows 7 and I also get the trap 000e... seems to not be fixed on AMD for me. I also tested on an older Core 2 Duo Mac running Windows 7 which also seems to be fixed.

So my testing shows Intel is fixed, AMD is still bugged but with trap 000e now instead of 0008.

Last edited 10 years ago by Brian Smith (previous) (diff)

comment:57 by Brian Smith, 9 years ago

Can anyone else test with various AMD systems to see if my results are accurate or if something is going on weird with my systems?

by lerdmann, 9 years ago

Attachment: os2pcat.zip added

PSD for Warp4 kernel that works around bugs in the #MF handler of Warp4 kernel

comment:58 by lerdmann, 9 years ago

@dbsoft: find attached file "os2pcat.zip". It contains a PSD (and all the source code) that fixes the existing problem in the Warp 4 kernel. Unzip OS2PCAT.PSD and OS2PCAT.SYM and place them into \os2\boot directory. Add this line to config.sys: PSD=OS2PCAT.PSD

That should fix your problem. It's no use fixing something in VirtualBox where in fact the Warp4 kernel is causing all the problems.

Add. info: with this PSD loaded in conjunction with the W4 kernel, the failing address (of the program snippet) displayed is exactly the same as for the SMP kernel.

in reply to:  58 comment:59 by Brian Smith, 9 years ago

Replying to lerdmann:

That should fix your problem. It's no use fixing something in VirtualBox where in fact the Warp4 kernel is causing all the problems.

That does seem to fix the trap on my AMD systems... however it isn't clear to me why the behavior is different on Intel and AMD?

comment:60 by lerdmann, 9 years ago

I am not the vitualization expert but there must be some difference between AMD and Intel. The point is that the PSD works around the bug in the W 4 kernel and it correctly handles what VirtualBox does on an unmasked floating point exception.

comment:61 by lerdmann, 9 years ago

About virtualization, here is a relevant excerpt from the Intel documentation (volume 3, chapter 23.8) and I suppose that AMD followed closely:

The first processors to support VMX operation require that the
following bits be 1 in VMX operation: CR0.PE, CR0.NE, CR0.PG, and
CR4.VMXE.

The necessity to set the CR0.NE bit translates to the generation of an #MF exception for floating point exceptions instead of taking the route via an external interrupt controller issuing a IRQ13 interrupt.
I would believe that your AMD CPU is an earlier model that requires CR0.NE to be set in order to properly operate in a virtualized environment. As a consequence the Warp4 kernel has to properly deal with the #MF exception which is what OS2PCAT.PSD ensures.
Later CPUs might offer additional capabilities where setting CR0.NE bit is not necessary, I don't know.

comment:62 by michaln, 9 years ago

The requirement to run with CR0.NE set (when using virtualization) applies to all Intel processors. The legacy FPU exception handling does not scale beyond a single CPU, which is why even OS/2 SMP kernels can't use it.

Intel probably plans to completely remove the old-style FPU exception handling in the future since it's not usable for any modern OS (where "modern" includes anything better than DOS, Windows 9x, and OS/2 W4-style kernels).

Anyway, if the PSD is necessary, it's a bug in VirtualBox (which we can't reproduce). Then again, the PSD isn't a bad solution and might actually make things slightly faster because FPU exceptions don't need to be intercepted.

in reply to:  62 comment:63 by Brian Smith, 9 years ago

Replying to michaln:

Anyway, if the PSD is necessary, it's a bug in VirtualBox (which we can't reproduce). Then again, the PSD isn't a bad solution and might actually make things slightly faster because FPU exceptions don't need to be intercepted.

That is kind of what I was thinking too since it works fine on Intel... the one processor I tried on is quite old but the newer one I purchased just a few months ago... it is Piledriver based which I think is the current series originally released at the end of 2012. So I don't think it is a problem with it being an old CPU.

by Brian Smith, 9 years ago

Attachment: amdfx6300.png added

comment:64 by lerdmann, 9 years ago

What was I talking ...

I had successfully tested 4.3.17 (with the W4 kernel and without any PSD) with an Intel dual-Core CPU and NOT with an AMD CPU. Combining with dbsoft's comments it looks like the problem is fixed for Intel CPUs but apparently not for AMD CPUs.

Sorry for the confusion.

comment:65 by Frank Mehnert, 9 years ago

Here is another Windows test build which contains a fix for AMD hosts. And here is the extpack.

in reply to:  65 comment:66 by Brian Smith, 9 years ago

Replying to frank:

Here is another Windows test build which contains a fix for AMD hosts. And here is the extpack.

Initial testing seems to show it works, I only tested on the new processor and just commented out the PSD line in the CONFIG.SYS to remove the OS2PCAT mentioned above. I'll test some more to verify the PSD is actually not loading and that it works on the other processor later today. Thanks!

Tested with an image that I did not install the PSD in and also on the older AMD system and both work correctly! Thanks looks like it is fixed for AMD now too.

Last edited 9 years ago by Brian Smith (previous) (diff)

comment:67 by michaln, 9 years ago

Getting more confirmation would be excellent. FYI, the latest fix applies to all AMD hosts (everything using AMD-V to be exact). No impact on Intels.

comment:68 by lerdmann, 9 years ago

As could be expected, I can confirm that it still works on Intel.

comment:69 by Frank Mehnert, 9 years ago

Could you recheck with VBox 4.3.20?

comment:70 by Brian Smith, 9 years ago

Tested on my main Mac (Intel) and PC (AMD) and both worked correctly.

Thank you!

comment:71 by Frank Mehnert, 9 years ago

Resolution: fixed
Status: reopenedclosed

Thanks for the feedback! I will close this ticket.

Note: See TracTickets for help on using tickets.

© 2023 Oracle
ContactPrivacy policyTerms of Use