VirtualBox

Ticket #20625 (new defect)

Opened 11 months ago

Last modified 8 months ago

After hardware upgrade OS/2 does not boot

Reported by: jimoe Owned by:
Component: host support Version: VirtualBox 6.1.26
Keywords: os/2, hardware Cc:
Guest type: other Host type: Linux

Description

opensuse tumbleweed 20211012 linux v5.14.9-1-default x86_64 VBox v6.1.26_SUSE r145957

After an hardware upgrade the OS/2 guest has stopped booting. The boot proceeds normally to a point, then simply stops.

Old CPU: AMD Athlon II 4x New CPU: AMD Ryzen 5 5600x

Mainboard Old: asus m5a88-m New: asus tuf gaming b550+

Attachments

VBox.log Download (204.0 KB) - added by jimoe 11 months ago.
Log of failed os/2 boot. LIne 1229 is the last entry before turning off the guest.

Change History

Changed 11 months ago by jimoe

Log of failed os/2 boot. LIne 1229 is the last entry before turning off the guest.

comment:1 Changed 11 months ago by klaus

I hope it's not caused by software which has trouble with a too fast CPU.

If it is related to tripping over certain CPU feature details you could try letting VirtualBox report a different (older) CPU profile. This works from the command line.

You can list the CPU profiles with

$ VBoxManage list cpu-profiles

Selecting a profile for a certain VM works with

$ VBoxManage modifyvm "vmname" --cpu-profile="profile name"

Note that not all profiles are going to work on a specific host CPU. It does not add features to your CPU. The highest chance of success is with older CPU profiles from the same vendor (or quite old Intel models). Also, remember it is a profile specifying which CPU features to report. It does not have any impact on CPU clock speed.

comment:2 Changed 11 months ago by bird

The VBoxManage list cpu-profile command doesn't work in 6.1, it's a feature in the upcoming major release. Sorry. Here is a list of available CPU profile names for 6.1 (in alphabetical order):

        "AMD Athlon 64 3200+"
        "AMD Athlon 64 X2 Dual Core 4200+"
        "AMD FX-8150 Eight-Core"
        "AMD Phenom II X6 1100T"
        "Hygon C86 7185 32-core"
        "Intel 80186"
        "Intel 80286"
        "Intel 80386"
        "Intel 80486"
        "Intel 8086"
        "Intel Atom 330 1.60GHz"
        "Intel Core Duo T2600 2.16GHz"
        "Intel Core i5-3570"
        "Intel Core i7-2635QM"
        "Intel Core i7-3820QM"
        "Intel Core i7-3960X"
        "Intel Core i7-5600U"
        "Intel Core i7-6700K"
        "Intel Core2 T7600 2.33GHz"
        "Intel Core2 X6800 2.93GHz"
        "Intel Pentium 4 3.00GHz"
        "Intel Pentium M processor 2.00GHz"
        "Intel Pentium N3530 2.16GHz"
        "Intel Xeon X5482 3.20GHz"
        "Quad-Core AMD Opteron 2384"
        "VIA QuadCore L4700 1.2+ GHz"
        "ZHAOXIN KaiXian KX-U5581 1.8GHz"

For debugging the issue, I would suggest trying to trigger the driver loading messages and let us know when it stops (Alt-F2 or similar).

comment:3 Changed 11 months ago by jimoe

I remember other aspects of this. (After the upgrade there was a lot happening.)

The os/2 guest had been saved before the upgrade. Afterwards, the saved session ran as expected. It is the boot process that is the issue.

The last line of the boot screen: c:\os2\boot\testcfg.sys

I tried:

  • restoring from an older backup
  • reducing the execution cap to 30%

comment:4 Changed 11 months ago by fth0

FWIW, did you also try the current VirtualBox 6.1.28, which supports Linux kernel 5.14?

comment:5 Changed 11 months ago by jimoe

No.

I use the current version on the Tumbleweed repository, which is 6.1.26. It is usually only a week or so after a VBox release that the new version appears.

I have tried to mix the two releases in the past; it did not go well.

comment:6 Changed 11 months ago by jimoe

VBox was upgracded to 6.1.28_SUSE r147628 today.

It made no difference to the failure of os/2 to start.

comment:7 Changed 11 months ago by jimoe

Any progress with this? Or waiting for v6.2 release?

comment:8 Changed 11 months ago by fth0

@klaus / @bird: A wild guess:

A few other VirtualBox users upgraded their host to modern AMD CPUs and experienced crashes in their Windows 9x guests. The technical background (without hypervisors involved) is described in  Windows 9x TLB Invalidation Bug and  TLB and Pagewalk Coherence in x86 Processors. Do you think that OS/2 may have the same bug as Windows 9x?

@jimoe: For a test, disable System > Acceleration > Enable Nested Paging.

comment:9 Changed 11 months ago by jimoe

Yay! That was it. OS/2 started and is functioning. Thank you.

comment:10 Changed 11 months ago by fth0

Thanks for reporting back! :)

Can you easily try if OS/2 runs natively on your new hardware? Background reason for this request is that the maintainer of the OS/2 Museum didn't encounter your issue (yet), although having similar setups. TIA.

comment:11 Changed 11 months ago by jimoe

Does it boot from a storage drive instead of a VM? I have no idea. I have been running OS/2 as a guest for over 10 years. No reason to do differently.

Last edited 11 months ago by jimoe (previous) (diff)

comment:12 Changed 9 months ago by bird

Just my 0.02$ here, but booting an OS/2 build VM on an threadripper 3990x would crash when the display doctor (SDD) started unless I set the CPU profile to some older intel CPU (I picked "Intel Pentium M processor 2.00GHz" as I know OS/2 worked on that CPU (old thinkpad)).

comment:13 Changed 9 months ago by bird

I think I've reproduced a related issue here while trying to implement unattended installation of OS/2 guests. TESTCFG.SYS frequently crashes during initialization after it returns from query APM support from the BIOS in real mode (calling DevHlp 24h to read some DOS variable pointing to APM Info), the registers restored from the stack in the epilogue are all wrong and it finally #GPs on the RETF as there is no valid CS on the phantom stack it's using. When looking the stack in the debugger, everything seems fine. Will try track down where this goes south and whether it's specific to this real-mode tripping or not.

comment:14 Changed 9 months ago by bird

As mentioned, TESTCFG.SYS ends up causing a BIOS call in real mode. When switch to real mode an identity mapped page virtual address 12000h is installed by tweaking the page directory / tables. When returning to protected mode, these page table and page directory changes are undone after enabling paging, but no CR3 flushing is done afterwards and that's causing trouble.

VBoxDbg> u 1200:00000153 L 60
1200:00000153 55                      push bp
1200:00000154 8b ec                   mov bp, sp
1200:00000156 b8 00 0a                mov ax, 00a00h
1200:00000159 8e d8                   mov ds, ax
1200:0000015b 9a 3a 27 00 12          call far 01200h:0273ah
1200:00000160 fa                      cli
1200:00000161 8b 46 04                mov ax, word [bp+004h]
1200:00000164 25 ff 8f                and ax, 08fffh
1200:00000167 0d 00 30                or ax, 03000h
1200:0000016a 89 46 04                mov word [bp+004h], ax
1200:0000016d 25 ff fd                and ax, 0fdffh
1200:00000170 50                      push ax
1200:00000171 9d                      popfw
1200:00000172 e8 47 25                call 026bch
1200:00000175 66 0f 01 16 12 13       lgdt [01312h]
1200:0000017b 66 0f 01 1e 18 13       lidt [01318h]
1200:00000181 66 a1 90 64             mov eax, dword [06490h]
1200:00000185 0f 22 d8                mov cr3, eax                  ; Modified CR3 loaded prior to enabling paging.
1200:00000188 0f 20 c0                mov eax, cr0
1200:0000018b 66 0b 06 5b 0d          or eax, dword [00d5bh]
1200:00000190 66 50                   push eax
1200:00000192 0f 22 c0                mov cr0, eax
1200:00000195 ea 9a 01 00 12          jmp far 01200h:0019ah
1200:0000019a 33 c0                   xor ax, ax
1200:0000019c 8e c0                   mov es, ax
1200:0000019e b8 00 0a                mov ax, 00a00h
1200:000001a1 8e d8                   mov ds, ax
1200:000001a3 8e 16 5f 0d             mov ss, [00d5fh]
1200:000001a7 e8 dc 24                call 02686h
1200:000001aa 66 58                   pop eax
1200:000001ac 66 a9 00 00 00 80       test eax, dword 080000000h
1200:000001b2 0f 84 3a 00             je +0003ah (001f0h)
1200:000001b6 66 53                   push ebx                     ; Start of code restoring the page directory and page table
1200:000001b8 b8 60 01                mov ax, 00160h               ; it's original state.
1200:000001bb 8e d8                   mov ds, ax
1200:000001bd 67 66 a1 08 70 ed ff    mov eax, dword [0ffed7008h]
1200:000001c4 67 66 8b 1d 28 b9 ec ff mov ebx, dword [0ffecb928h]
1200:000001cc 67 66 89 18             mov dword [eax], ebx
1200:000001d0 66 33 c0                xor eax, eax
1200:000001d3 b8 00 12                mov ax, 01200h
1200:000001d6 66 c1 e8 06             shr eax, 006h
1200:000001da 67 66 03 05 30 b9 ec ff add eax, dword [0ffecb930h]
1200:000001e2 67 66 8b 1d 2c b9 ec ff mov ebx, dword [0ffecb92ch]
1200:000001ea 67 66 89 18             mov dword [eax], ebx
1200:000001ee 66 5b                   pop ebx                      ; Restored, but no TLB flushing.
1200:000001f0 b8 28 01                mov ax, 00128h
1200:000001f3 8e d8                   mov ds, ax
1200:000001f5 80 26 15 00 fd          and byte [00015h], 0fdh
1200:000001fa b8 10 00                mov ax, 00010h
1200:000001fd 0f 00 d8                ltr ax                       ; This can come in handy on AMD systems.
1200:00000200 b8 28 00                mov ax, 00028h
1200:00000203 0f 00 d0                lldt ax
1200:00000206 9a fc 2c 48 01          call far 00148h:02cfch
1200:0000020b 33 c0                   xor ax, ax
1200:0000020d 8e d8                   mov ds, ax
1200:0000020f 8e c0                   mov es, ax
1200:00000211 8e e0                   mov fs, ax
1200:00000213 8e e8                   mov gs, ax
1200:00000215 c9                      leave
1200:00000216 c3                      retn

Patching the code in the debugger to do a TLB flush fixes the problem (squeezed out 1 byte from sequence loading 160h into DS, and 5 bytes from an inefficient load of 48h into eax, giving me the 6 bytes needed for a CR3 reload).

VBoxDbg> u 1200:1b8
1200:000001b8 68 60 01                push 00160h
1200:000001bb 1f                      pop DS
1200:000001bc 67 66 a1 08 70 ed ff    mov eax, dword [0ffed7008h]
1200:000001c3 67 66 8b 1d 28 b9 ec ff mov ebx, dword [0ffecb928h]
1200:000001cb 67 66 89 18             mov dword [eax], ebx
1200:000001cf 66 31 c0                xor eax, eax
1200:000001d2 b0 48                   mov AL, 048h
1200:000001d4 67 66 03 05 30 b9 ec ff add eax, dword [0ffecb930h]
1200:000001dc 67 66 8b 1d 2c b9 ec ff mov ebx, dword [0ffecb92ch]
1200:000001e4 67 66 89 18             mov dword [eax], ebx
1200:000001e8 0f 20 d8                mov eax, cr3                ; Added TLB flush sequence.
1200:000001eb 0f 22 d8                mov cr3, eax
1200:000001ee 66 5b                   pop ebx

However, patching OS2KRNL is tedious as the image would need to uncompressed before patching and recompressed afterwards. There are probably also fixups needing adjusting. So, for AMD hosts, we could intercept the LTR or the LLDT instructions and take a sledge hammer to the TLBs from the VMM. An LTR intercept with flush works here.

There are other things we could use to flush the TLBs from the VMM, as the call to 00148h:02cfch typically triggers a certain amount of I/O exits. However coming up with good heuristics for when to flush and when not to would be difficult, if not impossible.

P.S. I think there might be other copies of this code in SMP kernels.

Last edited 9 months ago by bird (previous) (diff)

comment:15 Changed 9 months ago by fth0

@bird: Kudos for the detailed analysis and sharing it. :)

I do understand the TLB and Pagewalk Coherence issues in general, but I don't know much about OS/2 and nothing about the significance of TESTCFG.SYS and SDD: Do your findings indicate that OS/2 has similar issues like Windows 9x, so that "many" OS/2 users with current AMD CPUs will trip over it?

comment:16 Changed 8 months ago by LewisR

I would suggest upgrading to ArcaOS. Among other things, the TESTCFG.SYS has been completely rewritten and does not make any BIOS calls in real mode.

Further, Arca Noae is not aware of any issues such as described here (with any available host CPU, regardless of frequency, with or without nested paging enabled).

Note: See TracTickets for help on using tickets.

www.oracle.com
ContactPrivacy policyTerms of Use