Opened 4 years ago
Last modified 8 months ago
#20131 new defect
rcu_sched detected stalls on CPUs/tasks: linux guest and host ?ryzen issue
Reported by: | RT_db | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 6.1.16 |
Keywords: | Cc: | ||
Guest type: | Linux | Host type: | Linux |
Description
Linux debian guest on Debian host Virtualbox 6.1.16
Host: Linux 5.6.0-0.bpo.2-amd64 #1 SMP Debian 5.6.14-2~bpo10+1 (2020-06-09) x86_64 GNU/Linux Debian 10 AMD Ryzen 9 3900X 12-Core 32Gb ram
guest 4 cores, 4gb ram dmesg:
[ 225.539622] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 225.540724] rcu: 0-...!: (27 ticks this GP) idle=c50/0/0x0 softirq=2529/2 530 fqs=0
[ 225.541761] rcu: 1-...!: (0 ticks this GP) idle=ac8/0/0x0 softirq=2101/21 01 fqs=0
[ 225.542767] rcu: 3-...!: (26 ticks this GP) idle=99c/0/0x0 softirq=2173/2 174 fqs=0
[ 225.543767] (detected by 2, t=5271 jiffies, g=2529, q=2)
[ 225.543770] Sending NMI from CPU 2 to CPUs 0:
[ 225.543825] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xe/0 x10
[ 225.544771] Sending NMI from CPU 2 to CPUs 1:
[ 225.544797] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0xe/0 x10
[ 225.545769] Sending NMI from CPU 2 to CPUs 3:
[ 225.545796] NMI backtrace for cpu 3 skipped: idling at native_safe_halt+0xe/0 x10
[ 225.546769] rcu: rcu_sched kthread starved for 5272 jiffies! g2529 f0x0 RCU_G P_WAIT_FQS(5) ->state=0x402 ->cpu=1
[ 225.547792] rcu: RCU grace-period kthread stack dump:
[ 225.548812] rcu_sched I 0 11 2 0x80004000
[ 225.548816] Call Trace:
[ 225.548836] ? schedule+0x2d8/0x760
[ 225.548838] ? switch_to_asm+0x40/0x70
[ 225.548840] ? switch_to_asm+0x40/0x70
[ 225.548842] schedule+0x4a/0xb0
[ 225.548843] schedule_timeout+0x15e/0x300
[ 225.548850] ? next_timer_interrupt+0xd0/0xd0
[ 225.548853] rcu_gp_kthread+0x452/0x8d0
[ 225.548864] kthread+0xf9/0x130
[ 225.548869] ? kfree_call_rcu+0x10/0x10
[ 225.548870] ? kthread_park+0x90/0x90
[ 225.548872] ret_from_fork+0x22/0x40
[ 285.421645] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 285.422242] rcu: 0-...!: (0 ticks this GP) idle=da0/0/0x0 softirq=2535/25 35 fqs=0
[ 285.422780] rcu: 1-...!: (37 ticks this GP) idle=d4c/0/0x0 softirq=2103/2 103 fqs=0
[ 285.423305] rcu: 3-...!: (46 ticks this GP) idle=b84/0/0x0 softirq=2178/2 179 fqs=0
[ 285.423815] (detected by 2, t=5268 jiffies, g=2541, q=866)
[ 285.423817] Sending NMI from CPU 2 to CPUs 0:
[ 285.423858] NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0xe/0 x10
[ 285.424818] Sending NMI from CPU 2 to CPUs 1:
[ 285.424845] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0xe/0 x10
[ 285.425817] Sending NMI from CPU 2 to CPUs 3:
[ 285.425847] NMI backtrace for cpu 3 skipped: idling at native_safe_halt+0xe/0 x10
[ 285.426815] rcu: rcu_sched kthread starved for 5268 jiffies! g2541 f0x0 RCU_G P_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 285.427228] rcu: RCU grace-period kthread stack dump:
[ 285.427620] rcu_sched I 0 11 2 0x80004000
[ 285.427623] Call Trace:
[ 285.427629] ? schedule+0x2d8/0x760
[ 285.427630] ? switch_to_asm+0x40/0x70
[ 285.427631] ? switch_to_asm+0x40/0x70
[ 285.427633] schedule+0x4a/0xb0
[ 285.427634] schedule_timeout+0x15e/0x300
[ 285.427637] ? next_timer_interrupt+0xd0/0xd0
[ 285.427640] rcu_gp_kthread+0x452/0x8d0
[ 285.427643] kthread+0xf9/0x130
[ 285.427645] ? kfree_call_rcu+0x10/0x10
[ 285.427647] ? kthread_park+0x90/0x90
[ 285.427648] ret_from_fork+0x22/0x40
System is unusable.
Guest transferred from Intel i5 core system, were it worked perfectly. Same error in Debian 9, 10 and bullseye guests. Tried multiple combination of cores and ram - no effect.
Interestingly Alpine 3.12 guest on same system has no problem.
Current workaround: install guest additions on guest and error messages go.
I've attached the virtualbox logs. host1 - is related to the dmesg output above. host2 and host3 are logs from the same system, just running longer.
Many thanks for your help.
Attachments (2)
Change History (5)
by , 4 years ago
comment:1 by , 4 years ago
Adding guest additions didn't ultimately fix the problem - it just delayed the onset of the problem.
comment:2 by , 3 years ago
I ran into this or something very much like it on my new Ryzen 5900x, Win10 host and Debian 11 guest (though I reproduced it with a Fedora 35 guest too, for example...), and with some minimal experimenting, found a workaround that's worked for me so far.
I found that "perf top" was good at stalling it out a bit, and doing a "vboxmanage modifyvm foo --hpet on" on the host made the problem occur virtually never or not at all for that VM, even while every other VM without that change was stalling.
Hopefully that's helpful to people who stumble onto this until whatever wacky root cause is run down (or it's just...made the default on AMD systems, heh).
comment:3 by , 8 months ago
This issue did not appear to affect me until I upgraded from a Ryzen 9 3900X to an AMD Ryzen 9 5950X. I applied your 'VBoxManage modifyvm <vmname> --hpet on' fix and it seems to have gone away. Thank you for your post.
host log 1