Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

From:	Sean Christopherson
Subject:	Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13
Date:	Tue, 7 Dec 2021 22:25:51 +0000

On Tue, Dec 07, 2021, Chris Murphy wrote:
> cc: qemu-devel
> 
> Hi,
> 
> I'm trying to help progress a very troublesome and so far elusive bug
> we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
> VMs simultaneously, eventually they become unresponsive, as well as
> new processes as we try to extract information from the host about
> what's gone wrong.

Have you tried bisecting?  IIUC, the issues showed up between v5.11 and 
v5.12.12,
bisecting should be relatively straightforward.

> Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> state where forking does not work correctly, breaking most things
> https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> 
> In subsequent testing, we used newer kernels with lockdep and other
> debug stuff enabled, and managed to capture a hung task with a bunch
> of locks listed, including kvm and qemu processes. But I can't parse
> it.
> 
> 5.15-rc7
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
> 5.15+
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939
> 
> If anyone can take a glance at those kernel messages, and/or give
> hints how we can extract more information for debugging, it'd be
> appreciated. Maybe all of that is normal and the actual problem isn't
> in any of these traces.

All the instances of

  (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm]

are uninteresting and expected, that's just each vCPU task taking its associated
vcpu->mutex, likely for KVM_RUN.

At a glance, the XFS stuff looks far more interesting/suspect.

[Prev in Thread]

Current Thread

[Next in Thread]

dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13, Chris Murphy, 2021/12/07
- Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13, Sean Christopherson <=
  - Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13, Chris Murphy, 2021/12/08

Prev by Date: [PATCH 12/12] s390x/pci: let intercept devices have separate PCI groups
Next by Date: Re: [PATCH v3 1/1] target/riscv: Fix PMP propagation for tlb
Previous by thread: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13
Next by thread: Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13
Index(es):
- Date
- Thread