Re: [PATCH 0/4] target/arm: Improvement on memory error handling

qemu-arm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/4] target/arm: Improvement on memory error handling

From:	Gavin Shan
Subject:	Re: [PATCH 0/4] target/arm: Improvement on memory error handling
Date:	Mon, 17 Feb 2025 13:58:26 +1000
User-agent:	Mozilla Thunderbird

On 2/14/25 10:59 PM, Mauro Carvalho Chehab wrote:

Em Fri, 14 Feb 2025 14:16:31 +1000
Gavin Shan <gshan@redhat.com> escreveu:

Currently, there is only one CPER buffer (entry), meaning only one
memory error can be reported. In extreme case, multiple memory errors
can be raised on different vCPUs. For example, a singile memory error
on a 64KB page of the host can results in 16 memory errors to 4KB
pages of the guest.


There is already a patchset allowing to have multiple CPER entries
floating around since last year:

        
https://lore.kernel.org/qemu-devel/cover.1738345063.git.mchehab+huawei@kernel.org/

I guess it is almost ready for being merged, needing just some
nitpick changes to satisfy ACPI maintainers. Such changeset already
adds a second CPER entry for GED, and allows to easily add more as
needed.


Thanks for the linker, Mauro. As I explained to Jonathan, the bottleneck
isn't the number of CPER entries (single or multiple). The bottleneck
is actually the acknowledgment mechanism. With the mechanism, a single
CPER buffer, which could contain multiple entries, can be delivered
and acknowledged at once. I don't see your series changes anything in
this regard if I don't miss anything.

In extreme case, multiple memory errors
can be raised on different vCPUs. For example, a singile memory error
on a 64KB page of the host can results in 16 memory errors to 4KB
pages of the guest.

Unfortunately, the virtual machine is simply aborted
by multiple concurrent memory errors, as the following call trace shows.
A SEA exception is injected to the guest so that the CPER buffer can
be claimed if the error is successfully pushed by acpi_ghes_memory_errors(),
Otherwise, abort() is triggered to crash the virtual machine.

   kvm_vcpu_thread_fn
     kvm_cpu_exec
       kvm_arch_on_sigbus_vcpu
         kvm_cpu_synchronize_state
         acpi_ghes_memory_errors         (a)
         kvm_inject_arm_sea | abort

It's arguably to crash the virtual machine in this case. The better
behaviour would be to retry on pushing the memory errors, to keep the
virtual machine alive so that the administrator has chance to chime
in, for example to dump the important data with luck. This series
adds one more parameter to acpi_ghes_memory_errors() so that it will
be tried to push the memory error until it succeeds.


Having a retry buffer might be interesting for some types of errors,
like error-injected and corrected errors. Yet, it doesn't sound right
to buffer uncorrected errors that would affect the virtual machine.


The question is how the uncorrected error can be delivered if the previous
corrected error is being delivered and not acknowledged yet? With the
acknowledgement mechanism, all errors are equal in priority when they're
delivered, correct?

Thanks,
Gavin

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 4/4] target/arm: Retry pushing CPER error if necessary, (continued)
- Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Jonathan Cameron, 2025/02/14
  - Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Gavin Shan, 2025/02/16
- Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Jonathan Cameron, 2025/02/14
  - Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Gavin Shan, 2025/02/16
- Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Mauro Carvalho Chehab, 2025/02/14
  - Re: [PATCH 0/4] target/arm: Improvement on memory error handling, Gavin Shan <=

Prev by Date: Re: [PATCH 0/4] target/arm: Improvement on memory error handling
Next by Date: Re: [PATCH 3/5] hw/i386/intel_iommu: Tear down address spaces before IOMMU reset
Previous by thread: Re: [PATCH 0/4] target/arm: Improvement on memory error handling
Next by thread: [PATCH] bcm2838: Add GIC-400 timer interupt connections
Index(es):
- Date
- Thread