[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bug#1037264: cksum crashes intermittently with "Illegal instruction"
From: |
Kristoffer Brånemyr |
Subject: |
Re: Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU |
Date: |
Mon, 12 Jun 2023 16:25:28 +0000 (UTC) |
I guess it doesn't hurt to try to also check for SSE variants in the function
trying to see if pclmul is supported.
But I think it's a bit suspicious that it only crashes sometimes.If there was
some instruction which causes this, should it not happen everytime?
Could it be something else, like some unaligned address read/write that causes
this?I guess ILL_ILLOPN might mean the argument to a instruction (i.e. possibly
address?)
Can you reproduce the problem running cksum in gdb? Then you could disassemble
the location it crashes in and possibly see a bit better what causes the issue.
Also dump the values of the hardware registers. And variables if you can.
--
/Kristoffer Brånemyr
Den måndag 12 juni 2023 kl. 15:03:11 CEST, Philip Rowlands
<coreutils@dimebar.com> skrev:
On Sat, 10 Jun 2023, at 11:09, Pádraig Brady wrote:
> cksum since v9.0 checks at runtime whether pclmul is supported.
> It seems that check is not working appropriately on a Xen DomU.
Hypervisors routinely lie about CPUID feature flags, in order to maintain
compatibility between a fleet of diverse servers. It's possible in this case
that the system was misconfigured to present flags which the underlying CPU
doesn't support.
> The routine in question is pclmul_supported() at:
> https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191
>
> That either suggests xen is incorrectly setting PCLMUL and AVX bits,
> or perhaps these two bits are not sufficient.
> Hmm I wonder do we also need to explicitly check for SSSE3 support?
Intel says to check for SSE and SSE2; quoting the manual
===
11.6.2 Checking for Intel® SSE and SSE2 Support
Before an application attempts to use Intel SSE and/or Intel SSE2, it should
check that they are present on the
processor:
1. Check that the processor supports the CPUID instruction. Bit 21 of the
EFLAGS register can be used to check
processor’s support the CPUID instruction.
2. Check that the processor supports Intel SSE and/or SSE2 (true if
CPUID.01H:EDX.SSE[bit 25] = 1 and/or
CPUID.01H:EDX.SSE2[bit 26] = 1).
12.13.4 Checking for Intel® AES-NI Support
Before an application attempts to use AESNI instructions or PCLMULQDQ, the
application should follow the steps
illustrated in Section 11.6.2, “Checking for Intel® SSE and SSE2 Support.”
Next, use the additional step provided
below:
Check that the processor supports Intel AES-NI (if CPUID.01H:ECX.AESNI[bit 25]
= 1); check that the processor
supports PCLMULQDQ (if CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1).
===
Wikipedia mentions an AVX-512 version (VPCLMULQDQ) but I don't think we're
using that.
I can't find the equivalent AMD docs. Is there a library / macro check for
this, to avoid the low-level bit inspection?
It would be useful to see the output of "cpuid -1" which does a verbose decode
of all CPUID flags, on the system which sees the SIGILL. (How can it be
intermittent??)
Interesting that the strace output finishes with:
read(0, "", 61440) = 0
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55bec9cc6cf5} ---
+++ killed by SIGILL +++
i.e. ILL_ILLOPN (operand) rather than ILL_ILLOPC (opcode). What could cause
this?
Cheers,
Phil