|
From: | Stefan Berger |
Subject: | Re: intermittent hang, s390x host, bios-tables-test test, TPM |
Date: | Fri, 6 Jan 2023 10:58:38 -0500 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 |
On 1/6/23 10:39, Peter Maydell wrote:
On Fri, 6 Jan 2023 at 15:16, Stefan Berger <stefanb@linux.ibm.com> wrote:On 1/6/23 07:10, Peter Maydell wrote:I'm seeing an intermittent hang on the s390 CI runner in the bios-tables-test test. It looks like we've deadlocked because:
Thread 3 (Thread 0x3ff8dafe900 (LWP 2661316)): #0 0x000003ff8e9c6002 in __GI___wait4 (pid=<optimized out>, stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=<optimized out>, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27 #1 0x000003ff8e9c5f72 in __GI___waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x2aa0b42c9bc, options=options@entry=0) at waitpid.c:38 #2 0x000002aa0952a516 in qtest_wait_qemu (s=0x2aa0b42c9b0) at ../tests/qtest/libqtest.c:206 #3 0x000002aa0952a58a in qtest_kill_qemu (s=0x2aa0b42c9b0) at ../tests/qtest/libqtest.c:229 #4 0x000003ff8f0c288e in g_hook_list_invoke () from /lib/s390x-linux-gnu/libglib-2.0.so.0 #5 <signal handler called> #6 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #7 0x000003ff8e9240a2 in __GI_abort () at abort.c:79 #8 0x000003ff8f0feda8 in g_assertion_message () from /lib/s390x-linux-gnu/libglib-2.0.so.0 #9 0x000003ff8f0fedfe in g_assertion_message_expr () from /lib/s390x-linux-gnu/libglib-2.0.so.0 #10 0x000002aa09522904 in tpm_emu_ctrl_thread (data=0x3fff5ffa160) at ../tests/qtest/tpm-emu.c:189This here seems to be the root cause. An unknown control channel command was received from the TPM emulator backend by the control channel thread and we end up in g_assert_not_reached().Yeah. It would be good if we didn't deadlock without printing the assertion, though... I guess we could improve qtest_kill_qemu() so it doesn't wait indefinitely for QEMU to exit but instead sends a SIGKILL 20 seconds after the SIGTERM. (Annoyingly, there is no convenient "waitpid but with a timeout" function...)
Yes, wait5(&to,...) doesn't exist, yet. I guess one would have to use a loop sending signal 0 to the pid for 20 seconds? Though I'd really like to know where that data race is coming from and why we get an unknown command. I am now running this on a ppc64 and x86_64 host over the weekend to see what happens. All good so far. Stefan
thanks -- PMM
[Prev in Thread] | Current Thread | [Next in Thread] |