qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode P


From: Rao, Lei
Subject: RE: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode PVM poweroff
Date: Mon, 5 Jul 2021 09:06:48 +0000

I can reproduce this bug successfully and the GDB stack is as follows:
Program terminated with signal SIGSEGV, Segmentation fault.
#0  object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e 
"qio-channel") at qom/object.c:832
832         if (type->class->interfaces &&
[Current thread is 1 (Thread 0x7f756e97eb00 (LWP 1811577))]
(gdb) bt
#0  object_class_dynamic_cast (class=0x55c8f5d2bf50, typename=0x55c8f2f7379e 
"qio-channel") at qom/object.c:832
#1  0x000055c8f2c3dd14 in object_dynamic_cast (obj=0x55c8f543ac00, 
typename=0x55c8f2f7379e "qio-channel") at qom/object.c:763
#2  0x000055c8f2c3ddce in object_dynamic_cast_assert (obj=0x55c8f543ac00, 
typename=0x55c8f2f7379e "qio-channel",
    file=0x55c8f2f73780 "migration/qemu-file-channel.c", line=117, 
func=0x55c8f2f73800 <__func__.18724> "channel_shutdown") at qom/object.c:786
#3  0x000055c8f2bbc6ac in channel_shutdown (opaque=0x55c8f543ac00, rd=true, 
wr=true, errp=0x0) at migration/qemu-file-channel.c:117
#4  0x000055c8f2bba56e in qemu_file_shutdown (f=0x7f7558070f50) at 
migration/qemu-file.c:67
#5  0x000055c8f2ba5373 in migrate_fd_cancel (s=0x55c8f4ccf3f0) at 
migration/migration.c:1699
#6  0x000055c8f2ba1992 in migration_shutdown () at migration/migration.c:187
#7  0x000055c8f29a5b77 in main (argc=69, argv=0x7fff3e9e8c08, 
envp=0x7fff3e9e8e38) at vl.c:4512

>From the call trace we can see that the reason for core dump is due to 
>QIOChannel *ioc = QIO_CHANNEL(opaque) in the channel_shutdown();
After analysis, I think what you said is right, Shutting down a file multiple 
times is safe and not a bug in it self. The reason for the segmentation fault 
Is that after doing failover, the COLO thread will 
qemu_fclose(s->rp_state.from_dst_file) in colo_process_checkpoint();
if we power off the guest at the same time. This will cause qemu_file_shutdown 
after qemu_close(from_dst_file). As a result, the qemu will crash.
So, I think the better way is as follows:
  /*
     * Must be called after failover BH is completed,
     * Or the failover BH may shutdown the wrong fd that
     * re-used by other threads after we release here.
     */
     if (s->rp_state.from_dst_file) {
            qemu_fclose(s->rp_state.from_dst_file);
+          s->rp_state.from_dst_file = NULL;
      }
After qemu_close(s->rp_state.from_dst_file), we set the from_dst_file = NULL;

Ways to reproduce bugs:
You can kill SVM, after executing the failover QMP command, immediately execute 
the power off command in the guest, which will have a high probability to 
reproduce this bug.

Thanks,
Lei.


-----Original Message-----
From: Rao, Lei 
Sent: Monday, July 5, 2021 10:54 AM
To: Lukas Straub <lukasstraub2@web.de>
Cc: Zhang, Chen <chen.zhang@intel.com>; lizhijian@cn.fujitsu.com; 
jasowang@redhat.com; zhang.zhanghailiang@huawei.com; quintela@redhat.com; 
dgilbert@redhat.com; like.xu.linux@gmail.com; qemu-devel@nongnu.org; Like Xu 
<like.xu@linux.intel.com>
Subject: RE: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode 
PVM poweroff

It's been a long time since this bug, I will reproduce it to get the GDB stack, 
but it may take some time.

Thanks,
Lei.

-----Original Message-----
From: Lukas Straub <lukasstraub2@web.de>
Sent: Sunday, July 4, 2021 4:36 AM
To: Rao, Lei <lei.rao@intel.com>
Cc: Zhang, Chen <chen.zhang@intel.com>; lizhijian@cn.fujitsu.com; 
jasowang@redhat.com; zhang.zhanghailiang@huawei.com; quintela@redhat.com; 
dgilbert@redhat.com; like.xu.linux@gmail.com; qemu-devel@nongnu.org; Like Xu 
<like.xu@linux.intel.com>
Subject: Re: [PATCH 4/7] colo: fixed 'Segmentation fault' when the simplex mode 
PVM poweroff

On Thu, 17 Jun 2021 10:47:12 +0800
Lei Rao <lei.rao@intel.com> wrote:

> From: "Rao, Lei" <lei.rao@intel.com>
> 
> When a PVM completed its SVM failover steps and begins to run in the 
> simplex mode, QEMU would encounter a 'Segmentation fault' if the guest 
> poweroff with the following calltrace:
> 
> Program received signal SIGSEGV, Segmentation fault.
> 
> This is because primary_vm_do_failover() would call 
> "qemu_file_shutdown (s->rp_state.from_dst_file);" and later the
> migration_shutdown() would do it again. So, we should set the 
> s->rp_state.from_dst_file to NULL.

Hello,
Please provide a backtrace to such bugfixes. It helps the reviewers to better 
understand the bug and the fix and it helps yourself to check if it is actually 
correct. I suggest you to enable coredumps in your test (or even production) 
system, so for every crash you definitely have a backtrace.

> Signed-off-by: Like Xu <like.xu@linux.intel.com>
> Signed-off-by: Lei Rao <lei.rao@intel.com>
> ---
>  migration/colo.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> 616dc00..c25e488 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -156,14 +156,15 @@ static void primary_vm_do_failover(void)
>  
>      /*
>       * Wake up COLO thread which may blocked in recv() or send(),
> -     * The s->rp_state.from_dst_file and s->to_dst_file may use the
> -     * same fd, but we still shutdown the fd for twice, it is harmless.
> +     * The s->to_dst_file may use the same fd, but we still shutdown
> +     * the fd for twice, it is harmless.
>       */

This change to the comment is incorrect. Shutting down a file multiple times is 
safe and not a bug in it self. If it leads to a crash anyway, that points to a 
bug in the qemu_file_shutdown() function or similar.

>      if (s->to_dst_file) {
>          qemu_file_shutdown(s->to_dst_file);
>      }
>      if (s->rp_state.from_dst_file) {
>          qemu_file_shutdown(s->rp_state.from_dst_file);
> +        s->rp_state.from_dst_file = NULL;
>      }

This is incorrect, as the file won't be closed after this and is leaked. And 
indeed, when applying the patch to master, qemu crashes when triggering 
failover on the primary, due to the leak:

qemu-system-x86_64: ../util/yank.c:107: yank_unregister_instance: Assertion 
`QLIST_EMPTY(&entry->yankfns)' failed.

>      old_state = failover_set_state(FAILOVER_STATUS_ACTIVE,



-- 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]