[Bug 1923583] [NEW] colo: pvm flush failed after svm killed

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug 1923583] [NEW] colo: pvm flush failed after svm killed

From:	meeho yuen
Subject:	[Bug 1923583] [NEW] colo: pvm flush failed after svm killed
Date:	Tue, 13 Apr 2021 08:30:11 -0000

Public bug reported:

Hi,
   Primary vm flush failed after killing svm, which leads primary vm guest 
filesystem unavailable.

qemu versoin: 5.2.0
host/guest os: CentOS Linux release 7.6.1810 (Core)

Reproduce steps:
1. create colo vm following 
https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt
2. kill secondary vm (don't remove nbd child from quorum on primary vm)and wait 
for a minute. the interval depends on guest os.
result: primary vm file system shutdown because of flush cache error.

After serveral tests, I found that qemu-5.0.0 worked well, and it's the
commit
https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block:
Flush all children in generic code) leads this change, and both virtio-
blk and ide turned out to be bad.

I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) 
failed, here is the call stack.
#0  bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856
#1  0x0000562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at 
../block/io.c:2920
#2  0x0000562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at 
../block/io.c:2920
#3  0x0000562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at 
../block/block-backend.c:1672
#4  0x0000562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at 
../block/block-backend.c:1680
#5  0x0000562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at 
../util/coroutine-ucontext.c:173

While i am not sure whether i use colo inproperly? Can we assume that
nbd child of quorum immediately removed right after svm crashed? Or it's
really a bug? Does the following patch fix? Help is needed! Thanks a
lot！

diff --git a/block/quorum.c b/block/quorum.c
index cfc1436..f2c0805 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = {
     .bdrv_dirname                       = quorum_dirname,
     .bdrv_co_block_status               = quorum_co_block_status,
 
-    .bdrv_co_flush_to_disk              = quorum_co_flush,
+    .bdrv_co_flush                      = quorum_co_flush,
 
     .bdrv_getlength                     = quorum_getlength,

** Affects: qemu
     Importance: Undecided
         Status: New

** Patch added: "primary guest kernel message"
   
https://bugs.launchpad.net/bugs/1923583/+attachment/5487235/+files/primary_guest_dmesg.log

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1923583

Title:
  colo: pvm flush failed after svm killed

Status in QEMU:
  New

Bug description:
  Hi,
     Primary vm flush failed after killing svm, which leads primary vm guest 
filesystem unavailable.

  qemu versoin: 5.2.0
  host/guest os: CentOS Linux release 7.6.1810 (Core)

  Reproduce steps:
  1. create colo vm following 
https://github.com/qemu/qemu/blob/master/docs/COLO-FT.txt
  2. kill secondary vm (don't remove nbd child from quorum on primary vm)and 
wait for a minute. the interval depends on guest os.
  result: primary vm file system shutdown because of flush cache error.

  After serveral tests, I found that qemu-5.0.0 worked well, and it's
  the commit
  
https://git.qemu.org/?p=qemu.git;a=commit;h=883833e29cb800b4d92b5d4736252f4004885191(block:
  Flush all children in generic code) leads this change, and both
  virtio-blk and ide turned out to be bad.

  I think it's nbd(replication) flush failed leads bdrv_co_flush(quorum_bs) 
failed, here is the call stack.
  #0  bdrv_co_flush (bs=0x56242b3cc0b0=nbd_bs) at ../block/io.c:2856
  #1  0x0000562428b0f399 in bdrv_co_flush (bs=0x56242b3c7e00=replication_bs) at 
../block/io.c:2920
  #2  0x0000562428b0f399 in bdrv_co_flush (bs=0x56242a4ad800=quorum_bs) at 
../block/io.c:2920
  #3  0x0000562428b70d56 in blk_do_flush (blk=0x56242a4ad4a0) at 
../block/block-backend.c:1672
  #4  0x0000562428b70d87 in blk_aio_flush_entry (opaque=0x7fd0980073f0) at 
../block/block-backend.c:1680
  #5  0x0000562428c5f9a7 in coroutine_trampoline (i0=-1409269904, i1=32721) at 
../util/coroutine-ucontext.c:173

  While i am not sure whether i use colo inproperly? Can we assume that
  nbd child of quorum immediately removed right after svm crashed? Or
  it's really a bug? Does the following patch fix? Help is needed!
  Thanks a lot！

  diff --git a/block/quorum.c b/block/quorum.c
  index cfc1436..f2c0805 100644
  --- a/block/quorum.c
  +++ b/block/quorum.c
  @@ -1279,7 +1279,7 @@ static BlockDriver bdrv_quorum = {
       .bdrv_dirname                       = quorum_dirname,
       .bdrv_co_block_status               = quorum_co_block_status,
   
  -    .bdrv_co_flush_to_disk              = quorum_co_flush,
  +    .bdrv_co_flush                      = quorum_co_flush,
   
       .bdrv_getlength                     = quorum_getlength,

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1923583/+subscriptions

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug 1923583] [NEW] colo: pvm flush failed after svm killed, meeho yuen <=
- Re: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed, no-reply, 2021/04/13

Prev by Date: Re: [PATCH v2 2/3] qom: move user_creatable_add_opts logic to vl.c and QAPIfy it
Next by Date: Re: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed
Previous by thread: [RFC PATCH v2 0/4] hw/arm/virt: Introduce cluster cpu topology support
Next by thread: Re: [Bug 1923583] [NEW] colo: pvm flush failed after svm killed
Index(es):
- Date
- Thread