[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: The issues about architecture of the COLO checkpoint

From: Zhanghailiang
Subject: RE: The issues about architecture of the COLO checkpoint
Date: Wed, 12 Feb 2020 03:18:03 +0000


Thank you Dave,

I'll reply here directly.

-----Original Message-----
From: Dr. David Alan Gilbert [mailto:address@hidden] 
Sent: Wednesday, February 12, 2020 1:48 AM
To: Daniel Cho <address@hidden>; address@hidden; Zhanghailiang <address@hidden>
Cc: address@hidden
Subject: Re: The issues about architecture of the COLO checkpoint

cc'ing in COLO people:

* Daniel Cho (address@hidden) wrote:
> Hi everyone,
>      We have some issues about setting COLO feature. Hope somebody 
> could give us some advice.
> Issue 1:
>      We dynamic to set COLO feature for PVM(2 core, 16G memory),  but 
> the Primary VM will pause a long time(based on memory size) for 
> waiting SVM start. Does it have any idea to reduce the pause time?

Yes, we do have some ideas to optimize this downtime.

The main problem for current version is, for each checkpoint, we have to send 
the whole PVM's pages
To SVM, and then copy the whole VM's state into SVM from ram cache, in this 
process, we need both of them be paused. 
Just as you said, the downtime is based on memory size. 

So firstly, we need to reduce the sending data while do checkpoint, actually, 
we can migrate parts of PVM's dirty pages in background
While both of VMs are running. And then we load these pages into ram cache 
(backup memory) in SVM temporarily. While do checkpoint,
We just send the last dirty pages of PVM to slave side and then copy the ram 
cache into SVM. Further on, we don't have
To send the whole PVM's dirty pages, we can only send the pages that dirtied by 
PVM or SVM during two checkpoints. (Because
If one page is not dirtied by both PVM and SVM, the data of this pages will 
keep same in SVM, PVM, backup memory). This method can reduce
the time that consumed in sending data.

For the second problem, we can reduce the memory copy by two methods, first 
one, we don't have to copy the whole pages in ram cache,
We can only copy the pages that dirtied by PVM and SVM in last checkpoint. 
Second, we can use userfault missing function to reduce the
Time consumed in memory copy. (For the second time, in theory, we can reduce 
time consumed in memory into ms level).

You can find the first optimization in attachment, it is based on an old qemu 
version (qemu-2.6), it should not be difficult to rebase it
Into master or your version. And please feel free to send the new version if 
you want into community ;)

> Issue 2:
>      In 
> https://github.com/qemu/qemu/blob/master/migration/colo.c#L503,
> could we move start_vm() before Line 488? Because at first checkpoint 
> PVM will wait for SVM's reply, it cause PVM stop for a while.

No, that makes no sense, because if PVM runs firstly, it still need to wait for
The network packets from SVM to compare before send it to client side.


>      We set the COLO feature on running VM, so we hope the running VM 
> could continuous service for users.
> Do you have any suggestions for those issues?
> Best regards,
> Daniel Cho
Dr. David Alan Gilbert / address@hidden / Manchester, UK

Attachment: 0001-COLO-Migrate-dirty-pages-during-the-gap-of-checkpoin.patch
Description: 0001-COLO-Migrate-dirty-pages-during-the-gap-of-checkpoin.patch

Attachment: 0001-COLO-Optimize-memory-back-up-process.patch
Description: 0001-COLO-Optimize-memory-back-up-process.patch

reply via email to

[Prev in Thread] Current Thread [Next in Thread]