RE: The issues about architecture of the COLO checkpoint

Add cc Jason Wang, he is a network expert.

In case some network things goes wrong.

Thanks

Zhang Chen

From: Zhang, Chen
Sent: Thursday, February 13, 2020 10:10 AM
To: 'Zhanghailiang' <address@hidden>; Daniel Cho <address@hidden>
Cc: Dr. David Alan Gilbert <address@hidden>; address@hidden
Subject: RE: The issues about architecture of the COLO checkpoint

For the issue 2:

COLO need use the network packets to confirm PVM and SVM in the same state,

Generally speaking, we can’t send PVM packets without compared with SVM packets.

But to prevent jamming, I think COLO can do force checkpoint and send the PVM packets in this case.

Thanks

Zhang Chen

From: Zhanghailiang <address@hidden>
Sent: Thursday, February 13, 2020 9:45 AM
To: Daniel Cho <address@hidden>
Cc: Dr. David Alan Gilbert <address@hidden>; address@hidden; Zhang, Chen <address@hidden>
Subject: RE: The issues about architecture of the COLO checkpoint

Hi,

1. After re-walked through the codes, yes, you are right, actually, after the first migration, we will keep dirty log on in primary side,

And only send the dirty pages in PVM to SVM. The ram cache in secondary side is always a backup of PVM, so we don’t have to

Re-send the none-dirtied pages.

The reason why the first checkpoint takes longer time is we have to backup the whole VM’s ram into ram cache, that is colo_init_ram_cache().

It is time consuming, but I have optimized in the second patch “0001-COLO-Optimize-memory-back-up-process.patch” which you can find in my previous reply.

Besides, I found that, In my previous reply “We can only copy the pages that dirtied by PVM and SVM in last checkpoint.”,

We have done this optimization in current upstream codes.

2．I don’t quite understand this question. For COLO, we always need both network packets of PVM’s and SVM’s to compare before send this packets to client.

It depends on this to decide whether or not PVM and SVM are in same state.

Thanks,

hailiang

From: Daniel Cho [mailto:address@hidden]
Sent: Wednesday, February 12, 2020 4:37 PM
To: Zhang, Chen <address@hidden>
Cc: Zhanghailiang <address@hidden>; Dr. David Alan Gilbert <address@hidden>; address@hidden
Subject: Re: The issues about architecture of the COLO checkpoint

Hi Hailiang,

Thanks for your replaying and explain in detail.

We will try to use the attachments to enhance memory copy.

However, we have some questions for your replying.

1. As you said, "for each checkpoint, we have to send the whole PVM's pages To SVM", why the only first checkpoint will takes more pause time?

In our observing, the first checkpoint will take more time for pausing, then other checkpoints will takes a few time for pausing. Does it means only the first checkpoint will send the whole pages to SVM, and the other checkpoints send the dirty pages to SVM for reloading?

2. We notice the COLO-COMPARE component will stuck the packet until receive packets from PVM and SVM, as this rule, when we add the COLO-COMPARE to PVM, its network will stuck until SVM start. So it is an other issue to make PVM stuck while setting COLO feature. With this issue, could we let colo-compare to pass the PVM's packet when the SVM's packet queue is empty? Then, the PVM's network won't stock, and "if PVM runs firstly, it still need to wait for The network packets from SVM to compare before send it to client side" won't happened either.

Best regard,

Daniel Cho

Zhang, Chen <address@hidden> 於 2020年2月12日週三下午1:45寫道：

> -----Original Message-----
> From: Zhanghailiang <address@hidden>
> Sent: Wednesday, February 12, 2020 11:18 AM
> To: Dr. David Alan Gilbert <address@hidden>; Daniel Cho
> <address@hidden>; Zhang, Chen <address@hidden>
> Cc: address@hidden
> Subject: RE: The issues about architecture of the COLO checkpoint
>
> Hi,
>
> Thank you Dave,
>
> I'll reply here directly.
>
> -----Original Message-----
> From: Dr. David Alan Gilbert [mailto:address@hidden]
> Sent: Wednesday, February 12, 2020 1:48 AM
> To: Daniel Cho <address@hidden>; address@hidden;
> Zhanghailiang <address@hidden>
> Cc: address@hidden
> Subject: Re: The issues about architecture of the COLO checkpoint
>
>
> cc'ing in COLO people:
>
>
> * Daniel Cho (address@hidden) wrote:
> > Hi everyone,
> > We have some issues about setting COLO feature. Hope somebody
> > could give us some advice.
> >
> > Issue 1:
> > We dynamic to set COLO feature for PVM(2 core, 16G memory), but
> > the Primary VM will pause a long time(based on memory size) for
> > waiting SVM start. Does it have any idea to reduce the pause time?
> >
>
> Yes, we do have some ideas to optimize this downtime.
>
> The main problem for current version is, for each checkpoint, we have to
> send the whole PVM's pages
> To SVM, and then copy the whole VM's state into SVM from ram cache, in
> this process, we need both of them be paused.
> Just as you said, the downtime is based on memory size.
>
> So firstly, we need to reduce the sending data while do checkpoint, actually,
> we can migrate parts of PVM's dirty pages in background
> While both of VMs are running. And then we load these pages into ram
> cache (backup memory) in SVM temporarily. While do checkpoint,
> We just send the last dirty pages of PVM to slave side and then copy the ram
> cache into SVM. Further on, we don't have
> To send the whole PVM's dirty pages, we can only send the pages that
> dirtied by PVM or SVM during two checkpoints. (Because
> If one page is not dirtied by both PVM and SVM, the data of this pages will
> keep same in SVM, PVM, backup memory). This method can reduce
> the time that consumed in sending data.
>
> For the second problem, we can reduce the memory copy by two methods,
> first one, we don't have to copy the whole pages in ram cache,
> We can only copy the pages that dirtied by PVM and SVM in last checkpoint.
> Second, we can use userfault missing function to reduce the
> Time consumed in memory copy. (For the second time, in theory, we can
> reduce time consumed in memory into ms level).
>
> You can find the first optimization in attachment, it is based on an old qemu
> version (qemu-2.6), it should not be difficult to rebase it
> Into master or your version. And please feel free to send the new version if
> you want into community ;)
>
>

Thanks Hailiang!
By the way, Do you have time to push the patches to upstream?
I think this is a better and faster option.

Thanks
Zhang Chen

> >
> > Issue 2:
> > In
> > https://github.com/qemu/qemu/blob/master/migration/colo.c#L503,
> > could we move start_vm() before Line 488? Because at first checkpoint
> > PVM will wait for SVM's reply, it cause PVM stop for a while.
> >
>
> No, that makes no sense, because if PVM runs firstly, it still need to wait for
> The network packets from SVM to compare before send it to client side.
>
>
> Thanks,
> Hailiang
>
> > We set the COLO feature on running VM, so we hope the running VM
> > could continuous service for users.
> > Do you have any suggestions for those issues?
> >
> > Best regards,
> > Daniel Cho
> --
> Dr. David Alan Gilbert / address@hidden / Manchester, UK

From:	Zhang, Chen
Subject:	RE: The issues about architecture of the COLO checkpoint
Date:	Thu, 13 Feb 2020 02:17:05 +0000