qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty


From: Li, Liang Z
Subject: Re: [Qemu-devel] Migration dirty bitmap: should only mark pages as dirty after they have been sent
Date: Thu, 3 Nov 2016 09:59:03 +0000

> pages will be sent. Before that during the migration setup, the
> ioctl(KVM_GET_DIRTY_LOG) is called once, so the kernel begins to produce
> the dirty bitmap from this moment. When the pages "that haven't been
> sent" are written, the kernel space marks them as dirty. However I don't
> think this is correct, because these pages will be sent during this and the 
> next
> iterations with the same content (if they are not written again after they are
> sent). It only makes sense to mark the pages which have already been sent
> during one iteration as dirty when they are written.
> > > > > > >
> > > > > > >
> > > > > > > Am I right about this consideration? If I am right, is there some
> advice to improve this?
> > > > > >
> > > > > > I think you're right that this can happen; to clarify I think the
> > > > > > case you're talking about is:
> > > > > >
> > > > > >   Iteration 1
> > > > > >     sync bitmap
> > > > > >     start sending pages
> > > > > >     page 'n' is modified - but hasn't been sent yet
> > > > > >     page 'n' gets sent
> > > > > >   Iteration 2
> > > > > >     sync bitmap
> > > > > >        'page n is shown as modified'
> > > > > >     send page 'n' again
> > > > > >
> > > > >
> > > > > Yes,this is right the case I am talking about.
> > > > >
> > > > > > So you're right that is wasteful; I guess it's more wasteful
> > > > > > on big VMs with slow networks where the length of each iteration
> > > > > > is large.
> > > > >
> > > > > I think this is "very" wasteful. Assume the workload writes the pages
> dirty randomly within the guest address space, and the transfer speed is
> constant. Intuitively, I think nearly half of the dirty pages produced in
> Iteration 1 is not really dirty. This means the time of Iteration 2 is double 
> of
> that to send only really dirty pages.
> > > >
> > > > It makes sense, can you get some perf numbers to show what kinds of
> > > > workloads get impacted the most?  That would also help us to figure
> > > > out what kinds of speed improvements we can expect.
> > > >
> > > >
> > > >                 Amit
> > >
> > > I have picked up 6 workloads and got the following statistics numbers
> > > of every iteration (except the last stop-copy one) during precopy.
> > > These numbers are obtained with the basic precopy migration, without
> > > the capabilities like xbzrle or compression, etc. The network for the
> > > migration is exclusive, with a separate network for the workloads.
> > > They are both gigabit ethernet. I use qemu-2.5.1.
> > >
> > > Three (booting, idle, web server) of them converged to the stop-copy
> phase,
> > > with the given bandwidth and default downtime (300ms), while the other
> > > three (kernel compilation, zeusmp, memcached) did not.
> > >
> > > One page is "not-really-dirty", if it is written first and is sent later
> > > (and not written again after that) during one iteration. I guess this
> > > would not happen so often during the other iterations as during the 1st
> > > iteration. Because all the pages of the VM are sent to the dest node
> during
> > > the 1st iteration, while during the others, only part of the pages are 
> > > sent.
> > > So I think the "not-really-dirty" pages should be produced mainly during
> > > the 1st iteration , and maybe very little during the other iterations.
> > >
> > > If we could avoid resending the "not-really-dirty" pages, intuitively, I
> > > think the time spent on Iteration 2 would be halved. This is a chain
> reaction,
> > > because the dirty pages produced during Iteration 2 is halved, which
> incurs
> > > that the time spent on Iteration 3 is halved, then Iteration 4, 5...
> >
> > Yes; these numbers don't show how many of them are false dirty though.
> >
> > One problem is thinking about pages that have been redirtied, if the page is
> dirtied
> > after the sync but before the network write then it's the false-dirty that
> > you're describing.
> >
> > However, if the page is being written a few times, and so it would have
> been written
> > after the network write then it isn't a false-dirty.
> >
> > You might be able to figure that out with some kernel tracing of when the
> dirtying
> > happens, but it might be easier to write the fix!
> >
> > Dave
> 
> Hi, I have made some new progress now.
> 
> To tell how many false dirty pages there are exactly in each iteration, I 
> malloc
> a
> buffer in memory as big as the size of the whole VM memory. When a page
> is
> transferred to the dest node, it is copied to the buffer; During the next
> iteration,
> if one page is transferred, it is compared to the old one in the buffer, and 
> the
> old one will be replaced for next comparison if it is really dirty. Thus, we 
> are
> now
> able to get the exact number of false dirty pages.
> 
> This time, I use 15 workloads to get the statistic number. They are:
> 
>   1. 11 benchmarks picked up from cpu2006 benchmark suit. They are all
> scientific
>      computing workloads like Quantum Chromodynamics, Fluid Dynamics, etc.
> I pick
>      up these 11 benchmarks because compared to others, they have bigger
> memory
>      occupation and higher memory dirty rate. Thus most of them could not
> converge
>      to stop-and-copy using the default migration speed (32MB/s).
>   2. kernel compilation
>   3. idle VM
>   4. Apache web server which serves static content
> 
>   (the above workloads are all running in VM with 1 vcpu and 1GB memory,
> and the
>    migration speed is the default 32MB/s)
> 
>   5. Memcached. The VM has 6 cpu cores and 6GB memory, and 4GB are used
> as the cache.
>      After filling up the 4GB cache, a client writes the cache at a constant 
> speed
>      during migration. This time, migration speed has no limit, and is up to 
> the
>      capability of 1Gbps Ethernet.
> 
> Summarize the results first: (and you can read the precise number below)
> 
>   1. 4 of these 15 workloads have a big proportion (>60%, even >80% during
> some iterations)
>      of false dirty pages out of all the dirty pages since iteration 2 (and 
> the big
>      proportion lasts during the following iterations). They are 
> cpu2006.zeusmp,
>      cpu2006.bzip2, cpu2006.mcf, and memcached.
>   2. 2 workloads (idle, webserver) spend most of the migration time on
> iteration 1, even
>      though the proportion of false dirty pages is big since iteration 2, the 
> space
> to
>      optimize is small.
>   3. 1 workload (kernel compilation) only have a big proportion during
> iteration 2, not
>      in the other iterations.
>   4. 8 workloads (the other 8 benchmarks of cpu2006) have little proportion of
> false
>      dirty pages since iteration 2. So the spaces to optimize for them are 
> small.
> 
> Now I want to talk a little more about the reasons why false dirty pages are
> produced.
> The first reason is what we have discussed before---the mechanism to track
> the dirty
> pages.
> And then I come up with another reason. Here is the situation: a write
> operation to one
> memory page happens, but it doesn't change any content of the page. So it's
> "write but
> not dirty", and kernel still marks it as dirty. One guy in our lab has done 
> some
> experiments
> to figure out the proportion of "write but not dirty" operations, and he uses
> the cpu2006
> benchmark suit. According to his results, general workloads has a little
> proportion (<10%)
> of "write but not dirty" out of all the write operations, while few workloads
> have higher
> proportion (one even as high as 50%). Now we are not sure why "write but
> not dirty" would
> happen, it just happened.
> 
> So these two reasons contribute to the false dirty pages. To optimize, I
> compute and store
> the SHA1 hash before transferring each page. Next time, if one page needs
> retransmission, its
> SHA1 hash is computed again, and compared to the old hash. If the hash is
> the same, it's a
> false dirty page, and we just skip this page; Otherwise, the page is
> transferred, and the new
> hash replaces the old one for next comparison.
> The reason to use SHA1 hash but not byte-by-byte comparison is the
> memory overheads. One SHA1
> hash is 20 bytes. So we need extra 20/4096 (<1/200) memory space of the
> whole VM memory, which
> is relatively small.
> As far as I know, SHA1 hash is widely used in the scenes of deduplication for
> backup systems.
> They have proven that the probability of hash collision is far smaller than 
> disk
> hardware fault,
> so it's secure hash, that is, if the hashes of two chunks are the same, the
> content must be the
> same. So I think the SHA1 hash could replace byte-to-byte comparison in the
> VM memory scenery.
> 
> Then I do the same migration experiments using the SHA1 hash. For the 4
> workloads which have
> big proportions of false dirty pages, the improvement is remarkable. Without
> optimization,
> they either can not converge to stop-and-copy, or take a very long time to
> complete. With the
> SHA1 hash method, all of them now complete in a relatively short time.
> For the reason I have talked above, the other workloads don't get notable
> improvements from the
> optimization. So below, I only show the exact number after optimization for
> the 4 workloads with
> remarkable improvements.
> 
> Any comments or suggestions?
> 

It seems the current XBZRLE feature can be used to solve false dirty issue, no?

Liang


reply via email to

[Prev in Thread] Current Thread [Next in Thread]