qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping


From: Hailiang Zhang
Subject: Re: [Qemu-devel] [PATCH COLO-Frame v15 00/38] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service (FT)
Date: Mon, 29 Feb 2016 20:16:19 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1

On 2016/2/29 17:47, Dr. David Alan Gilbert wrote:
* Hailiang Zhang (address@hidden) wrote:
On 2016/2/27 0:36, Dr. David Alan Gilbert wrote:
* Dr. David Alan Gilbert (address@hidden) wrote:
* zhanghailiang (address@hidden) wrote:
From: root <address@hidden>

This is the 15th version of COLO (Still only support periodic checkpoint).

Here is only COLO frame part, you can get the whole codes from github:
https://github.com/coloft/qemu/commits/colo-v2.6-periodic-mode

There are little changes for this series except the network releated part.

I was looking at the time the guest is paused during COLO and
was surprised to find one of the larger chunks was the time to reset
the guest before loading each checkpoint;  I've traced it part way, the
biggest contributors for my test VM seem to be:

   3.8ms  pcibus_reset: VGA
   1.8ms  pcibus_reset: virtio-net-pci
   1.5ms  pcibus_reset: virtio-blk-pci
   1.5ms  qemu_devices_reset: piix4_reset
   1.1ms  pcibus_reset: piix3-ide
   1.1ms  pcibus_reset: virtio-rng-pci

I've not looked deeper yet, but some of these are very silly;
I'm running with -nographic so why it's taking 3.8ms to reset VGA is
going to be interesting.
Also, my only block device is the virtio-blk, so while I understand the
standard PC machine has the IDE controller, why it takes it over a ms
to reset an unused device.

OK, so I've dug a bit deeper, and it appears that it's the changes in
PCI bars that actually take the time;  every time we do a reset we
reset all the BARs, this causes it to do a pci_update_mappings and
end up doing a memory_region_del_subregion.
Then we load the config space of the PCI device as we do the vmstate_load,
and this recreates all the mappings again.

I'm not sure what the fix is, but that sounds like it would
speed up the checkpoints usefully if we can avoid the map/remap when
they're the same.


Interesting, and thanks for your report.

We already known qemu_system_reset() is a time-consuming function, we shouldn't
call it here, but if we didn't do that, there will be a bug, which we have
reported before in the previous COLO series, the bellow is the copy of the 
related
patch comment:

     COLO VMstate: Load VM state into qsb before restore it

     We should not destroy the state of secondary until we receive the whole
     state from the primary, in case the primary fails in the middle of sending
     the state, so, here we cache the device state in Secondary before restore 
it.

     Besides, we should call qemu_system_reset() before load VM state,
     which can ensure the data is intact.
     Note: If we discard qemu_system_reset(), there will be some odd error,
     For exmple, qemu in slave side crashes and reports:

     KVM: entry failed, hardware error 0x7
     EAX=00000000 EBX=0000e000 ECX=00009578 EDX=0000434f
     ESI=0000fc10 EDI=0000434f EBP=00000000 ESP=00001fca
     EIP=00009594 EFL=00010246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
     ES =0040 00000400 0000ffff 00009300
     CS =f000 000f0000 0000ffff 00009b00
     SS =434f 000434f0 0000ffff 00009300
     DS =434f 000434f0 0000ffff 00009300
     FS =0000 00000000 0000ffff 00009300
     GS =0000 00000000 0000ffff 00009300
     LDT=0000 00000000 0000ffff 00008200
     TR =0000 00000000 0000ffff 00008b00
     GDT=     0002dcc8 00000047
     IDT=     00000000 0000ffff
     CR0=00000010 CR2=ffffffff CR3=00000000 CR4=00000000
     DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
     DR6=00000000ffff0ff0 DR7=0000000000000400
     EFER=0000000000000000
     Code=c0 74 0f 66 b9 78 95 00 00 66 31 d2 66 31 c0 e9 47 e0 fb 90 <f3> 90 
fa fc 66 c3 66 53 66 89 c3
     ERROR: invalid runstate transition: 'internal-error' -> 'colo'

     The reason is, some of the device state will be ignored when saving device 
state to slave,
     if the corresponding data is in its initial value, such as 0.
     But the device state in slave maybe in initialized value, after a loop of 
checkpoint,
     there will be inconsistent for the value of device state.
     This will happen when the PVM reboot or SVM run ahead of PVM in the 
startup process.
     Signed-off-by: zhanghailiang <address@hidden>
     Signed-off-by: Yang Hongyang <address@hidden>
     Signed-off-by: Gonglei <address@hidden>
     Reviewed-by: Dr. David Alan Gilbert <address@hidden

As described above, some values of the device state are zero, they will be
ignored during  migration, it has no problem for normal migration, because
for the VM in destination, the initial values will be zero too. But for COLO,
there are more than one round of migration, the related values may be changed
from no-zero to zero, they will be ignored too in the next checkpoint, the
VMstate will be inconsistent for SVM.

Yes, this doesn't really surprise me; a lot of the migration code does weird 
things
like this, so reset is the safest way.


The above error is caused directly by wrong value of 'async_pf_en_msr'.

static const VMStateDescription vmstate_async_pf_msr = {
     .name = "cpu/async_pf_msr",
     .version_id = 1,
     .minimum_version_id = 1,
     .needed = async_pf_msr_needed,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(env.async_pf_en_msr, X86CPU),
         VMSTATE_END_OF_LIST()
     }
};

static bool async_pf_msr_needed(void *opaque)
{
     X86CPU *cpu = opaque;

     return cpu->env.async_pf_en_msr != 0;
}

Some other VMstate of registers in CPUX86State have the same problem,
we can't make sure they won't cause any problems if the values of them
are incorrect.
So here, we just simply call qemu_system_reset() to avoid the inconsistent
problem.
Besides, compared with the most time-consuming operation (ram flushed from
COLO cache to SVM). The time consuming for qemu_system_reset() seems to be
acceptable ;)

I've got a patch where I've tried to multithread the flush - it's made it a 
little
faster, but not as much as I hoped (~20ms down to ~16ms using 4 cores)


Hmm, that seems to be a good idea, after switch to COLO (hybrid) mode, in most 
cases,
we will get much more dirtied pages than the periodic mode, because the delay 
time
between two checkpoints is usually longer.
The multi-thread flushing way may gain much more in that case, but i doubt, in 
some
bad case, users still can't bear the pause time.

Actually, we have thought about this problem for a long time,
In our early test based on Kernel COLO-proxy, we can easily got more than
one seconds' flushing time, IMHO, uses can't bear the long pausing time of VM 
if they
choose to use COLO.

We have designed another scenario which based on userfault's page-miss 
capability.
The base idea is to convert the flushing action to marking action, the flush 
action
will be processed during SVM's running time. For now it is only an idea,
we'd like to verify the idea first. (I'm not quite sure if userfaults' page-miss
feature is good performance designed, while we use it to mark one page to be 
MISS a time).


Thanks,
Hailiang

Another choice to fix the problem is to save the VMstate ignoring the needed()
return value, but this method is not so graceful.
diff --git a/migration/vmstate.c b/migration/vmstate.c
index e5388f0..7d15bba 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -409,7 +409,7 @@ static void vmstate_subsection_save(QEMUFile *f, const 
VMStateDescription *vmsd,
      bool subsection_found = false;

      while (sub && sub->needed) {
-        if (sub->needed(opaque)) {
+        if (sub->needed(opaque) || migrate_in_colo_state()) {
              const VMStateDescription *vmsd = sub->vmsd;
              uint8_t len;

Maybe, but I suspect this will also find other strange cases in devices.
For example, without the reset I wouldn't really trust devices to load the
new state in; they wouldn't have been tested with all the states they
might have left themselves in previously.

Dave



Thanks,
Hailiang

Dave


I guess reset is normally off anyones radar since it's outside
the time anyone cares about, but I guess perhaps the guys trying
to make qemu start really quickly would be interested.

Dave


Patch status:
Unreviewed: patch 21,27,28,29,33,38
Updated: patch 31,34,35,37

TODO:
1. Checkpoint based on proxy in qemu
2. The capability of continuous FT
3. Optimize the VM's downtime during checkpoint

v15:
  - Go on the shutdown process if encounter error while sending shutdown
    message to SVM. (patch 24)
  - Rename qemu_need_skip_netfilter to qemu_netfilter_can_skip and Remove
    some useless comment. (patch 31, Jason)
  - Call object_new_with_props() directly to add filter in
    colo_add_buffer_filter. (patch 34, Jason)
  - Re-implement colo_set_filter_status() based on COLOBufferFilters
    list. (patch 35)
  - Re-implement colo_flush_filter_packets() based on COLOBufferFilters
    list. (patch 37)
v14:
  - Re-implement the network processing based on netfilter (Jason Wang)
  - Rename 'COLOCommand' to 'COLOMessage'. (Markus's suggestion)
  - Split two new patches (patch 27/28) from patch 29
  - Fix some other comments from Dave and Markus.

v13:
  - Refactor colo_*_cmd helper functions to use 'Error **errp' parameter
   instead of return value to indicate success or failure. (patch 10)
  - Remove the optional error message for COLO_EXIT event. (patch 25)
  - Use semaphore to notify colo/colo incoming loop that failover work is
    finished. (patch 26)
  - Move COLO shutdown related codes to colo.c file. (patch 28)
  - Fix memory leak bug for colo incoming loop. (new patch 31)
  - Re-use some existed helper functions to realize the process of
    saving/loading ram and device. (patch 32)
  - Fix some other comments from Dave and Markus.

zhanghailiang (38):
   configure: Add parameter for configure to enable/disable COLO support
   migration: Introduce capability 'x-colo' to migration
   COLO: migrate colo related info to secondary node
   migration: Integrate COLO checkpoint process into migration
   migration: Integrate COLO checkpoint process into loadvm
   COLO/migration: Create a new communication path from destination to
     source
   COLO: Implement colo checkpoint protocol
   COLO: Add a new RunState RUN_STATE_COLO
   QEMUSizedBuffer: Introduce two help functions for qsb
   COLO: Save PVM state to secondary side when do checkpoint
   COLO: Load PVM's dirty pages into SVM's RAM cache temporarily
   ram/COLO: Record the dirty pages that SVM received
   COLO: Load VMState into qsb before restore it
   COLO: Flush PVM's cached RAM into SVM's memory
   COLO: Add checkpoint-delay parameter for migrate-set-parameters
   COLO: synchronize PVM's state to SVM periodically
   COLO failover: Introduce a new command to trigger a failover
   COLO failover: Introduce state to record failover process
   COLO: Implement failover work for Primary VM
   COLO: Implement failover work for Secondary VM
   qmp event: Add COLO_EXIT event to notify users while exited from COLO
   COLO failover: Shutdown related socket fd when do failover
   COLO failover: Don't do failover during loading VM's state
   COLO: Process shutdown command for VM in COLO state
   COLO: Update the global runstate after going into colo state
   savevm: Introduce two helper functions for save/find loadvm_handlers
     entry
   migration/savevm: Add new helpers to process the different stages of
     loadvm
   migration/savevm: Export two helper functions for savevm process
   COLO: Separate the process of saving/loading ram and device state
   COLO: Split qemu_savevm_state_begin out of checkpoint process
   net/filter: Add a 'status' property for filter object
   filter-buffer: Accept zero interval
   net: Add notifier/callback for netdev init
   COLO/filter: add each netdev a buffer filter
   COLO: manage the status of buffer filters for PVM
   filter-buffer: make filter_buffer_flush() public
   COLO: flush buffered packets in checkpoint process or exit COLO
   COLO: Add block replication into colo process

  configure                     |  11 +
  docs/qmp-events.txt           |  16 +
  hmp-commands.hx               |  15 +
  hmp.c                         |  15 +
  hmp.h                         |   1 +
  include/exec/ram_addr.h       |   1 +
  include/migration/colo.h      |  42 ++
  include/migration/failover.h  |  33 ++
  include/migration/migration.h |  16 +
  include/migration/qemu-file.h |   3 +-
  include/net/filter.h          |   5 +
  include/net/net.h             |   4 +
  include/sysemu/sysemu.h       |   9 +
  migration/Makefile.objs       |   2 +
  migration/colo-comm.c         |  76 ++++
  migration/colo-failover.c     |  83 ++++
  migration/colo.c              | 866 ++++++++++++++++++++++++++++++++++++++++++
  migration/migration.c         | 109 +++++-
  migration/qemu-file-buf.c     |  61 +++
  migration/ram.c               | 175 ++++++++-
  migration/savevm.c            | 114 ++++--
  net/filter-buffer.c           |  14 +-
  net/filter.c                  |  40 ++
  net/net.c                     |  33 ++
  qapi-schema.json              | 104 ++++-
  qapi/event.json               |  15 +
  qemu-options.hx               |   4 +-
  qmp-commands.hx               |  23 +-
  stubs/Makefile.objs           |   1 +
  stubs/migration-colo.c        |  54 +++
  trace-events                  |   8 +
  vl.c                          |  31 +-
  32 files changed, 1908 insertions(+), 76 deletions(-)
  create mode 100644 include/migration/colo.h
  create mode 100644 include/migration/failover.h
  create mode 100644 migration/colo-comm.c
  create mode 100644 migration/colo-failover.c
  create mode 100644 migration/colo.c
  create mode 100644 stubs/migration-colo.c

--
1.8.3.1


--
Dr. David Alan Gilbert / address@hidden / Manchester, UK
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

.



--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]