qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 00/32] implement vNVDIMM


From: Xiao Guangrong
Subject: Re: [Qemu-devel] [PATCH v3 00/32] implement vNVDIMM
Date: Mon, 12 Oct 2015 11:06:20 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0



On 10/12/2015 10:59 AM, Bharata B Rao wrote:
Xiao,

Are these patches present in any git tree so that they can be easily tried out.


Sorry, currently no git tree out of my workspace is available :(

BTW, this patchset is based on top of the commit b37686f7e on qemu tree:
commit b37686f7e84b22cfaf7fd01ac5133f2617cc3027
Merge: 8be6e62 98cf48f
Author: Peter Maydell <address@hidden>
Date:   Fri Oct 9 12:18:13 2015 +0100

    Merge remote-tracking branch 'remotes/stefanha/tags/tracing-pull-request' 
into staging

Thanks.

Regards,
Bharata.

On Sun, Oct 11, 2015 at 9:22 AM, Xiao Guangrong
<address@hidden> wrote:
Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
   1) abstract dimm device type from pc-dimm and create nvdimm device based on
      dimm, then it uses memory backend device as nvdimm's memory and NUMA has
      easily been implemented.
   2) let file-backend device support any kind of filesystem not only for
      hugetlbfs and let it work on file not only for directory which is
      achieved by extending 'mem-path' - if it's a directory then it works as
      current behavior, otherwise if it's file then directly allocates memory
      from it.
   3) we figure out a unused memory hole below 4G that is 0xFF00000 ~
      0xFFF00000, this range is large enough for NVDIMM ACPI as build 64-bit
      ACPI SSDT/DSDT table will break windows XP.
      BTW, only make SSDT.rev = 2 can not work since the width is only depended
      on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
      in ACPI spec:
| Note: For compatibility with ACPI versions before ACPI 2.0, the bit
| width of Integer objects is dependent on the ComplianceRevision of the DSDT.
| If the ComplianceRevision is less than 2, all integers are restricted to 32
| bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
| the global integer width for all integers, including integers in SSDTs.
   4) use the lowest ACPI spec version to document AML terms.
   5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"

- changes from Stefan's comments:
   1) do not do endian adjustment in-place since _DSM memory is visible to guest
   2) use target platform's target page size instead of fixed PAGE_SIZE
      definition
   3) lots of code style improvement and typo fixes.
   4) live migration fix
- changes from Paolo's comments:
   1) improve the name of memory region

- other changes:
   1) return exact buffer size for _DSM method instead of the page size.
   2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
      devices.
   3) NUMA support
   4) implement _FIT method
   5) rename "configdata" to "reserve-label-data"
   6) simplify _DSM arg3 determination
   7) main changelog update to let it reflect v3.

Changlog in v2:
- Use litten endian for DSM method, thanks for Stefan's suggestion

- introduce a new parameter, @configdata, if it's false, Qemu will
   build a static and readonly namespace in memory and use it serveing
   for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
   reserved region is needed at the end of the @file, it is good for
   the user who want to pass whole nvdimm device and make its data
   completely be visible to guest

- divide the source code into separated files and add maintain info

BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
be posted on next week

====== Background ======
NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
on Intel's platform. They are discovered via ACPI and configured by _DSM
method of NVDIMM device in ACPI. There has some supporting documents which
can be found at:
ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
this patchset tries to enable it in virtualization field

====== Design ======
NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
address space then CPU can directly access it as normal memory, another is
BLK which is used as block device to reduce the occupying of CPU address
space

BLK mode accesses NVDIMM via Command Register window and Data Register window.
BLK virtualization has high workload since each sector access will cause at
least two VM-EXIT. So we currently only imperilment vPMEM in this patchset

--- vPMEM design ---
We introduce a new device named "nvdimm", it uses memory backend device as
NVDIMM memory. The file in file-backend device can be a regular file and block
device. We can use any file when we do test or emulation, however,
in the real word, the files passed to guest are:
- the regular file in the filesystem with DAX enabled created on NVDIMM device
   on host
- the raw PMEM device on host, e,g /dev/pmem0
Memory access on the address created by mmap on these kinds of files can
directly reach NVDIMM device on host.

--- vConfigure data area design ---
Each NVDIMM device has a configure data area which is used to store label
namespace data. In order to emulating this area, we divide the file into two
parts:
- first parts is (0, size - 128K], which is used as PMEM
- 128K at the end of the file, which is used as Label Data Area
So that the label namespace data can be persistent during power lose or system
failure.

We also support passing the whole file to guest without reserve any region for
label data area which is achieved by "reserve-label-data" parameter - if it's
false then QEMU will build static and readonly namespace in memory and that
namespace contains the whole file size. The parameter is false on default.

--- _DSM method design ---
_DSM in ACPI is used to configure NVDIMM, currently we only allow access of
label namespace data, i.e, Get Namespace Label Size (Function Index 4),
Get Namespace Label Data (Function Index 5) and Set Namespace Label Data
(Function Index 6)

_DSM uses two pages to transfer data between ACPI and Qemu, the first page
is RAM-based used to save the input info of _DSM method and Qemu reuse it
store output info and another page is MMIO-based, ACPI write data to this
page to transfer the control to Qemu

====== Test ======
In host
1) create memory backed file, e.g # dd if=zero of=/tmp/nvdimm bs=1G count=10
2) append "-object memory-backend-file,share,id=mem1,
    mem-path=/tmp/nvdimm -device nvdimm,memdev=mem1,reserve-label-data,
    id=nv1" in QEMU command line

In guest, download the latest upsteam kernel (4.2 merge window) and enable
ACPI_NFIT, LIBNVDIMM and BLK_DEV_PMEM.
1) insmod drivers/nvdimm/libnvdimm.ko
2) insmod drivers/acpi/nfit.ko
3) insmod drivers/nvdimm/nd_btt.ko
4) insmod drivers/nvdimm/nd_pmem.ko
You can see the whole nvdimm device used as a single namespace and /dev/pmem0
appears. You can do whatever on /dev/pmem0 including DAX access.

Currently Linux NVDIMM driver does not support namespace operation on this
kind of PMEM, apply below changes to support dynamical namespace:

@@ -798,7 +823,8 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *a
                         continue;
                 }

-               if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               //if (nfit_mem->bdw && nfit_mem->memdev_pmem)
+               if (nfit_mem->memdev_pmem)
                         flags |= NDD_ALIASING;

You can append another NVDIMM device in guest and do:
# cd /sys/bus/nd/devices/
# cd namespace1.0/
# echo `uuidgen` > uuid
# echo `expr 1024 \* 1024 \* 128` > size
then reload nd.pmem.ko

You can see /dev/pmem1 appears

====== TODO ======
NVDIMM hotplug support

Xiao Guangrong (32):
   acpi: add aml_derefof
   acpi: add aml_sizeof
   acpi: add aml_create_field
   acpi: add aml_mutex, aml_acquire, aml_release
   acpi: add aml_concatenate
   acpi: add aml_object_type
   util: introduce qemu_file_get_page_size()
   exec: allow memory to be allocated from any kind of path
   exec: allow file_ram_alloc to work on file
   hostmem-file: clean up memory allocation
   hostmem-file: use whole file size if possible
   pc-dimm: remove DEFAULT_PC_DIMMSIZE
   pc-dimm: make pc_existing_dimms_capacity static and rename it
   pc-dimm: drop the prefix of pc-dimm
   stubs: rename qmp_pc_dimm_device_list.c
   pc-dimm: rename pc-dimm.c and pc-dimm.h
   dimm: abstract dimm device from pc-dimm
   dimm: get mapped memory region from DIMMDeviceClass->get_memory_region
   dimm: keep the state of the whole backend memory
   dimm: introduce realize callback
   nvdimm: implement NVDIMM device abstract
   nvdimm: init the address region used by NVDIMM ACPI
   nvdimm: build ACPI NFIT table
   nvdimm: init the address region used by DSM method
   nvdimm: build ACPI nvdimm devices
   nvdimm: save arg3 for NVDIMM device _DSM method
   nvdimm: support DSM_CMD_IMPLEMENTED function
   nvdimm: support DSM_CMD_NAMESPACE_LABEL_SIZE function
   nvdimm: support DSM_CMD_GET_NAMESPACE_LABEL_DATA
   nvdimm: support DSM_CMD_SET_NAMESPACE_LABEL_DATA
   nvdimm: allow using whole backend memory as pmem
   nvdimm: add maintain info

  MAINTAINERS                                        |   6 +
  backends/hostmem-file.c                            |  58 +-
  default-configs/i386-softmmu.mak                   |   2 +
  default-configs/x86_64-softmmu.mak                 |   2 +
  exec.c                                             | 113 ++-
  hmp.c                                              |   2 +-
  hw/Makefile.objs                                   |   2 +-
  hw/acpi/aml-build.c                                |  83 ++
  hw/acpi/ich9.c                                     |   8 +-
  hw/acpi/memory_hotplug.c                           |  26 +-
  hw/acpi/piix4.c                                    |   8 +-
  hw/i386/acpi-build.c                               |   4 +
  hw/i386/pc.c                                       |  37 +-
  hw/mem/Makefile.objs                               |   3 +
  hw/mem/{pc-dimm.c => dimm.c}                       | 240 ++---
  hw/mem/nvdimm/acpi.c                               | 961 +++++++++++++++++++++
  hw/mem/nvdimm/internal.h                           |  41 +
  hw/mem/nvdimm/namespace.c                          | 309 +++++++
  hw/mem/nvdimm/nvdimm.c                             | 136 +++
  hw/mem/pc-dimm.c                                   | 506 +----------
  hw/ppc/spapr.c                                     |  20 +-
  include/hw/acpi/aml-build.h                        |   8 +
  include/hw/i386/pc.h                               |   4 +-
  include/hw/mem/{pc-dimm.h => dimm.h}               |  68 +-
  include/hw/mem/nvdimm.h                            |  58 ++
  include/hw/mem/pc-dimm.h                           | 105 +--
  include/hw/ppc/spapr.h                             |   2 +-
  include/qemu/osdep.h                               |   1 +
  numa.c                                             |   4 +-
  qapi-schema.json                                   |   8 +-
  qmp.c                                              |   4 +-
  stubs/Makefile.objs                                |   2 +-
  ...c_dimm_device_list.c => qmp_dimm_device_list.c} |   4 +-
  target-ppc/kvm.c                                   |  21 +-
  trace-events                                       |   8 +-
  util/oslib-posix.c                                 |  16 +
  util/oslib-win32.c                                 |   5 +
  37 files changed, 2023 insertions(+), 862 deletions(-)
  rename hw/mem/{pc-dimm.c => dimm.c} (65%)
  create mode 100644 hw/mem/nvdimm/acpi.c
  create mode 100644 hw/mem/nvdimm/internal.h
  create mode 100644 hw/mem/nvdimm/namespace.c
  create mode 100644 hw/mem/nvdimm/nvdimm.c
  rewrite hw/mem/pc-dimm.c (91%)
  rename include/hw/mem/{pc-dimm.h => dimm.h} (50%)
  create mode 100644 include/hw/mem/nvdimm.h
  rewrite include/hw/mem/pc-dimm.h (97%)
  rename stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} (56%)

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to address@hidden
More majordomo info at  http://vger.kernel.org/majordomo-info.html






reply via email to

[Prev in Thread] Current Thread [Next in Thread]