Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic

From:	Alex Pyrgiotis
Subject:	Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic
Date:	Thu, 25 Feb 2016 12:10:52 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0

Hi Paolo,

Thanks a lot for your clarifications. See my comments inline:

(tl;dr: I suggest we reconsider Fam Zheng's attempt to remove the global
bounce buffer, which would make dma-helpers simpler and unblock this patch)

On 02/22/2016 12:43 PM, Paolo Bonzini wrote:
> 
> 
> On 19/02/2016 12:50, Alex Pyrgiotis wrote:
>> QEMU/Hardware space:
>> 5. The SCSI controller code will create a QEMUSGList that points to
>>    the memory regions of the SCSI request. This QEMUSGList will also
>>    include the MMIO regions.
>> 6. The QEMU device implementation, e.g. scsi-block, chooses to use
>>    the dma_* interface.
>> 7. The dma_blk_read/write() code will ultimately attempt to map all the
>>    memory regions pointed by the QEMUSGList in order to create a
>>    QEMUIOVector.
>> 8. At some point during the mapping loop, the code will encounter an
>>    MMIO region. Since reading and writing from/to an MMIO region
>>    requires  special handling, e.g., we need to call
>>    MemoryRegion->ops->write(), we cannot include it in our read/write
>>    system call to the host kernel.

This step and the next one were not clear to me, but thanks to your
comments, I now get what's happening behind the scenes. So, let's reiterate:

All normal regions in a QEMUSGList point to an address range in the
guest's RAM. The MMIO regions of QEMU's virtual devices, however, do not
correspond to such an address range, so QEMU must create a bounce buffer
to represent them. This bounce buffer is added in the I/O vector which
contains the rest of the mapped addresses and is later given to a
readv()/writev() call.

>> 9. This leads to a partial read/write and the mapping loop will resume
>>    once the partial read/write() has finished.

The MMIO region is the trigger for a partial read/write, but it's not
the actual reason. The actual reason is that there is only *one*
*global* bounce buffer. This means that if it's in use it or we
need to use it twice, we will have to wait.

>> Are we in the same page so far?
> 
> Yes.
> 
>> Are the above OK? If so, I have some questions:
>>
>> a) Is an MMIO region one of the reasons why we can't map an sg?
> 
> Yes, the only one pretty much.
> 
>> b) At which point will the relevant ops->write() method for the MMIO
>>    region be called when we have to DMA into the region?? Is it handled
>>    implicitly in dma_memory_map()?
> 
> It's in address_space_unmap:
> 
>     if (is_write) {
>         address_space_write(as, bounce.addr, MEMTXATTRS_UNSPECIFIED,
>                             bounce.buffer, access_len);
>     }
> 
> Likewise, address_space_map does the ops->read call through
> address_space_read.
> 
>> c) I'm not quite sure about the logic of the "nothing mapped" section.
>>    Correct me if I'm wrong, but what I think it does is that it
>>    registers a callback (reschedule_dma) once some sort of mapping has
>>    completed. What kind of mapping is this? Is there anything more to
>>    it?
> 
> Once something (presumably a concurrent user of dma-helpers.c) calls
> address_space_unmap to free the mapping (the bounce.buffer in the above
> address_space_write call), reschedule_dma is called.
>
>>> However, it is not possible to do the same for ioctls.  This is actually
>>> the reason why no one has ever tried to make scsi-generic do anything
>>> but bounce-buffering. I think that your code breaks horribly in this
>>> case, and I don't see a way to fix it, except for reverting to bounce
>>> buffering.

Sure, you're right, there's no sensible way to break an ioctl()
operation in many. However, I'd argue that we shouldn't need to, as it
would be much better if the DMA operation was never restarted in the
first place. Instead, if we dealt with the bigger issue of the global
bounce buffer, we could kill two birds with one stone.

I see that there was an attempt [1] to replace the global bounce buffer
with something more dynamic. In short, the objections [2] were the
following:

1. It introduced locking/unlocking a global mutex in the hot path as
   well as a hash table lookup.
2. It allowed for unbounded memory allocations.

An improvement that would address (1) is to get rid of any global state:

Since the mapping operation takes place in the context of a DMA
operation, we could provide a ctx-type struct to the dma_memory_(un)map
--> address_space_(un)map functions that would contain a hash table. If
any memory allocations were needed, we would track them using that hash
table, which would require no locks. Moreover, if the initialization of
the hash table hurts the performance in the general case, we could use
instead a skip list, if the number of memory allocations is small (e.g.
< 100).

If a mapping operation does not take place in the context of a DMA
operation, we could pass NULL and the address_space_(un)map code would
fallback to the global bounce buffer. Having a fallback would allow for
a smooth transition.

As for the point raised in (2), we can have a limit on the allocated
pages, e.g. 1024. If a well-behaved guest reaches that limit, it will
resume the allocation once a request has completed. For misbehaving
guests, this means that their request will block. Is this reasonable

>>> This would require major changes in your patches, and I'm not sure
>>> whether they are worth it for the single use case of tape devices...
>>
>> Well, I wouldn't narrow it down to tape devices. The scsi-generic
>> interface is the universal interface for all SCSI devices and the only
>> interface that is fully passthrough.
> 
> Sure, but what's the advantage of a fully passthrough interface over
> scsi-block?

This is not a matter of advantage between the two interfaces. A fully
passthrough interface is simply an alternative data path which one
should be able to test and benchmark.

>> So this patch would effectively
>> boost the baseline performance of SCSI devices. I think it's worth a try.
> 
> I think the first step is understanding what to do about the weird "&
> ~BDRV_SECTOR_MASK" case, then.

We can discuss about this case in the "dma-helpers: Do not truncate
small qiovs" thread. I'm all for the removal of this check, but I was
hoping that Kevin would clarify if it's still relevant.

Cheers,
Alex

[1] http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg01776.html
[2] http://lists.nongnu.org/archive/html/qemu-devel/2015-03/msg01884.html

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Paolo Bonzini, 2016/02/11
- Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Alex Pyrgiotis, 2016/02/19
  - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Paolo Bonzini, 2016/02/22
    - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Alex Pyrgiotis <=
    - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Paolo Bonzini, 2016/02/25
    - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Alex Pyrgiotis, 2016/02/25
    - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Paolo Bonzini, 2016/02/25
    - Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic, Kevin Wolf, 2016/02/26

Prev by Date: Re: [Qemu-devel] [PULL 05/23] pc-dimm: add pc_dimm_build_list()
Next by Date: Re: [Qemu-devel] [PULL 05/23] pc-dimm: add pc_dimm_build_list()
Previous by thread: Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic
Next by thread: Re: [Qemu-devel] [PATCH 1/9] dma-helpers: Expose the sg mapping logic
Index(es):
- Date
- Thread