[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

From: Andrzej Jakowski
Subject: Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
Date: Wed, 22 Jul 2020 11:14:15 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.9.0

On 7/22/20 10:21 AM, Klaus Jensen wrote:
> On Jul 22 10:00, Andrzej Jakowski wrote:
>> On 7/22/20 12:43 AM, Klaus Jensen wrote:
>>> @keith, please see below - can you comment on the Linux kernel 2 MB
>>> boundary requirement for the CMB? Or should we hail Stephen (or Logan
>>> maybe) since this seems to be related to p2pdma?
>>> On Jul 21 14:54, Andrzej Jakowski wrote:
>>>> On 7/15/20 1:06 AM, Klaus Jensen wrote:
>>>>> Hi Andrzej,
>>>>> I've not been ignoring this, but sorry for not following up earlier.
>>>>> I'm hesitent to merge anything that very obviously breaks an OS that we
>>>>> know is used a lot to this using this device. Also because the issue has
>>>>> not been analyzed well enough to actually know if this is a QEMU or
>>>>> kernel issue.
>>>> Hi Klaus,
>>>> Thx for your response! I understand your hesitance on merging stuff that
>>>> obviously breaks guest OS. 
>>>>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
>>>>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
>>>>> (irregardless of IOMMU on/off).
>>>>> Later, when the issue is better understood, we can add options to set
>>>>> offsets, BIRs etc.
>>>>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
>>>>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
>>>>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
>>>>> Can you reproduce the issues with that patch? I can't on a stock Arch
>>>>> Linux 5.7.5-arch1-1 kernel.
>>>> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
>>>> feel that investigation part why it works while mine doesn't is
>>>> missing. It looks to me that both patches are basically following same 
>>>> approach: create memory subregion and overlay on top of other memory
>>>> region. Why one works and the other doesn't then?
>>>> Having in mind that, I have recently focused on understanding problem.
>>>> I observed that when guest assigns address to BAR4, addr field in
>>>> nvme-bar4 memory region gets populated, but it doesn't get automatically
>>>> populated in ctrl_mem (cmb) memory subregion, so later when 
>>>> nvme_addr_is_cmb() 
>>>> is called address check works incorrectly and as a consequence vmm does 
>>>> dma 
>>>> read instead of memcpy.
>>>> I created a patch that sets correct address on ctrl_mem subregion and 
>>>> guest 
>>>> OS boots up correctly.
>>>> When I looked into pci and memory region code I noticed that indeed address
>>>> is only assigned to top level memory region but not to contained 
>>>> subregions.
>>>> I think that because in your approach cmb grabs whole bar exclusively it 
>>>> works
>>>> fine.
>>>> Here is my question (perhaps pci folks can help answer :)): if we consider 
>>>> memory region overlapping for pci devices as valid use case should pci 
>>>> code on configuration cycles walk through all contained subregion and
>>>> update addr field accordingly?
>>>> Thx!
>>> Hi Andrzej,
>>> Thanks for looking into this. I think your analysis helped me nail this.
>>> The problem is that we added the use of a subregion and have some
>>> assumptions that no longer hold.
>>> nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
>>> But when the memory region is a subregion, addr holds an offset into the
>>> parent container instead. Thus, changing all occurances of
>>> n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
>>> (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
>>> in your original patch[1]. The reason my version worked is because there
>>> was no subregion involved for the CMB, so the existing address
>>> validation calculations were still correct.
>> I'm a little bit concerned with this approach:
>> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
>> describe my understanding of the problem.
> Oh. In the context of your patch I meant bar4 of course, but anyway.
>> It looks to me that addr field sometimes contains *absolute* address (when 
>> no 
>> hierarchy is used) and other times it contains *relative* address (when
>> hierarchy is created). From my perspective use of this field is inconsistent
>> and thus error-prone.  
>> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
>> solve root problem and is still prone to the same problem if in the future
>> we potentially build even more complicated hierarchy.
>> I think that we could solve it by introducing helper function like
>> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
>> to retrieve absolute address and in the documentation indicate that addr 
>> field
>> can be relative or absolute and it is recommended to use above function to 
>> retrieve absolute address.
>> What do you think?
> I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
> did just to convince myself that this was the issue ;)
> I think the helper might already be there in memory.c. It's just not
> exported.
>    static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr 
> offset)

Awesome! I didn't notice. Let me check and repost series soon :)

>>> This leaves us with the Linux kernel complaining about not being able to
>>> register the CMB if it is not on a 2MB boundary - this is probably just
>>> a constraint in the kernel that we can't do anything about (but I'm no
>>> kernel hacker...), which can be fixed by either being "nice" towards the
>>> Linux kernel by forcing a 2 MB alignment in the device or exposing the
>>> SZU field such that the user can choose 16MiB size units (or higher)
>>> instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
>>> device such that we do not have to introduce new cmb_size parameters,
>>> while also making it easier for the user to configure. But I'm not
>>> really sure.
>> This is kernel limitation that we have to live with. The minimum granularity
>> of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse
>> at 128MB size+align (section-size), but sub-section memory-hotplug patches 
>> adjusted that to a 2MB section. 
> Thanks for that explanation!
>> Ensuring 2MiB size and alignment in the device emulation makes sense to me.
>> Perhaps we could document that limitations - making user more aware of it.
> Sounds good to me.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]