qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs


From: Veaceslav Falico
Subject: Re: [Qemu-devel] [RFC] qid path collision issues in 9pfs
Date: Fri, 19 Jan 2018 17:37:58 +0100
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

On 1/19/2018 4:52 PM, Eduard Shishkin wrote:
> 
> 
> On 1/19/2018 11:27 AM, Greg Kurz wrote:
>> On Mon, 15 Jan 2018 11:49:31 +0800
>> Antonios Motakis <address@hidden> wrote:
>>
>>> On 13-Jan-18 00:14, Greg Kurz wrote:
>>>> On Fri, 12 Jan 2018 19:32:10 +0800
>>>> Antonios Motakis <address@hidden> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>
>>>> Hi Antonios,
>>>>
>>>> I see you have attached a patch to this email... this really isn't the 
>>>> preferred
>>>> way to do things since it prevents to comment the patch (at least with my 
>>>> mail
>>>> client). The appropriate way would have been to send the patch with a cover
>>>> letter, using git-send-email for example.
>>>
>>> I apologize for attaching the patch, I should have known better!
>>>
>>
>> np :)
>>
>>>>
>>>>> We have found an issue in the 9p implementation of QEMU, with how qid 
>>>>> paths are generated, which can cause qid path collisions and several 
>>>>> issues caused by them. In our use case (running containers under VMs) 
>>>>> these have proven to be critical.
>>>>>
>>>>
>>>> Ouch...
>>>>
>>>>> In particular, stat_to_qid in hw/9pfs/9p.c generates a qid path using the 
>>>>> inode number of the file as input. According to the 9p spec the path 
>>>>> should be able to uniquely identify a file, distinct files should not 
>>>>> share a path value.
>>>>>
>>>>> The current implementation that defines qid.path = inode nr works fine as 
>>>>> long as there are not files from multiple partitions visible under the 9p 
>>>>> share. In that case, distinct files from different devices are allowed to 
>>>>> have the same inode number. So with multiple partitions, we have a very 
>>>>> high probability of qid path collisions.
>>>>>
>>>>> How to demonstrate the issue:
>>>>> 1) Prepare a problematic share:
>>>>>  - mount one partition under share/p1/ with some files inside
>>>>>  - mount another one *with identical contents* under share/p2/
>>>>>  - confirm that both partitions have files with same inode nr, size, etc
>>>>> 2) Demonstrate breakage:
>>>>>  - start a VM with a virtio-9p pointing to the share
>>>>>  - mount 9p share with FSCACHE on
>>>>>  - keep open share/p1/file
>>>>>  - open and write to share/p2/file
>>>>>
>>>>> What should happen is, the guest will consider share/p1/file and 
>>>>> share/p2/file to be the same file, and since we are using the cache it 
>>>>> will not reopen it. We intended to write to partition 2, but we just 
>>>>> wrote to partition 1. This is just one example on how the guest may rely 
>>>>> on qid paths being unique.
>>>>>
>>>>> In the use case of containers where we commonly have a few containers per 
>>>>> VM, all based on similar images, these kind of qid path collisions are 
>>>>> very common and they seem to cause all kinds of funny behavior (sometimes 
>>>>> very subtle).
>>>>>
>>>>> To avoid this situation, the device id of a file needs to be also taken 
>>>>> as input for generating a qid path. Unfortunately, the size of both inode 
>>>>> nr + device id together would be 96 bits, while we have only 64 bits for 
>>>>> the qid path, so we can't just append them and call it a day :(
>>>>>
>>>>> We have thought of a few approaches, but we would definitely like to hear 
>>>>> what the upstream maintainers and community think:
>>>>>
>>>>> * Full fix: Change the 9p protocol
>>>>>
>>>>> We would need to support a longer qid path, based on a virtio feature 
>>>>> flag. This would take reworking of host and guest parts of virtio-9p, so 
>>>>> both QEMU and Linux for most users.
>>>>>
>>>>
>>>> I agree for a longer qid path, but we shouldn't tie it to a virtio flag 
>>>> since
>>>> 9p is transport agnostic. And it happens to be used with a variety of 
>>>> transports.
>>>> QEMU has both virtio-9p and a Xen backend for example.
>>>>
>>>>> * Fallback and/or interim solutions
>>>>>
>>>>> A virtio feature flag may be refused by the guest, so we think we still 
>>>>> need to make collisions less likely even with 64 bit paths. E.g.
>>>>
>>>> In all cases, we would need a fallback solution to support current
>>>> guest setups. Also there are several 9p server implementations out
>>>> there (ganesha, diod, kvmtool) that are currently used with linux
>>>> clients... it will take some time to get everyone in sync :-\
>>>>
>>>>> 1. XOR the device id with inode nr to produce the qid path (we attach a 
>>>>> proof of concept patch)
>>>>
>>>> Hmm... this would still allow collisions. Not good for fallback.
>>>>
>>>>> 2. 64 bit hash of device id and inode nr
>>>>
>>>> Same here.
>>>>
>>>>> 3. other ideas, such as allocating new qid paths and keep a look up table 
>>>>> of some sorts (possibly too expensive)
>>>>>
>>>>
>>>> This would be acceptable for fallback.
>>>
>>> Maybe we can use the QEMU hash table 
>>> (https://github.com/qemu/qemu/blob/master/util/qht.c) but I wonder if it 
>>> scales to millions of entries. Do you think it is appropriate in this case?
>>>
>>
>> I don't know QHT, hence Cc'ing Emilio for insights.
>>
>>> I was thinking on how to implement something like this, without having to 
>>> maintain millions of entries... One option we could consider is to split 
>>> the bits into a directly-mapped part, and a lookup part. For example:
>>>
>>> Inode =
>>> [ 10 first bits ] + [ 54 lowest bits ]
>>>
>>> A hash table maps [ inode 10 first bits ] + [ device id ] => [ 10 bit 
>>> prefix ]
>>> The prefix is uniquely allocated for each input.
>>>
>>> Qid path =
>>> [ 10 bit prefix ] + [ inode 54 lowest bits ]
>>>
>>> Since inodes are not completely random, and we usually have a handful of 
>>> device IDs, we get a much smaller number of entries to track in the hash 
>>> table.
>>>
>>> So what this would give:
>>> (1)    Would be faster and take less memory than mapping the full 
>>> inode_nr,devi_id tuple to unique QID paths
>>> (2)    Guaranteed not to run out of bits when inode numbers stay below the 
>>> lowest 54 bits and we have less than 1024 devices.
>>> (3)    When we get beyond this this limit, there is a chance we run out of 
>>> bits to allocate new QID paths, but we can detect this and refuse to serve 
>>> the offending files instead of allowing a collision.
>>>
>>> We could tweak the prefix size to match the scenarios that we consider more 
>>> likely, but I think close to 10-16 bits sounds reasonable enough. What do 
>>> you think?
>>>
>>
>> I think that should work. Please send a patchset :)
> 
> Hi Greg,
> 
> This is based on the assumption that high bits of inode numbers are
> always zero, which is unacceptable from my standpoint. Inode numbers
> allocator is a per-volume file system feature, so there may appear
> allocators, which don't use simple increment and assign inode number
> with non-zero high bits even for the first file of a volume.
> 
> As I understand, the problem is that the guest doesn't have enough
> information for proper files identification (in our case share/p1/file
> and share/p2/file are wrongly identified as the same file). It seems
> that we need to transmit device ID to the guest, and not in the high bits of 
> the inode number.
> 
> AFAIK Slava has patches for such transmission, I hope that he will send
> it eventually.

They're pretty trivial, however it'll require great effort and coordination
for all parties involved, as they break backwards compatibility (QID becomes
bigger and, thus, breaks "Q" transmission).

I'd like to raise another issue - it seems that st_dev and i_no aren't
enough:

mkdir -p /tmp/mounted
touch /tmp/mounted/file
mount -o bind /tmp/mounted t1
mount -o bind /tmp/mounted t2
stat t{1,2}/file  | grep Inode
Device: 805h/2053d      Inode: 42729487    Links: 3
Device: 805h/2053d      Inode: 42729487    Links: 3

In that case, even with the patch applied, we'll still have the issue of
colliding QIDs guest-side - so, for example, after umount t1, t2/file
becomes unaccessible, as the inode points to t1...

I'm not sure how to fix this one. Ideas?

> 
> Thanks,
> Eduard.
> 
>>
>>>>
>>>>> With our proof of concept patch, the issues caused by qid path collisions 
>>>>> go away, so it can be seen as an interim solution of sorts. However, the 
>>>>> chance of collisions is not eliminated, we are just replacing the current 
>>>>> strategy, which is almost guaranteed to cause collisions in certain use 
>>>>> cases, with one that makes them more rare. We think that a virtio feature 
>>>>> flag for longer qid paths is the only way to eliminate these issues 
>>>>> completely.
>>>>>
>>>>> This is the extent that we were able to analyze the issue from our side, 
>>>>> we would like to hear if there are some better ideas, or which approach 
>>>>> would be more appropriate for upstream.
>>>>>
>>>>> Best regards,
>>>>>
>>>>
>>>> Cheers,
>>>>
>>>> -- 
>>>> Greg
>>>>
>>>
>>
> 
> .
> 





reply via email to

[Prev in Thread] Current Thread [Next in Thread]