Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest

From:	Andy Lutomirski
Subject:	Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
Date:	Mon, 25 Apr 2022 07:52:38 -0700
User-agent:	Cyrus-JMAP/3.7.0-alpha0-569-g7622ad95cc-fm-20220421.002-g7622ad95

On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote:
> On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote:
>> 

>> 
>> 2. Bind the memfile to a VM (or at least to a VM technology).  Now it's in 
>> the initial state appropriate for that VM.
>> 
>> For TDX, this completely bypasses the cases where the data is prepopulated 
>> and TDX can't handle it cleanly.  For SEV, it bypasses a situation in which 
>> data might be written to the memory before we find out whether that data 
>> will be unreclaimable or unmovable.
>
> This sounds a more strict rule to avoid semantics unclear.
>
> So userspace needs to know what excatly happens for a 'bind' operation.
> This is different when binds to different technologies. E.g. for SEV, it
> may imply after this call, the memfile can be accessed (through mmap or
> what ever) from userspace, while for current TDX this should be not allowed.

I think this is actually a good thing.  While SEV, TDX, pKVM, etc achieve 
similar goals and have broadly similar ways of achieving them, they really are 
different, and having userspace be aware of the differences seems okay to me.

(Although I don't think that allowing userspace to mmap SEV shared pages is 
particularly wise -- it will result in faults or cache incoherence depending on 
the variant of SEV in use.)

>
> And I feel we still need a third flow/operation to indicate the
> completion of the initialization on the memfile before the guest's 
> first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed
> and prevent future userspace access. After this point, then the memfile
> becomes truely private fd.

Even that is technology-dependent.  For TDX, this operation doesn't really 
exist.  For SEV, I'm not sure (I haven't read the specs in nearly enough 
detail).  For pKVM, I guess it does exist and isn't quite the same as a 
shared->private conversion.

Maybe this could be generalized a bit as an operation "measure and make 
private" that would be supported by the technologies for which it's useful.

>
>> 
>> 
>> ----------------------------------------------
>> 
>> Now I have a question, since I don't think anyone has really answered it: 
>> how does this all work with SEV- or pKVM-like technologies in which private 
>> and shared pages share the same address space?  I sounds like you're 
>> proposing to have a big memfile that contains private and shared pages and 
>> to use that same memfile as pages are converted back and forth.  IO and even 
>> real physical DMA could be done on that memfile.  Am I understanding 
>> correctly?
>
> For TDX case, and probably SEV as well, this memfile contains private memory
> only. But this design at least makes it possible for usage cases like
> pKVM which wants both private/shared memory in the same memfile and rely
> on other ways like mmap/munmap or mprotect to toggle private/shared instead
> of fallocate/hole punching.

Hmm.  Then we still need some way to get KVM to generate the correct SEV 
pagetables.  For TDX, there are private memslots and shared memslots, and they 
can overlap.  If they overlap and both contain valid pages at the same address, 
then the results may not be what the guest-side ABI expects, but everything 
will work.  So, when a single logical guest page transitions between shared and 
private, no change to the memslots is needed.  For SEV, this is not the case: 
everything is in one set of pagetables, and there isn't a natural way to 
resolve overlaps.

If the memslot code becomes efficient enough, then the memslots could be 
fragmented.  Or the memfile could support private and shared data in the same 
memslot.  And if pKVM does this, I don't see why SEV couldn't also do it and 
hopefully reuse the same code.

>
>> 
>> If so, I think this makes sense, but I'm wondering if the actual memslot 
>> setup should be different.  For TDX, private memory lives in a logically 
>> separate memslot space.  For SEV and pKVM, it doesn't.  I assume the API can 
>> reflect this straightforwardly.
>
> I believe so. The flow should be similar but we do need pass different
> flags during the 'bind' to the backing store for different usages. That
> should be some new flags for pKVM but the callbacks (API here) between
> memfile_notifile and its consumers can be reused.

And also some different flag in the operation that installs the fd as a memslot?

>
>> 
>> And the corresponding TDX question: is the intent still that shared pages 
>> aren't allowed at all in a TDX memfile?  If so, that would be the most 
>> direct mapping to what the hardware actually does.
>
> Exactly. TDX will still use fallocate/hole punching to turn on/off the
> private page. Once off, the traditional shared page will become
> effective in KVM.

Works for me.

For what it's worth, I still think it should be fine to land all the TDX 
memfile bits upstream as long as we're confident that SEV, pKVM, etc can be 
added on without issues.

I think we can increase confidence in this by either getting one other 
technology's maintainers to get far enough along in the design to be confident 
and/or by having a pure-kernel-software implementation that serves as a 
testbed.  For the latter, maybe it could support two different models with 
little overhead:

Pure software "interleaved" model: pages are shared or private and a hypercall 
converts them.  The access mode is entirely determined by the state programmed 
by hypercall.  I think this is essentially what Vishal implemented, but with 
the "HACK" replaced by something permanent and (if they're not already in the 
series) appropriate access checks implemented to actually protect the private 
memory.

Pure software "separate" mode: one GPA bit is set aside as the shared vs 
private bit.  The normal memslots are restricted to the shared half of GPA 
space.  Private memslots use the private half.  This works a lot like TDX.  
This would be new code.  We don't *really* need this for testing, since TDX 
itself exercises the same programming model, but it would let people without 
TDX hardware exercise the interesting bits of the memory management.

Paolo, etc: what do you think?

>
> Chao
>> 
>> --Andy

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory, (continued)
- Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory, Vishal Annapurve, 2022/04/08
  - Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory, Chao Peng, 2022/04/12
- Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory, Kirill A. Shutemov, 2022/04/12

Prev by Date: Re: [PATCH v2] target/ppc: Fix BookE debug interrupt generation
Next by Date: Re: [PATCH v2 01/42] i386: pcmpestr 64-bit sign extension bug
Previous by thread: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
Next by thread: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory
Index(es):
- Date
- Thread