qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against


From: Andy Lutomirski
Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK
Date: Tue, 12 Apr 2022 14:27:52 -0700
User-agent: Cyrus-JMAP/3.7.0-alpha0-386-g4174665229-fm-20220406.001-g41746652

On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote:
> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote:
>
>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he
>> past already with secretmem, it's not 100% that good of a fit (unmovable
>> is worth than mlocked). But it gets the job done for now at least.
>
> No, it doesn't. There are too many different interpretations how
> MELOCK is supposed to work
>
> eg VFIO accounts per-process so hostile users can just fork to go past
> it.
>
> RDMA is per-process but uses a different counter, so you can double up
>
> iouring is per-user and users a 3rd counter, so it can triple up on
> the above two
>
>> So I'm open for alternative to limit the amount of unmovable memory we
>> might allocate for user space, and then we could convert seretmem as well.
>
> I think it has to be cgroup based considering where we are now :\
>

So this is another situation where the actual backend (TDX, SEV, pKVM, pure 
software) makes a difference -- depending on exactly what backend we're using, 
the memory may not be unmoveable.  It might even be swappable (in the 
potentially distant future).

Anyway, here's a concrete proposal, with a bit of handwaving:

We add new cgroup limits:

memory.unmoveable
memory.locked

These can be set to an actual number or they can be set to the special value 
ROOT_CAP.  If they're set to ROOT_CAP, then anyone in the cgroup with 
capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or 
locked memory with this (and potentially other) new APIs.  If it's 0, then they 
can't.  If it's another value, then the memory can be allocated, charged to the 
cgroup, up to the limit, with no particular capability needed.  The default at 
boot is ROOT_CAP.  Anyone who wants to configure it differently is free to do 
so.  This avoids introducing a DoS, makes it easy to run tests without 
configuring cgroup, and lets serious users set up their cgroups.

Nothing is charge per mm.

To make this fully sensible, we need to know what the backend is for the 
private memory before allocating any so that we can charge it accordingly.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]