Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options

From:	Alex Bennée
Subject:	Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
Date:	Thu, 16 Jul 2020 11:41:57 +0100
User-agent:	mu4e 1.5.4; emacs 28.0.50

Josh Kunz <jkz@google.com> writes:

> Sorry for the late reply, response inline. Also I noticed a couple
> mails ago I seemed to have removed the devel list and maintainers.
> I've re-added them to the CC line.
>
> On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Josh Kunz <jkz@google.com> writes:
>>
>> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> > (snip)
>> >
>> >> >> > * Non-standard libc extension to allow creating TLS images 
>> >> >> > independent
>> >> >> >   of threads. This would allow us to just `clone` the child directly
>> >> >> >   instead of this complicated maneuver. Though we probably would 
>> >> >> > still
>> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
>> >> >> >   connected to thread stack allocation, which is also arch-specific. 
>> >> >> > I
>> >> >> >   do not have enough experience with libc development to know if
>> >> >> >   maintainers of any popular libcs would be open to supporting such 
>> >> >> > an
>> >> >> >   API. Additionally, since it will probably take years before a libc
>> >> >> >   fix would be widely deployed, we need an interim solution anyways.
>> >> >>
>> >> >> We could consider a custom lib stub that intercepts calls to the guests
>> >> >> original libc and replaces it with a QEMU aware one?
>> >> >
>> >> > Unfortunately the problem here is host libc, rather than guest libc.
>> >> > We need to make TLS variables in QEMU itself work, so intercepting
>> >> > guest libc calls won't help much. Or am I misunderstanding the point?
>> >>
>> >> Hold up - I'm a little confused now. Why does the host TLS affect the
>> >> guest TLS? We have complete control over the guests view of the world so
>> >> we should be able to control it's TLS storage.
>> >
>> > Guest TLS is unaffected, just like in the existing case for guest
>> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
>> > Just to be clear: This series changes nothing about guest TLS.
>> >
>> > The complexity of this series is to deal with *host* usage of TLS.
>> > That is to say: use of thread local variables in QEMU itself. Host TLS
>> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
>> > to run at all. TLS variables are used in QEMU for the RCU
>> > implementation, parts of the TCG, and all over the place to access the
>> > CPU/TaskState for the running thread. Host TLS is managed by the host
>> > libc, and TLS is only set up for host threads created via
>> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
>> > virtual memory map *and* TLS data with their parent[1], since libcs
>> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
>> > Without the workaround used in this patch, both the parent and child
>> > process's thread local variables reference the same memory locations.
>> > This just doesn't work, since thread local data is assumed to actually
>> > be thread local.
>> >
>> > The "alternative" proposed was to make the host libc support TLS for
>> > processes created using clone (there are several ways to go about
>> > this, each with different tradeoffs). You mentioned that "We could
>> > consider a custom lib stub that intercepts calls to the guests
>> > original libc..." in your comment. Since *guest* libc is not involved
>> > here I was a bit confused about how this could help, and wanted to
>> > clarify.
>> >
>> >> >> Have you considered a daemon which could co-ordinate between the
>> >> >> multiple processes that are sharing some state?
>> >> >
>> >> > Not really for the `CLONE_VM` support added in this patch series. I
>> >> > have considered trying to pull tcg out of the guest process, but not
>> >> > very seriously, since it seems like a pretty heavyweight approach.
>> >> > Especially compared to the solution included in this series. Do you
>> >> > think there's a simpler approach that involves using a daemon to do
>> >> > coordination?
>> >>
>> >> I'm getting a little lost now. Exactly what state are we trying to share
>> >> between two QEMU guests which are now in separate execution contexts?
>> >
>> > Since this series only deals with `clone(CLONE_VM)` we always want to
>> > share guest virtual memory between the execution contexts. There is
>> > also some extra state that needs to be shared depending on which flags
>> > are provided to `clone()`. E.g., signal handler tables for
>> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
>> >
>> > The problem is that since QEMU and the guest live in the same virtual
>> > memory map, keeping the mappings the same between the guest parent and
>> > guest child means that the mappings also stay the same between the
>> > host (QEMU) parent and host child. Two hosts can live in the same
>> > virtual memory map, like we do right now with threads, but *only* with
>> > valid TLS for each thread/process. That's why we bend-over backwards
>> > to get set-up TLS for emulation in the child process.
>>
>> OK thanks for that. I'd obviously misunderstood from my first read
>> through. So while hiding the underlying bits of QEMU from the guest is
>> relatively easy it's quite hard to hide QEMU from itself in this
>> CLONE_VM case.
>
> Yes exactly.
>
>> The other approach would be to suppress CLONE_VM for the actual process
>> (thereby allowing QEMU to safely have a new instance and no clashing
>> shared data) but emulate CLONE_VM for the guest itself (making the guest
>> portions of memory shared and visible to each other). The trouble then
>> would be co-ordination of mapping operations and other things that
>> should be visible in a real CLONE_VM setup. This is the sort of
>> situation I envisioned a co-ordination daemon might be useful.
>
> Ah. This is interesting. Effectively the inverse of this patch. I had
> not considered this approach. Thinking more about it, a "no shared
> memory" approach does seem more straightforward implementation wise.
> Unfortunately I think there would be a few substantial drawbacks:
>
> 1. Memory overhead. Every guest thread would need a full copy of QEMU
> memory, including the translated guest binary.

Sure although I suspect the overhead is not that great. For linux-user
on 64 bit systems we only allocate 128Mb of translation buffer per
process. What sort of size systems are you expecting to run on and how
big are the binaries?

> 2. Performance overhead. To keep virtual memory maps consistent across
> tasks, a heavyweight 2 phase commit scheme, or similar, would be
> needed for every `mmap`. That could have substantial performance
> overhead for the guest. This could be a huge problem for processes
> that use a large number of threads *and* do a lot of memory mapping or
> frequently change page permissions.

I suspect that cross-arch highly threaded apps are still in the realm of
"wow, that actually works, neat :-)" for linux-user. We don't have the
luxury of falling back to a single thread like we do for system
emulation so things like strong-on-weak memory order bugs can still trip
us up.

> 3. There would be lots of similarly-fiddly bits that need to be shared
> and coordinated in addition to guest memory. At least the signal
> handler tables and fd_trans tables, but there are likely others I'm
> missing.
>
> The performance drawbacks could be largely mitigated by using the
> current thread-only `CLONE_VM` support, but having *any* threads in
> the process at all would lead to deadlocks after fork() or similar
> non-CLONE_VM clone() calls. This could be worked around with a "stop
> the world" button somewhat like `start_exclusive`, but expanded to
> include all emulator threads. That will substantially slow down
> fork().
>
> Given all this I think the approach used in this series is probably at
> least as "good" as a "no shared memory" approach. It has its own
> complexities and drawbacks, but doesn't have obvious performance
> issues. If you or other maintainers disagree, I'd be happy to write up
> an RFC comparing the approaches in more detail (or we can just use
> this thread), just let me know. Until then I'll keep pursuing this
> patch.

I think that's fair. I'll leave it to the maintainers to chime in if
they have something to add. I'd already given some comments on patch 1 and
given it needs a re-spin I'll have another look on the next iteration.

I will say expect the system to get some testing on multiple backends so
if you can expand your testing beyond an x86_64 host please do.

>
>> > [1] At least on x86_64, because TLS references are defined in terms of
>> > the %fs segment, which is inherited on linux. Theoretically it's up to
>> > the architecture to specify how TLS is inherited across execution
>> > contexts. t's possible that the child actually ends up with no valid
>> > TLS rather than using the parent TLS data. But that's not really
>> > relevant here. The important thing is that the child ends up with
>> > *valid* TLS, not invalid or inherited TLS.
>>
>>
>> --
>> Alex Bennée


-- 
Alex Bennée

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options, Josh Kunz, 2020/07/08
- Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options, Alex Bennée <=

Prev by Date: [PATCH v2 2/2] scripts/performance: Add list_helpers.py script
Next by Date: Re: [PATCH 1/1] MAINTAINERS: introduce cve or security quotient field
Previous by thread: Re: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options
Next by thread: [PATCH v5 00/11] Add Nuvoton NPCM730/NPCM750 SoCs and two BMC machines
Index(es):
- Date
- Thread