[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_L
|
From: |
Wang, Lei |
|
Subject: |
Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN |
|
Date: |
Wed, 3 Apr 2024 16:35:35 +0800 |
|
User-agent: |
Mozilla Thunderbird |
On 4/3/2024 5:39, Peter Xu wrote:>>>>>
>>>>> I'm not 100% sure such thing (no matter here or moved into it, which
>>>>> does look cleaner) would work for us.
>>>>>
>>>>> The problem is I still don't yet see an ordering restricted on top of
>>>>> (1)
>>>>> accept() happens, and (2) receive LISTEN cmd here. What happens if
>>>>> the
>>>>> accept() request is not yet received when reaching LISTEN? Or is it
>>>>> always guaranteed the accept(fd) will always be polled here?
>>>>>
>>>>> For example, the source QEMU (no matter pre-7.2 or later) will always
>>>>> setup the preempt channel asynchrounously, then IIUC it can connect()
>>>>> after sending the whole chunk of packed data which should include this
>>>>> LISTEN. I think it means it's not guaranteed this will 100% work, but
>>>>> maybe further reduce the possibility of the race.
>>>>
>>>> I think the following code:
>>>>
>>>> postcopy_start() ->
>>>> postcopy_preempt_establish_channel() ->
>>>> qemu_sem_wait(&s->postcopy_qemufile_src_sem);
>>>>
>>>> can guarantee that the connect() syscall is successful so the destination
>>>> side
>>>> receives the connect() request before it loads the LISTEN command,
>>>> otherwise
>>>> it won't post the sem:
>>>>
>>>> postcopy_preempt_send_channel_new() ->
>>>> postcopy_preempt_send_channel_done() ->
>>>> qemu_sem_post(&s->postcopy_qemufile_src_sem);
>>>>
>>>
>>> Yes. But as mentioned in another thread, connect() and accept() are async.
>>> So in theory accept() could still come later after the LISTEN command.
>>
>> IIUC accept() is the callback and will be triggered by the connect() event.
>>
>> The reason accept() is not called in the destination is the main loop doesn't
>> get a chance to handle other events (connect()), so if we can guarantee
>> connect() is before LISTEN, then when handling the LISTEN cmd, we yield to
>> the
>> main loop and the connect() event is there, then we can handle it by calling
>> the
>> accept():
>>
>> qio_net_listener_channel_func
>> qio_channel_socket_accept
>> qemu_accept
>> accept
>>
>> so it seems the case accept() comes alter after LISTEN is in our expectation?
>
> The major thing uncertain to me is "accept() will return with a valid fd"
> on dest host is not guaranteed to order against anything.
>
> For example, I won't be surprised if a kernel implementation provides an
> async model of "accept()" syscall, so that even if the other side returned
> with "connect()", the "accept()" can still fetch nothing if the async model
> will need a delay for the new channel to be finally delivered to the
> "accept()" thread context. It just sounds like tricky to rely on such
> thing.
>
> What I proposed below shouldn't rely on any kernel details on how accept()
> could be implemented, it simply waits for the fd to be created before doing
> anything else (including creating the preempt thread and process packed
> data).
Thanks for the detailed explanation!
>
>>
>>>
>>>>>
>>>>> One right fix that I can think of is moving the sem_wait(&done) into
>>>>> the main thread too, so we wait for the sem _before_ reading the
>>>>> packed data, so there's no chance of fault. However I don't think
>>>>> sem_wait() will be smart enough to yield when in a coroutine.. In the
>>>>> long term run I think we should really make migration loadvm to do
>>>>> work in the thread rather than the main thread. I think it means we
>>>>> have one more example to be listed in this todo so that's preferred..
>>>>>
>>>>> https://wiki.qemu.org/ToDo/LiveMigration#Create_a_thread_for_migration
>>>>> _destination
>>>>>
>>>>> I attached such draft patch below, but I'm not sure it'll work. Let
>>>>> me know how both of you think about it.
>>>>
>>>> Sadly it doesn't work, there is an unknown segfault.
>
> Could you paste the stack of the segfault ("(gdb) thread apply all bt")?
> Or help to figure out what is wrong?
>
> Since I cannot reproduce myself, I may need your help debugging this. If
> you agree with what I said above and agree on such fix, please also feel
> free to go ahead and fix the segfault.
We should change the following line from
while (!qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) {
to
while (qemu_sem_timedwait(&mis->postcopy_qemufile_dst_done, 100)) {
After that fix, test passed and no segfault.
Given that the test shows a yield to the main loop won't introduce much overhead
(<1ms), how about first yield unconditionally, then we enter the while loop to
wait for several ms and yield periodically?
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/01
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Fabiano Rosas, 2024/04/01
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Lei, 2024/04/02
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/02
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Lei, 2024/04/02
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/02
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN,
Wang, Lei <=
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/03
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/03
- Re: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Peter Xu, 2024/04/03
- RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/04
RE: [PATCH] migration: Yield coroutine when receiving MIG_CMD_POSTCOPY_LISTEN, Wang, Wei W, 2024/04/02