qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

TCP/IP connections sometimes stop retransmitting packets (in nested virt


From: Maxim Levitsky
Subject: TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case)
Date: Sun, 17 Oct 2021 13:50:51 +0300
User-agent: Evolution 3.36.5 (3.36.5-2.fc32)

Hi!
 
This is a follow up mail to my mail about NFS client deadlock I was trying to 
debug last week:
e10b46b04fe4427fa50901dda71fb5f5a26af33e.camel@redhat.com/T/#u">https://lore.kernel.org/all/e10b46b04fe4427fa50901dda71fb5f5a26af33e.camel@redhat.com/T/#u
 
I strongly believe now that this is not related to NFS, but rather to some 
issue in networking stack and maybe
to somewhat non standard .config I was using for the kernels which has many 
advanced networking options disabled
(to cut on compile time).
This is why I choose to start a new thread about it.
 
Regarding the custom .config file, in particular I disabled CONFIG_NET_SCHED 
and CONFIG_TCP_CONG_ADVANCED. 
Both host and the fedora32 VM run the same kernel with those options disabled.


My setup is a VM (fedora32) which runs Win10 HyperV VM inside, nested, which in 
turn runs a fedora32 VM
(but I was able to reproduce it with ordinary HyperV disabled VM running in the 
same fedora 32 VM)
 
The host is running a NFS server, and the fedora32 VM runs a NFS client which 
is used to read/write to a qcow2 file
which contains the disk of the nested Win10 VM. The L3 VM which windows VM 
optionally
runs, is contained in the same qcow2 file.


I managed to capture (using wireshark) packets around the failure in both L0 
and L1.
The trace shows fair number of lost packets, a bit more than I would expect 
from communication that is running on the same host, 
but they are retransmitted and don't cause any issues until the moment of 
failure.


The failure happens when one packet which is sent from host to the guest,
is not received by the guest (as evident by the L1 trace, and by the following 
SACKS from the guest which exclude this packet), 
and then the host (on which the NFS server runs) never attempts to re-transmit 
it.


The host keeps on sending further TCP packets with replies to previous RPC 
calls it received from the fedora32 VM,
with an increasing sequence number, as evident from both traces, and the 
fedora32 VM keeps on SACK'ing those received packets, 
patiently waiting for the retransmission.
 
After around 12 minutes (!), the host RSTs the connection.

It is worth mentioning that while all of this is happening, the fedora32 VM can 
become hung if one attempts to access the files 
on the NFS share because effectively all NFS communication is blocked on TCP 
level.

I attached an extract from the two traces  (in L0 and L1) around the failure up 
to the RST packet.

In this trace the second packet with TCP sequence number 1736557331 (first one 
was empty without data) is not received by the guest
and then never retransmitted by the host.

Also worth noting that to ease on storage I captured only 512 bytes of each 
packet, but wireshark
notes how many bytes were in the actual packet.
 
Best regards,
        Maxim Levitsky

 
 
 
 

Attachment: L0_packets.pcapng
Description: application/pcapng

Attachment: L1_packets.pcapng
Description: application/pcapng


reply via email to

[Prev in Thread] Current Thread [Next in Thread]