Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout

lwip-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout

From:	Giuseppe Modugno
Subject:	Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout
Date:	Mon, 30 Nov 2020 17:13:11 +0100
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0

I tested my system deeply and this is what I've found.

My embedded device running lwip is mainly a MQTTs client that MUST bealways connected to a MQTTs broker (in my case, AWS IoT Core service).So my system runs lwip and mbedTLS. All usually works great, but someusers complain that the system lose the connection to the server andisn't able to reconnect again, even after many days and even if theconnection to the server is ok. After a MCU reset, all comes back towork perfectly. Luckily this problem happens only on a few devices.I'm using a short MQTT keep-alive, only 5 seconds, because the servermust know in a short time when the device is not available anymore.

So I tested throughly the behaviour of the system when the connection tothe server is not stable and I noticed some problems. The test set isthe following: my device, that features an Ethernet port, connected to aWiFi bridge, connected to the WiFi Router on my Android smartphone thatis connected to Internet through 4G connection.

MYDEV ETH -> WIFI BRIDGE -> ANDROID (WiFi Router) -> 4G -> Internet-> AWS MQTT Broker

|Now consider an active MQTT connection, keep-alive messages areexchanged between end points (my device and AWS broker). At some time, Idisable 4G on smartphone. It happens that the MQTT client correctlydetects a failing connection, because the server doesn't reply to thelast keep-alive, so it calls mqtt_close() and the user callback.At this moment, I think the TCP/IP stack wants to close the connectionas best as it can, i.e. trying to send pending unacked data. Here wehave the last PING message. lwip tries many times to retransmit thatdata and give up only after many minutes (around 15 minutes, maybe itdepends on network, I don't know).|

|As you can understand, for many minutes the lwip heap is partiallyallocated with data related to a connection that is in closing state.

|Now I enable 4G, let the client connect again to the server and display4G another time. Now I have TWO pending connection in closing state withmore data allocated on the heap. And so if this happens for the third time.|

|In my case I have an heap of 2kB and after three disconnections thereare 5-6 blocks allocated for a total amount of 350 bytes, maybefragmented. When the Internet connection is restored now, the TLSconnection can't be established because of a faulty allocation (out ofmemory). This is because I have 3 old pending TCP connections that arenot useful anymore for my application.|

|I can't increase the heap, anyway I wouldn't know how to size it,because the worst case could be 5 or 10 or 100 disconnections and5/10/100 closing connections filling the heap useless.|

|I'm thinking to change the mqtt client code to call altcp_abort()instead of altcp_close() when ||server_watchdog reaches its limit. It'sgoing to try to reconnect immediately, I don't need to lose preciousheap space for cleanly close an already dead connection.

|

||



Il 27/11/2020 18:26, Giuseppe Modugno ha scritto:

|MQTT client manages the situation when the reply of a keep-alive(PINGRESP) doesn't arrive in time (mqtt_cyclick_timer): /* Ifreception from server has been idle for 1.5*keep_alive time, server isconsidered unresponsive */ if ((client->server_watchdog *MQTT_CYCLIC_TIMER_INTERVAL) > (client->keep_alive + client->keep_alive/ 2)) { LWIP_DEBUGF(MQTT_DEBUG_WARN, ("mqtt_cyclic_timer: Serverincoming keep-alive timeout\n")); mqtt_close(client,MQTT_CONNECT_TIMEOUT); restart_timer = 0; } Here mqtt_close() iscalled and my user callback is called too to signal the event. Afterthis event, I see that a couple of allocated blocks aren't freed. Istrongly suspect the first is the TCP packet containing the lastunacked keep-alive and the second block is the TCP packet that is sentto the server to stop the connection (FYN flag?). I think they areallocated by tcp_output module. Because I want to stay connected tothe server, in my callback I call mqtt_client_connect() again aftersome seconds. Now suppose the connection to the server is establishedagain, but maybe after some time again a keep-alive doesn't receive ananswer (think of a client connected to Internet through a badconnection). I will have other two allocated and lost blocks in theheap. This is obviously very bad for the heap. I suspect mqtt_close()doesn't free immediately all the packets that are waiting for an ACK.That function does its best to close the connection cleanly, sosending FYN packet (the second block) and waiting for an answer fromthe server. However I think there should be a timeout, because unackedpackets can stay in the heap forever. What is this timeout? What couldbe the reason why in my case those packets are never freed? I tried toavoid re-calling mqtt_client_connect() when my callback is called atthe disconnection, but the two blocks above stay unfreed, even aftermany seconds. One solution is to call altcp_abort() instead ofaltcp_close(), indeed in this case I see the heap completely emptywhen the event occurs and my callback is called. Thinking about this,maybe it's better to call abort() instead of close(). If the serverhan't answered to a keep-alive, most probably the connection is notstable and is useless trying the clean closing procedure. I don't knowif it's important, I'm using mbedtls for MQTT connection. |

[Prev in Thread]

Current Thread

[Next in Thread]

[lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout, Giuseppe Modugno, 2020/11/27
- Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout, Giuseppe Modugno <=

Prev by Date: [lwip-devel] [bug #59574] [LINGER] close function cannot return
Next by Date: [lwip-devel] [bug #59492] mDNS: mdns_handle_questions holds core lock for excessive amount of time
Previous by thread: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout
Next by thread: [lwip-devel] [bug #59561] [mqtt] Stop sending a keep-alive if the connection is closed for a previously unacked keep-alive
Index(es):
- Date
- Thread