[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout

From: Giuseppe Modugno
Subject: Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout
Date: Mon, 30 Nov 2020 17:13:11 +0100
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0

I tested my system deeply and this is what I've found.

My embedded device running lwip is mainly a MQTTs client that MUST be always connected to a MQTTs broker (in my case, AWS IoT Core service). So my system runs lwip and mbedTLS. All usually works great, but some users complain that the system lose the connection to the server and isn't able to reconnect again, even after many days and even if the connection to the server is ok. After a MCU reset, all comes back to work perfectly. Luckily this problem happens only on a few devices. I'm using a short MQTT keep-alive, only 5 seconds, because the server must know in a short time when the device is not available anymore.

So I tested throughly the behaviour of the system when the connection to the server is not stable and I noticed some problems. The test set is the following: my device, that features an Ethernet port, connected to a WiFi bridge, connected to the WiFi Router on my Android smartphone that is connected to Internet through 4G connection.

  MYDEV ETH -> WIFI BRIDGE -> ANDROID (WiFi Router) -> 4G -> Internet -> AWS MQTT Broker

|Now consider an active MQTT connection, keep-alive messages are exchanged between end points (my device and AWS broker). At some time, I disable 4G on smartphone. It happens that the MQTT client correctly detects a failing connection, because the server doesn't reply to the last keep-alive, so it calls mqtt_close() and the user callback. At this moment, I think the TCP/IP stack wants to close the connection as best as it can, i.e. trying to send pending unacked data. Here we have the last PING message. lwip tries many times to retransmit that data and give up only after many minutes (around 15 minutes, maybe it depends on network, I don't know).|

|As you can understand, for many minutes the lwip heap is partially allocated with data related to a connection that is in closing state.

|Now I enable 4G, let the client connect again to the server and display 4G another time. Now I have TWO pending connection in closing state with more data allocated on the heap. And so if this happens for the third time.|

|In my case I have an heap of 2kB and after three disconnections there are 5-6 blocks allocated for a total amount of 350 bytes, maybe fragmented. When the Internet connection is restored now, the TLS connection can't be established because of a faulty allocation (out of memory). This is because I have 3 old pending TCP connections that are not useful anymore for my application.|

|I can't increase the heap, anyway I wouldn't know how to size it, because the worst case could be 5 or 10 or 100 disconnections and 5/10/100 closing connections filling the heap useless.|

|I'm thinking to change the mqtt client code to call altcp_abort() instead of altcp_close() when ||server_watchdog reaches its limit. It's going to try to reconnect immediately, I don't need to lose precious heap space for cleanly close an already dead connection.


Il 27/11/2020 18:26, Giuseppe Modugno ha scritto:
|MQTT client manages the situation when the reply of a keep-alive (PINGRESP) doesn't arrive in time (mqtt_cyclick_timer):  /* If reception from server has been idle for 1.5*keep_alive time, server is considered unresponsive */ if ((client->server_watchdog * MQTT_CYCLIC_TIMER_INTERVAL) > (client->keep_alive + client->keep_alive / 2)) { LWIP_DEBUGF(MQTT_DEBUG_WARN, ("mqtt_cyclic_timer: Server incoming keep-alive timeout\n")); mqtt_close(client, MQTT_CONNECT_TIMEOUT); restart_timer = 0; } Here mqtt_close() is called and my user callback is called too to signal the event. After this event, I see that a couple of allocated blocks aren't freed. I strongly suspect the first is the TCP packet containing the last unacked keep-alive and the second block is the TCP packet that is sent to the server to stop the connection (FYN flag?). I think they are allocated by tcp_output module. Because I want to stay connected to the server, in my callback I call mqtt_client_connect() again after some seconds. Now suppose the connection to the server is established again, but maybe after some time again a keep-alive doesn't receive an answer (think of a client connected to Internet through a bad connection). I will have other two allocated and lost blocks in the heap. This is obviously very bad for the heap. I suspect mqtt_close() doesn't free immediately all the packets that are waiting for an ACK. That function does its best to close the connection cleanly, so sending FYN packet (the second block) and waiting for an answer from the server. However I think there should be a timeout, because unacked packets can stay in the heap forever. What is this timeout? What could be the reason why in my case those packets are never freed? I tried to avoid re-calling mqtt_client_connect() when my callback is called at the disconnection, but the two blocks above stay unfreed, even after many seconds. One solution is to call altcp_abort() instead of altcp_close(), indeed in this case I see the heap completely empty when the event occurs and my callback is called. Thinking about this, maybe it's better to call abort() instead of close(). If the server han't answered to a keep-alive, most probably the connection is not stable and is useless trying the clean closing procedure. I don't know if it's important, I'm using mbedtls for MQTT connection. |

reply via email to

[Prev in Thread] Current Thread [Next in Thread]