[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout
From: |
Giuseppe Modugno |
Subject: |
Re: [lwip-devel] [mqtt] Disconnection caused by a keep-alive timeout |
Date: |
Mon, 30 Nov 2020 17:13:11 +0100 |
User-agent: |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 |
I tested my system deeply and this is what I've found.
My embedded device running lwip is mainly a MQTTs client that MUST be
always connected to a MQTTs broker (in my case, AWS IoT Core service).
So my system runs lwip and mbedTLS. All usually works great, but some
users complain that the system lose the connection to the server and
isn't able to reconnect again, even after many days and even if the
connection to the server is ok. After a MCU reset, all comes back to
work perfectly. Luckily this problem happens only on a few devices.
I'm using a short MQTT keep-alive, only 5 seconds, because the server
must know in a short time when the device is not available anymore.
So I tested throughly the behaviour of the system when the connection to
the server is not stable and I noticed some problems. The test set is
the following: my device, that features an Ethernet port, connected to a
WiFi bridge, connected to the WiFi Router on my Android smartphone that
is connected to Internet through 4G connection.
MYDEV ETH -> WIFI BRIDGE -> ANDROID (WiFi Router) -> 4G -> Internet
-> AWS MQTT Broker
|Now consider an active MQTT connection, keep-alive messages are
exchanged between end points (my device and AWS broker). At some time, I
disable 4G on smartphone. It happens that the MQTT client correctly
detects a failing connection, because the server doesn't reply to the
last keep-alive, so it calls mqtt_close() and the user callback.
At this moment, I think the TCP/IP stack wants to close the connection
as best as it can, i.e. trying to send pending unacked data. Here we
have the last PING message. lwip tries many times to retransmit that
data and give up only after many minutes (around 15 minutes, maybe it
depends on network, I don't know).|
|As you can understand, for many minutes the lwip heap is partially
allocated with data related to a connection that is in closing state.
|
|Now I enable 4G, let the client connect again to the server and display
4G another time. Now I have TWO pending connection in closing state with
more data allocated on the heap. And so if this happens for the third time.|
|In my case I have an heap of 2kB and after three disconnections there
are 5-6 blocks allocated for a total amount of 350 bytes, maybe
fragmented. When the Internet connection is restored now, the TLS
connection can't be established because of a faulty allocation (out of
memory). This is because I have 3 old pending TCP connections that are
not useful anymore for my application.|
|I can't increase the heap, anyway I wouldn't know how to size it,
because the worst case could be 5 or 10 or 100 disconnections and
5/10/100 closing connections filling the heap useless.|
|I'm thinking to change the mqtt client code to call altcp_abort()
instead of altcp_close() when ||server_watchdog reaches its limit. It's
going to try to reconnect immediately, I don't need to lose precious
heap space for cleanly close an already dead connection.
|
||
Il 27/11/2020 18:26, Giuseppe Modugno ha scritto:
|MQTT client manages the situation when the reply of a keep-alive
(PINGRESP) doesn't arrive in time (mqtt_cyclick_timer): /* If
reception from server has been idle for 1.5*keep_alive time, server is
considered unresponsive */ if ((client->server_watchdog *
MQTT_CYCLIC_TIMER_INTERVAL) > (client->keep_alive + client->keep_alive
/ 2)) { LWIP_DEBUGF(MQTT_DEBUG_WARN, ("mqtt_cyclic_timer: Server
incoming keep-alive timeout\n")); mqtt_close(client,
MQTT_CONNECT_TIMEOUT); restart_timer = 0; } Here mqtt_close() is
called and my user callback is called too to signal the event. After
this event, I see that a couple of allocated blocks aren't freed. I
strongly suspect the first is the TCP packet containing the last
unacked keep-alive and the second block is the TCP packet that is sent
to the server to stop the connection (FYN flag?). I think they are
allocated by tcp_output module. Because I want to stay connected to
the server, in my callback I call mqtt_client_connect() again after
some seconds. Now suppose the connection to the server is established
again, but maybe after some time again a keep-alive doesn't receive an
answer (think of a client connected to Internet through a bad
connection). I will have other two allocated and lost blocks in the
heap. This is obviously very bad for the heap. I suspect mqtt_close()
doesn't free immediately all the packets that are waiting for an ACK.
That function does its best to close the connection cleanly, so
sending FYN packet (the second block) and waiting for an answer from
the server. However I think there should be a timeout, because unacked
packets can stay in the heap forever. What is this timeout? What could
be the reason why in my case those packets are never freed? I tried to
avoid re-calling mqtt_client_connect() when my callback is called at
the disconnection, but the two blocks above stay unfreed, even after
many seconds. One solution is to call altcp_abort() instead of
altcp_close(), indeed in this case I see the heap completely empty
when the event occurs and my callback is called. Thinking about this,
maybe it's better to call abort() instead of close(). If the server
han't answered to a keep-alive, most probably the connection is not
stable and is useless trying the clean closing procedure. I don't know
if it's important, I'm using mbedtls for MQTT connection. |