[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lwip-users] Assert after dropped TX packet
From: |
Dittrich, Matthew |
Subject: |
Re: [lwip-users] Assert after dropped TX packet |
Date: |
Wed, 26 Oct 2011 11:51:33 -0500 |
I have verified that my MAC is dropping the tx frames during the "dead time".
Today I put some debug output into my low_level_output() and lwIP is passing it
in the frames, they are just not making it to the wire. So that's definitely on
me (or the silicon! :)
Yesterday I was fortunate to get a really clean capture, sorry to spam the list
with pcaps when it's a driver issue, but I think the attached pcap highlights
"something odd". This capture is much easier to digest than the previous ones.
The problem is between frame #29 and #30, debug log line 463. There are no
issues immediately leading up to frame #29 (no nagle-ish activity like my
previous pcaps), then my driver drops a frame twice in the "dead time", there a
few tcp_write()'s after the first tx failure, after which the congestion window
shrinks to MSS. All these tcp_write()'s seem to complete successfully. A new
PC app level request comes in after it's timeout with frame #30, lwip attempts
to send the response data, but the "ntohl(seg->tcphdr->seqno) - pcb->lastack +
seg->len <= wnd" test of the while() loop in tcp_output() at line 962 of
tcp_out.c keeps any data from sending. This is the "my_wnd" line 621 of the
debug dump. An ACK is sent instead (frame #31). lwIP makes one more attempt to
send the unacked segment (which my driver drops again... and only that specific
frame!) at line 770, and lwIP never calls tcp_output_segment() again.
So, does this seam normal? Is everything just "plugged up" waiting for the PC
to ACK that segment that keeps getting dropped? Would the next
tcp_output_segment() contain the same 795985:796699 segment that I keep
dropping, but the PC app gives up before it should be sent (exponential
backoff?)? After the first drop at 134447ms, the reattempts are made ~4sec
later at 138733ms, then ~10sec after that at 148037ms. The PC closes the
connection ~16sec after that.
Now I am off to find why my MAC is refusing to send a single unique frame
amongst 100's of MB of data!
Thanks again,
MD
-----Original Message-----
From: address@hidden [mailto:address@hidden On Behalf Of Dittrich, Matthew
Sent: Tuesday, October 25, 2011 1:43 PM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] Assert after dropped TX packet
Thanks for the reply Kieran,
I am glad to hear I am not (willfully) breaking the pbuf API rules. Hope you
guys are ready for another long one! :)
> My guess is that you've hit the
> common fault of failing to ensure that there is only a single thread
> active in the lwIP core at any one time.
I can assure you I am not in the stack from different threads. I am using the
raw API in order to avoid creating a task per connection. My Ethernet task
loop consists of a "handle_packets()", see below, netif->input is
ethernet_input(). low_level_input() returns a PBUF_RAW from PBUF_POOL. Then
sys_check_timeouts() is run, then my app level "monitor_events()" is run, this
watches a FreeRTOS queue that the other app tasks put messages into for the
ethernet task. The monitor_events() only makes tcp_writes() (no eth input
handling here) and only when we are not currently sending something else. Then
a FreeRTOS vTaskDelay(1) will pause the task until the next tick. Then repeat
the loop. No interrupts are involved.
I have been narrowing in on the failure, and it appears to happen only when the
machine is running "cycles". I mentioned that our protocol is mostly
request/response, but there are "unsolicited" responses sent from
monitor_events(), one happens at the beginning of a cycle, one at the end. It
also appears that these unsolicited responses must happen at the exact wrong
time for things to fall apart. monitor_events() also maintains a timer that
fires off the udp_sendto() for the broadcast, though that has been disabled
lately to rule it out as a cause.
Attached is a pcap, and some debug output. I prepended the value of sys_now()
(units of milliseconds) and a hypen in my LWIP_PLATFORM_DIAG() definition, my
(novice) comments are preceded by # and were added after the fact. The debug
output starts at packet #167 of the capture, I can provide debug output before
that corresponding to this capture if more "good" transactions would help the
analysis. I also modified the two identical "tcp_output: snd_wnd " debug
output lines to differentiate them. Here is my reader's digest version of the
cap/debug:
1. Packet #195 is the last "good" segment sent from lwip (192.168.10.6).
2. Lwip sends no data after this
3. There is a 10 second pause between 196 and 197, this is an PC app level
timeout, causing a "fallback" "poll" message request from the PC (192.168.10.5).
During this "dead time" lwip thinks it sent 202 bytes, but that packet is
not seen in wireshark (timestamp 433878 in debug)
There is an "cycle start" (at 433930) unsolicited response sent, tcp_write()
returns OK, never seen on wire. After this the congestion window reduces to
MSS.
A "cycle stop" (at 434852) message has the same fate... the actual wall time
between start and stop events in the system is ~1.42 sec with this setup.
Lwip attempts to send the same 2267210:2267412 with nothing on the wire
during the dead time.
4. Packet #197 containing a new request breaks the cycle, the lwip app code
services the request, tcp_write()'s without indication of failure (see " -mt 17
ERR_OK!!!" app level debug output).
However the test involving tcp_do_output_nagle() fails, and nothing is sent.
I "blew up" the macro into its own function for easier debug,
The result of its output and the flag test are in the debug dump as
"my_do_nagle"
5. #198 appears to be a retransmission of 197 (only difference is ip ID field)
from the PC, lwip app is not informed of it (correctly, I think) 6.
tcp_do_output_nagle() keeps any data from sending for the rest of the capture.
7. acks are not sent until winXP does a cumulative retransmission, packets
#202, #207, #214 8. In the end, the PC app has "so many" timed-out request that
it gives up and closes the connection. Somewhere along the line the assert()
"went away", it has not happened since last Friday.
I tried dropping back to released 1.4.0 with no difference. I have some
captures where the while() loop "ntohl(seg->tcphdr->seqno) - pcb->lastack +
seg->len <= wnd" test fails for awhile, then the tcp_do_output_nagle() fails in
tcp_output(). I think it is a matter of timing of the unsolicited responses in
relation to some dropped packet, I get different results depending upon how
much debug output I have enabled. Again, it only seems to happen when the
machine is running cycles, creating the unsolicited messages, but the API
returns "all good" every time. These captures are from a direct PC-to-device
physical connection, no routers/switches, lwip is handling a single TCP
connection. The fact that lwip thinks it is sending something during the "dead
time" but nothing appears on the wire makes me question my driver again... but
shouldn't lwip handle lost packets correctly? And why does it ACK the
cumulative retransmissions only? Those always make it out. Is there anything
that can fail after the "tcp_output_segment: " debug output?
Thanks again,
MD
-------------------------------------------------------
void ethernetif_handlepackets(struct netif *netif) {
struct eth_hdr* ethhdr;
struct pbuf* p;
/* move received packet into a new pbuf */
p = low_level_input(netif);
while (p != NULL) {
eth_dbg_hp_cnt++;
/* points to packet payload, which starts with an Ethernet header */
ethhdr = p->payload;
#if LINK_STATS
lwip_stats.link.recv++;
#endif /* LINK_STATS */
switch (htons(ethhdr->type)) {
case ETHTYPE_IP:
case ETHTYPE_ARP:
if (netif->input(p, netif) != ERR_OK) {
pbuf_free(p);
p = NULL;
}
break;
default:
pbuf_free(p);
p = NULL;
break;
}
p = low_level_input(netif);
}
}
-------------------------------------------------------
20111025e_wnd_short.pcap
Description: 20111025e_wnd_short.pcap
20111025e_wnd_short.txt
Description: 20111025e_wnd_short.txt