|
From: | Jackie |
Subject: | Re: [lwip-users] LWIP - TCP receive assert failed |
Date: | Fri, 16 Jan 2015 23:46:02 +0800 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.3.0 |
Hi Sylvain,
Thanks for your reply. I've been working hard on this issue lately, and I found something interesting. Specifically I am using FTP for upper-level application protocol, based on TCP connection in LWIP. Because of convenience of test, I use PPP to connect the FTP server on a host PC. So basically it is like, FTP client <---> TCP/IP (LWIP) <---> PPP <-----------------------> TCP/IP (Linux) <---> FTP server. After stress test and debugging, more than 10 hours uploading data, I found the PCB got corrupt in tcp_output(). The case is that tcp_output() can be blocked by the lower-level function call in tcp_output_segment(), in which somehow the buffer of lower-layer protocol is full, so the upper-layer is pending, and at the same time, tcp timer is running, tcp_slowtmr() is also calling tcp_output(), so this tcp_output() is called before the previous call is finished, like, tcp_output() { ...... tcp_output_segment(); // may be pending here ---> tcp_output() is called by tcp_slowtmr(), and returned; ...... do something about pcb->unacked and pcb->unsent; ...... } Obviously pcb->unacked and pcb->unsent can be corrupt, but pcb->snd_queuelen is unchanged, thus resulting a mismatch between the queue length and the data in the queue of unacked and unsent. Eventually the program will go into an assertion. Since I am using a very old version of LWIP, I am not sure if there is a problem in the new one. In my opinion, tcp_output() is better to be designed as reentrant function, it can be blocked, in case the buffer form lower layer is full, it will be waiting a "write signal" to continue sending data. What I changed as a workaround is try to re-check the pcb after tcp_output_segment(), when the local pointer useg should be pointing to the tail of unacked queue, otherwise, the unacked queue's content can be re-written. Do you have any concern about it? Any suggestion and discussion is welcome. Best, Jackie On 01/11/15 01:17, Sylvain Rochet wrote: Hi Jackie, On Mon, Jan 05, 2015 at 11:59:00PM +0800, Jackie wrote:Hi all, Recently when I am working on LWIP to do some stress test, e.g. continuously uploading data to a server via TCP connection, the device often crashed on an assert statement in tcp_receive(), if (pcb->snd_queuelen != 0) { LWIP_ASSERT("tcp_receive: valid queue length", pcb->unacked != NULL || pcb->unsent != NULL); } After debugging the crash case, I found some possible cause that the pcb structure has been corrupted by another thread during a context switch. I singled out one likely candidate, tcp_slowtmr(). In this timer, it calls another function tcp_pcb_purge(), in which it resets both unacked and unsent queue to NULL but without setting queuelen to 0. In some cases (like tcp state is FIN_WAIT_2), this timer will interrupt the current tcp thread in a preemptive OS environment, modifying the current pcb before hitting the assert statement afterwards. How likely will it be if so? Has anyone encountered a similar issue? Any suggestions?You are not specific enough to be able to conclude, but, as usual, it looks like a broken port or usage which do not follow lwIP threading model. Summary: - Do *NOT* call anything in interrupt context, nothing, never, never, use your OS semaphore signaling to an Ethernet/serial/… RX thread - memp_* functions are thread-safe if SYS_LIGHTWEIGHT_PROT is set, and again, thread safe does not mean it is interrupt safe, especially if your hardware does nested interrupts - Do *NOT* call any function from the RAW API outside lwIP thread - Use Netconn or Socket API in others threads, but keep in mind you should not share a Netconn/Socket control block between threads, (or use proper locking if you really have to, of course). Sylvain |
[Prev in Thread] | Current Thread | [Next in Thread] |