Re: [lwip-users] Infinite hang in tcp

On Fri, Oct 23, 2015 at 7:34 PM, Dinesh Pandey <address@hidden> wrote:

Seeing a similar problem:

Assertion "tcp_input: pcb->next != pcb (before cache)" failed at line 182 in <...>/core/tcp_in.c

I have two machines, one ARM and another i386 running the same code. I can reproduce it consistently on the ARM. Don't see it i386.

The LWIP task is running with NO_SYS=1 (as one task in a multitasking environment).

Will investigate over the next few days. Any hints welcome.

On Wed, Oct 14, 2015 at 11:03 PM, Sylvain Rochet <address@hidden> wrote:
Hi Stephen,

On Wed, Oct 14, 2015 at 09:13:59AM -0500, Stephen Cowell wrote:
> Hey Enrico,
> I'm using GNU toolchain/compiler, supplied with Atmel Studio 6.1.
> Since I've added the code I've had no other problems; I really don't
> have much time to research this, what with other pressures at work.
>
> It seems the issue is not unknown... sometimes the pdb ends up pointing
> to itself. These times appear to be correlated to high-stress I/O.
>
> Obviously the last pdb should point to null... and it should never point
> to itself. It is easy enough to catch it pointing to itself and make that
> null. I verified that this was the first pdb, that we weren't going to
> have a memory leak when we just terminated the list. I did not have
> the resources to chase down when the pointer to self happened...
> I only know that it does, and that the pdb that this happens to is
> at the first allocated pdb address. The obvious thing to do was to
> correct the pointer to break the endless loop... seems to work.
>
> As Sylvain wrote, the Atmel port has some serious differences from
> what he's used to seeing... I'm assuming this has something to do
> with it. As I get more time (the product ships soon) I'll be able to
> spend some more time on this issue. I'm just glad to get it out there
> and let others know it's happening.

A linked list corruption is a very serious problem, you really must not
ship your product with such a known bug. Your workaround only mitigate a
single common corruption pattern on linked list, but that's only one of
them. It will break soon or later with an other pattern.

If a linked list is corrupted it's because there is a reentrancy problem
in functions modifying the linked list. Which really limit the scope
where reentrancy can occur. We have critical sections for !NO_SYS
systems, you could use the critical sections hooks to check if
reentrancy constraints are respected,
SYS_ARCH_DECL_PROTECT/SYS_ARCH_PROTECT/SYS_ARCH_UNPROTECT.

At least, if you want to ship your product very quickly, just define
those hooks to something appropriate (those are recursive locks so
you'll have take care of that) and you should be safe, for now.

Sylvain

_______________________________________________
lwip-users mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/lwip-users

From:	Dinesh Pandey
Subject:	Re: [lwip-users] Infinite hang in tcp_slowtmr()
Date:	Thu, 29 Oct 2015 20:06:30 +0530

Re: [lwip-users] Infinite hang in tcp_slowtmr()