Re: [lwip-users] 1.4 rc1 non-blocking issues

lwip-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lwip-users] 1.4 rc1 non-blocking issues

From:	Yoav Nissim
Subject:	Re: [lwip-users] 1.4 rc1 non-blocking issues
Date:	Thu, 02 Dec 2010 13:00:22 +0200
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4


Hi Simon, Kieran, thank you for your replies.

I will post these issues on savannah as soon as I can get to it.


About our select issues -

Since we have encountered the thread concurrency limitation a while
back, we have changed our design - we have only two threads calling lwIP
sockets, and they never do so concurrently.

I am not sure whether this limitation applies to a blocking select - can
calls to socket api's be made while select is blocking on that same socket?


As for the crash I mentioned, it occurs in event_callbacks' select
handling, and is not the result of the above-mentioned threading limitation.

It seems to be the result of changing the synchronization mechanism in
1.4.0. lwIP 1.3.2 used a "broad" semaphore to synchronize event_callback
and lwip_select whereas 1.4.0 rc1 uses the critical section and a counter.

The following code in event_callback crashes when dereferencing scb:


        for (scb = select_cb_list; scb; scb = scb->next) {

We have found the following problems:


1. In the event_callback code:


        /* unlock interrupts with each step */

        last_select_cb_ctr = select_cb_ctr;
        SYS_ARCH_UNPROTECT(lev);
        SYS_ARCH_PROTECT(lev);
        if (last_select_cb_ctr != select_cb_ctr) {
            /* someone has changed select_cb_list, restart at the
beginning */
            scb = select_cb_list;

        }


When someone removes the last element in the list (scb == NULL) we
should break out of the loop.


2. In lwip_select, the counter is not updated when an element is removed
from the list.


        /* Take us off the list */
        SYS_ARCH_PROTECT(lev);
        if (select_cb.next != NULL) {
            select_cb.next->prev = select_cb.prev;
        }
        if (select_cb_list == &select_cb) {
            LWIP_ASSERT("select_cb.prev == NULL", select_cb.prev == NULL);
            select_cb_list = select_cb.next;
        } else {
            LWIP_ASSERT("select_cb.prev != NULL", select_cb.prev != NULL);
            select_cb.prev->next = select_cb.next;
        }
        SYS_ARCH_UNPROTECT(lev);



HTH,
Tal & Yoav.




On 2/12/2010 9:14, Simon Goldschmidt wrote:

> Kieran Mansley <address@hidden> wrote:
>>> 1. ERR_WOULDBLOCK is treated as a FATAL error - it seems as if someone
>>> forgot to update the ERR_IS_FATAL macro when the error code was added. A
>>> non-blocking operation that sets the conn error to WOULDBLOCK (e.g
>>> send() and recv() ) renders the socket unusable. Our workaround was to
>>> use ERR_WOULDBLOCK in the ERR_IS_FATAL macro instead of ERR_VAL.
>> I agree that looks wrong.  I'm sure it was correct in the past.  I'll make
>> sure this is fixed before 1.4.0 is released.  If you could file a bug for
>> this on savannah it would help.
> I've just fixed that in CVS, thanks for reporting. (I can still not add a new 
> bug to savannah...)
>
>>> 2. As far as we know, EMSGSIZE is not a valid return code for send() on
>>> a STREAM socket. netconn_write does not return the number of bytes
>>> processed and cannot perform partial sends. This makes an application
>>> that uses select run in tight loops since select returns writable, but
>>> send [working on an all or nothing assumption] returns an error
>>> (EWOULDBLOCK)
>> I'd like to see lwIP updated to support partial sends (which I think would
>> solve this problem) but it won't happen before 1.4.0.  A task on savannah
>> would again be very helpful, if there's not one already - I think this
>> might have been discussed before.
>
> There's bug #31084 for that on savannah: 
> https://savannah.nongnu.org/bugs/?31084
> I've targeted it at 1.4.0 though. I think it's one of the main missing things 
> for lwIP to support nonblocking sockets, so it might be worth including that 
> to prevent having to launch 1.4.1 too soon...
>
>>> 3. connect has several problems:
>>>
>>>     a. connect sets sock->err to EINPROGRESS. When select returns
>>> writable, getsockopt(SO_ERROR) will never let us know what happened [i.e
>>> no access to conn->err] since getsockopt(SO_ERROR) does not return the
>>> error value when sock->err is not 0 (it is set to EINPROGRESS). It seems
>>> to me the non-blocking path lacks the propagation of the connect result
>>> to sock->err (which does happen when using a blocking call).
> Could you try CVS HEAD? There was a bug there 
> (http://savannah.nongnu.org/bugs/?31590) which should be fixed since a week 
> or so now.
>
>>>     b. getsockopt(SO_ERROR) - behaviour according to Posix is to return
>>> and clear the _pending_ error for the socket (if one exists). instead
>>> getsockopt returns the last socket call error once. If additional calls
>>> are made netconn's last error is returned repeatedly.
> That's a problem of the mixture of 3 APIs here: socket API is built on top of 
> the netconn API which in turn is built on top of the raw API. In a fatal 
> error state, the socket could have an error that cannot easily be removed, so 
> not returning that first could be a problem. This code needs further 
> thoughts, so it would be best to open a bug entry for discussion (as soon as 
> savannah is fully back online again).
>
>>>     c. if connect is called again while a previous non-blocking connect
>>> is being processed, ERR_ISCONN is assigned to conn->err [which by the
>>> way translates to an errno of -1]. Now, if the connection succeeds,
>>> do_connected will not be able to set conn->err to ERR_OK since it checks
>>> for ERR_INPROGRESS. To make things worse, ERR_ISCONN is treated as a
>>> FATAL error, and will therefore render the socket unusable. According to
>>> Posix, EALREADY should be returned while a connect is in progress, and
>>> EISCONN should be returned when a socket is connected.
> That's a good one. Again, please file a bug report to make sure this doesn't 
> get lost!
>
>>> 4. lwip_select seems to be susceptible to race conditions and has issued
>>> many ASSERTs as well as crashed.
>> These sounds like you're using a socket from two different threads (one
>> calling select(), the other calling close()).  Unfortunately this style of
>> operation isn't supported in lwIP.
> Hmm, select() does use a global list to queue all threads waiting in a select 
> (select_cb_list), but for me, that worked quite well, so any detailed bug 
> report would be great.
>
> Thanks for the feedback, and please remember to open bug reports for the open 
> issues: there are too many posts on this list to keep track of bugs without 
> the bugtracker.
>
> Simon
>
> Simon

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [lwip-users] 1.4 rc1 non-blocking issues, Kieran Mansley, 2010/12/01
- Re: [lwip-users] 1.4 rc1 non-blocking issues, Simon Goldschmidt, 2010/12/02
  - Re: [lwip-users] 1.4 rc1 non-blocking issues, Yoav Nissim <=

Prev by Date: RE: [lwip-users] RTAI and LwIP for Realtime Communication
Next by Date: [lwip-users] where do I find a recent documentation about lwip?
Previous by thread: Re: [lwip-users] 1.4 rc1 non-blocking issues
Next by thread: [lwip-users] RTAI and LwIP for Realtime Communication
Index(es):
- Date
- Thread