[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lwip-users] 1.4 rc1 non-blocking issues
From: |
Yoav Nissim |
Subject: |
Re: [lwip-users] 1.4 rc1 non-blocking issues |
Date: |
Thu, 02 Dec 2010 13:00:22 +0200 |
User-agent: |
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4 |
Hi Simon, Kieran, thank you for your replies.
I will post these issues on savannah as soon as I can get to it.
About our select issues -
Since we have encountered the thread concurrency limitation a while
back, we have changed our design - we have only two threads calling lwIP
sockets, and they never do so concurrently.
I am not sure whether this limitation applies to a blocking select - can
calls to socket api's be made while select is blocking on that same socket?
As for the crash I mentioned, it occurs in event_callbacks' select
handling, and is not the result of the above-mentioned threading limitation.
It seems to be the result of changing the synchronization mechanism in
1.4.0. lwIP 1.3.2 used a "broad" semaphore to synchronize event_callback
and lwip_select whereas 1.4.0 rc1 uses the critical section and a counter.
The following code in event_callback crashes when dereferencing scb:
for (scb = select_cb_list; scb; scb = scb->next) {
We have found the following problems:
1. In the event_callback code:
/* unlock interrupts with each step */
last_select_cb_ctr = select_cb_ctr;
SYS_ARCH_UNPROTECT(lev);
SYS_ARCH_PROTECT(lev);
if (last_select_cb_ctr != select_cb_ctr) {
/* someone has changed select_cb_list, restart at the
beginning */
scb = select_cb_list;
}
When someone removes the last element in the list (scb == NULL) we
should break out of the loop.
2. In lwip_select, the counter is not updated when an element is removed
from the list.
/* Take us off the list */
SYS_ARCH_PROTECT(lev);
if (select_cb.next != NULL) {
select_cb.next->prev = select_cb.prev;
}
if (select_cb_list == &select_cb) {
LWIP_ASSERT("select_cb.prev == NULL", select_cb.prev == NULL);
select_cb_list = select_cb.next;
} else {
LWIP_ASSERT("select_cb.prev != NULL", select_cb.prev != NULL);
select_cb.prev->next = select_cb.next;
}
SYS_ARCH_UNPROTECT(lev);
HTH,
Tal & Yoav.
On 2/12/2010 9:14, Simon Goldschmidt wrote:
> Kieran Mansley <address@hidden> wrote:
>>> 1. ERR_WOULDBLOCK is treated as a FATAL error - it seems as if someone
>>> forgot to update the ERR_IS_FATAL macro when the error code was added. A
>>> non-blocking operation that sets the conn error to WOULDBLOCK (e.g
>>> send() and recv() ) renders the socket unusable. Our workaround was to
>>> use ERR_WOULDBLOCK in the ERR_IS_FATAL macro instead of ERR_VAL.
>> I agree that looks wrong. I'm sure it was correct in the past. I'll make
>> sure this is fixed before 1.4.0 is released. If you could file a bug for
>> this on savannah it would help.
> I've just fixed that in CVS, thanks for reporting. (I can still not add a new
> bug to savannah...)
>
>>> 2. As far as we know, EMSGSIZE is not a valid return code for send() on
>>> a STREAM socket. netconn_write does not return the number of bytes
>>> processed and cannot perform partial sends. This makes an application
>>> that uses select run in tight loops since select returns writable, but
>>> send [working on an all or nothing assumption] returns an error
>>> (EWOULDBLOCK)
>> I'd like to see lwIP updated to support partial sends (which I think would
>> solve this problem) but it won't happen before 1.4.0. A task on savannah
>> would again be very helpful, if there's not one already - I think this
>> might have been discussed before.
>
> There's bug #31084 for that on savannah:
> https://savannah.nongnu.org/bugs/?31084
> I've targeted it at 1.4.0 though. I think it's one of the main missing things
> for lwIP to support nonblocking sockets, so it might be worth including that
> to prevent having to launch 1.4.1 too soon...
>
>>> 3. connect has several problems:
>>>
>>> a. connect sets sock->err to EINPROGRESS. When select returns
>>> writable, getsockopt(SO_ERROR) will never let us know what happened [i.e
>>> no access to conn->err] since getsockopt(SO_ERROR) does not return the
>>> error value when sock->err is not 0 (it is set to EINPROGRESS). It seems
>>> to me the non-blocking path lacks the propagation of the connect result
>>> to sock->err (which does happen when using a blocking call).
> Could you try CVS HEAD? There was a bug there
> (http://savannah.nongnu.org/bugs/?31590) which should be fixed since a week
> or so now.
>
>>> b. getsockopt(SO_ERROR) - behaviour according to Posix is to return
>>> and clear the _pending_ error for the socket (if one exists). instead
>>> getsockopt returns the last socket call error once. If additional calls
>>> are made netconn's last error is returned repeatedly.
> That's a problem of the mixture of 3 APIs here: socket API is built on top of
> the netconn API which in turn is built on top of the raw API. In a fatal
> error state, the socket could have an error that cannot easily be removed, so
> not returning that first could be a problem. This code needs further
> thoughts, so it would be best to open a bug entry for discussion (as soon as
> savannah is fully back online again).
>
>>> c. if connect is called again while a previous non-blocking connect
>>> is being processed, ERR_ISCONN is assigned to conn->err [which by the
>>> way translates to an errno of -1]. Now, if the connection succeeds,
>>> do_connected will not be able to set conn->err to ERR_OK since it checks
>>> for ERR_INPROGRESS. To make things worse, ERR_ISCONN is treated as a
>>> FATAL error, and will therefore render the socket unusable. According to
>>> Posix, EALREADY should be returned while a connect is in progress, and
>>> EISCONN should be returned when a socket is connected.
> That's a good one. Again, please file a bug report to make sure this doesn't
> get lost!
>
>>> 4. lwip_select seems to be susceptible to race conditions and has issued
>>> many ASSERTs as well as crashed.
>> These sounds like you're using a socket from two different threads (one
>> calling select(), the other calling close()). Unfortunately this style of
>> operation isn't supported in lwIP.
> Hmm, select() does use a global list to queue all threads waiting in a select
> (select_cb_list), but for me, that worked quite well, so any detailed bug
> report would be great.
>
> Thanks for the feedback, and please remember to open bug reports for the open
> issues: there are too many posts on this list to keep track of bugs without
> the bugtracker.
>
> Simon
>
> Simon