[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gluster-devel] Re: [bug #19614] System crashes when node fails even
Brent A Nelson
Re: [Gluster-devel] Re: [bug #19614] System crashes when node fails even with xfr
Fri, 11 May 2007 15:37:54 -0400 (EDT)
On Sat, 12 May 2007, Anand Avati wrote:
you have observed the reconnection logic right. This effect has
'creeped in' after introducing the non blocking tcp connect
functionality, which, pushes connect to the background if it took more
than N usecs, (the current I/O request is returned failed if the
connect() dint succeed in that shot). by the time the second I/O
request comes the connect would have succeeded and the call goes
this can be 'fixed' by turning the "N usecs" (currently hardcoded in
the code, but I want to make it configurable from the spec soon) in
the transport code. but the flip side of makeing this "N" large is
that if the server is really dead for a long time, all I/O on the dead
transport will be blocked for that period, which can be accumulate to
be quite an inexperience.
Cool. I agree that the time should be quite short (in case nodes are
still down, that gives you access to what is available without a delay for
each and every request), but it would be nice that it waits a minimal
period for a reconnect to work. User-configurable would be nice. It
would help in my mysterious disconnect case (where all machines are
running fine, it's just that the client/server briefly disconnect,
disrupting the current I/O). It could also help on bad network links.
It's probably not that important in real disconnect cases, though, where a
machine may be down or rebooting.
But then, that's not the end of it. Reconnection logic is being
redesigned where the reconnection is done proactively (not when I/O is
triggered) when a connection dies.
Sounds good. Maybe both could work together?
[Gluster-devel] [bug #19614] System crashes when node fails even with xfr, Anand Avati, 2007/05/22