From: "Pranith Kumar Karampuri"<address@hidden>
To: "Anand Avati"<address@hidden>
Cc: "Vijay Bellur"<address@hidden>, "Amar Tumballi"<address@hidden>, "Krishnan
Parthasarathi"
<address@hidden>, "Raghavendra Gowdappa"<address@hidden>
Sent: Tuesday, May 22, 2012 8:42:58 AM
Subject: Re: RFC on fix to bug #802414
Dude,
We have already put logs yesterday in LOCK and UNLOCK and saw
that the&fd->inode->lock address changed from LOCK to UNLOCK.
"The hang in fuse_migrate_fd is _before_ the inode swap performed
there."
All the fds are opened on the same file. So all fds in the fd
migration point to same inode. The race is hit by nth fd, (n+1)th fd
hangs. We have seen that afr_local_cleanup was doing fd_unref, and
LOCK(fd->inode->lock) was done with one address then by the time
UNLOCK(fd->inode->lock) is done the address changed. So the next fd
that has to migrate hung because the prev inode lock is not
unlocked.
If after nth fd introduces the race a _cbk comes in epoll thread on
(n+1)th fd which tries to LOCK(fd->inode->lock) epoll thread will
hang.
Which is my theory for the hang we observed on Saturday.
Pranith.
----- Original Message -----
From: "Anand Avati"<address@hidden>
To: "Raghavendra Gowdappa"<address@hidden>
Cc: "Vijay Bellur"<address@hidden>, "Amar Tumballi"
<address@hidden>, "Krishnan Parthasarathi"
<address@hidden>, "Pranith Kumar Karampuri"
<address@hidden>
Sent: Tuesday, May 22, 2012 2:09:33 AM
Subject: Re: RFC on fix to bug #802414
On 05/21/2012 11:11 AM, Raghavendra Gowdappa wrote:
Avati,
fuse_migrate_fd (running in reader thread - rdthr) assigns new
inode to fd, once it looks up inode in new graph. But this
assignment can race with code that accesses fd->inode->lock
executing in poll-thread (pthr) as follows
pthr: LOCK (fd->inode->lock); (inode in old graph)
rdthr: fd->inode = inode (resolved in new graph)
pthr: UNLOCK (fd->inode->lock); (inode in new graph)
The way I see it (the backtrace output in the other mail), the swap
happening in fuse_create_cbk() must be the one causing lock/unlock to
land on different inode objects. The hang in fuse_migrate_fd is
_before_
the inode swap performed there. Can you put some logs in
fuse_create_cbk()'s inode swap code and confirm this?
Now, any lock operations on inode in old graph will block. Thanks
to pranith for pointing to this race-condition.
The problem here is we don't have a single lock that can
synchronize assignment "fd->inode = inode" and other locking
attempts on fd->inode->lock. So, we are thinking that instead of
trying to synchronize, eliminate the parallel accesses altogether.
This can be done by splitting fd migration into two tasks.
1. Actions on old graph (like fsync to flush writes to disk)
2. Actions in new graph (lookup, open)
We can send PARENT_DOWN when,
1. Task 1 is complete.
2. No fop sent by fuse is pending.
on receiving PARENT_DOWN, protocol/client will shutdown transports.
As part of transport cleanup, all pending frames are unwound and
protocol/client will notify its parents with PARENT_DOWN_HANDLED
event. Each of the translator will pass this event to its parents
once it is convinced that there are no pending fops started by it
(like background self-heal, reads as part of read-ahead etc). Once
fuse receives PARENT_DOWN_HANDLED, it is guaranteed that there
will be no replies that will be racing with migration (note that
migration is done using syncops). At this point in time, it is
safe to start Task 2 (which associates fd with an inode in new
graph).
Also note that reader thread will not do other operations till it
completes both tasks.
As far as the implementation of this patch goes, major work is in
translators like read-ahead, afr, dht to provide the guarantee
required to send PARENT_DOWN_HANDLED event to their parents.
Please let me know your thoughts on this.
All the above steps might not apply if it is caused by the swap in
fuse_create_cbk(). Let's confirm that first.
Avati