gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] preventing gfid-mismatches because of crashes in afr


From: raghav
Subject: Re: [Gluster-devel] preventing gfid-mismatches because of crashes in afr
Date: Wed, 12 Mar 2014 15:23:24 +0530
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130402 Thunderbird/17.0.5

On 03/11/2014 09:07 PM, Pranith Kumar Karampuri wrote:
hi,

    Traditionally afr just remembers which of the directories are good vs stale 
in extended attributes and then at the time of self-heal, does full directory 
scan and deletes stale entries and creates new entries. There are two problems 
with this approach
1) even creating/deleting/renaming one entry requires full scan of the 
directory.
2) If both bricks crash at the same time while a rename is going on, then it 
can lead to same-name, different gfid split-brains.
    Example:
             0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with gfid-b.
             1) user executes rename dir1/a -> dir2/b on the mount over-writing 
the original file b.
             2) On brick-0 rename succeeds so the end result is dir1 does not 
have 'a' and dir2 has file 'b' with gfid-a
             3) at this point both the brick processes go down or data center 
shutdown happens etc, so brick-1 still has dir1 with file 'a' with 'gfid-a' and 
dir2 with file 'b' with 'gfid-b'.
             4) Now when both bricks are back up, dir1 can be healed 
conservatively where 'a' will be recreated with 'gfid-a' and heal it from 
brick-1 to brick-0 (incorrectly undoing the rename).
             5) But for dir2 on brick-0 there is a file 'b' with gfid-a where 
as on brick-1 there is a file 'b' with 'gfid-b', afr at the moment doesn't 
store any information to figure out which one is correct.

To address this issue, granularity of preop/postop of the entry operations need 
to be incremented.
a filename inside a directory can be uniquely identified by the entry-tuple 
(parent-gfid, entryname, entry-gfid).
Example: For dir2/b in the example above we can represent it as (gfid-of-dir2, 
b, gfid-b) on brick-1

So we need to remember such information for every entry fop along with whether 
that entry is coming 'in' to the directory or going 'out' of the directory.
So in the previous example we would have remembered dir2/b with gfid-b is going 
out of that directory so that entry could be deleted and dir2/b with gfid-a can 
be healed from brick-0.

The solution that we come up with should have the following functionalities 
broadly:
1) Given an entry-tuple it should be able to remember that it is going in or 
out of that directory.
2) Given an existing entry-tuple it should be able to forget it.
3) Given an entry-tuple, we should be able to query if that entry-tuple is 
going in/out.

This is one possible way to address this issue:
0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in that 
directory and
1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from 
following path .glusterfs/indices/entry/pargfid/gfid/filename to 
'.glusterfs/indices/entry/in' as part of pre-op
2) Every time unlink/rmdir happens create a hardlink from following path inside 
.glusterfs/indices/entry/pargfid/gfid/filename to 
'.glusterfs/indices/entry/out' as part of pre-op
3) Every time rename happens create the following 2/3 hardlinks
    - .glusterfs/indices/entry/old-pargfid/gfid/old-filename to 
'.glusterfs/indices/entry/out'
    - .glusterfs/indices/entry/new-pargfid/gfid/new-filename to 
'.glusterfs/indices/entry/in'
and if the destination exists:
    - .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename to 
'.glusterfs/indices/entry/out'
4) Delete the same files as part of post-op.
2 questions:

1) How does this approach solve the 1st case where scan would be required for the full directory? If we delete these transit files during the post op, wouldn't that require a full file scan if one brick is down and directory entry operations are done on another brick? (or I am missing something here)

2) During a crash, indices directory itself need not be intact. Would that not cause problems if we expect indices to be crash consistent?

Regards
Raghav



reply via email to

[Prev in Thread] Current Thread [Next in Thread]