[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Choice of Translator question

From: Kevan Benson
Subject: Re: [Gluster-devel] Choice of Translator question
Date: Thu, 27 Dec 2007 09:58:13 -0800
User-agent: Thunderbird (X11/20071031)

Gareth Bult wrote:
The trusted.afr.version extended attribute tracks while file
version is being used, and on a read, all participating AFR members
should respond with this information, and any older/obsoleted file
versions are replaced by a newer copy from one of the valid AFR
members (this is self-heal)

Yes, understood.

I think they are planning striped reads per block (maybe definable)
at a later date.

Mmm, so at the moment, when it says AFR does striped reads, what it
really means is that it does striped reads, just so long as you have
lots of relatively small files and not a few large files .. ???

I'm not sure. It could very well depend on which version you are using, and where you read that. I'm sure some features listed in the wiki are only implemented in the TLA releases until they put out the next point release.

Read from the the file from a client (head -c1 FILE >/dev/null to

OR find /mountedfs -exec head -c1 > /dev/null {} \;

.. which is good, but VERY inefficient for a large file-system.

Agreed, which is why I just showed the single file self-heal method, since in your case targeted self heal (maybe before a full filesystem self heal) might be more useful.

you could use the stripe translator over AFR to AFR chunks of the
DB file, thus allowing per chunk self-heal.

Mmm, my experimentation indicates that this does not happen. I've
just spent 3 hours trying to prove / disprove this with various
configurations - AFR self-heals on a file basis, not on a
stripe-chunk basis.

If I have 4 bricks, two stripes using 2 bricks each, then an AFR on
top - any sort of self-heal replicates the entire DB. If I have 4
bricks, two AFR's and one stripe on top, I get the same thing.

I would expect AFR over stripe to replicate the whole file on inconsistent AFR versions, but I would have though stripe over AFR would work, as the AFR should only be seeing chunks of files. I don't see how the AFR could even be aware the chunks belong to the same file, so how it would know to replicate all the chunks of a file is a bit of a mystery to me. I will admit I haven't done much with the stripe translator though, so my understanding of it's operation may wrong.

I'm not familiar enough with database file writing practices in
general (not to mention your particular database's practices), or
the stripe translator to tell whether any of the following will
cause you problems, but they are worth looking into:

We're talking about flat files here, some with append, some with
seek/write updates.

Eh, it's probably not a problem anyways because of the way filesystems do block management.

1) Will the overhead the stripe translator introduces with a very
large file and relatively small chunks cause performance problems?
(5G in 1MB stripes = 5000 parts...)

No, this would be fine if the AFR/Stripe combination actually did a
per-chunk self heal.

I was thinking the stripe translator may add some extra overhead to the network, but it probably only requests the stripes that hold data you are requesting, so it probably is a non-issue (as you said).

2) How will GlusterFS handle a write to a stripe that is currently
self-healing?  Block?

The stripe replicates the entire stripe (which is big) and both read
and write operations block during the heal.

Do you mean that a change to a stripe replicates the entire file?

3) Does the way the DB writes the DB file cause massive updates
throughout the file, or does it generally just append and update
the indices, or something completely different.  It could have an
affect on how well something like this works.

I don't think access speed is an issue, glusterfs is very quick. The
issue is recovery, it appears not to operate as advertised!

Understood. I'll have to actually try this when I have some time, instead of just doing some armchair theorizing.

Essentially, using this layout, you are keeping track of which
stripes have changed and only have to sync those particular ones on
self-heal. The longer the downtime, the longer self-heal will take,
but you can mitigate that problem with a rsync  of the stripes
between the active and failed GlusterFS nodes BEFORE starting
glusterfsd onthe failed node (make sure to get the extended
attributes too).

Ok, firstly, manual rsync's sort of defeat the object of the
exercise. Secondly, having to go through this process every time a
configuration is changed / glusterfsd is restarted is unworkable. Thirdly, replicating many GB's of data hammers the IO system and
slows down the entire cluster - again undesirable.

Well, it depends on your goal. I only suggested rsync for when a node was offline for quite a while, which meant a large number of stripe components would have needed to be updates, requiring a long sync time. If it was a quick outage (glusterfs restart or system reboot), it wouldn't be needed. Think of it as a jumpstart on the self-heal process without blocking.

This, of course, was assuming that the stripe of AFR setup works.

Being able to restart a glusterfsd without breaking the replica's
would help, but I see no mention of this ...

Because I'm not a dev, and have no control over this. ;) Yes, I would like this feature as well, although I can imagine a couple of snags that can make it problematic to implement.

The above setup, if feasible, would mitigate restart cost, to the
point where only a few megs might need to be synced on a glusterfs

Ok, well I appear to have both AFR and Striping working and I can
observe their operation at brick level and confirm they are working

Here's my basic test harness;

On the client system;

$dd if=/dev/zero of=/mnt/stripe/database bs=1M count=1024 #!/usr/bin/python io=open("/mnt/stripe/database","r+")*1024*900) io.write("Change set version # 6\n") io.close()

On the bricks I have; #!/usr/bin/python io=open("/export/stripe-1/database","r+")*1024*900) print io.readline() io.close()

When I run on the client, both bricks show the correct
change. Then I kill glusterfsd on brick2. Running on the
client shows an update on brick1, obviously not on brick2. Restarting
glusterfsd on brick2 shows a reconnect in the logs. On the client;
head -c1 database Initiates a self heal, shown in the logs with DEBUG
turned on Running on brick1 and brick2 blocks ... An entire
1G chunk is copied to brick 2 on bricks 1 and 2 then continue
when the copy finishes ..


Was this on AFR over stripe or stripe over AFR?

I'm using fuse-2.7.2 from the repos and gluster 1.3.7 from the stable
tgz ...

fyi; The fuse that comes with Ubuntu/Gutsy seems to cause gluster to
crash under write-load, I'm still waiting to see if the current CVS
version solves the problem ...

The GlusterFS provided fuse is supposed to have some better default values for certain variables relating to transfer block size or some such that optimize it for glusterfs, and it's probably what they test against, so it's what I've been using.


-Kevan Benson
-A-1 Networks

reply via email to

[Prev in Thread] Current Thread [Next in Thread]