[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Architecture advice

From: Gordan Bobic
Subject: Re: [Gluster-devel] Architecture advice
Date: Wed, 14 Jan 2009 12:00:48 +0000

On Tue 13/01/09 01:15 , Martin Fick <address@hidden> wrote:
> --- On Mon, 1/12/09, Gordan Bobic  wrote:
> Ding, ding, ding ding!!!  I get it, you are 
> using NFS to achieve blocking, exactly my #1
> remaining grip with glusterfs, it does not 
> block!  Please try explaining why this is
> important to you to the glusterfs devs! I
> am not sure that I made my case clear to 
> them.

It's convenient because there doesn't need to be application level handling of 
the "filesystem went away" condition.

> It seems like your use of NFS is 
> primarily based upon this (what I perceive
> to be) major remaining shortcoming of 
> glusterfs.  Would you give up NFS if 
> blocking were implemented in glusterfs?

Depends on the use-case. NFS is mature, fast and well supported on most 
platforms, more so than GlusterFS.

> One remaining drawback to NFS, which you 
> may not care about, is the fact the NFS
> servers should not themselves be NFS 
> clients.

I have never seen that cause a problem. Can you explain a use-case where a 
server also being it's own client could be problematic?

> > > So, the HA translator supports talking to two
> > > different servers with two different transport
> > > mechanism and two different IPs.  Bonding does not
> > support anything like this is far as I can tell.
> > 
> > True. Bonding is more transparent. You make two NICs into
> > one virtual NIC and round-robin packets down them. If one
> > NIC/path fails, all the traffic will fail over to the other
> > NIC/path.
> Another benefit of the HA translator is that 
> you can have to entirely different paths which 
> is very hard to do with bonding.  With bonding 
> you are restricted to one IP.  If you think 
> about using a WAN, this would not allow you to 
> access a remote server using two entirely 
> different IPs which use two entirely different 
> WAN GWs.  The HA translator should in theory 
> make this very easy.

It would make it considerably easier, I'll grant you that, but with some 
failover hooked NAT-ing iptables magic you could achieve more or less the same 

> > >> This is a reinvention of a wheel. NFS already
> > >> handles this gracefully for the use-case you 
> > >> are describing.
> > > 
> > > I am lost, what does NFS have to do with it?
> > 
> > It already handles the "server has gone away"
> > situation gracefully. What I'm saying is that you can
> > use GlusterFS underneath for mirroring the data (AFR) and
> > re-export with NFS to the clients. If you want to avoid
> > client-side AFR and still have graceful failover with
> > lightweight transport, NFS is not a bad choice.
> Uh, not exactly a good choice though, it seems like
> an awfully big hammer to use just because you think
> it's better than reinventing the wheel.  I can see
> that it will work in your strict client/server use
> case, but not in "peer 2 peer".  A simple HA 
> translator would be a much better more flexible, 
> better glusterfs integrated solution, don't you 
> think?

Sorry, I don't see a problem. :-/

> > > No, I have not confirmed that this actually
> > > works with the HA translator, but I was told
> > > that the following would happen if it were used. 
> > Client A talks to Server A and submits a read request.  The
> > read request is received on Server A (TCP acked to the
> > client), and then Server A dies.  Client A
> > > will then in theory retry the read request
> > > on Server B.  Bonding cannot do anything
> > > like this (since the read was tcp ACKed)?  
> > 
> > Agreed, if a server fails, bonding won't help. Cluster
> > fail-over server-side, however, will, provided the network
> > file system protocol can deal with it reasonably well.
> Yes, but I fear you might still have a corner case 
> where you can get some non-posix behavior with this
> setup, just as I mentioned that I believe you would 
> with the HA translator.

I'm not 100% sure, but I think you break atomicity as soon as a server fails 
anyway. I don't think there is any corrective action taken to prevent the race 
condition that arises upon server failure. There is an inherent risk that the 
write will hit one server but not the other before failover occurs. I don't 
think you can work around this without client-side AFR.

> > > I think that this is quite different from
> > > any bonding solution.  Not better, different,
> > > If I were to use this it would not preclude me from
> > > also using bonding, but it solves a somewhat different
> > > problem.  It is not a complete solution, it is a piece, but
> > > not a duplicated piece.  If you don't like it,
> > > or it doesn't fit your backend use case, don't
> > > use it! :)
> > 
> > If it can handle the described failure more gracefully than
> > what I'm proposing, then I'm all for it. I'm
> > just not sure there is that much scope for it being better
> > since the last write may not have made it to the mirror
> > server anyway, so even if the protocol can re-try, it would
> > need to have some kind of journaling, roll back the journal
> > and replay the operation.
> That's why I said theory about the HA translator! :) 
> I do not see anything in the code that actually keeps
> track of requests until they are replied to, but I
> was told that it can replay it.  Can someone explain
> where this is done?
> I can' see how this is done without some type of RAM 
> journal?  I say RAM, because request need not 
> survive a client crash, they simply need to hit the 
> server disk before the client return a success, but 
> if the clients crashes, the apps never got a confirm, 
> so request will not need to be replayed.

This would require the journal to be synced across the servers, as the 
surviving server is the only one that can replay the journal, even though the 
server that failed is the one that was talking to the client.

> Why do you think a client would need to be able
> to roll back the journal, it should just have to 
> replay it, no roll back.

That's what I meant.

> > This, however, is a much more complex approach (very
> > similar to what GFS does), and there is a high price to pay
> > in terms of performance when the nodes aren't on the
> > same LAN.
> With glusterfs's architecture it should not be much of 
> a price, just the buffering of requests until they are 
> completed.

It would be expensive performance-wise. It sounds an awful lot like RHCS/DLM/GFS

> > > No. I am proposing adding a complete transactional
> > model to AFR so that if a write fails on one node, some
> > policy can decide whether the same write should be committed
> > of rolled back on the other nodes.  Today, the policy is to
> > simply apply it to the other nodes regardless.  This is a
> > recipe for split brain.  
> > 
> > OK, I get what you mean. It's basically the same
> > problem I described above when I mentioned that you'd
> > need some kind of a journal to roll-back the operation that
> > hasn't been fully committed.
> I don't see it at all like above, since above you do
> not need to rollback.  in this case, depending on 
> which side of the segregated network you are on, the
> journal may need to be rolled back or committed.

But the journal needs to be synced on both servers, something that would 
require synchronous blocking I/O. That would be paniful over a WAN, and at that 
rate you might as well use DRBD+GFS rather than GlusterFS.

> > > In the case of network segregation some policy should
> > > decide to allow writes to be applied
> > > to one side of the segregation and denied on the
> > > other.  This does not require fencing (but it
> > > would be better with it), it could be a simple policy
> > > like: "apply writes if a majority of nodes can be
> > > reached", if not fail (or block would be
> > > even better).
> > 
> > Hmm... This could lead to an elastic shifting quorum.
> > I'm not sure how you'd handle resyncing if nodes are
> > constantly leaving/joining. It seems a bit
> > non-deterministic.
> I wasn't trying to focus on a specific policy, but
> I fail to see any actual problem as long as you always
> have a majority?  Could you be specific about a 
> problematic case?
> I would suggest other policies also, thus my request
> for an external hook.

The problem is that if you have, say, 3 servers of which you need 2, you could 
end up in a situation where you write to servers 1,2 and then something happens 
and server 2 disappears but 3 comes back. You again have quorum of 2, but now 
it's 1,3. So servers 2,3 could end up with different data. It's also still 
prone to splitbraining because there is no enforced cluster-wide atomicity. In 
GFS, all FS operations are blocked until the node that stopped responding is 
confirmed fenced. I don't see any other way of reliably preventing the 
split-brain from occuring.

> > > I guess what you call tiny, I call huge.  Even if you
> > > have your heartbeat fencing occur in under a
> > > tenth of a second, that is time enough to split brain
> > > a major portion of a filesystem.  I would never trust it.
> > 
> > In GlusterFS that problem exists anyway, 
> Why "anyway"?  It exists, sure, but it's certainly
> something that I would hope gets fixed eventually.

The problem is that the "solution" to splitbrain is the "clobber-older" 
recovery method. The FS will correct itself, but you may lose the version that 
you'd have preferred to keep. Coda, for example, is much more paranoid about 
this scenario, but it is also a lot less transparent - the user has to resolve 
the conflict manually.

> > but it is largely
> > mitigated by the fact that it works on file level rather
> > than block device level. 
> Certainly not FS devastating like it would be for 
> a block device, but bad data is still bad data.
> It would be of no consolation to me that I have
> access to the rest of my FS if one really 
> important file is corrupt!

Sure, but if preventing this scenario is that much of an issue, you should 
probably look into DRBD+GFS instead. I see GlusterFS as a looser solution that 
fits a different use case. It's a trade-off between performance and integrity 

> > > To borrow your analogy, adding heartbeat to the
> > current AFR:  "It's a bit like fitting a big
> > padlock on the door when there's a wall missing."
> > > :)  
> > > Every single write needs to ensure that it will not
> > cause split brain for me to trust it.
> > 
> > Sounds like GlusterFS isn't necessarily the solution
> > for you, then. :(
> It's not all bad, it's just not usable for some 
> use cases yet.

Personally, I'm against a "one tool fits all use-cases" approach. There are 
several similar file systems designed with different primary design goals, and 
the associated advantages and drawbacks for different use-cases. Just pick the 
one that fits. :)

> > > If not, why would I bother with gluserfs over
> > > AFR instead of glusterfs over DRBD?  Oh right, because
> > I cannot get glusterfs to failover without
> > > incurring connection errors on the client! ;)
> > > (not your beef, I know, from another thread)
> > 
> > Precisely - which is why I originally suggested not using
> > GlusterFS for client-server communication. :)
> ...
> > And this is exactly why I suggested using NFS for the
> > clientserver connection. NFS blocks until the
> > server becomes contactable again.
> Yes, but do you have any other suggestions
> besides NFS?  Anything that can be safely 
> used as both a client and a server? :)

I still don't see what the problem is with NFS when used as both a client and a 

> > > But, to be clear, I am not disagreeing with you
> > > that the HA translator does not solve the split
> > > brain problem at all.  Perhaps this is what is really
> > > "upsetting" you, not that it is
> > > "duplicated" functionality, but rather that
> > > it does not help AFR solve it's split brain personality
> > > disorders, it only helps make them more available, thus
> > > making split brain even more likely!! ;(
> > 
> > I'm not sure it makes it any worse WRT split-brain, it
> > just seems that you are looking for GlusterFS+HA to provide
> > you with exactly the same set of features that NFS+(server
> > fail-over) already provides. 
> You are right, glusterfs + AFR + HA is probably no 
> worse than glusterfs + AFR + NFS.  But both make it 
> slightly more likely to have split brain than simply 
> glusterfs + AFR.  And glusterfs + AFR itself is much 
> more likely to split brain than glusterfs + DRBD.

I think you may be confusing GlusterFS and GFS. They are different file 
systems. DRBD+GFS has no scope for split-braining if your fencing is configured 
correctly because it is journaled and will block until the failed server is 

> > Of course, there could be
> > advantages in GlusterFS behaving the same way as NFS when
> > the server goes away if it's a single-server setup 
> I fail to see how having it not behave that way even
> if you have many servers and they all went down would 
> not be desirable?

You're probably right for a vast majority of use-cases. There could be an edge 
case where it might cause a server overload with hundreds of threads spawning 
to perform a task relying on the blocked file system. Since they all block yet 
more keep spawning, it could bring the machine down even though it might still 
be perform other tasks unhindered.

> > - it
> > would be easier to set up and a bit more elegant. But it
> > wouldn's add any functionality that couldn't be
> > re-created using the sort of a setup I described.
> I guess just multi path, multi protocol (encrypt one, 
> not the other...).  Primarily flexibility, bonding is 
> very limited.  I would think that it might in some
> usecases increase bandwidth also.  My reading on 
> bonding suggests that if you are using separate 
> switches that you can get either HA bonding or 
> link aggregation bonding, but not both right?

I haven't used it in a while, but IIRC, you can use both, and only suffer a 
bandwidth reduction if one leg fails. I could be wrong though. Now that you 
mention it, you might be right, the issue does sound vaguely familiar.

---- Msg sent via @Mail -

reply via email to

[Prev in Thread] Current Thread [Next in Thread]