Re: [Gluster-devel] Improving real world performance by moving files clo

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Improving real world performance by moving files clo

From:	gordan
Subject:	Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Date:	Fri, 16 May 2008 17:14:58 +0100 (BST)
User-agent:	Alpine 1.10 (LRH 962 2008-03-14)

On Fri, 16 May 2008, Derek Price wrote:

I mostly agree with you.  A few additional points are inlined below.

address@hidden wrote:
On Fri, 16 May 2008, Derek Price wrote:
address@hidden wrote:
Isn't that effectively the same thing? Unless there is quorum, DLM locksout the entire FS (it also does this when a node dies, until it getsdefinitive confirmation that it has been successfully fenced). For normalfile I/O all nodes in the cluster have to acknowledge a lock before itcan be granted.
Why? It requires a meta-data cache, but as long as every node in thequorum stores a given file's most recent revision # when any lock isgranted, even if it doesn't actually sync the file data, then any quorumshould be able to agree on what the version number of the most up-to-datecopy of a file is. All nodes are required to report only if you assumethat any given file has a small number of "owners" and that the querierdoesn't know who the owner is.
That's to do with file versioning, not locking, though. What am I missing?
I'm assuming that versioning and locking can and should be combined. You'veadmitted the necessity for keeping copies of files synchronized and IO isalways going to require some sort of lock to accomplish this. By having thequorum remain aware of what the most recent version of a given file is,whether that file is locked, and perhaps where copies of the file reside, youcould reduce the number of nodes that must be consulted when a lock isneeded.

True enough, but some care would need to be exercised to ensure that athis doesn't lead to edge cases where a node thinks it still has a lock,but all the other nodes have expired it (e.g. temporary network outage).

I think you will also speed things up if you don't have to consult all nodesfor every IO operation. If all available nodes must be consulted, then youintroduce an implicit wait until a specified timeout for every IO request ifany single node is down. With the quorum model, even before fencing takesplace, almost half the nodes can go incommunicado and the rest can operate asefficiently as they did with all nodes in service.

Indeed quorum of (n/2)+1 nodes should, in theory, suffice for safelygranting a lock, but it would probably mean that the locks should berefreshed several times more often than the default lock TTL, just toaccount for scope of packet loss. Releases of locks should, of course, beexplicitly notified to the cluster.

If some HA and fault-tolerant DHT implementation exists that already handlesatomic hash inserts with recognizable failures for keys that already exist,then perhaps that could take the place of DLM's quorum model, but I think anyalgorithm that requires contacting all nodes will prove to be a bad idea inthe end.

Not all nodes - only the nodes that contain a certain file. A single pingbroadcast to find out who has a copy of the file should prove to be ofinsignifficant bandwidth overheat compared to actual file transfers,unless you are dealing with a lot of files that are signifficantly smallerthan a network packet.

To remain fault tolerant, this requires that servers make some effort tostay up-to-date with the meta-data cache, but maybe this could be dealtwith efficiently with the DHT someone else brought up?
I'm not sure that so much metadata caching is actually necessary. If a fileopen brings the file to the local machine (this cannot be guaranteedbecause the local machine may be out of space, and it may be unable to freespace by expunging an old file due to that file not being redundant enoughin the network), then the metadata of that file, being attached to thefile, is implicitly "cached". But this isn't really caching at all - it'smigration.
The algorithm for opening a file might be as follows:
1) node broadcasts/multicasts an open request to all peers
2) peers that have the file available respond with the metadata (size,version, etc) they have and possibly their current load (to assist withload balancing by fetching the file from the least loaded peer)3.1) if the file is available locally, agree a lock with other nodes, anduse it.3.2) if the file is not available locally, but there is enough space, fetchit and do 3.1)3.3) if there isn't enough space locally to fetch the file, see if enoughspace can be freed. If this succeeds, do 3.2)3.4) if space cannot be freed, use the file remotely from the least loadedpeer.
Expunging algorithm would be similar - broadcast a file status request(similar to 1) above). If enough nodes respond with the latest version ofthe file (set some threshold depending on how much redundancy is required),the local file can be be removed and the space freed for a file that ismore useful locally. This shouldn't really happen until the local datastore starts to get full.
I might optimize the expunge algorithm slightly by having nodes with lowloads volunteer to copy files that otherwise couldn't be expunged from anode. Better yet, perhaps, would be a background process that runs onlightly loaded nodes and tries to create additional redundant copies at someconfigurable tolerance beyond the "minimum # of copies" threshold.

Not just lightly loaded nodes, but more importantly, nodes with most freespace available. :)

If copiesbeyond the minimum are only created on file access, then a heavily loadednode could quickly fill up its own disk with all the "redundant" copies offiles and have to start relying on remote access, further bogging down thebusy node.


Agreed.

Locking could be handled somewhat lazily - a lock request gets broadcastand as long as quorum peers respond, and there are no peers saying "no, Ihave that lock!", the lock can be granted. A lock can have TTL (in case anode dies while holding a lock), and the refresh should be expected if thenode expects to keep the lock. This could be used to speed up locking (eachnode would have a list of currently valid locks, without the need to checkexplicitly, for example - it would only need to broadcast a lock-requestwhen it looks like the lock can be granted).
For file delta writes, an AFR type mechanism could be used to send thedeltas to all the nodes that have the file. This could all get quitetricky, because it might require a separate multicast group to be set upfor up to every node combination subset, in order to keep the networkbandwidth down (or you'd just end up broadcasting to all nodes, which meansthings wouldn't scale as switches should, it'd be more like using hubs).
This would potentially have the problem that there is only 24 bits of IPmulticast address space, but that should provide enough groups withsensible redundancy levels to cover all node combinations. This may or maynot be way OTT complicated, though. There is probably a simpler and moresane solution.
I'm not sure what overhead is involved in creating multicast groups, but theywould only be required for files currently locked for write, so perhapscreating and discarding the multicast groups could be done in conjunctionwith creation and release of write locks.

Sure, these could be dynamic, but setup and teardown might cause enoughoverhead that you might as well be broadcasting all the locks and writes,and just expect the affected nodes to pick those out of the air and acton them.

It's also possible that you could reduce the complexity of this problem bysimply discarding as many copies down to as close to the minimum # as othernodes will allow, on write. However, I think that might reduce some of theperformance benefits this design otherwise gives each node.

Also remember that the broadcasts or multicasts would only actually beuseful for locks and file discovery. The actual read file transfer wouldbe point-to-point and writes would be distributed to only the subset ofnodes that are currently caching the files.

There would need to be special handling of a case where a node accepting abig write is running out of space as a consequence and something has to bedropped. Obviously, none of the currently open files can be discarded, sothere would need to be some kind of an auxiliary process that would make anode request a "volunpeer" (pun intended) to take over a file that itneeds to flush out, if discarting it would bring the redundancy below therequired threshold.

Perhaps thereare some useful ideas on how to perform this complex synchronization alreadyin the design of P2P file transfer networks? What would that be, somethinglike implicit striping based on the locations of valid redundantcopies/deltas?

Freenet and Entropy do something similar, but with fewer constraints. Theystore files in a DHT, route by hash expunge LRU and cacheprobabalistically. However, in that kind of an environment you cannotsensibly enforce a minimum redundancy level. Least used files willeventually fall off the network as the frequently used files getcached by nodes.


Gordan

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads, (continued)

Prev by Date: [Gluster-devel] booster translator error
Next by Date: Re: [Gluster-devel] booster translator error
Previous by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Next by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Index(es):
- Date
- Thread