[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Arch Cache & cached archives

From: Aaron Bentley
Subject: [Gnu-arch-users] Arch Cache & cached archives
Date: Tue, 14 Sep 2004 12:56:48 -0400
User-agent: Mozilla Thunderbird 0.5 (X11/20040309)

This is a description of the preliminary Arch Cache implementation with notes on future directions. Please consider it a draft; comments and critiques are welcome.

Although I'm using the term "cache", it's really about memoizing data that can be time-consuming to produce, but will always be equivalent once produced.

Layer 1: The Arch Cache
The Arch Cache abstraction connects "query paths" with streams, or things that streams can represent. Query paths look suspiciously like POSIX pathnames. Convenience functions are available for use with strings.

There is a test for whether the cache is enabled:
extern int
arch_cache_active (void)

Attempts to use the cache when it is not enabled will cause panics.

extern int
arch_cache_put (t_uchar **tmp_name, t_uchar *rel_query_path)

To add something to the cache, we use arch_cache_put. This returns a file descriptor that we'll have to close, and a tmp_name that we'll ultimately need to free.

extern void
arch_cache_commit (t_uchar *tmp_name, t_uchar *rel_query_path)

After we have written the answer to the file descriptor, we must commit it, before the answer can become active. This step is not required for the string wrappers.

extern int
arch_cache_has_answer (t_uchar * rel_query_path)

We can use arch_cache_has_answer to find out whether the cache has an answer for a particular query.

extern int
arch_cache_get (t_uchar * rel_query_path)

We can use arch_cache_get to retrieve the answer for a query. It will panic if no answer is available for that query. This is where the smart caching functionality Tom's mentioned could hook in. One possible inplementation would be to register a set of query handlers, and invoke them in sequence until one of them produced an answer.

extern int
arch_cache_maybe_get (t_uchar * rel_query_path)

If we don't want to panic when the answer isn't there, we can use maybe_get. This will return -1 if the answer is unavailable.

The convenience functions are:
arch_cache_put_str and arch_cache_get_str. Since these copy the string verbatim, I expect to add arch_cache_put_line and arch_cache_get_line, which will make sure the files have terminating '\n', but the strings do not.

Things currently unhandled include:
1. statistics
2. listing answers available to certain kinds of queries, e.g. listing full-trees available in a version.
3. erasing answers

Implementation notes:
Yes, this is implemented using the local filesystem, exactly as you'd expect. (It could also be implemented on top of a pseudo-filesystem.) The $HOME/.arch-params/=arch-cache file contains the prefix of the cache heirarchy.

Layer 2: Namespace
The current namespace looks like this:

/archives : data for archives, but not for specific locations

/archives/$ARCHIVE: data for a particular archive

/archives/$ARCHIVE/$REVISION: data for a particular archive revision. I'm not sure I want to keep it this way. For scalability reasons, this might be better: /archives/$ARCHIVE/$VERSION/$DATATYPE/$PATCHLEVEL. That way, listing data would scale with the number of patchlevels (which have cached queries) in the version, not version*patchlevels.

/archives/$ARCHIVE/$REVISION/full-tree.tar.gz: The full tree (same contents as a cacherev or import) for the revision

/archives/$ARCHIVE/$REVISION/log: The patchlog for the revision

/archives/$ARCHIVE/$REVISION/delta.tar.gz: The changeset between the revision and its direct ancestor

/archives/$ARCHIVE/$REVISION1/delta-from-REVISION2.tar.gz: (not implemented) The changeset that transforms $REVISION2 into $REVISION1

/archives/$ARCHIVE/$REVISION/ancestor: (not implemented) The direct ancestor of the revision

/archives/$ARCHIVE/$REVISION/type: (not implemented) The type of the revision ("import", "simple" or "continuation")

/locations/$MANGLED_URL/NAME: (not implemented) The official name associated with an archive location. Required for disconnected operation or lazy initialization, but may occasionally change.

Cached Archives
Cached archives are the first clients of the Arch Cache. They are a new archive type that implements the archive.h interface. Any location prefixed with cache: is created as a cached archive.

When they are initialized, they initialize a pointer to the real archive by removing the "cached:" prefix.

Most implementations are exact wrappers. The functions that use the cache are:

These functions check whether the cache has an answer already. If not, they retrieve the answer from the wrapped archive, and put it in the cache. Then, they unconditionally get from the cache.

It would be nice to cache at commit time, but that would need to be done at a higher level.

Comparison with local mirrors
- The user never downloads anything they don't need
- Commits are possible
- Never out of date
- Disconnected operation is not yet supported

Comparision with sparse, greedy revlibs
- Stores intermediate downloads, not just the target revisions
- Typically more space-efficient
- Not suitable as a reference tree

Comparison with proxy caches
- No "stale data" problems
- Available for SFTP
- Permanent by default
- Adaptable for disconnected use
- Higher level: Because its datatypes are Arch datatypes (not just files), it knows what kinds of data can be stored permanently, and keeps them by default. - Since data is grouped by archive, accessing the same data through different transports will not cause it to be duplicated in two caches
- Visible to Arch; higher-level functions like build_revision can use it.
- Potentially visible to tla-wrapping utilities

Aaron Bentley
Director of Technology
Panometrics, Inc.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]