[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-recutils] Indexing plan
From: |
Jose E. Marchesi |
Subject: |
Re: [bug-recutils] Indexing plan |
Date: |
Wed, 09 May 2012 19:53:11 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/24.0.92 (gnu/linux) |
Hi.
Some notes on your plan (which I generally like).
The design and implementation of indexing support could be split
into the following parts (each of them would also include tests and
documentation if it's user-visible):
- design index file structure
A single file would be used, with record descriptors in binary form
and several indices of various types.
This must probably be the first step. Given the structure and format of
recfiles, it is important to determine _what_ can be indexed. The data
structure to use for each index is an implementation detail at this
point.
- find the index file, using recfile metadata to check if it's up to
date
A combination of the last modified time and the size of the file will
probably be enough. But it must be a portable solution.
- delay reading the recfile
If indices are found, do it on first operation requiring whole rsets
or not being doable with the indices found.
Sorry, I don't understand. Do what?
- API for building indices and a recfix command to use it
- API for parsing indices
Maybe support a textual serialization of an index file for debugging
and tests.
Those APIs can go into librec, maybe under the rec_idx_* namespace.
The above parts could be done without indexing of records. There will
be other uses for indexing rset descriptors and maybe their offsets in
the file, so this can be used to test the above parts before choosing
record index data structures.
I like the approach of first supporting the indexing of rsets, and then
to figure out what to do for indexing records.
- design and implement tree-based index of value of a specific field
-> record mapping
Could use a memory-mapped btree with offline build and search
operations.
A new record descriptor field would be added to specify which fields
(or tuples of fields) to index using a specific index type.
I would go simple and use %key: for that purpose. The support for
multi-field keys is another sub-project that will probably be done
before 2.0.
- find what index to use for a query
An easy solution would be to find a comparison operation on a field
that has an index. I think much more could be done, but this would be
enough to work for simple queries and to test other parts.
E.g. for selection expression '(field1 = X) && (field2 = Y) && ...'
an index could be used to find all records with a field1 equal to X
and evaluate the expression only on them.
That sounds like a good approach.
- more performance tests
Should help make indexing improve specific operations. Won't find
which operations to improve, unlike real applications of big recfiles.
- other index structures if they would be useful
Please keep in mind that we want to use as much as stuff from gnulib as
possible. This applies to that kind of data structures as well. Since
you will probably find problems/improvements in the modules providing
data structures, it would be a good idea if you fill a copyright
assignment for gnulib right now. Would that be ok for you? As soon as
you confirm I will start the process with the gnulib maintainers.
--
Jose E. Marchesi http://www.jemarch.net
GNU Project http://www.gnu.org