[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-recutils] Index trees

From: Michał Masłowski
Subject: [bug-recutils] Index trees
Date: Mon, 20 Aug 2012 18:13:09 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.1 (gnu/linux)


The index-tree branch of git://elderthing.mtjm.eu/mtjm-recutils.git
contains changes implementing the b+tree-based indexes nearly as
described previously and using them with simple queries like the ones
done in performance tests (will send the patches separately).

I learned three things writing these changes: APIs I propose make
testing hard, so my unit tests are too incomplete;
just-return-NULL-on-error is a bad strategy for this code (e.g. the user
should know if they should run recfix --build-index, free some memory or
fix a bug for the query to succeed); index trees are much faster than
parsing the whole record set for appropriate tests.

Of the 13 patches two add recfix commands for index file changes, these
should be rewritten for error handling; two other patches fix bugs in
rec-idx-file that should be found by unit tests in a future revision of
index file patches.

The future change to support other endian index files will require API
changes in the index tree reader, probably could be done by passing a
boolean parameter to rec_idx_tree_new.  (It will make testing the reader
on various trees easier.)

I have the following questions related to the index tree APIs:

- Should the index builder support changing parameters like the node
  size limit?  Currently 200+ records might be needed to make a tree
  consisting of more than a single node, I'm not sure if this would be
  ok in a unit test for performance reasons (and test complexity).
  Exposing some internals of this implementation could make some
  possibly useful changes more difficult in future (although unlike
  e.g. fast_string indexes or non-string keys I have no specific ideas
  for such changes).

- Should index file and index tree modules use error codes like the
  parser, or split functions so each could fail for only one reason?
  rec_db_add_rsets_from_file would need a more complex solution
  (unsupported indexes don't prevent other indexes from working, while
  the user should know about them).

- Should new tests be added for recsel or the existing tests modified to
  run also with appropriate indexes?  The second solution seems more
  directly verifying the "indexes don't affect results" rule, although
  would make the tests more complex.  The first one could require adding
  similar number of tests (except for ones testing features "obviously
  unsupported" with indexes like joins or aggregates).

- It will be easy to introduce performance regressions when editing the
  query planner (if it will optimize more queries), should it be
  prevented by e.g. counting records iterated and making this available
  to the recsel correctness tests?  This will make optional output that
  could change depending on version (e.g. due to future optimizations).

Attachment: pgpPS21uvM4Dd.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]