bug#17280: "Untangle" script to lay foundations for refactoring dfa.c

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#17280: "Untangle" script to lay foundations for refactoring dfa.c

From:	behoffski
Subject:	bug#17280: "Untangle" script to lay foundations for refactoring dfa.c
Date:	Thu, 17 Apr 2014 12:19:08 +0930
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

G'day,

I decided, back in October, to seriously scratch an itch that I've been
wanting to scratch ever since about 2000 (!).  This is my desire to
include performance enhancements that I discovered while writing Grouse
Grep, which was published in Dr. Dobb's Journal in November 1997, and
which was updated in 2000 to run on Linux machines.

The performance improvements relate to the Boyer-Moore skip search
algorithm, and in particular:
   1. Case-insensitive searches are feasible for unibyte locales;
   2. The self-tuning Boyer-Moore (STBM) search that I invented does
      reduce some pathological cases, and also can speed up the
      normal search cases by a few percent.  It's certainly worth
      trying, but I'm not sure if it will be worthwhile across the
      huge range of use cases that Grep has to cover; and
   3. Case-insensitivity is a special form of classes being able to
      be handled by the B-M skip search.  There may be a few other
      cases where classes could be considered part of a "must-have"
      string, e.g. "fred bloggs[xyz]".  This case (the class at the
      tail of the string) is fairly easy; modifying the match for
      class matching elsewhere may be worthwhile, but the benefits
      dwindle quickly as the number of matching characters goes up,
      and the costs of matching classes comes to dominate.

I initially submitted a patch entitled "Patch to speed up GNU Grep,
mostly with -i option" on 6 July 2003, but did not have the necessary
copyright assignments in place, and then personal life intervened and
I was unable to proceed at that time.  Now, however, I do have the
relevant paperwork in place, and so am wanting to try again.

------- Lua script: untangle

Writing in Lua (htttp://www.lua.org/), which is a small, powerful,
readable, fairly popular, efficient and liberally-licensed scripting
language, I've written a longish script, called "untangle", that takes
source code, mainly dfa.c, but also dfa.h and Makefile.am, and
provides tools to pull apart the code into segments of different
sizes, and then put them back together in different ways, possibly
applying edits in some cases.

The script works with both Lua 5.1 and 5.2, and uses a module from
LuaRocks called "strictness", that flags global variables as errors
unless explicitly declared (local variables are always declared
explicitly, and a global reference is almost always a typo).

This script has waxed and waned over time; its progression has not been
linear, and this shows in some places.  I've tended to write notes to
myself as I go along; sometimes these are in the script comments; at
other times they are in the emitted code.  I apologise in advance for
any offence that comments may give; these should be read as entirely
notes to myself regarding reactions to the code that I'm working on,
and at times this might be because of a half-baked understanding of the
complexities of the problem that the code is attempting to handle.
Refactoring code is always fraught with danger, and these comments were
intended to help me flag areas where more care was needed.

I've tried to conform to GNU and existing coding standards in the code
I've generated; the Lua script, however, mostly reflects my personal
style preferences.  The script is quite long, but this is somewhat
inflated by the use of a patch-like facility to do text substitutions.

The script has some features to check the integrity of the segmentation
effort:  It check for non-empty lines that have been skipped between
segment selections, ensures that each segment name is unique, and
writes out a "reconstructed" version of each source file, so that any
overlaps or other glitches in the segmentation effort can be caught.

-------- Resulting changes (additions) to grep/src:

The script tries to break dfa.c into smaller, self-contained modules,
with each module's internals strongly hidden from outsiders, so that
changes can be made with less fear of unintended side-effects.

Each module has a "fsa" prefix.  This is because the modules are
search-engine-agnostic; they could be used by either a deterministic
or on non-deterministic matcher, or by a kwset matcher, or by a
Boyer-Moore-Gosper/Sunday matcher, etc.

I've tended to use "?? " (a trigraph, I know, but *always* for a
space) to mark areas where there is some doubt or multiple options
available at a point in the script or in the code, and I've made a
choice without being very confident that my selection is the best.
These markers could perhaps be read as "REVIEWME: " or "FIXME: ".

The new modules are:

  charclass.[ch]
     Classes are now purely opaque; the client gets a pointer, but does
     not know how the class is implemented.  Functions to map between
     indexes and pointers are provided, so the previous index" scheme,
     used as an implied parameter to CSET, still works.  However,
     realloc has been abandoned for holding class contents, in favour
     of a "list of pools" approach, so that the pointer for a class
     remains valid for the lifetime of the module.  This, in turn, leads
     to more opportunities for lazy caching, such as in the lexer's
     find_pred function.

     The module continues to aggressively reuse existing classes in
     preference to creating duplicates; a class-state variable, with
     values UNUSED/WORKING/FINALISED, has been added, so that
     de-duplication searches only consider finalised classes.

     More functions have been added: Bits can be set/cleared in ranges,
     and set union and set intersection have been implemented.  Ranges
     are useful to set up utf8 octets without needing to know
     implementation details.  Some of these range operations could be
     optimised (e.g. set k bits via ((1UL << k) - 1), but I'm just not
     in the mood to take this on at present).

     In the future, SIMD operations could be used within the charclass
     module, without any disruption to outsiders.

     A "gutter" has been added to each side of each class.  The main
     reason for this is so that setbit(EOF) and clrbit(EOF), where EOF
     is -1, can be executed harmlessly, without adding overhead to normal
     operations.

     A single class module instance is shared by all users; at present,
     the code is not thread-safe, but a mutex could be used to protect
     areas where race conditions might occur.

  fsatoken.[ch]:
     Defines the lexical token shared by other modules, plus some tools
     to assist debugging (e.g. prtok).

  fsalex.[ch]:
     First receives directive regarding regular expressing syntax (e.g.
     whether backreferences are supported, or whether NUL is to be
     included in the "." class), then receives a pattern string to work
     on, and, given this information, supplies a stream of tokens,
     possibly with associated information (e.g. {min,max} parameters),
     to a client (probably fsaparse).

  fsaparse.[ch]:
     Works with a lexer to translate a stream of tokens into a parse
     tree, with the tokens flattened into a linear list, and the tree
     structure imposed by adding postfix tokens to describe relationships
     between preceding lists (trees?) of tokens.

  fsamusts.[ch]:
     Given a postfix-tree token list supplied by fsaparse, find a
     single simple string (if any) that is a must-have item if the
     expression can possibly match, and add any such string to a linked
     list of "musts".

  dfa-prl.c
     A "parallel" version of dfa.c, with extra code ("HOOK:") added at
     various places, so that the existing code and the new code can be
     run side-by-side, and the outputs compared.  The debug output from
     these hooks is appended to the file /tmp/parallel.log.  This file,
     and these hooks, are mostly for debugging at present; however,
     the ability to hook in a different lexer into the parser may be
     valuable in the future.

-------- Long names, painful, but hopefully only in the short term

Nearly all externally-visible names in each module have the module name
as a prefix (e.g. fsalex_ or FSALEX_).  This results in very long
names in some places (e.g. FSATOKEN_TK_CSET instead of CSET).  The code
looks ungainly as a result.

The major benefit of this approach is that I can link the old and new
dfa/fsa code side-by-side in a single executable, and try out various
combinations of old code calling new, new code calling old, or even
old code calling both old and new in turn, without linker namespace
clashes.  This feature is used in various ways by dfa-prl.c, and
especially in how information is logged to /tmp/parallel.log.

So, please don't be put off by disruptions to indentation, or the
presence of overly long lines, as a result of this naming style.  The
code is a demonstration/discussion vehicle, and once merits and/or
deficiencies are decided, changes can be made as appropriate.

-------- Locale, re_syntax and other configuration items

At present, there is an assumption that the same locale is present
for the duration of the execution of the dfa machinery.  I've tried to
tighten this, so that the locale present when fsalex_syntax() is called
is the locale to be used, regardless of later global changes.  The code
is not perfect at achieving this; some areas are better others.

-------- Where to next?  Provocation, discussion, timeline?

The special case in the short term that I am aiming at is
case-insensitive Boyer-Moore string searching.  My intended approach
is to recondition dfa.c without major changes, then rework modules by
introducing a more expressive token syntax that allows concepts such as
case-insensitive characters to be named explicitly, then have a fairly
easy and direct task of converting these outputs into formats suitable
for high-performance search algorithms.

Norihiro Tanaka, however, has very neatly devised ways of making an
end-run around the hairier bits of the code, and come out with a
case-insensitive B-M match at the other end.  He's also come out with
a whole raft of other impressive improvements, working in conjunction
with the grep, awk and other GNU component teams.  However, I still
believe that there is some merit in the approach I'm proposing.

I've chosen to be aggressive in some of my changes (e.g. demanding
strict modularity; changing charclass storage so that the type is truly
opaque, and yet the pointer remains valid for the lifetime of the module;
my reworking of FETCH_WC, which was partially made obsolete when
MBS_SUPPORT variants were dropped; the way that I've reworked find_pred,
including using mbrtowc_cache; and rewriting closure ()).  These are
intended to stimulate discussion, and not simply a change for the sake
of a change.

Sometimes, a change in the way information is shared comes up with
neat surprises: Look at how fsalex shares dotclass with fsaparse, and
so neatly handles the single-byte match case of utf8-anychar, without
fsaparse needing to have any detailed knowledge of either the locale
or of selections made by regular expression option bits.

The code is *not* bug-free; the most obvious failure is that the
arrays detailing multibyte character classes are not handed from
fsalex to fsaparse.  Although I haven't checked further, it is likely
that some features that were previously exposed by lex and parse, that
are now hidden, are needed by the higher DFA building machinery; I
haven't investigated this yet.  Also, the code does not try to free up
resources cleanly.  I believe that sharing the code now, and starting
a discussion, is better than trying to polish things further.

I have only tried this code on one machine, running a recent version of
gcc and the GNU toolchain (under Gentoo).  While I've worked on
different CPUs, including different int sizes and different endianness,
I've had less experience with legacy Unix systems, and with the
differences between Unix/Linux/BSD etc.

So, treat this code as a demonstration/discussion starting point, and
over time hopefully risk/benefit analysis can help decide what to do
next.

--------------------

I've attached both the untangle script, and all the files created and/or
modified by it.  I've also attached the "strictness.lua" module,
extracted from LuaRocks.

cheers,

behoffski (Brenton Hoff)
Programmer, Grouse Software

Makefile.am
Description: Text document

charclass.c
Description: Text Data

charclass.h
Description: Text Data

dfa-prl.c
Description: Text Data

fsalex.h
Description: Text Data

fsalex.c
Description: Text Data

fsamusts.c
Description: Text Data

fsamusts.h
Description: Text Data

fsaparse.c
Description: Text Data

fsaparse.h
Description: Text Data

fsatoken.c
Description: Text Data

fsatoken.h
Description: Text Data

strictness.lua
Description: Text Data

untangle
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

bug#17280: "Untangle" script to lay foundations for refactoring dfa.c, behoffski <=
- bug#17280: "Untangle" script to lay foundations for refactoring dfa.c, Paul Eggert, 2014/04/17

Prev by Date: bug#16586: grep: infinite loop in grep -P on some files with invalid UTF-8 sequences
Next by Date: bug#17280: "Untangle" script to lay foundations for refactoring dfa.c
Previous by thread: bug#16586: grep: infinite loop in grep -P on some files with invalid UTF-8 sequences
Next by thread: bug#17280: "Untangle" script to lay foundations for refactoring dfa.c
Index(es):
- Date
- Thread