|
From: | behoffski |
Subject: | bug#17280: "Untangle" script to lay foundations for refactoring dfa.c |
Date: | Thu, 17 Apr 2014 12:19:08 +0930 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 |
G'day, I decided, back in October, to seriously scratch an itch that I've been wanting to scratch ever since about 2000 (!). This is my desire to include performance enhancements that I discovered while writing Grouse Grep, which was published in Dr. Dobb's Journal in November 1997, and which was updated in 2000 to run on Linux machines. The performance improvements relate to the Boyer-Moore skip search algorithm, and in particular: 1. Case-insensitive searches are feasible for unibyte locales; 2. The self-tuning Boyer-Moore (STBM) search that I invented does reduce some pathological cases, and also can speed up the normal search cases by a few percent. It's certainly worth trying, but I'm not sure if it will be worthwhile across the huge range of use cases that Grep has to cover; and 3. Case-insensitivity is a special form of classes being able to be handled by the B-M skip search. There may be a few other cases where classes could be considered part of a "must-have" string, e.g. "fred bloggs[xyz]". This case (the class at the tail of the string) is fairly easy; modifying the match for class matching elsewhere may be worthwhile, but the benefits dwindle quickly as the number of matching characters goes up, and the costs of matching classes comes to dominate. I initially submitted a patch entitled "Patch to speed up GNU Grep, mostly with -i option" on 6 July 2003, but did not have the necessary copyright assignments in place, and then personal life intervened and I was unable to proceed at that time. Now, however, I do have the relevant paperwork in place, and so am wanting to try again. ------- Lua script: untangle Writing in Lua (htttp://www.lua.org/), which is a small, powerful, readable, fairly popular, efficient and liberally-licensed scripting language, I've written a longish script, called "untangle", that takes source code, mainly dfa.c, but also dfa.h and Makefile.am, and provides tools to pull apart the code into segments of different sizes, and then put them back together in different ways, possibly applying edits in some cases. The script works with both Lua 5.1 and 5.2, and uses a module from LuaRocks called "strictness", that flags global variables as errors unless explicitly declared (local variables are always declared explicitly, and a global reference is almost always a typo). This script has waxed and waned over time; its progression has not been linear, and this shows in some places. I've tended to write notes to myself as I go along; sometimes these are in the script comments; at other times they are in the emitted code. I apologise in advance for any offence that comments may give; these should be read as entirely notes to myself regarding reactions to the code that I'm working on, and at times this might be because of a half-baked understanding of the complexities of the problem that the code is attempting to handle. Refactoring code is always fraught with danger, and these comments were intended to help me flag areas where more care was needed. I've tried to conform to GNU and existing coding standards in the code I've generated; the Lua script, however, mostly reflects my personal style preferences. The script is quite long, but this is somewhat inflated by the use of a patch-like facility to do text substitutions. The script has some features to check the integrity of the segmentation effort: It check for non-empty lines that have been skipped between segment selections, ensures that each segment name is unique, and writes out a "reconstructed" version of each source file, so that any overlaps or other glitches in the segmentation effort can be caught. -------- Resulting changes (additions) to grep/src: The script tries to break dfa.c into smaller, self-contained modules, with each module's internals strongly hidden from outsiders, so that changes can be made with less fear of unintended side-effects. Each module has a "fsa" prefix. This is because the modules are search-engine-agnostic; they could be used by either a deterministic or on non-deterministic matcher, or by a kwset matcher, or by a Boyer-Moore-Gosper/Sunday matcher, etc. I've tended to use "?? " (a trigraph, I know, but *always* for a space) to mark areas where there is some doubt or multiple options available at a point in the script or in the code, and I've made a choice without being very confident that my selection is the best. These markers could perhaps be read as "REVIEWME: " or "FIXME: ". The new modules are: charclass.[ch] Classes are now purely opaque; the client gets a pointer, but does not know how the class is implemented. Functions to map between indexes and pointers are provided, so the previous index" scheme, used as an implied parameter to CSET, still works. However, realloc has been abandoned for holding class contents, in favour of a "list of pools" approach, so that the pointer for a class remains valid for the lifetime of the module. This, in turn, leads to more opportunities for lazy caching, such as in the lexer's find_pred function. The module continues to aggressively reuse existing classes in preference to creating duplicates; a class-state variable, with values UNUSED/WORKING/FINALISED, has been added, so that de-duplication searches only consider finalised classes. More functions have been added: Bits can be set/cleared in ranges, and set union and set intersection have been implemented. Ranges are useful to set up utf8 octets without needing to know implementation details. Some of these range operations could be optimised (e.g. set k bits via ((1UL << k) - 1), but I'm just not in the mood to take this on at present). In the future, SIMD operations could be used within the charclass module, without any disruption to outsiders. A "gutter" has been added to each side of each class. The main reason for this is so that setbit(EOF) and clrbit(EOF), where EOF is -1, can be executed harmlessly, without adding overhead to normal operations. A single class module instance is shared by all users; at present, the code is not thread-safe, but a mutex could be used to protect areas where race conditions might occur. fsatoken.[ch]: Defines the lexical token shared by other modules, plus some tools to assist debugging (e.g. prtok). fsalex.[ch]: First receives directive regarding regular expressing syntax (e.g. whether backreferences are supported, or whether NUL is to be included in the "." class), then receives a pattern string to work on, and, given this information, supplies a stream of tokens, possibly with associated information (e.g. {min,max} parameters), to a client (probably fsaparse). fsaparse.[ch]: Works with a lexer to translate a stream of tokens into a parse tree, with the tokens flattened into a linear list, and the tree structure imposed by adding postfix tokens to describe relationships between preceding lists (trees?) of tokens. fsamusts.[ch]: Given a postfix-tree token list supplied by fsaparse, find a single simple string (if any) that is a must-have item if the expression can possibly match, and add any such string to a linked list of "musts". dfa-prl.c A "parallel" version of dfa.c, with extra code ("HOOK:") added at various places, so that the existing code and the new code can be run side-by-side, and the outputs compared. The debug output from these hooks is appended to the file /tmp/parallel.log. This file, and these hooks, are mostly for debugging at present; however, the ability to hook in a different lexer into the parser may be valuable in the future. -------- Long names, painful, but hopefully only in the short term Nearly all externally-visible names in each module have the module name as a prefix (e.g. fsalex_ or FSALEX_). This results in very long names in some places (e.g. FSATOKEN_TK_CSET instead of CSET). The code looks ungainly as a result. The major benefit of this approach is that I can link the old and new dfa/fsa code side-by-side in a single executable, and try out various combinations of old code calling new, new code calling old, or even old code calling both old and new in turn, without linker namespace clashes. This feature is used in various ways by dfa-prl.c, and especially in how information is logged to /tmp/parallel.log. So, please don't be put off by disruptions to indentation, or the presence of overly long lines, as a result of this naming style. The code is a demonstration/discussion vehicle, and once merits and/or deficiencies are decided, changes can be made as appropriate. -------- Locale, re_syntax and other configuration items At present, there is an assumption that the same locale is present for the duration of the execution of the dfa machinery. I've tried to tighten this, so that the locale present when fsalex_syntax() is called is the locale to be used, regardless of later global changes. The code is not perfect at achieving this; some areas are better others. -------- Where to next? Provocation, discussion, timeline? The special case in the short term that I am aiming at is case-insensitive Boyer-Moore string searching. My intended approach is to recondition dfa.c without major changes, then rework modules by introducing a more expressive token syntax that allows concepts such as case-insensitive characters to be named explicitly, then have a fairly easy and direct task of converting these outputs into formats suitable for high-performance search algorithms. Norihiro Tanaka, however, has very neatly devised ways of making an end-run around the hairier bits of the code, and come out with a case-insensitive B-M match at the other end. He's also come out with a whole raft of other impressive improvements, working in conjunction with the grep, awk and other GNU component teams. However, I still believe that there is some merit in the approach I'm proposing. I've chosen to be aggressive in some of my changes (e.g. demanding strict modularity; changing charclass storage so that the type is truly opaque, and yet the pointer remains valid for the lifetime of the module; my reworking of FETCH_WC, which was partially made obsolete when MBS_SUPPORT variants were dropped; the way that I've reworked find_pred, including using mbrtowc_cache; and rewriting closure ()). These are intended to stimulate discussion, and not simply a change for the sake of a change. Sometimes, a change in the way information is shared comes up with neat surprises: Look at how fsalex shares dotclass with fsaparse, and so neatly handles the single-byte match case of utf8-anychar, without fsaparse needing to have any detailed knowledge of either the locale or of selections made by regular expression option bits. The code is *not* bug-free; the most obvious failure is that the arrays detailing multibyte character classes are not handed from fsalex to fsaparse. Although I haven't checked further, it is likely that some features that were previously exposed by lex and parse, that are now hidden, are needed by the higher DFA building machinery; I haven't investigated this yet. Also, the code does not try to free up resources cleanly. I believe that sharing the code now, and starting a discussion, is better than trying to polish things further. I have only tried this code on one machine, running a recent version of gcc and the GNU toolchain (under Gentoo). While I've worked on different CPUs, including different int sizes and different endianness, I've had less experience with legacy Unix systems, and with the differences between Unix/Linux/BSD etc. So, treat this code as a demonstration/discussion starting point, and over time hopefully risk/benefit analysis can help decide what to do next. -------------------- I've attached both the untangle script, and all the files created and/or modified by it. I've also attached the "strictness.lua" module, extracted from LuaRocks. cheers, behoffski (Brenton Hoff) Programmer, Grouse Software
Makefile.am
Description: Text document
charclass.c
Description: Text Data
charclass.h
Description: Text Data
dfa-prl.c
Description: Text Data
fsalex.h
Description: Text Data
fsalex.c
Description: Text Data
fsamusts.c
Description: Text Data
fsamusts.h
Description: Text Data
fsaparse.c
Description: Text Data
fsaparse.h
Description: Text Data
fsatoken.c
Description: Text Data
fsatoken.h
Description: Text Data
strictness.lua
Description: Text Data
untangle
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |