bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#19837: Announcing initial release of hstbm, Version 0.13.


From: Glenn Morris
Subject: bug#19837: Announcing initial release of hstbm, Version 0.13.
Date: Wed, 11 Feb 2015 11:59:24 -0500
User-agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/)

The following mail arrived at debbugs.gnu.org via bcc, so could not be
associated with a package. I guess it was meant for grep, so I have
reassigned it and am sending this mail which will appear on the grep
mailing list.

sur-behoffski wrote:

> G'day,
>
> I've been working on an "exploratory fork" of GNU Grep, in order
> to provide a framework where I can set up a pared-down program
> that provides support for some simple search cases, but does not
> have the daunting complexity and (to some extent) extensive
> legacy support baggage associated with GNU Grep.
>
> This has resulted in a new project, "hstbm", which is an acronym
> for Hybrid Self-Tuning Boyer-Moore.
>
> The Savannah folks have kindly agreed to host hstbm as a non-GNU
> project at:
>
>     https://savannah.nongnu.org/projects/hstbm
>
> The release can be downloaded as a separate tarball from:
>
>     http://git.savannah.gnu.org/cgit/hstbm.git/snapshot/hstbm-0.13.tar.gz
>
> Git users probably want to clone the repository instead; the
> latest sources (which may include changes after the 0.13 release)
> can be retrieved via "git clone":
>
>     git clone git://git.savannah.nongnu.org/hstbm.git
>
> -------------
>
> hstbm is a partner to the "untangle" script that I've been posting
> to the GNU Grep over time; whereas "untangle" tries to lay the
> foundations for a "bottom-up" refactoring of GNU Grep code, hstbm
> is a "top-down" approach, reducing the layers of code between the
> caller, pattern interpretation (lexing and/or parsing), and the
> underlying search algorithm -- which is also called hstbm.
>
> Only a small fraction of regular expression syntax -- classes --
> is recognised, and the class support is competent but not
> bullet-proof.  It's best to think of the current level of pattern
> support as "fgrep (or grep -F) plus some class support".  This
> level of support is enough to fulfil the needs of the string
> search algorithm included in this release; more support, such as
> anchors, optional elements, repetition, backreferences etc. are
> not included for now (although there's a nod in one or two places
> that these could included in the future).
>
> -------------
>
> The hstbm algorithm itself is competitive with the equivalent
> algorithm in GNU Grep (culminating in bmexec () in src/kwset.c).
> In my limited testing, choosing data and patterns that are
> somewhat biased towards showing hstbm's strong points, it can
> outperform GNU Grep.  I'm hopeful that these improvements can
> hold up over a very wide range of data and patterns, with very
> few or no performance regressions, such that it might be
> worthwhile incorporating hstbm into GNU Grep:
>
> ["repl" is a Lua script to wrap (roughly) the following shell
> script, run it multiple times (throwing the first iteration
> away), and log the results, as well as outputting them:
>
>      Cmd=$*
>      time (let j=1 \
>            while [ $((j++)) -lt 1000 ] ; do \
>                  $Cmd    >/dev/null   \
>            done  \
>      ) 2>&1
> ]
>
> --------------
>
> $ grep --version | head 1
> grep (GNU grep) 2.21
> $ ./h --version | head -1
> hstbm (Hybrid Self-Tuning Boyer-Moore) 0.13
>
>
> $ cat /usr/src/linux-3.14.32-gentoo/*/*.[ch] >csrc
> $ wc csrc
>   420778  1348027 11521841 csrc
>
>
> $ ./repl grep -c "123[abc]456" csrc
> Cmd:    "grep" "-c" "123[abc]456" "csrc"
> 0
>   real  0m7.268s  user  0m3.140s  sys   0m3.767s
>   real  0m7.299s  user  0m3.317s  sys   0m3.590s
>   real  0m7.298s  user  0m3.253s  sys   0m3.657s
>   real  0m7.245s  user  0m3.233s  sys   0m3.673s
>
> $ ./repl ./h -c "123[abc]456" csrc
> Cmd:    "./h" "-c" "123[abc]456" "csrc"
> 0
>   real  0m6.570s  user  0m2.237s  sys   0m1.327s
>   real  0m6.586s  user  0m2.200s  sys   0m1.383s
>   real  0m6.647s  user  0m2.317s  sys   0m1.367s
>   real  0m6.598s  user  0m2.247s  sys   0m1.367s
>
> $ ./repl grep -ci "123[abc]456" csrc
> Cmd:    "grep" "-ci" "123[abc]456" "csrc"
> 0
>   real  0m7.307s  user  0m3.163s  sys   0m3.743s
>   real  0m7.269s  user  0m3.373s  sys   0m3.533s
>   real  0m7.266s  user  0m3.273s  sys   0m3.633s
>   real  0m7.266s  user  0m3.357s  sys   0m3.553s
>
> $ ./repl ./h -ci "123[abc]456" csrc
> Cmd:    "./h" "-ci" "123[abc]456" "csrc"
> 0
>   real  0m6.582s  user  0m2.250s  sys   0m1.313s
>   real  0m6.589s  user  0m2.287s  sys   0m1.293s
>   real  0m6.629s  user  0m2.197s  sys   0m1.370s
>   real  0m6.637s  user  0m2.217s  sys   0m1.357s
>
> $ ./repl grep -c "123456[abc]" csrc
> Cmd:    "grep" "-c" "123456[abc]" "csrc"
> 0
>   real  0m6.652s  user  0m2.170s  sys   0m1.407s
>   real  0m6.659s  user  0m2.257s  sys   0m1.330s
>   real  0m6.662s  user  0m2.130s  sys   0m1.450s
>   real  0m6.658s  user  0m2.157s  sys   0m1.430s
>
> $ ./repl ./h -c "123456[abc]" csrc
> Cmd:    "./h" "-c" "123456[abc]" "csrc"
> 0
>   real  0m6.584s  user  0m2.227s  sys   0m1.337s
>   real  0m6.546s  user  0m2.210s  sys   0m1.360s
>   real  0m6.546s  user  0m2.180s  sys   0m1.393s
>   real  0m6.544s  user  0m2.210s  sys   0m1.363s
>
> $ ./repl grep -ci "123456[abc]" csrc
> Cmd:    "grep" "-ci" "123456[abc]" "csrc"
> 0
>   real  0m6.676s  user  0m2.177s  sys   0m1.407s
>   real  0m6.681s  user  0m2.060s  sys   0m1.537s
>   real  0m6.671s  user  0m2.250s  sys   0m1.337s
>   real  0m6.683s  user  0m2.090s  sys   0m1.497s
>
> $ ./repl ./h -ci "123456[abc]" csrc
> Cmd:    "./h" "-ci" "123456[abc]" "csrc"
> 0
>   real  0m6.644s  user  0m2.333s  sys   0m1.233s
>   real  0m6.689s  user  0m2.113s  sys   0m1.457s
>   real  0m6.602s  user  0m2.187s  sys   0m1.383s
>   real  0m6.604s  user  0m2.220s  sys   0m1.353s
>
> $ ./repl grep -c "123456a" csrc
> Cmd:    "grep" "-c" "123456a" "csrc"
> 0
>   real  0m12.035s  user 0m7.743s  sys   0m2.490s
>   real  0m12.024s  user 0m7.763s  sys   0m2.473s
>   real  0m12.019s  user 0m7.820s  sys   0m2.413s
>   real  0m12.017s  user 0m7.717s  sys   0m2.517s
>
> $ ./repl ./h -c "123456a" csrc
> Cmd:    "./h" "-c" "123456a" "csrc"
> 0
>   real  0m6.266s  user  0m1.923s  sys   0m1.640s
>   real  0m6.273s  user  0m1.980s  sys   0m1.590s
>   real  0m6.252s  user  0m2.217s  sys   0m1.357s
>   real  0m6.254s  user  0m2.033s  sys   0m1.540s
>
> $ ./repl grep -ci "123456a" csrc
> Cmd:    "grep" "-ci" "123456a" "csrc"
> 0
>   real  0m15.220s  user 0m11.060s  sys  0m2.507s
>   real  0m15.327s  user 0m10.980s  sys  0m2.583s
>   real  0m15.215s  user 0m11.090s  sys  0m2.473s
>   real  0m15.218s  user 0m10.900s  sys  0m2.663s
>
> $ ./repl ./h -ci "123456a" csrc
> Cmd:    "./h" "-ci" "123456a" "csrc"
> 0
>   real  0m6.651s  user  0m2.307s  sys   0m1.260s
>   real  0m6.665s  user  0m2.253s  sys   0m1.327s
>   real  0m6.602s  user  0m2.320s  sys   0m1.243s
>   real  0m6.611s  user  0m2.243s  sys   0m1.327s
>
>
>
> $ wc candide-8859-1.txt
>   4352  35591 221126 candide-8859-1.txt
>
>
> $ LC_ALL=fr_FR.iso88591 ./repl grep -c "tr[[=e=]]s" candide-8859-1.txt
> Cmd:    "grep" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
> 154
>   real  0m4.330s  user  0m3.240s  sys   0m0.290s
>   real  0m4.340s  user  0m3.273s  sys   0m0.263s
>   real  0m4.345s  user  0m3.213s  sys   0m0.323s
>   real  0m4.346s  user  0m3.213s  sys   0m0.327s
>
> $ LC_ALL=fr_FR.iso88591 ./repl ./h -c "tr[[=e=]]s" candide-8859-1.txt
> Cmd:    "./h" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
> 154
>   real  0m1.376s  user  0m0.043s  sys   0m0.133s
>   real  0m1.382s  user  0m0.073s  sys   0m0.103s
>   real  0m1.383s  user  0m0.043s  sys   0m0.133s
>   real  0m1.386s  user  0m0.050s  sys   0m0.127s
>
> $ LC_ALL=fr_FR.iso88591 ./repl grep -ci "tr[[=e=]]s" candide-8859-1.txt
> Cmd:    "grep" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
> 155
>   real  0m5.704s  user  0m3.357s  sys   0m0.177s
>   real  0m5.683s  user  0m3.383s  sys   0m0.150s
>   real  0m5.686s  user  0m3.343s  sys   0m0.190s
>   real  0m5.690s  user  0m3.363s  sys   0m0.173s
>
> $ LC_ALL=fr_FR.iso88591 ./repl ./h -ci "tr[[=e=]]s" candide-8859-1.txt
> Cmd:    "./h" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
> 155
>   real  0m1.598s  user  0m0.043s  sys   0m0.137s
>   real  0m1.600s  user  0m0.060s  sys   0m0.117s
>   real  0m1.608s  user  0m0.040s  sys   0m0.140s
>   real  0m1.609s  user  0m0.070s  sys   0m0.110s
>
>
> --------------------
>
>
> While many, many features of GNU Grep are not supported, two switches,
> -J and -K=SKIP_LOOP_EFFICIENCY_THRESHOLD, are included.  Here's an
> example of the debug output created when -J is specified:
>
>
>
>
> $ ./h -J -c "tr[[=e=]]s" candide-8859-1.txt
> * show_internals:  Compiled pattern:  Number of tokens: 4
>     OP_CHAR_DIRECT('t')      -- 0x74
>     OP_CHAR_DIRECT('r')      -- 0x72
>     OP_CHAR_DIRECT('e')      -- 0x65
>     OP_CHAR_DIRECT('s')      -- 0x73
>
> * debug: hstbm context at 0x20ee000:
> - fold_table (not used):
>     ................................ ................................
>     ................................ ................................
>     ................................ ................................
>     ................................ ................................
> - delta1:
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444444444444444444444444444444 44444144444444444420344444444444
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444444444444444444444444444444 44444444444444444444444444444444
> -      Pattern:    t   r   e   s
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> -   nr_aliases:    0   0   0   0
> -  first_alias:    .   .   .   .
> - memchr_offset_list [negated]:    3   2   1   0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s):   4
> - delta2: 4
> -       delta3:    4   4   4
> -       delta4:    4   4
>
> * context check... [OK]
> 93
> $ ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
> * show_internals:  Compiled pattern:  Number of tokens: 4
>     OP_CHAR_CLASS(ref=1)     -- members:2
>       -- 54 74                                             Tt
>     OP_CHAR_CLASS(ref=2)     -- members:2
>       -- 52 72                                             Rr
>     OP_CHAR_CLASS(ref=3)     -- members:2
>       -- 45 65                                             Ee
>     OP_CHAR_CLASS(ref=4)     -- members:2
>       -- 53 73                                             Ss
>
> * debug: hstbm context at 0x1a4e000:
> - fold_table:
>     ................................  !"#$%&'()*+,-./0123456789:;<=>?
>     @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
>     ................................ ................................
>     ................................ ................................
> - delta1:
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444144444444444420344444444444 44444144444444444420344444444444
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444444444444444444444444444444 44444444444444444444444444444444
> -      Pattern:    T   R   E   S
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> -   nr_aliases:    1   1   1   1
> -  first_alias:    t   r   e   s
> - memchr_offset_list [negated]:    3   2   1   0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s):   4
> - delta2: 4
> -       delta3:    4   4   4
> -       delta4:    4   4
>
> * context check... [OK]
> 93
> $ LC_ALL=fr_FR.iso88591 ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
> * show_internals:  Compiled pattern:  Number of tokens: 4
>     OP_CHAR_CLASS(ref=1)     -- members:2
>       -- 54 74                                             Tt
>     OP_CHAR_CLASS(ref=2)     -- members:2
>       -- 52 72                                             Rr
>     OP_CHAR_CLASS(ref=3)     -- members:10
>       -- 45 65 c8 c9 ca cb e8 e9  ea eb                    Eeà à à à èé êë
>     OP_CHAR_CLASS(ref=4)     -- members:2
>       -- 53 73                                             Ss
>
> * debug: hstbm context at 0x1ae7000:
> - fold_table:
>     ................................  !"#$%&'()*+,-./0123456789:;<=>?
>     @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
>     ................................  
> ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿
>     à Áà à à à à à EEEEà Íà ÏÐÃ'Ã'Ã"Ã"à à à à à à à à Ýà à  à
> áâãäåæçEEEEìíîïðñòóôõö÷øùúûüýþÿ
> - delta1:
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444144444444444420344444444444 44444144444444444420344444444444
>     44444444444444444444444444444444 44444444444444444444444444444444
>     44444444111144444444444444444444 44444444111144444444444444444444
> -      Pattern:    T   R   E   S
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> -   nr_aliases:    1   1   9   1
> -  first_alias:    t   r   e   s
> - memchr_offset_list [negated]:    3   2   0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s):   4
> - delta2: 4
> -       delta3:    4   4   4
> -       delta4:    4   4
>
> * context check... [OK]
> 155
> $
>
>
>
> --------------------
>
>
>
> WARNING:  Because hstbm supports far fewer features than GNU Grep,
> it takes less time to initialise.  This will give hstbm an advantage
> when the total input is small (?? perhaps less than 4k bytes?), and
> should be factored out when comparing performance.  (Note that both
> csrc and candide-8859-1.txt above are greater than 100k bytes, so
> this effect should not be a large factor in the above results.)
>
>
> $ ./repl grep -c "123[abc]456" /dev/null
> Cmd:    "grep" "-c" "123[abc]456" "/dev/null"
> 0
>   real  0m0.990s  user  0m0.073s  sys   0m0.097s
>   real  0m1.001s  user  0m0.050s  sys   0m0.123s
>   real  0m0.993s  user  0m0.020s  sys   0m0.147s
>   real  0m0.989s  user  0m0.040s  sys   0m0.127s
>
> $ ./repl ./h -c "123[abc]456" /dev/null
> Cmd:    "./h" "-c" "123[abc]456" "/dev/null"
> 0
>   real  0m0.748s  user  0m0.047s  sys   0m0.103s
>   real  0m0.758s  user  0m0.057s  sys   0m0.093s
>   real  0m0.762s  user  0m0.037s  sys   0m0.113s
>   real  0m0.754s  user  0m0.050s  sys   0m0.100s
>
> $ ./repl grep -ci "123[abc]456" /dev/null
> Cmd:    "grep" "-ci" "123[abc]456" "/dev/null"
> 0
>   real  0m0.993s  user  0m0.043s  sys   0m0.127s
>   real  0m0.998s  user  0m0.057s  sys   0m0.110s
>   real  0m0.995s  user  0m0.037s  sys   0m0.130s
>   real  0m0.993s  user  0m0.053s  sys   0m0.110s
>
> $ ./repl ./h -ci "123[abc]456" /dev/null
> Cmd:    "./h" "-ci" "123[abc]456" "/dev/null"
> 0
>   real  0m0.768s  user  0m0.037s  sys   0m0.113s
>   real  0m0.787s  user  0m0.033s  sys   0m0.123s
>   real  0m0.770s  user  0m0.037s  sys   0m0.110s
>   real  0m0.779s  user  0m0.027s  sys   0m0.127s
>
>
>
> --------------------
>
>
>
> Have fun hacking, and reading the code... all feedback, comments,
> patches, etc. welcome.
>
> cheers,
>
> sur-behoffski (Brenton Hoff)
> Programmer, Grouse Software





reply via email to

[Prev in Thread] Current Thread [Next in Thread]