[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#19837: Announcing initial release of hstbm, Version 0.13.
From: |
Glenn Morris |
Subject: |
bug#19837: Announcing initial release of hstbm, Version 0.13. |
Date: |
Wed, 11 Feb 2015 11:59:24 -0500 |
User-agent: |
Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/) |
The following mail arrived at debbugs.gnu.org via bcc, so could not be
associated with a package. I guess it was meant for grep, so I have
reassigned it and am sending this mail which will appear on the grep
mailing list.
sur-behoffski wrote:
> G'day,
>
> I've been working on an "exploratory fork" of GNU Grep, in order
> to provide a framework where I can set up a pared-down program
> that provides support for some simple search cases, but does not
> have the daunting complexity and (to some extent) extensive
> legacy support baggage associated with GNU Grep.
>
> This has resulted in a new project, "hstbm", which is an acronym
> for Hybrid Self-Tuning Boyer-Moore.
>
> The Savannah folks have kindly agreed to host hstbm as a non-GNU
> project at:
>
> https://savannah.nongnu.org/projects/hstbm
>
> The release can be downloaded as a separate tarball from:
>
> http://git.savannah.gnu.org/cgit/hstbm.git/snapshot/hstbm-0.13.tar.gz
>
> Git users probably want to clone the repository instead; the
> latest sources (which may include changes after the 0.13 release)
> can be retrieved via "git clone":
>
> git clone git://git.savannah.nongnu.org/hstbm.git
>
> -------------
>
> hstbm is a partner to the "untangle" script that I've been posting
> to the GNU Grep over time; whereas "untangle" tries to lay the
> foundations for a "bottom-up" refactoring of GNU Grep code, hstbm
> is a "top-down" approach, reducing the layers of code between the
> caller, pattern interpretation (lexing and/or parsing), and the
> underlying search algorithm -- which is also called hstbm.
>
> Only a small fraction of regular expression syntax -- classes --
> is recognised, and the class support is competent but not
> bullet-proof. It's best to think of the current level of pattern
> support as "fgrep (or grep -F) plus some class support". This
> level of support is enough to fulfil the needs of the string
> search algorithm included in this release; more support, such as
> anchors, optional elements, repetition, backreferences etc. are
> not included for now (although there's a nod in one or two places
> that these could included in the future).
>
> -------------
>
> The hstbm algorithm itself is competitive with the equivalent
> algorithm in GNU Grep (culminating in bmexec () in src/kwset.c).
> In my limited testing, choosing data and patterns that are
> somewhat biased towards showing hstbm's strong points, it can
> outperform GNU Grep. I'm hopeful that these improvements can
> hold up over a very wide range of data and patterns, with very
> few or no performance regressions, such that it might be
> worthwhile incorporating hstbm into GNU Grep:
>
> ["repl" is a Lua script to wrap (roughly) the following shell
> script, run it multiple times (throwing the first iteration
> away), and log the results, as well as outputting them:
>
> Cmd=$*
> time (let j=1 \
> while [ $((j++)) -lt 1000 ] ; do \
> $Cmd >/dev/null \
> done \
> ) 2>&1
> ]
>
> --------------
>
> $ grep --version | head 1
> grep (GNU grep) 2.21
> $ ./h --version | head -1
> hstbm (Hybrid Self-Tuning Boyer-Moore) 0.13
>
>
> $ cat /usr/src/linux-3.14.32-gentoo/*/*.[ch] >csrc
> $ wc csrc
> 420778 1348027 11521841 csrc
>
>
> $ ./repl grep -c "123[abc]456" csrc
> Cmd: "grep" "-c" "123[abc]456" "csrc"
> 0
> real 0m7.268s user 0m3.140s sys 0m3.767s
> real 0m7.299s user 0m3.317s sys 0m3.590s
> real 0m7.298s user 0m3.253s sys 0m3.657s
> real 0m7.245s user 0m3.233s sys 0m3.673s
>
> $ ./repl ./h -c "123[abc]456" csrc
> Cmd: "./h" "-c" "123[abc]456" "csrc"
> 0
> real 0m6.570s user 0m2.237s sys 0m1.327s
> real 0m6.586s user 0m2.200s sys 0m1.383s
> real 0m6.647s user 0m2.317s sys 0m1.367s
> real 0m6.598s user 0m2.247s sys 0m1.367s
>
> $ ./repl grep -ci "123[abc]456" csrc
> Cmd: "grep" "-ci" "123[abc]456" "csrc"
> 0
> real 0m7.307s user 0m3.163s sys 0m3.743s
> real 0m7.269s user 0m3.373s sys 0m3.533s
> real 0m7.266s user 0m3.273s sys 0m3.633s
> real 0m7.266s user 0m3.357s sys 0m3.553s
>
> $ ./repl ./h -ci "123[abc]456" csrc
> Cmd: "./h" "-ci" "123[abc]456" "csrc"
> 0
> real 0m6.582s user 0m2.250s sys 0m1.313s
> real 0m6.589s user 0m2.287s sys 0m1.293s
> real 0m6.629s user 0m2.197s sys 0m1.370s
> real 0m6.637s user 0m2.217s sys 0m1.357s
>
> $ ./repl grep -c "123456[abc]" csrc
> Cmd: "grep" "-c" "123456[abc]" "csrc"
> 0
> real 0m6.652s user 0m2.170s sys 0m1.407s
> real 0m6.659s user 0m2.257s sys 0m1.330s
> real 0m6.662s user 0m2.130s sys 0m1.450s
> real 0m6.658s user 0m2.157s sys 0m1.430s
>
> $ ./repl ./h -c "123456[abc]" csrc
> Cmd: "./h" "-c" "123456[abc]" "csrc"
> 0
> real 0m6.584s user 0m2.227s sys 0m1.337s
> real 0m6.546s user 0m2.210s sys 0m1.360s
> real 0m6.546s user 0m2.180s sys 0m1.393s
> real 0m6.544s user 0m2.210s sys 0m1.363s
>
> $ ./repl grep -ci "123456[abc]" csrc
> Cmd: "grep" "-ci" "123456[abc]" "csrc"
> 0
> real 0m6.676s user 0m2.177s sys 0m1.407s
> real 0m6.681s user 0m2.060s sys 0m1.537s
> real 0m6.671s user 0m2.250s sys 0m1.337s
> real 0m6.683s user 0m2.090s sys 0m1.497s
>
> $ ./repl ./h -ci "123456[abc]" csrc
> Cmd: "./h" "-ci" "123456[abc]" "csrc"
> 0
> real 0m6.644s user 0m2.333s sys 0m1.233s
> real 0m6.689s user 0m2.113s sys 0m1.457s
> real 0m6.602s user 0m2.187s sys 0m1.383s
> real 0m6.604s user 0m2.220s sys 0m1.353s
>
> $ ./repl grep -c "123456a" csrc
> Cmd: "grep" "-c" "123456a" "csrc"
> 0
> real 0m12.035s user 0m7.743s sys 0m2.490s
> real 0m12.024s user 0m7.763s sys 0m2.473s
> real 0m12.019s user 0m7.820s sys 0m2.413s
> real 0m12.017s user 0m7.717s sys 0m2.517s
>
> $ ./repl ./h -c "123456a" csrc
> Cmd: "./h" "-c" "123456a" "csrc"
> 0
> real 0m6.266s user 0m1.923s sys 0m1.640s
> real 0m6.273s user 0m1.980s sys 0m1.590s
> real 0m6.252s user 0m2.217s sys 0m1.357s
> real 0m6.254s user 0m2.033s sys 0m1.540s
>
> $ ./repl grep -ci "123456a" csrc
> Cmd: "grep" "-ci" "123456a" "csrc"
> 0
> real 0m15.220s user 0m11.060s sys 0m2.507s
> real 0m15.327s user 0m10.980s sys 0m2.583s
> real 0m15.215s user 0m11.090s sys 0m2.473s
> real 0m15.218s user 0m10.900s sys 0m2.663s
>
> $ ./repl ./h -ci "123456a" csrc
> Cmd: "./h" "-ci" "123456a" "csrc"
> 0
> real 0m6.651s user 0m2.307s sys 0m1.260s
> real 0m6.665s user 0m2.253s sys 0m1.327s
> real 0m6.602s user 0m2.320s sys 0m1.243s
> real 0m6.611s user 0m2.243s sys 0m1.327s
>
>
>
> $ wc candide-8859-1.txt
> 4352 35591 221126 candide-8859-1.txt
>
>
> $ LC_ALL=fr_FR.iso88591 ./repl grep -c "tr[[=e=]]s" candide-8859-1.txt
> Cmd: "grep" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
> 154
> real 0m4.330s user 0m3.240s sys 0m0.290s
> real 0m4.340s user 0m3.273s sys 0m0.263s
> real 0m4.345s user 0m3.213s sys 0m0.323s
> real 0m4.346s user 0m3.213s sys 0m0.327s
>
> $ LC_ALL=fr_FR.iso88591 ./repl ./h -c "tr[[=e=]]s" candide-8859-1.txt
> Cmd: "./h" "-c" "tr[[=e=]]s" "candide-8859-1.txt"
> 154
> real 0m1.376s user 0m0.043s sys 0m0.133s
> real 0m1.382s user 0m0.073s sys 0m0.103s
> real 0m1.383s user 0m0.043s sys 0m0.133s
> real 0m1.386s user 0m0.050s sys 0m0.127s
>
> $ LC_ALL=fr_FR.iso88591 ./repl grep -ci "tr[[=e=]]s" candide-8859-1.txt
> Cmd: "grep" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
> 155
> real 0m5.704s user 0m3.357s sys 0m0.177s
> real 0m5.683s user 0m3.383s sys 0m0.150s
> real 0m5.686s user 0m3.343s sys 0m0.190s
> real 0m5.690s user 0m3.363s sys 0m0.173s
>
> $ LC_ALL=fr_FR.iso88591 ./repl ./h -ci "tr[[=e=]]s" candide-8859-1.txt
> Cmd: "./h" "-ci" "tr[[=e=]]s" "candide-8859-1.txt"
> 155
> real 0m1.598s user 0m0.043s sys 0m0.137s
> real 0m1.600s user 0m0.060s sys 0m0.117s
> real 0m1.608s user 0m0.040s sys 0m0.140s
> real 0m1.609s user 0m0.070s sys 0m0.110s
>
>
> --------------------
>
>
> While many, many features of GNU Grep are not supported, two switches,
> -J and -K=SKIP_LOOP_EFFICIENCY_THRESHOLD, are included. Here's an
> example of the debug output created when -J is specified:
>
>
>
>
> $ ./h -J -c "tr[[=e=]]s" candide-8859-1.txt
> * show_internals: Compiled pattern: Number of tokens: 4
> OP_CHAR_DIRECT('t') -- 0x74
> OP_CHAR_DIRECT('r') -- 0x72
> OP_CHAR_DIRECT('e') -- 0x65
> OP_CHAR_DIRECT('s') -- 0x73
>
> * debug: hstbm context at 0x20ee000:
> - fold_table (not used):
> ................................ ................................
> ................................ ................................
> ................................ ................................
> ................................ ................................
> - delta1:
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444444444444444444444444444444 44444144444444444420344444444444
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444444444444444444444444444444 44444444444444444444444444444444
> - Pattern: t r e s
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> - nr_aliases: 0 0 0 0
> - first_alias: . . . .
> - memchr_offset_list [negated]: 3 2 1 0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s): 4
> - delta2: 4
> - delta3: 4 4 4
> - delta4: 4 4
>
> * context check... [OK]
> 93
> $ ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
> * show_internals: Compiled pattern: Number of tokens: 4
> OP_CHAR_CLASS(ref=1) -- members:2
> -- 54 74 Tt
> OP_CHAR_CLASS(ref=2) -- members:2
> -- 52 72 Rr
> OP_CHAR_CLASS(ref=3) -- members:2
> -- 45 65 Ee
> OP_CHAR_CLASS(ref=4) -- members:2
> -- 53 73 Ss
>
> * debug: hstbm context at 0x1a4e000:
> - fold_table:
> ................................ !"#$%&'()*+,-./0123456789:;<=>?
> @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
> ................................ ................................
> ................................ ................................
> - delta1:
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444144444444444420344444444444 44444144444444444420344444444444
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444444444444444444444444444444 44444444444444444444444444444444
> - Pattern: T R E S
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> - nr_aliases: 1 1 1 1
> - first_alias: t r e s
> - memchr_offset_list [negated]: 3 2 1 0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s): 4
> - delta2: 4
> - delta3: 4 4 4
> - delta4: 4 4
>
> * context check... [OK]
> 93
> $ LC_ALL=fr_FR.iso88591 ./h -J -ci "tr[[=e=]]s" candide-8859-1.txt
> * show_internals: Compiled pattern: Number of tokens: 4
> OP_CHAR_CLASS(ref=1) -- members:2
> -- 54 74 Tt
> OP_CHAR_CLASS(ref=2) -- members:2
> -- 52 72 Rr
> OP_CHAR_CLASS(ref=3) -- members:10
> -- 45 65 c8 c9 ca cb e8 e9 ea eb Eeà à à à èé êë
> OP_CHAR_CLASS(ref=4) -- members:2
> -- 53 73 Ss
>
> * debug: hstbm context at 0x1ae7000:
> - fold_table:
> ................................ !"#$%&'()*+,-./0123456789:;<=>?
> @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ `abcdEfghijklmnopqRsTuvwxyz{|}~.
> ................................
> ¡¢£¤¥¦§¨©ª«¬Â®¯°±²³´µ¶·¸¹º»¼½¾¿
> Ã ÃÃ Ã Ã Ã Ã Ã EEEEÃ ÃÃ ÃÃÃ'Ã'Ã"Ã"Ã Ã Ã Ã Ã Ã Ã Ã ÃÃ Ã Ã
> áâãäåæçEEEEìÃîïðñòóôõö÷øùúûüýþÿ
> - delta1:
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444144444444444420344444444444 44444144444444444420344444444444
> 44444444444444444444444444444444 44444444444444444444444444444444
> 44444444111144444444444444444444 44444444111144444444444444444444
> - Pattern: T R E S
>
> * Hybrid (memchr/memchr2) efficiency variables:
> - skip_loop_efficiency_threshold: 80
> - nr_aliases: 1 1 9 1
> - first_alias: t r e s
> - memchr_offset_list [negated]: 3 2 0
> - current_memchr_offset_index: 0
>
> * Guard (delta2) and match (delta 3/4) skip variables:
> - delta2 candidate(s): 4
> - delta2: 4
> - delta3: 4 4 4
> - delta4: 4 4
>
> * context check... [OK]
> 155
> $
>
>
>
> --------------------
>
>
>
> WARNING: Because hstbm supports far fewer features than GNU Grep,
> it takes less time to initialise. This will give hstbm an advantage
> when the total input is small (?? perhaps less than 4k bytes?), and
> should be factored out when comparing performance. (Note that both
> csrc and candide-8859-1.txt above are greater than 100k bytes, so
> this effect should not be a large factor in the above results.)
>
>
> $ ./repl grep -c "123[abc]456" /dev/null
> Cmd: "grep" "-c" "123[abc]456" "/dev/null"
> 0
> real 0m0.990s user 0m0.073s sys 0m0.097s
> real 0m1.001s user 0m0.050s sys 0m0.123s
> real 0m0.993s user 0m0.020s sys 0m0.147s
> real 0m0.989s user 0m0.040s sys 0m0.127s
>
> $ ./repl ./h -c "123[abc]456" /dev/null
> Cmd: "./h" "-c" "123[abc]456" "/dev/null"
> 0
> real 0m0.748s user 0m0.047s sys 0m0.103s
> real 0m0.758s user 0m0.057s sys 0m0.093s
> real 0m0.762s user 0m0.037s sys 0m0.113s
> real 0m0.754s user 0m0.050s sys 0m0.100s
>
> $ ./repl grep -ci "123[abc]456" /dev/null
> Cmd: "grep" "-ci" "123[abc]456" "/dev/null"
> 0
> real 0m0.993s user 0m0.043s sys 0m0.127s
> real 0m0.998s user 0m0.057s sys 0m0.110s
> real 0m0.995s user 0m0.037s sys 0m0.130s
> real 0m0.993s user 0m0.053s sys 0m0.110s
>
> $ ./repl ./h -ci "123[abc]456" /dev/null
> Cmd: "./h" "-ci" "123[abc]456" "/dev/null"
> 0
> real 0m0.768s user 0m0.037s sys 0m0.113s
> real 0m0.787s user 0m0.033s sys 0m0.123s
> real 0m0.770s user 0m0.037s sys 0m0.110s
> real 0m0.779s user 0m0.027s sys 0m0.127s
>
>
>
> --------------------
>
>
>
> Have fun hacking, and reading the code... all feedback, comments,
> patches, etc. welcome.
>
> cheers,
>
> sur-behoffski (Brenton Hoff)
> Programmer, Grouse Software
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- bug#19837: Announcing initial release of hstbm, Version 0.13.,
Glenn Morris <=