Re: [PATCH 4/9] dfa: speed up handling of brackets

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/9] dfa: speed up handling of brackets

From:	Jim Meyering
Subject:	Re: [PATCH 4/9] dfa: speed up handling of brackets
Date:	Wed, 17 Mar 2010 14:36:48 +0100

Paolo Bonzini wrote:
> This patch has two sides.  One is to fold the parsing of brackets in the
> single- and multi-byte cases.  The second is to leverage this change,
> and use a bitset to test for single-byte characters in the charset.
> Splitting the two would be very hard.
>
> Testcase:
>    yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
>      time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]
>
> Before: 59ms (best of three runs); after: 51ms (best of three runs).
> Nice, but mostly providing infrastructure for the next patch.
...

Nice indeed.
A few nits:

...
> +  { "blank", is_blank },
> +  { 0, 0 }
> +};

Please use "NULL" for each of those, not 0.
That will avoid from warnings from a tool or two.

> -#ifdef MBS_SUPPORT

Please drop the !__STDC__ part.
It stopped being useful about 10 years ago.

> +#ifdef __STDC__
> +#define FUNC(F, P) static int F(int c) { return P(c); }
> +#else
> +#define FUNC(F, P) static int F(c) int c; { return P(c); }
> +#endif
...
> +static predicate *
> +find_pred (const char *str)
> +{
> +  int i;

s/int/unsigned int/

Please use "unsigned int" not just "unsigned".
I recall making a point of keeping the "int" years ago, e.g.,
in coreutils, perhaps to placate compilers that
could warn about implicit "int".  Besides, spelling it
"unsigned int" seems slightly clearer than relying on the
reader to extrapolate the "int", and makes it far easier
to automate searches for "int"-related variables.

> +  for (i = 0; prednames[i].name; ++i)
> +    if (!strcmp(str, prednames[i].name))
> +      break;
> +
> +  return prednames[i].pred;
> +}
> +
>  /* Multibyte character handling sub-routine for lex.
>     This function  parse a bracket expression and build a struct
>     mb_char_classes.  */
>  static token
> -parse_bracket_exp_mb (void)
> +parse_bracket_exp (void)
>  {
> +  int invert;
> +  int c, c1, c2;
> +  charclass ccl;
> +
> +#ifdef MBS_SUPPORT
>    wint_t wc, wc1, wc2;
>
>    /* Work area to build a mb_char_classes.  */
> @@ -367,63 +453,68 @@ parse_bracket_exp_mb (void)
>    int chars_al, range_sts_al, range_ends_al, ch_classes_al,
>      equivs_al, coll_elems_al;
>
> -  REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
> -                    dfa->mbcsets_alloc, dfa->nmbcsets + 1);
> -  /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
> -     We will update dfa->multibyte_prop[] in addtok(), because we can't
> -     decide the index in dfa->tokens[].  */
> -
> -  /* Initialize work are */
> -  work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
> -
>    chars_al = 1;
>    range_sts_al = range_ends_al = 0;
>    ch_classes_al = equivs_al = coll_elems_al = 0;
> +  if (MB_CUR_MAX > 1)
> +    {
> +      REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
> +                           dfa->mbcsets_alloc, dfa->nmbcsets + 1);
> +
> +      /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
> +         We will update dfa->multibyte_prop[] in addtok(), because we can't
> +         decide the index in dfa->tokens[].  */
> +
> +      /* Initialize work area.  */
> +      work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
> +      work_mbc->nchars = work_mbc->nranges = work_mbc->nch_classes = 0;
> +      work_mbc->nequivs = work_mbc->ncoll_elems = 0;

I was going to write "Please don't put multiple initializations on
the same line." but then saw this was merely an indentation change,
so never mind.

> +      work_mbc->chars = NULL;
> +      work_mbc->ch_classes = NULL;
> +      work_mbc->range_sts = work_mbc->range_ends = NULL;
> +      work_mbc->equivs = work_mbc->coll_elems = NULL;
> +    }
> +  else
> +    work_mbc = NULL;
> +#endif

ACK!
Finally.  Thanks!

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, (continued)
- [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Jim Meyering, 2010/03/15
    - Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Paolo Bonzini, 2010/03/15
    - Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Jim Meyering, 2010/03/15
    - Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Paolo Bonzini, 2010/03/15
    - Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets, Jim Meyering, 2010/03/15
- [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Jim Meyering, 2010/03/16
    - Re: [PATCH 3/9] dfa: rewrite handling of multibyte case_fold lexing, Paolo Bonzini, 2010/03/17
- [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering <=
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Paolo Bonzini, 2010/03/17
    - Re: [PATCH 4/9] dfa: speed up handling of brackets, Jim Meyering, 2010/03/17
- [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Jim Meyering, 2010/03/17
    - Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets, Paolo Bonzini, 2010/03/17
- [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 7/9] dfa: run simple UTF-8 regexps as a single-byte character set, Jim Meyering, 2010/03/15
- [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/14
  - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Jim Meyering, 2010/03/17
    - Re: [PATCH 6/9] dfa: cache MB_CUR_MAX for dfaexec, Paolo Bonzini, 2010/03/17

Prev by Date: Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
Next by Date: [PATCH] dfa: spell out "unsigned int"
Previous by thread: [PATCH 4/9] dfa: speed up handling of brackets
Next by thread: Re: [PATCH 4/9] dfa: speed up handling of brackets
Index(es):
- Date
- Thread