bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 4/9] dfa: speed up handling of brackets


From: Jim Meyering
Subject: Re: [PATCH 4/9] dfa: speed up handling of brackets
Date: Wed, 17 Mar 2010 14:36:48 +0100

Paolo Bonzini wrote:
> This patch has two sides.  One is to fold the parsing of brackets in the
> single- and multi-byte cases.  The second is to leverage this change,
> and use a bitset to test for single-byte characters in the charset.
> Splitting the two would be very hard.
>
> Testcase:
>    yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
>      time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]
>
> Before: 59ms (best of three runs); after: 51ms (best of three runs).
> Nice, but mostly providing infrastructure for the next patch.
...

Nice indeed.
A few nits:

...
> +  { "blank", is_blank },
> +  { 0, 0 }
> +};

Please use "NULL" for each of those, not 0.
That will avoid from warnings from a tool or two.

> -#ifdef MBS_SUPPORT

Please drop the !__STDC__ part.
It stopped being useful about 10 years ago.

> +#ifdef __STDC__
> +#define FUNC(F, P) static int F(int c) { return P(c); }
> +#else
> +#define FUNC(F, P) static int F(c) int c; { return P(c); }
> +#endif
...
> +static predicate *
> +find_pred (const char *str)
> +{
> +  int i;

s/int/unsigned int/

Please use "unsigned int" not just "unsigned".
I recall making a point of keeping the "int" years ago, e.g.,
in coreutils, perhaps to placate compilers that
could warn about implicit "int".  Besides, spelling it
"unsigned int" seems slightly clearer than relying on the
reader to extrapolate the "int", and makes it far easier
to automate searches for "int"-related variables.

> +  for (i = 0; prednames[i].name; ++i)
> +    if (!strcmp(str, prednames[i].name))
> +      break;
> +
> +  return prednames[i].pred;
> +}
> +
>  /* Multibyte character handling sub-routine for lex.
>     This function  parse a bracket expression and build a struct
>     mb_char_classes.  */
>  static token
> -parse_bracket_exp_mb (void)
> +parse_bracket_exp (void)
>  {
> +  int invert;
> +  int c, c1, c2;
> +  charclass ccl;
> +
> +#ifdef MBS_SUPPORT
>    wint_t wc, wc1, wc2;
>
>    /* Work area to build a mb_char_classes.  */
> @@ -367,63 +453,68 @@ parse_bracket_exp_mb (void)
>    int chars_al, range_sts_al, range_ends_al, ch_classes_al,
>      equivs_al, coll_elems_al;
>
> -  REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
> -                    dfa->mbcsets_alloc, dfa->nmbcsets + 1);
> -  /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
> -     We will update dfa->multibyte_prop[] in addtok(), because we can't
> -     decide the index in dfa->tokens[].  */
> -
> -  /* Initialize work are */
> -  work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
> -
>    chars_al = 1;
>    range_sts_al = range_ends_al = 0;
>    ch_classes_al = equivs_al = coll_elems_al = 0;
> +  if (MB_CUR_MAX > 1)
> +    {
> +      REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
> +                           dfa->mbcsets_alloc, dfa->nmbcsets + 1);
> +
> +      /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
> +         We will update dfa->multibyte_prop[] in addtok(), because we can't
> +         decide the index in dfa->tokens[].  */
> +
> +      /* Initialize work area.  */
> +      work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
> +      work_mbc->nchars = work_mbc->nranges = work_mbc->nch_classes = 0;
> +      work_mbc->nequivs = work_mbc->ncoll_elems = 0;

I was going to write "Please don't put multiple initializations on
the same line." but then saw this was merely an indentation change,
so never mind.

> +      work_mbc->chars = NULL;
> +      work_mbc->ch_classes = NULL;
> +      work_mbc->range_sts = work_mbc->range_ends = NULL;
> +      work_mbc->equivs = work_mbc->coll_elems = NULL;
> +    }
> +  else
> +    work_mbc = NULL;
> +#endif

ACK!
Finally.  Thanks!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]