Re: [PATCH 1/2] dfa: convert to wide character line-by-line

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/2] dfa: convert to wide character line-by-line

From:	Jim Meyering
Subject:	Re: [PATCH 1/2] dfa: convert to wide character line-by-line
Date:	Wed, 05 May 2010 09:56:34 +0200

Paolo Bonzini wrote:
> This provides a nice speedup for -m in general, but especially
> it avoids quadratic complexity in case we have to go to glibc.
>
> * NEWS: Document change.
> * src/dfa.c (prepare_wc_buf): Extract out of dfaexec.  Convert
> only up to the next newline.
> (dfaexec): Exit multibyte processing loop if past buf_end.
> Call prepare_wc_buf again after processing a newline.

Nice indeed, but it induces an abort from glibc with some inputs:
This is on F13:

    $ export LC_ALL=fr_FR.UTF-8
    $ printf '\xc3\n' > in; src/grep '[é]' in
    *** buffer overflow detected ***: src/grep terminated
    ======= Backtrace: =======
    /lib64/libc.so.6(__fortify_fail+0x37)[0x33920fedb7]
    /lib64/libc.so.6[0x33920fcd30]
    /lib64/libc.so.6(__strncpy_chk+0x17b)[0x33920fbfeb]
    ...
    zsh: abort (core dumped)  src/grep '[é]' in

> +/* Initialize mblen_buf and inputwcs with data from the next line.  */
> +
> +static void
> +prepare_wc_buf (const char *begin, const char *end)
> +{
> +  unsigned char eol = eolbyte;
> +  size_t remain_bytes, i;

Here's the quick summary:

  "remain_bytes" must not be of type "size_t".
  Please leave it as "int", like it was in the code you moved.

That solves the problem.  Details, as I debugged it...

> +  buf_begin = (unsigned char *) begin;
> +
> +  remain_bytes = 0;
> +  for (i = 0; i < end - begin + 1; i++)
> +    {
> +      if (remain_bytes == 0)
> +        {
> +          remain_bytes
> +            = mbrtowc(inputwcs + i, begin + i, end - begin - i + 1, &mbs);

The trouble is here, in this moved code.

Here's what happens: with invalid-multibyte input, mbrtowc returns
(size_t)-1 or (size_t)-2, which makes us take the inner "else" branch,
and that code fails to define inputwcs[i].  Later, that undefined byte,
inputwcs[0], is referenced by this code,

#define SKIP_REMAINS_MB_IF_INITIAL_STATE(s, p)          \
  if (s == 0)                                           \
    {                                                   \
      while (inputwcs[p - buf_begin] == 0               \
            && mblen_buf[p - buf_begin] > 0             \
            && (unsigned char const *) p < buf_end)     \
        ++p;                                            \

But that's only a relatively harmless read-uninitialized.

> +          if (remain_bytes < 1
> +              || (remain_bytes == 1 && inputwcs[i] == (wchar_t)begin[i]))
> +            {
> +              remain_bytes = 0;
> +              inputwcs[i] = (wchar_t)begin[i];
> +              mblen_buf[i] = 0;
> +              if (begin[i] == eol)
> +                break;
> +            }
> +          else
> +            {
> +              mblen_buf[i] = remain_bytes;
> +              remain_bytes--;
> +            }
> +        }
> +      else
> +        {
> +          mblen_buf[i] = remain_bytes;
> +          inputwcs[i] = 0;
> +          remain_bytes--;
> +        }

Later the real problem arises because we've stored that huge unsigned
value, (size_t)-1, into mblen_buf[0].

In match_mb_charset, this happens:

    match_len = (mblen_buf[idx] == 0)? 1 : mblen_buf[idx];
    ...
    strncpy(buffer, (char const *) buf_begin + idx, match_len);
    buffer[match_len] = '\0';

with idx==0, which leads to match_len = -858993460.
That triggers an abort in the fortify-augmented strncpy,
or sometimes a segfault on the next line.

----------------------------------------------------------

The real problem is that in moving that block of code,
you also changed it:

  > -  int remain_bytes, i;
  > +  size_t remain_bytes, i;

While it's nice to make "i" be unsigned, remain_bytes
must remain signed.

The moral of the story:

  When moving code, minimize ancillary/cleanup changes as much as possible.
  Put them in separate patches where it's much easier to see what's changing.

Another bit of advice that might have helped avoid this:

  Do not group multiple variable declarations on the same line.

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH 1/2] dfa: convert to wide character line-by-line, Paolo Bonzini, 2010/05/04
- [PATCH 2/2] tests: add test for newly-fixed performance problem, Paolo Bonzini, 2010/05/04
  - Re: [PATCH 2/2] tests: add test for newly-fixed performance problem, Jim Meyering, 2010/05/04
    - Re: [PATCH 2/2] tests: add test for newly-fixed performance problem, Paolo Bonzini, 2010/05/05
    - Re: [PATCH 2/2] tests: add test for newly-fixed performance problem, Jim Meyering, 2010/05/05
    - Re: [PATCH 2/2] tests: add test for newly-fixed performance problem, Aharon Robbins, 2010/05/05
- Re: [PATCH 1/2] dfa: convert to wide character line-by-line, Jim Meyering <=
  - Re: [PATCH 1/2] dfa: convert to wide character line-by-line, Jim Meyering, 2010/05/05
    - Re: [PATCH 1/2] dfa: convert to wide character line-by-line, Jim Meyering, 2010/05/05
    - Re: [PATCH 1/2] dfa: convert to wide character line-by-line, Paolo Bonzini, 2010/05/05

Prev by Date: Re: [PATCH 2/2] tests: add test for newly-fixed performance problem
Next by Date: Re: [PATCH 1/2] dfa: convert to wide character line-by-line
Previous by thread: Re: [PATCH 2/2] tests: add test for newly-fixed performance problem
Next by thread: Re: [PATCH 1/2] dfa: convert to wide character line-by-line
Index(es):
- Date
- Thread