[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/2] dfa: convert to wide character line-by-line
From: |
Jim Meyering |
Subject: |
Re: [PATCH 1/2] dfa: convert to wide character line-by-line |
Date: |
Wed, 05 May 2010 09:56:34 +0200 |
Paolo Bonzini wrote:
> This provides a nice speedup for -m in general, but especially
> it avoids quadratic complexity in case we have to go to glibc.
>
> * NEWS: Document change.
> * src/dfa.c (prepare_wc_buf): Extract out of dfaexec. Convert
> only up to the next newline.
> (dfaexec): Exit multibyte processing loop if past buf_end.
> Call prepare_wc_buf again after processing a newline.
Nice indeed, but it induces an abort from glibc with some inputs:
This is on F13:
$ export LC_ALL=fr_FR.UTF-8
$ printf '\xc3\n' > in; src/grep '[é]' in
*** buffer overflow detected ***: src/grep terminated
======= Backtrace: =======
/lib64/libc.so.6(__fortify_fail+0x37)[0x33920fedb7]
/lib64/libc.so.6[0x33920fcd30]
/lib64/libc.so.6(__strncpy_chk+0x17b)[0x33920fbfeb]
...
zsh: abort (core dumped) src/grep '[é]' in
> +/* Initialize mblen_buf and inputwcs with data from the next line. */
> +
> +static void
> +prepare_wc_buf (const char *begin, const char *end)
> +{
> + unsigned char eol = eolbyte;
> + size_t remain_bytes, i;
Here's the quick summary:
"remain_bytes" must not be of type "size_t".
Please leave it as "int", like it was in the code you moved.
That solves the problem. Details, as I debugged it...
> + buf_begin = (unsigned char *) begin;
> +
> + remain_bytes = 0;
> + for (i = 0; i < end - begin + 1; i++)
> + {
> + if (remain_bytes == 0)
> + {
> + remain_bytes
> + = mbrtowc(inputwcs + i, begin + i, end - begin - i + 1, &mbs);
The trouble is here, in this moved code.
Here's what happens: with invalid-multibyte input, mbrtowc returns
(size_t)-1 or (size_t)-2, which makes us take the inner "else" branch,
and that code fails to define inputwcs[i]. Later, that undefined byte,
inputwcs[0], is referenced by this code,
#define SKIP_REMAINS_MB_IF_INITIAL_STATE(s, p) \
if (s == 0) \
{ \
while (inputwcs[p - buf_begin] == 0 \
&& mblen_buf[p - buf_begin] > 0 \
&& (unsigned char const *) p < buf_end) \
++p; \
But that's only a relatively harmless read-uninitialized.
> + if (remain_bytes < 1
> + || (remain_bytes == 1 && inputwcs[i] == (wchar_t)begin[i]))
> + {
> + remain_bytes = 0;
> + inputwcs[i] = (wchar_t)begin[i];
> + mblen_buf[i] = 0;
> + if (begin[i] == eol)
> + break;
> + }
> + else
> + {
> + mblen_buf[i] = remain_bytes;
> + remain_bytes--;
> + }
> + }
> + else
> + {
> + mblen_buf[i] = remain_bytes;
> + inputwcs[i] = 0;
> + remain_bytes--;
> + }
Later the real problem arises because we've stored that huge unsigned
value, (size_t)-1, into mblen_buf[0].
In match_mb_charset, this happens:
match_len = (mblen_buf[idx] == 0)? 1 : mblen_buf[idx];
...
strncpy(buffer, (char const *) buf_begin + idx, match_len);
buffer[match_len] = '\0';
with idx==0, which leads to match_len = -858993460.
That triggers an abort in the fortify-augmented strncpy,
or sometimes a segfault on the next line.
----------------------------------------------------------
The real problem is that in moving that block of code,
you also changed it:
> - int remain_bytes, i;
> + size_t remain_bytes, i;
While it's nice to make "i" be unsigned, remain_bytes
must remain signed.
The moral of the story:
When moving code, minimize ancillary/cleanup changes as much as possible.
Put them in separate patches where it's much easier to see what's changing.
Another bit of advice that might have helped avoid this:
Do not group multiple variable declarations on the same line.
Re: [PATCH 1/2] dfa: convert to wide character line-by-line,
Jim Meyering <=