bug#47264: [PATCH] pcre: migrate to pcre2

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#47264: [PATCH] pcre: migrate to pcre2

From:	Carlo Arenas
Subject:	bug#47264: [PATCH] pcre: migrate to pcre2
Date:	Mon, 8 Nov 2021 14:28:11 -0800

On Mon, Nov 8, 2021 at 11:53 AM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> On 11/8/21 01:47, Carlo Arenas wrote:
> > On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> > Let me know how to help otherwise.
>
> The main thing from my point of view is that I'd like to know what those
> other bugs are. I can split out the patch into pieces if I know what to
> look for.

The way pcre_jit_stack_alloc is called is incorrect and could leak if
new_size is not enough and PCRE_ERROR_JIT_STACKLIMIT is triggered
again.  The new code calls it only once and sets the maximum to the
same maximum the old code would do after looping.

It doesn't handle silent failures from pcre_study() from old versions,
where JIT might need to be disabled because no RWX page could be
allocated (ex: SELinux enabled and enforcing).

This last one might be better to fix in PCRE2 as it is also version
dependent and behaves differently between PCRE and PCRE2 and the PCRE
version hasn't been needed and won't be needed after the migration,
though.

> > I didn't want anyone hitting on those old PCRE2 bugs though with this
> > first release, hence why the configure rule is there for now (even if
> > I am likely going to remove it for the next version)
>
> If it's a PCRE2 bug we can ask people to fix it in their PCRE2 library.

agree, but it might be worth adding some functionality to allow
workarounds (ex: disabling JIT, which is currently enabled
unconditionally)

> Possibly we should continue to support PCRE1 as a configure-time option;
> that would assuage concerns about bugs in PCRE2. More work for us, though.

We did that in git for a while, but I wouldn't do it now, considering
that PCRE1 is no longer maintained and is arguably more buggy than
PCRE2.

Chances are there is a bugfix available in PCRE2 already that needs
backporting if upgrading their version is not possible.

> > \C is supported with -P in the PCRE version now though, is removing that ok?
>
> I guess I don't see the harm of supporting \C; why disable it?

It is not very useful and interacts badly with the state of UTF-8
matching as it will leave with a partial codeset, which is then
incorrect UTF-8 (if lucky) or could be interpreted incorrectly.

my assumption was that it wasn't disabled by accident, since old PCRE
also default to non UTF-8 by default at compile time, unlike what was
done in PCRE2.

> >> If memory serves grep currently takes care to not pass invalid UTF-8 in
> >> the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work 
> >> obsolete?
> >
> > not sure I understand what you mean
>
> I guess I was thinking about an older grep version.
>
> Currently grep compiles with PCRE_UTF8 and checks for PCRE_ERROR_BADUTF8
> returns from pcre_exec, so it's relying on the count of bytes that this
> pcre_exec returns in sub[0] before calling pcre_exec with
> PCRE_NO_UTF8_CHECK. So, effectively it's using pcre_exec to check that a
> buffer contains valid UTF-8.

note that it does its own skipping first, but could rely on PCRE2 and
PCRE2_MATCH_INVALID_UTF instead.

> I don't see how this works with the proposed patch. It uses sub[0] but I
> don't see how it's set. What am I missing?

a fix that was on my next version, but I got stuck in the refactoring,
should be (from master) :

-          int valid_bytes = sub[0];
+          PCRE2_SIZE valid_bytes = sub[0] = pcre2_get_startchar (pc->data);

it is set by the pcre2_match() call (which is equivalent to
pcre-exec), but setting it on bad UTF-8 (which is only reported in
versions that don't have PCRE2_MATCH_INVALID_UTF

> One more thing I just noticed: this test:
>
>    if (PCRE2_ERROR_UTF8_ERR1 <= e || e < PCRE2_ERROR_UTF8_ERR21)
>
> is logically equivalent to the following (which is clearer to me):
>
>    if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e < PCRE2_ERROR_UTF8_ERR1))
>
> Shouldn't that be the following instead?
>
>    if (! (PCRE2_ERROR_UTF8_ERR21 <= e && e <= PCRE2_ERROR_UTF8_ERR1))

Makes sense, will include and add a comment to explain better the translation.

Carlo

[1] https://www.pcre.org/original/doc/html/pcre_jit_stack_alloc.html

[Prev in Thread]

Current Thread

[Next in Thread]

bug#47264: [PATCH] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/07
- bug#47264: [PATCH] pcre: migrate to pcre2, Paul Eggert, 2021/11/07
  - bug#47264: [PATCH] pcre: migrate to pcre2, Jim Meyering, 2021/11/07
  - bug#47264: [PATCH] pcre: migrate to pcre2, Carlo Arenas, 2021/11/08
    - bug#47264: [PATCH] pcre: migrate to pcre2, Paul Eggert, 2021/11/08
    - bug#47264: [PATCH] pcre: migrate to pcre2, Carlo Arenas <=

Prev by Date: bug#51458: grep PCRE - '^' and '$' are not recognized as begin and end of line for multiline strings
Next by Date: bug#51698: surrogate-pair test fails under Cygwin
Previous by thread: bug#47264: [PATCH] pcre: migrate to pcre2
Next by thread: bug#51458: grep PCRE - "^" and "$" are not recognized as begin and end of line for multiline strings
Index(es):
- Date
- Thread