[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#47264: [PATCH v2] pcre: migrate to pcre2
From: |
Paul Eggert |
Subject: |
bug#47264: [PATCH v2] pcre: migrate to pcre2 |
Date: |
Sun, 14 Nov 2021 19:17:58 -0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.1 |
On 11/14/21 14:25, Carlo Arenas wrote:
the one in patch6 where a uint32_t option is doubled, triggers
warnings because of comparing an unsigned variable with 0 AFAIK, but
there are several of those in the upstream gnulib so presumably not a
concern?
Yes, that's right. intprops.h can generate tons of bogus warnings with
older or non-GCC compilers. We typically don't worry about those
warnings. Recent GCC should be OK here.
using idx_t instead of size_t should be fine (if only halves the max
size of the objects managed), but I am concerned that assuming
PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
might be risky (at least without a comment), and considering that is
part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
forward compatibility to a potential future version of PCRE2 that may
define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
PCRE2_SIZE is hardwired to size_t, so there is only one plausible
default for PCRE2_SIZE_MAX, namely SIZE_MAX.
As I mentioned before, PCRE matches the Perl definition as mentioned
before in an early draft that also had this change reversed.
I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
correspond to the intuitive meaning of "match words", and it doesn't
correspond to how grep -w behaves for any grep that I know of.
Which "early draft" are you talking about? This appears to be merely a
bug in libpcre2's documentation and implementation.
I would suggest instead that -P should also follow perl convention
instead when used together with -w, but maybe that is something that a
-P feature flag could enable or disable as needed?
I can't imagine anybody intuitively saying in an English locale that
"%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a
user really wants PCRE2's buggy interpretation, they can simply surround
their regexp with "\b(?:" and ")\b" and not use -w; so there's no need
to have a different flag for pcre2grep's bizarre interpretation of -w.
Here's another reason why pcre2grep -w is obviously busted:
$ pcre2grep -w ',' <<'EOF'
> a,a
> a, a
> a,
> EOF
a,a
Why is "," a word in the first input line, but not in the second or
third? pcre2grep is simply wrong here.
Note that "word" definition also has a different meaning in a post
Unicode world
Yes, but that's an independent issue.
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/09
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Jeffrey Walton, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2,
Paul Eggert <=
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Arenas, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/15