bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b

From:	Paul Eggert
Subject:	bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Date:	Mon, 9 Jan 2023 15:12:23 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0

On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:

        /b:
        155781
        (*UCP)/b:
        46035
        /s:
        0
        (*UCP)/s:
        0
        /w:
        142468
        (*UCP)/w:
        9706

So the output still differs, and some of those differences may or may
not be wanted.

I took a look at the output, and by and large I'd want the differences;that is, I'd want the UCP version, which generates less output. This isbecause several Emacs source files are not UTF-8, and \b has nonsensematches when searching text files encoded via Shift-JIS or Big 5 orwhatever. For this sort of thing, the fewer matches the better.

If all you're doing is matching either ASCII or Japanese text and you
want "locale-aware numbers" it might do the wrong thing.

I'm not seeing much of a problem here. When searching Japanese text, Iwould expect \d and [0-9０-９] (using both ASCII and full-width digits) tobe equivalent so (assuming UCP) it's not a big deal as to which regexyou use, since Japanese text won't contain Bengali (or whatever) digits.And when searching binary data, I'd expect a bunch of garbage no matterhow \d is interpreted.

Here I'm assuming [０-９] (using full-width digits) has the expectedmeaning in PCRE2, i.e., that PCRE2 didn't make the same mistake thatPOSIX made.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Ævar Arnfjörð Bjarmason, 2023/01/09
- bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Paul Eggert, 2023/01/09
  - bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Ævar Arnfjörð Bjarmason, 2023/01/09
    - bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Paul Eggert <=

Prev by Date: bug#60697: GNU grep mishandles \b near encoding errors
Next by Date: bug#60708: pcre: improve support for linking with a library without unicode
Previous by thread: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Next by thread: bug#60697: GNU grep mishandles \b near encoding errors
Index(es):
- Date
- Thread