bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b


From: Paul Eggert
Subject: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Date: Mon, 9 Jan 2023 15:12:23 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0

On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:

        /b:
        155781
        (*UCP)/b:
        46035
        /s:
        0
        (*UCP)/s:
        0
        /w:
        142468
        (*UCP)/w:
        9706

So the output still differs, and some of those differences may or may
not be wanted.

I took a look at the output, and by and large I'd want the differences; that is, I'd want the UCP version, which generates less output. This is because several Emacs source files are not UTF-8, and \b has nonsense matches when searching text files encoded via Shift-JIS or Big 5 or whatever. For this sort of thing, the fewer matches the better.


If all you're doing is matching either ASCII or Japanese text and you
want "locale-aware numbers" it might do the wrong thing.

I'm not seeing much of a problem here. When searching Japanese text, I would expect \d and [0-90-9] (using both ASCII and full-width digits) to be equivalent so (assuming UCP) it's not a big deal as to which regex you use, since Japanese text won't contain Bengali (or whatever) digits. And when searching binary data, I'd expect a bunch of garbage no matter how \d is interpreted.

Here I'm assuming [0-9] (using full-width digits) has the expected meaning in PCRE2, i.e., that PCRE2 didn't make the same mistake that POSIX made.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]