bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Vincent Lefevre
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Sat, 20 Dec 2014 01:13:39 +0100
User-agent:	Mutt/1.5.23-6371-vl-r75100 (2014-11-04)

On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote:
> I got them from pcre_valid_utf8(), but I made some mistakes.  They are
> as following.
> 
>   0xE0 0xAF 0xBF

This one is valid UTF-8 and corresponds to the code point U+0BFF, and
the following matches:

$ printf "\xE0\xAF\xBF\n" | grep -P .
௿

>   0xED 0xA0 0xBF

OK, this is in the surrogate area, and it doesn't match with PCRE.

>   0xF0 0x8F 0xBF 0xBF

This would be U+7FF4FFFF, larger than U+10FFFF.

> > BTW,
> > 
> >   printf "\xF4\xBF\xBF\xBF\n" | grep .
> > 
> > finds a match, and this appears to be a bug (grep should follow
> > the current standard).
> 
> I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
> for the sequence.

Ditto with:

  printf "\xED\xA0\xBF\n" | grep .

(surrogate area).

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/18
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre <=
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/12/19
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/12/19

Prev by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread