[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

From: Jim Meyering
Subject: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date: Fri, 01 Jun 2012 12:02:47 +0200

Strahinja Kustudic wrote:
> URL:
>   <http://savannah.gnu.org/bugs/?36567>
>                  Summary: grep -i (case-insensitive) is broken with UTF8
>                  Project: grep
>             Submitted by: kustodian
>             Submitted on: Thu 31 May 2012 11:18:30 AM GMT
>                 Category: None
>                 Severity: 3 - Normal
>               Item Group: None
>                   Status: None
>                  Privacy: Public
>              Assigned to: None
>              Open/Closed: Open
>          Discussion Lock: Any
> Details:
> Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive
> search with UTF8 encoding when there is an UTF8 character. Here is the
> example:
> # Without -i switch everything works correctly
> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
> AA UTF8 char İ 12345
> AA 12345
> # With -i it breaks
> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
> AA UTF8 char İ 12345AA 12345
> As you can see it somehow deletes the new line character in the line which has
> an UTF8 'İ' character.
> Everything works correctly in versions 2.5.4 and below, it's broken from 2.6.1
> to the latest version (which is atm 2.6.12).
> This is a big concern, since it can break scripts which filtered UTF8 input

Thanks for the report.
This is the same bug that prompted the addition of the
tests/turkish-I test (still expected to fail):


Sorry no one has followed up since then.

Here's another demonstrator:

    printf "$i$i$i$i$i$i$i\n" > in
    LC_ALL=en_US.UTF-8 grep -i .... in > out
    cmp in out > /dev/null || echo FAIL

As I mentioned in the link above, this is a problem because of the way
grep's -i is implemented: it converts both the RE and the buffer-to-search
to lower case, and then performs the search.  The problem arises with
turkish-I because the conversion changes the length of the buffer (in
the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
+ newline, yet the lower case version has a length of just 8: 7 x
lower-cased i + NL), and the code returns the match offset and length
relative to the shortened lower-case buffer (that lower-cased buffer is
internal to code duplicated in EGexecute/Fexecute), yet it uses those
offset,length numbers to manipulate the original buffer.

Without re-architecting too much, one solution is to change mbtolower to
return additional information: a malloc'd mapping vector M, of the same
length as its returned buffer, where M[i] is the length-in-bytes of the
character that formed byte I of the result.  With that, or something
similar, the caller could then map the currently-erroneous offset,len
numbers to equivalent numbers that apply to the original buffer.  This
mapping could be allocated/defined only when lengths actually differ,
so that the cost in general would be negligible.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]