bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8


From: Jim Meyering
Subject: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date: Fri, 01 Jun 2012 17:18:04 +0200

Jim Meyering wrote:

> Strahinja Kustudic wrote:
>> URL:
>>   <http://savannah.gnu.org/bugs/?36567>
>>
>>                  Summary: grep -i (case-insensitive) is broken with UTF8
>>                  Project: grep
>>             Submitted by: kustodian
>>             Submitted on: Thu 31 May 2012 11:18:30 AM GMT
>>                 Category: None
>>                 Severity: 3 - Normal
>>               Item Group: None
>>                   Status: None
>>                  Privacy: Public
>>              Assigned to: None
>>              Open/Closed: Open
>>          Discussion Lock: Any
>>
>> Details:
>>
>> Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive
>> search with UTF8 encoding when there is an UTF8 character. Here is the
>> example:
>>
>> # Without -i switch everything works correctly
>> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
>> AA UTF8 char İ 12345
>> AA 12345
>>
>>
>> # With -i it breaks
>> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
>> AA UTF8 char İ 12345AA 12345
>>
>>
>> As you can see it somehow deletes the new line character in the line which 
>> has
>> an UTF8 'İ' character.
>>
>> Everything works correctly in versions 2.5.4 and below, it's broken from 
>> 2.6.1
>> to the latest version (which is atm 2.6.12).
>>
>> This is a big concern, since it can break scripts which filtered UTF8 input
>
> Thanks for the report.
> This is the same bug that prompted the addition of the
> tests/turkish-I test (still expected to fail):
>
>     http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413/focus=3417
>
> Sorry no one has followed up since then.
>
> Here's another demonstrator:
>
>     i='\xC4\xB0'
>     printf "$i$i$i$i$i$i$i\n" > in
>     LC_ALL=en_US.UTF-8 grep -i .... in > out
>     cmp in out > /dev/null || echo FAIL
>
> As I mentioned in the link above, this is a problem because of the way
> grep's -i is implemented: it converts both the RE and the buffer-to-search
> to lower case, and then performs the search.  The problem arises with
> turkish-I because the conversion changes the length of the buffer (in
> the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
> + newline, yet the lower case version has a length of just 8: 7 x
> lower-cased i + NL), and the code returns the match offset and length
> relative to the shortened lower-case buffer (that lower-cased buffer is
> internal to code duplicated in EGexecute/Fexecute), yet it uses those
> offset,length numbers to manipulate the original buffer.
>
> Without re-architecting too much, one solution is to change mbtolower to
> return additional information: a malloc'd mapping vector M, of the same
> length as its returned buffer, where M[i] is the length-in-bytes of the
> character that formed byte I of the result.  With that, or something
> similar, the caller could then map the currently-erroneous offset,len
> numbers to equivalent numbers that apply to the original buffer.  This
> mapping could be allocated/defined only when lengths actually differ,
> so that the cost in general would be negligible.

I've implemented the above, and have begun testing.
The testing exposed an additional problem with -F.
This fails both with and without the complication of multi-byte:

    $ i='\xC4\xB0'
    $ printf "$i$i$i$i$i$i$i\n" > in
    $ LC_ALL=C grep "$i" in || echo FAIL
    FAIL

I'll post once I've resolved that.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]