[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
From: |
Jim Meyering |
Subject: |
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8 |
Date: |
Fri, 01 Jun 2012 17:18:04 +0200 |
Jim Meyering wrote:
> Strahinja Kustudic wrote:
>> URL:
>> <http://savannah.gnu.org/bugs/?36567>
>>
>> Summary: grep -i (case-insensitive) is broken with UTF8
>> Project: grep
>> Submitted by: kustodian
>> Submitted on: Thu 31 May 2012 11:18:30 AM GMT
>> Category: None
>> Severity: 3 - Normal
>> Item Group: None
>> Status: None
>> Privacy: Public
>> Assigned to: None
>> Open/Closed: Open
>> Discussion Lock: Any
>>
>> Details:
>>
>> Since version 2.6.1 grep doesn't work correctly if you use a case-insesitive
>> search with UTF8 encoding when there is an UTF8 character. Here is the
>> example:
>>
>> # Without -i switch everything works correctly
>> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep 'AA'
>> AA UTF8 char İ 12345
>> AA 12345
>>
>>
>> # With -i it breaks
>> $ echo -e 'AA UTF8 char İ 12345\nAA 12345' | grep -i 'AA'
>> AA UTF8 char İ 12345AA 12345
>>
>>
>> As you can see it somehow deletes the new line character in the line which
>> has
>> an UTF8 'İ' character.
>>
>> Everything works correctly in versions 2.5.4 and below, it's broken from
>> 2.6.1
>> to the latest version (which is atm 2.6.12).
>>
>> This is a big concern, since it can break scripts which filtered UTF8 input
>
> Thanks for the report.
> This is the same bug that prompted the addition of the
> tests/turkish-I test (still expected to fail):
>
> http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3413/focus=3417
>
> Sorry no one has followed up since then.
>
> Here's another demonstrator:
>
> i='\xC4\xB0'
> printf "$i$i$i$i$i$i$i\n" > in
> LC_ALL=en_US.UTF-8 grep -i .... in > out
> cmp in out > /dev/null || echo FAIL
>
> As I mentioned in the link above, this is a problem because of the way
> grep's -i is implemented: it converts both the RE and the buffer-to-search
> to lower case, and then performs the search. The problem arises with
> turkish-I because the conversion changes the length of the buffer (in
> the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot
> + newline, yet the lower case version has a length of just 8: 7 x
> lower-cased i + NL), and the code returns the match offset and length
> relative to the shortened lower-case buffer (that lower-cased buffer is
> internal to code duplicated in EGexecute/Fexecute), yet it uses those
> offset,length numbers to manipulate the original buffer.
>
> Without re-architecting too much, one solution is to change mbtolower to
> return additional information: a malloc'd mapping vector M, of the same
> length as its returned buffer, where M[i] is the length-in-bytes of the
> character that formed byte I of the result. With that, or something
> similar, the caller could then map the currently-erroneous offset,len
> numbers to equivalent numbers that apply to the original buffer. This
> mapping could be allocated/defined only when lengths actually differ,
> so that the cost in general would be negligible.
I've implemented the above, and have begun testing.
The testing exposed an additional problem with -F.
This fails both with and without the complication of multi-byte:
$ i='\xC4\xB0'
$ printf "$i$i$i$i$i$i$i\n" > in
$ LC_ALL=C grep "$i" in || echo FAIL
FAIL
I'll post once I've resolved that.
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, (continued)
- [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paul Eggert, 2012/06/12
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Paolo Bonzini, 2012/06/15
- [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paul Eggert, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Jim Meyering, 2012/06/16
- Re: [PATCH] grep -i: work also when converting to lower-case inflates byte count, Paolo Bonzini, 2012/06/23
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
- Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8,
Jim Meyering <=
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/01
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Jim Meyering, 2012/06/02
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/12
Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, Johannes Meixner, 2012/06/14