[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: grep -i in UTF-8: newline not printed after matching line if it cont
From: |
Jim Meyering |
Subject: |
Re: grep -i in UTF-8: newline not printed after matching line if it contains I WITH DOT (U+0130) |
Date: |
Wed, 19 Jan 2011 22:16:28 +0100 |
Jim Meyering wrote:
> Ilya Basin wrote:
>> $ grep -i . greptest.txt
>> aIabIbcIcdId$
>>
>> This doesn't happen without -i or with LANG=C
>>
>>
>> $ grep --version
>> grep (GNU grep) 2.7
>> $ echo $LANG
>> en_US.UTF-8
>>
>> pcre 8.10
>>
>> Linux IL 2.6.36-ARCH #1 SMP PREEMPT Wed Nov 24 06:44:11 UTC 2010 i686
>> Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz GenuineIntel GNU/Linux
>
> Thanks for the report. That is indeed a bug.
> It affects even the very latest in git.
>
> Here's another variant of it:
> [note how it fails to print the matched "."]
>
> $ i='\xC4\xB0'; printf "$i$i$i.$i$i$i$i\n" \
> | LC_ALL=en_US.UTF-8 ./grep -oi '.\.'|od -a -tx1
> 0000000 D 0 nl
> c4 b0 0a
> 0000003
>
> -----------------------------
> More like your example, this shows how, with -i,
> grep is searching a different string (down-cased)
> and the width of the lower-case "i" is just one byte.
> The end-of-line offset is calculated using the all-lower-case
> string, yet that offset is not valid in the original, longer string,
> so grep fails to print the entire line:
>
> i='\xC4\xB0'; printf "$i$i$i$i$i$i$i\n" |LC_ALL=en_US.UTF-8 ./grep -i ....
> İİİİ
>
> One of us should find time to fix it before too long.
First step is (at least this time) to write the test.
I've just pushed this:
>From 955695aea8fac194db07009a8673af3aaa6e0f8c Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 19 Jan 2011 22:12:09 +0100
Subject: [PATCH 1/2] maint: sort test names in Makefile.am
* tests/Makefile.am (TESTS): Sort test names.
---
tests/Makefile.am | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/tests/Makefile.am b/tests/Makefile.am
index ac0e3c1..0d78d26 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -35,9 +35,9 @@ endif
TESTS = \
backref \
+ backref-multibyte-slow \
backref-word \
bre \
- backref-multibyte-slow \
case-fold-backref \
case-fold-backslash-w \
case-fold-char-class \
@@ -46,8 +46,8 @@ TESTS = \
char-class-multibyte \
dfaexec-multibyte \
empty \
- ere \
equiv-classes \
+ ere \
euc-mb \
fedora \
fgrep-infloop \
@@ -65,15 +65,15 @@ TESTS = \
options \
pcre \
pcre-z \
+ prefix-of-multibyte \
reversed-range-endpoints \
sjis-mb \
spencer1 \
spencer1-locale \
status \
- prefix-of-multibyte \
warn-char-classes \
- word-multi-file \
word-delim-multibyte \
+ word-multi-file \
yesno
EXTRA_DIST = \
--
1.7.3.5
>From ebfc46553d56ec3ab3feade82e53fac0863fd102 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 19 Jan 2011 22:12:43 +0100
Subject: [PATCH 2/2] tests: add a known-to-fail test
* tests/turkish-I: New test.
* tests/Makefile.am (TESTS): Add it.
(XFAIL_TESTS): Add here, too.
Reported by Ilya Basin.
---
THANKS | 1 +
tests/Makefile.am | 2 ++
tests/turkish-I | 32 ++++++++++++++++++++++++++++++++
3 files changed, 35 insertions(+), 0 deletions(-)
create mode 100755 tests/turkish-I
diff --git a/THANKS b/THANKS
index 8c3d0d9..116b9c4 100644
--- a/THANKS
+++ b/THANKS
@@ -37,6 +37,7 @@ H. Merijn Brand <address@hidden>
Harald Hanche-Olsen <address@hidden>
Hans-Bernhard Broeker <address@hidden>
Heikki Korpela <address@hidden>
+Ilya Basin <address@hidden>
Isamu Hasegawa <address@hidden>
Jaroslav Škarvada <address@hidden>
Jeff Bailey <address@hidden>
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 0d78d26..7233c01 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -32,6 +32,7 @@ XFAIL_TESTS = \
if USE_INCLUDED_REGEX
XFAIL_TESTS += equiv-classes
endif
+XFAIL_TESTS += turkish-I
TESTS = \
backref \
@@ -71,6 +72,7 @@ TESTS = \
spencer1 \
spencer1-locale \
status \
+ turkish-I \
warn-char-classes \
word-delim-multibyte \
word-multi-file \
diff --git a/tests/turkish-I b/tests/turkish-I
new file mode 100755
index 0000000..ac536c4
--- /dev/null
+++ b/tests/turkish-I
@@ -0,0 +1,32 @@
+#!/bin/sh
+# grep -i in UTF-8: missing NL in output on line containing I WITH DOT (U+0130)
+
+# Copyright (C) 2011 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_en_utf8_locale_
+
+fail=0
+
+i='\xC4\xB0'
+printf "$i$i$i$i$i$i$i\n" > in || framework_failure_
+
+LC_ALL=en_US.UTF-8 grep -i .... in > out || fail=1
+
+compare out in || fail=1
+
+Exit $fail
--
1.7.3.5
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: grep -i in UTF-8: newline not printed after matching line if it contains I WITH DOT (U+0130),
Jim Meyering <=