[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 i
From: |
L. A. Walsh |
Subject: |
bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 input. |
Date: |
Wed, 27 May 2015 14:41:12 -0700 |
User-agent: |
Thunderbird |
(skip to end if you don't care to read how I found this
mess)...
Paul Eggert wrote:
Linda Walsh wrote:
I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file
That's pretty vague. Can you reproduce that problem? I don't observe
it:
----
I'm not quite *sure* how to tell someone else to reproduce this, but
I can pretty reliably now some output from a checker....:
*** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkPVClientServerCoreCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libsystemd.so.0
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkParallelCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
Now before you think I'm too daft, the code that produces those
messages is in perl and is:
for my $k (@sorted_missing) {
P "*** file = %s", $k;
open(my $gh, "grep -rP '/$k' /home/rpms/13.2|");
while (<$gh>) {
print
}
P "-----";
}
Those files are files that came up "missing" as pre-reqs.
in /home/rpms/...., I have the *file listings* of each of
the rpms, created in the same structure as in the distro, so
a file under that dir /home/rpms/13.2.. This is why I had
a problem finding it:
Ishtar:rpms/13.2/repo/oss/suse> file -bi x86_64/*>/tmp/x86files.txt
Ishtar:rpms/13.2/repo/oss/suse> sort </tmp/x86files.txt |uniq -c
2 text/plain; charset=iso-8859-1
13269 text/plain; charset=us-ascii
2 text/plain; charset=utf-8
--- I'd say it's likely 1-2 files out of 13274 files that could
have the problem. Yeah, I run into alot of needles in haystacks..
but trying to find the needle... just generating the file of types:
time file -i x86_64/*>/tmp/fullx86files.txt
27.71sec 27.07usr 0.63sys (99.99% cpu)
Then grep helps!
Ishtar:rpms/13.2/repo/oss/suse> grep iso-88 /tmp/fullx86files.txt
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
x86_64/aspell-nb-0.50.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
---
Ishtar:rpms/13.2/repo/oss/suse> more
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm
/usr/lib64/aspell-0.60/icelandic.alias
/usr/lib64/aspell-0.60/is.dat
/usr/lib64/aspell-0.60/is.multi
/usr/lib64/aspell-0.60/is.rws
/usr/lib64/aspell-0.60/is_phonet.dat
/usr/lib64/aspell-0.60/355slenska.alias <<-- the 355 was in inverse color
/usr/share/doc/packages/aspell-is
/usr/share/doc/packages/aspell-is/COPYING
/usr/share/doc/packages/aspell-is/Copyright
/usr/share/doc/packages/aspell-is/README
----
Same w/the other file (had this 1 'violation':
/usr/lib64/aspell-0.60/bokmal.alias
/usr/lib64/aspell-0.60/bokm345l.alias <-3
So those are 'octal' code points (using a little calc prog):
pcalc
pcalc V0.1.8: Type 'constants' to see constants
(1)> 0355
= 237 (0x00ed) "í"
(2)> 0345
= 229 (0x00e5) "å"
-------------------------------------------------------------------------------
So the 1st part of the bug is the message w/no filename.
the 2nd part of the bug is this: (looking for '^nobody' in
"/etc/passwd" works, as shown in 1st example:
grep -P '^nobody' /etc/passwd
nobody:x:65534:65533:(group Nobody):/var/lib/nobody:/bin/nologin
but the 'error' message aborts any further file searches:
---
grep -P '^nobody' x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /etc/passwd
grep: invalid UTF-8 byte sequence in input
----------------------------------------------------------
This is why I objected to '\000' being treated as a binary
file (and why I think it's bad grep can't look for that):
If one works with windows, it's far more likely
just to be in UTF-16 encoding.
-l
- bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 input.,
L. A. Walsh <=