bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Cannot searh MessengerPlus xhtml chat logs


From: Paolo Bonzini
Subject: Re: Cannot searh MessengerPlus xhtml chat logs
Date: Mon, 26 Apr 2010 10:14:38 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.3

On 04/25/2010 11:36 AM, Wuhtzu wrote:

Hi everyone

Forgive me if my possible lack of knowledge, but I've come across something
strange using GNU grep 2.5.4 with Windows 7
(http://gnuwin32.sourceforge.net/packages/grep.htm).

I was  rying to search xhtml chat logs generated by MessengerPlus! Live v.
4.83.0.376 but I have been completely unable to do so. Trying to search for
something a little more complicated I came to s dead stop trying just this:

grep -ic td *.html

All the xhtml files in the current directory was listed but with count 0
even though they contain hundreds of td-tags. In order to get matches within
a file I had to open it, copy it's content to a new file and save it again.
Then it matched just as it's supposed to.

Two file samples are available here:

Original chat log. Not able to match anything:
http://wuhtzu.dk/random/posts/ex-april-2009.html

This one is in UTF-16. It's very hard to match anything in this encoding since "normal" Latin characters are not represented the same way as ASCII.

In particular, this won't work

LANG=en_US.UTF-16LE fgrep -c 't\x00d\x00' ex-april-2009.html

because grep does not handle \x escape sequences; maybe that could be added as a feature. These three on the other hand work:

1) using Perl regular expressions:

LANG=en_US.UTF-16LE grep -Pc 't\x00d\x00' ex-april-2009.html


2) using tr or printf to print the regex, using bash <(...) syntax. This won't work because echo truncates the argument after the first nul character:

LANG=en_US.UTF-16LE grep -icf <(echo $'t\x00d\x00') ex-april-2009.html

however you can use these two:

LANG=en_US.UTF-16LE grep -icf <(echo address@hidden@ | tr @ '\0') 
ex-april-2009.html

LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t d) ex-april-2009.html


3) same as above using a temporary file.

echo -n td | iconv -f UTF-8 -t UTF-16LE > test-re
LANG=en_US.UTF-16LE grep -icf test-re ex-april-2009.html


With some care, the full power of regular expressions can be used, for example

LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t . t) ex-april-2009.html

However, there are a lot of tricky areas here, for example the \n _byte_ is used as a separator rather than the Unicode character \n (which would be "\n\x0"), and that is why "echo -n" is needed in the example above.

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]