Re: Cannot searh MessengerPlus xhtml chat logs

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Cannot searh MessengerPlus xhtml chat logs

From:	Paolo Bonzini
Subject:	Re: Cannot searh MessengerPlus xhtml chat logs
Date:	Mon, 26 Apr 2010 10:14:38 +0200
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100301 Fedora/3.0.3-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.3

On 04/25/2010 11:36 AM, Wuhtzu wrote:


Hi everyone

Forgive me if my possible lack of knowledge, but I've come across something
strange using GNU grep 2.5.4 with Windows 7
(http://gnuwin32.sourceforge.net/packages/grep.htm).

I was  rying to search xhtml chat logs generated by MessengerPlus! Live v.
4.83.0.376 but I have been completely unable to do so. Trying to search for
something a little more complicated I came to s dead stop trying just this:

grep -ic td *.html

All the xhtml files in the current directory was listed but with count 0
even though they contain hundreds of td-tags. In order to get matches within
a file I had to open it, copy it's content to a new file and save it again.
Then it matched just as it's supposed to.

Two file samples are available here:

Original chat log. Not able to match anything:
http://wuhtzu.dk/random/posts/ex-april-2009.html

This one is in UTF-16. It's very hard to match anything in thisencoding since "normal" Latin characters are not represented the sameway as ASCII.


In particular, this won't work

LANG=en_US.UTF-16LE fgrep -c 't\x00d\x00' ex-april-2009.html

because grep does not handle \x escape sequences; maybe that could beadded as a feature. These three on the other hand work:


1) using Perl regular expressions:

LANG=en_US.UTF-16LE grep -Pc 't\x00d\x00' ex-april-2009.html

2) using tr or printf to print the regex, using bash <(...) syntax.This won't work because echo truncates the argument after the first nulcharacter:


LANG=en_US.UTF-16LE grep -icf <(echo $'t\x00d\x00') ex-april-2009.html

however you can use these two:

LANG=en_US.UTF-16LE grep -icf <(echo address@hidden@ | tr @ '\0') 
ex-april-2009.html

LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t d) ex-april-2009.html


3) same as above using a temporary file.

echo -n td | iconv -f UTF-8 -t UTF-16LE > test-re
LANG=en_US.UTF-16LE grep -icf test-re ex-april-2009.html

With some care, the full power of regular expressions can be used, forexample


LANG=en_US.UTF-16LE grep -icf <(printf '%c\0' t . t) ex-april-2009.html

However, there are a lot of tricky areas here, for example the \n _byte_is used as a separator rather than the Unicode character \n (which wouldbe "\n\x0"), and that is why "echo -n" is needed in the example above.


Paolo

[Prev in Thread]

Current Thread

[Next in Thread]

Cannot searh MessengerPlus xhtml chat logs, Wuhtzu, 2010/04/25
- Re: Cannot searh MessengerPlus xhtml chat logs, Paolo Bonzini <=

Prev by Date: Cannot searh MessengerPlus xhtml chat logs
Next by Date: [RFC PATCH] fall back to glibc matcher if a multibyte match is found
Previous by thread: Cannot searh MessengerPlus xhtml chat logs
Next by thread: [RFC PATCH] fall back to glibc matcher if a multibyte match is found
Index(es):
- Date
- Thread