Re: regexp filter to match non-english characters

info-gnus-english

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp filter to match non-english characters

From:	Ted Zlatanov
Subject:	Re: regexp filter to match non-english characters
Date:	Thu, 06 Nov 2008 08:50:59 -0600
User-agent:	Gnus/5.110011 (No Gnus v0.11) Emacs/23.0.60 (gnu/linux)

On Thu, 06 Nov 2008 10:34:25 +0100 Michal Nazarewicz <mina86@tlen.pl> wrote: 

MN> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
MN> there are plenty of other characters which may show up in an English
MN> text, like (I'll use a (sequence of) ASCII characters which resembles
MN> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
MN> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be
MN> filtered out.

Agreed.  It's not an easy problem without Unicode properties, but for
the *subject* of the message it's a passable heuristic.

MN> Besides, I think what you really meant was:

MN> (string-match "[^\\0-\\177]" "string") 

MN> since "1ff" is not a valid octal number.

Yes.  Sorry.

MN> I think that taking the title of the entry and checking if at least 90%
MN> are ASCII characters would be sufficient to filter out Asian texts.  You
MN> can also try taking first 100 (or so) characters of the body.  I think
MN> you could use replace-regexp-in-string for that purpose:

MN> (defun mn-non-english-p (string) 
MN>   (>
MN>    (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
MN>    (* (length string) 9)))

That might work, but for a score file a simple regular expression is
better, and I understood the OP to need a score file.

Ted

[Prev in Thread]

Current Thread

[Next in Thread]

regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
  - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
  - Message not available
    - Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
    - Re: regexp filter to match non-english characters, Michal Nazarewicz, 2008/11/06
    - Re: regexp filter to match non-english characters, Ted Zlatanov <=
    - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
    - Message not available
    - Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
    - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
    - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06

Prev by Date: Re: regexp filter to match non-english characters
Next by Date: Re: regexp filter to match non-english characters
Previous by thread: Re: regexp filter to match non-english characters
Next by thread: Re: regexp filter to match non-english characters
Index(es):
- Date
- Thread