[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp filter to match non-english characters

From: Robert D. Crawford
Subject: Re: regexp filter to match non-english characters
Date: Thu, 06 Nov 2008 10:38:02 -0600
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

Michal Nazarewicz <> writes:

> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
> there are plenty of other characters which may show up in an English
> text, like (I'll use a (sequence of) ASCII characters which resembles
> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be
> filtered out.
> Besides, I think what you really meant was:
> (string-match "[^\\0-\\177]" "string") 
> since "1ff" is not a valid octal number.
> I think that taking the title of the entry and checking if at least 90%
> are ASCII characters would be sufficient to filter out Asian texts.  You
> can also try taking first 100 (or so) characters of the body.  I think
> you could use replace-regexp-in-string for that purpose:
> (defun mn-non-english-p (string) 
>   (>
>    (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
>    (* (length string) 9)))

I like the way this looks.  Seems that it will allow the characters I
would like to keep but remove posts which I cannot read.  Here is my
problem, and forgive what I can only assume is my lack of understanding
in doing complex scoring/filtering, but I don't know how to implement
this.  I have read through the gnus info manual section on scoring and
don't see anywhere that I can plug in a function to perform this action
on the subject.  I will readily admit that it is probable that I just
missed it.  If someone could point me to the place where this is
explained in the manual I would be very appreciative.  

I must add that the body of the posts from this nnrss group consist of
only the following lines:

Tables Linearized                  
 About This Style 



About This Style

Table contents are turned into a sequence of paragraphs, one per cell.

The part about "Tables Linearized" is added by something I use.  The
explanation is on the last line.  I mention this because I don't think
scoring on the body of the post will work in this case.

Thanks for all the help from both you and Ted,
Robert D. Crawford                            

Your temporary financial embarrassment will be relieved in a surprising manner.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]