Re: regexp filter to match non-english characters

info-gnus-english

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp filter to match non-english characters

From:	Robert D. Crawford
Subject:	Re: regexp filter to match non-english characters
Date:	Thu, 06 Nov 2008 10:38:02 -0600
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux)

Michal Nazarewicz <mina86@tlen.pl> writes:

> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
> there are plenty of other characters which may show up in an English
> text, like (I'll use a (sequence of) ASCII characters which resembles
> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be
> filtered out.
>
> Besides, I think what you really meant was:
>
> (string-match "[^\\0-\\177]" "string") 
>
> since "1ff" is not a valid octal number.
>
> I think that taking the title of the entry and checking if at least 90%
> are ASCII characters would be sufficient to filter out Asian texts.  You
> can also try taking first 100 (or so) characters of the body.  I think
> you could use replace-regexp-in-string for that purpose:
>
> (defun mn-non-english-p (string) 
>   (>
>    (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
>    (* (length string) 9)))

I like the way this looks.  Seems that it will allow the characters I
would like to keep but remove posts which I cannot read.  Here is my
problem, and forgive what I can only assume is my lack of understanding
in doing complex scoring/filtering, but I don't know how to implement
this.  I have read through the gnus info manual section on scoring and
don't see anywhere that I can plug in a function to perform this action
on the subject.  I will readily admit that it is probable that I just
missed it.  If someone could point me to the place where this is
explained in the manual I would be very appreciative.  

I must add that the body of the posts from this nnrss group consist of
only the following lines:

Tables Linearized                  
 About This Style 

link

comments

About This Style

Table contents are turned into a sequence of paragraphs, one per cell.

The part about "Tables Linearized" is added by something I use.  The
explanation is on the last line.  I mention this because I don't think
scoring on the body of the post will work in this case.

Thanks for all the help from both you and Ted,
rdc
-- 
Robert D. Crawford                                      rdc1x@comcast.net

Your temporary financial embarrassment will be relieved in a surprising manner.

[Prev in Thread]

Current Thread

[Next in Thread]

regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
  - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
  - Message not available
    - Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
    - Re: regexp filter to match non-english characters, Michal Nazarewicz, 2008/11/06
    - Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
    - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
    - Message not available
    - Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
    - Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
    - Re: regexp filter to match non-english characters, Robert D. Crawford <=

Prev by Date: Re: regexp filter to match non-english characters
Next by Date: Re: regexp filter to match non-english characters
Previous by thread: Re: regexp filter to match non-english characters
Next by thread: Losing Mail in Gnus - pt 2
Index(es):
- Date
- Thread