[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regexp filter to match non-english characters
From: |
Robert D. Crawford |
Subject: |
Re: regexp filter to match non-english characters |
Date: |
Thu, 06 Nov 2008 10:38:02 -0600 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.0.60 (gnu/linux) |
Michal Nazarewicz <mina86@tlen.pl> writes:
> "<<" and ">>" have codes U+00AB and U+00BB so that's why they match but
> there are plenty of other characters which may show up in an English
> text, like (I'll use a (sequence of) ASCII characters which resembles
> the proper unicode character) "`" (U+2018), "'" (U+2019), "``" (U+201C)
> , "''" (U+201D) or "..." (U+2026) which will cause the entry to be
> filtered out.
>
> Besides, I think what you really meant was:
>
> (string-match "[^\\0-\\177]" "string")
>
> since "1ff" is not a valid octal number.
>
> I think that taking the title of the entry and checking if at least 90%
> are ASCII characters would be sufficient to filter out Asian texts. You
> can also try taking first 100 (or so) characters of the body. I think
> you could use replace-regexp-in-string for that purpose:
>
> (defun mn-non-english-p (string)
> (>
> (* (length (replace-regexp-in-string "[^\\0-\\77]" "" string)) 10)
> (* (length string) 9)))
I like the way this looks. Seems that it will allow the characters I
would like to keep but remove posts which I cannot read. Here is my
problem, and forgive what I can only assume is my lack of understanding
in doing complex scoring/filtering, but I don't know how to implement
this. I have read through the gnus info manual section on scoring and
don't see anywhere that I can plug in a function to perform this action
on the subject. I will readily admit that it is probable that I just
missed it. If someone could point me to the place where this is
explained in the manual I would be very appreciative.
I must add that the body of the posts from this nnrss group consist of
only the following lines:
Tables Linearized
About This Style
link
comments
About This Style
Table contents are turned into a sequence of paragraphs, one per cell.
The part about "Tables Linearized" is added by something I use. The
explanation is on the last line. I mention this because I don't think
scoring on the body of the post will work in this case.
Thanks for all the help from both you and Ted,
rdc
--
Robert D. Crawford rdc1x@comcast.net
Your temporary financial embarrassment will be relieved in a surprising manner.
- regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/05
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/05
- Re: regexp filter to match non-english characters, Michal Nazarewicz, 2008/11/06
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Message not available
- Re: regexp filter to match non-english characters, Ted Zlatanov, 2008/11/06
- Re: regexp filter to match non-english characters, Robert D. Crawford, 2008/11/06
- Re: regexp filter to match non-english characters,
Robert D. Crawford <=